dask - Python for climatology, oceanograpy and atmospheric science

dask

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

https://dask.org/

Documentation

Embarrassingly parallel Workloads — Dask Examples documentation

https://examples.dask.org/applications/embarrassingly-parallel.html

Best Practices — Dask documentation

https://docs.dask.org/en/latest/best-practices.html

If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You have space for ten chunks per core which gives Dask a healthy margin, without having tasks that are too small

Array Best Practices — Dask documentation

https://docs.dask.org/en/latest/array-best-practices.html

Select a good chunk size

While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.

DataFrame Best Practices — Dask documentation

https://docs.dask.org/en/latest/dataframe-best-practices.html

Tutorial

Dask: Introduction - YouTube

https://www.youtube.com/watch?v=nnndxbr_Xq4

Dask Live by Coiled - YouTube

2021/10/07

https://www.youtube.com/watch?v=nHMcqEYZ5qY&t=898

https://github.com/coiled/dask-mini-tutorial/blob/main/README.md

Tips

Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba

https://towardsdatascience.com/how-i-learned-to-love-parallelized-applies-with-python-pandas-dask-and-numba-f06b0b367138

python - How to map a column with dask - Stack Overflow

https://stackoverflow.com/questions/40019905/how-to-map-a-column-with-dask

時間のかかる前処理をDaskで高速化 - ぴよぴよ.py

https://cocodrips.hateblo.jp/entry/2018/12/18/201752

Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? - HPC - Pangeo

https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/6

Examples

CDAT/dask-cdms: cdms using dask cluster

https://github.com/CDAT/dask-cdms

weather data across a cluster using NumPy in parallel with dask.array

http://matthewrocklin.com/blog//work/2016/02/26/dask-distributed-part-3

python - Dask Read Data from Binary File - Stack Overflow

https://stackoverflow.com/questions/51025228/dask-read-data-from-binary-file

Experiment with Dask and TensorFlow

https://matthewrocklin.com/blog//work/2017/02/11/dask-tensorflow

Asynchronous Optimization Algorithms with Dask

https://matthewrocklin.com/blog//work/2017/04/19/dask-glm-2

dask with matplotlib · GitHub

https://gist.github.com/dcherian/3a00f9b107d5893965867788b9ce95df

Pandas with Dask, For an Ultra-Fast Notebook - Towards Data Science

https://towardsdatascience.com/pandas-with-dask-for-an-ultra-fast-notebook-e2621c3769f

How to Convert a pandas Dataframe into a Dask Dataframe - YouTube

https://www.youtube.com/watch?v=l9c08OAT7jY

Filtering Dask DataFrames with loc

X ah: 「Just tried using dask.distributed's client.submit for a much larger dataset (OISST 2000-2024, ~34 GBs) to parallelize outputting an animation. Takes ~17 mins to generate 8762 frames without exceeding memory usage (consistently ~0.5 GB). #python https://t.co/90cNK5XZ0U / X -

https://x.com/IAteAnDrew1/status/1760552429626802212

Tools

Dask JupyterLab Extension

https://github.com/dask/dask-labextension

dask-ml: for machine learning

Subpages

dask array operations

dask distributed

paid tutorial

Parallel Computing with Dask | DataCamp

https://www.datacamp.com/courses/parallel-computing-with-dask