dask
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
https://dask.org/
Documentation
Embarrassingly parallel Workloads — Dask Examples documentation
https://examples.dask.org/applications/embarrassingly-parallel.html
Best Practices — Dask documentation
https://docs.dask.org/en/latest/best-practices.html
If you have a machine with 100 GB and 10 cores, then you might want to choose chunks in the 1GB range. You have space for ten chunks per core which gives Dask a healthy margin, without having tasks that are too small
Array Best Practices — Dask documentation
https://docs.dask.org/en/latest/array-best-practices.html
Select a good chunk size
While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.
DataFrame Best Practices — Dask documentation
https://docs.dask.org/en/latest/dataframe-best-practices.html
Tutorial
Dask: Introduction - YouTube
https://www.youtube.com/watch?v=nnndxbr_Xq4
Dask Live by Coiled - YouTube
2021/10/07
https://www.youtube.com/watch?v=nHMcqEYZ5qY&t=898
https://github.com/coiled/dask-mini-tutorial/blob/main/README.md
Tips
Data Pre-Processing in Python: How I learned to love parallelized applies with Dask and Numba
https://towardsdatascience.com/how-i-learned-to-love-parallelized-applies-with-python-pandas-dask-and-numba-f06b0b367138
python - How to map a column with dask - Stack Overflow
https://stackoverflow.com/questions/40019905/how-to-map-a-column-with-dask
時間のかかる前処理をDaskで高速化 - ぴよぴよ.py
https://cocodrips.hateblo.jp/entry/2018/12/18/201752
Best practices to go from 1000s of netcdf files to analyses on a HPC cluster? - HPC - Pangeo
https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/6
Examples
CDAT/dask-cdms: cdms using dask cluster
https://github.com/CDAT/dask-cdms
weather data across a cluster using NumPy in parallel with dask.array
http://matthewrocklin.com/blog//work/2016/02/26/dask-distributed-part-3
python - Dask Read Data from Binary File - Stack Overflow
https://stackoverflow.com/questions/51025228/dask-read-data-from-binary-file
Experiment with Dask and TensorFlow
https://matthewrocklin.com/blog//work/2017/02/11/dask-tensorflow
Asynchronous Optimization Algorithms with Dask
https://matthewrocklin.com/blog//work/2017/04/19/dask-glm-2
dask with matplotlib · GitHub
https://gist.github.com/dcherian/3a00f9b107d5893965867788b9ce95df
Pandas with Dask, For an Ultra-Fast Notebook - Towards Data Science
https://towardsdatascience.com/pandas-with-dask-for-an-ultra-fast-notebook-e2621c3769f
How to Convert a pandas Dataframe into a Dask Dataframe - YouTube
https://www.youtube.com/watch?v=l9c08OAT7jY
Filtering Dask DataFrames with loc
X ah: 「Just tried using dask.distributed's client.submit for a much larger dataset (OISST 2000-2024, ~34 GBs) to parallelize outputting an animation. Takes ~17 mins to generate 8762 frames without exceeding memory usage (consistently ~0.5 GB). #python https://t.co/90cNK5XZ0U / X -
https://x.com/IAteAnDrew1/status/1760552429626802212
Tools
Dask JupyterLab Extension
https://github.com/dask/dask-labextension
dask-ml: for machine learning
Subpages
dask array operations
dask delayed
dask dashboard
dask distributed
paid tutorial
Parallel Computing with Dask | DataCamp
https://www.datacamp.com/courses/parallel-computing-with-dask