Dask for ArviZ#
Dask is a big data processing library used for:
Parallelizing the computation of the workflow consisting of NumPy, pandas, xarray and scikit-learn frameworks.
Scaling the workflows up or down depending upon the hardware that is being used.
Most notably, it provides the support for working with larger-than-memory datasets. In this case, dask partitions the dataset into smaller chunks, then loads only a few chunks from the disk, and once the necessary processing is completed, it throws away the intermediate values. This way, the computations are performed without exceeding the memory limit.
Check out these links if you’re unsure whether your workflow can benefit from using Dask or not:
Excerpt from “Dask Array Best Practices” doc.
If your data fits comfortably in RAM and you are not performance bound, then using
NumPymight be the right choice. Dask adds another layer of complexity which may get in the way.
If you are just looking for speedups rather than scalability then you may want to consider a project like
Dask is an optional dependency inside ArviZ, which is still being actively developed. Currently, few functions belonging to diagnostics and stats module can utilize Dask’s capabilities.
# optional imports from dask.distributed import Client from dask.diagnostics import ResourceProfiler from bokeh.resources import INLINE import bokeh.io bokeh.io.output_notebook(INLINE) %reload_ext memory_profiler