Dask for ArviZ#
Dask overview#
Dask is a big data processing library used for:
Parallelizing the computation of the workflow consisting of NumPy, pandas, xarray and scikit-learn frameworks.
Scaling the workflows up or down depending upon the hardware that is being used.
Most notably, it provides the support for working with larger-than-memory datasets. In this case, dask partitions the dataset into smaller chunks, then loads only a few chunks from the disk, and once the necessary processing is completed, it throws away the intermediate values. This way, the computations are performed without exceeding the memory limit.
Check out these links if you’re unsure whether your workflow can benefit from using Dask or not:
Excerpt from “Dask Array Best Practices” doc.
If your data fits comfortably in RAM and you are not performance bound, then using
NumPy
might be the right choice. Dask adds another layer of complexity which may get in the way.If you are just looking for speedups rather than scalability then you may want to consider a project like
Numba
.
Caution
Dask is an optional dependency inside ArviZ, which is still being actively developed. Currently, few functions belonging to diagnostics and stats module can utilize Dask’s capabilities.
# optional imports
from dask.distributed import Client
from dask.diagnostics import ResourceProfiler
from bokeh.resources import INLINE
import bokeh.io
bokeh.io.output_notebook(INLINE)
%reload_ext memory_profiler