Dask for ArviZ#

Dask overview#

Dask is a big data processing library used for:

  1. Parallelizing the computation of the workflow consisting of NumPy, pandas, xarray and scikit-learn frameworks.

  2. Scaling the workflows up or down depending upon the hardware that is being used.

Most notably, it provides the support for working with larger-than-memory datasets. In this case, dask partitions the dataset into smaller chunks, then loads only a few chunks from the disk, and once the necessary processing is completed, it throws away the intermediate values. This way, the computations are performed without exceeding the memory limit.


Check out these links if you’re unsure whether your workflow can benefit from using Dask or not:

Excerpt from “Dask Array Best Practices” doc.

If your data fits comfortably in RAM and you are not performance bound, then using NumPy might be the right choice. Dask adds another layer of complexity which may get in the way.

If you are just looking for speedups rather than scalability then you may want to consider a project like Numba.


Caution

Dask is an optional dependency inside ArviZ, which is still being actively developed. Currently, few functions belonging to diagnostics and stats module can utilize Dask’s capabilities.

import arviz as az
import numpy as np
import timeit
import dask

from arviz.utils import conditional_jit, Dask
# optional imports
from dask.distributed import Client
from dask.diagnostics import ResourceProfiler

from bokeh.resources import INLINE
import bokeh.io
bokeh.io.output_notebook(INLINE)

%reload_ext memory_profiler
Loading BokehJS ...