Dask for ArviZ#

Dask overview#

Dask is a big data processing library used for:

Parallelizing the computation of the workflow consisting of NumPy, pandas, xarray and scikit-learn frameworks.
Scaling the workflows up or down depending upon the hardware that is being used.

Most notably, it provides the support for working with larger-than-memory datasets. In this case, dask partitions the dataset into smaller chunks, then loads only a few chunks from the disk, and once the necessary processing is completed, it throws away the intermediate values. This way, the computations are performed without exceeding the memory limit.

Check out these links if you’re unsure whether your workflow can benefit from using Dask or not:

Excerpt from “Dask Array Best Practices” doc.

If your data fits comfortably in RAM and you are not performance bound, then using NumPy might be the right choice. Dask adds another layer of complexity which may get in the way.

If you are just looking for speedups rather than scalability then you may want to consider a project like Numba.

Caution

Dask is an optional dependency inside ArviZ, which is still being actively developed. Currently, few functions belonging to diagnostics and stats module can utilize Dask’s capabilities.

import arviz as az
import numpy as np
import timeit
import dask

from arviz.utils import conditional_jit, Dask

# optional imports
from dask.distributed import Client
from dask.diagnostics import ResourceProfiler

from bokeh.resources import INLINE
import bokeh.io

bokeh.io.output_notebook(INLINE)

%reload_ext memory_profiler

Loading BokehJS ...

Note

ResourceProfiler() and Client are optional. They are only used for the visualizing and profiling the dask enabled methods. ArviZ-Dask integration can be used without using these objects.

client = Client(threads_per_worker=4, n_workers=1, memory_limit="1.2GB")
client

Client

Client-3a9af271-4541-11ec-b824-5820b17a12fa

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

58ea9317

Dashboard: http://127.0.0.1:8787/status	Workers: 1
Total threads: 4	Total memory: 1.12 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-b3987f5c-3da8-4564-9fb5-76fb51a47a9e

Comm: tcp://127.0.0.1:35215	Workers: 1
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 1.12 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:36217	Total threads: 4
Dashboard: http://127.0.0.1:37745/status	Memory: 1.12 GiB
Nanny: tcp://127.0.0.1:41547
Local directory: /home/oriol/Public/arviz/doc/source/user_guide/dask-worker-space/worker-zyz5nqyr

Variance example#

array_size = 250_000_000

Calculating variance using Numpy

%%memit 
data = np.random.randn(array_size)
np.var(data, ddof=1)
del data

peak memory: 4072.28 MiB, increment: 3815.28 MiB

Calculating variance using Dask arrays:

Divides the array into multiple chunks.
Objects are lazy in nature and are computed on-the-fly.
Builds a task graph of the entire computation and parallelizes the execution.

%memit data = dask.array.random.normal(size=array_size, chunks="auto")
data

peak memory: 258.30 MiB, increment: 0.28 MiB

	Array	Chunk
Bytes	1.86 GiB	119.21 MiB
Shape	(250000000,)	(15625000,)
Count	16 Tasks	16 Chunks
Type	float64	numpy.ndarray

var = dask.array.var(data, ddof=1)
var.visualize()

../_images/454ff3958658b7e49346b6d49d373d7543f3d2fdddb83958e4e13ebdb3e113c1.png

with ResourceProfiler(dt=0.25) as rprof:
    var.compute()

rprof.visualize();

del data

Here, the NumPy version consumed around ~5GB memory but the Dask version was able to compute variance in under 1.2Gb memory (the limit set in the Client configuration above) which shows how beneficial Dask can be when dealing with large datasets.

ArviZ-Dask integration#

Creating Dask-backed InferenceData objects#

InferenceData is the central data format for ArviZ and there are several ways to generate this object (which you can look here.

However, as the ArviZ-Dask integraton is still a work in progress, to use InferenceData object with Dask-compatible methods we’ll have generate it in a different way. arviz.from_netcdf() has an experimental group_kwargs argument that can be used to read netCDF files directly with Dask.

We will progressively add more ways to generate Dask backed InferenceData and document them here. If you are interested in helping out, reach out on Gitter

From dictionary using `dask.array`#

We start creating a dask array with random samples, that we can then convert to InferenceData using arviz.from_dict(). ArviZ passes values and coord values as is to xarray, so by passing a dask array we’ll get a dask backed InferenceData automatically.

%memit daskdata = dask.array.random.random((10, 1000, 10000), chunks=(10, 1000, 625))
daskdata

peak memory: 260.43 MiB, increment: 0.07 MiB

	Array	Chunk
Bytes	762.94 MiB	47.68 MiB
Shape	(10, 1000, 10000)	(10, 1000, 625)
Count	16 Tasks	16 Chunks
Type	float64	numpy.ndarray

daskdata.visualize()  # Each chunk will follow lazy evaluation

../_images/e308134f10f74dc247352a1ed2581b34e5cc13c6ff8d1c833d05ebef62ce3d43.png

Note

Setting up the right value of the chunks parameter is very important. Computation on Dask arrays with small chunks are slow because each operation on a chunk has some overhead. On the other side, if your chunks are too big, then it might not fit in the memory.

datadict = {"x": daskdata}
%memit idata_dask = az.from_dict(posterior=datadict, dims={"x": ["dim_1"]})
idata_dask

peak memory: 260.55 MiB, increment: 0.07 MiB

arviz.InferenceData

posterior

<xarray.Dataset>
Dimensions:  (chain: 10, draw: 1000, dim_1: 10000)
Coordinates:
  * chain    (chain) int64 0 1 2 3 4 5 6 7 8 9
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
  * dim_1    (dim_1) int64 0 1 2 3 4 5 6 ... 9993 9994 9995 9996 9997 9998 9999
Data variables:
    x        (chain, draw, dim_1) float64 dask.array<chunksize=(10, 1000, 625), meta=np.ndarray>
Attributes:
    created_at:     2021-11-14T11:51:52.050185
    arviz_version:  0.11.4

xarray.Dataset

Dimensions:
- chain: 10
- draw: 1000
- dim_1: 10000
Coordinates: (3)
- chain
  (chain)
  int64
  0 1 2 3 4 5 6 7 8 9
```
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
```
- draw
  (draw)
  int64
  0 1 2 3 4 5 ... 995 996 997 998 999
```
array([  0,   1,   2, ..., 997, 998, 999])
```
- dim_1
  (dim_1)
  int64
  0 1 2 3 4 ... 9996 9997 9998 9999
```
array([   0,    1,    2, ..., 9997, 9998, 9999])
```

Data variables: (1)

x

(chain, draw, dim_1)

float64

dask.array<chunksize=(10, 1000, 625), meta=np.ndarray>

	Array	Chunk
Bytes	762.94 MiB	47.68 MiB
Shape	(10, 1000, 10000)	(10, 1000, 625)
Count	16 Tasks	16 Chunks
Type	float64	numpy.ndarray

Attributes: (2)
created_at :
2021-11-14T11:51:52.050185
arviz_version :
0.11.4

Dask for ArviZ#

Dask overview#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Variance example#

ArviZ-Dask integration#

Creating Dask-backed InferenceData objects#

From dictionary using `dask.array`#

Executing ArviZ functions with Dask#

`arviz.ess`#

`arviz.rhat` #

`arviz.hdi`#

Dask for ArviZ#

Dask overview#

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Variance example#

ArviZ-Dask integration#

Creating Dask-backed InferenceData objects#

From dictionary using dask.array#

Executing ArviZ functions with Dask#

arviz.ess#

arviz.rhat #

arviz.hdi#

From dictionary using `dask.array`#

`arviz.ess`#

`arviz.rhat` #

`arviz.hdi`#