InferenceData schema specification for ArviZ#

ArviZ uses DataTree objects from the xarray library to store and organize the outputs of Bayesian inference in a structured, labeled format.

Its purpose is to serve the following three goals:

Usefulness in the analysis of Bayesian inference results.
Reproducibility of Bayesian inference analysis.
Interoperability between different inference backends and programming languages.

This specification describes the schema that DataTree objects should follow to be compatible with ArviZ’s functionalities for exploratory analysis of Bayesian models.

Basic structure#

Each node in the tree or group represents a different conceptual quantity, such as the posterior distribution or the observed data. Each group contains one or several multidimensional labeled variables representing specific quantities related to that concept.

Terminology#

The terminology used in this specification is based on xarray’s terminology. Below is a summary of the most relevant terms for this specification:

Variable: NetCDF-like variables are multidimensional labeled arrays representing a single quantity. Variables and their dimensions must be named. They can also have attributes describing it.
Dimension: The dimensions of an object are its named axes. A variable containing 3D data can have dimensions [chain, draw, dim0], i.e., its 0th-dimension is chain, its 1st-dimension is draw, and so on. Every dimension present in a DataTree variable must share names with a coordinate. Given that dimensions must be named, dimension and dimension name are used equivalents.
Coordinate: A named array that labels a dimension. A coordinate named chain with values [0, 1, 2, 3] would label the chain dimension. Coordinate names and values can be loosely thought of as labels and tick labels along a dimension, respectively.
Attribute: An ordered dictionary that can store arbitrary metadata.
Group: Dataset containing one or several variables with a conceptual link between them. Variables inside a group will generally share some dimensions too. For example, the posterior group contains a representation of the posterior distribution conditioned on the observations in the observed_data group.
Matching samples: Two variables (or groups) will be called to have matching samples if they are generated with the same set of samples. Therefore, they will share dimensions and coordinates corresponding to the sampling process. Sample dimensions (generally (chain, draw)) are the ones introduced by the sampling process.
Matching variables: Two groups with matching variables are groups that conceptually share variables, variable dimensions and coordinates of the variable dimensions but do not necessarily share variable names nor sample dimensions. Variable dimensions are the ones intrinsic to the data and model as opposed to sample dimensions which are the ones relative to the sampling process. When talking about specific variables, this same idea is expressed as one variable being the counterpart of the other.

Rules#

Below are a few rules that should be followed:

Each group should have one entry per variable and each variable should be named.
Dimension names chain, draw, sample and pred_id are reserved for use to indicate sample dimensions.
- chain indicates the MCMC chain
- draw indicates the iteration within each MCMC chain. ArviZ assumes all chains have the same length for better interoperability with NumPy and xarray.
- sample indicates a unique id per value combining chain and draw. i.e. we often don’t care about chain and draw when plotting and only want all the samples of the distribution as a whole.
- pred_id is interpreted as the dimension storing multiple independent and identically distributed values per sample.
Dimensions, including sample dimensions, should be identified by name only. The dimension order does not matter, only their names.
For groups like observed_data or constant_data, all sample dimensions can be omitted. For groups like prior, posterior or posterior_predictive either sample has to be present or both chain and draw dimensions need to be present. Any combinations that follow this are valid.
Dimensions must be named and share name with a coordinate specifying the index values, called coordinate values.
Coordinate values can be repeated and should not necessarily be numerical values.
Variables must not share names with dimensions.
Groups, variables or the DataTree itself can have arbitrary metadata stored.

Metadata#

No metadata is required to be present in order to be compliant with ArviZ’s schema. However, it is recommended to store the following fields when relevant:

name: All the quantities stored in a DataTree are tied to a single model. The model identifier can be added as metadata to simplify the calls to model comparison functions.
sample_dims: list of dimensions that were generated through a sampling process. Common examples are chain, draw or sample. It will generally be taken as the default value for arguments like dim or sample_dims.
sampled_variables: list of variable names on which inference was performed.
created_at: the date of creation of the group.
creation_library: the library used to create the DataTree, does not necessary be ArviZ.
creation_library_version: the version of creation_library that generated the DataTree.
creation_library_language: the programming language from which creation_library was used to create the DataTree.
inference_library: the library used to run the inference.
inference_library_version: version of the inference library used.

Metadata can be stored at the whole DataTree level but also at group level when needed. In particular, the name attribute is only taken into account at the DataTree level whereas sample_dims or sampled_variables are only taken into account at the group level.

Relations between groups#

DataTree objects may contain any subset of the groups described below. The presence of additional groups, or the absence of some groups described here, does not violate the schema. Whenever related groups are present they should comply with these relations. Relationships between variables and dimensions across groups are defined in this specification and should be respected whenever the relevant groups are present.

`posterior`#

Samples from the posterior distribution \(p(\theta \mid y)\) in the parameter (also called constrained) space.

`unconstrained_posterior`#

Samples from the posterior distribution \(p(\theta_{\text{transformed}} \mid y)\) in the unconstrained (also called transformed) space.

Only variables that undergo a transformation for sampling should be present here. Therefore, to get the samples for all the variables in the unconstrained space, variables should be taken from the unconstrained_posterior group if present, and if not, then the values from the variable in the posterior group should be used.

Samples should match between the posterior and the unconstrained_posterior groups. All variables in unconstrained_posterior should have a counterpart in posterior with the same name. However, they don’t need to have the same dimensions nor shape.

Note

Both DataTree groups and variables can have metadata, which in the unconstrained_posterior case could be used to store the transformations each variable goes through to map between the constrained and unconstrained spaces. The schema leaves this completely up to the user and imposes no conventions or restrictions on such metadata.

`sample_stats`#

Information and diagnostics for each posterior sample, provided by the inference backend. It may vary depending on the algorithm used by the backend (i.e. an affine invariant sampler has no energy associated). Therefore none of these parameters should be assumed to be present in the sample_stats group. The convention below serves to ensure that if a variable is present with one of these names it will correspond to the definition given in front of it. Moreover, some sample_stats may be constant throughout the sampling process; these variables don’t need to have any sampling dimensions.

`log_likelihood`#

Pointwise log likelihood data. Samples should match with posterior ones and its variables should match observed_data variables. The observed_data counterpart variable may have a different name. Moreover, some cases such as a multivariate normal may require some dimensions or coordinates to be different.

`log_prior`#

Pointwise evaluation of the prior distribution’s log pdf/pmf at the posterior samples. Samples should match with posterior ones and its variables should match posterior variables, or be a subset of it.

`posterior_predictive`#

Posterior predictive samples p(y|y) corresponding to the posterior predictive distribution evaluated at the observed_data. Samples should match with posterior ones and its variables should match observed_data variables. The observed_data counterpart variable may have a different name.

`observed_data`#

Observed data on which the posterior is conditional. It should only contain data which is modeled as a random variable. Each variable should have a counterpart in posterior_predictive, however, the posterior_predictive counterpart variable may have a different name.

`constant_data`#

Model constants, data included in the model which is not modeled as a random variable. It should be the data used to generate samples in all the groups except the predictions groups.

`prior`#

Samples from the prior distribution \(p(\theta)\). Samples do not need to match posterior samples. However, this group will still follow the convention on chain and draw as first dimensions. It should have matching variables with the posterior group.

`prior_predictive`#

Samples from the prior predictive distribution. Samples should match prior samples and each variable should have a counterpart in posterior_predictive/observed_data.

`predictions`#

Out of sample posterior predictive samples \(p(y' \mid y)\). Samples should match posterior samples. Its variables should have a counterpart in posterior_predictive. However, variables in predictions and their counterpart in posterior_predictive can have different coordinate values.

`predictions_constant_data`#

Model constants used to get the predictions samples. Its variables should have a counterpart in constant_data. However, variables in predictions_constant_data and their counterpart in constant_data can have different coordinate values.

Note on sample stats, warmup and unconstrained groups

The schema does not define which warmup or unconstrained groups exist or can exist by default. We recognize both the samplers and the models are continuously evolving. Some models already require the use of sampling algorithms to get prior samples, in which case we basically need to treat the prior and posterior groups in the same way.

We define the prefixes to allow third-party libraries to be aware of the potential relations and hopefully support as many cases as possible. Back to the case above, it might be necessary to generate a pair plot for prior samples generated with NUTS and its associated divergences, which would then come from sample_stats_prior.

Sample stats groups#

Information and diagnostics for the samples in any DataTree group other than the posterior should be stored in a separate group with the sample_stats_ prefix. For example sample_stats_prior.

The same rules and conventions defined in sample_stats apply to any sample stats group.

Warmup groups#

Samples generated during the adaptation/warmup phases of algorithms like HMC can also be stored in a DataTree. In such cases, the data/samples generated during the adaptation process should be stored in groups with the same name with the warmup_ prefix, e.g. warmup_posterior, warmup_sample_stats_prior. The warmup_ prefix goes before other prefixes.

Unconstrained groups#

Samples on the unconstrained space in cases where the samples need to be generated with the help of a sampling algorithm and the sampling algorithm requires transformations to an unconstrained space.

It is described in more detail in unconstrained_posterior section, which is what we expect to be the most common section, but other groups could also have an unconstrained linked group, e.g. prior and unconstrained_prior.

Examples#

In order to clarify the definitions above, an example of DataTree generation for a 1D linear regression is available in several probabilistic programming frameworks. This particular inference task has been chosen because it is widely well known while still being useful and it also allows to populate all the fields in the DataTree object.

InferenceData schema specification for ArviZ#

Basic structure#

Terminology#

Rules#

Metadata#

Relations between groups#

posterior#

unconstrained_posterior#

sample_stats#

log_likelihood#

log_prior#

posterior_predictive#

observed_data#

constant_data#

prior#

prior_predictive#

predictions#

predictions_constant_data#