{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(xarray_for_arviz)=\n", "# Introduction to xarray, InferenceData, and netCDF for ArviZ" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While ArviZ supports plotting from familiar data types, such as dictionaries and NumPy arrays, there are a couple of data structures central to ArviZ that are useful to know when using the library. \n", "\n", "They are \n", "\n", "* {class}`xarray:xarray.Dataset`\n", "* {class}`arviz.InferenceData`\n", "* {ref}`netCDF ` \n", "\n", "\n", "## Why more than one data structure?\n", "[Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference) generates numerous datasets that represent different aspects of the model. For example, in a single analysis, a Bayesian practitioner could end up with any of the following data.\n", "\n", "\n", "\n", "* Prior Distribution for N number of variables\n", "* Posterior Distribution for N number of variables\n", "* Prior Predictive Distribution\n", "* Posterior Predictive Distribution\n", "* Trace data for each of the above\n", "* Sample statistics for each inference run\n", "* Any other array like data source\n", "\n", "For more detail, see the `InferenceData` structure specification {ref}`here `.\n", "\n", "\n", "## Why not Pandas Dataframes or NumPy Arrays?\n", "Data from [probabilistic programming](https://en.wikipedia.org/wiki/Probabilistic_programming) is naturally high dimensional. To add to the complexity ArviZ must handle the data generated from multiple Bayesian modeling libraries, such as PyMC3 and PyStan. This application is handled by the *xarray* package quite well. The xarray package lets users manage high dimensional data with human readable dimensions and coordinates quite easily.\n", "\n", "![InferenceData Structure](InferenceDataStructure.png) \n", "\n", "Above is a visual representation of the data structures and their relationships. Although it seems more complex at a glance, the ArviZ devs believe that the usage of *xarray*, `InferenceData`, and *netCDF* will simplify the handling, referencing, and serialization of data generated during Bayesian analysis. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An introduction to each\n", "To help get familiar with each, ArviZ includes some toy datasets. You can check the different ways to start an `InferenceData` {ref}`here `. For illustration purposes, here we have shown only one example provided by the library. To start an `az.InferenceData`, sample can be loaded from disk." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
arviz.InferenceData
\n", "
\n", "
    \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset>\n",
             "Dimensions:  (chain: 4, draw: 500, school: 8)\n",
             "Coordinates:\n",
             "  * chain    (chain) int64 0 1 2 3\n",
             "  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499\n",
             "  * school   (school) object 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'\n",
             "Data variables:\n",
             "    mu       (chain, draw) float64 -3.477 -2.456 -2.826 ... 4.597 5.899 0.1614\n",
             "    theta    (chain, draw, school) float64 1.669 -8.537 -2.623 ... 10.59 4.523\n",
             "    tau      (chain, draw) float64 3.73 2.075 3.703 4.146 ... 8.346 7.711 5.407\n",
             "Attributes:\n",
             "    created_at:                 2019-06-21T17:36:34.398087\n",
             "    inference_library:          pymc3\n",
             "    inference_library_version:  3.7

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset>\n",
             "Dimensions:  (chain: 4, draw: 500, school: 8)\n",
             "Coordinates:\n",
             "  * chain    (chain) int64 0 1 2 3\n",
             "  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499\n",
             "  * school   (school) object 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'\n",
             "Data variables:\n",
             "    obs      (chain, draw, school) float64 7.85 -19.03 -22.5 ... 4.698 -15.07\n",
             "Attributes:\n",
             "    created_at:                 2019-06-21T17:36:34.489022\n",
             "    inference_library:          pymc3\n",
             "    inference_library_version:  3.7

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset>\n",
             "Dimensions:           (chain: 4, draw: 500, school: 8)\n",
             "Coordinates:\n",
             "  * chain             (chain) int64 0 1 2 3\n",
             "  * draw              (draw) int64 0 1 2 3 4 5 6 ... 493 494 495 496 497 498 499\n",
             "  * school            (school) object 'Choate' 'Deerfield' ... 'Mt. Hermon'\n",
             "Data variables:\n",
             "    tune              (chain, draw) bool True False False ... False False False\n",
             "    depth             (chain, draw) int64 5 3 3 4 5 5 4 4 5 ... 4 4 4 5 5 5 5 5\n",
             "    tree_size         (chain, draw) float64 31.0 7.0 7.0 15.0 ... 31.0 31.0 31.0\n",
             "    lp                (chain, draw) float64 -59.05 -56.19 ... -63.62 -58.35\n",
             "    energy_error      (chain, draw) float64 0.07387 -0.1841 ... -0.087 -0.003652\n",
             "    step_size_bar     (chain, draw) float64 0.2417 0.2417 ... 0.1502 0.1502\n",
             "    max_energy_error  (chain, draw) float64 0.131 -0.2067 ... -0.101 -0.1757\n",
             "    energy            (chain, draw) float64 60.76 62.76 64.4 ... 67.77 67.21\n",
             "    mean_tree_accept  (chain, draw) float64 0.9506 0.9906 ... 0.9875 0.9967\n",
             "    step_size         (chain, draw) float64 0.1275 0.1275 ... 0.1064 0.1064\n",
             "    diverging         (chain, draw) bool False False False ... False False False\n",
             "    log_likelihood    (chain, draw, school) float64 -5.168 -4.589 ... -3.896\n",
             "Attributes:\n",
             "    created_at:                 2019-06-21T17:36:34.485802\n",
             "    inference_library:          pymc3\n",
             "    inference_library_version:  3.7

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset>\n",
             "Dimensions:    (chain: 1, draw: 500, school: 8)\n",
             "Coordinates:\n",
             "  * chain      (chain) int64 0\n",
             "  * draw       (draw) int64 0 1 2 3 4 5 6 7 ... 492 493 494 495 496 497 498 499\n",
             "  * school     (school) object 'Choate' 'Deerfield' ... 'Mt. Hermon'\n",
             "Data variables:\n",
             "    tau        (chain, draw) float64 6.561 1.016 68.91 ... 1.56 5.949 0.7631\n",
             "    tau_log__  (chain, draw) float64 1.881 0.01593 4.233 ... 1.783 -0.2704\n",
             "    mu         (chain, draw) float64 5.293 0.8137 0.7122 ... -1.658 -3.273\n",
             "    theta      (chain, draw, school) float64 2.357 7.371 7.251 ... -3.775 -3.555\n",
             "    obs        (chain, draw, school) float64 -3.54 6.769 19.68 ... -21.16 -6.071\n",
             "Attributes:\n",
             "    created_at:                 2019-06-21T17:36:34.490387\n",
             "    inference_library:          pymc3\n",
             "    inference_library_version:  3.7

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
  • \n", " \n", " \n", "
    \n", "
    \n", "
      \n", "
      \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      <xarray.Dataset>\n",
             "Dimensions:  (school: 8)\n",
             "Coordinates:\n",
             "  * school   (school) object 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'\n",
             "Data variables:\n",
             "    obs      (school) float64 28.0 8.0 -3.0 7.0 -1.0 1.0 18.0 12.0\n",
             "Attributes:\n",
             "    created_at:                 2019-06-21T17:36:34.491909\n",
             "    inference_library:          pymc3\n",
             "    inference_library_version:  3.7

      \n", "
    \n", "
    \n", "
  • \n", " \n", "
\n", "
\n", " " ], "text/plain": [ "Inference data with groups:\n", "\t> posterior\n", "\t> posterior_predictive\n", "\t> sample_stats\n", "\t> prior\n", "\t> observed_data" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the centered eight schools model\n", "import arviz as az\n", "\n", "data = az.load_arviz_data(\"centered_eight\")\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case the `az.InferenceData` object contains both a posterior predictive distribution and the observed data, among other datasets. Each group in `InferenceData` is both an attribute on `InferenceData` and itself a `xarray.Dataset` object. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (chain: 4, draw: 500, school: 8)\n",
       "Coordinates:\n",
       "  * chain    (chain) int64 0 1 2 3\n",
       "  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499\n",
       "  * school   (school) object 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'\n",
       "Data variables:\n",
       "    mu       (chain, draw) float64 -3.477 -2.456 -2.826 ... 4.597 5.899 0.1614\n",
       "    theta    (chain, draw, school) float64 1.669 -8.537 -2.623 ... 10.59 4.523\n",
       "    tau      (chain, draw) float64 3.73 2.075 3.703 4.146 ... 8.346 7.711 5.407\n",
       "Attributes:\n",
       "    created_at:                 2019-06-21T17:36:34.398087\n",
       "    inference_library:          pymc3\n",
       "    inference_library_version:  3.7
" ], "text/plain": [ "\n", "Dimensions: (chain: 4, draw: 500, school: 8)\n", "Coordinates:\n", " * chain (chain) int64 0 1 2 3\n", " * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499\n", " * school (school) object 'Choate' 'Deerfield' ... \"St. Paul's\" 'Mt. Hermon'\n", "Data variables:\n", " mu (chain, draw) float64 -3.477 -2.456 -2.826 ... 4.597 5.899 0.1614\n", " theta (chain, draw, school) float64 1.669 -8.537 -2.623 ... 10.59 4.523\n", " tau (chain, draw) float64 3.73 2.075 3.703 4.146 ... 8.346 7.711 5.407\n", "Attributes:\n", " created_at: 2019-06-21T17:36:34.398087\n", " inference_library: pymc3\n", " inference_library_version: 3.7" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the posterior dataset\n", "posterior = data.posterior\n", "posterior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our eight schools model example, the posterior trace consists of 3 variables and approximately over 4 chains. In addition, it is a hierarchical model where values for the variable `theta` are associated with a particular school. \n", "\n", "According to the xarray's terminology: \n", "* Data variables are the actual values generated from the MCMC draws\n", "* Dimensions are the axes that refer to the data variables\n", "* Coordinates are pointers to specific slices or points in the `xarray.Dataset`\n", "\n", "Observed data from the eight schools model can be accessed through the same method." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:  (school: 8)\n",
       "Coordinates:\n",
       "  * school   (school) object 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'\n",
       "Data variables:\n",
       "    obs      (school) float64 28.0 8.0 -3.0 7.0 -1.0 1.0 18.0 12.0\n",
       "Attributes:\n",
       "    created_at:                 2019-06-21T17:36:34.491909\n",
       "    inference_library:          pymc3\n",
       "    inference_library_version:  3.7
" ], "text/plain": [ "\n", "Dimensions: (school: 8)\n", "Coordinates:\n", " * school (school) object 'Choate' 'Deerfield' ... \"St. Paul's\" 'Mt. Hermon'\n", "Data variables:\n", " obs (school) float64 28.0 8.0 -3.0 7.0 -1.0 1.0 18.0 12.0\n", "Attributes:\n", " created_at: 2019-06-21T17:36:34.491909\n", " inference_library: pymc3\n", " inference_library_version: 3.7" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the observed xarray\n", "observed_data = data.observed_data\n", "observed_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It should be noted that the observed dataset contains only 8 data variables. Moreover, it doesn't have a chain and draw dimension or coordinates unlike posterior. This difference in sizes is the motivating reason behind `InferenceData`. Rather than force multiple different sized arrays into one array, or have users manage multiple objects corresponding to different datasets, it is easier to hold references to each `xarray.Dataset` in an `InferenceData` object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(netcdf)=\n", "\n", "## NetCDF\n", "[NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a standard for referencing array oriented files. In other words, while `xarray.Dataset`s, and by extension `InferenceData`, are convenient for accessing arrays in Python memory, *netCDF* provides a convenient mechanism for persistence of model data on disk. In fact, the netCDF dataset was the inspiration for `InferenceData` as netCDF4 supports the concept of groups. `InferenceData` merely wraps `xarray.Dataset` with the same functionality.\n", "\n", "Most users will not have to concern themselves with the *netCDF* standard but for completeness it is good to make its usage transparent. It is also worth noting that the netCDF4 file standard is interoperable with HDF5 which may be familiar from other contexts.\n", "\n", "Earlier in this tutorial `InferenceData` was loaded from a *netCDF* file" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = az.load_arviz_data(\"centered_eight\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, the `InferenceData` objects can be persisted to disk in the netCDF format" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'eight_schools_model.nc'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.to_netcdf(\"eight_schools_model.nc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Additional Reading\n", "Additional documentation and tutorials exist for xarray and netCDF4. Check the following links:\n", "\n", "## `InferenceData`\n", "* {ref}`working_with_InferenceData`: Tutorial covering the most common operations with `InferenceData` objects\n", "* {ref}`creating_InferenceData`: Cookbook with examples of generating InferenceData objects from multiple sources, both external inference libraries like \n", "* {ref}`data module API reference `\n", "* {ref}`InferenceData API reference `: description of all available `InferenceData` methods, grouped by topic\n", "\n", "## xarray\n", "* For getting to know xarray, check [xarray documentation](http://xarray.pydata.org/en/stable/why-xarray.html)\n", "* Feel free to watch the Q/A session about xarray at [xarray lightning talk at SciPy 2015](https://www.youtube.com/watch?v=X0pAhJgySxk&t=949s)\n", "\n", "## NetCDF\n", "* Get to know the introduction of netCDF at the official website of [NetCDF documentation](https://www.unidata.ucar.edu/software/netcdf/docs/.)\n", "* Netcdf4-python library is a used to read/write netCDF files in both netCDF4 and netCDF3 format. Learn more about it by visitng its API documentation at [NetCDF4 API documentation](http://unidata.github.io/netcdf4-python/)\n", "* xarray provides direct serialization and IO to netCDF format. Learn how to read/write netCDF files directly as xarray objects at {ref}`NetCDF usage in xarray `\n", "* Check how to read/write netCDF4 files with HDF5 and vice versa at [NetCDF interoperability with HDF5](https://www.unidata.ucar.edu/software/netcdf/docs/interoperability_hdf5.html)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }