arviz_stats.compare#
- arviz_stats.compare(compare_dict, method='stacking', var_name=None, reference=None, round_to='auto')[source]#
Compare models based on their expected log pointwise predictive density (ELPD).
The ELPD is estimated by Pareto smoothed importance sampling leave-one-out cross-validation, the same method used by
arviz_stats.loo. The method is described in [1] and [2]. By default, the weights are estimated using"stacking"as described in [3].If more than 11 models are compared, a diagnostic check for selection bias is performed [4]. If detected, avoid LOO-based selection and use model averaging/stacking or projection predictive inference.
See the EABM chapters on Model Comparison, Model Comparison (Case Study), and Model Comparison for Large Data for more details.
- Parameters:
- compare_dict: dict of {str: DataTree or ELPDData}
A dictionary of model names and
xr.DataTreeorELPDData.- method: str, optional
Method used to estimate the weights for each model. Available options are:
‘stacking’ : stacking of predictive distributions.
‘BB-pseudo-BMA’ : pseudo-Bayesian Model averaging using Akaike-type weighting. The weights are stabilized using the Bayesian bootstrap.
‘pseudo-BMA’: pseudo-Bayesian Model averaging using Akaike-type weighting, without Bootstrap stabilization (not recommended).
For more information read https://arxiv.org/abs/1704.02030
- var_name: str, optional
If there is more than a single observed variable in the
InferenceData, which should be used as the basis for comparison.- reference: str, optional
Name of the reference model used for computing
elpd_diff. IfNone(default), the best-performing model (highest ELPD) is used as the reference. When specified, allelpd_diffvalues are computed relative to this model, which will haveelpd_diff = 0. This is useful for comparing against a baseline model, null model, or a specific model of interest rather than the top-ranked model.- round_to
intor {“auto”, “none”}, optional Rounding specification. Defaults to “auto”. If integer, number of decimal places to round to. Use the string “None” or “none” to return raw numbers. If None use
rcParams["stats.round_to"]. If"auto", applies custom rounding rules to columns in the returned DataFrame:elpdandelpd_diffare rounded based onseanddserespectively,using the same rule as
summarystat/se pairs.
seanddseare rounded based onrcParams["stats.round_to"].pis rounded to 1 decimal place.weightuses precision based on the largest weight, showing approximately 2 significant digits for that maximum value.
- Returns:
DataFrameA DataFrame, ordered from best to worst model (measured by the ELPD). The index reflects the key with which the models are passed to this function. The columns are:
rank: The rank-order of the models. 0 is the best.
elpd_diff: The difference in ELPD between each model and the reference model, computed as
elpd_model - elpd_reference. By default the reference is the top-ranked model, so all values are negative or zero. The reference model always has anelpd_diffof 0.dse: Standard error of the difference in ELPD between each model
p_worse: The probability that each model is worse than the best ranked model. Probabilities are computed with a normal approximation. [5] presents the conditions when this approximation is good. If a reference model is specified, this column is renamed to p_better and reflects the probability that each model is better than the reference model.
diag_diff: Potential issues with the ELPD difference. It can take 3 values: N < 100 (small data), |elpd_diff| < 4 (models make similar predictions), or empty string (no issues detected). If either of the first two values are shown, the ELPD differences and probabilities (worse or better) should be interpreted with caution as the error distribution is skewed or thick tailed and the normal approximation not well calibrated. However, elpd_diff and dse values will still be useful for understanding the magnitude of the differences. For example, if |elpd_diff| is many times larger than dse the difference is quite certain even if the exact probability value is not well calibrated (and likely overestimated). In addition, if the model is not well specified and there are outliers, the error distribution can also be skewed or thick tailed and the normal approximation is not well calibrated. Possible model misspecification and outliers can be diagnosed with usual predictive checking methods.
diag_elpd: Potential issues with the ELPD estimate. It can take the value
K k_psis > thresholdwhere K is the number of high Pareto k values in the PSIS computation, or empty string if no issues detected. IfK k_psis > thresholdis shown, there may be significant bias in ELPD differences favoring models with a large number of high Pareto k values. Thethresholdis thegood_kattribute in the input ELPD results.p: pIC, Estimated effective number of parameters.
elpd: ELPD estimated using PSIS-LOO-CV (elpd_loo). Higher ELPD indicates higher out-of-sample predictive fit (“better” model).
se: Standard error of the ELPD estimate. If method = BB-pseudo-BMA these values are estimated using Bayesian bootstrap.
weight: Relative weight for each model. This can be loosely interpreted as the probability of each model (among the compared models) given the data. By default the uncertainty in the weights estimation is considered using Bayesian bootstrap.
subsampling_dse: (Only when subsampling is used) The subsampling component of the standard error of the ELPD difference. This quantifies the uncertainty due to using a subsample rather than all observations.
warning: A value of 1 indicates that the computation of the ELPD may not be reliable. This could be indication of LOO starting to fail see http://arxiv.org/abs/1507.04544 for details.
See also
looCompute the ELPD using the Pareto smoothed importance sampling Leave-one-out cross-validation method.
arviz_plots.plot_compareSummary plot for model comparison.
References
[1]Vehtari et al. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5) (2017) https://doi.org/10.1007/s11222-016-9696-4 arXiv preprint https://arxiv.org/abs/1507.04544.
[2]Vehtari et al. Pareto Smoothed Importance Sampling. Journal of Machine Learning Research, 25(72) (2024) https://jmlr.org/papers/v25/19-556.html arXiv preprint https://arxiv.org/abs/1507.02646
[3]Yao et al. Using stacking to average Bayesian predictive distributions Bayesian Analysis, 13, 3 (2018). https://doi.org/10.1214/17-BA1091 arXiv preprint https://arxiv.org/abs/1704.02030.
[4]McLatchie, Y., Vehtari, A. Efficient estimation and correction of selection-induced bias with order statistics. Statistics and Computing, 34, 132 (2024). https://doi.org/10.1007/s11222-024-10442-4 arXiv preprint https://arxiv.org/abs/2309.03742
[5]Sivula et al. Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison. (2025). https://doi.org/10.48550/arXiv.2008.10296
Examples
Compare the centered and non centered models of the eight school problem:
In [1]: In [1]: from arviz_stats import compare ...: ...: from arviz_base import load_arviz_data ...: ...: data1 = load_arviz_data("non_centered_eight") ...: ...: data2 = load_arviz_data("centered_eight") ...: ...: compare_dict = {"non centered": data1, "centered": data2} ...: ...: compare(compare_dict) ...: Out[1]: rank elpd_diff dse p_worse ... p elpd se weight non centered 0 0.0 0.000 NaN ... 0.9 -31.0 1.4 1.0 centered 1 -0.0 0.055 0.7 ... 0.9 -31.0 1.3 0.0 [2 rows x 10 columns]
Compare models using subsampled LOO:
In [2]: In [1]: from arviz_stats import loo_subsample ...: ...: from arviz_base import load_arviz_data ...: ...: data1 = load_arviz_data("non_centered_eight") ...: ...: data2 = load_arviz_data("centered_eight") ...: ...: loo_sub1 = loo_subsample(data1, observations=6, pointwise=True, seed=42) ...: ...: loo_sub2 = loo_subsample(data2, observations=6, pointwise=True, seed=42) ...: ...: compare({"non_centered": loo_sub1, "centered": loo_sub2}) ...: Out[2]: rank elpd_diff dse ... se weight subsampling_dse non_centered 0 0.0 0.000 ... 1.5 0.5 0.00000 centered 1 -0.0 0.052 ... 1.4 0.5 0.01881 [2 rows x 11 columns]
When using subsampled LOO, the
subsampling_dsecolumn quantifies the additional uncertainty from using subsamples instead of all observations. Theelpd_diffvalues are computed using a difference-of-estimators approach on overlapping observations, which can differ from simple subtraction of ELPD values. Using the same seed across models ensures overlapping observations for more accurate paired comparisons with smaller standard errors.