accept_stat
was the acceptance
probability averaged over samples in the slice. In more recent versions of Stan
the NUTS algorithm uses multinomial sampling over the states for each Hamiltonian
trajectory.
For HMC without NUTS accept_stat
is the standard Metropolis
acceptance probability.
If the leapfrog integrator were perfect numerically, there would no need to do any more randomization per transition than generating a random momentum vector. Instead, what is done in practice to account for numerical errors during integration is to apply a Metropolis acceptance step. If the proposal is not accepted, the previous parameter value is returned for the next draw and used to initialize the next iteration.
By setting the target acceptance parameter to a value closer to 1 (its value must be strictly less than 1 and its default value is 0.8), adaptation will be forced to use smaller step sizes. This can improve sampling efficiency (effective samples per iteration) at the cost of increased iteration times. Raising the target will also allow some models that would otherwise get stuck to overcome their blockages.
divergent
over all iterations is therefore
the proportion of iterations with diverging error.
When numerical issues arise during the evaluation of the parameter Jacobians or the model log density, an exception is raised in the underlying code and the current expansion of the Hamiltonian forward and backward in time is halted. This is marked as a divergent transition.
The primary cause of divergent transitions in Euclidean HMC (other than bugs in the model code) is numerical instability in the leapfrog integrator used to simulate the Hamiltonian evaluation. The fundamental problem is that a fixed step size is being multiplied by the gradient at a particular point, to determine the next simulated point. If the stepsize is too large, this can overshoot into ill-defined portions of the posterior.
If there are (post-warmup) divergences then the results may be biased and should not be used.
In some cases, simply lowering the initial step size and increasing the target acceptance rate will keep the step size small enough that sampling can proceed.
The exact cause of each divergent transition is printed as a warning message in the output console. This can be useful in cases where managing the step size is insufficient. In such cases, a reparameterization is often required so that the posterior curvature is more manageable; see the section about Neal's Funnel in the Stan manual for an example.
For more details see Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo.energy
is the value of the Hamiltonian (up to an additive
constant) at each sample.
While divergences can identify light tails and incomplete exploration of the target distribution, the energy diagnostic can identify overly heavy tails that are also challenging for sampling. Informally, the energy diagnostic for HMC quantifies the heaviness of the tails of the posterior distribution. The energy diagostic plot shows overlaid histograms of the (centered) marginal energy distribution and the first-differenced distribution. Keep an eye out for discrepancies between these distributions.
For more details see Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo.All implementations of HMC use numerical integrators requiring a step size (equivalently, discretization time interval).
If step_size
is too large, the leapfrog integrator will be
inaccurate and too many proposals will be rejected. If step_size
is too small, too many small steps will be taken by the leapfrog integrator
leading to long simulation times per interval. Thus the goal is to balance the
acceptance rate between these extremes.
If n_leapfrog
is too small, the trajectory traced out in each
iteration will be too short and sampling will devolve to a random walk. If
n_leapfrog
is too large, the algorithm will do too much work
on each iteration.
Configuring NUTS involves putting a cap on the depth of the
trees that it evaluates during each iteration. This is controlled through
a maximum depth parameter. n_leapfrog
is then bounded
by 2 to the power of the maximum depth minus 1.
Tree depth is an important diagnostic tool for
NUTS. For example, a treedepth = 0
occurs when the first leapfrog
step is immediately rejected and the initial state returned, indicating extreme
curvature and poorly-chosen stepsize
(at least relative to the
current position).
On the other hand, treedepth = max_treedepth
equal to the maximum depth
indicates that NUTS is taking many leapfrog steps and being terminated
prematurely to avoid excessively long execution time.
Taking very many steps may be a sign of poor adaptation, may be due to targeting a very high acceptance rate, or may simply indicate a difficult posterior from which to sample. In the latter case, reparameterization may help with efficiency. But in the rare cases where the model is correctly specified and a large number of steps is necessary, the maximum depth should be increased to ensure that that the NUTS tree can grow as large as necessary.
stepsize
and number of
steps n_leapfrog
according to the Hamiltonian dynamics.
Then a Metropolis acceptance step is applied, and a decision is made whether
to update to the new state or keep the existing state.
n_leapfrog
in each iteration in order to allow the
proposals to traverse the posterior without doing unnecessary work.
The motivation is to maximize the expected squared jump distance
(see, e.g.,
Roberts et al. (1997))
at each step and avoid the random-walk behavior that arises in random-walk
Metropolis or Gibbs samplers when there is correlation in the posterior.
For a precise definition of the NUTS algorithm see Hoffman and Gelman
(2011,
2014)
treedepth
is increased by one, doubling
n_leapfrog
and effectively doubling the computation time.
The algorithm terminates in one of two ways, either
treedepth
that it evaluates during each iteration. This is controlled through a maximum depth parameter. The number of leapfrog steps taken is then bounded by 2 to the power
of the maximum depth minus 1.
For more details see Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo.
The idea behind posterior predictive checking is simple:
If our model is a good fit then we should be able to use it to generate
data that looks a lot like the data we observed.
To generate this 'replicated' data we use the posterior predictive distribution
$$ p(y^{rep} | y ) = \int p(y^{rep} | \theta) p(\theta | y ) d \theta,$$where \(y\) is the observed data and \(\theta\) the parameters in our model.
For each draw of \(\theta\) from the posterior \(p(\theta | y) \) we simulate data \(y^{rep}\) from the posterior predictive distribution \(p(y^{rep} | y) \).
Using the simulations of \(y^{rep}\) we can make various graphical displays comparing our observed data to the replications.
In this tutorial we do the following:
First we'll generate some fake data in R to use for this example
# Number of observations
N <- 100
# Model matrix (with column of 1s for intercept and one covariate)
X <- cbind(Const = 1, X1 = rnorm(N))
K <- ncol(X)
# Generate fake outcome y
beta <- c(2, 1/2) # pick intercept and coefficient
sigma <- 1 # standard deviation
y <- rnorm(N, mean = X %*% beta, sd = sigma) # generate data
Now we can write Stan code for a simple linear regression model.
data {
int N ; # integer, number of observations
int K ; # integer, number of columns in model matrix
matrix[N,K] X ; # N by K model matrix
vector[N] y ; # vector of N observations
}
parameters {
real<lower=0> sigma ; # real number > 0, standard deviation
vector[K] beta ; # K-vector of regression coefficients
}
model {
beta ~ normal(0, 5) ; # prior for betas
sigma ~ cauchy(0, 2.5) ; # prior for sigma
y ~ normal(X*beta, sigma) ; # vectorized likelihood
}
generated quantities {
# Here we do the simulations from the posterior predictive distribution
vector[N] y_rep ; # vector of same length as the data y
for (n in 1:N)
y_rep[n] <- normal_rng(X[n]*beta, sigma) ;
}
In this case the posterior predictive distribution we want to simulate from is the normal distribution with mean and standard deviation updated to reflect the posterior draws of beta
and sigma
.
The code in the generated quantities
block will be evaluated for each posterior draw of the parameters. For example, if we have 100 post-warmup iterations then we will have 100 y_rep
vectors, each of length N
.
If we've saved our Stan code in a file called stan_code.stan
then we can run this model with RStan and then launch ShinyStan like this:
library(rstan)
library(ShinyStan)
# Prepare the data we'll need as a list
stan_data <- list(y = y, X = X, N = N, K = K)
# Fit the model
stanfit <- stan(file = "stan_code.stan", data = stan_data)
# Launch ShinyStan
launch_shinystan(stanfit)
Once we've launched ShinyStan we can navigate to the page for posterior predictive checking. In the dropdown menus it will ask us to select the object containing our data from our R global environment and the name of the paramter from our model containing the posterior predictive replications. So we enter y
and y_rep
, respectively.
ShinyStan will then generate graphics that will aid in checking the fit of our model including comparisons of the distribution of the observed data to the distributions of the posterior predictive replications, distributions of test statistics, and residual plots.
@Misc{shinystan-software:2017, title = {{shinystan}: Interactive Visual and Numerical Diagnostics and Posterior Analysis for {Bayesian} Models}, author = {Stan Development Team}, note = {R package version 2.4.0}, year = {2017}, url = {https://mc-stan.org} })
Samples in a Markov chain are only drawn with the marginal distribution \(p(\theta | y,x)\) after the chain has converged to its equilibrium distribution. There are several methods to test whether an MCMC method has failed to converge; unfortunately, passing the tests does not guarantee convergence. The recommended method for Stan is to run multiple Markov chains, initialized randomly with a diffuse set of initial parameter values, discard the warmup/adaptation samples, then split the remainder of each chain in half and compute the potential scale reduction statistic \(\hat{R}\).
If the effective sample size is too low to make inferences with the desired precision, double the number of iterations and start again, including rerunning warmup and everything. Often, a small effective sample size is the result of too few warmup iterations. At most, this rerunning strategy will consume about 50% more cycles than guessing the correct number of iterations at the outset.
The estimation of effective sample size is described in detail in the 'Markov Chain Monte Carlo Sampling' chapter of the Stan Modeling Language User's Guide and Reference Manual.
When estimating a mean based on a sample of \(M\) independent draws, the estimation error is proportional to \(1/M\). If the draws are positively correlated, as they typically are when drawn using MCMC methods, the error is proportional to \(1/\sqrt{n_{eff}}\) where \(n_{eff}\) is the effective sample size. Thus it is standard practice to also monitor (an estimate of) the effective sample size until it is large enough for the estimation or inference task at hand.
Gelman and Rubin’s recommendation is that the independent Markov chains be initialized with diffuse starting values for the parameters and sampled until all values for \(\hat{R}\) are below 1.1. Stan allows users to specify initial values for parameters and it is also able to draw diffuse random initializations itself.
Details on the computatation of \(\hat{R}\) and some of its limitations can be found in the 'Markov Chain Monte Carlo Sampling' chapter of the Stan Modeling Language User's Guide and Reference Manual.
stepsize
and number of
steps n_leapfrog
according to the Hamiltonian dynamics.
Then a Metropolis acceptance step is applied, and a decision is made whether
to update to the new state or keep the existing state.
n_leapfrog
in each iteration in order to allow the
proposals to traverse the posterior without doing unnecessary work.
The motivation is to maximize the expected squared jump distance
(see, e.g.,
Roberts et al. (1997))
at each step and avoid the random-walk behavior that arises in random-walk
Metropolis or Gibbs samplers when there is correlation in the posterior.
For a precise definition of the NUTS algorithm see Hoffman and Gelman
(2011,
2014)
treedepth
is increased by one, doubling
n_leapfrog
and effectively doubling the computation time.
The algorithm terminates in one of two ways, either
treedepth
that it evaluates during each iteration. This is controlled through a maximum depth parameter. The number of leapfrog steps taken is then bounded by 2 to the power
of the maximum depth minus 1.
For more details see Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo.
accept_stat
was the acceptance
probability averaged over samples in the slice. In more recent versions of Stan
the NUTS algorithm uses multinomial sampling over the states for each Hamiltonian
trajectory.
For HMC without NUTS accept_stat
is the standard Metropolis
acceptance probability.
If the leapfrog integrator were perfect numerically, there would no need to do any more randomization per transition than generating a random momentum vector. Instead, what is done in practice to account for numerical errors during integration is to apply a Metropolis acceptance step. If the proposal is not accepted, the previous parameter value is returned for the next draw and used to initialize the next iteration.
By setting the target acceptance parameter to a value closer to 1 (its value must be strictly less than 1 and its default value is 0.8), adaptation will be forced to use smaller step sizes. This can improve sampling efficiency (effective samples per iteration) at the cost of increased iteration times. Raising the target will also allow some models that would otherwise get stuck to overcome their blockages.
divergent
over all iterations is therefore
the proportion of iterations with diverging error.
When numerical issues arise during the evaluation of the parameter Jacobians or the model log density, an exception is raised in the underlying code and the current expansion of the Hamiltonian forward and backward in time is halted. This is marked as a divergent transition.
The primary cause of divergent transitions in Euclidean HMC (other than bugs in the model code) is numerical instability in the leapfrog integrator used to simulate the Hamiltonian evaluation. The fundamental problem is that a fixed step size is being multiplied by the gradient at a particular point, to determine the next simulated point. If the stepsize is too large, this can overshoot into ill-defined portions of the posterior.
If there are (post-warmup) divergences then the results may be biased and should not be used.
In some cases, simply lowering the initial step size and increasing the target acceptance rate will keep the step size small enough that sampling can proceed.
The exact cause of each divergent transition is printed as a warning message in the output console. This can be useful in cases where managing the step size is insufficient. In such cases, a reparameterization is often required so that the posterior curvature is more manageable; see the section about Neal's Funnel in the Stan manual for an example.
For more details see Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo.energy
is the value of the Hamiltonian (up to an additive
constant) at each sample.
While divergences can identify light tails and incomplete exploration of the target distribution, the energy diagnostic can identify overly heavy tails that are also challenging for sampling. Informally, the energy diagnostic for HMC quantifies the heaviness of the tails of the posterior distribution. The energy diagostic plot shows overlaid histograms of the (centered) marginal energy distribution and the first-differenced distribution. Keep an eye out for discrepancies between these distributions.
For more details see Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo.All implementations of HMC use numerical integrators requiring a step size (equivalently, discretization time interval).
If step_size
is too large, the leapfrog integrator will be
inaccurate and too many proposals will be rejected. If step_size
is too small, too many small steps will be taken by the leapfrog integrator
leading to long simulation times per interval. Thus the goal is to balance the
acceptance rate between these extremes.
If n_leapfrog
is too small, the trajectory traced out in each
iteration will be too short and sampling will devolve to a random walk. If
n_leapfrog
is too large, the algorithm will do too much work
on each iteration.
Configuring NUTS involves putting a cap on the depth of the
trees that it evaluates during each iteration. This is controlled through
a maximum depth parameter. n_leapfrog
is then bounded
by 2 to the power of the maximum depth minus 1.
Tree depth is an important diagnostic tool for
NUTS. For example, a treedepth = 0
occurs when the first leapfrog
step is immediately rejected and the initial state returned, indicating extreme
curvature and poorly-chosen stepsize
(at least relative to the
current position).
On the other hand, treedepth = max_treedepth
equal to the maximum depth
indicates that NUTS is taking many leapfrog steps and being terminated
prematurely to avoid excessively long execution time.
Taking very many steps may be a sign of poor adaptation, may be due to targeting a very high acceptance rate, or may simply indicate a difficult posterior from which to sample. In the latter case, reparameterization may help with efficiency. But in the rare cases where the model is correctly specified and a large number of steps is necessary, the maximum depth should be increased to ensure that that the NUTS tree can grow as large as necessary.
To ask a question or suggest a new feature visit the Stan users message board.
To report a bug or suggest a new feature visit the GitHub issue tracker.
Clicking on a 'Save ggplot2 object' button will be save an .RData
file that you can load into your Global Environment using the
load
function in R.
You can then make changes to the plot using the functions in the
ggplot2 package.
Any plot that can be saved as a ggplot2 object can also be saved as a PDF.
The
drop_parameters
function in the
shinystan
R package will allow you to reduce the size
of a shinystan object by removing parameters.
See
help('drop_parameters', 'shinystan')
for the documentation.
Additionally, for large models, the
launch_shinystan
function will launch the app faster when used with a
shinystan object rather than a stanfit object
(because no conversion is required).
If ShinyStan takes a long time to launch for your
model then it can help to first create a
shinystan object using the
as.shinystan
function.
Alternatively, the first time you launch
ShinyStan using a stanfit object, a shinystan
object will be returned if you assign the value of
launch_shinystan
to a name, e.g.
sso <- launch_shinystan(stanfit)
rather than just
launch_shinystan(stanfit)
The next time you launch ShinyStan for the same
model you can launch it using
sso
rather than
stanfit
and it should be quicker to launch.
If it is still too slow then dropping some large parameters
from the shinystan object is the best solution.