This vignette gives a first and very brief overview of how the package JointAI can be used. The different settings and options are explained in more depth in the help pages and other vignettes.
Here, we use the NHANES data that are part of the JointAI package. For more info on this data, check the help file for the NHANES data, go to the web page of the National Health and Nutrition Examination Survey (NHANES) and check out the vignette Visualizing Incomplete Data, in which the NHANES data is explored.
Fitting a linear regression model with JointAI is
straightforward with the function lm_imp()
:
lm1 <- lm_imp(SBP ~ gender + age + race + WC + alc + educ + albu + bili,
data = NHANES, n.iter = 500)
#>
|
| | 0%
|
|* | 2%
|
|** | 4%
|
|*** | 6%
|
|**** | 8%
|
|***** | 10%
|
|****** | 12%
|
|******* | 14%
|
|******** | 16%
|
|********* | 18%
|
|********** | 20%
|
|*********** | 22%
|
|************ | 24%
|
|************* | 26%
|
|************** | 28%
|
|*************** | 30%
|
|**************** | 32%
|
|***************** | 34%
|
|****************** | 36%
|
|******************* | 38%
|
|******************** | 40%
|
|********************* | 42%
|
|********************** | 44%
|
|*********************** | 46%
|
|************************ | 48%
|
|************************* | 50%
|
|************************** | 52%
|
|*************************** | 54%
|
|**************************** | 56%
|
|***************************** | 58%
|
|****************************** | 60%
|
|******************************* | 62%
|
|******************************** | 64%
|
|********************************* | 66%
|
|********************************** | 68%
|
|*********************************** | 70%
|
|************************************ | 72%
|
|************************************* | 74%
|
|************************************** | 76%
|
|*************************************** | 78%
|
|**************************************** | 80%
|
|***************************************** | 82%
|
|****************************************** | 84%
|
|******************************************* | 86%
|
|******************************************** | 88%
|
|********************************************* | 90%
|
|********************************************** | 92%
|
|*********************************************** | 94%
|
|************************************************ | 96%
|
|************************************************* | 98%
|
|**************************************************| 100%
The specification of lm_imp()
is similar to the
specification of a linear regression model for complete data using
lm()
. In this minimal example, the only difference is that
for lm_imp()
the number of iterations n.iter
has to be specified. Of course, there are many more parameters that can
(and sometimes should) be specified. In the vignette Model
Specification, many of these parameters are explained in
detail.
n.iter
specifies the length of the Markov chain,
i.e., the number of draws from the posterior distribution of the
parameter or unobserved value. How many iterations are necessary depends
on the data and complexity of the model and can vary from as few as 100
up to thousands or millions.
One important criterion is that the Markov chains need to have converged. Convergence can be evaluated visually with a trace plot.
traceplot(lm1)
The function traceplot()
produces a plot of the sampled values across iterations per parameter.
Different chains are represented by different colours.
When the sampler has converged, the chains show a horizontal band, as
in the above figure. Consequently, when traces show a trend, convergence
has not been reached, and more iterations are necessary (e.g., using the
function add_samples()
).
When the chains have converged, we can obtain the result of the model from the model summary.
Results from a model fitted with JointAI can be
printed using summary()
:
summary(lm1)
#>
#> Bayesian linear model fitted with JointAI
#>
#> Call:
#> lm_imp(formula = SBP ~ gender + age + race + WC + alc + educ +
#> albu + bili, data = NHANES, n.iter = 500)
#>
#>
#> Posterior summary:
#> Mean SD 2.5% 97.5% tail-prob. GR-crit MCE/SD
#> (Intercept) 61.928 22.0386 20.6177 104.822 0.00400 1.00 0.0264
#> genderfemale -3.116 2.2651 -7.4493 1.407 0.16267 1.00 0.0258
#> age 0.364 0.0733 0.2248 0.511 0.00000 1.00 0.0258
#> raceOther Hispanic 0.469 5.0392 -9.1897 10.436 0.90667 1.00 0.0258
#> raceNon-Hispanic White -1.615 3.0196 -7.5193 4.246 0.58267 1.00 0.0258
#> raceNon-Hispanic Black 8.793 3.6753 1.7979 15.790 0.01467 1.00 0.0258
#> raceother 3.734 3.4810 -2.9797 10.729 0.30133 1.00 0.0265
#> WC 0.238 0.0825 0.0847 0.404 0.01200 1.00 0.0258
#> alc>=1 7.244 2.2752 2.6597 11.601 0.00133 1.00 0.0293
#> educhigh -3.535 2.0936 -7.5238 0.489 0.08267 1.01 0.0258
#> albu 5.087 3.9275 -2.7394 12.826 0.20400 1.00 0.0280
#> bili -5.545 4.7804 -14.9430 3.511 0.24000 1.01 0.0282
#>
#> Posterior summary of residual std. deviation:
#> Mean SD 2.5% 97.5% GR-crit MCE/SD
#> sigma_SBP 13.2 0.739 11.8 14.7 1 0.0277
#>
#>
#> MCMC settings:
#> Iterations = 101:600
#> Sample size per chain = 500
#> Thinning interval = 1
#> Number of chains = 3
#>
#> Number of observations: 186
The output gives the posterior summary, i.e., the summary of the MCMC (Markov Chain Monte Carlo) sample, which consists of all chains combined.
By default, summary()
will only print the posterior summary for the main model parameters of
the analysis model. How one can select which parameters are shown is
described in the vignette Selecting
Parameters.
The summary consists of the posterior mean, the standard deviation, 2.5% and 97.5% quantiles of the MCMC sample, the tail probability, the Gelman-Rubin criterion for convergence and the ratio of the Monte Carlo error to the standard deviation of the posterior sample.
The tail probability is a measure of how likely the value 0 is under the estimated posterior distribution, and is calculated as \[2\times\min\left\{Pr(\theta > 0), Pr(\theta < 0)\right\}\] (where \(\theta\) is the parameter of interest).
In the following graphics, the shaded areas represent the minimum of \(Pr(\theta > 0)\) and \(Pr(\theta < 0)\):
The Gelman-Rubin1 criterion, also available via the function
GR_crit()
,
compares the within and between chain variation. When it is close enough
to 12,
the chains can be assumed to have converged.
The Monte Carlo error is a measure of the error that is made because
the estimate is based on a finite sample. It can also be obtained using
the function MC_error()
.
Ideally, the Monte Carlo error should be small compared to the standard
error of the posterior sample. The values shown in the model summary are
the Monte Carlo error divided by the posterior standard deviation.
In the model summary, additionally, some important characteristics of
the MCMC samples on which the summary is based, are given. This includes
the range and number of iterations (=
Sample size per chain
), thinning interval and number of
chains.
Furthermore, the number of observations (the sample size of the data) is given.
With the arguments start
, end
and
thin
it is possible to select which iterations from the
MCMC sample are included in the summary.
For example:
When the traceplot shows that the chains only
converged after 1500 iterations, start = 1500
should be
specified in summary()
.
A summary of the missing values in all variables involved in the
model can be added to the summary by setting
missinfo = TRUE
.
The posterior distributions can be visualized using the function
densplot()
:
densplot(lm1)
By default, densplot()
plots the empirical distribution
of each of the chains separately. When joined = TRUE
the
distributions of the combined chains are plotted.
Besides linear regression models, it is also possible to fit
generalized linear models: glm_imp()
log-normal models: lognorm_imp()
beta models for continuous outcomes between 0
and 1 (proportions): betareg_imp()
cumulative logit models for ordinal outcomes: clm_imp()
multinomial logit models for unordered factors
with multiple levels: mlogit_imp()
linear mixed models: lme_imp()
or lmer_imp()
(identical)
generalized linear mixed models: glme_imp()
or glmer_imp()
(identical)
log-normal mixed models: lognormmm_imp()
beta mixed models: betamm_imp()
cumulative logit mixed models: clmm_imp()
multinomial logit mixed models: mlogitmm_imp()
parametric (Weibull) survival models: survreg_imp()
proportional hazards survival models: coxph_imp()
joint models for longitudinal and survival data:
JM_imp()
It is also possible to specify multiple analysis models at the same time, by providing a list of model formulas.
Gelman, A and Rubin, DB (1992) Inference from iterative
simulation using multiple sequences, Statistical Science, 7,
457-511.
Brooks, SP. and Gelman, A. (1998) General methods for
monitoring convergence of iterative simulations. Journal of
Computational and Graphical Statistics, 7, 434-455.↩︎
for example < 1.1; but this is not a generally accepted cut-off↩︎