Software
The output in this practical was
generated using R version 4.2.1 (2022-06-23 ucrt) and
JointAI version 1.0.3.9000 (with rjags
version 4.13 and JAGS version 4.3.1).
load("Data/datHCV.RData")
We will work with a simulated dataset that is based on a cohort of
patients with chronic hepatitis C, "datHCV"
:
250 patients (id
) from 10 hospitals
(center
)
985 rows in the data
potential additional grouping factor Genotype
(crossed with id
)
4 repeatedly measured variables:
logCreatinin, Albumin, logFIB4, Creatinin
time variable: time
10 baseline covariates:
Age0, Sex, alc_pwk, AntiHBc, DM, race, Weight, Height, Alcoholabuse, BMI
time-to-event outcome: event
(values:
FALSE, TRUE
) and corresponding event/censoring time
etime
We can visualize the variables in the data using the function
plot_all()
(⇨
vignette). The argument idvars
allows us to specify
variables that define a hierarchical structure. The plot then shows the
observed frequencies on that level (e.g., the subject-specific variables
are counted only once per subject).
library("JointAI")
par(mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
plot_all(datHCV, idvars = c("id", "center", "Genotype"), breaks = 50)
The missing data pattern of the data can be obtained and plotted
using md_pattern()
(⇨
vignette). Due to the large number of patterns arising when
considering all variables we produce separate plots for the time-varying
variables and the baseline variables.
Missing data pattern of the time-constant variables:
md_pattern(subset(datHCV, select = -c(logFIB4, Albumin, logCreatinin, time)))
Missing data pattern of the longitudinal variables:
md_pattern(subset(datHCV, select = c(logFIB4, Albumin, logCreatinin, time)))
Preparations
To get to know the syntax used in
JointAI, we start by fitting linear regression models
to the example data.
First, we create a version of the data that only contains the observations at baseline:
<- subset(datHCV, time == 0) HCVbase
The specification of models in JointAI is kept as
similar to the specification using the functions lm()
or
glm()
for complete data. The corresponding functions in
JointAI are lm_imp()
and
glm_imp()
.
The main functions in JointAI have the structure
*_imp()
; currently available are
Generalized linear models (and extensions)
lm_imp()
: linear regressionglm_imp()
: generalized linear regressionlognorm_imp()
: log-normal regressionbetareg_imp()
: regression using a
beta-distributionclm_imp()
: ordinal (cumulative logit) regressionmlogit_imp()
: multinomial regressionMixed Models
lme_imp()
, lmer_imp()
: linear mixed
modelglme_imp()
, glmer_imp()
: generalized
linear mixed modelbetamm_imp()
: beta mixed modellognormmm_imp()
: log-normal mixed modelclmm_imp()
: cumulative logit mixed modelmlogitmm_imp()
: multinomial logit mixed modelSurvival Models
coxph_imp()
: proportional hazards model (with B-spline
baseline hazard)survreg_imp()
: parametric (Weibull) modelJM_imp()
: Joint model for longitudinal and survival
dataImportant arguments that are known from other functions/packages are
formula
data
family
⇨
Examples / Documentationseed
The only additional argument that has to be specified is
n.iter
, which determines the number of iterations of the
MCMC sampler.
By default, n.iter = 0
, so that no MCMC sample is
generated. This is to avoid unintentionally long-running models. Only
the adaptive phase is then run (default: n.adapt = 100
,
more details later).
Fit a linear regression model to investigate the association between
a subject’s Albumin
(at baseline) and the covariates Age0
, Sex
,
race
, DM
,
and alc_pwk
.
For now, a small number of iterations, say 200, should be sufficient.
lm_imp()
.
formula
,
data
, and n.iter
(and optionally the
seed
).
<- lm_imp(Albumin ~ Age0 + Sex + DM + race + alc_pwk,
lm1 data = HCVbase, n.iter = 200, seed = 1234)
After fitting the model, convergence of the MCMC chains has to be
confirmed. This can be done using a trace plot, i.e., with the help of
the function traceplot()
.
The function densplot()
shows us the posterior
densities.
Investigate the trace plot and density plot for model
lm1
.
traceplot(lm1)
The chains seem to have converged, but on close inspection we can see
that the trace plots for the three dummies for alc_pwk
look
very similar.
densplot(lm1)
The functions traceplot()
and densplot()
have some more options to change how things look, including the use of
ggplot2 instead of using base
graphics.
You can find out more in the help of the functions and in the vignette.
When chains have converged, we can obtain the model
summary()
(⇨
vignette).
The function summary()
returns a summary of the
posterior distribution, (by default) consisting of the
The tail probability is calculated as \(2 \times \min\{Pr(\theta > 0), Pr(\theta < 0)\}\), where \(\theta\) is the parameter of interest. It is a measure of how likely the value 0 is under the estimated posterior distribution. The figure visualizes three examples of posterior distributions and the corresponding minimum of \(Pr(\theta > 0)\) and \(Pr(\theta < 0)\) (shaded area):
The Gelman-Rubin criterion (aka R-hat or “potential scale reduction factor”) uses within-chain and between-chain variation to determine
GR_crit()
internally uses
coda::gelman.diag()
It also contains information on the MCMC chains:
And the data:
missinfo = TRUE
): information on the
amount of missing valuesObtain the summary()
of lm1
and inspect
it.
summary(lm1, missinfo = TRUE)
##
## Bayesian linear model fitted with JointAI
##
## Call:
## lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk,
## data = HCVbase, n.iter = 200, seed = 1234)
##
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob. GR-crit MCE/SD
## (Intercept) 48.0372 4.2296 39.7040 56.5031 0.0000 1.01 0.0762
## Age0 -0.0536 0.0391 -0.1301 0.0235 0.1533 1.02 0.0408
## SexFemale -1.6004 1.0402 -3.6236 0.3323 0.1333 1.01 0.0408
## DMYes 0.0751 1.2443 -2.4021 2.5021 0.9633 1.01 0.0453
## raceOther -1.7816 1.6269 -4.7840 1.3969 0.3000 1.00 0.0490
## raceAsian 3.7429 1.9834 -0.0554 7.6367 0.0533 1.02 0.0523
## alc_pwk<1 Drink per week 3.2305 3.9734 -4.4354 10.6095 0.3900 1.03 0.0776
## alc_pwk1-14 Drinks per week 2.2815 3.7556 -5.1671 9.2548 0.5167 1.00 0.0792
## alc_pwk> 14 Drinks per week 0.5339 3.8327 -7.0285 8.2627 0.9000 1.01 0.0830
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5% GR-crit MCE/SD
## sigma_Albumin 6.01 0.341 5.39 6.75 1.01 0.0427
##
##
## MCMC settings:
## Iterations = 101:300
## Sample size per chain = 200
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 250
##
##
## Number and proportion of complete cases:
## # %
## lvlone 46 18.4
##
## Number and proportion of missing values:
## # NA % NA
## Age0 0 0.0
## Sex 0 0.0
## race 53 21.2
## DM 66 26.4
## Albumin 71 28.4
## alc_pwk 99 39.6
The Gelman-Rubin criterion and Monte Carlo error to posterior
standard deviation ratio show what we already saw in the trace plot:
most of the parameters have converged and mix nicely, but the parameters
for alc_pwk
have a slightly higher Monte Carlo error ratio
than the other parameters.
Let’s have a look at what might be the issue with
alc_pwk
.
alc_pwk
is a categorical variable with the following
observed frequencies:
table(HCVbase$alc_pwk)
##
## Never used <1 Drink per week 1-14 Drinks per week > 14 Drinks per week
## 7 25 49 70
There are only 7 observations in the “Never used” category, and this is currently the reference category.
Convergence/mixing is often better when a large group is used as the reference category. Following the convention in , the default in JointAI is to use the first category.
We can specify which category should be the reference for some or all
categorical variables in a model using the argument refcats
.
refcats
can be a character string that specifies that
for all categorical variables the "first"
,
"last"
or "largest"
category should be used,
or a named vector that specifies per variable which category should be
used (either via one of the category labels, category indices, or
"first"
, "last"
or "largest"
).
(See also set_refcats()
and the vignette.)
Re-fit lm1
but use the largest category of
alk_pwk
as reference category.
refcat = "largest"
refcat = c(alc_pwk = "largest")
refcat = c(alc_pwk = "> 14 Drinks per week")
refcat = c(alc_pwk = 4)
You can use update()
to re-fit the model so that you
only need to specify the additional argument refcat
.
Check the convergence using a traceplot()
, and inspect
the Gelman-Rubin criterion and Monte Carlo error of the new model.
You can also use the functions GR-crit()
and MC_error()
.
<- update(lm1, refcat = "largest") lm2
traceplot(lm2)
The traces for the different dummy variables of alk_pwk
no longer look the same!
We can check the Gelman-Rubin criterion and Monte Carlo error
directly using the functions GR_crit()
and
MC_error()
:
GR_crit(lm2)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.002 1.004
## Age0 1.002 1.003
## SexFemale 0.997 0.999
## DMYes 1.002 1.010
## raceOther 1.006 1.017
## raceAsian 0.998 1.002
## alc_pwkNever used 1.018 1.069
## alc_pwk<1 Drink per week 1.030 1.103
## alc_pwk1-14 Drinks per week 0.998 1.001
## sigma_Albumin 0.998 1.000
##
## Multivariate psrf
##
## 1.03
⇨ The Gelman-Rubin criterion is now close to one for all regression coefficients.
By default, the function coda::gelman.diag()
that is
used internally by GR_crit()
, discards the first half of
the MCMC chains as burn-in. Because here the chains have converged and
we only have very few samples, we could specify to use the whole chains
by setting autoburnin = FALSE
).
MC_error(lm2)
## est MCSE SD MCSE/SD
## (Intercept) 48.567 0.0867 1.93 0.045
## Age0 -0.051 0.0016 0.04 0.041
## SexFemale -1.606 0.0433 1.06 0.041
## DMYes 0.027 0.0560 1.25 0.045
## raceOther -1.890 0.0781 1.65 0.047
## raceAsian 3.444 0.1114 2.06 0.054
## alc_pwkNever used -0.375 0.3246 4.08 0.080
## alc_pwk<1 Drink per week 2.688 0.1437 1.86 0.077
## alc_pwk1-14 Drinks per week 1.499 0.0852 1.39 0.061
## sigma_Albumin 5.993 0.0141 0.33 0.042
⇨ The Monte Carlo error is below/close to 5% for most parameters. To reduce the Monte Carlo error we can increase the number of iterations or use more MCMC chains (in this case probably the better option because we don’t need a long burn-in).
There are currently no functions in JointAI to obtain and/or plot auto- and cross-correlation of the MCMC chains, but we can use the functions provided in the package coda.
A fitted JointAI object has an element MCMC
that is of
class mcmc.list
, the type of object coda
works with.
We could, hence, use
coda::crosscorr(lm2$MCMC)
coda::crosscorr.plot(lm2$MCMC)
or plot the auto-correlation using
coda::autocorr.plot(lm2$MCMC)
, or create our own plot:
library("ggplot2")
library("magrittr")
::autocorr(lm2$MCMC) %>%
coda::melt() %>%
reshape2subset(., Var2 == Var3 & Var1 != "Lag 0") %>%
ggplot(aes(x = Var1, y = value, fill = factor(L1))) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap("Var2") +
scale_fill_discrete(name = "MCMC chain") +
coord_cartesian(ylim = c(-1, 1)) +
theme(legend.position = c(0.75, 0.15))
You can obtain the information on which regression coefficient refers
to which covariate term (in which model) via the coef_list
element of a fitted JointAI model:
$coef_list lm2
## $Albumin
## outcome outcat varname coef varnam_print
## 1 Albumin <NA> (Intercept) beta[1] (Intercept)
## 2 Albumin <NA> Age0 beta[2] Age0
## 3 Albumin <NA> SexFemale beta[3] SexFemale
## 4 Albumin <NA> DMYes beta[4] DMYes
## 5 Albumin <NA> raceOther beta[5] raceOther
## 6 Albumin <NA> raceAsian beta[6] raceAsian
## 7 Albumin <NA> alc_pwkNever used beta[7] alc_pwkNever used
## 8 Albumin <NA> alc_pwk<1 Drink per week beta[8] alc_pwk<1 Drink per week
## 9 Albumin <NA> alc_pwk1-14 Drinks per week beta[9] alc_pwk1-14 Drinks per week
##
## $alc_pwk
## outcome outcat varname coef varnam_print
## 1 alc_pwk <NA> Age0 alpha[1] Age0
## 2 alc_pwk <NA> SexFemale alpha[2] SexFemale
## 3 alc_pwk <NA> DMYes alpha[3] DMYes
## 4 alc_pwk <NA> raceOther alpha[4] raceOther
## 5 alc_pwk <NA> raceAsian alpha[5] raceAsian
##
## $DM
## outcome outcat varname coef varnam_print
## 1 DM <NA> (Intercept) alpha[6] (Intercept)
## 2 DM <NA> Age0 alpha[7] Age0
## 3 DM <NA> SexFemale alpha[8] SexFemale
## 4 DM <NA> raceOther alpha[9] raceOther
## 5 DM <NA> raceAsian alpha[10] raceAsian
##
## $race
## outcome outcat varname coef varnam_print
## 1 race raceAsian (Intercept) alpha[14] raceAsian: (Intercept)
## 2 race raceAsian Age0 alpha[15] raceAsian: Age0
## 3 race raceAsian SexFemale alpha[16] raceAsian: SexFemale
## 4 race raceOther (Intercept) alpha[11] raceOther: (Intercept)
## 5 race raceOther Age0 alpha[12] raceOther: Age0
## 6 race raceOther SexFemale alpha[13] raceOther: SexFemale
There are a number of arguments related to settings for the MCMC sampling:
n.chains
: the number of MCMC chains (default is 3)n.adapt
: number of iterations in the adaptive phase
(default is 100)n.iter
: number of iterations in the sampling phase
(default is 0)thin
: thinning interval (default is 1, i.e., no
thinning) ⇨
vignettemonitor_params
: parameters/nodes to be monitored ⇨
vignetteinits
: (optional) initial values ⇨
vignetteTo get more MCMC samples, we could increase n.iter
, and
then the sampling would just take longer. Alternatively, we can increase
the number of chains (i.e., set n.chain
).
To save computational time, chains can be run in parallel. JointAI uses the future framework for parallel computation (see also this vignette of the future package). Which type of future can/should be used may depend on the operating system.
library("future")
plan(multisession(workers = 8))
This creates a multisession future with 8
processes. Each chain will then be
run in a separate session. This setting will remain in place for the
entire session, unless we overwrite
it with another call to plan()
(e.g.,
plan(sequential)
).
Re-run lm2
, now with 8 chains sampled in parallel.
You may also explore the trace plot and model summary to see whether the extra MCMC samples have improved the mixing, Gelman-Rubin criterion and Monte Carlo error.
<- update(lm2, n.chains = 8) lm3
##
## Parallel sampling with 8 workers started (2022-05-10 15:05:33).
It is not possible to get a progress bar when using parallel computation.
traceplot(lm3)
summary(lm3)
##
## Bayesian linear model fitted with JointAI
##
## Call:
## lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk,
## data = HCVbase, n.chains = 8, n.iter = 200, refcats = "largest",
## seed = 1234)
##
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob. GR-crit MCE/SD
## (Intercept) 48.5655 1.9043 44.919 52.2095 0.0000 1.00 0.0271
## Age0 -0.0515 0.0389 -0.127 0.0255 0.1900 1.00 0.0257
## SexFemale -1.5343 1.0319 -3.571 0.4318 0.1450 1.01 0.0266
## DMYes -0.0322 1.2081 -2.328 2.3883 0.9775 1.01 0.0278
## raceOther -1.8667 1.6471 -4.952 1.3973 0.2500 1.01 0.0315
## raceAsian 3.6193 2.0358 -0.602 7.4365 0.0963 1.03 0.0385
## alc_pwkNever used -0.3994 3.9919 -8.100 7.5503 0.9250 1.02 0.0504
## alc_pwk<1 Drink per week 2.7278 1.8208 -0.811 6.2098 0.1175 1.01 0.0414
## alc_pwk1-14 Drinks per week 1.4957 1.3880 -1.250 4.2225 0.2637 1.00 0.0334
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5% GR-crit MCE/SD
## sigma_Albumin 5.98 0.348 5.34 6.69 1 0.0289
##
##
## MCMC settings:
## Iterations = 101:300
## Sample size per chain = 200
## Thinning interval = 1
## Number of chains = 8
##
## Number of observations: 250
⇨ Both the Gelman-Rubin criterion and the Monte Carlo error are fine now.
We can get some information on the computational settings via the
comp_info
argument of a fitted JointAI
object:
$comp_info lm3
## $start_time
## [1] "2022-05-10 15:05:33 CEST"
##
## $duration
## $duration$adapt
## chain1 chain2 chain3 chain4 chain5 chain6
## run 1 11.09768 secs 13.37164 secs 15.35581 secs 17.14637 secs 18.04228 secs 18.94701 secs
## chain7 chain8
## run 1 18.73093 secs 19.02346 secs
##
## $duration$sample
## chain1 chain2 chain3 chain4 chain5 chain6
## run 1 29.67038 secs 31.25177 secs 32.14474 secs 31.11918 secs 30.929 secs 29.81292 secs
## chain7 chain8
## run 1 29.08424 secs 28.78671 secs
##
##
## $JointAI_version
## [1] '1.0.3.9000'
##
## $R_version
## [1] "R version 4.1.3 (2022-03-10)"
##
## $parallel
## [1] TRUE
##
## $workers
## [1] 8
and can get the computational time via
sum_duration()
sum_duration(lm2)
## Time difference of 30.81611 secs
sum_duration(lm3)
## Time difference of 51.1682 secs
The output of lm3$comp_info
will look different in
models fitted with JointAI version ≤ 1.0.3, and the
function sum_duration()
is only available in the
development version available on GitHub.
When models take a bit longer to run we may want to continue sampling
rather than throwing away the MCMC samples we already have. We can do
this with the function add_samples()
.
When using add_samples()
we need to make sure that the
setting for parallel computation in the current
session matches the setting used to
fit the original JointAI object.
Add an extra 100 iterations to lm2
.
Since lm2
was run sequentially, you need to first
set
plan(sequential)
You can check how many iterations the resulting new JointAI object has by
summary()
, ormcmc_settings
element of the fitted
JointAI object.plan(sequential)
<- add_samples(lm2, n.iter = 100) lm4
## NOTE: Stopping adaptation
The relevant part of the output of summary()
shows
us:
summary(lm4)
## [...]
##
## MCMC settings:
## Iterations = 101:400
## Sample size per chain = 300
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 250
Note that the call is now a list containing the original call and the
call to add_samples()
:
$call lm4
## [[1]]
## lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk,
## data = HCVbase, n.iter = 200, refcats = "largest", seed = 1234)
##
## [[2]]
## add_samples(object = lm2, n.iter = 100)
From the model summary()
:
##
## Bayesian linear model fitted with JointAI
##
## Call:
## list(lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk,
## data = HCVbase, n.iter = 200, refcats = "largest", seed = 1234),
## add_samples(object = lm2, n.iter = 100))
##
## [...]
In the mcmc_settings
element of lm4
we can
also see that the original JointAI object was run for 200 iterations and
add_samples()
was run with 100 iterations:
$mcmc_settings[c("n.adapt", "n.iter", "n.chains", "thin")] lm4
## $n.adapt
## [1] 100
##
## $n.iter
## [1] 200 100
##
## $n.chains
## [1] 3
##
## $thin
## [1] 1 1
The function add_samples()
has an argument
add
which is TRUE
by default, so that the new
MCMC samples are added to the existing ones. When
add = FALSE
, the resulting JointAI object will only contain
the new samples. This is useful to exclude the original samples as
burn-in.
With sequential evaluation, the MCMC samples from multiple chains are
created by one jags
object (returned by package
rjags), but when parallel computation is used a
separate jags
model object is created per chain.
The number of chains in a jags
model is set, and so when
continuing sampling from a JointAI object we have to
run it in the same (compatible) type of future.
Specifically:
future::plan()
was set in the current session, no
problem (add_samples()
runs sequentially)future:plan()
has been called in the current session
we need to re-set the future to sequential using
plan(sequential)
add_samples()
in
parallelfuture::plan()
has not been
called or a sequential future is set, add_samples()
will
run sequentially (with a note)add_samples()
runs in parallel.JointAI automatically chooses the type of models to
use for (incomplete) covariates based on the type of variable
(factor
, ordered
, numeric
),
number of levels (and the hierarchical level in multi-level data).
We can see the type of model that was used in the models
element of the fitted object:
$models lm2
## Albumin alc_pwk DM race
## "glm_gaussian_identity" "clm" "glm_binomial_logit" "mlogit"
The function list_models()
returns a more elaborate
version:
list_models(lm2)
## Linear model for "Albumin"
## family: gaussian
## link: identity
## * Predictor variables:
## (Intercept), Age0, SexFemale, DMYes, raceOther, raceAsian, alc_pwkNever used, alc_pwk<1
## Drink per week, alc_pwk1-14 Drinks per week
## * Regression coefficients:
## beta[1:9] (normal prior(s) with mean 0 and precision 1e-04)
## * Precision of "Albumin" :
## tau_Albumin (Gamma prior with shape parameter 0.01 and rate parameter 0.01)
##
##
## Cumulative logit model for "alc_pwk"
## * Reference category: "> 14 Drinks per week"
## * Predictor variables:
## Age0, SexFemale, DMYes, raceOther, raceAsian
## * Regression coefficients:
## alpha[1:5] (normal prior(s) with mean 0 and precision 1e-04)
## * Intercepts:
## - Never used: gamma_alc_pwk[1] (normal prior with mean 0 and precision 1e-04)
## - <1 Drink per week: gamma_alc_pwk[2] = gamma_alc_pwk[1] + exp(delta_alc_pwk[1])
## - 1-14 Drinks per week: gamma_alc_pwk[3] = gamma_alc_pwk[2] + exp(delta_alc_pwk[2])
## * Increments:
## delta_alc_pwk[1:2] (normal prior(s) with mean 0 and precision 1e-04)
##
##
## Binomial model for "DM"
## family: binomial
## link: logit
## * Reference category: "No"
## * Predictor variables:
## (Intercept), Age0, SexFemale, raceOther, raceAsian
## * Regression coefficients:
## alpha[6:10] (normal prior(s) with mean 0 and precision 1e-04)
##
##
## Multinomial logit model for "race"
## * Reference category: "Caucasian"
## * Predictor variables:
## (Intercept), Age0, SexFemale
## * Regression coefficients:
## raceOther: alpha[14:16]
## raceAsian: alpha[11:13] (normal prior(s) with mean 0 and precision 1e-04)
The arguments of list_models()
allow you to omit parts of these details.
Available covariate Model Types
Implemented
model types that can be chosen in the argument models
for
baseline covariates (not repeatedly measured) are:
lm
|
linear (normal) model with identity link (alternatively:
glm_gaussian_identity ); default for continuous variables
|
glm_gaussian_log
|
linear (normal) model with log link |
glm_gaussian_inverse
|
linear (normal) model with inverse link |
glm_logit
|
logistic model for binary data (alternatively:
glm_binomial_logit ); default for binary variables
|
glm_probit
|
probit model for binary data (alternatively:
glm_binomial_probit )
|
glm_binomial_log
|
binomial model with log link |
glm_binomial_cloglog
|
binomial model with complementary log-log link |
glm_gamma_inverse
|
gamma model with inverse link for skewed continuous data |
glm_gamma_identity
|
gamma model with identity link for skewed continuous data |
glm_gamma_log
|
gamma model with log link for skewed continuous data |
glm_poisson_log
|
Poisson model with log link for count data |
glm_poisson_identity
|
Poisson model with identity link for count data |
lognorm
|
log-normal model for skewed continuous data |
beta
|
beta model (with logit link) for skewed continuous data in (0, 1) |
mlogit
|
multinomial logit model for unordered categorical variables; default for unordered factors with >2 levels |
clm
|
cumulative logit model for ordered categorical variables; default for ordered factors |
The distributions used in the chosen model types must be appropriate for the respective variables. This means, for instance, that if a covariate has observed values ≤ 0, choosing a log-normal or a Gamma model will result in an error.
Change the Covariate Model Types
The argument
models
takes a named vector of model types, where the names
are the names of the (incomplete) variables.
When models are specified for only a subset of the variables for which a model is needed, the default model choices (as indicated in the table) are used for the unspecified variables.
Fit a linear regression model for Albumin
with
covariates Age0
, Sex
, DM
and
Creatinin
.
Creatinin
has a skewed distribution and the default
(normal) distribution may not be appropriate to model/impute this
variable.
⇨ Set a more appropriate model type for Creatinin
.
models = c(Creatinin = "<select a model type from the list above>")
.
Given the shape of the histogram for Creatinin
(see the
histogram above) a log-normal model or a Gamma model might be
appropriate.
The histogram of Creatinin
shows the marginal
distribution. More relevant would be the distribution conditional on the
variables in the linear predictor of the model for
Creatinin
, but we don’t know this distribution.
<- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin,
lm5a models = c(Creatinin = "lognorm"),
n.iter = 200, data = HCVbase)
<- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin,
lm5b models = c(Creatinin = "glm_gamma_log"),
n.iter = 200, data = HCVbase)
Of course we would now check the convergence before inspecting the results…
Did you notice that the sampling is faster when using
lognorm
than with glm_gamma_log
?
sum_duration(lm5a)
## Time difference of 3.990817 secs
sum_duration(lm5b)
## Time difference of 14.41182 secs
Different types of models use different samplers in JAGS, resulting in slower or faster sampling.
This has to do with the complexity of the models (also the number of parameters) and the efficiency of the samplers that JAGS chooses for the different nodes.
The two models for Albumin result in very similar estimates. We can compare them, for instance, by plotting the posterior means and credible intervals.
library("magrittr")
library("ggplot2")
rbind(
data.frame(coef = coef(lm5a)[[1]],
confint(lm5a)[[1]],
model = "lognorm", check.names = FALSE) %>%
::rownames_to_column(var = "variable"),
tibble
data.frame(coef = coef(lm5b)[[1]],
confint(lm5b)[[1]],
model = "gamma", check.names = FALSE) %>%
::rownames_to_column(var = "variable")
tibble%>%
) ggplot(aes(x = model, y = coef)) +
geom_point() +
geom_errorbar(aes(ymin = `2.5%`, ymax = `97.5%`)) +
facet_wrap("variable", scales = "free")
Sometimes, the linear association with the response that is implied by just adding a covariate to the linear predictor is not appropriate.
As in other packages for regression, we can specify functions inside a model formula or use interactions of variables.
An important difference to other packages is that when functions/interactions involve an incomplete variable, the function or interaction has to be computed in JAGS, using the imputed value. This limits the functionality to functions that are directly available in JAGS, or have been implemented. This is the case for many standard functions such as
log()
, exp()
sin()
, cos()
, …abs()
, sqrt()
I(...^2)
)Functions may also involve multiple (incomplete) variables.
Not yet available are splines for incomplete variables (e.g.,
bs()
and ns()
)).
For interactions, the standard syntax with *
,
:
or ^
is used.
See the vignette for more explanations and examples.
We now want to add BMI as covariate to the model for
Albumin
. BMI
can be calculated from
Height
and Weight
as \(\frac{\texttt{Weight}}{(\texttt{Height}/100)^2}\).
There are some patients for whom only either Height
or
Weight
is available:
md_pattern(subset(datHCV, select = c(Height, Weight, BMI)))
If we directly use BMI
in the model, we loose the
partial information we have for these patients.
Specify a linear model for Albumin
with covariates
Age0
, Sex
, DM
,
Creatinin
and BMI, in which BMI is calculated from
Height
and Weight
.
To specify the model, we can just include the formula for the BMI,
wrapped in a I()
function, in the model formula:
<- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin + I(Weight/(Height/100)^2),
lm6 data = HCVbase, models = c(Creatinin = "lognorm"))
##
## Note: No MCMC sample will be created when n.iter is set to 0.
## Error in rjags::jags.model(file = modelfile, data = data_list, inits = inits, :
## Error in node M_lvlone[6,10]
## Invalid parent values
But this results in an error! What is going wrong?
Since we haven’t specified model types for Height
and
Weight
, the default, i.e., a normal model, is chosen. This
means that imputed values for Height
could be any value on
the real line. This includes the value 0 which would result in division
by 0, causing the error.
Note that even if all observed values of Height
are far
away from zero, and probably none of the imputed values would ever be
exactly zero, we still get this error.
The error message says that the node M_lvlone[5,10]
has
invalid parent values.
M_lvlone
is the name of the design matrix containing the
data. (All design matrices are called M
;
lvlone
refers to the level of the particular variable.)
The design matrix is stored in the data_list
argument in
the returned object.
head(lm6$data_list$M_lvlone)
## Albumin Creatinin DM Height Weight (Intercept) Age0 SexFemale DMYes
## 960.1 57.92853 70.29047 0 180.6634 63.25236 1 24.02857 0 NA
## 512.112 NA 35.83949 0 164.2964 87.17301 1 35.28016 0 NA
## 704.445 47.74663 NA 0 168.6981 85.74012 1 42.65734 1 NA
## 960.889 52.02988 32.21713 0 190.5249 86.73939 1 49.15190 0 NA
## 671 40.72937 NA 0 155.2527 NA 1 47.18493 1 NA
## 936.46 45.86362 42.20511 NA NA 50.48447 1 22.20862 0 NA
## I(Weight/(Height/100)^2)
## 960.1 NA
## 512.112 NA
## 704.445 NA
## 960.889 NA
## 671 NA
## 936.46 NA
Column 10 contains the (to be) calculated BMI and row 5 is a row in
which the value for Height
is missing.
Heigth
is being imputed using the default normal model,
which would allow for the value 0, which causes the error because we
divide by it.
To avoid this problem we could either
Height
that excludes
the value 0, orOption 2 should only be used when the likely imputed values are far enough away from zero, so that the truncation does not actually impact the imputed values.
We can specify truncation using the argument trunc
:
<- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin + I(Weight/(Height/100)^2),
lm6 data = HCVbase, models = c(Creatinin = "lognorm"),
trunc = list(Height = c(1e-5, NA)))
##
## Note: No MCMC sample will be created when n.iter is set to 0.
In contrast to, for instance, the mice package,
JointAI uses by default only the variables that appear
in the model formula and ignores all other columns in the input
data
.
It is possible to specify that some variables should be used as auxiliary variables. Auxiliary variables are used as covariates in the models for (incomplete) covariates, but not in the analysis model.
The argument auxvars
takes a one-sided (RHS)
formula.
Alcoholabuse
was originally obtained from alc_pwk
and additional information that is not included in this data. There are
more missing values in alc_pwk
than
Alcoholabuse
.
⇨ Alcoholabuse
is strongly
related to alc_pwk
and may provide relevant information
when imputing alk_pwk
.
Update our previous model lm1
(it had the model formula
Albumin ~ Age0 + Sex + DM + race + alc_pwk
), so that
Alcoholabuse
is used as auxiliary variable for the
imputation.
Use list_models()
to see in which sub-models
Alcoholabuse
was included.
(Note: I set n.iter = 0
and n.adapt = 0
here, i.e., no MCMC samples, to avoid unnecessary computational
time.)
<- update(lm1, auxvars = ~ Alcoholabuse, n.iter = 0, n.adapt = 0) lm7
##
## Note: No MCMC sample will be created when n.iter is set to 0.
list_models(lm7, priors = FALSE, regcoef = FALSE, otherpars = FALSE)
## Linear model for "Albumin"
## family: gaussian
## link: identity
## * Predictor variables:
## (Intercept), Age0, SexFemale, DMYes, raceOther, raceAsian, alc_pwk<1 Drink per week,
## alc_pwk1-14 Drinks per week, alc_pwk> 14 Drinks per week
##
##
## Cumulative logit model for "alc_pwk"
## * Reference category: "Never used"
## * Predictor variables:
## Age0, SexFemale, DMYes, raceOther, raceAsian, AlcoholabuseYes
##
##
## Binomial model for "DM"
## family: binomial
## link: logit
## * Reference category: "No"
## * Predictor variables:
## (Intercept), Age0, SexFemale, raceOther, raceAsian, AlcoholabuseYes
##
##
## Multinomial logit model for "race"
## * Reference category: "Caucasian"
## * Predictor variables:
## (Intercept), Age0, SexFemale, AlcoholabuseYes
##
##
## Binomial model for "Alcoholabuse"
## family: binomial
## link: logit
## * Reference category: "No"
## * Predictor variables:
## (Intercept), Age0, SexFemale
Alcoholabuse
is included in all covariate models,
because the variable has less missing values than all other incomplete
covariates.
It is also possible to use functions of variables as auxiliary variables:
<- update(lm1, auxvars = ~ Alcoholabuse + I(Age0^2),
lm7b n.iter = 0, n.adapt = 0)
##
## Note: No MCMC sample will be created when n.iter is set to 0.
list_models(lm7b, priors = FALSE, regcoef = FALSE, otherpars = FALSE)
## Linear model for "Albumin"
## family: gaussian
## link: identity
## * Predictor variables:
## (Intercept), Age0, SexFemale, DMYes, raceOther, raceAsian, alc_pwk<1 Drink per week,
## alc_pwk1-14 Drinks per week, alc_pwk> 14 Drinks per week
##
##
## Cumulative logit model for "alc_pwk"
## * Reference category: "Never used"
## * Predictor variables:
## Age0, SexFemale, DMYes, raceOther, raceAsian, AlcoholabuseYes, I(Age0^2)
##
##
## Binomial model for "DM"
## family: binomial
## link: logit
## * Reference category: "No"
## * Predictor variables:
## (Intercept), Age0, SexFemale, raceOther, raceAsian, AlcoholabuseYes, I(Age0^2)
##
##
## Multinomial logit model for "race"
## * Reference category: "Caucasian"
## * Predictor variables:
## (Intercept), Age0, SexFemale, AlcoholabuseYes, I(Age0^2)
##
##
## Binomial model for "Alcoholabuse"
## family: binomial
## link: logit
## * Reference category: "No"
## * Predictor variables:
## (Intercept), Age0, SexFemale, I(Age0^2)
The analysis model only includes Age0
, but the covariate
models use Age0
as well as I(Age0^2)
.
See also the vignette for more explanation / examples.