Software
The output in this practical was generated using R version 4.2.1 (2022-06-23 ucrt) and JointAI version 1.0.3.9000 (with rjags version 4.13 and JAGS version 4.3.1).

Data

load("Data/datHCV.RData")

We will work with a simulated dataset that is based on a cohort of patients with chronic hepatitis C, "datHCV":

250 patients (id) from 10 hospitals (center)
985 rows in the data
potential additional grouping factor Genotype (crossed with id)
4 repeatedly measured variables: logCreatinin, Albumin, logFIB4, Creatinin
time variable: time
10 baseline covariates: Age0, Sex, alc_pwk, AntiHBc, DM, race, Weight, Height, Alcoholabuse, BMI
time-to-event outcome: event (values: FALSE, TRUE) and corresponding event/censoring time etime

Data Distribution

We can visualize the variables in the data using the function plot_all() (⇨ vignette). The argument idvars allows us to specify variables that define a hierarchical structure. The plot then shows the observed frequencies on that level (e.g., the subject-specific variables are counted only once per subject).

library("JointAI")
par(mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
plot_all(datHCV, idvars = c("id", "center", "Genotype"), breaks = 50)

Missing data pattern

The missing data pattern of the data can be obtained and plotted using md_pattern() (⇨ vignette). Due to the large number of patterns arising when considering all variables we produce separate plots for the time-varying variables and the baseline variables.

Missing data pattern of the time-constant variables:

md_pattern(subset(datHCV, select = -c(logFIB4, Albumin, logCreatinin, time)))

Missing data pattern of the longitudinal variables:

md_pattern(subset(datHCV, select = c(logFIB4, Albumin, logCreatinin, time)))

(Generalized) Linear Regression

Preparations
To get to know the syntax used in JointAI, we start by fitting linear regression models to the example data.

First, we create a version of the data that only contains the observations at baseline:

HCVbase <- subset(datHCV, time == 0)

Model Specification

The specification of models in JointAI is kept as similar to the specification using the functions lm() or glm() for complete data. The corresponding functions in JointAI are lm_imp() and glm_imp().

The main functions in JointAI have the structure *_imp(); currently available are

Generalized linear models (and extensions)

lm_imp(): linear regression
glm_imp(): generalized linear regression
lognorm_imp(): log-normal regression
betareg_imp(): regression using a beta-distribution
clm_imp(): ordinal (cumulative logit) regression
mlogit_imp(): multinomial regression

Mixed Models

lme_imp(), lmer_imp(): linear mixed model
glme_imp(), glmer_imp(): generalized linear mixed model
betamm_imp(): beta mixed model
lognormmm_imp(): log-normal mixed model
clmm_imp(): cumulative logit mixed model
mlogitmm_imp(): multinomial logit mixed model

Survival Models

coxph_imp(): proportional hazards model (with B-spline baseline hazard)
survreg_imp(): parametric (Weibull) model
JM_imp(): Joint model for longitudinal and survival data

Important arguments that are known from other functions/packages are

formula
data
family ⇨ Examples / Documentation
seed

The only additional argument that has to be specified is n.iter, which determines the number of iterations of the MCMC sampler.

By default, n.iter = 0, so that no MCMC sample is generated. This is to avoid unintentionally long-running models. Only the adaptive phase is then run (default: n.adapt = 100, more details later).

Task

Fit a linear regression model to investigate the association between a subject’s Albumin (at baseline) and the covariates Age0, Sex, race, DM, and alc_pwk.

For now, a small number of iterations, say 200, should be sufficient.

The function you need is lm_imp().

You need to specify the arguments formula, data, and n.iter (and optionally the seed).

Solution

lm1 <- lm_imp(Albumin ~ Age0 + Sex + DM + race + alc_pwk,
              data = HCVbase, n.iter = 200, seed = 1234)

Visualizing the posterior sample

After fitting the model, convergence of the MCMC chains has to be confirmed. This can be done using a trace plot, i.e., with the help of the function traceplot().

The function densplot() shows us the posterior densities.

Task

Investigate the trace plot and density plot for model lm1.

Solution

traceplot(lm1)

The chains seem to have converged, but on close inspection we can see that the trace plots for the three dummies for alc_pwk look very similar.

densplot(lm1)

The functions traceplot() and densplot() have some more options to change how things look, including the use of ggplot2 instead of using base graphics.

You can find out more in the help of the functions and in the vignette.

Model Summary

When chains have converged, we can obtain the model summary() (⇨ vignette).

The function summary() returns a summary of the posterior distribution, (by default) consisting of the

posterior mean, standard deviation, and 2.5% and 97.5% quantiles,
tail probability
Gelman-Rubin criterion for convergence
ratio of the Monte Carlo error to posterior standard deviation.

The tail probability is calculated as $2 \times \min\{Pr(\theta > 0), Pr(\theta < 0)\}$, where $\theta$ is the parameter of interest. It is a measure of how likely the value 0 is under the estimated posterior distribution. The figure visualizes three examples of posterior distributions and the corresponding minimum of $Pr(\theta > 0)$ and $Pr(\theta < 0)$ (shaded area):

The Gelman-Rubin criterion (aka R-hat or “potential scale reduction factor”) uses within-chain and between-chain variation to determine

requires at least 2 MCMC chains
GR_crit() internally uses coda::gelman.diag()
The upper limit of the CI should be close to zero, say < 1.1.

Task

Obtain the summary() of lm1 and inspect it.

Solution

summary(lm1, missinfo = TRUE)

## 
## Bayesian linear model fitted with JointAI
## 
## Call:
## lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk, 
##     data = HCVbase, n.iter = 200, seed = 1234)
## 
## 
## Posterior summary:
##                                Mean     SD    2.5%   97.5% tail-prob. GR-crit MCE/SD
## (Intercept)                 48.0372 4.2296 39.7040 56.5031     0.0000    1.01 0.0762
## Age0                        -0.0536 0.0391 -0.1301  0.0235     0.1533    1.02 0.0408
## SexFemale                   -1.6004 1.0402 -3.6236  0.3323     0.1333    1.01 0.0408
## DMYes                        0.0751 1.2443 -2.4021  2.5021     0.9633    1.01 0.0453
## raceOther                   -1.7816 1.6269 -4.7840  1.3969     0.3000    1.00 0.0490
## raceAsian                    3.7429 1.9834 -0.0554  7.6367     0.0533    1.02 0.0523
## alc_pwk<1 Drink per week     3.2305 3.9734 -4.4354 10.6095     0.3900    1.03 0.0776
## alc_pwk1-14 Drinks per week  2.2815 3.7556 -5.1671  9.2548     0.5167    1.00 0.0792
## alc_pwk> 14 Drinks per week  0.5339 3.8327 -7.0285  8.2627     0.9000    1.01 0.0830
## 
## Posterior summary of residual std. deviation:
##               Mean    SD 2.5% 97.5% GR-crit MCE/SD
## sigma_Albumin 6.01 0.341 5.39  6.75    1.01 0.0427
## 
## 
## MCMC settings:
## Iterations = 101:300
## Sample size per chain = 200 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 250 
## 
## 
## Number and proportion of complete cases:
##         #    %
## lvlone 46 18.4
## 
## Number and proportion of missing values:
##         # NA % NA
## Age0       0  0.0
## Sex        0  0.0
## race      53 21.2
## DM        66 26.4
## Albumin   71 28.4
## alc_pwk   99 39.6

The Gelman-Rubin criterion and Monte Carlo error to posterior standard deviation ratio show what we already saw in the trace plot: most of the parameters have converged and mix nicely, but the parameters for alc_pwk have a slightly higher Monte Carlo error ratio than the other parameters.

Reference Categories

Let’s have a look at what might be the issue with alc_pwk.

alc_pwk is a categorical variable with the following observed frequencies:

table(HCVbase$alc_pwk)

## 
##           Never used    <1 Drink per week 1-14 Drinks per week > 14 Drinks per week 
##                    7                   25                   49                   70

There are only 7 observations in the “Never used” category, and this is currently the reference category.

Convergence/mixing is often better when a large group is used as the reference category. Following the convention in , the default in JointAI is to use the first category.

We can specify which category should be the reference for some or all categorical variables in a model using the argument refcats.

refcats can be a character string that specifies that for all categorical variables the "first", "last" or "largest" category should be used, or a named vector that specifies per variable which category should be used (either via one of the category labels, category indices, or "first", "last" or "largest"). (See also set_refcats() and the vignette.)

Task

Re-fit lm1 but use the largest category of alk_pwk as reference category.

You could specify the reference category as

refcat = "largest"
refcat = c(alc_pwk = "largest")
refcat = c(alc_pwk = "> 14 Drinks per week")
refcat = c(alc_pwk = 4)

You can use update() to re-fit the model so that you only need to specify the additional argument refcat.

Check the convergence using a traceplot(), and inspect the Gelman-Rubin criterion and Monte Carlo error of the new model.

You can also use the functions GR-crit() and MC_error().

Solution

lm2 <- update(lm1, refcat = "largest")

traceplot(lm2)

The traces for the different dummy variables of alk_pwk no longer look the same!

We can check the Gelman-Rubin criterion and Monte Carlo error directly using the functions GR_crit() and MC_error():

GR_crit(lm2)

## Potential scale reduction factors:
## 
##                             Point est. Upper C.I.
## (Intercept)                      1.002      1.004
## Age0                             1.002      1.003
## SexFemale                        0.997      0.999
## DMYes                            1.002      1.010
## raceOther                        1.006      1.017
## raceAsian                        0.998      1.002
## alc_pwkNever used                1.018      1.069
## alc_pwk<1 Drink per week         1.030      1.103
## alc_pwk1-14 Drinks per week      0.998      1.001
## sigma_Albumin                    0.998      1.000
## 
## Multivariate psrf
## 
## 1.03

⇨ The Gelman-Rubin criterion is now close to one for all regression coefficients.

By default, the function coda::gelman.diag() that is used internally by GR_crit(), discards the first half of the MCMC chains as burn-in. Because here the chains have converged and we only have very few samples, we could specify to use the whole chains by setting autoburnin = FALSE).

MC_error(lm2)

##                                est   MCSE   SD MCSE/SD
## (Intercept)                 48.567 0.0867 1.93   0.045
## Age0                        -0.051 0.0016 0.04   0.041
## SexFemale                   -1.606 0.0433 1.06   0.041
## DMYes                        0.027 0.0560 1.25   0.045
## raceOther                   -1.890 0.0781 1.65   0.047
## raceAsian                    3.444 0.1114 2.06   0.054
## alc_pwkNever used           -0.375 0.3246 4.08   0.080
## alc_pwk<1 Drink per week     2.688 0.1437 1.86   0.077
## alc_pwk1-14 Drinks per week  1.499 0.0852 1.39   0.061
## sigma_Albumin                5.993 0.0141 0.33   0.042

⇨ The Monte Carlo error is below/close to 5% for most parameters. To reduce the Monte Carlo error we can increase the number of iterations or use more MCMC chains (in this case probably the better option because we don’t need a long burn-in).

There are currently no functions in JointAI to obtain and/or plot auto- and cross-correlation of the MCMC chains, but we can use the functions provided in the package coda.

A fitted JointAI object has an element MCMC that is of class mcmc.list, the type of object coda works with.

We could, hence, use

coda::crosscorr(lm2$MCMC)
coda::crosscorr.plot(lm2$MCMC)

or plot the auto-correlation using coda::autocorr.plot(lm2$MCMC), or create our own plot:

library("ggplot2")
library("magrittr")

coda::autocorr(lm2$MCMC) %>%
  reshape2::melt() %>%
  subset(., Var2 == Var3 & Var1 != "Lag 0") %>%
  ggplot(aes(x = Var1, y = value, fill = factor(L1))) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap("Var2") +
  scale_fill_discrete(name = "MCMC chain") +
  coord_cartesian(ylim = c(-1, 1)) +
  theme(legend.position = c(0.75, 0.15))

You can obtain the information on which regression coefficient refers to which covariate term (in which model) via the coef_list element of a fitted JointAI model:

lm2$coef_list

## $Albumin
##   outcome outcat                     varname    coef                varnam_print
## 1 Albumin   <NA>                 (Intercept) beta[1]                 (Intercept)
## 2 Albumin   <NA>                        Age0 beta[2]                        Age0
## 3 Albumin   <NA>                   SexFemale beta[3]                   SexFemale
## 4 Albumin   <NA>                       DMYes beta[4]                       DMYes
## 5 Albumin   <NA>                   raceOther beta[5]                   raceOther
## 6 Albumin   <NA>                   raceAsian beta[6]                   raceAsian
## 7 Albumin   <NA>           alc_pwkNever used beta[7]           alc_pwkNever used
## 8 Albumin   <NA>    alc_pwk<1 Drink per week beta[8]    alc_pwk<1 Drink per week
## 9 Albumin   <NA> alc_pwk1-14 Drinks per week beta[9] alc_pwk1-14 Drinks per week
## 
## $alc_pwk
##   outcome outcat   varname     coef varnam_print
## 1 alc_pwk   <NA>      Age0 alpha[1]         Age0
## 2 alc_pwk   <NA> SexFemale alpha[2]    SexFemale
## 3 alc_pwk   <NA>     DMYes alpha[3]        DMYes
## 4 alc_pwk   <NA> raceOther alpha[4]    raceOther
## 5 alc_pwk   <NA> raceAsian alpha[5]    raceAsian
## 
## $DM
##   outcome outcat     varname      coef varnam_print
## 1      DM   <NA> (Intercept)  alpha[6]  (Intercept)
## 2      DM   <NA>        Age0  alpha[7]         Age0
## 3      DM   <NA>   SexFemale  alpha[8]    SexFemale
## 4      DM   <NA>   raceOther  alpha[9]    raceOther
## 5      DM   <NA>   raceAsian alpha[10]    raceAsian
## 
## $race
##   outcome    outcat     varname      coef           varnam_print
## 1    race raceAsian (Intercept) alpha[14] raceAsian: (Intercept)
## 2    race raceAsian        Age0 alpha[15]        raceAsian: Age0
## 3    race raceAsian   SexFemale alpha[16]   raceAsian: SexFemale
## 4    race raceOther (Intercept) alpha[11] raceOther: (Intercept)
## 5    race raceOther        Age0 alpha[12]        raceOther: Age0
## 6    race raceOther   SexFemale alpha[13]   raceOther: SexFemale

MCMC Settings, Parallel Chains & Adding Samples

There are a number of arguments related to settings for the MCMC sampling:

n.chains: the number of MCMC chains (default is 3)
n.adapt: number of iterations in the adaptive phase (default is 100)
n.iter: number of iterations in the sampling phase (default is 0)
thin: thinning interval (default is 1, i.e., no thinning) ⇨ vignette
monitor_params: parameters/nodes to be monitored ⇨ vignette
inits: (optional) initial values ⇨ vignette

To get more MCMC samples, we could increase n.iter, and then the sampling would just take longer. Alternatively, we can increase the number of chains (i.e., set n.chain).

Parallel Computation

To save computational time, chains can be run in parallel. JointAI uses the future framework for parallel computation (see also this vignette of the future package). Which type of future can/should be used may depend on the operating system.

library("future")
plan(multisession(workers = 8))

This creates a multisession future with 8 processes. Each chain will then be run in a separate session. This setting will remain in place for the entire session, unless we overwrite it with another call to plan() (e.g., plan(sequential)).

Task

Re-run lm2, now with 8 chains sampled in parallel.

You may also explore the trace plot and model summary to see whether the extra MCMC samples have improved the mixing, Gelman-Rubin criterion and Monte Carlo error.

Solution

lm3 <- update(lm2, n.chains = 8)

## 
## Parallel sampling with 8 workers started (2022-05-10 15:05:33).

It is not possible to get a progress bar when using parallel computation.

traceplot(lm3)

summary(lm3)

## 
## Bayesian linear model fitted with JointAI
## 
## Call:
## lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk, 
##     data = HCVbase, n.chains = 8, n.iter = 200, refcats = "largest", 
##     seed = 1234)
## 
## 
## Posterior summary:
##                                Mean     SD   2.5%   97.5% tail-prob. GR-crit MCE/SD
## (Intercept)                 48.5655 1.9043 44.919 52.2095     0.0000    1.00 0.0271
## Age0                        -0.0515 0.0389 -0.127  0.0255     0.1900    1.00 0.0257
## SexFemale                   -1.5343 1.0319 -3.571  0.4318     0.1450    1.01 0.0266
## DMYes                       -0.0322 1.2081 -2.328  2.3883     0.9775    1.01 0.0278
## raceOther                   -1.8667 1.6471 -4.952  1.3973     0.2500    1.01 0.0315
## raceAsian                    3.6193 2.0358 -0.602  7.4365     0.0963    1.03 0.0385
## alc_pwkNever used           -0.3994 3.9919 -8.100  7.5503     0.9250    1.02 0.0504
## alc_pwk<1 Drink per week     2.7278 1.8208 -0.811  6.2098     0.1175    1.01 0.0414
## alc_pwk1-14 Drinks per week  1.4957 1.3880 -1.250  4.2225     0.2637    1.00 0.0334
## 
## Posterior summary of residual std. deviation:
##               Mean    SD 2.5% 97.5% GR-crit MCE/SD
## sigma_Albumin 5.98 0.348 5.34  6.69       1 0.0289
## 
## 
## MCMC settings:
## Iterations = 101:300
## Sample size per chain = 200 
## Thinning interval = 1 
## Number of chains = 8 
## 
## Number of observations: 250

⇨ Both the Gelman-Rubin criterion and the Monte Carlo error are fine now.

Computational Time

We can get some information on the computational settings via the comp_info argument of a fitted JointAI object:

lm3$comp_info

## $start_time
## [1] "2022-05-10 15:05:33 CEST"
## 
## $duration
## $duration$adapt
##              chain1        chain2        chain3        chain4        chain5        chain6
## run 1 11.09768 secs 13.37164 secs 15.35581 secs 17.14637 secs 18.04228 secs 18.94701 secs
##              chain7        chain8
## run 1 18.73093 secs 19.02346 secs
## 
## $duration$sample
##              chain1        chain2        chain3        chain4      chain5        chain6
## run 1 29.67038 secs 31.25177 secs 32.14474 secs 31.11918 secs 30.929 secs 29.81292 secs
##              chain7        chain8
## run 1 29.08424 secs 28.78671 secs
## 
## 
## $JointAI_version
## [1] '1.0.3.9000'
## 
## $R_version
## [1] "R version 4.1.3 (2022-03-10)"
## 
## $parallel
## [1] TRUE
## 
## $workers
## [1] 8

and can get the computational time via sum_duration()

sum_duration(lm2)

## Time difference of 30.81611 secs

sum_duration(lm3)

## Time difference of 51.1682 secs

The output of lm3$comp_info will look different in models fitted with JointAI version ≤ 1.0.3, and the function sum_duration() is only available in the development version available on GitHub.

Adding samples

When models take a bit longer to run we may want to continue sampling rather than throwing away the MCMC samples we already have. We can do this with the function add_samples().

When using add_samples() we need to make sure that the setting for parallel computation in the current session matches the setting used to fit the original JointAI object.

Task

Add an extra 100 iterations to lm2.

Since lm2 was run sequentially, you need to first set

plan(sequential)

You can check how many iterations the resulting new JointAI object has by

getting the model summary(), or
checking the mcmc_settings element of the fitted JointAI object.

Solution

plan(sequential)
lm4 <- add_samples(lm2, n.iter = 100)

## NOTE: Stopping adaptation

The relevant part of the output of summary() shows us:

summary(lm4)

## [...]
## 
## MCMC settings:
## Iterations = 101:400
## Sample size per chain = 300 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 250

Note that the call is now a list containing the original call and the call to add_samples():

lm4$call

## [[1]]
## lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk, 
##     data = HCVbase, n.iter = 200, refcats = "largest", seed = 1234)
## 
## [[2]]
## add_samples(object = lm2, n.iter = 100)

From the model summary():

## 
## Bayesian linear model fitted with JointAI
## 
## Call:
## list(lm_imp(formula = Albumin ~ Age0 + Sex + DM + race + alc_pwk, 
##     data = HCVbase, n.iter = 200, refcats = "largest", seed = 1234), 
##     add_samples(object = lm2, n.iter = 100))
## 
## [...]

In the mcmc_settings element of lm4 we can also see that the original JointAI object was run for 200 iterations and add_samples() was run with 100 iterations:

lm4$mcmc_settings[c("n.adapt", "n.iter", "n.chains", "thin")]

## $n.adapt
## [1] 100
## 
## $n.iter
## [1] 200 100
## 
## $n.chains
## [1] 3
## 
## $thin
## [1] 1 1

The function add_samples() has an argument add which is TRUE by default, so that the new MCMC samples are added to the existing ones. When add = FALSE, the resulting JointAI object will only contain the new samples. This is useful to exclude the original samples as burn-in.

With sequential evaluation, the MCMC samples from multiple chains are created by one jags object (returned by package rjags), but when parallel computation is used a separate jags model object is created per chain.

The number of chains in a jags model is set, and so when continuing sampling from a JointAI object we have to run it in the same (compatible) type of future.

Specifically:

original model was run sequentially:
- if no future::plan() was set in the current session, no problem (add_samples() runs sequentially)
- if future:plan() has been called in the current session we need to re-set the future to sequential using plan(sequential)
- it is not possible to run add_samples() in parallel
original model was run in parallel:
- if in the current session future::plan() has not been called or a sequential future is set, add_samples() will run sequentially (with a note)
- if in the current session a sequential future is set, add_samples() runs in parallel.

Covariate Model Types

JointAI automatically chooses the type of models to use for (incomplete) covariates based on the type of variable (factor, ordered, numeric), number of levels (and the hierarchical level in multi-level data).

We can see the type of model that was used in the models element of the fitted object:

lm2$models

##                 Albumin                 alc_pwk                      DM                    race 
## "glm_gaussian_identity"                   "clm"    "glm_binomial_logit"                "mlogit"

The function list_models() returns a more elaborate version:

list_models(lm2)

## Linear model for "Albumin" 
##    family: gaussian 
##    link: identity 
## * Predictor variables:
##   (Intercept), Age0, SexFemale, DMYes, raceOther, raceAsian, alc_pwkNever used, alc_pwk<1 
##   Drink per week, alc_pwk1-14 Drinks per week 
## * Regression coefficients:
##   beta[1:9] (normal prior(s) with mean 0 and precision 1e-04) 
## * Precision of  "Albumin" :
##   tau_Albumin (Gamma prior with shape parameter 0.01 and rate parameter 0.01)
## 
## 
## Cumulative logit model for "alc_pwk" 
## * Reference category: "> 14 Drinks per week"
## * Predictor variables:
##   Age0, SexFemale, DMYes, raceOther, raceAsian 
## * Regression coefficients:
##   alpha[1:5] (normal prior(s) with mean 0 and precision 1e-04) 
## * Intercepts:
##   - Never used: gamma_alc_pwk[1] (normal prior with mean 0 and precision 1e-04)
##   - <1 Drink per week: gamma_alc_pwk[2] = gamma_alc_pwk[1] + exp(delta_alc_pwk[1])
##   - 1-14 Drinks per week: gamma_alc_pwk[3] = gamma_alc_pwk[2] + exp(delta_alc_pwk[2])
## * Increments:
##   delta_alc_pwk[1:2] (normal prior(s) with mean 0 and precision 1e-04)
## 
## 
## Binomial model for "DM" 
##    family: binomial 
##    link: logit 
## * Reference category: "No"
## * Predictor variables:
##   (Intercept), Age0, SexFemale, raceOther, raceAsian 
## * Regression coefficients:
##   alpha[6:10] (normal prior(s) with mean 0 and precision 1e-04) 
## 
## 
## Multinomial logit model for "race" 
## * Reference category: "Caucasian"
## * Predictor variables:
##   (Intercept), Age0, SexFemale 
## * Regression coefficients:
##   raceOther: alpha[14:16]
##   raceAsian: alpha[11:13] (normal prior(s) with mean 0 and precision 1e-04)

The arguments of list_models() allow you to omit parts of these details.

Available covariate Model Types
Implemented model types that can be chosen in the argument models for baseline covariates (not repeatedly measured) are:

`lm`	linear (normal) model with identity link (alternatively: `glm_gaussian_identity`); default for continuous variables
`glm_gaussian_log`	linear (normal) model with log link
`glm_gaussian_inverse`	linear (normal) model with inverse link
`glm_logit`	logistic model for binary data (alternatively: `glm_binomial_logit`); default for binary variables
`glm_probit`	probit model for binary data (alternatively: `glm_binomial_probit`)
`glm_binomial_log`	binomial model with log link
`glm_binomial_cloglog`	binomial model with complementary log-log link
`glm_gamma_inverse`	gamma model with inverse link for skewed continuous data
`glm_gamma_identity`	gamma model with identity link for skewed continuous data
`glm_gamma_log`	gamma model with log link for skewed continuous data
`glm_poisson_log`	Poisson model with log link for count data
`glm_poisson_identity`	Poisson model with identity link for count data
`lognorm`	log-normal model for skewed continuous data
`beta`	beta model (with logit link) for skewed continuous data in (0, 1)
`mlogit`	multinomial logit model for unordered categorical variables; default for unordered factors with >2 levels
`clm`	cumulative logit model for ordered categorical variables; default for ordered factors

The distributions used in the chosen model types must be appropriate for the respective variables. This means, for instance, that if a covariate has observed values ≤ 0, choosing a log-normal or a Gamma model will result in an error.

Change the Covariate Model Types
The argument models takes a named vector of model types, where the names are the names of the (incomplete) variables.

When models are specified for only a subset of the variables for which a model is needed, the default model choices (as indicated in the table) are used for the unspecified variables.

Task

Fit a linear regression model for Albumin with covariates Age0, Sex, DM and Creatinin.

Creatinin has a skewed distribution and the default (normal) distribution may not be appropriate to model/impute this variable.

⇨ Set a more appropriate model type for Creatinin.

Specify the argument models = c(Creatinin = "<select a model type from the list above>").

Solution

Given the shape of the histogram for Creatinin (see the histogram above) a log-normal model or a Gamma model might be appropriate.

The histogram of Creatinin shows the marginal distribution. More relevant would be the distribution conditional on the variables in the linear predictor of the model for Creatinin, but we don’t know this distribution.

lm5a <- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin,
              models = c(Creatinin = "lognorm"),
              n.iter = 200, data = HCVbase)


lm5b <- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin,
              models = c(Creatinin = "glm_gamma_log"),
              n.iter = 200, data = HCVbase)

Of course we would now check the convergence before inspecting the results…

Did you notice that the sampling is faster when using lognorm than with glm_gamma_log?

sum_duration(lm5a)

## Time difference of 3.990817 secs

sum_duration(lm5b)

## Time difference of 14.41182 secs

Different types of models use different samplers in JAGS, resulting in slower or faster sampling.

This has to do with the complexity of the models (also the number of parameters) and the efficiency of the samplers that JAGS chooses for the different nodes.

The two models for Albumin result in very similar estimates. We can compare them, for instance, by plotting the posterior means and credible intervals.

library("magrittr")
library("ggplot2")

rbind(
  data.frame(coef = coef(lm5a)[[1]],
             confint(lm5a)[[1]],
             model = "lognorm", check.names = FALSE) %>%
    tibble::rownames_to_column(var = "variable"),
  
  data.frame(coef = coef(lm5b)[[1]],
             confint(lm5b)[[1]],
             model = "gamma", check.names = FALSE) %>%
    tibble::rownames_to_column(var = "variable")
)  %>%
  ggplot(aes(x = model, y = coef)) +
  geom_point() +
  geom_errorbar(aes(ymin = `2.5%`, ymax = `97.5%`)) +
  facet_wrap("variable", scales = "free")

Functions and Interactions of/with incomplete variables

Sometimes, the linear association with the response that is implied by just adding a covariate to the linear predictor is not appropriate.

As in other packages for regression, we can specify functions inside a model formula or use interactions of variables.

An important difference to other packages is that when functions/interactions involve an incomplete variable, the function or interaction has to be computed in JAGS, using the imputed value. This limits the functionality to functions that are directly available in JAGS, or have been implemented. This is the case for many standard functions such as

log(), exp()
sin(), cos(), …
abs(), sqrt()
polynomials (via I(...^2))

Functions may also involve multiple (incomplete) variables.

Not yet available are splines for incomplete variables (e.g., bs() and ns())).

For interactions, the standard syntax with *, : or ^ is used.

See the vignette for more explanations and examples.

Task

We now want to add BMI as covariate to the model for Albumin. BMI can be calculated from Height and Weight as $\frac{\texttt{Weight}}{(\texttt{Height}/100)^2}$.

There are some patients for whom only either Height or Weight is available:

md_pattern(subset(datHCV, select = c(Height, Weight, BMI)))

If we directly use BMI in the model, we loose the partial information we have for these patients.

Specify a linear model for Albumin with covariates Age0, Sex, DM, Creatinin and BMI, in which BMI is calculated from Height and Weight.

Got an error message? Me too. Go check the solution.

Solution?

To specify the model, we can just include the formula for the BMI, wrapped in a I() function, in the model formula:

lm6 <- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin + I(Weight/(Height/100)^2),
              data = HCVbase, models = c(Creatinin = "lognorm"))

## 
## Note: No MCMC sample will be created when n.iter is set to 0.

## Error in rjags::jags.model(file = modelfile, data = data_list, inits = inits,  : 
##   Error in node M_lvlone[6,10]
## Invalid parent values

But this results in an error! What is going wrong?

Short Answer

Since we haven’t specified model types for Height and Weight, the default, i.e., a normal model, is chosen. This means that imputed values for Height could be any value on the real line. This includes the value 0 which would result in division by 0, causing the error.

Note that even if all observed values of Height are far away from zero, and probably none of the imputed values would ever be exactly zero, we still get this error.

Elaborate Answer

The error message says that the node M_lvlone[5,10] has invalid parent values.

M_lvlone is the name of the design matrix containing the data. (All design matrices are called M; lvlone refers to the level of the particular variable.)

The design matrix is stored in the data_list argument in the returned object.

head(lm6$data_list$M_lvlone)

##          Albumin Creatinin DM   Height   Weight (Intercept)     Age0 SexFemale DMYes
## 960.1   57.92853  70.29047  0 180.6634 63.25236           1 24.02857         0    NA
## 512.112       NA  35.83949  0 164.2964 87.17301           1 35.28016         0    NA
## 704.445 47.74663        NA  0 168.6981 85.74012           1 42.65734         1    NA
## 960.889 52.02988  32.21713  0 190.5249 86.73939           1 49.15190         0    NA
## 671     40.72937        NA  0 155.2527       NA           1 47.18493         1    NA
## 936.46  45.86362  42.20511 NA       NA 50.48447           1 22.20862         0    NA
##         I(Weight/(Height/100)^2)
## 960.1                         NA
## 512.112                       NA
## 704.445                       NA
## 960.889                       NA
## 671                           NA
## 936.46                        NA

Column 10 contains the (to be) calculated BMI and row 5 is a row in which the value for Height is missing.

Heigth is being imputed using the default normal model, which would allow for the value 0, which causes the error because we divide by it.

How to fix it

To avoid this problem we could either

use a different distribution for Height that excludes the value 0, or
truncate the normal distribution to $[\text{<something small>}, \infty)$

Option 2 should only be used when the likely imputed values are far enough away from zero, so that the truncation does not actually impact the imputed values.

We can specify truncation using the argument trunc:

lm6 <- lm_imp(Albumin ~ Age0 + Sex + DM + Creatinin + I(Weight/(Height/100)^2),
              data = HCVbase, models = c(Creatinin = "lognorm"),
              trunc = list(Height = c(1e-5, NA)))

## 
## Note: No MCMC sample will be created when n.iter is set to 0.

Auxiliary variables

In contrast to, for instance, the mice package, JointAI uses by default only the variables that appear in the model formula and ignores all other columns in the input data.

It is possible to specify that some variables should be used as auxiliary variables. Auxiliary variables are used as covariates in the models for (incomplete) covariates, but not in the analysis model.

The argument auxvars takes a one-sided (RHS) formula.

Task

Alcoholabuse was originally obtained from alc_pwk and additional information that is not included in this data. There are more missing values in alc_pwk than Alcoholabuse.
⇨ Alcoholabuse is strongly related to alc_pwk and may provide relevant information when imputing alk_pwk.

Update our previous model lm1 (it had the model formula Albumin ~ Age0 + Sex + DM + race + alc_pwk), so that Alcoholabuse is used as auxiliary variable for the imputation.

Use list_models() to see in which sub-models Alcoholabuse was included.

Solution

(Note: I set n.iter = 0 and n.adapt = 0 here, i.e., no MCMC samples, to avoid unnecessary computational time.)

lm7 <- update(lm1, auxvars = ~ Alcoholabuse, n.iter = 0, n.adapt = 0)

## 
## Note: No MCMC sample will be created when n.iter is set to 0.

list_models(lm7, priors = FALSE, regcoef = FALSE, otherpars = FALSE)

## Linear model for "Albumin" 
##    family: gaussian 
##    link: identity 
## * Predictor variables:
##   (Intercept), Age0, SexFemale, DMYes, raceOther, raceAsian, alc_pwk<1 Drink per week, 
##   alc_pwk1-14 Drinks per week, alc_pwk> 14 Drinks per week 
## 
## 
## Cumulative logit model for "alc_pwk" 
## * Reference category: "Never used"
## * Predictor variables:
##   Age0, SexFemale, DMYes, raceOther, raceAsian, AlcoholabuseYes 
## 
## 
## Binomial model for "DM" 
##    family: binomial 
##    link: logit 
## * Reference category: "No"
## * Predictor variables:
##   (Intercept), Age0, SexFemale, raceOther, raceAsian, AlcoholabuseYes 
## 
## 
## Multinomial logit model for "race" 
## * Reference category: "Caucasian"
## * Predictor variables:
##   (Intercept), Age0, SexFemale, AlcoholabuseYes 
## 
## 
## Binomial model for "Alcoholabuse" 
##    family: binomial 
##    link: logit 
## * Reference category: "No"
## * Predictor variables:
##   (Intercept), Age0, SexFemale

Alcoholabuse is included in all covariate models, because the variable has less missing values than all other incomplete covariates.

It is also possible to use functions of variables as auxiliary variables:

lm7b <- update(lm1, auxvars = ~ Alcoholabuse + I(Age0^2),
               n.iter = 0, n.adapt = 0)

## 
## Note: No MCMC sample will be created when n.iter is set to 0.

list_models(lm7b, priors = FALSE, regcoef = FALSE, otherpars = FALSE)

## Linear model for "Albumin" 
##    family: gaussian 
##    link: identity 
## * Predictor variables:
##   (Intercept), Age0, SexFemale, DMYes, raceOther, raceAsian, alc_pwk<1 Drink per week, 
##   alc_pwk1-14 Drinks per week, alc_pwk> 14 Drinks per week 
## 
## 
## Cumulative logit model for "alc_pwk" 
## * Reference category: "Never used"
## * Predictor variables:
##   Age0, SexFemale, DMYes, raceOther, raceAsian, AlcoholabuseYes, I(Age0^2) 
## 
## 
## Binomial model for "DM" 
##    family: binomial 
##    link: logit 
## * Reference category: "No"
## * Predictor variables:
##   (Intercept), Age0, SexFemale, raceOther, raceAsian, AlcoholabuseYes, I(Age0^2) 
## 
## 
## Multinomial logit model for "race" 
## * Reference category: "Caucasian"
## * Predictor variables:
##   (Intercept), Age0, SexFemale, AlcoholabuseYes, I(Age0^2) 
## 
## 
## Binomial model for "Alcoholabuse" 
##    family: binomial 
##    link: logit 
## * Reference category: "No"
## * Predictor variables:
##   (Intercept), Age0, SexFemale, I(Age0^2)

The analysis model only includes Age0, but the covariate models use Age0 as well as I(Age0^2).

See also the vignette for more explanation / examples.

Bayesian Methods for Missing Covariates in Longitudinal Studies

Practical I: Generalized Linear Models and Basics

Data

Data Distribution

Missing data pattern

(Generalized) Linear Regression

Model Specification

Task

Solution

Visualizing the posterior sample

Task

Solution

Model Summary

Task

Solution

Reference Categories

Task

Solution

MCMC Settings, Parallel Chains & Adding Samples

Parallel Computation

Task

Solution

Computational Time

Adding samples

Task

Solution

Covariate Model Types

Task

Solution

Functions and Interactions of/with incomplete variables

Task

Solution?

Short Answer

Elaborate Answer

How to fix it

Auxiliary variables

Task

Solution