Imputation of Longitudinal or Multi-level Data

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 4.3.0 (2023-04-21 ucrt)
mice (version: 3.15.0)
miceadds (version: 3.16.18)
lme4 (version: 3.1.162)
JointAI (version: 1.0.5)
splines (version: 4.3.0)
ggplot2 (version: 3.4.2)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Aim

The focus of this practical is the imputation of data that has features that require special attention.

In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.

Data & Model of interest

For this practical, we will use the EP16dat4 dataset, which is a subset of data from a trial on primary biliary cirrhosis (PBC) of the liver.

To get the EP16dat4 dataset, load the file EP16dat4.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat4.RData").

The variables contained in the dataset EP16dat4 are:


`id`	patient identifier
`day`	continuously measured day of follow-up time (the time variable)
`sex`	patients’ sex (f: female, m: male)
`trt`	treatment group (0: D-penicillmain, 1: placebo)
`age`	patients’ age at intake
`ascites`	presence of ascites at baseline (0:no, 1:yes)
`hepato`	presence of hepatomegaly or enlarged liver
`bili`	serum bilirubin level at baseline
`copper`	urine copper (ug/day)
`albumin`	serum albumin level at follow-up (time-varying)

The variables have the following distributions and proportions of missing values:

albumin, bili and hepato were measured repeatedly, day is the day of that measurement, and the other variables were only measured once, at baseline.

The missing data pattern is:

The longitudinal outcome albumin shows relatively linear trajectories over time:

To analyse the trajectories of albumin we want to use the following linear mixed effects model with random intercept and slope (either using lme from the package nlme or using lmer from the package lme4; and only AFTER we have imputed the data):

# using package nlme
lme(albumin ~ day + sex + trt + age + ascites + log(bili) + copper, random = ~day|id)


# using package lme4
lmer(albumin ~ day + sex + trt + age + ascites + log(bili) + copper + (day|id))

Imputation using mice

With the current data, the approach to impute the data in wide format is not feasible, since the data are unbalanced, i.e., the measurements do not follow a pre-specified pattern, but different patients have different numbers of measurements and are measured at different time-points.

Theoretically, there is some functionality in mice for handling longitudinal data, using special imputation methods like "2l.norm", "2lonly.norm" or "2lonly.pmm", which take into account the multi-level structure of the data.

The package has undergone some recent updates with the result that we are unable to replicate the results we obtained before - data that was imputed now results in warning messages, and values are not being imputed. This may be a user error, due to changes in functionality, or it may be an issue with the functions itself.

Going forward we will perform imputation in this setting using an alternative approach.

The idea is to find a summary of the longitudinal variables that will capture the relevant characteristics of the longitudinal profiles so that as little information as possible is lost. This summary also has to be represented in the same number of values for each subject, and these values need to have the same interpretation for all subjects.

One option that fulfils these requirements is to fit mixed models to the longitudinal variables, and to represent the trajectories by the estimated random effects in the imputation (Erler et al., 2016). This approach, however, does not work for imputation of missing values in longitudinal variables. Hence, we will not consider the variable hepato here.

To get a good representation, the model needs to be flexible enough to fit the data sufficiently well.

We already saw how the traces for albumin look like, here is the corresponding plot for log(bili):

Task 1

Our model of interest involves the longitudinal variables albumin and bili, which are both completely observed.

Fit linear mixed models for albumin and for log(bili).

The function lmer() from the package lme4 is faster than lme() from nlme.

To solve convergence issues in the function lmer(), you can set the argument control = lmerControl(optimizer = "bobyqa").

Use a natural cubic spline with 2 degrees of freedom for the time variable day and also include this spline specification in the random effects.
Extract the random effects from these models.

Random effects can be extracted from the model using the function ranef().

Solution 1

library("splines")

## 
## Attaching package: 'splines'

## The following object is masked from 'package:JointAI':
## 
##     bs

library("lme4")

# model for albumin
mod_albu0 <- lmer(albumin ~ age + sex + ns(day, df = 2) + (ns(day, df = 2) | id),
                  data = EP16dat4)

# model for log(bili)
mod_bili0 <- lmer(log(bili) ~ ns(day, df = 2) + (ns(day, df = 2) | id),
                  data = EP16dat4,
                  control = lmerControl(optimizer = "bobyqa"))

# extract the random effects
b_albu <- ranef(mod_albu0)$id
b_bili <- ranef(mod_bili0)$id

head(b_albu)

##   (Intercept) ns(day, df = 2)1 ns(day, df = 2)2
## 1   -0.444125         0.031114         0.270071
## 2    0.303350        -0.494048        -0.641496
## 3    0.046973         0.026260        -0.018409
## 4   -0.512316        -0.390659        -0.010550
## 5   -0.237393        -0.766935        -0.293471
## 6    0.440042         0.899356         0.472693

head(b_bili)

##   (Intercept) ns(day, df = 2)1 ns(day, df = 2)2
## 1    2.161134          2.03407          1.31636
## 2   -0.598690          0.40027         -0.53131
## 3   -0.287299         -0.15803         -0.16803
## 4   -0.046146          0.64818          0.06175
## 5    0.262130          2.43750          1.10205
## 6   -0.677375         -2.36292         -1.28033

To be able to distinguish between the random effects from the two models, we should give them different names:

names(b_albu) <- paste0("b_albu", 0:2)
names(b_bili) <- paste0("b_bili", 0:2)

To check that our model fits the data suitably well, we can look at the observed and fitted values for a few cases:

# Combine the original data with the fitted values
plotDF <- cbind(EP16dat4,
                fit_albu = fitted(mod_albu0),
                fit_bili = fitted(mod_bili0)
)

# make a subset of 24 randomly chosen subjects
plotDF_sub <- subset(plotDF, id %in% sample(unique(EP16dat4$id), size = 30))


# For albumin:
ggplot(plotDF_sub, aes(x = day, y = albumin, group = id)) +
  geom_point(alpha = 0.5, size = 1) +
  geom_line(aes(y = fit_albu), color = 'blue') +
  geom_point(aes(y = fit_albu), color = 'blue', alpha = 0.3) +
  facet_wrap('id', ncol = 6) +
  theme(panel.grid = element_blank())

# For log(bili):
ggplot(plotDF_sub, aes(x = day, y = log(bili), group = id)) +
  geom_point(alpha = 0.5, size = 1) +
  geom_line(aes(y = fit_bili), color = 'blue') +
  geom_point(aes(y = fit_bili), color = 'blue', alpha = 0.3) +
  facet_wrap('id', ncol = 6) +
  theme(panel.grid = element_blank())

Task 2

The next step is to get a wide-format version of our data. We do not need to worry about the longitudinal variables, instead we need to include the random effects as columns in this wide-format data.

Get the baseline observations from all subjects as basis for the wide-format data. You can use match(unique(EP16dat4$id), EP16dat4$id) to select the first row for each subject.
Add the random effects as columns to this data.

You can use cbind() to add columns to a data.frame.

Solution 2

# combine the first row of each subject with the random effects
EP16dat_wide <- cbind(EP16dat4[match(unique(EP16dat4$id), EP16dat4$id), ],
                      b_albu, b_bili)

head(EP16dat_wide)

##    id trt    age sex day ascites hepato bili albumin copper   b_albu0   b_albu1   b_albu2   b_bili0
## 1   1   1 58.765   f   0    <NA>      1 14.5    2.60    156 -0.444125  0.031114  0.270071  2.161134
## 3   2   1 56.446   f   0       0      1  1.1    4.14     54  0.303350 -0.494048 -0.641496 -0.598690
## 12  3   1 70.073   m   0       0      0  1.4    3.48     NA  0.046973  0.026260 -0.018409 -0.287299
## 16  4   1 54.741   f   0       0      1  1.8    2.54     64 -0.512316 -0.390659 -0.010550 -0.046146
## 23  5   0 38.105   f   0       0      1  3.4    3.53    143 -0.237393 -0.766935 -0.293471  0.262130
## 29  6   0 66.259   f   0       0      1  0.8    3.98     50  0.440042  0.899356  0.472693 -0.677375
##     b_bili1  b_bili2
## 1   2.03407  1.31636
## 3   0.40027 -0.53131
## 12 -0.15803 -0.16803
## 16  0.64818  0.06175
## 23  2.43750  1.10205
## 29 -2.36292 -1.28033

Task 3

With these extra columns in the data, we can run the imputation using mice().

Perform the necessary steps to impute the data. Make sure you exclude unnecessary columns/variables from the imputation.

The longitudinal variables are now represented by their random effects. This means, we do not need to include them any more in the imputation.

Solution 3

library("mice")
imp0 <- mice(EP16dat_wide, maxit = 0)

## Warning: Number of logged events: 1

There is one logged event. This is only the set-up run, so it is not a problem, but we should be interested in what the problem might be to fix this in the actual imputation.

imp0$loggedEvents

##   it im dep     meth out
## 1  0  0     constant day

The variable day is constant (because we used the first row for everyone, and the first measurement was taken on day 0 for all patients). We would exclude this variable from the imputation anyway since it is a longitudinal variable.

meth <- imp0$method
meth

##       id      trt      age      sex      day  ascites   hepato     bili  albumin   copper  b_albu0 
##       ""       ""       ""       ""       "" "logreg"       ""       ""       ""    "pmm"       "" 
##  b_albu1  b_albu2  b_bili0  b_bili1  b_bili2 
##       ""       ""       ""       ""       ""

We do not need to change anything in the imputation method if we think that using pmm for copper is a good choice.

pred <- imp0$predictorMatrix
pred[, c('day', 'hepato', 'albumin', 'bili', 'id')] <- 0 # mice already set day to 0

imp_mice <- mice(EP16dat_wide, method = meth, predictorMatrix = pred,
                 maxit = 20, seed = 2020, printFlag = FALSE)

Task 4

We now need to combine the imputed baseline data with the longitudinal variables in order to do our analysis.

Extract the imputed data from the mids object in long format and do not include the original data.
Split the data by imputation number into a list of data frames

You can split the data into a list with the help of the function split.

The variable .imp refers to the imputation number.

Merge each of the data.frames in the list with the longitudinal variables

When merging the data, the imputed data should only include the baseline covariates, id variable, and columns that were added by mice. The longitudinal data should only include the time-varying variables (including day) and the id variable.

Convert the list of merged data.frames to a mids object with the help of the function datalist2mids from the package miceadds.

Solution 4

We first extract the data from the mids object. Here we use the option include = FALSE because we will later use the function datalist2mids(). If we would use as.mids() we would need to include the original data.

impdat <- complete(imp_mice, action = 'long', include = FALSE)

We then split the data by imputation number. Then we merge each part of the data. We can conveniently wrap this into lapply(), because we want to apply the same function to each element of a list.

imp_list <- lapply(split(impdat, impdat$.imp), function(x) {
  merge(subset(x, select = c(.imp, .id, id, trt, age, sex, ascites, copper)),
        subset(EP16dat4, select = c(id, day, hepato, bili, albumin)),
        all = TRUE)
})

When converting the resulting list of longitudinal datasets to a mids object we would have (at least) two options: the function as.mids() and the function datalist2mids.

as.mids() does not work well with longitudinal data. It gives an error about duplicate row names. For that reason, we use datalist2mids.

mids_long <- miceadds::datalist2mids(imp_list)

## Warning: Number of logged events: 1

Task 5

Fit the mixed model of interest (as specified above) and obtain the pooled results.

You need to load the broom.mixed package (i.e., library("broom.mixed")) before pooling results from a mixed model.

The model of interest is

lmer(albumin ~ ns(day, df = 2) + sex + trt + age + ascites +                         log(bili) + copper + (ns(day, df = 2)|id)

Solution 5

mod_mice <- with(mids_long,
                 lmer(albumin ~ ns(day, df = 2) + sex + trt + age + ascites +
                        log(bili) + copper + (ns(day, df = 2)|id),
                      control = lmerControl(optimizer = "bobyqa")))

library("broom.mixed")
summary(pool(mod_mice), conf.int = TRUE)

##               term    estimate  std.error statistic       df     p.value      2.5 %     97.5 %
## 1      (Intercept)  4.20832530 0.11898216  35.36938  781.019 2.4869e-164  3.9747626  4.4418880
## 2 ns(day, df = 2)1 -0.81749356 0.06482659 -12.61047 1591.773  8.0453e-35 -0.9446480 -0.6903391
## 3 ns(day, df = 2)2 -0.57945448 0.09503880  -6.09703 1790.413  1.3205e-09 -0.7658531 -0.3930558
## 4             sexf -0.14529983 0.05988992  -2.42612  635.757  1.5539e-02 -0.2629058 -0.0276939
## 5              trt  0.02090008 0.03579124   0.58394 1556.820  5.5934e-01 -0.0493040  0.0911042
## 6              age -0.00790092 0.00176303  -4.48143 1885.037  7.8597e-06 -0.0113586 -0.0044432
## 7         ascites1 -0.20739675 0.07845242  -2.64360  250.789  8.7199e-03 -0.3619063 -0.0528872
## 8        log(bili) -0.16106260 0.01485340 -10.84348  223.804  2.7174e-22 -0.1903330 -0.1317922
## 9           copper -0.00055242 0.00032133  -1.71913   20.611  1.0057e-01 -0.0012214  0.0001166

Imputation using JointAI

To analyse incomplete longitudinal data using a linear mixed model the R package JointAI provides the functions lme_imp() and lmer_imp(). The specification of the main model components is analogous to the function lme() from the nlme package or lmer() from the lme4 package.

Specification of longitudinal models:
When imputing variables in a longitudinal (or other multi-level) model and there are missing values in baseline (level-2) covariates, models need to be specified for all longitudinal covariates, even if they do not have missing values. Specifying no model would imply that the incomplete baseline covariates are independent of the complete longitudinal variable (see also here). Therefore, JointAI automatically specifies models for all longitudinal covariates in such a setting.

An exception may be the time variable: it is often reasonable to assume that the baseline covariates are independent of the measurement times of the outcome and longitudinal covariates. To tell JointAI not to specify a model for the time variable, the argument no_model can be used.

Model types for longitudinal covariates:
For longitudinal covariates the following model types are implemented:

name	model	variable type
`lmm`	linear mixed model	continuous variables
`glmm_lognorm`	log-normal mixed model	skewed continuous variables
`glmm_beta`	beta mixed model	continuous variables in [0, 1]
`glmm_logit`	logistic mixed model	factors with two levels
`glmm_probit`	probit mixed model	factors with two levels
`glmm_gamma_inverse`	gamma mixed model (with inverse link)	skewed continuous variables
`glmm_gamma_identity`	gamma mixed model (with identity link)	skewed continuous variables
`glmm_gamma_log`	gamma mixed model (with log link)	skewed continuous variables
`glmm_poisson_log`	poisson mixed model (with log link)	count variables
`glmm_poisson_identity`	poisson mixed model (with identity link)	count variables
`clmm`	cumulative logit mixed model	ordered factors with >2 levels
`mlogitmm`	multinomial logit mixed model	unordered factors with >2 levels

More info:
For the specification of the other arguments of lme_imp(), refer to

the help page,
the vignette on Model Specification, or
the details given in the practical on Imputation with Non-linear Associations

Task 1

Run the imputation (start with n.iter = 500; this will take a few seconds).

Remember to specify appropriate models for the incomplete covariates and longitudinal variables.

To see what the defaults are, you can run lme_imp() with n.adapt = 0 and n.iter = 0, and extract the vector of model types using <mymodel>$models, where <mymodel> is the name you gave the model.

Prevent specification of a model for day.
Check convergence using a traceplot().

If you use JointAI version 0.6.1 and R version 4.0.0 you need to set the option use_ggplot = TRUE in traceplot().

Solution 1

library("JointAI")
JointAIlong <- lme_imp(albumin ~ ns(day, df = 2) + sex + trt + age + ascites + 
                         bili + copper, random = ~ns(day, df = 2)|id,
                       models = c(copper = 'lognorm', bili = 'glmm_lognorm'),
                       no_model = 'day', data = EP16dat4, n.iter = 500, seed = 2020)

traceplot(JointAIlong, use_ggplot = TRUE) +
  theme(legend.position = 'none',
        panel.grid = element_blank())

Task 2

The traceplot shows that there is some autocorrelation (i.e., values are very similar to the previous value) in some of the chains (for example, ns(day, df = 2)2).

To continue the sampling in order to run the model for longer, we can use the function add_samples(). The number of additional iterations can be specified using the argument n.iter.

Add another 1000 samples to the model and check the traceplot() again.
Also evaluate convergence using the Gelman-Rubin criterion (GR_crit())
Evaluate the Monte Carlo error (MC_error()).

Solution 2

JointAIlong_extra <- add_samples(JointAIlong, n.iter = 1000)

traceplot(JointAIlong_extra, use_ggplot = TRUE) +
  theme(legend.position = 'none',
        panel.grid = element_blank())

The traceplots now look a bit better.

By setting the argument autoburnin = FALSE in GR_crit() we select to use the full MCMC chains. By default (autoburnin = TRUE) the function automatically discards the first half of the iterations and only uses the second half to evaluate convergence.

GR_crit(JointAIlong_extra, autoburnin = FALSE)

## Potential scale reduction factors:
## 
##                  Point est. Upper C.I.
## (Intercept)            1.00       1.00
## sexf                   1.00       1.00
## trt1                   1.00       1.00
## age                    1.00       1.00
## ascites1               1.01       1.03
## copper                 1.00       1.00
## ns(day, df = 2)1       1.01       1.02
## ns(day, df = 2)2       1.06       1.19
## bili                   1.00       1.02
## sigma_albumin          1.00       1.00
## 
## Multivariate psrf
## 
## 1.06

For some parameters, the criterion is still too large. The traceplot indicates that this is not really an issue of convergence, but slow mixing: because of the high correlation between subsequent samples, the chain only moves very slowly. As a result, there is more variation between the chains that within one chain.

We would have to run the model even longer to give the chains the “time” to fully explore the full range of the posterior distribution.

MC_error(JointAIlong_extra)

##                       est    MCSE      SD MCSE/SD
## (Intercept)       4.19683 3.8e-03 0.12529   0.030
## sexf             -0.12265 1.9e-03 0.06305   0.031
## trt1              0.00978 9.4e-04 0.03795   0.025
## age              -0.00736 5.3e-05 0.00193   0.028
## ascites1         -0.27734 3.2e-03 0.08791   0.036
## copper           -0.00092 1.2e-05 0.00028   0.043
## ns(day, df = 2)1 -0.91264 3.8e-03 0.06844   0.055
## ns(day, df = 2)2 -0.70565 1.3e-02 0.12384   0.105
## bili             -0.02413 5.4e-05 0.00246   0.022
## sigma_albumin     0.31713 1.4e-04 0.00592   0.024

The Monte Carlo error is also still larger then we would want it to be.

Based on the Gelman-Rubin criterion and the Monte Carlo error we need more MCMC samples to get precise results. We can either run the MCMC chains for longer, but since the issue is not that the chains do not have converged, we could also get more samples by increasing the number of MCMC chains.

Using more MCMC chains has the advantage that different chains are independent and that they can be run in parallel.

Note: On computers with multiple CPUs it is possible to run the different MCMC chains of a model in parallel to save some time.

This can be achieved using the doFuture package. For details see the vignette on MCMC settings.

Additional exercise JointAI

We want to fit a logistic mixed model for the variable hepato and explore if the association is non-linear over time.

Task 1

Fit a logistic mixed model using the function glme_imp using the same covariates as before.
Specify a natural cubic spline with 3 degrees of freedom for day.
Check convergence using a traceplot().

When specifying a generalized (mixed) model remember to specify the model family and link function.

To use natural cubic splines use the function ns() from the package splines, i.e., ns(day, df = 3).

Solution 1

library("splines")
JointAIlong2 <- glme_imp(hepato ~ ns(day, df = 3) + sex + trt + age + ascites +
                           bili + copper, random = ~1|id, family = binomial(),
                         models = c(copper = 'lognorm', bili = 'glmm_gamma_log'),
                         no_model = 'day', data = EP16dat4, n.iter = 1000,
                         seed = 2020)

traceplot(JointAIlong2, use_ggplot = TRUE) +
  theme(legend.position = 'none',
        panel.grid = element_blank())

Task 2

When the model has converged, we want to visualize the potentially non-linear association of day. To do that, we can create a new dataset containing information on an “average” subject, with different values for day.

The function predDF() creates such a dataset from an object of class JointAI. It sets reference values (i.e., the median for continuous variables and the reference category for categorical variables) for all variables other than the ones specified in the argument var (a one-sided formula). The variables given in var will range across the range of values of that variable encountered in the data.

Use predDF() to create a dataset that allows visualization of the effect of day.

Solution 2

newdf <- predDF(JointAIlong2, var = ~day)

head(newdf)

##   hepato    day sex trt    age ascites bili copper  id
## 1      0   0.00   m   0 48.871       0  1.4     64 126
## 2      0  52.04   m   0 48.871       0  1.4     64 126
## 3      0 104.08   m   0 48.871       0  1.4     64 126
## 4      0 156.12   m   0 48.871       0  1.4     64 126
## 5      0 208.16   m   0 48.871       0  1.4     64 126
## 6      0 260.20   m   0 48.871       0  1.4     64 126

Task 3

We can now predict the outcome of our model for our “average” subject using the function predict(). It takes a JointAI object and a data.frame containing the data to predict from as arguments. The argument quantiles can be used to specify which quantiles of the distribution of each fitted value are returned (default is 2.5% and 97.5%).

predict() returns a list with the following elements

dat: the data.frame provided by the user extended with the fitted values and 2.5% and 97.5% quantiles that form the credible interval for the fitted values
fit: a vector containing the fitted values (the mean of the distribution of the fitted value)
quantiles: a matrix containing the credible interval for each fitted value

Use predict() to obtain the fitted values and corresponding intervals
Visualize the result by plotting fitted values and quantiles (y-axis) over time (day; x-axis)

Solution 3

pred <- predict(JointAIlong2, newdata = newdf, type = "response")

## Warning: 
## Prediction in multi-level settings currently only takes into account the fixed effects,
## i.e., assumes that the random effect realizations are equal to zero.

ggplot(pred$newdata, aes(x = day, y = fit)) +
  geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3) +
  geom_line() +
  theme(panel.grid = element_blank()) +
  ylab("expected probability")

References

Erler, N. S., Rizopoulos, D., Rosmalen, J. van, Jaddoe, V. W., Franco, O. H., & Lesaffre, E. M. (2016). Dealing with missing covariates in epidemiologic studies: A comparison between multiple imputation and a full Bayesian approach. Statistics in Medicine, 35(17), 2955–2974. https://doi.org/10.1002/sim.6944