In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)miceadds
(version: 3.16.18)lme4
(version: 3.1.162)JointAI
(version: 1.0.5)splines
(version: 4.3.0)ggplot2
(version: 3.4.2)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
For this practical, we will use the EP16dat4 dataset, which is a subset of data from a trial on primary biliary cirrhosis (PBC) of the liver.
To get the EP16dat4 dataset, load the file
EP16dat4.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat4.RData")
.
The variables contained in the dataset EP16dat4
are:
id
|
patient identifier |
day
|
continuously measured day of follow-up time (the time variable) |
sex
|
patients’ sex (f: female, m: male) |
trt
|
treatment group (0: D-penicillmain, 1: placebo) |
age
|
patients’ age at intake |
ascites
|
presence of ascites at baseline (0:no, 1:yes) |
hepato
|
presence of hepatomegaly or enlarged liver |
bili
|
serum bilirubin level at baseline |
copper
|
urine copper (ug/day) |
albumin
|
serum albumin level at follow-up (time-varying) |
The variables have the following distributions and proportions of missing values:
albumin
, bili
and hepato
were
measured repeatedly, day
is the day of that measurement,
and the other variables were only measured once, at baseline.
The missing data pattern is:
The longitudinal outcome albumin
shows relatively linear
trajectories over time:
To analyse the trajectories of albumin
we want to use
the following linear mixed effects model with random intercept and slope
(either using lme
from the package nlme or
using lmer
from the package lme4; and only
AFTER we have imputed the data):
# using package nlme
lme(albumin ~ day + sex + trt + age + ascites + log(bili) + copper, random = ~day|id)
# using package lme4
lmer(albumin ~ day + sex + trt + age + ascites + log(bili) + copper + (day|id))
With the current data, the approach to impute the data in wide format is not feasible, since the data are unbalanced, i.e., the measurements do not follow a pre-specified pattern, but different patients have different numbers of measurements and are measured at different time-points.
Theoretically, there is some functionality in mice
for handling longitudinal data, using special imputation methods like
"2l.norm"
, "2lonly.norm"
or
"2lonly.pmm"
, which take into account the multi-level
structure of the data.
The package has undergone some recent updates with the result that we are unable to replicate the results we obtained before - data that was imputed now results in warning messages, and values are not being imputed. This may be a user error, due to changes in functionality, or it may be an issue with the functions itself.
Going forward we will perform imputation in this setting using an alternative approach.
The idea is to find a summary of the longitudinal variables that will capture the relevant characteristics of the longitudinal profiles so that as little information as possible is lost. This summary also has to be represented in the same number of values for each subject, and these values need to have the same interpretation for all subjects.
One option that fulfils these requirements is to fit mixed models to
the longitudinal variables, and to represent the trajectories by the
estimated random effects in the imputation (Erler
et al., 2016). This approach, however, does not work for
imputation of missing values in longitudinal variables. Hence, we will
not consider the variable hepato
here.
To get a good representation, the model needs to be flexible enough to fit the data sufficiently well.
We already saw how the traces for albumin
look like,
here is the corresponding plot for log(bili)
:
Our model of interest involves the longitudinal variables
albumin
and bili
, which are both completely
observed.
albumin
and for
log(bili)
.lmer()
from the package lme4
is faster than lme()
from nlme.
lmer()
, you can
set the argument
control = lmerControl(optimizer = "bobyqa")
.
day
and also include this spline specification in
the random effects.ranef()
.
library("splines")
##
## Attaching package: 'splines'
## The following object is masked from 'package:JointAI':
##
## bs
library("lme4")
# model for albumin
<- lmer(albumin ~ age + sex + ns(day, df = 2) + (ns(day, df = 2) | id),
mod_albu0 data = EP16dat4)
# model for log(bili)
<- lmer(log(bili) ~ ns(day, df = 2) + (ns(day, df = 2) | id),
mod_bili0 data = EP16dat4,
control = lmerControl(optimizer = "bobyqa"))
# extract the random effects
<- ranef(mod_albu0)$id
b_albu <- ranef(mod_bili0)$id b_bili
head(b_albu)
## (Intercept) ns(day, df = 2)1 ns(day, df = 2)2
## 1 -0.444125 0.031114 0.270071
## 2 0.303350 -0.494048 -0.641496
## 3 0.046973 0.026260 -0.018409
## 4 -0.512316 -0.390659 -0.010550
## 5 -0.237393 -0.766935 -0.293471
## 6 0.440042 0.899356 0.472693
head(b_bili)
## (Intercept) ns(day, df = 2)1 ns(day, df = 2)2
## 1 2.161134 2.03407 1.31636
## 2 -0.598690 0.40027 -0.53131
## 3 -0.287299 -0.15803 -0.16803
## 4 -0.046146 0.64818 0.06175
## 5 0.262130 2.43750 1.10205
## 6 -0.677375 -2.36292 -1.28033
To be able to distinguish between the random effects from the two models, we should give them different names:
names(b_albu) <- paste0("b_albu", 0:2)
names(b_bili) <- paste0("b_bili", 0:2)
To check that our model fits the data suitably well, we can look at the observed and fitted values for a few cases:
# Combine the original data with the fitted values
<- cbind(EP16dat4,
plotDF fit_albu = fitted(mod_albu0),
fit_bili = fitted(mod_bili0)
)
# make a subset of 24 randomly chosen subjects
<- subset(plotDF, id %in% sample(unique(EP16dat4$id), size = 30))
plotDF_sub
# For albumin:
ggplot(plotDF_sub, aes(x = day, y = albumin, group = id)) +
geom_point(alpha = 0.5, size = 1) +
geom_line(aes(y = fit_albu), color = 'blue') +
geom_point(aes(y = fit_albu), color = 'blue', alpha = 0.3) +
facet_wrap('id', ncol = 6) +
theme(panel.grid = element_blank())
# For log(bili):
ggplot(plotDF_sub, aes(x = day, y = log(bili), group = id)) +
geom_point(alpha = 0.5, size = 1) +
geom_line(aes(y = fit_bili), color = 'blue') +
geom_point(aes(y = fit_bili), color = 'blue', alpha = 0.3) +
facet_wrap('id', ncol = 6) +
theme(panel.grid = element_blank())
The next step is to get a wide-format version of our data. We do not need to worry about the longitudinal variables, instead we need to include the random effects as columns in this wide-format data.
match(unique(EP16dat4$id), EP16dat4$id)
to select the first
row for each subject.cbind()
to add columns to a
data.frame
.
# combine the first row of each subject with the random effects
<- cbind(EP16dat4[match(unique(EP16dat4$id), EP16dat4$id), ],
EP16dat_wide
b_albu, b_bili)
head(EP16dat_wide)
## id trt age sex day ascites hepato bili albumin copper b_albu0 b_albu1 b_albu2 b_bili0
## 1 1 1 58.765 f 0 <NA> 1 14.5 2.60 156 -0.444125 0.031114 0.270071 2.161134
## 3 2 1 56.446 f 0 0 1 1.1 4.14 54 0.303350 -0.494048 -0.641496 -0.598690
## 12 3 1 70.073 m 0 0 0 1.4 3.48 NA 0.046973 0.026260 -0.018409 -0.287299
## 16 4 1 54.741 f 0 0 1 1.8 2.54 64 -0.512316 -0.390659 -0.010550 -0.046146
## 23 5 0 38.105 f 0 0 1 3.4 3.53 143 -0.237393 -0.766935 -0.293471 0.262130
## 29 6 0 66.259 f 0 0 1 0.8 3.98 50 0.440042 0.899356 0.472693 -0.677375
## b_bili1 b_bili2
## 1 2.03407 1.31636
## 3 0.40027 -0.53131
## 12 -0.15803 -0.16803
## 16 0.64818 0.06175
## 23 2.43750 1.10205
## 29 -2.36292 -1.28033
With these extra columns in the data, we can run the imputation using
mice()
.
Perform the necessary steps to impute the data. Make sure you exclude unnecessary columns/variables from the imputation.
library("mice")
<- mice(EP16dat_wide, maxit = 0) imp0
## Warning: Number of logged events: 1
There is one logged event. This is only the set-up run, so it is not a problem, but we should be interested in what the problem might be to fix this in the actual imputation.
$loggedEvents imp0
## it im dep meth out
## 1 0 0 constant day
The variable day
is constant (because we used the first
row for everyone, and the first measurement was taken on day 0 for all
patients). We would exclude this variable from the imputation anyway
since it is a longitudinal variable.
<- imp0$method
meth meth
## id trt age sex day ascites hepato bili albumin copper b_albu0
## "" "" "" "" "" "logreg" "" "" "" "pmm" ""
## b_albu1 b_albu2 b_bili0 b_bili1 b_bili2
## "" "" "" "" ""
We do not need to change anything in the imputation method if we
think that using pmm
for copper
is a good
choice.
<- imp0$predictorMatrix
pred c('day', 'hepato', 'albumin', 'bili', 'id')] <- 0 # mice already set day to 0 pred[,
<- mice(EP16dat_wide, method = meth, predictorMatrix = pred,
imp_mice maxit = 20, seed = 2020, printFlag = FALSE)
We now need to combine the imputed baseline data with the longitudinal variables in order to do our analysis.
mids
object in long
format and do not include the original data.split
.
.imp
refers to the imputation number.
data.frame
s in the list with the
longitudinal variablesday
) and the
id
variable.
mids
object
with the help of the function datalist2mids
from the
package miceadds.We first extract the data from the mids
object. Here we
use the option include = FALSE
because we will later use
the function datalist2mids()
. If we would use
as.mids()
we would need to include the original data.
<- complete(imp_mice, action = 'long', include = FALSE) impdat
We then split the data by imputation number. Then we merge each part
of the data. We can conveniently wrap this into lapply()
,
because we want to apply the same function to each element of a
list.
<- lapply(split(impdat, impdat$.imp), function(x) {
imp_list merge(subset(x, select = c(.imp, .id, id, trt, age, sex, ascites, copper)),
subset(EP16dat4, select = c(id, day, hepato, bili, albumin)),
all = TRUE)
})
When converting the resulting list of longitudinal datasets to a
mids
object we would have (at least) two options: the
function as.mids()
and the function
datalist2mids
.
as.mids()
does not work well with longitudinal data. It
gives an error about duplicate row names. For that reason, we use
datalist2mids
.
<- miceadds::datalist2mids(imp_list) mids_long
## Warning: Number of logged events: 1
Fit the mixed model of interest (as specified above) and obtain the pooled results.
You need to load the broom.mixed package (i.e.,
library("broom.mixed")
) before pooling results from a mixed
model.
lmer(albumin ~ ns(day, df = 2) + sex + trt + age + ascites + log(bili) + copper + (ns(day, df = 2)|id)
<- with(mids_long,
mod_mice lmer(albumin ~ ns(day, df = 2) + sex + trt + age + ascites +
log(bili) + copper + (ns(day, df = 2)|id),
control = lmerControl(optimizer = "bobyqa")))
library("broom.mixed")
summary(pool(mod_mice), conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 4.20832530 0.11898216 35.36938 781.019 2.4869e-164 3.9747626 4.4418880
## 2 ns(day, df = 2)1 -0.81749356 0.06482659 -12.61047 1591.773 8.0453e-35 -0.9446480 -0.6903391
## 3 ns(day, df = 2)2 -0.57945448 0.09503880 -6.09703 1790.413 1.3205e-09 -0.7658531 -0.3930558
## 4 sexf -0.14529983 0.05988992 -2.42612 635.757 1.5539e-02 -0.2629058 -0.0276939
## 5 trt 0.02090008 0.03579124 0.58394 1556.820 5.5934e-01 -0.0493040 0.0911042
## 6 age -0.00790092 0.00176303 -4.48143 1885.037 7.8597e-06 -0.0113586 -0.0044432
## 7 ascites1 -0.20739675 0.07845242 -2.64360 250.789 8.7199e-03 -0.3619063 -0.0528872
## 8 log(bili) -0.16106260 0.01485340 -10.84348 223.804 2.7174e-22 -0.1903330 -0.1317922
## 9 copper -0.00055242 0.00032133 -1.71913 20.611 1.0057e-01 -0.0012214 0.0001166
To analyse incomplete longitudinal data using a linear mixed model
the R package JointAI provides the functions
lme_imp()
and lmer_imp()
. The specification of
the main model components is analogous to the function
lme()
from the nlme package or
lmer()
from the lme4 package.
Specification of longitudinal models:
When
imputing variables in a longitudinal (or other multi-level) model and
there are missing values in baseline (level-2) covariates, models need
to be specified for all longitudinal covariates, even if they do not
have missing values. Specifying no model would imply that the incomplete
baseline covariates are independent of the complete longitudinal
variable (see also here).
Therefore, JointAI automatically specifies models for
all longitudinal covariates in such a setting.
An exception may be the time variable: it is often reasonable to
assume that the baseline covariates are independent of the measurement
times of the outcome and longitudinal covariates. To tell
JointAI not to specify a model for the time variable,
the argument no_model
can be used.
name | model | variable type |
---|---|---|
lmm
|
linear mixed model | continuous variables |
glmm_lognorm
|
log-normal mixed model | skewed continuous variables |
glmm_beta
|
beta mixed model | continuous variables in [0, 1] |
glmm_logit
|
logistic mixed model | factors with two levels |
glmm_probit
|
probit mixed model | factors with two levels |
glmm_gamma_inverse
|
gamma mixed model (with inverse link) | skewed continuous variables |
glmm_gamma_identity
|
gamma mixed model (with identity link) | skewed continuous variables |
glmm_gamma_log
|
gamma mixed model (with log link) | skewed continuous variables |
glmm_poisson_log
|
poisson mixed model (with log link) | count variables |
glmm_poisson_identity
|
poisson mixed model (with identity link) | count variables |
clmm
|
cumulative logit mixed model | ordered factors with >2 levels |
mlogitmm
|
multinomial logit mixed model | unordered factors with >2 levels |
More info:
For the specification of the other
arguments of lme_imp()
, refer to
Run the imputation (start with n.iter = 500
; this will
take a few seconds).
lme_imp()
with
n.adapt = 0
and n.iter = 0
, and extract the
vector of model types using <mymodel>$models
, where
<mymodel>
is the name you gave the model.
day
.traceplot()
.use_ggplot = TRUE
in
traceplot()
.
library("JointAI")
<- lme_imp(albumin ~ ns(day, df = 2) + sex + trt + age + ascites +
JointAIlong + copper, random = ~ns(day, df = 2)|id,
bili models = c(copper = 'lognorm', bili = 'glmm_lognorm'),
no_model = 'day', data = EP16dat4, n.iter = 500, seed = 2020)
traceplot(JointAIlong, use_ggplot = TRUE) +
theme(legend.position = 'none',
panel.grid = element_blank())
The traceplot shows that there is some autocorrelation (i.e., values
are very similar to the previous value) in some of the chains (for
example, ns(day, df = 2)2
).
To continue the sampling in order to run the model for longer, we can
use the function add_samples()
. The number of additional
iterations can be specified using the argument n.iter
.
traceplot()
again.GR_crit()
)MC_error()
).<- add_samples(JointAIlong, n.iter = 1000) JointAIlong_extra
traceplot(JointAIlong_extra, use_ggplot = TRUE) +
theme(legend.position = 'none',
panel.grid = element_blank())
The traceplots now look a bit better.
By setting the argument autoburnin = FALSE
in
GR_crit()
we select to use the full MCMC chains. By default
(autoburnin = TRUE
) the function automatically discards the
first half of the iterations and only uses the second half to evaluate
convergence.
GR_crit(JointAIlong_extra, autoburnin = FALSE)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.00 1.00
## sexf 1.00 1.00
## trt1 1.00 1.00
## age 1.00 1.00
## ascites1 1.01 1.03
## copper 1.00 1.00
## ns(day, df = 2)1 1.01 1.02
## ns(day, df = 2)2 1.06 1.19
## bili 1.00 1.02
## sigma_albumin 1.00 1.00
##
## Multivariate psrf
##
## 1.06
For some parameters, the criterion is still too large. The traceplot indicates that this is not really an issue of convergence, but slow mixing: because of the high correlation between subsequent samples, the chain only moves very slowly. As a result, there is more variation between the chains that within one chain.
We would have to run the model even longer to give the chains the “time” to fully explore the full range of the posterior distribution.
MC_error(JointAIlong_extra)
## est MCSE SD MCSE/SD
## (Intercept) 4.19683 3.8e-03 0.12529 0.030
## sexf -0.12265 1.9e-03 0.06305 0.031
## trt1 0.00978 9.4e-04 0.03795 0.025
## age -0.00736 5.3e-05 0.00193 0.028
## ascites1 -0.27734 3.2e-03 0.08791 0.036
## copper -0.00092 1.2e-05 0.00028 0.043
## ns(day, df = 2)1 -0.91264 3.8e-03 0.06844 0.055
## ns(day, df = 2)2 -0.70565 1.3e-02 0.12384 0.105
## bili -0.02413 5.4e-05 0.00246 0.022
## sigma_albumin 0.31713 1.4e-04 0.00592 0.024
The Monte Carlo error is also still larger then we would want it to be.
Based on the Gelman-Rubin criterion and the Monte Carlo error we need more MCMC samples to get precise results. We can either run the MCMC chains for longer, but since the issue is not that the chains do not have converged, we could also get more samples by increasing the number of MCMC chains.
Using more MCMC chains has the advantage that different chains are independent and that they can be run in parallel.
Note: On computers with multiple CPUs it is possible to run the different MCMC chains of a model in parallel to save some time.
This can be achieved using the doFuture package. For details see the vignette on MCMC settings.
We want to fit a logistic mixed model for the variable
hepato
and explore if the association is non-linear over
time.
glme_imp
using the same covariates as before.day
.traceplot()
.ns()
from the
package splines, i.e., ns(day, df = 3)
.
library("splines")
<- glme_imp(hepato ~ ns(day, df = 3) + sex + trt + age + ascites +
JointAIlong2 + copper, random = ~1|id, family = binomial(),
bili models = c(copper = 'lognorm', bili = 'glmm_gamma_log'),
no_model = 'day', data = EP16dat4, n.iter = 1000,
seed = 2020)
traceplot(JointAIlong2, use_ggplot = TRUE) +
theme(legend.position = 'none',
panel.grid = element_blank())
When the model has converged, we want to visualize the potentially
non-linear association of day
. To do that, we can create a
new dataset containing information on an “average” subject, with
different values for day
.
The function predDF()
creates such a dataset from an
object of class JointAI
. It sets reference values (i.e.,
the median for continuous variables and the reference category for
categorical variables) for all variables other than the ones specified
in the argument var
(a one-sided formula). The variables
given in var
will range across the range of values of that
variable encountered in the data.
Use predDF()
to create a dataset that allows
visualization of the effect of day
.
<- predDF(JointAIlong2, var = ~day)
newdf
head(newdf)
## hepato day sex trt age ascites bili copper id
## 1 0 0.00 m 0 48.871 0 1.4 64 126
## 2 0 52.04 m 0 48.871 0 1.4 64 126
## 3 0 104.08 m 0 48.871 0 1.4 64 126
## 4 0 156.12 m 0 48.871 0 1.4 64 126
## 5 0 208.16 m 0 48.871 0 1.4 64 126
## 6 0 260.20 m 0 48.871 0 1.4 64 126
We can now predict the outcome of our model for our “average” subject
using the function predict()
. It takes a
JointAI
object and a data.frame
containing the
data to predict from as arguments. The argument quantiles
can be used to specify which quantiles of the distribution of each
fitted value are returned (default is 2.5%
and
97.5%
).
predict()
returns a list with the following elements
dat
: the data.frame
provided by the user
extended with the fitted values and 2.5% and 97.5% quantiles that form
the credible interval for the fitted valuesfit
: a vector containing the fitted values (the mean of
the distribution of the fitted value)quantiles
: a matrix containing the credible interval
for each fitted valuepredict()
to obtain the fitted values and
corresponding intervalsday
; x-axis)<- predict(JointAIlong2, newdata = newdf, type = "response") pred
## Warning:
## Prediction in multi-level settings currently only takes into account the fixed effects,
## i.e., assumes that the random effect realizations are equal to zero.
ggplot(pred$newdata, aes(x = day, y = fit)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3) +
geom_line() +
theme(panel.grid = element_blank()) +
ylab("expected probability")
© Nicole Erler