In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)survival
(version: 3.5.5JointAI
(version: 1.0.5)ggplot2
(version: 3.4.2)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
For this practical, we will use the EP16dat3 dataset, which is a subset of the PBC (Primary Biliary Cirrhosis) data.
To get the EP16dat3 dataset, load the file
EP16dat3.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat3.RData")
.
EP16dat3
) are:
time | number of years between inclusion and death, transplantion, or end of follow-up |
status |
status at time (censored, transplant, dead)
|
age | patient’s age at intake |
sex | patient’s sex |
platelet | platelet count |
chol | serum cholesterol |
stage | histologic stage of disease |
The variables in EP16dat3
dataset have the following
distributions:
The missing data pattern is
We are interested to determine predictor variables for patient survival, using the following Cox proportional hazards model:
coxph(Surv(time, status %in% c('transplant', 'dead')) ~ platelet + age + sex + chol + stage)
As we have seen in the lecture, the mice package
provides the function nelsonaalen()
that calculates the
Nelson-Aalen estimator with which the cumulative Hazard can be
approximated.
mice()
.Note:
nelsonaalen()
does not accept the
%in% c('transplant', 'dead')
specification of the event,
hence, we need to create a new event indicator event
.
We first create the event variable, and calculate the Nelson-Aalen estimator:
library("mice")
library("survival")
$event <- EP16dat3$status %in% c('transplant', 'dead')
EP16dat3$na <- nelsonaalen(data = EP16dat3, timevar = "time", statusvar = "event") EP16dat3
Then we can continue as usual, by doing the set-up run and adapting
the arguments method
and predictorMatrix
:
<- mice(EP16dat3, maxit = 0)
micesurv0 <- micesurv0$meth
micesurvmeth <- micesurv0$pred micesurvpred
We can set the imputation for platelet
to
"norm"
, since this variable has a relatively symmetric
distribution, and exclude time
as predictor from the
imputation models:
c("platelet")] <- "norm"
micesurvmeth[c("time", "status")] <- 0 micesurvpred[,
We can now check our settings:
# check the method and predictorMatrix
micesurvmeth
## time status platelet age sex chol stage event na
## "" "" "norm" "" "" "pmm" "polr" "" ""
micesurvpred
## time status platelet age sex chol stage event na
## time 0 0 1 1 1 1 1 1 1
## status 0 0 1 1 1 1 1 1 1
## platelet 0 0 0 1 1 1 1 1 1
## age 0 0 1 0 1 1 1 1 1
## sex 0 0 1 1 0 1 1 1 1
## chol 0 0 1 1 1 0 1 1 1
## stage 0 0 1 1 1 1 0 1 1
## event 0 0 1 1 1 1 1 0 1
## na 0 0 1 1 1 1 1 1 0
Run the imputation and analyse the imputed data.
Run the imputation:
<- mice(EP16dat3, method = micesurvmeth, predictorMatrix = micesurvpred,
micesurv maxit = 10, m = 5, printFlag = FALSE, seed = 2020)
Fit the model and pool the results:
<- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
micesurv_mira + sex + chol + stage))
age
summary(pool(micesurv_mira), conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 platelet -0.0018713 0.0014181 -1.319531 21.367 0.2009570 -0.0048173 0.0010748
## 2 age 0.0141105 0.0120983 1.166322 65.855 0.2476907 -0.0100455 0.0382665
## 3 sexf -0.1701306 0.3538187 -0.480841 67.936 0.6321742 -0.8761767 0.5359155
## 4 chol 0.0014801 0.0004508 3.283222 13.003 0.0059347 0.0005062 0.0024539
## 5 stage.L 1.7394505 0.6912830 2.516264 71.504 0.0141062 0.3612398 3.1176612
## 6 stage.Q -0.0123688 0.5486179 -0.022545 69.517 0.9820775 -1.1066865 1.0819489
## 7 stage.C 0.4333561 0.3393589 1.276985 61.957 0.2063728 -0.2450220 1.1117342
Note:
For older versions of the mice package there were two warning messages:
## Warning: Unknown or uninitialised column: `df.residual`.
## Warning in pool.fitlist(getfit(object), dfcom = dfcom): Large sample assumed.
The function pool()
has an argument dfcom
,
referring to the degrees of freedom from the complete data analysis,
which is NULL
by default. In that case, mice will try to
extract them from the first analysis. But this failed for a Cox PH
model.
The warning message (about the unknown column
df.residual
) can just be ignored.
The second warning means that, because the degrees of freedom could not be extracted from the model, it will be assumed that this number is very large (i.e., that the data is really large compared to the number of parameters that are estimated).
You can prevent these warning messages by specifying the argument
dfcom
in the function pool()
.
In the summary of the pooled results we see that orthogonal
polynomial coding for the ordered factor stage
was used.
Hence, we cannot interpret the coefficients from this variable the way
we would interpret dummy variables.
For some model types we can overwrite the default contrasts (i.e.,
how categorical variables are handled) using an argument
contrasts
. In a linear or logistic regression model, for
example we could specify
contrasts = list(stage = 'contr.treatment')
.
Unfortunately, this is not possible for coxph()
. Here we
need to change the global option how contrasts can be set.
The global option for contrasts looks like this:
options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
contr.treatment
refers to dummy coding and
contr.poly
to orthogonal polynomials. We would like to also
use contr.treatment
for ordered factors.
To be able to change the setting back, we first save the current options:
<- options() op
Then we can overwrite the options for the contrasts:
options(contrasts = rep('contr.treatment', 2))
options("contrasts")
## $contrasts
## [1] "contr.treatment" "contr.treatment"
If we now repeat our analysis, we will get dummy-coded effects for
stage
:
<- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
micesurv_mira2 + sex + chol + stage))
age
summary(pool(micesurv_mira), conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 platelet -0.0018713 0.0014181 -1.319531 21.367 0.2009570 -0.0048173 0.0010748
## 2 age 0.0141105 0.0120983 1.166322 65.855 0.2476907 -0.0100455 0.0382665
## 3 sexf -0.1701306 0.3538187 -0.480841 67.936 0.6321742 -0.8761767 0.5359155
## 4 chol 0.0014801 0.0004508 3.283222 13.003 0.0059347 0.0005062 0.0024539
## 5 stage.L 1.7394505 0.6912830 2.516264 71.504 0.0141062 0.3612398 3.1176612
## 6 stage.Q -0.0123688 0.5486179 -0.022545 69.517 0.9820775 -1.1066865 1.0819489
## 7 stage.C 0.4333561 0.3393589 1.276985 61.957 0.2063728 -0.2450220 1.1117342
Afterwards, we change the settings back:
options(op)
options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
Analysis of a Cox proportional hazards model using
JointAI works analogous to the function
coxph()
from packages survival.
coxph_imp()
(and 500 iterations).library("JointAI")
<- coxph_imp(Surv(time, event) ~ platelet + age + sex + chol + stage,
JointAIcox data = EP16dat3, n.iter = 1500, seed = 2020)
traceplot(JointAIcox, use_ggplot = TRUE) +
theme(legend.position = 'none')
The mixing of the parameters for stage
is not great. We
would have to increase the number of iterations and/or the number of
chains to get better results.
When the groups of a categorical variable are unbalanced, convergence is usually better when the largest group (or a relatively large group) is used as the reference category.
We can try this by specifying that we would like to use
<- coxph_imp(Surv(time, event) ~ platelet + age + sex + chol + stage,
JointAIcox2 data = EP16dat3, n.iter = 500, seed = 2020,
refcats = list(stage = 4))
traceplot(JointAIcox2, use_ggplot = TRUE) +
theme(legend.position = 'none')
For larger datasets with many different event times the Cox model implemented in JointAI can become quite slow. This is because it has to use the counting process notation which requires a loop over all event times in each iteration of the MCMC sampler.
A parametric survival model, e.g. assuming a Weibull distribution
(see Section 4.2 of the slides of EP03:
Biostatistical Methods II: Survival Analysis), is often faster. This
is implemented in the function survreg_imp()
.
traceplot()
and the
Gelman-Rubin criterion (GR_crit()
).<- survreg_imp(Surv(time, event) ~ platelet +
JointAIsurv + sex + chol + stage, data = EP16dat3,
age n.iter = 1500, seed = 2020)
traceplot(JointAIsurv, use_ggplot = TRUE)
GR_crit(JointAIsurv, autoburnin = FALSE)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.10 1.27
## platelet 1.01 1.02
## age 1.00 1.00
## sexf 1.02 1.06
## chol 1.00 1.01
## stage2 1.12 1.28
## stage3 1.11 1.29
## stage4 1.12 1.31
## shape_Surv_time_event 1.00 1.01
##
## Multivariate psrf
##
## 1.05
Both the traceplot and the Gelman-Rubin criterion show that the
parameters for stage
don’t converge well. There are clear
patterns visible in the plots and the upper limit of the confidence
interval of the Gelman-Rubin criterion is much larger than one.
In some cases, convergence of coefficients for categorical variables can be improved by changing the reference category. Especially when the categories are very unbalanced, convergence it better when the largest (or a large) category is chosen as the reference.
The plot of the distribution of the variables in the
EP16dat3
data at the beginning of this practical shows that
there are few patients with stage 1.
stage
as reference (using the argument
refcats
).Check out the help
file and the vignette on Model
Specification for more details on how to use the argument
refcats
.
traceplot()
and GR_crit()
to see if the change in reference category improved convergence.<- survreg_imp(Surv(time, event) ~ platelet +
JointAIsurv2 + sex + chol + stage, data = EP16dat3,
age n.iter = 1000, seed = 2020, refcats = 'largest')
traceplot(JointAIsurv2, use_ggplot = TRUE)
GR_crit(JointAIsurv2, autoburnin = FALSE)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.03 1.11
## platelet 1.01 1.02
## age 1.00 1.00
## sexm 1.00 1.01
## chol 1.11 1.25
## stage1 1.00 1.01
## stage2 1.01 1.04
## stage3 1.02 1.08
## shape_Surv_time_event 1.00 1.01
##
## Multivariate psrf
##
## 1.05
The traceplots look a lot better and the Gelman-Rubin criterion has improved. Nevertheless, more iterations or chaines would be necessary to obtain more reliable results.
summary(JointAIsurv2, start = 250)
##
## Bayesian weibull survival model fitted with JointAI
##
## Call:
## survreg_imp(formula = Surv(time, event) ~ platelet + age + sex +
## chol + stage, data = EP16dat3, n.iter = 1000, refcats = "largest",
## seed = 2020)
##
##
## Number of events: 81
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob. GR-crit MCE/SD
## (Intercept) 7.66797 0.187077 7.31477 8.032284 0.0000 1.04 0.0908
## platelet 0.00123 0.001288 -0.00119 0.003685 0.3478 1.02 0.0601
## age -0.01570 0.013266 -0.04240 0.010492 0.2382 1.00 0.0526
## sexm -0.05638 0.383715 -0.76021 0.743525 0.8508 1.01 0.0462
## chol -0.00121 0.000629 -0.00226 0.000179 0.0744 1.05 0.0863
## stage1 3.21866 1.342265 1.12702 6.228553 0.0000 1.01 0.1440
## stage2 1.41357 0.380650 0.73327 2.224563 0.0000 1.02 0.0842
## stage3 1.27369 0.335360 0.67164 1.964542 0.0000 1.07 0.0925
##
## Posterior summary of the shape of the Weibull distribution:
## Mean SD 2.5% 97.5% GR-crit MCE/SD
## shape_Surv_time_event 0.965 0.0925 0.783 1.15 1 0.0679
##
##
## MCMC settings:
## Iterations = 250:1100
## Sample size per chain = 851
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 160
© Nicole Erler