In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice (version: 3.15.0)survival (version: 3.5.5JointAI (version: 1.0.5)ggplot2 (version: 3.4.2)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
For this practical, we will use the EP16dat3 dataset, which is a subset of the PBC (Primary Biliary Cirrhosis) data.
To get the EP16dat3 dataset, load the file
EP16dat3.RData. You can download it here.
To load this dataset into R, you can use the command
file.choose() which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat3.RData").
EP16dat3) are:
| time | number of years between inclusion and death, transplantion, or end of follow-up | 
| status | status at time(censored, transplant, dead) | 
| age | patient’s age at intake | 
| sex | patient’s sex | 
| platelet | platelet count | 
| chol | serum cholesterol | 
| stage | histologic stage of disease | 
The variables in EP16dat3 dataset have the following
distributions:
The missing data pattern is
We are interested to determine predictor variables for patient survival, using the following Cox proportional hazards model:
coxph(Surv(time, status %in% c('transplant', 'dead')) ~ platelet + age + sex + chol + stage)As we have seen in the lecture, the mice package
provides the function nelsonaalen() that calculates the
Nelson-Aalen estimator with which the cumulative Hazard can be
approximated.
mice().Note:
nelsonaalen() does not accept the
%in% c('transplant', 'dead') specification of the event,
hence, we need to create a new event indicator event.
We first create the event variable, and calculate the Nelson-Aalen estimator:
library("mice")
library("survival")
EP16dat3$event <- EP16dat3$status %in% c('transplant', 'dead')
EP16dat3$na <- nelsonaalen(data = EP16dat3, timevar = "time",  statusvar = "event")Then we can continue as usual, by doing the set-up run and adapting
the arguments method and predictorMatrix:
micesurv0 <- mice(EP16dat3, maxit = 0)
micesurvmeth <- micesurv0$meth
micesurvpred <- micesurv0$predWe can set the imputation for platelet to
"norm", since this variable has a relatively symmetric
distribution, and exclude time as predictor from the
imputation models:
micesurvmeth[c("platelet")] <- "norm"
micesurvpred[, c("time", "status")] <- 0We can now check our settings:
# check the method and predictorMatrix
micesurvmeth##     time   status platelet      age      sex     chol    stage    event       na 
##       ""       ""   "norm"       ""       ""    "pmm"   "polr"       ""       ""micesurvpred##          time status platelet age sex chol stage event na
## time        0      0        1   1   1    1     1     1  1
## status      0      0        1   1   1    1     1     1  1
## platelet    0      0        0   1   1    1     1     1  1
## age         0      0        1   0   1    1     1     1  1
## sex         0      0        1   1   0    1     1     1  1
## chol        0      0        1   1   1    0     1     1  1
## stage       0      0        1   1   1    1     0     1  1
## event       0      0        1   1   1    1     1     0  1
## na          0      0        1   1   1    1     1     1  0Run the imputation and analyse the imputed data.
Run the imputation:
micesurv <- mice(EP16dat3, method = micesurvmeth, predictorMatrix = micesurvpred,
                 maxit = 10, m = 5, printFlag = FALSE, seed = 2020)Fit the model and pool the results:
micesurv_mira <- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
                                        age + sex + chol + stage))
summary(pool(micesurv_mira), conf.int = TRUE)##       term   estimate std.error statistic     df   p.value      2.5 %    97.5 %
## 1 platelet -0.0018713 0.0014181 -1.319531 21.367 0.2009570 -0.0048173 0.0010748
## 2      age  0.0141105 0.0120983  1.166322 65.855 0.2476907 -0.0100455 0.0382665
## 3     sexf -0.1701306 0.3538187 -0.480841 67.936 0.6321742 -0.8761767 0.5359155
## 4     chol  0.0014801 0.0004508  3.283222 13.003 0.0059347  0.0005062 0.0024539
## 5  stage.L  1.7394505 0.6912830  2.516264 71.504 0.0141062  0.3612398 3.1176612
## 6  stage.Q -0.0123688 0.5486179 -0.022545 69.517 0.9820775 -1.1066865 1.0819489
## 7  stage.C  0.4333561 0.3393589  1.276985 61.957 0.2063728 -0.2450220 1.1117342Note:
For older versions of the mice package there were two warning messages:
## Warning: Unknown or uninitialised column: `df.residual`.## Warning in pool.fitlist(getfit(object), dfcom = dfcom): Large sample assumed.The function pool() has an argument dfcom,
referring to the degrees of freedom from the complete data analysis,
which is NULL by default. In that case, mice will try to
extract them from the first analysis. But this failed for a Cox PH
model.
The warning message (about the unknown column
df.residual) can just be ignored.
The second warning means that, because the degrees of freedom could not be extracted from the model, it will be assumed that this number is very large (i.e., that the data is really large compared to the number of parameters that are estimated).
You can prevent these warning messages by specifying the argument
dfcom in the function pool().
In the summary of the pooled results we see that orthogonal
polynomial coding for the ordered factor stage was used.
Hence, we cannot interpret the coefficients from this variable the way
we would interpret dummy variables.
For some model types we can overwrite the default contrasts (i.e.,
how categorical variables are handled) using an argument
contrasts. In a linear or logistic regression model, for
example we could specify
contrasts = list(stage = 'contr.treatment').
Unfortunately, this is not possible for coxph(). Here we
need to change the global option how contrasts can be set.
The global option for contrasts looks like this:
options("contrasts")## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"contr.treatment refers to dummy coding and
contr.poly to orthogonal polynomials. We would like to also
use contr.treatment for ordered factors.
To be able to change the setting back, we first save the current options:
op <- options()Then we can overwrite the options for the contrasts:
options(contrasts = rep('contr.treatment', 2))
options("contrasts")## $contrasts
## [1] "contr.treatment" "contr.treatment"If we now repeat our analysis, we will get dummy-coded effects for
stage:
micesurv_mira2 <- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
                                        age + sex + chol + stage))
summary(pool(micesurv_mira), conf.int = TRUE)##       term   estimate std.error statistic     df   p.value      2.5 %    97.5 %
## 1 platelet -0.0018713 0.0014181 -1.319531 21.367 0.2009570 -0.0048173 0.0010748
## 2      age  0.0141105 0.0120983  1.166322 65.855 0.2476907 -0.0100455 0.0382665
## 3     sexf -0.1701306 0.3538187 -0.480841 67.936 0.6321742 -0.8761767 0.5359155
## 4     chol  0.0014801 0.0004508  3.283222 13.003 0.0059347  0.0005062 0.0024539
## 5  stage.L  1.7394505 0.6912830  2.516264 71.504 0.0141062  0.3612398 3.1176612
## 6  stage.Q -0.0123688 0.5486179 -0.022545 69.517 0.9820775 -1.1066865 1.0819489
## 7  stage.C  0.4333561 0.3393589  1.276985 61.957 0.2063728 -0.2450220 1.1117342Afterwards, we change the settings back:
options(op)
options("contrasts")## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"Analysis of a Cox proportional hazards model using
JointAI works analogous to the function
coxph() from packages survival.
coxph_imp() (and 500 iterations).library("JointAI")
JointAIcox <- coxph_imp(Surv(time, event) ~ platelet + age + sex + chol + stage,
                        data = EP16dat3, n.iter = 1500, seed = 2020)traceplot(JointAIcox, use_ggplot = TRUE) + 
  theme(legend.position = 'none')The mixing of the parameters for stage is not great. We
would have to increase the number of iterations and/or the number of
chains to get better results.
When the groups of a categorical variable are unbalanced, convergence is usually better when the largest group (or a relatively large group) is used as the reference category.
We can try this by specifying that we would like to use
JointAIcox2 <- coxph_imp(Surv(time, event) ~ platelet + age + sex + chol + stage,
                         data = EP16dat3, n.iter = 500, seed = 2020,
                         refcats = list(stage = 4))traceplot(JointAIcox2, use_ggplot = TRUE) + 
  theme(legend.position = 'none')For larger datasets with many different event times the Cox model implemented in JointAI can become quite slow. This is because it has to use the counting process notation which requires a loop over all event times in each iteration of the MCMC sampler.
A parametric survival model, e.g. assuming a Weibull distribution
(see Section 4.2 of the slides of EP03:
Biostatistical Methods II: Survival Analysis), is often faster. This
is implemented in the function survreg_imp().
traceplot() and the
Gelman-Rubin criterion (GR_crit()).JointAIsurv <- survreg_imp(Surv(time, event) ~ platelet + 
                             age + sex + chol + stage, data = EP16dat3,
                           n.iter = 1500, seed = 2020)traceplot(JointAIsurv, use_ggplot = TRUE)GR_crit(JointAIsurv, autoburnin = FALSE)## Potential scale reduction factors:
## 
##                       Point est. Upper C.I.
## (Intercept)                 1.10       1.27
## platelet                    1.01       1.02
## age                         1.00       1.00
## sexf                        1.02       1.06
## chol                        1.00       1.01
## stage2                      1.12       1.28
## stage3                      1.11       1.29
## stage4                      1.12       1.31
## shape_Surv_time_event       1.00       1.01
## 
## Multivariate psrf
## 
## 1.05Both the traceplot and the Gelman-Rubin criterion show that the
parameters for stage don’t converge well. There are clear
patterns visible in the plots and the upper limit of the confidence
interval of the Gelman-Rubin criterion is much larger than one.
In some cases, convergence of coefficients for categorical variables can be improved by changing the reference category. Especially when the categories are very unbalanced, convergence it better when the largest (or a large) category is chosen as the reference.
The plot of the distribution of the variables in the
EP16dat3 data at the beginning of this practical shows that
there are few patients with stage 1.
stage as reference (using the argument
refcats).Check out the help
file and the vignette on Model
Specification for more details on how to use the argument
refcats.
traceplot() and GR_crit()
to see if the change in reference category improved convergence.JointAIsurv2 <- survreg_imp(Surv(time, event) ~ platelet + 
                              age + sex + chol + stage, data = EP16dat3,
                            n.iter = 1000, seed = 2020, refcats = 'largest')traceplot(JointAIsurv2, use_ggplot = TRUE)GR_crit(JointAIsurv2, autoburnin = FALSE)## Potential scale reduction factors:
## 
##                       Point est. Upper C.I.
## (Intercept)                 1.03       1.11
## platelet                    1.01       1.02
## age                         1.00       1.00
## sexm                        1.00       1.01
## chol                        1.11       1.25
## stage1                      1.00       1.01
## stage2                      1.01       1.04
## stage3                      1.02       1.08
## shape_Surv_time_event       1.00       1.01
## 
## Multivariate psrf
## 
## 1.05The traceplots look a lot better and the Gelman-Rubin criterion has improved. Nevertheless, more iterations or chaines would be necessary to obtain more reliable results.
summary(JointAIsurv2, start = 250)## 
## Bayesian weibull survival model fitted with JointAI
## 
## Call:
## survreg_imp(formula = Surv(time, event) ~ platelet + age + sex + 
##     chol + stage, data = EP16dat3, n.iter = 1000, refcats = "largest", 
##     seed = 2020)
## 
## 
## Number of events: 81 
## 
## Posterior summary:
##                 Mean       SD     2.5%    97.5% tail-prob. GR-crit MCE/SD
## (Intercept)  7.66797 0.187077  7.31477 8.032284     0.0000    1.04 0.0908
## platelet     0.00123 0.001288 -0.00119 0.003685     0.3478    1.02 0.0601
## age         -0.01570 0.013266 -0.04240 0.010492     0.2382    1.00 0.0526
## sexm        -0.05638 0.383715 -0.76021 0.743525     0.8508    1.01 0.0462
## chol        -0.00121 0.000629 -0.00226 0.000179     0.0744    1.05 0.0863
## stage1       3.21866 1.342265  1.12702 6.228553     0.0000    1.01 0.1440
## stage2       1.41357 0.380650  0.73327 2.224563     0.0000    1.02 0.0842
## stage3       1.27369 0.335360  0.67164 1.964542     0.0000    1.07 0.0925
## 
## Posterior summary of the shape of the Weibull distribution:
##                        Mean     SD  2.5% 97.5% GR-crit MCE/SD
## shape_Surv_time_event 0.965 0.0925 0.783  1.15       1 0.0679
## 
## 
## MCMC settings:
## Iterations = 250:1100
## Sample size per chain = 851 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 160© Nicole Erler