Imputation of Survival Data

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 4.3.0 (2023-04-21 ucrt)
mice (version: 3.15.0)
survival (version: 3.5.5
JointAI (version: 1.0.5)
ggplot2 (version: 3.4.2)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Aim

The focus of this practical is the imputation of data that has features that require special attention.

In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.

Data & Model of interest

For this practical, we will use the EP16dat3 dataset, which is a subset of the PBC (Primary Biliary Cirrhosis) data.

To get the EP16dat3 dataset, load the file EP16dat3.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat3.RData").

The variables contained in this subset (EP16dat3) are:


time	number of years between inclusion and death, transplantion, or end of follow-up
status	status at `time` (censored, transplant, dead)
age	patient’s age at intake
sex	patient’s sex
platelet	platelet count
chol	serum cholesterol
stage	histologic stage of disease

The variables in EP16dat3 dataset have the following distributions:

The missing data pattern is

We are interested to determine predictor variables for patient survival, using the following Cox proportional hazards model:

coxph(Surv(time, status %in% c('transplant', 'dead')) ~ platelet + age + sex + chol + stage)

Imputation with mice

As we have seen in the lecture, the mice package provides the function nelsonaalen() that calculates the Nelson-Aalen estimator with which the cumulative Hazard can be approximated.

Task 1

Calculate the Nelson-Aalen estimate for patient survival in the EP16dat3 data and
perform the usual setup steps for imputation using mice().

Note:

nelsonaalen() does not accept the %in% c('transplant', 'dead') specification of the event, hence, we need to create a new event indicator event.

Solution 1

We first create the event variable, and calculate the Nelson-Aalen estimator:

library("mice")
library("survival")

EP16dat3$event <- EP16dat3$status %in% c('transplant', 'dead')
EP16dat3$na <- nelsonaalen(data = EP16dat3, timevar = "time",  statusvar = "event")

Then we can continue as usual, by doing the set-up run and adapting the arguments method and predictorMatrix:

micesurv0 <- mice(EP16dat3, maxit = 0)
micesurvmeth <- micesurv0$meth
micesurvpred <- micesurv0$pred

We can set the imputation for platelet to "norm", since this variable has a relatively symmetric distribution, and exclude time as predictor from the imputation models:

micesurvmeth[c("platelet")] <- "norm"
micesurvpred[, c("time", "status")] <- 0

We can now check our settings:

# check the method and predictorMatrix
micesurvmeth

##     time   status platelet      age      sex     chol    stage    event       na 
##       ""       ""   "norm"       ""       ""    "pmm"   "polr"       ""       ""

micesurvpred

##          time status platelet age sex chol stage event na
## time        0      0        1   1   1    1     1     1  1
## status      0      0        1   1   1    1     1     1  1
## platelet    0      0        0   1   1    1     1     1  1
## age         0      0        1   0   1    1     1     1  1
## sex         0      0        1   1   0    1     1     1  1
## chol        0      0        1   1   1    0     1     1  1
## stage       0      0        1   1   1    1     0     1  1
## event       0      0        1   1   1    1     1     0  1
## na          0      0        1   1   1    1     1     1  0

Task 2

Run the imputation and analyse the imputed data.

Solution 2

Run the imputation:

micesurv <- mice(EP16dat3, method = micesurvmeth, predictorMatrix = micesurvpred,
                 maxit = 10, m = 5, printFlag = FALSE, seed = 2020)

Fit the model and pool the results:

micesurv_mira <- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
                                        age + sex + chol + stage))

summary(pool(micesurv_mira), conf.int = TRUE)

##       term   estimate std.error statistic     df   p.value      2.5 %    97.5 %
## 1 platelet -0.0018713 0.0014181 -1.319531 21.367 0.2009570 -0.0048173 0.0010748
## 2      age  0.0141105 0.0120983  1.166322 65.855 0.2476907 -0.0100455 0.0382665
## 3     sexf -0.1701306 0.3538187 -0.480841 67.936 0.6321742 -0.8761767 0.5359155
## 4     chol  0.0014801 0.0004508  3.283222 13.003 0.0059347  0.0005062 0.0024539
## 5  stage.L  1.7394505 0.6912830  2.516264 71.504 0.0141062  0.3612398 3.1176612
## 6  stage.Q -0.0123688 0.5486179 -0.022545 69.517 0.9820775 -1.1066865 1.0819489
## 7  stage.C  0.4333561 0.3393589  1.276985 61.957 0.2063728 -0.2450220 1.1117342

Note:

For older versions of the mice package there were two warning messages:

## Warning: Unknown or uninitialised column: `df.residual`.

## Warning in pool.fitlist(getfit(object), dfcom = dfcom): Large sample assumed.

The function pool() has an argument dfcom, referring to the degrees of freedom from the complete data analysis, which is NULL by default. In that case, mice will try to extract them from the first analysis. But this failed for a Cox PH model.

The warning message (about the unknown column df.residual) can just be ignored.

The second warning means that, because the degrees of freedom could not be extracted from the model, it will be assumed that this number is very large (i.e., that the data is really large compared to the number of parameters that are estimated).

You can prevent these warning messages by specifying the argument dfcom in the function pool().

About Contrasts

In the summary of the pooled results we see that orthogonal polynomial coding for the ordered factor stage was used. Hence, we cannot interpret the coefficients from this variable the way we would interpret dummy variables.

For some model types we can overwrite the default contrasts (i.e., how categorical variables are handled) using an argument contrasts. In a linear or logistic regression model, for example we could specify contrasts = list(stage = 'contr.treatment').

Unfortunately, this is not possible for coxph(). Here we need to change the global option how contrasts can be set.

The global option for contrasts looks like this:

options("contrasts")

## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"

contr.treatment refers to dummy coding and contr.poly to orthogonal polynomials. We would like to also use contr.treatment for ordered factors.

To be able to change the setting back, we first save the current options:

op <- options()

Then we can overwrite the options for the contrasts:

options(contrasts = rep('contr.treatment', 2))
options("contrasts")

## $contrasts
## [1] "contr.treatment" "contr.treatment"

If we now repeat our analysis, we will get dummy-coded effects for stage:

micesurv_mira2 <- with(micesurv, coxph(Surv(time = time, event) ~ platelet +
                                        age + sex + chol + stage))

summary(pool(micesurv_mira), conf.int = TRUE)

##       term   estimate std.error statistic     df   p.value      2.5 %    97.5 %
## 1 platelet -0.0018713 0.0014181 -1.319531 21.367 0.2009570 -0.0048173 0.0010748
## 2      age  0.0141105 0.0120983  1.166322 65.855 0.2476907 -0.0100455 0.0382665
## 3     sexf -0.1701306 0.3538187 -0.480841 67.936 0.6321742 -0.8761767 0.5359155
## 4     chol  0.0014801 0.0004508  3.283222 13.003 0.0059347  0.0005062 0.0024539
## 5  stage.L  1.7394505 0.6912830  2.516264 71.504 0.0141062  0.3612398 3.1176612
## 6  stage.Q -0.0123688 0.5486179 -0.022545 69.517 0.9820775 -1.1066865 1.0819489
## 7  stage.C  0.4333561 0.3393589  1.276985 61.957 0.2063728 -0.2450220 1.1117342

Afterwards, we change the settings back:

options(op)
options("contrasts")

## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"

Imputation with JointAI

Analysis of a Cox proportional hazards model using JointAI works analogous to the function coxph() from packages survival.

Task 1

Fit the Cox model given above using the function coxph_imp() (and 500 iterations).
Check the traceplot to confirm convergence.

Solution 1

library("JointAI")
JointAIcox <- coxph_imp(Surv(time, event) ~ platelet + age + sex + chol + stage,
                        data = EP16dat3, n.iter = 1500, seed = 2020)

traceplot(JointAIcox, use_ggplot = TRUE) + 
  theme(legend.position = 'none')

The mixing of the parameters for stage is not great. We would have to increase the number of iterations and/or the number of chains to get better results.

Task 2

When the groups of a categorical variable are unbalanced, convergence is usually better when the largest group (or a relatively large group) is used as the reference category.

We can try this by specifying that we would like to use

JointAIcox2 <- coxph_imp(Surv(time, event) ~ platelet + age + sex + chol + stage,
                         data = EP16dat3, n.iter = 500, seed = 2020,
                         refcats = list(stage = 4))

traceplot(JointAIcox2, use_ggplot = TRUE) + 
  theme(legend.position = 'none')

Additional excercise JointAI

For larger datasets with many different event times the Cox model implemented in JointAI can become quite slow. This is because it has to use the counting process notation which requires a loop over all event times in each iteration of the MCMC sampler.

A parametric survival model, e.g. assuming a Weibull distribution (see Section 4.2 of the slides of EP03: Biostatistical Methods II: Survival Analysis), is often faster. This is implemented in the function survreg_imp().

Task 1

Fit a Weibull survival model using the model structure of the Cox model.
Investigate convergence using the traceplot() and the Gelman-Rubin criterion (GR_crit()).

Solution 1

JointAIsurv <- survreg_imp(Surv(time, event) ~ platelet + 
                             age + sex + chol + stage, data = EP16dat3,
                           n.iter = 1500, seed = 2020)

traceplot(JointAIsurv, use_ggplot = TRUE)

GR_crit(JointAIsurv, autoburnin = FALSE)

## Potential scale reduction factors:
## 
##                       Point est. Upper C.I.
## (Intercept)                 1.10       1.27
## platelet                    1.01       1.02
## age                         1.00       1.00
## sexf                        1.02       1.06
## chol                        1.00       1.01
## stage2                      1.12       1.28
## stage3                      1.11       1.29
## stage4                      1.12       1.31
## shape_Surv_time_event       1.00       1.01
## 
## Multivariate psrf
## 
## 1.05

Both the traceplot and the Gelman-Rubin criterion show that the parameters for stage don’t converge well. There are clear patterns visible in the plots and the upper limit of the confidence interval of the Gelman-Rubin criterion is much larger than one.

Task 2

In some cases, convergence of coefficients for categorical variables can be improved by changing the reference category. Especially when the categories are very unbalanced, convergence it better when the largest (or a large) category is chosen as the reference.

The plot of the distribution of the variables in the EP16dat3 data at the beginning of this practical shows that there are few patients with stage 1.

Re-run the Weibull survival model with the most frequent category of stage as reference (using the argument refcats).

Check out the help file and the vignette on Model Specification for more details on how to use the argument refcats.

Then look at the traceplot() and GR_crit() to see if the change in reference category improved convergence.

Solution 2

JointAIsurv2 <- survreg_imp(Surv(time, event) ~ platelet + 
                              age + sex + chol + stage, data = EP16dat3,
                            n.iter = 1000, seed = 2020, refcats = 'largest')

traceplot(JointAIsurv2, use_ggplot = TRUE)

GR_crit(JointAIsurv2, autoburnin = FALSE)

## Potential scale reduction factors:
## 
##                       Point est. Upper C.I.
## (Intercept)                 1.03       1.11
## platelet                    1.01       1.02
## age                         1.00       1.00
## sexm                        1.00       1.01
## chol                        1.11       1.25
## stage1                      1.00       1.01
## stage2                      1.01       1.04
## stage3                      1.02       1.08
## shape_Surv_time_event       1.00       1.01
## 
## Multivariate psrf
## 
## 1.05

The traceplots look a lot better and the Gelman-Rubin criterion has improved. Nevertheless, more iterations or chaines would be necessary to obtain more reliable results.

summary(JointAIsurv2, start = 250)

## 
## Bayesian weibull survival model fitted with JointAI
## 
## Call:
## survreg_imp(formula = Surv(time, event) ~ platelet + age + sex + 
##     chol + stage, data = EP16dat3, n.iter = 1000, refcats = "largest", 
##     seed = 2020)
## 
## 
## Number of events: 81 
## 
## Posterior summary:
##                 Mean       SD     2.5%    97.5% tail-prob. GR-crit MCE/SD
## (Intercept)  7.66797 0.187077  7.31477 8.032284     0.0000    1.04 0.0908
## platelet     0.00123 0.001288 -0.00119 0.003685     0.3478    1.02 0.0601
## age         -0.01570 0.013266 -0.04240 0.010492     0.2382    1.00 0.0526
## sexm        -0.05638 0.383715 -0.76021 0.743525     0.8508    1.01 0.0462
## chol        -0.00121 0.000629 -0.00226 0.000179     0.0744    1.05 0.0863
## stage1       3.21866 1.342265  1.12702 6.228553     0.0000    1.01 0.1440
## stage2       1.41357 0.380650  0.73327 2.224563     0.0000    1.02 0.0842
## stage3       1.27369 0.335360  0.67164 1.964542     0.0000    1.07 0.0925
## 
## Posterior summary of the shape of the Weibull distribution:
##                        Mean     SD  2.5% 97.5% GR-crit MCE/SD
## shape_Surv_time_event 0.965 0.0925 0.783  1.15       1 0.0679
## 
## 
## MCMC settings:
## Iterations = 250:1100
## Sample size per chain = 851 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 160