Working with Multiply Imputed Data in Different Formats

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 4.3.0 (2023-04-21 ucrt)
mice (version: 3.15.0)
mitools (version: 2.4)
miceadds (version: 3.16.18)
plyr (version: 1.8.8)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.

To get the EP16dat1 dataset, load the file EP16dat1.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat1.RData").

If you have not followed the first practical or if you re-loaded the EP16dat1 data, you need to re-code the variable educ again:

EP16dat1$educ <- as.ordered(EP16dat1$educ)

Imputed data

The imputed data are stored in a mids object called imp that we created in the previous practical.

You can load it into your workspace by clicking the object imps.RData if you are using RStudio. Alternatively, you can load this workspace using load("<path>/imps.RData").

Aim of this practical

Working with multiply imputed data can get a lot more complex, for example, when the research question cannot be answered by just one regression model, or when the data need some further processing after imputation.

In the lecture, we have seen that the mice package allows post-processing of variables, but for more complex calculations it is often more convenient to perform them after imputation.

When data have been imputed with a different R package, but you want to use the convenient functions of the mice package, the imputed data needs to be converted to mids or mira objects first.

The focus of this practical is on functions to convert multiply imputed data between different formats. An overview of possible workflows and the necessary functions is given in this flow diagram:

Adding a filter variable after imputation with mice

Assume that for a secondary research question we want to analyse a subset of the imputed data: subjects without hypertension.

We define subjects as hypertensive if they have

systolic blood pressure above 140, or
diastolic blood pressure above 90, or
take medication for hypertension.

How can we select the correct cases for the analysis?

The standard functions for regression models (e.g. lm, glm, lme, …) have an argument subset that allows the specification of an expression that defines the subset, i.e. a “filter”. The expression should return a logical value (TRUE or FALSE), and only cases for which the value is TRUE are included in the analysis.

When the criterion consists of multiple variables it can be useful to pre-calculate it, so that we can easily check if our filter does what it is supposed to.

Task 1

Since the imputed data is a mids object we first need to convert it to a data.frame. This can be done with the function complete() from the mice package.

Convert imp to a long-format data.frame in which the original data and all imputed datasets are stacked onto each other.

Solution 1

library("mice")
imp_long <- complete(imp, action = 'long', include = TRUE)

Some packages may not return the imputed data in long format but as a list of data frames. In that case you can use the function datalist2mids from the R package miceadds to create a mids object.

Alternatively, you can use do.call(rbind, <list of data.frames>) to convert the list of data frames into one long format data frame to then use as.mids(). You need to make sure that

the list contains the original data
there are variables identifying the imputation number and subject id in each data.frame

Task 2

Calculate a filter variable hypten_filter that separates cases with hypertension (SBP > 140 or DBP > 90 or HyperMed == 'yes') from the other cases. Cases where HyperMed is missing should be classified based on SBP and DBP only.
Check that the filter does what it is supposed to, for example with the help of a table.

Take care of the missing values in HyperMed. HyperMed == "yes" will return NA if HyperMed is NA.

Solution 2

imp_long$hypten_filter <- with(imp_long,
                               factor(
                                 ifelse(SBP <= 140 & DBP <= 90 & (HyperMed != "yes" | is.na(HyperMed)),
                                        "no", "yes")))

We can check that the filter variable does what it is supposed to by creating a table:

plyr::ddply(imp_long, c('HyperMed', "SBP > 140", "DBP > 90"), plyr::summarize,
            N = length(hypten_filter),
            'yes' = sum(hypten_filter == 'yes'),
            'no' = sum(hypten_filter == 'no'))

##    HyperMed SBP > 140 DBP > 90    N yes   no
## 1        no     FALSE    FALSE  117   0  117
## 2        no     FALSE     TRUE    6   6    0
## 3        no      TRUE    FALSE   12  12    0
## 4        no      TRUE     TRUE   14  14    0
## 5        no        NA       NA    1  NA   NA
## 6  previous     FALSE    FALSE   82   0   82
## 7  previous     FALSE     TRUE    6   6    0
## 8  previous      TRUE    FALSE   18  18    0
## 9  previous      TRUE     TRUE   12  12    0
## 10 previous        NA       NA    2  NA   NA
## 11      yes     FALSE    FALSE  562 562    0
## 12      yes     FALSE     TRUE    9   9    0
## 13      yes      TRUE    FALSE  202 202    0
## 14      yes      TRUE     TRUE   62  62    0
## 15      yes        NA       NA   17  17    0
## 16     <NA>     FALSE    FALSE 4385   0 4385
## 17     <NA>     FALSE     TRUE   69  69    0
## 18     <NA>      TRUE    FALSE  331 331    0
## 19     <NA>      TRUE     TRUE   54  54    0
## 20     <NA>        NA       NA   39  NA   NA

Note:

Subjects for whom SBP or DBP is imputed may be classified to have hypertension in some sets but not in other, so that the number of subjects in the analysis of the subset will differ between imputed datasets. This may complicate subsequent analyses.

with(imp_long, table(hypten_filter, .imp, exclude = NULL))

##              .imp
## hypten_filter   0   1   2   3   4   5
##          no   734 769 770 771 770 770
##          yes  224 231 230 229 230 230
##          <NA>  42   0   0   0   0   0

Task 3

Convert the data back to a mids object using the function as.mids() from the mice package.
Analyse and pool the data using the logistic regression with model formula DM ~ gender + race + age + chol + SBP + WC, restricted to cases without hypertension.

Use subset = hypten_filter == "no" to select cases without hypertension.

Solution 3

imp_mids <- as.mids(imp_long)

subres <- with(imp_mids,
               glm(DM ~ gender + race + age + chol + SBP + WC,
                   family = binomial(),
                   subset = hypten_filter == 'no'))

summary(pool(subres), conf.int = TRUE)

##                      term   estimate std.error statistic      df    p.value       2.5 %    97.5 %
## 1             (Intercept) -12.207041 1.9496057  -6.26129 640.833 6.9959e-10 -16.0354289 -8.378654
## 2            genderfemale   0.267147 0.2878060   0.92822 597.845 3.5367e-01  -0.2980867  0.832381
## 3      raceOther Hispanic  -0.341800 0.4782119  -0.71475 740.202 4.7499e-01  -1.2806135  0.597013
## 4  raceNon-Hispanic White  -1.070861 0.4001554  -2.67611 619.034 7.6453e-03  -1.8566876 -0.285034
## 5  raceNon-Hispanic Black  -0.382342 0.4185385  -0.91352 713.969 3.6128e-01  -1.2040559  0.439371
## 6               raceother  -0.726813 0.4888762  -1.48670 720.273 1.3753e-01  -1.6866060  0.232979
## 7                     age   0.044300 0.0095845   4.62204 673.368 4.5551e-06   0.0254810  0.063119
## 8                    chol   0.023963 0.1534679   0.15614  56.666 8.7647e-01  -0.2833905  0.331316
## 9                     SBP   0.020590 0.0136697   1.50625 191.241 1.3365e-01  -0.0063728  0.047553
## 10                     WC   0.058242 0.0089357   6.51788 306.792 2.9259e-10   0.0406589  0.075825

If you want to check if the analysis was indeed performed on the subset only, you can check the length of the fitted values from each analysis:

sapply(subres$analyses, function(x) length(x$fitted.values))

## [1] 769 770 771 770 770

Another way of pooling

The function pool() from the mice package is not the only function that performs pooling using Rubin’s rules. The package mitools also provides this functionality.

The diagram above shows how to obtain pooled results using mitools.

Task 1

Split the data by imputation number using the function split().
Convert the resulting list to an object of class imputationList. The imputationList should only contain the imputed data, not the original data.

To exclude an element from a list, say the first element, use <name of the list object>[-1].

Solution 1

imp_list <- imputationList(split(imp_long, imp_long$.imp)[-1])

Task 2

Analyse each of the imputed datasets in imp_list using with().
Pool the results using MIcombine() and obtain the summary() of the pooled results.

Solution 2

reslist <- with(imp_list, glm(DM ~ gender + race + age + chol + SBP + WC,
                   family = binomial(),
                   subset = hypten_filter == 'no'))

summary(MIcombine(reslist))

## Multiple imputation results:
##       with(imp_list, glm(DM ~ gender + race + age + chol + SBP + WC, 
##     family = binomial(), subset = hypten_filter == "no"))
##       MIcombine.default(reslist)
##                           results        se      (lower    upper) missInfo
## (Intercept)            -12.207041 1.9496057 -16.0291279 -8.384955      3 %
## genderfemale             0.267147 0.2878060  -0.2971500  0.831444      4 %
## raceOther Hispanic      -0.341800 0.4782119  -1.2790994  0.595499      1 %
## raceNon-Hispanic White  -1.070861 0.4001554  -1.8553897 -0.286332      3 %
## raceNon-Hispanic Black  -0.382342 0.4185385  -1.2027215  0.438037      2 %
## raceother               -0.726813 0.4888762  -1.6850497  0.231423      1 %
## age                      0.044300 0.0095845   0.0255118  0.063089      2 %
## chol                     0.023963 0.1534679  -0.2827212  0.330647     27 %
## SBP                      0.020590 0.0136697  -0.0063235  0.047503     13 %
## WC                       0.058242 0.0089357   0.0406897  0.075794      9 %

Citing R Packages

When you use R for the analysis in a scientific publication, it is good practice to properly reference the software and packages that you used, including the version numbers.

Moreover, package developers spent a lot of time providing you with the software, and they deserve getting some credit for their work.

The function citation() will tell you how you should cite R or specific packages.

For example:

citation()

## To cite R in publications use:
## 
##   R Core Team (2023). _R: A Language and Environment for Statistical Computing_. R
##   Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2023},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please cite it when using it for
## data analysis. See also 'citation("pkgname")' for citing R packages.

citation("mice")

## To cite mice in publications use:
## 
##   Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by
##   Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI
##   10.18637/jss.v045.i03.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {{mice}: Multivariate Imputation by Chained Equations in R},
##     author = {Stef {van Buuren} and Karin Groothuis-Oudshoorn},
##     journal = {Journal of Statistical Software},
##     year = {2011},
##     volume = {45},
##     number = {3},
##     pages = {1-67},
##     doi = {10.18637/jss.v045.i03},
##   }

To obtain the version numbers you can use:

R.version.string

## [1] "R version 4.3.0 (2023-04-21 ucrt)"

packageVersion("mice")

## [1] '3.15.0'

Because software changes over time it is important to know with which version an analysis was performed in order to be able to reproduce the results.