In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)mitools
(version: 2.4)miceadds
(version: 3.16.18)plyr
(version: 1.8.8)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat1 dataset, load the file
EP16dat1.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat1.RData")
.
If you have not followed the first practical or if you re-loaded the
EP16dat1 data, you need to re-code the variable
educ
again:
$educ <- as.ordered(EP16dat1$educ) EP16dat1
The imputed data are stored in a mids
object called
imp
that we created in the previous practical.
You can load it into your workspace by clicking the object
imps.RData
if you are using RStudio. Alternatively, you can
load this workspace using
load("<path>/imps.RData")
.
Working with multiply imputed data can get a lot more complex, for example, when the research question cannot be answered by just one regression model, or when the data need some further processing after imputation.
In the lecture, we have seen that the mice package allows post-processing of variables, but for more complex calculations it is often more convenient to perform them after imputation.
When data have been imputed with a different R package, but you want
to use the convenient functions of the mice package, the
imputed data needs to be converted to mids
or
mira
objects first.
The focus of this practical is on functions to convert multiply imputed data between different formats. An overview of possible workflows and the necessary functions is given in this flow diagram:
Assume that for a secondary research question we want to analyse a subset of the imputed data: subjects without hypertension.
We define subjects as hypertensive if they have
How can we select the correct cases for the analysis?
The standard functions for regression models (e.g. lm
,
glm
, lme
, …) have an argument
subset
that allows the specification of an expression that
defines the subset, i.e. a “filter”. The expression should return a
logical
value (TRUE
or FALSE
),
and only cases for which the value is TRUE
are included in
the analysis.
When the criterion consists of multiple variables it can be useful to pre-calculate it, so that we can easily check if our filter does what it is supposed to.
Since the imputed data is a mids
object we first need to
convert it to a data.frame
. This can be done with the
function complete()
from the mice
package.
Convert imp
to a long-format data.frame
in
which the original data and all imputed datasets are stacked onto each
other.
library("mice")
<- complete(imp, action = 'long', include = TRUE) imp_long
Some packages may not return the imputed data in long format but as a
list of data frames. In that case you can use the function
datalist2mids
from the R package miceadds
to create a mids
object.
Alternatively, you can use
do.call(rbind, <list of data.frames>)
to convert the
list of data frames into one long format data frame to then use
as.mids()
. You need to make sure that
data.frame
hypten_filter
that
separates cases with hypertension (SBP > 140
or
DBP > 90
or HyperMed == 'yes'
) from the
other cases. Cases where HyperMed
is missing should be
classified based on SBP
and DBP
only.HyperMed
.
HyperMed == "yes"
will return NA
if
HyperMed
is NA
.
$hypten_filter <- with(imp_long,
imp_longfactor(
ifelse(SBP <= 140 & DBP <= 90 & (HyperMed != "yes" | is.na(HyperMed)),
"no", "yes")))
We can check that the filter variable does what it is supposed to by creating a table:
::ddply(imp_long, c('HyperMed', "SBP > 140", "DBP > 90"), plyr::summarize,
plyrN = length(hypten_filter),
'yes' = sum(hypten_filter == 'yes'),
'no' = sum(hypten_filter == 'no'))
## HyperMed SBP > 140 DBP > 90 N yes no
## 1 no FALSE FALSE 117 0 117
## 2 no FALSE TRUE 6 6 0
## 3 no TRUE FALSE 12 12 0
## 4 no TRUE TRUE 14 14 0
## 5 no NA NA 1 NA NA
## 6 previous FALSE FALSE 82 0 82
## 7 previous FALSE TRUE 6 6 0
## 8 previous TRUE FALSE 18 18 0
## 9 previous TRUE TRUE 12 12 0
## 10 previous NA NA 2 NA NA
## 11 yes FALSE FALSE 562 562 0
## 12 yes FALSE TRUE 9 9 0
## 13 yes TRUE FALSE 202 202 0
## 14 yes TRUE TRUE 62 62 0
## 15 yes NA NA 17 17 0
## 16 <NA> FALSE FALSE 4385 0 4385
## 17 <NA> FALSE TRUE 69 69 0
## 18 <NA> TRUE FALSE 331 331 0
## 19 <NA> TRUE TRUE 54 54 0
## 20 <NA> NA NA 39 NA NA
Note:
Subjects for whom SBP
or DBP
is imputed may
be classified to have hypertension in some sets but not in other, so
that the number of subjects in the analysis of the subset will differ
between imputed datasets. This may complicate subsequent analyses.
with(imp_long, table(hypten_filter, .imp, exclude = NULL))
## .imp
## hypten_filter 0 1 2 3 4 5
## no 734 769 770 771 770 770
## yes 224 231 230 229 230 230
## <NA> 42 0 0 0 0 0
mids
object using the
function as.mids()
from the mice
package.DM ~ gender + race + age + chol + SBP + WC
,
restricted to cases without hypertension.subset = hypten_filter == "no"
to select cases without
hypertension.
<- as.mids(imp_long)
imp_mids
<- with(imp_mids,
subres glm(DM ~ gender + race + age + chol + SBP + WC,
family = binomial(),
subset = hypten_filter == 'no'))
summary(pool(subres), conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) -12.207041 1.9496057 -6.26129 640.833 6.9959e-10 -16.0354289 -8.378654
## 2 genderfemale 0.267147 0.2878060 0.92822 597.845 3.5367e-01 -0.2980867 0.832381
## 3 raceOther Hispanic -0.341800 0.4782119 -0.71475 740.202 4.7499e-01 -1.2806135 0.597013
## 4 raceNon-Hispanic White -1.070861 0.4001554 -2.67611 619.034 7.6453e-03 -1.8566876 -0.285034
## 5 raceNon-Hispanic Black -0.382342 0.4185385 -0.91352 713.969 3.6128e-01 -1.2040559 0.439371
## 6 raceother -0.726813 0.4888762 -1.48670 720.273 1.3753e-01 -1.6866060 0.232979
## 7 age 0.044300 0.0095845 4.62204 673.368 4.5551e-06 0.0254810 0.063119
## 8 chol 0.023963 0.1534679 0.15614 56.666 8.7647e-01 -0.2833905 0.331316
## 9 SBP 0.020590 0.0136697 1.50625 191.241 1.3365e-01 -0.0063728 0.047553
## 10 WC 0.058242 0.0089357 6.51788 306.792 2.9259e-10 0.0406589 0.075825
If you want to check if the analysis was indeed performed on the subset only, you can check the length of the fitted values from each analysis:
sapply(subres$analyses, function(x) length(x$fitted.values))
## [1] 769 770 771 770 770
The function pool()
from the mice
package is not the only function that performs pooling using Rubin’s
rules. The package mitools also provides this
functionality.
The diagram above shows how to obtain pooled results using mitools.
split()
.imputationList
. The imputationList
should only
contain the imputed data, not the original data.<name of the list object>[-1]
.
<- imputationList(split(imp_long, imp_long$.imp)[-1]) imp_list
imp_list
using
with()
.MIcombine()
and obtain the
summary()
of the pooled results.<- with(imp_list, glm(DM ~ gender + race + age + chol + SBP + WC,
reslist family = binomial(),
subset = hypten_filter == 'no'))
summary(MIcombine(reslist))
## Multiple imputation results:
## with(imp_list, glm(DM ~ gender + race + age + chol + SBP + WC,
## family = binomial(), subset = hypten_filter == "no"))
## MIcombine.default(reslist)
## results se (lower upper) missInfo
## (Intercept) -12.207041 1.9496057 -16.0291279 -8.384955 3 %
## genderfemale 0.267147 0.2878060 -0.2971500 0.831444 4 %
## raceOther Hispanic -0.341800 0.4782119 -1.2790994 0.595499 1 %
## raceNon-Hispanic White -1.070861 0.4001554 -1.8553897 -0.286332 3 %
## raceNon-Hispanic Black -0.382342 0.4185385 -1.2027215 0.438037 2 %
## raceother -0.726813 0.4888762 -1.6850497 0.231423 1 %
## age 0.044300 0.0095845 0.0255118 0.063089 2 %
## chol 0.023963 0.1534679 -0.2827212 0.330647 27 %
## SBP 0.020590 0.0136697 -0.0063235 0.047503 13 %
## WC 0.058242 0.0089357 0.0406897 0.075794 9 %
When you use R for the analysis in a scientific publication, it is good practice to properly reference the software and packages that you used, including the version numbers.
Moreover, package developers spent a lot of time providing you with the software, and they deserve getting some credit for their work.
The function citation()
will tell you how you should
cite R or specific packages.
For example:
citation()
## To cite R in publications use:
##
## R Core Team (2023). _R: A Language and Environment for Statistical Computing_. R
## Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2023},
## url = {https://www.R-project.org/},
## }
##
## We have invested a lot of time and effort in creating R, please cite it when using it for
## data analysis. See also 'citation("pkgname")' for citing R packages.
citation("mice")
## To cite mice in publications use:
##
## Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by
## Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI
## 10.18637/jss.v045.i03.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {{mice}: Multivariate Imputation by Chained Equations in R},
## author = {Stef {van Buuren} and Karin Groothuis-Oudshoorn},
## journal = {Journal of Statistical Software},
## year = {2011},
## volume = {45},
## number = {3},
## pages = {1-67},
## doi = {10.18637/jss.v045.i03},
## }
To obtain the version numbers you can use:
R.version.string
## [1] "R version 4.3.0 (2023-04-21 ucrt)"
packageVersion("mice")
## [1] '3.15.0'
Because software changes over time it is important to know with which version an analysis was performed in order to be able to reproduce the results.
© Nicole Erler