In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat1 dataset, load the file
EP16dat1.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat1.RData")
.
If you have not followed the first practical or if you re-loaded the
EP16dat1 data, you need to re-code the variable
educ
again:
$educ <- as.ordered(EP16dat1$educ) EP16dat1
Imputation needs to be tailored to the dataset at hand and, hence,
using the function mice()
well requires several arguments
to be specified. To make the specification easier it is useful to do a
dry-run which will create the default versions of everything that needs
to be specified.
These default settings can then be adapted to our data.
Do the set-up run of mice()
with the
EP16dat1 data without any iterations
(maxit = 0
).
# Note: This command may not produce any output.
library("mice")
<- mice(EP16dat1, maxit = 0) imp0
## Warning: Number of logged events: 1
There are many imputation methods available in mice.
You can find the list in the help page of the mice()
function. We will focus here on the following ones:
name | variable type | description |
---|---|---|
pmm | any | Predictive mean matching |
norm | numeric | Bayesian linear regression |
logreg | binary | Logistic regression |
polr | ordered | Proportional odds model |
polyreg | unordered | Polytomous logistic regression |
The default imputation methods that mice()
selects can
be specified in the argument defaultMethod
.
If unspecified, mice
will use
pmm
for numerical columns,logreg
for factor columns with two categories,polyreg
for columns with unordered andpolr
for columns with ordered factors with more than
two categories.When a normal imputation model seems to be appropriate for most of
the continuous covariates, you may want to specify norm
as
the default method in the set-up run. Let’s do that!
<- mice(EP16dat1, maxit = 0,
imp0 defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))
## Warning: Number of logged events: 1
In the histograms we made for the continuous variables during the
previous practical, we could see that the variable creat
had a skewed distribution, hence, using a normal imputation model may
not work well.
method
from imp0
.creat
so that this
variable will be imputed using predictive mean matching.""
) the variable will not
be imputed.<- imp0$method
meth "creat"] <- "pmm"
meth[ meth
## HDL race DBP bili smoke DM gender WC chol HyperMed
## "norm" "" "norm" "norm" "polr" "" "" "norm" "norm" "polr"
## alc SBP wgt hypten cohort occup age educ albu creat
## "polr" "norm" "norm" "logreg" "" "polyreg" "" "polr" "norm" "pmm"
## uricacid BMI hypchol hgt
## "norm" "norm" "logreg" "norm"
The predictor matrix specifies which variables are used in the linear predictors of each of the imputation models.
A value of 1
specifies that the variable given in the
column name is used in the model to impute the variable given in the row
name (and 0
specifies that this variable is not used in
that model).
Get the predictorMatrix
from imp0
. Notice
that mice has already set some of the values to
0
. Do you understand why?
<- imp0$predictorMatrix
pred pred
## HDL race DBP bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age
## HDL 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## race 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## DBP 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## bili 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1
## smoke 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1
## DM 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1
## gender 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1
## WC 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1
## chol 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1
## HyperMed 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1
## alc 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1
## SBP 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## wgt 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1
## hypten 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1
## cohort 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## occup 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1
## age 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0
## educ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## albu 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## creat 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## uricacid 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## BMI 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## hypchol 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## hgt 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## educ albu creat uricacid BMI hypchol hgt
## HDL 1 1 1 1 1 1 1
## race 1 1 1 1 1 1 1
## DBP 1 1 1 1 1 1 1
## bili 1 1 1 1 1 1 1
## smoke 1 1 1 1 1 1 1
## DM 1 1 1 1 1 1 1
## gender 1 1 1 1 1 1 1
## WC 1 1 1 1 1 1 1
## chol 1 1 1 1 1 1 1
## HyperMed 1 1 1 1 1 1 1
## alc 1 1 1 1 1 1 1
## SBP 1 1 1 1 1 1 1
## wgt 1 1 1 1 1 1 1
## hypten 1 1 1 1 1 1 1
## cohort 1 1 1 1 1 1 1
## occup 1 1 1 1 1 1 1
## age 1 1 1 1 1 1 1
## educ 0 1 1 1 1 1 1
## albu 1 0 1 1 1 1 1
## creat 1 1 0 1 1 1 1
## uricacid 1 1 1 0 1 1 1
## BMI 1 1 1 1 0 1 1
## hypchol 1 1 1 1 1 0 1
## hgt 1 1 1 1 1 1 0
The column corresponding to the variable cohort
is set
to 0
which means that this variable is not used in any of
the imputation models. cohort
has the same value for all
observations, so it would not be useful as a covariate.
Because BMI is calculated from height (hgt
) and weight
(wgt
), and there are cases where only one of these two
variables is missing, we want to impute hgt
and
wgt
separately. BMI
should be imputed using
“passive imputation”.
To avoid multicollinearity (which may lead to problems during
imputation), imputation models should not include all three variables as
predictor variables. In this example, we will use BMI
to
impute the other incomplete variables in the data.
Moreover, we need to exclude WC
from the imputation
model for wgt
because the high correlation between
WC
, BMI
and wgt
would otherwise
lead to problems during imputation.
And since HyperMed
does not give us a lot more
information than hypten
, but has a lot more missing values,
we do not want to use it as a predictor variable.
Apply the necessary changes to pred
and
meth
.
"~I(...)"
.
# BMI will not be used as predictor of height and weight
c("hgt", "wgt"), "BMI"] <- 0
pred[
# height and weight will not be used as predictor in any model
c("hgt", "wgt")] <- 0
pred[,
# height and weight will be used as predictors for each other
"hgt", "wgt"] <- 1
pred["wgt", "hgt"] <- 1
pred[
# WC is not used as predictor for weight
"wgt", "WC"] <- 0
pred[
# HyperMed will not be used as predictor in any model
"HyperMed"] <- 0
pred[,
# hypchol will not be used as predictor in the imputation model for chol
"chol", "hypchol"] <- 0
pred[
# BMI will be imputed passively
"BMI"] <- "~I(wgt/hgt^2)"
meth[
# HyperMed will not be imputed
"HyperMed"] <- "" meth[
The visit sequence specifies the order in which the variables are imputed.
To be sure that the imputed values of BMI
match the
imputed values of hgt
and wgt
,
BMI
needs to be imputed after hgt
and
wgt
.
visitSequence
from imp0
, and<- imp0$visitSequence
visSeq
<- match("BMI", visSeq)
which_BMI <- c(visSeq[-which_BMI], visSeq[which_BMI]) visSeq
With the changes that we have made to the
predictorMatrix
and method
, we can now perform
the imputation. Use m = 5
and maxit = 10
.
<- mice(EP16dat1, method = meth, predictorMatrix = pred, visitSequence = visSeq,
imp maxit = 10, m = 5, seed = 2020)
mice()
prints the name of the variable being imputed for
each iteration and imputation. If you run mice()
on your
own computer the output will show up continuously. There, you may notice
that imputation is slowest for categorical variables, especially when
they have many categories.
You can hide the lengthy output by specifying
printFlag = FALSE
.
mice()
does not return a data.frame
. Find
out the class of the object returned by mice()
function
using the function class()
, and take a look at the help
file for this class.
class(imp)
## [1] "mids"
We see that imp
is an object of class
mids
.
The help
file tells us that a mids
object is a list with several
elements:
data :
|
Original (incomplete) data set. |
imp :
|
The imputed values: A list of ncol(data) components, each
list component is a matrix with nmis[j] rows and
m columns.
|
m :
|
The number of imputations. |
where :
|
The missingness indicator matrix. |
blocks
|
The blocks argument of the mice() function.
|
call :
|
The call that created the mids object.
|
nmis :
|
The number of missing observations per variable. |
method :
|
The vector imputation methods. |
predictorMatrix :
|
The predictor matrix. |
visitSequence :
|
The sequence in which columns are visited during imputation. |
formulas
|
A named list of formulas corresponding the the imputed variables (blocks). |
post :
|
A vector of strings of length length(blocks) with commands
for post-processing.
|
seed :
|
The seed value of the solution. |
iteration :
|
The number of iterations. |
lastSeedValue :
|
The most recent seed value. |
chainMean:
|
The mean of imputed values per variable and iteration: a list of
m components. Each component is a matrix with
maxit columns and length(visitSequence) rows.
|
chainVar :
|
The variances of imputed values per variable and iteration(same
structure as chainMean ).
|
loggedEvents :
|
A data.frame with the record of automatic corrective
actions and warnings; (NULL if no action was made).
|
version
|
Version number of the mice package that created the object. |
date
|
Date at which the object was created. |
Details of the loggedEvents
:
mice()
does some pre-processing of the data:
Furthermore, during each iteration
polr
imputation that does not converge is replaced by
polyreg
.data.frame
in loggedEvents
has the
following columns:
it
|
iteration number |
im
|
imputation number |
dep
|
name of the name of the variable being imputed |
meth
|
imputation method used |
out
|
character vector with names of altered/removed predictors |
In Section 4 of the lectures, we have seen different methods to obtain imputed values: Bayesian imputation, bootstrap imputation and predictive mean matching.
The imputation methods implemented in the mice that we talked about in Section 6 of the lectures and here of course fall into these categories.
norm
and logreg
perform Bayesian linear and
logistic regression, respectively. There are bootstrap alternatives
available: norm.boot
and logreg.boot
.
The default method pmm
performs predictive mean matching
using, by default, 5 donors and type-1 matching.
The number of donors can be changed using the argument
donors
. A selection between type-0, type-1 and type-2
matching is possible via the argument matchtype
.
For example, we could adjust the syntax from above:
mice(data = EP16dat1, m = 5, method = meth, predictorMatrix = pred,
visitSequence = visSeq, maxit = 10, seed = 2020,
donors = 5, matchtype = 2L)
Both arguments can be passed to mice()
and will apply to
all imputation models that use pmm
.
An alternative implementation of predictive mean matching is the
imputation method midastouch
. It uses type-2 matching with
a leave-one-out approach, where parameter estimates are obtained using
bootstrap. Donors are selected from all cases for which the variable
that is to be imputed is observed, with probability depending on the
distance of the predicted values.
© Nicole Erler