Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

  • R version 4.3.0 (2023-04-21 ucrt)
  • mice (version: 3.15.0)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

Overview

For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.

To get the EP16dat1 dataset, load the file EP16dat1.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat1.RData").

If you have not followed the first practical or if you re-loaded the EP16dat1 data, you need to re-code the variable educ again:

EP16dat1$educ <- as.ordered(EP16dat1$educ)

Preparing for imputation

Set-up run

Imputation needs to be tailored to the dataset at hand and, hence, using the function mice() well requires several arguments to be specified. To make the specification easier it is useful to do a dry-run which will create the default versions of everything that needs to be specified.

These default settings can then be adapted to our data.

Task

Do the set-up run of mice() with the EP16dat1 data without any iterations (maxit = 0).

Solution

# Note: This command may not produce any output.
library("mice")
imp0 <- mice(EP16dat1, maxit = 0)
## Warning: Number of logged events: 1

Imputation method

There are many imputation methods available in mice. You can find the list in the help page of the mice() function. We will focus here on the following ones:

name variable type description
pmm any Predictive mean matching
norm numeric Bayesian linear regression
logreg binary Logistic regression
polr ordered Proportional odds model
polyreg unordered Polytomous logistic regression

The default imputation methods that mice() selects can be specified in the argument defaultMethod.

If unspecified, mice will use

  • pmm for numerical columns,
  • logreg for factor columns with two categories,
  • polyreg for columns with unordered and
  • polr for columns with ordered factors with more than two categories.

Task 1

When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm as the default method in the set-up run. Let’s do that!

The order of the types of variable is: continuous, binary, factor, ordered factor.

Solution 1

imp0 <- mice(EP16dat1, maxit = 0, 
             defaultMethod = c("norm", 'logreg', 'polyreg', 'polr'))
## Warning: Number of logged events: 1

Task 2

In the histograms we made for the continuous variables during the previous practical, we could see that the variable creat had a skewed distribution, hence, using a normal imputation model may not work well.

  • Extract the default settings of the imputation method from imp0.
  • Change the imputation method for creat so that this variable will be imputed using predictive mean matching.
  • Check that all specified imputation methods are correct. When no imputation method is specified ("") the variable will not be imputed.

Solution 2

meth <- imp0$method
meth["creat"] <- "pmm"
meth
##       HDL      race       DBP      bili     smoke        DM    gender        WC      chol  HyperMed 
##    "norm"        ""    "norm"    "norm"    "polr"        ""        ""    "norm"    "norm"    "polr" 
##       alc       SBP       wgt    hypten    cohort     occup       age      educ      albu     creat 
##    "polr"    "norm"    "norm"  "logreg"        "" "polyreg"        ""    "polr"    "norm"     "pmm" 
##  uricacid       BMI   hypchol       hgt 
##    "norm"    "norm"  "logreg"    "norm"

Predictor matrix

The predictor matrix specifies which variables are used in the linear predictors of each of the imputation models.

A value of 1 specifies that the variable given in the column name is used in the model to impute the variable given in the row name (and 0 specifies that this variable is not used in that model).

Task 1

Get the predictorMatrix from imp0. Notice that mice has already set some of the values to 0. Do you understand why?

Solution 1

pred <- imp0$predictorMatrix
pred
##          HDL race DBP bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age
## HDL        0    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## race       1    0   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## DBP        1    1   0    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## bili       1    1   1    0     1  1      1  1    1        1   1   1   1      1      0     1   1
## smoke      1    1   1    1     0  1      1  1    1        1   1   1   1      1      0     1   1
## DM         1    1   1    1     1  0      1  1    1        1   1   1   1      1      0     1   1
## gender     1    1   1    1     1  1      0  1    1        1   1   1   1      1      0     1   1
## WC         1    1   1    1     1  1      1  0    1        1   1   1   1      1      0     1   1
## chol       1    1   1    1     1  1      1  1    0        1   1   1   1      1      0     1   1
## HyperMed   1    1   1    1     1  1      1  1    1        0   1   1   1      1      0     1   1
## alc        1    1   1    1     1  1      1  1    1        1   0   1   1      1      0     1   1
## SBP        1    1   1    1     1  1      1  1    1        1   1   0   1      1      0     1   1
## wgt        1    1   1    1     1  1      1  1    1        1   1   1   0      1      0     1   1
## hypten     1    1   1    1     1  1      1  1    1        1   1   1   1      0      0     1   1
## cohort     1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## occup      1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     0   1
## age        1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   0
## educ       1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## albu       1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## creat      1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## uricacid   1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## BMI        1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## hypchol    1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## hgt        1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
##          educ albu creat uricacid BMI hypchol hgt
## HDL         1    1     1        1   1       1   1
## race        1    1     1        1   1       1   1
## DBP         1    1     1        1   1       1   1
## bili        1    1     1        1   1       1   1
## smoke       1    1     1        1   1       1   1
## DM          1    1     1        1   1       1   1
## gender      1    1     1        1   1       1   1
## WC          1    1     1        1   1       1   1
## chol        1    1     1        1   1       1   1
## HyperMed    1    1     1        1   1       1   1
## alc         1    1     1        1   1       1   1
## SBP         1    1     1        1   1       1   1
## wgt         1    1     1        1   1       1   1
## hypten      1    1     1        1   1       1   1
## cohort      1    1     1        1   1       1   1
## occup       1    1     1        1   1       1   1
## age         1    1     1        1   1       1   1
## educ        0    1     1        1   1       1   1
## albu        1    0     1        1   1       1   1
## creat       1    1     0        1   1       1   1
## uricacid    1    1     1        0   1       1   1
## BMI         1    1     1        1   0       1   1
## hypchol     1    1     1        1   1       0   1
## hgt         1    1     1        1   1       1   0

The column corresponding to the variable cohort is set to 0 which means that this variable is not used in any of the imputation models. cohort has the same value for all observations, so it would not be useful as a covariate.

Task 2

Because BMI is calculated from height (hgt) and weight (wgt), and there are cases where only one of these two variables is missing, we want to impute hgt and wgt separately. BMI should be imputed using “passive imputation”.

To avoid multicollinearity (which may lead to problems during imputation), imputation models should not include all three variables as predictor variables. In this example, we will use BMI to impute the other incomplete variables in the data.

Moreover, we need to exclude WC from the imputation model for wgt because the high correlation between WC, BMI and wgt would otherwise lead to problems during imputation.

And since HyperMed does not give us a lot more information than hypten, but has a lot more missing values, we do not want to use it as a predictor variable.

Apply the necessary changes to pred and meth.

For passive imputation, you need to specify the formula used to calculate BMI in meth using "~I(...)".

Solution 2

# BMI will not be used as predictor of height and weight
pred[c("hgt", "wgt"), "BMI"] <- 0

# height and weight will not be used as predictor in any model
pred[, c("hgt", "wgt")] <- 0

# height and weight will be used as predictors for each other
pred["hgt", "wgt"] <- 1
pred["wgt", "hgt"] <- 1

# WC is not used as predictor for weight
pred["wgt", "WC"] <- 0

# HyperMed will not be used as predictor in any model
pred[, "HyperMed"] <- 0

# hypchol will not be used as predictor in the imputation model for chol
pred["chol", "hypchol"] <- 0

# BMI will be imputed passively
meth["BMI"] <- "~I(wgt/hgt^2)"

# HyperMed will not be imputed
meth["HyperMed"] <- ""

Visit sequence

The visit sequence specifies the order in which the variables are imputed.

Task

To be sure that the imputed values of BMI match the imputed values of hgt and wgt, BMI needs to be imputed after hgt and wgt.

  • Get the visitSequence from imp0, and
  • change it if necessary.

Solution

visSeq <- imp0$visitSequence

which_BMI <- match("BMI", visSeq)
visSeq <- c(visSeq[-which_BMI], visSeq[which_BMI])

Imputation

Running the imputation

Task

With the changes that we have made to the predictorMatrix and method, we can now perform the imputation. Use m = 5 and maxit = 10.

Solution

imp <- mice(EP16dat1, method = meth, predictorMatrix = pred, visitSequence = visSeq,
            maxit = 10, m = 5, seed = 2020)

mice() prints the name of the variable being imputed for each iteration and imputation. If you run mice() on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.

You can hide the lengthy output by specifying printFlag = FALSE.

What does mice return?

Task

mice() does not return a data.frame. Find out the class of the object returned by mice() function using the function class(), and take a look at the help file for this class.

Solution

class(imp)
## [1] "mids"

We see that imp is an object of class mids.

The help file tells us that a mids object is a list with several elements:

data: Original (incomplete) data set.
imp: The imputed values: A list of ncol(data) components, each list component is a matrix with nmis[j] rows and m columns.
m: The number of imputations.
where: The missingness indicator matrix.
blocks The blocks argument of the mice() function.
call: The call that created the mids object.
nmis: The number of missing observations per variable.
method: The vector imputation methods.
predictorMatrix: The predictor matrix.
visitSequence: The sequence in which columns are visited during imputation.
formulas A named list of formulas corresponding the the imputed variables (blocks).
post: A vector of strings of length length(blocks) with commands for post-processing.
seed: The seed value of the solution.
iteration: The number of iterations.
lastSeedValue: The most recent seed value.
chainMean: The mean of imputed values per variable and iteration: a list of m components. Each component is a matrix with maxitcolumns and length(visitSequence) rows.
chainVar: The variances of imputed values per variable and iteration(same structure as chainMean).
loggedEvents: A data.frame with the record of automatic corrective actions and warnings; (NULL if no action was made).
version Version number of the mice package that created the object.
date Date at which the object was created.

Details of the loggedEvents:

mice() does some pre-processing of the data:

  • variables containing missing values, that are not imputed but used as predictor are removed
  • constant variables are removed
  • collinear variables are removed

Furthermore, during each iteration

  • variables (or dummy variables) that are linearly dependent are removed
  • polr imputation that does not converge is replaced by polyreg.
The data.frame in loggedEvents has the following columns:
it iteration number
im imputation number
dep name of the name of the variable being imputed
meth imputation method used
out character vector with names of altered/removed predictors

A Note on Imputation Methods

In Section 4 of the lectures, we have seen different methods to obtain imputed values: Bayesian imputation, bootstrap imputation and predictive mean matching.

The imputation methods implemented in the mice that we talked about in Section 6 of the lectures and here of course fall into these categories.

norm and logreg perform Bayesian linear and logistic regression, respectively. There are bootstrap alternatives available: norm.boot and logreg.boot.

The default method pmm performs predictive mean matching using, by default, 5 donors and type-1 matching.

The number of donors can be changed using the argument donors. A selection between type-0, type-1 and type-2 matching is possible via the argument matchtype.

For example, we could adjust the syntax from above:

mice(data = EP16dat1, m = 5, method = meth, predictorMatrix = pred, 
     visitSequence = visSeq, maxit = 10, seed = 2020,
     donors = 5, matchtype = 2L)

Both arguments can be passed to mice() and will apply to all imputation models that use pmm.

An alternative implementation of predictive mean matching is the imputation method midastouch. It uses type-2 matching with a leave-one-out approach, where parameter estimates are obtained using bootstrap. Donors are selected from all cases for which the variable that is to be imputed is observed, with probability depending on the distance of the predicted values.

 

© Nicole Erler