Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

  • R version 3.6.0 (2019-04-26)
  • mice (version: 3.4.0)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

For this practical, we will again use the NHANES dataset that we have seen in the previous practical.

To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals.RData").

If you have not followed the first practical or if you re-loaded the NHANES data you need to re-code the variable educ again:

Preparing for imputation

Set-up run

Imputation needs to be tailored to the dataset at hand and, hence, using the function mice() well requires several arguments to be specified. To make the specification easier it is useful to do a dry-run which will create the default versions of everything that needs to be specified.

These default settings can then be adapted to our data.

Task

Do the set-up run of mice() with the NHANES data without any iterations (maxit = 0).

Imputation method

There are many imputation methods available in mice. You can find the list in the help page of the mice() function. We will focus here on the following ones:

name variable type description
pmm any Predictive mean matching
norm numeric Bayesian linear regression
logreg binary Logistic regression
polr ordered Proportional odds model
polyreg unordered Polytomous logistic regression

The default imputation methods that mice() selects can be specified in the argument defaultMethod.

If unspecified, mice will use

  • pmm for numerical columns,
  • logreg for factor columns with two categories,
  • polyreg for columns with unordered and
  • polr for columns with ordered factors with more than two categories.

Task 1

When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm as the default method in the setup run. Let’s do that:

The order of the types of variable is: continuous, binary, factor, ordered factor.

Task 2

In the histograms we made for the continuous variables during the previous practical, we could see that the variable creat had a skewed distribution, hence, using a normal imputation model may not work well.

  • Extract the default settings of meth from imp0.
  • Change the imputation method for creat so that this variable will be imputed using predictive mean matching.
  • Check that all specified imputation methods are correct. When no imputation method is specified ("") the variable will not be imputed.

Solution 2

##       HDL      race       DBP      bili     smoke        DM    gender        WC      chol  HyperMed 
##    "norm"        ""    "norm"    "norm"    "polr"        ""        ""    "norm"    "norm"    "polr" 
##       alc       SBP       wgt    hypten    cohort     occup       age      educ      albu     creat 
##    "polr"    "norm"    "norm"  "logreg"        "" "polyreg"        ""    "polr"    "norm"     "pmm" 
##  uricacid       BMI   hypchol       hgt 
##    "norm"    "norm"  "logreg"    "norm"

Predictor matrix

The predictor matrix specifies which variables are used in the linear predictors of each of the imputation models.

A value of 1 specifies that the variable given in the column name is used in the model to impute the variable given in the row name (and 0 specifies that this variable is not used in that model).

Task 1

Get the predictorMatrix from imp0. Notice that mice has already set some of the values to 0. Do you understand why?

Solution 1

##          HDL race DBP bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age
## HDL        0    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## race       1    0   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## DBP        1    1   0    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## bili       1    1   1    0     1  1      1  1    1        1   1   1   1      1      0     1   1
## smoke      1    1   1    1     0  1      1  1    1        1   1   1   1      1      0     1   1
## DM         1    1   1    1     1  0      1  1    1        1   1   1   1      1      0     1   1
## gender     1    1   1    1     1  1      0  1    1        1   1   1   1      1      0     1   1
## WC         1    1   1    1     1  1      1  0    1        1   1   1   1      1      0     1   1
## chol       1    1   1    1     1  1      1  1    0        1   1   1   1      1      0     1   1
## HyperMed   1    1   1    1     1  1      1  1    1        0   1   1   1      1      0     1   1
## alc        1    1   1    1     1  1      1  1    1        1   0   1   1      1      0     1   1
## SBP        1    1   1    1     1  1      1  1    1        1   1   0   1      1      0     1   1
## wgt        1    1   1    1     1  1      1  1    1        1   1   1   0      1      0     1   1
## hypten     1    1   1    1     1  1      1  1    1        1   1   1   1      0      0     1   1
## cohort     1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## occup      1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     0   1
## age        1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   0
## educ       1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## albu       1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## creat      1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## uricacid   1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## BMI        1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## hypchol    1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
## hgt        1    1   1    1     1  1      1  1    1        1   1   1   1      1      0     1   1
##          educ albu creat uricacid BMI hypchol hgt
## HDL         1    1     1        1   1       1   1
## race        1    1     1        1   1       1   1
## DBP         1    1     1        1   1       1   1
## bili        1    1     1        1   1       1   1
## smoke       1    1     1        1   1       1   1
## DM          1    1     1        1   1       1   1
## gender      1    1     1        1   1       1   1
## WC          1    1     1        1   1       1   1
## chol        1    1     1        1   1       1   1
## HyperMed    1    1     1        1   1       1   1
## alc         1    1     1        1   1       1   1
## SBP         1    1     1        1   1       1   1
## wgt         1    1     1        1   1       1   1
## hypten      1    1     1        1   1       1   1
## cohort      1    1     1        1   1       1   1
## occup       1    1     1        1   1       1   1
## age         1    1     1        1   1       1   1
## educ        0    1     1        1   1       1   1
## albu        1    0     1        1   1       1   1
## creat       1    1     0        1   1       1   1
## uricacid    1    1     1        0   1       1   1
## BMI         1    1     1        1   0       1   1
## hypchol     1    1     1        1   1       0   1
## hgt         1    1     1        1   1       1   0

The column corresponding to the variable cohort is set to 0 which means that this variable is not used in any of the imputation models. cohort has the same value for all observations, so it would not be useful as a covariate.

Task 2

Because BMI is calculated from height (hgt) and weight (wgt), and there are cases where only one of these two variables is missing, we want to impute hgt and wgt separately. BMI should be imputed using “passive imputation”.

To avoid multicollinearity (which may lead to problems during imputation), imputation models should not include all three variables as predictor variables. In this example, we will use BMI to impute the other variables.

Moreover, we need to exclude WC from the imputation model for wgt because the high correlation between WC, BMI and wgt would otherwise lead to problems during imputation.

And since HyperMed does not give us a lot more information than hypten, but has a lot more missing values, we do not want to use it as a predictor variable.

Apply the necessary changes to pred and meth.

For passive imputation, you need to specify the formula used to calculate BMI in meth using "~I(...)".