Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

  • R version 3.6.0 (2019-04-26)
  • mice (version: 3.4.0)
  • JointAI (version: 0.5.1)
  • ggplot2 (version: 3.1.1)
  • reshape2 (version: 1.4.3)
  • ggpubr (version: 0.2)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

For this practical, we will use a subset of the NHANES dataset that we have seen in the previous practicals. It contains only those cases that have observed wgt and some columns that are not needed were excluded.

Download the file NHANES_for_practicals_2.RData from here. To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_for_practicals_2.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals_2.RData").

Aim

The focus of this practical is the imputation of data that has features that require special attention.

In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.

Our aim is to fit the following linear regression model for weight:

We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age and chol and HDL, respectively.

Additionally, we want to include the other variables in the dataset as auxiliary variables.

Imputation using mice

When the analysis model of interest involves interaction terms between incomplete variables, mice has limited options to reduce the bias that may be introduced by naive handling of the missing values.

Use of the “Just Another Variable” approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.

Just Another Variable approach

Task 1

  • Calculate the interaction terms in the incomplete data.
  • Perform the setup-run of mice() without any iterations.

Solution 1

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##      wgt   gender     bili      age     chol      HDL      hgt     educ     race      SBP   hypten 
##       ""       ""   "norm"       ""   "norm"   "norm"   "norm"   "polr"       ""   "norm" "logreg" 
##       WC  agechol   ageHDL 
##   "norm"   "norm"   "norm" 
## PredictorMatrix:
##        wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt      0      1    1   1    1   1   1    1    1   1      1  1       1      1
## gender   1      0    1   1    1   1   1    1    1   1      1  1       1      1
## bili     1      1    0   1    1   1   1    1    1   1      1  1       1      1
## age      1      1    1   0    1   1   1    1    1   1      1  1       1      1
## chol     1      1    1   1    0   1   1    1    1   1      1  1       1      1
## HDL      1      1    1   1    1   0   1    1    1   1      1  1       1      1

Task 2

Apply the necessary change to the imputation method and predictor matrix.

Since the interaction terms are calculated from the orignal variables, these interaction terms should not be used to impute the original variables.

Solution 2

##      wgt   gender     bili      age     chol      HDL      hgt     educ     race      SBP   hypten 
##       ""       ""    "pmm"       ""   "norm"   "norm"   "norm"   "polr"       ""   "norm" "logreg" 
##       WC  agechol   ageHDL 
##   "norm"   "norm"   "norm"
##         wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt       0      1    1   1    1   1   1    1    1   1      1  1       1      1
## gender    1      0    1   1    1   1   1    1    1   1      1  1       1      1
## bili      1      1    0   1    1   1   1    1    1   1      1  1       1      1
## age       1      1    1   0    1   1   1    1    1   1      1  1       1      1
## chol      1      1    1   1    0   1   1    1    1   1      1  1       0      1
## HDL       1      1    1   1    1   0   1    1    1   1      1  1       1      0
## hgt       1      1    1   1    1   1   0    1    1   1      1  1       1      1
## educ      1      1    1   1    1   1   1    0    1   1      1  1       1      1
## race      1      1    1   1    1   1   1    1    0   1      1  1       1      1
## SBP       1      1    1   1    1   1   1    1    1   0      1  1       1      1
## hypten    1      1    1   1    1   1   1    1    1   1      0  1       1      1
## WC        1      1    1   1    1   1   1    1    1   1      1  0       1      1
## agechol   1      1    1   1    1   1   1    1    1   1      1  1       0      1
## ageHDL    1      1    1   1    1   1   1    1    1   1      1  1       1      0

Task 3

Run the imputation using the JAV approach and check the traceplot.

Task 4

We skip the more detailed evaluation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data closely enough.

  • Analyse the imputed data and pool the results.

Solution 4

Passive Imputation

For the passive imputation, we can re-use the adjusted versions of meth and pred we created for the JAV approach, but additional changes to meth are necessary.

Task 1

Specify the new imputation method, i.e., adapt meth and save it as methPAS.

For passive imputation instead of an imputation method you need to specify the formula used to calculate the value that is imputed passively.

Task 2

Run the imputation using passive imputation and check the traceplot.