Imputation of Data with Non-linear Associations

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 4.3.0 (2023-04-21 ucrt)
mice (version: 3.15.0)
JointAI (version: 1.0.5)
ggplot2 (version: 3.4.2)
reshape2 (version: 1.4.4)
ggpubr (version: 0.6.0)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

For this practical, we will use the EP16dat2 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.

To get the EP16dat2 dataset, load the file EP16dat2.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat2.RData").

To give you a head start, here some basic summary of this data:

str(EP16dat2)

## 'data.frame':    500 obs. of  12 variables:
##  $ wgt   : num  78 78 75.3 90.7 112 ...
##  $ gender: Factor w/ 2 levels "male","female": 1 1 2 1 2 1 2 2 1 1 ...
##  $ bili  : num  1.1 0.7 0.5 0.8 0.6 0.7 1.1 0.8 0.8 0.5 ...
##  $ age   : num  67 39 64 36 33 62 56 63 55 20 ...
##  $ chol  : num  6.13 4.65 4.14 3.47 6.31 4.47 6.41 5.51 7.01 3.75 ...
##  $ HDL   : num  1.09 1.14 1.29 1.37 1.27 0.85 1.81 2.38 2.79 1.03 ...
##  $ hgt   : num  1.75 1.78 1.63 1.93 1.73 ...
##  $ educ  : Ord.factor w/ 5 levels "Less than 9th grade"<..: 5 3 5 4 4 3 4 5 4 2 ...
##  $ race  : Factor w/ 5 levels "Mexican American",..: 5 3 5 3 4 5 4 5 3 3 ...
##  $ SBP   : num  139 103 NaN 115 107 ...
##  $ hypten: Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 NA 1 2 1 ...
##  $ WC    : num  91.6 84.5 91.6 95.4 119.6 ...

par(mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
JointAI::plot_all(EP16dat2, breaks = 50, nrow = 3)

Aim

The focus of this practical is the imputation of data that has features that require special attention.

In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.

Our aim is to impute data, so that we can later fit the following linear regression model for weight:

mod <- lm(wgt ~ gender + bili + age * (chol + HDL) + hgt)

(Note: this line of syntax will not run! This is just to give you the model specification.)

We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age and chol and HDL, respectively.

Additionally, we want to include the other variables in the dataset as auxiliary variables.

Imputation using mice

When the analysis model of interest involves interaction terms between incomplete variables, mice has limited options to reduce the bias that may be introduced by naive handling of the missing values.

Use of the “Just Another Variable” approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.

Just Another Variable approach

Task 1

Calculate the interaction terms in the incomplete data.
Perform the setup-run of mice() without any iterations.

Solution 1

# calculate the interaction terms
EP16dat2$agechol <- EP16dat2$age * EP16dat2$chol
EP16dat2$ageHDL <- EP16dat2$age * EP16dat2$HDL

# setup run
imp0 <- mice(EP16dat2, maxit = 0, 
             defaultMethod = c('norm', 'logreg', 'polyreg', 'polr'))
imp0

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##      wgt   gender     bili      age     chol      HDL      hgt     educ     race      SBP   hypten 
##       ""       ""   "norm"       ""   "norm"   "norm"   "norm"   "polr"       ""   "norm" "logreg" 
##       WC  agechol   ageHDL 
##   "norm"   "norm"   "norm" 
## PredictorMatrix:
##        wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt      0      1    1   1    1   1   1    1    1   1      1  1       1      1
## gender   1      0    1   1    1   1   1    1    1   1      1  1       1      1
## bili     1      1    0   1    1   1   1    1    1   1      1  1       1      1
## age      1      1    1   0    1   1   1    1    1   1      1  1       1      1
## chol     1      1    1   1    0   1   1    1    1   1      1  1       1      1
## HDL      1      1    1   1    1   0   1    1    1   1      1  1       1      1

Task 2

Apply the necessary changes to the imputation method and predictor matrix.

Since the interaction terms are calculated from the orignal variables, these interaction terms should not be used to impute the original variables.

Solution 2

meth <- imp0$method 
pred <- imp0$predictorMatrix

# change imputation for "bili" to pmm (to prevent negative values)
meth["bili"] <- 'pmm'
 
# changes in predictor matrix to prevent feedback from the interactions to the
# original variables
pred["chol", "agechol"] <- 0
pred["HDL", "ageHDL"] <- 0

meth

##      wgt   gender     bili      age     chol      HDL      hgt     educ     race      SBP   hypten 
##       ""       ""    "pmm"       ""   "norm"   "norm"   "norm"   "polr"       ""   "norm" "logreg" 
##       WC  agechol   ageHDL 
##   "norm"   "norm"   "norm"

pred

##         wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt       0      1    1   1    1   1   1    1    1   1      1  1       1      1
## gender    1      0    1   1    1   1   1    1    1   1      1  1       1      1
## bili      1      1    0   1    1   1   1    1    1   1      1  1       1      1
## age       1      1    1   0    1   1   1    1    1   1      1  1       1      1
## chol      1      1    1   1    0   1   1    1    1   1      1  1       0      1
## HDL       1      1    1   1    1   0   1    1    1   1      1  1       1      0
## hgt       1      1    1   1    1   1   0    1    1   1      1  1       1      1
## educ      1      1    1   1    1   1   1    0    1   1      1  1       1      1
## race      1      1    1   1    1   1   1    1    0   1      1  1       1      1
## SBP       1      1    1   1    1   1   1    1    1   0      1  1       1      1
## hypten    1      1    1   1    1   1   1    1    1   1      0  1       1      1
## WC        1      1    1   1    1   1   1    1    1   1      1  0       1      1
## agechol   1      1    1   1    1   1   1    1    1   1      1  1       0      1
## ageHDL    1      1    1   1    1   1   1    1    1   1      1  1       1      0

Note:

A look at the histogram of bili tells us that the variable is not too skewed, but the distribution is relatively close to zero. And in a previous practical we already saw that there were negative values for bili when this variable was imputed with a linear regression model.

In practice, I would still try a normal distribution and only use predictive mean matching if the distribution of the imputed values is off.

Task 3

Run the imputation using the JAV approach and check the traceplot.

Solution 3

# run imputation with the JAV approach
impJAV <- mice(EP16dat2, method = meth, predictorMatrix = pred,
               maxit = 20, printFlag = FALSE, seed = 2020)

Note The argument printFlag prevents the lengthy output and the argument seed can be set to get reproducible results (this sets the seed value for the random number generator).

plot(impJAV, layout = c(4, 6))

Task 4

We skip the more detailed evaluation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data closely enough.

Analyse the imputed data and pool the results.

Solution 4

miraJAV <- with(impJAV, 
                lm(wgt  ~ gender + bili + age + chol + HDL + agechol + ageHDL + hgt))
summary(pool(miraJAV), conf.int = TRUE)

##           term     estimate   std.error statistic       df      p.value         2.5 %       97.5 %
## 1  (Intercept) -109.9006404 19.81204897 -5.547162 345.1777 5.786808e-08 -148.86817385 -70.93310696
## 2 genderfemale    3.9698542  1.90856165  2.080024 416.6427 3.813405e-02    0.21824403   7.72146428
## 3         bili   -7.4887824  2.91479097 -2.569235 235.8189 1.080863e-02  -13.23113824  -1.74642663
## 4          age    0.6765514  0.23133062  2.924608 325.0598 3.691235e-03    0.22145726   1.13164548
## 5         chol   10.0151253  2.20722706  4.537424 126.1109 1.309768e-05    5.64712509  14.38312556
## 6          HDL  -21.1952860  6.10635637 -3.471020 312.1504 5.915349e-04  -33.21010890  -9.18046317
## 7      agechol   -0.1577853  0.04245688 -3.716365 173.9412 2.722120e-04   -0.24158224  -0.07398827
## 8       ageHDL    0.1448126  0.12399776  1.167865 257.9545 2.439396e-01   -0.09936415   0.38898939
## 9          hgt   99.5806561  9.29987107 10.707746 328.0904 3.810318e-23   81.28575613 117.87555608

Passive Imputation

For the passive imputation, we can re-use the adjusted versions of meth and pred we created for the JAV approach, but additional changes to meth are necessary.

Task 1

Specify the new imputation method, i.e., adapt meth and save it as methPAS.

For passive imputation, instead of an imputation method you need to specify the formula used to calculate the value that is imputed passively.

Solution 1

# changes in imputation method for passive imputation
methPAS <- meth
methPAS[c("agechol", "ageHDL")] <- c("~I(age*chol)", "~I(age*HDL)")

Note: the visit sequence does not have to be changed here, because the variables that are being imputed passively are already last in the visit sequence:

imp0$visitSequence

##  [1] "wgt"     "gender"  "bili"    "age"     "chol"    "HDL"     "hgt"     "educ"    "race"   
## [10] "SBP"     "hypten"  "WC"      "agechol" "ageHDL"

Task 2

Run the imputation using passive imputation and check the traceplot.

Solution 2

# run imputation with passive imputation
impPAS <- mice(EP16dat2, method = methPAS, predictorMatrix = pred,
               maxit = 20, printFlag = FALSE, seed = 2020)

plot(impPAS, layout = c(4, 5))

Task 3

We will again skip the detailed evaluation of convergence and the imputed values.

Analyse the imputed data and pool the results.

Solution 3

miraPAS <- with(impPAS, 
                lm(wgt ~ gender + bili + age + chol + HDL + agechol + ageHDL + hgt))

summary(pool(miraPAS), conf.int = TRUE)

##           term     estimate  std.error statistic        df      p.value         2.5 %       97.5 %
## 1  (Intercept) -100.7599669 19.6078398 -5.138759 389.20359 4.385119e-07 -139.31050645 -62.20942743
## 2 genderfemale    3.8709508  1.9139558  2.022487 410.05791 4.377470e-02    0.10856145   7.63334016
## 3         bili   -6.3948346  3.0809505 -2.075604  66.23626 4.181688e-02  -12.54574123  -0.24392790
## 4          age    0.5275668  0.2277888  2.316035 402.14931 2.105850e-02    0.07976123   0.97537237
## 5         chol    8.5359003  2.1434765  3.982269 275.31302 8.741749e-05    4.31621389  12.75558674
## 6          HDL  -21.6226691  5.9733178 -3.619876 430.64217 3.297853e-04  -33.36315300  -9.88218521
## 7      agechol   -0.1287719  0.0415769 -3.097198 352.57687 2.110252e-03   -0.21054180  -0.04700196
## 8       ageHDL    0.1535878  0.1193828  1.286516 463.08705 1.989059e-01   -0.08101126   0.38818693
## 9          hgt   98.1348739  9.0180337 10.882070 486.33993 7.843135e-25   80.41575662 115.85399115

Imputation with JointAI

JointAI provides functions that allow us to fit Bayesian regression models with incomplete covariates. The main functions are designed to resemble the standard functions to fit regression models with complete data.

For univariate outcomes the following functions are available:

lm_imp(): for linear regression
lognorm_imp(): for regression using a log-normal distribution
betareg_imp(): for continuous outcomes in the interval \([0, 1]\)
glm_imp(): for generalized linear regression (e.g., logistic, gamma or Poisson)
clm_imp(): for ordinal (cumulative logit) regression
mlogit_imp(): for multinomial logit models

Note:

There will be a new version of JointAI (later this year) which will also provide functions to fit models using a log-normal or a beta distribution, and a function to fit multinomial logit models (for unordered categorical outcomes).

Specification of the analysis model

Similar to the complete data versions, the functions from JointAI take the following arguments:


`formula`	model formula
`data`	original, incomplete dataset
`family`	for glm’s: the distribution family of the outcome (e.g., `binomial()` for a logistic model)

Task 1

Specify the linear regression model with the model formula given at the beginning of this practical, using lm_imp().

You need to specify the arguments formula, data and n.iter. Set n.iter = 100 (we will learn more about this argument a little bit later).

Solution 1

lm1 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = EP16dat2,
              n.iter = 100)

lm1

## 
## Call:
## lm_imp(formula = wgt ~ gender + bili + age * (chol + HDL) + hgt, 
##     data = EP16dat2, n.iter = 100)
## 
##  Bayesian linear model for "wgt" 
## 
## 
## Coefficients:
##  (Intercept) genderfemale         bili          age         chol          HDL          hgt 
##     -98.1158       3.4622      -6.8458       0.6221       9.3998     -21.6244      94.4908 
##     age:chol      age:HDL 
##      -0.1485       0.1514 
## 
## 
## Residual standard deviation:
## sigma_wgt 
##     15.59

Task 2

lm_imp() returns an object of class JointAI, which contains

the original data (data),
information on the type of model (call, analysis_type, models, fixed, random, hyperpars, scale_pars) and
information about the MCMC sampling (mcmc_settings),
the JAGS model (model) and
the MCMC sample (MCMC; if a sample was generated),
the computational time (time) of the MCMC sampling, and
some additional elements that are used by methods for objects of class JointAI but are typically not of interest for the user.

Check which imputation models were used in lm1.

The imputation model types are returned in the JointAI object under “models”.

Solution 2

lm1$models

##                     wgt                    bili                    chol                     HDL 
## "glm_gaussian_identity"                    "lm"                    "lm"                    "lm" 
##                     hgt 
##                    "lm"

Specification of the imputation models

In JointAI, there are some arguments related to the imputation part of the model:


`models`	vector of imputation methods (for details see below and the vignette on Model Specification)
`auxvars`	a formula of variables that are not part of the analysis model but should be used to predict missing values (optional; for details see the vignette on Model Specification)
`refcats`	allows specification of which category of categorical variables is used as reference (optional; for details see the vignette on Model Specification)
`trunc`	allow truncation of distributions of incomplete continuous covariates (for details see the vignette on Model Specification)

Like in mice default imputation models are chosen based on the class of each of the incomplete variables. The default choices for baseline (not time-varying) covariates are

name	model	variable type
`lm`	linear regression	continuous variables
`glm_logit`	logistic regression	factors with two levels
`mlogit`	multinomial logit model	unordered factors with >2 levels
`clm`	cumulative logit model	ordered factors with >2 levels

Alternative imputation methods are available for continuous baseline covariates:

name	model	variable type
`lognorm`	normal regression of the log-transformed variable	right-skewed variables >0
`glm_gamma_inverse`	Gamma regression (with inverse-link)	right-skewed variables >0
`glm_gamma_identity`	Gamma regression (with identity-link)	right-skewed variables >0
`glm_gamma_log`	Gamma regression (with log-link)	right-skewed variables >0
`beta`	beta regression (with logit-link)	continuous variables with values in [0, 1]

Task

Re-fit the linear regression model, but now

specify a log-normal or a gamma distribution for bili
set the reference category to the largest group
use the other variables that are in the data as auxiliary variables

To specify a non-default imputation method use the argument models = c(bili = ...).

To set the respective largest group as reference category for all categorical variables use the argument refcats = "largest".

Auxiliary variables need to be specified explicitely using the argument auxvars. It takes a formula of variable names.

Solution

lm2 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = EP16dat2,
              auxvars = ~ educ + race + SBP + hypten + WC,
              models = c(bili = 'lognorm'), refcats = 'largest',
              n.iter = 100, seed = 2020)

Note:

By default, no output is printed during the adaptive phase. If you want to be sure that the model is running and your computer did not just get stuck you can set the argument quiet = FALSE.

Specification of the MCMC settings

Specification of the basic settings for the MCMC sampling can be done using the following arguments:


`n.chains`	number of MCMC chains
`n.adapt`	number of iterations used in the adaptive phase
`n.iter`	number of iterations per MCMC chain in the sampling phase

JointAI has more arguments than the ones given above, but in this practical we focus only on the most important. You may find out more about all the arguments in the vignette on MCMC Settings.

By default, n.chains = 3, n.adapt = 100 and n.iter = 0.

It is useful to use more than one chain to be able to evaluate convergence of the MCMC chains.

Samples in the adaptive phase are not used for the final MCMC sample. They are needed to optimize the MCMC sampler. When the number provided via the argument n.adapt is insufficient for this optimization a warning message will be printed. In simple models (e.g., linear regression) usually the default value of n.adapt = 100 can be used.

The default value for n.iter, the number of iterations in the sampling phase is 0 (no MCMC sample will be created). The number of iterations that is needed depends on how complex the model and the data is and can range from somewhere as low as 100 up to several million.

In the following, we will look at some criteria to determine if the number of MCMC samples that was used is sufficient.

Evaluation of the MCMC sample

The first step after fitting a Bayesian model should be to confirm that the MCMC chains have converged. This can be done visually, using a traceplot (plotting the sampled values per parameter and chain across iterations) or using the Gelman-Rubin criterion.

Task 1

Investigate convergence of lm2 by creating a traceplot using the function traceplot(). The plot should show a horizontal band without trends or patterns and the different chains should be mixed.

Solution 1

traceplot(lm2, use_ggplot = TRUE, ncol = 4) +
  theme(legend.position = 'none')

Note:

In the recent update of R to version 4.0.0, several things changed. These changes also affect the function traceplot() of JointAI when the (default) matplot() is use instead of the use_ggplot = TRUE option.

Currently, the standard output (when use_ggplot = TRUE is not used) does not work properly. This will be fixed in the next version of JointAI.

Task 2

Investigate convergence of lm2 using the Gelman-Rubin criterion, implemented in the function GR_crit().

The upper limit of the confidence interval (“Upper C.I.”) should not be much larger than 1.

Solution 2

GR_crit(lm2)

## Potential scale reduction factors:
## 
##              Point est. Upper C.I.
## (Intercept)       1.018       1.07
## genderfemale      1.036       1.13
## bili              0.997       1.00
## age               1.023       1.09
## chol              1.006       1.03
## HDL               1.003       1.02
## hgt               1.011       1.04
## age:chol          1.005       1.03
## age:HDL           1.007       1.04
## sigma_wgt         1.001       1.01
## 
## Multivariate psrf
## 
## 1.06

Continue

When we are satisfied with the convergence of the MCMC chains we can take a look at the MCMC sample is precise enough. We can do this by comparing the Monte Carlo error (which describes the error made since we have just a finite sample) to the estimated standard deviation. This is implemented in the function MC_error().

Task 3

Calculate the Monte Carlo error for lm2.

Solution 3

MC_error(lm2)

##                  est   MCSE     SD MCSE/SD
## (Intercept)  -105.95 1.1639 20.160   0.058
## genderfemale    4.17 0.1136  1.968   0.058
## bili           -7.12 0.1604  2.779   0.058
## age             0.61 0.0130  0.226   0.058
## chol            9.31 0.1125  1.949   0.058
## HDL           -21.89 0.3491  6.046   0.058
## hgt            99.49 0.5474  9.482   0.058
## age:chol       -0.14 0.0022  0.038   0.058
## age:HDL         0.14 0.0068  0.117   0.058
## sigma_wgt      15.35 0.0266  0.460   0.058

The output shows us the posterior mean (est), and standard deviaiton SD, as well as the Monte Carlo error (MCSE) and the ratio of the Monte Carlo error to the posterior standard deviation (MCSE/SD).

This ratio should not bee too large. A rule of thumb is that it should be less than 5%.

To get a quick overview of this ratio we can also plot it:

par(mar = c(3.2, 5, 1, 1), mgp = c(2, 0.6, 0))
plot(MC_error(lm2))

The Monte Carlo error is approximately 6% for all our model parameters. We could reduce this proportion by increasing the number of iterations (n.iter), but for the purpose of this practical the precision is acceptable.

Results

Task

Get the summary of the model using the function summary().

Solution

summary(lm2)

## 
## Bayesian linear model fitted with JointAI
## 
## Call:
## lm_imp(formula = wgt ~ gender + bili + age * (chol + HDL) + hgt, 
##     data = EP16dat2, n.iter = 100, auxvars = ~educ + race + SBP + 
##         hypten + WC, refcats = "largest", models = c(bili = "lognorm"), 
##     seed = 2020)
## 
## 
## Posterior summary:
##                  Mean      SD      2.5%    97.5% tail-prob. GR-crit MCE/SD
## (Intercept)  -105.949 20.1602 -146.4318 -67.7366    0.00000    1.07 0.0577
## genderfemale    4.171  1.9675    0.3303   8.1145    0.04000    1.13 0.0577
## bili           -7.122  2.7790  -12.3151  -1.8318    0.00667    1.00 0.0577
## age             0.608  0.2258    0.1616   1.0170    0.00667    1.09 0.0577
## chol            9.305  1.9487    5.5637  13.0444    0.00000    1.03 0.0577
## HDL           -21.893  6.0462  -34.0475 -10.6014    0.00000    1.02 0.0577
## hgt            99.486  9.4816   82.1896 116.8122    0.00000    1.04 0.0577
## age:chol       -0.143  0.0381   -0.2102  -0.0709    0.00000    1.03 0.0577
## age:HDL         0.145  0.1172   -0.0846   0.3670    0.20667    1.04 0.0577
## 
## Posterior summary of residual std. deviation:
##           Mean   SD 2.5% 97.5% GR-crit MCE/SD
## sigma_wgt 15.4 0.46 14.4  16.2    1.01 0.0577
## 
## 
## MCMC settings:
## Iterations = 101:200
## Sample size per chain = 100 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 500

Additional exercise JointAI

Monitoring imputed values

JointAI also allows us to extract imputed values to generate multiple imputed datasets that can, for instance, be used for a secondary analysis.

To be able to extract the imputed values, it is necessary to specify in advance that these values should be monitored (“recorded”). This can be done using the argument monitor_params.

monitor_params uses a number of key words to specify which (groups of) parameters or values should be monitored. Each key word works like a switch and can be set to TRUE or FALSE. For more details see the vignette on Parameter Selection.

For monitoring imputed values, monitor_params = c(imps = TRUE) needs to be specified.

Task

Re-fit the linear regression model, but now additionally set the argument monitor_params to keep the imputed values.

Solution

lm3 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = EP16dat2,
               auxvars = ~ educ + race + SBP + hypten + WC,
               models = c(bili = 'lognorm'), refcats = 'largest',
               n.iter = 250, monitor_params = c(imps = TRUE), seed = 2020)

Extracting imputed data

The function get_MIdat() allows us to create multiple completed datasets from an object of class JointAI.

It takes the following arguments


`object`	an object of class `JointAI`
`m`	number of imputed datasets
`include`	should the original, incomplete data be included? (default is `TRUE`)
`start`	first iteration of interest; allows discarding burn-in iterations
`minspace`	minimum number of iterations between iterations chosen as imputed values (default is 50)
`seed`	optional seed value
`export_to_SPSS`	logical; should the completed data be exported to SPSS?
`resdir`	optional directory for results (for export to SPSS)
`filename`	optional file name (for export to SPSS)

Task 1

Extract 10 imputed datasets from lm3.
Inspect the resulting object.

Solution 1

MIdat3 <- get_MIdat(lm3, m = 10)

head(MIdat3)

##      Imputation_       wgt gender bili age chol  HDL    hgt                 educ               race
## 597            0  78.01789   male  1.1  67 6.13 1.09 1.7526     College or above              other
## 159            0  78.01789   male  0.7  39 4.65 1.14 1.7780 High school graduate Non-Hispanic White
## 2121           0  75.29633 female  0.5  64 4.14 1.29 1.6256     College or above              other
## 2549           0  90.71847   male  0.8  36 3.47 1.37 1.9304         some college Non-Hispanic White
## 1172           0 112.03732 female  0.6  33 6.31 1.27 1.7272         some college Non-Hispanic Black
## 1199           0  58.96701   male  0.7  62 4.47 0.85 1.6002 High school graduate              other
##           SBP hypten    WC agechol ageHDL .rownr .id
## 597  138.6667    yes  91.6  410.71  73.03      1   1
## 159  102.6667     no  84.5  181.35  44.46      2   2
## 2121       NA    yes  91.6  264.96  82.56      3   3
## 2549 115.3333    yes  95.4  124.92  49.32      4   4
## 1172 106.6667     no 119.6  208.23  41.91      5   5
## 1199 140.6667    yes  89.0  277.14  52.70      6   6

We see that some columns were added to the data:

Imputation_ identifies the imputation number
.id is the subject ID
.rownr refers to the row number of the original data

summary(MIdat3)

##   Imputation_      wgt            gender          bili             age             chol       
##  Min.   : 0   Min.   : 39.01   male  :2772   Min.   :0.2000   Min.   :20.00   Min.   : 2.030  
##  1st Qu.: 2   1st Qu.: 65.20   female:2728   1st Qu.:0.6000   1st Qu.:31.00   1st Qu.: 4.268  
##  Median : 5   Median : 76.20                 Median :0.7000   Median :43.00   Median : 4.860  
##  Mean   : 5   Mean   : 78.25                 Mean   :0.7441   Mean   :45.02   Mean   : 4.995  
##  3rd Qu.: 8   3rd Qu.: 86.41                 3rd Qu.:0.9000   3rd Qu.:58.00   3rd Qu.: 5.640  
##  Max.   :10   Max.   :167.38                 Max.   :2.9000   Max.   :79.00   Max.   :10.680  
##                                              NA's   :47                       NA's   :41      
##       HDL              hgt                          educ                      race     
##  Min.   :0.2507   Min.   :1.397   Less than 9th grade : 341   Mexican American  : 572  
##  1st Qu.:1.1100   1st Qu.:1.626   9-11th grade        : 762   Other Hispanic    : 638  
##  Median :1.3400   Median :1.676   High school graduate:1267   Non-Hispanic White:2002  
##  Mean   :1.3979   Mean   :1.685   some college        :1631   Non-Hispanic Black:1232  
##  3rd Qu.:1.6000   3rd Qu.:1.753   College or above    :1498   other             :1056  
##  Max.   :3.1300   Max.   :1.930   NA's                :   1                            
##  NA's   :41       NA's   :11                                                           
##       SBP          hypten           WC            agechol           ageHDL           .rownr     
##  Min.   : 77.79   no  :4062   Min.   : 61.90   Min.   : 45.54   Min.   :  7.92   Min.   :  1.0  
##  1st Qu.:109.33   yes :1417   1st Qu.: 85.00   1st Qu.:140.16   1st Qu.: 39.37   1st Qu.:125.8  
##  Median :118.67   NA's:  21   Median : 95.40   Median :222.31   Median : 57.40   Median :250.5  
##  Mean   :120.35               Mean   : 96.29   Mean   :227.71   Mean   : 63.14   Mean   :250.5  
##  3rd Qu.:129.33               3rd Qu.:105.10   3rd Qu.:296.40   3rd Qu.: 80.04   3rd Qu.:375.2  
##  Max.   :202.00               Max.   :154.70   Max.   :725.90   Max.   :171.36   Max.   :500.0  
##  NA's   :29                   NA's   :23       NA's   :451      NA's   :451                     
##       .id       
##  Min.   :  1.0  
##  1st Qu.:125.8  
##  Median :250.5  
##  Mean   :250.5  
##  3rd Qu.:375.2  
##  Max.   :500.0  
##

Task 2

Similar to the functions densplot() from the mice package and propplot(), the function plot_imp_distr() from JointAI allows us to plot the distribution of the observed and imputed values for the incomplete variables.

It takes the following arguments


`data`	a data.frame in long format containing multiple imputations (and the original incomplete data)
`imp`	the name of the variable specifying the imputation indicator
`id`	the name of the variable specifying the subject indicator
`rownr`	the name of a variable identifying which rows correspond to the same observation in the original (unimputed) data
`ncol`, `nrow`	optional number of rows and columns in the plot layout; automatically chosen if unspecified

Check the imputed values in MIdat3 using plot_imp_distr().

Solution 2

plot_imp_distr(MIdat3)

Transforming imputed data to a mids object

To perform standard analyses on the imputed data it is usefull to convert them to a mids object, so that they can be treated as if they had been imputed with mice().

The mice package proves the function as.mids() to convert a long-format dataset (with original and multiple imputed datasets stacked onto each other) to a mids object.

Task

Transform MIdat3 to a mids object and confirm that it has worked by checking the class of the result.

Solution

mids3 <- mice::as.mids(MIdat3, .imp = "Imputation_", .id = '.id')

class(mids3)

## [1] "mids"