In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)JointAI
(version: 1.0.5)ggplot2
(version: 3.4.2)reshape2
(version: 1.4.4)ggpubr
(version: 0.6.0)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat2 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat2 dataset, load the file
EP16dat2.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat2.RData")
.
To give you a head start, here some basic summary of this data:
str(EP16dat2)
## 'data.frame': 500 obs. of 12 variables:
## $ wgt : num 78 78 75.3 90.7 112 ...
## $ gender: Factor w/ 2 levels "male","female": 1 1 2 1 2 1 2 2 1 1 ...
## $ bili : num 1.1 0.7 0.5 0.8 0.6 0.7 1.1 0.8 0.8 0.5 ...
## $ age : num 67 39 64 36 33 62 56 63 55 20 ...
## $ chol : num 6.13 4.65 4.14 3.47 6.31 4.47 6.41 5.51 7.01 3.75 ...
## $ HDL : num 1.09 1.14 1.29 1.37 1.27 0.85 1.81 2.38 2.79 1.03 ...
## $ hgt : num 1.75 1.78 1.63 1.93 1.73 ...
## $ educ : Ord.factor w/ 5 levels "Less than 9th grade"<..: 5 3 5 4 4 3 4 5 4 2 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 5 3 5 3 4 5 4 5 3 3 ...
## $ SBP : num 139 103 NaN 115 107 ...
## $ hypten: Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 NA 1 2 1 ...
## $ WC : num 91.6 84.5 91.6 95.4 119.6 ...
par(mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
::plot_all(EP16dat2, breaks = 50, nrow = 3) JointAI
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
Our aim is to impute data, so that we can later fit the following linear regression model for weight:
<- lm(wgt ~ gender + bili + age * (chol + HDL) + hgt) mod
(Note: this line of syntax will not run! This is just to give you the model specification.)
We expect that the effects of cholesterol and HDL may differ with
age, and, hence, include interaction terms between
age
and chol
and HDL
,
respectively.
Additionally, we want to include the other variables in the dataset as auxiliary variables.
When the analysis model of interest involves interaction terms between incomplete variables, mice has limited options to reduce the bias that may be introduced by naive handling of the missing values.
Use of the “Just Another Variable” approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.
mice()
without any
iterations.# calculate the interaction terms
$agechol <- EP16dat2$age * EP16dat2$chol
EP16dat2$ageHDL <- EP16dat2$age * EP16dat2$HDL
EP16dat2
# setup run
<- mice(EP16dat2, maxit = 0,
imp0 defaultMethod = c('norm', 'logreg', 'polyreg', 'polr'))
imp0
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## wgt gender bili age chol HDL hgt educ race SBP hypten
## "" "" "norm" "" "norm" "norm" "norm" "polr" "" "norm" "logreg"
## WC agechol ageHDL
## "norm" "norm" "norm"
## PredictorMatrix:
## wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## gender 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## age 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## HDL 1 1 1 1 1 0 1 1 1 1 1 1 1 1
Apply the necessary changes to the imputation method and predictor matrix.
<- imp0$method
meth <- imp0$predictorMatrix
pred
# change imputation for "bili" to pmm (to prevent negative values)
"bili"] <- 'pmm'
meth[
# changes in predictor matrix to prevent feedback from the interactions to the
# original variables
"chol", "agechol"] <- 0
pred["HDL", "ageHDL"] <- 0
pred[
meth
## wgt gender bili age chol HDL hgt educ race SBP hypten
## "" "" "pmm" "" "norm" "norm" "norm" "polr" "" "norm" "logreg"
## WC agechol ageHDL
## "norm" "norm" "norm"
pred
## wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## gender 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## age 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1 0 1
## HDL 1 1 1 1 1 0 1 1 1 1 1 1 1 0
## hgt 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## educ 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## race 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## SBP 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## hypten 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## WC 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## agechol 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## ageHDL 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Note:
A look at the histogram of bili
tells us that the
variable is not too skewed, but the distribution is relatively close to
zero. And in a previous practical we already saw that there were
negative values for bili
when this variable was imputed
with a linear regression model.
In practice, I would still try a normal distribution and only use predictive mean matching if the distribution of the imputed values is off.
Run the imputation using the JAV approach and check the traceplot.
# run imputation with the JAV approach
<- mice(EP16dat2, method = meth, predictorMatrix = pred,
impJAV maxit = 20, printFlag = FALSE, seed = 2020)
Note The argument printFlag
prevents the lengthy output and the argument seed
can be
set to get reproducible results (this sets the seed value for the random
number generator).
plot(impJAV, layout = c(4, 6))
We skip the more detailed evaluation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data closely enough.
Analyse the imputed data and pool the results.
<- with(impJAV,
miraJAV lm(wgt ~ gender + bili + age + chol + HDL + agechol + ageHDL + hgt))
summary(pool(miraJAV), conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) -109.9006404 19.81204897 -5.547162 345.1777 5.786808e-08 -148.86817385 -70.93310696
## 2 genderfemale 3.9698542 1.90856165 2.080024 416.6427 3.813405e-02 0.21824403 7.72146428
## 3 bili -7.4887824 2.91479097 -2.569235 235.8189 1.080863e-02 -13.23113824 -1.74642663
## 4 age 0.6765514 0.23133062 2.924608 325.0598 3.691235e-03 0.22145726 1.13164548
## 5 chol 10.0151253 2.20722706 4.537424 126.1109 1.309768e-05 5.64712509 14.38312556
## 6 HDL -21.1952860 6.10635637 -3.471020 312.1504 5.915349e-04 -33.21010890 -9.18046317
## 7 agechol -0.1577853 0.04245688 -3.716365 173.9412 2.722120e-04 -0.24158224 -0.07398827
## 8 ageHDL 0.1448126 0.12399776 1.167865 257.9545 2.439396e-01 -0.09936415 0.38898939
## 9 hgt 99.5806561 9.29987107 10.707746 328.0904 3.810318e-23 81.28575613 117.87555608
For the passive imputation, we can re-use the adjusted versions of
meth
and pred
we created for the JAV approach,
but additional changes to meth
are necessary.
Specify the new imputation method, i.e., adapt meth
and
save it as methPAS
.
# changes in imputation method for passive imputation
<- meth
methPAS c("agechol", "ageHDL")] <- c("~I(age*chol)", "~I(age*HDL)") methPAS[
Note: the visit sequence does not have to be changed here, because the variables that are being imputed passively are already last in the visit sequence:
$visitSequence imp0
## [1] "wgt" "gender" "bili" "age" "chol" "HDL" "hgt" "educ" "race"
## [10] "SBP" "hypten" "WC" "agechol" "ageHDL"
Run the imputation using passive imputation and check the traceplot.
# run imputation with passive imputation
<- mice(EP16dat2, method = methPAS, predictorMatrix = pred,
impPAS maxit = 20, printFlag = FALSE, seed = 2020)
plot(impPAS, layout = c(4, 5))
We will again skip the detailed evaluation of convergence and the imputed values.
Analyse the imputed data and pool the results.
<- with(impPAS,
miraPAS lm(wgt ~ gender + bili + age + chol + HDL + agechol + ageHDL + hgt))
summary(pool(miraPAS), conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) -100.7599669 19.6078398 -5.138759 389.20359 4.385119e-07 -139.31050645 -62.20942743
## 2 genderfemale 3.8709508 1.9139558 2.022487 410.05791 4.377470e-02 0.10856145 7.63334016
## 3 bili -6.3948346 3.0809505 -2.075604 66.23626 4.181688e-02 -12.54574123 -0.24392790
## 4 age 0.5275668 0.2277888 2.316035 402.14931 2.105850e-02 0.07976123 0.97537237
## 5 chol 8.5359003 2.1434765 3.982269 275.31302 8.741749e-05 4.31621389 12.75558674
## 6 HDL -21.6226691 5.9733178 -3.619876 430.64217 3.297853e-04 -33.36315300 -9.88218521
## 7 agechol -0.1287719 0.0415769 -3.097198 352.57687 2.110252e-03 -0.21054180 -0.04700196
## 8 ageHDL 0.1535878 0.1193828 1.286516 463.08705 1.989059e-01 -0.08101126 0.38818693
## 9 hgt 98.1348739 9.0180337 10.882070 486.33993 7.843135e-25 80.41575662 115.85399115
JointAI provides functions that allow us to fit Bayesian regression models with incomplete covariates. The main functions are designed to resemble the standard functions to fit regression models with complete data.
For univariate outcomes the following functions are available:
lm_imp()
: for linear regressionlognorm_imp()
: for regression using a log-normal
distributionbetareg_imp()
: for continuous outcomes in the interval
\([0, 1]\)glm_imp()
: for generalized linear regression (e.g.,
logistic, gamma or Poisson)clm_imp()
: for ordinal (cumulative logit)
regressionmlogit_imp()
: for multinomial logit modelsNote:
There will be a new version of JointAI (later this year) which will also provide functions to fit models using a log-normal or a beta distribution, and a function to fit multinomial logit models (for unordered categorical outcomes).
formula
|
model formula |
data
|
original, incomplete dataset |
family
|
for glm’s: the distribution family of the outcome (e.g.,
binomial() for a logistic model)
|
Specify the linear regression model with the model formula given at
the beginning of this practical, using lm_imp()
.
You need to specify the arguments formula
,
data
and n.iter
. Set n.iter = 100
(we will learn more about this argument a little bit later).
<- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = EP16dat2,
lm1 n.iter = 100)
lm1
##
## Call:
## lm_imp(formula = wgt ~ gender + bili + age * (chol + HDL) + hgt,
## data = EP16dat2, n.iter = 100)
##
## Bayesian linear model for "wgt"
##
##
## Coefficients:
## (Intercept) genderfemale bili age chol HDL hgt
## -98.1158 3.4622 -6.8458 0.6221 9.3998 -21.6244 94.4908
## age:chol age:HDL
## -0.1485 0.1514
##
##
## Residual standard deviation:
## sigma_wgt
## 15.59
lm_imp()
returns an object of class
JointAI
, which contains
data
),call
,
analysis_type
, models
, fixed
,
random
, hyperpars
, scale_pars
)
andmcmc_settings
),model
) andMCMC
; if a sample was generated),time
) of the MCMC sampling,
andJointAI
but are typically not of interest for the
user.Check which imputation models were used in lm1
.
$models lm1
## wgt bili chol HDL
## "glm_gaussian_identity" "lm" "lm" "lm"
## hgt
## "lm"
models
|
vector of imputation methods (for details see below and the vignette on Model Specification) |
auxvars
|
a formula of variables that are not part of the analysis model but should be used to predict missing values (optional; for details see the vignette on Model Specification) |
refcats
|
allows specification of which category of categorical variables is used as reference (optional; for details see the vignette on Model Specification) |
trunc
|
allow truncation of distributions of incomplete continuous covariates (for details see the vignette on Model Specification) |
class
of each of the incomplete variables. The
default choices for baseline (not time-varying) covariates are
name | model | variable type |
---|---|---|
lm
|
linear regression | continuous variables |
glm_logit
|
logistic regression | factors with two levels |
mlogit
|
multinomial logit model | unordered factors with >2 levels |
clm
|
cumulative logit model | ordered factors with >2 levels |
name | model | variable type |
---|---|---|
lognorm
|
normal regression of the log-transformed variable | right-skewed variables >0 |
glm_gamma_inverse
|
Gamma regression (with inverse-link) | right-skewed variables >0 |
glm_gamma_identity
|
Gamma regression (with identity-link) | right-skewed variables >0 |
glm_gamma_log
|
Gamma regression (with log-link) | right-skewed variables >0 |
beta
|
beta regression (with logit-link) | continuous variables with values in [0, 1] |
Re-fit the linear regression model, but now
bili
models = c(bili = ...)
.
refcats = "largest"
.
auxvars
. It takes a formula of variable names.
<- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = EP16dat2,
lm2 auxvars = ~ educ + race + SBP + hypten + WC,
models = c(bili = 'lognorm'), refcats = 'largest',
n.iter = 100, seed = 2020)
Note:
By default, no output is printed during the adaptive phase. If you
want to be sure that the model is running and your computer did not just
get stuck you can set the argument quiet = FALSE
.
n.chains
|
number of MCMC chains |
n.adapt
|
number of iterations used in the adaptive phase |
n.iter
|
number of iterations per MCMC chain in the sampling phase |
JointAI has more arguments than the ones given above, but in this practical we focus only on the most important. You may find out more about all the arguments in the vignette on MCMC Settings.
By default, n.chains = 3
, n.adapt = 100
and
n.iter = 0
.
It is useful to use more than one chain to be able to evaluate convergence of the MCMC chains.
Samples in the adaptive phase are not used for the final MCMC sample.
They are needed to optimize the MCMC sampler. When the number provided
via the argument n.adapt
is insufficient for this
optimization a warning message will be printed. In simple models (e.g.,
linear regression) usually the default value of
n.adapt = 100
can be used.
The default value for n.iter
, the number of iterations
in the sampling phase is 0
(no MCMC sample will be
created). The number of iterations that is needed depends on how complex
the model and the data is and can range from somewhere as low as 100 up
to several million.
In the following, we will look at some criteria to determine if the number of MCMC samples that was used is sufficient.
The first step after fitting a Bayesian model should be to confirm that the MCMC chains have converged. This can be done visually, using a traceplot (plotting the sampled values per parameter and chain across iterations) or using the Gelman-Rubin criterion.
Investigate convergence of lm2
by creating a traceplot
using the function traceplot()
. The plot should show a
horizontal band without trends or patterns and the different chains
should be mixed.
traceplot(lm2, use_ggplot = TRUE, ncol = 4) +
theme(legend.position = 'none')
Note:
In the recent update of R to version 4.0.0, several things changed.
These changes also affect the function traceplot()
of
JointAI when the (default) matplot()
is
use instead of the use_ggplot = TRUE
option.
Currently, the standard output (when use_ggplot = TRUE
is not used) does not work properly. This will be fixed in the next
version of JointAI.
Investigate convergence of lm2
using the Gelman-Rubin
criterion, implemented in the function GR_crit()
.
The upper limit of the confidence interval
(“Upper C.I.
”) should not be much larger than 1.
GR_crit(lm2)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.018 1.07
## genderfemale 1.036 1.13
## bili 0.997 1.00
## age 1.023 1.09
## chol 1.006 1.03
## HDL 1.003 1.02
## hgt 1.011 1.04
## age:chol 1.005 1.03
## age:HDL 1.007 1.04
## sigma_wgt 1.001 1.01
##
## Multivariate psrf
##
## 1.06
When we are satisfied with the convergence of the MCMC chains we can
take a look at the MCMC sample is precise enough. We can do this by
comparing the Monte Carlo error (which describes the error made since we
have just a finite sample) to the estimated standard deviation. This is
implemented in the function MC_error()
.
Calculate the Monte Carlo error for lm2
.
MC_error(lm2)
## est MCSE SD MCSE/SD
## (Intercept) -105.95 1.1639 20.160 0.058
## genderfemale 4.17 0.1136 1.968 0.058
## bili -7.12 0.1604 2.779 0.058
## age 0.61 0.0130 0.226 0.058
## chol 9.31 0.1125 1.949 0.058
## HDL -21.89 0.3491 6.046 0.058
## hgt 99.49 0.5474 9.482 0.058
## age:chol -0.14 0.0022 0.038 0.058
## age:HDL 0.14 0.0068 0.117 0.058
## sigma_wgt 15.35 0.0266 0.460 0.058
The output shows us the posterior mean (est
), and
standard deviaiton SD
, as well as the Monte Carlo error
(MCSE
) and the ratio of the Monte Carlo error to the
posterior standard deviation (MCSE/SD
).
This ratio should not bee too large. A rule of thumb is that it should be less than 5%.
To get a quick overview of this ratio we can also plot it:
par(mar = c(3.2, 5, 1, 1), mgp = c(2, 0.6, 0))
plot(MC_error(lm2))
The Monte Carlo error is approximately 6% for all our model
parameters. We could reduce this proportion by increasing the number of
iterations (n.iter
), but for the purpose of this practical
the precision is acceptable.
Get the summary of the model using the function
summary()
.
summary(lm2)
##
## Bayesian linear model fitted with JointAI
##
## Call:
## lm_imp(formula = wgt ~ gender + bili + age * (chol + HDL) + hgt,
## data = EP16dat2, n.iter = 100, auxvars = ~educ + race + SBP +
## hypten + WC, refcats = "largest", models = c(bili = "lognorm"),
## seed = 2020)
##
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob. GR-crit MCE/SD
## (Intercept) -105.949 20.1602 -146.4318 -67.7366 0.00000 1.07 0.0577
## genderfemale 4.171 1.9675 0.3303 8.1145 0.04000 1.13 0.0577
## bili -7.122 2.7790 -12.3151 -1.8318 0.00667 1.00 0.0577
## age 0.608 0.2258 0.1616 1.0170 0.00667 1.09 0.0577
## chol 9.305 1.9487 5.5637 13.0444 0.00000 1.03 0.0577
## HDL -21.893 6.0462 -34.0475 -10.6014 0.00000 1.02 0.0577
## hgt 99.486 9.4816 82.1896 116.8122 0.00000 1.04 0.0577
## age:chol -0.143 0.0381 -0.2102 -0.0709 0.00000 1.03 0.0577
## age:HDL 0.145 0.1172 -0.0846 0.3670 0.20667 1.04 0.0577
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5% GR-crit MCE/SD
## sigma_wgt 15.4 0.46 14.4 16.2 1.01 0.0577
##
##
## MCMC settings:
## Iterations = 101:200
## Sample size per chain = 100
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 500
JointAI also allows us to extract imputed values to generate multiple imputed datasets that can, for instance, be used for a secondary analysis.
To be able to extract the imputed values, it is necessary to specify
in advance that these values should be monitored (“recorded”). This can
be done using the argument monitor_params
.
monitor_params
uses a number of key words to specify
which (groups of) parameters or values should be monitored. Each key
word works like a switch and can be set to TRUE
or
FALSE
. For more details see the vignette on Parameter
Selection.
For monitoring imputed values,
monitor_params = c(imps = TRUE)
needs to be specified.
Re-fit the linear regression model, but now additionally set the
argument monitor_params
to keep the imputed values.
<- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = EP16dat2,
lm3 auxvars = ~ educ + race + SBP + hypten + WC,
models = c(bili = 'lognorm'), refcats = 'largest',
n.iter = 250, monitor_params = c(imps = TRUE), seed = 2020)
The function get_MIdat()
allows us to create multiple
completed datasets from an object of class JointAI
.
object
|
an object of class JointAI
|
m
|
number of imputed datasets |
include
|
should the original, incomplete data be included? (default is
TRUE )
|
start
|
first iteration of interest; allows discarding burn-in iterations |
minspace
|
minimum number of iterations between iterations chosen as imputed values (default is 50) |
seed
|
optional seed value |
export_to_SPSS
|
logical; should the completed data be exported to SPSS? |
resdir
|
optional directory for results (for export to SPSS) |
filename
|
optional file name (for export to SPSS) |
lm3
.<- get_MIdat(lm3, m = 10)
MIdat3
head(MIdat3)
## Imputation_ wgt gender bili age chol HDL hgt educ race
## 597 0 78.01789 male 1.1 67 6.13 1.09 1.7526 College or above other
## 159 0 78.01789 male 0.7 39 4.65 1.14 1.7780 High school graduate Non-Hispanic White
## 2121 0 75.29633 female 0.5 64 4.14 1.29 1.6256 College or above other
## 2549 0 90.71847 male 0.8 36 3.47 1.37 1.9304 some college Non-Hispanic White
## 1172 0 112.03732 female 0.6 33 6.31 1.27 1.7272 some college Non-Hispanic Black
## 1199 0 58.96701 male 0.7 62 4.47 0.85 1.6002 High school graduate other
## SBP hypten WC agechol ageHDL .rownr .id
## 597 138.6667 yes 91.6 410.71 73.03 1 1
## 159 102.6667 no 84.5 181.35 44.46 2 2
## 2121 NA yes 91.6 264.96 82.56 3 3
## 2549 115.3333 yes 95.4 124.92 49.32 4 4
## 1172 106.6667 no 119.6 208.23 41.91 5 5
## 1199 140.6667 yes 89.0 277.14 52.70 6 6
We see that some columns were added to the data:
Imputation_
identifies the imputation number.id
is the subject ID.rownr
refers to the row number of the original
datasummary(MIdat3)
## Imputation_ wgt gender bili age chol
## Min. : 0 Min. : 39.01 male :2772 Min. :0.2000 Min. :20.00 Min. : 2.030
## 1st Qu.: 2 1st Qu.: 65.20 female:2728 1st Qu.:0.6000 1st Qu.:31.00 1st Qu.: 4.268
## Median : 5 Median : 76.20 Median :0.7000 Median :43.00 Median : 4.860
## Mean : 5 Mean : 78.25 Mean :0.7441 Mean :45.02 Mean : 4.995
## 3rd Qu.: 8 3rd Qu.: 86.41 3rd Qu.:0.9000 3rd Qu.:58.00 3rd Qu.: 5.640
## Max. :10 Max. :167.38 Max. :2.9000 Max. :79.00 Max. :10.680
## NA's :47 NA's :41
## HDL hgt educ race
## Min. :0.2507 Min. :1.397 Less than 9th grade : 341 Mexican American : 572
## 1st Qu.:1.1100 1st Qu.:1.626 9-11th grade : 762 Other Hispanic : 638
## Median :1.3400 Median :1.676 High school graduate:1267 Non-Hispanic White:2002
## Mean :1.3979 Mean :1.685 some college :1631 Non-Hispanic Black:1232
## 3rd Qu.:1.6000 3rd Qu.:1.753 College or above :1498 other :1056
## Max. :3.1300 Max. :1.930 NA's : 1
## NA's :41 NA's :11
## SBP hypten WC agechol ageHDL .rownr
## Min. : 77.79 no :4062 Min. : 61.90 Min. : 45.54 Min. : 7.92 Min. : 1.0
## 1st Qu.:109.33 yes :1417 1st Qu.: 85.00 1st Qu.:140.16 1st Qu.: 39.37 1st Qu.:125.8
## Median :118.67 NA's: 21 Median : 95.40 Median :222.31 Median : 57.40 Median :250.5
## Mean :120.35 Mean : 96.29 Mean :227.71 Mean : 63.14 Mean :250.5
## 3rd Qu.:129.33 3rd Qu.:105.10 3rd Qu.:296.40 3rd Qu.: 80.04 3rd Qu.:375.2
## Max. :202.00 Max. :154.70 Max. :725.90 Max. :171.36 Max. :500.0
## NA's :29 NA's :23 NA's :451 NA's :451
## .id
## Min. : 1.0
## 1st Qu.:125.8
## Median :250.5
## Mean :250.5
## 3rd Qu.:375.2
## Max. :500.0
##
Similar to the functions densplot()
from the
mice package and propplot()
, the function
plot_imp_distr()
from JointAI allows us to
plot the distribution of the observed and imputed values for the
incomplete variables.
It takes the following arguments
data
|
a data.frame in long format containing multiple imputations (and the original incomplete data) |
imp
|
the name of the variable specifying the imputation indicator |
id
|
the name of the variable specifying the subject indicator |
rownr
|
the name of a variable identifying which rows correspond to the same observation in the original (unimputed) data |
ncol , nrow
|
optional number of rows and columns in the plot layout; automatically chosen if unspecified |
Check the imputed values in MIdat3
using
plot_imp_distr()
.
plot_imp_distr(MIdat3)
To perform standard analyses on the imputed data it is usefull to
convert them to a mids
object, so that they can be treated
as if they had been imputed with mice()
.
The mice package proves the function
as.mids()
to convert a long-format dataset (with original
and multiple imputed datasets stacked onto each other) to a
mids
object.
Transform MIdat3
to a mids
object and
confirm that it has worked by checking the class
of the
result.
<- mice::as.mids(MIdat3, .imp = "Imputation_", .id = '.id')
mids3
class(mids3)
## [1] "mids"
© Nicole Erler