In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat1 dataset, load the file
EP16dat1.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat1.RData")
.
If you have not followed the first practical or if you re-loaded the
EP16dat1 data, you need to re-code the variable
educ
again:
$educ <- as.ordered(EP16dat1$educ) EP16dat1
The imputed data are stored in a mids
object called
imp
that we created in the previous practical.
You can load it into your workspace by clicking the object
imps.RData
if you are using RStudio. Alternatively, you can
load this workspace using
load("<path>/imps.RData")
.
mids
objectsWhen we are confident that the imputation was successful, the imputed data can be analysed with any standard complete data method.
Depending on the research question, for the NHANES data you might use
lm()
glm()
with
family = binomial()
glm()
with an appropriate family
Since our imputed data imp
is not a
data.frame
but a mids
object, we cannot use
the standard specification
lm(formula = ..., data = imp, ...)
but need to use the
function with()
from the mice package:
with(imp, lm(formula, ...))
class
and structure of the resulting
object.For example:
mod1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
mod2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI, family = "binomial"))
mod3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI, family = Gamma(link = "log")))
# You may, for example, choose one of these models:
<- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
mod1 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI,
mod2 family = "binomial"))
<- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI,
mod3 family = Gamma(link = 'log')))
# to determine the class
class(mod1)
## [1] "mira" "matrix"
# to determine the structure
names(mod1)
## [1] "call" "call1" "nmis" "analyses"
sapply(mod1, class)
## call call1 nmis analyses
## "call" "call" "integer" "list"
# explore further
$analyses mod1
## [[1]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 91.8201 0.3961 -6.0867 -1.7343 1.0969 0.3472
##
##
## [[2]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 95.7341 0.3935 -5.7506 -1.3572 0.3308 0.3300
##
##
## [[3]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 94.1262 0.3836 -5.7174 -2.5476 1.1176 0.3040
##
##
## [[4]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 93.7607 0.4083 -5.8519 -1.0055 0.6318 0.3152
##
##
## [[5]]
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Coefficients:
## (Intercept) age genderfemale bili chol BMI
## 94.7909 0.3863 -5.6186 -0.2588 0.5052 0.3196
summary(mod1$analyses[[1]])
##
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.069 -9.164 -0.947 7.956 67.742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.82008 3.54252 25.919 < 2e-16 ***
## age 0.39610 0.02987 13.261 < 2e-16 ***
## genderfemale -6.08667 0.98134 -6.202 8.15e-10 ***
## bili -1.73435 1.69586 -1.023 0.3067
## chol 1.09693 0.45654 2.403 0.0165 *
## BMI 0.34717 0.07880 4.406 1.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.77 on 994 degrees of freedom
## Multiple R-squared: 0.2227, Adjusted R-squared: 0.2188
## F-statistic: 56.97 on 5 and 994 DF, p-value: < 2.2e-16
mira
objectsAnalyses performed on mids
objects will result in a
mira
object (multiply imputed repeated analyses).
mira
object is a list with the following elements:
call :
|
The call that created the mira object.
|
call1 :
|
The call that created the mids object.
|
nmis :
|
The number of missing observations per column. |
analyses :
|
The models fitted on each of the imputed datasets: A list with
imp$m components, where each component contains a fitted
model object.
|
The class of each of the components of imp$analyses
will
depend on the type of model that was fitted.
If you want to check the results from one particular imputed dataset,
e.g., the 3rd, you can access the fitted model via
mod1$analyses[[3]]
. Alternatively you can use the function
getfit()
.
To pool the results from the repeated analyses contained in a
mira
object, the function pool()
can be
used.
Pool one of the provided mira
objects
(mod1
, mod2
or mod3
) and
investigate its structure.
# pool the model
<- pool(mod1)
pooled1
# investigate the structure
class(pooled1)
## [1] "mipo" "data.frame"
names(pooled1)
## [1] "call" "m" "pooled" "glanced"
str(pooled1)
## Classes 'mipo' and 'data.frame': 0 obs. of 4 variables:
## $ call : language pool(object = mod1)
## $ m : int 5
## $ pooled :'data.frame': 6 obs. of 11 variables:
## ..$ term : Factor w/ 6 levels "(Intercept)",..: 1 2 3 4 5 6
## ..$ m : int 5 5 5 5 5 5
## ..$ estimate: num 94.046 0.394 -5.805 -1.381 0.736 ...
## ..$ ubar : num 1.29e+01 9.02e-04 9.73e-01 2.87 2.11e-01 ...
## ..$ b : num 2.11 9.42e-05 3.17e-02 7.22e-01 1.26e-01 ...
## ..$ t : num 15.43697 0.00101 1.01112 3.73869 0.36258 ...
## ..$ dfcom : int 994 994 994 994 994 994
## ..$ df : num 125.9 236.2 713.2 67.9 22.1 ...
## ..$ riv : num 0.1964 0.1253 0.0391 0.3014 0.7158 ...
## ..$ lambda : num 0.1642 0.1114 0.0377 0.2316 0.4172 ...
## ..$ fmi : num 0.1771 0.1188 0.0404 0.2533 0.4636 ...
## $ glanced:'data.frame': 5 obs. of 12 variables:
## ..$ r.squared : num 0.223 0.206 0.207 0.217 0.199
## ..$ adj.r.squared: num 0.219 0.202 0.203 0.214 0.195
## ..$ sigma : num 14.8 14.9 14.9 14.8 14.9
## ..$ statistic : num 57 51.5 51.8 55.3 49.4
## ..$ p.value : num 3.65e-52 1.37e-47 8.14e-48 9.90e-51 1.04e-45
## ..$ df : num 5 5 5 5 5
## ..$ logLik : num -4108 -4115 -4116 -4114 -4119
## ..$ AIC : num 8231 8244 8245 8241 8253
## ..$ BIC : num 8265 8279 8279 8276 8287
## ..$ deviance : num 216783 219748 219908 219047 221545
## ..$ df.residual : int 994 994 994 994 994
## ..$ nobs : int 1000 1000 1000 1000 1000
mipo
objectsPooling of a mira
object results in a mipo
object (multiply imputed pooled analysis).
mipo
object is a list that has the following elements:
call :
|
The call that created the mipo object.
|
m :
|
The number of imputed datasets. |
pooled :
|
A data.frame containing pooled estimates and variances.
|
pooled
has the following columns:
estimate :
|
Pooled parameter estimates |
ubar :
|
The within-imputation variance of the estimate
|
b :
|
The between-imputation variance of the estimate
|
t :
|
The total variance of the estimate
|
dfcom :
|
Degrees of freedom in the complete data |
df :
|
Degrees of freedom associated with the \(t\)-statistics. |
riv :
|
The relative increase in variance due to the missing values: \((b+b/m)/ubar\) |
lambda :
|
Proportion of the variation due to the missing values: \((b+b/m)/t\). |
fmi :
|
Fraction of missing information. |
From the elements contained in the mipo
object, we could
calculate the pooled confidence intervals and p-values by hand, using
the function provided in the course notes.
However, the function summary()
applied to the
mipo
object will do this for us. This function has four
more arguments that may be relevant:
conf.int
: needs to be set to TRUE
in order
to to get confidence intervalsconf.level
: the confidence level, by default this is
0.95exponentiate
: whether to exponentiate the pooled
estimates and confidence intervals (e.g., for logistic regression)Get the summary of the pooled results.
<- pool(mod1)
pooled1 summary(pooled1, conf.int = TRUE)
## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 94.0463858 3.92899092 23.9365241 125.89054 1.090617e-48 86.2709627 101.8218088
## 2 age 0.3935746 0.03185725 12.3543153 236.17521 2.236391e-27 0.3308139 0.4563352
## 3 genderfemale -5.8050262 1.00554606 -5.7730087 713.18805 1.161878e-08 -7.7792106 -3.8308418
## 4 bili -1.3806815 1.93356811 -0.7140589 67.91568 4.776377e-01 -5.2391433 2.4777804
## 5 chol 0.7364822 0.60214286 1.2231021 22.10613 2.341702e-01 -0.5119381 1.9849026
## 6 BMI 0.3231938 0.08187090 3.9476031 614.16507 8.806992e-05 0.1624130 0.4839747
Note: pool()
extracts the
model coefficients and variance-covariance matrices using
tidy
from the package broom. Hence,
pooling using the pool()
function from
mice only works for models of classes for which a
method tidy
exists.
© Nicole Erler