How to Analyse Data after Multiple Imputation

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 4.3.0 (2023-04-21 ucrt)
mice (version: 3.15.0)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

Overview

For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.

To get the EP16dat1 dataset, load the file EP16dat1.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat1.RData").

If you have not followed the first practical or if you re-loaded the EP16dat1 data, you need to re-code the variable educ again:

EP16dat1$educ <- as.ordered(EP16dat1$educ)

Imputed data

The imputed data are stored in a mids object called imp that we created in the previous practical.

You can load it into your workspace by clicking the object imps.RData if you are using RStudio. Alternatively, you can load this workspace using load("<path>/imps.RData").

Analysing mulitple imputed datasets

Fitting models on `mids` objects

When we are confident that the imputation was successful, the imputed data can be analysed with any standard complete data method.

Depending on the research question, for the NHANES data you might use

linear regression using lm()
logistic regression using glm() with family = binomial()
other generalized linear regression models using glm() with an appropriate family

Since our imputed data imp is not a data.frame but a mids object, we cannot use the standard specification lm(formula = ..., data = imp, ...) but need to use the function with() from the mice package:

with(imp, lm(formula, ...))

Task

Choose any analysis model of interest and fit it on the imputed data.
Find out the class and structure of the resulting object.

For example:

mod1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))

mod2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI, family = "binomial"))

mod3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI, family = Gamma(link = "log")))

Solution

# You may, for example, choose one of these models:
mod1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))
mod2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI,
                       family = "binomial"))
mod3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI,
                       family = Gamma(link = 'log')))

# to determine the class
class(mod1)

## [1] "mira"   "matrix"

# to determine the structure
names(mod1)

## [1] "call"     "call1"    "nmis"     "analyses"

sapply(mod1, class)

##      call     call1      nmis  analyses 
##    "call"    "call" "integer"    "list"

# explore further
mod1$analyses

## [[1]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      91.8201        0.3961       -6.0867       -1.7343        1.0969        0.3472  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      95.7341        0.3935       -5.7506       -1.3572        0.3308        0.3300  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      94.1262        0.3836       -5.7174       -2.5476        1.1176        0.3040  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      93.7607        0.4083       -5.8519       -1.0055        0.6318        0.3152  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      94.7909        0.3863       -5.6186       -0.2588        0.5052        0.3196

summary(mod1$analyses[[1]])

## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.069  -9.164  -0.947   7.956  67.742 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  91.82008    3.54252  25.919  < 2e-16 ***
## age           0.39610    0.02987  13.261  < 2e-16 ***
## genderfemale -6.08667    0.98134  -6.202 8.15e-10 ***
## bili         -1.73435    1.69586  -1.023   0.3067    
## chol          1.09693    0.45654   2.403   0.0165 *  
## BMI           0.34717    0.07880   4.406 1.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.77 on 994 degrees of freedom
## Multiple R-squared:  0.2227, Adjusted R-squared:  0.2188 
## F-statistic: 56.97 on 5 and 994 DF,  p-value: < 2.2e-16

Structure of `mira` objects

Analyses performed on mids objects will result in a mira object (multiply imputed repeated analyses).

A mira object is a list with the following elements:


`call`:	The call that created the `mira` object.
`call1`:	The call that created the `mids` object.
`nmis`:	The number of missing observations per column.
`analyses`:	The models fitted on each of the imputed datasets: A list with `imp$m` components, where each component contains a fitted model object.

The class of each of the components of imp$analyses will depend on the type of model that was fitted.

If you want to check the results from one particular imputed dataset, e.g., the 3rd, you can access the fitted model via mod1$analyses[[3]]. Alternatively you can use the function getfit().

Pooling

Combining results

To pool the results from the repeated analyses contained in a mira object, the function pool() can be used.

Task

Pool one of the provided mira objects (mod1, mod2 or mod3) and investigate its structure.

Solution

# pool the model
pooled1 <- pool(mod1)

# investigate the structure
class(pooled1)

## [1] "mipo"       "data.frame"

names(pooled1)

## [1] "call"    "m"       "pooled"  "glanced"

str(pooled1)

## Classes 'mipo' and 'data.frame': 0 obs. of  4 variables:
##  $ call   : language pool(object = mod1)
##  $ m      : int 5
##  $ pooled :'data.frame': 6 obs. of  11 variables:
##   ..$ term    : Factor w/ 6 levels "(Intercept)",..: 1 2 3 4 5 6
##   ..$ m       : int  5 5 5 5 5 5
##   ..$ estimate: num  94.046 0.394 -5.805 -1.381 0.736 ...
##   ..$ ubar    : num  1.29e+01 9.02e-04 9.73e-01 2.87 2.11e-01 ...
##   ..$ b       : num  2.11 9.42e-05 3.17e-02 7.22e-01 1.26e-01 ...
##   ..$ t       : num  15.43697 0.00101 1.01112 3.73869 0.36258 ...
##   ..$ dfcom   : int  994 994 994 994 994 994
##   ..$ df      : num  125.9 236.2 713.2 67.9 22.1 ...
##   ..$ riv     : num  0.1964 0.1253 0.0391 0.3014 0.7158 ...
##   ..$ lambda  : num  0.1642 0.1114 0.0377 0.2316 0.4172 ...
##   ..$ fmi     : num  0.1771 0.1188 0.0404 0.2533 0.4636 ...
##  $ glanced:'data.frame': 5 obs. of  12 variables:
##   ..$ r.squared    : num  0.223 0.206 0.207 0.217 0.199
##   ..$ adj.r.squared: num  0.219 0.202 0.203 0.214 0.195
##   ..$ sigma        : num  14.8 14.9 14.9 14.8 14.9
##   ..$ statistic    : num  57 51.5 51.8 55.3 49.4
##   ..$ p.value      : num  3.65e-52 1.37e-47 8.14e-48 9.90e-51 1.04e-45
##   ..$ df           : num  5 5 5 5 5
##   ..$ logLik       : num  -4108 -4115 -4116 -4114 -4119
##   ..$ AIC          : num  8231 8244 8245 8241 8253
##   ..$ BIC          : num  8265 8279 8279 8276 8287
##   ..$ deviance     : num  216783 219748 219908 219047 221545
##   ..$ df.residual  : int  994 994 994 994 994
##   ..$ nobs         : int  1000 1000 1000 1000 1000

Structure of `mipo` objects

Pooling of a mira object results in a mipo object (multiply imputed pooled analysis).

The mipo object is a list that has the following elements:


`call`:	The call that created the `mipo` object.
`m`:	The number of imputed datasets.
`pooled`:	A `data.frame` containing pooled estimates and variances.

pooled has the following columns:

`estimate`:	Pooled parameter estimates
`ubar`:	The within-imputation variance of the `estimate`
`b`:	The between-imputation variance of the `estimate`
`t`:	The total variance of the `estimate`
`dfcom`:	Degrees of freedom in the complete data
`df`:	Degrees of freedom associated with the $t$-statistics.
`riv`:	The relative increase in variance due to the missing values: $(b+b/m)/ubar$
`lambda`:	Proportion of the variation due to the missing values: $(b+b/m)/t$.
`fmi`:	Fraction of missing information.

From the elements contained in the mipo object, we could calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.

However, the function summary() applied to the mipo object will do this for us. This function has four more arguments that may be relevant:

conf.int: needs to be set to TRUE in order to to get confidence intervals
conf.level: the confidence level, by default this is 0.95
exponentiate: whether to exponentiate the pooled estimates and confidence intervals (e.g., for logistic regression)

Task

Get the summary of the pooled results.

Solution

pooled1 <- pool(mod1)
summary(pooled1, conf.int = TRUE)

##           term   estimate  std.error  statistic        df      p.value      2.5 %      97.5 %
## 1  (Intercept) 94.0463858 3.92899092 23.9365241 125.89054 1.090617e-48 86.2709627 101.8218088
## 2          age  0.3935746 0.03185725 12.3543153 236.17521 2.236391e-27  0.3308139   0.4563352
## 3 genderfemale -5.8050262 1.00554606 -5.7730087 713.18805 1.161878e-08 -7.7792106  -3.8308418
## 4         bili -1.3806815 1.93356811 -0.7140589  67.91568 4.776377e-01 -5.2391433   2.4777804
## 5         chol  0.7364822 0.60214286  1.2231021  22.10613 2.341702e-01 -0.5119381   1.9849026
## 6          BMI  0.3231938 0.08187090  3.9476031 614.16507 8.806992e-05  0.1624130   0.4839747

Note: pool() extracts the model coefficients and variance-covariance matrices using tidy from the package broom. Hence, pooling using the pool() function from mice only works for models of classes for which a method tidy exists.

How to Analyse Data after Multiple Imputation

Preface

R packages

Help files

Dataset

Overview

Imputed data

Analysing mulitple imputed datasets

Fitting models on mids objects

Task

Solution

Structure of mira objects

Pooling

Combining results

Task

Solution

Structure of mipo objects

Task

Solution

Fitting models on `mids` objects

Structure of `mira` objects

Structure of `mipo` objects