Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you can still follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

  • R version 3.6.0 (2019-04-26)
  • mice (version: 3.4.0)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

For this practical, we will again use the NHANES dataset that we have seen in the previous practicals.

To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals.RData").

If you have not followed the first practical or if you re-loaded the NHANES data you need to re-code the variable educ again:

Imputed data

The imputed data are stored in a mids object called imp that we created in the previous practical.

You can load it into your workspace by clicking the object imps.RData if you are using RStudio. Alternatively, you can load this workspace using load("<path>/imps.RData"). You then need to run:

Analysing mulitple imputed datasets

Fitting models on mids objects

When we are confident that the imputation was successfull, the imputed data can be analysed with any standard complete data method.

Depending on the research question, for the NHANES data you might use

  • linear regression using lm()
  • logistic regression using glm() with family = binomial()
  • other generalized linear regression models using glm() with an appropriate family

Since our imputed data imp is not a data.frame but a mids object, we cannot use the standard specification lm(formula = ..., data = imp, ...) but need to use the function with() from the mice package:

with(imp, lm(formula, ...))

Task

  • Choose any analysis model of interest and fit it on the imputed data.
  • Find out the class and structure of the resulting object.

For example:

mod1 <- with(imp, lm(SBP ~ age + gender + bili + chol + BMI))

mod2 <- with(imp, glm(hypchol ~ age + gender + bili + race + BMI, family = "binomial"))

mod3 <- with(imp, glm(creat ~ age + gender + uricacid + WC + BMI, family = Gamma(link = "log")))

Solution

## [1] "mira"   "matrix"
## [1] "call"     "call1"    "nmis"     "analyses"
## $call
## [1] "call"
## 
## $call1
## [1] "call"
## 
## $nmis
## [1] "integer"
## 
## $analyses
## [1] "list"
## [[1]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      91.3922        0.4009       -5.2851       -3.4062        1.6550        0.2715  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      93.8528        0.4096       -5.5699       -2.8959        0.9808        0.2801  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      93.8626        0.3982       -5.4364       -3.7986        1.1172        0.2914  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      92.4412        0.4116       -5.5242       -2.5347        1.1863        0.2862  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Coefficients:
##  (Intercept)           age  genderfemale          bili          chol           BMI  
##      92.5550        0.3970       -5.6263       -2.8590        1.2476        0.2956
## 
## Call:
## lm(formula = SBP ~ age + gender + bili + chol + BMI)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.939  -9.111  -0.540   7.751  68.422 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  91.39220    3.20770  28.491  < 2e-16 ***
## age           0.40092    0.02739  14.640  < 2e-16 ***
## genderfemale -5.28511    0.90512  -5.839 7.10e-09 ***
## bili         -3.40618    1.50234  -2.267 0.023589 *  
## chol          1.65496    0.42232   3.919 9.51e-05 ***
## BMI           0.27148    0.07175   3.784 0.000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.93 on 994 degrees of freedom
## Multiple R-squared:  0.2441, Adjusted R-squared:  0.2403 
## F-statistic: 64.19 on 5 and 994 DF,  p-value: < 2.2e-16

Structure of mira objects

Analyses performed on mids objects will result in a mira object (multiply imputed repeated analyses).

A mira object is a list with the following elements:
call: The call that created the mira object.
call1: The call that created the mids object.
nmis: The number of missing observations per column.
analyses: The models fitted on each of the imputed datasets: A list with imp$m components, where each component contains a fitted model object.

The class of each of the components of imp$analyses will depend on the type of model that was fitted.

If you want to check the results from one particular imputed dataset, e.g., the 3rd, you can access the fitted model via mod1$analyses[[3]]. Alternatively you can use the function getfit().

Pooling

Combining results

To pool the results from the repeated analyses contained in a mira object, the function pool() can be used.

Task

Pool one of the provided mira objects (mod1, mod2 or mod3) and investigate its structure.

Solution

## [1] "mipo"       "data.frame"
## [1] "call"   "m"      "pooled"
## Classes 'mipo' and 'data.frame': 0 obs. of  3 variables:
##  $ call  : language pool(object = mod1)
##  $ m     : int 5
##  $ pooled:'data.frame':  6 obs. of  9 variables:
##   ..$ estimate: num  92.821 0.403 -5.488 -3.099 1.237 ...
##   ..$ ubar    : num  1.02e+01 7.53e-04 8.26e-01 2.24 1.77e-01 ...
##   ..$ b       : num  1.10 4.48e-05 1.77e-02 2.50e-01 6.43e-02 ...
##   ..$ t       : num  1.15e+01 8.06e-04 8.47e-01 2.54 2.54e-01 ...
##   ..$ dfcom   : int  994 994 994 994 994 994
##   ..$ df      : num  225.4 456.3 839 216.2 40.8 ...
##   ..$ riv     : num  0.1298 0.0714 0.0258 0.1338 0.4362 ...
##   ..$ lambda  : num  0.1149 0.0667 0.0251 0.118 0.3037 ...
##   ..$ fmi     : num  0.1226 0.0707 0.0274 0.1261 0.3355 ...

Structure of mipo objects

Pooling of a mira object results in a mipo object (multiply imputed pooled analysis).

The mipo object is a list that has the following elements:
call: The call that created the mipo object.
m: The number of imputed datasets.
pooled: A data.frame containing pooled estimates and variances.
pooled has the following columns:
estimate: Pooled parameter estimates
ubar: The within-imputation variance of the estimate
b: The between-imputation variance of the estimate
t: The total variance of the estimate
dfcom: Degrees of freedom in the complete data
df: Degrees of freedom associated with the \(t\)-statistics.
riv: The relative increase in variance due to the missing values: \((b+b/m)/ubar\)
lambda: Proportion of the variation due to the missing values: \((b+b/m)/t\).
fmi: Fraction of missing information.

From the elements contained in the mipo object, we could calculate the pooled confidence intervals and p-values by hand, using the function provided in the course notes.

However, the function summary() applied to the mipo object will do this for us. This function has four more arguments that may be relevant:

  • conf.int: needs to be set to TRUE in order to to get confidence intervals
  • conf.level: the confidence level, by default this is 0.95