Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

  • R version 3.6.0 (2019-04-26)
  • mice (version: 3.4.0)
  • visdat (version: 0.5.3)
  • JointAI (version: 0.5.1)
  • VIM (version: 4.8.0)
  • plyr (version: 1.8.4)
  • corrplot (version: 0.84)
  • ggplot2 (version: 3.1.1)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

Overview

For this example, we will use the NHANES dataset. To get the NHANES data, load the file NHANES_for_practicals.RData. You can download it here. To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals.RData").

Task

Let’s take a first look at the data. Useful functions are dim(), head(), str() and summary().

Solution

## [1] 1000   24
## 'data.frame':    1000 obs. of  24 variables:
##  $ HDL     : num  1.58 1.24 1.71 1.01 1.09 NA NA 1.16 1.16 1.5 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 2 5 3 3 3 3 2 3 1 3 ...
##  $ DBP     : num  56.7 81.3 70 66 69.3 ...
##  $ bili    : num  0.5 0.9 0.7 0.4 0.9 NA NA 0.9 0.6 0.9 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 3 2 1 2 2 3 ...
##  $ DM      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ gender  : Factor w/ 2 levels "male","female": 2 1 1 2 1 2 1 1 1 1 ...
##  $ WC      : num  64.5 81.4 76 139.5 94.6 ...
##  $ chol    : num  4.29 4.27 4.22 3.96 4.97 NA NA 5.2 5.56 4.86 ...
##  $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA 3 3 NA NA NA ...
##  $ alc     : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 2 NA NA 2 4 1 1 5 3 4 ...
##  $ SBP     : num  105 125 133 141 117 ...
##  $ wgt     : num  46.3 63.5 62.1 113.9 102.1 ...
##  $ hypten  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 1 1 1 ...
##  $ cohort  : chr  "2011" "2011" "2011" "2011" ...
##  $ occup   : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 1 3 3 1 1 1 ...
##  $ age     : num  22 39 51 45 31 75 73 52 29 40 ...
##  $ educ    : Factor w/ 5 levels "Less than 9th grade",..: 4 5 5 5 3 1 2 4 3 1 ...
##  $ albu    : num  3.8 4.3 4.3 3.6 4.9 NA NA 3.9 4.9 4.3 ...
##  $ creat   : num  0.61 0.87 0.89 0.61 0.83 NA NA 0.91 0.93 0.94 ...
##  $ uricacid: num  3.6 5.4 6.2 4.3 6.1 NA NA 7.8 3.9 4.9 ...
##  $ BMI     : num  19.3 19.5 22.1 41.8 32.3 ...
##  $ hypchol : Factor w/ 2 levels "no","yes": 1 1 1 1 1 NA NA 1 1 1 ...
##  $ hgt     : num  1.55 1.8 1.68 1.65 1.78 ...
##       HDL                        race          DBP              bili            smoke       DM     
##  Min.   :0.360   Mexican American  :112   Min.   : 12.67   Min.   :0.2000   never  :608   no :853  
##  1st Qu.:1.110   Other Hispanic    :110   1st Qu.: 64.00   1st Qu.:0.6000   former :191   yes:147  
##  Median :1.320   Non-Hispanic White:350   Median : 70.67   Median :0.7000   current:199            
##  Mean   :1.391   Non-Hispanic Black:229   Mean   : 70.80   Mean   :0.7527   NA's   :  2            
##  3rd Qu.:1.580   other             :199   3rd Qu.: 77.33   3rd Qu.:0.9000                          
##  Max.   :4.030                            Max.   :130.00   Max.   :2.9000                          
##  NA's   :86                               NA's   :59       NA's   :95                              
##     gender          WC              chol            HyperMed     alc           SBP        
##  male  :493   Min.   : 61.90   Min.   : 2.070   no      : 25   0   :113   Min.   : 81.33  
##  female:507   1st Qu.: 85.00   1st Qu.: 4.270   previous: 20   <=1 :281   1st Qu.:109.33  
##               Median : 95.10   Median : 4.910   yes     :142   1-3 :105   Median :118.00  
##               Mean   : 96.35   Mean   : 4.998   NA's    :813   3-7 : 81   Mean   :120.15  
##               3rd Qu.:105.50   3rd Qu.: 5.610                  >7  :101   3rd Qu.:128.67  
##               Max.   :154.70   Max.   :10.680                  NA's:319   Max.   :202.00  
##               NA's   :49       NA's   :86                                 NA's   :59      
##       wgt          hypten       cohort                       occup          age       
##  Min.   : 39.01   no  :693   Length:1000        working         :544   Min.   :20.00  
##  1st Qu.: 63.50   yes :265   Class :character   looking for work: 46   1st Qu.:31.00  
##  Median : 76.88   NA's: 42   Mode  :character   not working     :393   Median :43.00  
##  Mean   : 78.35                                 NA's            : 17   Mean   :45.23  
##  3rd Qu.: 89.13                                                        3rd Qu.:59.00  
##  Max.   :167.83                                                        Max.   :79.00  
##  NA's   :22                                                                           
##                    educ          albu           creat           uricacid          BMI       
##  Less than 9th grade : 83   Min.   :3.000   Min.   :0.4400   Min.   :2.300   Min.   :15.34  
##  9-11th grade        :133   1st Qu.:4.100   1st Qu.:0.6900   1st Qu.:4.400   1st Qu.:23.18  
##  High school graduate:228   Median :4.300   Median :0.8200   Median :5.300   Median :26.58  
##  some college        :283   Mean   :4.289   Mean   :0.8525   Mean   :5.356   Mean   :27.49  
##  College or above    :272   3rd Qu.:4.500   3rd Qu.:0.9500   3rd Qu.:6.200   3rd Qu.:30.73  
##  NA's                :  1   Max.   :5.400   Max.   :7.4600   Max.   :9.900   Max.   :60.54  
##                             NA's   :91      NA's   :91       NA's   :92      NA's   :37     
##  hypchol         hgt       
##  no  :813   Min.   :1.397  
##  yes :101   1st Qu.:1.626  
##  NA's: 86   Median :1.676  
##             Mean   :1.685  
##             3rd Qu.:1.753  
##             Max.   :1.981  
##             NA's   :22

Variable coding

It is important to check that all variables are coded correctly, i.e., have the correct class. Imputation software (e.g., the mice package) uses the class to automatically select imputation methods. When importing data from other software it can happen that factors become continuous variables or that ordered factors lose their ordering.

str() showed that in the NHANES data smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.

Task

Using levels(NHANES$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.

Solution

## [1] "Less than 9th grade"  "9-11th grade"         "High school graduate" "some college"        
## [5] "College or above"
##  Ord.factor w/ 5 levels "Less than 9th grade"<..: 4 5 5 5 3 1 2 4 3 1 ...

Distribution of missing values

Missing data pattern

In the summary() we could already see that there are missing values in several variables. The missing data pattern can be obtained and visualized by several functions from different packages. Examples are

  • md.pattern() from package mice
  • md_pattern() from package JointAI (with argument patter = TRUE)
  • aggr() from package VIM
  • vis_dat() and vis_miss() from package visdat

Task

Explore the missing data pattern of the NHANES data.