Getting to Know Your Incomplete Data

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 4.3.0 (2023-04-21 ucrt)
mice (version: 3.15.0)
visdat (version: 0.6.0)
JointAI (version: 1.0.5)
VIM (version: 6.2.2)
plyr (version: 1.8.8)
corrplot (version: 0.92)
ggplot2 (version: 3.4.2)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

Overview

For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.

To get the EP16dat1 dataset, load the file EP16dat1.RData. You can download it here.

To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/EP16dat1.RData").

Task

Take a first look at the data. Useful functions are dim(), head(), str() and summary().

Solution

dim(EP16dat1)

## [1] 1000   24

head(EP16dat1)

##       HDL               race      DBP bili   smoke DM gender    WC chol HyperMed  alc      SBP
## 858  1.58     Other Hispanic 56.66667  0.5   never no female  64.5 4.29     <NA>  <=1 104.6667
## 103  1.24              other 81.33333  0.9   never no   male  81.4 4.27     <NA> <NA> 124.6667
## 338  1.71 Non-Hispanic White 70.00000  0.7 current no   male  76.0 4.22     <NA> <NA> 133.3333
## 344  1.01 Non-Hispanic White 66.00000  0.4   never no female 139.5 3.96     <NA>  <=1 141.3333
## 2382 1.09 Non-Hispanic White 69.33333  0.9 current no   male  94.6 4.97     <NA>  3-7 117.3333
## 2479   NA Non-Hispanic White 71.33333   NA  former no female 101.8   NA      yes    0 148.0000
##            wgt hypten cohort       occup age                 educ albu creat uricacid      BMI
## 858   46.26642     no   2011     working  22         some college  3.8  0.61      3.6 19.27253
## 103   63.50293     no   2011     working  39     College or above  4.3  0.87      5.4 19.52584
## 338   62.14215     no   2011     working  51     College or above  4.3  0.89      6.2 22.11215
## 344  113.85168    yes   2011     working  45     College or above  3.6  0.61      4.3 41.76816
## 2382 102.05828     no   2011     working  31 High school graduate  4.9  0.83      6.1 32.28381
## 2479  70.30682    yes   2011 not working  75  Less than 9th grade   NA    NA       NA 24.27618
##      hypchol    hgt
## 858       no 1.5494
## 103       no 1.8034
## 338       no 1.6764
## 344       no 1.6510
## 2382      no 1.7780
## 2479    <NA> 1.7018

str(EP16dat1)

## 'data.frame':    1000 obs. of  24 variables:
##  $ HDL     : num  1.58 1.24 1.71 1.01 1.09 NA NA 1.16 1.16 1.5 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 2 5 3 3 3 3 2 3 1 3 ...
##  $ DBP     : num  56.7 81.3 70 66 69.3 ...
##  $ bili    : num  0.5 0.9 0.7 0.4 0.9 NA NA 0.9 0.6 0.9 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 3 2 1 2 2 3 ...
##  $ DM      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ gender  : Factor w/ 2 levels "male","female": 2 1 1 2 1 2 1 1 1 1 ...
##  $ WC      : num  64.5 81.4 76 139.5 94.6 ...
##  $ chol    : num  4.29 4.27 4.22 3.96 4.97 NA NA 5.2 5.56 4.86 ...
##  $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA 3 3 NA NA NA ...
##  $ alc     : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 2 NA NA 2 4 1 1 5 3 4 ...
##  $ SBP     : num  105 125 133 141 117 ...
##  $ wgt     : num  46.3 63.5 62.1 113.9 102.1 ...
##  $ hypten  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 1 1 1 ...
##  $ cohort  : chr  "2011" "2011" "2011" "2011" ...
##  $ occup   : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 1 3 3 1 1 1 ...
##  $ age     : num  22 39 51 45 31 75 73 52 29 40 ...
##  $ educ    : Factor w/ 5 levels "Less than 9th grade",..: 4 5 5 5 3 1 2 4 3 1 ...
##  $ albu    : num  3.8 4.3 4.3 3.6 4.9 NA NA 3.9 4.9 4.3 ...
##  $ creat   : num  0.61 0.87 0.89 0.61 0.83 NA NA 0.91 0.93 0.94 ...
##  $ uricacid: num  3.6 5.4 6.2 4.3 6.1 NA NA 7.8 3.9 4.9 ...
##  $ BMI     : num  19.3 19.5 22.1 41.8 32.3 ...
##  $ hypchol : Factor w/ 2 levels "no","yes": 1 1 1 1 1 NA NA 1 1 1 ...
##  $ hgt     : num  1.55 1.8 1.68 1.65 1.78 ...

summary(EP16dat1)

##       HDL                        race          DBP              bili            smoke       DM     
##  Min.   :0.360   Mexican American  :112   Min.   : 12.67   Min.   :0.2000   never  :608   no :853  
##  1st Qu.:1.110   Other Hispanic    :110   1st Qu.: 64.00   1st Qu.:0.6000   former :191   yes:147  
##  Median :1.320   Non-Hispanic White:350   Median : 70.67   Median :0.7000   current:199            
##  Mean   :1.391   Non-Hispanic Black:229   Mean   : 70.80   Mean   :0.7527   NA's   :  2            
##  3rd Qu.:1.580   other             :199   3rd Qu.: 77.33   3rd Qu.:0.9000                          
##  Max.   :4.030                            Max.   :130.00   Max.   :2.9000                          
##  NA's   :86                               NA's   :59       NA's   :95                              
##     gender          WC              chol            HyperMed     alc           SBP        
##  male  :493   Min.   : 61.90   Min.   : 2.070   no      : 25   0   :113   Min.   : 81.33  
##  female:507   1st Qu.: 85.00   1st Qu.: 4.270   previous: 20   <=1 :281   1st Qu.:109.33  
##               Median : 95.10   Median : 4.910   yes     :142   1-3 :105   Median :118.00  
##               Mean   : 96.35   Mean   : 4.998   NA's    :813   3-7 : 81   Mean   :120.15  
##               3rd Qu.:105.50   3rd Qu.: 5.610                  >7  :101   3rd Qu.:128.67  
##               Max.   :154.70   Max.   :10.680                  NA's:319   Max.   :202.00  
##               NA's   :49       NA's   :86                                 NA's   :59      
##       wgt          hypten       cohort                       occup          age       
##  Min.   : 39.01   no  :693   Length:1000        working         :544   Min.   :20.00  
##  1st Qu.: 63.50   yes :265   Class :character   looking for work: 46   1st Qu.:31.00  
##  Median : 76.88   NA's: 42   Mode  :character   not working     :393   Median :43.00  
##  Mean   : 78.35                                 NA's            : 17   Mean   :45.23  
##  3rd Qu.: 89.13                                                        3rd Qu.:59.00  
##  Max.   :167.83                                                        Max.   :79.00  
##  NA's   :22                                                                           
##                    educ          albu           creat           uricacid          BMI       
##  Less than 9th grade : 83   Min.   :3.000   Min.   :0.4400   Min.   :2.300   Min.   :15.34  
##  9-11th grade        :133   1st Qu.:4.100   1st Qu.:0.6900   1st Qu.:4.400   1st Qu.:23.18  
##  High school graduate:228   Median :4.300   Median :0.8200   Median :5.300   Median :26.58  
##  some college        :283   Mean   :4.289   Mean   :0.8525   Mean   :5.356   Mean   :27.49  
##  College or above    :272   3rd Qu.:4.500   3rd Qu.:0.9500   3rd Qu.:6.200   3rd Qu.:30.73  
##  NA's                :  1   Max.   :5.400   Max.   :7.4600   Max.   :9.900   Max.   :60.54  
##                             NA's   :91      NA's   :91       NA's   :92      NA's   :37     
##  hypchol         hgt       
##  no  :813   Min.   :1.397  
##  yes :101   1st Qu.:1.626  
##  NA's: 86   Median :1.676  
##             Mean   :1.685  
##             3rd Qu.:1.753  
##             Max.   :1.981  
##             NA's   :22

Variable coding

It is important to check that all variables are coded correctly, i.e., have the correct class. Imputation software (e.g., the mice package) uses the class to automatically select imputation methods. When importing data from other software, it can happen that factors become continuous variables or that ordered factors lose their ordering.

str() (in the solution above) showed that in the EP16dat1 data the variables smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.

Task

Using levels(EP16dat1$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.

Solution

levels(EP16dat1$educ)

## [1] "Less than 9th grade"  "9-11th grade"         "High school graduate" "some college"        
## [5] "College or above"

EP16dat1$educ <- as.ordered(EP16dat1$educ)

str(EP16dat1$educ)

##  Ord.factor w/ 5 levels "Less than 9th grade"<..: 4 5 5 5 3 1 2 4 3 1 ...

Distribution of missing values

Missing data pattern

In the summary() we could already see that there are missing values in several variables. The missing data pattern can be obtained and visualized by several functions from different packages. Examples are

md.pattern() from package mice
md_pattern() from package JointAI (with argument pattern = TRUE)
aggr() from package VIM
vis_dat() and vis_miss() from package visdat

Task

Explore the missing data pattern of the EP16dat1 data.

Solution (i)

mdp <- mice::md.pattern(EP16dat1)

mdp

##     race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 33     1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 527    1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 75     1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 159    1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 2      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 3      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    0
## 2      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    0
## 15     1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 21     1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 8      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 12     1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 10     1  1      1      1   1    1     1     1   1   1   1      1  1   0   0   1    1       1    1
## 3      1  1      1      1   1    1     1     1   1   1   1      1  1   0   0   1    1       1    1
## 10     1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 6      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 4      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 3      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 3      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   0    0       0    0
## 3      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   0    0       0    0
## 3      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   0    0       0    0
## 3      1  1      1      1   1    1     1     1   1   1   1      1  0   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   1   1      1  0   0   0   0    0       0    0
## 2      1  1      1      1   1    1     1     1   1   1   1      1  0   0   0   0    0       0    0
## 2      1  1      1      1   1    1     1     1   1   1   1      0  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   1   1      0  1   1   1   0    0       0    0
## 11     1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   1    1       1    1
## 12     1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   1    1       1    1
## 2      1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   0    0       0    0
## 3      1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   0    0       0    0
## 1      1  1      1      1   1    1     1     1   1   1   1      0  0   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   1   1      0  0   0   0   0    0       0    0
## 1      1  1      1      1   1    1     1     1   1   1   1      0  0   0   0   0    0       0    0
## 2      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 4      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 2      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   0    0       0    0
## 1      1  1      1      1   1    1     1     1   1   0   0      1  0   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   0   0      0  1   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   0   0      0  1   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   0   0      0  0   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   1   0   0      0  0   0   0   0    0       0    0
## 3      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 2      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 3      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   0    0       0    0
## 2      1  1      1      1   1    1     1     1   0   1   0      1  0   1   1   0    0       0    0
## 1      1  1      1      1   1    1     1     1   0   1   0      1  0   1   1   0    0       0    0
## 1      1  1      1      1   1    1     1     1   0   1   0      0  1   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     1   0   1   0      0  0   0   0   1    1       1    1
## 4      1  1      1      1   1    1     1     1   0   0   0      1  1   1   1   1    1       1    1
## 2      1  1      1      1   1    1     1     1   0   0   0      1  1   1   1   1    1       1    1
## 4      1  1      1      1   1    1     1     0   1   1   1      1  1   1   1   1    1       1    1
## 9      1  1      1      1   1    1     1     0   1   1   1      1  1   1   1   1    1       1    1
## 2      1  1      1      1   1    1     1     0   1   1   1      1  1   1   1   0    0       0    0
## 1      1  1      1      1   1    1     1     0   1   1   1      0  1   0   0   1    1       1    1
## 1      1  1      1      1   1    1     1     0   0   0   0      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     0     1   1   1   1      1  1   1   1   1    1       1    1
## 1      1  1      1      1   1    1     0     1   1   1   1      1  1   1   1   0    0       0    0
## 1      1  1      1      1   1    0     1     1   1   1   1      0  0   0   0   0    0       0    0
##        0  0      0      0   0    1     2    17  22  22  37     42 49  59  59  86   86      86   91
##     creat uricacid bili alc HyperMed     
## 33      1        1    1   1        1    0
## 527     1        1    1   1        0    1
## 75      1        1    1   0        1    1
## 159     1        1    1   0        0    2
## 2       1        1    0   1        0    2
## 1       1        1    0   0        0    3
## 1       1        0    0   1        0    3
## 3       0        0    0   1        0    5
## 2       0        0    0   0        1    5
## 15      0        0    0   1        1    7
## 21      0        0    0   1        0    8
## 8       0        0    0   0        1    8
## 12      0        0    0   0        0    9
## 10      1        1    1   1        1    2
## 3       1        1    1   0        1    3
## 10      1        1    1   1        1    1
## 6       1        1    1   1        0    2
## 4       1        1    1   0        1    2
## 3       1        1    1   0        0    3
## 3       0        0    0   1        1    8
## 3       0        0    0   0        1    9
## 3       0        0    0   0        0   10
## 3       1        1    1   0        1    4
## 1       0        0    0   1        1   10
## 2       0        0    0   0        1   11
## 2       1        1    1   1        0    2
## 1       0        0    0   1        0    9
## 11      1        1    1   1        0    4
## 12      1        1    1   0        0    5
## 2       0        0    0   1        0   11
## 3       0        0    0   0        0   12
## 1       1        1    1   0        0    6
## 1       0        0    0   1        0   12
## 1       0        0    0   0        0   13
## 2       1        1    1   1        1    2
## 4       1        1    1   1        0    3
## 1       1        1    1   0        1    3
## 2       1        1    1   0        0    4
## 1       0        0    0   1        0   10
## 1       1        1    1   0        1    6
## 1       1        1    1   1        0    6
## 1       1        1    1   0        0    7
## 1       1        1    1   0        0    8
## 1       0        0    0   0        0   15
## 3       1        1    1   1        1    2
## 1       1        1    1   1        0    3
## 2       1        1    1   0        1    3
## 3       1        1    1   0        0    4
## 1       0        0    0   0        0   11
## 2       0        0    0   0        1   11
## 1       0        0    0   0        0   12
## 1       1        1    1   0        0    7
## 1       1        1    1   0        0    8
## 4       1        1    1   1        0    4
## 2       1        1    1   0        0    5
## 4       1        1    1   1        1    1
## 9       1        1    1   1        0    2
## 2       0        0    0   0        0   10
## 1       1        1    1   0        0    6
## 1       1        1    1   1        0    5
## 1       1        1    1   1        0    2
## 1       0        0    0   0        0   10
## 1       0        0    0   1        0   13
##        91       92   95 319      813 2069

The function md.pattern() from the package mice gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The row names show how many rows of the dataset have the given pattern. The last column shows the number of missing values in each pattern, the last row the number of missing values per variable.

md.pattern() also plots the missing data pattern automatically.

Solution (ii)

MDP <- JointAI::md_pattern(EP16dat1, pattern = TRUE)
MDP

## $plot

## 
## $pattern
##      race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 1       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 2       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 3       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 4       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 5       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 6       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 7       1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 8       1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   1    1       1    1
## 9       1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   1    1       1    1
## 10      1  1      1      1   1    1     1     1   1   1   1      1  1   0   0   1    1       1    1
## 11      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 12      1  1      1      1   1    1     1     0   1   1   1      1  1   1   1   1    1       1    1
## 13      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   0    0       0    0
## 14      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 15      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 16      1  1      1      1   1    1     1     0   1   1   1      1  1   1   1   1    1       1    1
## 17      1  1      1      1   1    1     1     1   0   0   0      1  1   1   1   1    1       1    1
## 18      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 19      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   0    0       0    0
## 20      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 21      1  1      1      1   1    1     1     1   1   1   1      1  0   0   0   1    1       1    1
## 22      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   0    0       0    0
## 23      1  1      1      1   1    1     1     1   1   1   1      1  1   0   0   1    1       1    1
## 24      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 25      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    0
## 26      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   0    0       0    0
## 27      1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   0    0       0    0
## 28      1  1      1      1   1    1     1     1   1   1   1      1  0   1   1   1    1       1    1
## 29      1  1      1      1   1    1     1     1   0   1   0      1  0   1   1   0    0       0    0
## 30      1  1      1      1   1    1     1     1   1   1   1      1  0   0   0   0    0       0    0
## 31      1  1      1      1   1    1     1     1   1   1   1      0  1   0   0   0    0       0    0
## 32      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 33      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 34      1  1      1      1   1    1     1     0   1   1   1      1  1   1   1   0    0       0    0
## 35      1  1      1      1   1    1     1     1   0   0   0      1  1   1   1   1    1       1    1
## 36      1  1      1      1   1    1     1     1   1   1   1      0  1   1   1   1    1       1    1
## 37      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    0
## 38      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 39      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 40      1  1      1      1   1    1     0     1   1   1   1      1  1   1   1   0    0       0    0
## 41      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   0    0       0    0
## 42      1  1      1      1   1    1     1     1   0   1   0      1  0   1   1   0    0       0    0
## 43      1  1      1      1   1    1     1     1   1   0   0      0  1   0   0   1    1       1    1
## 44      1  1      1      1   1    1     1     1   0   1   0      0  0   0   0   1    1       1    1
## 45      1  1      1      1   1    1     1     1   1   1   1      0  0   0   0   1    1       1    1
## 46      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 47      1  1      1      1   1    1     1     0   0   0   0      1  1   1   1   1    1       1    1
## 48      1  1      1      1   1    1     1     1   1   0   0      0  0   0   0   1    1       1    1
## 49      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   0    0       0    0
## 50      1  1      1      1   1    1     1     1   1   0   0      1  1   1   1   1    1       1    1
## 51      1  1      1      1   1    0     1     1   1   1   1      0  0   0   0   0    0       0    0
## 52      1  1      1      1   1    1     1     0   1   1   1      0  1   0   0   1    1       1    1
## 53      1  1      1      1   1    1     1     1   1   1   1      0  0   0   0   0    0       0    0
## 54      1  1      1      1   1    1     1     1   1   0   0      0  1   0   0   1    1       1    1
## 55      1  1      1      1   1    1     1     1   1   1   1      0  0   0   0   0    0       0    0
## 56      1  1      1      1   1    1     1     1   0   1   0      0  1   0   0   1    1       1    1
## 57      1  1      1      1   1    1     1     1   1   1   1      1  1   1   1   1    1       1    1
## 58      1  1      1      1   1    1     0     1   1   1   1      1  1   1   1   1    1       1    1
## 59      1  1      1      1   1    1     1     1   1   0   0      0  0   0   0   0    0       0    0
## 60      1  1      1      1   1    1     1     1   1   0   0      1  0   0   0   1    1       1    1
## 61      1  1      1      1   1    1     1     1   0   1   0      1  1   1   1   1    1       1    1
## 62      1  1      1      1   1    1     1     1   1   1   1      1  0   0   0   0    0       0    0
## 63      1  1      1      1   1    1     1     1   1   1   1      0  1   1   1   0    0       0    0
## Nmis    0  0      0      0   0    1     2    17  22  22  37     42 49  59  59  86   86      86   91
##      creat uricacid bili alc HyperMed Npat
## 1        1        1    1   1        0  527
## 2        1        1    1   0        0  159
## 3        1        1    1   0        1   75
## 4        1        1    1   1        1   33
## 5        0        0    0   1        0   21
## 6        0        0    0   1        1   15
## 7        0        0    0   0        0   12
## 8        1        1    1   0        0   12
## 9        1        1    1   1        0   11
## 10       1        1    1   1        1   10
## 11       1        1    1   1        1   10
## 12       1        1    1   1        0    9
## 13       0        0    0   0        1    8
## 14       1        1    1   1        0    6
## 15       1        1    1   0        1    4
## 16       1        1    1   1        1    4
## 17       1        1    1   1        0    4
## 18       1        1    1   1        0    4
## 19       0        0    0   1        1    3
## 20       1        1    1   0        0    3
## 21       1        1    1   0        1    3
## 22       0        0    0   0        0    3
## 23       1        1    1   0        1    3
## 24       1        1    1   1        1    3
## 25       0        0    0   1        0    3
## 26       0        0    0   0        1    3
## 27       0        0    0   0        0    3
## 28       1        1    1   0        0    3
## 29       0        0    0   0        1    2
## 30       0        0    0   0        1    2
## 31       0        0    0   1        0    2
## 32       1        1    0   1        0    2
## 33       1        1    1   0        1    2
## 34       0        0    0   0        0    2
## 35       1        1    1   0        0    2
## 36       1        1    1   1        0    2
## 37       0        0    0   0        1    2
## 38       1        1    1   1        1    2
## 39       1        1    1   0        0    2
## 40       0        0    0   0        0    1
## 41       0        0    0   0        0    1
## 42       0        0    0   0        0    1
## 43       1        1    1   0        0    1
## 44       1        1    1   0        0    1
## 45       1        1    1   0        0    1
## 46       1        1    0   0        0    1
## 47       1        1    1   1        0    1
## 48       1        1    1   0        0    1
## 49       0        0    0   1        0    1
## 50       1        1    1   0        1    1
## 51       0        0    0   1        0    1
## 52       1        1    1   0        0    1
## 53       0        0    0   1        0    1
## 54       1        1    1   1        0    1
## 55       0        0    0   0        0    1
## 56       1        1    1   0        0    1
## 57       1        0    0   1        0    1
## 58       1        1    1   1        0    1
## 59       0        0    0   0        0    1
## 60       1        1    1   0        1    1
## 61       1        1    1   1        0    1
## 62       0        0    0   1        1    1
## 63       0        0    0   1        0    1
## Nmis    91       92   95 319      813 2069

The function md_pattern() from package JointAI gives a matrix very similar to the one obtained from mice::md.pattern(). However, here, the number of rows in the data that have a particular missing data pattern are given in the last column.

For more information on how to customize the visualization by md_pattern() see the vignette Visualizing Incomplete Data on the webpage of JointAI.

Solution (iii)

VIM::aggr(EP16dat1)

aggr() from VIM plots a histogram of the proportion of missing values per column as well as a visualization of the missing data pattern. Here, a small histogram on the right edge of the plot shows the proportion of cases in each pattern.

For more options how to customize the plot, see the VIM documentation.

Proportion of missing values

Task

Now, get an overview of

how much missingness there is in each variable, and
the proportion of (in)complete cases.

The function complete.cases() can be applied to a data.frame and will give you a vector of values TRUE (if the the row does not have missing values) and FALSE if there are missing values.

is.na() returns TRUE if the supplied object is missing, FALSE if it is observed.

colSums() calculates the sum of values in each column of a data.frame or matrix

colMeans() calculates the mean of values in each column of a data.frame or matrix

Solution

# number and proportion of complete cases
cbind(
  "#" = table(ifelse(complete.cases(EP16dat1), 'complete', 'incomplete')),
  "%" = round(100 * table(!complete.cases(EP16dat1))/nrow(EP16dat1), 2)
)

##              #    %
## complete    33  3.3
## incomplete 967 96.7

# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(EP16dat1))),
      "% NA" = round(sort(colMeans(is.na(EP16dat1))) * 100, 2))

##          # NA % NA
## race        0  0.0
## DM          0  0.0
## gender      0  0.0
## cohort      0  0.0
## age         0  0.0
## educ        1  0.1
## smoke       2  0.2
## occup      17  1.7
## wgt        22  2.2
## hgt        22  2.2
## BMI        37  3.7
## hypten     42  4.2
## WC         49  4.9
## DBP        59  5.9
## SBP        59  5.9
## HDL        86  8.6
## chol       86  8.6
## hypchol    86  8.6
## albu       91  9.1
## creat      91  9.1
## uricacid   92  9.2
## bili       95  9.5
## alc       319 31.9
## HyperMed  813 81.3

Relationship between variables

In the missing data pattern we could already see that some variables tend to be missing together. But there may be different types of relationships between variables that are of interest.

Task 1

Our data contains hgt (height), wgt (weight) and BMI. Check the missing data pattern specifically for those three variables.

Solution 1

There are multiple solutions how you could approach this.

We could obtain the missing data pattern for only the three variables of interest:

JointAI::md_pattern(EP16dat1[, c("hgt", "wgt", "BMI")], pattern = TRUE, plot = FALSE)

##      hgt wgt BMI Npat
## 1      1   1   1  963
## 2      1   0   0   15
## 3      0   1   0   15
## 4      0   0   0    7
## Nmis  22  22  37   81

Alternatively, we could look at a 3-dimensional table of the missing data indicator for the three variables:

with(EP16dat1, 
     table(ifelse(is.na(hgt), 'height mis.', 'height obs.'),
           ifelse(is.na(wgt), 'weight mis.', 'weight obs.'),
           ifelse(is.na(BMI), 'BMI mis.', 'BMI obs.'))
)

## , ,  = BMI mis.
## 
##              
##               weight mis. weight obs.
##   height mis.           7          15
##   height obs.          15           0
## 
## , ,  = BMI obs.
## 
##              
##               weight mis. weight obs.
##   height mis.           0           0
##   height obs.           0         963

We could also use a table in long format to describe the missing data pattern:

plyr::ddply(EP16dat1, c(height = "ifelse(is.na(hgt), 'missing', 'observed')",
                        weight = "ifelse(is.na(wgt), 'missing', 'observed')",
                        BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
            plyr::summarize,
            N = length(hgt)
)

##     height   weight      BMI   N
## 1  missing  missing  missing   7
## 2  missing observed  missing  15
## 3 observed  missing  missing  15
## 4 observed observed observed 963

As we have also seen in the lecture, there are some cases where only either hgt or wgt is missing. BMI is only observed when both components are observed. To use all available information, we want to impute hgt and wgt separately and calculate BMI from the imputed values.

Task 2

The data contains variables on hypertension (hypten) and the use of medication to treat hypertension (HyperMed). We might expect that medication is only prescribed for patients with hypertension, hence, we need to investigate the relationship between HyperMed and hypten.

Make a table of the two variables to confirm our expectation.
Make sure that missing values are also displayed in that table!

If you use the function table(), you need to specify the argument exclude = NULL.

Solution 2

with(EP16dat1, table(HyperMed, hypten, exclude = NULL))

##           hypten
## HyperMed    no yes <NA>
##   no         0  25    0
##   previous   0  20    0
##   yes        0 142    0
##   <NA>     693  78   42

Task 3

Furthermore, we can expect a systematic relationship between the variables hypchol (hypercholesterolemia) and chol (serum cholesterol).

Find out how these two variables are related.

Solution 3

with(EP16dat1, plot(chol ~ hypchol))

Based on this plot it seems that hypchol is yes if chol is above some threshold. To be sure, and to find out what this threshold may be, we need to look at the numbers:

with(EP16dat1, summary(chol[hypchol == "no"]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.070   4.220   4.780   4.755   5.380   6.180      86

with(EP16dat1, summary(chol[hypchol == "yes"]))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6.21    6.39    6.70    6.95    7.16   10.68      86

Indeed, there is no overlap between values of chol for participants in the two groups defined by hypchol. The cut-off for hypchol is probably 6.2.

Task 4

Visualize the correlations between variables to detect associations that you may have missed.

Solution 4

# convert all variables to numeric
dat_num <- sapply(EP16dat1, as.numeric)

# calculate the Spearman correlation matrix
cormat <- cor(dat_num, use = 'pair', method = 'spearman')

# plot the correlation matrix
corrplot::corrplot(cormat, method = 'square', type = 'lower')

Note: Correlations involving categorical variables are not usually done! We only use this as a quick-and-dirty method for visualization of relationships.

The question marks in the plot indicate that no correlation could be calculated for cohort (because cohort is the same for all subjects) and neither between HyperMed and hypten (but we already knew that from the lecture).

Besides the relations we were already aware of, the plot shows us some additional potentially relevant correlations, for example, that wgt is strongly correlated with WC.

Comparing the number of cases where only either wgt or WC is missing shows that there are 14 cases where wgt is missing but WC is observed and 58 cases where WC is missing and wgt is observed.

with(EP16dat1, table(ifelse(is.na(WC), 'WC mis.', 'WC obs.'),
                     ifelse(is.na(wgt), 'wgt mis.', 'wgt obs.')
))

##          
##           wgt mis. wgt obs.
##   WC mis.        4       45
##   WC obs.       18      933

Distribution of the data

Before imputing missing values it is important to take a look at how the observed parts of incomplete variables are distributed, so that we can choose appropriate imputation models.

Task

Visualize the distributions of the incomplete continuous and categorical variables. The package JointAI provides the convenient function plot_all(), but you can of course use functions from other packages as well.

Pay attention to

whether continuous distributions deviate considerably from the normal distribution,
if variables have values close to the limits of the range they are defined in,
whether categorical variables are very unbalanced (i.e., some category very small).

Note: When you use RStudio and the “Plots” panel is too small, you may not be able to show the distributions of all variables. You can try to extend the panel, or tell R to plot the graph in a new window, by first calling windows().

It can also help to reduce the white space around the separate plots by changing the margins in the graphical parameters: par(mar = c(3, 3, 2, 1))

Solution

par(mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
JointAI::plot_all(EP16dat1, breaks = 30, ncol = 4)

To learn more about additional options of plot_all() check out the vignette Visualizing Incomplete Data on the webpage of JointAI.