In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.15.0)visdat
(version: 0.6.0)JointAI
(version: 1.0.5)VIM
(version: 6.2.2)plyr
(version: 1.8.8)corrplot
(version: 0.92)ggplot2
(version: 3.4.2)You can find help files for any function by adding a ?
before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
For this practical, we will use the EP16dat1 dataset, which is a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To get the EP16dat1 dataset, load the file
EP16dat1.RData
. You can download it here.
To load this dataset into R, you can use the command
file.choose()
which opens the explorer and allows you to
navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/EP16dat1.RData")
.
Take a first look at the data. Useful functions are
dim()
, head()
, str()
and
summary()
.
dim(EP16dat1)
## [1] 1000 24
head(EP16dat1)
## HDL race DBP bili smoke DM gender WC chol HyperMed alc SBP
## 858 1.58 Other Hispanic 56.66667 0.5 never no female 64.5 4.29 <NA> <=1 104.6667
## 103 1.24 other 81.33333 0.9 never no male 81.4 4.27 <NA> <NA> 124.6667
## 338 1.71 Non-Hispanic White 70.00000 0.7 current no male 76.0 4.22 <NA> <NA> 133.3333
## 344 1.01 Non-Hispanic White 66.00000 0.4 never no female 139.5 3.96 <NA> <=1 141.3333
## 2382 1.09 Non-Hispanic White 69.33333 0.9 current no male 94.6 4.97 <NA> 3-7 117.3333
## 2479 NA Non-Hispanic White 71.33333 NA former no female 101.8 NA yes 0 148.0000
## wgt hypten cohort occup age educ albu creat uricacid BMI
## 858 46.26642 no 2011 working 22 some college 3.8 0.61 3.6 19.27253
## 103 63.50293 no 2011 working 39 College or above 4.3 0.87 5.4 19.52584
## 338 62.14215 no 2011 working 51 College or above 4.3 0.89 6.2 22.11215
## 344 113.85168 yes 2011 working 45 College or above 3.6 0.61 4.3 41.76816
## 2382 102.05828 no 2011 working 31 High school graduate 4.9 0.83 6.1 32.28381
## 2479 70.30682 yes 2011 not working 75 Less than 9th grade NA NA NA 24.27618
## hypchol hgt
## 858 no 1.5494
## 103 no 1.8034
## 338 no 1.6764
## 344 no 1.6510
## 2382 no 1.7780
## 2479 <NA> 1.7018
str(EP16dat1)
## 'data.frame': 1000 obs. of 24 variables:
## $ HDL : num 1.58 1.24 1.71 1.01 1.09 NA NA 1.16 1.16 1.5 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 2 5 3 3 3 3 2 3 1 3 ...
## $ DBP : num 56.7 81.3 70 66 69.3 ...
## $ bili : num 0.5 0.9 0.7 0.4 0.9 NA NA 0.9 0.6 0.9 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 3 2 1 2 2 3 ...
## $ DM : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ gender : Factor w/ 2 levels "male","female": 2 1 1 2 1 2 1 1 1 1 ...
## $ WC : num 64.5 81.4 76 139.5 94.6 ...
## $ chol : num 4.29 4.27 4.22 3.96 4.97 NA NA 5.2 5.56 4.86 ...
## $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: NA NA NA NA NA 3 3 NA NA NA ...
## $ alc : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 2 NA NA 2 4 1 1 5 3 4 ...
## $ SBP : num 105 125 133 141 117 ...
## $ wgt : num 46.3 63.5 62.1 113.9 102.1 ...
## $ hypten : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 2 1 1 1 ...
## $ cohort : chr "2011" "2011" "2011" "2011" ...
## $ occup : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 1 3 3 1 1 1 ...
## $ age : num 22 39 51 45 31 75 73 52 29 40 ...
## $ educ : Factor w/ 5 levels "Less than 9th grade",..: 4 5 5 5 3 1 2 4 3 1 ...
## $ albu : num 3.8 4.3 4.3 3.6 4.9 NA NA 3.9 4.9 4.3 ...
## $ creat : num 0.61 0.87 0.89 0.61 0.83 NA NA 0.91 0.93 0.94 ...
## $ uricacid: num 3.6 5.4 6.2 4.3 6.1 NA NA 7.8 3.9 4.9 ...
## $ BMI : num 19.3 19.5 22.1 41.8 32.3 ...
## $ hypchol : Factor w/ 2 levels "no","yes": 1 1 1 1 1 NA NA 1 1 1 ...
## $ hgt : num 1.55 1.8 1.68 1.65 1.78 ...
summary(EP16dat1)
## HDL race DBP bili smoke DM
## Min. :0.360 Mexican American :112 Min. : 12.67 Min. :0.2000 never :608 no :853
## 1st Qu.:1.110 Other Hispanic :110 1st Qu.: 64.00 1st Qu.:0.6000 former :191 yes:147
## Median :1.320 Non-Hispanic White:350 Median : 70.67 Median :0.7000 current:199
## Mean :1.391 Non-Hispanic Black:229 Mean : 70.80 Mean :0.7527 NA's : 2
## 3rd Qu.:1.580 other :199 3rd Qu.: 77.33 3rd Qu.:0.9000
## Max. :4.030 Max. :130.00 Max. :2.9000
## NA's :86 NA's :59 NA's :95
## gender WC chol HyperMed alc SBP
## male :493 Min. : 61.90 Min. : 2.070 no : 25 0 :113 Min. : 81.33
## female:507 1st Qu.: 85.00 1st Qu.: 4.270 previous: 20 <=1 :281 1st Qu.:109.33
## Median : 95.10 Median : 4.910 yes :142 1-3 :105 Median :118.00
## Mean : 96.35 Mean : 4.998 NA's :813 3-7 : 81 Mean :120.15
## 3rd Qu.:105.50 3rd Qu.: 5.610 >7 :101 3rd Qu.:128.67
## Max. :154.70 Max. :10.680 NA's:319 Max. :202.00
## NA's :49 NA's :86 NA's :59
## wgt hypten cohort occup age
## Min. : 39.01 no :693 Length:1000 working :544 Min. :20.00
## 1st Qu.: 63.50 yes :265 Class :character looking for work: 46 1st Qu.:31.00
## Median : 76.88 NA's: 42 Mode :character not working :393 Median :43.00
## Mean : 78.35 NA's : 17 Mean :45.23
## 3rd Qu.: 89.13 3rd Qu.:59.00
## Max. :167.83 Max. :79.00
## NA's :22
## educ albu creat uricacid BMI
## Less than 9th grade : 83 Min. :3.000 Min. :0.4400 Min. :2.300 Min. :15.34
## 9-11th grade :133 1st Qu.:4.100 1st Qu.:0.6900 1st Qu.:4.400 1st Qu.:23.18
## High school graduate:228 Median :4.300 Median :0.8200 Median :5.300 Median :26.58
## some college :283 Mean :4.289 Mean :0.8525 Mean :5.356 Mean :27.49
## College or above :272 3rd Qu.:4.500 3rd Qu.:0.9500 3rd Qu.:6.200 3rd Qu.:30.73
## NA's : 1 Max. :5.400 Max. :7.4600 Max. :9.900 Max. :60.54
## NA's :91 NA's :91 NA's :92 NA's :37
## hypchol hgt
## no :813 Min. :1.397
## yes :101 1st Qu.:1.626
## NA's: 86 Median :1.676
## Mean :1.685
## 3rd Qu.:1.753
## Max. :1.981
## NA's :22
It is important to check that all variables are coded correctly,
i.e., have the correct class
. Imputation software (e.g.,
the mice package) uses the class
to
automatically select imputation methods. When importing data from other
software, it can happen that factors become continuous variables or that
ordered factors lose their ordering.
str()
(in the solution above) showed that in the
EP16dat1 data the variables smoke
and
alc
are correctly specified as ordinal variables, but
educ
is an unordered factor.
Using levels(EP16dat1$educ)
we can print the names of
the categories of educ
. Convert the unordered factor to an
ordered factor, for example using as.ordered()
. Afterwards,
check if the conversion was successful.
levels(EP16dat1$educ)
## [1] "Less than 9th grade" "9-11th grade" "High school graduate" "some college"
## [5] "College or above"
$educ <- as.ordered(EP16dat1$educ)
EP16dat1
str(EP16dat1$educ)
## Ord.factor w/ 5 levels "Less than 9th grade"<..: 4 5 5 5 3 1 2 4 3 1 ...
In the summary()
we could already see that there are
missing values in several variables. The missing data pattern can be
obtained and visualized by several functions from different packages.
Examples are
md.pattern()
from package micemd_pattern()
from package JointAI
(with argument pattern = TRUE
)aggr()
from package VIMvis_dat()
and vis_miss()
from package
visdatExplore the missing data pattern of the EP16dat1 data.
<- mice::md.pattern(EP16dat1) mdp
mdp
## race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 527 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 159 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0
## 11 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 12 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0
## 2 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1
## 4 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 9 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1
## 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
## 0 0 0 0 0 1 2 17 22 22 37 42 49 59 59 86 86 86 91
## creat uricacid bili alc HyperMed
## 33 1 1 1 1 1 0
## 527 1 1 1 1 0 1
## 75 1 1 1 0 1 1
## 159 1 1 1 0 0 2
## 2 1 1 0 1 0 2
## 1 1 1 0 0 0 3
## 1 1 0 0 1 0 3
## 3 0 0 0 1 0 5
## 2 0 0 0 0 1 5
## 15 0 0 0 1 1 7
## 21 0 0 0 1 0 8
## 8 0 0 0 0 1 8
## 12 0 0 0 0 0 9
## 10 1 1 1 1 1 2
## 3 1 1 1 0 1 3
## 10 1 1 1 1 1 1
## 6 1 1 1 1 0 2
## 4 1 1 1 0 1 2
## 3 1 1 1 0 0 3
## 3 0 0 0 1 1 8
## 3 0 0 0 0 1 9
## 3 0 0 0 0 0 10
## 3 1 1 1 0 1 4
## 1 0 0 0 1 1 10
## 2 0 0 0 0 1 11
## 2 1 1 1 1 0 2
## 1 0 0 0 1 0 9
## 11 1 1 1 1 0 4
## 12 1 1 1 0 0 5
## 2 0 0 0 1 0 11
## 3 0 0 0 0 0 12
## 1 1 1 1 0 0 6
## 1 0 0 0 1 0 12
## 1 0 0 0 0 0 13
## 2 1 1 1 1 1 2
## 4 1 1 1 1 0 3
## 1 1 1 1 0 1 3
## 2 1 1 1 0 0 4
## 1 0 0 0 1 0 10
## 1 1 1 1 0 1 6
## 1 1 1 1 1 0 6
## 1 1 1 1 0 0 7
## 1 1 1 1 0 0 8
## 1 0 0 0 0 0 15
## 3 1 1 1 1 1 2
## 1 1 1 1 1 0 3
## 2 1 1 1 0 1 3
## 3 1 1 1 0 0 4
## 1 0 0 0 0 0 11
## 2 0 0 0 0 1 11
## 1 0 0 0 0 0 12
## 1 1 1 1 0 0 7
## 1 1 1 1 0 0 8
## 4 1 1 1 1 0 4
## 2 1 1 1 0 0 5
## 4 1 1 1 1 1 1
## 9 1 1 1 1 0 2
## 2 0 0 0 0 0 10
## 1 1 1 1 0 0 6
## 1 1 1 1 1 0 5
## 1 1 1 1 1 0 2
## 1 0 0 0 0 0 10
## 1 0 0 0 1 0 13
## 91 92 95 319 813 2069
The function md.pattern()
from the package
mice gives us a matrix where each row represents one
missing data pattern (1 = observed, 0 = missing). The row names show how
many rows of the dataset have the given pattern. The last column shows
the number of missing values in each pattern, the last row the number of
missing values per variable.
md.pattern()
also plots the missing data pattern
automatically.
<- JointAI::md_pattern(EP16dat1, pattern = TRUE)
MDP MDP
## $plot
##
## $pattern
## race DM gender cohort age educ smoke occup wgt hgt BMI hypten WC DBP SBP HDL chol hypchol albu
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 8 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 9 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 11 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
## 14 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 15 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 16 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## 17 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 19 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 20 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 23 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1
## 24 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 25 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 26 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0
## 27 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 28 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 30 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 31 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0
## 32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 33 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 34 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0
## 35 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
## 36 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## 37 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 38 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 39 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 40 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## 41 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0
## 42 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0
## 43 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 44 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1
## 45 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1
## 46 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 47 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1
## 48 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1
## 49 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0
## 50 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1
## 51 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0
## 52 1 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1
## 53 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 54 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 1
## 55 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
## 56 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1
## 57 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 58 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## 59 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
## 60 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 1
## 61 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1
## 62 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## 63 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0
## Nmis 0 0 0 0 0 1 2 17 22 22 37 42 49 59 59 86 86 86 91
## creat uricacid bili alc HyperMed Npat
## 1 1 1 1 1 0 527
## 2 1 1 1 0 0 159
## 3 1 1 1 0 1 75
## 4 1 1 1 1 1 33
## 5 0 0 0 1 0 21
## 6 0 0 0 1 1 15
## 7 0 0 0 0 0 12
## 8 1 1 1 0 0 12
## 9 1 1 1 1 0 11
## 10 1 1 1 1 1 10
## 11 1 1 1 1 1 10
## 12 1 1 1 1 0 9
## 13 0 0 0 0 1 8
## 14 1 1 1 1 0 6
## 15 1 1 1 0 1 4
## 16 1 1 1 1 1 4
## 17 1 1 1 1 0 4
## 18 1 1 1 1 0 4
## 19 0 0 0 1 1 3
## 20 1 1 1 0 0 3
## 21 1 1 1 0 1 3
## 22 0 0 0 0 0 3
## 23 1 1 1 0 1 3
## 24 1 1 1 1 1 3
## 25 0 0 0 1 0 3
## 26 0 0 0 0 1 3
## 27 0 0 0 0 0 3
## 28 1 1 1 0 0 3
## 29 0 0 0 0 1 2
## 30 0 0 0 0 1 2
## 31 0 0 0 1 0 2
## 32 1 1 0 1 0 2
## 33 1 1 1 0 1 2
## 34 0 0 0 0 0 2
## 35 1 1 1 0 0 2
## 36 1 1 1 1 0 2
## 37 0 0 0 0 1 2
## 38 1 1 1 1 1 2
## 39 1 1 1 0 0 2
## 40 0 0 0 0 0 1
## 41 0 0 0 0 0 1
## 42 0 0 0 0 0 1
## 43 1 1 1 0 0 1
## 44 1 1 1 0 0 1
## 45 1 1 1 0 0 1
## 46 1 1 0 0 0 1
## 47 1 1 1 1 0 1
## 48 1 1 1 0 0 1
## 49 0 0 0 1 0 1
## 50 1 1 1 0 1 1
## 51 0 0 0 1 0 1
## 52 1 1 1 0 0 1
## 53 0 0 0 1 0 1
## 54 1 1 1 1 0 1
## 55 0 0 0 0 0 1
## 56 1 1 1 0 0 1
## 57 1 0 0 1 0 1
## 58 1 1 1 1 0 1
## 59 0 0 0 0 0 1
## 60 1 1 1 0 1 1
## 61 1 1 1 1 0 1
## 62 0 0 0 1 1 1
## 63 0 0 0 1 0 1
## Nmis 91 92 95 319 813 2069
The function md_pattern()
from package
JointAI gives a matrix very similar to the one obtained
from mice::md.pattern()
. However, here, the number of rows
in the data that have a particular missing data pattern are given in the
last column.
For more information on how to customize the visualization by
md_pattern()
see the vignette Visualizing
Incomplete Data on the webpage of
JointAI.
::aggr(EP16dat1) VIM
aggr()
from VIM plots a histogram of
the proportion of missing values per column as well as a visualization
of the missing data pattern. Here, a small histogram on the right edge
of the plot shows the proportion of cases in each pattern.
For more options how to customize the plot, see the VIM documentation.
Now, get an overview of
The function complete.cases()
can be applied to a
data.frame
and will give you a vector of values
TRUE
(if the the row does not have missing values) and
FALSE
if there are missing values.
is.na()
returns TRUE
if the supplied object
is missing, FALSE
if it is observed.
colSums()
calculates the sum of values in each column of
a data.frame or matrix
colMeans()
calculates the mean of values in each column of
a data.frame or matrix
# number and proportion of complete cases
cbind(
"#" = table(ifelse(complete.cases(EP16dat1), 'complete', 'incomplete')),
"%" = round(100 * table(!complete.cases(EP16dat1))/nrow(EP16dat1), 2)
)
## # %
## complete 33 3.3
## incomplete 967 96.7
# number and proportion of missing values per variable
cbind("# NA" = sort(colSums(is.na(EP16dat1))),
"% NA" = round(sort(colMeans(is.na(EP16dat1))) * 100, 2))
## # NA % NA
## race 0 0.0
## DM 0 0.0
## gender 0 0.0
## cohort 0 0.0
## age 0 0.0
## educ 1 0.1
## smoke 2 0.2
## occup 17 1.7
## wgt 22 2.2
## hgt 22 2.2
## BMI 37 3.7
## hypten 42 4.2
## WC 49 4.9
## DBP 59 5.9
## SBP 59 5.9
## HDL 86 8.6
## chol 86 8.6
## hypchol 86 8.6
## albu 91 9.1
## creat 91 9.1
## uricacid 92 9.2
## bili 95 9.5
## alc 319 31.9
## HyperMed 813 81.3
In the missing data pattern we could already see that some variables tend to be missing together. But there may be different types of relationships between variables that are of interest.
Our data contains hgt
(height), wgt
(weight) and BMI
. Check the missing data pattern
specifically for those three variables.
There are multiple solutions how you could approach this.
We could obtain the missing data pattern for only the three variables of interest:
::md_pattern(EP16dat1[, c("hgt", "wgt", "BMI")], pattern = TRUE, plot = FALSE) JointAI
## hgt wgt BMI Npat
## 1 1 1 1 963
## 2 1 0 0 15
## 3 0 1 0 15
## 4 0 0 0 7
## Nmis 22 22 37 81
Alternatively, we could look at a 3-dimensional table of the missing data indicator for the three variables:
with(EP16dat1,
table(ifelse(is.na(hgt), 'height mis.', 'height obs.'),
ifelse(is.na(wgt), 'weight mis.', 'weight obs.'),
ifelse(is.na(BMI), 'BMI mis.', 'BMI obs.'))
)
## , , = BMI mis.
##
##
## weight mis. weight obs.
## height mis. 7 15
## height obs. 15 0
##
## , , = BMI obs.
##
##
## weight mis. weight obs.
## height mis. 0 0
## height obs. 0 963
We could also use a table in long format to describe the missing data pattern:
::ddply(EP16dat1, c(height = "ifelse(is.na(hgt), 'missing', 'observed')",
plyrweight = "ifelse(is.na(wgt), 'missing', 'observed')",
BMI = "ifelse(is.na(BMI), 'missing', 'observed')"),
::summarize,
plyrN = length(hgt)
)
## height weight BMI N
## 1 missing missing missing 7
## 2 missing observed missing 15
## 3 observed missing missing 15
## 4 observed observed observed 963
As we have also seen in the lecture, there are some cases where only
either hgt
or wgt
is missing. BMI
is only observed when both components are observed. To use all available
information, we want to impute hgt
and wgt
separately and calculate BMI
from the imputed values.
The data contains variables on hypertension (hypten
) and
the use of medication to treat hypertension (HyperMed
). We
might expect that medication is only prescribed for patients with
hypertension, hence, we need to investigate the relationship between
HyperMed
and hypten
.
table()
, you need to specify the
argument exclude = NULL
.
with(EP16dat1, table(HyperMed, hypten, exclude = NULL))
## hypten
## HyperMed no yes <NA>
## no 0 25 0
## previous 0 20 0
## yes 0 142 0
## <NA> 693 78 42
Furthermore, we can expect a systematic relationship between the
variables hypchol
(hypercholesterolemia) and
chol
(serum cholesterol).
with(EP16dat1, plot(chol ~ hypchol))
Based on this plot it seems that hypchol
is
yes
if chol
is above some threshold. To be
sure, and to find out what this threshold may be, we need to look at the
numbers:
with(EP16dat1, summary(chol[hypchol == "no"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.070 4.220 4.780 4.755 5.380 6.180 86
with(EP16dat1, summary(chol[hypchol == "yes"]))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.21 6.39 6.70 6.95 7.16 10.68 86
Indeed, there is no overlap between values of chol
for
participants in the two groups defined by hypchol
. The
cut-off for hypchol
is probably 6.2.
Visualize the correlations between variables to detect associations that you may have missed.
# convert all variables to numeric
<- sapply(EP16dat1, as.numeric)
dat_num
# calculate the Spearman correlation matrix
<- cor(dat_num, use = 'pair', method = 'spearman')
cormat
# plot the correlation matrix
::corrplot(cormat, method = 'square', type = 'lower') corrplot
Note: Correlations involving categorical variables are not usually done! We only use this as a quick-and-dirty method for visualization of relationships.
The question marks in the plot indicate that no correlation could be
calculated for cohort
(because cohort
is the
same for all subjects) and neither between HyperMed
and
hypten
(but we already knew that from the lecture).
Besides the relations we were already aware of, the plot shows us
some additional potentially relevant correlations, for example, that
wgt
is strongly correlated with WC
.
Comparing the number of cases where only either wgt
or
WC
is missing shows that there are 14 cases where
wgt
is missing but WC
is observed and 58 cases
where WC
is missing and wgt
is observed.
with(EP16dat1, table(ifelse(is.na(WC), 'WC mis.', 'WC obs.'),
ifelse(is.na(wgt), 'wgt mis.', 'wgt obs.')
))
##
## wgt mis. wgt obs.
## WC mis. 4 45
## WC obs. 18 933
Before imputing missing values it is important to take a look at how the observed parts of incomplete variables are distributed, so that we can choose appropriate imputation models.
Visualize the distributions of the incomplete continuous and
categorical variables. The package JointAI provides the
convenient function plot_all()
, but you can of course use
functions from other packages as well.
Pay attention to
Note: When you use RStudio and the “Plots”
panel is too small, you may not be able to show the distributions of all
variables. You can try to extend the panel, or tell R to plot the graph
in a new window, by first calling windows()
.
It can also help to reduce the white space around the separate plots
by changing the margins in the graphical parameters:
par(mar = c(3, 3, 2, 1))
par(mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
::plot_all(EP16dat1, breaks = 30, ncol = 4) JointAI
To learn more about additional options of plot_all()
check out the vignette Visualizing
Incomplete Data on the webpage of
JointAI.
© Nicole Erler