Dataset

For this practical, we will use the lung dataset from the survival package.

You can either load the survival package to get access to this dataset

library(survival)

or make a copy of this data

lung <- survival::lung

Getting to know the data

Data dimensions

Task

Find out the class and dimension of the lung data.

Use the functions class(), dim(), nrow() and ncol().

Solution

class(lung)
## [1] "data.frame"
dim(lung)
## [1] 228  10
nrow(lung)
## [1] 228
ncol(lung)
## [1] 10

Data Structure

Task 1

Now investigate the structure of the data.

  • Which variables do the data contain?
  • What types of variables are these?
  • Are there missing values?
Use the functions str(), head(), names() and summary().

Solution 1

head(lung)
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0
names(lung)
##  [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"   "ph.karno"  "pat.karno"
##  [9] "meal.cal"  "wt.loss"
str(lung)
## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

Task 2

All variables are coded as numeric.

  • Do you think some of them should be factors?
  • Explore how many distinct values there are in those variables that may not actually be continuous.
The functions unique(), table() and length() are useful here.

Solution 2

  • Looking at the output of str(lung) we can see that status, sex and ph.ecog may only have values 0, 1 and 2
  • Also inst, ph.karno, pat.karno, and wt.loss could be categorical.

To confirm that status, sex and ph.ecog only have very few levels, unique() or table() can be used:

unique(lung$status)
## [1] 2 1
table(lung$sex, exclude = NULL)
## 
##   1   2 
## 138  90
table(lung$ph.ecog, exclude = NULL)
## 
##    0    1    2    3 <NA> 
##   63  113   50    1    1

To prevent possibly very lengthy output for the other variables (if they have many different values) we could first check how many different values there are:

length(unique(lung$inst))
## [1] 19
length(unique(lung$ph.karno))
## [1] 7
length(unique(lung$pat.karno))
## [1] 9
length(unique(lung$wt.loss))
## [1] 54

We decide that they all should remain continuous variables (although for variables with few different values, like ph.karno or pat.karno, it may often not be appropriate to treat them as continuous).

Re-coding

Turning continuous values into factors

Task

  • Re-code status, sex and ph.ecog as factors.
  • Use meaningful factor labels for status (1 = censored, 2 = dead) and sex (1 = male, 2 = female).
  • Check that the resulting variables are indeed of class factor and that they have the correct levels.
Use the function factor().
To make sure that the correct categories get the correct labels, specify not only the argument labels, but also levels.
To check the class, use the function class(). To check levels, you can either use levels() or use a table().

Solution

# For ph.ecog, just convert to a factor
lung$ph.ecog <- factor(lung$ph.ecog)

# For the other two, use levels and labels
lung$sex <- factor(lung$sex, levels = c(1, 2), labels = c('male', 'female'))
lung$status <- factor(lung$status, levels = c(1, 2), labels = c('censored', 'dead'))

# Confirm the class, either with
class(lung$ph.ecog)
## [1] "factor"
class(lung$sex)
## [1] "factor"
class(lung$status)
## [1] "factor"
# or just use (str(lung))
 
levels(lung$ph.ecog)
## [1] "0" "1" "2" "3"
table(lung$ph.ecog)
## 
##   0   1   2   3 
##  63 113  50   1
table(lung$sex)
## 
##   male female 
##    138     90
table(lung$status)
## 
## censored     dead 
##       63      165

Descriptive Statistics

Summary

Task 1

Get the summary of the lung data.

Solution 1

summary(lung)
##       inst            time             status         age            sex      ph.ecog   
##  Min.   : 1.00   Min.   :   5.0   censored: 63   Min.   :39.00   male  :138   0   : 63  
##  1st Qu.: 3.00   1st Qu.: 166.8   dead    :165   1st Qu.:56.00   female: 90   1   :113  
##  Median :11.00   Median : 255.5                  Median :63.00                2   : 50  
##  Mean   :11.09   Mean   : 305.2                  Mean   :62.45                3   :  1  
##  3rd Qu.:16.00   3rd Qu.: 396.5                  3rd Qu.:69.00                NA's:  1  
##  Max.   :33.00   Max.   :1022.0                  Max.   :82.00                          
##  NA's   :1                                                                              
##     ph.karno        pat.karno         meal.cal         wt.loss       
##  Min.   : 50.00   Min.   : 30.00   Min.   :  96.0   Min.   :-24.000  
##  1st Qu.: 75.00   1st Qu.: 70.00   1st Qu.: 635.0   1st Qu.:  0.000  
##  Median : 80.00   Median : 80.00   Median : 975.0   Median :  7.000  
##  Mean   : 81.94   Mean   : 79.96   Mean   : 928.8   Mean   :  9.832  
##  3rd Qu.: 90.00   3rd Qu.: 90.00   3rd Qu.:1150.0   3rd Qu.: 15.750  
##  Max.   :100.00   Max.   :100.00   Max.   :2600.0   Max.   : 68.000  
##  NA's   :1        NA's   :3        NA's   :47       NA's   :14

Task 2

  • Calculate each of the values of this summary for the variables ph.karno and ph.ecog “by hand” (i.e., using other functions).
  • Also calculate the inter-quartile range for ph.karno.
You can use the functions min(), max(), mean(), quantile() for the continuous variable (but there are other options), and table() for the categorical variable.
There are missing values in both variables: remember to set na.rm = TRUE or exclude = NULL.

Solution 2

min(lung$ph.karno, na.rm = TRUE)
## [1] 50
max(lung$ph.karno, na.rm = TRUE)
## [1] 100
# alternatively: range(lung$ph.karno, na.rm = TRUE)

quantile(lung$ph.karno, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
## 25% 50% 75% 
##  75  80  90
# alternatively: median(lung$ph.karno, na.rm = TRUE)

IQR(lung$ph.karno, na.rm = TRUE)
## [1] 15
table(lung$ph.ecog, exclude = NULL)
## 
##    0    1    2    3 <NA> 
##   63  113   50    1    1

Variance and Standard Deviation

Task

  • Calculate the variance of meal.cal.
  • Calculate the standard deviation of meal.cal.
  • Check that the standard deviation is indeed the square root of the variance.
  • Check that the variance is indeed the square of the standard deviation.
Use the functions sd() and var().
To calculate the square root, use the function sqrt().
To calculate the square, use ^2.

Solution

cal_sd <- sd(lung$meal.cal, na.rm = TRUE); cal_sd
## [1] 402.1747
cal_var <- var(lung$meal.cal, na.rm = TRUE); cal_var
## [1] 161744.5
sqrt(cal_var)
## [1] 402.1747
cal_sd^2
## [1] 161744.5

Tables

Task 1

  • Get a table of sex and status.
  • Also calculate the probabilities for each combination:
    • relative to the total number of subjects
    • relative to sex
To get the probabilities, use the function prop.table().
Use the argument margin to specify if the probabilities are relative to the whole sample size or only one of the two variables.

Solution 1

# 2x2 table of status and sex:
tab <- table(lung$status, lung$sex)

# probabilities, relative to the total number of subjects:
prop.table(tab)
##           
##                 male    female
##   censored 0.1140351 0.1622807
##   dead     0.4912281 0.2324561
# probabilities, relative to sex:
prop.table(tab, margin = 2)
##           
##                 male    female
##   censored 0.1884058 0.4111111
##   dead     0.8115942 0.5888889

Note:

Make sure choose the correct margin!

When sex is in the rows, you need margin = 1.

Always check that the table shows the correct numbers by roughly adding up the proportions in your head.

Task 2

  • Get a table of sex and ph.ecog per status.
  • Experiment with the effect the order of the variables in table() has.
  • Convert this 3-dimensional table into a flat table.
To get a flat table, use ftable().

Solution 2

tab <- table(lung$sex, lung$ph.ecog, lung$status, exclude = NULL)
tab
## , ,  = censored
## 
##         
##           0  1  2  3 <NA>
##   male    8 17  1  0    0
##   female 18 14  5  0    0
## 
## , ,  = dead
## 
##         
##           0  1  2  3 <NA>
##   male   28 54 28  1    1
##   female  9 28 16  0    0
# The first variable specifies the rows, the second the columns, and the 
# third variables the third dimension:
table(lung$ph.ecog, lung$status, lung$sex, exclude = NULL)
## , ,  = male
## 
##       
##        censored dead
##   0           8   28
##   1          17   54
##   2           1   28
##   3           0    1
##   <NA>        0    1
## 
## , ,  = female
## 
##       
##        censored dead
##   0          18    9
##   1          14   28
##   2           5   16
##   3           0    0
##   <NA>        0    0
# Convert to a flat table:
ftable(tab, exclude = NULL)
##            censored dead
##                         
## male   0          8   28
##        1         17   54
##        2          1   28
##        3          0    1
##        NA         0    1
## female 0         18    9
##        1         14   28
##        2          5   16
##        3          0    0
##        NA         0    0
 

© Nicole Erler