For this practical, we will use the lung dataset from the survival package.
You can either load the survival package to get access to this dataset
library(survival)or make a copy of this data
lung <- survival::lungFind out the class and dimension of the lung data.
class(), dim(), nrow() and ncol().
class(lung)## [1] "data.frame"
dim(lung)## [1] 228 10
nrow(lung)## [1] 228
ncol(lung)## [1] 10
Now investigate the structure of the data.
str(), head(), names() and summary().
head(lung)## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
names(lung)## [1] "inst" "time" "status" "age" "sex" "ph.ecog" "ph.karno" "pat.karno"
## [9] "meal.cal" "wt.loss"
str(lung)## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
All variables are coded as numeric.
unique(), table() and length() are useful here.
str(lung) we can see that status, sex and ph.ecog may only have values 0, 1 and 2inst, ph.karno, pat.karno, and wt.loss could be categorical.To confirm that status, sex and ph.ecog only have very few levels, unique() or table() can be used:
unique(lung$status)## [1] 2 1
table(lung$sex, exclude = NULL)##
## 1 2
## 138 90
table(lung$ph.ecog, exclude = NULL)##
## 0 1 2 3 <NA>
## 63 113 50 1 1
To prevent possibly very lengthy output for the other variables (if they have many different values) we could first check how many different values there are:
length(unique(lung$inst))## [1] 19
length(unique(lung$ph.karno))## [1] 7
length(unique(lung$pat.karno))## [1] 9
length(unique(lung$wt.loss))## [1] 54
We decide that they all should remain continuous variables (although for variables with few different values, like ph.karno or pat.karno, it may often not be appropriate to treat them as continuous).
status, sex and ph.ecog as factors.status (1 = censored, 2 = dead) and sex (1 = male, 2 = female).factor and that they have the correct levels.factor().
labels, but also levels.
class(). To check levels, you can either use levels() or use a table().
# For ph.ecog, just convert to a factor
lung$ph.ecog <- factor(lung$ph.ecog)
# For the other two, use levels and labels
lung$sex <- factor(lung$sex, levels = c(1, 2), labels = c('male', 'female'))
lung$status <- factor(lung$status, levels = c(1, 2), labels = c('censored', 'dead'))
# Confirm the class, either with
class(lung$ph.ecog)## [1] "factor"
class(lung$sex)## [1] "factor"
class(lung$status)## [1] "factor"
# or just use (str(lung))
levels(lung$ph.ecog)## [1] "0" "1" "2" "3"
table(lung$ph.ecog)##
## 0 1 2 3
## 63 113 50 1
table(lung$sex)##
## male female
## 138 90
table(lung$status)##
## censored dead
## 63 165
Get the summary of the lung data.
summary(lung)## inst time status age sex ph.ecog
## Min. : 1.00 Min. : 5.0 censored: 63 Min. :39.00 male :138 0 : 63
## 1st Qu.: 3.00 1st Qu.: 166.8 dead :165 1st Qu.:56.00 female: 90 1 :113
## Median :11.00 Median : 255.5 Median :63.00 2 : 50
## Mean :11.09 Mean : 305.2 Mean :62.45 3 : 1
## 3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:69.00 NA's: 1
## Max. :33.00 Max. :1022.0 Max. :82.00
## NA's :1
## ph.karno pat.karno meal.cal wt.loss
## Min. : 50.00 Min. : 30.00 Min. : 96.0 Min. :-24.000
## 1st Qu.: 75.00 1st Qu.: 70.00 1st Qu.: 635.0 1st Qu.: 0.000
## Median : 80.00 Median : 80.00 Median : 975.0 Median : 7.000
## Mean : 81.94 Mean : 79.96 Mean : 928.8 Mean : 9.832
## 3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:1150.0 3rd Qu.: 15.750
## Max. :100.00 Max. :100.00 Max. :2600.0 Max. : 68.000
## NA's :1 NA's :3 NA's :47 NA's :14
ph.karno and ph.ecog “by hand” (i.e., using other functions).ph.karno.min(), max(), mean(), quantile() for the continuous variable (but there are other options), and table() for the categorical variable.
na.rm = TRUE or exclude = NULL.
min(lung$ph.karno, na.rm = TRUE)## [1] 50
max(lung$ph.karno, na.rm = TRUE)## [1] 100
# alternatively: range(lung$ph.karno, na.rm = TRUE)
quantile(lung$ph.karno, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)## 25% 50% 75%
## 75 80 90
# alternatively: median(lung$ph.karno, na.rm = TRUE)
IQR(lung$ph.karno, na.rm = TRUE)## [1] 15
table(lung$ph.ecog, exclude = NULL)##
## 0 1 2 3 <NA>
## 63 113 50 1 1
meal.cal.meal.cal.sd() and var().
sqrt().
^2.
cal_sd <- sd(lung$meal.cal, na.rm = TRUE); cal_sd## [1] 402.1747
cal_var <- var(lung$meal.cal, na.rm = TRUE); cal_var## [1] 161744.5
sqrt(cal_var)## [1] 402.1747
cal_sd^2## [1] 161744.5
sex and status.sexprop.table().
margin to specify if the probabilities are relative to the whole sample size or only one of the two variables.
# 2x2 table of status and sex:
tab <- table(lung$status, lung$sex)
# probabilities, relative to the total number of subjects:
prop.table(tab)##
## male female
## censored 0.1140351 0.1622807
## dead 0.4912281 0.2324561
# probabilities, relative to sex:
prop.table(tab, margin = 2)##
## male female
## censored 0.1884058 0.4111111
## dead 0.8115942 0.5888889
Note:
Make sure choose the correct margin!
When sex is in the rows, you need margin = 1.
Always check that the table shows the correct numbers by roughly adding up the proportions in your head.
sex and ph.ecog per status.table() has.ftable().
tab <- table(lung$sex, lung$ph.ecog, lung$status, exclude = NULL)
tab## , , = censored
##
##
## 0 1 2 3 <NA>
## male 8 17 1 0 0
## female 18 14 5 0 0
##
## , , = dead
##
##
## 0 1 2 3 <NA>
## male 28 54 28 1 1
## female 9 28 16 0 0
# The first variable specifies the rows, the second the columns, and the
# third variables the third dimension:
table(lung$ph.ecog, lung$status, lung$sex, exclude = NULL)## , , = male
##
##
## censored dead
## 0 8 28
## 1 17 54
## 2 1 28
## 3 0 1
## <NA> 0 1
##
## , , = female
##
##
## censored dead
## 0 18 9
## 1 14 28
## 2 5 16
## 3 0 0
## <NA> 0 0
# Convert to a flat table:
ftable(tab, exclude = NULL)## censored dead
##
## male 0 8 28
## 1 17 54
## 2 1 28
## 3 0 1
## NA 0 1
## female 0 18 9
## 1 14 28
## 2 5 16
## 3 0 0
## NA 0 0
© Nicole Erler