For this practical, we will use the lung dataset from the survival package.
You can either load the survival package to get access to this dataset
library(survival)
or make a copy of this data
<- survival::lung lung
Find out the class and dimension of the lung data.
class()
, dim()
, nrow()
and ncol()
.
class(lung)
## [1] "data.frame"
dim(lung)
## [1] 228 10
nrow(lung)
## [1] 228
ncol(lung)
## [1] 10
Now investigate the structure of the data.
str()
, head()
, names()
and summary()
.
head(lung)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
names(lung)
## [1] "inst" "time" "status" "age" "sex" "ph.ecog" "ph.karno" "pat.karno"
## [9] "meal.cal" "wt.loss"
str(lung)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
All variables are coded as numeric
.
unique()
, table()
and length()
are useful here.
str(lung)
we can see that status
, sex
and ph.ecog
may only have values 0, 1 and 2inst
, ph.karno
, pat.karno
, and wt.loss
could be categorical.To confirm that status
, sex
and ph.ecog
only have very few levels, unique()
or table()
can be used:
unique(lung$status)
## [1] 2 1
table(lung$sex, exclude = NULL)
##
## 1 2
## 138 90
table(lung$ph.ecog, exclude = NULL)
##
## 0 1 2 3 <NA>
## 63 113 50 1 1
To prevent possibly very lengthy output for the other variables (if they have many different values) we could first check how many different values there are:
length(unique(lung$inst))
## [1] 19
length(unique(lung$ph.karno))
## [1] 7
length(unique(lung$pat.karno))
## [1] 9
length(unique(lung$wt.loss))
## [1] 54
We decide that they all should remain continuous variables (although for variables with few different values, like ph.karno
or pat.karno
, it may often not be appropriate to treat them as continuous).
status
, sex
and ph.ecog
as factors.status
(1 = censored, 2 = dead) and sex
(1 = male, 2 = female).factor
and that they have the correct levels.factor()
.
labels
, but also levels
.
class()
. To check levels, you can either use levels()
or use a table()
.
# For ph.ecog, just convert to a factor
$ph.ecog <- factor(lung$ph.ecog)
lung
# For the other two, use levels and labels
$sex <- factor(lung$sex, levels = c(1, 2), labels = c('male', 'female'))
lung$status <- factor(lung$status, levels = c(1, 2), labels = c('censored', 'dead'))
lung
# Confirm the class, either with
class(lung$ph.ecog)
## [1] "factor"
class(lung$sex)
## [1] "factor"
class(lung$status)
## [1] "factor"
# or just use (str(lung))
levels(lung$ph.ecog)
## [1] "0" "1" "2" "3"
table(lung$ph.ecog)
##
## 0 1 2 3
## 63 113 50 1
table(lung$sex)
##
## male female
## 138 90
table(lung$status)
##
## censored dead
## 63 165
Get the summary of the lung data.
summary(lung)
## inst time status age sex ph.ecog
## Min. : 1.00 Min. : 5.0 censored: 63 Min. :39.00 male :138 0 : 63
## 1st Qu.: 3.00 1st Qu.: 166.8 dead :165 1st Qu.:56.00 female: 90 1 :113
## Median :11.00 Median : 255.5 Median :63.00 2 : 50
## Mean :11.09 Mean : 305.2 Mean :62.45 3 : 1
## 3rd Qu.:16.00 3rd Qu.: 396.5 3rd Qu.:69.00 NA's: 1
## Max. :33.00 Max. :1022.0 Max. :82.00
## NA's :1
## ph.karno pat.karno meal.cal wt.loss
## Min. : 50.00 Min. : 30.00 Min. : 96.0 Min. :-24.000
## 1st Qu.: 75.00 1st Qu.: 70.00 1st Qu.: 635.0 1st Qu.: 0.000
## Median : 80.00 Median : 80.00 Median : 975.0 Median : 7.000
## Mean : 81.94 Mean : 79.96 Mean : 928.8 Mean : 9.832
## 3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:1150.0 3rd Qu.: 15.750
## Max. :100.00 Max. :100.00 Max. :2600.0 Max. : 68.000
## NA's :1 NA's :3 NA's :47 NA's :14
ph.karno
and ph.ecog
“by hand” (i.e., using other functions).ph.karno
.min()
, max()
, mean()
, quantile()
for the continuous variable (but there are other options), and table()
for the categorical variable.
na.rm = TRUE
or exclude = NULL
.
min(lung$ph.karno, na.rm = TRUE)
## [1] 50
max(lung$ph.karno, na.rm = TRUE)
## [1] 100
# alternatively: range(lung$ph.karno, na.rm = TRUE)
quantile(lung$ph.karno, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
## 25% 50% 75%
## 75 80 90
# alternatively: median(lung$ph.karno, na.rm = TRUE)
IQR(lung$ph.karno, na.rm = TRUE)
## [1] 15
table(lung$ph.ecog, exclude = NULL)
##
## 0 1 2 3 <NA>
## 63 113 50 1 1
meal.cal
.meal.cal
.sd()
and var()
.
sqrt()
.
^2
.
<- sd(lung$meal.cal, na.rm = TRUE); cal_sd cal_sd
## [1] 402.1747
<- var(lung$meal.cal, na.rm = TRUE); cal_var cal_var
## [1] 161744.5
sqrt(cal_var)
## [1] 402.1747
^2 cal_sd
## [1] 161744.5
sex
and status
.sex
prop.table()
.
margin
to specify if the probabilities are relative to the whole sample size or only one of the two variables.
# 2x2 table of status and sex:
<- table(lung$status, lung$sex)
tab
# probabilities, relative to the total number of subjects:
prop.table(tab)
##
## male female
## censored 0.1140351 0.1622807
## dead 0.4912281 0.2324561
# probabilities, relative to sex:
prop.table(tab, margin = 2)
##
## male female
## censored 0.1884058 0.4111111
## dead 0.8115942 0.5888889
Note:
Make sure choose the correct margin!
When sex
is in the rows, you need margin = 1
.
Always check that the table shows the correct numbers by roughly adding up the proportions in your head.
sex
and ph.ecog
per status
.table()
has.ftable()
.
<- table(lung$sex, lung$ph.ecog, lung$status, exclude = NULL)
tab tab
## , , = censored
##
##
## 0 1 2 3 <NA>
## male 8 17 1 0 0
## female 18 14 5 0 0
##
## , , = dead
##
##
## 0 1 2 3 <NA>
## male 28 54 28 1 1
## female 9 28 16 0 0
# The first variable specifies the rows, the second the columns, and the
# third variables the third dimension:
table(lung$ph.ecog, lung$status, lung$sex, exclude = NULL)
## , , = male
##
##
## censored dead
## 0 8 28
## 1 17 54
## 2 1 28
## 3 0 1
## <NA> 0 1
##
## , , = female
##
##
## censored dead
## 0 18 9
## 1 14 28
## 2 5 16
## 3 0 0
## <NA> 0 0
# Convert to a flat table:
ftable(tab, exclude = NULL)
## censored dead
##
## male 0 8 28
## 1 17 54
## 2 1 28
## 3 0 1
## NA 0 1
## female 0 18 9
## 1 14 28
## 2 5 16
## 3 0 0
## NA 0 0
© Nicole Erler