For this practical, we simulate data using the following syntax:
set.seed(2020)
<- data.frame(sex = factor(sample(c('male', 'female'), size = 100, replace = TRUE)),
nrj kcal = round(rnorm(100, mean = 2200, sd = 250)),
weight = runif(100, min = 45, max = 150),
height = rnorm(100, mean = 170, sd = 10),
age = rnorm(100, mean = 50, sd = 10),
sports = factor(sample(c('never', 'sometimes', 'regularly'),
size = 100, replace = TRUE), ordered = TRUE)
)
data.frame
are:
sex | kcal | weight | height | age | sports |
---|---|---|---|---|---|
female | 2153 | 83.25 | 162.70 | 59.17 | never |
female | 2350 | 74.35 | 178.51 | 47.73 | never |
male | 2032 | 47.71 | 166.04 | 61.82 | sometimes |
female | 2319 | 82.89 | 174.07 | 65.21 | regularly |
female | 2230 | 88.52 | 159.61 | 64.38 | sometimes |
male | 2230 | 112.04 | 157.44 | 73.27 | regularly |
We want to calculate the Basal Metabolic Rate (BMR) for the individuals in the nrj data.
The formula differs for men and women:
Which function would you choose to calculate the BMR? ifelse()
or if() ... else ...
?
Both are possible, but ifelse()
is more straightforward in this setting.
ifelse()
and add it to the dataset as a new variable BMR1
.ifelse()
has arguments test
, yes
and no
.
nrj$sex == "male"
as the test
.
$BMR1 <- ifelse(nrj$sex == 'male',
nrj13.75 * nrj$weight + 5 * nrj$height - 6.76 * nrj$age + 66,
9.56 * nrj$weight + 1.85 * nrj$height - 4.68 * nrj$age + 655
)head(nrj)
## sex kcal weight height age sports BMR1
## 1 female 2153 83.24565 162.7047 59.16676 never 1474.932
## 2 female 2350 74.34613 178.5138 47.73101 never 1472.618
## 3 male 2032 47.70793 166.0351 61.82203 sometimes 1134.242
## 4 female 2319 82.89275 174.0668 65.21487 regularly 1464.273
## 5 female 2230 88.51872 159.6145 64.37977 sometimes 1495.228
## 6 male 2230 112.04331 157.4412 73.27302 regularly 1898.476
if() ... else ...
.
How is the condition
used in if()
different from the test
used in ifelse()
?
The argument test
in ifelse()
expects a vector of “tests” (a vector of TRUE
and FALSE
) while the condition
argument in if()
expects a single “test” (a single TRUE
or FALSE
).
How can we check row by row if a subject is male or female?
We could use a for()
-loop that runs through all rows in nrj.
Calculate the BMR using if() ... else ...
and add it to the nrj data as a new variable BMR2
.
BMR2
.
Our syntax needs to have the general form:
for (i in <"columns of nrj">) {
if ("<subject is male">) {
<"formula for males">
else {
}
<"formula for females">
} }
$BMR2 <- NA # "empty" version of BMR2
nrj
# loop over all rows
for (i in 1:nrow(nrj)) {
# test if the subject is male
$BMR2[i] <- if (nrj$sex[i] == 'male') {
nrj
# formula for males
13.75 * nrj$weight[i] + 5 * nrj$height[i] - 6.76 * nrj$age[i] + 66
else {
}
# formula for females
9.56 * nrj$weight[i] + 1.85 * nrj$height[i] - 4.68 * nrj$age[i] + 655
} }
Check that BMR1
and BMR2
are the same.
There are multiple possible ways to check this:
all.equal(nrj$BMR1, nrj$BMR2)
## [1] TRUE
identical(nrj$BMR1, nrj$BMR2)
## [1] TRUE
table(nrj$BMR1 == nrj$BMR2)
##
## TRUE
## 100
summary(nrj$BMR1 - nrj$BMR2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
We would now like to get descriptive statistics of the nrj data. For continuous variables, we want the mean and standard deviation, for categorical data we want to report the proportion of subjects in each category.
Remember how to get summary measures?
kcal
.sports
.mean()
, sd()
, table()
and prop.table()
.
mean(nrj$kcal)
## [1] 2200.36
sd(nrj$kcal)
## [1] 294.8881
prop.table(table(nrj$sports))
##
## never regularly sometimes
## 0.34 0.23 0.43
Use paste()
to create some nice-looking output for these summaries:
kcal
paste()
has arguments sep
and collapse
, but you could also use paste0()
.
round()
to reduce the number of digits in the output.
# for "kcal":
paste0("kcal: ", round(mean(nrj$kcal), 1), " (", round(sd(nrj$kcal), 1), ")")
## [1] "kcal: 2200.4 (294.9)"
# for "sports":
# table of proportions:
<- prop.table(table(nrj$sports))
tab
# combine proportions with category names:
<- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
props
# combine variable name with proportions string:
paste('sports:', props)
## [1] "sports: 34% never, 23% regularly, 43% sometimes"
Write a loop that creates the summary strings we created in the previous Solution for each variable in the nrj data.
Use print()
to print each of the summaries.
factor()
or not to choose the correct summary type.
Our syntax has the general form:
for (i in <"columns of nrj">) {
if ("<variable is factor">) {
<"syntax for summary of a factor">
else {
}
<"syntax for summary of a continuous variable">
} }
# loop over all columns
for (i in 1:ncol(nrj)) {
# test if column is a factor
if (is.factor(nrj[, i])) {
# syntax for categorical variable
<- prop.table(table(nrj[, i]))
tab <- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
props print(paste0(names(nrj)[i], ": ", props))
else {
}
# syntax for continuous variable
print(paste0(names(nrj)[i], ": ",
round(mean(nrj[, i]), 1), " (", round(sd(nrj[, i]), 1), ")"))
} }
## [1] "sex: 60% female, 40% male"
## [1] "kcal: 2200.4 (294.9)"
## [1] "weight: 95.6 (30.8)"
## [1] "height: 168.6 (11.5)"
## [1] "age: 50.1 (10.3)"
## [1] "sports: 34% never, 23% regularly, 43% sometimes"
## [1] "BMR1: 1734.6 (375.8)"
## [1] "BMR2: 1734.6 (375.8)"
We also want to practice writing our own functions.
Write two functions:
Assume that the input will be a vector, such as nrj$kcal
or nrj$sports
.
Note that when we work with these vectors, they do not contain the variable name any more, so we cannot use the variable name in the two functions.
function(){...}
.
We can re-use the syntax we used before, and only remove the part that adds the variable name.
Our functions have one argument, which we call x
here:
<- function(x) {
summary_continuous paste0(round(mean(x), 1), " (", round(sd(x), 1), ")")
}
summary_continuous(nrj$kcal)
## [1] "2200.4 (294.9)"
<- function(x) {
summary_categorical <- prop.table(table(x))
tab paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
}
summary_categorical(nrj$sports)
## [1] "34% never, 23% regularly, 43% sometimes"
Write another function that has a data.frame
as input and prints the summary string for each variable, using the two functions from the previous solution.
This function also has one argument which we call dat
. We can, again, re-use syntax from above:
<- function(dat) {
summary_df
# loop over all columns
for (i in 1:ncol(dat)) {
# check if the column is a factor
<- if (is.factor(dat[, i])) {
summary_string
# syntax for a categorical variable
summary_categorical(dat[, i])
else {
}
# syntax for a continuous variable
summary_continuous(dat[, i])
}
# print the result of the summary
print(summary_string)
}
}
summary_df(dat = nrj)
## [1] "60% female, 40% male"
## [1] "2200.4 (294.9)"
## [1] "95.6 (30.8)"
## [1] "168.6 (11.5)"
## [1] "50.1 (10.3)"
## [1] "34% never, 23% regularly, 43% sometimes"
## [1] "1734.6 (375.8)"
## [1] "1734.6 (375.8)"
Note:
It would also be possible to use print()
directly around the functions summary_categorical()
and summary_continuous()
. However, for the next Task it is more convenient to first collect the summary string in an object (summary_string
), and do the further steps with that object.
summary_df()
that we created in the previous solution does not contain any variable names.
Modify the function so that the output strings look like the output in the previous exercise (with “variable name: …”).
To adjust the function, the only row that needs changing is the one in which we print the summary:
<- function(dat) {
summary_df
# loop over all columns
for (i in 1:ncol(dat)) {
# check if the column is a factor
<- if (is.factor(dat[, i])) {
summary_string
# syntax for a categorical variable
summary_categorical(dat[, i])
else {
}
# syntax for a continuous variable
summary_continuous(dat[, i])
}
# print the result of the summary together with the variable name
print(paste0(names(dat)[i], ": ", summary_string))
}
}
summary_df(dat = nrj)
## [1] "sex: 60% female, 40% male"
## [1] "kcal: 2200.4 (294.9)"
## [1] "weight: 95.6 (30.8)"
## [1] "height: 168.6 (11.5)"
## [1] "age: 50.1 (10.3)"
## [1] "sports: 34% never, 23% regularly, 43% sometimes"
## [1] "BMR1: 1734.6 (375.8)"
## [1] "BMR2: 1734.6 (375.8)"
© Nicole Erler