For this practical, we simulate data using the following syntax:
set.seed(1234)
<- data.frame(sex = factor(sample(c('male', 'female'),
nrj size = 100, replace = TRUE)),
kcal = round(rnorm(100, mean = 2200, sd = 250)),
weight = runif(100, min = 45, max = 150),
height = rnorm(100, mean = 170, sd = 10),
age = rnorm(100, mean = 50, sd = 10),
sports = factor(sample(c('never', 'sometimes', 'regularly'),
size = 100, replace = TRUE),
ordered = TRUE)
)
The first six rows of the resulting data.frame
are:
head(nrj)
## sex kcal weight height age sports
## 1 female 1748 82.06484 174.8523 44.20043 sometimes
## 2 female 2054 148.00062 176.9677 40.46721 regularly
## 3 female 1923 101.58269 171.8551 48.20571 sometimes
## 4 female 1946 91.62356 177.0073 60.09808 never
## 5 male 2159 144.68351 173.1168 50.23627 never
## 6 female 2341 92.51075 177.6046 43.50972 sometimes
We want to calculate the Basal Metabolic Rate (BMR) for the individuals in the nrj data.
The formula differs for men and women:
Which function would you choose to calculate the BMR,
ifelse()
or if() ... else ...
?
Both are possible, but ifelse()
is more straightforward
to use in this setting.
ifelse()
and add it to the
dataset as a new variable BMR1
.The function ifelse()
has arguments test
,
yes
and no
.
nrj$sex == "male"
as the test
.
We need to create a test
that distinguishes between
males and females. This can be done by simply comparing the value of the
variable nrj$sex
with either "male"
or
"female"
.
When we use nrj$sex == "male"
, we need to provide the
formula for males as the yes
argument and the formula for
females as the no
argument.
$BMR1 <- ifelse(nrj$sex == "male",
nrj13.75 * nrj$weight + 5 * nrj$height - 6.76 * nrj$age + 66,
9.56 * nrj$weight + 1.85 * nrj$height - 4.68 * nrj$age + 655
)head(nrj)
## sex kcal weight height age sports BMR1
## 1 female 1748 82.06484 174.8523 44.20043 sometimes 1556.159
## 2 female 2054 148.00062 176.9677 40.46721 regularly 2207.890
## 3 female 1923 101.58269 171.8551 48.20571 sometimes 1718.460
## 4 female 1946 91.62356 177.0073 60.09808 never 1577.126
## 5 male 2159 144.68351 173.1168 50.23627 never 2581.385
## 6 female 2341 92.51075 177.6046 43.50972 sometimes 1664.346
The formula that we specify for the yes
argument will
contain the BMR for all rows in the nrj data calculated
as if all persons were males, and the formula we specify for the
no
argument calculates the BMR as if all persons in the
data were female.
yes
and no
are vectors of
the same length as nrj$sex
, and depending on whether
nrj$sex == "male"
returns TRUE
or
FALSE
, the corresponding element of the first or second
vector is used.
As an exercise, we now want to calculate the BMR using
if() ... else ...
.
How is the argument cond
used in if()
different from the argument test
used in
ifelse()
?
The argument test
in ifelse()
expects a
vector of “tests” (a vector of TRUE
and FALSE
)
while the cond
argument in if()
expects a
single “test” (a single TRUE
or FALSE
).
How can we check row by row if a subject is male or female?
We could use a for()
-loop that runs through all rows in
nrj.
Calculate the BMR using if() ... else ...
and add it to
the nrj data as a new variable BMR2
.
BMR2
.
Our syntax needs to have the general form:
for (<"index"> in <"columns of nrj">) {
if ("<subject is male">) {
<"formula for males">
else {
}
<"formula for females">
} }
$BMR2 <- NA # "empty" version of BMR2
nrj
# loop over all rows
for (i in 1:nrow(nrj)) {
# test if the subject is male
$BMR2[i] <- if (nrj$sex[i] == "male") {
nrj
# formula for males
13.75 * nrj$weight[i] + 5 * nrj$height[i] - 6.76 * nrj$age[i] + 66
else {
}
# formula for females
9.56 * nrj$weight[i] + 1.85 * nrj$height[i] - 4.68 * nrj$age[i] + 655
} }
Check that BMR1
and BMR2
are the same.
There are multiple possible ways to check this:
all.equal(nrj$BMR1, nrj$BMR2)
## [1] TRUE
identical(nrj$BMR1, nrj$BMR2)
## [1] TRUE
table(nrj$BMR1 == nrj$BMR2)
##
## TRUE
## 100
summary(nrj$BMR1 - nrj$BMR2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
We would now like to get descriptive statistics of the nrj data. For continuous variables, we want the mean and standard deviation, for categorical data we want to report the proportion of subjects in each category.
Eventually, the aim is to write a function that will automatically do this for a dataset. The steps of how we can develop such a function are divided over the different Tasks in this section and the next section.
Remember how to get summary measures?
kcal
.sports
.mean()
, sd()
,
table()
and prop.table()
.
mean(nrj$kcal)
## [1] 2219.93
sd(nrj$kcal)
## [1] 237.6322
prop.table(table(nrj$sports))
##
## never regularly sometimes
## 0.33 0.36 0.31
Use paste()
to create some nice-looking output for these
summaries:
kcal
sports
paste()
has arguments sep
and
collapse
, but you could also use paste0()
.
round()
to reduce
the number of digits in the output.
# for "kcal":
paste0("kcal: ", round(mean(nrj$kcal), 1), " (", round(sd(nrj$kcal), 1), ")")
## [1] "kcal: 2219.9 (237.6)"
# for "sports":
# table of proportions:
<- prop.table(table(nrj$sports))
tab
# combine the proportions with category names:
<- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
props
# combine the variable name with the proportions string:
paste('sports:', props)
## [1] "sports: 33% never, 36% regularly, 31% sometimes"
Write a loop that creates the summary strings we created in the previous solution for each variable in the nrj data.
Use print()
to print each of the summaries.
factor()
or not to choose the correct summary type.
Our syntax has the general form:
for (<"index"> in <"columns of nrj">) {
if ("<variable is factor">) {
<"syntax for summary of a factor">
else {
}
<"syntax for summary of a continuous variable">
} }
# loop over all columns
for (i in 1:ncol(nrj)) {
# test if column is a factor
if (is.factor(nrj[, i])) {
# syntax for categorical variable
<- prop.table(table(nrj[, i]))
tab <- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
props print(paste0(names(nrj)[i], ": ", props))
else {
}
# syntax for continuous variable
print(paste0(names(nrj)[i], ": ",
round(mean(nrj[, i]), 1), " (", round(sd(nrj[, i]), 1), ")"))
} }
## [1] "sex: 58% female, 42% male"
## [1] "kcal: 2219.9 (237.6)"
## [1] "weight: 99.6 (31.4)"
## [1] "height: 171.5 (9.6)"
## [1] "age: 49.9 (10.5)"
## [1] "sports: 33% never, 36% regularly, 31% sometimes"
## [1] "BMR1: 1807.1 (390.2)"
## [1] "BMR2: 1807.1 (390.2)"
Now we will put the different elements from the previous part together using functions.
Write two functions:
Assume that the input will be a vector, such as nrj$kcal
or nrj$sports
.
Note that when we work with these
vectors, they do not contain the variable name any more, so we cannot
use the variable name in the two functions.
<function name> <- function( <arguments> ) {
<function body>
}
We can re-use the syntax we used before, and only remove the part that adds the variable name.
Our functions have one argument, which we call x
here:
<- function(x) {
summary_continuous paste0(round(mean(x), 1), " (", round(sd(x), 1), ")")
}
summary_continuous(nrj$kcal)
## [1] "2219.9 (237.6)"
<- function(x) {
summary_categorical <- prop.table(table(x))
tab paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
}
summary_categorical(nrj$sports)
## [1] "33% never, 36% regularly, 31% sometimes"
Write another function that has a data.frame
as input
and prints a summary string for each variable, using the two functions
from the previous solution, i.e., so that
summary_continuous()
is used for continuous variables and
summary_categorical()
is used for categorical
variables).
The general structure for our syntax is
<"function name"> <- function(<"data">) {
for (<"index"> in <"columns of the data">) {
if ("<variable is factor">) {
<"function for summary of a factor">
else {
}
<"function for summary of a continuous variable">
}
} }
This function also has one argument which we call dat
.
We can, again, re-use syntax from above:
<- function(dat) {
summary_df
# loop over all columns
for (i in 1:ncol(dat)) {
# check if the column is a factor
<- if (is.factor(dat[, i])) {
summary_string
# syntax for a categorical variable
summary_categorical(dat[, i])
else {
}
# syntax for a continuous variable
summary_continuous(dat[, i])
}
# print the result of the summary
print(summary_string)
}
}
summary_df(dat = nrj)
## [1] "58% female, 42% male"
## [1] "2219.9 (237.6)"
## [1] "99.6 (31.4)"
## [1] "171.5 (9.6)"
## [1] "49.9 (10.5)"
## [1] "33% never, 36% regularly, 31% sometimes"
## [1] "1807.1 (390.2)"
## [1] "1807.1 (390.2)"
It would also be possible to use print()
directly around
the functions summary_categorical()
and
summary_continuous()
. However, for the next Task it is more
convenient to first collect the summary string in an object
(summary_string
), and do the further steps with that
object.
The function summary_df()
that we created in the
previous solution does not contain any variable names. Here we want to
add those names to the output.
Modify the function so that the output strings look like the output in the previous exercise (with “variable name: …”).
To adjust the function, the only row that needs changing is the one in which we print the summary:
<- function(dat) {
summary_df
# loop over all columns
for (i in 1:ncol(dat)) {
# check if the column is a factor
<- if (is.factor(dat[, i])) {
summary_string
# syntax for a categorical variable
summary_categorical(dat[, i])
else {
}
# syntax for a continuous variable
summary_continuous(dat[, i])
}
# print the result of the summary together with the variable name
print(paste0(names(dat)[i], ": ", summary_string))
}
}
summary_df(dat = nrj)
## [1] "sex: 58% female, 42% male"
## [1] "kcal: 2219.9 (237.6)"
## [1] "weight: 99.6 (31.4)"
## [1] "height: 171.5 (9.6)"
## [1] "age: 49.9 (10.5)"
## [1] "sports: 33% never, 36% regularly, 31% sometimes"
## [1] "BMR1: 1807.1 (390.2)"
## [1] "BMR2: 1807.1 (390.2)"