Data

For this practical, we simulate data using the following syntax:

set.seed(2020)
nrj <- data.frame(sex = factor(sample(c('male', 'female'), size = 100, replace = TRUE)),
                  kcal = round(rnorm(100, mean = 2200, sd = 250)),
                  weight = runif(100, min = 45, max = 150),
                  height = rnorm(100, mean = 170, sd = 10),
                  age = rnorm(100, mean = 50, sd = 10),
                  sports = factor(sample(c('never', 'sometimes', 'regularly'),
                                         size = 100, replace = TRUE), ordered = TRUE)
)
The first six rows of the resulting data.frame are:
sex kcal weight height age sports
female 2153 83.25 162.70 59.17 never
female 2350 74.35 178.51 47.73 never
male 2032 47.71 166.04 61.82 sometimes
female 2319 82.89 174.07 65.21 regularly
female 2230 88.52 159.61 64.38 sometimes
male 2230 112.04 157.44 73.27 regularly

Calculating the BMR

We want to calculate the Basal Metabolic Rate (BMR) for the individuals in the nrj data.

The formula differs for men and women:

  • men: \((13.75 \times \text{weight}) + (5 \times \text{height}) - (6.76 \times \text{age}) + 66\)
  • women: \((9.56 \times \text{weight}) + (1.85 \times \text{height}) - (4.68 \times \text{age}) + 655\)

Task 1

Which function would you choose to calculate the BMR? ifelse() or if() ... else ...?

Solution 1

Both are possible, but ifelse() is more straightforward in this setting.

Task 2

  • Run the syntax from above to create the dataset.
  • Calculate the BMR using ifelse() and add it to the dataset as a new variable BMR1.
The function ifelse() has arguments test, yes and no.
You could use nrj$sex == "male" as the test.

Solution 2

nrj$BMR1 <- ifelse(nrj$sex == 'male',
                  13.75 * nrj$weight + 5 * nrj$height - 6.76 * nrj$age + 66,
                  9.56 * nrj$weight + 1.85 * nrj$height - 4.68 * nrj$age + 655
)
head(nrj)
##      sex kcal    weight   height      age    sports     BMR1
## 1 female 2153  83.24565 162.7047 59.16676     never 1474.932
## 2 female 2350  74.34613 178.5138 47.73101     never 1472.618
## 3   male 2032  47.70793 166.0351 61.82203 sometimes 1134.242
## 4 female 2319  82.89275 174.0668 65.21487 regularly 1464.273
## 5 female 2230  88.51872 159.6145 64.37977 sometimes 1495.228
## 6   male 2230 112.04331 157.4412 73.27302 regularly 1898.476

Task 3

We now want to calculate the BMR using if() ... else ....

How is the condition used in if() different from the test used in ifelse()?

Solution 3

The argument test in ifelse() expects a vector of “tests” (a vector of TRUE and FALSE) while the condition argument in if() expects a single “test” (a single TRUE or FALSE).

Task 4

How can we check row by row if a subject is male or female?

Solution 4

We could use a for()-loop that runs through all rows in nrj.

Task 5

Calculate the BMR using if() ... else ... and add it to the nrj data as a new variable BMR2.

You need a sequence from 1 to the number of rows of nrj.
You need to pre-specify (an empty version of) the new variable BMR2.
Don’t forget that you now need to use indices.

Solution 5

Our syntax needs to have the general form:

for (i in <"columns of nrj">) {
  
  if ("<subject is male">) {
    
    <"formula for males">
    
  } else {
    
    <"formula for females">
    
  }
}
nrj$BMR2 <- NA   # "empty" version of BMR2

# loop over all rows
for (i in 1:nrow(nrj)) {
  
  # test if the subject is male
  nrj$BMR2[i] <- if (nrj$sex[i] == 'male') {
    
    # formula for males
    13.75 * nrj$weight[i] + 5 * nrj$height[i] - 6.76 * nrj$age[i] + 66
    
  } else {
    
    # formula for females
    9.56 * nrj$weight[i] + 1.85 * nrj$height[i] - 4.68 * nrj$age[i] + 655
    
  }
}

Task 6

Check that BMR1 and BMR2 are the same.

Solution 6

There are multiple possible ways to check this:

all.equal(nrj$BMR1, nrj$BMR2)
## [1] TRUE
identical(nrj$BMR1, nrj$BMR2)
## [1] TRUE
table(nrj$BMR1 == nrj$BMR2)
## 
## TRUE 
##  100
summary(nrj$BMR1 - nrj$BMR2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

Data Summary

We would now like to get descriptive statistics of the nrj data. For continuous variables, we want the mean and standard deviation, for categorical data we want to report the proportion of subjects in each category.

Task 1

Remember how to get summary measures?

  • Get the mean and standard deviation of the variable kcal.
  • Get the proportion of observations in each category for the variable sports.
You might use mean(), sd(), table() and prop.table().

Solution 1

mean(nrj$kcal)
## [1] 2200.36
sd(nrj$kcal)
## [1] 294.8881
prop.table(table(nrj$sports))
## 
##     never regularly sometimes 
##      0.34      0.23      0.43

Task 2

Use paste() to create some nice-looking output for these summaries:

  • “kcal: ()” for kcal
  • “sports: % <category 1>, % , …”
Remember: The function paste() has arguments sep and collapse, but you could also use paste0().
You might also want to use the function round() to reduce the number of digits in the output.

Solution 2

# for "kcal":
paste0("kcal: ", round(mean(nrj$kcal), 1), " (", round(sd(nrj$kcal), 1), ")")
## [1] "kcal: 2200.4 (294.9)"
# for "sports":

# table of proportions:
tab <- prop.table(table(nrj$sports))

# combine proportions with category names:
props <- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")

# combine variable name with proportions string:
paste('sports:', props)
## [1] "sports: 34% never, 23% regularly, 43% sometimes"

Task 3

Write a loop that creates the summary strings we created in the previous Solution for each variable in the nrj data.

Use print() to print each of the summaries.

You need a loop that iterates through the columns of nrj.
You need to autmatically check for each variable if it is a factor() or not to choose the correct summary type.

Solution 3

Our syntax has the general form:

for (i in <"columns of nrj">) {
  
  if ("<variable is factor">) {
    
    <"syntax for summary of a factor">
    
  } else {
    
    <"syntax for summary of a continuous variable">
    
  }
}
# loop over all columns
for (i in 1:ncol(nrj)) {
  
  # test if column is a factor
  if (is.factor(nrj[, i])) {
    
    # syntax for categorical variable
    tab <- prop.table(table(nrj[, i]))
    props <- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
    print(paste0(names(nrj)[i], ": ", props))
    
  } else {
    
    # syntax for continuous variable
    print(paste0(names(nrj)[i], ": ", 
                 round(mean(nrj[, i]), 1), " (", round(sd(nrj[, i]), 1), ")"))
    
  }
}
## [1] "sex: 60% female, 40% male"
## [1] "kcal: 2200.4 (294.9)"
## [1] "weight: 95.6 (30.8)"
## [1] "height: 168.6 (11.5)"
## [1] "age: 50.1 (10.3)"
## [1] "sports: 34% never, 23% regularly, 43% sometimes"
## [1] "BMR1: 1734.6 (375.8)"
## [1] "BMR2: 1734.6 (375.8)"

Writing your own function

We also want to practice writing our own functions.

Task 1

Write two functions:

  • one that creates the summary string for a continuous variable (mean and sd),
  • one that creates the output (proportion of subjects per category) for a categorical variable.

Assume that the input will be a vector, such as nrj$kcal or nrj$sports.
Note that when we work with these vectors, they do not contain the variable name any more, so we cannot use the variable name in the two functions.

The general form of a function is function(){...}.

Solution 1

We can re-use the syntax we used before, and only remove the part that adds the variable name.

Our functions have one argument, which we call x here:

summary_continuous <- function(x) {
  paste0(round(mean(x), 1), " (", round(sd(x), 1), ")")
}

summary_continuous(nrj$kcal)
## [1] "2200.4 (294.9)"
summary_categorical <- function(x) {
  tab <- prop.table(table(x))
  paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
}

summary_categorical(nrj$sports)
## [1] "34% never, 23% regularly, 43% sometimes"

Task 2

Write another function that has a data.frame as input and prints the summary string for each variable, using the two functions from the previous solution.

Solution 2

This function also has one argument which we call dat. We can, again, re-use syntax from above:

summary_df <- function(dat) {
  
  # loop over all columns
  for (i in 1:ncol(dat)) {
    
    # check if the column is a factor
    summary_string <- if (is.factor(dat[, i])) {
      
      # syntax for a categorical variable
      summary_categorical(dat[, i])
      
    } else {
      
      # syntax for a continuous variable
      summary_continuous(dat[, i])
    }
    
    # print the result of the summary
    print(summary_string)
  }
  
}

summary_df(dat = nrj)
## [1] "60% female, 40% male"
## [1] "2200.4 (294.9)"
## [1] "95.6 (30.8)"
## [1] "168.6 (11.5)"
## [1] "50.1 (10.3)"
## [1] "34% never, 23% regularly, 43% sometimes"
## [1] "1734.6 (375.8)"
## [1] "1734.6 (375.8)"

Note:

It would also be possible to use print() directly around the functions summary_categorical() and summary_continuous(). However, for the next Task it is more convenient to first collect the summary string in an object (summary_string), and do the further steps with that object.

Task 3

The function summary_df() that we created in the previous solution does not contain any variable names.

Modify the function so that the output strings look like the output in the previous exercise (with “variable name: …”).

Solution 3

To adjust the function, the only row that needs changing is the one in which we print the summary:

summary_df <- function(dat) {
  
  # loop over all columns
  for (i in 1:ncol(dat)) {
    
    # check if the column is a factor
    summary_string <- if (is.factor(dat[, i])) {
      
      # syntax for a categorical variable
      summary_categorical(dat[, i])
      
    } else {
      
      # syntax for a continuous variable
      summary_continuous(dat[, i])
    }
    
    # print the result of the summary together with the variable name
    print(paste0(names(dat)[i], ": ", summary_string))
  }
}

summary_df(dat = nrj)
## [1] "sex: 60% female, 40% male"
## [1] "kcal: 2200.4 (294.9)"
## [1] "weight: 95.6 (30.8)"
## [1] "height: 168.6 (11.5)"
## [1] "age: 50.1 (10.3)"
## [1] "sports: 34% never, 23% regularly, 43% sometimes"
## [1] "BMR1: 1734.6 (375.8)"
## [1] "BMR2: 1734.6 (375.8)"
 

© Nicole Erler