Data

For this practical, we simulate data using the following syntax:

set.seed(1234)
nrj <- data.frame(sex = factor(sample(c('male', 'female'),
                                      size = 100, replace = TRUE)),
                  kcal = round(rnorm(100, mean = 2200, sd = 250)),
                  weight = runif(100, min = 45, max = 150),
                  height = rnorm(100, mean = 170, sd = 10),
                  age = rnorm(100, mean = 50, sd = 10),
                  sports = factor(sample(c('never', 'sometimes', 'regularly'),
                                         size = 100, replace = TRUE),
                                  ordered = TRUE)
)

The first six rows of the resulting data.frame are:

head(nrj)
##      sex kcal    weight   height      age    sports
## 1 female 1748  82.06484 174.8523 44.20043 sometimes
## 2 female 2054 148.00062 176.9677 40.46721 regularly
## 3 female 1923 101.58269 171.8551 48.20571 sometimes
## 4 female 1946  91.62356 177.0073 60.09808     never
## 5   male 2159 144.68351 173.1168 50.23627     never
## 6 female 2341  92.51075 177.6046 43.50972 sometimes

Calculating the BMR

We want to calculate the Basal Metabolic Rate (BMR) for the individuals in the nrj data.

The formula differs for men and women:

  • men: \((13.75 \times \text{weight}) + (5 \times \text{height}) - (6.76 \times \text{age}) + 66\)
  • women: \((9.56 \times \text{weight}) + (1.85 \times \text{height}) - (4.68 \times \text{age}) + 655\)

Task 1

Which function would you choose to calculate the BMR, ifelse() or if() ... else ...?

Solution 1

Both are possible, but ifelse() is more straightforward to use in this setting.

Task 2

  • Run the syntax from above to create the dataset.
  • Calculate the BMR using ifelse() and add it to the dataset as a new variable BMR1.

The function ifelse() has arguments test, yes and no.

You could use nrj$sex == "male" as the test.

Solution 2

We need to create a test that distinguishes between males and females. This can be done by simply comparing the value of the variable nrj$sex with either "male" or "female".

When we use nrj$sex == "male", we need to provide the formula for males as the yes argument and the formula for females as the no argument.

nrj$BMR1 <- ifelse(nrj$sex == "male",
                  13.75 * nrj$weight + 5 * nrj$height - 6.76 * nrj$age + 66,
                  9.56 * nrj$weight + 1.85 * nrj$height - 4.68 * nrj$age + 655
)
head(nrj)
##      sex kcal    weight   height      age    sports     BMR1
## 1 female 1748  82.06484 174.8523 44.20043 sometimes 1556.159
## 2 female 2054 148.00062 176.9677 40.46721 regularly 2207.890
## 3 female 1923 101.58269 171.8551 48.20571 sometimes 1718.460
## 4 female 1946  91.62356 177.0073 60.09808     never 1577.126
## 5   male 2159 144.68351 173.1168 50.23627     never 2581.385
## 6 female 2341  92.51075 177.6046 43.50972 sometimes 1664.346

The formula that we specify for the yes argument will contain the BMR for all rows in the nrj data calculated as if all persons were males, and the formula we specify for the no argument calculates the BMR as if all persons in the data were female.

So what we pass to yes and no are vectors of the same length as nrj$sex, and depending on whether nrj$sex == "male" returns TRUE or FALSE, the corresponding element of the first or second vector is used.

Task 3

As an exercise, we now want to calculate the BMR using if() ... else ....

How is the argument cond used in if() different from the argument test used in ifelse()?

Solution 3

The argument test in ifelse() expects a vector of “tests” (a vector of TRUE and FALSE) while the cond argument in if() expects a single “test” (a single TRUE or FALSE).

Task 4

How can we check row by row if a subject is male or female?

Solution 4

We could use a for()-loop that runs through all rows in nrj.

Task 5

Calculate the BMR using if() ... else ... and add it to the nrj data as a new variable BMR2.

You need a sequence from 1 to the number of rows of nrj.
You need to pre-specify (an empty version of) the new variable BMR2.
Don’t forget that you now need to use indices.

Solution 5

Our syntax needs to have the general form:

for (<"index"> in <"columns of nrj">) {
  
  if ("<subject is male">) {
    
    <"formula for males">
    
  } else {
    
    <"formula for females">
    
  }
}
nrj$BMR2 <- NA   # "empty" version of BMR2

# loop over all rows
for (i in 1:nrow(nrj)) {
  
  # test if the subject is male
  nrj$BMR2[i] <- if (nrj$sex[i] == "male") {
    
    # formula for males
    13.75 * nrj$weight[i] + 5 * nrj$height[i] - 6.76 * nrj$age[i] + 66
    
  } else {
    
    # formula for females
    9.56 * nrj$weight[i] + 1.85 * nrj$height[i] - 4.68 * nrj$age[i] + 655
    
  }
}

Task 6

Check that BMR1 and BMR2 are the same.

Solution 6

There are multiple possible ways to check this:

all.equal(nrj$BMR1, nrj$BMR2)
## [1] TRUE
identical(nrj$BMR1, nrj$BMR2)
## [1] TRUE
table(nrj$BMR1 == nrj$BMR2)
## 
## TRUE 
##  100
summary(nrj$BMR1 - nrj$BMR2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0

Data Summary

We would now like to get descriptive statistics of the nrj data. For continuous variables, we want the mean and standard deviation, for categorical data we want to report the proportion of subjects in each category.

Eventually, the aim is to write a function that will automatically do this for a dataset. The steps of how we can develop such a function are divided over the different Tasks in this section and the next section.

Task 1

Remember how to get summary measures?

  • Get the mean and standard deviation of the variable kcal.
  • Get the proportion of observations in each category for the variable sports.
You might use mean(), sd(), table() and prop.table().

Solution 1

mean(nrj$kcal)
## [1] 2219.93
sd(nrj$kcal)
## [1] 237.6322
prop.table(table(nrj$sports))
## 
##     never regularly sometimes 
##      0.33      0.36      0.31

Task 2

Use paste() to create some nice-looking output for these summaries:

  • “kcal: <mean> (<sd>)” for the variable kcal
  • “sports: <percentage>% <category 1>, <percentage>% <category 2>, …” for the variable sports
The function paste() has arguments sep and collapse, but you could also use paste0().
You might also want to use the function round() to reduce the number of digits in the output.

Solution 2

# for "kcal":
paste0("kcal: ", round(mean(nrj$kcal), 1), " (", round(sd(nrj$kcal), 1), ")")
## [1] "kcal: 2219.9 (237.6)"
# for "sports":

# table of proportions:
tab <- prop.table(table(nrj$sports))

# combine the proportions with category names:
props <- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")

# combine the variable name with the proportions string:
paste('sports:', props)
## [1] "sports: 33% never, 36% regularly, 31% sometimes"

Task 3

Write a loop that creates the summary strings we created in the previous solution for each variable in the nrj data.

Use print() to print each of the summaries.

You need a loop that iterates through the columns of nrj.
You need to autmatically check for each variable if it is a factor() or not to choose the correct summary type.

Solution 3

Our syntax has the general form:

for (<"index"> in <"columns of nrj">) {
  
  if ("<variable is factor">) {
    
    <"syntax for summary of a factor">
    
  } else {
    
    <"syntax for summary of a continuous variable">
    
  }
}
# loop over all columns
for (i in 1:ncol(nrj)) {
  
  # test if column is a factor
  if (is.factor(nrj[, i])) {
    
    # syntax for categorical variable
    tab <- prop.table(table(nrj[, i]))
    props <- paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
    print(paste0(names(nrj)[i], ": ", props))
    
  } else {
    
    # syntax for continuous variable
    print(paste0(names(nrj)[i], ": ", 
                 round(mean(nrj[, i]), 1), " (", round(sd(nrj[, i]), 1), ")"))
    
  }
}
## [1] "sex: 58% female, 42% male"
## [1] "kcal: 2219.9 (237.6)"
## [1] "weight: 99.6 (31.4)"
## [1] "height: 171.5 (9.6)"
## [1] "age: 49.9 (10.5)"
## [1] "sports: 33% never, 36% regularly, 31% sometimes"
## [1] "BMR1: 1807.1 (390.2)"
## [1] "BMR2: 1807.1 (390.2)"

Writing your own function

Now we will put the different elements from the previous part together using functions.

Task 1

Write two functions:

  • one that creates the summary string for a continuous variable (the same as we created earlier in this practical, with the mean and sd),
  • one that creates the output (proportion of subjects per category; as we did before) for a categorical variable.

Assume that the input will be a vector, such as nrj$kcal or nrj$sports.
Note that when we work with these vectors, they do not contain the variable name any more, so we cannot use the variable name in the two functions.

The general form of a function is
<function name> <- function( <arguments> ) {
  <function body>
}

Solution 1

We can re-use the syntax we used before, and only remove the part that adds the variable name.

Our functions have one argument, which we call x here:

summary_continuous <- function(x) {
  paste0(round(mean(x), 1), " (", round(sd(x), 1), ")")
}

summary_continuous(nrj$kcal)
## [1] "2219.9 (237.6)"
summary_categorical <- function(x) {
  tab <- prop.table(table(x))
  paste0(round(tab * 100, 1), "% ", names(tab), collapse = ", ")
}

summary_categorical(nrj$sports)
## [1] "33% never, 36% regularly, 31% sometimes"

Task 2

Write another function that has a data.frame as input and prints a summary string for each variable, using the two functions from the previous solution, i.e., so that summary_continuous() is used for continuous variables and summary_categorical() is used for categorical variables).

Solution 2

The general structure for our syntax is

<"function name"> <- function(<"data">) {

  for (<"index"> in <"columns of the data">) {
    
    if ("<variable is factor">) {
      
      <"function for summary of a factor">
        
    } else {
      
      <"function for summary of a continuous variable">
        
    }
  }
}

This function also has one argument which we call dat. We can, again, re-use syntax from above:

summary_df <- function(dat) {
  
  # loop over all columns
  for (i in 1:ncol(dat)) {
    
    # check if the column is a factor
    summary_string <- if (is.factor(dat[, i])) {
      
      # syntax for a categorical variable
      summary_categorical(dat[, i])
      
    } else {
      
      # syntax for a continuous variable
      summary_continuous(dat[, i])
    }
    
    # print the result of the summary
    print(summary_string)
  }
  
}

summary_df(dat = nrj)
## [1] "58% female, 42% male"
## [1] "2219.9 (237.6)"
## [1] "99.6 (31.4)"
## [1] "171.5 (9.6)"
## [1] "49.9 (10.5)"
## [1] "33% never, 36% regularly, 31% sometimes"
## [1] "1807.1 (390.2)"
## [1] "1807.1 (390.2)"

It would also be possible to use print() directly around the functions summary_categorical() and summary_continuous(). However, for the next Task it is more convenient to first collect the summary string in an object (summary_string), and do the further steps with that object.

Task 3

The function summary_df() that we created in the previous solution does not contain any variable names. Here we want to add those names to the output.

Modify the function so that the output strings look like the output in the previous exercise (with “variable name: …”).

Solution 3

To adjust the function, the only row that needs changing is the one in which we print the summary:

summary_df <- function(dat) {
  
  # loop over all columns
  for (i in 1:ncol(dat)) {
    
    # check if the column is a factor
    summary_string <- if (is.factor(dat[, i])) {
      
      # syntax for a categorical variable
      summary_categorical(dat[, i])
      
    } else {
      
      # syntax for a continuous variable
      summary_continuous(dat[, i])
    }
    
    # print the result of the summary together with the variable name
    print(paste0(names(dat)[i], ": ", summary_string))
  }
}

summary_df(dat = nrj)
## [1] "sex: 58% female, 42% male"
## [1] "kcal: 2219.9 (237.6)"
## [1] "weight: 99.6 (31.4)"
## [1] "height: 171.5 (9.6)"
## [1] "age: 49.9 (10.5)"
## [1] "sports: 33% never, 36% regularly, 31% sometimes"
## [1] "BMR1: 1807.1 (390.2)"
## [1] "BMR2: 1807.1 (390.2)"