A Custom Subset Function

Selection based on an interval

The aim of this practical is to write a function that selects a subset of a dataset based on whether the value of a particular variable is inside an interval or not.

Several tasks of this practical are quite technical and the aim here is to demonstrate the abstract way of thinking that is needed for some programming tasks.

Note that for several tasks of this practical the given solution is only one of many possible solutions. In some cases it may not be the most efficient or elegant solution, because the focus is on keeping the syntax simple and easier to understand.

Task 1

We want to write a function that creates a subset of a data.frame based on whether the values of a particular variable in that data.frame is within a certain interval or not.

How many arguments does our function need?
What type of objects should these arguments represent?

Solution 1

The function needs three arguments. We could call them

dat: a data.frame
variable: the name of the variable based on which the subset is created (a character string)
interval: a numeric vector of length two, with the lower and upper limit of the interval

Alternatively, we could of course also use separate arguments for the lower and upper limit of the interval, and we could consider to not use the name of a variable, but to accept a vector of values, so that it would be possible to do the selection based on a variable that is not part of the data.frame.

Task 2

Now write the function.

It should create a logical filter variable that tells us for each row of the dataset dat if the value of variable is inside the interval or not.
Using this filter variable, create the subset and return it.
Create your own example data.frame to test the function.

To specify that two conditions need to be fulfilled you can use the & operator.

Solution 2

It is easiest to start with creating the example data because we can then use it to help us to try things as we go.

The test data has to have at least one numeric variable (which we call a here), but here we will also create a second variable (b) to make the data look more like a normal dataset:

exdat <- data.frame(a = 1:10,
                    b = factor(sample(c('A', 'B'), size = 10, replace = TRUE))
)

exdat

##     a b
## 1   1 B
## 2   2 B
## 3   3 B
## 4   4 B
## 5   5 A
## 6   6 B
## 7   7 B
## 8   8 A
## 9   9 A
## 10 10 A

To create the filter variable, we need to check if the values of variable are inside interval, i.e., whether they are larger than the lower bound of interval and smaller than the upper bound of interval.

Because variable is the name of the variable (and not the vector of values), we need to use the corresponding column of dat.

We can try the first step, creating the filter, with our example data:

intrvl <- c(2, 5)
exdat[, "a"] > intrvl[1] & exdat[, "a"] < intrvl[2]

Here, I use names for the data, interval and variable that are different from the names of the arguments in our function. This is on purpose. If we now specify dat, variable and interval outside our function, they are available in the global environment. If we would then write a function that uses these arguments, but make a mistake in the syntax two things could happen:

we have a function that works only in this particular session but not when we run and use the function in a new, clean session (because the objects from the global environment are used in the function which will not be available in the new, clean session)
we may get errors or wrong results when one/some of the objects that we pass to the function have changed compared to the object in the global environment, because then they don’t match any more (e.g., we use a different example dataset that doesn’t have a variable "a", but the value of variable is still set to "a").

To write a function that uses this filter we need to replace the names of the example data with the names of the arguments used in the function:

fun1 <- function(dat, variable, interval) {
  dat[, variable] > interval[1] & dat[, variable] < interval[2]
}

We can try this with our example data

fun1(exdat, variable  = "a", interval = c(3, 6))

##  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

So far, the function only creates the filter, but doesn’t use this filter yet to make and return the subset. We still need to add this step to our function:

fun1 <- function(dat, variable, interval) {
  filter <- dat[, variable] > interval[1] & dat[, variable] < interval[2]
  
  subset(dat, subset = filter)
}

For interval = c(3, 7) the output should be those rows with a equal to 4, 5, or 6:

fun1(exdat, variable = "a", interval = c(3, 7))

##   a b
## 4 4 B
## 5 5 A
## 6 6 B

Task 3

Extend the function with an argument incl_boundaries that allows the user to specify whether the boundaries of the interval should be included in the subset that is returned.

Specify this argument so that by default the boundaries are always included.
Check your extended function using the example data.

You now need different filter variables for the case with and the case without the boundaries.

Solution 3

To set the default value for incl_boundaries so that the boundaries are included we set it to TRUE.

When the boundaries should be included, the comparison of the value of variable with the interval has to be >= and <= instead of > and <.

To implement this, we can use an if() ... else statement that uses the argument incl_boundaries as condition.

fun2 <- function(dat, variable, interval, incl_boundaries = TRUE) {
  
  # check if boundaries should be included
  filter <- if (incl_boundaries) {
    
    # syntax for subset including boundaries
    dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
    
  } else {
    
    # syntax for subset not including boundaries
    dat[, variable] > interval[1] & dat[, variable] < interval[2]
  }
  
  subset(dat, subset = filter)
}

With the default setting to include the boundaries, fun2() should return the rows of exdat where a is 3, …, 6:

fun2(exdat, variable = "a", interval = c(3, 6))

##   a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B

When we set incl_boundaries = FALSE the rows where a = 3 and a = 6 should be excluded:

fun2(exdat, variable = "a", interval = c(3, 6), incl_boundaries = FALSE)

##   a b
## 4 4 B
## 5 5 A

Inside or outside?

Task 1

We want to extend the function further with an additional argument outside that allows the user to choose whether cases with values of the variable inside or outside the specified interval should be selected.

Write this extended version of the function using another if() ... else statement.
Set the argument so that by default values inside the interval are selected.

To select cases that fulfill either one condition OR another condition, you can use the | operator.

Solution 1

We set the default value as outside = FALSE to return, by default, the values inside the specified interval.

To select the correct rows of the data, depending on the values of incl_boundaries and outside we need nested if() ... else statements, so that the general structure of the function should be:

fun3 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  # check if boundaries should be included
  filter <- if (incl_boundaries) {
    
    # check if values outside the interval should be selected
    if (outside) {
      # syntax for subset outside the interval, including the boundaries
    } else {
      # syntax for subset inside the interval, including the boundaries
    }
    
  } else {
    
    # check if values outside the interval should be selected
    if (outside) {
      # syntax for subset outside the interval, not including the boundaries
    } else {
      # syntax for subset inside the interval, not including the boundaries
    }
  }
  
  subset(dat, subset = filter)
}

We now need to work out how the filter variable should be defined in the four different scenarios with respect to the inside or outside of the interval and whether the boundaries should be included or excluded.

Selecting values outside the interval means they should be either smaller than (or equal to) the lower bound or larger than (or equal to) the upper bound of the interval. This “or” is implemented as the | operator.

(The following syntax does not run when we use it outside the function.)

# syntax for subset outside the interval, including the boundaries
dat[, variable] <= interval[1] | dat[, variable] >= interval[2]

# syntax for subset outside the interval, not including the boundaries
dat[, variable] < interval[1] | dat[, variable] > interval[2]

The syntax for the other two scenarios (values inside the interval) is as before.

So, filling in the different pieces of syntax for the filters in the different scenarios, we get the function:

fun3 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  # check if boundaries should be included
  filter <- if (incl_boundaries) {
    
    # check if values outside the interval should be selected
    if (outside) {
      # syntax for subset outside the interval, including the boundaries
      dat[, variable] <= interval[1] | dat[, variable] >= interval[2]
    } else {
      # syntax for subset inside the interval, including the boundaries
      dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
    }
    
  } else {
    
    # check if values outside the interval should be selected
    if (outside) {
      # syntax for subset outside the interval, not including the boundaries
      dat[, variable] < interval[1] | dat[, variable] > interval[2]
    } else {
      # syntax for subset inside the interval, not including the boundaries
      dat[, variable] > interval[1] & dat[, variable] < interval[2]
    }
  }
  
  subset(dat, subset = filter)
}

We test all four scenarios (different combinations of incl_boundaries and outside) with interval = c(3, 7):

incl_boundaries = TRUE, outside = FALSE should include values 3, …, 7:

fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)

##   a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B

incl_boundaries = TRUE, outside = TRUE should include values 1, 2, 3, and 7, …, 10:

fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)

##     a b
## 1   1 B
## 2   2 B
## 3   3 B
## 7   7 B
## 8   8 A
## 9   9 A
## 10 10 A

incl_boundaries = FALSE, outside = FALSE should include values 4, 5, 6:

fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)

##   a b
## 4 4 B
## 5 5 A
## 6 6 B

incl_boundaries = FALSE, outside = TRUE should include values 1, 2 and 8, 9, 10:

fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)

##     a b
## 1   1 B
## 2   2 B
## 8   8 A
## 9   9 A
## 10 10 A

Task 2

In fun3(), many of the lines of syntax are almost identical.

Re-write the function to make better use of the fact that a logical value can be reversed (i.e., you can use the filter to specify which cases should be selected or which cases should be excluded.)

The idea is that we could use filter when we want to select cases with values inside the interval and !filter when we want to select cases outside the interval.

It is helpful to make a table of the combinations of options for incl_boundaries and outside and then think about which version of the syntax is needed:

version 1) dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
version 2) dat[, variable] > interval[1] & dat[, variable] < interval[2]

Solution 2

Our first idea might be to try this function:

fun4_false <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  filter <- if (incl_boundaries) {
    # inside interval, including boundaries
    dat[, variable] >= interval[1] & dat[, variable] <= interval[2]  # version 1
  } else {
    # inside interval, excluding boundaries
    dat[, variable] > interval[1] & dat[, variable] < interval[2]    # version 2
  }
  
  if (outside) 
    filter <- !filter
  
  subset(dat, filter)
}

Depending on the incl_boundary argument, the filter selects either the inside of the interval, or the inside AND the boundaries.
Depending on the argument outside, we then use filter directly (when we want to return the inside of the interval) or negate it (i.e., turn TRUE into FALSE and vice versa) when outside = TRUE.

The problem is that by negating the filter, we also “negate” the in- or exclusion of the boundary values:

when filter includes the boundary values, they are excluded in !filter
when filter excludes the boundary values, they are included in !filter

We somehow need to consider the two arguments incl_boundary and outside together. This is where the table comes in handy:

incl_boundary	outside	version
TRUE	TRUE
FALSE	TRUE
TRUE	FALSE
FALSE	FALSE

We can then fill in the table by looking at each scenario:

If we want the outside of the intervals including the boundaries, we need to exclude the inside of the interval without boundaries => version 2
If we want the outside of the intervals excluding the boundaries, we need to exclude the inside of the interval inclusive the boundaries => version 1
If we want the inside of the interval including the boundaries, we select exactly that => version 1
If we want the inside of the interval excluding the boundaries, we select exactly that => version 2

The filled-in version of the table then is:

incl_boundary	outside	version
TRUE	TRUE	2
FALSE	TRUE	1
TRUE	FALSE	1
FALSE	FALSE	2

We note that we need to choose version 1 when incl_boundary is different from outside and version 2 when incl_boundary and outside have the same value.

With this we can fix our function, which now is a lot shorter than fun3:

fun4_correct <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  # check if "incl_boundaries" is different from "outside"
  filter <- if (incl_boundaries != outside) {
    # values inside the interval, including boundaries
    dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
  } else {
    # values inside the interval, excluding boundaries
    dat[, variable] > interval[1] & dat[, variable] < interval[2]
  }
  
  # check if values outside the interval should be returned
  if (outside) {
    # invert the filter variable
    filter <- !filter
  }
  
  subset(dat, filter)
}

We repeat the same set of tests as before, with interval = c(3, 7):

incl_boundaries = TRUE, outside = FALSE should include values 3, …, 7:

fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)

##   a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B

incl_boundaries = TRUE, outside = TRUE should include values 1, 2, 3, and 7, …, 10:

fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)

##     a b
## 1   1 B
## 2   2 B
## 3   3 B
## 7   7 B
## 8   8 A
## 9   9 A
## 10 10 A

incl_boundaries = FALSE, outside = FALSE should include values 4, 5, 6:

fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)

##   a b
## 4 4 B
## 5 5 A
## 6 6 B

incl_boundaries = FALSE, outside = TRUE should include values 1, 2 and 8, 9, 10:

fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)

##     a b
## 1   1 B
## 2   2 B
## 8   8 A
## 9   9 A
## 10 10 A

If you didn’t figure this one out by yourself, don’t worry. This really was a though one!

Check for categorical variables

Task 1

What happens when we call fun4_correct() but specify variable to be a factor?

Try to figure this out by looking at the syntax. Then check it using the example data.

Solution 1

In the definition of the filter variable, we compare dat[, variable] with the lower or upper bound of the interval.

When we compare a factor variable (such as b in exdat) with a numeric value we get NA and a warning message because the comparison is not meaningful:

exdat$b < 3

## Warning in Ops.factor(exdat$b, 3): '<' not meaningful for factors

##  [1] NA NA NA NA NA NA NA NA NA NA

The filter variable will therefore not contain a vector of TRUE and FALSE but a vector of NA values.

The function subset() selects only those rows of the data for which the vector passed to its argument subset is TRUE and excludes rows for which it is FALSE or missing.

This means that fun4_correct will return a data.frame with zero rows.

We can easily check that:

fun4_correct(exdat, variable = "b", interval = c(3, 7))

## Warning in Ops.factor(dat[, variable], interval[1]): '>=' not meaningful for factors

## Warning in Ops.factor(dat[, variable], interval[2]): '<=' not meaningful for factors

## [1] a b
## <0 rows> (or 0-length row.names)

Task 2

Add a check to fun4_correct that first checks if variable is of type numeric. When a non-numeric variable is selected, the function should not try to produce a subset but instead print a message.

You can use the function is.numeric() to check this.

You can use print(), message() or warning() to print the message.

Solution 2

We add another if() ... else statement to fun4_correct() to implement this additional check, i.e., the structure of the function is

fun5 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  if (<"variable is not numeric">) {
    
    # warning message
    
  } else { 
    
    # same syntax as in "fun4_correct()"
  
  }
}

With the functions that we have seen so far, one solution would be:

fun5 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  # check if the variable is not numeric
  if (!is.numeric(dat[, variable])) {
    print("The variable you selected is not numeric!")
    
  } else {
    
    # same syntax as before
    
    filter <- if (incl_boundaries != outside) {
      dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
    } else {
      dat[, variable] > interval[1] & dat[, variable] < interval[2]
    }
    
    if (outside) {
      filter <- !filter
    }
    
    subset(dat, filter)
  }
}

fun5(exdat, variable = "b", interval = c(3, 7))

## [1] "The variable you selected is not numeric!"

Typically we would not just want to print a message but either create a warning() or an error message (using stop()) which would immediately stop the execution of the function:

fun5b <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
  
  if (!is.numeric(dat[, variable])) {
    stop("The variable you selected is not numeric!", call. = FALSE)
  }
  
  filter <- if (incl_boundaries != outside) {
    dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
  } else {
    dat[, variable] > interval[1] & dat[, variable] < interval[2]
  }
  
  if (outside) {
    filter <- !filter
  }
  
  subset(dat, filter)
}

fun5b(exdat, variable = "b", interval = c(3, 7))

## Error: The variable you selected is not numeric!

When we use stop() we do not need the else part of the if() statement because the function will be stopped immediately and the rest of the syntax will not be evaluated.

A Custom Subset Function

Biostatistics II: Introduction to R

Selection based on an interval

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Inside or outside?

Task 1

Solution 1

Task 2

Solution 2

Check for categorical variables

Task 1

Solution 1

Task 2

Solution 2