The aim of this practical is to write a function that selects a subset of a dataset based on whether the value of a particular variable is inside an interval or not.
Several tasks of this practical are quite technical and the aim here is to demonstrate the abstract way of thinking that is needed for some programming tasks.
Note that for several tasks of this practical the given solution is only one of many possible solutions. In some cases it may not be the most efficient or elegant solution, because the focus is on keeping the syntax simple and easier to understand.
We want to write a function that creates a subset of a
data.frame based on whether the values of a particular
variable in that data.frame is within a certain interval or
not.
The function needs three arguments. We could call them
dat: a data.framevariable: the name of the variable based on which the
subset is created (a character string)interval: a numeric vector of length two, with the
lower and upper limit of the intervalAlternatively, we could of course also use separate arguments for the
lower and upper limit of the interval, and we could consider to not use
the name of a variable, but to accept a vector of values, so that it
would be possible to do the selection based on a variable that is not
part of the data.frame.
Now write the function.
logical filter variable that tells
us for each row of the dataset dat if the value of
variable is inside the interval or not.data.frame to test the
function.& operator.
It is easiest to start with creating the example data because we can then use it to help us to try things as we go.
The test data has to have at least one numeric variable (which we
call a here), but here we will also create a second
variable (b) to make the data look more like a normal
dataset:
exdat <- data.frame(a = 1:10,
b = factor(sample(c('A', 'B'), size = 10, replace = TRUE))
)exdat## a b
## 1 1 B
## 2 2 B
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B
## 8 8 A
## 9 9 A
## 10 10 A
To create the filter variable, we need to check if the values of
variable are inside interval, i.e., whether
they are larger than the lower bound of interval and
smaller than the upper bound of interval.
Because variable is the name of the
variable (and not the vector of values), we need to use the
corresponding column of dat.
We can try the first step, creating the filter, with our example data:
intrvl <- c(2, 5)
exdat[, "a"] > intrvl[1] & exdat[, "a"] < intrvl[2]Here, I use names for the data, interval and variable that are
different from the names of the arguments in our function. This is on
purpose. If we now specify dat, variable and
interval outside our function, they are available in the
global environment. If we would then write a function that uses these
arguments, but make a mistake in the syntax two things could happen:
"a", but the value of variable is still set to
"a").To write a function that uses this filter we need to replace the names of the example data with the names of the arguments used in the function:
fun1 <- function(dat, variable, interval) {
dat[, variable] > interval[1] & dat[, variable] < interval[2]
}We can try this with our example data
fun1(exdat, variable = "a", interval = c(3, 6))## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
So far, the function only creates the filter, but doesn’t use this filter yet to make and return the subset. We still need to add this step to our function:
fun1 <- function(dat, variable, interval) {
filter <- dat[, variable] > interval[1] & dat[, variable] < interval[2]
subset(dat, subset = filter)
}For interval = c(3, 7) the output should be those rows
with a equal to 4, 5, or 6:
fun1(exdat, variable = "a", interval = c(3, 7))## a b
## 4 4 B
## 5 5 A
## 6 6 B
Extend the function with an argument incl_boundaries
that allows the user to specify whether the boundaries of the
interval should be included in the subset that is
returned.
To set the default value for incl_boundaries so that the
boundaries are included we set it to TRUE.
When the boundaries should be included, the comparison of the value
of variable with the interval has to be
>= and <= instead of >
and <.
To implement this, we can use an if() ... else statement
that uses the argument incl_boundaries as condition.
fun2 <- function(dat, variable, interval, incl_boundaries = TRUE) {
# check if boundaries should be included
filter <- if (incl_boundaries) {
# syntax for subset including boundaries
dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
} else {
# syntax for subset not including boundaries
dat[, variable] > interval[1] & dat[, variable] < interval[2]
}
subset(dat, subset = filter)
}With the default setting to include the boundaries,
fun2() should return the rows of exdat where
a is 3, …, 6:
fun2(exdat, variable = "a", interval = c(3, 6))## a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
When we set incl_boundaries = FALSE the rows where
a = 3 and a = 6 should be excluded:
fun2(exdat, variable = "a", interval = c(3, 6), incl_boundaries = FALSE)## a b
## 4 4 B
## 5 5 A
We want to extend the function further with an additional argument
outside that allows the user to choose whether cases with
values of the variable inside or outside the specified
interval should be selected.
if() ... else statement.| operator.
We set the default value as outside = FALSE to return,
by default, the values inside the specified interval.
To select the correct rows of the data, depending on the values of
incl_boundaries and outside we need nested
if() ... else statements, so that the general structure of
the function should be:
fun3 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
# check if boundaries should be included
filter <- if (incl_boundaries) {
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, including the boundaries
} else {
# syntax for subset inside the interval, including the boundaries
}
} else {
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, not including the boundaries
} else {
# syntax for subset inside the interval, not including the boundaries
}
}
subset(dat, subset = filter)
}We now need to work out how the filter variable should be defined in the four different scenarios with respect to the inside or outside of the interval and whether the boundaries should be included or excluded.
Selecting values outside the interval means they should
be either smaller than (or equal to) the lower bound or larger than (or
equal to) the upper bound of the interval. This “or” is
implemented as the | operator.
(The following syntax does not run when we use it outside the function.)
# syntax for subset outside the interval, including the boundaries
dat[, variable] <= interval[1] | dat[, variable] >= interval[2]
# syntax for subset outside the interval, not including the boundaries
dat[, variable] < interval[1] | dat[, variable] > interval[2]The syntax for the other two scenarios (values inside the interval) is as before.
So, filling in the different pieces of syntax for the filters in the different scenarios, we get the function:
fun3 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
# check if boundaries should be included
filter <- if (incl_boundaries) {
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, including the boundaries
dat[, variable] <= interval[1] | dat[, variable] >= interval[2]
} else {
# syntax for subset inside the interval, including the boundaries
dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
}
} else {
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, not including the boundaries
dat[, variable] < interval[1] | dat[, variable] > interval[2]
} else {
# syntax for subset inside the interval, not including the boundaries
dat[, variable] > interval[1] & dat[, variable] < interval[2]
}
}
subset(dat, subset = filter)
}We test all four scenarios (different combinations of
incl_boundaries and outside) with
interval = c(3, 7):
incl_boundaries = TRUE, outside = FALSE should include
values 3, …, 7:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)## a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B
incl_boundaries = TRUE, outside = TRUE should include
values 1, 2, 3, and 7, …, 10:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)## a b
## 1 1 B
## 2 2 B
## 3 3 B
## 7 7 B
## 8 8 A
## 9 9 A
## 10 10 A
incl_boundaries = FALSE, outside = FALSE should include
values 4, 5, 6:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)## a b
## 4 4 B
## 5 5 A
## 6 6 B
incl_boundaries = FALSE, outside = TRUE should include
values 1, 2 and 8, 9, 10:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)## a b
## 1 1 B
## 2 2 B
## 8 8 A
## 9 9 A
## 10 10 A
In fun3(), many of the lines of syntax are almost
identical.
Re-write the function to make better use of the fact that a
logical value can be reversed (i.e., you can use the
filter to specify which cases should be selected or which
cases should be excluded.)
filter when we want to select
cases with values inside the interval and !filter when we
want to select cases outside the interval.
incl_boundaries and outside and then think
about which version of the syntax is needed:dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] > interval[1] & dat[, variable] < interval[2]
Our first idea might be to try this function:
fun4_false <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
filter <- if (incl_boundaries) {
# inside interval, including boundaries
dat[, variable] >= interval[1] & dat[, variable] <= interval[2] # version 1
} else {
# inside interval, excluding boundaries
dat[, variable] > interval[1] & dat[, variable] < interval[2] # version 2
}
if (outside)
filter <- !filter
subset(dat, filter)
}incl_boundary argument, the
filter selects either the inside of the interval, or the
inside AND the boundaries.outside, we then use
filter directly (when we want to return the inside of the
interval) or negate it (i.e., turn TRUE into
FALSE and vice versa) when
outside = TRUE.The problem is that by negating the filter, we also “negate” the in- or exclusion of the boundary values:
filter includes the boundary values, they are
excluded in !filterfilter excludes the boundary values, they are
included in !filterincl_boundary
and outside together. This is where the table comes in
handy:
| incl_boundary | outside | version |
|---|---|---|
| TRUE | TRUE | |
| FALSE | TRUE | |
| TRUE | FALSE | |
| FALSE | FALSE |
We can then fill in the table by looking at each scenario:
| incl_boundary | outside | version |
|---|---|---|
| TRUE | TRUE | 2 |
| FALSE | TRUE | 1 |
| TRUE | FALSE | 1 |
| FALSE | FALSE | 2 |
We note that we need to choose version 1 when
incl_boundary is different from outside and
version 2 when incl_boundary and outside have
the same value.
With this we can fix our function, which now is a lot shorter than
fun3:
fun4_correct <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
# check if "incl_boundaries" is different from "outside"
filter <- if (incl_boundaries != outside) {
# values inside the interval, including boundaries
dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
} else {
# values inside the interval, excluding boundaries
dat[, variable] > interval[1] & dat[, variable] < interval[2]
}
# check if values outside the interval should be returned
if (outside) {
# invert the filter variable
filter <- !filter
}
subset(dat, filter)
}We repeat the same set of tests as before, with
interval = c(3, 7):
incl_boundaries = TRUE, outside = FALSE should include
values 3, …, 7:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)## a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B
incl_boundaries = TRUE, outside = TRUE should include
values 1, 2, 3, and 7, …, 10:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)## a b
## 1 1 B
## 2 2 B
## 3 3 B
## 7 7 B
## 8 8 A
## 9 9 A
## 10 10 A
incl_boundaries = FALSE, outside = FALSE should include
values 4, 5, 6:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)## a b
## 4 4 B
## 5 5 A
## 6 6 B
incl_boundaries = FALSE, outside = TRUE should include
values 1, 2 and 8, 9, 10:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)## a b
## 1 1 B
## 2 2 B
## 8 8 A
## 9 9 A
## 10 10 A
If you didn’t figure this one out by yourself, don’t worry. This really was a though one!
What happens when we call fun4_correct() but specify
variable to be a factor?
Try to figure this out by looking at the syntax. Then check it using the example data.
In the definition of the filter variable, we compare
dat[, variable] with the lower or upper bound of the
interval.
When we compare a factor variable (such as b in
exdat) with a numeric value we get NA and a
warning message because the comparison is not meaningful:
exdat$b < 3## Warning in Ops.factor(exdat$b, 3): '<' not meaningful for factors
## [1] NA NA NA NA NA NA NA NA NA NA
The filter variable will therefore not contain a vector
of TRUE and FALSE but a vector of
NA values.
The function subset() selects only those rows of the
data for which the vector passed to its argument subset is
TRUE and excludes rows for which it is FALSE
or missing.
This means that fun4_correct will return a
data.frame with zero rows.
We can easily check that:
fun4_correct(exdat, variable = "b", interval = c(3, 7))## Warning in Ops.factor(dat[, variable], interval[1]): '>=' not meaningful for factors
## Warning in Ops.factor(dat[, variable], interval[2]): '<=' not meaningful for factors
## [1] a b
## <0 rows> (or 0-length row.names)
Add a check to fun4_correct that first checks if
variable is of type numeric. When a
non-numeric variable is selected, the function should not try to produce
a subset but instead print a message.
is.numeric() to check this.
print(), message() or
warning() to print the message.
We add another if() ... else statement to
fun4_correct() to implement this additional check, i.e.,
the structure of the function is
fun5 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
if (<"variable is not numeric">) {
# warning message
} else {
# same syntax as in "fun4_correct()"
}
}With the functions that we have seen so far, one solution would be:
fun5 <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
# check if the variable is not numeric
if (!is.numeric(dat[, variable])) {
print("The variable you selected is not numeric!")
} else {
# same syntax as before
filter <- if (incl_boundaries != outside) {
dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
} else {
dat[, variable] > interval[1] & dat[, variable] < interval[2]
}
if (outside) {
filter <- !filter
}
subset(dat, filter)
}
}fun5(exdat, variable = "b", interval = c(3, 7))## [1] "The variable you selected is not numeric!"
Typically we would not just want to print a message but either create
a warning() or an error message (using stop())
which would immediately stop the execution of the function:
fun5b <- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
if (!is.numeric(dat[, variable])) {
stop("The variable you selected is not numeric!", call. = FALSE)
}
filter <- if (incl_boundaries != outside) {
dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
} else {
dat[, variable] > interval[1] & dat[, variable] < interval[2]
}
if (outside) {
filter <- !filter
}
subset(dat, filter)
}fun5b(exdat, variable = "b", interval = c(3, 7))## Error: The variable you selected is not numeric!
When we use stop() we do not need the else
part of the if() statement because the function will be
stopped immediately and the rest of the syntax will not be
evaluated.