Aim of this practical is to write a function that selects a subset of a dataset based on whether a particular variable is inside an interval.
Note:
When writing functions, there are usually many ways that lead to the same solution. The solutions given here are suggestions, but there are many other ways the functions could be written.
The function needs three arguments. We will call them
dat
: a data.frame
variable
: the name of the variable based on which the subset is created (a character string)interval
: a numeric vector of length two, with the lower and upper limit of the intervalNow, write the function.
dat
if the value of variable
is inside the interval
or not.data.frame
to test the function.<- function(dat, variable, interval) {
fun1 <- dat[, variable] > interval[1] & dat[, variable] < interval[2]
filter
subset(dat, subset = filter)
}
To test the function, we create a dataset that allows us to check easily if what the function returns is correct:
<- data.frame(a = 1:10,
exdat b = sample(c('A', 'B'), size = 10, replace = TRUE)
)
exdat
## a b
## 1 1 B
## 2 2 B
## 3 3 A
## 4 4 B
## 5 5 B
## 6 6 A
## 7 7 A
## 8 8 B
## 9 9 B
## 10 10 B
For interval = c(3, 7)
the output should be those rows with a
equal to 4, 5, or 6:
fun1(exdat, variable = 'a', interval = c(3, 7))
## a b
## 4 4 B
## 5 5 B
## 6 6 A
Extend the function with an argument incl_boundaries
that allows the user to specify whether the boundaries of the interval
should be included in the subset that is returned.
<- function(dat, variable, interval, incl_boundaries = TRUE) {
fun2
# check if boundaries should be included
<- if (incl_boundaries) {
filter
# syntax for subset including boundaries
>= interval[1] & dat[, variable] <= interval[2]
dat[, variable]
else {
}
# syntax for subset not including boundaries
> interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
subset(dat, subset = filter)
}
With the default setting to include the boundaries, fun2()
should return the rows of exdat
where a
is 3, …, 6:
fun2(exdat, variable = 'a', interval = c(3, 6))
## a b
## 3 3 A
## 4 4 B
## 5 5 B
## 6 6 A
When we set incl_boundaries = FALSE
the rows where a = 3
and a = 6
should be excluded:
fun2(exdat, variable = 'a', interval = c(3, 6), incl_boundaries = FALSE)
## a b
## 4 4 B
## 5 5 B
We want to extend the function further with an additional argument outside
that allows the user to choose whether cases with values of the variable
inside or outside the specified interval should be selected.
if() ... else
statement.<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun3
# check if boundaries should be included
<- if (incl_boundaries) {
filter
# check if values outside (or inside) the interval should be selected
if (outside) {
# syntax for subset including boundaries
<= interval[1] | dat[, variable] >= interval[2]
dat[, variable] else {
} # syntax for subset including boundaries
>= interval[1] & dat[, variable] <= interval[2]
dat[, variable]
}
else {
}
# check if values outside (or inside) the interval should be selected
if (outside) {
# syntax for subset not including boundaries
< interval[1] | dat[, variable] > interval[2]
dat[, variable] else {
} # syntax for subset not including boundaries
> interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
}
subset(dat, subset = filter)
}
We test all four scenarios (different combinations of incl_boundaries
and outside
) with interval = c(3, 7)
:
incl_boundaries = TRUE, outside = FALSE
should include values 3, …, 7:
fun3(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)
## a b
## 3 3 A
## 4 4 B
## 5 5 B
## 6 6 A
## 7 7 A
incl_boundaries = TRUE, outside = TRUE
should include values 1, 2, 3, and 7, …, 10:
fun3(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 3 3 A
## 7 7 A
## 8 8 B
## 9 9 B
## 10 10 B
incl_boundaries = FALSE, outside = FALSE
should include values 4, 5, 6:
fun3(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)
## a b
## 4 4 B
## 5 5 B
## 6 6 A
incl_boundaries = FALSE, outside = TRUE
should include values 1, 2 and 8, 9, 10:
fun3(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 8 8 B
## 9 9 B
## 10 10 B
In fun3()
, many of the lines of syntax are almost identical.
Re-write the function to make better use of the fact that a logical
value can be reversed (i.e., you can use the filter
to specify which cases should be selected or which cases should be excluded.)
filter
when we want to select cases with values inside the interval and !filter
when we want to select cases outside the interval.
It is helpful to make a table of the combinations of options for incl_boundaries
and outside
and then think about which version of the syntax is needed:
dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] > interval[1] & dat[, variable] < interval[2]
Our first idea might be to try this function:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun4_false
<- if (incl_boundaries) {
filter >= interval[1] & dat[, variable] <= interval[2] # version 1
dat[, variable] else {
} > interval[1] & dat[, variable] < interval[2] # version 2
dat[, variable]
}
if (outside)
<- !filter
filter
subset(dat, filter)
}
incl_boundary
argument, the filter
selects either the inside of the interval, or the inside AND the boundaries.outside
, we then use filter
directly (when we want to return the inside of the interval) or negate it (i.e., turn TRUE
into FALSE
and vice versa) when outside = TRUE
.The problem is that by negating the filter, we also “negate” the in- or exclusion of the boundary values:
filter
includes the boundary values, they are excluded in !filter
filter
excludes the boundary values, they are included in !filter
incl_boundary
and outside
together. This is where the table comes in handy:
incl_boundary | outside | version |
---|---|---|
TRUE | TRUE | |
FALSE | TRUE | |
TRUE | FALSE | |
FALSE | FALSE |
We can then fill in the table by looking at each scenario:
incl_boundary | outside | version |
---|---|---|
TRUE | TRUE | 2 |
FALSE | TRUE | 1 |
TRUE | FALSE | 1 |
FALSE | FALSE | 2 |
We note that we need to choose version 1 when incl_boundary != outside
and version 2 when incl_boundary == outside
!
With this we can fix our function, which now is a lot shorter than fun3
:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun4_correct
<- if (incl_boundaries != outside) {
filter >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] else {
} > interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
if (outside)
<- !filter
filter
subset(dat, filter)
}
We repeat the same set of tests as before, with interval = c(3, 7)
:
incl_boundaries = TRUE, outside = FALSE
should include values 3, …, 7:
fun4_correct(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)
## a b
## 3 3 A
## 4 4 B
## 5 5 B
## 6 6 A
## 7 7 A
incl_boundaries = TRUE, outside = TRUE
should include values 1, 2, 3, and 7, …, 10:
fun4_correct(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 3 3 A
## 7 7 A
## 8 8 B
## 9 9 B
## 10 10 B
incl_boundaries = FALSE, outside = FALSE
should include values 4, 5, 6:
fun4_correct(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)
## a b
## 4 4 B
## 5 5 B
## 6 6 A
incl_boundaries = FALSE, outside = TRUE
should include values 1, 2 and 8, 9, 10:
fun4_correct(exdat, variable = 'a', interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 8 8 B
## 9 9 B
## 10 10 B
Note:
If you didn’t figure this one out by yourself, don’t worry. This really was a though one!
What happens when we call fun4_correct()
but specify variable
to be a factor?
We can easily check that:
fun4_correct(exdat, variable = 'b', interval = c(3, 7))
Add a check to fun4_correct
that first checks if variable
is of type numeric
. When a non-numeric variable is selected, the function should not try to produce a subset but instead print a message.
is.numeric()
to check this.
With the functions that we have seen so far, one solution would be:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun5
if (!is.numeric(dat[, variable])) {
print('The variable you selected is not numeric!')
else {
}
<- if (incl_boundaries != outside) {
filter >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] else {
} > interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
if (outside) {
<- !filter
filter
}
subset(dat, filter)
} }
fun5(exdat, variable = 'b', interval = c(3, 7))
## [1] "The variable you selected is not numeric!"
Note:
Typically we would not just want to print a message but either create a warning()
or an error message (using stop()
) which would immediately stop the execution of the function:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun5b
if (!is.numeric(dat[, variable])) {
stop('The variable you selected is not numeric!', call. = FALSE)
}
<- if (incl_boundaries != outside) {
filter >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] else {
} > interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
if (outside) {
<- !filter
filter
}
subset(dat, filter)
}
fun5b(exdat, variable = 'b', interval = c(3, 7))
## Error: The variable you selected is not numeric!
When we use stop()
we do not need the else
part of the if()
statement because the function will be stopped immediately and the rest of the syntax will not be evaluated.
© Nicole Erler