The aim of this practical is to write a function that selects a subset of a dataset based on whether the value of a particular variable is inside an interval or not.
Several tasks of this practical are quite technical and the aim here is to demonstrate the abstract way of thinking that is needed for some programming tasks.
Note that for several tasks of this practical the given solution is only one of many possible solutions. In some cases it may not be the most efficient or elegant solution, because the focus is on keeping the syntax simple and easier to understand.
We want to write a function that creates a subset of a
data.frame
based on whether the values of a particular
variable in that data.frame
is within a certain interval or
not.
The function needs three arguments. We could call them
dat
: a data.frame
variable
: the name of the variable based on which the
subset is created (a character string)interval
: a numeric vector of length two, with the
lower and upper limit of the intervalAlternatively, we could of course also use separate arguments for the
lower and upper limit of the interval, and we could consider to not use
the name of a variable, but to accept a vector of values, so that it
would be possible to do the selection based on a variable that is not
part of the data.frame
.
Now write the function.
logical
filter variable that tells
us for each row of the dataset dat
if the value of
variable
is inside the interval
or not.data.frame
to test the
function.&
operator.
It is easiest to start with creating the example data because we can then use it to help us to try things as we go.
The test data has to have at least one numeric variable (which we
call a
here), but here we will also create a second
variable (b
) to make the data look more like a normal
dataset:
<- data.frame(a = 1:10,
exdat b = factor(sample(c('A', 'B'), size = 10, replace = TRUE))
)
exdat
## a b
## 1 1 B
## 2 2 B
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B
## 8 8 A
## 9 9 A
## 10 10 A
To create the filter variable, we need to check if the values of
variable
are inside interval
, i.e., whether
they are larger than the lower bound of interval
and
smaller than the upper bound of interval
.
Because variable
is the name of the
variable (and not the vector of values), we need to use the
corresponding column of dat
.
We can try the first step, creating the filter, with our example data:
<- c(2, 5)
intrvl "a"] > intrvl[1] & exdat[, "a"] < intrvl[2] exdat[,
Here, I use names for the data, interval and variable that are
different from the names of the arguments in our function. This is on
purpose. If we now specify dat
, variable
and
interval
outside our function, they are available in the
global environment. If we would then write a function that uses these
arguments, but make a mistake in the syntax two things could happen:
"a"
, but the value of variable
is still set to
"a"
).To write a function that uses this filter we need to replace the names of the example data with the names of the arguments used in the function:
<- function(dat, variable, interval) {
fun1 > interval[1] & dat[, variable] < interval[2]
dat[, variable] }
We can try this with our example data
fun1(exdat, variable = "a", interval = c(3, 6))
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
So far, the function only creates the filter, but doesn’t use this filter yet to make and return the subset. We still need to add this step to our function:
<- function(dat, variable, interval) {
fun1 <- dat[, variable] > interval[1] & dat[, variable] < interval[2]
filter
subset(dat, subset = filter)
}
For interval = c(3, 7)
the output should be those rows
with a
equal to 4, 5, or 6:
fun1(exdat, variable = "a", interval = c(3, 7))
## a b
## 4 4 B
## 5 5 A
## 6 6 B
Extend the function with an argument incl_boundaries
that allows the user to specify whether the boundaries of the
interval
should be included in the subset that is
returned.
To set the default value for incl_boundaries
so that the
boundaries are included we set it to TRUE
.
When the boundaries should be included, the comparison of the value
of variable
with the interval
has to be
>=
and <=
instead of >
and <
.
To implement this, we can use an if() ... else
statement
that uses the argument incl_boundaries
as condition.
<- function(dat, variable, interval, incl_boundaries = TRUE) {
fun2
# check if boundaries should be included
<- if (incl_boundaries) {
filter
# syntax for subset including boundaries
>= interval[1] & dat[, variable] <= interval[2]
dat[, variable]
else {
}
# syntax for subset not including boundaries
> interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
subset(dat, subset = filter)
}
With the default setting to include the boundaries,
fun2()
should return the rows of exdat
where
a
is 3, …, 6:
fun2(exdat, variable = "a", interval = c(3, 6))
## a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
When we set incl_boundaries = FALSE
the rows where
a = 3
and a = 6
should be excluded:
fun2(exdat, variable = "a", interval = c(3, 6), incl_boundaries = FALSE)
## a b
## 4 4 B
## 5 5 A
We want to extend the function further with an additional argument
outside
that allows the user to choose whether cases with
values of the variable
inside or outside the specified
interval should be selected.
if() ... else
statement.|
operator.
We set the default value as outside = FALSE
to return,
by default, the values inside the specified interval.
To select the correct rows of the data, depending on the values of
incl_boundaries
and outside
we need nested
if() ... else
statements, so that the general structure of
the function should be:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun3
# check if boundaries should be included
<- if (incl_boundaries) {
filter
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, including the boundaries
else {
} # syntax for subset inside the interval, including the boundaries
}
else {
}
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, not including the boundaries
else {
} # syntax for subset inside the interval, not including the boundaries
}
}
subset(dat, subset = filter)
}
We now need to work out how the filter variable should be defined in the four different scenarios with respect to the inside or outside of the interval and whether the boundaries should be included or excluded.
Selecting values outside the interval
means they should
be either smaller than (or equal to) the lower bound or larger than (or
equal to) the upper bound of the interval
. This “or” is
implemented as the |
operator.
(The following syntax does not run when we use it outside the function.)
# syntax for subset outside the interval, including the boundaries
<= interval[1] | dat[, variable] >= interval[2]
dat[, variable]
# syntax for subset outside the interval, not including the boundaries
< interval[1] | dat[, variable] > interval[2] dat[, variable]
The syntax for the other two scenarios (values inside the interval) is as before.
So, filling in the different pieces of syntax for the filters in the different scenarios, we get the function:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun3
# check if boundaries should be included
<- if (incl_boundaries) {
filter
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, including the boundaries
<= interval[1] | dat[, variable] >= interval[2]
dat[, variable] else {
} # syntax for subset inside the interval, including the boundaries
>= interval[1] & dat[, variable] <= interval[2]
dat[, variable]
}
else {
}
# check if values outside the interval should be selected
if (outside) {
# syntax for subset outside the interval, not including the boundaries
< interval[1] | dat[, variable] > interval[2]
dat[, variable] else {
} # syntax for subset inside the interval, not including the boundaries
> interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
}
subset(dat, subset = filter)
}
We test all four scenarios (different combinations of
incl_boundaries
and outside
) with
interval = c(3, 7)
:
incl_boundaries = TRUE, outside = FALSE
should include
values 3, …, 7:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)
## a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B
incl_boundaries = TRUE, outside = TRUE
should include
values 1, 2, 3, and 7, …, 10:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 3 3 B
## 7 7 B
## 8 8 A
## 9 9 A
## 10 10 A
incl_boundaries = FALSE, outside = FALSE
should include
values 4, 5, 6:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)
## a b
## 4 4 B
## 5 5 A
## 6 6 B
incl_boundaries = FALSE, outside = TRUE
should include
values 1, 2 and 8, 9, 10:
fun3(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 8 8 A
## 9 9 A
## 10 10 A
In fun3()
, many of the lines of syntax are almost
identical.
Re-write the function to make better use of the fact that a
logical
value can be reversed (i.e., you can use the
filter
to specify which cases should be selected or which
cases should be excluded.)
filter
when we want to select
cases with values inside the interval and !filter
when we
want to select cases outside the interval.
incl_boundaries
and outside
and then think
about which version of the syntax is needed:dat[, variable] >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] > interval[1] & dat[, variable] < interval[2]
Our first idea might be to try this function:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun4_false
<- if (incl_boundaries) {
filter # inside interval, including boundaries
>= interval[1] & dat[, variable] <= interval[2] # version 1
dat[, variable] else {
} # inside interval, excluding boundaries
> interval[1] & dat[, variable] < interval[2] # version 2
dat[, variable]
}
if (outside)
<- !filter
filter
subset(dat, filter)
}
incl_boundary
argument, the
filter
selects either the inside of the interval, or the
inside AND the boundaries.outside
, we then use
filter
directly (when we want to return the inside of the
interval) or negate it (i.e., turn TRUE
into
FALSE
and vice versa) when
outside = TRUE
.The problem is that by negating the filter, we also “negate” the in- or exclusion of the boundary values:
filter
includes the boundary values, they are
excluded in !filter
filter
excludes the boundary values, they are
included in !filter
incl_boundary
and outside
together. This is where the table comes in
handy:
incl_boundary | outside | version |
---|---|---|
TRUE | TRUE | |
FALSE | TRUE | |
TRUE | FALSE | |
FALSE | FALSE |
We can then fill in the table by looking at each scenario:
incl_boundary | outside | version |
---|---|---|
TRUE | TRUE | 2 |
FALSE | TRUE | 1 |
TRUE | FALSE | 1 |
FALSE | FALSE | 2 |
We note that we need to choose version 1 when
incl_boundary
is different from outside
and
version 2 when incl_boundary
and outside
have
the same value.
With this we can fix our function, which now is a lot shorter than
fun3
:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun4_correct
# check if "incl_boundaries" is different from "outside"
<- if (incl_boundaries != outside) {
filter # values inside the interval, including boundaries
>= interval[1] & dat[, variable] <= interval[2]
dat[, variable] else {
} # values inside the interval, excluding boundaries
> interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
# check if values outside the interval should be returned
if (outside) {
# invert the filter variable
<- !filter
filter
}
subset(dat, filter)
}
We repeat the same set of tests as before, with
interval = c(3, 7)
:
incl_boundaries = TRUE, outside = FALSE
should include
values 3, …, 7:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = FALSE)
## a b
## 3 3 B
## 4 4 B
## 5 5 A
## 6 6 B
## 7 7 B
incl_boundaries = TRUE, outside = TRUE
should include
values 1, 2, 3, and 7, …, 10:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = TRUE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 3 3 B
## 7 7 B
## 8 8 A
## 9 9 A
## 10 10 A
incl_boundaries = FALSE, outside = FALSE
should include
values 4, 5, 6:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = FALSE)
## a b
## 4 4 B
## 5 5 A
## 6 6 B
incl_boundaries = FALSE, outside = TRUE
should include
values 1, 2 and 8, 9, 10:
fun4_correct(exdat, variable = "a", interval = c(3, 7), incl_boundaries = FALSE, outside = TRUE)
## a b
## 1 1 B
## 2 2 B
## 8 8 A
## 9 9 A
## 10 10 A
If you didn’t figure this one out by yourself, don’t worry. This really was a though one!
What happens when we call fun4_correct()
but specify
variable
to be a factor?
Try to figure this out by looking at the syntax. Then check it using the example data.
In the definition of the filter
variable, we compare
dat[, variable]
with the lower or upper bound of the
interval
.
When we compare a factor variable (such as b
in
exdat
) with a numeric value we get NA
and a
warning message because the comparison is not meaningful:
$b < 3 exdat
## Warning in Ops.factor(exdat$b, 3): '<' not meaningful for factors
## [1] NA NA NA NA NA NA NA NA NA NA
The filter
variable will therefore not contain a vector
of TRUE
and FALSE
but a vector of
NA
values.
The function subset()
selects only those rows of the
data for which the vector passed to its argument subset
is
TRUE
and excludes rows for which it is FALSE
or missing.
This means that fun4_correct
will return a
data.frame
with zero rows.
We can easily check that:
fun4_correct(exdat, variable = "b", interval = c(3, 7))
## Warning in Ops.factor(dat[, variable], interval[1]): '>=' not meaningful for factors
## Warning in Ops.factor(dat[, variable], interval[2]): '<=' not meaningful for factors
## [1] a b
## <0 rows> (or 0-length row.names)
Add a check to fun4_correct
that first checks if
variable
is of type numeric
. When a
non-numeric variable is selected, the function should not try to produce
a subset but instead print a message.
is.numeric()
to check this.
print()
, message()
or
warning()
to print the message.
We add another if() ... else
statement to
fun4_correct()
to implement this additional check, i.e.,
the structure of the function is
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun5
if (<"variable is not numeric">) {
# warning message
else {
}
# same syntax as in "fun4_correct()"
} }
With the functions that we have seen so far, one solution would be:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun5
# check if the variable is not numeric
if (!is.numeric(dat[, variable])) {
print("The variable you selected is not numeric!")
else {
}
# same syntax as before
<- if (incl_boundaries != outside) {
filter >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] else {
} > interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
if (outside) {
<- !filter
filter
}
subset(dat, filter)
} }
fun5(exdat, variable = "b", interval = c(3, 7))
## [1] "The variable you selected is not numeric!"
Typically we would not just want to print a message but either create
a warning()
or an error message (using stop()
)
which would immediately stop the execution of the function:
<- function(dat, variable, interval, incl_boundaries = TRUE, outside = FALSE) {
fun5b
if (!is.numeric(dat[, variable])) {
stop("The variable you selected is not numeric!", call. = FALSE)
}
<- if (incl_boundaries != outside) {
filter >= interval[1] & dat[, variable] <= interval[2]
dat[, variable] else {
} > interval[1] & dat[, variable] < interval[2]
dat[, variable]
}
if (outside) {
<- !filter
filter
}
subset(dat, filter)
}
fun5b(exdat, variable = "b", interval = c(3, 7))
## Error: The variable you selected is not numeric!
When we use stop()
we do not need the else
part of the if()
statement because the function will be
stopped immediately and the rest of the syntax will not be
evaluated.