For this practical, we will use the survey dataset from the package MASS.
You can either load the MASS package to get access to this dataset
library(MASS)
or make a copy of this data
<- MASS::survey survey
The survey dataset contains responses of 237 statistics students at the University of Adelaide to a number of questions:
str(survey)
## 'data.frame': 237 obs. of 12 variables:
## $ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
## $ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
## $ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
## $ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
## $ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
## $ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
## $ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
## $ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
## $ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
## $ Height: num 173 178 NA 160 165 ...
## $ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
## $ Age : num 18.2 17.6 16.9 20.3 23.7 ...
Age
differencesFirst, we want to test if there is a difference in Age
between Female
and Male
students.
Which statistical test do you think is appropriate to test this?
hist()
creates a histogram.
par(mfrow = c(1, 2)) # arrange plots in a grid with one row and two columns
hist(survey$Age[which(survey$Sex == 'Female')], nclass = 50, main = 'Female', xlab = 'age')
hist(survey$Age[which(survey$Sex == 'Male')], nclass = 50, main = 'Male', xlab = 'age')
We notice that the distribution within each of the groups is skewed and does not follow a normal distribution. We decide to use a non-parametric test.
Note:
In the above syntax we used ...[which(survey$Sex == 'Female')]...
and not just ...[survey$Sex == 'Female]...
. These two ways of selecting cases give different results when there are missing values:
<- c("A", "A", "B", NA, "A", "B")
x <- c("a", "a", "b", "na_value", "a", "b")
y
== "A"] y[x
## [1] "a" "a" NA "a"
which(x == "A")] y[
## [1] "a" "a" "a"
Perform a non-parametric test to investigate whether males are older than females.
It is recommended to use the exact p-value calculation when possible (= when there are no ties), and to use continuity correction if a exact calculation is not possble.
wilcox.test()
function.
To perform a one-sided test, the argument alternative
needs to be specified to either "less"
or "greater"
.
In the help file for wilcox.test()
it is written that
[…] the one-sided alternative “greater” is that
x
is shifted to the right ofy
.
When using the formula specification, x
refers to the first factor level (Female
in our case) and y
refers to the second level (Male
in our case).
Setting alternative = "greater"
would, hence, test the alternative hypothesis \(H_1: Female > Male\). Since we want to test \(H_1: Male > Female\) (i.e., \(H_0: Male \leq Female\)) we need to set alternative = "less"
:
<- wilcox.test(Age ~ Sex, data = survey, alternative = 'less', exact = TRUE) wtest
## Warning in wilcox.test.default(x = c(18.25, 21, 35.833, 28.5, 18.75, 17.5, : cannot compute exact p-
## value with ties
wtest
##
## Wilcoxon rank sum test with continuity correction
##
## data: Age by Sex
## W = 5869, p-value = 0.01858
## alternative hypothesis: true location shift is less than 0
Because there are ties in the data, the exact calculation of the p-value could not be used and the normal approximation was used instead. Continuity correction was used (by default).
The output shows that we can reject the null hypothesis that males are younger than (or equally as old as) females, i.e., males are likely older.
Next, we would like to test if the span of the writing hand (Wr.Hnd
; distance from tip of thumb to tip of little finger) differs from the span of the non-writing hand (NW.Hnd
).
Make a plot to investigate if a t.test()
would be appropriate to answer this question.
We are now investigating a paired sample and are, therefore, interested in the difference between the spans of the writing and non-writing hand. (Remember: a paired t-test is the same as a one-sample t-test of the differences)
hist(survey$NW.Hnd - survey$Wr.Hnd, nclass = 50)
t.test()
to answer our research question about the spand of the hands.# one-sample t-test:
<- t.test(survey$Wr.Hnd - survey$NW.Hnd)
ttest1
# paired t-test:
<- t.test(survey$Wr.Hnd, survey$NW.Hnd, paired = TRUE)
ttest2
ttest1
##
## One Sample t-test
##
## data: survey$Wr.Hnd - survey$NW.Hnd
## t = 2.1268, df = 235, p-value = 0.03448
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.006367389 0.166513967
## sample estimates:
## mean of x
## 0.08644068
ttest2
##
## Paired t-test
##
## data: survey$Wr.Hnd and survey$NW.Hnd
## t = 2.1268, df = 235, p-value = 0.03448
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.006367389 0.166513967
## sample estimates:
## mean of the differences
## 0.08644068
?t.test
).
In the help file we can read:
## Default S3 method:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
The default values that were used are:
y = NULL
(correct, because we are doing a one-sample test)alternative = "two.sided"
(correct, we want to test for difference in either direction)mu = 0
(correct, we want to test for any difference)paired = FALSE
(only needs to be specified when we use both arguments x
and y
)var.equal = FALSE
(only relevant for independent samples test)conf.level = 0.95
(OK, since this is a common choice and nothing else was specified in the exercise)We now want to test if the writing hand (W.Hnd
) is associated with which hand was on top when the students clapped their hands (Clap
).
Perform an appropriate test.
chisq.test(table(survey$W.Hnd, survey$Clap))
## Warning in chisq.test(table(survey$W.Hnd, survey$Clap)): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: table(survey$W.Hnd, survey$Clap)
## X-squared = 19.252, df = 2, p-value = 6.598e-05
Note:
A common recommendation is / used to be that Fisher’s Exact test should be used instead of the \(\chi^2\)-test whenever one of the expected frequencies is \(< 5\), however this “cut-off” is arbitrary (https://doi.org/10.1002/sim.2832) and a \(\chi^2\)-test can be used.
Check if the conclusion differs when we take into account the students’ Sex
.
Sex
as stratifying variable.
To test for an association between W.Hnd
and Clap
in each of the two strata defined by Sex
we can use the mantelhaen.test()
:
<- mantelhaen.test(table(survey$W.Hnd, survey$Clap, survey$Sex))
MHtest MHtest
##
## Cochran-Mantel-Haenszel test
##
## data: table(survey$W.Hnd, survey$Clap, survey$Sex)
## Cochran-Mantel-Haenszel M^2 = 19.922, df = 2, p-value = 4.721e-05
Obtain adjusted p-values for these test and perform a multiplicity correction:
Age
by Sex
Wr.Hnd
and NW.Hnd
W.Hnd
, Clap
and Sex
p.adjust()
.
<- c(wtest = wtest$p.value,
pvals ttest = ttest1$p.value,
MHtest = MHtest$p.value)
<- p.adjust(pvals, method = "BH")
padj padj
## wtest ttest MHtest
## 0.0278728964 0.0344813844 0.0001416164
Note:
To make a nice looking comparison we could use the following syntax:
data.frame('p-value' = format.pval(pvals, eps = 0.001, digits = 2),
'adj.p-value' = format.pval(padj, eps = 0.001, digits = 2),
row.names = names(pvals), check.names = FALSE
)
## p-value adj.p-value
## wtest 0.019 0.028
## ttest 0.034 0.034
## MHtest <0.001 <0.001
By setting the argument check.names = FALSE
in data.frame()
we prevent that R “cleanes” the variable names (replaces the “-” with “.”, because a “-” in a variable name is usually rather inconvenient and can give problems.).
The argument eps
in format.pval()
controls the cutoff at which the “<…” notation is used, the argument digits
sets the number of significant digits are used (significant has nothing to do with statistical significance in this case).
© Nicole Erler