In this practical, we will fit a multiple linear regression model
using the function lm()
, investigate the output of the
model summary()
and how all the different elements in this
summary are derived.
The aim is to gain a better understanding about what the numbers shown in the model summary mean, to recap some of the the theory about linear regression models and the least squares estimator, and to practice working with .
It may seem to you that there is too much “R stuff” in the practicals and not enough “statistics” but in practice, cleaning/preparing the data for the analysis and visualizing the results usually is much more work than running the analysis itself. Therefore one of the aims of these practicals is to prepare you for working with your own data in the future.
This practical is intended to be “learning by doing”, i.e., you may not be able to solve all tasks by yourself without the hints or even checking the solution, and you will come across functions that you haven’t seen in the lecture slides.
When a function is new to you, you can find out more about it in the
help file that you can access by typing
?<function name>
.
For several tasks of this practical the given solution is only one of many possible solutions.
There are a few parts indicated as “Advanced Level” containing some more complex solutions (for those of you, for whom working with is a piece of cake).
For this practical we will use a subset of the NHANES (National Health and Nutrition Examination Survey) data.
To load this dataset into , you can
use the command file.choose()
which opens the explorer and
allows you to navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/nhanes.RData")
.
Here is a short overview over the data:
## SBP gender age race WC alc educ creat albu uricacid bili occup smoke
## 1 110.6667 male 22 Non-Hispanic White 81.0 <NA> low 0.91 4.8 4.9 0.6 working never
## 2 118.0000 female 44 Non-Hispanic White 80.1 <NA> high 0.89 3.7 4.5 0.3 working never
## 3 124.6667 male 21 other 69.6 <1 low 0.87 4.4 5.4 0.2 not working never
## 4 102.0000 female 43 Non-Hispanic Black 120.4 >=1 low 0.68 4.4 5.0 0.8 not working current
## 5 146.6667 male 51 other 81.1 <NA> low 0.99 4.1 5.0 0.5 looking for work current
## 6 122.0000 male 80 Non-Hispanic White 112.5 <NA> low 1.01 4.2 4.8 0.7 not working never
## 'data.frame': 2765 obs. of 13 variables:
## $ SBP : num 111 118 125 102 147 ...
## $ gender : Factor w/ 2 levels "male","female": 1 2 1 2 1 1 1 2 1 1 ...
## $ age : num 22 44 21 43 51 80 26 30 70 35 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 3 3 5 4 5 3 4 5 4 4 ...
## $ WC : num 81 80.1 69.6 120.4 81.1 ...
## $ alc : Factor w/ 2 levels "<1",">=1": NA NA 1 2 NA NA 2 2 1 NA ...
## $ educ : Factor w/ 2 levels "low","high": 1 2 1 1 1 1 1 2 1 2 ...
## $ creat : num 0.91 0.89 0.87 0.68 0.99 1.01 1.03 0.91 0.8 1.1 ...
## $ albu : num 4.8 3.7 4.4 4.4 4.1 4.2 5 4.8 3.7 4 ...
## $ uricacid: num 4.9 4.5 5.4 5 5 4.8 5.4 6.7 5.4 6.7 ...
## $ bili : num 0.6 0.3 0.2 0.8 0.5 0.7 1.1 0.9 0.9 0.9 ...
## $ occup : Factor w/ 3 levels "working","looking for work",..: 1 1 3 3 2 3 1 1 3 1 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 1 3 3 1 1 1 2 1 ...
Fit a linear regression model on the nhanes data
with SBP
as the response and gender
,
age
, race
, smoke
and
albu
as covariates.
lm()
.
formula
and
data
.
To fit a linear regression model, we use the function
lm()
. This function has multiple arguments, but for now we
only need to specify the formula
and the
data
.
In the model formula
we use the name of the response
variable, i.e., SBP
, then a tilde, and then the names of
the covariates, separated by plus symbols.
The left hand side (LHS) of a formula refers to the part before the tilde, the right hand side (RHS) refers to the part after the tilde.
summary()
and take a close look at it. Do
you know what all the different parts mean?age
and what is it’s
interpretation?race
estimated? What type of
coding was used and how should the estimates be interpreted?smoke
?##
## Call:
## lm(formula = SBP ~ gender + age + race + smoke + albu, data = nhanes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.379 -9.772 -1.071 8.194 95.891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.86921 5.04766 18.993 < 2e-16 ***
## genderfemale -4.45617 0.69972 -6.368 2.30e-10 ***
## age 0.46631 0.01988 23.456 < 2e-16 ***
## raceOther Hispanic -0.21361 1.43025 -0.149 0.881
## raceNon-Hispanic White -0.12290 1.15504 -0.106 0.915
## raceNon-Hispanic Black 5.32980 1.23176 4.327 1.58e-05 ***
## raceother -0.26289 1.29833 -0.202 0.840
## smoke.L 0.97193 0.61667 1.576 0.115
## smoke.Q 0.61565 0.69081 0.891 0.373
## albu 1.25940 1.07471 1.172 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.78 on 2316 degrees of freedom
## (439 observations deleted due to missingness)
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2324
## F-statistic: 79.2 on 9 and 2316 DF, p-value: < 2.2e-16
The output of the model summary has 4 parts:
We will take a closer look at all of these elements throughout this
practical. But first, let’s see what the effects of age
,
race
and smoke
are and how they should be
interpreted.
age
The effect of age
is estimated to be 0.466. Its
interpretation is that a 1 unit increase in age
is
associated with an increase of 0.466 in the expected SBP
when all other covariates are kept constant.
The p-value for the effect of age
is < 2.22e-16,
i.e., it would be very unlikely to see an effect of this size when the
true effect was zero. This means that we can reject the (implied)
null-hypothesis that the effect of age
on SBP
is zero and state that age
has a (statistically
significant) effect on SBP
.
race
:The effect of race
is represented by the effects of the
four dummy variables raceOther Hispanic
,
raceNon-Hispanic White
,
raceNon-Hispanic Black
, and raceother
. We can
see that used dummy
coding because the names of the different categories are
appended to the variable name. Moreover, dummy coding is the default for
unordered factors (and we know that race
is an unordered
factor from str(nhanes)
shown in the introduction to this
practical).
From the output of the summary, we cannot say whether
race
overall has a relevant contribution to explaining the
variance in SBP
. The regression coefficients correspond to
the difference in the expected value of SBP
when a
participant has a particular value of race
compared to the
reference category (and the same values for all other covariates).
We see that the p-values for the effects of
other Hispanic
, Non-Hispanic White
and
other
are very large. This means that these effects are
very likely to be observed when there is no actual difference in
SBP
between these races and the reference category
(Mexican American
). But we see a coefficient of 5.330 for
Non-Hispanic Black
participants. The corresponding p-value
indicates that observing such an effect would be very unlikely if there
was no difference in SBP
between
Non-Hispanic Black
and Mexican American
.
We can conclude that Non-Hispanic Black
participants had
a (statistically significantly) higher SBP
than
Mexican American
participants and that there was no
evidence that participants of the other race differed from
Mexican American
participants regarding their
SBP
.
smoke
The variable smoke
is an ordered factor; therefore,
has used orthogonal polynomials to
represent the ordering of the categories in the estimated effects.
Interpretation of such effects is not straightforward. A visualization of orthogonal polynomials can be found on slides 21/22 of the section on Linear Regression in R.
Because smoke
is an ordered factor,
has used orthogonal polynomials
contrasts, but we prefer to use dummy coding for this variable, to
facilitate the interpretation of the effects.
Re-fit the linear regression model (lm1
) and specify
that dummy-coding should be used for the variable smoke
. Do
this via an argument to lm()
. Print the summary of this new
model.
If you are unsure which argument to use, or how to specify this, take
a look at the help (?lm
).
Remember: Contrasts were covered in the slides on Linear Regression in R.
contrasts
.
contrasts
tells us to check the
description of the argument contrasts.arg
of the function
model.matrix.default
. There we read that we should provide
a list, whose entries are values (numeric matrices, functions or character strings naming functions) to be used as replacement values for the contrasts replacement function and whose names are the names of columns of data containing factors.
list
with a named element, where
the name is the name of the variable for which we want to change the
contrast, and the element itself is the name of the contrast we want to
use.
contr.treatment
in R.
We need to specify the argument contrasts
by providing a
list
that has elements for each covariate for which we want
to specify the type of contrast.
In our case, this is only the variable smoke
, and we set
the contrast for smoke
to be contr.treatment
,
i.e., dummy coding:
lm2 <- lm(SBP ~ gender + age + race + smoke + albu, data = nhanes,
contrasts = list(smoke = "contr.treatment"))
summary(lm2)
##
## Call:
## lm(formula = SBP ~ gender + age + race + smoke + albu, data = nhanes,
## contrasts = list(smoke = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.379 -9.772 -1.071 8.194 95.891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.43329 5.05898 18.864 < 2e-16 ***
## genderfemale -4.45617 0.69972 -6.368 2.30e-10 ***
## age 0.46631 0.01988 23.456 < 2e-16 ***
## raceOther Hispanic -0.21361 1.43025 -0.149 0.881
## raceNon-Hispanic White -0.12290 1.15504 -0.106 0.915
## raceNon-Hispanic Black 5.32980 1.23176 4.327 1.58e-05 ***
## raceother -0.26289 1.29833 -0.202 0.840
## smokeformer -0.06676 0.85987 -0.078 0.938
## smokecurrent 1.37452 0.87210 1.576 0.115
## albu 1.25940 1.07471 1.172 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.78 on 2316 degrees of freedom
## (439 observations deleted due to missingness)
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2324
## F-statistic: 79.2 on 9 and 2316 DF, p-value: < 2.2e-16
We could also have used the function update()
to re-fit
the model.
When using update()
, we specify the object that we want
to update, i.e., lm1
, and provide new versions of those
arguments that need to be changed, i.e., contrasts
:
In the following, we will inspect the different parts of the model summary and re-create them “by hand” to get a better understanding of what they are and how they are calculated.
We will compare them to the values calculated by lm()
(or summary()
) to double-check that our own calculations
are correct.
Before we can start to really calculate anything we should investigate if there are any values missing in our data because missing values complicate the calculations. They are automatically omitted in some steps, which leads to objects (e.g., vectors or matrices) having different dimensions than we may expect them to have, which could result in errors.
Investigate if there are any missing values in the variables used in the model.
summary()
of the data tells us the number of missing
values per variable. Alternatively, you could also use the function
is.na()
.
There are multiple ways to check for missing values.
The number of cases with missing values is displayed in the
summary()
of the data:
## SBP gender age race WC alc educ creat
## Min. : 81.33 male :1381 Min. :18.00 Mexican American : 282 Min. : 56.20 <1 :1149 low :1192 Min. :0.3400
## 1st Qu.:110.00 female:1384 1st Qu.:31.00 Other Hispanic : 294 1st Qu.: 85.40 >=1 : 897 high:1433 1st Qu.:0.7000
## Median :120.00 Median :47.00 Non-Hispanic White:1022 Median : 96.20 NA's: 719 NA's: 140 Median :0.8400
## Mean :122.59 Mean :47.53 Non-Hispanic Black: 688 Mean : 97.56 Mean :0.8911
## 3rd Qu.:131.33 3rd Qu.:63.00 other : 479 3rd Qu.:107.10 3rd Qu.:0.9900
## Max. :234.67 Max. :80.00 Max. :176.00 Max. :9.5100
## NA's :127 NA's :150 NA's :214
## albu uricacid bili occup smoke
## Min. :2.600 Min. : 0.400 Min. :0.1000 working :1375 never :1505
## 1st Qu.:4.000 1st Qu.: 4.500 1st Qu.:0.6000 looking for work: 137 former : 594
## Median :4.300 Median : 5.400 Median :0.7000 not working :1217 current: 524
## Mean :4.263 Mean : 5.476 Mean :0.7382 NA's : 36 NA's : 142
## 3rd Qu.:4.500 3rd Qu.: 6.400 3rd Qu.:0.9000
## Max. :5.400 Max. :11.300 Max. :4.1000
## NA's :214 NA's :215 NA's :218
Showing the summary of the whole dataset can be inconvenient when the data contain many variables and we are only interested in a few of them.
We could then focus on only those variables that we have used in the model, i.e.,
## SBP gender age race smoke albu
## Min. : 81.33 male :1381 Min. :18.00 Mexican American : 282 never :1505 Min. :2.600
## 1st Qu.:110.00 female:1384 1st Qu.:31.00 Other Hispanic : 294 former : 594 1st Qu.:4.000
## Median :120.00 Median :47.00 Non-Hispanic White:1022 current: 524 Median :4.300
## Mean :122.59 Mean :47.53 Non-Hispanic Black: 688 NA's : 142 Mean :4.263
## 3rd Qu.:131.33 3rd Qu.:63.00 other : 479 3rd Qu.:4.500
## Max. :234.67 Max. :80.00 Max. :5.400
## NA's :127 NA's :214
An even more convenient way to obtain the names of all the variables used in the model is to extract them from the model formula.
We can obtain the model formula as
## SBP ~ gender + age + race + smoke + albu
and we can use the function all.vars()
to extract all
variable names from a formula
object, i.e.,
## [1] "SBP" "gender" "age" "race" "smoke" "albu"
Using this, we can get the same summary, without having to type all the separate variable names:
## SBP gender age race smoke albu
## Min. : 81.33 male :1381 Min. :18.00 Mexican American : 282 never :1505 Min. :2.600
## 1st Qu.:110.00 female:1384 1st Qu.:31.00 Other Hispanic : 294 former : 594 1st Qu.:4.000
## Median :120.00 Median :47.00 Non-Hispanic White:1022 current: 524 Median :4.300
## Mean :122.59 Mean :47.53 Non-Hispanic Black: 688 NA's : 142 Mean :4.263
## 3rd Qu.:131.33 3rd Qu.:63.00 other : 479 3rd Qu.:4.500
## Max. :234.67 Max. :80.00 Max. :5.400
## NA's :127 NA's :214
Using all.vars()
and formula()
can save us
some work when we have models with many variables and prevents us from
making typos in the variable names (which would result in an error when
the name we typed does not exist in the data).
Moreover, when, for some reason, we change the formula used in
lm2
(e.g., to include an additional variable) this syntax
will still give us the variables used in the model. If we had created
the vector of variable names by typing them, we would also have to
change this vector every time we change the model formula.
We can also count missing values per variable, using the functions
is.na()
and colSums()
.
The function is.na()
returns the value TRUE
if an element is NA
and FALSE
otherwise.
colSums()
sums up values per column.
## SBP gender age race smoke albu
## 127 0 0 0 142 214
is.na(<...>)
gives us a matrix of
TRUE
and FALSE
with the same dimension as the
input data, indicating for each value whether it is missing or not.
(Shown here are only rows 21 - 29.)
## SBP gender age race smoke albu
## [...]
## 21 FALSE FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE TRUE FALSE
## 23 FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE FALSE
## 28 FALSE FALSE FALSE FALSE FALSE FALSE
## 29 FALSE FALSE FALSE FALSE TRUE TRUE
## [...]
When we calculate the sum over a logical
vector, the value
TRUE
is counted as 1
and FALSE
is
counted as 0
.
To sum up (missing) values per row we can use the
corresponding function rowSums()
.
is.na()
, rowSums()
and
table()
could come in handy.
We can calculate the number of missing values per subject using almost the same syntax as before (where we counted missing values per variable), but now we count them per case, i.e., per row in our data.
Here, it is extremely important that we only select the columns of
the nhanes
data that are used in the model!
##
## 0 1 2
## 2326 395 44
rowSums(is.na(nhanes[, c("SBP", ..., "albu")]))
returns
a vector of integers, with one entry for each row in the data.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [...]
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [...]
We can use table()
to then count how many rows there are
with 0, 1, 2, … missing values.
There are 439 observations with missing values in the variables used in our model.
Because there are nrow(nhanes)
= 2765 rows in the data,
we have 2326 observations with complete data.
Find the information about the number of incomplete cases in the
output of the summary()
for our model
(lm2
).
Information on the number of cases that were excluded due to
missingness is given in the last part of the output of the
summary()
:
## [...]
## Residual standard error: 15.78 on 2316 degrees of freedom
## (439 observations deleted due to missingness)
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2324
## F-statistic: 79.2 on 9 and 2316 DF, p-value: < 2.2e-16
It tells us that 439 observations were deleted due to missingness.
In this part, we will calculate the least squares estimates for the
regression coefficients from the data, and compare them with the values
that we have obtained using lm()
.
Remember (or look up) the formula for the least squares estimator (in matrix notation).
The formula can be found on slide 3 of the slides on the Least Squares Estimator:
\[\boldsymbol{\hat\beta} = \left(\mathbf X^\top\mathbf X\right)^{-1}\mathbf X^\top\mathbf y\]
The little \(^\top\) indicates that a vector or a matrix is transposed (see slide 17 of Multiple Linear Regression).
The superscript \(^{-1}\) refers to the inverse of the matrix.
In , the inverse of a matrix is
calculated using the function solve()
and the function
t()
transposes a vector or a matrix.
The next step is to create the design matrix, X
, and the
vector of the response values, y
.
X
has to be a matrix
containing only
numeric values. It should have a column for the intercept and
categorical variables have to be coded appropriately.
y
should be a column vector, i.e., a matrix with a
single column.
To create a design matrix, you can use the function
model.matrix()
. ₼ see ?model.matrix()
A model.matrix()
can be created from different types of
objects. Here, we use a formula
as the object
(and the nhanes
data as data
).
In creating the design matrix / model matrix, automatically applies contrasts, e.g., converts categorical variables into the corresponding dummy variables.
To use contrasts types other than the default, we need to specify the
argument contrasts.arg
(the same way we specified
contrasts
in lm()
).
X
and y
.Create the design matrix X
:
X <- model.matrix(SBP ~ gender + age + race + smoke + albu,
data = nhanes,
contrasts.arg = list(smoke = "contr.treatment"))
To get dummy variables for smoke
instead of orthogonal
polynomials (default for ordered factors), we need to specify the
contrasts.arg
argument.
You can extract the formula from a fitted model object using the
function formula()
, e.g.:
## SBP ~ gender + age + race + smoke + albu
So, instead of typing the formula from lm2
again, we
could also create the design matrix X
like this:
Create the response y
:
Check their class and dimensions:
## num [1:2326, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2326] "1" "2" "3" "4" ...
## ..$ : chr [1:10] "(Intercept)" "genderfemale" "age" "raceOther Hispanic" ...
## - attr(*, "assign")= int [1:10] 0 1 2 3 3 3 3 4 4 5
## - attr(*, "contrasts")=List of 3
## ..$ gender: chr "contr.treatment"
## ..$ race : chr "contr.treatment"
## ..$ smoke : chr "contr.treatment"
## num [1:2765] 111 118 125 102 147 ...
X
is a numeric matrix with 2326 rows and 10 columns and
y
is a numeric vector of length 2765.
<details class = “tip” title = “ is.matrix()
” We can
also confirm that X
is indeed a matrix my checking its
class
:
## [1] "matrix" "array"
or testing if it is a matrix
## [1] TRUE
## (Intercept) genderfemale age raceOther Hispanic raceNon-Hispanic White raceNon-Hispanic Black raceother smokeformer smokecurrent albu
## 1 1 0 22 0 1 0 0 0 0 4.8
## 2 1 1 44 0 1 0 0 0 0 3.7
## 3 1 0 21 0 0 0 1 0 0 4.4
## 4 1 1 43 0 0 1 0 0 1 4.4
## 5 1 0 51 0 0 0 1 0 1 4.1
## 6 1 0 80 0 1 0 0 0 0 4.2
## [1] 110.6667 118.0000 124.6667 102.0000 146.6667 122.0000
The structure of X
and y
looks OK, but
there are 2765 observations in y
, i.e., y
includes the response for subjects with missing covariate values, but
X
only has 2326 rows, i.e.,
has automatically excluded incomplete
observations from X
.
To be able to estimate the parameters, X
and
y
need to have matching dimensions. We need to exclude the
observations in y
that belong to incomplete cases.
We can do this with the help of the function
complete.cases()
.
The function complete.cases()
returns a
logical
vector specifying which rows (of a
matrix
or data.frame
) are completely observed
(TRUE
) and which have missing values
(FALSE
).
As before, we need to select the relevant columns of the
nhanes
data to make sure that only those rows in which
there is a missing value in these variables are considered incomplete,
e.g., use
nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")]
:
We can now use the logical
vector cc
to
select rows in y
.
Using the formula of the least squares estimates from Task 1, and the two objects created in Task 2, calculate the parameter estimates.
%*%
.
t()
.
solve()
.
Calculating the least squares estimates of the regression
coefficients using the formula \[\boldsymbol{\hat\beta} = \left(\mathbf
X^\top\mathbf X\right)^{-1}\mathbf X^\top\mathbf y\] in
, we use t()
to transpose
a matrix ( \(^\top\) in the
mathematical formula) and use solve()
instead of the \(^{-1}\) to calculate the inverse:
The result is a column vector with the parameter estimates:
## [,1]
## (Intercept) 95.43328823
## genderfemale -4.45617279
## age 0.46631424
## raceOther Hispanic -0.21360840
## raceNon-Hispanic White -0.12290498
## raceNon-Hispanic Black 5.32980363
## raceother -0.26288646
## smokeformer -0.06675802
## smokecurrent 1.37452263
## albu 1.25940379
If you get an error message like this:
## Error in solve(t(X) %*% X) %*% t(X) %*% y: non-conformable arguments
this means that the dimensions of y
and X
do not match. Check the dimensions of both objects (dim(X)
,
length(y)
).
The number of rows in X
must be the same as the length
of y
and y
has to be a row vector, not a
column vector (= matrix with one column).
Compare the calculated parameter estimates betas
with
the estimates calculated by lm()
above
(lm2
).
coef()
to extract the vector of
regression coefficients from a fitted model object.
We can compare the values in two vectors in different ways. The easiest (with regards to the syntax) is just to print both of them separately.
But the comparison is a bit more convenient when we print the two
vectors of parameter estimates next to each other, i.e., by combining
them with the function cbind()
:
## [,1] [,2]
## (Intercept) 95.43328823 95.43328823
## genderfemale -4.45617279 -4.45617279
## age 0.46631424 0.46631424
## raceOther Hispanic -0.21360840 -0.21360840
## raceNon-Hispanic White -0.12290498 -0.12290498
## raceNon-Hispanic Black 5.32980363 5.32980363
## raceother -0.26288646 -0.26288646
## smokeformer -0.06675802 -0.06675802
## smokecurrent 1.37452263 1.37452263
## albu 1.25940379 1.25940379
The parameter estimates are the same!
Comparing vectors by looking at them and checking if all the numbers are the same works for small vectors, but is not really feasible when we have longer vectors, or other types of objects.
You will see other options in the solutions of some other tasks.
Since we need the residuals to calculate the standard errors of the regression coefficients and the residual standard deviation, we continue with calculating the residuals.
Calculate the residuals and obtain the 5-point summary (minimum, 1st quartile, median, 3rd quartile, maximum).
quantile()
may come in handy.
The residuals can be calculated as the difference between the observed response \(\mathbf y\) and the fitted values \(\mathbf{\hat y} = \mathbf X\boldsymbol{\hat\beta}\), i.e., \[\boldsymbol{\hat\varepsilon} = \mathbf y - \mathbf X\boldsymbol{\hat\beta}.\]
We have already created the vectors y
and X
containing the response and the design matrix in the previous part of
this practical, and have calculated the vector of parameter estimates
betas
.
Using matrix multiplication (%*%
) we can then calculate
the residuals, and we will call them r
here:
The 5-point summary consisting of the minimum, 1st
quartile, median, 3rd quartile, maximum can be obtained very
easily using the function quantile()
.
By default, the function quantile()
returns the 0, 0.25,
0.5, 0.75 and 1 quantile of a vector of numeric values.
How can we know that by default the function quantile()
returns the 0, 0.25, 0.5, 0.75 and 1 quantiles?
We have a look at the help file ?quantile
.
In the Usage section we see written
The values specified for each argument here are the default values,
i.e., seq(0, 1, 0.25)
is the default for the argument
probs
.
And seq(0, 1, 0.25)
is a sequence from 0 to 1 in steps
of 0.25, and, hence, returns
## [1] 0.00 0.25 0.50 0.75 1.00
The minimum and maximum correspond, of course, to the 0 and 1
quantile, and the median to the 0.5 quantile, and so the default values
of quantile()
are exactly what we need here.
## 0% 25% 50% 75% 100%
## -54.378987 -9.771811 -1.070625 8.193726 95.891053
summary()
for lm2
.r
) with the residuals
returned by the model lm2
.The 5-point summary of the residuals is given in the printed model summary before the table of coefficients:
##
## Call:
## lm(formula = SBP ~ gender + age + race + smoke + albu, data = nhanes,
## contrasts = list(smoke = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.379 -9.772 -1.071 8.194 95.891
##
## Coefficients:
## [...]
These values coincide with the values we obtained in the previous step. This is already a good indication that we calculated the residuals correctly. (But it is no proof.)
To compare the full vector of residuals we first need to extract them
from the fitted model object lm2
. This can be done in two
ways:
lm2$residuals
residuals(lm2)
Because we have 2326 residuals, comparing them by just printing the values next to each other is not very efficient.
Alternatively, we can check if the differences between the two vectors are (close to) zero. Because there are still 2326 of these differences, we just check the summary of the differences:
## V1
## Min. :-1.306e-11
## 1st Qu.:-8.974e-12
## Median :-8.583e-12
## Mean :-8.592e-12
## 3rd Qu.:-8.181e-12
## Max. :-5.697e-12
The very small differences between the values we calculated and the values from the fitted model occur due to differences in the implemented mathematical operations that are used internally.
There are two functions in that
can help us with comparing objects: all.equal()
and
identical()
.
In case the two objects are not the same all.equal()
will give us some information, but identical()
does
not:
## [1] FALSE
## [1] "names for current but not for target" "Attributes: < names for target but not for current >"
## [3] "Attributes: < Length mismatch: comparison on first 0 components >" "target is matrix, current is numeric"
There seem to be some differences in the names of the two objects,
and one is a matrix (r
, because we calculated as via matrix
multiplication) and one is numeric.
We need to turn r
into a vector using
c(r)
:
## [1] FALSE
## [1] "names for current but not for target"
but that does not solve (all of) the issues with the names.
Names are an “attribute” of an object, and we can see all the
attributes that an object has using the function
attributes()
:
## NULL
## $names
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
## [20] "20" "21" "23" "24" "25" "26" "27" "28" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "41"
## [....]
We see that c(r)
does not have any attributes, and
lm2$residuals
has an attribute called names
,
which just the indices of the observations.
The function all.equal()
has an arguments that allows us
to specify that attributes should not be compared:
## [1] TRUE
identical()
does not have this argument. But we can
remove attributes by setting them to NULL
:
When we now compare r_lm2
and the original
lm2$residuals
we can see that the names have
disappeared:
## 1 2 3 4 5 6
## -0.9477681 1.9681688 14.1622891 -21.2743309 21.1761603 -15.9050187
## [1] -0.9477681 1.9681688 14.1622891 -21.2743309 21.1761603 -15.9050187
## [1] FALSE
But does still not consider the two vectors to be identical. This is not really surprising, because they are not identical. We saw previously that the differences between the values are not equal to zero.
The function identical()
really requires vectors to be
identical, but all.equal()
tests if the two vectors are
almost equal. It has an argument
tolerance
which is, by default, equal to the square root of
the machine epsilon. The machine epsilon is the smallest positive
floating-point number x
such that x + 1
is not
identical to 1
.
## [1] 2.220446e-16
## [1] 1.490116e-08
So differences that are smaller than \(1.4901161\times 10^{-8}\) are not
considered to be differences when using all.equal()
but
they are considered when using identical
.
all.equal()
.
Using the residuals, we can now calculate the estimate for the residual variance.
Look up the formula for the estimator for the residual variance.
The formula for the residual variance is given in the slides on the Least Squares Estimator (slide 6):
\[\hat\sigma^2 = \frac{1}{n - p - 1} \boldsymbol{\hat\varepsilon}^\top \boldsymbol{\hat\varepsilon}\]
summary(lm2)
.summary(lm2)
together with the residual standard
error?The formula for the estimator of the residual variance is \[\hat\sigma^2 = \frac{1}{n - p - 1} \boldsymbol{\hat\varepsilon}^\top \boldsymbol{\hat\varepsilon}.\]
Here, \(n\) is the number of observations, i.e., 2326, and \(p\) is the number of regression coefficients excluding the intercept, i.e., 9.
The residual standard error is shown in the lower part of the model summary:
## [...]
##
## Residual standard error: 15.78 on 2316 degrees of freedom
## (439 observations deleted due to missingness)
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2324
## F-statistic: 79.2 on 9 and 2316 DF, p-value: < 2.2e-16
The standard error is the square root of the variance of the residuals, i.e.,
## [1] 15.78038
The degrees of freedom are the number of observations 2326 minus the number of regression coefficients (9 + 1 = 10), i.e., 2316.
We can obtain them from the fitted model object as
lm2$df.residual
.
## [1] 2316
Look up the formula for the variance-covariance matrix of the parameter estimates.
The formula for the variance-covariance matrix of the parameter estimates \(\boldsymbol{\hat\beta}\) is given in the slides on the Least Squares Estimator (slide 7):
\[\mathrm{cov(\boldsymbol{\hat\beta})} = \sigma^2 (\mathbf X^\top \mathbf X)^{-1}\]
Calculate the variance-covariance matrix of the parameter estimates
and compare it with the variance-covariance matrix that you can obtain
from the fitted model (lm2
) using the function
vcov()
.
sig2
) to replace \(\sigma^2\) in the formula for the variance
of \(\boldsymbol{\hat\beta}\).
Calculate the variance-covariance matrix:
V
is a symmetric matrix with 10 rows and columns, that
has the variances of the regression coefficients on the diagonal, and
the covariances as off-diagonal elements.
We can compare it to the matrix derived using vcov()
using differences:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.983e-12 -4.381e-15 5.000e-18 7.304e-14 3.927e-15 1.343e-11
Here, we compare the values of the two matrices by looking at their
differences. The function c()
is used to convert the
matrices to vectors in order to get one overall summary, and not a
separate summary per column of the matrix.
diag()
to extract the diagonal elements from a
matrix.
The standard errors are the square root of the variances, and the
variances are the elements on the diagonal of the variance-covariance
matrix V
. Hence, we can obtain the standard errors as:
In the model summary, the standard errors are part of the “coefficients” table. This table can be obtained using the following syntax:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.43328823 5.05897617 18.8641506 5.939995e-74
## genderfemale -4.45617279 0.69972411 -6.3684711 2.295145e-10
## age 0.46631424 0.01988055 23.4558012 2.443005e-109
## raceOther Hispanic -0.21360840 1.43025364 -0.1493500 8.812905e-01
## raceNon-Hispanic White -0.12290498 1.15504476 -0.1064071 9.152686e-01
## raceNon-Hispanic Black 5.32980363 1.23175526 4.3269989 1.575735e-05
## raceother -0.26288646 1.29833254 -0.2024801 8.395592e-01
## smokeformer -0.06675802 0.85986713 -0.0776376 9.381230e-01
## smokecurrent 1.37452263 0.87209759 1.5761110 1.151368e-01
## albu 1.25940379 1.07471405 1.1718501 2.413778e-01
This “table” of coefficients is actually a matrix
object
and so we can compare the standard errors either by printing them next
to each other
## se
## (Intercept) 5.05897617 5.05897617
## genderfemale 0.69972411 0.69972411
## age 0.01988055 0.01988055
## raceOther Hispanic 1.43025364 1.43025364
## raceNon-Hispanic White 1.15504476 1.15504476
## raceNon-Hispanic Black 1.23175526 1.23175526
## raceother 1.29833254 1.29833254
## smokeformer 0.85986713 0.85986713
## smokecurrent 0.87209759 0.87209759
## albu 1.07471405 1.07471405
or again using the differences
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.328e-12 -4.025e-15 -3.220e-15 -1.652e-13 -6.774e-16 2.220e-16
The test statistic is \[\frac{\boldsymbol{\hat\beta} - \boldsymbol \beta}{\widehat{\mathrm{se}}(\boldsymbol{\hat\beta)}},\] where \(\boldsymbol{\hat\beta}\) is the vector of parameter estimates, \(\boldsymbol \beta\) is the vector of values of the regression coefficients assumed under the null-hypothesis, and \(\widehat{\mathrm{se}}(\boldsymbol{\hat\beta)}\) are the estimated standard errors for the parameter estimates.
In the slides on Hypothesis Tests (slide 6), the test statistic was given for a single regression coefficient \(\beta_j\):
\[T_j = \frac{\hat\beta_j - \beta_{0j}}{\hat\sigma_j},\qquad j = 0,\ldots,p\]
In the solution above it is written for the whole vector of regression coefficients \[\boldsymbol\beta = \begin{pmatrix}\beta_0\\\beta_1\\\vdots\\\beta_p\end{pmatrix}\]When the model assumptions are fulfilled, this test statistic follows a \(t\)-distribution, i.e., we are performing a \(t\)-test.
In regression models, we typically perform a 2-sided \(t\)-test with null hypothesis \(\boldsymbol\beta = \mathbf 0\) and alternative hypothesis \(\boldsymbol\beta \neq \mathbf 0\).
Calculate the test statistics for all regression coefficients and compare them to the corresponding values in the model summary.
Because our null hypothesis is that the regression coefficients are equal to zero, we can omit subtracting \(\boldsymbol\beta\) from the formula and only need to divide the parameter estimates by their standard error:
As before, we can compare the values by printing them next to each other
## [,1] [,2]
## (Intercept) 18.8641506 18.8641506
## genderfemale -6.3684711 -6.3684711
## age 23.4558012 23.4558012
## raceOther Hispanic -0.1493500 -0.1493500
## raceNon-Hispanic White -0.1064071 -0.1064071
## raceNon-Hispanic Black 4.3269989 4.3269989
## raceother -0.2024801 -0.2024801
## smokeformer -0.0776376 -0.0776376
## smokecurrent 1.5761110 1.5761110
## albu 1.1718501 1.1718501
or by checking if their differences are close to) zero:
## V1
## Min. :-1.162e-13
## 1st Qu.:-8.108e-14
## Median : 1.643e-13
## Mean : 4.884e-13
## 3rd Qu.: 6.328e-13
## Max. : 2.145e-12
The formula for the p-value for a 2-sided test was given in the slides on Hypothesis tests (slide 10): \[p = 2\min\{\Pr(t\leq T\mid H_0),\, \Pr(t \geq T\mid H_0)\}\] The p-value is the probability to observe a test statistic / effect / regression coefficient at least as extreme as the observed test statistic / effect / regression coefficient if the null hypothesis is true (where extreme means into the direction of the alternative hypothesis).
Under the null-hypothesis, the test statistic follows a \(t\)-distribution with \(n - p - 1\) degrees of freedom (i.e., the
residual degrees of freedom). We have already calculated the number of
degrees of freedom earlier, for example, as
nrow(X) - ncol(X)
= 2316.
Calculate the p-values for the regression coefficients from
lm2
.
The function pt()
calculates the distribution function
for the \(t\)-distribution. Check the
help (?pt
) for information on how to use this function.
pmin()
can be used to determine the “parallel”
minima of two (or more) vectors. For example,
pmin(c(1, 2, 3, 4), c(1.2, 0.9, 3, 3.3))
returns a vector
c(1, 0.9, 3, 3.3)
The residual degrees of freedom are:
Remember, you can also obtain the residual degrees of freedom directly from the fitted model:
## [1] 2316
For each regression coefficient, we need to find the minimum of \(\Pr(t\leq T\mid H_0)\) and \(\Pr(t \geq T\mid H_0)\), and multiply this minimum with 2:
p_vals <- 2 * pmin(
pt(t_stats, df = resid_df, lower.tail = FALSE),
pt(t_stats, df = resid_df, lower.tail = TRUE)
)
The function pt()
calculates the distribution function
of the \(t\)-distribution and has the
arguments
q
: a vector of quantilesdf
: the (residual) degrees of freedomncp
: a non-centrality parameter (which we don’t need
here)lower.tail = TRUE
: an indicator specifying if the lower
or upper tail of the distribution should be returnedlog.p = FALSE
: an indicator specifying if the returned
probabilities should be log-transformed (which we don’t need here)For the formula of the p-value, we need both \(\Pr(t\leq T\mid H_0)\) and \(\Pr(t \geq T\mid H_0)\), i.e., the area under the density function that is left (lower tail) or right (upper tail) of the observed test statistic.
For example:
The function pmin()
selects the smaller value from each
of the pairs of probabilities.
Compare the p-values you just calculated with the p-values from the
output of summary(lm2)
.
The p-values are also part of the “coefficients table” returned by
the summary()
of lm2
. We can thus work with
them by selecting the column with name "Pr(>|t|)"
from
this matrix:
## V1
## Min. :-8.612e-13
## 1st Qu.:-7.053e-14
## Median :-3.554e-14
## Mean :-1.162e-13
## 3rd Qu.:-4.700e-18
## Max. : 0.000e+00
With every step we make in the calculation, the small differences we
observed between the numbers derived from summary(lm2)
and
the numbers we calculated ourselves become larger.
Especially when the numbers themselves are relatively small, as may be the case for p-values, it could be difficult to judge if the observed differences are indeed just due to the numeric differences in the calculation or if we made a mistake in our own calculation.
To check that we “implemented” the formula correctly, we should use
the values from summary(lm2)
as input. For example:
Extract the values of the test statistics from the
summary()
output:
Calculate the p-values (i.e., use “our” formula/implementation but
the input values calculated by lm()
):
p_vals_lm <- 2 * pmin(
pt(abs(t_stats_lm), df = resid_df, lower.tail = FALSE),
pt(abs(t_stats_lm), df = resid_df, lower.tail = TRUE)
)
Compare with the p-values from the summary()
output:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
We now get the exact same numbers, indicating that our syntax to
calculate the p-values is correct.
The output of the summary()
function does not contain
the confidence intervals for the regression coefficients. We can,
however, obtain them using the function confint()
.
Read the help for the function confint()
(?confint
) to understand which parameters you need to
provide and which options you have.
The function confint()
takes a fitted model object (such
as lm2
) as its main argument (object
). The
additional argument parm
allows us to specify for which
parameters we want to obtain the confidence interval. We could specify
either a vector with numbers referring to the index of the vector of
regression coefficients, or the names of the parameters (i.e., what is
returned by names(coef(lm2))
).
The confidence level is specified via the argument
level
.
In the Details section we can read that
confint
is a “generic function”. This means that it can be
applied to different types of objects, and
will look for a specific “method” for
the type of object.
We can also read that in the default version of
confint()
normality is assumed (i.e., the regression
coefficients are assumed to be approximately normally distributed).
The Details section also tells us that the fitted
model object must be of a type for which coef()
and
vcov()
return the parameter estimates and
variance-covariance matrix.
Calculate the 90% confidence intervals for the regression parameters
in lm2
confint()
andand compare them.
qt()
.
Using confint()
(we need to change the argument
level
to get 90% CIs instead of the default 95% CIs):
## 5 % 95 %
## (Intercept) 87.10868312 103.7578933
## genderfemale -5.60757709 -3.3047685
## age 0.43360056 0.4990279
## raceOther Hispanic -2.56710767 2.1398909
## raceNon-Hispanic White -2.02354479 1.7777348
## raceNon-Hispanic Black 3.30293579 7.3566715
## raceother -2.39930801 1.8735351
## smokeformer -1.48167954 1.3481635
## smokecurrent -0.06052427 2.8095695
## albu -0.50905088 3.0278585
The formula for the confidence interval was given in the slides on Hypothesis Tests: \[\hat\beta_j \pm t_{1-\alpha/2}(n-p-1) \widehat{\mathrm{se}}(\hat\beta_j)\] Since we want 90% confidence intervals, \(\alpha\) is 0.1, which means \(1-\alpha/2\) is 0.05, and therefore we use \(t_{0.95}(n-p-1)\) instead of \(t_{0.975}(n-p-1)\)):
Compare them
## 5 % 95 %
## (Intercept) 87.10868312 87.10868312 103.7578933 103.7578933
## genderfemale -5.60757709 -5.60757709 -3.3047685 -3.3047685
## age 0.43360056 0.43360056 0.4990279 0.4990279
## raceOther Hispanic -2.56710767 -2.56710767 2.1398909 2.1398909
## raceNon-Hispanic White -2.02354479 -2.02354479 1.7777348 1.7777348
## raceNon-Hispanic Black 3.30293579 3.30293579 7.3566715 7.3566715
## raceother -2.39930801 -2.39930801 1.8735351 1.8735351
## smokeformer -1.48167954 -1.48167954 1.3481635 1.3481635
## smokecurrent -0.06052427 -0.06052427 2.8095695 2.8095695
## albu -0.50905088 -0.50905088 3.0278585 3.0278585
The formulas for the coefficient of determination and the adjusted coefficient of determination are given on pages 16 and 17 of the Slides on Hypothesis Tests: \[R^2 = \frac{\text{ESS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\qquad\text{and}\qquad R_{adj}^2 = 1 - \frac{n-1}{n-p-1}(1 - R^2),\] where \(\text{RSS}\) is the residual sum of squared, \(\text{ESS}\) is the explained variation (“explained sum of squares”) and \(\text{TSS}\) the total variation (“total sum of squares”).
The residual sum of squares is calculated as the sum over the squared residuals \[\text{RSS} = \boldsymbol\varepsilon^\top\boldsymbol\varepsilon = \sum_{i = 1}^n \varepsilon_i^2,\] the variation explained by the model is calculated as the sum over the squared deviations of the fitted values from the mean response, \[\text{ESS} = (\mathbf{\hat y} - \bar y)^\top (\mathbf{\hat y} - \bar y) = \sum_{i = 1}^n (\hat y_i - \bar y)^2,\] and the total variation is calculated as the sum over the squared differences between the response and the mean response, \[\text{TSS} = (\mathbf y - \bar y)^\top (\mathbf y - \bar y) = \sum_{i = 1}^n (y_i - \bar y)^2,\] where \(\bar y = \displaystyle\frac{1}{n}\sum_{i = 1}^n y_i\).
Calculate the coefficient of determination and the adjusted
coefficient of determination and compare them to the values shown in the
output of summary(lm2)
.
Residual sum of squares
Total sum of squares
Coefficient of determination
Adjusted coefficient of determination.
In the model summary()
, the coefficient of determination
is shown in the last part of the output:
## [...]
## Residual standard error: 15.78 on 2316 degrees of freedom
## (439 observations deleted due to missingness)
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2324
## F-statistic: 79.2 on 9 and 2316 DF, p-value: < 2.2e-16
Our own calculated values are
## [1] 0.2353521
## [1] 0.2323807
and, thus, the same as the values shown in the model summary.
The overall F-test has the null hypothesis that all regression coefficients (except for the intercept) are equal to zero, and the alternative hypothesis that at least one of these coefficients is different from zero. This means, that we are testing whether the model explains any of the variation in the response, i.e., if it is “better” than using just the average response.
The test statistic for the F-test is calculated as \[F = \frac{\text{ESS}}{\text{RSS}}\frac{n - p - 1}{p}\] and, under the null-hypothesis, follows a \(F\)-distribution with \(p\) and \(n-p-1\) degrees of freedom (see the slides on Hypothesis Tests (slide 14)).
The p-value is the probability under the null-hypothesis to observe a test statistic that is at least as large as the one observed in the data.
Analogous to how we calculated the p-value for the \(t\)-test for the regression coefficients, we can determine this probability as the area under the density of the \(F\)-distribution in the part that is larger than the calculated test statistic. The difference is that we are performing a 1-sided test now, so we don’t need the part where we took the minimum and multiplied by two.
Calculate the overall-F-test for lm2
and compare your
results with the results of the output of summary(lm2)
.
The part of the variation explained by the model is calculated from
the fitted values (X %*% betas
) and the average response
(mean(y)
):
The test statistic of the F-test is then:
The p-value is the value of the distribution function at the test statistic, i.e., the area under the density for values at least as large as the test statistic (i.e., in the upper tail of the distribution):
We find the results of the overall F-test in the last line of the
output of summary(lm2)
:
## [...]
##
## Residual standard error: 15.78 on 2316 degrees of freedom
## (439 observations deleted due to missingness)
## Multiple R-squared: 0.2354, Adjusted R-squared: 0.2324
## F-statistic: 79.2 on 9 and 2316 DF, p-value: < 2.2e-16
The value of the test statistic, degrees of freedom and p-value are
elements in the summary.lm
object, and we can print them
next to our results for comparison:
cbind("summary" = summary(lm2)$fstatistic,
"calculated" = c(value = F_stat,
numdf = ncol(X) - 1,
dendf = resid_df)
)
## summary calculated
## value 79.205 79.205
## numdf 9.000 9.000
## dendf 2316.000 2316.000
We can obtain the same results with the help of the function
anova()
.
# fit a model with just an intercept, no covariates:
lm0 <- lm(SBP ~ 1, data = lm2$model)
anova(lm2, lm0)
## Analysis of Variance Table
##
## Model 1: SBP ~ gender + age + race + smoke + albu
## Model 2: SBP ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2316 576731
## 2 2325 754244 -9 -177513 79.205 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we use lm2$model
, i.e., the model frame from
lm2
, because this is the relevant subset of
nhanes
in which observations with missing values were
excluded.
lm0
does not involve any covariates, only
observations with missing values in SBP
would be excluded.
We would then get an error when we use anova()
because the
two models, lm0
and lm2
were fitted on
different sample sizes. By using lm2$model
we can use the
same data used to fit lm2
.