In this practical, we will fit a multiple linear regression model using the function lm(), investigate the output of the model summary() and how all the different elements in this summary are derived.

The aim is to gain a better understanding about what the numbers shown in the model summary mean, to recap some of the the theory about linear regression models and the least squares estimator, and to practice working with .

It may seem to you that there is too much “R stuff” in the practicals and not enough “statistics” but in practice, cleaning/preparing the data for the analysis and visualizing the results usually is much more work than running the analysis itself. Therefore one of the aims of these practicals is to prepare you for working with your own data in the future.

This practical is intended to be “learning by doing”, i.e., you may not be able to solve all tasks by yourself without the hints or even checking the solution, and you will come across functions that you haven’t seen in the lecture slides.

When a function is new to you, you can find out more about it in the help file that you can access by typing ?<function name>.

For several tasks of this practical the given solution is only one of many possible solutions.

There are a few parts indicated as “Advanced Level” containing some more complex solutions (for those of you, for whom working with is a piece of cake).

Data

For this practical we will use a subset of the NHANES (National Health and Nutrition Examination Survey) data.

To load this dataset into , you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/nhanes.RData").

Here is a short overview over the data:

head(nhanes)
##        SBP gender age               race    WC  alc educ creat albu uricacid bili            occup   smoke
## 1 110.6667   male  22 Non-Hispanic White  81.0 <NA>  low  0.91  4.8      4.9  0.6          working   never
## 2 118.0000 female  44 Non-Hispanic White  80.1 <NA> high  0.89  3.7      4.5  0.3          working   never
## 3 124.6667   male  21              other  69.6   <1  low  0.87  4.4      5.4  0.2      not working   never
## 4 102.0000 female  43 Non-Hispanic Black 120.4  >=1  low  0.68  4.4      5.0  0.8      not working current
## 5 146.6667   male  51              other  81.1 <NA>  low  0.99  4.1      5.0  0.5 looking for work current
## 6 122.0000   male  80 Non-Hispanic White 112.5 <NA>  low  1.01  4.2      4.8  0.7      not working   never
str(nhanes)
## 'data.frame':    2765 obs. of  13 variables:
##  $ SBP     : num  111 118 125 102 147 ...
##  $ gender  : Factor w/ 2 levels "male","female": 1 2 1 2 1 1 1 2 1 1 ...
##  $ age     : num  22 44 21 43 51 80 26 30 70 35 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 3 3 5 4 5 3 4 5 4 4 ...
##  $ WC      : num  81 80.1 69.6 120.4 81.1 ...
##  $ alc     : Factor w/ 2 levels "<1",">=1": NA NA 1 2 NA NA 2 2 1 NA ...
##  $ educ    : Factor w/ 2 levels "low","high": 1 2 1 1 1 1 1 2 1 2 ...
##  $ creat   : num  0.91 0.89 0.87 0.68 0.99 1.01 1.03 0.91 0.8 1.1 ...
##  $ albu    : num  4.8 3.7 4.4 4.4 4.1 4.2 5 4.8 3.7 4 ...
##  $ uricacid: num  4.9 4.5 5.4 5 5 4.8 5.4 6.7 5.4 6.7 ...
##  $ bili    : num  0.6 0.3 0.2 0.8 0.5 0.7 1.1 0.9 0.9 0.9 ...
##  $ occup   : Factor w/ 3 levels "working","looking for work",..: 1 1 3 3 2 3 1 1 3 1 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 1 3 3 1 1 1 2 1 ...

Fitting the Linear Regression Model

Task 1

Fit a linear regression model on the nhanes data with SBP as the response and gender, age, race, smoke and albu as covariates.

You need to use the function lm().
You need to specify the arguments formula and data.

Solution 1

To fit a linear regression model, we use the function lm(). This function has multiple arguments, but for now we only need to specify the formula and the data.

In the model formula we use the name of the response variable, i.e., SBP, then a tilde, and then the names of the covariates, separated by plus symbols.

lm1 <- lm(SBP ~ gender + age + race + smoke + albu, data = nhanes)

The left hand side (LHS) of a formula refers to the part before the tilde, the right hand side (RHS) refers to the part after the tilde.

Task 2

  • Get the model summary() and take a close look at it. Do you know what all the different parts mean?
  • What is the effect estimate of age and what is it’s interpretation?
  • How is the effect of race estimated? What type of coding was used and how should the estimates be interpreted?
  • What about the interpretation of the effect of smoke?

Solution 2

summary(lm1)
## 
## Call:
## lm(formula = SBP ~ gender + age + race + smoke + albu, data = nhanes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.379  -9.772  -1.071   8.194  95.891 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            95.86921    5.04766  18.993  < 2e-16 ***
## genderfemale           -4.45617    0.69972  -6.368 2.30e-10 ***
## age                     0.46631    0.01988  23.456  < 2e-16 ***
## raceOther Hispanic     -0.21361    1.43025  -0.149    0.881    
## raceNon-Hispanic White -0.12290    1.15504  -0.106    0.915    
## raceNon-Hispanic Black  5.32980    1.23176   4.327 1.58e-05 ***
## raceother              -0.26289    1.29833  -0.202    0.840    
## smoke.L                 0.97193    0.61667   1.576    0.115    
## smoke.Q                 0.61565    0.69081   0.891    0.373    
## albu                    1.25940    1.07471   1.172    0.241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.78 on 2316 degrees of freedom
##   (439 observations deleted due to missingness)
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2324 
## F-statistic:  79.2 on 9 and 2316 DF,  p-value: < 2.2e-16

The output of the model summary has 4 parts:

  • The function call,
  • a summary of the residuals,
  • the table of coefficients, and
  • additional information on the residual standard error, missing values, the coefficient of determination, and the overall F test.

We will take a closer look at all of these elements throughout this practical. But first, let’s see what the effects of age, race and smoke are and how they should be interpreted.

Effect of age

The effect of age is estimated to be 0.466. Its interpretation is that a 1 unit increase in age is associated with an increase of 0.466 in the expected SBP when all other covariates are kept constant.

The p-value for the effect of age is < 2.22e-16, i.e., it would be very unlikely to see an effect of this size when the true effect was zero. This means that we can reject the (implied) null-hypothesis that the effect of age on SBP is zero and state that age has a (statistically significant) effect on SBP.

Effect of race:

The effect of race is represented by the effects of the four dummy variables raceOther Hispanic, raceNon-Hispanic White, raceNon-Hispanic Black, and raceother. We can see that used dummy coding because the names of the different categories are appended to the variable name. Moreover, dummy coding is the default for unordered factors (and we know that race is an unordered factor from str(nhanes) shown in the introduction to this practical).

From the output of the summary, we cannot say whether race overall has a relevant contribution to explaining the variance in SBP. The regression coefficients correspond to the difference in the expected value of SBP when a participant has a particular value of race compared to the reference category (and the same values for all other covariates).

We see that the p-values for the effects of other Hispanic, Non-Hispanic White and other are very large. This means that these effects are very likely to be observed when there is no actual difference in SBP between these races and the reference category (Mexican American). But we see a coefficient of 5.330 for Non-Hispanic Black participants. The corresponding p-value indicates that observing such an effect would be very unlikely if there was no difference in SBP between Non-Hispanic Black and Mexican American.

We can conclude that Non-Hispanic Black participants had a (statistically significantly) higher SBP than Mexican American participants and that there was no evidence that participants of the other race differed from Mexican American participants regarding their SBP.

Effect of smoke

The variable smoke is an ordered factor; therefore, has used orthogonal polynomials to represent the ordering of the categories in the estimated effects.

Interpretation of such effects is not straightforward. A visualization of orthogonal polynomials can be found on slides 21/22 of the section on Linear Regression in R.

⇨ Continue with Task 3.

Task 3

Because smoke is an ordered factor, has used orthogonal polynomials contrasts, but we prefer to use dummy coding for this variable, to facilitate the interpretation of the effects.

Re-fit the linear regression model (lm1) and specify that dummy-coding should be used for the variable smoke. Do this via an argument to lm(). Print the summary of this new model.

If you are unsure which argument to use, or how to specify this, take a look at the help (?lm).

Remember: Contrasts were covered in the slides on Linear Regression in R.

The argument you need to use is called contrasts.
The help for the argument contrasts tells us to check the description of the argument contrasts.arg of the function model.matrix.default. There we read that we should provide
a list, whose entries are values (numeric matrices, functions or character strings naming functions) to be used as replacement values for the contrasts replacement function and whose names are the names of columns of data containing factors.

So we can create a list with a named element, where the name is the name of the variable for which we want to change the contrast, and the element itself is the name of the contrast we want to use.
Dummy coding is called contr.treatment in R.

Solution 3

We need to specify the argument contrasts by providing a list that has elements for each covariate for which we want to specify the type of contrast.

In our case, this is only the variable smoke, and we set the contrast for smoke to be contr.treatment, i.e., dummy coding:

lm2 <- lm(SBP ~ gender + age + race + smoke + albu, data = nhanes,
          contrasts = list(smoke = "contr.treatment"))

summary(lm2)
## 
## Call:
## lm(formula = SBP ~ gender + age + race + smoke + albu, data = nhanes, 
##     contrasts = list(smoke = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.379  -9.772  -1.071   8.194  95.891 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            95.43329    5.05898  18.864  < 2e-16 ***
## genderfemale           -4.45617    0.69972  -6.368 2.30e-10 ***
## age                     0.46631    0.01988  23.456  < 2e-16 ***
## raceOther Hispanic     -0.21361    1.43025  -0.149    0.881    
## raceNon-Hispanic White -0.12290    1.15504  -0.106    0.915    
## raceNon-Hispanic Black  5.32980    1.23176   4.327 1.58e-05 ***
## raceother              -0.26289    1.29833  -0.202    0.840    
## smokeformer            -0.06676    0.85987  -0.078    0.938    
## smokecurrent            1.37452    0.87210   1.576    0.115    
## albu                    1.25940    1.07471   1.172    0.241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.78 on 2316 degrees of freedom
##   (439 observations deleted due to missingness)
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2324 
## F-statistic:  79.2 on 9 and 2316 DF,  p-value: < 2.2e-16

We could also have used the function update() to re-fit the model.

When using update(), we specify the object that we want to update, i.e., lm1, and provide new versions of those arguments that need to be changed, i.e., contrasts:

lm2 <- update(lm1, contrasts = list(smoke = "contr.treatment"))
This can be convenient when the model is large and typing everything again would be a lot of work.

The Elements of the Model Summary

In the following, we will inspect the different parts of the model summary and re-create them “by hand” to get a better understanding of what they are and how they are calculated.

We will compare them to the values calculated by lm() (or summary()) to double-check that our own calculations are correct.

Missing Values

Before we can start to really calculate anything we should investigate if there are any values missing in our data because missing values complicate the calculations. They are automatically omitted in some steps, which leads to objects (e.g., vectors or matrices) having different dimensions than we may expect them to have, which could result in errors.

Task 1

Investigate if there are any missing values in the variables used in the model.

The summary() of the data tells us the number of missing values per variable. Alternatively, you could also use the function is.na().

Solution 1

There are multiple ways to check for missing values.

Option 1

The number of cases with missing values is displayed in the summary() of the data:

summary(nhanes)
##       SBP            gender          age                        race            WC           alc         educ          creat       
##  Min.   : 81.33   male  :1381   Min.   :18.00   Mexican American  : 282   Min.   : 56.20   <1  :1149   low :1192   Min.   :0.3400  
##  1st Qu.:110.00   female:1384   1st Qu.:31.00   Other Hispanic    : 294   1st Qu.: 85.40   >=1 : 897   high:1433   1st Qu.:0.7000  
##  Median :120.00                 Median :47.00   Non-Hispanic White:1022   Median : 96.20   NA's: 719   NA's: 140   Median :0.8400  
##  Mean   :122.59                 Mean   :47.53   Non-Hispanic Black: 688   Mean   : 97.56                           Mean   :0.8911  
##  3rd Qu.:131.33                 3rd Qu.:63.00   other             : 479   3rd Qu.:107.10                           3rd Qu.:0.9900  
##  Max.   :234.67                 Max.   :80.00                             Max.   :176.00                           Max.   :9.5100  
##  NA's   :127                                                              NA's   :150                              NA's   :214     
##       albu          uricacid           bili                     occup          smoke     
##  Min.   :2.600   Min.   : 0.400   Min.   :0.1000   working         :1375   never  :1505  
##  1st Qu.:4.000   1st Qu.: 4.500   1st Qu.:0.6000   looking for work: 137   former : 594  
##  Median :4.300   Median : 5.400   Median :0.7000   not working     :1217   current: 524  
##  Mean   :4.263   Mean   : 5.476   Mean   :0.7382   NA's            :  36   NA's   : 142  
##  3rd Qu.:4.500   3rd Qu.: 6.400   3rd Qu.:0.9000                                         
##  Max.   :5.400   Max.   :11.300   Max.   :4.1000                                         
##  NA's   :214     NA's   :215      NA's   :218

Showing the summary of the whole dataset can be inconvenient when the data contain many variables and we are only interested in a few of them.

We could then focus on only those variables that we have used in the model, i.e.,

summary(nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")])
##       SBP            gender          age                        race          smoke           albu      
##  Min.   : 81.33   male  :1381   Min.   :18.00   Mexican American  : 282   never  :1505   Min.   :2.600  
##  1st Qu.:110.00   female:1384   1st Qu.:31.00   Other Hispanic    : 294   former : 594   1st Qu.:4.000  
##  Median :120.00                 Median :47.00   Non-Hispanic White:1022   current: 524   Median :4.300  
##  Mean   :122.59                 Mean   :47.53   Non-Hispanic Black: 688   NA's   : 142   Mean   :4.263  
##  3rd Qu.:131.33                 3rd Qu.:63.00   other             : 479                  3rd Qu.:4.500  
##  Max.   :234.67                 Max.   :80.00                                            Max.   :5.400  
##  NA's   :127                                                                             NA's   :214

An even more convenient way to obtain the names of all the variables used in the model is to extract them from the model formula.

We can obtain the model formula as

formula(lm2)
## SBP ~ gender + age + race + smoke + albu

and we can use the function all.vars() to extract all variable names from a formula object, i.e.,

all.vars(formula(lm2))
## [1] "SBP"    "gender" "age"    "race"   "smoke"  "albu"

Using this, we can get the same summary, without having to type all the separate variable names:

summary(nhanes[all.vars(formula(lm2))])
##       SBP            gender          age                        race          smoke           albu      
##  Min.   : 81.33   male  :1381   Min.   :18.00   Mexican American  : 282   never  :1505   Min.   :2.600  
##  1st Qu.:110.00   female:1384   1st Qu.:31.00   Other Hispanic    : 294   former : 594   1st Qu.:4.000  
##  Median :120.00                 Median :47.00   Non-Hispanic White:1022   current: 524   Median :4.300  
##  Mean   :122.59                 Mean   :47.53   Non-Hispanic Black: 688   NA's   : 142   Mean   :4.263  
##  3rd Qu.:131.33                 3rd Qu.:63.00   other             : 479                  3rd Qu.:4.500  
##  Max.   :234.67                 Max.   :80.00                                            Max.   :5.400  
##  NA's   :127                                                                             NA's   :214

Using all.vars() and formula() can save us some work when we have models with many variables and prevents us from making typos in the variable names (which would result in an error when the name we typed does not exist in the data).

Moreover, when, for some reason, we change the formula used in lm2 (e.g., to include an additional variable) this syntax will still give us the variables used in the model. If we had created the vector of variable names by typing them, we would also have to change this vector every time we change the model formula.

Option 2

We can also count missing values per variable, using the functions is.na() and colSums().

The function is.na() returns the value TRUE if an element is NA and FALSE otherwise. colSums() sums up values per column.

colSums(is.na(nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")]))
##    SBP gender    age   race  smoke   albu 
##    127      0      0      0    142    214

is.na(<...>) gives us a matrix of TRUE and FALSE with the same dimension as the input data, indicating for each value whether it is missing or not. (Shown here are only rows 21 - 29.)

is.na(nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")])
##        SBP gender   age  race smoke  albu
## [...]
## 21   FALSE  FALSE FALSE FALSE FALSE FALSE
## 22   FALSE  FALSE FALSE FALSE  TRUE FALSE
## 23   FALSE  FALSE FALSE FALSE FALSE FALSE
## 24   FALSE  FALSE FALSE FALSE FALSE FALSE
## 25   FALSE  FALSE FALSE FALSE FALSE FALSE
## 26   FALSE  FALSE FALSE FALSE FALSE FALSE
## 27   FALSE  FALSE FALSE FALSE FALSE FALSE
## 28   FALSE  FALSE FALSE FALSE FALSE FALSE
## 29   FALSE  FALSE FALSE FALSE  TRUE  TRUE
## [...]
When we calculate the sum over a logical vector, the value TRUE is counted as 1 and FALSE is counted as 0.

To sum up (missing) values per row we can use the corresponding function rowSums().

Of course, we could also again use all.vars() and formula():

colSums(is.na(nhanes[all.vars(formula(lm2))]))
##    SBP gender    age   race  smoke   albu 
##    127      0      0      0    142    214

Task 2

  • Determine how many cases (“patients”) there are for which one or more values is/are missing in the variables used in the model.
  • What does this mean for the to number of observations that are used for fitting the model?
The functions is.na(), rowSums() and table() could come in handy.

Solution 2

We can calculate the number of missing values per subject using almost the same syntax as before (where we counted missing values per variable), but now we count them per case, i.e., per row in our data.

Here, it is extremely important that we only select the columns of the nhanes data that are used in the model!

table(rowSums(is.na(nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")])))
## 
##    0    1    2 
## 2326  395   44

rowSums(is.na(nhanes[, c("SBP", ..., "albu")])) returns a vector of integers, with one entry for each row in the data.

rowSums(is.na(nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")]))
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 [...]
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 [...]
We can use table() to then count how many rows there are with 0, 1, 2, … missing values.

There are 439 observations with missing values in the variables used in our model.

Because there are nrow(nhanes) = 2765 rows in the data, we have 2326 observations with complete data.

Task 3

Find the information about the number of incomplete cases in the output of the summary() for our model (lm2).

Solution 3

Information on the number of cases that were excluded due to missingness is given in the last part of the output of the summary():

## [...]
##  Residual standard error: 15.78 on 2316 degrees of freedom
##   (439 observations deleted due to missingness)
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2324 
## F-statistic:  79.2 on 9 and 2316 DF,  p-value: < 2.2e-16

It tells us that 439 observations were deleted due to missingness.

Parameter Estimates

In this part, we will calculate the least squares estimates for the regression coefficients from the data, and compare them with the values that we have obtained using lm().

Task 1

Remember (or look up) the formula for the least squares estimator (in matrix notation).

This formula was given in the slides on the Least Squares Estimator

Solution 1

The formula can be found on slide 3 of the slides on the Least Squares Estimator:

\[\boldsymbol{\hat\beta} = \left(\mathbf X^\top\mathbf X\right)^{-1}\mathbf X^\top\mathbf y\]

The little \(^\top\) indicates that a vector or a matrix is transposed (see slide 17 of Multiple Linear Regression).

The superscript \(^{-1}\) refers to the inverse of the matrix.

In , the inverse of a matrix is calculated using the function solve() and the function t() transposes a vector or a matrix.

Task 2

The next step is to create the design matrix, X, and the vector of the response values, y.

X has to be a matrix containing only numeric values. It should have a column for the intercept and categorical variables have to be coded appropriately.

y should be a column vector, i.e., a matrix with a single column.

To create a design matrix, you can use the function model.matrix().   ₼ see ?model.matrix()

A model.matrix() can be created from different types of objects. Here, we use a formula as the object (and the nhanes data as data).

In creating the design matrix / model matrix, automatically applies contrasts, e.g., converts categorical variables into the corresponding dummy variables.

To use contrasts types other than the default, we need to specify the argument contrasts.arg (the same way we specified contrasts in lm()).

  • Create both X and y.
  • Investigate the resulting objects and check that they are of the correct type and have the correct length/dimension.

Solution 2

Create the design matrix X:

X <- model.matrix(SBP ~ gender + age + race + smoke + albu, 
                  data = nhanes,
                  contrasts.arg = list(smoke = "contr.treatment"))

To get dummy variables for smoke instead of orthogonal polynomials (default for ordered factors), we need to specify the contrasts.arg argument.

You can extract the formula from a fitted model object using the function formula(), e.g.:

formula(lm2)
## SBP ~ gender + age + race + smoke + albu

So, instead of typing the formula from lm2 again, we could also create the design matrix X like this:

X <- model.matrix(formula(lm2),
                  data = nhanes,
                  contrasts.arg = list(smoke = "contr.treatment"))

Create the response y:

y <- nhanes$SBP

Check their class and dimensions:

str(X)
##  num [1:2326, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2326] "1" "2" "3" "4" ...
##   ..$ : chr [1:10] "(Intercept)" "genderfemale" "age" "raceOther Hispanic" ...
##  - attr(*, "assign")= int [1:10] 0 1 2 3 3 3 3 4 4 5
##  - attr(*, "contrasts")=List of 3
##   ..$ gender: chr "contr.treatment"
##   ..$ race  : chr "contr.treatment"
##   ..$ smoke : chr "contr.treatment"
str(y)
##  num [1:2765] 111 118 125 102 147 ...

X is a numeric matrix with 2326 rows and 10 columns and y is a numeric vector of length 2765.

<details class = “tip” title = “ is.matrix()” We can also confirm that X is indeed a matrix my checking its class:

class(X)
## [1] "matrix" "array"

or testing if it is a matrix

is.matrix(X)
## [1] TRUE
head(X)
##   (Intercept) genderfemale age raceOther Hispanic raceNon-Hispanic White raceNon-Hispanic Black raceother smokeformer smokecurrent albu
## 1           1            0  22                  0                      1                      0         0           0            0  4.8
## 2           1            1  44                  0                      1                      0         0           0            0  3.7
## 3           1            0  21                  0                      0                      0         1           0            0  4.4
## 4           1            1  43                  0                      0                      1         0           0            1  4.4
## 5           1            0  51                  0                      0                      0         1           0            1  4.1
## 6           1            0  80                  0                      1                      0         0           0            0  4.2
head(y)
## [1] 110.6667 118.0000 124.6667 102.0000 146.6667 122.0000

The structure of X and y looks OK, but there are 2765 observations in y, i.e., y includes the response for subjects with missing covariate values, but X only has 2326 rows, i.e., has automatically excluded incomplete observations from X.

To be able to estimate the parameters, X and y need to have matching dimensions. We need to exclude the observations in y that belong to incomplete cases.

We can do this with the help of the function complete.cases().

The function complete.cases() returns a logical vector specifying which rows (of a matrix or data.frame) are completely observed (TRUE) and which have missing values (FALSE).

As before, we need to select the relevant columns of the nhanes data to make sure that only those rows in which there is a missing value in these variables are considered incomplete, e.g., use nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")]:

cc <- complete.cases(nhanes[, c("SBP", "gender", "age", "race", "smoke", "albu")])

We can now use the logical vector cc to select rows in y.

y <- y[cc]

Task 3

Using the formula of the least squares estimates from Task 1, and the two objects created in Task 2, calculate the parameter estimates.

You need to use matrix multiplication.
The symbol for matrix multiplication in is %*%.
To transpose a matrix, use the function t().
To get the inverse of a matrix (i.e., to calculate \((\mathbf X^\top\mathbf X)^{-1}\)) use the function solve().

Solution 3

Calculating the least squares estimates of the regression coefficients using the formula \[\boldsymbol{\hat\beta} = \left(\mathbf X^\top\mathbf X\right)^{-1}\mathbf X^\top\mathbf y\] in , we use t() to transpose a matrix ( \(^\top\) in the mathematical formula) and use solve() instead of the \(^{-1}\) to calculate the inverse:

betas <- solve(t(X) %*% X) %*% t(X) %*% y

The result is a column vector with the parameter estimates:

betas
##                               [,1]
## (Intercept)            95.43328823
## genderfemale           -4.45617279
## age                     0.46631424
## raceOther Hispanic     -0.21360840
## raceNon-Hispanic White -0.12290498
## raceNon-Hispanic Black  5.32980363
## raceother              -0.26288646
## smokeformer            -0.06675802
## smokecurrent            1.37452263
## albu                    1.25940379

If you get an error message like this:

## Error in solve(t(X) %*% X) %*% t(X) %*% y: non-conformable arguments

this means that the dimensions of y and X do not match. Check the dimensions of both objects (dim(X), length(y)).

The number of rows in X must be the same as the length of y and y has to be a row vector, not a column vector (= matrix with one column).

Task 4

Compare the calculated parameter estimates betas with the estimates calculated by lm() above (lm2).

You can use the function coef() to extract the vector of regression coefficients from a fitted model object.

Solution 4

We can compare the values in two vectors in different ways. The easiest (with regards to the syntax) is just to print both of them separately.

But the comparison is a bit more convenient when we print the two vectors of parameter estimates next to each other, i.e., by combining them with the function cbind():

cbind(betas, coef(lm2))
##                               [,1]        [,2]
## (Intercept)            95.43328823 95.43328823
## genderfemale           -4.45617279 -4.45617279
## age                     0.46631424  0.46631424
## raceOther Hispanic     -0.21360840 -0.21360840
## raceNon-Hispanic White -0.12290498 -0.12290498
## raceNon-Hispanic Black  5.32980363  5.32980363
## raceother              -0.26288646 -0.26288646
## smokeformer            -0.06675802 -0.06675802
## smokecurrent            1.37452263  1.37452263
## albu                    1.25940379  1.25940379

The parameter estimates are the same!

Comparing vectors by looking at them and checking if all the numbers are the same works for small vectors, but is not really feasible when we have longer vectors, or other types of objects.

You will see other options in the solutions of some other tasks.

Residuals

Since we need the residuals to calculate the standard errors of the regression coefficients and the residual standard deviation, we continue with calculating the residuals.

Task 1

Calculate the residuals and obtain the 5-point summary (minimum, 1st quartile, median, 3rd quartile, maximum).

The formula for the residuals in a simple linear regression model is given on slide 6 of the section Simple Linear Regression and on slide 3 of the section on The Least Squares Estimator.
The function quantile() may come in handy.

Solution 1

The residuals can be calculated as the difference between the observed response \(\mathbf y\) and the fitted values \(\mathbf{\hat y} = \mathbf X\boldsymbol{\hat\beta}\), i.e., \[\boldsymbol{\hat\varepsilon} = \mathbf y - \mathbf X\boldsymbol{\hat\beta}.\]

We have already created the vectors y and X containing the response and the design matrix in the previous part of this practical, and have calculated the vector of parameter estimates betas.

Using matrix multiplication (%*%) we can then calculate the residuals, and we will call them r here:

r <- y - X %*% betas

The 5-point summary consisting of the minimum, 1st quartile, median, 3rd quartile, maximum can be obtained very easily using the function quantile().

By default, the function quantile() returns the 0, 0.25, 0.5, 0.75 and 1 quantile of a vector of numeric values.

How can we know that by default the function quantile() returns the 0, 0.25, 0.5, 0.75 and 1 quantiles?

We have a look at the help file ?quantile.

In the Usage section we see written

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
         names = TRUE, type = 7, digits = 7, ...)

The values specified for each argument here are the default values, i.e., seq(0, 1, 0.25) is the default for the argument probs.

And seq(0, 1, 0.25) is a sequence from 0 to 1 in steps of 0.25, and, hence, returns

seq(0, 1, 0.25)
## [1] 0.00 0.25 0.50 0.75 1.00

The minimum and maximum correspond, of course, to the 0 and 1 quantile, and the median to the 0.5 quantile, and so the default values of quantile() are exactly what we need here.

quantile(r)
##         0%        25%        50%        75%       100% 
## -54.378987  -9.771811  -1.070625   8.193726  95.891053

An alternative way to obtain the required values would be to use the function summary():

summary(r)
##        V1         
##  Min.   :-54.379  
##  1st Qu.: -9.772  
##  Median : -1.071  
##  Mean   :  0.000  
##  3rd Qu.:  8.194  
##  Max.   : 95.891

It additionally returns the mean and the number of missing values if there are any.

Task 2

  • Find the summary of the residuals in the printed model summary() for lm2.
  • Compare the calculated residuals (r) with the residuals returned by the model lm2.
See the slides on Linear Regression in R for how to obtain residuals from a fitted linear model object.

Solution 2

The 5-point summary of the residuals is given in the printed model summary before the table of coefficients:

## 
## Call:
## lm(formula = SBP ~ gender + age + race + smoke + albu, data = nhanes, 
##     contrasts = list(smoke = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.379  -9.772  -1.071   8.194  95.891 
## 
## Coefficients: 
## [...]

These values coincide with the values we obtained in the previous step. This is already a good indication that we calculated the residuals correctly. (But it is no proof.)

To compare the full vector of residuals we first need to extract them from the fitted model object lm2. This can be done in two ways:

  • lm2$residuals
  • residuals(lm2)

Because we have 2326 residuals, comparing them by just printing the values next to each other is not very efficient.

Alternatively, we can check if the differences between the two vectors are (close to) zero. Because there are still 2326 of these differences, we just check the summary of the differences:

summary(r - lm2$residuals)
##        V1            
##  Min.   :-1.306e-11  
##  1st Qu.:-8.974e-12  
##  Median :-8.583e-12  
##  Mean   :-8.592e-12  
##  3rd Qu.:-8.181e-12  
##  Max.   :-5.697e-12

The very small differences between the values we calculated and the values from the fitted model occur due to differences in the implemented mathematical operations that are used internally.

There are two functions in that can help us with comparing objects: all.equal() and identical().

In case the two objects are not the same all.equal() will give us some information, but identical() does not:

identical(r, lm2$residuals)
## [1] FALSE
all.equal(r, lm2$residuals)
## [1] "names for current but not for target"                              "Attributes: < names for target but not for current >"             
## [3] "Attributes: < Length mismatch: comparison on first 0 components >" "target is matrix, current is numeric"

There seem to be some differences in the names of the two objects, and one is a matrix (r, because we calculated as via matrix multiplication) and one is numeric.

We need to turn r into a vector using c(r):

identical(c(r), lm2$residuals)
## [1] FALSE
all.equal(c(r), lm2$residuals)
## [1] "names for current but not for target"

but that does not solve (all of) the issues with the names.

Names are an “attribute” of an object, and we can see all the attributes that an object has using the function attributes():

attributes(c(r))
## NULL
attributes(lm2$residuals)
## $names
##    [1] "1"    "2"    "3"    "4"    "5"    "6"    "7"    "8"    "9"    "10"   "11"   "12"   "13"   "14"   "15"   "16"   "17"   "18"   "19"  
##   [20] "20"   "21"   "23"   "24"   "25"   "26"   "27"   "28"   "30"   "31"   "32"   "33"   "34"   "35"   "36"   "37"   "38"   "39"   "41"   
## [....]

We see that c(r) does not have any attributes, and lm2$residuals has an attribute called names, which just the indices of the observations.

The function all.equal() has an arguments that allows us to specify that attributes should not be compared:

all.equal(c(r), lm2$residuals, check.attributes = FALSE)
## [1] TRUE

identical() does not have this argument. But we can remove attributes by setting them to NULL:

r_lm2 <- lm2$residuals
attributes(r_lm2) <- NULL

When we now compare r_lm2 and the original lm2$residuals we can see that the names have disappeared:

head(lm2$residuals)
##           1           2           3           4           5           6 
##  -0.9477681   1.9681688  14.1622891 -21.2743309  21.1761603 -15.9050187
head(r_lm2)
## [1]  -0.9477681   1.9681688  14.1622891 -21.2743309  21.1761603 -15.9050187
identical(c(r), r_lm2)
## [1] FALSE

But does still not consider the two vectors to be identical. This is not really surprising, because they are not identical. We saw previously that the differences between the values are not equal to zero.

The function identical() really requires vectors to be identical, but all.equal() tests if the two vectors are almost equal. It has an argument tolerance which is, by default, equal to the square root of the machine epsilon. The machine epsilon is the smallest positive floating-point number x such that x + 1 is not identical to 1.

.Machine$double.eps
## [1] 2.220446e-16
sqrt(.Machine$double.eps)
## [1] 1.490116e-08

So differences that are smaller than \(1.4901161\times 10^{-8}\) are not considered to be differences when using all.equal() but they are considered when using identical.

Conclusion: For most practical purposes we prefer to use all.equal().

Residual Variance

Using the residuals, we can now calculate the estimate for the residual variance.

Task 1

Look up the formula for the estimator for the residual variance.

See the slides on the Least Squares Estimator.

Solution 1

The formula for the residual variance is given in the slides on the Least Squares Estimator (slide 6):

\[\hat\sigma^2 = \frac{1}{n - p - 1} \boldsymbol{\hat\varepsilon}^\top \boldsymbol{\hat\varepsilon}\]

Task 2

  • Calculate the estimated residual variance and compare the result with the residual standard error shown in the output of summary(lm2).
  • How are the degrees of freedom calculated that are mentioned in the output of summary(lm2) together with the residual standard error?

Solution 2

The formula for the estimator of the residual variance is \[\hat\sigma^2 = \frac{1}{n - p - 1} \boldsymbol{\hat\varepsilon}^\top \boldsymbol{\hat\varepsilon}.\]

Here, \(n\) is the number of observations, i.e., 2326, and \(p\) is the number of regression coefficients excluding the intercept, i.e., 9.

# we use
# * n = nrow(X)
# * p = ncol(X) - 1

sig2 <- 1 / (nrow(X) - ncol(X)) * sum(r^2)

The residual standard error is shown in the lower part of the model summary:

## [...]
##  
## Residual standard error: 15.78 on 2316 degrees of freedom
##   (439 observations deleted due to missingness)
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2324 
## F-statistic:  79.2 on 9 and 2316 DF,  p-value: < 2.2e-16

The standard error is the square root of the variance of the residuals, i.e.,

sqrt(sig2)
## [1] 15.78038

The degrees of freedom are the number of observations 2326 minus the number of regression coefficients (9 + 1 = 10), i.e., 2316.

We can obtain them from the fitted model object as lm2$df.residual.

lm2$df.residual
## [1] 2316

Standard Errors of the Regression Coefficients

Task 1

Look up the formula for the variance-covariance matrix of the parameter estimates.

See the slides on the Least Squares Estimator.

Solution 1

The formula for the variance-covariance matrix of the parameter estimates \(\boldsymbol{\hat\beta}\) is given in the slides on the Least Squares Estimator (slide 7):

\[\mathrm{cov(\boldsymbol{\hat\beta})} = \sigma^2 (\mathbf X^\top \mathbf X)^{-1}\]

Task 2

Calculate the variance-covariance matrix of the parameter estimates and compare it with the variance-covariance matrix that you can obtain from the fitted model (lm2) using the function vcov().

Use the estimate of the residual variance calculated in the previous step (sig2) to replace \(\sigma^2\) in the formula for the variance of \(\boldsymbol{\hat\beta}\).

Solution 2

Calculate the variance-covariance matrix:

V <- sig2 * solve(t(X) %*% X)

V is a symmetric matrix with 10 rows and columns, that has the variances of the regression coefficients on the diagonal, and the covariances as off-diagonal elements.

We can compare it to the matrix derived using vcov() using differences:

summary(c(V - vcov(lm2)))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.983e-12 -4.381e-15  5.000e-18  7.304e-14  3.927e-15  1.343e-11

Here, we compare the values of the two matrices by looking at their differences. The function c() is used to convert the matrices to vectors in order to get one overall summary, and not a separate summary per column of the matrix.

Task 3

  • Calculate the standard error of the parameter estimates from the variance-covariance matrix.
  • Compare this to the standard errors shown in the model summary.
You can use diag() to extract the diagonal elements from a matrix.

Solution 3

The standard errors are the square root of the variances, and the variances are the elements on the diagonal of the variance-covariance matrix V. Hence, we can obtain the standard errors as:

se <- sqrt(diag(V))

In the model summary, the standard errors are part of the “coefficients” table. This table can be obtained using the following syntax:

summary(lm2)$coefficients
##                           Estimate Std. Error    t value      Pr(>|t|)
## (Intercept)            95.43328823 5.05897617 18.8641506  5.939995e-74
## genderfemale           -4.45617279 0.69972411 -6.3684711  2.295145e-10
## age                     0.46631424 0.01988055 23.4558012 2.443005e-109
## raceOther Hispanic     -0.21360840 1.43025364 -0.1493500  8.812905e-01
## raceNon-Hispanic White -0.12290498 1.15504476 -0.1064071  9.152686e-01
## raceNon-Hispanic Black  5.32980363 1.23175526  4.3269989  1.575735e-05
## raceother              -0.26288646 1.29833254 -0.2024801  8.395592e-01
## smokeformer            -0.06675802 0.85986713 -0.0776376  9.381230e-01
## smokecurrent            1.37452263 0.87209759  1.5761110  1.151368e-01
## albu                    1.25940379 1.07471405  1.1718501  2.413778e-01

This “table” of coefficients is actually a matrix object and so we can compare the standard errors either by printing them next to each other

cbind(summary(lm2)$coefficients[, "Std. Error"], se)
##                                           se
## (Intercept)            5.05897617 5.05897617
## genderfemale           0.69972411 0.69972411
## age                    0.01988055 0.01988055
## raceOther Hispanic     1.43025364 1.43025364
## raceNon-Hispanic White 1.15504476 1.15504476
## raceNon-Hispanic Black 1.23175526 1.23175526
## raceother              1.29833254 1.29833254
## smokeformer            0.85986713 0.85986713
## smokecurrent           0.87209759 0.87209759
## albu                   1.07471405 1.07471405

or again using the differences

summary(summary(lm2)$coefficients[, "Std. Error"] - sqrt(diag(V)))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.328e-12 -4.025e-15 -3.220e-15 -1.652e-13 -6.774e-16  2.220e-16

Test Statistic

Task 1

  • Look up the formula for the test-statistic used for the regression coefficients.
  • What type of test is usually performed?
  • What are the null- and alternative hypotheses?
Check the slides on Hypothesis Tests.

Solution 1

The test statistic is \[\frac{\boldsymbol{\hat\beta} - \boldsymbol \beta}{\widehat{\mathrm{se}}(\boldsymbol{\hat\beta)}},\] where \(\boldsymbol{\hat\beta}\) is the vector of parameter estimates, \(\boldsymbol \beta\) is the vector of values of the regression coefficients assumed under the null-hypothesis, and \(\widehat{\mathrm{se}}(\boldsymbol{\hat\beta)}\) are the estimated standard errors for the parameter estimates.

In the slides on Hypothesis Tests (slide 6), the test statistic was given for a single regression coefficient \(\beta_j\):

\[T_j = \frac{\hat\beta_j - \beta_{0j}}{\hat\sigma_j},\qquad j = 0,\ldots,p\]

In the solution above it is written for the whole vector of regression coefficients \[\boldsymbol\beta = \begin{pmatrix}\beta_0\\\beta_1\\\vdots\\\beta_p\end{pmatrix}\]

When the model assumptions are fulfilled, this test statistic follows a \(t\)-distribution, i.e., we are performing a \(t\)-test.

In regression models, we typically perform a 2-sided \(t\)-test with null hypothesis \(\boldsymbol\beta = \mathbf 0\) and alternative hypothesis \(\boldsymbol\beta \neq \mathbf 0\).

Task 2

Calculate the test statistics for all regression coefficients and compare them to the corresponding values in the model summary.

Solution 2

Because our null hypothesis is that the regression coefficients are equal to zero, we can omit subtracting \(\boldsymbol\beta\) from the formula and only need to divide the parameter estimates by their standard error:

t_stats <- betas/se

As before, we can compare the values by printing them next to each other

cbind(summary(lm2)$coefficients[, "t value"], t_stats)
##                              [,1]       [,2]
## (Intercept)            18.8641506 18.8641506
## genderfemale           -6.3684711 -6.3684711
## age                    23.4558012 23.4558012
## raceOther Hispanic     -0.1493500 -0.1493500
## raceNon-Hispanic White -0.1064071 -0.1064071
## raceNon-Hispanic Black  4.3269989  4.3269989
## raceother              -0.2024801 -0.2024801
## smokeformer            -0.0776376 -0.0776376
## smokecurrent            1.5761110  1.5761110
## albu                    1.1718501  1.1718501

or by checking if their differences are close to) zero:

summary(summary(lm2)$coefficients[, "t value"] - t_stats)
##        V1            
##  Min.   :-1.162e-13  
##  1st Qu.:-8.108e-14  
##  Median : 1.643e-13  
##  Mean   : 4.884e-13  
##  3rd Qu.: 6.328e-13  
##  Max.   : 2.145e-12

P-values

Task 1

  • Look up the formula for the p-value for a 2-sided test.
  • What is the interpretation of the p-value (in general)?
  • Which distribution do we need to use in the calculation of the p-value in our example here?
Check the slides on Hypothesis Tests.

Solution 1

The formula for the p-value for a 2-sided test was given in the slides on Hypothesis tests (slide 10): \[p = 2\min\{\Pr(t\leq T\mid H_0),\, \Pr(t \geq T\mid H_0)\}\] The p-value is the probability to observe a test statistic / effect / regression coefficient at least as extreme as the observed test statistic / effect / regression coefficient if the null hypothesis is true (where extreme means into the direction of the alternative hypothesis).

Under the null-hypothesis, the test statistic follows a \(t\)-distribution with \(n - p - 1\) degrees of freedom (i.e., the residual degrees of freedom). We have already calculated the number of degrees of freedom earlier, for example, as nrow(X) - ncol(X) = 2316.

Task 2

Calculate the p-values for the regression coefficients from lm2.

The function pt() calculates the distribution function for the \(t\)-distribution. Check the help (?pt) for information on how to use this function.

The function pmin() can be used to determine the “parallel” minima of two (or more) vectors. For example, pmin(c(1, 2, 3, 4), c(1.2, 0.9, 3, 3.3)) returns a vector c(1, 0.9, 3, 3.3)

Solution 2

The residual degrees of freedom are:

resid_df <- nrow(X) - ncol(X)

Remember, you can also obtain the residual degrees of freedom directly from the fitted model:

lm2$df.residual
## [1] 2316

For each regression coefficient, we need to find the minimum of \(\Pr(t\leq T\mid H_0)\) and \(\Pr(t \geq T\mid H_0)\), and multiply this minimum with 2:

p_vals <- 2 * pmin(
  pt(t_stats, df = resid_df, lower.tail = FALSE),
  pt(t_stats, df = resid_df, lower.tail = TRUE)
)

The function pt() calculates the distribution function of the \(t\)-distribution and has the arguments

  • q: a vector of quantiles
  • df: the (residual) degrees of freedom
  • ncp: a non-centrality parameter (which we don’t need here)
  • lower.tail = TRUE: an indicator specifying if the lower or upper tail of the distribution should be returned
  • log.p = FALSE: an indicator specifying if the returned probabilities should be log-transformed (which we don’t need here)

For the formula of the p-value, we need both \(\Pr(t\leq T\mid H_0)\) and \(\Pr(t \geq T\mid H_0)\), i.e., the area under the density function that is left (lower tail) or right (upper tail) of the observed test statistic.

For example:

The function pmin() selects the smaller value from each of the pairs of probabilities.

Task 3

Compare the p-values you just calculated with the p-values from the output of summary(lm2).

Solution 3

The p-values are also part of the “coefficients table” returned by the summary() of lm2. We can thus work with them by selecting the column with name "Pr(>|t|)" from this matrix:

summary(summary(lm2)$coefficients[, "Pr(>|t|)"] - p_vals)
##        V1            
##  Min.   :-8.612e-13  
##  1st Qu.:-7.053e-14  
##  Median :-3.554e-14  
##  Mean   :-1.162e-13  
##  3rd Qu.:-4.700e-18  
##  Max.   : 0.000e+00

With every step we make in the calculation, the small differences we observed between the numbers derived from summary(lm2) and the numbers we calculated ourselves become larger.

Especially when the numbers themselves are relatively small, as may be the case for p-values, it could be difficult to judge if the observed differences are indeed just due to the numeric differences in the calculation or if we made a mistake in our own calculation.

To check that we “implemented” the formula correctly, we should use the values from summary(lm2) as input. For example:

Extract the values of the test statistics from the summary() output:

t_stats_lm <- summary(lm2)$coefficients[, "t value"]

Calculate the p-values (i.e., use “our” formula/implementation but the input values calculated by lm()):

p_vals_lm <- 2 * pmin(
  pt(abs(t_stats_lm), df = resid_df, lower.tail = FALSE),
  pt(abs(t_stats_lm), df = resid_df, lower.tail = TRUE)
)

Compare with the p-values from the summary() output:

summary(summary(lm2)$coefficients[, "Pr(>|t|)"] - p_vals_lm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0
We now get the exact same numbers, indicating that our syntax to calculate the p-values is correct.

Confidence intervals

The output of the summary() function does not contain the confidence intervals for the regression coefficients. We can, however, obtain them using the function confint().

Task 1

Read the help for the function confint() (?confint) to understand which parameters you need to provide and which options you have.

Solution 1

The function confint() takes a fitted model object (such as lm2) as its main argument (object). The additional argument parm allows us to specify for which parameters we want to obtain the confidence interval. We could specify either a vector with numbers referring to the index of the vector of regression coefficients, or the names of the parameters (i.e., what is returned by names(coef(lm2))).

The confidence level is specified via the argument level.

In the Details section we can read that confint is a “generic function”. This means that it can be applied to different types of objects, and will look for a specific “method” for the type of object.

We can also read that in the default version of confint() normality is assumed (i.e., the regression coefficients are assumed to be approximately normally distributed).

The Details section also tells us that the fitted model object must be of a type for which coef() and vcov() return the parameter estimates and variance-covariance matrix.

Task 2

Calculate the 90% confidence intervals for the regression parameters in lm2

  • using confint() and
  • “by hand”

and compare them.

You can obtain the quantile from a \(t\)-distribution by using the function qt().

Solution 2

Using confint() (we need to change the argument level to get 90% CIs instead of the default 95% CIs):

ci <- confint(lm2, level = 0.9)
ci
##                                5 %        95 %
## (Intercept)            87.10868312 103.7578933
## genderfemale           -5.60757709  -3.3047685
## age                     0.43360056   0.4990279
## raceOther Hispanic     -2.56710767   2.1398909
## raceNon-Hispanic White -2.02354479   1.7777348
## raceNon-Hispanic Black  3.30293579   7.3566715
## raceother              -2.39930801   1.8735351
## smokeformer            -1.48167954   1.3481635
## smokecurrent           -0.06052427   2.8095695
## albu                   -0.50905088   3.0278585

The formula for the confidence interval was given in the slides on Hypothesis Tests: \[\hat\beta_j \pm t_{1-\alpha/2}(n-p-1) \widehat{\mathrm{se}}(\hat\beta_j)\] Since we want 90% confidence intervals, \(\alpha\) is 0.1, which means \(1-\alpha/2\) is 0.05, and therefore we use \(t_{0.95}(n-p-1)\) instead of \(t_{0.975}(n-p-1)\)):

lwr <- betas - qt(0.95, resid_df) * se
upr <- betas + qt(0.95, resid_df) * se

Compare them

cbind(lwr, ci, upr)
##                                            5 %        95 %            
## (Intercept)            87.10868312 87.10868312 103.7578933 103.7578933
## genderfemale           -5.60757709 -5.60757709  -3.3047685  -3.3047685
## age                     0.43360056  0.43360056   0.4990279   0.4990279
## raceOther Hispanic     -2.56710767 -2.56710767   2.1398909   2.1398909
## raceNon-Hispanic White -2.02354479 -2.02354479   1.7777348   1.7777348
## raceNon-Hispanic Black  3.30293579  3.30293579   7.3566715   7.3566715
## raceother              -2.39930801 -2.39930801   1.8735351   1.8735351
## smokeformer            -1.48167954 -1.48167954   1.3481635   1.3481635
## smokecurrent           -0.06052427 -0.06052427   2.8095695   2.8095695
## albu                   -0.50905088 -0.50905088   3.0278585   3.0278585

R-Squared

Task 1

  • Look up the formulas for the coefficient of determination (R2) and the adjusted coefficient of determination (R2adj).
  • Think about how to calculate the different elements of the formulas.
The coefficient of determination was also part of the slides on Hypothesis Tests.

Solution 1

The formulas for the coefficient of determination and the adjusted coefficient of determination are given on pages 16 and 17 of the Slides on Hypothesis Tests: \[R^2 = \frac{\text{ESS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\qquad\text{and}\qquad R_{adj}^2 = 1 - \frac{n-1}{n-p-1}(1 - R^2),\] where \(\text{RSS}\) is the residual sum of squared, \(\text{ESS}\) is the explained variation (“explained sum of squares”) and \(\text{TSS}\) the total variation (“total sum of squares”).

The residual sum of squares is calculated as the sum over the squared residuals \[\text{RSS} = \boldsymbol\varepsilon^\top\boldsymbol\varepsilon = \sum_{i = 1}^n \varepsilon_i^2,\] the variation explained by the model is calculated as the sum over the squared deviations of the fitted values from the mean response, \[\text{ESS} = (\mathbf{\hat y} - \bar y)^\top (\mathbf{\hat y} - \bar y) = \sum_{i = 1}^n (\hat y_i - \bar y)^2,\] and the total variation is calculated as the sum over the squared differences between the response and the mean response, \[\text{TSS} = (\mathbf y - \bar y)^\top (\mathbf y - \bar y) = \sum_{i = 1}^n (y_i - \bar y)^2,\] where \(\bar y = \displaystyle\frac{1}{n}\sum_{i = 1}^n y_i\).

Task 2

Calculate the coefficient of determination and the adjusted coefficient of determination and compare them to the values shown in the output of summary(lm2).

Solution 2

Residual sum of squares

RSS <- sum(r^2)

Total sum of squares

TSS <- sum((y - mean(y))^2)

Coefficient of determination

R2 <- 1 - RSS/TSS

Adjusted coefficient of determination.

# resid_df <- nrow(X) - ncol(X)
R2adj <- 1 - (nrow(X) - 1)/resid_df * (1 - R2)

In the model summary(), the coefficient of determination is shown in the last part of the output:

## [...]
##  Residual standard error: 15.78 on 2316 degrees of freedom
##   (439 observations deleted due to missingness)
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2324 
## F-statistic:  79.2 on 9 and 2316 DF,  p-value: < 2.2e-16

Our own calculated values are

R2
## [1] 0.2353521
R2adj
## [1] 0.2323807

and, thus, the same as the values shown in the model summary.

Overall-F-Test

Task 1

  • What are the null and alternative hypotheses of the overall-F-test?
  • How is the test statistic for this test calculated?
  • Based on what you have learned so far on how p-values are defined and calculated, derive how to calculate the p-value for the overall-F-test.

Solution 1

The overall F-test has the null hypothesis that all regression coefficients (except for the intercept) are equal to zero, and the alternative hypothesis that at least one of these coefficients is different from zero. This means, that we are testing whether the model explains any of the variation in the response, i.e., if it is “better” than using just the average response.

The test statistic for the F-test is calculated as \[F = \frac{\text{ESS}}{\text{RSS}}\frac{n - p - 1}{p}\] and, under the null-hypothesis, follows a \(F\)-distribution with \(p\) and \(n-p-1\) degrees of freedom (see the slides on Hypothesis Tests (slide 14)).

The p-value is the probability under the null-hypothesis to observe a test statistic that is at least as large as the one observed in the data.

Analogous to how we calculated the p-value for the \(t\)-test for the regression coefficients, we can determine this probability as the area under the density of the \(F\)-distribution in the part that is larger than the calculated test statistic. The difference is that we are performing a 1-sided test now, so we don’t need the part where we took the minimum and multiplied by two.

Task 2

Calculate the overall-F-test for lm2 and compare your results with the results of the output of summary(lm2).

Solution 2

The part of the variation explained by the model is calculated from the fitted values (X %*% betas) and the average response (mean(y)):

ESS <- sum((X %*% betas - mean(y))^2)

The test statistic of the F-test is then:

F_stat <- ESS/RSS * resid_df / (ncol(X) - 1)

The p-value is the value of the distribution function at the test statistic, i.e., the area under the density for values at least as large as the test statistic (i.e., in the upper tail of the distribution):

p_val <- pf(F_stat, ncol(X) - 1, resid_df, lower.tail = FALSE)

We find the results of the overall F-test in the last line of the output of summary(lm2):

## [...]
##  
## Residual standard error: 15.78 on 2316 degrees of freedom
##   (439 observations deleted due to missingness)
## Multiple R-squared:  0.2354, Adjusted R-squared:  0.2324 
## F-statistic:  79.2 on 9 and 2316 DF,  p-value: < 2.2e-16

The value of the test statistic, degrees of freedom and p-value are elements in the summary.lm object, and we can print them next to our results for comparison:

cbind("summary" = summary(lm2)$fstatistic, 
      "calculated" = c(value = F_stat,
                       numdf = ncol(X) - 1,
                       dendf = resid_df)
)
##        summary calculated
## value   79.205     79.205
## numdf    9.000      9.000
## dendf 2316.000   2316.000

We can obtain the same results with the help of the function anova().

# fit a model with just an intercept, no covariates:
lm0 <- lm(SBP ~ 1, data = lm2$model)

anova(lm2, lm0)
## Analysis of Variance Table
## 
## Model 1: SBP ~ gender + age + race + smoke + albu
## Model 2: SBP ~ 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2316 576731                                  
## 2   2325 754244 -9   -177513 79.205 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we use lm2$model, i.e., the model frame from lm2, because this is the relevant subset of nhanes in which observations with missing values were excluded.

Because lm0 does not involve any covariates, only observations with missing values in SBP would be excluded. We would then get an error when we use anova() because the two models, lm0 and lm2 were fitted on different sample sizes. By using lm2$model we can use the same data used to fit lm2.