Linear Regression

Data

For this practical, we will use the birthwt dataset from the package MASS.

You can either load the MASS package to get access to this dataset

library(MASS)

or make a copy of this data

birthwt <- MASS::birthwt

The birthwt data have the following 10 variables:

str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Some of the variables are actually categorical variables, but we will not yet re-code them to factors in order to investigate how the variable coding can change the results.

Fitting a Linear Regression

Task 1

Write the model formula with

outcome bwt (birth weight) and
covariates
- age (mother’s age),
- lwt (mothers weight in pounds),
- race (mother’s race) and
- smoke (smoking status during pregnancy).

Use the ~ to separate the outcome from the linear predictor and use + to separate the covariates.

Solution 1

bwt ~ age + lwt + race + smoke

## bwt ~ age + lwt + race + smoke

Task 2

Fit a linear regression model with the formula that we have just specified.

Use the function lm().

Besides the model formula, you also need to provide the data via the argument data.

Solution 2

mod1 <- lm(bwt ~ age + lwt + race + smoke, data = birthwt)

Note that if we don’t assign a name to the model, the results are just printed, but we cannot use or investigate them any further.

Model Summary

Task 1

Get the summary() of the model.

Solution 1

summary(mod1)

## 
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke, data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2269.55  -437.06    -2.89   456.40  1700.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3066.311    354.446   8.651  2.5e-15 ***
## age            1.278      9.781   0.131 0.896148    
## lwt            3.058      1.692   1.807 0.072346 .  
## race        -210.345     60.077  -3.501 0.000581 ***
## smoke       -408.542    110.155  -3.709 0.000276 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 688.1 on 184 degrees of freedom
## Multiple R-squared:  0.1285, Adjusted R-squared:  0.1095 
## F-statistic:  6.78 on 4 and 184 DF,  p-value: 4.117e-05

Task 2

Investigate the model summary closely.

What are the estimated effects of the coefficients?
Confirm that you identified the correct part of the output by extracting only the regression coefficients from the model (not from the model summary!).

To extract the coefficients, use the function coef().

Solution 2

The regression coefficients are given in the first column (“Estimate”) of the output under the header “Coefficients”.

## [...]
##  Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3066.311    354.446   8.651  2.5e-15 ***
## age            1.278      9.781   0.131 0.896148    
## lwt            3.058      1.692   1.807 0.072346 .  
## race        -210.345     60.077  -3.501 0.000581 ***
## smoke       -408.542    110.155  -3.709 0.000276 *** 
## [...]

They are the same as:

coef(mod1)

## (Intercept)         age         lwt        race       smoke 
## 3066.310971    1.278421    3.057756 -210.345356 -408.542106

Task 3

Where in the summary can you find the p-values?
Which of the covariates have a statistically significant effect (for \(\alpha = 5\%\))?

Solution 3

The p-values are given in the last column (Pr(>|t|)) of the output under the header “Coefficients”. We can see that the intercept and the effects for race and smoke significantly differ from 0.

Task 4

What does the model summary tell you about the model fit?

The model summary contains the \(R^2\) and adjusted \(R^2\).

The model summary contains info on an omnibus test (F-statistic).

Solution 4

In the lowest part of the output of summary():

## [...]
##  Multiple R-squared:  0.1285,    Adjusted R-squared:  0.1095  
## [...]

We can read there that the R-squared is 0.1285 and that the adjusted R-squared is 0.1095.

In the last row we see the F-statistic which tests the model against the null-model (intercept only).

## [...]
##  F-statistic:  6.78 on 4 and 184 DF,  p-value: 4.117e-05

It shows that the model explains significantly more variation in the outcome than the null-model (p-value: 4.117e-05)

Confidence Intervals

Task 1

Get the 90% confidence intervals of the model.

Use the function confint().

Solution 1

confint(mod1, level = 0.9)

##                      5 %        95 %
## (Intercept) 2480.3496976 3652.272245
## age          -14.8907558   17.447597
## lwt            0.2607886    5.854724
## race        -309.6629066 -111.027806
## smoke       -590.6480521 -226.436160

Task 2

Combine the coefficients and 95% confidence intervals of the model into one matrix.

Solution 2

cbind(estimate = coef(mod1),
      confint(mod1))

##                estimate        2.5 %      97.5 %
## (Intercept) 3066.310971 2367.0109618 3765.610981
## age            1.278421  -18.0182561   20.575097
## lwt            3.057756   -0.2802109    6.395723
## race        -210.345356 -328.8732642  -91.817448
## smoke       -408.542106 -625.8716388 -191.212574

Note:

Because coef() returns a vector, the corresponding column in our matrix would not have a name. We can assign the name when we use cbind().

confint() returns a matrix which already has column names.

Coding of the covariates

Now we will re-code those covariates that are actually categorical variables.

Task 1

Re-code race and smoke as factor.

Use the function factor().

Solution 1

birthwt$race <- factor(birthwt$race)
birthwt$smoke <- factor(birthwt$smoke)

Task 2

Fit the same model as before, but with the re-coded data and give it a new name.
Print the model summary.

How did the results change?

Solution 2

mod2 <- lm(bwt ~ age + lwt + race + smoke, data = birthwt)
summary(mod2)

## 
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2281.9  -449.1    24.3   474.1  1746.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2839.433    321.435   8.834  8.2e-16 ***
## age           -1.948      9.820  -0.198  0.84299    
## lwt            4.000      1.738   2.301  0.02249 *  
## race2       -510.501    157.077  -3.250  0.00137 ** 
## race3       -398.644    119.579  -3.334  0.00104 ** 
## smoke1      -401.720    109.241  -3.677  0.00031 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 682.1 on 183 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.125 
## F-statistic: 6.373 on 5 and 183 DF,  p-value: 1.758e-05

When we compare the “Coefficients” part of the summary for mod2

## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2839.433    321.435   8.834  8.2e-16 ***
## age           -1.948      9.820  -0.198  0.84299    
## lwt            4.000      1.738   2.301  0.02249 *  
## race2       -510.501    157.077  -3.250  0.00137 ** 
## race3       -398.644    119.579  -3.334  0.00104 ** 
## smoke1      -401.720    109.241  -3.677  0.00031 ***

with the “Coefficients” part of the summary for mod1

## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3066.311    354.446   8.651  2.5e-15 ***
## age            1.278      9.781   0.131 0.896148    
## lwt            3.058      1.692   1.807 0.072346 .  
## race        -210.345     60.077  -3.501 0.000581 ***
## smoke       -408.542    110.155  -3.709 0.000276 ***

we notice that in the summary for mod2 we have coefficients for race2 and race3 where in the summary for mod1 we only had one coefficient, and that the name of the coefficient for smoke is now smoke1.

Note:

R automatically converts a factor into dummy variables!
The first category is used as reference category.
The other factor levels are appended to the variable name to identify the different dummy variables.

Because we now have more parameters, of course also the results for model fit slightly changed.

Task 3

What is the difference in the interpretation of the effects for race and smoke in mod1 and mod2?

Solution 3

For smoke there is no real difference. smoke has only two categories, therefore there is one dummy variable smoke1 which is 1 if smoke == 1 and 0 otherwise. Since this is the same as the original variable smoke, nothing really changes.

For race, the assumption in mod1 is that the effect of category 3 compared to 1 is twice as large as the effect of category 2 compared to 1 because race is treated as continuous and the coefficient is the effect of a “one unit increase” (and 3 vs 1 is a two unit increase).

In mod2 there are two dummy variables for race, which measure the effect of each category compared to the reference category 1. The effects of category 3 is not assumed to be double the effect of category 2.