For this practical, we will use the birthwt dataset from the package MASS.
You can either load the MASS package to get access to this dataset
library(MASS)
or make a copy of this data
<- MASS::birthwt birthwt
The birthwt data have the following 10 variables:
str(birthwt)
## 'data.frame': 189 obs. of 10 variables:
## $ low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : int 19 33 20 21 18 21 22 17 29 26 ...
## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
## $ race : int 2 3 1 1 1 3 1 3 1 1 ...
## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ht : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ui : int 1 0 0 1 1 0 0 0 0 0 ...
## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
Some of the variables are actually categorical variables, but we will not yet re-code them to factors in order to investigate how the variable coding can change the results.
Write the model formula with
bwt
(birth weight) andage
(mother’s age),lwt
(mothers weight in pounds),race
(mother’s race) andsmoke
(smoking status during pregnancy).~
to separate the outcome from the linear predictor and use +
to separate the covariates.
~ age + lwt + race + smoke bwt
## bwt ~ age + lwt + race + smoke
Fit a linear regression model with the formula that we have just specified.
lm()
.
data
.
<- lm(bwt ~ age + lwt + race + smoke, data = birthwt) mod1
Note that if we don’t assign a name to the model, the results are just printed, but we cannot use or investigate them any further.
Get the summary()
of the model.
summary(mod1)
##
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2269.55 -437.06 -2.89 456.40 1700.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3066.311 354.446 8.651 2.5e-15 ***
## age 1.278 9.781 0.131 0.896148
## lwt 3.058 1.692 1.807 0.072346 .
## race -210.345 60.077 -3.501 0.000581 ***
## smoke -408.542 110.155 -3.709 0.000276 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 688.1 on 184 degrees of freedom
## Multiple R-squared: 0.1285, Adjusted R-squared: 0.1095
## F-statistic: 6.78 on 4 and 184 DF, p-value: 4.117e-05
Investigate the model summary closely.
coef()
.
The regression coefficients are given in the first column (“Estimate”) of the output under the header “Coefficients”.
## [...]
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3066.311 354.446 8.651 2.5e-15 ***
## age 1.278 9.781 0.131 0.896148
## lwt 3.058 1.692 1.807 0.072346 .
## race -210.345 60.077 -3.501 0.000581 ***
## smoke -408.542 110.155 -3.709 0.000276 ***
## [...]
They are the same as:
coef(mod1)
## (Intercept) age lwt race smoke
## 3066.310971 1.278421 3.057756 -210.345356 -408.542106
The p-values are given in the last column (Pr(>|t|)
) of the output under the header “Coefficients”. We can see that the intercept and the effects for race
and smoke
significantly differ from 0.
What does the model summary tell you about the model fit?
In the lowest part of the output of summary()
:
## [...]
## Multiple R-squared: 0.1285, Adjusted R-squared: 0.1095
## [...]
We can read there that the R-squared is 0.1285 and that the adjusted R-squared is 0.1095.
In the last row we see the F-statistic which tests the model against the null-model (intercept only).
## [...]
## F-statistic: 6.78 on 4 and 184 DF, p-value: 4.117e-05
It shows that the model explains significantly more variation in the outcome than the null-model (p-value: 4.117e-05)
Get the 90% confidence intervals of the model.
confint()
.
confint(mod1, level = 0.9)
## 5 % 95 %
## (Intercept) 2480.3496976 3652.272245
## age -14.8907558 17.447597
## lwt 0.2607886 5.854724
## race -309.6629066 -111.027806
## smoke -590.6480521 -226.436160
Combine the coefficients and 95% confidence intervals of the model into one matrix
.
cbind(estimate = coef(mod1),
confint(mod1))
## estimate 2.5 % 97.5 %
## (Intercept) 3066.310971 2367.0109618 3765.610981
## age 1.278421 -18.0182561 20.575097
## lwt 3.057756 -0.2802109 6.395723
## race -210.345356 -328.8732642 -91.817448
## smoke -408.542106 -625.8716388 -191.212574
Note:
Because coef()
returns a vector, the corresponding column in our matrix would not have a name. We can assign the name when we use cbind()
.
confint()
returns a matrix
which already has column names.
Now we will re-code those covariates that are actually categorical variables.
Re-code race
and smoke
as factor
.
factor()
.
$race <- factor(birthwt$race)
birthwt$smoke <- factor(birthwt$smoke) birthwt
How did the results change?
<- lm(bwt ~ age + lwt + race + smoke, data = birthwt)
mod2 summary(mod2)
##
## Call:
## lm(formula = bwt ~ age + lwt + race + smoke, data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2281.9 -449.1 24.3 474.1 1746.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2839.433 321.435 8.834 8.2e-16 ***
## age -1.948 9.820 -0.198 0.84299
## lwt 4.000 1.738 2.301 0.02249 *
## race2 -510.501 157.077 -3.250 0.00137 **
## race3 -398.644 119.579 -3.334 0.00104 **
## smoke1 -401.720 109.241 -3.677 0.00031 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 682.1 on 183 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.125
## F-statistic: 6.373 on 5 and 183 DF, p-value: 1.758e-05
When we compare the “Coefficients” part of the summary for mod2
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2839.433 321.435 8.834 8.2e-16 ***
## age -1.948 9.820 -0.198 0.84299
## lwt 4.000 1.738 2.301 0.02249 *
## race2 -510.501 157.077 -3.250 0.00137 **
## race3 -398.644 119.579 -3.334 0.00104 **
## smoke1 -401.720 109.241 -3.677 0.00031 ***
with the “Coefficients” part of the summary for mod1
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3066.311 354.446 8.651 2.5e-15 ***
## age 1.278 9.781 0.131 0.896148
## lwt 3.058 1.692 1.807 0.072346 .
## race -210.345 60.077 -3.501 0.000581 ***
## smoke -408.542 110.155 -3.709 0.000276 ***
we notice that in the summary for mod2
we have coefficients for race2
and race3
where in the summary for mod1
we only had one coefficient, and that the name of the coefficient for smoke
is now smoke1
.
Note:
factor
into dummy variables!Because we now have more parameters, of course also the results for model fit slightly changed.
What is the difference in the interpretation of the effects for race
and smoke
in mod1
and mod2
?
For smoke
there is no real difference. smoke
has only two categories, therefore there is one dummy variable smoke1
which is 1 if smoke == 1
and 0 otherwise. Since this is the same as the original variable smoke
, nothing really changes.
For race
, the assumption in mod1
is that the effect of category 3
compared to 1
is twice as large as the effect of category 2
compared to 1
because race
is treated as continuous and the coefficient is the effect of a “one unit increase” (and 3
vs 1
is a two unit increase).
In mod2
there are two dummy variables for race
, which measure the effect of each category compared to the reference category 1
. The effects of category 3
is not assumed to be double the effect of category 2
.
© Nicole Erler