In this practical, we will perform model diagnostics for a fitted linear regression model.
The aim is to revisit the theory on model diagnostics and to learn how to perform the necessary steps in .
Note that for several tasks of this practical the given solution is only one of many possible solutions. In some cases it may not be the most efficient or elegant solution, because the focus is on keeping the syntax simple and easier to understand.
Most of the plots are created using the base graphics. For some plots, example syntax on how to create them using ggplot2 is available. These examples are intended as just that, examples, but the explanations are not sufficient to understand the syntax if you are completely new to ggplot2. If you are new to ggplot2, chapter 2 of the ggplot2 book will provides a nice introduction into the basics of ggplot2 graphics.
This practical is intended to be “learning by doing”, i.e., you may not be able to solve all tasks by yourself without the hints or even checking the solution, and you will come across functions that you haven’t seen in the lecture slides.
When a function is new to you, you can find out more about it in the
help file that you can access by typing
?<function name>
.
There are a few parts indicated as “Advanced Level” containing some more complex solutions (for those of you, for whom working with is a piece of cake).
For this practical we will use the same subset of the NHANES (National Health and Nutrition Examination Survey) data as used in the first practical.
To load this dataset into , you can
use the command file.choose()
which opens the explorer and
allows you to navigate to the location of the file on your computer.
If you know the path to the file, you can also use
load("<path>/nhanes.RData")
.
Here a short overview over the data:
## SBP gender age race WC alc educ creat albu uricacid bili occup smoke
## 1 110.6667 male 22 Non-Hispanic White 81.0 <NA> low 0.91 4.8 4.9 0.6 working never
## 2 118.0000 female 44 Non-Hispanic White 80.1 <NA> high 0.89 3.7 4.5 0.3 working never
## 3 124.6667 male 21 other 69.6 <1 low 0.87 4.4 5.4 0.2 not working never
## 4 102.0000 female 43 Non-Hispanic Black 120.4 >=1 low 0.68 4.4 5.0 0.8 not working current
## 5 146.6667 male 51 other 81.1 <NA> low 0.99 4.1 5.0 0.5 looking for work current
## 6 122.0000 male 80 Non-Hispanic White 112.5 <NA> low 1.01 4.2 4.8 0.7 not working never
## 'data.frame': 2765 obs. of 13 variables:
## $ SBP : num 111 118 125 102 147 ...
## $ gender : Factor w/ 2 levels "male","female": 1 2 1 2 1 1 1 2 1 1 ...
## $ age : num 22 44 21 43 51 80 26 30 70 35 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 3 3 5 4 5 3 4 5 4 4 ...
## $ WC : num 81 80.1 69.6 120.4 81.1 ...
## $ alc : Factor w/ 2 levels "<1",">=1": NA NA 1 2 NA NA 2 2 1 NA ...
## $ educ : Factor w/ 2 levels "low","high": 1 2 1 1 1 1 1 2 1 2 ...
## $ creat : num 0.91 0.89 0.87 0.68 0.99 1.01 1.03 0.91 0.8 1.1 ...
## $ albu : num 4.8 3.7 4.4 4.4 4.1 4.2 5 4.8 3.7 4 ...
## $ uricacid: num 4.9 4.5 5.4 5 5 4.8 5.4 6.7 5.4 6.7 ...
## $ bili : num 0.6 0.3 0.2 0.8 0.5 0.7 1.1 0.9 0.9 0.9 ...
## $ occup : Factor w/ 3 levels "working","looking for work",..: 1 1 3 3 2 3 1 1 3 1 ...
## $ smoke : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 1 3 3 1 1 1 2 1 ...
Fit a linear regression model on the nhanes data
with SBP
as the response and gender
,
age
, waist circumference (WC
), creatinine
(creat
) and albumin (albu
) as covariates.
Note that the some of the covariates in the model have missing values. In the following steps, we need to plot residuals and fitted values against the covariates. It is convenient to make sure that residuals and fitted values will have the same length as the original data.
na.action
so that fitted values
and residuals have NA
values for incomplete cases (see the
slides on missing values in Linear
Regression with R).
When we use the option na.exclude
, the vectors of the
fitted values and residuals will have the same length as the input data,
and have NA
values for observations with missing values.
This is convenient for this practical because we will need to plot
residuals against covariate values, and when we take these covariate
values from the input data, they will have the length of the original
data.
In the lectures, we’ve seen different types of residuals. In , we have the following options to calculate them:
resid(lm1)
or residuals(lm1)
lm1$resid
or lm1$residuals
rstandard(lm1)
rstudent(lm1)
resid(lm1, type = "partial")
or
residuals(lm1, type = "partial")
Calculate the different types of residuals and combine them into a
data.frame
.
Make sure the columns have meaningful names, i.e., that it is clear they contain residuals and which type of residuals.
colnames()
.
paste0()
allows you to concatenate strings.
This is useful for changing the variable names.
We first obtain the matrix
of partial residuals and
re-name the columns before combining it into a data.frame
with the other types of residuals.
## gender age WC creat albu
## 1 -0.32886268 -12.3572653 -4.0367656 -2.169207 -1.302762
## 2 0.06174613 0.7849141 0.0621059 1.994264 1.024472
## 3 16.07153418 3.6365442 11.1174041 14.163616 14.417626
## 4 -20.77244221 -20.4558616 -16.3665601 -19.194686 -18.619701
## 5 24.92404036 24.6866707 21.2270693 23.218844 22.760126
## 6 -15.17003887 -3.6163756 -15.4344193 -16.841449 -17.163951
The partial residuals have the variable names as column names.
Because later we want to plot them against the actual covariates (and
will have residuals and data in the same data.frame
) it is
important that we can distinguish columns containing data and columns
containing residuals. Therefore we add a suffix "rpart_"
to
each column name.
## [1] "rpart_gender" "rpart_age" "rpart_WC" "rpart_creat" "rpart_albu"
The function paste0()
creates character strings from one
or more objects.
When we provide multiple objects as we did above, where we had the
character string "rpart_"
and a vector of character strings
(the column names), the objects will be converted to characters and
concatenated:
## [1] "abc123-16"
When we provide one object with multiple elements, e.g., a
vector
or a list
, the different elements are
only combined into a single string when we specify the additional
argument collapse
:
## [1] "abc" "123" "-" "16"
## [1] "abc_123_-_16"
collapse
can be any character string
## [1] "abc_ something _123_ something _-_ something _16"
We can now combine the partial residuals with the other types of
residuals into a data.frame
:
dat_resid <- data.frame(
resid = resid(lm1),
rstandard = rstandard(lm1),
rstudent = rstudent(lm1),
part_resid
)
head(dat_resid)
## resid rstandard rstudent rpart_gender rpart_age rpart_WC rpart_creat rpart_albu
## 1 -2.201540 -0.1414100 -0.1413806 -0.32886268 -12.3572653 -4.0367656 -2.169207 -1.302762
## 2 1.995717 0.1282378 0.1282110 0.06174613 0.7849141 0.0621059 1.994264 1.024472
## 3 14.198856 0.9123318 0.9122993 16.07153418 3.6365442 11.1174041 14.163616 14.417626
## 4 -18.838471 -1.2103031 -1.2104227 -20.77244221 -20.4558616 -16.3665601 -19.194686 -18.619701
## 5 23.051363 1.4806115 1.4809867 24.92404036 24.6866707 21.2270693 23.218844 22.760126
## 6 -17.042717 -1.0947569 -1.0948031 -15.17003887 -3.6163756 -15.4344193 -16.841449 -17.163951
When we plot residuals, we often plot them against the fitted values
or against the covariates. For the following tasks, it is convenient to
have all the values in the same data.frame
.
Add the fitted values and the covariates used in the model as columns
to the data.frame
containing the residuals.
fitted.values()
(see slide 10
of Linear
Regression in R).
We can get the fitted values directly from the fitted model object,
but the resulting vector is shorter than the original data because
NA
values were omitted. Since we used
na.action = "na.exclude"
when we fitted the model, this is
not the case when we use fitted.values()
(or its
abbreviation fitted()
).
There are many ways how we could add the original variables as
columns to dat_resid
.
Some examples: (if you run all of them you will create copies of some columns )
We could add each variable separately using the $
operator:
dat_resid$SBP <- nhanes$SBP
dat_resid$gender <- nhanes$gender
dat_resid$age <- nhanes$age
dat_resid$WC <- nhanes$WC
dat_resid$creat <- nhanes$creat
dat_resid$albu <- nhanes$albu
This is a lot of typing and hence susceptible to typos and/or
copy-paste mistakes. Moreover, if we change the covariates used in
lm1
, we also need to remember to change the syntax
here.
We could also add all variables at the same time using
cbind()
:
This option requires less typing, but we still need to remember to
change the variable names here if we change the covariates in
lm1
.
We can extract the model formula from lm1
using the
function formula()
, and the function
all.vars()
allows us to get the names of all variables used
in a formula
:
## SBP ~ gender + age + WC + creat + albu
## [1] "SBP" "gender" "age" "WC" "creat" "albu"
Using these two functions, we can extract the correct columns from
the nhanes
data using syntax that will remain correct even
if we change the covariates in lm1
:
Plot the usual residuals, standardized and studentized residuals against the fitted values.
Add a horizontal reference line to each plot to indicate zero.
abline()
.
The global graphics parameters can be controlled via the function
par()
. To display multiple plots in one figure, we can
specify the mfrow
argument. It takes a numeric vector of
length 2 that specifies the number of rows and number of columns of the
plot layout.
par()
allows you to change many other settings. Here is
some additional information on the ones used in this solution:
The graphics setting mar
controls the margins of the
plot. The values give the size of bottom, left, top and right margin in
number of lines. The default, c(5, 4, 4, 2) + 0.1
can
result in much white space between plots.
mgp
sets the margin line for the axis titles, axis
labels and axis line (i.e., how far they are from the box around the
plot). We can set this here to move the axis text and title towards the
plot to save some space.
dev.off()
(without any
arguments). This removes all plots and restores the default settings.
par(mfrow = c(2, 2), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(resid ~ fit, data = dat_resid,
ylab = "residuals", xlab = "fitted values")
abline(h = 0, lty = 2)
plot(rstandard ~ fit, data = dat_resid,
ylab = "standardized residuals", xlab = "fitted values")
abline(h = 0, lty = 2)
plot(rstudent ~ fit, data = dat_resid,
ylab = "studentized residuals", xlab = "fitted values")
abline(h = 0, lty = 2)
abline()
draws straight lines into (base R)
graphics.
Using the argument h
we specify that we want a
horizontal line at the position 0 on the y-axis. Alternative arguments
are v
(for vertical lines) and a
and
b
for general lines, where a
is the intercept
and b
the slope.
lty
controls the line type. To get more
information about graphical parameters such as lty
, see the
help for ?par
.
All three plots look very similar and don’t show an obvious pattern. There might be a bit more variation in the residuals for larger observations.
Here I used the formula specification for the plot, i.e., using
arguments formula
(without using the name of this argument)
and data
.
Alternatively, we could also create these plots by using a
specification using the arguments x
and y
:
For the specification using the arguments x
and
y
we need to provide vectors containing the values to be
plotted. Therefore we need to use dat_resid$fit
and cannot
just use the name of the variable, i.e., fit
.
To create the figure using the ggplot2 package, we could use two different approaches:
ggpubr::ggarrange()
(or
cowplot::plot_grid()
)::
to use a function from a package
without loading this package, i.e.,
<package>::<function>
. This can be convenient
when we don’t want functions from one package to interfere with
functions from other packages (that may have the same function name), or
to make clear which package the function is from.
In the following, we first create 3 separate plots and then combine them into one figure in a second step, to keep things simple.
A ggplot
object is build up in layers:
This syntax just specifies the data (the first argument, which is used
here without a name), and the basic aesthetics (aes()
). To
actually plot the data we need to specify how we want to add them, i.e.,
as dots, or lines, or bars, …
## Warning: Removed 405 rows containing missing values (geom_point).
We get a warning message that 405 rows were removed because they contain
missing values. We know about these missing values in the variable
resid
(otherwise we should now investigate why there are
missing values and if we might have introduced them by mistake somewhere
in our syntax). The warning can be prevented by specifying
na.rm = TRUE
in geom_point()
to indicate that
we would like to remove the cases with missing values.
The next layer then adds a horizontal line at the position
0
on the y-axis, and uses line type lty = 2
,
i.e., a dashed line:
ggplot(dat_resid, aes(x = fit, y = resid)) +
geom_point(na.rm = TRUE) +
geom_hline(yintercept = 0, lty = 2)
We could adjust the colours of the dots and the line by specifying different values in the corresponding functions, for example:
By default, the axes are labelled with the name of the variable that
is shown on the axis, but we can change this to get more informative
titles using xlab()
and ylab()
.
ggplot(dat_resid, aes(x = fit, y = resid)) +
geom_point(na.rm = TRUE) +
geom_hline(yintercept = 0, lty = 2) +
xlab("fitted values") +
ylab("residuals")
We can “store” the ggplot
object by assigning it to a
name, so that we can call it again later.
p_resid <- ggplot(dat_resid, aes(x = fit, y = resid)) +
geom_point(na.rm = TRUE) +
geom_hline(yintercept = 0, lty = 2) +
xlab("fitted values") +
ylab("residuals")
The above syntax will have no output, but this does:
## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
We can create the other two plots (for the standardized and studentized residuals) in the same manner:
p_rstand <- ggplot(dat_resid, aes(x = fit, y = rstandard)) +
geom_point(na.rm = TRUE) +
geom_hline(yintercept = 0, lty = 2) +
xlab("fitted values") +
ylab("standardized residuals")
p_rstud <- ggplot(dat_resid, aes(x = fit, y = rstudent)) +
geom_point(na.rm = TRUE) +
geom_hline(yintercept = 0, lty = 2) +
xlab("fitted values") +
ylab("studentized residuals")
To combine them into one figure, we can use `ggpubr::ggarrange(p_resid, p_rstand, p_rstud)
Alternatively, we could first convert the data into a long format, so that we have multiple rows per observation, one for each type of residual:
dat_resid_long <- reshape(dat_resid, direction = "long",
varying = c("resid", "rstandard", "rstudent"),
v.names = "residual", timevar = "type",
times = c("usual", "standardized", "studentized"))
head(dat_resid_long)
## rpart_gender rpart_age rpart_WC rpart_creat rpart_albu fit SBP gender age WC creat albu type residual id
## 1.usual -0.32886268 -12.3572653 -4.0367656 -2.169207 -1.302762 112.8682 110.6667 male 22 81.0 0.91 4.8 usual -2.201540 1
## 2.usual 0.06174613 0.7849141 0.0621059 1.994264 1.024472 116.0043 118.0000 female 44 80.1 0.89 3.7 usual 1.995717 2
## 3.usual 16.07153418 3.6365442 11.1174041 14.163616 14.417626 110.4678 124.6667 male 21 69.6 0.87 4.4 usual 14.198856 3
## 4.usual -20.77244221 -20.4558616 -16.3665601 -19.194686 -18.619701 120.8385 102.0000 female 43 120.4 0.68 4.4 usual -18.838471 4
## 5.usual 24.92404036 24.6866707 21.2270693 23.218844 22.760126 123.6153 146.6667 male 51 81.1 0.99 4.1 usual 23.051363 5
## 6.usual -15.17003887 -3.6163756 -15.4344193 -16.841449 -17.163951 139.0427 122.0000 male 80 112.5 1.01 4.2 usual -17.042717 6
The new variable type
contains the information on what
type of residual it is (“usual”, “standardized” or “studentized”), and
the new column residual
contains the value of the
residual.
ggplot(dat_resid_long, aes(x = fit, y = residual)) +
geom_point() +
geom_hline(yintercept = 0, lty = 2) +
facet_wrap("type", scales = "free_y", ncol = 2)
The function facet_wrap()
creates a separate plot for
each different value of type
and arranges them in
ncol = 2
columns. The argument
scales = "free_y"
specifies that the y-axes of the
different plots are not restricted to the same scale.
Plot the standardized residuals against each of the covariates and interpret what you see.
par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(rstandard ~ age, data = dat_resid,
ylab = "standardized residuals")
abline(h = 0, lty = 2)
plot(rstandard ~ gender, data = dat_resid,
ylab = "standardized residuals")
abline(h = 0, lty = 2)
plot(rstandard ~ WC, data = dat_resid,
ylab = "standardized residuals")
abline(h = 0, lty = 2)
plot(rstandard ~ creat, data = dat_resid,
ylab = "standardized residuals")
abline(h = 0, lty = 2)
plot(rstandard ~ albu, data = dat_resid,
ylab = "standardized residuals")
abline(h = 0, lty = 2)
We see that for age
the scatter seems to be spread a bit
wider for larger values. The other plots do not show clear patterns.
creat
is skewed which makes it a lot harder to detect if
there are any trends or patterns in this residual plot.
For the categorical covariate gender
,
creates box plots by default. To get a
scatter plot we could use plot.default()
instead of
plot()
.
To investigate if the assumption of a linear association between the covariates and the response is reasonable, plot the partial residuals against the corresponding covariates.
par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(rpart_age ~ age, data = dat_resid,
xlab = "age", ylab = "partial residuals")
plot(rpart_gender ~ gender, data = dat_resid,
xlab = "gender", ylab = "partial residuals")
plot(rpart_WC ~ WC, data = dat_resid,
xlab = "waist circumference", ylab = "partial residuals")
plot(rpart_creat ~ creat, data = dat_resid,
xlab = "creatinine", ylab = "partial residuals")
plot(rpart_albu ~ albu, data = dat_resid,
xlab = "albumin", ylab = "partial residuals")
If the assumption of linear effects is violated we should see a curved pattern in the plots. There are no strong patterns for any of the continuous covariates.
To investigate whether the residuals are approximately normally distributed, create a plot comparing the empirical quantiles of the standardized residuals to the theoretical quantiles of a standard normal distribution (QQ-plot).
Use the functions qqnorm()
and qqline()
.
See the examples at the end of the help (?qqnorm
) for how
to use these functions.
A QQ plot comparing the empirical quantiles of the residuals to the
theoretical quantiles of a standard normal distribution can be created
using qqnorm()
. The function qqline()
adds a
reference line:
By setting the argument col
we can specify the colour in
which the line should be plotted.
The package car provides functionality to create a plot with a point-wise confidence-envelope:
::
to use a function from a package
without loading this package, i.e..,
<package>::<function>
. This can be convenient
when we don’t want functions from one package to interfere with
functions from other packages (that may have the same function name), or
to make clear which package the function is from.
We see that the standardized residuals in the upper tail of the distribution deviate from the normal distribution.
To get a QQ-plot with confidence envelope using
ggplot2 we can use the functions
stat_qq_line()
and stat_qq()
. The package
qqplotr
provides the additional functionality to get a
confidence band via the function stat_qq_band()
:
Check for multicollinearity using the variance inflation factor. You
can use the function vif()
from the package
car to do this. vif()
takes a fitted model
object as its only argument.
## gender age WC creat albu
## 1.241285 1.142430 1.211550 1.140478 1.285415
The variance inflation factor is relatively low for all covariates, indicating that there are no issues with multicollinearity in our model.
From the lectures (section Model Diagnostics II: Linearity, Normality & Multicollinearity) we know how the variance inflation factor is calculated.
Calculate the variance inflation factor for lm1
using
this formula and compare the results to the variance inflation factor
calculated in the previous step.
The variance inflation factor for a covariate \(j\) is calculated as \[VIF_j = \frac{1}{1 - R_j^2},\] where \(R_j\) is the coefficient of determination from a model that has variable \(j\) as response and all other covariates in its linear predictor (see slides 15/16 of Model Diagnostics II: Linearity, Normality & Multicollinearity).
We fit a model for each of the covariates:
lm_gender <- update(lm1, formula = as.numeric(gender) ~ . - gender, data = lm1$model)
lm_age <- update(lm1, formula = age ~ . - age, data = lm1$model)
lm_WC <- update(lm1, formula = WC ~ . - WC, data = lm1$model)
lm_creat <- update(lm1, formula = creat ~ . - creat, data = lm1$model)
lm_albu <- update(lm1, formula = albu ~ . - albu, data = lm1$model)
Here I use update()
instead of writing the model
specification.
update()
will call the same function that was used to
create the object that is passed to it, with the same arguments that
were provided in the original function call, except for those that are
specified in update()
.
In this case, we specify a new model formula.
The dot means “everything as it was in the original model”. For
instance, age ~ .
would create a formula
age ~ gender + age + WC + creat + albu
where the right hand
side of the original formula is used, i.e, age
is now the
response and a covariate. This would lead to a warning message and
excluding age
as
covariate. To avoid this, we remove age
as a covariate
ourselves, by writing age ~ . - age
.
If we wanted to keep the response from the original model, but apply changes to the right hand side of the formula, we could use the dot on the left hand side.
Note that the use of .
means something different in a
model formula in lm()
.
In a formula in lm()
, the dot will include all other
variables (i.e., all except the response) into the model.
##
## Call:
## lm(formula = age ~ ., data = nhanes)
##
## Coefficients:
## (Intercept) SBP genderfemale raceOther Hispanic raceNon-Hispanic White
## 28.27832 0.33531 0.74008 3.43255 3.88019
## raceNon-Hispanic Black raceother WC alc>=1 educhigh
## -0.68617 -0.66851 0.06762 -1.95229 -1.50444
## creat albu uricacid bili occuplooking for work
## 3.22239 -8.96571 0.01064 5.19037 -4.63012
## occupnot working smoke.L smoke.Q
## 8.96362 -1.37670 -7.07694
(see also slides 3/4 from Linear Regression in R)
SBP
, but
lm1
does. Therefore, cases with missing values in
SBP
are excluded when fitting lm1
but they
would not in the models for the covariates. We need to make sure that
all models are fitted on the exact same data. By using the model frame
returned in lm1
we can easily achieve this.
Then we get the coefficient of determination for each model (it is returned as part of the model summary) and store them in a vector.
R2 <- c(gender = summary(lm_gender)$r.squared,
age = summary(lm_age)$r.squared,
WC = summary(lm_WC)$r.squared,
creat = summary(lm_creat)$r.squared,
albu = summary(lm_albu)$r.squared
)
Using this vector we can calculate the variance inflation factor and compare it to the values obtained using the car package:
## VIF car::vif
## gender 1.241285 1.241285
## age 1.142430 1.142430
## WC 1.211550 1.211550
## creat 1.140478 1.140478
## albu 1.285415 1.285415
In the plots created above, we already saw that there might be some heteroscedasticity.
For a better visualization of the variation of the residuals, we can plot an estimate of this variation instead of the residuals themselves. A common choice is to use the square root of the absolute value of the (standardized) residuals.
Plot the square root of the absolute value of the standardized residuals against the fitted values.
To also get a LOESS (locally estimated scatter plot smoothing) curve
you can use the function scatter.smooth()
instead of
plot()
. (Note that scatter.smooth()
cannot be
specified via arguments formula
and data
. You
will have to provide vectors of values to the arguments x
and y
.)
We can obtain the absolute value with the help of the function
abs()
and the square root is calculated by
sqrt()
.
scatter.smooth(y = sqrt(abs(dat_resid$rstandard)), x = dat_resid$fit,
xlab = "fitted values", ylab = "sqrt(abs(standardized residuals))")
It is a bit difficult to see here, but the LOESS fit shows an upward trend, indicating that residual variation increases with increasing fitted values.
We can increase the visibility of the LOESS fit by changing the colour of the scatter to grey:
We can improve the plot by changing the axis labels and highlighting the LOESS fit with colour.
The style of this line can be controlled via a list()
of
options that is passed to the argument lpars
. For instance,
lpars = list(col = "red", lwd = 2)
makes the line thicker
and red.
We should also give the plot a nicer y-axis label. To write the
square root symbol in the title of the y-axis (because it looks nicer
than writing “sqrt” and is shorter than writing the whole title in
words) we need to specify it as an expression()
.
The syntax used inside the function expression()
, i.e.,
sqrt(abs(~standardized~residuals~""))
uses “plotmath” (see
?plotmath
) and can be a bit tricky to figure out.
plotmath
provides a number of function to write common
mathematical symbols but it cannot handle normal text directly. This is
why the ~
is used instead of spaces.
An improved version of the plot could then look like this:
Alternatively, we could add the LOESS curve (obtained with
loess.smooth()
) to a standard plot (and get the exact same
output):
plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid,
ylab = "sqrt(abs(standardized residuals))",
xlab = "fitted values", col = "grey")
loess_fit <- loess.smooth(y = sqrt(abs(dat_resid$rstandard)),
x = dat_resid$fit)
lines(loess_fit)
Of course, we can apply the same changes (y-axis title and colour of the line) as before:
To be able to get confidence intervals for this line, however, we
need to get the fitted loess
object (by using the function
loess()
) because then we can get estimates of the standard
errors that allow us to calculate confidence intervals.
Here is a demonstration how we can do this:
The function loess()
allows us to fit a model using
“Local Polynomial Regression” (the details about this method are outside
the scope of this course). This is the same fit we obtained by
scatter.smooth()
in the previous step, but in the form of a
model, which means we have more options to work with the resulting
object.
The formula remains the same as before: the squared absolute
standardized residuals are the response, the fitted values from
lm1
the (only) covariate.
Using the function predict()
we can obtain the fitted
values and the corresponding standard error (when we set the argument
se = TRUE
) from this LOESS fit, which allows us to
calculate confidence intervals.
(We’ll see more about predict()
and fitted values in the
section on Effect Plots.)
The object returned by predict()
is a list
with elements fit
, se.fit
(and two more
elements):
## List of 4
## $ fit : Named num [1:2765] 0.69 0.712 0.675 0.763 0.798 ...
## ..- attr(*, "names")= chr [1:2765] "1" "2" "3" "4" ...
## $ se.fit : Named num [1:2765] 0.0128 0.013 0.0145 0.0135 0.0133 ...
## ..- attr(*, "names")= chr [1:2765] "1" "2" "3" "4" ...
## $ residual.scale: num 0.355
## $ df : num 2355
We can calculate the lower (lwr
) and upper
(upr
) limits of the 95% confidence interval as:
loess_fit$lwr <- loess_fit$fit - qnorm(0.975) * loess_fit$se.fit
loess_fit$upr <- loess_fit$fit + qnorm(0.975) * loess_fit$se.fit
We could now add lines for this confidence interval to the plot we
created in the previous step, however, the fitted values in
dat_resid
are not sorted. This means that when we plot
lines using dat_resid$fit
as x-coordinate, we get a mess of
zigzag lines.
In a scatter plot everything looks fine, but when using lines…
⇨ When you get weird zigzag lines in a plot, check how it looks like when using dots. If it looks fine with dots, try sorting the data by the variable shown on the x-axis.To sort dat_resid$fit
and get the elements of
loess_fit
in the same order, it is convenient to have the
elements of loess_fit
and dat_resid$fit
in the
same data.frame
. We add the fitted values from
lm1
to loess_fit
(instead of vice versa)
because that is more convenient for the next task:
loess_fit$fit_lm1 <- dat_resid$fit
loess_fit <- data.frame(loess_fit[c("fit_lm1", "fit", "lwr", "upr")])
We can now sort dat_resid
by the fitted values:
It is important to combine the LOESS fit (and confidence interval)
with the same version of the fit from lm1
in
dat_resid
that was used to create loess_fit
because only then will the values in loess_fit
correspond
to the rows in dat_resid
.
This means we can either
loess_fit
loess_fit
and the fitted valuesor
dat_resid
loess_fit
We can now plot the square root of the absolute standardized residuals against the fitted values and add the corresponding LOESS fit with 95% confidence interval to the plot:
plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid,
ylab = "sqrt(abs(standardized residuals))",
xlab = "fitted values")
lines(x = loess_fit$fit_lm1, y = loess_fit$fit, col = "red", lwd = 2)
lines(x = loess_fit$fit_lm1, y = loess_fit$lwr, lty = 2, col = "red", lwd = 2)
lines(x = loess_fit$fit_lm1, y = loess_fit$upr, lty = 2, col = "red", lwd = 2)
An alternative way to add the fitted line and confidence interval
would be to use matplot()
which plots multiple variables
(columns in the input data) at the same time:
plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid,
ylab = "sqrt(abs(standardized residuals))",
xlab = "fitted values")
matplot(x = loess_fit$fit_lm1, y = loess_fit[, c("fit", "lwr", "upr")],
type = "l", lty = c(1, 2, 2), col = "red", add = TRUE)
In matplot()
we need to set a few additional
arguments.
The argument type = "l"
specifies that lines should be
plotted. By default, matplot()
would plot each observation
using digits 1, 2 or 3.
lty
sets the line type and we specify that type
1
(solid) is used for the first column of the data
(fit
) and type 2
(dashed) is used for the
other two columns.
add = TRUE
. Otherwise
matplot()
would create a new plot.
It seems that the residual standard deviation increases a bit with increasing fitted values, but not a lot.
We can obtain a scatter plot smoother using
geom_smooth()
. The function chooses the method
to be used based on the amount of data. To get a LOESS curve we can set
method = "loess"
:
ggplot(dat_resid, aes(x = fit, y = sqrt(abs(rstandard)))) +
geom_point() +
geom_smooth(method = "loess") +
xlab("fitted values") +
ylab("sqrt(abs(standardized residuals))")
## `geom_smooth()` using formula 'y ~ x'
Create plots of the square root of the absolute value of the standardized residuals against the covariates.
For the continuous covariates, add the loess
fit and the
corresponding 95% confidence intervals.
The following syntax relies a lot on copy-pasting.
Whenever we need to copy-paste the same syntax multiple times (and
make small changes to each version) a better option would be to create a
function that runs the syntax, to use
loops, or to use a function from the apply
family.
This would reduce the risk to introduce copy-paste mistakes and when we want to make a change to the syntax at a later point we would only need to make that change once but not in every instance of a copied piece of syntax.
Because we haven’t covered yet how to write functions or how to use
loops or apply()
(that will follow in Biostatistics II),
we’ll copy-paste for now.
Fit the LOESS curves for the continuous covariates:
loess_age <- loess(sqrt(abs(rstandard)) ~ age, data = dat_resid)
loess_WC <- loess(sqrt(abs(rstandard)) ~ WC, data = dat_resid)
loess_creat <- loess(sqrt(abs(rstandard)) ~ creat, data = dat_resid)
loess_albu <- loess(sqrt(abs(rstandard)) ~ albu, data = dat_resid)
Calculate the loess
fit and confidence intervals:
loess_fit_age <- predict(loess_age, dat_resid, se = TRUE, level = 0.95)
loess_fit_age$lwr <- loess_fit_age$fit - qnorm(0.975) * loess_fit_age$se.fit
loess_fit_age$upr <- loess_fit_age$fit + qnorm(0.975) * loess_fit_age$se.fit
loess_fit_WC <- predict(loess_WC, dat_resid, se = TRUE, level = 0.95)
loess_fit_WC$lwr <- loess_fit_WC$fit - qnorm(0.975) * loess_fit_WC$se.fit
loess_fit_WC$upr <- loess_fit_WC$fit + qnorm(0.975) * loess_fit_WC$se.fit
loess_fit_creat <- predict(loess_creat, dat_resid, se = TRUE, level = 0.95)
loess_fit_creat$lwr <- loess_fit_creat$fit - qnorm(0.975) * loess_fit_creat$se.fit
loess_fit_creat$upr <- loess_fit_creat$fit + qnorm(0.975) * loess_fit_creat$se.fit
loess_fit_albu <- predict(loess_albu, dat_resid, se = TRUE, level = 0.95)
loess_fit_albu$lwr <- loess_fit_albu$fit - qnorm(0.975) * loess_fit_albu$se.fit
loess_fit_albu$upr <- loess_fit_albu$fit + qnorm(0.975) * loess_fit_albu$se.fit
Add the covariate, convert to a data.frame
and sort by
the value of the covariate:
loess_fit_age$covar <- dat_resid$age
loess_fit_age <- data.frame(loess_fit_age[c("covar", "fit", "lwr", "upr")])
loess_fit_age <- loess_fit_age[order(loess_fit_age$covar), ]
loess_fit_WC$covar <- dat_resid$WC
loess_fit_WC <- data.frame(loess_fit_WC[c("covar", "fit", "lwr", "upr")])
loess_fit_WC <- loess_fit_WC[order(loess_fit_WC$covar), ]
loess_fit_creat$covar <- dat_resid$creat
loess_fit_creat <- data.frame(loess_fit_creat[c("covar", "fit", "lwr", "upr")])
loess_fit_creat <- loess_fit_creat[order(loess_fit_creat$covar), ]
loess_fit_albu$covar <- dat_resid$albu
loess_fit_albu <- data.frame(loess_fit_albu[c("covar", "fit", "lwr", "upr")])
loess_fit_albu <- loess_fit_albu[order(loess_fit_albu$covar), ]
Create the plots:
par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
# plot for age
plot(sqrt(abs(rstandard)) ~ age, data = dat_resid, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_age$covar, y = loess_fit_age[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
# plot for gender
plot(sqrt(abs(rstandard)) ~ gender, data = dat_resid,
ylab = "sqrt(abs(standardized residuals))")
# plot for WC
plot(sqrt(abs(rstandard)) ~ WC, data = dat_resid, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_WC$covar, y = loess_fit_WC[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
# plot for creat
plot(sqrt(abs(rstandard)) ~ creat, data = dat_resid, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_creat$covar, y = loess_fit_creat[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
# plot for albu
plot(sqrt(abs(rstandard)) ~ albu, data = dat_resid, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_albu$covar, y = loess_fit_albu[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
We see a clear upwards trend for age
and a bit a
decreasing line for albu
. For creat
it is
difficult to judge given the skewness of the variable.
We can see the fit for creat
in the area with the most
data better when we restrict the x-axis by setting the argument
xlim
:
par(mfrow = c(1, 1))
plot(sqrt(abs(rstandard)) ~ creat, data = dat_resid, col = "grey",
ylab = "sqrt(abs(standardized residuals))",
xlim = c(0, 3))
matplot(x = loess_fit_creat$covar, y = loess_fit_creat[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
It seems that there is no strong relation between the variation of
the residuals and creat
.
We will now fit a second model using weighted least squares to see if this will lead to more homoscedastic error terms and maybe also improve the normality of the resulting residuals.
Fit a weighted least squares version of lm1
(see slides
19-21 of Model
Diagnostics III: Heteroscedasticity).
For calculating the weights, use the same covariates as in
lm1
.
lm()
via the argument
weights
.
We first fit a model to the natural logarithm of the squared
residuals, using the same covariates as in the original model
lm1
:
Here, I again used update()
instead of writing the whole
model “by hand”. This requires adjusting the model formula, and
replacing the data.
In the model formula, we replace the original response with the
logarithm of the squared residuals. The right hand side of the formula
stays the same as it was in lm1
.
nhanes
data we now need to use the data that
contains the residuals (and, of course, also the covariates, but we have
added them to dat_resid
in a previous step).
From this model, we extract the fitted values as estimates of the (log-transformed) residual variation (i.e., \(\widehat{\log(\hat\varepsilon_i^2)} = \mathbf z_i^\top \boldsymbol{\hat\alpha}\)) and calculate the weights as \(w = \frac{1}{\exp\left(\widehat{\log(\hat{\varepsilon}_i^2)}\right)}\):
Using these weights, we can then fit the weighted least squares model
by updating the original model lm1
with specifying the
weights:
Create plots to check for heteroscedasticity in the weighted least squares model.
We follow the same steps as before.
First, make a data.frame
of the standardized residuals,
fitted values and covariates, and plot the square root of the absolute
standardized residuals against the fitted values
dat_resid_w <- data.frame(
rstandard = rstandard(lm_weighted),
fit = fitted.values(lm_weighted),
nhanes[all.vars(formula(lm_weighted))]
)
scatter.smooth(y = sqrt(abs(dat_resid_w$rstandard)), x = dat_resid_w$fit,
xlab = "fitted values", ylab = "sqrt(abs(standardized residuals))",
col = "grey")
The upward trend of the residual variation has disappeared.
Fit the LOESS curves for the continuous covariates:
loess_age_w <- update(loess_age, data = dat_resid_w)
loess_WC_w <- update(loess_WC, data = dat_resid_w)
loess_creat_w <- update(loess_creat, data = dat_resid_w)
loess_albu_w <- update(loess_albu, data = dat_resid_w)
Calculate the loess
fit and confidence intervals:
loess_fit_age_w <- predict(loess_age_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_age_w$lwr <- loess_fit_age_w$fit - qnorm(0.975) * loess_fit_age_w$se.fit
loess_fit_age_w$upr <- loess_fit_age_w$fit + qnorm(0.975) * loess_fit_age_w$se.fit
loess_fit_WC_w <- predict(loess_WC_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_WC_w$lwr <- loess_fit_WC_w$fit - qnorm(0.975) * loess_fit_WC_w$se.fit
loess_fit_WC_w$upr <- loess_fit_WC_w$fit + qnorm(0.975) * loess_fit_WC_w$se.fit
loess_fit_creat_w <- predict(loess_creat_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_creat_w$lwr <- loess_fit_creat_w$fit - qnorm(0.975) * loess_fit_creat_w$se.fit
loess_fit_creat_w$upr <- loess_fit_creat_w$fit + qnorm(0.975) * loess_fit_creat_w$se.fit
loess_fit_albu_w <- predict(loess_albu_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_albu_w$lwr <- loess_fit_albu_w$fit - qnorm(0.975) * loess_fit_albu_w$se.fit
loess_fit_albu_w$upr <- loess_fit_albu_w$fit + qnorm(0.975) * loess_fit_albu_w$se.fit
Add the covariate, convert to a data.frame
and sort by
the value of the covariate:
loess_fit_age_w$covar <- dat_resid_w$age
loess_fit_age_w <- data.frame(loess_fit_age_w[c("covar", "fit", "lwr", "upr")])
loess_fit_age_w <- loess_fit_age_w[order(loess_fit_age_w$covar), ]
loess_fit_WC_w$covar <- dat_resid_w$WC
loess_fit_WC_w <- data.frame(loess_fit_WC_w[c("covar", "fit", "lwr", "upr")])
loess_fit_WC_w <- loess_fit_WC_w[order(loess_fit_WC_w$covar), ]
loess_fit_creat_w$covar <- dat_resid_w$creat
loess_fit_creat_w <- data.frame(loess_fit_creat_w[c("covar", "fit", "lwr", "upr")])
loess_fit_creat_w <- loess_fit_creat_w[order(loess_fit_creat_w$covar), ]
loess_fit_albu_w$covar <- dat_resid_w$albu
loess_fit_albu_w <- data.frame(loess_fit_albu_w[c("covar", "fit", "lwr", "upr")])
loess_fit_albu_w <- loess_fit_albu_w[order(loess_fit_albu_w$covar), ]
Create the plots:
par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(sqrt(abs(rstandard)) ~ age, data = dat_resid_w, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_age_w$covar, y = loess_fit_age_w[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
plot(sqrt(abs(rstandard)) ~ gender, data = dat_resid_w,
ylab = "sqrt(abs(standardized residuals))")
plot(sqrt(abs(rstandard)) ~ WC, data = dat_resid_w, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_WC_w$covar, y = loess_fit_WC_w[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
plot(sqrt(abs(rstandard)) ~ creat, data = dat_resid_w, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_creat_w$covar, y = loess_fit_creat_w[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
plot(sqrt(abs(rstandard)) ~ albu, data = dat_resid_w, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_albu_w$covar, y = loess_fit_albu_w[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
There seems to be less of a trend in these plots compared to the corresponding plots from the ordinary least squares model.
We can visualize the difference between the LOESS fit from
lm1
and lm_weighted
better by plotting both in
the same figure, for example, for age
and
albu
:
par(mfrow = c(1, 2), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(sqrt(abs(rstandard)) ~ age, data = dat_resid_w, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_age_w$covar, y = loess_fit_age_w[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
matplot(x = loess_fit_age$covar, y = loess_fit_age[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "red", add = TRUE)
legend("topright", lty = c(1, 1), col = c(1, 2), legend = c("weighted", "OLS"),
bty = "n")
plot(sqrt(abs(rstandard)) ~ albu, data = dat_resid_w, col = "grey",
ylab = "sqrt(abs(standardized residuals))")
matplot(x = loess_fit_albu_w$covar, y = loess_fit_albu_w[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)
matplot(x = loess_fit_albu$covar, y = loess_fit_albu[, c("fit", "lwr", "upr")],
lty = c(1, 2, 2), type = "l", col = "red", add = TRUE)
legend("topright", lty = c(1, 1), col = c(1, 2), legend = c("weighted", "OLS"),
bty = "n")
The function legend()
adds a legend. Here, we place it
in the top right corner. We need to specify what this legend should show
ourselves, i.e., two solid lines that have colours 1 (black) and 2
(red), which are labelled “weighted” and “OLS”. The argument
bty
specifies the box type (i.e., whether a line should be
drawn around the legend) and we set it to "n"
, i.e,. no
box.
loess_mod_w <- loess(sqrt(abs(rstandard)) ~ fit, data = dat_resid_w)
loess_fit_w <- predict(loess_mod_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_w$lwr <- loess_fit_w$fit - qnorm(0.975) * loess_fit_w$se.fit
loess_fit_w$upr <- loess_fit_w$fit + qnorm(0.975) * loess_fit_w$se.fit
loess_fit_w$fit_lm1_w <- dat_resid_w$fit
loess_fit_w <- data.frame(loess_fit_w[c("fit_lm1_w", "fit", "lwr", "upr")])
loess_fit_w <- loess_fit_w[order(loess_fit_w$fit_lm1_w), ]
plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid_w,
ylab = expression(sqrt(abs(~standardized~residuals~""))),
xlab = "fitted values", col = "gray")
matplot(x = loess_fit$fit_lm1, y = loess_fit[, c("fit", "lwr", "upr")],
type = "l", lty = c(1, 2, 2), col = "red", add = TRUE)
matplot(x = loess_fit_w$fit_lm1_w, y = loess_fit_w[, c("fit", "lwr", "upr")],
type = "l", lty = c(1, 2, 2), col = "black", add = TRUE)
legend("topleft", lty = c(1, 1), col = c(1, 2), legend = c("weighted", "OLS"),
bty = "n")
Check if the residuals in the weighted least squares model are now closer to a normal distribution by comparing the QQ-plot from the original model to the QQ-plot for the weighted model (side by side, not within the same plot).
car::qqPlot()
.
To print the two plots next to each other, we can set the graphical
parameters (par()
) option mfrow
. The function
qqPlot()
from the car package only
requires us to provide the vector of residuals. To make it easily clear
which plot is which, we can add an informative title
(main
):
par(mfrow = c(1, 2))
car::qqPlot(rstandard(lm1),
main = "ordinary least squares")
car::qqPlot(rstandard(lm_weighted), ylim = c(-3.7, 4.7),
main = "weighted least squares")
I’ve also specified the argument ylim
to extend the
range of the y-axis because the lower end of the confidence envelope was
cut-off in the right figure.
The upper quantiles of the standardized residuals have become a bit more similar to the quantiles of the standard normal distribution in the weighted least squares model (but the upper tail is still deviating).
How can we identify potential outliers? Which type of residuals do we need to use for this?
From the section Model Diagnostics IV: Outliers & Influential Observations (slides 9/10) we know that we can identify potential outliers by comparing the studentized residuals to the \(1-\alpha/2\) and \(\alpha/2\) quantiles of a \(t\)-distribution with \(n-p-1\) degrees of freedom.
The reasoning is that if the model is correctly specified, the studentized residuals should have such a \(t\)-distribution and values outside these quantiles are relatively unlikely if the studentized residuals do indeed come from this distribution.
Plot the studentized residuals for lm_weighted
against
the fitted values and add horizontal lines for the 2.5% and 97.5%
quantiles of the corresponding \(t\)-distribution.
qt()
.
abline()
.
If you find some potential outliers, think about how you could investigate those observations further.
If you haven’t already done so earlier, add the studentized residuals
for lm_weighted
to dat_resid_w
:
We can now plot the studentized residuals against the fitted values and add horizontal lines indicating the \(1 - \alpha/2\) and \(\alpha/2\) quantiles of the \(t\)-distribution with \(n - p - 1\) degrees of freedom (and \(\alpha = 0.05\)).
plot(rstudent ~ fit, data = dat_resid_w,
xlab = "fitted values", ylab = "studentized residuals")
abline(h = qt(c(0.025, 0.975), df = lm_weighted$df.residual), lty = 2)
We can obtain the quantiles of the \(t\)-distribution using the function
qt()
, for which we specify a vector of probabilities, i.e.,
c(0.025, 0.975)
and the degrees of freedom, \(n - p - 1\). The degrees of freedom are
returned in the fitted model object as df.residual
.
abline()
.
The argument h
indicates at which position(s) on the y-axis
(a) horizontal line(s) should be drawn. lty = 2
specifies
that the line type should be “dashed”.
Several observations are outside the range that is likely to be observed if the studentized residuals would really follow a \(t\)-distribution. We should investigate these observations but there is no need to immediately exclude them. Keep in mind that these more extreme values for the studentized residuals are only less likely under the \(t\)-distribution, but not impossible.
There are too many of these potential outliers to investigate each case separately by hand. One way to get a bit more information is to visualize the situation.
A possibility to visualize the observations is to look at the distribution of each variable separately, but there might be observations that have relatively normal values in each variable, but unusual combinations. We can get a bit more information when we plot scatter plots of pairs of variables.
To be able to highlight the observations we have identified as
potential outliers in a scatter plot, we add a new variable to
dat_resid_w
that indicates if the studentized residual is
more extreme than the quantiles of the \(t\)-distribution:
We can obtain a matrix of all pairwise scatter plots using the
function pairs()
:
pairs(dat_resid_w[c("SBP", "gender", "age", "WC", "creat", "albu")],
col = c("grey", "red")[as.numeric(dat_resid_w$outlier) + 1])
The variable outlier
in dat_resid_w
is a
logical
vector (with values TRUE
when the
observation is a potential outlier, and FALSE
when it is
not).
To plot the observations with different colours we need to assign a
colour to each observation. We do that by converting the
logical
vector to a numeric
vector, but then
it contains values 0
and 1
, which is why we
add + 1
.
1
and 2
allows us
to use it as index for a vector of colours,
c("grey", "red")
.
Most of the observations that we identified seem to have a systolic blood pressure above 140. Depending on the aim of our analysis, we should consider whether we want to have patients with hypertension in the analysis or if we should exclude them because we only want to draw conclusions for patients without hypertension. We should also consider to perform sensitivity analyses if hypertension is a relevant issue in the application at hand.
From a statistical point of view, the values are not that extreme that we would not expect any problems with regards to the stability of the model, and so we will keep these observations in the analysis for now.
Using the indicator variable outlier
, we could now also
create a subset of the data containing only these observations, count
them, investigate them further, …
We now want to identify observations with high leverage.
lm1
or
lm_weighted
?The leverage values are the diagonal values of the hat matrix (Section Model Diagnostics IV: Outliers & Influential Observations, slide 5). They indicate how far the covariate values of an observation are from the sample mean and are a measure of how much the observed response for a particular subject influences the corresponding fitted value.
For a standard linear regression model the hat matrix can be
calculated from the design matrix \(\mathbf
X\) as (see Model
Diagnostics I: Residuals, slide 3) \[\mathbf H = \mathbf X(\mathbf X^\top \mathbf
X)^{-1}\mathbf X^\top\] Because there was some heteroscedasticity
in lm1
, and the weighted least squares approach used in
lm_weighted
seemed to have reduced that, we would prefer to
continue to work with lm_weighted
. But, because this model
was not estimated with the ordinary least squares estimator, the
calculation of the hat matrix has to take into account the weights that
were used to fit the model.
For the weighted least squares model, the regression coefficients are calculated as \[\boldsymbol{\hat\beta} = (\mathbf X^\top \mathbf W \mathbf X)^{-1} \mathbf X^\top \mathbf W \mathbf {y},\] where \(\mathbf W\) is a weight matrix, for example \(w\mathbf I\), and, therefore, the hat matrix is \[\mathbf H = \mathbf X\left(\mathbf X^\top \mathbf W \mathbf X\right)^{-1} \mathbf X^\top \mathbf W\]
This formula for the weighted least squares estimator (and corresponding hat matrix) is not in the lecture slides and not relevant for the exam.
But because the weighted least squares model had the better fit and working with the formulas for the weighted model is not much more difficult than working with the corresponding formulas for an OLS model, we will work with the weighted version here.
There is no universal cut-off for high leverage values, but we could use \(2 (p + 1)/n\) to give us some guidance (Section Model Diagnostics IV: Outliers & Influential Observations, slide 11).
Calculate the hat matrix \(\mathbf
H\) for lm_weighted
and identify high leverage
points.
model.matrix()
.
diag()
can be used for multiple purposes:matrix
: to extract the diagonal elements
of the matrix,
vector
: to create a diagonal matrix that
has the vector as its diagonal,
integer
: to create the identity matrix
(that has 1’s on the diagonal and is 0 everywhere else) of dimension
given by the integer.
%*%
is used for matrix multiplication,
t()
to transpose a vector or a matrix, and you can get the
inverse of a matrix with the function solve()
.
Did you get the following warning message?
## Warning in w * diag(nrow(X)): longer object length is not a multiple of shorter object length
This means that the dimensions of the weights w
and
deign matrix X
don’t match.
Explore how large/long your objects are and have a look at the weights to figure out what the issue might be.
na.omit()
can help you to resolve the problem.
We can obtain the design matrix \(\mathbf X\) from the fitted model object:
With this matrix we can now calculate the hat matrix and extract its
diagonal elements using diag()
:
The vector of weights, w
, that we calculated previously,
has length 2765 and contains some NA
values (because the
fitted values it was calculated from had these NA
values).
The matrix X
, however, only has 2360 rows and does not
contain data from incomplete observations.
To get a vector without the NA
values we use the
function na.omit()
.
diag(nrow(X))
creates an identity matrix , e.g.,
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
(but with dimensions equal to the number of rows in X
and not 5, but that would be too large to display here), so that the
resulting weights matrix W
has the weights on the diagonal
and is zero everywhere else, e.g.,
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.03614327 0.00000000 0.00000000 0.00000000 0.00000000
## [2,] 0.00000000 0.01688889 0.00000000 0.00000000 0.00000000
## [3,] 0.00000000 0.00000000 0.03268759 0.00000000 0.00000000
## [4,] 0.00000000 0.00000000 0.00000000 0.02350066 0.00000000
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 0.01568025
(but then much larger…)
Using this weights matrix we can then calculate the hat matrix and its diagonal:
We can now plot the leverage values and add a horizontal line to the plot to indicate the reference line:
There are a few observations above the line, but none that are really
standing out. As before, we can create a new variable in
dat_resid
that we can then use to highlight those
observations in a scatter plot:
dat_resid_w$high_lev[!is.na(dat_resid_w$fit)] <- h > 2 * ncol(X)/nrow(X)
pairs(dat_resid_w[c("SBP", "gender", "age", "WC", "creat", "albu")],
col = c("grey", "red")[as.numeric(dat_resid_w$high_lev) + 1])
As expected, the highlighted dots are more at the edge of the
observed values. For most variables, there are no extreme values we
should be concerned about, only for creat
. To further
inspect (some) of the identified high leverage observations, more
knowledge on the particular (clinical) application/study would be
necessary.
We could further investigate how the higher values of
SBP
and creat
are related to the residuals,
e.g., whether they are responsible for the deviation from normality in
the upper tail that we see in the QQ-plot.
It is also possible to obtain the hat values from a fitted linear
model using the function hatvalues()
.
Compare the hat values you just calculated with the values you get
using hatvalues()
.
To extract the hat values from lm_weighted
:
## 1 2 3 4 5 6 7 8
## 0.0029673072 0.0026032854 0.0040234376 0.0030324754 0.0019482885 0.0014668865 0.0039065911 0.0039496513
## 9 10 11 12 13 14 15 16
## 0.0000000000 0.0021375842 0.0010374903 0.0026008281 0.0024350893 0.0016604297 0.0022520324 0.0023776542
## 17 18 19 20 21 22 23 24
## 0.0015994164 0.0014601304 0.0024934888 0.0017250689 0.0015578032 0.0040213386 0.0028244693 0.0015483757
## 25 26 27 28 29 30 31 32
## 0.0016432862 0.0025492124 0.0017022925 0.0014401198 0.0000000000 0.0018833256 0.0010927249 0.0437888411
## [...]
The first thing we might notice when we try to compare the two sets of values by taking the differences or plotting them against each other is that they have different length.
## [1] 2360
## [1] 2765
hatvalues()
includes values for the incomplete cases,
but our own calculated hat values do not. We “lost” those observations
when we used model.matrix()
, which automatically excludes
rows that contain missing values.
When we inspect the values from hatvalues()
that
correspond to incomplete observations (which have value NA
in the fitted values and residuals in dat_resid_w
), we see
that they all have the value 0:
## 9 29 35 43 44 48 51 62 81 86
## 0 0 0 0 0 0 0 0 0 0
The function is.na()
checks if a value is missing and
returns a logical vector with TRUE
if the value is
NA
and FALSE
if it is not NA
:
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [...]
Using this logical value we can create a subset that only contains
the elements of a vector, list, matrix or data.frame
for
which the logical vector has value TRUE
.
head()
with argument n
set to 10.
To compare the two vectors, we take the subset for
hatvalues(lm_weighted)
that corresponds to complete
observations:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.297e-15 2.515e-17 1.393e-16 3.748e-16 4.567e-16 7.171e-15
Now, we are interested in selecting only those elements of the hat values that correspond to elements of the residuals or fitted values that are not missing.
Since is.na()
returns TRUE
for missing
values, we use !
to invert this logical vector, i.e., a
TRUE
becomes a FALSE
and a FALSE
becomes a TRUE
.
The differences are all very close to zero, demonstrating that except for some small differences due to the way the hat values were calculated, our results are the same as the numbers calculated by the function.
We can obtain a whole range of influence measures for a fitted model
object using the function influence.measures()
, or separate
influence measures via dffits()
, dfbeta()
,
dfbetas()
, cooks.distance()
.
Calculate the influence measures for lm_weighted
using
the separate functions and inspect them via plots.
You can add reference lines based on the rules of thumb given in section Model Diagnostics IV: Outliers & Influential Observations” on slides 15-20.
To obtain quantiles of the \(F\)-distribution, use the function
qf()
.
Cook’s distance:
There are multiple rules of thumb for which values of Cook’s distance should be considered large:
plot(cooks.distance(lm_weighted))
abline(h = c(1,
qf(0.5, ncol(X) - 1, lm_weighted$df.residual),
4/lm_weighted$df.residual),
lty = 2, col = "red")
Since Cook’s distance is \(<0.052\) for all observations, only the reference line for \(\frac{4}{n - p - 1}\) is visible.
There are many values larger than this reference line, but all observations are much smaller than the other two lines (at 1 and at 0.8705419).
There is one observations that clearly stands out from the rest (the one with value > 0.05). We could investigate this particular subject further since it has much more influence on the model fit than all other observations.
We can set the range of values to be shown on the y-axis with the
help of the argument ylim
:
DFFITS:
Reference lines for DFFITS can be drawn at \(\pm 2 \sqrt{\frac{p + 2}{n-p-2}}\).
plot(dffits(lm_weighted), ylim = c(-0.5, 0.65))
abline(h = c(-2, 2) * sqrt((ncol(X) + 1)/(lm_weighted$df.residual - 1)),
lty = 2, col = "red")
There are multiple observations outside the reference lines. When investigating those cases we might start with the most extreme ones (one with a value close to 0.6 and three with a value close to -0.3).
DFBETAS:
Reference lines for DFBETAS can be drawn at \(\pm 2 /\sqrt{n}\). This measure of influence is covariate-specific, i.e., we need to plot it separately for each of the covariates:
DFBETAS <- dfbetas(lm_weighted)
par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(DFBETAS[, 1], main = colnames(DFBETAS)[1])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")
plot(DFBETAS[, 2], main = colnames(DFBETAS)[2])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")
plot(DFBETAS[, 3], main = colnames(DFBETAS)[3])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")
plot(DFBETAS[, 4], main = colnames(DFBETAS)[4])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")
plot(DFBETAS[, 5], main = colnames(DFBETAS)[5])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")
plot(DFBETAS[, 6], main = colnames(DFBETAS)[6])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")
Calculate the influence measures using
infuence.measures()
and investigate the resulting
object.
Calculate the influence measures:
Investigate what type of object infl
is and obtain its
structure so that we can see how we can work with this object:
## [1] "infl"
## List of 3
## $ infmat: num [1:2765, 1:10] 0.00216 0.00864 0.04857 0.05175 0.04474 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2765] "1" "2" "3" "4" ...
## .. ..$ : chr [1:10] "dfb.1_" "dfb.gndr" "dfb.age" "dfb.WC" ...
## $ is.inf: logi [1:2765, 1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2765] "1" "2" "3" "4" ...
## .. ..$ : chr [1:10] "dfb.1_" "dfb.gndr" "dfb.age" "dfb.WC" ...
## $ call : language lm(formula = SBP ~ gender + age + WC + creat + albu, data = nhanes, weights = w, na.action = "na.exclude")
## - attr(*, "class")= chr "infl"
We can also get some more information by looking at the help file
? influence.measures
(first paragraph of the Details
section).
influence.measures()
returns a list with 3 elements.
The element infmat
is a matrix containing the different
influence measures:
## dfb.1_ dfb.gndr dfb.age dfb.WC dfb.cret dfb.albu dffit cov.r
## 1 0.0021606757 0.0047689097 0.0055777801 0.003597802 -0.0005995998 -0.006067404 -0.01692031 1.0052898
## 2 0.0086380136 0.0009500429 0.0004547626 -0.006880392 0.0021805136 -0.008590003 0.01156868 1.0050380
## 3 0.0485668125 -0.0415907741 -0.0224686608 -0.052947175 -0.0038479099 -0.030203698 0.07800047 1.0027455
## 4 0.0517479672 -0.0444103127 0.0020164298 -0.061156462 0.0071348600 -0.041421772 -0.07965685 1.0002695
## 5 0.0447438334 -0.0386745638 0.0211528257 -0.043705240 -0.0012308017 -0.037784900 0.06403226 0.9991465
## 6 0.0008270652 0.0110187106 -0.0244282322 -0.001451829 0.0034179197 0.001341547 -0.03067003 1.0023877
## cook.d hat
## 1 4.773448e-05 0.002967307
## 2 2.231474e-05 0.002603285
## 3 1.013794e-03 0.004023438
## 4 1.057048e-03 0.003032475
## 5 6.830357e-04 0.001948289
## 6 1.567991e-04 0.001466887
And the element is.inf
is a matrix with information
about whether the elements in infmat
are above the
thresholds that uses for the different
influence measures.
The thresholds that uses are:
(This information is not written in the help file.)
## dfb.1_ dfb.gndr dfb.age dfb.WC dfb.cret dfb.albu dffit cov.r cook.d hat
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We can get an overview for those observations for which at least one of the influence measures was above the threshold:
## Potentially influential observations of
## lm(formula = SBP ~ gender + age + WC + creat + albu, data = nhanes, weights = w, na.action = "na.exclude") :
##
## dfb.1_ dfb.gndr dfb.age dfb.WC dfb.cret dfb.albu dffit cov.r cook.d hat
## 32 0.01 -0.09 0.04 0.04 -0.31 0.03 -0.32_* 1.04_* 0.02 0.04_*
## 59 -0.01 0.00 0.00 -0.01 0.00 0.01 0.02 1.01_* 0.00 0.01
## 61 0.00 0.00 0.00 0.00 -0.01 0.00 -0.01 1.01_* 0.00 0.01
## 71 -0.02 0.03 0.07 0.02 0.00 0.00 0.10 0.99_* 0.00 0.00
## 75 -0.01 0.02 0.01 0.00 0.07 0.00 0.07 1.01_* 0.00 0.01_*
## 113 -0.03 0.03 -0.02 0.06 0.00 0.01 0.07 1.01_* 0.00 0.01
## 126 0.00 0.00 0.01 -0.01 0.00 0.00 -0.02 1.01_* 0.00 0.01
## 133 0.00 0.02 0.06 -0.10 -0.01 0.03 -0.13 1.01_* 0.00 0.01_*
## 152 0.04 -0.07 0.06 -0.06 -0.06 -0.02 0.11 0.99_* 0.00 0.00
## 157 -0.01 0.06 0.10 -0.03 0.07 -0.01 0.16_* 0.96_* 0.00 0.00
## 189 0.00 -0.01 0.01 -0.03 0.00 0.01 -0.03 1.01_* 0.00 0.01_*
## 215 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 224 0.09 -0.09 0.00 -0.01 -0.01 -0.10 -0.13 0.99_* 0.00 0.00
## 275 0.07 -0.07 0.03 -0.04 -0.03 -0.07 0.10 0.99_* 0.00 0.00
## 314 -0.08 0.10 0.06 -0.05 -0.04 0.12 0.20_* 0.96_* 0.01 0.00
## 317 -0.12 0.10 -0.03 0.01 0.02 0.15 0.18_* 1.01 0.01 0.01_*
## 328 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 336 0.02 0.03 0.01 0.03 0.01 -0.04 0.09 0.99_* 0.00 0.00
## 339 -0.03 0.03 -0.04 0.09 0.00 0.00 0.10 1.01 0.00 0.01_*
## 354 -0.04 0.23 -0.14 0.04 0.54 -0.08 0.56_* 1.02_* 0.05 0.03_*
## 369 0.10 -0.09 0.00 0.02 -0.05 -0.11 0.15_* 0.99 0.00 0.00
## 371 0.02 0.00 -0.02 0.00 0.00 -0.02 0.03 1.01_* 0.00 0.01
## 390 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 408 0.00 0.00 0.00 0.00 -0.01 0.00 -0.01 1.07_* 0.00 0.06_*
## 410 0.02 -0.02 0.01 0.00 0.00 -0.03 -0.04 1.01_* 0.00 0.01_*
## 421 -0.02 0.03 0.08 -0.03 0.01 0.03 0.10 0.99_* 0.00 0.00
## 422 0.02 -0.03 0.00 0.01 0.04 -0.04 0.10 0.99_* 0.00 0.00
## 436 0.02 -0.07 0.00 0.04 -0.04 -0.03 0.11 0.98_* 0.00 0.00
## 452 -0.02 -0.02 -0.04 0.06 0.04 0.00 0.11 0.99_* 0.00 0.00
## 453 0.07 0.01 0.08 -0.08 0.02 -0.08 0.14 0.98_* 0.00 0.00
## 471 -0.04 -0.04 0.12 -0.03 0.01 0.05 0.16_* 0.96_* 0.00 0.00
## 488 -0.03 0.00 0.02 0.00 0.01 0.03 -0.04 1.01_* 0.00 0.01_*
## 499 -0.08 0.11 -0.12 0.14 0.06 0.05 0.20_* 0.99 0.01 0.01
## 509 -0.07 0.02 0.02 0.01 0.02 0.08 -0.09 1.01_* 0.00 0.01_*
## 510 -0.07 -0.03 -0.08 0.10 -0.03 0.07 0.16_* 1.00 0.00 0.01
## 515 -0.02 0.00 -0.02 0.04 -0.01 0.02 0.05 1.01_* 0.00 0.01_*
## 522 -0.02 0.00 -0.02 0.04 0.00 0.01 0.04 1.01_* 0.00 0.01_*
## 540 0.00 0.00 0.00 0.00 -0.01 0.00 -0.01 1.01_* 0.00 0.01_*
## 555 -0.01 0.00 0.00 0.00 0.01 0.01 0.01 1.01_* 0.00 0.01
## 573 -0.12 0.00 -0.05 0.03 0.00 0.15 0.21_* 1.00 0.01 0.01
## 607 -0.03 0.06 0.06 0.01 0.03 0.01 -0.12 0.99_* 0.00 0.00
## 617 0.07 -0.08 0.11 -0.09 -0.03 -0.05 0.15 0.98_* 0.00 0.00
## 621 0.01 0.00 0.00 -0.01 0.00 -0.01 -0.01 1.01_* 0.00 0.01
## 660 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 1.01_* 0.00 0.01
## 782 0.00 -0.01 0.00 0.00 -0.02 0.00 -0.02 1.07_* 0.00 0.06_*
## 785 0.08 -0.13 0.06 -0.03 -0.35 -0.01 -0.35_* 1.11_* 0.02 0.10_*
## 804 -0.01 0.00 0.00 0.00 0.00 0.01 -0.01 1.02_* 0.00 0.01_*
## 819 -0.03 0.00 0.01 0.00 0.00 0.03 -0.04 1.01_* 0.00 0.01
## 825 -0.01 0.05 -0.03 0.03 0.02 0.00 -0.07 0.99_* 0.00 0.00
## 830 0.04 0.00 -0.01 -0.01 -0.01 -0.04 0.04 1.01_* 0.00 0.01
## 858 0.01 0.00 0.00 0.00 0.00 0.00 0.01 1.01_* 0.00 0.01
## 887 0.00 0.03 0.02 0.04 -0.02 -0.02 0.09 0.99_* 0.00 0.00
## 888 0.00 0.03 0.04 0.01 0.00 -0.01 0.08 0.99_* 0.00 0.00
## 911 -0.08 -0.02 0.06 0.03 -0.02 0.08 0.13 0.98_* 0.00 0.00
## 912 -0.01 0.01 0.00 0.01 0.00 0.01 -0.01 1.01_* 0.00 0.01
## 936 -0.01 0.05 0.11 -0.07 -0.03 0.04 0.16_* 0.96_* 0.00 0.00
## 972 0.05 0.02 0.07 -0.08 0.00 -0.05 0.12 0.98_* 0.00 0.00
## 974 0.00 0.02 0.02 -0.02 0.02 0.00 -0.04 1.01_* 0.00 0.01
## 995 -0.01 0.06 -0.02 0.00 0.14 -0.02 0.14 1.02_* 0.00 0.02_*
## 1013 -0.02 0.03 0.09 -0.03 -0.05 0.04 0.12 0.99_* 0.00 0.00
## 1015 -0.08 0.16 0.06 -0.01 0.29 0.02 0.32_* 0.96_* 0.02 0.01
## 1054 0.04 -0.01 -0.01 -0.01 -0.01 -0.04 0.04 1.01_* 0.00 0.01_*
## 1072 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 1084 0.00 -0.02 0.04 0.00 0.03 -0.01 0.08 0.99_* 0.00 0.00
## 1098 0.01 -0.01 0.00 -0.02 0.00 -0.01 -0.02 1.01_* 0.00 0.01
## 1117 0.00 0.00 0.01 -0.01 -0.01 0.00 -0.01 1.01_* 0.00 0.01
## 1118 0.02 0.04 0.04 -0.05 0.02 -0.01 0.10 0.98_* 0.00 0.00
## 1137 0.05 -0.05 0.01 0.02 0.00 -0.07 -0.10 1.01 0.00 0.01_*
## 1166 0.00 -0.05 0.05 -0.02 -0.04 0.01 0.09 0.99_* 0.00 0.00
## 1196 -0.01 0.01 -0.01 0.02 0.00 0.00 0.02 1.01_* 0.00 0.01
## 1203 -0.05 0.10 0.01 -0.02 0.01 0.07 0.15 0.97_* 0.00 0.00
## 1214 -0.02 -0.04 0.01 0.02 0.02 0.01 -0.09 0.99_* 0.00 0.00
## 1215 0.01 0.02 0.07 -0.04 0.00 0.00 0.09 0.99_* 0.00 0.00
## 1221 0.03 0.02 -0.01 0.06 -0.10 -0.03 0.17_* 0.97_* 0.00 0.00
## 1263 0.01 -0.05 0.04 0.02 -0.01 -0.02 0.09 0.99_* 0.00 0.00
## 1309 -0.02 -0.01 0.02 0.01 -0.16 0.05 -0.18_* 1.00 0.01 0.01_*
## 1312 -0.07 0.07 0.07 0.02 -0.03 0.07 0.13 0.98_* 0.00 0.00
## 1314 -0.02 -0.04 -0.03 0.02 -0.01 0.03 0.09 0.99_* 0.00 0.00
## 1328 -0.10 0.12 -0.01 0.23 0.05 0.02 0.29_* 0.95_* 0.01 0.00
## 1351 -0.03 0.02 -0.01 0.04 0.01 0.03 0.05 1.01_* 0.00 0.01_*
## 1361 -0.03 0.10 -0.02 -0.03 0.24 -0.01 0.25_* 1.03_* 0.01 0.03_*
## 1375 0.08 -0.11 0.07 -0.03 -0.03 -0.09 -0.16_* 0.99 0.00 0.00
## 1381 -0.02 0.00 0.01 0.00 0.00 0.02 -0.02 1.01_* 0.00 0.01
## 1387 0.01 0.00 0.01 -0.02 0.00 -0.01 -0.03 1.01_* 0.00 0.01_*
## 1410 -0.04 0.04 0.07 -0.01 -0.01 0.04 0.09 0.99_* 0.00 0.00
## 1451 -0.01 -0.04 0.02 0.03 -0.01 0.01 0.08 0.99_* 0.00 0.00
## 1466 0.00 0.02 0.03 0.01 -0.06 0.01 0.09 0.99_* 0.00 0.00
## 1483 -0.02 0.03 -0.03 0.05 0.02 0.00 0.06 1.01_* 0.00 0.01_*
## 1514 0.00 -0.03 0.05 0.00 0.03 0.00 0.09 0.99_* 0.00 0.00
## 1517 0.00 -0.01 0.02 -0.03 0.01 0.01 -0.04 1.01_* 0.00 0.01
## 1540 -0.03 0.05 -0.07 0.05 0.02 0.02 -0.10 0.99_* 0.00 0.00
## 1571 -0.03 0.00 0.01 0.01 0.00 0.04 -0.04 1.01_* 0.00 0.01
## 1572 -0.05 -0.01 0.07 -0.06 0.02 0.08 -0.14 1.01 0.00 0.01_*
## 1573 0.11 -0.08 -0.03 -0.01 0.05 -0.14 -0.17_* 0.99 0.01 0.01
## 1575 -0.05 0.01 0.01 0.00 0.00 0.06 -0.06 1.01_* 0.00 0.01
## 1592 0.01 0.01 -0.03 0.03 0.01 -0.02 0.05 1.01_* 0.00 0.01
## 1604 0.01 0.00 0.02 -0.03 0.00 0.00 -0.03 1.01_* 0.00 0.01_*
## 1615 0.03 0.01 0.11 -0.04 -0.03 -0.03 0.14 0.97_* 0.00 0.00
## 1639 -0.03 0.03 -0.05 0.14 -0.03 0.00 0.16_* 1.02_* 0.00 0.02_*
## 1655 0.01 -0.02 -0.01 0.00 -0.03 -0.01 -0.04 1.01_* 0.00 0.01
## 1661 0.00 0.00 -0.01 0.01 0.00 0.00 0.01 1.01_* 0.00 0.01
## 1667 0.01 0.02 0.08 -0.05 0.00 0.00 0.10 0.99_* 0.00 0.00
## 1672 0.03 -0.02 0.00 -0.01 0.00 -0.03 -0.03 1.01_* 0.00 0.01
## 1703 0.03 -0.07 0.04 -0.02 -0.14 -0.01 -0.14 1.02_* 0.00 0.02_*
## 1723 -0.11 -0.08 0.11 0.13 -0.14 0.11 0.26_* 0.95_* 0.01 0.00
## 1726 -0.10 0.00 -0.05 0.10 0.06 0.08 0.16_* 0.99_* 0.00 0.00
## 1744 -0.02 0.02 -0.02 0.02 0.01 0.02 0.03 1.01_* 0.00 0.01
## 1757 -0.01 0.05 -0.02 0.01 0.03 0.01 -0.07 0.99_* 0.00 0.00
## 1775 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 1785 0.01 0.05 -0.01 0.03 -0.03 -0.01 0.11 0.98_* 0.00 0.00
## 1788 -0.01 0.01 0.00 0.00 0.02 0.01 0.02 1.01_* 0.00 0.01
## 1793 -0.05 0.04 -0.04 0.09 0.00 0.03 0.10 1.01_* 0.00 0.01_*
## 1795 -0.02 0.04 0.10 -0.04 -0.04 0.03 0.13 0.98_* 0.00 0.00
## 1803 -0.01 0.00 0.00 0.00 0.00 0.01 0.02 1.01_* 0.00 0.01_*
## 1830 -0.03 0.04 0.01 0.00 0.07 0.01 0.08 1.01_* 0.00 0.01_*
## 1894 0.00 -0.02 0.06 -0.09 0.00 0.03 -0.12 1.01_* 0.00 0.01_*
## 1921 0.02 0.05 0.04 -0.01 0.07 -0.04 0.12 0.98_* 0.00 0.00
## 1985 -0.01 0.03 0.02 0.02 -0.04 0.01 0.09 0.99_* 0.00 0.00
## 2019 -0.02 -0.01 -0.02 0.06 0.00 0.01 0.06 1.01_* 0.00 0.01_*
## 2022 0.11 -0.12 -0.05 -0.12 -0.09 -0.05 0.20_* 0.99_* 0.01 0.01
## 2052 0.02 0.01 0.07 -0.04 0.00 -0.02 0.09 0.99_* 0.00 0.00
## 2077 0.03 0.02 0.05 -0.03 0.01 -0.03 0.08 0.99_* 0.00 0.00
## 2081 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 2082 0.07 0.00 -0.02 -0.07 -0.05 -0.05 -0.12 0.99_* 0.00 0.00
## 2085 0.02 -0.02 0.06 -0.01 0.06 -0.04 0.12 0.99_* 0.00 0.00
## 2095 0.01 0.00 0.00 -0.04 0.00 0.00 -0.04 1.01_* 0.00 0.01_*
## 2102 0.01 -0.09 0.05 0.04 -0.10 -0.01 0.15 0.97_* 0.00 0.00
## 2106 -0.03 -0.01 0.06 -0.03 0.05 0.03 0.11 0.99_* 0.00 0.00
## 2108 0.04 0.03 0.04 0.03 -0.05 -0.06 -0.16_* 0.98_* 0.00 0.00
## 2135 0.01 -0.04 0.00 -0.02 0.02 0.00 0.09 0.99_* 0.00 0.00
## 2159 -0.01 -0.03 -0.03 -0.01 -0.02 0.02 -0.08 0.99_* 0.00 0.00
## 2165 -0.09 0.02 0.03 0.03 0.01 0.10 -0.10 1.01_* 0.00 0.01_*
## 2186 -0.01 -0.04 0.00 0.04 -0.01 0.01 0.08 0.99_* 0.00 0.00
## 2211 0.01 0.00 0.00 0.00 0.00 -0.01 0.02 1.01_* 0.00 0.01
## 2216 0.07 -0.12 0.04 0.00 -0.36 -0.01 -0.36_* 1.04_* 0.02 0.04_*
## 2269 0.01 0.05 0.04 -0.11 0.05 0.02 -0.15_* 1.00 0.00 0.01
## 2278 0.06 0.02 0.01 -0.05 0.00 -0.06 -0.10 0.99_* 0.00 0.00
## 2281 0.00 0.00 0.00 0.00 0.00 0.00 0.01 1.01_* 0.00 0.01
## 2288 0.01 0.01 0.07 0.00 -0.04 -0.01 0.10 0.99_* 0.00 0.00
## 2289 -0.03 0.06 -0.02 0.01 -0.02 0.05 0.10 0.99_* 0.00 0.00
## 2295 0.06 0.03 -0.03 0.00 -0.01 -0.06 0.12 0.98_* 0.00 0.00
## 2303 -0.04 -0.02 -0.08 0.12 0.02 0.01 0.15_* 1.00 0.00 0.01
## 2304 0.01 -0.02 0.00 0.00 0.00 -0.02 -0.03 1.01_* 0.00 0.01
## 2312 0.00 -0.01 0.01 -0.02 -0.01 0.00 -0.02 1.01_* 0.00 0.01
## 2323 -0.09 -0.03 -0.04 0.15 -0.01 0.06 0.19_* 0.98_* 0.01 0.00
## 2334 -0.02 0.06 0.05 -0.02 0.10 0.00 0.13 0.99_* 0.00 0.00
## 2340 -0.03 0.06 -0.03 0.01 0.18 0.00 0.18_* 1.07_* 0.01 0.06_*
## 2373 -0.09 0.00 -0.14 0.16 0.07 0.06 0.23_* 0.99_* 0.01 0.01
## 2389 0.00 -0.01 0.00 0.00 -0.03 0.00 -0.03 1.02_* 0.00 0.02_*
## 2456 -0.02 0.02 -0.01 0.06 -0.01 0.01 0.07 1.01_* 0.00 0.01_*
## 2461 0.02 -0.09 -0.07 0.08 -0.10 -0.02 0.18_* 0.99_* 0.01 0.00
## 2467 0.00 -0.01 0.01 0.01 -0.03 0.00 -0.04 1.02_* 0.00 0.02_*
## 2469 0.02 0.00 -0.01 -0.01 -0.01 -0.02 0.02 1.01_* 0.00 0.01_*
## 2486 -0.01 0.05 0.01 -0.02 0.12 -0.01 0.13 1.01_* 0.00 0.01_*
## 2513 0.01 -0.01 0.00 -0.01 0.00 0.00 -0.01 1.01_* 0.00 0.01
## 2527 0.03 -0.04 0.04 -0.08 0.00 -0.01 -0.09 1.01_* 0.00 0.01_*
## 2537 -0.01 0.02 0.04 0.04 -0.04 -0.01 0.10 0.99_* 0.00 0.00
## 2542 0.02 -0.01 -0.01 -0.02 0.00 -0.02 -0.03 1.01_* 0.00 0.01
## 2616 -0.06 -0.01 0.02 0.00 -0.03 0.08 0.10 1.01_* 0.00 0.01_*
## 2623 0.08 -0.14 -0.03 0.13 -0.09 -0.12 0.27_* 0.95_* 0.01 0.00
## 2633 0.07 0.01 0.02 -0.12 0.03 -0.06 -0.13 1.01 0.00 0.01_*
## 2634 -0.02 0.06 0.07 -0.02 0.05 0.01 0.12 0.97_* 0.00 0.00
## 2645 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.01_* 0.00 0.01
## 2652 0.01 0.03 0.06 -0.04 0.00 -0.01 0.09 0.99_* 0.00 0.00
## 2673 -0.04 0.05 0.04 0.02 0.01 0.03 0.08 0.99_* 0.00 0.00
## 2685 -0.03 0.05 0.04 -0.01 -0.02 0.04 0.10 0.98_* 0.00 0.00
## 2728 -0.06 0.06 0.02 0.04 0.02 0.05 0.09 0.99_* 0.00 0.00
We could, of course, again investigate these observations further to determine if there might be any mistakes in the data and whether we should perform sensitivity analyses in which highly influential observations are excluded.
Throughout this practical, we created the plots ourselves and also calculated many of the measures “by hand”.
provides a convenient plotting
method to give us much of this information automatically. We only need
to plot()
the fitted model object to get
which
)