Linear Regression: Model Diagnostics

In this practical, we will perform model diagnostics for a fitted linear regression model.

The aim is to revisit the theory on model diagnostics and to learn how to perform the necessary steps in .

Note that for several tasks of this practical the given solution is only one of many possible solutions. In some cases it may not be the most efficient or elegant solution, because the focus is on keeping the syntax simple and easier to understand.

Most of the plots are created using the base graphics. For some plots, example syntax on how to create them using ggplot2 is available. These examples are intended as just that, examples, but the explanations are not sufficient to understand the syntax if you are completely new to ggplot2. If you are new to ggplot2, chapter 2 of the ggplot2 book will provides a nice introduction into the basics of ggplot2 graphics.

This practical is intended to be “learning by doing”, i.e., you may not be able to solve all tasks by yourself without the hints or even checking the solution, and you will come across functions that you haven’t seen in the lecture slides.

When a function is new to you, you can find out more about it in the help file that you can access by typing ?<function name>.

There are a few parts indicated as “Advanced Level” containing some more complex solutions (for those of you, for whom working with is a piece of cake).

Data

For this practical we will use the same subset of the NHANES (National Health and Nutrition Examination Survey) data as used in the first practical.

To load this dataset into , you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/nhanes.RData").

Here a short overview over the data:

head(nhanes)

##        SBP gender age               race    WC  alc educ creat albu uricacid bili            occup   smoke
## 1 110.6667   male  22 Non-Hispanic White  81.0 <NA>  low  0.91  4.8      4.9  0.6          working   never
## 2 118.0000 female  44 Non-Hispanic White  80.1 <NA> high  0.89  3.7      4.5  0.3          working   never
## 3 124.6667   male  21              other  69.6   <1  low  0.87  4.4      5.4  0.2      not working   never
## 4 102.0000 female  43 Non-Hispanic Black 120.4  >=1  low  0.68  4.4      5.0  0.8      not working current
## 5 146.6667   male  51              other  81.1 <NA>  low  0.99  4.1      5.0  0.5 looking for work current
## 6 122.0000   male  80 Non-Hispanic White 112.5 <NA>  low  1.01  4.2      4.8  0.7      not working   never

str(nhanes)

## 'data.frame':    2765 obs. of  13 variables:
##  $ SBP     : num  111 118 125 102 147 ...
##  $ gender  : Factor w/ 2 levels "male","female": 1 2 1 2 1 1 1 2 1 1 ...
##  $ age     : num  22 44 21 43 51 80 26 30 70 35 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 3 3 5 4 5 3 4 5 4 4 ...
##  $ WC      : num  81 80.1 69.6 120.4 81.1 ...
##  $ alc     : Factor w/ 2 levels "<1",">=1": NA NA 1 2 NA NA 2 2 1 NA ...
##  $ educ    : Factor w/ 2 levels "low","high": 1 2 1 1 1 1 1 2 1 2 ...
##  $ creat   : num  0.91 0.89 0.87 0.68 0.99 1.01 1.03 0.91 0.8 1.1 ...
##  $ albu    : num  4.8 3.7 4.4 4.4 4.1 4.2 5 4.8 3.7 4 ...
##  $ uricacid: num  4.9 4.5 5.4 5 5 4.8 5.4 6.7 5.4 6.7 ...
##  $ bili    : num  0.6 0.3 0.2 0.8 0.5 0.7 1.1 0.9 0.9 0.9 ...
##  $ occup   : Factor w/ 3 levels "working","looking for work",..: 1 1 3 3 2 3 1 1 3 1 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 1 3 3 1 1 1 2 1 ...

Fitting the Linear Regression Model

Task

Fit a linear regression model on the nhanes data with SBP as the response and gender, age, waist circumference (WC), creatinine (creat) and albumin (albu) as covariates.

Note that the some of the covariates in the model have missing values. In the following steps, we need to plot residuals and fitted values against the covariates. It is convenient to make sure that residuals and fitted values will have the same length as the original data.

You should set the argument na.action so that fitted values and residuals have NA values for incomplete cases (see the slides on missing values in Linear Regression with R).

Solution

When we use the option na.exclude, the vectors of the fitted values and residuals will have the same length as the input data, and have NA values for observations with missing values. This is convenient for this practical because we will need to plot residuals against covariate values, and when we take these covariate values from the input data, they will have the length of the original data.

lm1 <- lm(SBP ~ gender + age + WC + creat + albu, data = nhanes,
          na.action = "na.exclude")

Residuals

In the lectures, we’ve seen different types of residuals. In , we have the following options to calculate them:

the usual residuals
- resid(lm1) or residuals(lm1)
- lm1$resid or lm1$residuals
standardized residuals
- rstandard(lm1)
studentized residuals
- rstudent(lm1)
partial residuals
- resid(lm1, type = "partial") or residuals(lm1, type = "partial")

Task 1

Calculate the different types of residuals and combine them into a data.frame.

Make sure the columns have meaningful names, i.e., that it is clear they contain residuals and which type of residuals.

You can see and change column names using the function colnames().

The function paste0() allows you to concatenate strings. This is useful for changing the variable names.

Solution 1

We first obtain the matrix of partial residuals and re-name the columns before combining it into a data.frame with the other types of residuals.

part_resid <- resid(lm1, type = "partial", na.action = "na.exclude")
head(part_resid)

##         gender         age          WC      creat       albu
## 1  -0.32886268 -12.3572653  -4.0367656  -2.169207  -1.302762
## 2   0.06174613   0.7849141   0.0621059   1.994264   1.024472
## 3  16.07153418   3.6365442  11.1174041  14.163616  14.417626
## 4 -20.77244221 -20.4558616 -16.3665601 -19.194686 -18.619701
## 5  24.92404036  24.6866707  21.2270693  23.218844  22.760126
## 6 -15.17003887  -3.6163756 -15.4344193 -16.841449 -17.163951

The partial residuals have the variable names as column names. Because later we want to plot them against the actual covariates (and will have residuals and data in the same data.frame) it is important that we can distinguish columns containing data and columns containing residuals. Therefore we add a suffix "rpart_" to each column name.

colnames(part_resid) <- paste0("rpart_", colnames(part_resid))
colnames(part_resid)

## [1] "rpart_gender" "rpart_age"    "rpart_WC"     "rpart_creat"  "rpart_albu"

The function paste0() creates character strings from one or more objects.

When we provide multiple objects as we did above, where we had the character string "rpart_" and a vector of character strings (the column names), the objects will be converted to characters and concatenated:

paste0("abc", 123, "-", 4^2)

## [1] "abc123-16"

When we provide one object with multiple elements, e.g., a vector or a list, the different elements are only combined into a single string when we specify the additional argument collapse:

paste0(c("abc", 123, "-", 4^2))

## [1] "abc" "123" "-"   "16"

paste0(c("abc", 123, "-", 4^2), collapse = "_")

## [1] "abc_123_-_16"

collapse can be any character string

paste0(c("abc", 123, "-", 4^2), collapse = "_ something _")

## [1] "abc_ something _123_ something _-_ something _16"

We can now combine the partial residuals with the other types of residuals into a data.frame:

dat_resid <- data.frame(
  resid = resid(lm1),
  rstandard = rstandard(lm1),
  rstudent = rstudent(lm1),
  part_resid
)

head(dat_resid)

##        resid  rstandard   rstudent rpart_gender   rpart_age    rpart_WC rpart_creat rpart_albu
## 1  -2.201540 -0.1414100 -0.1413806  -0.32886268 -12.3572653  -4.0367656   -2.169207  -1.302762
## 2   1.995717  0.1282378  0.1282110   0.06174613   0.7849141   0.0621059    1.994264   1.024472
## 3  14.198856  0.9123318  0.9122993  16.07153418   3.6365442  11.1174041   14.163616  14.417626
## 4 -18.838471 -1.2103031 -1.2104227 -20.77244221 -20.4558616 -16.3665601  -19.194686 -18.619701
## 5  23.051363  1.4806115  1.4809867  24.92404036  24.6866707  21.2270693   23.218844  22.760126
## 6 -17.042717 -1.0947569 -1.0948031 -15.17003887  -3.6163756 -15.4344193  -16.841449 -17.163951

Task 2

When we plot residuals, we often plot them against the fitted values or against the covariates. For the following tasks, it is convenient to have all the values in the same data.frame.

Add the fitted values and the covariates used in the model as columns to the data.frame containing the residuals.

The fitted values are returned as element in the fitted model object but you can also use the function fitted.values() (see slide 10 of Linear Regression in R).

Solution 2

We can get the fitted values directly from the fitted model object, but the resulting vector is shorter than the original data because NA values were omitted. Since we used na.action = "na.exclude" when we fitted the model, this is not the case when we use fitted.values() (or its abbreviation fitted()).

dat_resid$fit <- fitted.values(lm1)

There are many ways how we could add the original variables as columns to dat_resid.

Some examples: (if you run all of them you will create copies of some columns )

Option 1

We could add each variable separately using the $ operator:

dat_resid$SBP <- nhanes$SBP
dat_resid$gender <- nhanes$gender
dat_resid$age <- nhanes$age
dat_resid$WC <- nhanes$WC
dat_resid$creat <- nhanes$creat
dat_resid$albu <- nhanes$albu

This is a lot of typing and hence susceptible to typos and/or copy-paste mistakes. Moreover, if we change the covariates used in lm1, we also need to remember to change the syntax here.

Option 2

We could also add all variables at the same time using cbind():

dat_resid <- cbind(dat_resid, nhanes[, c("SBP", "gender", "age", "WC", "creat", "albu")])

This option requires less typing, but we still need to remember to change the variable names here if we change the covariates in lm1.

Option 3

We can extract the model formula from lm1 using the function formula(), and the function all.vars() allows us to get the names of all variables used in a formula:

formula(lm1)

## SBP ~ gender + age + WC + creat + albu

all.vars(formula(lm1))

## [1] "SBP"    "gender" "age"    "WC"     "creat"  "albu"

Using these two functions, we can extract the correct columns from the nhanes data using syntax that will remain correct even if we change the covariates in lm1:

dat_resid <- cbind(dat_resid,
                   nhanes[all.vars(formula(lm1))])

Task 3

Plot the usual residuals, standardized and studentized residuals against the fitted values.

Add a horizontal reference line to each plot to indicate zero.

A horizontal line can be set with the function abline().

Solution 3

Base R graphics

The global graphics parameters can be controlled via the function par(). To display multiple plots in one figure, we can specify the mfrow argument. It takes a numeric vector of length 2 that specifies the number of rows and number of columns of the plot layout.

par() allows you to change many other settings. Here is some additional information on the ones used in this solution:

The graphics setting mar controls the margins of the plot. The values give the size of bottom, left, top and right margin in number of lines. The default, c(5, 4, 4, 2) + 0.1 can result in much white space between plots.

mgp sets the margin line for the axis titles, axis labels and axis line (i.e., how far they are from the box around the plot). We can set this here to move the axis text and title towards the plot to save some space.

To re-set the graphics device (the part of the user interface showing the plots) you can call the function dev.off() (without any arguments). This removes all plots and restores the default settings.

par(mfrow = c(2, 2), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))

plot(resid ~ fit, data = dat_resid,
     ylab = "residuals", xlab = "fitted values")
abline(h = 0, lty = 2)

plot(rstandard ~ fit, data = dat_resid, 
     ylab = "standardized residuals", xlab = "fitted values")
abline(h = 0, lty = 2)

plot(rstudent ~ fit, data = dat_resid,
     ylab = "studentized residuals", xlab = "fitted values")
abline(h = 0, lty = 2)

abline() draws straight lines into (base R) graphics.

Using the argument h we specify that we want a horizontal line at the position 0 on the y-axis. Alternative arguments are v (for vertical lines) and a and b for general lines, where a is the intercept and b the slope.

The argument lty controls the line type. To get more information about graphical parameters such as lty, see the help for ?par.

All three plots look very similar and don’t show an obvious pattern. There might be a bit more variation in the residuals for larger observations.

Here I used the formula specification for the plot, i.e., using arguments formula (without using the name of this argument) and data.

Alternatively, we could also create these plots by using a specification using the arguments x and y:

plot(x = dat_resid$fit, y = dat_resid$resid,
     ylab = "residuals", xlab = "fitted values")

For the specification using the arguments x and y we need to provide vectors containing the values to be plotted. Therefore we need to use dat_resid$fit and cannot just use the name of the variable, i.e., fit.

ggplot2

To create the figure using the ggplot2 package, we could use two different approaches:

create the plots separately and combine them into one figure using ggpubr::ggarrange() (or cowplot::plot_grid())
convert the data to long format and use facets

We can use the notation :: to use a function from a package without loading this package, i.e., <package>::<function>. This can be convenient when we don’t want functions from one package to interfere with functions from other packages (that may have the same function name), or to make clear which package the function is from.

In the following, we first create 3 separate plots and then combine them into one figure in a second step, to keep things simple.

A ggplot object is build up in layers:

library("ggplot2")
ggplot(dat_resid, aes(x = fit,  y = resid))

This syntax just specifies the data (the first argument, which is used here without a name), and the basic aesthetics (aes()). To actually plot the data we need to specify how we want to add them, i.e., as dots, or lines, or bars, …

ggplot(dat_resid, aes(x = fit,  y = resid)) +
  geom_point()

## Warning: Removed 405 rows containing missing values (geom_point).

We get a warning message that 405 rows were removed because they contain missing values. We know about these missing values in the variable resid (otherwise we should now investigate why there are missing values and if we might have introduced them by mistake somewhere in our syntax). The warning can be prevented by specifying na.rm = TRUE in geom_point() to indicate that we would like to remove the cases with missing values.

The next layer then adds a horizontal line at the position 0 on the y-axis, and uses line type lty = 2, i.e., a dashed line:

ggplot(dat_resid, aes(x = fit,  y = resid)) +
  geom_point(na.rm = TRUE) +
  geom_hline(yintercept = 0, lty = 2)

We could adjust the colours of the dots and the line by specifying different values in the corresponding functions, for example:

ggplot(dat_resid, aes(x = fit,  y = resid)) +
  geom_point(na.rm = TRUE, col = "darkblue") +
  geom_hline(yintercept = 0, lty = 2, col = "red")

By default, the axes are labelled with the name of the variable that is shown on the axis, but we can change this to get more informative titles using xlab() and ylab().

ggplot(dat_resid, aes(x = fit,  y = resid)) +
  geom_point(na.rm = TRUE) +
  geom_hline(yintercept = 0, lty = 2) +
  xlab("fitted values") +
  ylab("residuals")

We can “store” the ggplot object by assigning it to a name, so that we can call it again later.

p_resid <- ggplot(dat_resid, aes(x = fit,  y = resid)) +
  geom_point(na.rm = TRUE) +
  geom_hline(yintercept = 0, lty = 2) +
  xlab("fitted values") +
  ylab("residuals")

The above syntax will have no output, but this does:

p_resid

## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

We can create the other two plots (for the standardized and studentized residuals) in the same manner:

p_rstand <- ggplot(dat_resid, aes(x = fit,  y = rstandard)) +
  geom_point(na.rm = TRUE) +
  geom_hline(yintercept = 0, lty = 2) +
  xlab("fitted values") +
  ylab("standardized residuals")

p_rstud <- ggplot(dat_resid, aes(x = fit,  y = rstudent)) +
  geom_point(na.rm = TRUE) +
  geom_hline(yintercept = 0, lty = 2) +
  xlab("fitted values") +
  ylab("studentized residuals")

To combine them into one figure, we can use `ggpubr::ggarrange(p_resid, p_rstand, p_rstud)

ggpubr::ggarrange(p_resid, p_rstand, p_rstud)

Alternatively, we could first convert the data into a long format, so that we have multiple rows per observation, one for each type of residual:

dat_resid_long <- reshape(dat_resid, direction = "long",
        varying = c("resid", "rstandard", "rstudent"),
        v.names = "residual", timevar = "type", 
        times = c("usual", "standardized", "studentized"))

head(dat_resid_long)

##         rpart_gender   rpart_age    rpart_WC rpart_creat rpart_albu      fit      SBP gender age    WC creat albu  type   residual id
## 1.usual  -0.32886268 -12.3572653  -4.0367656   -2.169207  -1.302762 112.8682 110.6667   male  22  81.0  0.91  4.8 usual  -2.201540  1
## 2.usual   0.06174613   0.7849141   0.0621059    1.994264   1.024472 116.0043 118.0000 female  44  80.1  0.89  3.7 usual   1.995717  2
## 3.usual  16.07153418   3.6365442  11.1174041   14.163616  14.417626 110.4678 124.6667   male  21  69.6  0.87  4.4 usual  14.198856  3
## 4.usual -20.77244221 -20.4558616 -16.3665601  -19.194686 -18.619701 120.8385 102.0000 female  43 120.4  0.68  4.4 usual -18.838471  4
## 5.usual  24.92404036  24.6866707  21.2270693   23.218844  22.760126 123.6153 146.6667   male  51  81.1  0.99  4.1 usual  23.051363  5
## 6.usual -15.17003887  -3.6163756 -15.4344193  -16.841449 -17.163951 139.0427 122.0000   male  80 112.5  1.01  4.2 usual -17.042717  6

The new variable type contains the information on what type of residual it is (“usual”, “standardized” or “studentized”), and the new column residual contains the value of the residual.

ggplot(dat_resid_long, aes(x = fit, y = residual)) +
  geom_point() +
  geom_hline(yintercept = 0, lty = 2) +
  facet_wrap("type", scales = "free_y", ncol = 2)

The function facet_wrap() creates a separate plot for each different value of type and arranges them in ncol = 2 columns. The argument scales = "free_y" specifies that the y-axes of the different plots are not restricted to the same scale.

Task 4

Plot the standardized residuals against each of the covariates and interpret what you see.

Solution 4

par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))

plot(rstandard ~ age, data = dat_resid,
     ylab = "standardized residuals")
abline(h = 0, lty = 2)

plot(rstandard ~ gender, data = dat_resid,
     ylab = "standardized residuals")
abline(h = 0, lty = 2)

plot(rstandard ~ WC, data = dat_resid,
     ylab = "standardized residuals")
abline(h = 0, lty = 2)

plot(rstandard ~ creat, data = dat_resid,
     ylab = "standardized residuals")
abline(h = 0, lty = 2)

plot(rstandard ~ albu, data = dat_resid,
     ylab = "standardized residuals")
abline(h = 0, lty = 2)

We see that for age the scatter seems to be spread a bit wider for larger values. The other plots do not show clear patterns. creat is skewed which makes it a lot harder to detect if there are any trends or patterns in this residual plot.

For the categorical covariate gender, creates box plots by default. To get a scatter plot we could use plot.default() instead of plot().

Linearity, Normality & Multicollinearity

Task 1

To investigate if the assumption of a linear association between the covariates and the response is reasonable, plot the partial residuals against the corresponding covariates.

Solution 1

par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))

plot(rpart_age ~ age, data = dat_resid,
     xlab = "age", ylab = "partial residuals")

plot(rpart_gender ~ gender, data = dat_resid,
     xlab = "gender", ylab = "partial residuals")

plot(rpart_WC ~ WC, data = dat_resid,
     xlab = "waist circumference", ylab = "partial residuals")

plot(rpart_creat ~ creat, data = dat_resid,
     xlab = "creatinine", ylab = "partial residuals")

plot(rpart_albu ~ albu, data = dat_resid,
     xlab = "albumin", ylab = "partial residuals")

If the assumption of linear effects is violated we should see a curved pattern in the plots. There are no strong patterns for any of the continuous covariates.

Task 2

To investigate whether the residuals are approximately normally distributed, create a plot comparing the empirical quantiles of the standardized residuals to the theoretical quantiles of a standard normal distribution (QQ-plot).

Use the functions qqnorm() and qqline(). See the examples at the end of the help (?qqnorm) for how to use these functions.

Solution 2

A QQ plot comparing the empirical quantiles of the residuals to the theoretical quantiles of a standard normal distribution can be created using qqnorm(). The function qqline() adds a reference line:

qqnorm(dat_resid$rstandard)
qqline(dat_resid$rstandard, col = "red")

By setting the argument col we can specify the colour in which the line should be plotted.

The package car provides functionality to create a plot with a point-wise confidence-envelope:

car::qqPlot(dat_resid$rstandard)

We can use the notation :: to use a function from a package without loading this package, i.e.., <package>::<function>. This can be convenient when we don’t want functions from one package to interfere with functions from other packages (that may have the same function name), or to make clear which package the function is from.

We see that the standardized residuals in the upper tail of the distribution deviate from the normal distribution.

To get a QQ-plot with confidence envelope using ggplot2 we can use the functions stat_qq_line() and stat_qq(). The package qqplotr provides the additional functionality to get a confidence band via the function stat_qq_band():

ggplot(dat_resid, aes(sample = rstandard)) +
  qqplotr::stat_qq_band() +
  stat_qq_line() +
  stat_qq()

Task 3

Check for multicollinearity using the variance inflation factor. You can use the function vif() from the package car to do this. vif() takes a fitted model object as its only argument.

Solution 3

car::vif(lm1)

##   gender      age       WC    creat     albu 
## 1.241285 1.142430 1.211550 1.140478 1.285415

The variance inflation factor is relatively low for all covariates, indicating that there are no issues with multicollinearity in our model.

Task 4

From the lectures (section Model Diagnostics II: Linearity, Normality & Multicollinearity) we know how the variance inflation factor is calculated.

Calculate the variance inflation factor for lm1 using this formula and compare the results to the variance inflation factor calculated in the previous step.

The $R_j^2$ in the formula for the VIF is the coefficient of determination from the model with covariate $x_j$ as response and all other covariates as independent variables.

Solution 4

The variance inflation factor for a covariate $j$ is calculated as \[VIF_j = \frac{1}{1 - R_j^2},\] where $R_j$ is the coefficient of determination from a model that has variable $j$ as response and all other covariates in its linear predictor (see slides 15/16 of Model Diagnostics II: Linearity, Normality & Multicollinearity).

We fit a model for each of the covariates:

lm_gender <- update(lm1, formula = as.numeric(gender) ~ . - gender, data = lm1$model)
lm_age <- update(lm1, formula = age ~ . - age, data = lm1$model)
lm_WC <- update(lm1, formula = WC ~ . - WC, data = lm1$model)
lm_creat <- update(lm1, formula = creat ~ . - creat, data = lm1$model)
lm_albu <- update(lm1, formula = albu ~ . - albu, data = lm1$model)

Here I use update() instead of writing the model specification.

update() will call the same function that was used to create the object that is passed to it, with the same arguments that were provided in the original function call, except for those that are specified in update().

In this case, we specify a new model formula.

The dot means “everything as it was in the original model”. For instance, age ~ . would create a formula age ~ gender + age + WC + creat + albu where the right hand side of the original formula is used, i.e, age is now the response and a covariate. This would lead to a warning message and excluding age as covariate. To avoid this, we remove age as a covariate ourselves, by writing age ~ . - age.

If we wanted to keep the response from the original model, but apply changes to the right hand side of the formula, we could use the dot on the left hand side.

Note that the use of . means something different in a model formula in lm().

In a formula in lm(), the dot will include all other variables (i.e., all except the response) into the model.

lm(age ~ ., data = nhanes)

## 
## Call:
## lm(formula = age ~ ., data = nhanes)
## 
## Coefficients:
##            (Intercept)                     SBP            genderfemale      raceOther Hispanic  raceNon-Hispanic White  
##               28.27832                 0.33531                 0.74008                 3.43255                 3.88019  
## raceNon-Hispanic Black               raceother                      WC                  alc>=1                educhigh  
##               -0.68617                -0.66851                 0.06762                -1.95229                -1.50444  
##                  creat                    albu                uricacid                    bili   occuplooking for work  
##                3.22239                -8.96571                 0.01064                 5.19037                -4.63012  
##       occupnot working                 smoke.L                 smoke.Q  
##                8.96362                -1.37670                -7.07694

(see also slides 3/4 from Linear Regression in R)

The new models do not include the variable SBP, but lm1 does. Therefore, cases with missing values in SBP are excluded when fitting lm1 but they would not in the models for the covariates. We need to make sure that all models are fitted on the exact same data. By using the model frame returned in lm1 we can easily achieve this.

Then we get the coefficient of determination for each model (it is returned as part of the model summary) and store them in a vector.

R2 <- c(gender = summary(lm_gender)$r.squared,
        age = summary(lm_age)$r.squared,
        WC = summary(lm_WC)$r.squared,
        creat = summary(lm_creat)$r.squared,
        albu = summary(lm_albu)$r.squared
)

Using this vector we can calculate the variance inflation factor and compare it to the values obtained using the car package:

cbind(VIF = 1 / (1 - R2),
      "car::vif" = car::vif(lm1)
)

##             VIF car::vif
## gender 1.241285 1.241285
## age    1.142430 1.142430
## WC     1.211550 1.211550
## creat  1.140478 1.140478
## albu   1.285415 1.285415

Homoscedasticity

Residual Variation

In the plots created above, we already saw that there might be some heteroscedasticity.

For a better visualization of the variation of the residuals, we can plot an estimate of this variation instead of the residuals themselves. A common choice is to use the square root of the absolute value of the (standardized) residuals.

Task 1

Plot the square root of the absolute value of the standardized residuals against the fitted values.

To also get a LOESS (locally estimated scatter plot smoothing) curve you can use the function scatter.smooth() instead of plot(). (Note that scatter.smooth() cannot be specified via arguments formula and data. You will have to provide vectors of values to the arguments x and y.)

Solution 1

We can obtain the absolute value with the help of the function abs() and the square root is calculated by sqrt().

scatter.smooth(y = sqrt(abs(dat_resid$rstandard)), x = dat_resid$fit,
               xlab = "fitted values", ylab = "sqrt(abs(standardized residuals))")

It is a bit difficult to see here, but the LOESS fit shows an upward trend, indicating that residual variation increases with increasing fitted values.

We can increase the visibility of the LOESS fit by changing the colour of the scatter to grey:

scatter.smooth(y = sqrt(abs(dat_resid$rstandard)), x = dat_resid$fit,
               xlab = "fitted values", ylab = "sqrt(abs(standardized residuals))",
               col = "grey")

We can improve the plot by changing the axis labels and highlighting the LOESS fit with colour.

The style of this line can be controlled via a list() of options that is passed to the argument lpars. For instance, lpars = list(col = "red", lwd = 2) makes the line thicker and red.

We should also give the plot a nicer y-axis label. To write the square root symbol in the title of the y-axis (because it looks nicer than writing “sqrt” and is shorter than writing the whole title in words) we need to specify it as an expression().

The syntax used inside the function expression(), i.e., sqrt(abs(~standardized~residuals~"")) uses “plotmath” (see ?plotmath) and can be a bit tricky to figure out.

plotmath provides a number of function to write common mathematical symbols but it cannot handle normal text directly. This is why the ~ is used instead of spaces.

An improved version of the plot could then look like this:

scatter.smooth(y = sqrt(abs(dat_resid$rstandard)), x = dat_resid$fit,
               ylab = expression(sqrt(abs(~standardized~residuals~""))),
               xlab = "fitted values", lpars = list(col = "red", lwd = 2))

Alternatively, we could add the LOESS curve (obtained with loess.smooth()) to a standard plot (and get the exact same output):

plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid, 
     ylab = "sqrt(abs(standardized residuals))",
     xlab = "fitted values", col = "grey")

loess_fit <- loess.smooth(y = sqrt(abs(dat_resid$rstandard)),
                          x = dat_resid$fit)
lines(loess_fit)

Of course, we can apply the same changes (y-axis title and colour of the line) as before:

plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid, 
     ylab = expression(sqrt(abs(~standardized~residuals~""))),
     xlab = "fitted values")

loess_fit <- loess.smooth(y = sqrt(abs(dat_resid$rstandard)), x = dat_resid$fit)
lines(loess_fit, col = "red", lwd = 2)

Confidence Intervals

To be able to get confidence intervals for this line, however, we need to get the fitted loess object (by using the function loess()) because then we can get estimates of the standard errors that allow us to calculate confidence intervals.

Here is a demonstration how we can do this:

The function loess() allows us to fit a model using “Local Polynomial Regression” (the details about this method are outside the scope of this course). This is the same fit we obtained by scatter.smooth() in the previous step, but in the form of a model, which means we have more options to work with the resulting object.

The formula remains the same as before: the squared absolute standardized residuals are the response, the fitted values from lm1 the (only) covariate.

loess_mod <- loess(sqrt(abs(rstandard)) ~ fit, data = dat_resid)

Using the function predict() we can obtain the fitted values and the corresponding standard error (when we set the argument se = TRUE) from this LOESS fit, which allows us to calculate confidence intervals.

(We’ll see more about predict() and fitted values in the section on Effect Plots.)

loess_fit <- predict(loess_mod, dat_resid, se = TRUE, level = 0.95)

The object returned by predict() is a list with elements fit, se.fit (and two more elements):

str(loess_fit)

## List of 4
##  $ fit           : Named num [1:2765] 0.69 0.712 0.675 0.763 0.798 ...
##   ..- attr(*, "names")= chr [1:2765] "1" "2" "3" "4" ...
##  $ se.fit        : Named num [1:2765] 0.0128 0.013 0.0145 0.0135 0.0133 ...
##   ..- attr(*, "names")= chr [1:2765] "1" "2" "3" "4" ...
##  $ residual.scale: num 0.355
##  $ df            : num 2355

We can calculate the lower (lwr) and upper (upr) limits of the 95% confidence interval as:

loess_fit$lwr <- loess_fit$fit - qnorm(0.975) * loess_fit$se.fit
loess_fit$upr <- loess_fit$fit + qnorm(0.975) * loess_fit$se.fit

We could now add lines for this confidence interval to the plot we created in the previous step, however, the fitted values in dat_resid are not sorted. This means that when we plot lines using dat_resid$fit as x-coordinate, we get a mess of zigzag lines.

In a scatter plot everything looks fine, but when using lines…

par(mfrow = c(1, 2))
plot(x = dat_resid$fit, y = loess_fit$lwr, type = "p")
plot(x = dat_resid$fit, y = loess_fit$lwr, type = "l")

⇨ When you get weird zigzag lines in a plot, check how it looks like when using dots. If it looks fine with dots, try sorting the data by the variable shown on the x-axis.

To sort dat_resid$fit and get the elements of loess_fit in the same order, it is convenient to have the elements of loess_fit and dat_resid$fit in the same data.frame. We add the fitted values from lm1 to loess_fit (instead of vice versa) because that is more convenient for the next task:

loess_fit$fit_lm1 <- dat_resid$fit
loess_fit <- data.frame(loess_fit[c("fit_lm1", "fit", "lwr", "upr")])

We can now sort dat_resid by the fitted values:

loess_fit <- loess_fit[order(loess_fit$fit_lm1), ]

It is important to combine the LOESS fit (and confidence interval) with the same version of the fit from lm1 in dat_resid that was used to create loess_fit because only then will the values in loess_fit correspond to the rows in dat_resid.

This means we can either

create loess_fit
combine loess_fit and the fitted values
sort

sort dat_resid
create loess_fit
combine

We can now plot the square root of the absolute standardized residuals against the fitted values and add the corresponding LOESS fit with 95% confidence interval to the plot:

plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid, 
     ylab = "sqrt(abs(standardized residuals))",
     xlab = "fitted values")

lines(x = loess_fit$fit_lm1, y = loess_fit$fit, col = "red", lwd = 2)
lines(x = loess_fit$fit_lm1, y = loess_fit$lwr, lty = 2, col = "red", lwd = 2)
lines(x = loess_fit$fit_lm1, y = loess_fit$upr, lty = 2, col = "red", lwd = 2)

An alternative way to add the fitted line and confidence interval would be to use matplot() which plots multiple variables (columns in the input data) at the same time:

plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid, 
     ylab = "sqrt(abs(standardized residuals))",
     xlab = "fitted values")

matplot(x = loess_fit$fit_lm1, y = loess_fit[, c("fit", "lwr", "upr")],
        type = "l", lty = c(1, 2, 2), col = "red", add = TRUE)

In matplot() we need to set a few additional arguments.

The argument type = "l" specifies that lines should be plotted. By default, matplot() would plot each observation using digits 1, 2 or 3.

lty sets the line type and we specify that type 1 (solid) is used for the first column of the data (fit) and type 2 (dashed) is used for the other two columns.

Because we want to add the lines on top of the scatter plot we already have we set the argument add = TRUE. Otherwise matplot() would create a new plot.

It seems that the residual standard deviation increases a bit with increasing fitted values, but not a lot.

We can obtain a scatter plot smoother using geom_smooth(). The function chooses the method to be used based on the amount of data. To get a LOESS curve we can set method = "loess":

ggplot(dat_resid, aes(x = fit, y = sqrt(abs(rstandard)))) +
  geom_point() +
  geom_smooth(method = "loess") +
  xlab("fitted values") +
  ylab("sqrt(abs(standardized residuals))")

## `geom_smooth()` using formula 'y ~ x'

Task 2

Create plots of the square root of the absolute value of the standardized residuals against the covariates.

For the continuous covariates, add the loess fit and the corresponding 95% confidence intervals.

Solution 2

The following syntax relies a lot on copy-pasting.

Whenever we need to copy-paste the same syntax multiple times (and make small changes to each version) a better option would be to create a function that runs the syntax, to use loops, or to use a function from the apply family.

This would reduce the risk to introduce copy-paste mistakes and when we want to make a change to the syntax at a later point we would only need to make that change once but not in every instance of a copied piece of syntax.

Because we haven’t covered yet how to write functions or how to use loops or apply() (that will follow in Biostatistics II), we’ll copy-paste for now.

Fit the LOESS curves for the continuous covariates:

loess_age <- loess(sqrt(abs(rstandard)) ~ age, data = dat_resid)
loess_WC <- loess(sqrt(abs(rstandard)) ~ WC, data = dat_resid)
loess_creat <- loess(sqrt(abs(rstandard)) ~ creat, data = dat_resid)
loess_albu <- loess(sqrt(abs(rstandard)) ~ albu, data = dat_resid)

Calculate the loess fit and confidence intervals:

loess_fit_age <- predict(loess_age, dat_resid, se = TRUE, level = 0.95)
loess_fit_age$lwr <- loess_fit_age$fit - qnorm(0.975) * loess_fit_age$se.fit
loess_fit_age$upr <- loess_fit_age$fit + qnorm(0.975) * loess_fit_age$se.fit

loess_fit_WC <- predict(loess_WC, dat_resid, se = TRUE, level = 0.95)
loess_fit_WC$lwr <- loess_fit_WC$fit - qnorm(0.975) * loess_fit_WC$se.fit
loess_fit_WC$upr <- loess_fit_WC$fit + qnorm(0.975) * loess_fit_WC$se.fit

loess_fit_creat <- predict(loess_creat, dat_resid, se = TRUE, level = 0.95)
loess_fit_creat$lwr <- loess_fit_creat$fit - qnorm(0.975) * loess_fit_creat$se.fit
loess_fit_creat$upr <- loess_fit_creat$fit + qnorm(0.975) * loess_fit_creat$se.fit

loess_fit_albu <- predict(loess_albu, dat_resid, se = TRUE, level = 0.95)
loess_fit_albu$lwr <- loess_fit_albu$fit - qnorm(0.975) * loess_fit_albu$se.fit
loess_fit_albu$upr <- loess_fit_albu$fit + qnorm(0.975) * loess_fit_albu$se.fit

Add the covariate, convert to a data.frame and sort by the value of the covariate:

loess_fit_age$covar <- dat_resid$age
loess_fit_age <- data.frame(loess_fit_age[c("covar", "fit", "lwr", "upr")])
loess_fit_age <- loess_fit_age[order(loess_fit_age$covar), ]

loess_fit_WC$covar <- dat_resid$WC
loess_fit_WC <- data.frame(loess_fit_WC[c("covar", "fit", "lwr", "upr")])
loess_fit_WC <- loess_fit_WC[order(loess_fit_WC$covar), ]

loess_fit_creat$covar <- dat_resid$creat
loess_fit_creat <- data.frame(loess_fit_creat[c("covar", "fit", "lwr", "upr")])
loess_fit_creat <- loess_fit_creat[order(loess_fit_creat$covar), ]

loess_fit_albu$covar <- dat_resid$albu
loess_fit_albu <- data.frame(loess_fit_albu[c("covar", "fit", "lwr", "upr")])
loess_fit_albu <- loess_fit_albu[order(loess_fit_albu$covar), ]

Create the plots:

par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))

# plot for age
plot(sqrt(abs(rstandard)) ~ age, data = dat_resid, col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_age$covar, y = loess_fit_age[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)


# plot for gender
plot(sqrt(abs(rstandard)) ~ gender, data = dat_resid,
     ylab = "sqrt(abs(standardized residuals))")


# plot for WC
plot(sqrt(abs(rstandard)) ~ WC, data = dat_resid,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_WC$covar, y = loess_fit_WC[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)


# plot for creat
plot(sqrt(abs(rstandard)) ~ creat, data = dat_resid,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_creat$covar, y = loess_fit_creat[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)


# plot for albu
plot(sqrt(abs(rstandard)) ~ albu, data = dat_resid,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_albu$covar, y = loess_fit_albu[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)

We see a clear upwards trend for age and a bit a decreasing line for albu. For creat it is difficult to judge given the skewness of the variable.

We can see the fit for creat in the area with the most data better when we restrict the x-axis by setting the argument xlim:

par(mfrow = c(1, 1))
plot(sqrt(abs(rstandard)) ~ creat, data = dat_resid,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))",
     xlim  = c(0, 3))

matplot(x = loess_fit_creat$covar, y = loess_fit_creat[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)

It seems that there is no strong relation between the variation of the residuals and creat.

Weighted Least Squares

We will now fit a second model using weighted least squares to see if this will lead to more homoscedastic error terms and maybe also improve the normality of the resulting residuals.

Task 1

Fit a weighted least squares version of lm1 (see slides 19-21 of Model Diagnostics III: Heteroscedasticity).

For calculating the weights, use the same covariates as in lm1.

You need to fit a model to the log-transformed “variance” of the residuals.

You need to provide the weights to lm() via the argument weights.

Solution 1

We first fit a model to the natural logarithm of the squared residuals, using the same covariates as in the original model lm1:

mod_logresid <- update(lm1, formula = log(I(resid^2)) ~ ., data = dat_resid)

Here, I again used update() instead of writing the whole model “by hand”. This requires adjusting the model formula, and replacing the data.

In the model formula, we replace the original response with the logarithm of the squared residuals. The right hand side of the formula stays the same as it was in lm1.

Instead of the nhanes data we now need to use the data that contains the residuals (and, of course, also the covariates, but we have added them to dat_resid in a previous step).

From this model, we extract the fitted values as estimates of the (log-transformed) residual variation (i.e., $\widehat{\log(\hat\varepsilon_i^2)} = \mathbf z_i^\top \boldsymbol{\hat\alpha}$) and calculate the weights as $w = \frac{1}{\exp\left(\widehat{\log(\hat{\varepsilon}_i^2)}\right)}$:

w <- 1 / exp(fitted.values(mod_logresid))

Using these weights, we can then fit the weighted least squares model by updating the original model lm1 with specifying the weights:

lm_weighted <- update(lm1, weights = w)

Task 2

Create plots to check for heteroscedasticity in the weighted least squares model.

Plot the square root of the absolute value of the standardized residuals against the fitted values and against each of the continuous covariates.

Solution 2

We follow the same steps as before.

First, make a data.frame of the standardized residuals, fitted values and covariates, and plot the square root of the absolute standardized residuals against the fitted values

dat_resid_w <- data.frame(
  rstandard = rstandard(lm_weighted),
  fit = fitted.values(lm_weighted),
  nhanes[all.vars(formula(lm_weighted))]
)

scatter.smooth(y = sqrt(abs(dat_resid_w$rstandard)), x = dat_resid_w$fit,
               xlab = "fitted values", ylab = "sqrt(abs(standardized residuals))",
               col = "grey")

The upward trend of the residual variation has disappeared.

Fit the LOESS curves for the continuous covariates:

loess_age_w <- update(loess_age, data = dat_resid_w)
loess_WC_w <- update(loess_WC, data = dat_resid_w)
loess_creat_w <- update(loess_creat, data = dat_resid_w)
loess_albu_w <- update(loess_albu, data = dat_resid_w)

Calculate the loess fit and confidence intervals:

loess_fit_age_w <- predict(loess_age_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_age_w$lwr <- loess_fit_age_w$fit - qnorm(0.975) * loess_fit_age_w$se.fit
loess_fit_age_w$upr <- loess_fit_age_w$fit + qnorm(0.975) * loess_fit_age_w$se.fit

loess_fit_WC_w <- predict(loess_WC_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_WC_w$lwr <- loess_fit_WC_w$fit - qnorm(0.975) * loess_fit_WC_w$se.fit
loess_fit_WC_w$upr <- loess_fit_WC_w$fit + qnorm(0.975) * loess_fit_WC_w$se.fit

loess_fit_creat_w <- predict(loess_creat_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_creat_w$lwr <- loess_fit_creat_w$fit - qnorm(0.975) * loess_fit_creat_w$se.fit
loess_fit_creat_w$upr <- loess_fit_creat_w$fit + qnorm(0.975) * loess_fit_creat_w$se.fit

loess_fit_albu_w <- predict(loess_albu_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_albu_w$lwr <- loess_fit_albu_w$fit - qnorm(0.975) * loess_fit_albu_w$se.fit
loess_fit_albu_w$upr <- loess_fit_albu_w$fit + qnorm(0.975) * loess_fit_albu_w$se.fit

Add the covariate, convert to a data.frame and sort by the value of the covariate:

loess_fit_age_w$covar <- dat_resid_w$age
loess_fit_age_w <- data.frame(loess_fit_age_w[c("covar", "fit", "lwr", "upr")])
loess_fit_age_w <- loess_fit_age_w[order(loess_fit_age_w$covar), ]

loess_fit_WC_w$covar <- dat_resid_w$WC
loess_fit_WC_w <- data.frame(loess_fit_WC_w[c("covar", "fit", "lwr", "upr")])
loess_fit_WC_w <- loess_fit_WC_w[order(loess_fit_WC_w$covar), ]

loess_fit_creat_w$covar <- dat_resid_w$creat
loess_fit_creat_w <- data.frame(loess_fit_creat_w[c("covar", "fit", "lwr", "upr")])
loess_fit_creat_w <- loess_fit_creat_w[order(loess_fit_creat_w$covar), ]

loess_fit_albu_w$covar <- dat_resid_w$albu
loess_fit_albu_w <- data.frame(loess_fit_albu_w[c("covar", "fit", "lwr", "upr")])
loess_fit_albu_w <- loess_fit_albu_w[order(loess_fit_albu_w$covar), ]

Create the plots:

par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))

plot(sqrt(abs(rstandard)) ~ age, data = dat_resid_w,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_age_w$covar, y = loess_fit_age_w[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)



plot(sqrt(abs(rstandard)) ~ gender, data = dat_resid_w, 
     ylab = "sqrt(abs(standardized residuals))")



plot(sqrt(abs(rstandard)) ~ WC, data = dat_resid_w,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_WC_w$covar, y = loess_fit_WC_w[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)



plot(sqrt(abs(rstandard)) ~ creat, data = dat_resid_w,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_creat_w$covar, y = loess_fit_creat_w[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)



plot(sqrt(abs(rstandard)) ~ albu, data = dat_resid_w,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_albu_w$covar, y = loess_fit_albu_w[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)

There seems to be less of a trend in these plots compared to the corresponding plots from the ordinary least squares model.

We can visualize the difference between the LOESS fit from lm1 and lm_weighted better by plotting both in the same figure, for example, for age and albu:

par(mfrow = c(1, 2), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))

plot(sqrt(abs(rstandard)) ~ age, data = dat_resid_w,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_age_w$covar, y = loess_fit_age_w[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)

matplot(x = loess_fit_age$covar, y = loess_fit_age[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "red", add = TRUE)
legend("topright", lty = c(1, 1), col = c(1, 2), legend = c("weighted", "OLS"),
       bty = "n")


plot(sqrt(abs(rstandard)) ~ albu, data = dat_resid_w,  col = "grey",
     ylab = "sqrt(abs(standardized residuals))")

matplot(x = loess_fit_albu_w$covar, y = loess_fit_albu_w[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "black", add = TRUE)

matplot(x = loess_fit_albu$covar, y = loess_fit_albu[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l", col = "red", add = TRUE)

legend("topright", lty = c(1, 1), col = c(1, 2), legend = c("weighted", "OLS"),
       bty = "n")

The function legend() adds a legend. Here, we place it in the top right corner. We need to specify what this legend should show ourselves, i.e., two solid lines that have colours 1 (black) and 2 (red), which are labelled “weighted” and “OLS”. The argument bty specifies the box type (i.e., whether a line should be drawn around the legend) and we set it to "n", i.e,. no box.

loess_mod_w <- loess(sqrt(abs(rstandard)) ~ fit, data = dat_resid_w)

loess_fit_w <- predict(loess_mod_w, dat_resid_w, se = TRUE, level = 0.95)
loess_fit_w$lwr <- loess_fit_w$fit - qnorm(0.975) * loess_fit_w$se.fit
loess_fit_w$upr <- loess_fit_w$fit + qnorm(0.975) * loess_fit_w$se.fit


loess_fit_w$fit_lm1_w <- dat_resid_w$fit
loess_fit_w <- data.frame(loess_fit_w[c("fit_lm1_w", "fit", "lwr", "upr")])
loess_fit_w <- loess_fit_w[order(loess_fit_w$fit_lm1_w), ]

plot(sqrt(abs(rstandard)) ~ fit, data = dat_resid_w, 
     ylab = expression(sqrt(abs(~standardized~residuals~""))),
     xlab = "fitted values", col = "gray")

matplot(x = loess_fit$fit_lm1, y = loess_fit[, c("fit", "lwr", "upr")],
        type = "l", lty = c(1, 2, 2), col = "red", add = TRUE)

matplot(x = loess_fit_w$fit_lm1_w, y = loess_fit_w[, c("fit", "lwr", "upr")],
        type = "l", lty = c(1, 2, 2), col = "black", add = TRUE)

legend("topleft", lty = c(1, 1), col = c(1, 2), legend = c("weighted", "OLS"),
       bty = "n")

Task 3

Check if the residuals in the weighted least squares model are now closer to a normal distribution by comparing the QQ-plot from the original model to the QQ-plot for the weighted model (side by side, not within the same plot).

To plot QQ-plots with confidence envelope, you can use the function car::qqPlot().

Solution 3

To print the two plots next to each other, we can set the graphical parameters (par()) option mfrow. The function qqPlot() from the car package only requires us to provide the vector of residuals. To make it easily clear which plot is which, we can add an informative title (main):

par(mfrow = c(1, 2))
car::qqPlot(rstandard(lm1),
            main = "ordinary least squares")
car::qqPlot(rstandard(lm_weighted), ylim = c(-3.7, 4.7),
            main = "weighted least squares")

I’ve also specified the argument ylim to extend the range of the y-axis because the lower end of the confidence envelope was cut-off in the right figure.

The upper quantiles of the standardized residuals have become a bit more similar to the quantiles of the standard normal distribution in the weighted least squares model (but the upper tail is still deviating).

Outliers & Leverage

Task 1

How can we identify potential outliers? Which type of residuals do we need to use for this?

Solution 1

From the section Model Diagnostics IV: Outliers & Influential Observations (slides 9/10) we know that we can identify potential outliers by comparing the studentized residuals to the $1-\alpha/2$ and $\alpha/2$ quantiles of a $t$-distribution with $n-p-1$ degrees of freedom.

The reasoning is that if the model is correctly specified, the studentized residuals should have such a $t$-distribution and values outside these quantiles are relatively unlikely if the studentized residuals do indeed come from this distribution.

Task 2

Plot the studentized residuals for lm_weighted against the fitted values and add horizontal lines for the 2.5% and 97.5% quantiles of the corresponding $t$-distribution.

You can obtain quantiles from the $t$-distribution via the function qt().

Horizontal lines can be added with the function abline().

If you find some potential outliers, think about how you could investigate those observations further.

Solution 2

If you haven’t already done so earlier, add the studentized residuals for lm_weighted to dat_resid_w:

dat_resid_w$rstudent <- rstudent(lm_weighted)

We can now plot the studentized residuals against the fitted values and add horizontal lines indicating the $1 - \alpha/2$ and $\alpha/2$ quantiles of the $t$-distribution with $n - p - 1$ degrees of freedom (and $\alpha = 0.05$).

plot(rstudent ~ fit, data = dat_resid_w,
     xlab = "fitted values", ylab = "studentized residuals")
abline(h = qt(c(0.025, 0.975), df = lm_weighted$df.residual), lty = 2)

We can obtain the quantiles of the $t$-distribution using the function qt(), for which we specify a vector of probabilities, i.e., c(0.025, 0.975) and the degrees of freedom, $n - p - 1$. The degrees of freedom are returned in the fitted model object as df.residual.

The horizontal lines are drawn with the function abline(). The argument h indicates at which position(s) on the y-axis (a) horizontal line(s) should be drawn. lty = 2 specifies that the line type should be “dashed”.

Several observations are outside the range that is likely to be observed if the studentized residuals would really follow a $t$-distribution. We should investigate these observations but there is no need to immediately exclude them. Keep in mind that these more extreme values for the studentized residuals are only less likely under the $t$-distribution, but not impossible.

There are too many of these potential outliers to investigate each case separately by hand. One way to get a bit more information is to visualize the situation.

A possibility to visualize the observations is to look at the distribution of each variable separately, but there might be observations that have relatively normal values in each variable, but unusual combinations. We can get a bit more information when we plot scatter plots of pairs of variables.

To be able to highlight the observations we have identified as potential outliers in a scatter plot, we add a new variable to dat_resid_w that indicates if the studentized residual is more extreme than the quantiles of the $t$-distribution:

dat_resid_w$outlier <- abs(dat_resid_w$rstudent) > qt(0.975, df = lm_weighted$df.residual)

We can obtain a matrix of all pairwise scatter plots using the function pairs():

pairs(dat_resid_w[c("SBP", "gender", "age", "WC", "creat", "albu")],
      col = c("grey", "red")[as.numeric(dat_resid_w$outlier) + 1])

The variable outlier in dat_resid_w is a logical vector (with values TRUE when the observation is a potential outlier, and FALSE when it is not).

To plot the observations with different colours we need to assign a colour to each observation. We do that by converting the logical vector to a numeric vector, but then it contains values 0 and 1, which is why we add + 1.

Having a vector with values 1 and 2 allows us to use it as index for a vector of colours, c("grey", "red").

Most of the observations that we identified seem to have a systolic blood pressure above 140. Depending on the aim of our analysis, we should consider whether we want to have patients with hypertension in the analysis or if we should exclude them because we only want to draw conclusions for patients without hypertension. We should also consider to perform sensitivity analyses if hypertension is a relevant issue in the application at hand.

From a statistical point of view, the values are not that extreme that we would not expect any problems with regards to the stability of the model, and so we will keep these observations in the analysis for now.

Using the indicator variable outlier, we could now also create a subset of the data containing only these observations, count them, investigate them further, …

Task 3

We now want to identify observations with high leverage.

How is the leverage defined?
Remember (or look up) the formula to calculate the leverage values.
Should we calculate the leverage values in lm1 or lm_weighted?
Is there a cut-off for how large the leverage has to be for an observation to have “high leverage”?

Solution 3

The leverage values are the diagonal values of the hat matrix (Section Model Diagnostics IV: Outliers & Influential Observations, slide 5). They indicate how far the covariate values of an observation are from the sample mean and are a measure of how much the observed response for a particular subject influences the corresponding fitted value.

For a standard linear regression model the hat matrix can be calculated from the design matrix $\mathbf X$ as (see Model Diagnostics I: Residuals, slide 3) \[\mathbf H = \mathbf X(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top\] Because there was some heteroscedasticity in lm1, and the weighted least squares approach used in lm_weighted seemed to have reduced that, we would prefer to continue to work with lm_weighted. But, because this model was not estimated with the ordinary least squares estimator, the calculation of the hat matrix has to take into account the weights that were used to fit the model.

For the weighted least squares model, the regression coefficients are calculated as \[\boldsymbol{\hat\beta} = (\mathbf X^\top \mathbf W \mathbf X)^{-1} \mathbf X^\top \mathbf W \mathbf {y},\] where $\mathbf W$ is a weight matrix, for example $w\mathbf I$, and, therefore, the hat matrix is \[\mathbf H = \mathbf X\left(\mathbf X^\top \mathbf W \mathbf X\right)^{-1} \mathbf X^\top \mathbf W\]

This formula for the weighted least squares estimator (and corresponding hat matrix) is not in the lecture slides and not relevant for the exam.

But because the weighted least squares model had the better fit and working with the formulas for the weighted model is not much more difficult than working with the corresponding formulas for an OLS model, we will work with the weighted version here.

There is no universal cut-off for high leverage values, but we could use $2 (p + 1)/n$ to give us some guidance (Section Model Diagnostics IV: Outliers & Influential Observations, slide 11).

Task 4

Calculate the hat matrix $\mathbf H$ for lm_weighted and identify high leverage points.

You can get the design matrix from a fitted model object using the function model.matrix().

The function diag() can be used for multiple purposes:

when used with a matrix: to extract the diagonal elements of the matrix,
when used with a vector: to create a diagonal matrix that has the vector as its diagonal,
when used with an integer: to create the identity matrix (that has 1’s on the diagonal and is 0 everywhere else) of dimension given by the integer.

Remember: %*% is used for matrix multiplication, t() to transpose a vector or a matrix, and you can get the inverse of a matrix with the function solve().

Did you get the following warning message?

## Warning in w * diag(nrow(X)): longer object length is not a multiple of shorter object length

This means that the dimensions of the weights w and deign matrix X don’t match.

Explore how large/long your objects are and have a look at the weights to figure out what the issue might be.

The function na.omit() can help you to resolve the problem.

Solution 4

We can obtain the design matrix $\mathbf X$ from the fitted model object:

X <- model.matrix(lm_weighted)

With this matrix we can now calculate the hat matrix and extract its diagonal elements using diag():

W <- na.omit(w) * diag(nrow(X))

The vector of weights, w, that we calculated previously, has length 2765 and contains some NA values (because the fitted values it was calculated from had these NA values). The matrix X, however, only has 2360 rows and does not contain data from incomplete observations.

To get a vector without the NA values we use the function na.omit().

diag(nrow(X)) creates an identity matrix , e.g.,

diag(5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

(but with dimensions equal to the number of rows in X and not 5, but that would be too large to display here), so that the resulting weights matrix W has the weights on the diagonal and is zero everywhere else, e.g.,

w[1:5] * diag(5)

##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] 0.03614327 0.00000000 0.00000000 0.00000000 0.00000000
## [2,] 0.00000000 0.01688889 0.00000000 0.00000000 0.00000000
## [3,] 0.00000000 0.00000000 0.03268759 0.00000000 0.00000000
## [4,] 0.00000000 0.00000000 0.00000000 0.02350066 0.00000000
## [5,] 0.00000000 0.00000000 0.00000000 0.00000000 0.01568025

(but then much larger…)

Using this weights matrix we can then calculate the hat matrix and its diagonal:

H <- X %*% solve(t(X) %*% W %*% X) %*% t(X) %*% W
h <- diag(H)

We can now plot the leverage values and add a horizontal line to the plot to indicate the reference line:

plot(h)
abline(h = 2 * ncol(X)/nrow(X), lty = 2, col = "red")

There are a few observations above the line, but none that are really standing out. As before, we can create a new variable in dat_resid that we can then use to highlight those observations in a scatter plot:

dat_resid_w$high_lev[!is.na(dat_resid_w$fit)] <- h > 2 * ncol(X)/nrow(X)

pairs(dat_resid_w[c("SBP", "gender", "age", "WC", "creat", "albu")],
      col = c("grey", "red")[as.numeric(dat_resid_w$high_lev) + 1])

As expected, the highlighted dots are more at the edge of the observed values. For most variables, there are no extreme values we should be concerned about, only for creat. To further inspect (some) of the identified high leverage observations, more knowledge on the particular (clinical) application/study would be necessary.

We could further investigate how the higher values of SBP and creat are related to the residuals, e.g., whether they are responsible for the deviation from normality in the upper tail that we see in the QQ-plot.

Task 5

It is also possible to obtain the hat values from a fitted linear model using the function hatvalues().

Compare the hat values you just calculated with the values you get using hatvalues().

Solution 5

To extract the hat values from lm_weighted:

hatvalues(lm_weighted)

##            1            2            3            4            5            6            7            8 
## 0.0029673072 0.0026032854 0.0040234376 0.0030324754 0.0019482885 0.0014668865 0.0039065911 0.0039496513 
##            9           10           11           12           13           14           15           16 
## 0.0000000000 0.0021375842 0.0010374903 0.0026008281 0.0024350893 0.0016604297 0.0022520324 0.0023776542 
##           17           18           19           20           21           22           23           24 
## 0.0015994164 0.0014601304 0.0024934888 0.0017250689 0.0015578032 0.0040213386 0.0028244693 0.0015483757 
##           25           26           27           28           29           30           31           32 
## 0.0016432862 0.0025492124 0.0017022925 0.0014401198 0.0000000000 0.0018833256 0.0010927249 0.0437888411  
## [...]

The first thing we might notice when we try to compare the two sets of values by taking the differences or plotting them against each other is that they have different length.

length(h)

## [1] 2360

length(hatvalues(lm_weighted))

## [1] 2765

hatvalues() includes values for the incomplete cases, but our own calculated hat values do not. We “lost” those observations when we used model.matrix(), which automatically excludes rows that contain missing values.

When we inspect the values from hatvalues() that correspond to incomplete observations (which have value NA in the fitted values and residuals in dat_resid_w), we see that they all have the value 0:

head(hatvalues(lm_weighted)[is.na(dat_resid_w$fit)], 10)

##  9 29 35 43 44 48 51 62 81 86 
##  0  0  0  0  0  0  0  0  0  0

The function is.na() checks if a value is missing and returns a logical vector with TRUE if the value is NA and FALSE if it is not NA:

is.na(dat_resid_w$fit)

##    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [...]

Using this logical value we can create a subset that only contains the elements of a vector, list, matrix or data.frame for which the logical vector has value TRUE.

And because this would create a very long output, we tell to only show us the first 10 elements by using the function head() with argument n set to 10.

To compare the two vectors, we take the subset for hatvalues(lm_weighted) that corresponds to complete observations:

summary(h - hatvalues(lm_weighted)[!is.na(dat_resid_w$fit)])

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.297e-15  2.515e-17  1.393e-16  3.748e-16  4.567e-16  7.171e-15

Now, we are interested in selecting only those elements of the hat values that correspond to elements of the residuals or fitted values that are not missing.

Since is.na() returns TRUE for missing values, we use ! to invert this logical vector, i.e., a TRUE becomes a FALSE and a FALSE becomes a TRUE.

The differences are all very close to zero, demonstrating that except for some small differences due to the way the hat values were calculated, our results are the same as the numbers calculated by the function.

Influential Values

Task 1

We can obtain a whole range of influence measures for a fitted model object using the function influence.measures(), or separate influence measures via dffits(), dfbeta(), dfbetas(), cooks.distance().

Calculate the influence measures for lm_weighted using the separate functions and inspect them via plots.

You can add reference lines based on the rules of thumb given in section Model Diagnostics IV: Outliers & Influential Observations” on slides 15-20.

To obtain quantiles of the $F$-distribution, use the function qf().

Solution 1

Cook’s distance:

There are multiple rules of thumb for which values of Cook’s distance should be considered large:

$D_i > F_{0.5}(p, n - p - 1}$
$D_i > \frac{4}{n - p - 1}$
$D_i > 1$

plot(cooks.distance(lm_weighted))

abline(h = c(1,
             qf(0.5, ncol(X) - 1, lm_weighted$df.residual),
             4/lm_weighted$df.residual),
       lty  = 2, col = "red")

Since Cook’s distance is $<0.052$ for all observations, only the reference line for $\frac{4}{n - p - 1}$ is visible.

There are many values larger than this reference line, but all observations are much smaller than the other two lines (at 1 and at 0.8705419).

There is one observations that clearly stands out from the rest (the one with value > 0.05). We could investigate this particular subject further since it has much more influence on the model fit than all other observations.

We can set the range of values to be shown on the y-axis with the help of the argument ylim:

plot(cooks.distance(lm_weighted), ylim = c(0, 1))
abline(h = c(1,
             qf(0.5, ncol(X) - 1, lm_weighted$df.residual),
             4/lm_weighted$df.residual),
       lty  = 2, col = "red")

DFFITS:

Reference lines for DFFITS can be drawn at $\pm 2 \sqrt{\frac{p + 2}{n-p-2}}$.

plot(dffits(lm_weighted), ylim = c(-0.5, 0.65))

abline(h = c(-2, 2) * sqrt((ncol(X) + 1)/(lm_weighted$df.residual - 1)),
       lty = 2, col = "red")

There are multiple observations outside the reference lines. When investigating those cases we might start with the most extreme ones (one with a value close to 0.6 and three with a value close to -0.3).

DFBETAS:

Reference lines for DFBETAS can be drawn at $\pm 2 /\sqrt{n}$. This measure of influence is covariate-specific, i.e., we need to plot it separately for each of the covariates:

DFBETAS <- dfbetas(lm_weighted)

par(mfrow = c(2, 3), mar = c(3.5, 3.5, 2, 1), mgp = c(2, 0.6, 0))
plot(DFBETAS[, 1], main = colnames(DFBETAS)[1])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")

plot(DFBETAS[, 2], main = colnames(DFBETAS)[2])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")

plot(DFBETAS[, 3], main = colnames(DFBETAS)[3])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")

plot(DFBETAS[, 4], main = colnames(DFBETAS)[4])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")

plot(DFBETAS[, 5], main = colnames(DFBETAS)[5])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")

plot(DFBETAS[, 6], main = colnames(DFBETAS)[6])
abline(h = c(-2, 2) / sqrt(nrow(X)), lty = 2, col = "red")

Task 2

Calculate the influence measures using infuence.measures() and investigate the resulting object.

Solution 2

Calculate the influence measures:

infl <- influence.measures(lm_weighted)

Investigate what type of object infl is and obtain its structure so that we can see how we can work with this object:

class(infl)

## [1] "infl"

str(infl)

## List of 3
##  $ infmat: num [1:2765, 1:10] 0.00216 0.00864 0.04857 0.05175 0.04474 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2765] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:10] "dfb.1_" "dfb.gndr" "dfb.age" "dfb.WC" ...
##  $ is.inf: logi [1:2765, 1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2765] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:10] "dfb.1_" "dfb.gndr" "dfb.age" "dfb.WC" ...
##  $ call  : language lm(formula = SBP ~ gender + age + WC + creat + albu, data = nhanes, weights = w, na.action = "na.exclude")
##  - attr(*, "class")= chr "infl"

We can also get some more information by looking at the help file ? influence.measures (first paragraph of the Details section).

influence.measures() returns a list with 3 elements.

The element infmat is a matrix containing the different influence measures:

head(infl$infmat)

##         dfb.1_      dfb.gndr       dfb.age       dfb.WC      dfb.cret     dfb.albu       dffit     cov.r
## 1 0.0021606757  0.0047689097  0.0055777801  0.003597802 -0.0005995998 -0.006067404 -0.01692031 1.0052898
## 2 0.0086380136  0.0009500429  0.0004547626 -0.006880392  0.0021805136 -0.008590003  0.01156868 1.0050380
## 3 0.0485668125 -0.0415907741 -0.0224686608 -0.052947175 -0.0038479099 -0.030203698  0.07800047 1.0027455
## 4 0.0517479672 -0.0444103127  0.0020164298 -0.061156462  0.0071348600 -0.041421772 -0.07965685 1.0002695
## 5 0.0447438334 -0.0386745638  0.0211528257 -0.043705240 -0.0012308017 -0.037784900  0.06403226 0.9991465
## 6 0.0008270652  0.0110187106 -0.0244282322 -0.001451829  0.0034179197  0.001341547 -0.03067003 1.0023877
##         cook.d         hat
## 1 4.773448e-05 0.002967307
## 2 2.231474e-05 0.002603285
## 3 1.013794e-03 0.004023438
## 4 1.057048e-03 0.003032475
## 5 6.830357e-04 0.001948289
## 6 1.567991e-04 0.001466887

And the element is.inf is a matrix with information about whether the elements in infmat are above the thresholds that uses for the different influence measures.

The thresholds that uses are:

$\text{DFBETAS} > 1$
$\text{DFFIT} > 3\sqrt{(p + 1)/(n - p - 1)}$
Cook’s distance: $D_i > F_{0.5}(p + 1, n - p - 1)$
$h_{ii} > 3 (p + 1)/n$

(This information is not written in the help file.)

head(infl$is.inf)

##   dfb.1_ dfb.gndr dfb.age dfb.WC dfb.cret dfb.albu dffit cov.r cook.d   hat
## 1  FALSE    FALSE   FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 2  FALSE    FALSE   FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 3  FALSE    FALSE   FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 4  FALSE    FALSE   FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 5  FALSE    FALSE   FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 6  FALSE    FALSE   FALSE  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE

We can get an overview for those observations for which at least one of the influence measures was above the threshold:

summary(infl)

## Potentially influential observations of
##   lm(formula = SBP ~ gender + age + WC + creat + albu, data = nhanes,      weights = w, na.action = "na.exclude") :
## 
##      dfb.1_ dfb.gndr dfb.age dfb.WC dfb.cret dfb.albu dffit   cov.r   cook.d hat    
## 32    0.01  -0.09     0.04    0.04  -0.31     0.03    -0.32_*  1.04_*  0.02   0.04_*
## 59   -0.01   0.00     0.00   -0.01   0.00     0.01     0.02    1.01_*  0.00   0.01  
## 61    0.00   0.00     0.00    0.00  -0.01     0.00    -0.01    1.01_*  0.00   0.01  
## 71   -0.02   0.03     0.07    0.02   0.00     0.00     0.10    0.99_*  0.00   0.00  
## 75   -0.01   0.02     0.01    0.00   0.07     0.00     0.07    1.01_*  0.00   0.01_*
## 113  -0.03   0.03    -0.02    0.06   0.00     0.01     0.07    1.01_*  0.00   0.01  
## 126   0.00   0.00     0.01   -0.01   0.00     0.00    -0.02    1.01_*  0.00   0.01  
## 133   0.00   0.02     0.06   -0.10  -0.01     0.03    -0.13    1.01_*  0.00   0.01_*
## 152   0.04  -0.07     0.06   -0.06  -0.06    -0.02     0.11    0.99_*  0.00   0.00  
## 157  -0.01   0.06     0.10   -0.03   0.07    -0.01     0.16_*  0.96_*  0.00   0.00  
## 189   0.00  -0.01     0.01   -0.03   0.00     0.01    -0.03    1.01_*  0.00   0.01_*
## 215   0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 224   0.09  -0.09     0.00   -0.01  -0.01    -0.10    -0.13    0.99_*  0.00   0.00  
## 275   0.07  -0.07     0.03   -0.04  -0.03    -0.07     0.10    0.99_*  0.00   0.00  
## 314  -0.08   0.10     0.06   -0.05  -0.04     0.12     0.20_*  0.96_*  0.01   0.00  
## 317  -0.12   0.10    -0.03    0.01   0.02     0.15     0.18_*  1.01    0.01   0.01_*
## 328   0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 336   0.02   0.03     0.01    0.03   0.01    -0.04     0.09    0.99_*  0.00   0.00  
## 339  -0.03   0.03    -0.04    0.09   0.00     0.00     0.10    1.01    0.00   0.01_*
## 354  -0.04   0.23    -0.14    0.04   0.54    -0.08     0.56_*  1.02_*  0.05   0.03_*
## 369   0.10  -0.09     0.00    0.02  -0.05    -0.11     0.15_*  0.99    0.00   0.00  
## 371   0.02   0.00    -0.02    0.00   0.00    -0.02     0.03    1.01_*  0.00   0.01  
## 390   0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 408   0.00   0.00     0.00    0.00  -0.01     0.00    -0.01    1.07_*  0.00   0.06_*
## 410   0.02  -0.02     0.01    0.00   0.00    -0.03    -0.04    1.01_*  0.00   0.01_*
## 421  -0.02   0.03     0.08   -0.03   0.01     0.03     0.10    0.99_*  0.00   0.00  
## 422   0.02  -0.03     0.00    0.01   0.04    -0.04     0.10    0.99_*  0.00   0.00  
## 436   0.02  -0.07     0.00    0.04  -0.04    -0.03     0.11    0.98_*  0.00   0.00  
## 452  -0.02  -0.02    -0.04    0.06   0.04     0.00     0.11    0.99_*  0.00   0.00  
## 453   0.07   0.01     0.08   -0.08   0.02    -0.08     0.14    0.98_*  0.00   0.00  
## 471  -0.04  -0.04     0.12   -0.03   0.01     0.05     0.16_*  0.96_*  0.00   0.00  
## 488  -0.03   0.00     0.02    0.00   0.01     0.03    -0.04    1.01_*  0.00   0.01_*
## 499  -0.08   0.11    -0.12    0.14   0.06     0.05     0.20_*  0.99    0.01   0.01  
## 509  -0.07   0.02     0.02    0.01   0.02     0.08    -0.09    1.01_*  0.00   0.01_*
## 510  -0.07  -0.03    -0.08    0.10  -0.03     0.07     0.16_*  1.00    0.00   0.01  
## 515  -0.02   0.00    -0.02    0.04  -0.01     0.02     0.05    1.01_*  0.00   0.01_*
## 522  -0.02   0.00    -0.02    0.04   0.00     0.01     0.04    1.01_*  0.00   0.01_*
## 540   0.00   0.00     0.00    0.00  -0.01     0.00    -0.01    1.01_*  0.00   0.01_*
## 555  -0.01   0.00     0.00    0.00   0.01     0.01     0.01    1.01_*  0.00   0.01  
## 573  -0.12   0.00    -0.05    0.03   0.00     0.15     0.21_*  1.00    0.01   0.01  
## 607  -0.03   0.06     0.06    0.01   0.03     0.01    -0.12    0.99_*  0.00   0.00  
## 617   0.07  -0.08     0.11   -0.09  -0.03    -0.05     0.15    0.98_*  0.00   0.00  
## 621   0.01   0.00     0.00   -0.01   0.00    -0.01    -0.01    1.01_*  0.00   0.01  
## 660   0.00   0.00     0.00    0.00   0.00     0.00    -0.01    1.01_*  0.00   0.01  
## 782   0.00  -0.01     0.00    0.00  -0.02     0.00    -0.02    1.07_*  0.00   0.06_*
## 785   0.08  -0.13     0.06   -0.03  -0.35    -0.01    -0.35_*  1.11_*  0.02   0.10_*
## 804  -0.01   0.00     0.00    0.00   0.00     0.01    -0.01    1.02_*  0.00   0.01_*
## 819  -0.03   0.00     0.01    0.00   0.00     0.03    -0.04    1.01_*  0.00   0.01  
## 825  -0.01   0.05    -0.03    0.03   0.02     0.00    -0.07    0.99_*  0.00   0.00  
## 830   0.04   0.00    -0.01   -0.01  -0.01    -0.04     0.04    1.01_*  0.00   0.01  
## 858   0.01   0.00     0.00    0.00   0.00     0.00     0.01    1.01_*  0.00   0.01  
## 887   0.00   0.03     0.02    0.04  -0.02    -0.02     0.09    0.99_*  0.00   0.00  
## 888   0.00   0.03     0.04    0.01   0.00    -0.01     0.08    0.99_*  0.00   0.00  
## 911  -0.08  -0.02     0.06    0.03  -0.02     0.08     0.13    0.98_*  0.00   0.00  
## 912  -0.01   0.01     0.00    0.01   0.00     0.01    -0.01    1.01_*  0.00   0.01  
## 936  -0.01   0.05     0.11   -0.07  -0.03     0.04     0.16_*  0.96_*  0.00   0.00  
## 972   0.05   0.02     0.07   -0.08   0.00    -0.05     0.12    0.98_*  0.00   0.00  
## 974   0.00   0.02     0.02   -0.02   0.02     0.00    -0.04    1.01_*  0.00   0.01  
## 995  -0.01   0.06    -0.02    0.00   0.14    -0.02     0.14    1.02_*  0.00   0.02_*
## 1013 -0.02   0.03     0.09   -0.03  -0.05     0.04     0.12    0.99_*  0.00   0.00  
## 1015 -0.08   0.16     0.06   -0.01   0.29     0.02     0.32_*  0.96_*  0.02   0.01  
## 1054  0.04  -0.01    -0.01   -0.01  -0.01    -0.04     0.04    1.01_*  0.00   0.01_*
## 1072  0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 1084  0.00  -0.02     0.04    0.00   0.03    -0.01     0.08    0.99_*  0.00   0.00  
## 1098  0.01  -0.01     0.00   -0.02   0.00    -0.01    -0.02    1.01_*  0.00   0.01  
## 1117  0.00   0.00     0.01   -0.01  -0.01     0.00    -0.01    1.01_*  0.00   0.01  
## 1118  0.02   0.04     0.04   -0.05   0.02    -0.01     0.10    0.98_*  0.00   0.00  
## 1137  0.05  -0.05     0.01    0.02   0.00    -0.07    -0.10    1.01    0.00   0.01_*
## 1166  0.00  -0.05     0.05   -0.02  -0.04     0.01     0.09    0.99_*  0.00   0.00  
## 1196 -0.01   0.01    -0.01    0.02   0.00     0.00     0.02    1.01_*  0.00   0.01  
## 1203 -0.05   0.10     0.01   -0.02   0.01     0.07     0.15    0.97_*  0.00   0.00  
## 1214 -0.02  -0.04     0.01    0.02   0.02     0.01    -0.09    0.99_*  0.00   0.00  
## 1215  0.01   0.02     0.07   -0.04   0.00     0.00     0.09    0.99_*  0.00   0.00  
## 1221  0.03   0.02    -0.01    0.06  -0.10    -0.03     0.17_*  0.97_*  0.00   0.00  
## 1263  0.01  -0.05     0.04    0.02  -0.01    -0.02     0.09    0.99_*  0.00   0.00  
## 1309 -0.02  -0.01     0.02    0.01  -0.16     0.05    -0.18_*  1.00    0.01   0.01_*
## 1312 -0.07   0.07     0.07    0.02  -0.03     0.07     0.13    0.98_*  0.00   0.00  
## 1314 -0.02  -0.04    -0.03    0.02  -0.01     0.03     0.09    0.99_*  0.00   0.00  
## 1328 -0.10   0.12    -0.01    0.23   0.05     0.02     0.29_*  0.95_*  0.01   0.00  
## 1351 -0.03   0.02    -0.01    0.04   0.01     0.03     0.05    1.01_*  0.00   0.01_*
## 1361 -0.03   0.10    -0.02   -0.03   0.24    -0.01     0.25_*  1.03_*  0.01   0.03_*
## 1375  0.08  -0.11     0.07   -0.03  -0.03    -0.09    -0.16_*  0.99    0.00   0.00  
## 1381 -0.02   0.00     0.01    0.00   0.00     0.02    -0.02    1.01_*  0.00   0.01  
## 1387  0.01   0.00     0.01   -0.02   0.00    -0.01    -0.03    1.01_*  0.00   0.01_*
## 1410 -0.04   0.04     0.07   -0.01  -0.01     0.04     0.09    0.99_*  0.00   0.00  
## 1451 -0.01  -0.04     0.02    0.03  -0.01     0.01     0.08    0.99_*  0.00   0.00  
## 1466  0.00   0.02     0.03    0.01  -0.06     0.01     0.09    0.99_*  0.00   0.00  
## 1483 -0.02   0.03    -0.03    0.05   0.02     0.00     0.06    1.01_*  0.00   0.01_*
## 1514  0.00  -0.03     0.05    0.00   0.03     0.00     0.09    0.99_*  0.00   0.00  
## 1517  0.00  -0.01     0.02   -0.03   0.01     0.01    -0.04    1.01_*  0.00   0.01  
## 1540 -0.03   0.05    -0.07    0.05   0.02     0.02    -0.10    0.99_*  0.00   0.00  
## 1571 -0.03   0.00     0.01    0.01   0.00     0.04    -0.04    1.01_*  0.00   0.01  
## 1572 -0.05  -0.01     0.07   -0.06   0.02     0.08    -0.14    1.01    0.00   0.01_*
## 1573  0.11  -0.08    -0.03   -0.01   0.05    -0.14    -0.17_*  0.99    0.01   0.01  
## 1575 -0.05   0.01     0.01    0.00   0.00     0.06    -0.06    1.01_*  0.00   0.01  
## 1592  0.01   0.01    -0.03    0.03   0.01    -0.02     0.05    1.01_*  0.00   0.01  
## 1604  0.01   0.00     0.02   -0.03   0.00     0.00    -0.03    1.01_*  0.00   0.01_*
## 1615  0.03   0.01     0.11   -0.04  -0.03    -0.03     0.14    0.97_*  0.00   0.00  
## 1639 -0.03   0.03    -0.05    0.14  -0.03     0.00     0.16_*  1.02_*  0.00   0.02_*
## 1655  0.01  -0.02    -0.01    0.00  -0.03    -0.01    -0.04    1.01_*  0.00   0.01  
## 1661  0.00   0.00    -0.01    0.01   0.00     0.00     0.01    1.01_*  0.00   0.01  
## 1667  0.01   0.02     0.08   -0.05   0.00     0.00     0.10    0.99_*  0.00   0.00  
## 1672  0.03  -0.02     0.00   -0.01   0.00    -0.03    -0.03    1.01_*  0.00   0.01  
## 1703  0.03  -0.07     0.04   -0.02  -0.14    -0.01    -0.14    1.02_*  0.00   0.02_*
## 1723 -0.11  -0.08     0.11    0.13  -0.14     0.11     0.26_*  0.95_*  0.01   0.00  
## 1726 -0.10   0.00    -0.05    0.10   0.06     0.08     0.16_*  0.99_*  0.00   0.00  
## 1744 -0.02   0.02    -0.02    0.02   0.01     0.02     0.03    1.01_*  0.00   0.01  
## 1757 -0.01   0.05    -0.02    0.01   0.03     0.01    -0.07    0.99_*  0.00   0.00  
## 1775  0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 1785  0.01   0.05    -0.01    0.03  -0.03    -0.01     0.11    0.98_*  0.00   0.00  
## 1788 -0.01   0.01     0.00    0.00   0.02     0.01     0.02    1.01_*  0.00   0.01  
## 1793 -0.05   0.04    -0.04    0.09   0.00     0.03     0.10    1.01_*  0.00   0.01_*
## 1795 -0.02   0.04     0.10   -0.04  -0.04     0.03     0.13    0.98_*  0.00   0.00  
## 1803 -0.01   0.00     0.00    0.00   0.00     0.01     0.02    1.01_*  0.00   0.01_*
## 1830 -0.03   0.04     0.01    0.00   0.07     0.01     0.08    1.01_*  0.00   0.01_*
## 1894  0.00  -0.02     0.06   -0.09   0.00     0.03    -0.12    1.01_*  0.00   0.01_*
## 1921  0.02   0.05     0.04   -0.01   0.07    -0.04     0.12    0.98_*  0.00   0.00  
## 1985 -0.01   0.03     0.02    0.02  -0.04     0.01     0.09    0.99_*  0.00   0.00  
## 2019 -0.02  -0.01    -0.02    0.06   0.00     0.01     0.06    1.01_*  0.00   0.01_*
## 2022  0.11  -0.12    -0.05   -0.12  -0.09    -0.05     0.20_*  0.99_*  0.01   0.01  
## 2052  0.02   0.01     0.07   -0.04   0.00    -0.02     0.09    0.99_*  0.00   0.00  
## 2077  0.03   0.02     0.05   -0.03   0.01    -0.03     0.08    0.99_*  0.00   0.00  
## 2081  0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 2082  0.07   0.00    -0.02   -0.07  -0.05    -0.05    -0.12    0.99_*  0.00   0.00  
## 2085  0.02  -0.02     0.06   -0.01   0.06    -0.04     0.12    0.99_*  0.00   0.00  
## 2095  0.01   0.00     0.00   -0.04   0.00     0.00    -0.04    1.01_*  0.00   0.01_*
## 2102  0.01  -0.09     0.05    0.04  -0.10    -0.01     0.15    0.97_*  0.00   0.00  
## 2106 -0.03  -0.01     0.06   -0.03   0.05     0.03     0.11    0.99_*  0.00   0.00  
## 2108  0.04   0.03     0.04    0.03  -0.05    -0.06    -0.16_*  0.98_*  0.00   0.00  
## 2135  0.01  -0.04     0.00   -0.02   0.02     0.00     0.09    0.99_*  0.00   0.00  
## 2159 -0.01  -0.03    -0.03   -0.01  -0.02     0.02    -0.08    0.99_*  0.00   0.00  
## 2165 -0.09   0.02     0.03    0.03   0.01     0.10    -0.10    1.01_*  0.00   0.01_*
## 2186 -0.01  -0.04     0.00    0.04  -0.01     0.01     0.08    0.99_*  0.00   0.00  
## 2211  0.01   0.00     0.00    0.00   0.00    -0.01     0.02    1.01_*  0.00   0.01  
## 2216  0.07  -0.12     0.04    0.00  -0.36    -0.01    -0.36_*  1.04_*  0.02   0.04_*
## 2269  0.01   0.05     0.04   -0.11   0.05     0.02    -0.15_*  1.00    0.00   0.01  
## 2278  0.06   0.02     0.01   -0.05   0.00    -0.06    -0.10    0.99_*  0.00   0.00  
## 2281  0.00   0.00     0.00    0.00   0.00     0.00     0.01    1.01_*  0.00   0.01  
## 2288  0.01   0.01     0.07    0.00  -0.04    -0.01     0.10    0.99_*  0.00   0.00  
## 2289 -0.03   0.06    -0.02    0.01  -0.02     0.05     0.10    0.99_*  0.00   0.00  
## 2295  0.06   0.03    -0.03    0.00  -0.01    -0.06     0.12    0.98_*  0.00   0.00  
## 2303 -0.04  -0.02    -0.08    0.12   0.02     0.01     0.15_*  1.00    0.00   0.01  
## 2304  0.01  -0.02     0.00    0.00   0.00    -0.02    -0.03    1.01_*  0.00   0.01  
## 2312  0.00  -0.01     0.01   -0.02  -0.01     0.00    -0.02    1.01_*  0.00   0.01  
## 2323 -0.09  -0.03    -0.04    0.15  -0.01     0.06     0.19_*  0.98_*  0.01   0.00  
## 2334 -0.02   0.06     0.05   -0.02   0.10     0.00     0.13    0.99_*  0.00   0.00  
## 2340 -0.03   0.06    -0.03    0.01   0.18     0.00     0.18_*  1.07_*  0.01   0.06_*
## 2373 -0.09   0.00    -0.14    0.16   0.07     0.06     0.23_*  0.99_*  0.01   0.01  
## 2389  0.00  -0.01     0.00    0.00  -0.03     0.00    -0.03    1.02_*  0.00   0.02_*
## 2456 -0.02   0.02    -0.01    0.06  -0.01     0.01     0.07    1.01_*  0.00   0.01_*
## 2461  0.02  -0.09    -0.07    0.08  -0.10    -0.02     0.18_*  0.99_*  0.01   0.00  
## 2467  0.00  -0.01     0.01    0.01  -0.03     0.00    -0.04    1.02_*  0.00   0.02_*
## 2469  0.02   0.00    -0.01   -0.01  -0.01    -0.02     0.02    1.01_*  0.00   0.01_*
## 2486 -0.01   0.05     0.01   -0.02   0.12    -0.01     0.13    1.01_*  0.00   0.01_*
## 2513  0.01  -0.01     0.00   -0.01   0.00     0.00    -0.01    1.01_*  0.00   0.01  
## 2527  0.03  -0.04     0.04   -0.08   0.00    -0.01    -0.09    1.01_*  0.00   0.01_*
## 2537 -0.01   0.02     0.04    0.04  -0.04    -0.01     0.10    0.99_*  0.00   0.00  
## 2542  0.02  -0.01    -0.01   -0.02   0.00    -0.02    -0.03    1.01_*  0.00   0.01  
## 2616 -0.06  -0.01     0.02    0.00  -0.03     0.08     0.10    1.01_*  0.00   0.01_*
## 2623  0.08  -0.14    -0.03    0.13  -0.09    -0.12     0.27_*  0.95_*  0.01   0.00  
## 2633  0.07   0.01     0.02   -0.12   0.03    -0.06    -0.13    1.01    0.00   0.01_*
## 2634 -0.02   0.06     0.07   -0.02   0.05     0.01     0.12    0.97_*  0.00   0.00  
## 2645  0.00   0.00     0.00    0.00   0.00     0.00     0.00    1.01_*  0.00   0.01  
## 2652  0.01   0.03     0.06   -0.04   0.00    -0.01     0.09    0.99_*  0.00   0.00  
## 2673 -0.04   0.05     0.04    0.02   0.01     0.03     0.08    0.99_*  0.00   0.00  
## 2685 -0.03   0.05     0.04   -0.01  -0.02     0.04     0.10    0.98_*  0.00   0.00  
## 2728 -0.06   0.06     0.02    0.04   0.02     0.05     0.09    0.99_*  0.00   0.00

We could, of course, again investigate these observations further to determine if there might be any mistakes in the data and whether we should perform sensitivity analyses in which highly influential observations are excluded.

Default plot method for fitted models

Throughout this practical, we created the plots ourselves and also calculated many of the measures “by hand”.

provides a convenient plotting method to give us much of this information automatically. We only need to plot() the fitted model object to get

a scatter plot of the residuals against the fitted values,
a QQ plot for the standardized residuals,
a plot of the square root of the absolute values of the standardized residuals,
a plot of Cook’s distance (only when we specify it via the argument which)
a plot of the standardized residuals vs the leverage, and
a plot of Cook’s distance vs the leverage values.

par(mfrow = c(2, 3))
plot(lm_weighted, which = 1:6)

Linear Regression: Model Diagnostics

Biostatistics I: Linear Regression

Data

Fitting the Linear Regression Model

Task

Solution

Residuals

Task 1

Solution 1

Task 2

Solution 2

Option 1

Option 2

Option 3

Task 3

Solution 3

Base R graphics

ggplot2

Task 4

Solution 4

Linearity, Normality & Multicollinearity

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Task 4

Solution 4

Homoscedasticity

Residual Variation

Task 1

Solution 1

Confidence Intervals

Task 2

Solution 2

Weighted Least Squares

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Outliers & Leverage

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Task 4

Solution 4

Task 5

Solution 5

Influential Values

Task 1

Solution 1

Task 2

Solution 2

Default plot method for fitted models