Linear Regression: Non-linear Effects

In this practical, we will fit multiple linear regression models that include non-linear effects and interactions, and visualize the estimated effects using effect plots.

To focus on the relevant issues with regards to non-linear effects and interactions we will skip some of the model evaluation steps that would of course be required in an analysis performed to answer a real research question. For the same reason, we also do not follow the recommended procedure for model building here.

This practical is intended to be “learning by doing”, i.e., you may not be able to solve all tasks by yourself without the hints or even checking the solution, and you will come across functions that you haven’t seen in the lecture slides.

When a function is new to you, you can find out more about it in the help file that you can access by typing ?<function name>.

For several tasks of this practical the given solution is only one of many possible solutions.

There are a few parts indicated as “Advanced Level” containing some more complex solutions (for those of you, for whom working with is a piece of cake).

Data

For this practical we will use the same subset of the NHANES (National Health and Nutrition Examination Survey) data as used in the first two practicals.

To load this dataset into , you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.

If you know the path to the file, you can also use load("<path>/nhanes.RData").

Here a short overview over the data:

head(nhanes)

##        SBP gender age               race    WC  alc educ creat albu uricacid bili            occup   smoke
## 1 110.6667   male  22 Non-Hispanic White  81.0 <NA>  low  0.91  4.8      4.9  0.6          working   never
## 2 118.0000 female  44 Non-Hispanic White  80.1 <NA> high  0.89  3.7      4.5  0.3          working   never
## 3 124.6667   male  21              other  69.6   <1  low  0.87  4.4      5.4  0.2      not working   never
## 4 102.0000 female  43 Non-Hispanic Black 120.4  >=1  low  0.68  4.4      5.0  0.8      not working current
## 5 146.6667   male  51              other  81.1 <NA>  low  0.99  4.1      5.0  0.5 looking for work current
## 6 122.0000   male  80 Non-Hispanic White 112.5 <NA>  low  1.01  4.2      4.8  0.7      not working   never

str(nhanes)

## 'data.frame':    2765 obs. of  13 variables:
##  $ SBP     : num  111 118 125 102 147 ...
##  $ gender  : Factor w/ 2 levels "male","female": 1 2 1 2 1 1 1 2 1 1 ...
##  $ age     : num  22 44 21 43 51 80 26 30 70 35 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 3 3 5 4 5 3 4 5 4 4 ...
##  $ WC      : num  81 80.1 69.6 120.4 81.1 ...
##  $ alc     : Factor w/ 2 levels "<1",">=1": NA NA 1 2 NA NA 2 2 1 NA ...
##  $ educ    : Factor w/ 2 levels "low","high": 1 2 1 1 1 1 1 2 1 2 ...
##  $ creat   : num  0.91 0.89 0.87 0.68 0.99 1.01 1.03 0.91 0.8 1.1 ...
##  $ albu    : num  4.8 3.7 4.4 4.4 4.1 4.2 5 4.8 3.7 4 ...
##  $ uricacid: num  4.9 4.5 5.4 5 5 4.8 5.4 6.7 5.4 6.7 ...
##  $ bili    : num  0.6 0.3 0.2 0.8 0.5 0.7 1.1 0.9 0.9 0.9 ...
##  $ occup   : Factor w/ 3 levels "working","looking for work",..: 1 1 3 3 2 3 1 1 3 1 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 1 3 3 1 1 1 2 1 ...

Transformed Response

Task 1

We want to model the effect of age, gender, smoke, SBP, albu and creat on bili.

Fit the corresponding linear regression model.

Based on the distribution of the residuals, do you think this model is appropriate for the data at hand?

Investigate the distribution of the resiudals via plots. See also the Model Diagnostics practical.

The variable smoke is an ordered factor. You may want to specify the type of coding used for this variable.

You can do this via the contrasts argument (see also Task 3 from the Linear Regression Summary practical).

Solution 1

Fit the linear regression model for bili:

lm1 <- lm(bili ~ age + gender + smoke + SBP + albu + creat, data = nhanes,
          contrasts = list(smoke = "contr.treatment"))

Because smoke is an ordered factor would by default use orthogonal polynomials. Typically, we prefer dummy coding and specify this via the contrasts argument here.

To check the distribution of the residuals we can create an QQ-plot and a histogram:

par(mfrow = c(1, 2), mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))

car::qqPlot(rstandard(lm1), ylim = c(-3, 15))
hist(rstandard(lm1), breaks = 100)

Whenever we only need a single/few function(s) from a package, we may not want to load the package (with library()) but can use :: to indicate from which package the function is, e.g., car::qqPlot().

Here we set the limits of the y-axis (ylim in qqPlot()) because by default, the range of the y-axis is based on the quantiles, but does not include the 95% confidence envelope, which would then be partially cut off.

For the histogram, we increase the number of breaks to 100 because this gives us some more detail about the distribution than the default, in which the observations are grouped into a smaller number of intervals.

The QQ-plot shows that the residuals are not normally distributed. They are right-skewed with some residuals being a lot larger than expected under a standard normal distribution. We can also see this in the histogram.

Task 2

How can we improve our model to better comply with the assumptions?

Fit the improved model and check if the change did indeed improve the normality of the residuals.

You should transform the response.

Solution 2

The response variable bili is right-skewed, and because the covariates used in lm1 cannot sufficiently explain the skewness, the residuals are skewed as well.

Transformation of the response may result in more normally distributed residuals.

For variables that are right-skewed, transformation with the logarithm or wit the square root can help to obtain a more symmetric distribution.

To find a suitable transformation, we can plot histograms of different transformations of bili. Note that having a normally distributed response does not guarantee that the residuals are normally distributed as well, but it many cases the response and covariates have similar distributions (and it is more convenient to work with the response because we have it available without having to fit a model each time).

par(mfrow = c(1, 3))
hist(nhanes$bili, breaks = 100)
hist(log(nhanes$bili), breaks = 100)
hist(sqrt(nhanes$bili), breaks = 100)

The transformation with the natural logarithm gives us a relatively symmetric distribution, so we will work with this transformation.

Re-fit the model with the transformed response:

lm2 <- update(lm1, formula = log(bili) ~ .)

Here we again use the function update() instead of re-writing the model because we only need to change one detail of the model: the response. Therefore, we specify the new model formula via the argument formula. The dot after the tilde indicates that we keep the right hand side of the model formula the same as it was in lm1.

Check the distribution of the residuals by re-creating the plots:

par(mfrow = c(1, 2))

car::qqPlot(rstandard(lm2))
hist(rstandard(lm2), breaks = 100)

The distribution of the residuals is now closer to a normal distribution.

Task 3

Now that we have transformed the response, how does this change the interpretation of the results?

Solution 3

The estimates for the regression coefficients from lm2 give the expected change in log(bili) for a 1 unit increase in the covariate, not the change in bili.

Task 4

Visualize the effect of albu and gender on log(bili) as well as on bili with an effect plot (using lm2). What do the plots show you about these associations?

First, you need to create data that contains the values of the covariates for which you want to calculate the fitted values.

Values of the variables of interest need to vary, and the other variables should have “reference” values.

Then you need to calculate the fitted values, i.e., “predict” them.

To show the fitted values for bili you need to transform the fitted values obtained from the model for log(bili) back.

Getting a nice-looking plot for gender is a bit complex when using the base framework. Try, but don’t worry if you can’t figure it out.

Did you get the following error message (followed by some warning messages)?

## Error in plot.window(...): need finite 'xlim' values

When creating the plot, checks the range of the data to determine the range of the x-axis and y-axis for the plot. To do this, the data is converted to numeric values, and then the range is determined (using range()) for the subset of values that are finite, i.e., range(x[is.finite(x)]).

Check what type of variable gender is in the data you created for prediction and see what happens when you try to convert it to numeric values.

To fix the issue, convert the varible gender to a factor.

Solution 4

First, we need to create the data for which we want to calculate the fitted values. We use a sequence of values for albu, both factor levels for gender, and reference values for all other covariates (the median or the first category).

# data for visualization of the effect of "albu"
ndf1 <- data.frame(albu = seq(min(nhanes$albu, na.rm = TRUE),
                              max(nhanes$albu, na.rm = TRUE),
                              by = 0.05),
                   age = median(nhanes$age),
                   gender = "male",
                   smoke = "never",
                   SBP = median(nhanes$SBP, na.rm = TRUE),
                   creat = median(nhanes$creat, na.rm = TRUE))

# data for visualization of the effect of "gender"
ndf2 <- data.frame(albu = median(nhanes$albu, na.rm = TRUE),
                   age = median(nhanes$age),
                   gender = levels(nhanes$gender),
                   smoke = "never",
                   SBP = median(nhanes$SBP, na.rm = TRUE),
                   creat = median(nhanes$creat, na.rm = TRUE))

I specified two separate datasets for visualization of the effects of albu and gender since there is no interaction between the two variables. This means that the association between one of these variables and the outcome is independent from the value of the other variable.

If we would specify one dataset in which both albu and gender vary, we’d get a dataset with all the different values for albu for males, and then the same for females. This would allow us to visualize the effects of both albu and gender in the same plot. This would be required to visualize an interaction between the two variables.

In the following, we would then only have to use predict() once.

The plotting in the base R framework (using plot() or matplot()) would get a bit more complicated, though.

Then we can calculate the fitted values and corresponding 95% confidence intervals using the function predict():

pred1 <- predict(lm2, newdata = ndf1, interval = "confidence")
pred2 <- predict(lm2, newdata = ndf2, interval = "confidence")

Effectplot for `albumin`

The fitted values for bili can be obtained by transforming the fitted values using exp(), the inverse of the natural logarithm.

To compare the fitted values on both scales, i.e., log(bili) and bili we plot them next to each other. For the continuous variable albu, we can use the function matplot() which allows us to simultaneously plot the fitted values and confidence bands.

par(mfrow = c(1, 2))

matplot(x = ndf1$albu, y = pred1,
        xlab = "albumin", ylab = "expected value of log(bili)",
        lty = c(1, 2, 2), col = "black", type = "l")

matplot(x = ndf1$albu, y = exp(pred1),
        xlab = "albumin", ylab = "expected value of bili",
        lty = c(1, 2, 2), col = "black", type = "l")

Effectlot for `gender`

The function matplot() does not work well for categorical variables because the default type of plot for categorical variables is a bar plot. To get points and error bars we use plot.default().

This may sound confusing, because I just said that the default is a bar plot, but plot.default() is specifically plotting scatter plots. When we call plot(), checks of which type the provided objects are, and then internally chooses whether to use plot.default() — for continuous data — or another type of plot.

When we use plot.default() (or, in general plot()), we need to plot the confidence intervals separately.

plot.default() tries to convert the data that are provided to numeric values. That is not possible for character vectors (and gender is a character vector in ndf), but it is possible for factors. We therefore convert gender to a factor.

par(mfrow = c(1, 2))

plot.default(factor(ndf2$gender), pred2[, "fit"],
             ylim = range(pred2), xlim = c(0, 3), xaxt = "n", 
             xlab = "gender", ylab = "expected value of log(bili)")
axis(side = 1, at = seq_along(ndf2$gender), labels = levels(factor(ndf2$gender)))

arrows(x0 = as.numeric(factor(ndf2$gender)),
       x1 = as.numeric(factor(ndf2$gender)),
       y0 = pred2[, "lwr"],
       y1 = pred2[, "upr"],
       code = 3, angle = 90, length = 0.1)


plot.default(factor(ndf2$gender), exp(pred2[, "fit"]),
             ylim = range(exp(pred2)), xlim = c(0, 3), xaxt = "n", 
             xlab = "gender", ylab = "expected value of bili")
axis(side = 1, at = seq_along(ndf2$gender), labels = levels(factor(ndf2$gender)))

arrows(x0 = as.numeric(factor(ndf2$gender)), 
       x1 = as.numeric(factor(ndf2$gender)),
       y0 = exp(pred2[, "lwr"]),
       y1 = exp(pred2[, "upr"]),
       code = 3, angle = 90, length = 0.1)

To get a nicely labelled x-axis even though we are now plotting gender as numeric values, i.e., 1 and 2 instead of female and male, we can create the x-axis ourselves.

First, we omit the x-axis from the plot by setting the x-axis type to “none” (xaxt = "n"). Then we create the axis using axis(). The argument side specifies on which side of the plot the axis should be printed (1 is the bottom, 2 would be left, 3 the top and 4 the right side).

at specifies the location on the axis at which we want the axis to have values and tick marks. seq_along(ndf2$gender) creates a sequence of integers that “counts” the elements in ndf2$gender, i.e, it will return the vector c(1, 2). With labels we can change what is printed at the tick mark locations on the axis.

The error bars are created with arrows(). x0, x1, y0 and y1 set the x- and y-coordinates of the arrows to be drawn, code specifies the type of arrow (at which end the arrow head should be drawn, where 3 refers to both ends). The arguments angel and length determine how the arrow head looks like. Since we do not actually want an arrow, but just a horizontal line, we set angle = 90 (degrees). length specifies how wide the horizontal lines at the ends of the error bars are.

The plots demonstrate that transformation of the response introduces a non-linear association between the covariates and the response on the original scale. The effect plot for log(bili) and albu shows a straight line, because the model assumed a linear effect of albu, so that a step of size 1 on the x-axis will always result in the same change on the y-axis.

But this is not the case for the plot with bili on the y-axis. The higher the value of albu, the larger the effect of a 1 unit increase on bili.

In the plots for the effect of gender we do not see large differences with respect to the differences in the expected values. But plotting the exp() (or another transformation) of the expected value will lead to confidence intervals that are no longer symmetric around the expected value. In this particular example, however, this is not visible due to the small expected values.

Non-linear Effects

As we have seen, transformation of the response introduces non-linear associations between the response and the covariates. It may be, however, that the shape of the association that is introduced is not appropriate. Either, because a covariate has a linear effect with the response on its original scale, or because the true association has a different, maybe more complex shape.

Task 1

We want to explore the shapes of the associations for the continuous covariates in lm2.

Explore the distributions of the continuous covariates used in lm2 to decide if the default settings uses for splines are appropriate or not.

Then update lm2 to model the associations with the continuous covariates as non-linear using natural cubic splines with 3 degrees of freedom.

To fit a natural cubic spline, use the function ns(). You need to load the splines package. See also the slides on non-linear effects

Solution 1

First, we explore the distributions of the three continuous covariates age, albu and bili:

par(mfrow = c(4, 1), mgp = c(2, 0.6, 0), mar = c(3, 3, 2, 1))

hist(nhanes$age, breaks = 200)

hist(nhanes$SBP, breaks = 200)

hist(nhanes$albu, breaks = 200)

hist(nhanes$creat, breaks = 200)

Again, to be able to see where the data is located, we increase the number of breaks in the histograms.

The variable age is relatively uniformly distributed. albu is symmetric but has a few observations in the tails of the distribution and SBP and creat are skewed with some observations that are much larger than the rest.

The default behaviour for ns() is that the boundary knots are placed at the minimum and maximum value of the covariate.

For SBP, albu and creat, the minima and/or maxima are outside the area where we have many observations. We should, therefore, change the boundary knots so that they are not at the minimum and maximum of the distribution, but exclude the few more extreme values.

We can indicate the positions of certain quantiles using vertical lines. This can help us to find where to place the boundary knots.

par(mfrow = c(4, 1), mgp = c(2, 0.6, 0), mar = c(3, 3, 2, 1))

hist(lm2$model$age, breaks = 200)

hist(lm2$model$SBP, breaks = 200)
abline(v = quantile(nhanes$SBP, c(0.01, 0.99), na.rm = TRUE),
       lty = 2, col = "red")

hist(lm2$model$albu, breaks = 200)
abline(v = quantile(nhanes$albu, c(0.01, 0.99), na.rm = TRUE),
       lty = 2, col = "red")

hist(lm2$model$creat, breaks = 200)
abline(v = quantile(lm2$model$creat, c(0, 0.975), na.rm = TRUE),
       lty = 2, col = "red")

The positions of the quantiles shown by the vertical lines were chosen based on trial-and-error and visual inspection.

Here, I do not plot the covariates from the original data (nhanes), but use the element model from the fitted model object lm2. The reason is that we have some missing values in the nhanes data and the data.frame returned by lm2$model only contains the observations that were actually used in the model. If there are extreme observations in the original data which are excluded due to a missing value in another variable, these observations are not relevant for us here.

In the specification of the possible locations for the boundary knots, indicated by the red vertical lines, we use the quantiles of the original data (nhanes) because when we specify the boundary knots using quantiles() in ns(), calculates the quantiles in the input data, which is nhanes.

In many applications it may not matter too much if we use the original data or the complete cases to investigate the locations of the boundary knots. But when there are many missing values, the complete cases may differ quite a bit from the full data and hence the chosen positions for the knots are not as appropriate as we may think they are.

Because the variables (might) contain missing values, we specify the argument na.rm = TRUE in quantile() so that missing values are excluded when calculating the quantiles. Without this argument, quantile() would return NA if there are any values missing.

We can now update our model using ns(). Because we want to replace the linear effects of age, albu and creat with the spline specification we need to remove them from the formula. Otherwise we would have both the linear effect and the spline in the model.

library("splines")

lm3 <- update(lm2,
              formula = .~ . - age - SBP - creat - albu +
                ns(age, df = 3) +
                ns(SBP, df = 3, B = quantile(SBP, c(0.01, 0.99), na.rm = TRUE)) +
                ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE)) +
                ns(creat, df = 3, B = quantile(creat, c(0, 0.975), na.rm = TRUE))
)

Although, with such extensive changes in the model we can also consider to just write the new model from scratch and not use update().

We can abbreviate the name of the argument Boundary.knots with just B. This can be convenient because the variable names in the output can get really long when using splines so that the output is broken into multiply lines and hard to read.

The resulting model now has the formula

formula(lm3)

## log(bili) ~ gender + smoke + ns(age, df = 3) + ns(SBP, df = 3, 
##     B = quantile(SBP, c(0.01, 0.99), na.rm = TRUE)) + ns(albu, 
##     df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE)) + 
##     ns(creat, df = 3, B = quantile(creat, c(0, 0.975), na.rm = TRUE))

Task 2

We want to explore the estimated shapes of the associations of the continuous covariates.

We could do this by creating all the effect plots ourselves, but we can also use existing packages for this. We work with these packages here because they are convenient alternatives to doing everything by hand, but for the purpose of this course, you don’t need to become an expert in the use of these packages.

Check whether the packages visreg, effects and ggeffects are installed and if needed install them using the function install.packages().

You can see the list of installed packages in RStudio in the “Packages” tab. You can also use the function packageVersion("<package name>") which returns the version number of the package if it is available and give an error if the package is not available.

Read the help pages for these functions (and take a look at the Examples section in the help file!!!) to find out how to use these functions and what exactly they calculate. Do they all show the same thing?
Create effect plots using the functions
- visreg() from the package visreg,
- predictorEffects() from the package effects (and plot()), and
- ggpredict() from the package ggeffects (and plot()).

For plotting multiple ggplot plots into one figure you can use the function ggpubr::ggarrange(). You may need to install the ggpubr package first.

Solution 2

`visreg::visreg`

The visreg package works with the Trellis graphics framework. To show multiple plots in the same figure, we can use the general graphics settings (par()) that we’ve used before.

The function visreg has several additional arguments. One of these arguments is partial, which allows us to choose whether we want to display the partial residuals or not.

Showing the partial residuals can help to interpret the non-linearity of the fit in the context of the remaining uncertainty.

par(mfrow = c(2, 3), mgp = c(2, 0.6, 0), mar = c(3, 3, 2, 1))
visreg::visreg(lm3, partial = FALSE)

Using the argument trans we could transform the values on the y-axis. This means, we can also show the expected values of bili instead of log(bili).

To explore whether we need the spline for a continuous covariate we need to explore the plots with the y-axis using the scale the model was fitted with. In our case, that is log(bili). If the effect plot shows a relatively straight line (or rather a confidence band through which we could draw a straight line) then this indicates that there is no evidence for a non-linear association.

Since a linear effect on the transformed response corresponds to a curve when we plot the expected response on the original scale of the data, then it is not easily possible to identify whether the spline is needed or not.

The effect plot for lm3 shows that the effects of SBP and age are relatively linear, and the effects for albu and creat show some more curvature. Because creat is very skewed there is very little information for large values of creat and we cannot see much of the details for the part of creat that has the most observations.

With the argument xvar we can specify a single variable for which the effect should be visualized.

We can create a single plot for creat in which we adjust the range of the x-axis:

visreg::visreg(lm3, xvar = "creat", xlim = c(0, 3), partial = FALSE)

Reference Values

In the explanation for the argument type in the Arguments section of the help ?visreg::visreg we can read that when type = "conditional"

the plot returned shows the value of the variable on the x-axis and the change in response on the y-axis, holding all other variables constant (by default, median for numeric variables and most common category for factors).

In the plots we obtained from visreg, the chosen “reference” values were, thus, the median and largest category.

To get more information about the visreg package, see

http://pbreheny.github.io/visreg/index.html
?visreg::`visreg-faq`

`effects::predictorEffects`

The function predictorEffects from the package effects calculates the effects, but does not immediately plot them.

predEff <- effects::predictorEffects(lm3)

names(predEff)

## [1] "gender" "smoke"  "age"    "SBP"    "albu"   "creat"

summary(predEff$gender)

## 
##  gender effect
## gender
##       male     female 
## -0.3094679 -0.4475585 
## 
##  Lower 95 Percent Confidence Limits
## gender
##       male     female 
## -0.3366397 -0.4812631 
## 
##  Upper 95 Percent Confidence Limits
## gender
##       male     female 
## -0.2822960 -0.4138538

We can obtain plots by using plot() on the resulting object:

par(mfrow = c(2, 3), mgp = c(2, 0.6, 0), mar = c(3, 3, 2, 1))
plot(predEff)

When we compare the plots with the plots we obtained from the visreg package, they look very similar, however, the fitted values are different, because other “reference” values are used.

Reference Values

effects uses “average” values for the covariates not shown in a particular plot. This is the mean of the observed data for continuous variables and a weighted average of the categories (as numeric values), weighted by the frequency each category is observed in the data.

For more information on the options in the effects package, see the vignette.

`ggeffects::ggpredict`

As the name already indicates, ggeffects works with ggplot2, i.e., uses a different graphics framework than the other two functions.

The function ggpredict() calculates the fitted values and corresponding 95% confidence intervals

gg_pred <- ggeffects::ggpredict(lm3, back.transform = FALSE)

gg_pred

## $gender
## # Predicted values of bili
## 
## gender | Predicted |         95% CI
## -----------------------------------
## male   |     -0.28 | [-0.31, -0.25]
## female |     -0.42 | [-0.45, -0.38]
## 
## Adjusted for:
## * smoke =  never
## *   age =  48.79
## *   SBP = 122.71
## *  albu =   4.26
## * creat =   0.90
## 
## $smoke
## # Predicted values of bili
## 
## smoke   | Predicted |         95% CI
## ------------------------------------
## never   |     -0.28 | [-0.31, -0.25]
## former  |     -0.31 | [-0.35, -0.28]
## current |     -0.39 | [-0.42, -0.35]
## 
## Adjusted for:
## * gender =   male
## *    age =  48.79
## *    SBP = 122.71
## *   albu =   4.26
## *  creat =   0.90
## 
## $age
## # Predicted values of bili
## 
## age | Predicted |         95% CI
## --------------------------------
##  20 |     -0.33 | [-0.38, -0.27]
##  27 |     -0.31 | [-0.34, -0.27]
##  35 |     -0.29 | [-0.33, -0.26]
##  43 |     -0.28 | [-0.32, -0.25]
##  50 |     -0.28 | [-0.31, -0.25]
##  57 |     -0.27 | [-0.31, -0.24]
##  65 |     -0.25 | [-0.28, -0.21]
##  80 |     -0.17 | [-0.23, -0.12]
## 
## Adjusted for:
## * gender =   male
## *  smoke =  never
## *    SBP = 122.71
## *   albu =   4.26
## *  creat =   0.90

## 
## Not all rows are shown in the ouput. Use `print(..., n = Inf)` to show all rows.

## 
## $SBP
## # Predicted values of bili
## 
##    SBP | Predicted |         95% CI
## -----------------------------------
##  81.33 |     -0.20 | [-0.31, -0.10]
##  96.67 |     -0.23 | [-0.28, -0.18]
## 110.00 |     -0.25 | [-0.29, -0.22]
## 122.67 |     -0.28 | [-0.31, -0.25]
## 134.00 |     -0.30 | [-0.34, -0.27]
## 147.33 |     -0.31 | [-0.35, -0.27]
## 160.00 |     -0.30 | [-0.35, -0.25]
## 234.67 |     -0.20 | [-0.40,  0.00]
## 
## Adjusted for:
## * gender =  male
## *  smoke = never
## *    age = 48.79
## *   albu =  4.26
## *  creat =  0.90

## 
## Not all rows are shown in the ouput. Use `print(..., n = Inf)` to show all rows.

## 
## $albu
## # Predicted values of bili
## 
## albu | Predicted |         95% CI
## ---------------------------------
## 2.60 |     -0.45 | [-0.63, -0.26]
## 3.10 |     -0.43 | [-0.54, -0.32]
## 3.40 |     -0.42 | [-0.49, -0.35]
## 3.70 |     -0.41 | [-0.45, -0.37]
## 4.10 |     -0.34 | [-0.37, -0.30]
## 4.40 |     -0.23 | [-0.26, -0.20]
## 4.70 |     -0.14 | [-0.18, -0.11]
## 5.40 |     -0.01 | [-0.14,  0.12]
## 
## Adjusted for:
## * gender =   male
## *  smoke =  never
## *    age =  48.79
## *    SBP = 122.71
## *  creat =   0.90

## 
## Not all rows are shown in the ouput. Use `print(..., n = Inf)` to show all rows.

## 
## $creat
## # Predicted values of bili
## 
## creat | Predicted |         95% CI
## ----------------------------------
##  0.34 |     -0.48 | [-0.60, -0.36]
##  0.59 |     -0.34 | [-0.39, -0.30]
##  0.79 |     -0.29 | [-0.32, -0.25]
##  0.98 |     -0.28 | [-0.31, -0.25]
##  1.17 |     -0.28 | [-0.32, -0.25]
##  1.37 |     -0.29 | [-0.33, -0.25]
##  1.63 |     -0.29 | [-0.33, -0.25]
##  9.51 |     -0.43 | [-0.76, -0.11]
## 
## Adjusted for:
## * gender =   male
## *  smoke =  never
## *    age =  48.79
## *    SBP = 122.71
## *   albu =   4.26

## 
## Not all rows are shown in the ouput. Use `print(..., n = Inf)` to show all rows.

## 
## attr(,"class")
## [1] "ggalleffects" "list"        
## attr(,"model.name")
## [1] "lm3"

By default, ggpredict will transform the results back to the original scale of the response. We change this behaviour here using the argument back.transform.

The argument typical controls how the “reference” values for the covariates not of interest are set. The default is that the mean is used for continuous covariates.

To create a plot from the calculated fitted values we can use the function plot(). This results in a list of ggplot objects and each plot is shown separately. You can click through them by using the arrow buttons in the Plot tab in RStudio.

To show all plots in one figure, we can, for instance, use the function ggarrange() from the ggpubr package:

ggpubr::ggarrange(
  plotlist = plot(gg_pred)
)

Even though we set back.transform = "FALSE", which means that we get the fitted values for log(bili), the y-axis is (wrongly) labelled bili.

The ggeffects package has another function, ggeffect(), which does something similar to what ggpredict() does. In the help file we can read that

ggpredict() uses predict() for generating predictions, while ggeffect() computes marginal effects by internally calling effects::Effect()…

gg_eff <- ggeffects::ggeffect(lm3)

ggpubr::ggarrange(
  plotlist = plot(gg_eff)
)

More information on the ggeffects package can be found on the package website https://strengejacke.github.io/ggeffects/.

Comparison

The four functions from the three packages thus do not show us the same thing.

This is because they use different reference values:

	reference value
package	categorical	continuous
`visreg::visreg`	largest	median
`effects::predictorEffects`	weighted average	mean
`ggeffects::ggpredict`	reference category	mean
`ggeffects::ggeffect`	weighted average	mean

For variables that do not have an interaction with other variables this results only in a change of the absolute value (they are shifted along the y-axis).

We can see this, for instance, in the results for the covariate gender:

res_visreg <- visreg::visreg(lm3, xvar = "gender", plot = FALSE)

res_visreg$fit

##   gender smoke age SBP albu creat bili  visregFit  visregLwr  visregUpr
## 1   male never  48 120  4.3  0.84  0.7 -0.2594933 -0.2933712 -0.2256154
## 2 female never  48 120  4.3  0.84  0.7 -0.3975839 -0.4316886 -0.3634792

predEff$gender

## 
##  gender predictor effect
## 
##  gender effect
## gender
##       male     female 
## -0.3094679 -0.4475585

gg_pred$gender

## # Predicted values of bili
## 
## gender | Predicted |         95% CI
## -----------------------------------
## male   |     -0.28 | [-0.31, -0.25]
## female |     -0.42 | [-0.45, -0.38]
## 
## Adjusted for:
## * smoke =  never
## *   age =  48.79
## *   SBP = 122.71
## *  albu =   4.26
## * creat =   0.90

gg_eff$gender

## # Predicted values of bili
## 
## gender | Predicted |         95% CI
## -----------------------------------
## male   |     -0.31 | [-0.34, -0.28]
## female |     -0.45 | [-0.48, -0.41]

The difference in expected value between different values of the covariate remains the same with all four functions:

diff(res_visreg$fit[, "visregFit"])
## [1] -0.1380906
diff(predEff$gender$fit)
##         [,1]
## 2 -0.1380906
diff(gg_pred$gender$predicted)
## [1] -0.1380906
diff(gg_eff$gender$predicted)
## [1] -0.1380906

Interactions

In the effect plots from the previous section we saw that age and SBP seemed to have quite linear effects.

To simplify the model a bit, we model these two variables using linear effects but will now introduce interactions into the model.

Task 1

Fit another linear regression model for log(bili), with natural cubic splines for creat and albu (both with 3 degrees of freedom), linear effects for age and SBP, gender, smoke, and the interactions between gender and age.
Obtain the model summary.
Interpret the results for age and gender.

Solution 1

lm4 <- lm(log(bili) ~  SBP + smoke + 
            gender * age +  
            ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE)) + 
            ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE)),
          data = nhanes, contrasts = list(smoke = "contr.treatment"))

summary(lm4)

## 
## Call:
## lm(formula = log(bili) ~ SBP + smoke + gender * age + ns(creat, 
##     df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE)) + 
##     ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE)), 
##     data = nhanes, contrasts = list(smoke = "contr.treatment"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.79300 -0.20056  0.00025  0.19427  1.92024 
## 
## Coefficients:
##                                                                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                                            -0.4635453  0.0699667  -6.625 4.30e-11 ***
## SBP                                                                    -0.0010210  0.0004285  -2.383  0.01726 *  
## smokeformer                                                            -0.0295654  0.0177632  -1.664  0.09616 .  
## smokecurrent                                                           -0.1081718  0.0178814  -6.049 1.69e-09 ***
## genderfemale                                                           -0.1459250  0.0445310  -3.277  0.00106 ** 
## age                                                                     0.0018190  0.0006141   2.962  0.00309 ** 
## ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))1  0.0654868  0.0264376   2.477  0.01332 *  
## ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))2  0.1770324  0.0544186   3.253  0.00116 ** 
## ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))3  0.0488002  0.0247937   1.968  0.04916 *  
## ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))1      0.2410533  0.0294314   8.190 4.26e-16 ***
## ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))2      0.3129724  0.0715270   4.376 1.26e-05 ***
## ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))3      0.3323907  0.0350030   9.496  < 2e-16 ***
## genderfemale:age                                                        0.0002956  0.0008007   0.369  0.71207    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3247 on 2309 degrees of freedom
##   (443 observations deleted due to missingness)
## Multiple R-squared:  0.1552, Adjusted R-squared:  0.1508 
## F-statistic: 35.34 on 12 and 2309 DF,  p-value: < 2.2e-16

The relevant subset of the results table is

##                      Estimate   Std. Error    t value    Pr(>|t|)
## genderfemale     -0.145925042 0.0445309694 -3.2769339 0.001064956
## age               0.001818959 0.0006141296  2.9618494 0.003089180
## genderfemale:age  0.000295560 0.0008007182  0.3691187 0.712073122

For males, a one unit increase in age is associated with a 0.002 higher log(bili). There was no evidence that the effect of age on log(bili) differs for males and females (and there is no evidence that the effect of gender changes with age). At zero years of age, females have a -0.146 “higher” log(bili) than males.

Task 2

Visualize the effects of gender and age using an effect plot.

Because there is an interaction between the two variables, the data you create for the effect plot has to have all combinations for the two variables. Using two separate data.frame objects does not work here.

Solution 2

Because there is an interaction between gender and age, we need to create data that has all the combinations of the values for the two variables.

ndf <- with(nhanes,
            expand.grid(
              age = seq(min(age), max(age), length = 50),
              gender = levels(gender),
              smoke = "never",
              SBP = median(SBP, na.rm = TRUE),
              creat = median(creat, na.rm = TRUE),
              albu = median(albu, na.rm = TRUE)
            )
)

To avoid having to write nhanes$ for each covariate we can use the function with(). It evaluates an expression (i.e., the expand.grid(...) in our case) in an environment constructed from data, i.e., nhanes here.

It is convenient because it avoids some work, but if you have used nhanes$... that is perfectly fine as well.

We can then calculate the fitted values and add them as covariates to the dataset;

pred <- predict(lm4, newdata = ndf, interval = "confidence")

ndf <- cbind(ndf, pred)

In the following, I show multiple different solutions how to obtain the effect plots.

Using `plot()` or `matplot()`

Because the fitted values for males and females are in the same column of ndf when we plot them as lines (either in matplot() or with plot() and lines()) we will get a zigzag pattern:

matplot(x = ndf$age, y = ndf[, c("fit", "lwr", "upr")],
        lty = c(1, 2, 2), type = "l")

To avoid this, we could either plot it with dots instead of lines (but with matplot() it is not possible to use different colour for males and females). With plot() this is possible, but we need to plot the fitted values, lower and upper bound of the confidence interval separately:

par(mfrow = c(1, 2))

matplot(x = ndf$age, y = ndf[, c("fit", "lwr", "upr")],
         type = "p", pch = c(19, 1, 1), col = "black")

plot(fit ~ age, data = ndf, ylim = range(lwr, upr), col = gender, pch = 19)
points(lwr ~ age, data = ndf, col = gender)
points(upr ~ age, data = ndf, col = gender)

Alternatively, we could convert the data to have the values for males and females in separate columns:

ndf_wide <- reshape(ndf, direction = "wide",
                    v.names = c("fit", "lwr", "upr"),
                    idvar = "age", timevar = "gender")
head(ndf_wide)

##        age smoke SBP creat albu   fit.male   lwr.male   upr.male fit.female lwr.female upr.female
## 1 18.00000 never 120  0.84  4.3 -0.3029198 -0.3481019 -0.2577378 -0.4435248 -0.4924835 -0.3945661
## 2 19.26531 never 120  0.84  4.3 -0.3006183 -0.3446811 -0.2565555 -0.4408493 -0.4885696 -0.3931290
## 3 20.53061 never 120  0.84  4.3 -0.2983168 -0.3412853 -0.2553483 -0.4381738 -0.4846732 -0.3916744
## 4 21.79592 never 120  0.84  4.3 -0.2960152 -0.3379163 -0.2541142 -0.4354983 -0.4807956 -0.3902009
## 5 23.06122 never 120  0.84  4.3 -0.2937137 -0.3345762 -0.2528512 -0.4328227 -0.4769385 -0.3887070
## 6 24.32653 never 120  0.84  4.3 -0.2914121 -0.3312674 -0.2515569 -0.4301472 -0.4731035 -0.3871909

So that we could then use matplot():

matplot(x = ndf_wide$age, 
        y = ndf_wide[, grep("male|female", names(ndf_wide))],
        xlab = "age", ylab = "expected value of log(bili)",
        type = "l", lty = c(1, 2, 2), col = rep(c("blue", "red"), each = 3))

With the data in wide format, we could also use plot() and create each line separately with the help of lines().

ggplot2

When we use ggplot2 we can plot the data in the original (long) format:

library("ggplot2")

ggplot(ndf,
       aes(x = age, y = fit, color = gender, fill = gender)) +
  geom_line() +
  geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2, color = NA) +
  scale_fill_manual(values = c("blue",  "red"), aesthetics = c("fill", "color"))

ggeffects

The ggeffects package can help us to create such a plot without the need to create the data ourselves:

gg_pred <- ggeffects::ggpredict(lm4, c("age", "gender"),
                                back.transform = FALSE)

## Model has log-transformed response. Predictions are on log-scale.

plot(gg_pred)

visreg

Using the visreg package also avoids having to create the data ourselves:

visreg::visreg(lm4, xvar = "age", by = "gender", partial = FALSE)

Task 3

In the interpretation of the model, we had to refer to the effect of gender “at zero years of age”, which has no meaningful interpretation since we only have adults in the data.

Because there seems to be no interaction between gender and age (the effect plots showed lines that are virtually parallel, i.e., the estimated interaction is so small that it is not clinically relevant) the “at zero years of age” isn’t that relevant here.

But in other examples, it may be relevant to be able to say what the expected difference in the response is for certain covariate values.

Calculate the difference in the expected response between females and males for the minimum, median and maximum age observed in the data.

Solution 3

There is two ways we can calculate this difference.

Option 1

We could predict() the fitted values at the desired ages and take their difference.

To do this, we create a new data.frame for prediction in which we set age to the minimum, median and maximum of the age in the nhanes data and then calculate fitted values with predict():

ndf <- with(nhanes,
            expand.grid(
              age = c(min(age), median(age), max(age)),
              gender = levels(gender),
              smoke = "never",
              SBP = median(SBP, na.rm = TRUE),
              creat = median(creat, na.rm = TRUE),
              albu = median(albu, na.rm = TRUE)
            )
)


ndf$fit <- predict(lm4, newdata = ndf)

ndf

##   age gender smoke SBP creat albu        fit
## 1  18   male never 120  0.84  4.3 -0.3029198
## 2  47   male never 120  0.84  4.3 -0.2501700
## 3  80   male never 120  0.84  4.3 -0.1901444
## 4  18 female never 120  0.84  4.3 -0.4435248
## 5  47 female never 120  0.84  4.3 -0.3822037
## 6  80 female never 120  0.84  4.3 -0.3124246

To be able to calculate the differences it is convenient to convert ndf so that the fitted values for males and females are given in separate columns. We can do this using the function reshape:

ndf_wide <- reshape(ndf, direction = "wide", v.names = "fit",
                    idvar = "age", timevar = "gender")
ndf_wide

##   age smoke SBP creat albu   fit.male fit.female
## 1  18 never 120  0.84  4.3 -0.3029198 -0.4435248
## 2  47 never 120  0.84  4.3 -0.2501700 -0.3822037
## 3  80 never 120  0.84  4.3 -0.1901444 -0.3124246

We can now take the difference between the two columns:

ndf_wide$fit.female - ndf_wide$fit.male

## [1] -0.1406050 -0.1320337 -0.1222802

Option 2

We can also calculate the difference directly from the parameter estimates.

The relevant part of the model summary was:

##                      Estimate   Std. Error    t value    Pr(>|t|)
## genderfemale     -0.145925042 0.0445309694 -3.2769339 0.001064956
## age               0.001818959 0.0006141296  2.9618494 0.003089180
## genderfemale:age  0.000295560 0.0008007182  0.3691187 0.712073122

The difference is the effect of gender plus the age times the interaction term:

18 years: -0.1459 + 0.0003 * 18 = -0.1406
47 years: -0.1459 + 0.0003 * 47 = -0.1320
80 years: -0.1459 + 0.0003 * 80 = -0.1223

Task 4

Instead of the interaction between gender and age, re-fit the model with interactions between gender and the spline for creat, and between gender and the spline for albu.

Create an effect plot to help you with the interpretation.

Solution 4

The new model:

lm5 <- lm(log(bili) ~ SBP + smoke + age + gender * 
            (ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE)) + 
               ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))),
          data = nhanes, contrasts = list(smoke = "contr.treatment"))

summary(lm5)

## 
## Call:
## lm(formula = log(bili) ~ SBP + smoke + age + gender * (ns(creat, 
##     df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE)) + 
##     ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))), 
##     data = nhanes, contrasts = list(smoke = "contr.treatment"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81817 -0.20039 -0.00437  0.19154  1.94836 
## 
## Coefficients:
##                                                                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                                                         -0.4957812  0.1458443  -3.399 0.000687 ***
## SBP                                                                                 -0.0009418  0.0004289  -2.196 0.028213 *  
## smokeformer                                                                         -0.0295538  0.0176337  -1.676 0.093876 .  
## smokecurrent                                                                        -0.1059635  0.0179034  -5.919 3.73e-09 ***
## age                                                                                  0.0020465  0.0004773   4.288 1.88e-05 ***
## genderfemale                                                                        -0.1425123  0.1404511  -1.015 0.310367    
## ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))1               0.1554122  0.0634964   2.448 0.014456 *  
## ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))2               0.3371666  0.2162565   1.559 0.119109    
## ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))3               0.1217016  0.0515367   2.361 0.018286 *  
## ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))1                   0.1971162  0.0508439   3.877 0.000109 ***
## ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))2                   0.1593285  0.1648436   0.967 0.333874    
## ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))3                   0.3071743  0.0488327   6.290 3.78e-10 ***
## genderfemale:ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))1 -0.1822389  0.0747034  -2.439 0.014783 *  
## genderfemale:ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))2 -0.1813105  0.2240910  -0.809 0.418545    
## genderfemale:ns(creat, df = 3, B = quantile(creat, c(0.025, 0.975), na.rm = TRUE))3 -0.1410623  0.0631180  -2.235 0.025520 *  
## genderfemale:ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))1      0.0560519  0.0649681   0.863 0.388359    
## genderfemale:ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))2      0.1876493  0.1850707   1.014 0.310721    
## genderfemale:ns(albu, df = 3, B = quantile(albu, c(0.01, 0.99), na.rm = TRUE))3      0.0350688  0.0776279   0.452 0.651487    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3242 on 2304 degrees of freedom
##   (443 observations deleted due to missingness)
## Multiple R-squared:  0.1596, Adjusted R-squared:  0.1534 
## F-statistic: 25.74 on 17 and 2304 DF,  p-value: < 2.2e-16

We can use, for example, ggpredict() to create the effect plot:

gg_pred <- ggeffects::ggpredict(lm5, terms = c("creat", "gender"),
                                back.transform = FALSE)

## Model has log-transformed response. Predictions are on log-scale.

plot(gg_pred, color = c("blue", "red"))

We notice that for males the confidence band is extremely wide for large values of creat, and for males also for very small values.

The crossing lines indicate that there is a relevant difference in the effect of creat for males and females. Or the other way around: that there is a difference in the effect of gender across values of creat.

gg_pred <- ggeffects::ggpredict(lm5, terms = c("albu", "gender"),
                                back.transform = FALSE)

## Model has log-transformed response. Predictions are on log-scale.

plot(gg_pred, color = c("blue", "red"))

For albu, the confidence intervals are also getting wider at the edges of the plot, but it is not as extreme as for creat. We see that the curves for males and females run relatively parallel, indicating that there is no (strong) interaction.

Task 5

Explore why we get such a wide confidence band for small values of creat for males but not for females (and generally for large values of creat).

Find out for which range of creat we have enough information to get a meaningful result/effect plot.

Explore the distribution of creat for males and females separately.

Solution 5

Apparently, there is a difference in the data between males and females. So far, we only explored the distribution of creat overall to find out if and where to set the boundary knots, but not per category of gender.

par(mfrow = c(1, 2), mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
hist(lm2$model$creat[lm2$model$gender == "male"], breaks = 200,
     main = "males", xlab = "creat")

hist(lm2$model$creat[lm2$model$gender == "female"], breaks = 200,
     main = "females", xlab = "creat")

There are not many observations larger than 2 (but we already knew that) and because the x-axis also includes large values, we can’t see well what is going on for the smaller values of creat.

To see this better, we can restrict the range of the x-axis by setting xlim = c(0, 2):

par(mfrow = c(1, 2), mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
hist(lm2$model$creat[lm2$model$gender == "male"], breaks = 200,
     main = "males", xlab = "creat", xlim = c(0, 2))

hist(lm2$model$creat[lm2$model$gender == "female"], breaks = 200,
     main = "females", xlab = "creat", xlim = c(0, 2))

In this plot we see that males do not have any values of creat < 0.5, but females have.

The wide confidence intervals in the effect plots are due to insufficient information for the smallest and highest values of creat. We could restrict the effect plots to areas where we have at least a certain number of observations (within a relatively small interval), for example, 5. (I’d say this is the absolute minimum required, in other applications you might also need more observations to get a reasonable precise fitted value.)

Using the histogram, we can guess for which range of creat we have at least 5 observations per interval. This is easier when we add horizontal lines at the value 5:

par(mfrow = c(1, 2), mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
hist(lm2$model$creat[lm2$model$gender == "male"], breaks = 200,
     main = "males", xlab = "creat", xlim = c(0, 2))
abline(h = 5, lty = 2)

hist(lm2$model$creat[lm2$model$gender == "female"], breaks = 200,
     main = "females", xlab = "creat", xlim = c(0, 2))
abline(h = 5, lty = 2)

For males, we could use the range from 0.6 to 1.5, and for females the range from 0.4 until 1.3.

We can also extract the information which of the intervals contain at least 5 subjects from the histogram.

For the purpose of visualization we can add a horizontal line at 5, and to be able to use the information from the histograms we assign them names (hm and hf), so that we can then work with these objects.

par(mfrow = c(1, 2), mar = c(3, 3, 2, 1), mgp = c(2, 0.6, 0))
hm <- hist(lm2$model$creat[lm2$model$gender == "male"], breaks = 200,
           main = "males", xlab = "creat", xlim = c(0, 2))
abline(h = 5, lty = 2)

hf <- hist(lm2$model$creat[lm2$model$gender == "female"], breaks = 200,
           main = "females", xlab = "creat", xlim = c(0, 2))
abline(h = 5, lty = 2)

The histograms return a list with (among others) elements mids (the mid points of the intervals) and counts (the number of observations per interval). We can use this to find suitable ranges for creat for males and females.

Here, we take the range of mid points for intervals that contain at least 10 observations. This is, of course, an arbitrary choice, and it depends on the number of intervals in our data whether this choice makes sense or not.

# for males:
range(hm$mids[which(hm$counts > 10)])

## [1] 0.625 1.375

# for females:
range(hf$mids[which(hf$counts > 10)])

## [1] 0.475 1.125

Task 6

Create an effect plot for creat and gender that uses a more appropriate (and gender specific) range of values of creat for males and females.

Create this plot yourself, i.e., not using visreg, ggeffects or the effects package.

Solution 6

Using the bounds determined in the previous step, we can create data for calculation of the fitted values. We could either create separate datasets for males and females and then combine them, or create a combined dataset and then exclude unwanted rows:

ndf <- with(nhanes,
            expand.grid(
              age = median(age, na.rm = TRUE),
              gender = levels(gender),
              smoke = "never",
              SBP = median(SBP, na.rm = TRUE),
              creat = round(seq(0.45, 1.4, by = 0.05), 2),
              albu = median(albu, na.rm = TRUE)
            )
)


ndf <- subset(ndf, (creat >= 0.6 | gender == "female") & 
                (creat <= 1.15 | gender == "male"))

Here we created ndf with the larger range of values (the lowest and highest value to be used for either males or females) and now have to exclude certain rows to prevent their data from being plotted.

To exclude the values at the lower end of the range for males, we can use the condition that creat should be at least 0.6 OR that gender should be “female”, i.e., we exclude rows with values lower than 0.6 for males.

The same is possible for excluding the larger values of creat for females, where we use the restriction that either creat should be smaller or equal to 1.15 OR gender should be “male”.

Then we combine the data with the fitted values:

ndf <- cbind(ndf, 
             predict(lm5, newdata = ndf, interval = "confidence"))

And can now plot them:

ggplot(ndf, aes(x = creat, y = fit, color = gender, fill = gender)) +
  geom_line() +
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              alpha = 0.2, color = NA) +
  scale_color_manual(values = c("blue", "red"), 
                     aesthetics = c("color", "fill"))

Restricting what is shown in the plot to the areas with sufficient data can help to prevent confusion about the interpretation.

We could, of course, also create the plot using plot() or matplot() as we did before, but the ggplot2 solution is easier here.

Task 7

Describe the effect plot.

Give the estimated values of the fitted values and confidence intervals for relevant values of creat in this description.

For a more meaningful clinical interpretation it helps to use a plot that shows the expected values of bili rather than the expected values of log(bili).

Solution 7

To get an interpretation of the plot that is more meaningful from a clinical point of view, we should use the plot for bili rather than log(bili).

The plot for log(bili) should be used to investigate whether there is a need for splines, but in our example here, there is very little difference between the two plots. This is because the fitted values are relatively small and not widely spread.

The y-axis on the plot ranges from approximately -0.6 to -0.2. The exp() transformation for these values shows an almost linear curve:

plot(seq(-0.6, -0.2, 0.01), exp(seq(-0.6, -0.2, 0.01)))
lines(x = c(-0.6, -0.2), y = exp(c(-0.6, -0.2)), lty = 2)

This would be different if our fitted values were 10 times as large:

plot(seq(-0.6, -0.2, 0.01) * 10, exp(seq(-0.6, -0.2, 0.01) * 10))
lines(x = c(-0.6, -0.2) * 10, y = exp(c(-0.6, -0.2) * 10), lty = 2)

In that case we would see much more differences between the plots for the response on the original scale and the log-transformed response.

ggpubr::ggarrange(
  ggplot(ndf, aes(x = creat, y = fit, color = gender, fill = gender)) +
    geom_line() +
    geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.2, color = NA) +
    scale_color_manual(values = c("blue", "red"), 
                       aesthetics = c("color", "fill")) +
    ylab("expected value of log(bili)"),
  
  ggplot(ndf, aes(x = creat, y = exp(fit), color = gender, fill = gender)) +
    geom_line() +
    geom_ribbon(aes(ymin = exp(lwr), ymax = exp(upr)), alpha = 0.2, color = NA) +
    scale_color_manual(values = c("blue", "red"), 
                       aesthetics = c("color", "fill")) +
    ylab("expected value of bili"),
  
common.legend = TRUE
)

The function ggarrange() from the package ggpubr allows us to plot multiple ggplot2 objects in a grid, i.e., what par(mfrow(1, 2)) would do for base R graphics.

Some values we could use in the description are:

ndf$fit_exp <- exp(ndf$fit)
ndf$lwr_exp <- exp(ndf$lwr)
ndf$upr_exp <- exp(ndf$upr)

subset(ndf, gender == "female" & creat %in% c(0.55, 0.75, 1.15))

##    age gender smoke SBP creat albu        fit        lwr        upr   fit_exp   lwr_exp   upr_exp
## 6   47 female never 120  0.55  4.3 -0.4559922 -0.4976694 -0.4143151 0.6338188 0.6079459 0.6607927
## 14  47 female never 120  0.75  4.3 -0.3758135 -0.4086156 -0.3430115 0.6867304 0.6645697 0.7096300
## 30  47 female never 120  1.15  4.3 -0.4397020 -0.5005517 -0.3788523 0.6442284 0.6061961 0.6846467

subset(ndf, gender == "male" & creat %in% c(0.6, 0.9, 1.4))

##    age gender smoke SBP creat albu        fit        lwr        upr   fit_exp   lwr_exp   upr_exp
## 7   47   male never 120   0.6  4.3 -0.3752777 -0.4976481 -0.2529073 0.6870984 0.6079588 0.7765399
## 19  47   male never 120   0.9  4.3 -0.2540724 -0.2850398 -0.2231051 0.7756356 0.7519843 0.8000308
## 39  47   male never 120   1.4  4.3 -0.2379764 -0.2779690 -0.1979839 0.7882213 0.7573203 0.8203831

A possible description of the results for creat and gender would be:

The plot shows that males and females have differently shaped non-linear effects of creat on log(bili) (and bili).

For females, we see that bili increases for increasing values of creat, from 0.634 (95% CI [0.608, 0.661]) for a creat of 0.55 until approximately 0.687 (95% CI [0.665, 0.710]) for a creat of 0.75, after which higher values of creat result in a decrease in the expected bili to 0.644 (95% CI [0.606, 0.685]) for a creat of 1.15.

For males, the expected bili is increasing from 0.687 (95% CI [0.608, 0.675]) for a value of creat of 0.6, to 0.776 (95% CI [0.752, 0.694]) for a value of creat of approximately 0.9, however, for small values of creat there is much uncertainty. The expected bili remains relatively stable for large values of creat (0.788, 95% CI [0.757,] for a creat of 1.4).

Linear Regression: Non-linear Effects

Biostatistics I: Linear Regression

Data

Transformed Response

Task 1

Solution 1

Task 2

Solution 2

Task 3

Solution 3

Task 4

Solution 4

Effectplot for albumin

Effectlot for gender

Non-linear Effects

Task 1

Solution 1

Task 2

Solution 2

visreg::visreg

Reference Values

effects::predictorEffects

Reference Values

ggeffects::ggpredict

Comparison

Interactions

Task 1

Solution 1

Task 2

Solution 2

Using plot() or matplot()

ggplot2

ggeffects

visreg

Task 3

Solution 3

Option 1

Option 2

Task 4

Solution 4

Task 5

Solution 5

Task 6

Solution 6

Task 7

Solution 7

Effectplot for `albumin`

Effectlot for `gender`

`visreg::visreg`

`effects::predictorEffects`

`ggeffects::ggpredict`

Using `plot()` or `matplot()`