Logistic Regression & Non-linear Effects

Preface

R packages

In this practical, we use the R package effects. It can be installed by running the following syntax:

install.packages("effects")

Dataset

For this practical, we will use the pbc dataset on Primary Biliary Cholangitis from the package survival.

You can either load the survival package to get access to this dataset

library(survival)

or make a copy of this data

pbc <- survival::pbc

The pbc data contain (among others) the following variables:

str(pbc[, c(5:11, 20)])

## 'data.frame':    418 obs. of  8 variables:
##  $ age    : num  58.8 56.4 70.1 54.7 38.1 ...
##  $ sex    : Factor w/ 2 levels "m","f": 2 2 1 2 2 2 2 2 2 2 ...
##  $ ascites: int  1 0 0 0 0 0 0 0 0 1 ...
##  $ hepato : int  1 1 0 1 1 1 1 0 0 0 ...
##  $ spiders: int  1 1 0 1 1 0 0 0 1 1 ...
##  $ edema  : num  1 0 0.5 0.5 0 0 0 0 0 1 ...
##  $ bili   : num  14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
##  $ stage  : int  4 3 4 4 3 3 3 3 2 4 ...

The relevant variables are

age: patient’s age in years
sex: patient’s sex
ascites: presence of ascites
spiders: presence of blood vessel malformations in the skin
bili: serum bilirubin
stage: histologic stage of the disease

Fitting a Logistic Regression

We want to fit a logistic regression model for the outcome spiders using the covariates age, sex, bili, stage and ascites.

Variable coding

Task 1

Are the variables coded correctly? If not, re-code them.

Three of the variables of interest are coded as integer.

Solution 1

The variables ascites, spiders and stage are coded as integer. Both ascites and spiders do only have two levels, therefore we could use them as they are:

table(pbc$spiders, exclude = NULL)

## 
##    0    1 <NA> 
##  222   90  106

table(pbc$ascites, exclude = NULL)

## 
##    0    1 <NA> 
##  288   24  106

#  In this case optional:
pbc$spiders <- factor(pbc$spiders)
pbc$ascites <- factor(pbc$ascites)

stage has four levels and should be re-coded as a factor:

table(pbc$stage, exclude = NULL)

## 
##    1    2    3    4 <NA> 
##   21   92  155  144    6

pbc$stage <- factor(pbc$stage)

Task 2

Fit the logistic model described above and get the model summary.

A logistic regression model is a GLM.

You need to specify the model formula, the data and the family.

Solution 2

mod1 <- glm(spiders ~ age + sex + bili + stage + ascites, data = pbc,
            family = binomial())
summary(mod1)

## 
## Call:
## glm(formula = spiders ~ age + sex + bili + stage + ascites, family = binomial(), 
##     data = pbc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7728  -0.7471  -0.5353   0.9351   2.3718  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.93568    1.36930  -2.144  0.03204 * 
## age         -0.02051    0.01410  -1.455  0.14577   
## sexf         1.21276    0.57295   2.117  0.03429 * 
## bili         0.09867    0.03109   3.174  0.00151 **
## stage2       0.65103    1.10384   0.590  0.55533   
## stage3       1.37092    1.06306   1.290  0.19719   
## stage4       2.27858    1.06739   2.135  0.03278 * 
## ascites1     0.26188    0.53826   0.487  0.62659   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 374.88  on 311  degrees of freedom
## Residual deviance: 321.27  on 304  degrees of freedom
##   (106 observations deleted due to missingness)
## AIC: 337.27
## 
## Number of Fisher Scoring iterations: 5

Task 3

Investigate the output of the model summary.

Were there missing values in the data? What happened with those cases?
What is the interpretation of the coefficient for bili in this model?
What is the interpretation of the coefficients for stage?

Solution 3

Missing values:
In the lower part of the output of the model summary we can read:

## [...]
##      Null deviance: 374.88  on 311  degrees of freedom
##  Residual deviance: 321.27  on 304  degrees of freedom
##    (106 observations deleted due to missingness)
##  AIC: 337.27
##  [...]

There were 106 cases with missing values. These were excluded from the analysis.

Interpretation of the effect of bili:

## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
##  [...]
##  bili         0.09867    0.03109   3.174  0.00151 **
##  [...]

With every 1 unite increase of bili the expected log odds of having spiders, i.e., \(\log\left(\frac{P(\texttt{spiders = 1})}{1 - P(\texttt{spiders=1})}\right)\), increases by \(\beta_{\texttt{bili}} = 0.099\) (while all other covariates in the model are kept constant).

This means that the odds ratio of bili is \(\exp(\beta_{\texttt{bili}}) = \exp(0.099) = 1.104\).

Interpretation of the effect(s) of stage:

## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
##  [...]
##  stage2       0.65103    1.10384   0.590  0.55533   
##  stage3       1.37092    1.06306   1.290  0.19719   
##  stage4       2.27858    1.06739   2.135  0.03278 * 
##  [...]

A patient with stage = 2 has a \(\beta_{\texttt{stage2}} = 0.651\) higher log odds to have spiders than a patient with stage = 1 (and the same values for all other covariates).
The odds ratio of stage = 2 is \(\exp(\beta_{\texttt{stage2}}) = \exp(0.651) = 1.918\).

Patients with stage = 3 or stage = 4 have a \(\beta_{\texttt{stage3}} = 1.371\) and \(\beta_{\texttt{stage4}} = 2.279\) higher log odds to have spiders than a patient with stage = 1 (who has the same values for all other covariates).
The corresponding odds ratios of stage = 3 and stage = 4 are \(\exp(\beta_{\texttt{stage3}}) = \exp(1.371) = 3.939\) and \(\exp(\beta_{\texttt{stage4}}) = \exp(2.279) = 9.763\), respectively.

Comparing Models

We would like to know if the variable stage has a significant/relevant contribution to the model.

Task 1

How can we determine this? Does the output of the summary() give us an answer?

Solution 1

Since stage is a factor with >2 levels, multiple dummy variables are used in the model. To answer the question whether stage has a significant/relevant contribution we need to consider all three dummy variables simultaneously.

Therefore, we need to compare mod1 with the model in which stage was removed.

Task 2

Re-fit mod1 without stage (but give it a different name).

Solution 2

mod1b <- glm(spiders ~ age + sex + bili + ascites, data = pbc,
             family = binomial())

# We could also use:
mod1b <- update(mod1, formula = . ~ . - stage)

Note:

The function update() allows us to update models or formulas.

To update a model, specify the new values for those arguments that change.
To update a formula, we can use the . to say ‘as it was’ and only add the change.

Task 3

Are the models mod1 and mod1b nested or not?
Which method can we use to compare them?
Perform this comparison and interpret the result.

Solution 3

The models are nested. mod1b is a specific case of mod1 for which \(\beta_{\texttt{stage2}} = \beta_{\texttt{stage3}} = \beta_{\texttt{stage4}} = 0\)

To compare nested models, we can use a likelihood ratio test:

anova(mod1, mod1b, test = 'LRT')

## Analysis of Deviance Table
## 
## Model 1: spiders ~ age + sex + bili + stage + ascites
## Model 2: spiders ~ age + sex + bili + ascites
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       304     321.27                         
## 2       307     342.65 -3   -21.38 8.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value indicates that the observed difference in residual deviance between the two models would be very unlikely to observe if mod1b was the correct model.

Hence, the model with stage fits the data better.

Non-linear effects

We now want to investigate if the continuous variables have non-linear effect.

Task 1

To do this, use a natural cubic splines with three degrees of freedom.

Use the function ns() from the package splines.

Solution 1

library("splines")
mod2 <- glm(spiders ~ ns(age, df = 3) + sex + ns(bili, df = 3) + stage + ascites,
            data = pbc, family = binomial())

summary(mod2)

## 
## Call:
## glm(formula = spiders ~ ns(age, df = 3) + sex + ns(bili, df = 3) + 
##     stage + ascites, family = binomial(), data = pbc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5599  -0.7723  -0.4592   0.9689   2.4957  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -4.9676     1.4427  -3.443 0.000575 ***
## ns(age, df = 3)1   -0.5062     0.6187  -0.818 0.413253    
## ns(age, df = 3)2   -0.2193     1.8429  -0.119 0.905292    
## ns(age, df = 3)3   -0.4484     1.0426  -0.430 0.667141    
## sexf                1.4727     0.5869   2.509 0.012098 *  
## ns(bili, df = 3)1   1.9401     0.8638   2.246 0.024699 *  
## ns(bili, df = 3)2   4.5549     1.2426   3.666 0.000247 ***
## ns(bili, df = 3)3   2.2240     1.1900   1.869 0.061642 .  
## stage2              0.3438     1.1180   0.308 0.758447    
## stage3              0.9110     1.0815   0.842 0.399585    
## stage4              1.6438     1.0919   1.506 0.132190    
## ascites1            0.2423     0.5394   0.449 0.653314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 374.88  on 311  degrees of freedom
## Residual deviance: 310.91  on 300  degrees of freedom
##   (106 observations deleted due to missingness)
## AIC: 334.91
## 
## Number of Fisher Scoring iterations: 5

Task 2

What is the interpretation of the regression coefficients for bili in mod2?

Solution 2

The coefficients tell us how much each of the basis functions of the splines have been scaled. They do not have a direct meaningful clinical interpretation.

Visualization

To visualize the resulting non-linear fit, we can use an effect plot.

The idea is to obtain fitted values for a specific set of cases, which vary in the value of the covariate of interest, but have constant values for all other covariates.

The R package effects can help us in creating such plots:

# if not installed, run:
# install.packages("effects")

library("effects")
plot(predictorEffects(mod2))

From the effects package documentation:

[…] the vertical axis is in linear predictor scale, which is the log-odds or logit for this logistic regression example, but the vertical axis labels are in mean (probability) scale, so the tick-marks are not equally spaced

The plot for age, for example, shows us the estimated probability (and corresponding 95% confidence band) for having spiders for any given age while the other variables have been fixed to the population mean.

For a more information about the effects package and its functions, please refer to the reference manual and vignettes available at https://CRAN.R-project.org/package=effects.

Task 3

The effect plots show that the effect of age is quite linear, but that the effect of bili is indeed non-linear.

Confirm that the non-linear effect for age is not needed, i.e., does not fit the data better than the corresponding model with a linear effect of age.

Re-fit the model and replace the spline for age with just age to compare the model fit.

For the comparison, use the AIC() or BIC().

Solution 3

mod2b <- glm(spiders ~ age + sex + ns(bili, df = 3) + stage + ascites,
             data = pbc, family = binomial())

When comparing models with regards to the effect of a variable that is modelled using a spline, models are usually not nested (except in specific cases). Therefore, we will use here the information criteria to compare mod2 and mod2b:

BIC(mod2, mod2b)

##       df      BIC
## mod2  12 379.8257
## mod2b 10 368.5101

We see that the smaller (less degrees of freedom / parameters) model has the lower BIC. This confirms that the spline specification for age is not necessary.

Task 4

The non-linear association for bili resembles the shape of the logarithmic function.

Investigate if we can replace the spline for bili with the (natural) logarithm.

using information criteria
visually

For the visual comparison, the function predictorEffect(predictor = "bili", mod = <name of model>) may be useful if you want to show the effect plot for the variable bili only.

Solution 4

mod2c <- glm(spiders ~ age + sex + log(bili) + stage + ascites,
             data = pbc, family = binomial())

BIC(mod2b, mod2c)

##       df      BIC
## mod2b 10 368.5101
## mod2c  8 358.4755

The BIC shows that the model with log(bili) has better model fit.

plot(predictorEffect('bili', mod2b))

plot(predictorEffect('bili', mod2c))

Additional Exercises

Ordinal Covariate

The variable stage is actually ordinal, but when we re-coded it above, we did not consider that.

Task 1

Make an ordered factor from stage and call the new variable stage_ord.

Solution 1

pbc$stage_ord <- as.ordered(pbc$stage)

# or alternatively:
# pbc$stage_ord <- factor(pbc$stage, ordered = TRUE)

Note:

R automatically orders factors by their level (alphabetically or by number):

factor(c('low', 'medium', 'high'), ordered = TRUE)

## [1] low    medium high  
## Levels: high < low < medium

For ordered factors where the levels should not be ordered by alphabet (or number) you need to specify the levels in factor():

factor(c('low', 'medium', 'high'), levels = c('low', 'medium', 'high'), ordered = TRUE)

## [1] low    medium high  
## Levels: low < medium < high

Task 2

Fit another version of mod1 using stage_ord instead of stage.
Do not copy-paste the syntax from above, but use update() to change the model.
What is the interpretation of the regression coefficients for stage_ord?

Solution 2

To update mod1, we need to specify that we want to change the formula. We remove the original stage and add stage_ord:

mod1c <- update(mod1, formula = . ~ . -stage + stage_ord)
summary(mod1c)

## 
## Call:
## glm(formula = spiders ~ age + sex + bili + ascites + stage_ord, 
##     family = binomial(), data = pbc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7728  -0.7471  -0.5353   0.9351   2.3718  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.86055    0.98226  -1.894  0.05821 . 
## age         -0.02051    0.01410  -1.455  0.14577   
## sexf         1.21276    0.57295   2.117  0.03429 * 
## bili         0.09867    0.03109   3.174  0.00151 **
## ascites1     0.26188    0.53826   0.487  0.62659   
## stage_ord.L  1.68949    0.72208   2.340  0.01930 * 
## stage_ord.Q  0.12831    0.57199   0.224  0.82250   
## stage_ord.C  0.02659    0.37473   0.071  0.94343   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 374.88  on 311  degrees of freedom
## Residual deviance: 321.27  on 304  degrees of freedom
##   (106 observations deleted due to missingness)
## AIC: 337.27
## 
## Number of Fisher Scoring iterations: 5

The summary of the coefficients for stage_ord looks a bit different than the corresponding part in mod1. We now have:

## Coefficients:
## [...]
## stage_ord.L  1.68949    0.72208   2.340  0.01930 * 
## stage_ord.Q  0.12831    0.57199   0.224  0.82250   
## stage_ord.C  0.02659    0.37473   0.071  0.94343   
## [...]

where in mod1 we had:

## Coefficients:
## [...]
## stage2       0.65103    1.10384   0.590  0.55533   
## stage3       1.37092    1.06306   1.290  0.19719   
## stage4       2.27858    1.06739   2.135  0.03278 * 
## [...]

When R creates dummy variables, the factor levels are added at the end of the variable name. For ordered factors, R uses by default orthogonal polynomials, i.e., in this case a linear, a quadratic and a cubic term.

You can obtain the coding for the (ordered) factor using contrasts():

contrasts(pbc$stage_ord)

##              .L   .Q         .C
## [1,] -0.6708204  0.5 -0.2236068
## [2,] -0.2236068 -0.5  0.6708204
## [3,]  0.2236068 -0.5 -0.6708204
## [4,]  0.6708204  0.5  0.2236068

These values are values from a linear, quadratic and cubic curve at the numeric values of the factor levels:

The coefficients in the summary() output are thus the effect estimates for these linear, quadratic and cubic curves and cannot be interpreted as difference from a reference category like it was the case with a dummy coded factor.