In this practical, we use the R package effects. It can be installed by running the following syntax:
install.packages("effects")
For this practical, we will use the pbc dataset on Primary Biliary Cholangitis from the package survival.
You can either load the survival package to get access to this dataset
library(survival)
or make a copy of this data
<- survival::pbc pbc
The pbc data contain (among others) the following variables:
str(pbc[, c(5:11, 20)])
## 'data.frame': 418 obs. of 8 variables:
## $ age : num 58.8 56.4 70.1 54.7 38.1 ...
## $ sex : Factor w/ 2 levels "m","f": 2 2 1 2 2 2 2 2 2 2 ...
## $ ascites: int 1 0 0 0 0 0 0 0 0 1 ...
## $ hepato : int 1 1 0 1 1 1 1 0 0 0 ...
## $ spiders: int 1 1 0 1 1 0 0 0 1 1 ...
## $ edema : num 1 0 0.5 0.5 0 0 0 0 0 1 ...
## $ bili : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ stage : int 4 3 4 4 3 3 3 3 2 4 ...
The relevant variables are
age
: patient’s age in yearssex
: patient’s sexascites
: presence of ascitesspiders
: presence of blood vessel malformations in the skinbili
: serum bilirubinstage
: histologic stage of the diseaseWe want to fit a logistic regression model for the outcome spiders
using the covariates age
, sex
, bili
, stage
and ascites
.
Are the variables coded correctly? If not, re-code them.
The variables ascites
, spiders
and stage
are coded as integer. Both ascites
and spiders
do only have two levels, therefore we could use them as they are:
table(pbc$spiders, exclude = NULL)
##
## 0 1 <NA>
## 222 90 106
table(pbc$ascites, exclude = NULL)
##
## 0 1 <NA>
## 288 24 106
# In this case optional:
$spiders <- factor(pbc$spiders)
pbc$ascites <- factor(pbc$ascites) pbc
stage
has four levels and should be re-coded as a factor:
table(pbc$stage, exclude = NULL)
##
## 1 2 3 4 <NA>
## 21 92 155 144 6
$stage <- factor(pbc$stage) pbc
Fit the logistic model described above and get the model summary.
formula
, the data
and the family
.
<- glm(spiders ~ age + sex + bili + stage + ascites, data = pbc,
mod1 family = binomial())
summary(mod1)
##
## Call:
## glm(formula = spiders ~ age + sex + bili + stage + ascites, family = binomial(),
## data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7728 -0.7471 -0.5353 0.9351 2.3718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.93568 1.36930 -2.144 0.03204 *
## age -0.02051 0.01410 -1.455 0.14577
## sexf 1.21276 0.57295 2.117 0.03429 *
## bili 0.09867 0.03109 3.174 0.00151 **
## stage2 0.65103 1.10384 0.590 0.55533
## stage3 1.37092 1.06306 1.290 0.19719
## stage4 2.27858 1.06739 2.135 0.03278 *
## ascites1 0.26188 0.53826 0.487 0.62659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 321.27 on 304 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 337.27
##
## Number of Fisher Scoring iterations: 5
Investigate the output of the model summary.
bili
in this model?stage
?Missing values:
In the lower part of the output of the model summary we can read:
## [...]
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 321.27 on 304 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 337.27
## [...]
There were 106 cases with missing values. These were excluded from the analysis.
Interpretation of the effect of bili
:
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## [...]
## bili 0.09867 0.03109 3.174 0.00151 **
## [...]
With every 1 unite increase of bili
the expected log odds of having spiders
, i.e., \(\log\left(\frac{P(\texttt{spiders = 1})}{1 - P(\texttt{spiders=1})}\right)\), increases by \(\beta_{\texttt{bili}} = 0.099\) (while all other covariates in the model are kept constant).
This means that the odds ratio of bili
is \(\exp(\beta_{\texttt{bili}}) = \exp(0.099) = 1.104\).
Interpretation of the effect(s) of stage
:
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## [...]
## stage2 0.65103 1.10384 0.590 0.55533
## stage3 1.37092 1.06306 1.290 0.19719
## stage4 2.27858 1.06739 2.135 0.03278 *
## [...]
A patient with stage = 2
has a \(\beta_{\texttt{stage2}} = 0.651\) higher log odds to have spiders
than a patient with stage = 1
(and the same values for all other covariates).
The odds ratio of stage = 2
is \(\exp(\beta_{\texttt{stage2}}) = \exp(0.651) = 1.918\).
Patients with stage = 3
or stage = 4
have a \(\beta_{\texttt{stage3}} = 1.371\) and \(\beta_{\texttt{stage4}} = 2.279\) higher log odds to have spiders
than a patient with stage = 1
(who has the same values for all other covariates).
The corresponding odds ratios of stage = 3
and stage = 4
are \(\exp(\beta_{\texttt{stage3}}) = \exp(1.371) = 3.939\) and \(\exp(\beta_{\texttt{stage4}}) = \exp(2.279) = 9.763\), respectively.
We would like to know if the variable stage
has a significant/relevant contribution to the model.
How can we determine this? Does the output of the summary()
give us an answer?
Since stage
is a factor with >2 levels, multiple dummy variables are used in the model. To answer the question whether stage
has a significant/relevant contribution we need to consider all three dummy variables simultaneously.
Therefore, we need to compare mod1
with the model in which stage
was removed.
Re-fit mod1
without stage
(but give it a different name).
<- glm(spiders ~ age + sex + bili + ascites, data = pbc,
mod1b family = binomial())
# We could also use:
<- update(mod1, formula = . ~ . - stage) mod1b
Note:
The function update()
allows us to update models or formulas.
.
to say ‘as it was’ and only add the change.mod1
and mod1b
nested or not?The models are nested. mod1b
is a specific case of mod1
for which \(\beta_{\texttt{stage2}} = \beta_{\texttt{stage3}} = \beta_{\texttt{stage4}} = 0\)
To compare nested models, we can use a likelihood ratio test:
anova(mod1, mod1b, test = 'LRT')
## Analysis of Deviance Table
##
## Model 1: spiders ~ age + sex + bili + stage + ascites
## Model 2: spiders ~ age + sex + bili + ascites
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 304 321.27
## 2 307 342.65 -3 -21.38 8.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small p-value indicates that the observed difference in residual deviance between the two models would be very unlikely to observe if mod1b
was the correct model.
Hence, the model with stage
fits the data better.
We now want to investigate if the continuous variables have non-linear effect.
To do this, use a natural cubic splines with three degrees of freedom.
ns()
from the package splines
.
library("splines")
<- glm(spiders ~ ns(age, df = 3) + sex + ns(bili, df = 3) + stage + ascites,
mod2 data = pbc, family = binomial())
summary(mod2)
##
## Call:
## glm(formula = spiders ~ ns(age, df = 3) + sex + ns(bili, df = 3) +
## stage + ascites, family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5599 -0.7723 -0.4592 0.9689 2.4957
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9676 1.4427 -3.443 0.000575 ***
## ns(age, df = 3)1 -0.5062 0.6187 -0.818 0.413253
## ns(age, df = 3)2 -0.2193 1.8429 -0.119 0.905292
## ns(age, df = 3)3 -0.4484 1.0426 -0.430 0.667141
## sexf 1.4727 0.5869 2.509 0.012098 *
## ns(bili, df = 3)1 1.9401 0.8638 2.246 0.024699 *
## ns(bili, df = 3)2 4.5549 1.2426 3.666 0.000247 ***
## ns(bili, df = 3)3 2.2240 1.1900 1.869 0.061642 .
## stage2 0.3438 1.1180 0.308 0.758447
## stage3 0.9110 1.0815 0.842 0.399585
## stage4 1.6438 1.0919 1.506 0.132190
## ascites1 0.2423 0.5394 0.449 0.653314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 310.91 on 300 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 334.91
##
## Number of Fisher Scoring iterations: 5
What is the interpretation of the regression coefficients for bili
in mod2
?
The coefficients tell us how much each of the basis functions of the splines have been scaled. They do not have a direct meaningful clinical interpretation.
To visualize the resulting non-linear fit, we can use an effect plot.
The idea is to obtain fitted values for a specific set of cases, which vary in the value of the covariate of interest, but have constant values for all other covariates.
The R package effects can help us in creating such plots:
# if not installed, run:
# install.packages("effects")
library("effects")
plot(predictorEffects(mod2))
From the effects package documentation:
[…] the vertical axis is in linear predictor scale, which is the log-odds or logit for this logistic regression example, but the vertical axis labels are in mean (probability) scale, so the tick-marks are not equally spaced
The plot for age
, for example, shows us the estimated probability (and corresponding 95% confidence band) for having spiders
for any given age
while the other variables have been fixed to the population mean.
For a more information about the effects package and its functions, please refer to the reference manual and vignettes available at https://CRAN.R-project.org/package=effects.
The effect plots show that the effect of age
is quite linear, but that the effect of bili
is indeed non-linear.
Confirm that the non-linear effect for age
is not needed, i.e., does not fit the data better than the corresponding model with a linear effect of age
.
age
with just age
to compare the model fit.
AIC()
or BIC()
.
<- glm(spiders ~ age + sex + ns(bili, df = 3) + stage + ascites,
mod2b data = pbc, family = binomial())
When comparing models with regards to the effect of a variable that is modelled using a spline, models are usually not nested (except in specific cases). Therefore, we will use here the information criteria to compare mod2
and mod2b
:
BIC(mod2, mod2b)
## df BIC
## mod2 12 379.8257
## mod2b 10 368.5101
We see that the smaller (less degrees of freedom / parameters) model has the lower BIC. This confirms that the spline specification for age
is not necessary.
The non-linear association for bili
resembles the shape of the logarithmic function.
Investigate if we can replace the spline for bili
with the (natural) logarithm.
predictorEffect(predictor = "bili", mod = <name of model>)
may be useful if you want to show the effect plot for the variable bili
only.
<- glm(spiders ~ age + sex + log(bili) + stage + ascites,
mod2c data = pbc, family = binomial())
BIC(mod2b, mod2c)
## df BIC
## mod2b 10 368.5101
## mod2c 8 358.4755
The BIC shows that the model with log(bili)
has better model fit.
plot(predictorEffect('bili', mod2b))
plot(predictorEffect('bili', mod2c))
The variable stage
is actually ordinal, but when we re-coded it above, we did not consider that.
Make an ordered factor from stage
and call the new variable stage_ord
.
$stage_ord <- as.ordered(pbc$stage)
pbc
# or alternatively:
# pbc$stage_ord <- factor(pbc$stage, ordered = TRUE)
Note:
R automatically orders factors by their level (alphabetically or by number):
factor(c('low', 'medium', 'high'), ordered = TRUE)
## [1] low medium high
## Levels: high < low < medium
For ordered factors where the levels should not be ordered by alphabet (or number) you need to specify the levels
in factor()
:
factor(c('low', 'medium', 'high'), levels = c('low', 'medium', 'high'), ordered = TRUE)
## [1] low medium high
## Levels: low < medium < high
mod1
using stage_ord
instead of stage
.update()
to change the model.stage_ord
?To update mod1
, we need to specify that we want to change the formula
. We remove the original stage
and add stage_ord
:
<- update(mod1, formula = . ~ . -stage + stage_ord)
mod1c summary(mod1c)
##
## Call:
## glm(formula = spiders ~ age + sex + bili + ascites + stage_ord,
## family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7728 -0.7471 -0.5353 0.9351 2.3718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86055 0.98226 -1.894 0.05821 .
## age -0.02051 0.01410 -1.455 0.14577
## sexf 1.21276 0.57295 2.117 0.03429 *
## bili 0.09867 0.03109 3.174 0.00151 **
## ascites1 0.26188 0.53826 0.487 0.62659
## stage_ord.L 1.68949 0.72208 2.340 0.01930 *
## stage_ord.Q 0.12831 0.57199 0.224 0.82250
## stage_ord.C 0.02659 0.37473 0.071 0.94343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 321.27 on 304 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 337.27
##
## Number of Fisher Scoring iterations: 5
The summary of the coefficients for stage_ord
looks a bit different than the corresponding part in mod1
. We now have:
## Coefficients:
## [...]
## stage_ord.L 1.68949 0.72208 2.340 0.01930 *
## stage_ord.Q 0.12831 0.57199 0.224 0.82250
## stage_ord.C 0.02659 0.37473 0.071 0.94343
## [...]
where in mod1
we had:
## Coefficients:
## [...]
## stage2 0.65103 1.10384 0.590 0.55533
## stage3 1.37092 1.06306 1.290 0.19719
## stage4 2.27858 1.06739 2.135 0.03278 *
## [...]
When R creates dummy variables, the factor levels are added at the end of the variable name. For ordered factors, R uses by default orthogonal polynomials, i.e., in this case a linear, a quadratic and a cubic term.
You can obtain the coding for the (ordered) factor using contrasts()
:
contrasts(pbc$stage_ord)
## .L .Q .C
## [1,] -0.6708204 0.5 -0.2236068
## [2,] -0.2236068 -0.5 0.6708204
## [3,] 0.2236068 -0.5 -0.6708204
## [4,] 0.6708204 0.5 0.2236068
These values are values from a linear, quadratic and cubic curve at the numeric values of the factor levels:
The coefficients in the summary()
output are thus the effect estimates for these linear, quadratic and cubic curves and cannot be interpreted as difference from a reference category like it was the case with a dummy coded factor.
© Nicole Erler