In this practical, we use the R package effects. It can be installed by running the following syntax:
install.packages("effects")For this practical, we will use the pbc dataset on Primary Biliary Cholangitis from the package survival.
You can either load the survival package to get access to this dataset
library(survival)or make a copy of this data
pbc <- survival::pbcThe pbc data contain (among others) the following variables:
str(pbc[, c(5:11, 20)])## 'data.frame': 418 obs. of 8 variables:
## $ age : num 58.8 56.4 70.1 54.7 38.1 ...
## $ sex : Factor w/ 2 levels "m","f": 2 2 1 2 2 2 2 2 2 2 ...
## $ ascites: int 1 0 0 0 0 0 0 0 0 1 ...
## $ hepato : int 1 1 0 1 1 1 1 0 0 0 ...
## $ spiders: int 1 1 0 1 1 0 0 0 1 1 ...
## $ edema : num 1 0 0.5 0.5 0 0 0 0 0 1 ...
## $ bili : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ stage : int 4 3 4 4 3 3 3 3 2 4 ...
The relevant variables are
age: patient’s age in yearssex: patient’s sexascites: presence of ascitesspiders: presence of blood vessel malformations in the skinbili: serum bilirubinstage: histologic stage of the diseaseWe want to fit a logistic regression model for the outcome spiders using the covariates age, sex, bili, stage and ascites.
Are the variables coded correctly? If not, re-code them.
The variables ascites, spiders and stage are coded as integer. Both ascites and spiders do only have two levels, therefore we could use them as they are:
table(pbc$spiders, exclude = NULL)##
## 0 1 <NA>
## 222 90 106
table(pbc$ascites, exclude = NULL)##
## 0 1 <NA>
## 288 24 106
# In this case optional:
pbc$spiders <- factor(pbc$spiders)
pbc$ascites <- factor(pbc$ascites)stage has four levels and should be re-coded as a factor:
table(pbc$stage, exclude = NULL)##
## 1 2 3 4 <NA>
## 21 92 155 144 6
pbc$stage <- factor(pbc$stage)Fit the logistic model described above and get the model summary.
formula, the data and the family.
mod1 <- glm(spiders ~ age + sex + bili + stage + ascites, data = pbc,
family = binomial())
summary(mod1)##
## Call:
## glm(formula = spiders ~ age + sex + bili + stage + ascites, family = binomial(),
## data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7728 -0.7471 -0.5353 0.9351 2.3718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.93568 1.36930 -2.144 0.03204 *
## age -0.02051 0.01410 -1.455 0.14577
## sexf 1.21276 0.57295 2.117 0.03429 *
## bili 0.09867 0.03109 3.174 0.00151 **
## stage2 0.65103 1.10384 0.590 0.55533
## stage3 1.37092 1.06306 1.290 0.19719
## stage4 2.27858 1.06739 2.135 0.03278 *
## ascites1 0.26188 0.53826 0.487 0.62659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 321.27 on 304 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 337.27
##
## Number of Fisher Scoring iterations: 5
Investigate the output of the model summary.
bili in this model?stage?Missing values:
In the lower part of the output of the model summary we can read:
## [...]
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 321.27 on 304 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 337.27
## [...]
There were 106 cases with missing values. These were excluded from the analysis.
Interpretation of the effect of bili:
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## [...]
## bili 0.09867 0.03109 3.174 0.00151 **
## [...]
With every 1 unite increase of bili the expected log odds of having spiders, i.e., \(\log\left(\frac{P(\texttt{spiders = 1})}{1 - P(\texttt{spiders=1})}\right)\), increases by \(\beta_{\texttt{bili}} = 0.099\) (while all other covariates in the model are kept constant).
This means that the odds ratio of bili is \(\exp(\beta_{\texttt{bili}}) = \exp(0.099) = 1.104\).
Interpretation of the effect(s) of stage:
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## [...]
## stage2 0.65103 1.10384 0.590 0.55533
## stage3 1.37092 1.06306 1.290 0.19719
## stage4 2.27858 1.06739 2.135 0.03278 *
## [...]
A patient with stage = 2 has a \(\beta_{\texttt{stage2}} = 0.651\) higher log odds to have spiders than a patient with stage = 1 (and the same values for all other covariates).
The odds ratio of stage = 2 is \(\exp(\beta_{\texttt{stage2}}) = \exp(0.651) = 1.918\).
Patients with stage = 3 or stage = 4 have a \(\beta_{\texttt{stage3}} = 1.371\) and \(\beta_{\texttt{stage4}} = 2.279\) higher log odds to have spiders than a patient with stage = 1 (who has the same values for all other covariates).
The corresponding odds ratios of stage = 3 and stage = 4 are \(\exp(\beta_{\texttt{stage3}}) = \exp(1.371) = 3.939\) and \(\exp(\beta_{\texttt{stage4}}) = \exp(2.279) = 9.763\), respectively.
We would like to know if the variable stage has a significant/relevant contribution to the model.
How can we determine this? Does the output of the summary() give us an answer?
Since stage is a factor with >2 levels, multiple dummy variables are used in the model. To answer the question whether stage has a significant/relevant contribution we need to consider all three dummy variables simultaneously.
Therefore, we need to compare mod1 with the model in which stage was removed.
Re-fit mod1 without stage (but give it a different name).
mod1b <- glm(spiders ~ age + sex + bili + ascites, data = pbc,
family = binomial())
# We could also use:
mod1b <- update(mod1, formula = . ~ . - stage)Note:
The function update() allows us to update models or formulas.
. to say ‘as it was’ and only add the change.mod1 and mod1b nested or not?The models are nested. mod1b is a specific case of mod1 for which \(\beta_{\texttt{stage2}} = \beta_{\texttt{stage3}} = \beta_{\texttt{stage4}} = 0\)
To compare nested models, we can use a likelihood ratio test:
anova(mod1, mod1b, test = 'LRT')## Analysis of Deviance Table
##
## Model 1: spiders ~ age + sex + bili + stage + ascites
## Model 2: spiders ~ age + sex + bili + ascites
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 304 321.27
## 2 307 342.65 -3 -21.38 8.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small p-value indicates that the observed difference in residual deviance between the two models would be very unlikely to observe if mod1b was the correct model.
Hence, the model with stage fits the data better.
We now want to investigate if the continuous variables have non-linear effect.
To do this, use a natural cubic splines with three degrees of freedom.
ns() from the package splines.
library("splines")
mod2 <- glm(spiders ~ ns(age, df = 3) + sex + ns(bili, df = 3) + stage + ascites,
data = pbc, family = binomial())
summary(mod2)##
## Call:
## glm(formula = spiders ~ ns(age, df = 3) + sex + ns(bili, df = 3) +
## stage + ascites, family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5599 -0.7723 -0.4592 0.9689 2.4957
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.9676 1.4427 -3.443 0.000575 ***
## ns(age, df = 3)1 -0.5062 0.6187 -0.818 0.413253
## ns(age, df = 3)2 -0.2193 1.8429 -0.119 0.905292
## ns(age, df = 3)3 -0.4484 1.0426 -0.430 0.667141
## sexf 1.4727 0.5869 2.509 0.012098 *
## ns(bili, df = 3)1 1.9401 0.8638 2.246 0.024699 *
## ns(bili, df = 3)2 4.5549 1.2426 3.666 0.000247 ***
## ns(bili, df = 3)3 2.2240 1.1900 1.869 0.061642 .
## stage2 0.3438 1.1180 0.308 0.758447
## stage3 0.9110 1.0815 0.842 0.399585
## stage4 1.6438 1.0919 1.506 0.132190
## ascites1 0.2423 0.5394 0.449 0.653314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 310.91 on 300 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 334.91
##
## Number of Fisher Scoring iterations: 5
What is the interpretation of the regression coefficients for bili in mod2?
The coefficients tell us how much each of the basis functions of the splines have been scaled. They do not have a direct meaningful clinical interpretation.
To visualize the resulting non-linear fit, we can use an effect plot.
The idea is to obtain fitted values for a specific set of cases, which vary in the value of the covariate of interest, but have constant values for all other covariates.
The R package effects can help us in creating such plots:
# if not installed, run:
# install.packages("effects")
library("effects")
plot(predictorEffects(mod2))From the effects package documentation:
[…] the vertical axis is in linear predictor scale, which is the log-odds or logit for this logistic regression example, but the vertical axis labels are in mean (probability) scale, so the tick-marks are not equally spaced
The plot for age, for example, shows us the estimated probability (and corresponding 95% confidence band) for having spiders for any given age while the other variables have been fixed to the population mean.
For a more information about the effects package and its functions, please refer to the reference manual and vignettes available at https://CRAN.R-project.org/package=effects.
The effect plots show that the effect of age is quite linear, but that the effect of bili is indeed non-linear.
Confirm that the non-linear effect for age is not needed, i.e., does not fit the data better than the corresponding model with a linear effect of age.
age with just age to compare the model fit.
AIC() or BIC().
mod2b <- glm(spiders ~ age + sex + ns(bili, df = 3) + stage + ascites,
data = pbc, family = binomial())When comparing models with regards to the effect of a variable that is modelled using a spline, models are usually not nested (except in specific cases). Therefore, we will use here the information criteria to compare mod2 and mod2b:
BIC(mod2, mod2b)## df BIC
## mod2 12 379.8257
## mod2b 10 368.5101
We see that the smaller (less degrees of freedom / parameters) model has the lower BIC. This confirms that the spline specification for age is not necessary.
The non-linear association for bili resembles the shape of the logarithmic function.
Investigate if we can replace the spline for bili with the (natural) logarithm.
predictorEffect(predictor = "bili", mod = <name of model>) may be useful if you want to show the effect plot for the variable bili only.
mod2c <- glm(spiders ~ age + sex + log(bili) + stage + ascites,
data = pbc, family = binomial())BIC(mod2b, mod2c)## df BIC
## mod2b 10 368.5101
## mod2c 8 358.4755
The BIC shows that the model with log(bili) has better model fit.
plot(predictorEffect('bili', mod2b))plot(predictorEffect('bili', mod2c))The variable stage is actually ordinal, but when we re-coded it above, we did not consider that.
Make an ordered factor from stage and call the new variable stage_ord.
pbc$stage_ord <- as.ordered(pbc$stage)
# or alternatively:
# pbc$stage_ord <- factor(pbc$stage, ordered = TRUE)Note:
R automatically orders factors by their level (alphabetically or by number):
factor(c('low', 'medium', 'high'), ordered = TRUE)## [1] low medium high
## Levels: high < low < medium
For ordered factors where the levels should not be ordered by alphabet (or number) you need to specify the levels in factor():
factor(c('low', 'medium', 'high'), levels = c('low', 'medium', 'high'), ordered = TRUE)## [1] low medium high
## Levels: low < medium < high
mod1 using stage_ord instead of stage.update() to change the model.stage_ord?To update mod1, we need to specify that we want to change the formula. We remove the original stage and add stage_ord:
mod1c <- update(mod1, formula = . ~ . -stage + stage_ord)
summary(mod1c)##
## Call:
## glm(formula = spiders ~ age + sex + bili + ascites + stage_ord,
## family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7728 -0.7471 -0.5353 0.9351 2.3718
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86055 0.98226 -1.894 0.05821 .
## age -0.02051 0.01410 -1.455 0.14577
## sexf 1.21276 0.57295 2.117 0.03429 *
## bili 0.09867 0.03109 3.174 0.00151 **
## ascites1 0.26188 0.53826 0.487 0.62659
## stage_ord.L 1.68949 0.72208 2.340 0.01930 *
## stage_ord.Q 0.12831 0.57199 0.224 0.82250
## stage_ord.C 0.02659 0.37473 0.071 0.94343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 374.88 on 311 degrees of freedom
## Residual deviance: 321.27 on 304 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 337.27
##
## Number of Fisher Scoring iterations: 5
The summary of the coefficients for stage_ord looks a bit different than the corresponding part in mod1. We now have:
## Coefficients:
## [...]
## stage_ord.L 1.68949 0.72208 2.340 0.01930 *
## stage_ord.Q 0.12831 0.57199 0.224 0.82250
## stage_ord.C 0.02659 0.37473 0.071 0.94343
## [...]
where in mod1 we had:
## Coefficients:
## [...]
## stage2 0.65103 1.10384 0.590 0.55533
## stage3 1.37092 1.06306 1.290 0.19719
## stage4 2.27858 1.06739 2.135 0.03278 *
## [...]
When R creates dummy variables, the factor levels are added at the end of the variable name. For ordered factors, R uses by default orthogonal polynomials, i.e., in this case a linear, a quadratic and a cubic term.
You can obtain the coding for the (ordered) factor using contrasts():
contrasts(pbc$stage_ord)## .L .Q .C
## [1,] -0.6708204 0.5 -0.2236068
## [2,] -0.2236068 -0.5 0.6708204
## [3,] 0.2236068 -0.5 -0.6708204
## [4,] 0.6708204 0.5 0.2236068
These values are values from a linear, quadratic and cubic curve at the numeric values of the factor levels:
The coefficients in the summary() output are thus the effect estimates for these linear, quadratic and cubic curves and cannot be interpreted as difference from a reference category like it was the case with a dummy coded factor.
© Nicole Erler