In this practical, we use the R package ggplot2. It can be installed by running the following syntax:
install.packages("ggplot2")Like in the previous practical on logistic regression, we will use the pbc dataset from the package survival.
You can either load the survival package to get access to this dataset
library(survival)or make a copy of this data
pbc <- survival::pbcThe pbc data contain (among others) the following variables:
str(pbc[, c(5:12, 20)])## 'data.frame': 418 obs. of 9 variables:
## $ age : num 58.8 56.4 70.1 54.7 38.1 ...
## $ sex : Factor w/ 2 levels "m","f": 2 2 1 2 2 2 2 2 2 2 ...
## $ ascites: int 1 0 0 0 0 0 0 0 0 1 ...
## $ hepato : int 1 1 0 1 1 1 1 0 0 0 ...
## $ spiders: int 1 1 0 1 1 0 0 0 1 1 ...
## $ edema : num 1 0 0.5 0.5 0 0 0 0 0 1 ...
## $ bili : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ chol : int 261 302 176 244 279 248 322 280 562 200 ...
## $ stage : int 4 3 4 4 3 3 3 3 2 4 ...
The relevant variables are
age: patient’s age in yearssex: patient’s sexascites: presence of ascitesspiders: presence of blood vessel malformations in the skinbili: serum bilirubinchol: serum cholesterolstage: histologic stage of the diseaseIn the previous practical, we have re-coded some of the categorical variables and fitted the logistic regression model.
If you are using a new R session, you need to first run the following syntax:
pbc$stage <- factor(pbc$stage)
pbc$spiders <- factor(pbc$spiders)
mod1 <- glm(spiders ~ age + sex + bili + stage + ascites, data = pbc,
family = binomial())We would now like to add another continuous variable to mod1: the variable chol.
mod1 to include chol.mod_chol <- update(mod1, formula = . ~ . + chol)
summary(mod_chol)##
## Call:
## glm(formula = spiders ~ age + sex + bili + stage + ascites +
## chol, family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6892 -0.7688 -0.5030 0.9416 2.2690
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.812e+01 1.074e+03 -0.017 0.9865
## age -1.149e-02 1.518e-02 -0.757 0.4491
## sexf 1.274e+00 5.857e-01 2.175 0.0297 *
## bili 8.606e-02 3.580e-02 2.404 0.0162 *
## stage2 1.529e+01 1.074e+03 0.014 0.9886
## stage3 1.624e+01 1.074e+03 0.015 0.9879
## stage4 1.714e+01 1.074e+03 0.016 0.9873
## ascites 9.736e-02 5.765e-01 0.169 0.8659
## chol -1.710e-04 6.779e-04 -0.252 0.8009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.38 on 283 degrees of freedom
## Residual deviance: 289.81 on 275 degrees of freedom
## (134 observations deleted due to missingness)
## AIC: 307.81
##
## Number of Fisher Scoring iterations: 16
We can compare the coefficients form mod1 and mod_chol and notice that the (Intercept), and the coefficients for stage are much more extreme than the ones in mod1:
round(
cbind(mod1 = coef(mod1)[names(coef(mod_chol))],
mod_chol = coef(mod_chol)
), 3)## mod1 mod_chol
## (Intercept) -2.936 -18.116
## age -0.021 -0.011
## sexf 1.213 1.274
## bili 0.099 0.086
## stage2 0.651 15.292
## stage3 1.371 16.241
## stage4 2.279 17.144
## ascites 0.262 0.097
## <NA> NA 0.000
Note:
If we would just use cbind() to combine the coefficients into a matrix we could not be sure that the coefficients of the same variable are in the same row.
a <- c(x = 1, y = 2, z = 3)
b <- c(x = 1, z = 3, y = 2)
cbind(a, b)## a b
## x 1 1
## y 2 3
## z 3 2
Moreover, the “missing” effect for chol in mod1 will just be filled in with the first value of the vector.
cbind(coef(mod1), coef(mod_chol))## Warning in cbind(coef(mod1), coef(mod_chol)): number of rows of result is not a multiple of vector
## length (arg 1)
## [,1] [,2]
## (Intercept) -2.93568220 -1.811631e+01
## age -0.02050837 -1.148978e-02
## sexf 1.21276329 1.273726e+00
## bili 0.09867415 8.606048e-02
## stage2 0.65103163 1.529183e+01
## stage3 1.37091993 1.624066e+01
## stage4 2.27857603 1.714371e+01
## ascites 0.26188382 9.736020e-02
## chol -2.93568220 -1.709538e-04
By using coef(mod1)[names(coef(mod_chol))] we make sure that the coefficients from mod1 are given in the same order as the coefficients from mod_chol. Since there is no chol in coef(mod1), NA is filled in:
coef(mod1)[names(coef(mod_chol))]## (Intercept) age sexf bili stage2 stage3 stage4 ascites
## -2.93568220 -0.02050837 1.21276329 0.09867415 0.65103163 1.37091993 2.27857603 0.26188382
## <NA>
## NA
To compare the coefficients from different models it is useful to plot coefficients with their corresponding confidence intervals.
Using base R plots:
# coefficients & CI from mod1
res1 <- cbind(coef = coef(mod1), confint(mod1))## Waiting for profiling to be done...
res1 <- rbind(res1, chol = c(NA, NA, NA)) # add an "empty" row for chol
# coefficients & CI from mod_chol
res3 <- cbind(coef = coef(mod_chol), confint(mod_chol))## Waiting for profiling to be done...
# graphical settings:
# * mfrow = c(3, 3) print the plots in a grid with 3 rows and 3 columns
# * mar = c(3, 3, 2, 1): adjust the plot margins (looks nicer)
par(mfrow = c(3, 3), mar = c(3, 3, 2, 1))
for (k in rownames(res1)) { # loop over all variables
plot(x = c(1, 2), # position on the x-axis
y = c(res1[k, 'coef'], res3[k, 'coef']), # coefficient value on the y-axis
main = k, # set variable name as title
xlim = c(0, 3), # set limit of the x-axis (looks nicer)
ylim = range(res1[k, ], res3[k, ], na.rm = TRUE), # set the limits of the y-axis
# so that also the CIs fit in
xaxt = 'n', # don't plot the x-axis (we want our own version)
ylab = 'log OR & 95% CI') # specify the y-axis label
# specification of the x-axis
axis(side = 1, # draw axis at the bottom of the plot
at = c(1, 2), # locations of the tick-marks on the x-axis
labels = c('mod1', 'mod_chol')) # labels of the tick-marks
# add vertical lines representing the confidence intervals
lines(x = c(1, 1), y = res1[k, c('2.5 %', '97.5 %')])
lines(x = c(2, 2), y = res3[k, c('2.5 %', '97.5 %')])
}Using ggplot2:
results <- rbind(
# data.frame containing results from mod1
data.frame(coef = coef(mod1),
confint(mod1),
variable = names(coef(mod1)),
model = 'mod1',
check.names = FALSE),
# data.frame containing results from mod_chol
data.frame(coef = coef(mod_chol),
confint(mod_chol),
variable = names(coef(mod_chol)),
model = 'mod_chol',
check.names = FALSE)
)## Waiting for profiling to be done...
## Waiting for profiling to be done...
library("ggplot2")
ggplot(results, # specification of the data
aes(x = model, y = coef)) + # specify what goes on which axis
geom_point() + # plot points for the coefficients
geom_errorbar(aes(ymin = `2.5 %`, # add confidence intervals
ymax = `97.5 %`),
width = 0.3) + # width of the horizontal lines
facet_wrap('variable', # make a separate plot per variable
scales = 'free_y') # allow the y-axes to vary between variablesSomething is clearly going wrong with the variable stage.
Investigate what is going on by looking at stage in relation to the outcome spiders.
table().
table(stage = pbc$stage,
spiders = pbc$spiders, exclude = NULL)## spiders
## stage 0 1 <NA>
## 1 15 1 5
## 2 58 9 25
## 3 90 30 35
## 4 59 50 35
## <NA> 0 0 6
There are very few observations with spiders for stage = 1 and stage = 2.
We saw that there is almost perfect separation between spiders for the patients with stage = 1, but as long as there is a patient, for every combination of covariate and outcome, we should not get such non-sense results.
We may have overlooked something when we created the table.
Check if the table did indeed include those cases that are analysed in mod_chol by checking if the table at least contains the correct number of cases.
summary().
sum(table(stage = pbc$stage, spiders = pbc$spiders))## [1] 312
In the table, we have 312 cases where both spiders and stage are observed.
The output of summary(mod_chol) showed the following:
## [...]
## Null deviance: 341.38 on 283 degrees of freedom
## Residual deviance: 289.81 on 275 degrees of freedom
## (134 observations deleted due to missingness)
## AIC: 307.81
##
## Number of Fisher Scoring iterations: 16
134 observations were excluded.
In the full pbc data we have nrow(pbc) = 418 observations.
We should, hence, be looking at only 418 - 134 = 284 observations!
There are some other variables in our model that have missing values!
Note:
The output of summary(mod_chol)) also shows that the deviance of the null model (the model with only intercept) had 283 degrees of freedom, which is N - 1. We could also derive from there that there were 283 + 1 cases in the analysis.
We need to exclude all cases that have missing values in any of the variables involved in mod_chol and check the table of stage and spiders again.
stage vs spiders using the subset.A convenient way to obtain all variables from a model is to use the function all.vars() on the model formula:
all.vars(formula(mod_chol))## [1] "spiders" "age" "sex" "bili" "stage" "ascites" "chol"
We can use this to get a summary of exactly those columns of the pbc data:
summary(pbc[, all.vars(formula(mod_chol))])## spiders age sex bili stage ascites chol
## 0 :222 Min. :26.28 m: 44 Min. : 0.300 1 : 21 Min. :0.00000 Min. : 120.0
## 1 : 90 1st Qu.:42.83 f:374 1st Qu.: 0.800 2 : 92 1st Qu.:0.00000 1st Qu.: 249.5
## NA's:106 Median :51.00 Median : 1.400 3 :155 Median :0.00000 Median : 309.5
## Mean :50.74 Mean : 3.221 4 :144 Mean :0.07692 Mean : 369.5
## 3rd Qu.:58.24 3rd Qu.: 3.400 NA's: 6 3rd Qu.:0.00000 3rd Qu.: 400.0
## Max. :78.44 Max. :28.000 Max. :1.00000 Max. :1775.0
## NA's :106 NA's :134
Note:
Alternatively, we could have looked at the sum of missing values in each column:
colSums(is.na(pbc[, all.vars(formula(mod_chol))]))## spiders age sex bili stage ascites chol
## 106 0 0 0 6 106 134
We can make a subset in which we exclude cases with missing values in variables other than stage and spiders. (We could also exclude the cases with missing values in those two variables, but they are shown in separate columns/rows in the table anyway.)
pbc_sub <- subset(pbc, !is.na(chol) & !is.na(stage))
# or:
# pbc_sub <- pbc[, !is.na(ascites) & !is.na(chol)]
table(pbc_sub$stage, pbc_sub$spiders, exclude = NULL)##
## 0 1
## 1 13 0
## 2 54 7
## 3 85 30
## 4 50 45
Indeed, there are no subjects with spiders in the stage = 1 group in the analysis. This explains the inflated effect estimates and standard errors.
Due to the dummy coding in which stage = 1 was the reference category, there was no information available to estimate any of the stage dummy effects, because they are relative to stage = 1.
Change the reference category of stage to another level and re-fit mod_chol.
We choose category 4 as reference category. There are (at least) two options to change the reference category:
pbc$stage_rev <- factor(pbc$stage, levels = 4:1, labels = 4:1)
pbc$stage_ref4 <- relevel(pbc$stage, ref = 4)
levels(pbc$stage)
## [1] "1" "2" "3" "4"
levels(pbc$stage_ref4)
## [1] "4" "1" "2" "3"
levels(pbc$stage_rev)
## [1] "4" "3" "2" "1"Even though stage_ref4 and stage_rev have different ordering of their levels (but thy are not ordered factors!), they give the same result when used as covariates in a regression model, since all non-reference categories are only compared with the reference category.
mod_chol2 <- glm(spiders ~ age + sex + chol + bili + stage_rev + ascites,
data = pbc, family = binomial())
summary(mod_chol2)##
## Call:
## glm(formula = spiders ~ age + sex + chol + bili + stage_rev +
## ascites, family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6892 -0.7688 -0.5030 0.9416 2.2690
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.726e-01 1.098e+00 -0.885 0.375931
## age -1.149e-02 1.518e-02 -0.757 0.449141
## sexf 1.274e+00 5.857e-01 2.175 0.029666 *
## chol -1.710e-04 6.779e-04 -0.252 0.800890
## bili 8.606e-02 3.580e-02 2.404 0.016230 *
## stage_rev3 -9.030e-01 3.320e-01 -2.720 0.006534 **
## stage_rev2 -1.852e+00 4.769e-01 -3.883 0.000103 ***
## stage_rev1 -1.714e+01 1.074e+03 -0.016 0.987263
## ascites 9.736e-02 5.765e-01 0.169 0.865879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.38 on 283 degrees of freedom
## Residual deviance: 289.81 on 275 degrees of freedom
## (134 observations deleted due to missingness)
## AIC: 307.81
##
## Number of Fisher Scoring iterations: 16
We see that the effect estimates and standard errors of levels 2 and 3 are back to normal values.
© Nicole Erler