In this practical, we use the R package ggplot2. It can be installed by running the following syntax:
install.packages("ggplot2")
Like in the previous practical on logistic regression, we will use the pbc dataset from the package survival.
You can either load the survival package to get access to this dataset
library(survival)
or make a copy of this data
<- survival::pbc pbc
The pbc data contain (among others) the following variables:
str(pbc[, c(5:12, 20)])
## 'data.frame': 418 obs. of 9 variables:
## $ age : num 58.8 56.4 70.1 54.7 38.1 ...
## $ sex : Factor w/ 2 levels "m","f": 2 2 1 2 2 2 2 2 2 2 ...
## $ ascites: int 1 0 0 0 0 0 0 0 0 1 ...
## $ hepato : int 1 1 0 1 1 1 1 0 0 0 ...
## $ spiders: int 1 1 0 1 1 0 0 0 1 1 ...
## $ edema : num 1 0 0.5 0.5 0 0 0 0 0 1 ...
## $ bili : num 14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
## $ chol : int 261 302 176 244 279 248 322 280 562 200 ...
## $ stage : int 4 3 4 4 3 3 3 3 2 4 ...
The relevant variables are
age
: patient’s age in yearssex
: patient’s sexascites
: presence of ascitesspiders
: presence of blood vessel malformations in the skinbili
: serum bilirubinchol
: serum cholesterolstage
: histologic stage of the diseaseIn the previous practical, we have re-coded some of the categorical variables and fitted the logistic regression model.
If you are using a new R session, you need to first run the following syntax:
$stage <- factor(pbc$stage)
pbc$spiders <- factor(pbc$spiders)
pbc
<- glm(spiders ~ age + sex + bili + stage + ascites, data = pbc,
mod1 family = binomial())
We would now like to add another continuous variable to mod1
: the variable chol
.
mod1
to include chol
.<- update(mod1, formula = . ~ . + chol)
mod_chol summary(mod_chol)
##
## Call:
## glm(formula = spiders ~ age + sex + bili + stage + ascites +
## chol, family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6892 -0.7688 -0.5030 0.9416 2.2690
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.812e+01 1.074e+03 -0.017 0.9865
## age -1.149e-02 1.518e-02 -0.757 0.4491
## sexf 1.274e+00 5.857e-01 2.175 0.0297 *
## bili 8.606e-02 3.580e-02 2.404 0.0162 *
## stage2 1.529e+01 1.074e+03 0.014 0.9886
## stage3 1.624e+01 1.074e+03 0.015 0.9879
## stage4 1.714e+01 1.074e+03 0.016 0.9873
## ascites 9.736e-02 5.765e-01 0.169 0.8659
## chol -1.710e-04 6.779e-04 -0.252 0.8009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.38 on 283 degrees of freedom
## Residual deviance: 289.81 on 275 degrees of freedom
## (134 observations deleted due to missingness)
## AIC: 307.81
##
## Number of Fisher Scoring iterations: 16
We can compare the coefficients form mod1
and mod_chol
and notice that the (Intercept)
, and the coefficients for stage
are much more extreme than the ones in mod1
:
round(
cbind(mod1 = coef(mod1)[names(coef(mod_chol))],
mod_chol = coef(mod_chol)
3) ),
## mod1 mod_chol
## (Intercept) -2.936 -18.116
## age -0.021 -0.011
## sexf 1.213 1.274
## bili 0.099 0.086
## stage2 0.651 15.292
## stage3 1.371 16.241
## stage4 2.279 17.144
## ascites 0.262 0.097
## <NA> NA 0.000
Note:
If we would just use cbind()
to combine the coefficients into a matrix we could not be sure that the coefficients of the same variable are in the same row.
<- c(x = 1, y = 2, z = 3)
a <- c(x = 1, z = 3, y = 2)
b cbind(a, b)
## a b
## x 1 1
## y 2 3
## z 3 2
Moreover, the “missing” effect for chol
in mod1
will just be filled in with the first value of the vector.
cbind(coef(mod1), coef(mod_chol))
## Warning in cbind(coef(mod1), coef(mod_chol)): number of rows of result is not a multiple of vector
## length (arg 1)
## [,1] [,2]
## (Intercept) -2.93568220 -1.811631e+01
## age -0.02050837 -1.148978e-02
## sexf 1.21276329 1.273726e+00
## bili 0.09867415 8.606048e-02
## stage2 0.65103163 1.529183e+01
## stage3 1.37091993 1.624066e+01
## stage4 2.27857603 1.714371e+01
## ascites 0.26188382 9.736020e-02
## chol -2.93568220 -1.709538e-04
By using coef(mod1)[names(coef(mod_chol))]
we make sure that the coefficients from mod1
are given in the same order as the coefficients from mod_chol
. Since there is no chol
in coef(mod1)
, NA
is filled in:
coef(mod1)[names(coef(mod_chol))]
## (Intercept) age sexf bili stage2 stage3 stage4 ascites
## -2.93568220 -0.02050837 1.21276329 0.09867415 0.65103163 1.37091993 2.27857603 0.26188382
## <NA>
## NA
To compare the coefficients from different models it is useful to plot coefficients with their corresponding confidence intervals.
Using base R plots:
# coefficients & CI from mod1
<- cbind(coef = coef(mod1), confint(mod1)) res1
## Waiting for profiling to be done...
<- rbind(res1, chol = c(NA, NA, NA)) # add an "empty" row for chol
res1
# coefficients & CI from mod_chol
<- cbind(coef = coef(mod_chol), confint(mod_chol)) res3
## Waiting for profiling to be done...
# graphical settings:
# * mfrow = c(3, 3) print the plots in a grid with 3 rows and 3 columns
# * mar = c(3, 3, 2, 1): adjust the plot margins (looks nicer)
par(mfrow = c(3, 3), mar = c(3, 3, 2, 1))
for (k in rownames(res1)) { # loop over all variables
plot(x = c(1, 2), # position on the x-axis
y = c(res1[k, 'coef'], res3[k, 'coef']), # coefficient value on the y-axis
main = k, # set variable name as title
xlim = c(0, 3), # set limit of the x-axis (looks nicer)
ylim = range(res1[k, ], res3[k, ], na.rm = TRUE), # set the limits of the y-axis
# so that also the CIs fit in
xaxt = 'n', # don't plot the x-axis (we want our own version)
ylab = 'log OR & 95% CI') # specify the y-axis label
# specification of the x-axis
axis(side = 1, # draw axis at the bottom of the plot
at = c(1, 2), # locations of the tick-marks on the x-axis
labels = c('mod1', 'mod_chol')) # labels of the tick-marks
# add vertical lines representing the confidence intervals
lines(x = c(1, 1), y = res1[k, c('2.5 %', '97.5 %')])
lines(x = c(2, 2), y = res3[k, c('2.5 %', '97.5 %')])
}
Using ggplot2:
<- rbind(
results # data.frame containing results from mod1
data.frame(coef = coef(mod1),
confint(mod1),
variable = names(coef(mod1)),
model = 'mod1',
check.names = FALSE),
# data.frame containing results from mod_chol
data.frame(coef = coef(mod_chol),
confint(mod_chol),
variable = names(coef(mod_chol)),
model = 'mod_chol',
check.names = FALSE)
)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
library("ggplot2")
ggplot(results, # specification of the data
aes(x = model, y = coef)) + # specify what goes on which axis
geom_point() + # plot points for the coefficients
geom_errorbar(aes(ymin = `2.5 %`, # add confidence intervals
ymax = `97.5 %`),
width = 0.3) + # width of the horizontal lines
facet_wrap('variable', # make a separate plot per variable
scales = 'free_y') # allow the y-axes to vary between variables
Something is clearly going wrong with the variable stage
.
Investigate what is going on by looking at stage
in relation to the outcome spiders
.
table()
.
table(stage = pbc$stage,
spiders = pbc$spiders, exclude = NULL)
## spiders
## stage 0 1 <NA>
## 1 15 1 5
## 2 58 9 25
## 3 90 30 35
## 4 59 50 35
## <NA> 0 0 6
There are very few observations with spiders
for stage = 1
and stage = 2
.
We saw that there is almost perfect separation between spiders
for the patients with stage = 1
, but as long as there is a patient, for every combination of covariate and outcome, we should not get such non-sense results.
We may have overlooked something when we created the table.
Check if the table did indeed include those cases that are analysed in mod_chol
by checking if the table at least contains the correct number of cases.
summary()
.
sum(table(stage = pbc$stage, spiders = pbc$spiders))
## [1] 312
In the table, we have 312 cases where both spiders
and stage
are observed.
The output of summary(mod_chol)
showed the following:
## [...]
## Null deviance: 341.38 on 283 degrees of freedom
## Residual deviance: 289.81 on 275 degrees of freedom
## (134 observations deleted due to missingness)
## AIC: 307.81
##
## Number of Fisher Scoring iterations: 16
134 observations were excluded.
In the full pbc data we have nrow(pbc) = 418
observations.
We should, hence, be looking at only 418 - 134 = 284
observations!
There are some other variables in our model that have missing values!
Note:
The output of summary(mod_chol)
) also shows that the deviance of the null model (the model with only intercept) had 283 degrees of freedom, which is N - 1. We could also derive from there that there were 283 + 1 cases in the analysis.
We need to exclude all cases that have missing values in any of the variables involved in mod_chol
and check the table of stage
and spiders
again.
stage
vs spiders
using the subset.A convenient way to obtain all variables from a model is to use the function all.vars()
on the model formula:
all.vars(formula(mod_chol))
## [1] "spiders" "age" "sex" "bili" "stage" "ascites" "chol"
We can use this to get a summary of exactly those columns of the pbc data:
summary(pbc[, all.vars(formula(mod_chol))])
## spiders age sex bili stage ascites chol
## 0 :222 Min. :26.28 m: 44 Min. : 0.300 1 : 21 Min. :0.00000 Min. : 120.0
## 1 : 90 1st Qu.:42.83 f:374 1st Qu.: 0.800 2 : 92 1st Qu.:0.00000 1st Qu.: 249.5
## NA's:106 Median :51.00 Median : 1.400 3 :155 Median :0.00000 Median : 309.5
## Mean :50.74 Mean : 3.221 4 :144 Mean :0.07692 Mean : 369.5
## 3rd Qu.:58.24 3rd Qu.: 3.400 NA's: 6 3rd Qu.:0.00000 3rd Qu.: 400.0
## Max. :78.44 Max. :28.000 Max. :1.00000 Max. :1775.0
## NA's :106 NA's :134
Note:
Alternatively, we could have looked at the sum of missing values in each column:
colSums(is.na(pbc[, all.vars(formula(mod_chol))]))
## spiders age sex bili stage ascites chol
## 106 0 0 0 6 106 134
We can make a subset in which we exclude cases with missing values in variables other than stage
and spiders
. (We could also exclude the cases with missing values in those two variables, but they are shown in separate columns/rows in the table anyway.)
<- subset(pbc, !is.na(chol) & !is.na(stage))
pbc_sub
# or:
# pbc_sub <- pbc[, !is.na(ascites) & !is.na(chol)]
table(pbc_sub$stage, pbc_sub$spiders, exclude = NULL)
##
## 0 1
## 1 13 0
## 2 54 7
## 3 85 30
## 4 50 45
Indeed, there are no subjects with spiders
in the stage = 1
group in the analysis. This explains the inflated effect estimates and standard errors.
Due to the dummy coding in which stage = 1
was the reference category, there was no information available to estimate any of the stage
dummy effects, because they are relative to stage = 1
.
Change the reference category of stage
to another level and re-fit mod_chol
.
We choose category 4 as reference category. There are (at least) two options to change the reference category:
$stage_rev <- factor(pbc$stage, levels = 4:1, labels = 4:1)
pbc$stage_ref4 <- relevel(pbc$stage, ref = 4)
pbc
levels(pbc$stage)
## [1] "1" "2" "3" "4"
levels(pbc$stage_ref4)
## [1] "4" "1" "2" "3"
levels(pbc$stage_rev)
## [1] "4" "3" "2" "1"
Even though stage_ref4
and stage_rev
have different ordering of their levels (but thy are not ordered factors!), they give the same result when used as covariates in a regression model, since all non-reference categories are only compared with the reference category.
<- glm(spiders ~ age + sex + chol + bili + stage_rev + ascites,
mod_chol2 data = pbc, family = binomial())
summary(mod_chol2)
##
## Call:
## glm(formula = spiders ~ age + sex + chol + bili + stage_rev +
## ascites, family = binomial(), data = pbc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6892 -0.7688 -0.5030 0.9416 2.2690
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.726e-01 1.098e+00 -0.885 0.375931
## age -1.149e-02 1.518e-02 -0.757 0.449141
## sexf 1.274e+00 5.857e-01 2.175 0.029666 *
## chol -1.710e-04 6.779e-04 -0.252 0.800890
## bili 8.606e-02 3.580e-02 2.404 0.016230 *
## stage_rev3 -9.030e-01 3.320e-01 -2.720 0.006534 **
## stage_rev2 -1.852e+00 4.769e-01 -3.883 0.000103 ***
## stage_rev1 -1.714e+01 1.074e+03 -0.016 0.987263
## ascites 9.736e-02 5.765e-01 0.169 0.865879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 341.38 on 283 degrees of freedom
## Residual deviance: 289.81 on 275 degrees of freedom
## (134 observations deleted due to missingness)
## AIC: 307.81
##
## Number of Fisher Scoring iterations: 16
We see that the effect estimates and standard errors of levels 2 and 3 are back to normal values.
© Nicole Erler