Logistic Regression: Additional Exercise

Preface

R packages

In this practical, we use the R package ggplot2. It can be installed by running the following syntax:

install.packages("ggplot2")

Dataset

Like in the previous practical on logistic regression, we will use the pbc dataset from the package survival.

You can either load the survival package to get access to this dataset

library(survival)

or make a copy of this data

pbc <- survival::pbc

The pbc data contain (among others) the following variables:

str(pbc[, c(5:12, 20)])

## 'data.frame':    418 obs. of  9 variables:
##  $ age    : num  58.8 56.4 70.1 54.7 38.1 ...
##  $ sex    : Factor w/ 2 levels "m","f": 2 2 1 2 2 2 2 2 2 2 ...
##  $ ascites: int  1 0 0 0 0 0 0 0 0 1 ...
##  $ hepato : int  1 1 0 1 1 1 1 0 0 0 ...
##  $ spiders: int  1 1 0 1 1 0 0 0 1 1 ...
##  $ edema  : num  1 0 0.5 0.5 0 0 0 0 0 1 ...
##  $ bili   : num  14.5 1.1 1.4 1.8 3.4 0.8 1 0.3 3.2 12.6 ...
##  $ chol   : int  261 302 176 244 279 248 322 280 562 200 ...
##  $ stage  : int  4 3 4 4 3 3 3 3 2 4 ...

The relevant variables are

age: patient’s age in years
sex: patient’s sex
ascites: presence of ascites
spiders: presence of blood vessel malformations in the skin
bili: serum bilirubin
chol: serum cholesterol
stage: histologic stage of the disease

Exploring problems in logistic regression

In the previous practical, we have re-coded some of the categorical variables and fitted the logistic regression model.

If you are using a new R session, you need to first run the following syntax:

pbc$stage <- factor(pbc$stage)
pbc$spiders <- factor(pbc$spiders)

mod1 <- glm(spiders ~ age + sex + bili + stage + ascites, data = pbc,
            family = binomial())

We would now like to add another continuous variable to mod1: the variable chol.

Task 1

Update mod1 to include chol.
Investigate how the results change.

Solution 1

mod_chol <- update(mod1, formula = . ~ . + chol)
summary(mod_chol)

## 
## Call:
## glm(formula = spiders ~ age + sex + bili + stage + ascites + 
##     chol, family = binomial(), data = pbc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6892  -0.7688  -0.5030   0.9416   2.2690  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.812e+01  1.074e+03  -0.017   0.9865  
## age         -1.149e-02  1.518e-02  -0.757   0.4491  
## sexf         1.274e+00  5.857e-01   2.175   0.0297 *
## bili         8.606e-02  3.580e-02   2.404   0.0162 *
## stage2       1.529e+01  1.074e+03   0.014   0.9886  
## stage3       1.624e+01  1.074e+03   0.015   0.9879  
## stage4       1.714e+01  1.074e+03   0.016   0.9873  
## ascites      9.736e-02  5.765e-01   0.169   0.8659  
## chol        -1.710e-04  6.779e-04  -0.252   0.8009  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 341.38  on 283  degrees of freedom
## Residual deviance: 289.81  on 275  degrees of freedom
##   (134 observations deleted due to missingness)
## AIC: 307.81
## 
## Number of Fisher Scoring iterations: 16

We can compare the coefficients form mod1 and mod_chol and notice that the (Intercept), and the coefficients for stage are much more extreme than the ones in mod1:

round(
  cbind(mod1 = coef(mod1)[names(coef(mod_chol))],
        mod_chol = coef(mod_chol)
  ), 3)

##               mod1 mod_chol
## (Intercept) -2.936  -18.116
## age         -0.021   -0.011
## sexf         1.213    1.274
## bili         0.099    0.086
## stage2       0.651   15.292
## stage3       1.371   16.241
## stage4       2.279   17.144
## ascites      0.262    0.097
## <NA>            NA    0.000

Note:

If we would just use cbind() to combine the coefficients into a matrix we could not be sure that the coefficients of the same variable are in the same row.

a <- c(x = 1, y = 2, z = 3)
b <- c(x = 1, z = 3, y = 2)
cbind(a, b)

##   a b
## x 1 1
## y 2 3
## z 3 2

Moreover, the “missing” effect for chol in mod1 will just be filled in with the first value of the vector.

cbind(coef(mod1), coef(mod_chol))

## Warning in cbind(coef(mod1), coef(mod_chol)): number of rows of result is not a multiple of vector
## length (arg 1)

##                    [,1]          [,2]
## (Intercept) -2.93568220 -1.811631e+01
## age         -0.02050837 -1.148978e-02
## sexf         1.21276329  1.273726e+00
## bili         0.09867415  8.606048e-02
## stage2       0.65103163  1.529183e+01
## stage3       1.37091993  1.624066e+01
## stage4       2.27857603  1.714371e+01
## ascites      0.26188382  9.736020e-02
## chol        -2.93568220 -1.709538e-04

By using coef(mod1)[names(coef(mod_chol))] we make sure that the coefficients from mod1 are given in the same order as the coefficients from mod_chol. Since there is no chol in coef(mod1), NA is filled in:

coef(mod1)[names(coef(mod_chol))]

## (Intercept)         age        sexf        bili      stage2      stage3      stage4     ascites 
## -2.93568220 -0.02050837  1.21276329  0.09867415  0.65103163  1.37091993  2.27857603  0.26188382 
##        <NA> 
##          NA

Visualization

To compare the coefficients from different models it is useful to plot coefficients with their corresponding confidence intervals.

Using base R plots:

# coefficients & CI from mod1
res1 <- cbind(coef = coef(mod1), confint(mod1))

## Waiting for profiling to be done...

res1 <- rbind(res1, chol = c(NA, NA, NA)) # add an "empty" row for chol

# coefficients & CI from mod_chol
res3 <- cbind(coef = coef(mod_chol), confint(mod_chol))

## Waiting for profiling to be done...

# graphical settings:
# * mfrow = c(3, 3) print the plots in a grid with 3 rows and 3 columns
# * mar = c(3, 3, 2, 1): adjust the plot margins (looks nicer)
par(mfrow = c(3, 3), mar = c(3, 3, 2, 1))

for (k in rownames(res1)) {                               # loop over all variables
  plot(x = c(1, 2),                                       # position on the x-axis
       y = c(res1[k, 'coef'], res3[k, 'coef']),           # coefficient value on the y-axis
       main = k,                                          # set variable name as title
       xlim = c(0, 3),                                    # set limit of the x-axis (looks nicer)
       ylim = range(res1[k, ], res3[k, ], na.rm = TRUE),  # set the limits of the y-axis
                                                          # so that also the CIs fit in
       xaxt = 'n',                                        # don't plot the x-axis (we want our own version)
       ylab = 'log OR & 95% CI')                          # specify the y-axis label
  
  # specification of the x-axis
  axis(side = 1,                    # draw axis at the bottom of the plot
       at = c(1, 2),                # locations of the tick-marks on the x-axis
       labels = c('mod1', 'mod_chol'))  # labels of the tick-marks
  
  # add vertical lines representing the confidence intervals
  lines(x = c(1, 1), y = res1[k, c('2.5 %', '97.5 %')])
  lines(x = c(2, 2), y = res3[k, c('2.5 %', '97.5 %')])
}

Using ggplot2:

results <- rbind(
  # data.frame containing results from mod1
  data.frame(coef = coef(mod1), 
             confint(mod1), 
             variable = names(coef(mod1)),
             model = 'mod1',
             check.names = FALSE),
  
  # data.frame containing results from mod_chol
  data.frame(coef = coef(mod_chol), 
             confint(mod_chol), 
             variable = names(coef(mod_chol)),
             model = 'mod_chol',
             check.names = FALSE)
)

## Waiting for profiling to be done...
## Waiting for profiling to be done...

library("ggplot2")
ggplot(results,                           # specification of the data
       aes(x = model, y = coef)) +        # specify what goes on which axis
  geom_point() +                          # plot points for the coefficients
  geom_errorbar(aes(ymin = `2.5 %`,       # add confidence intervals
                    ymax = `97.5 %`),
                width = 0.3) +            # width of the horizontal lines
  facet_wrap('variable',                  # make a separate plot per variable
             scales = 'free_y')           # allow the y-axes to vary between variables

Task 2

Something is clearly going wrong with the variable stage.

Investigate what is going on by looking at stage in relation to the outcome spiders.

Use a table().

Solution 2

table(stage = pbc$stage,
      spiders = pbc$spiders, exclude = NULL)

##       spiders
## stage   0  1 <NA>
##   1    15  1    5
##   2    58  9   25
##   3    90 30   35
##   4    59 50   35
##   <NA>  0  0    6

There are very few observations with spiders for stage = 1 and stage = 2.

Task 3

We saw that there is almost perfect separation between spiders for the patients with stage = 1, but as long as there is a patient, for every combination of covariate and outcome, we should not get such non-sense results.

We may have overlooked something when we created the table.

Check if the table did indeed include those cases that are analysed in mod_chol by checking if the table at least contains the correct number of cases.

You can find relevant info in the output of the summary().

Solution 3

sum(table(stage = pbc$stage, spiders = pbc$spiders))

## [1] 312

In the table, we have 312 cases where both spiders and stage are observed.

The output of summary(mod_chol) showed the following:

## [...]
##     Null deviance: 341.38  on 283  degrees of freedom
## Residual deviance: 289.81  on 275  degrees of freedom
##   (134 observations deleted due to missingness)
## AIC: 307.81
## 
## Number of Fisher Scoring iterations: 16

134 observations were excluded.

In the full pbc data we have nrow(pbc) = 418 observations.

We should, hence, be looking at only 418 - 134 = 284 observations!

There are some other variables in our model that have missing values!

Note:

The output of summary(mod_chol)) also shows that the deviance of the null model (the model with only intercept) had 283 degrees of freedom, which is N - 1. We could also derive from there that there were 283 + 1 cases in the analysis.

Task 4

We need to exclude all cases that have missing values in any of the variables involved in mod_chol and check the table of stage and spiders again.

Check which variables have missing values.
Make a subset excluding those cases.
Create the table of stage vs spiders using the subset.

Solution 4

A convenient way to obtain all variables from a model is to use the function all.vars() on the model formula:

all.vars(formula(mod_chol))

## [1] "spiders" "age"     "sex"     "bili"    "stage"   "ascites" "chol"

We can use this to get a summary of exactly those columns of the pbc data:

summary(pbc[, all.vars(formula(mod_chol))])

##  spiders         age        sex          bili         stage        ascites             chol       
##  0   :222   Min.   :26.28   m: 44   Min.   : 0.300   1   : 21   Min.   :0.00000   Min.   : 120.0  
##  1   : 90   1st Qu.:42.83   f:374   1st Qu.: 0.800   2   : 92   1st Qu.:0.00000   1st Qu.: 249.5  
##  NA's:106   Median :51.00           Median : 1.400   3   :155   Median :0.00000   Median : 309.5  
##             Mean   :50.74           Mean   : 3.221   4   :144   Mean   :0.07692   Mean   : 369.5  
##             3rd Qu.:58.24           3rd Qu.: 3.400   NA's:  6   3rd Qu.:0.00000   3rd Qu.: 400.0  
##             Max.   :78.44           Max.   :28.000              Max.   :1.00000   Max.   :1775.0  
##                                                                 NA's   :106       NA's   :134

Note:

Alternatively, we could have looked at the sum of missing values in each column:

colSums(is.na(pbc[, all.vars(formula(mod_chol))]))

## spiders     age     sex    bili   stage ascites    chol 
##     106       0       0       0       6     106     134

We can make a subset in which we exclude cases with missing values in variables other than stage and spiders. (We could also exclude the cases with missing values in those two variables, but they are shown in separate columns/rows in the table anyway.)

pbc_sub <- subset(pbc, !is.na(chol) & !is.na(stage))

# or:
# pbc_sub <- pbc[, !is.na(ascites) & !is.na(chol)]

table(pbc_sub$stage, pbc_sub$spiders, exclude = NULL)

##    
##      0  1
##   1 13  0
##   2 54  7
##   3 85 30
##   4 50 45

Indeed, there are no subjects with spiders in the stage = 1 group in the analysis. This explains the inflated effect estimates and standard errors.

Changing the reference category

Due to the dummy coding in which stage = 1 was the reference category, there was no information available to estimate any of the stage dummy effects, because they are relative to stage = 1.

Task 1

Change the reference category of stage to another level and re-fit mod_chol.

Solution 1

We choose category 4 as reference category. There are (at least) two options to change the reference category:

pbc$stage_rev <- factor(pbc$stage, levels = 4:1, labels = 4:1)
pbc$stage_ref4 <- relevel(pbc$stage, ref = 4)

levels(pbc$stage)
## [1] "1" "2" "3" "4"
levels(pbc$stage_ref4)
## [1] "4" "1" "2" "3"
levels(pbc$stage_rev)
## [1] "4" "3" "2" "1"

Even though stage_ref4 and stage_rev have different ordering of their levels (but thy are not ordered factors!), they give the same result when used as covariates in a regression model, since all non-reference categories are only compared with the reference category.

mod_chol2 <- glm(spiders ~ age + sex + chol + bili + stage_rev + ascites, 
                 data = pbc, family = binomial())
summary(mod_chol2)

## 
## Call:
## glm(formula = spiders ~ age + sex + chol + bili + stage_rev + 
##     ascites, family = binomial(), data = pbc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6892  -0.7688  -0.5030   0.9416   2.2690  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.726e-01  1.098e+00  -0.885 0.375931    
## age         -1.149e-02  1.518e-02  -0.757 0.449141    
## sexf         1.274e+00  5.857e-01   2.175 0.029666 *  
## chol        -1.710e-04  6.779e-04  -0.252 0.800890    
## bili         8.606e-02  3.580e-02   2.404 0.016230 *  
## stage_rev3  -9.030e-01  3.320e-01  -2.720 0.006534 ** 
## stage_rev2  -1.852e+00  4.769e-01  -3.883 0.000103 ***
## stage_rev1  -1.714e+01  1.074e+03  -0.016 0.987263    
## ascites      9.736e-02  5.765e-01   0.169 0.865879    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 341.38  on 283  degrees of freedom
## Residual deviance: 289.81  on 275  degrees of freedom
##   (134 observations deleted due to missingness)
## AIC: 307.81
## 
## Number of Fisher Scoring iterations: 16

We see that the effect estimates and standard errors of levels 2 and 3 are back to normal values.