The aim of this practical is to write a function that returns a nice looking summary of the results from a linear or logistic regression model.

Creating Data

Task

Create a data.frame with example data that we can use to fit example models to test our function.

To create data, you could draw random numbers from a normal distribution, using rnorm(), from a uniform distribution, using runif() or from a binomial distribution, using rbinom(). The functions sample() and factor() could also be useful.

Solution

N <- 1000
testdat <- data.frame(y = rnorm(N),
                      ybin = factor(rbinom(N, size = 1, prob = 0.7)),
                      x1 = runif(N, min = 0, max = 4),
                      x2 = factor(sample(paste0('cat', 1:4), size = N, replace = TRUE))
)

testdat$x3 <- 0.5 * testdat$y + 0.5 * as.numeric(testdat$ybin) + rnorm(N, 0, 1)

summary(testdat)
##        y            ybin          x1              x2            x3          
##  Min.   :-3.16613   0:323   Min.   :0.002036   cat1:226   Min.   :-3.21724  
##  1st Qu.:-0.71303   1:677   1st Qu.:1.055275   cat2:243   1st Qu.:-0.01074  
##  Median :-0.05912           Median :1.909885   cat3:287   Median : 0.82795  
##  Mean   :-0.02762           Mean   :1.960608   cat4:244   Mean   : 0.82695  
##  3rd Qu.: 0.63855           3rd Qu.:2.954759              3rd Qu.: 1.62160  
##  Max.   : 3.70290           Max.   :3.999473              Max.   : 4.36887

Note:

We create one variable that is (strongly) associated with the outcome variables to make sure we have some significant p-values.

Example Models

Task

Using your test data, fit a linear and a logistic regression model.

Solution

test_lm <- lm(y ~ ., data = testdat)
test_logit <- glm(ybin ~ ., data = testdat, family = binomial())

Building the Summary

Task 1

Our summary should contain the estimates of the regression coefficients, the 95% confidence interval, and the p-value.

Explore the summary() of the linear regression model (not the printed output but the object that is returned) to find out how you can access the p-values from a lm.

Solution 1

summary_lm <- summary(test_lm)
class(summary_lm)
## [1] "summary.lm"
names(summary_lm)
##  [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"        
##  [7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled"
is.list(summary_lm)
## [1] TRUE
summary_lm$coefficients
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.22409323 0.09212938 -2.4323754 1.517595e-02
## ybin1       -0.22828842 0.06517862 -3.5025048 4.814306e-04
## x1          -0.01443955 0.02641363 -0.5466703 5.847281e-01
## x2cat2       0.10598917 0.08662939  1.2234782 2.214394e-01
## x2cat3       0.08122001 0.08320072  0.9761936 3.292063e-01
## x2cat4       0.03061912 0.08629760  0.3548085 7.228084e-01
## x3           0.39034164 0.02553041 15.2892816 1.504170e-47
class(summary_lm$coefficients)
## [1] "matrix" "array"

Task 2

Combine the regression coefficients, the 95% confidence interval and the p-values from the linear regression model into one data.frame.

Make sure the columns are named properly (for example: estimate, 2.5 %, 97.5 %, p-value).

You cannot get the confidence interval from the model itself, but you have already seen and used a function that gives you the confidence interval.
Try the argument check.names = FALSE.

Solution 2

dat <- data.frame(estimate = coef(test_lm),
                  confint(test_lm),
                  'p-value' = summary(test_lm)$coefficients[, 'Pr(>|t|)'],
                  check.names = FALSE)
dat
##                estimate       2.5 %      97.5 %      p-value
## (Intercept) -0.22409323 -0.40488386 -0.04330261 1.517595e-02
## ybin1       -0.22828842 -0.35619205 -0.10038478 4.814306e-04
## x1          -0.01443955 -0.06627250  0.03739340 5.847281e-01
## x2cat2       0.10598917 -0.06400852  0.27598686 2.214394e-01
## x2cat3       0.08122001 -0.08204942  0.24448944 3.292063e-01
## x2cat4       0.03061912 -0.13872748  0.19996573 7.228084e-01
## x3           0.39034164  0.34024189  0.44044139 1.504170e-47

Note:

We need to wrap p-value in quotes because it does contain a “-” and is not a syntactically valid name.

estimate is a syntactically valid name.

Task 3

  • Write a function that has a linear regression model as input and returns a summary like the one we have constructed in the previous solution.
  • Test the function with the example linear regression.

Solution 3

fun1 <- function(model) {
  dat <- data.frame(estimate = coef(model),
                  confint(model),
                  'p-value' = summary(model)$coefficients[, 'Pr(>|t|)'],
                  check.names = FALSE)
  
  dat
}

fun1(test_lm)
##                estimate       2.5 %      97.5 %      p-value
## (Intercept) -0.22409323 -0.40488386 -0.04330261 1.517595e-02
## ybin1       -0.22828842 -0.35619205 -0.10038478 4.814306e-04
## x1          -0.01443955 -0.06627250  0.03739340 5.847281e-01
## x2cat2       0.10598917 -0.06400852  0.27598686 2.214394e-01
## x2cat3       0.08122001 -0.08204942  0.24448944 3.292063e-01
## x2cat4       0.03061912 -0.13872748  0.19996573 7.228084e-01
## x3           0.39034164  0.34024189  0.44044139 1.504170e-47

Improving the Look

Task 1

To make the output look nicer, we want to limit the decimal places in the output to a number specified by the user. Implement this using an argument decimals and decide on a default value for it.

You can use the function round().

Solution 1

fun2 <- function(model, decimals = 3) {
  dat <- data.frame(estimate = coef(model),
                  confint(model),
                  'p-value' = summary(model)$coefficients[, 'Pr(>|t|)'],
                  check.names = FALSE)
  
  round(dat, decimals)
}

fun2(test_lm)
##             estimate  2.5 % 97.5 % p-value
## (Intercept)   -0.224 -0.405 -0.043   0.015
## ybin1         -0.228 -0.356 -0.100   0.000
## x1            -0.014 -0.066  0.037   0.585
## x2cat2         0.106 -0.064  0.276   0.221
## x2cat3         0.081 -0.082  0.244   0.329
## x2cat4         0.031 -0.139  0.200   0.723
## x3             0.390  0.340  0.440   0.000

Task 2

In another practical, we have seen that the function format.pval() will convert values below a certain value in to “<…” notation.

Use this in our summary function.

format.pval() has the additional arguments digits and eps.
To use the $ to select a column with a syntactically invalid name you need to wrap the name in “`”, i.e., `the name`.
Values can also be represented as \(10^{-3}\).

Solution 2

fun3 <- function(model, decimals = 3) {
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  dat <- round(dat, decimals)
  
  dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                               eps = 10^(-decimals), digits = decimals)
  
  dat
}

fun3(test_lm)
##             estimate  2.5 % 97.5 % p-value
## (Intercept)   -0.224 -0.405 -0.043  0.0152
## ybin1         -0.228 -0.356 -0.100  <0.001
## x1            -0.014 -0.066  0.037  0.5847
## x2cat2         0.106 -0.064  0.276  0.2214
## x2cat3         0.081 -0.082  0.244  0.3292
## x2cat4         0.031 -0.139  0.200  0.7228
## x3             0.390  0.340  0.440  <0.001
fun3(test_lm, decimals = 5)
##             estimate    2.5 %   97.5 %    p-value
## (Intercept) -0.22409 -0.40488 -0.04330 0.01517595
## ybin1       -0.22829 -0.35619 -0.10038 0.00048143
## x1          -0.01444 -0.06627  0.03739 0.58472808
## x2cat2       0.10599 -0.06401  0.27599 0.22143938
## x2cat3       0.08122 -0.08205  0.24449 0.32920630
## x2cat4       0.03062 -0.13873  0.19997 0.72280840
## x3           0.39034  0.34024  0.44044    < 1e-05
fun3(test_lm, decimals = 1)
##             estimate 2.5 % 97.5 % p-value
## (Intercept)     -0.2  -0.4    0.0    <0.1
## ybin1           -0.2  -0.4   -0.1    <0.1
## x1               0.0  -0.1    0.0     0.6
## x2cat2           0.1  -0.1    0.3     0.2
## x2cat3           0.1  -0.1    0.2     0.3
## x2cat4           0.0  -0.1    0.2     0.7
## x3               0.4   0.3    0.4    <0.1

Note:

We do not want to first round the p-value and then apply format.pval() because this might give wrong results in specific cases, for example:

pv <- 0.0009
format.pval(pv, digits = 2, eps = 0.001)
## [1] "<0.001"
round(pv, 3)
## [1] 0.001
format.pval(round(pv, 3), digits = 2, eps = 0.001)
## [1] "0.001"

Logistic Regression

Task 1

We also would like to use this function for output from logistic regression models.

Does the current version of the function work with logistic regression models?

  • If it does: does it give us the summary that we are interested in?
  • If it does not: investigate what the problem is.

Solution 1

fun3(test_logit)
## Waiting for profiling to be done...
## Error in summary(model)$coefficients[, "Pr(>|t|)"]: subscript out of bounds

We get an error message “subscript out of bounds”. This means we are trying to select an element that does not exist.

To find the problem, we check each part of the function body, replacing model by test_logit:

decimals <- 3

dat <- data.frame(estimate = coef(test_logit),
                  confint(test_logit),
                  check.names = FALSE)
## Waiting for profiling to be done...
dat <- round(dat, decimals)

dat$`p-value` <- format.pval(summary(test_logit)$coefficients[, 'Pr(>|t|)'],
                             eps = 10^(-decimals), digits = decimals)
## Error in summary(test_logit)$coefficients[, "Pr(>|t|)"]: subscript out of bounds

The problem is in the p-value column, so we split this syntax up further:

summary(test_logit)$coefficients[, 'Pr(>|t|)']
## Error in summary(test_logit)$coefficients[, "Pr(>|t|)"]: subscript out of bounds

Same error, so probably the column 'Pr(>|t|)' does not exist.

summary(test_logit)$coefficients
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  0.3866473 0.19527421  1.9800226 4.770099e-02
## y           -0.2611655 0.07590243 -3.4408056 5.799851e-04
## x1          -0.1111445 0.06261764 -1.7749710 7.590268e-02
## x2cat2       0.4532677 0.20868071  2.1720631 2.985090e-02
## x2cat3       0.1598570 0.19434209  0.8225549 4.107611e-01
## x2cat4       0.1164228 0.20042718  0.5808733 5.613258e-01
## x3           0.5353623 0.06973867  7.6766919 1.632494e-14

Indeed, for a logistic regression model, the column is called Pr(>|z|)!

Task 2

Adjust the function so that it works with both linear and logistic regression models.

Solution 2

fun4 <- function(model, decimals = 3, modeltype) {
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  dat <- round(dat, decimals)
  
  if (modeltype == 'lm') {
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
  }
  
  dat
}

fun4(test_logit, modeltype = 'logit')
## Waiting for profiling to be done...
##             estimate  2.5 % 97.5 % p-value
## (Intercept)    0.387  0.006  0.772  0.0477
## y             -0.261 -0.411 -0.113  <0.001
## x1            -0.111 -0.234  0.011  0.0759
## x2cat2         0.453  0.046  0.865  0.0299
## x2cat3         0.160 -0.222  0.541  0.4108
## x2cat4         0.116 -0.277  0.510  0.5613
## x3             0.535  0.401  0.674  <0.001
fun4(test_lm, modeltype = 'lm')
##             estimate  2.5 % 97.5 % p-value
## (Intercept)   -0.224 -0.405 -0.043  0.0152
## ybin1         -0.228 -0.356 -0.100  <0.001
## x1            -0.014 -0.066  0.037  0.5847
## x2cat2         0.106 -0.064  0.276  0.2214
## x2cat3         0.081 -0.082  0.244  0.3292
## x2cat4         0.031 -0.139  0.200  0.7228
## x3             0.390  0.340  0.440  <0.001

Task 3

Now that the function “works”, i.e., runs for both types of models, we can come back to the other question:

For a logistic model: is this the output that we are interested in?

Solution 3

For a logistic regression model, the effects are usually reported as odds ratios. In our current version, we present the effects on the log odds scale.

Task 4

Adjust the function so that for logistic regression models the summary is given as odds ratios.

Do not forget to change the column name(s).

Solution 4

fun5 <- function(model, decimals = 3, modeltype) {
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}

fun5(test_logit, modeltype = 'logit')
## Waiting for profiling to be done...
##                OR 2.5 % 97.5 % p-value
## (Intercept) 1.472 1.006  2.164  0.0477
## y           0.770 0.663  0.893  <0.001
## x1          0.895 0.791  1.011  0.0759
## x2cat2      1.573 1.047  2.374  0.0299
## x2cat3      1.173 0.801  1.718  0.4108
## x2cat4      1.123 0.758  1.665  0.5613
## x3          1.708 1.493  1.963  <0.001
fun5(test_lm, modeltype = 'lm')
##             estimate  2.5 % 97.5 % p-value
## (Intercept)   -0.224 -0.405 -0.043  0.0152
## ybin1         -0.228 -0.356 -0.100  <0.001
## x1            -0.014 -0.066  0.037  0.5847
## x2cat2         0.106 -0.064  0.276  0.2214
## x2cat3         0.081 -0.082  0.244  0.3292
## x2cat4         0.031 -0.139  0.200  0.7228
## x3             0.390  0.340  0.440  <0.001

Task 5

To make the function more convenient to use, we want to automatically identify the model type, so that the user does not need to provide the modeltype.

The function family() can be used on regression models.

Solution 5

A first idea might be to use class(model) to identify the type of model.

class(test_lm)
## [1] "lm"
class(test_logit)
## [1] "glm" "lm"

The problem with this would be that a user might also have used glm() to fit a linear regression model:

class(glm(formula = y ~ ., data = testdat, family = gaussian()))
## [1] "glm" "lm"

Moreover, not all GLMs use a logit link (not even all binary GLMs do!) and using exp() would then not give an odds ratio:

class(glm(formula = ybin ~ ., data = testdat, family = binomial(link = 'probit')))
## [1] "glm" "lm"

Using family() we can extract the model family and the link function:

family(test_lm)
## 
## Family: gaussian 
## Link function: identity
family(test_logit)
## 
## Family: binomial 
## Link function: logit

We need to inspect this further to figure out how to access that information:

fam_lm <- family(test_lm)
str(fam_lm)
## List of 11
##  $ family    : chr "gaussian"
##  $ link      : chr "identity"
##  $ linkfun   :function (mu)  
##  $ linkinv   :function (eta)  
##  $ variance  :function (mu)  
##  $ dev.resids:function (y, mu, wt)  
##  $ aic       :function (y, n, mu, wt, dev)  
##  $ mu.eta    :function (eta)  
##  $ initialize:  expression({  n <- rep.int(1, nobs)  if (is.null(etastart) && is.null(start) && is.null(mustart) && ((family$link| __truncated__
##  $ validmu   :function (mu)  
##  $ valideta  :function (eta)  
##  - attr(*, "class")= chr "family"
fam_lm$family
## [1] "gaussian"
fam_lm$link
## [1] "identity"
family(test_logit)$family
## [1] "binomial"
family(test_logit)$link
## [1] "logit"

Using this information we can adapt our function:

fun6 <- function(model, decimals = 3) {

  # extract the family and link  
  fam <- family(model)$family
  link <- family(model)$link
  
  modeltype <- if (fam == 'binomial' & link == 'logit') {
    'logit'
  } else {
    if (fam == 'gaussian' & link == 'identity') {
      'lm'
    }
  }
  
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}

Task 6

What would happen if we would apply our function to a model that is not a standard linear regression (with identity link) or logistic regression model?

Try to figure this out by looking at the syntax without trying it out!

Solution 6

If we provide a model that does not fit either of the criteria fam == 'binomial' & link == 'logit' or fam == 'gaussian' & link == 'identity' the value of modeltype will be NULL.

Why?
The first condition (fam == 'binomial' & link == 'logit') would be FALSE so we would end up in the else part. There, the condition (fam == 'gaussian' & link == 'identity') would also be FALSE. Since this if() does not provide any alternative, it will return NULL.

The question is what happens when we try to evaluate if (modeltype == 'lm') when modeltype = NULL.

The comparison of NULL with 'lm' will result in a logical vector of length zero:

NULL == 'lm'
## logical(0)

Using a logical vector of length zero as condition to if() results in an error:

if (NULL == 'lm') {}
## Error in if (NULL == "lm") {: argument is of length zero
test_probit <- glm(ybin ~ ., family = binomial(link = 'probit'), data = testdat)
fun6(test_probit)
## Waiting for profiling to be done...
## Error in if (modeltype == "lm") {: argument is of length zero

Task 7

Prevent this error from occuring by stopping the evaluation of the function before it happens and provide an informative error message.

To create an error message, use stop().
You could use the function is.null().

Solution 7

fun7 <- function(model, decimals = 3) {

  # extract the family and link  
  fam <- family(model)$family
  link <- family(model)$link
  
  modeltype <- if (fam == 'binomial' & link == 'logit') {
    'logit'
  } else {
    if (fam == 'gaussian' & link == 'identity') {
      'lm'
    }
  }
  
  if (is.null(modeltype)) {
    stop("I do not know how to create the summary for the type of model you have provided.")
  }
  
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}
fun7(test_probit)
## Error in fun7(test_probit): I do not know how to create the summary for the type of model you have provided.
 

© Nicole Erler