The aim of this practical is to write a function that returns a nice looking summary of the results from a linear or logistic regression model.
Create a data.frame with example data that we can use to fit example models to test our function.
rnorm(), from a uniform distribution, using runif() or from a binomial distribution, using rbinom(). The functions sample() and factor() could also be useful.
N <- 1000
testdat <- data.frame(y = rnorm(N),
ybin = factor(rbinom(N, size = 1, prob = 0.7)),
x1 = runif(N, min = 0, max = 4),
x2 = factor(sample(paste0('cat', 1:4), size = N, replace = TRUE))
)
testdat$x3 <- 0.5 * testdat$y + 0.5 * as.numeric(testdat$ybin) + rnorm(N, 0, 1)
summary(testdat)## y ybin x1 x2 x3
## Min. :-3.16613 0:323 Min. :0.002036 cat1:226 Min. :-3.21724
## 1st Qu.:-0.71303 1:677 1st Qu.:1.055275 cat2:243 1st Qu.:-0.01074
## Median :-0.05912 Median :1.909885 cat3:287 Median : 0.82795
## Mean :-0.02762 Mean :1.960608 cat4:244 Mean : 0.82695
## 3rd Qu.: 0.63855 3rd Qu.:2.954759 3rd Qu.: 1.62160
## Max. : 3.70290 Max. :3.999473 Max. : 4.36887
Note:
We create one variable that is (strongly) associated with the outcome variables to make sure we have some significant p-values.
Using your test data, fit a linear and a logistic regression model.
test_lm <- lm(y ~ ., data = testdat)
test_logit <- glm(ybin ~ ., data = testdat, family = binomial())Our summary should contain the estimates of the regression coefficients, the 95% confidence interval, and the p-value.
Explore the summary() of the linear regression model (not the printed output but the object that is returned) to find out how you can access the p-values from a lm.
summary_lm <- summary(test_lm)
class(summary_lm)## [1] "summary.lm"
names(summary_lm)## [1] "call" "terms" "residuals" "coefficients" "aliased" "sigma"
## [7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
is.list(summary_lm)## [1] TRUE
summary_lm$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.22409323 0.09212938 -2.4323754 1.517595e-02
## ybin1 -0.22828842 0.06517862 -3.5025048 4.814306e-04
## x1 -0.01443955 0.02641363 -0.5466703 5.847281e-01
## x2cat2 0.10598917 0.08662939 1.2234782 2.214394e-01
## x2cat3 0.08122001 0.08320072 0.9761936 3.292063e-01
## x2cat4 0.03061912 0.08629760 0.3548085 7.228084e-01
## x3 0.39034164 0.02553041 15.2892816 1.504170e-47
class(summary_lm$coefficients)## [1] "matrix" "array"
Combine the regression coefficients, the 95% confidence interval and the p-values from the linear regression model into one data.frame.
Make sure the columns are named properly (for example: estimate, 2.5 %, 97.5 %, p-value).
check.names = FALSE.
dat <- data.frame(estimate = coef(test_lm),
confint(test_lm),
'p-value' = summary(test_lm)$coefficients[, 'Pr(>|t|)'],
check.names = FALSE)
dat## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.22409323 -0.40488386 -0.04330261 1.517595e-02
## ybin1 -0.22828842 -0.35619205 -0.10038478 4.814306e-04
## x1 -0.01443955 -0.06627250 0.03739340 5.847281e-01
## x2cat2 0.10598917 -0.06400852 0.27598686 2.214394e-01
## x2cat3 0.08122001 -0.08204942 0.24448944 3.292063e-01
## x2cat4 0.03061912 -0.13872748 0.19996573 7.228084e-01
## x3 0.39034164 0.34024189 0.44044139 1.504170e-47
Note:
We need to wrap p-value in quotes because it does contain a “-” and is not a syntactically valid name.
estimate is a syntactically valid name.
fun1 <- function(model) {
dat <- data.frame(estimate = coef(model),
confint(model),
'p-value' = summary(model)$coefficients[, 'Pr(>|t|)'],
check.names = FALSE)
dat
}
fun1(test_lm)## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.22409323 -0.40488386 -0.04330261 1.517595e-02
## ybin1 -0.22828842 -0.35619205 -0.10038478 4.814306e-04
## x1 -0.01443955 -0.06627250 0.03739340 5.847281e-01
## x2cat2 0.10598917 -0.06400852 0.27598686 2.214394e-01
## x2cat3 0.08122001 -0.08204942 0.24448944 3.292063e-01
## x2cat4 0.03061912 -0.13872748 0.19996573 7.228084e-01
## x3 0.39034164 0.34024189 0.44044139 1.504170e-47
To make the output look nicer, we want to limit the decimal places in the output to a number specified by the user. Implement this using an argument decimals and decide on a default value for it.
round().
fun2 <- function(model, decimals = 3) {
dat <- data.frame(estimate = coef(model),
confint(model),
'p-value' = summary(model)$coefficients[, 'Pr(>|t|)'],
check.names = FALSE)
round(dat, decimals)
}
fun2(test_lm)## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.015
## ybin1 -0.228 -0.356 -0.100 0.000
## x1 -0.014 -0.066 0.037 0.585
## x2cat2 0.106 -0.064 0.276 0.221
## x2cat3 0.081 -0.082 0.244 0.329
## x2cat4 0.031 -0.139 0.200 0.723
## x3 0.390 0.340 0.440 0.000
In another practical, we have seen that the function format.pval() will convert values below a certain value in to “<…” notation.
Use this in our summary function.
format.pval() has the additional arguments digits and eps.
$ to select a column with a syntactically invalid name you need to wrap the name in “`”, i.e., `the name`.
fun3 <- function(model, decimals = 3) {
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
dat
}
fun3(test_lm)## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.0152
## ybin1 -0.228 -0.356 -0.100 <0.001
## x1 -0.014 -0.066 0.037 0.5847
## x2cat2 0.106 -0.064 0.276 0.2214
## x2cat3 0.081 -0.082 0.244 0.3292
## x2cat4 0.031 -0.139 0.200 0.7228
## x3 0.390 0.340 0.440 <0.001
fun3(test_lm, decimals = 5)## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.22409 -0.40488 -0.04330 0.01517595
## ybin1 -0.22829 -0.35619 -0.10038 0.00048143
## x1 -0.01444 -0.06627 0.03739 0.58472808
## x2cat2 0.10599 -0.06401 0.27599 0.22143938
## x2cat3 0.08122 -0.08205 0.24449 0.32920630
## x2cat4 0.03062 -0.13873 0.19997 0.72280840
## x3 0.39034 0.34024 0.44044 < 1e-05
fun3(test_lm, decimals = 1)## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.2 -0.4 0.0 <0.1
## ybin1 -0.2 -0.4 -0.1 <0.1
## x1 0.0 -0.1 0.0 0.6
## x2cat2 0.1 -0.1 0.3 0.2
## x2cat3 0.1 -0.1 0.2 0.3
## x2cat4 0.0 -0.1 0.2 0.7
## x3 0.4 0.3 0.4 <0.1
Note:
We do not want to first round the p-value and then apply format.pval() because this might give wrong results in specific cases, for example:
pv <- 0.0009
format.pval(pv, digits = 2, eps = 0.001)## [1] "<0.001"
round(pv, 3)## [1] 0.001
format.pval(round(pv, 3), digits = 2, eps = 0.001)## [1] "0.001"
We also would like to use this function for output from logistic regression models.
Does the current version of the function work with logistic regression models?
fun3(test_logit)## Waiting for profiling to be done...
## Error in summary(model)$coefficients[, "Pr(>|t|)"]: subscript out of bounds
We get an error message “subscript out of bounds”. This means we are trying to select an element that does not exist.
To find the problem, we check each part of the function body, replacing model by test_logit:
decimals <- 3
dat <- data.frame(estimate = coef(test_logit),
confint(test_logit),
check.names = FALSE)## Waiting for profiling to be done...
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(test_logit)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)## Error in summary(test_logit)$coefficients[, "Pr(>|t|)"]: subscript out of bounds
The problem is in the p-value column, so we split this syntax up further:
summary(test_logit)$coefficients[, 'Pr(>|t|)']## Error in summary(test_logit)$coefficients[, "Pr(>|t|)"]: subscript out of bounds
Same error, so probably the column 'Pr(>|t|)' does not exist.
summary(test_logit)$coefficients## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3866473 0.19527421 1.9800226 4.770099e-02
## y -0.2611655 0.07590243 -3.4408056 5.799851e-04
## x1 -0.1111445 0.06261764 -1.7749710 7.590268e-02
## x2cat2 0.4532677 0.20868071 2.1720631 2.985090e-02
## x2cat3 0.1598570 0.19434209 0.8225549 4.107611e-01
## x2cat4 0.1164228 0.20042718 0.5808733 5.613258e-01
## x3 0.5353623 0.06973867 7.6766919 1.632494e-14
Indeed, for a logistic regression model, the column is called Pr(>|z|)!
Adjust the function so that it works with both linear and logistic regression models.
fun4 <- function(model, decimals = 3, modeltype) {
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
dat <- round(dat, decimals)
if (modeltype == 'lm') {
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
}
dat
}
fun4(test_logit, modeltype = 'logit')## Waiting for profiling to be done...
## estimate 2.5 % 97.5 % p-value
## (Intercept) 0.387 0.006 0.772 0.0477
## y -0.261 -0.411 -0.113 <0.001
## x1 -0.111 -0.234 0.011 0.0759
## x2cat2 0.453 0.046 0.865 0.0299
## x2cat3 0.160 -0.222 0.541 0.4108
## x2cat4 0.116 -0.277 0.510 0.5613
## x3 0.535 0.401 0.674 <0.001
fun4(test_lm, modeltype = 'lm')## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.0152
## ybin1 -0.228 -0.356 -0.100 <0.001
## x1 -0.014 -0.066 0.037 0.5847
## x2cat2 0.106 -0.064 0.276 0.2214
## x2cat3 0.081 -0.082 0.244 0.3292
## x2cat4 0.031 -0.139 0.200 0.7228
## x3 0.390 0.340 0.440 <0.001
Now that the function “works”, i.e., runs for both types of models, we can come back to the other question:
For a logistic model: is this the output that we are interested in?
For a logistic regression model, the effects are usually reported as odds ratios. In our current version, we present the effects on the log odds scale.
Adjust the function so that for logistic regression models the summary is given as odds ratios.
Do not forget to change the column name(s).
fun5 <- function(model, decimals = 3, modeltype) {
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}
fun5(test_logit, modeltype = 'logit')## Waiting for profiling to be done...
## OR 2.5 % 97.5 % p-value
## (Intercept) 1.472 1.006 2.164 0.0477
## y 0.770 0.663 0.893 <0.001
## x1 0.895 0.791 1.011 0.0759
## x2cat2 1.573 1.047 2.374 0.0299
## x2cat3 1.173 0.801 1.718 0.4108
## x2cat4 1.123 0.758 1.665 0.5613
## x3 1.708 1.493 1.963 <0.001
fun5(test_lm, modeltype = 'lm')## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.0152
## ybin1 -0.228 -0.356 -0.100 <0.001
## x1 -0.014 -0.066 0.037 0.5847
## x2cat2 0.106 -0.064 0.276 0.2214
## x2cat3 0.081 -0.082 0.244 0.3292
## x2cat4 0.031 -0.139 0.200 0.7228
## x3 0.390 0.340 0.440 <0.001
To make the function more convenient to use, we want to automatically identify the model type, so that the user does not need to provide the modeltype.
family() can be used on regression models.
A first idea might be to use class(model) to identify the type of model.
class(test_lm)
## [1] "lm"
class(test_logit)
## [1] "glm" "lm"The problem with this would be that a user might also have used glm() to fit a linear regression model:
class(glm(formula = y ~ ., data = testdat, family = gaussian()))## [1] "glm" "lm"
Moreover, not all GLMs use a logit link (not even all binary GLMs do!) and using exp() would then not give an odds ratio:
class(glm(formula = ybin ~ ., data = testdat, family = binomial(link = 'probit')))## [1] "glm" "lm"
Using family() we can extract the model family and the link function:
family(test_lm)##
## Family: gaussian
## Link function: identity
family(test_logit)##
## Family: binomial
## Link function: logit
We need to inspect this further to figure out how to access that information:
fam_lm <- family(test_lm)
str(fam_lm)## List of 11
## $ family : chr "gaussian"
## $ link : chr "identity"
## $ linkfun :function (mu)
## $ linkinv :function (eta)
## $ variance :function (mu)
## $ dev.resids:function (y, mu, wt)
## $ aic :function (y, n, mu, wt, dev)
## $ mu.eta :function (eta)
## $ initialize: expression({ n <- rep.int(1, nobs) if (is.null(etastart) && is.null(start) && is.null(mustart) && ((family$link| __truncated__
## $ validmu :function (mu)
## $ valideta :function (eta)
## - attr(*, "class")= chr "family"
fam_lm$family## [1] "gaussian"
fam_lm$link## [1] "identity"
family(test_logit)$family## [1] "binomial"
family(test_logit)$link## [1] "logit"
Using this information we can adapt our function:
fun6 <- function(model, decimals = 3) {
# extract the family and link
fam <- family(model)$family
link <- family(model)$link
modeltype <- if (fam == 'binomial' & link == 'logit') {
'logit'
} else {
if (fam == 'gaussian' & link == 'identity') {
'lm'
}
}
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}What would happen if we would apply our function to a model that is not a standard linear regression (with identity link) or logistic regression model?
Try to figure this out by looking at the syntax without trying it out!
If we provide a model that does not fit either of the criteria fam == 'binomial' & link == 'logit' or fam == 'gaussian' & link == 'identity' the value of modeltype will be NULL.
Why?
The first condition (fam == 'binomial' & link == 'logit') would be FALSE so we would end up in the else part. There, the condition (fam == 'gaussian' & link == 'identity') would also be FALSE. Since this if() does not provide any alternative, it will return NULL.
The question is what happens when we try to evaluate if (modeltype == 'lm') when modeltype = NULL.
The comparison of NULL with 'lm' will result in a logical vector of length zero:
NULL == 'lm'## logical(0)
Using a logical vector of length zero as condition to if() results in an error:
if (NULL == 'lm') {}## Error in if (NULL == "lm") {: argument is of length zero
test_probit <- glm(ybin ~ ., family = binomial(link = 'probit'), data = testdat)
fun6(test_probit)## Waiting for profiling to be done...
## Error in if (modeltype == "lm") {: argument is of length zero
Prevent this error from occuring by stopping the evaluation of the function before it happens and provide an informative error message.
stop().
is.null().
fun7 <- function(model, decimals = 3) {
# extract the family and link
fam <- family(model)$family
link <- family(model)$link
modeltype <- if (fam == 'binomial' & link == 'logit') {
'logit'
} else {
if (fam == 'gaussian' & link == 'identity') {
'lm'
}
}
if (is.null(modeltype)) {
stop("I do not know how to create the summary for the type of model you have provided.")
}
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}fun7(test_probit)## Error in fun7(test_probit): I do not know how to create the summary for the type of model you have provided.
© Nicole Erler