The aim of this practical is to write a function that returns a nice looking summary of the results from a linear or logistic regression model.
Create a data.frame
with example data that we can use to fit example models to test our function.
rnorm()
, from a uniform distribution, using runif()
or from a binomial distribution, using rbinom()
. The functions sample()
and factor()
could also be useful.
<- 1000
N <- data.frame(y = rnorm(N),
testdat ybin = factor(rbinom(N, size = 1, prob = 0.7)),
x1 = runif(N, min = 0, max = 4),
x2 = factor(sample(paste0('cat', 1:4), size = N, replace = TRUE))
)
$x3 <- 0.5 * testdat$y + 0.5 * as.numeric(testdat$ybin) + rnorm(N, 0, 1)
testdat
summary(testdat)
## y ybin x1 x2 x3
## Min. :-3.16613 0:323 Min. :0.002036 cat1:226 Min. :-3.21724
## 1st Qu.:-0.71303 1:677 1st Qu.:1.055275 cat2:243 1st Qu.:-0.01074
## Median :-0.05912 Median :1.909885 cat3:287 Median : 0.82795
## Mean :-0.02762 Mean :1.960608 cat4:244 Mean : 0.82695
## 3rd Qu.: 0.63855 3rd Qu.:2.954759 3rd Qu.: 1.62160
## Max. : 3.70290 Max. :3.999473 Max. : 4.36887
Note:
We create one variable that is (strongly) associated with the outcome variables to make sure we have some significant p-values.
Using your test data, fit a linear and a logistic regression model.
<- lm(y ~ ., data = testdat)
test_lm <- glm(ybin ~ ., data = testdat, family = binomial()) test_logit
Our summary should contain the estimates of the regression coefficients, the 95% confidence interval, and the p-value.
Explore the summary()
of the linear regression model (not the printed output but the object that is returned) to find out how you can access the p-values from a lm
.
<- summary(test_lm)
summary_lm class(summary_lm)
## [1] "summary.lm"
names(summary_lm)
## [1] "call" "terms" "residuals" "coefficients" "aliased" "sigma"
## [7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
is.list(summary_lm)
## [1] TRUE
$coefficients summary_lm
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.22409323 0.09212938 -2.4323754 1.517595e-02
## ybin1 -0.22828842 0.06517862 -3.5025048 4.814306e-04
## x1 -0.01443955 0.02641363 -0.5466703 5.847281e-01
## x2cat2 0.10598917 0.08662939 1.2234782 2.214394e-01
## x2cat3 0.08122001 0.08320072 0.9761936 3.292063e-01
## x2cat4 0.03061912 0.08629760 0.3548085 7.228084e-01
## x3 0.39034164 0.02553041 15.2892816 1.504170e-47
class(summary_lm$coefficients)
## [1] "matrix" "array"
Combine the regression coefficients, the 95% confidence interval and the p-values from the linear regression model into one data.frame
.
Make sure the columns are named properly (for example: estimate
, 2.5 %
, 97.5 %
, p-value
).
check.names = FALSE
.
<- data.frame(estimate = coef(test_lm),
dat confint(test_lm),
'p-value' = summary(test_lm)$coefficients[, 'Pr(>|t|)'],
check.names = FALSE)
dat
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.22409323 -0.40488386 -0.04330261 1.517595e-02
## ybin1 -0.22828842 -0.35619205 -0.10038478 4.814306e-04
## x1 -0.01443955 -0.06627250 0.03739340 5.847281e-01
## x2cat2 0.10598917 -0.06400852 0.27598686 2.214394e-01
## x2cat3 0.08122001 -0.08204942 0.24448944 3.292063e-01
## x2cat4 0.03061912 -0.13872748 0.19996573 7.228084e-01
## x3 0.39034164 0.34024189 0.44044139 1.504170e-47
Note:
We need to wrap p-value
in quotes because it does contain a “-” and is not a syntactically valid name.
estimate
is a syntactically valid name.
<- function(model) {
fun1 <- data.frame(estimate = coef(model),
dat confint(model),
'p-value' = summary(model)$coefficients[, 'Pr(>|t|)'],
check.names = FALSE)
dat
}
fun1(test_lm)
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.22409323 -0.40488386 -0.04330261 1.517595e-02
## ybin1 -0.22828842 -0.35619205 -0.10038478 4.814306e-04
## x1 -0.01443955 -0.06627250 0.03739340 5.847281e-01
## x2cat2 0.10598917 -0.06400852 0.27598686 2.214394e-01
## x2cat3 0.08122001 -0.08204942 0.24448944 3.292063e-01
## x2cat4 0.03061912 -0.13872748 0.19996573 7.228084e-01
## x3 0.39034164 0.34024189 0.44044139 1.504170e-47
To make the output look nicer, we want to limit the decimal places in the output to a number specified by the user. Implement this using an argument decimals
and decide on a default value for it.
round()
.
<- function(model, decimals = 3) {
fun2 <- data.frame(estimate = coef(model),
dat confint(model),
'p-value' = summary(model)$coefficients[, 'Pr(>|t|)'],
check.names = FALSE)
round(dat, decimals)
}
fun2(test_lm)
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.015
## ybin1 -0.228 -0.356 -0.100 0.000
## x1 -0.014 -0.066 0.037 0.585
## x2cat2 0.106 -0.064 0.276 0.221
## x2cat3 0.081 -0.082 0.244 0.329
## x2cat4 0.031 -0.139 0.200 0.723
## x3 0.390 0.340 0.440 0.000
In another practical, we have seen that the function format.pval()
will convert values below a certain value in to “<…” notation.
Use this in our summary function.
format.pval()
has the additional arguments digits
and eps
.
$
to select a column with a syntactically invalid name you need to wrap the name in “`”, i.e., `the name`.
<- function(model, decimals = 3) {
fun3 <- data.frame(estimate = coef(model),
dat confint(model),
check.names = FALSE)
<- round(dat, decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
dateps = 10^(-decimals), digits = decimals)
dat
}
fun3(test_lm)
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.0152
## ybin1 -0.228 -0.356 -0.100 <0.001
## x1 -0.014 -0.066 0.037 0.5847
## x2cat2 0.106 -0.064 0.276 0.2214
## x2cat3 0.081 -0.082 0.244 0.3292
## x2cat4 0.031 -0.139 0.200 0.7228
## x3 0.390 0.340 0.440 <0.001
fun3(test_lm, decimals = 5)
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.22409 -0.40488 -0.04330 0.01517595
## ybin1 -0.22829 -0.35619 -0.10038 0.00048143
## x1 -0.01444 -0.06627 0.03739 0.58472808
## x2cat2 0.10599 -0.06401 0.27599 0.22143938
## x2cat3 0.08122 -0.08205 0.24449 0.32920630
## x2cat4 0.03062 -0.13873 0.19997 0.72280840
## x3 0.39034 0.34024 0.44044 < 1e-05
fun3(test_lm, decimals = 1)
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.2 -0.4 0.0 <0.1
## ybin1 -0.2 -0.4 -0.1 <0.1
## x1 0.0 -0.1 0.0 0.6
## x2cat2 0.1 -0.1 0.3 0.2
## x2cat3 0.1 -0.1 0.2 0.3
## x2cat4 0.0 -0.1 0.2 0.7
## x3 0.4 0.3 0.4 <0.1
Note:
We do not want to first round the p-value and then apply format.pval()
because this might give wrong results in specific cases, for example:
<- 0.0009
pv format.pval(pv, digits = 2, eps = 0.001)
## [1] "<0.001"
round(pv, 3)
## [1] 0.001
format.pval(round(pv, 3), digits = 2, eps = 0.001)
## [1] "0.001"
We also would like to use this function for output from logistic regression models.
Does the current version of the function work with logistic regression models?
fun3(test_logit)
## Waiting for profiling to be done...
## Error in summary(model)$coefficients[, "Pr(>|t|)"]: subscript out of bounds
We get an error message “subscript out of bounds”. This means we are trying to select an element that does not exist.
To find the problem, we check each part of the function body, replacing model
by test_logit
:
<- 3
decimals
<- data.frame(estimate = coef(test_logit),
dat confint(test_logit),
check.names = FALSE)
## Waiting for profiling to be done...
<- round(dat, decimals)
dat
$`p-value` <- format.pval(summary(test_logit)$coefficients[, 'Pr(>|t|)'],
dateps = 10^(-decimals), digits = decimals)
## Error in summary(test_logit)$coefficients[, "Pr(>|t|)"]: subscript out of bounds
The problem is in the p-value column, so we split this syntax up further:
summary(test_logit)$coefficients[, 'Pr(>|t|)']
## Error in summary(test_logit)$coefficients[, "Pr(>|t|)"]: subscript out of bounds
Same error, so probably the column 'Pr(>|t|)'
does not exist.
summary(test_logit)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3866473 0.19527421 1.9800226 4.770099e-02
## y -0.2611655 0.07590243 -3.4408056 5.799851e-04
## x1 -0.1111445 0.06261764 -1.7749710 7.590268e-02
## x2cat2 0.4532677 0.20868071 2.1720631 2.985090e-02
## x2cat3 0.1598570 0.19434209 0.8225549 4.107611e-01
## x2cat4 0.1164228 0.20042718 0.5808733 5.613258e-01
## x3 0.5353623 0.06973867 7.6766919 1.632494e-14
Indeed, for a logistic regression model, the column is called Pr(>|z|)
!
Adjust the function so that it works with both linear and logistic regression models.
<- function(model, decimals = 3, modeltype) {
fun4 <- data.frame(estimate = coef(model),
dat confint(model),
check.names = FALSE)
<- round(dat, decimals)
dat
if (modeltype == 'lm') {
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
dateps = 10^(-decimals), digits = decimals)
else {
} $`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
dateps = 10^(-decimals), digits = decimals)
}
dat
}
fun4(test_logit, modeltype = 'logit')
## Waiting for profiling to be done...
## estimate 2.5 % 97.5 % p-value
## (Intercept) 0.387 0.006 0.772 0.0477
## y -0.261 -0.411 -0.113 <0.001
## x1 -0.111 -0.234 0.011 0.0759
## x2cat2 0.453 0.046 0.865 0.0299
## x2cat3 0.160 -0.222 0.541 0.4108
## x2cat4 0.116 -0.277 0.510 0.5613
## x3 0.535 0.401 0.674 <0.001
fun4(test_lm, modeltype = 'lm')
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.0152
## ybin1 -0.228 -0.356 -0.100 <0.001
## x1 -0.014 -0.066 0.037 0.5847
## x2cat2 0.106 -0.064 0.276 0.2214
## x2cat3 0.081 -0.082 0.244 0.3292
## x2cat4 0.031 -0.139 0.200 0.7228
## x3 0.390 0.340 0.440 <0.001
Now that the function “works”, i.e., runs for both types of models, we can come back to the other question:
For a logistic model: is this the output that we are interested in?
For a logistic regression model, the effects are usually reported as odds ratios. In our current version, we present the effects on the log odds scale.
Adjust the function so that for logistic regression models the summary is given as odds ratios.
Do not forget to change the column name(s).
<- function(model, decimals = 3, modeltype) {
fun5 <- data.frame(estimate = coef(model),
dat confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
<- round(dat, decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
dateps = 10^(-decimals), digits = decimals)
else {
} <- round(exp(dat), decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
dateps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}
fun5(test_logit, modeltype = 'logit')
## Waiting for profiling to be done...
## OR 2.5 % 97.5 % p-value
## (Intercept) 1.472 1.006 2.164 0.0477
## y 0.770 0.663 0.893 <0.001
## x1 0.895 0.791 1.011 0.0759
## x2cat2 1.573 1.047 2.374 0.0299
## x2cat3 1.173 0.801 1.718 0.4108
## x2cat4 1.123 0.758 1.665 0.5613
## x3 1.708 1.493 1.963 <0.001
fun5(test_lm, modeltype = 'lm')
## estimate 2.5 % 97.5 % p-value
## (Intercept) -0.224 -0.405 -0.043 0.0152
## ybin1 -0.228 -0.356 -0.100 <0.001
## x1 -0.014 -0.066 0.037 0.5847
## x2cat2 0.106 -0.064 0.276 0.2214
## x2cat3 0.081 -0.082 0.244 0.3292
## x2cat4 0.031 -0.139 0.200 0.7228
## x3 0.390 0.340 0.440 <0.001
To make the function more convenient to use, we want to automatically identify the model type, so that the user does not need to provide the modeltype
.
family()
can be used on regression models.
A first idea might be to use class(model)
to identify the type of model.
class(test_lm)
## [1] "lm"
class(test_logit)
## [1] "glm" "lm"
The problem with this would be that a user might also have used glm()
to fit a linear regression model:
class(glm(formula = y ~ ., data = testdat, family = gaussian()))
## [1] "glm" "lm"
Moreover, not all GLMs use a logit link (not even all binary GLMs do!) and using exp()
would then not give an odds ratio:
class(glm(formula = ybin ~ ., data = testdat, family = binomial(link = 'probit')))
## [1] "glm" "lm"
Using family()
we can extract the model family and the link function:
family(test_lm)
##
## Family: gaussian
## Link function: identity
family(test_logit)
##
## Family: binomial
## Link function: logit
We need to inspect this further to figure out how to access that information:
<- family(test_lm)
fam_lm str(fam_lm)
## List of 11
## $ family : chr "gaussian"
## $ link : chr "identity"
## $ linkfun :function (mu)
## $ linkinv :function (eta)
## $ variance :function (mu)
## $ dev.resids:function (y, mu, wt)
## $ aic :function (y, n, mu, wt, dev)
## $ mu.eta :function (eta)
## $ initialize: expression({ n <- rep.int(1, nobs) if (is.null(etastart) && is.null(start) && is.null(mustart) && ((family$link| __truncated__
## $ validmu :function (mu)
## $ valideta :function (eta)
## - attr(*, "class")= chr "family"
$family fam_lm
## [1] "gaussian"
$link fam_lm
## [1] "identity"
family(test_logit)$family
## [1] "binomial"
family(test_logit)$link
## [1] "logit"
Using this information we can adapt our function:
<- function(model, decimals = 3) {
fun6
# extract the family and link
<- family(model)$family
fam <- family(model)$link
link
<- if (fam == 'binomial' & link == 'logit') {
modeltype 'logit'
else {
} if (fam == 'gaussian' & link == 'identity') {
'lm'
}
}
<- data.frame(estimate = coef(model),
dat confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
<- round(dat, decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
dateps = 10^(-decimals), digits = decimals)
else {
} <- round(exp(dat), decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
dateps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat }
What would happen if we would apply our function to a model that is not a standard linear regression (with identity link) or logistic regression model?
Try to figure this out by looking at the syntax without trying it out!
If we provide a model
that does not fit either of the criteria fam == 'binomial' & link == 'logit'
or fam == 'gaussian' & link == 'identity'
the value of modeltype
will be NULL
.
Why?
The first condition (fam == 'binomial' & link == 'logit'
) would be FALSE
so we would end up in the else
part. There, the condition (fam == 'gaussian' & link == 'identity'
) would also be FALSE
. Since this if()
does not provide any alternative, it will return NULL
.
The question is what happens when we try to evaluate if (modeltype == 'lm')
when modeltype = NULL
.
The comparison of NULL
with 'lm'
will result in a logical
vector of length zero:
NULL == 'lm'
## logical(0)
Using a logical
vector of length zero as condition to if()
results in an error:
if (NULL == 'lm') {}
## Error in if (NULL == "lm") {: argument is of length zero
<- glm(ybin ~ ., family = binomial(link = 'probit'), data = testdat)
test_probit fun6(test_probit)
## Waiting for profiling to be done...
## Error in if (modeltype == "lm") {: argument is of length zero
Prevent this error from occuring by stopping the evaluation of the function before it happens and provide an informative error message.
stop()
.
is.null()
.
<- function(model, decimals = 3) {
fun7
# extract the family and link
<- family(model)$family
fam <- family(model)$link
link
<- if (fam == 'binomial' & link == 'logit') {
modeltype 'logit'
else {
} if (fam == 'gaussian' & link == 'identity') {
'lm'
}
}
if (is.null(modeltype)) {
stop("I do not know how to create the summary for the type of model you have provided.")
}
<- data.frame(estimate = coef(model),
dat confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
<- round(dat, decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
dateps = 10^(-decimals), digits = decimals)
else {
} <- round(exp(dat), decimals)
dat
$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
dateps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat }
fun7(test_probit)
## Error in fun7(test_probit): I do not know how to create the summary for the type of model you have provided.
© Nicole Erler