ModelSpecification.Rmd
In this vignette we use the NHANES data for examples in crosssectional data and the dataset simLong for examples in longitudinal data. For more info on these datasets, check out the vignette Visualizing Incomplete Data, in which the distributions of variables and missing values in both sets is explored.
To learn more about the theoretical background of the statistical approach that is implemented in JointAI, check out the vignette Theoretical Background.
Note:
In some of the examples we use n.adapt = 0
(and n.iter = 0
, which is the default). This is in order to prevent the MCMC sampling and thereby reduce computational time when compiling this vignette.
JointAI has several main functions (which we abbreviate in the following with *_imp()
):
lm_imp()
: linear regression
glm_imp()
: generalized linear regression
clm_imp()
: cumulative logit models
lme_imp()
: linear mixed effects regression
glme_imp()
: generalized linear mixed effects regression
clmm_imp()
: cumulative logit mixed models
survreg_imp()
: parametric (Weibull) survival models
coxph_imp()
: Cox proportional hazards survival modelsSpecification of these functions is similar to the specification of the complete data versions lm()
, glm()
, lme()
(from package nlme) and clm()
and clmm2()
(from package ordinal), survreg()
and coxph()
(from package survival). glme_imp()
uses a combination of the specification used for lme()
and glm()
.
lm_imp()
, glm_imp()
, clm_imp()
, survreg_imp()
and coxph_imp()
take arguments formula
and data
, whereas lme_imp()
and glme_imp()
, clmm_imp()
require the specification of a fixed
effects and a random
effects formula. Specification of the fixed effects formula is demonstrated in section Model formula, specification of the random effects in section Multilevel structure & longitudinal covariates.
Additionally, glm_imp()
and glme_imp()
require the specification of the model family
(and link
function).
Implemented families and links for glm_imp()
and glme_imp()
are
family  

gaussian 
with links: identity , log

binomial 
with links: logit , probit , log , cloglog

Gamma 
with links: inverse , identity , log

poisson 
with links: log , identity

The argument family
can be given as character string or as a function. If the link function is omitted, the default link is used.
Example:
The following three specifications are equal:
mod1a < glm_imp(educ ~ age + gender + creat, data = NHANES,
family = "binomial", n.adapt = 0)
mod1b < glm_imp(educ ~ age + gender + creat, data = NHANES,
family = binomial(), n.adapt = 0)
mod1c < glm_imp(educ ~ age + gender + creat, data = NHANES,
family = binomial(link = 'logit'), n.adapt = 0)
mod1a$analysis_type
#> [1] "glm"
#> attr(,"family")
#> [1] "binomial"
#> attr(,"link")
#> [1] "logit"
To use a probit link instead, this needs to be specified explicitly:
mod1d < glm_imp(educ ~ age + gender + creat, data = NHANES,
family = binomial(link = 'probit'), n.adapt = 0)
mod1d$analysis_type
#> [1] "glm"
#> attr(,"family")
#> [1] "binomial"
#> attr(,"link")
#> [1] "probit"
The arguments formula
(in lm_imp()
, glm_imp()
, clm_imp()
, survreg_imp()
and coxph_imp()
) and fixed
(in lme_imp()
, glme_imp()
and clmm_imp()
) take a twosided formula
object, where ~
separates the response (outcome / dependent variable) from the linear predictor, in which covariates (independent variables) are separated by +
. An intercept is added automatically (except in Cox models or models for ordinal outcomes).
survreg_imp()
and coxph_imp()
expect a survival object (Surv()
) on the left hand side of the model formula. Currently, only right censored data can be handled.
Interactions between variables can be introduced using :
or *
, which adds the interaction term AND the main effects, i.e.,
is equivalent to
Interactions between multiple variables can be specified using parentheses:
mod2a < glm_imp(educ ~ gender * (age + smoke + creat),
data = NHANES, family = binomial(), n.adapt = 0)
parameters(mod2a)
#> [1] "(Intercept)" "genderfemale" "age"
#> [4] "smokeformer" "smokecurrent" "creat"
#> [7] "genderfemale:age" "genderfemale:smokeformer" "genderfemale:smokecurrent"
#> [10] "genderfemale:creat"
The function parameters()
returns the parameters that are specified to be followed (even for models where no MCMC sampling was performed, i.e., when n.iter = 0
and n.adapt = 0
).
To specify interactions of a given level, ^
can be used:
# all twoway interactions:
mod2b < glm_imp(educ ~ gender + (age + smoke + creat)^2,
data = NHANES, family = binomial(), n.adapt = 0)
parameters(mod2b)
#> [1] "(Intercept)" "genderfemale" "age" "smokeformer"
#> [5] "smokecurrent" "creat" "age:smokeformer" "age:smokecurrent"
#> [9] "age:creat" "smokeformer:creat" "smokecurrent:creat"
# all two and threeway interactions:
mod2c < glm_imp(educ ~ gender + (age + smoke + creat)^3,
data = NHANES, family = binomial(), n.adapt = 0)
parameters(mod2c)
#> [1] "(Intercept)" "genderfemale" "age"
#> [4] "smokeformer" "smokecurrent" "creat"
#> [7] "age:smokeformer" "age:smokecurrent" "age:creat"
#> [10] "smokeformer:creat" "smokecurrent:creat" "age:smokeformer:creat"
#> [13] "age:smokecurrent:creat"
In JointAI, interactions between any type of variables (observed, incomplete, timevarying) are allowed. When an incomplete variable is involved, the interaction term is recalculated within each iteration of the MCMC sampling, using the imputed values from the current iteration.
It is important not to precalculate interactions with incomplete variables as extra variables in the dataset, but to specify them in the model formula. Otherwise imputed values of the main effect and interaction term will not match and results may be incorrect.
In practice, associations between outcome and covariates do not always meet the standard assumption that all covariate effects are linear. Often, assuming a logarithmic, quadratic or other nonlinear effect is more appropriate.
Nonlinear associations can be specified in the model formula using either functions such as log()
(the natural logarithm), sqrt()
(the square root) or exp()
(the exponential function), or by specifying a function of a variable using I()
, for example, I(x^2)
would be the quadratic term of a variable x
.
For completely observed covariates, JointAI can handle any common type of function implemented in R, including splines, e.g., using ns()
or bs()
from the package splines (which is automatically installed with R).
Since functions involving variables that have missing values need to be recalculated in each iteration of the MCMC sampling, currently, only functions that are available in JAGS can be used for incomplete variables. Those functions include:
log()
, exp()
sqrt()
, polynomials (using I()
)abs()
sin()
, cos()
The list of functions implemented in JAGS can be found in the JAGS user manual.
Some examples:^{1}
# Absolute difference between bili and creat
mod3a < lm_imp(SBP ~ age + gender + abs(bili  creat), data = NHANES)
# Using a natural cubic spline for age (completely observed) and a quadratic
# and a cubic effect for bili
library(splines)
mod3b < lm_imp(SBP ~ ns(age, df = 2) + gender + I(bili^2) + I(bili^3), data = NHANES)
# A function of creat and albu
mod3c < lm_imp(SBP ~ age + gender + I(creat/albu^2), data = NHANES,
models = c(creat = 'lognorm', albu = 'lognorm'))
# This function may make more sense to calculate BMI as weight/height^2, but
# we currently do not have those variables in the NHANES dataset.
# Using the sinus and cosinus
mod3d < lm_imp(SBP ~ bili + sin(creat) + cos(albu), data = NHANES)
When a function of a complete or incomplete variable is used in the model formula, the main effect of that variable is automatically added as auxiliary variable (more on auxiliary variables in section Auxiliary variables), and (usually) only main effects are used as predictors in models for incomplete variables.
In mod3b
, for example, the spline of age is used as predictor for SBP
, but in the imputation model for bili
, age
enters with a linear effect.
list_models(mod3b, priors = FALSE, regcoef = FALSE, otherpars = FALSE)
#> Linear regression model for "bili"
#> * Predictor variables:
#> (Intercept), genderfemale, age
The function list_models()
prints a list of the models specified for covariates in a JointAI
object. Since, here, we are only interested in the predictor variables, we suppress printing of information on prior distributions, regression coefficients and other parameters by setting priors
, regcoef
and otherpars
to FALSE
.
When a function of a variable is specified as an auxiliary variable, this function is used (as well) in the models for covariates. For example, in mod3e
, waist circumference (WC
) is not part of the model for SBP
, and I(WC^2)
is used in the linear predictor of the imputation model for bili
:
mod3e < lm_imp(SBP ~ age + gender + bili, auxvars = ~ I(WC^2), data = NHANES)
list_models(mod3e, priors = F, regcoef = F, otherpars = F)
#> Linear regression model for "WC"
#> * Predictor variables:
#> (Intercept), age, genderfemale
#>
#> Linear regression model for "bili"
#> * Predictor variables:
#> (Intercept), age, genderfemale, I(WC^2)
When WC
is used as predictor variable in the main model and a function of WC
is specified as auxiliary variable, both variables are used as predictors in the imputation models:
mod3f < lm_imp(SBP ~ age + gender + bili + WC, auxvars = ~ I(WC^2), data = NHANES)
list_models(mod3f, priors = F, regcoef = F, otherpars = F)
#> Linear regression model for "WC"
#> * Predictor variables:
#> (Intercept), age, genderfemale
#>
#> Linear regression model for "bili"
#> * Predictor variables:
#> (Intercept), age, genderfemale, WC, I(WC^2)
When a function of a covariate is used in the linear predictor of the analysis model, and that function should also be used in the linear predictor of imputation models, it is necessary to also include that function in the argument auxvars
:
mod3g < lm_imp(SBP ~ age + gender + bili + I(WC^2), auxvars = ~ I(WC^2), data = NHANES)
list_models(mod3g, priors = F, regcoef = F, otherpars = F)
#> Linear regression model for "WC"
#> * Predictor variables:
#> (Intercept), age, genderfemale
#>
#> Linear regression model for "bili"
#> * Predictor variables:
#> (Intercept), age, genderfemale, I(WC^2)
Incomplete variables are always imputed on their original scale, i.e.,
mod3b
the variable bili
is imputed and the quadratic and cubic versions calculated from the imputed values.creat
and albu
in mod3c
are imputed separately, and I(creat/albu^2)
calculated from the imputed (and observed) values.Important::
When different transformations of the same incomplete variable are used in one model, or interaction terms involve the main and interaction effect of an incomplete variable, it is strongly discouraged to calculate these transformations or interaction terms beforehand and supply them as different variables.
If, for example, a model formula contains both x
and x2
(where x2
= x^2
), they are treated as separate variables and imputed with separate models. Imputed values of x2
are thus not equal to the square of imputed values of x
. Instead, x + I(x^2)
should be used in the model formula. Then, only x
is imputed and used in the linear predictor of models for other incomplete variables, and x^2
is calculated from the imputed values of x
.
When a function has restricted support, e.g., log(x)
is only defined for x > 0
, the distribution used to impute that variable needs to comply with these restrictions. This can either be achieved by truncating the distribution, using the argument trunc
, or by selecting an imputation method that meets the restrictions. For more information on imputation methods, see the section Imputation model types.
Example:
When using a log()
transformation for the covariate bili
, we can either use the default imputation method "norm"
(a normal distribution) and truncate it by specifying trunc = list(bili = c(<lower>, <upper>))
(where the lower and upper limits are the smallest and largest allowed values) or choose an imputation method (using the argument models
; more details see the section on imputation model types) that only imputes positive values such as a lognormal distribution ("lognorm"
) or a Gamma distribution ("gamma"
):
# truncation of the distribution of bili
mod4a < lm_imp(SBP ~ age + gender + log(bili) + exp(creat),
trunc = list(bili = c(1e5, 1e10)), data = NHANES)
# lognormal model for bili
mod4b < lm_imp(SBP ~ age + gender + log(bili) + exp(creat),
models = c(bili = 'lognorm', creat = 'norm'), data = NHANES)
# gamma model for bili
mod4c < lm_imp(SBP ~ age + gender + log(bili) + exp(creat),
models = c(bili = 'gamma', creat = 'norm'), data = NHANES)
Truncation always requires to specify both limits. Since Inf
and Inf
are not accepted, a value outside the range of the variable (here: 1e10) can be selected for the second boundary of the truncation interval.
It is possible to use functions that have different names in R and JAGS, or that do exist in JAGS, but not in R, by defining a new function in R that has the name of the function in JAGS.
Example:
In JAGS the inverse logit transformation is defined in the function ilogit
. In R, there is no function ilogit
, but the inverse logit is available as the distribution function of the logistic distribution plogis()
.
# Define the function ilogit
ilogit < plogis
# Use ilogit in the model formula
mod5a < lm_imp(SBP ~ age + gender + ilogit(creat), data = NHANES)
It is also possible to nest a function in another function.
Example:^{2}
The complementary log log transformation is restricted to values larger than 0 and smaller than 1. In order to use this function on a variable that exceeds this range, as is the case for creat
, a second transformation might be used, for instance the inverse logit from the previous example.
In JAGS, the complementary loglog transformation is implemented as cloglog
, but since this function does not exist in (base) R, we first need to define it:
# define the complementary log log transformation
cloglog < function(x) log(log(1  x))
# define the inverse logit (in case you have not done it in the previous example)
ilogit < plogis
# nest ilogit inside cloglog
mod6a < lm_imp(SBP ~ age + gender + cloglog(ilogit(creat)), data = NHANES)
In multilevel models, additional to the fixed effects structure specified by the argument fixed
a random effects structure needs to be provided via the argument random
.
random
takes a onesided formula starting with a ~
. Variables for which a random effect should be included are usually separated by a +
, and the grouping variable is separated by 
. A random intercept is added automatically and only needs to be specified in a random intercept only model.
A few examples:
random = ~ 1  id
: random intercept only, with id
as grouping variablerandom = ~ time  id
: random intercept and slope for variable time
random = ~ time + I(time^2)  id
: random intercept, slope and quadratic random effect for time
random = ~ time + x  id
random intercept, random slope for time
and random effect for variable x
It is possible to use splines in the random effects structure (but only for completely observed variables), e.g.:
mod7a < lme_imp(bmi ~ GESTBIR + ETHN + HEIGHT_M + ns(age, df = 2),
random = ~ns(age, df = 2)  ID, data = simLong)
Currently, multiple levels of nesting (nested or crossed random effects) are not yet available.
From JointAI version 0.5.0 onward imputation of longitudinal covariates is possible. Currently, imputation of continuous, binary and ordered factors is implemented. For more details, see the section covariate model types below.
Note:
When incomplete baseline covariates are involved in the model it is necessary to specify models for all longitudinal covariates, complete and incomplete. This is done automatically by JointAI, but it may be necessary to change the default model types to models that better fit the distributions of the respective longitudinal variables.
When no incomplete baseline covariates are involved, models only need to be specified for incomplete longitudinal covariates, but not for completely observed longitudinal covariates.
The joint distribution of an outcome \(y\), covariates \(x\), random effects \(b\) and parameters \(\theta\), \(p(y, x, b, \theta)\), is modeled as the product of univariate conditional distributions. To facilitate the specification of these distributions they are ordered so that longitudinal (level1) variables may have baseline (level2) variables in their linear predictors but not vice versa.
For example: \[\begin{align} p(y, x, b, \theta) = & p(y \mid x_1, ..., x_4, b_y, \theta_y) && \text{analysis model}\\ & p(x_1\mid \theta_{x1}) && \text{model for a complete baseline covariate}\\ & p(x_2\mid x_1, \theta_{x2}) && \text{model for an incomplete baseline covariate}\\ & p(x_3\mid x_1, x_2, b_{x3}, \theta_{x3}) && \text{model for a complete longitudinal covariate}\\ & p(x_4\mid x_1, x_2, x_3, b_{x4}, \theta_{x4}) && \text{model for an incomplete longitudinal covariate}\\ & p(b_y\theta_b) p(b_{x3}\theta_b) p(b_{x4}\theta_b) && \text{models for the random effects}\\ & p(\theta_y) p(\theta_{x1}) \ldots p(\theta_{x4}) p(\theta_b) && \text{prior distributions}\end{align}\]
Since the parameter vectors \(\theta_{x1}\), \(\theta_{x2}\), … are assumed to be a priori independent, and furthermore \(x_1\) is completely observed and modeled independently of incomplete variables, estimation of the other model parts is not affected by \(p(x_1\mid \theta_{x1})\) and, hence, this model can be omitted.
\(p(x_3 \mid x_1, x_2, b_{x3}, \theta_{x3})\), on the other hand is modeled conditional on the incomplete covariate \(x_2\) and can therefore not be omitted.
If there were no incomplete baseline covariates, i.e., if \(x_2\) was completely observed, \(p(x_3 \mid x_1, x_2, b_{x3}, \theta_{x3})\) would also not affect the estimation of parameters in the other parts of the model and could be omitted.
JointAI automatically selects an imputation model for each of the incomplete baseline covariates (level2 covariates) and for each of the (incomplete) longitudinal covariates (level1 covariates), based on the class
of the variable and the number of levels. The automatically selected types for baseline covariates are:
name  model  variable type 

norm 
linear regression  continuous variables 
logit 
logistic regression  factors with two levels 
multilogit 
multinomial logit model  unordered factors with >2 levels 
cumlogit 
cumulative logit model  ordered factors with >2 levels 
The default methods for longitudinal covariates are:
name  model  variable type 

lmm 
linear mixed model  continuous longitudinal variables 
glmm_logit 
logistic mixed model  longitudinal factors with two levels 
clmm 
cumulative logit mixed model  longitudinal ordered factors with >2 levels 
The imputation models that are chosen by default may not necessarily be appropriate for the data at hand, especially for continuous variables, which often do not comply with the assumptions of (conditional) normality.
Therefore, alternative imputation methods are available for continuous baseline covariates:
name  model  variable type 

lognorm 
normal regression of the logtransformed variable  rightskewed variables >0 
gamma 
Gamma regression (with loglink)  rightskewed variables >0 
beta 
beta regression (with logitlink)  continuous variables with values in [0, 1] 
lognorm
assumes a normal distribution for the natural logarithm of the variable, but the variable enters the linear predictor of the analysis model (and possibly other imputation models) on its original scale.
For longitudinal covariates the following alternative model types are available:
name  model  variable type 

glmm_lognorm 
normal mixed model for the logtransformed variable  longitudinal rightskewed variables >0 
glmm_gamma 
gamma mixed model (with loglink)  longitudinal rightskewed variables >0 
glmm_poisson 
poisson mixed model (with loglink)  longitudinal count variables 
In models mod4b
and mod4c
we have already seen two examples in which the imputation model type was changed using the argument models
.
models
takes a named vector of (imputation/covariate) model types, where the names are the names of the (incomplete) baseline or longitudinal variables. When the vector supplied to models
only contains specifications for a subset of those variables, default models are used for the remaining variables.
mod8a < lm_imp(SBP ~ age + gender + WC + alc + bili + occup + smoke,
models = c(bili = 'gamma', WC = 'lognorm'),
data = NHANES, n.iter = 100, progress.bar = 'none')
mod8a$models
#> WC smoke bili occup alc
#> "lognorm" "cumlogit" "gamma" "multilogit" "logit"
The function get_models()
, which finds and assigns the default covariate model types, can also be called directly. get_models()
has arguments
fixed
: the fixed effects formularandom
: the random effects formula (if necessary)data
: the datasetauxvars
: an optional onesided formula of auxiliary variablesno_model
: an optional vector of variables for which no model should be specifiedmodels
: and optional vector of (nondefault) model types chosen by the usermod8b_models < get_models(bmi ~ GESTBIR + ETHN + HEIGHT_M + SMOKE + hc + MARITAL + ns(age, df = 2),
random = ~ns(age, df = 2)  ID, data = simLong)
mod8b_models
#> $models
#> HEIGHT_M ETHN MARITAL SMOKE age hc
#> "norm" "logit" "multilogit" "cumlogit" "lmm" "lmm"
#>
#> $meth
#> HEIGHT_M ETHN MARITAL SMOKE hc
#> "norm" "logit" "multilogit" "cumlogit" "lmm"
get_models()
returns a list of two vectors:
models
: containing all specified modelsmeth
: containing the models for the incomplete variables onlyWhen there is a “time” variable in the model, such as age
in our example, which is the age of the child at the time of the measurement, it may not be meaningful to specify a model for that variable. Especially when the “time” variable is prespecified by the design of the study it can usually be assumed to be independent of the covariates and a model for it has no useful interpretation.
The argument no_model
(in get_models()
as well as in the main functions *_imp()
) allows us to exclude models for such variables (as long as they are completely observed):
mod8c_models < get_models(bmi ~ GESTBIR + ETHN + HEIGHT_M + SMOKE + hc + MARITAL + ns(age, df = 2),
random = ~ns(age, df = 2)  ID, data = simLong, no_model = "age")
mod8c_models
#> $models
#> HEIGHT_M ETHN MARITAL SMOKE hc
#> "norm" "logit" "multilogit" "cumlogit" "lmm"
#>
#> $meth
#> HEIGHT_M ETHN MARITAL SMOKE hc
#> "norm" "logit" "multilogit" "cumlogit" "lmm"
By excluding the model for age
we implicitly assume that incomplete baseline variables are independent of age
.
Note:
When a continuous incomplete variable has only two different values it is assumed to be binary and its coding and default imputation model will be changed accordingly. This behavior can be overwritten when the imputation method for that variable is specified directly by the user.
Variables of type logical
are automatically converted to binary factors.
In JointAI, models for level1 covariates are specified after models for level2 covariates, so that the level2 covariates are used as predictor variables in the models for level1 covariates, but not vice versa.
Within the two groups, models are ordered by number of missing values (decreasing), so that the model for the variable with the most missing values has the most other variables in its linear predictor:
# number of missing values in the covariates in mod8a
colSums(is.na(NHANES[, names(mod8a$models)]))
#> WC smoke bili occup alc
#> 2 7 8 28 34
# => The sequence of models is ordered according to the number of missing values.
#
# print information on the imputation models (and omit everything but the predictor variables)
list_models(mod8a, priors = F, regcoef = F, otherpars = F, refcat = F)
#> Lognormal regression model for "WC"
#> * Predictor variables:
#> (Intercept), age, genderfemale
#>
#> Cumulative logit model for "smoke"
#> * Predictor variables:
#> age, genderfemale, WC
#>
#> Gamma regression model for "bili"
#> * Parametrization:
#>  shape: shape_bili = mu_bili^2 * tau_bili
#>  rate: rate_bili = mu_bili * tau_bili
#> * Predictor variables:
#> (Intercept), age, genderfemale, WC, smokeformer, smokecurrent
#>
#> Multinomial logit model for "occup"
#> * Predictor variables:
#> (Intercept), age, genderfemale, WC, smokeformer, smokecurrent, bili
#>
#> Logistic regression model for "alc"
#> * Predictor variables:
#> (Intercept), age, genderfemale, WC, smokeformer, smokecurrent, bili, occuplooking for work, occupnot working
Auxiliary variables are variables that are not part of the analysis model, but should be considered as predictor variables in the imputation models because they can inform the imputation of on unobserved values.
Good auxiliary variables are ^{3}
In the main functions, *_imp()
, auxiliary variables can be specified with the argument auxvars
, which is a onesided formula.
Example:
We might consider the variables educ
and smoke
as predictors for the imputation of occup
:
mod9a < lm_imp(SBP ~ gender + age + occup, auxvars = ~ educ + smoke,
data = NHANES, n.iter = 100, progress.bar = 'none')
The variables educ
and smoke
are not used as predictors in the analysis model:
summary(mod9a)
#>
#> Linear model fitted with JointAI
#>
#> Call:
#> lm_imp(formula = SBP ~ gender + age + occup, data = NHANES, n.iter = 100,
#> auxvars = ~educ + smoke, progress.bar = "none")
#>
#> Posterior summary:
#> Mean SD 2.5% 97.5% tailprob. GRcrit
#> (Intercept) 105.984 3.3214 99.420 112.437 0.000 1.01
#> genderfemale 5.880 2.2662 10.449 1.399 0.020 1.12
#> age 0.369 0.0762 0.226 0.528 0.000 1.00
#> occuplooking for work 3.732 7.2562 9.387 17.981 0.647 1.02
#> occupnot working 0.547 2.6481 5.591 4.685 0.853 1.13
#>
#> Posterior summary of residual std. deviation:
#> Mean SD 2.5% 97.5% GRcrit
#> sigma_SBP 14.4 0.685 13.2 16 0.998
#>
#>
#> MCMC settings:
#> Iterations = 101:200
#> Sample size per chain = 100
#> Thinning interval = 1
#> Number of chains = 3
#>
#> Number of observations: 186
They are, however, used as predictors in the imputation for occup
and imputed themselves (if they have missing values):
list_models(mod9a, priors = FALSE, regcoef = FALSE, otherpars = FALSE, refcat = FALSE)
#> Cumulative logit model for "smoke"
#> * Predictor variables:
#> genderfemale, age, educhigh
#>
#> Multinomial logit model for "occup"
#> * Predictor variables:
#> (Intercept), genderfemale, age, educhigh, smokeformer, smokecurrent
As shown above in mod3e
and mod3f
, it is possible to specify functions of auxiliary variables. In that case, the auxiliary variable is not considered as linear effect but as specified by the function:
mod9b < lm_imp(SBP ~ gender + age + occup, data = NHANES,
auxvars = ~ educ + smoke + log(WC),
trunc = list(WC = c(1e10, 1e10)), n.adapt = 0)
list_models(mod9b, priors = FALSE, regcoef = FALSE, otherpars = FALSE,
refcat = FALSE)
#> Linear regression model for "WC"
#> * Predictor variables:
#> (Intercept), genderfemale, age, educhigh
#>
#> Cumulative logit model for "smoke"
#> * Predictor variables:
#> genderfemale, age, educhigh, log(WC)
#>
#> Multinomial logit model for "occup"
#> * Predictor variables:
#> (Intercept), genderfemale, age, educhigh, log(WC), smokeformer, smokecurrent
Note:
Omitting auxiliary variables from the analysis model implies that the outcome is independent of these variables, conditional on the other variables in the model. If this is not true, the model is misspecified which may lead to biased results (similar to leaving a confounding variable out of a model).
In JointAI, dummy coding is used when categorical variables enter a linear predictor, i.e., a binary variables is created for each category, except the reference category. These binary variables have value one when that category was observed and zero otherwise. Contrary to the behavior in base R, this is also done for ordered factors. Note that it is not possible to overwrite this behavior using the base R contrasts specification.
By default, the first category of a categorical variable (ordered or unordered) is used as reference, however, this may not always allow the desired interpretation of the regression coefficients. Moreover, when categories are unbalanced, setting the largest group as reference may result in better mixing of the MCMC chains.
Therefore, JointAI allows specification of the reference category separately for each variable, via the argument refcats
.
To specify the choice of reference category globally for all variables in the model, refcats
can be set as
refcats = "first"
refcats = "last"
refcats = "largest"
For example:
mod10a < lm_imp(SBP ~ gender + age + race + educ + occup + smoke,
refcats = "largest", data = NHANES, n.adapt = 0)
Alternatively, refcats
takes a named vector, in which the reference category for each variable can be specified either by its number or its name, or one of the three global types: “first”, “last” or “largest”. For variables for which no reference category is specified in the list the default is used.
mod10b < lm_imp(SBP ~ gender + age + race + educ + occup + smoke,
refcats = list(occup = "not working", race = 3, educ = 'largest'),
data = NHANES, n.adapt = 0)
To help to specify the reference category, the function set_refcat()
can be used. It prints the names of the categorical variables that are selected by
or all categorical variables in the data if only data
is provided, and asks a number of questions which the user needs to reply to by input of a number.
refs_mod10 < set_refcat(NHANES, formula = formula(mod10b))
#> The categorical variables are:
#>  "gender"
#>  "race"
#>  "educ"
#>  "occup"
#>  "smoke"
#>
#> How do you want to specify the reference categories?
#>
#> 1: Use the first category for each variable.
#> 2: Use the last category for each variabe.
#> 3: Use the largest category for each variable.
#> 4: Specify the reference categories individually.
When option 4 is chosen, a question for each categorical variable is asked, for example:
#> The reference category for “race” should be
#>
#> 1: Mexican American
#> 2: Other Hispanic
#> 3: NonHispanic White
#> 4: NonHispanic Black
#> 5: other
After specification of the reference categories for all categorical variables, the determined specification for the argument refcats
is printed:
#> In the JointAI model specify:
#> refcats = c(gender = 'female', race = 'NonHispanic White', educ = 'low',
#> occup = 'not working', smoke = 'never')
#>
#> or use the output of this function.
set_refcat()
also returns a named vector that can be passed to the argument refcats
:
mod10c < lm_imp(SBP ~ gender + age + race + educ + occup + smoke,
refcats = refs_mod10, data = NHANES, n.adapt = 0)
Contrary to base R behavior, dummy coding (i.e., contr.treatment
contrasts) are used for ordered factors in any linear predictor. Since the order of levels in an ordered factor contains information relevant to the imputation of missing values, it is important that incomplete ordinal variables are coded as such.
For unordered factors, the reference category could be changed in the dataset directly, outside of JointAI, e.g., using relevel()
. For ordered factors, however, this is not possible without loosing or changing the order of the levels. In this case, the refcats
argument in JointAI has to be used.
Changing a reference category via the argument refcats
does not change the order of levels in the dataset or any of the data matrices inside JointAI. Only when, in the JAGS model, the categorical variables is converted into dummy variables, the reference category is used to determine for which levels the dummies are created.
In the Bayesian framework, parameters are random variables for which a distribution needs to be specified. These distributions depend on parameters themselves, i.e., on hyperparameters.
The function default_hyperpars()
returns a list containing the default hyper parameters used in a JointAI
model.
default_hyperpars()
#> $norm
#> mu_reg_norm tau_reg_norm shape_tau_norm rate_tau_norm
#> 0e+00 1e04 1e02 1e02
#>
#> $gamma
#> mu_reg_gamma tau_reg_gamma shape_tau_gamma rate_tau_gamma
#> 0e+00 1e04 1e02 1e02
#>
#> $beta
#> mu_reg_beta tau_reg_beta shape_tau_beta rate_tau_beta
#> 0e+00 1e04 1e02 1e02
#>
#> $logit
#> mu_reg_logit tau_reg_logit
#> 0e+00 1e04
#>
#> $poisson
#> mu_reg_poisson tau_reg_poisson
#> 0e+00 1e04
#>
#> $probit
#> mu_reg_probit tau_reg_probit
#> 0e+00 1e04
#>
#> $multinomial
#> mu_reg_multinomial tau_reg_multinomial
#> 0e+00 1e04
#>
#> $ordinal
#> mu_reg_ordinal tau_reg_ordinal mu_delta_ordinal tau_delta_ordinal
#> 0e+00 1e04 0e+00 1e04
#>
#> $Z
#> function(nranef) {
#> if (nranef > 1) {
#> RinvD < diag(as.numeric(rep(NA, nranef)))
#> KinvD < nranef
#> } else {
#> RinvD < KinvD < NULL
#> }
#>
#> list(
#> RinvD = RinvD,
#> KinvD = KinvD,
#> shape_diag_RinvD = 0.1,
#> rate_diag_RinvD = 0.01
#> )
#> }
#> <bytecode: 0x0000000018f2cfa8>
#> <environment: 0x000000001b649d30>
#>
#> $surv
#> mu_reg_surv tau_reg_surv
#> 0.000 0.001
To change the hyperparameters in a JointAI model, the list obtained from default_hyperpars()
can be edited and passed to the argument hyperpars
in the main functions *_imp()
.
mu_reg_*
and tau_reg_*
refer to the mean and precision in the distribution for regression coefficients. shape_tau_*
and rate_tau_*
are the shape and rate parameters of a Gamma distribution that is used has prior for precision parameters. RinvD
is the scale matrix in the Wishart prior for the inverse of the random effects design matrix D
, and KinvD
the number of degrees of freedom in that distribution. shape_diag_RinvD
and rate_diag_RinvD
are the scale and rate parameters of the Gamma prior of the diagonal elements of RinvD
. In random effects models with only one random effect, instead of the Wishart distribution a Gamma prior is used for the inverse of D
.
When variables are measured on very different scales this can result in slow convergence and bad mixing. Therefore, JointAI automatically scales continuous variables to have mean zero and standard deviation one. Results (parameters and imputed values) are transformed back to the original scale.
In some settings, however, it is not possible to scale continuous variables. This is the case for incomplete variables that enter a linear predictor in a function and variables that are imputed with models that are defined on a subset of the real line (i.e., with a Gamma or a Beta distribution). Variables that are imputed with a lognormal distribution are scaled, but not centered. To prevent scaling, the argument scale_vars
can be set to FALSE
. When a vector of variable names is supplied to scale_vars
, only those variables are scaled.
By default, only the MCMC samples that is scaled back to the scale of the data is stored in a JointAI
object. When the argument keep_scaled_mcmc = TRUE
also the scaled sample is kept.
Using the argument ridge = TRUE
it is possible to impose a ridge penalty on the parameters of the analysis model, shrinking these parameters towards zero. This is done by specification of a \(\text{Ga}(0.01, 0.01)\) prior for the precision of the regression coefficients \(\beta\) instead of setting it to a fixed (small) value.
Note: these examples are chosen to demonstrate functionality and may not fit the data.↩︎
Again, this is just a demonstration of the possibilities in JointAI, but nesting transformations will most often result in coefficients that that do not have meaningful interpretation in practice.↩︎
Van Buuren, S. (2012). Flexible imputation of missing data. Chapman and Hall/CRC. See also the second edition online.↩︎