vignettes/TheoreticalBackground.Rmd
TheoreticalBackground.Rmd
Consider the general setting of a regression model where interest lies in a set of parameters θ that describe the association between a univariate outcome y and a set of covariates X=(x1,…,xp). In the Bayesian framework, inference over θ is obtained by estimation of the posterior distribution of θ, which is proportional to the product of the likelihood of the data (y,X) and the prior distribution of θ, p(θ∣y,X)∝p(y,X∣θ)p(θ).
When some of the covariates are incomplete, X consists of two parts, the completely observed variables Xobs (i.e., those columns of X that do not have any missing values) and those variables that are incomplete, Xmis (i.e., those columns of X that do have missing values). If y had missing values (and this missingness was ignorable), the only necessary change in the formulas below would be to replace y by ymis. Here, we will, therefore, consider y to be completely observed. In the implementation in the R package JointAI, however, missing values in the outcome are allowed and are imputed automatically.
The likelihood of the complete data, i.e., observed and unobserved data, can be factorized in the following convenient way: p(y,Xobs,Xmis∣θ)=p(y∣Xobs,Xmis,θy∣x)p(Xmis∣Xobs,θx), where the first factor constitutes the analysis model of interest, described by a vector of parameters θy∣x, and the second factor is the joint distribution of the incomplete variables, i.e., the imputation part of the model, described by parameters θx, and θ=(θ⊤y∣x,θ⊤x)⊤.
Explicitly specifying the joint distribution of all data is one of the major advantages of the Bayesian approach, since this facilitates the use of all available information of the outcome in the imputation of the incomplete covariates (Erler et al. 2016), which becomes especially relevant for more complex outcomes like repeatedly measured variables (see the section on imputation with longitudinal outcomes).
In complex models the posterior distribution can usually not be derived analytically. Therefore, MCMC methods are used to obtain samples from the posterior distribution. The MCMC sampling in JointAI is done using Gibbs sampling, which iteratively samples from the full conditional distributions of the unknown parameters and missing values.
In the following sections each of the three parts of the model, the analysis model, the imputation part and the prior distributions, are described in more in detail.
The analysis model of interest is described by the probability density function p(y∣X,θy∣x).
The R package JointAI can currently handle analysis models that are
Moreover, it is possible to fit joint models of longitudinal and survival data, where the survival model is a proportional hazards model and the longitudinal variables are modelled using generalized, ordinal, beta or log-normal mixed models.
For a GLM the probability density function is chosen from the exponential family and has the linear predictor g{E(yi∣X,θy∣x)}=x⊤iβ, where g(⋅) is a link function, yi the value of the outcome variable for subject i, and xi is a column vector containing the row of X that contains the covariate information for i.
For a GLMM the linear predictor is of the form g{E(yij∣X,bi,θy∣x)}=x⊤ijβ+z⊤ijbi, where yij is the j-th outcome of subject i, xij is the corresponding vector of covariate values, bi a vector of random effects pertaining to subject i, and zij a column vector containing the row of the design matrix of the random effects, Z, that corresponds to the j-th measurement of subject i. Z typically contains a subset of the variables in X, and bi follows a normal distribution with mean zero and covariance matrix D.
In both cases the parameter vector θy∣x contains the regression coefficients β, and potentially additional variance parameters (e.g., for linear (mixed) models), for which prior distributions will be specified in the section on prior distributions.
In JointAI most standard link functions are available (for details see the documentation of glm_imp()). One exception is the square-root link, which is not available since it is not implemented as a link in JAGS.
Cumulative logit mixed models are of the form yij∼Mult(πij,1,…,πij,K),πij,1=P(yij≤1),πij,k=P(yij≤k)−P(yij≤k−1),k∈2,…,K−1,πij,K=1−K−1∑k=1πij,k,logit(P(yij≤k))=γk+ηij,k∈1,…,K,ηij=x⊤ijβ+z⊤ijbi,γ1,δ1,…,δK−1iid∼N(μγ,σ2γ),γk∼γk−1+exp(δk−1),k=2,…,K, where πij,k=P(yij=k) and logit(x)=log(x1−x). A cumulative logit regression model for a univariate outcome yi can be obtained by dropping the index j and omitting z⊤ijbi. In cumulative logit (mixed) models, the design matrix X does not contain an intercept, since outcome category specific intercepts γ1,…,γK are specified. Here, the parameter vector θy∣x includes the regression coefficients β, the first intercept γ1 and increments δ1,…,δK−1.
Survival data are typically characterized by the observed event or censoring times, Ti, and the event indicator, Di, which is one if the event was observed and zero otherwise. JointAI provides two types of models to analyse right censored survival data, a parametric model which assumes a Weibull distribution for the true (but partially unobserved) survival times T∗, and a semi-parametric Cox proportional hazards model.
The parametric survival model is implemented as T∗i∼Weibull(1,ri,s),Di∼𝟙(T∗i≥Ci),log(rj)=−x⊤iβ,s∼Exp(0.01), where 𝟙 is the indicator function which is one if T∗i≥Ci, and zero otherwise.
The Cox proportional hazards model can be written as hi(t)=h0(t)exp(Xiβ), where h0(t) is the baseline hazard function, which, in JointAI, is model using a B-spline approach, i.e., h0(t)=∑dq=1γBqBq(t), where Bq denotes the q-th basis function.
The survival function of the Cox model is S(t∣θ)=exp{−∫t0h0(s)exp(Xiβ)ds}=exp{−exp(Xiβ)∫t0h0(s)ds}, where θ includes the regression coefficients β (which do not contain an intercept) and the coefficients γB used in the specification of the baseline hazard. Since the integral over the baseline hazard does not have a closed-form solution, in JointAI it is approximated using Gauss-Kronrod quadrature with 15 evaluation points.
A convenient way to specify the joint distribution of the incomplete covariates Xmis=(xmis1,…,xmisq) is to use a sequence of conditional univariate distributions (Ibrahim, Chen, and Lipsitz 2002; Erler et al. 2016) p(xmis1,…,xmisq∣Xobs,θx)=p(xmis1∣Xobs,θx1)q∏ℓ=2p(xmisℓ∣Xobs,xmis1,…,xmisℓ−1,θxℓ), with θx=(θ⊤x1,…,θ⊤xq)⊤.
Each of the conditional distributions is a member of the exponential family, extended with distributions for ordinal categorical variables, and chosen according to the type of the respective variable. Its linear predictor is gℓ{E(xi,misℓ∣xi,obs,xi,mis<ℓ,θxℓ)}=(x⊤i,obs,xi,mis1,…,xi,misℓ−1)αℓ,ℓ=1,…,q, where xi,mis<ℓ=(xi,mis1,…,xi,misℓ−1)⊤ and xi,obs is the vector of values for subject i of those covariates that are observed for all subjects.
Factorization of the joint distribution of the covariates in such a sequence yields a straightforward specification of the joint distribution, even when the covariates are of mixed type.
Missing values in the covariates are sampled from their full-conditional distribution that can be derived from the full joint distribution of outcome and covariates.
When, for instance, the analysis model is a GLM, the full conditional distribution of an incomplete covariate xi,misℓ can be written as p(xi,misℓ∣yi,xi,obs,xi,mis−ℓ,θ)∝p(yi∣xi,obs,xi,mis,θy∣x)p(xi,mis∣xi,obs,θx)p(θy∣x)p(θx)∝p(yi∣xi,obs,xi,mis,θy∣x)p(xi,misℓ∣xi,obs,xi,mis<ℓ,θxℓ){q∏k=ℓ+1p(xi,misk∣xi,obs,xi,mis<k,θxk)}p(θy∣x)p(θxℓ)p∏k=ℓ+1p(θxk), where θxℓ is the vector of parameters describing the model for the ℓ-th covariate, and contains the vector of regression coefficients αℓ and potentially additional (variance) parameters. The product of distributions enclosed by curly brackets represents the distributions of those covariates that have xmisℓ as a predictive variable in the specification of the sequence in (1).
Note that (2) describes the actual imputation model, i.e., the distribution the imputed values for xi,misℓ are sampled from, but the sequence of covariates models (1) is used to describe the joint distribution. (2) is derived from that joint distribution.
Factorizing the joint distribution into analysis model and imputation part also facilitates extensions to settings with more complex outcomes, such as repeatedly measured outcomes. In the case where the analysis model is a GLMM or ordinal mixed model, the conditional distribution of the outcome in (2), p(yi∣xi,obs,xi,mis,θy∣x), has to be replaced by {ni∏j=1p(yij∣xi,obs,xi,mis,bi,θy∣x)}. Since y does not appear in any of the other terms in (2), and (3) can be chosen to be a model that is appropriate for the outcome at hand, the thereby specified full conditional distribution of xi,misℓ allows us to draw valid imputations that use all available information on the outcome.
This is an important difference to multiple imputation using a fully conditional specification (FCS, a.k.a. MICE), where the full conditional distributions used to impute missing values are specified directly, usually as regression models, and require the outcome to be explicitly included in the linear predictor of each imputation model. In settings with complex outcomes it is not clear how this should be done and simplifications may lead to biased results (Erler et al. 2016). The joint model specification utilized in JointAI overcomes this difficulty.
When some of the covariates are time-varying, it is convenient to specify models for these variables in the beginning of the sequence of covariate models, so that models for longitudinal variables have other longitudinal and baseline covariates in their linear predictor, but longitudinal covariates do not enter the predictors of baseline covariates.
Note that, whenever there are incomplete baseline covariates it is necessary to specify models for all longitudinal variables, even completely observed ones, while models for completely observed baseline covariates can be omitted. This becomes clear when we extend the factorized joint distribution from above with completely and incompletely observed longitudinal (level-1) covariates sobs and smis: p(yij∣sij,obs,sij,mis,xi,obs,xi,mis,θy∣x)p(sij,mis∣sij,obs,xi,obs,xi,mis,θsmis)p(sij,obs∣xi,obs,xi,mis,θsobs)p(xi,mis∣xi,obs,θxmis)p(xi,obs∣θxobs)p(θy∣x)p(θsmis)p(θsobs)p(θxmis)p(θxobs) Given that the parameter vectors θxobs, θxmis, θsobs and θsmis are a priori independent, and p(xi,obs∣θxobs) is independent of both xmis and smis, it can be omitted.
Since p(sij,obs∣xi,obs,xi,mis,θsobs), however, has xi,mis in its linear predictor and will, hence, be part of the full conditional distribution of xi,mis, it cannot be omitted from the model.
Other settings in which the fully Bayesian approach employed in JointAI has an advantage over standard FCS are settings with interaction terms that involve incomplete covariates or when the association of the outcome with an incomplete covariate is non-linear. In standard FCS such settings lead to incompatible imputation models (White, Royston, and Wood 2011; Bartlett et al. 2015). This becomes clear when considering the following simple example where the analysis model of interest is the linear regression yi=β0+β1xi+β2x2i+εi and xi is imputed using xi=α0+α1yi+˜εi. While the analysis model assumes a quadratic relationship, the imputation model assumes a linear association between x and y and there cannot be a joint distribution that has the imputation and analysis model as its full conditional distributions.
Because, in JointAI, the analysis model is a factor in the full conditional distribution that is used to impute xi, the non-linear association is taken into account. Furthermore, since it is the joint distribution that is specified, and the full conditional then derived from it, the joint distribution is ensured to exist.
Prior distributions have to be specified for all (hyper)parameters. A common prior choice for the regression coefficients is the normal distribution with mean zero and large variance. In JointAI variance parameters in models for normally distributed variables are specified as, by default vague, inverse-gamma distributions.
The covariance matrix of the random effects in a mixed model, D, is assumed to follow an inverse Wishart distribution where the degrees of freedom are usually chosen to be equal to the dimension of the random effects, and the scale matrix is diagonal. Since the magnitude of the diagonal elements relates to the variance of the random effects, the choice of suitable values depends on the scale of the variable the random effect is associated with. Therefore, JointAI uses independent gamma hyper-priors for each of the diagonal elements. More details about the default hyper-parameters and how to change them are given in the section on hyper-parameters in the vignette about Model Specification.