class: left, top, title-slide
Bayesian Methods for Handling Missing Values in Complex Settings
CMStatistics 2021
Nicole Erler
Department of Biostatistics
n.erler@erasmusmc.nl
N_Erler
NErler
https://nerler.com
--- count: false layout: true <div class="my-footer"><span> <a href="https://twitter.com/N_Erler"><i class="fab fa-twitter"></i> N_Erler</a>      <a href="https://github.com/NErler"><i class="fab fa-github"></i> NErler</a>      <a href = "https://nerler.com"><i class="fas fa-globe-americas"></i> nerler.com</a> </span></div> --- ## How do we Handle Missing Data? [Shivasabesan et al. (2018)](https://doi.org/10.1016/j.injury.2018.03.035) * all trauma registry-based manuscripts (2015 - 2017) * 241/539 (45%) described how missing data was handled: * 234 (43%) **complete case analysis** (CCA) * 18 (3%) **multiple imputation** (MI) * 34 (6%) combination of CCA & MI ??? Life isn't perfect, and neither is real world data. And, so in most datasets, there are some or even many observations missing, at least when you work with clinical data like I do. In randomized controlled trials, missing data is usually restricted to missing outcome values when participants drop out early, but in observational studies missing values are a lot bigger issue. We need to include many covariates in the model to reduce bias due to confounding, and the more variables we need, the more likely it is that some haven't been recorded very thoroughly. When data is collected retrospectively things are usually worse then in prospective studies, and proportion of missing values up to 50% in some variables are not uncommon. Missing data are such a widespread issue, that it is not even feasible to get an overview of how many studies have to deal with them. But there are a few studies that have looked into how missing data are handled and reported in specific types of studies or fields of application. - - - A more recent systematic review in the clinical journal "Injury" investigated missing data in trauma registries. About half of the 539 included papers described how missing data was handled, and the most common approach was again to perform a complete case analysis, followed by a combination with multiple imputation or just multiple imputation. -- [Carroll et al. (2020)](https://doi.org/10.1186/s12874-020-01018-7) * observational time-to-event studies in oncology using prop. hazards model (2012 - 2018) * approx. 130/148 (88%) reported on the missing data method * 79 (53%) **complete case analysis** * 33 (22%) **multiple imputation** ??? Last year, Carroll and colleagues published their investigation on missing data methods in observational time-to-event studies that used proportional hazards models. They included 148 studies, of which most reported on how missing data was handled, and, again, half of the studies performed complete case analyses, and the next most popular method was multiple imputation. --- ## How do we Handle Missing Data? [Hunt et al. (2021)](https://doi.org/10.1002/pds.5245) * multi-database pharmacoepidemiologic studies (2018 - 2019) * 19/62 (31%) reported the missing data method: * 13 (21%) **complete case analysis** * 4 (6%) **multiple imputation** ??? Even more recently, a review on multi-database studies in pharmacoepidemiology found that in that area of research, only on third of the studies reported on the missing data method, and also there, complete cases analysis, followed by multiple imputation was the most commonly reported method. - - - More and more clinical researchers are aware that in most settings complete case analysis is not appropriate, and multiple imputation is somehow considered the universal remedy to the missing data problem. --- ## Multiple Imputation <img src = "figures/MI.png", height = 480, style = "margin: auto; display: block;"> ??? The principle of multiple imputation is simple: * We impute each missing value multiple times, in order to take into account the uncertainty that we have about the missing value, and thereby create a set of completed datasets. * Each of these datasets can then be analysed by standard complete data methods, and the results of these separate analyses are then pooled to obtain an overall result. One often cited advantage of this approach is that we can run this imputation once, and then re-use the imputed data for multiple analyses performed by different researchers. --- ## Imputation Step .flex-grid[ .col[ <table class="simpletable"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> * `\(\mathbf y\)`: **response** variable * `\(\color{var(--nord15)}{\mathbf x_1}\)`: **incomplete** covariate * `\(\mathbf x_2\)`, `\(\mathbf x_3\)`: **complete** covariates <div style = "width: 450px;"></div> ] .col[ **Sampling from the predictive distribution:  ** {{content}} ] ] ??? The most interesting part of this multiple imputation is the imputation step. Missing values should be imputed from their predictive distribution. The idea for imputation of `\(x_1\)` is similar to predicting the missing values. We can think of this as fitting a model to the cases where `\(x_1\)` is observed, **for example**, this could be a **linear regression model** that has the **response `\(y\)`** and the other covariates as its covariates. From this model, we can get estimates for the regression coefficients `\(\beta\)` and the standard deviation of the residuals `\(\sigma\)`. - - - - - -- * Define a model for the incomplete variable `\(\color{var(--nord15)}{\mathbf x_1}\)`, e.g. `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \varepsilon$$` to get estimates `\(\boldsymbol{\hat\beta}\)` (and `\(\hat\sigma\)`) {{content}} -- <br> * Sample `\(\color{var(--nord15)}{\hat{x}_{i1}}\)` from the predictive distribution `$$p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)$$` ??? And using these parameter estimates we can then calculate the fitted or predicted value of `\(x_1\)` for those cases where `\(x_1\)` is missing. --- ## Multiple Imputation using FCS (MICE) .flex-grid[ .col[ <div style = "text-align: center; margin-bottom: 25px;"> <strong>Multivariate<br>Missingness</strong></div> <table class="simpletable"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\color{var(--nord15)}{\mathbf x_2}\)</th> <th>\(\color{var(--nord15)}{\mathbf x_3}\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ **Most common approach:**<br> <span style = "font-weight: bold;">MICE</span> <span style = "color: var(--lgrey);">(multivariate imputation by chained equations)</span><br> <span style = "font-weight: bold;">FCS</span> <span style = "color: var(--lgrey);">(fully conditional specification), i.e.,</span> `$$p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else})$$` <div> \begin{alignat}{10} \color{var(--nord15)}{\mathbf x_1} &= \beta_0 &+& \beta_1 \mathbf y &+& \beta_2 \color{var(--nord15)}{\mathbf x_2} &+& \beta_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots &+& \boldsymbol\varepsilon \\ \color{var(--nord15)}{\mathbf x_2} &= \alpha_0 &+& \alpha_1 \mathbf y &+& \alpha_2 \color{var(--nord15)}{\mathbf x_1} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots &+& \boldsymbol\varepsilon \\ \color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+& \theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots &+& \boldsymbol\varepsilon \end{alignat} </div> {{content}} ] ] ??? In practice, we usually have missing values in just a single variable, but in multiple variables. And the most common approach to imputation in this setting is MICE, short for **multivariate imputation by chained equations**, an approach that is also called **fully conditional specification**. The principle is an extension to what we've seen on the previous slide. We impute missing values using models that have all other data in their linear predictor. - - - -- <br> * iterative * based on the idea of the Gibbs sampler ??? Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. You start by randomly drawing starting values from the observed part of the data, and then you cycle through the incomplete variables and impute one at a time. Once we have imputed each missing value, we start again with the first variable, but now use the imputed values of the other variables instead of the starting values, and we do this a few times until the algorithm has converged. This algorithm is based on the idea of the Gibbs sampler, which allows you to sample from a multivariate distribution by iteratively sampling from the full conditionals derived from the joint distribution. --- ## A Simple Example .pull-left[ **Implied Assumption:**<br> <span>Linear association</span> between `\(\color{var(--nord15)}{\mathbf x_1}\)` and `\(\mathbf y\)`: `$$\color{var(--nord15)}{\mathbf x_1} = \theta_0 + \bbox[#3B4252, 2pt]{\theta_1 \mathbf y} + \theta_2 \mathbf x_2 + \theta_3 \mathbf x_3 + \ldots$$` <!-- <img src="figures/linplot.png", width = "450", height = "300", style="position:absolute; bottom:45px;"> --> <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> ] ??? One thing that is implied when we use such simple regression models for the imputation is that there is a linear association between incompl. covariate and the response (and other covariates). -- .pull-right[ <br> But what if `$$\mathbf y = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \color{var(--nord15)}{\mathbf x_1} + \beta_2 \color{var(--nord15)}{\mathbf x_1}^2} + \beta_3 \mathbf x_2 + \ldots$$` <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="100%" style="display: block; margin: auto;" /> <!-- <img src="figures/qdrplot.png", width = "450", height = "300", style="position:absolute; bottom:45px;"> --> ] ??? But that is of course not always the case. What if we have a setting where we assume that there is a non-linear association, for example quadratic? --- ## Non-linear Associations .pull-left[ * <span style="font-weight: bold; color:var(--nord4);">true association</span>: non-linear * <span style="font-weight: bold; color:var(--nord7);">imputation assumption</span>: linear ] .pull-right[ ] <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="80%" style="display: block; margin: auto;" /> ??? If we * correctly assume a non-linear association in the analysis model * but a linear association in the imputation model we introduce bias, even if we analyse the imputed data under the correct assumption because the imputed values will follow the linear association and thereby reduce the shape of the structure in the observed data. --- count: false class: animated, fadeIn ## Non-linear Associations .pull-left[ * <span style="font-weight: bold; color:var(--nord4);">true association</span>: non-linear * <span style="font-weight: bold; color:var(--nord7);">imputation assumption</span>: linear ] .pull-right[ <span style="font-size: 56pt; position: relative; right: 110px; bottom: 20px;">} ⇨</span> <span style = "color: var(--nord11); font-size: 1.2rem; font-weight: bold; position: relative; bottom: 30px; right: 100px;"> bias!</span> ] <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" /> ??? If we * correctly assume a non-linear association in the analysis model * but a linear association in the imputation model we introduce bias, even if we analyse the imputed data under the correct assumption because the imputed values will follow the linear association and thereby reduce the shape of the structure in the observed data. --- ## Time-to-Event Outcomes (Simple) **Proportional Hazards Model:** `$$h(t) = h_0(t) \exp\left(\color{var(--nord15)}{\mathbf x}^\top \boldsymbol\beta\right)$$` -- **Log-likelihood** `$$p(\mathbf T, \boldsymbol \delta \mid \color{var(--nord15)}{\mathbf x}, \boldsymbol\beta) = \boldsymbol\delta \left(\log h_0(T) + \color{var(--nord15)}{\mathbf x} \boldsymbol\beta\right) - \int_0^T h_0(s)\exp\left( \color{var(--nord15)}{\mathbf x} \boldsymbol\beta\right)ds$$` <br> -- Inconsistent with the (naive) imputation model: `$$\color{var(--nord15)}{\mathbf x} = \theta_0 + \theta_1 \mathbf T + \theta_2 \boldsymbol\delta + \theta_3\ldots$$` ??? The corresponding log likelihood then is given by this formula here. This is the likelihood used in our analysis model, and it implies a non-linear association between the response variables, the event time and event indicator, and the potentially incomplete covariates `\(x\)`. --- ## Multi-level Data .flex-grid[ .col[ <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ] .col[ <div style = "width: 300px;"> </div> <table class="simpletable"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class="hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> ] ] ??? We have multi-level data when we have measured the same variable repeatedly in the same patient, but also when we have a clustering structure in our data, for example in a multi-center study. In both cases, observations from the same patient, or the same cluster are not independent. We typically represent this type of data in long format, so that we now have multiple rows that belong to the same patient. --- ## FCS in Multi-level Data If `\(\color{var(--nord15)}{\mathbf x_1}\)` is **level-1**:<br> `\(p(\color{var(--nord15}{x_{i1}(t)} \mid \text{everything else})\)` could be a mixed model, e.g.: `$$x_{i1}(t) = \underset{\color{var(--nord3)}{\text{fixed effects}}}{\color{var(--nord3)}{\underbrace{\color{var(--nord4)}{\theta_0 + \theta_1 y_i(t) + \theta_2 x_{2i}(t) + \theta_3 x_{3i}(t)}}}} + \underset{\color{var(--nord3)}{\substack{\text{random}\\\text{effects}}}}{\color{var(--nord3)}{\underbrace{\color{var(--nord4)}{\mathbf b_i \mathbf z_i(t)}}}} + \varepsilon_i(t)$$` -- .flex-grid[ .col[ But what if `\(\color{var(--nord15)}{\mathbf x_1}\)` is a **level-2** variable? {{content}} ] .col[ <table class="simpletable"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> <tr class="hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> ] ] ??? If `\(x_1\)` is a level-2 variable: * imputed values for the same subject should be identical {{content}} -- * Use a mixed model? {{content}} -- <i class="fas fa-thumbs-down" style = "color:var(--nord11);"></i> {{content}} -- * Use a GLM? {{content}} -- <i class="fas fa-thumbs-down" style = "color:var(--nord11);"></i> {{content}} -- <br> <ul class = "arrowlist"> <li><strong>Average</strong> imputed values?</li> <li><strong>Summarize</strong> time-varying variables?</li> <li>Imputation in <strong>wide format</strong>?</li> </ul> ??? ⇨ For incomplete baseline variables imputation in wide format might be better(?) --- ## Non-linear Associations & Multi-level In settings with **non-linear associations** and **multi-level data** the **correct predictive distribution** $$ p(\color{var(--nord15)}{\mathbf x_{mis}} \mid \text{everything else})$$ may not have a closed form. <br> <div style="width: 85%; padding: 1.15em 2em; background-color: var(--nord0); margin: auto; display: block; text-align: center;"> <strong><span style="font-size: 1.5rem;">⇨</span> We cannot easily specify the correct imputation model directly.</strong> </div> ??? So, in summary, whenever we have a non-linear association between the response and incomplete covariates, including time-to-event outcomes, or when we have multi-level data, it is not straightforward to specify the correct imputation model, meaning the correct predictive distribution of the missing values, directly because usually they do not have a closed form. --- ## Getting the Correct Distribution .flex-grid[ .col[ * We need `\(\;p(\color{var(--nord15)}{\mathbf x} \mid \mathbf y, \ldots)\)` * We know `\(\;p(\mathbf y \mid \color{var(--nord15)}{\mathbf x}, \ldots)\)` ] .col[ <div style = "color: var(--nord3);"> <ul> <li>\(\mathbf x\): incomplete covariate(s)</li> <li>\(\mathbf y\): outcome(s)</li> <li>\(\ldots\): everything else </li> </ul> </div> ] ] ??? The interesting question now is: How do we get this predictive distribution? What we need is the distribution of the incomplete variable `\(x\)` conditional on the response and everything else. And what we do know is the distribution of the response conditional on the covariates and everything else, because this is the analysis model that we have specified already. And here Bayes comes to the rescue. -- **Bayes Theorem:** `\begin{align} p(\color{var(--nord15)}{\mathbf x} \mid \mathbf y, \ldots) = \frac{p(\mathbf y \mid \color{var(--nord15)}{\mathbf x}, \ldots)\; p(\color{var(--nord15)}{\mathbf x},\ldots)}{p\left(\mathbf y, \ldots\right)} &\propto \underset{\text{joint distribution}}{\underbrace{p(\mathbf y, \color{var(--nord15)}{\mathbf x}, \ldots)}} \end{align}` ??? Because Bayes theorem allows us to convert a conditional distribution into the reversed conditional, and so using this theorem we can derive that the distribution of interest, the distribution of the incomplete covariate conditional on everything else is proportional to the distribution of the response conditional on the covariates, and the distribution of the covariates and everything else. We can now factorize this further, but I find it easier to just look at it as a specification of the joint distribution of everything. -- For example: $$ p(\mathbf y, \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_3, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_3, \boldsymbol\theta)}}\; \underset{\text{covariate model(s)}}{\underbrace{p(\color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_3 \mid \boldsymbol\theta)}}\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}} $$ ??? And we can use this principle for our example. So we can write the joint distribution of `\(y\)` and `\(x_1\)`, `\(x_2\)` and `\(x_3\)` as the conditional distribution of `\(y\)` given the covariates., the distribution of the covariates conditional on the parameters, and, of course, prior distributions for those parameters. --- name: covariatemodels ## Covariate Models For example `\begin{align} p(\color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \color{var(--nord15)}{\mathbf x_3(t)}, \mathbf x_4(t) \mid \boldsymbol\theta) =\, & p(\color{var(--nord15)}{\mathbf x_3(t)} \mid \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_4(t), \boldsymbol\theta) & \color{var(--nord3)}{\small\text{e.g., GLMM}}\\ & p(\mathbf x_4(t) \mid \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \boldsymbol\theta) & \color{var(--nord3)}{\small\text{e.g., GLMM}}\\ & p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf x_2, \boldsymbol\theta) & \color{var(--nord3)}{\small\text{e.g., GLM}}\\ & \color{var(--nord3)}{p(\mathbf x_2 \mid \boldsymbol\theta)} & \color{var(--nord3)}{\small\text{(can be omitted)}} \end{align}` ??? [Jump to Hep C example default values slide](#hepcdefaults) -- <br> <div style="width: 65%; padding: 1.15em 2em; background-color: var(--nord0); margin: auto; display: block; text-align: center;"> Covariate models \(\neq\) imputation models! </div> --- ## Estimation <div class = "container"> <div class = "box"> <div class = "box-row"> <div class = "box-cell" style = "background: var(--nord0);">imputation<br>models</div> <div class = "box-cell">\(\propto\)</div> <div class = "box-cell" style = "background: var(--nord0);">joint<br>distribution</div> <div class = "box-cell">\(=\)</div> <div class = "box-cell" style = "background: var(--nord0);">analysis<br>model</div> <div class = "box-cell" style = "background: var(--nord0);">covariate<br>models</div> <div class = "box-cell" style = "background: var(--nord0)">priors</div> </div> </div> </div> .flex-grid[ .col[ <img src = "figures/Gibbs.png", height = 350;> ] .col[ <br> <div style = "width: 650px;"> Estimation via MCMC<br> ⇨ Gibbs sampler </div> ] ] ??? In most cases, the posterior distribution will not have a closed form and so we won't be able to derive it analytically. Instead, Markov Chain Monte Carlo methods are used to create a sample from the posterior distribution. The results from the Bayesian analysis are then presented as summary measures of this sample, usually the mean and the 2.5% and 97.5% quantiles, which form the 95% credible interval. Since, in the Bayesian framework, the result is given in terms of the probability distribution of the unknown parameters conditional on the data that was observed, these results have a more intuitive interpretation than frequentist results. -- <img src = "figures/MICE.png", height = 300; style = "position: absolute; bottom: 50px; right: 60px;"> --- ## Advantages of the Bayesian Approach `$$p(\mathbf y, \mathbf x, \boldsymbol\theta) = \bbox[#2E3440, 5pt]{\underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \mathbf x, \color{var(--nord14)}{\boldsymbol\theta_{y\mid x}})}}}\; p(\mathbf x \mid \boldsymbol\theta_x)\; p(\color{var(--nord14}{\boldsymbol\theta_{y\mid x}}, \boldsymbol\theta_x)\qquad \color{var(--nord3)}{\text{with } \boldsymbol\theta = (\boldsymbol\theta_{y\mid x}, \boldsymbol\theta_x)}$$` * specification of the joint distribution<br> ⇨ assures **compatibility** of imputation models -- * use of the analysis model in the specification of the joint distribution<br> ⇨ **parameters of interest** `\(\color{var(--nord14)}{\boldsymbol\theta_{y\mid x}}\)` are estimated directly<br> ⇨ **non-linear associations** are taken into account<br> ⇨ assures **congeniality** -- * response is not in any linear predictor<br> ⇨ no problem to use **complex outcomes** --- ## In Complex Settings **(Multivariate) joint model for longitudinal and survival data:** `$$p(\mathbf T, \boldsymbol\delta, \mathbf y, \color{var(--nord15)}{\mathbf x}, \mathbf b, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{ \underset{\substack{\text{survival}\\\text{model}}}{\underbrace{ p(\mathbf T, \boldsymbol\delta \mid \mathbf b, \color{var(--nord15)}{\mathbf x}, \boldsymbol \theta)}}\;\; \underset{\substack{\text{(multivariate)}\\\text{longitudinal}\\\text{model}}}{\underbrace{ p(\mathbf y \mid \mathbf b, \color{var(--nord15)}{\mathbf x}, \boldsymbol\theta)\; p(\mathbf b\mid \boldsymbol \theta)}} }}\;\; \underset{\substack{\text{covariate}\\\text{models}}}{\underbrace{ p(\color{var(--nord15)}{\mathbf x} \mid \boldsymbol\theta)}}\;\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}}$$` ??? For example, for our multivariate joint model for longitudinal and survival data. There, the model for the response consists of multiple parts, the time-to-event model, and the multivariate mixed model for the longitudinal biomarkers. -- <br> <div class = "container"> <div class = "box"> <div class = "box-row"> <div class = "box-cell" style = "background: var(--nord0);">joint<br>distribution</div> <div class = "box-cell">\(=\)</div> <div class = "box-cell" style = "background: var(--nord0);">analysis<br>model</div> <div class = "box-cell" style = "background: var(--nord0);">covariate<br>models</div> <div class = "box-cell" style = "background: var(--nord0)">priors</div> </div> </div> </div> ??? And this means that we can always divide the joint distribution into the analysis model, a model for the covariates, and prior distributions. And this also works for fairly complex models. --- ## In practice: <i class="fab fa-r-project"></i> Package JointAI <div class = "container"> <div class = "box" style = "margin-left: 20px;"> <div class = "box-row"> <div class = "box-cell" style="width:180px;"> <strong>Specification</strong><br>of the joint<br> distribution: </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>User</strong></div> <div class = "box-cell" style = "background: var(--nord1); border: 2px solid var(--nord8); ">analysis<br>model</div> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>JointAI</strong></div> <div class = "box-row"> <div class = "box-cell" style = "background: var(--nord1);">covariate<br>models</div> <div class = "box-cell" style = "background: var(--nord1)">priors</div> </div></div> </div></div></div> -- <br> <div class = "container"> <div class = "box" style = "margin-left: 20px;"> <div class = "box-row"> <div class = "box-cell" style="width:180px;"> <strong>Estimation:</strong> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>JointAI</strong></div> <div class = "box-cell" style = "background: var(--nord1);">pre-processing</div> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>rjags / JAGS</strong></div> <div class = "box-cell" style = "background: var(--nord1);">MCMC sampling</div> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>JointAI</strong></div> <div class = "box-cell" style = "background: var(--nord1);">post-processing</div> </div> </div></div></div> --- ## JointAI: Model Types & Specification .flex-grid[ .col[ **Univariate Models:** * `lm_imp()` * `glm_imp()` * `clm_imp()` * `mlogit_imp()` * `betareg_imp()` * `lognorm_imp()` ] .col[ **Mixed Models** * `lme_imp()` * `glme_imp()` * `clmm_imp()` * `mlogitmm_imp()` * `betamm_imp()` * `lognormmm_imp()` ] .col[ **Survival Models** * `coxph_imp()` * `survreg_imp()` * `JM_imp()` ] ] **For Example:** ```r mod <- lm_imp(SBP ~ gender + age + alc * creat, data = NHANES, n.iter = 200) ``` <p class = "smallbreak"> </p> .footnote[ [Full documentation at https://nerler.github.io/JointAI/](https://nerler.github.io/JointAI/) ] --- ## JointAI: More Features .flex-grid[ .col[ <button class="modal-button" href="#myModal1">Model Specification</button> <div id="myModal1" class="modal"> <div class="modal-content"> <span class="close">×</span> <ul> <li>auxiliary variables (<a href = "https://nerler.github.io/JointAI/articles/ModelSpecification.html#auxiliary-variables"><code>auxvars</code></a>)</li> <li> shrinkage (ridge) (<a href = "https://nerler.github.io/JointAI/articles/ModelSpecification.html#shrinkage"><code>shrinkage</code></a>) <li> truncation of continuous distributions (<a href = "https://nerler.github.io/JointAI/articles/ModelSpecification.html#functions-with-restricted-support"><code>trunc</code></a>) <li> setting reference categories (<a href="https://nerler.github.io/JointAI/articles/ModelSpecification.html#reference-categories"><code>refcats</code></a>) <li> change hyper-parameters (<a href = "https://nerler.github.io/JointAI/articles/ModelSpecification.html#hyper-parameters"><code>hyperpars</code></a>) <li> parameters to be monitored (<a href="https://nerler.github.io/JointAI/articles/SelectingParameters.html"><code>monitor_params</code></a>)</li> <li> export the JAGS model (<code>modelname</code>, <code>modeldir</code>, <code>keep_model</code>) <li> customize the JAGS model (<code>modelname</code>, <code>modeldir</code>, <code>overwrite</code>) </ul> </div> </div> <button class="modal-button" href="#myModal2">MCMC settings</button> <div id="myModal2" class="modal"> <div class="modal-content"> <span class="close">×</span> <ul> <li>adaptive phase (<a href="https://nerler.github.io/JointAI/articles/MCMCsettings.html#adaptive-phase"><code>n.adapt</code></a>)</li> <li>sampling phase (<a href="https://nerler.github.io/JointAI/articles/MCMCsettings.html#sampling-iterations"><code>n.iter</code></a>)</li> <li>number of chains (<a href="https://nerler.github.io/JointAI/articles/MCMCsettings.html#number-of-chains"><code>n.chains</code></a>)</li> <li>thinning (<a href="https://nerler.github.io/JointAI/articles/MCMCsettings.html#thinning"><code>thin</code></a>)</li> <li>initial values (<a href="https://nerler.github.io/JointAI/articles/MCMCsettings.html#initial-values"><code>inits</code></a>, <code><object>$mcmc_settings$inits</code>)</li> <li>seed value (<code>seed</code>)</li> <li>continue sampling (<a href="https://nerler.github.io/JointAI/reference/add_samples.html"><code>add_samples()</code></a>)</li> </ul> </div> </div> <button class="modal-button" href = "#modal-plots">Plots</button> <div class="modal" id="modal-plots"> <div class="modal-content"> <span class="close">×</span> <ul> <li>data distribution (<a href="https://nerler.github.io/JointAI/articles/VisualizingIncompleteData.html#visualize-the-distribution-of-each-variable"> <code>plot_all()</code></a>)</li> <li>missing data pattern (<a href = "https://nerler.github.io/JointAI/articles/VisualizingIncompleteData.html#missing-data-pattern"> <code>md_pattern()</code></a>)</li> <li>MCMC chains (<a href = "https://nerler.github.io/JointAI/articles/AfterFitting.html#trace-plot"> <code>traceplot()</code></a>)</li> <li>posterior density (<a href="https://nerler.github.io/JointAI/articles/AfterFitting.html#density-plot"><code>densplot()</code></a>)</li> <li>imputed vs observed data (<a href="https://nerler.github.io/JointAI/articles/AfterFitting.html#export-of-imputed-values"> <code>plot_imp_distr()</code></a>)</li> <li>Monte Carlo Error (<a href="https://nerler.github.io/JointAI/articles/AfterFitting.html#sec:mcerror"><code>plot(MC_error(<object>))</code></a>)</li> </ul> </div> </div> <button class="modal-button" href = "#modal-parallel">Parallel Sampling</button> <div class="modal" id="modal-parallel"> <div class="modal-content"> <span class="close">×</span> Using the R packages <strong>future</strong> and <strong>doFuture</strong>: <pre><code class="r hljs remark-code"> <div class="remark-code-line"><span class="hljs-keyword">library</span>(<span class="hljs-string">"doFuture"</span>)</div> <div class="remark-code-line">plan(multiprocess(workers = <span class="hljs-number">6</span>))</div> <div class="remark-code-line">registerDoFuture()</div> <div class="remark-code-line"></div> <div class="remark-code-line"><span class="hljs-comment">## fit JointAI model ...</span></div><br> <div class="remark-code-line"><span class="hljs-comment">## to re-set to sequential evaluation:</span></div> <div class="remark-code-line">plan(sequential())</div> </code></pre> </div> </div> ] .col[ <button class="modal-button" href = "#modal-clm"> Cumulative Logit Models</button> <div class="modal" id="modal-clm"> <div class="modal-content"> <span class="close">×</span> <ul> <li>invert the OR (<a href="https://nerler.github.io/JointAI/reference/model_imp.html#cumulative-logit-mixed-models"><code>rev</code></a>) \[\log\left(\frac{P(y_i \color{var(--nord14)}{>} k)}{P(y_i \color{var(--nord14)}{\leq} k)}\right) = \gamma_k + \eta_i \quad\text{vs}\quad \log\left(\frac{P(y_i \color{var(--nord14)}{\leq} k)}{P(y_i \color{var(--nord14)}{>} k)}\right) = \gamma_k + \eta_i\] </li> <li>partial proportional odds (<a href="https://nerler.github.io/JointAI/reference/model_imp.html#cumulative-logit-mixed-models"><code>nonprop</code></a>) \[\log\left(\frac{P(y_i > k)}{P(y_i \leq k)}\right) = \gamma_k + \eta_i \quad\text{vs}\quad \log\left(\frac{P(y_i \leq k)}{P(y_i > k)}\right) = \gamma_k + \eta_i \color{var(--nord14)}{+ \eta_{ki}}\] </li> </ul> </div> </div> <button class="modal-button" href = "#modal-multi-level"> Multi-level Models</button> <div class="modal" id="modal-multi-level"> <div class="modal-content"> <span class="close">×</span> <ul> <li><strong>lme4</strong> or <strong>nlme</strong> type specification</li> <li>nested and crossed random effects<br>(determined by data structure)</li> <li>2, 3, 4, ... levels of grouping, i.e., <code> ... + (time | id) + (1 | group) + (1 | center ) + (1 | country) + ...</code></li> <li>multi-level structure also for survival models</li><br> <li>uses hierarchical centering, i.e., \(\mathbf b \sim N(\mathbf X\boldsymbol\beta, \mathbf D)\)</li> </ul> <a href="https://nerler.github.io/JointAI/reference/model_imp.html#model-formulas">⇨ For more info see the package documentation.</a> </div> </div> <button class="modal-button" href = "#modal-survival"> Survival<br>(with Time-varying Covariates) </button> <div class="modal" id="modal-survival"> <div class="modal-content"> <span class="close">×</span> <ul> <li>proportional hazards model with <strong>time-varying covariates</strong><br> (using <a href="https://nerler.github.io/JointAI/reference/model_imp.html#survival-models-with-frailties-or-time-varying-covariates"> <code> + (1 | id)</code></a> and <a href="https://nerler.github.io/JointAI/reference/model_imp.html#survival-models-with-frailties-or-time-varying-covariates"> <code> timevar = "<...>"</code></a>)</li> <li><strong>baseline hazard</strong>: B-spline with <code>df_basehaz</code> degrees of freedom</li><br> <li></strong>joint model</strong> for longitudinal and survival data<br> (using <a href = "https://nerler.github.io/JointAI/reference/model_imp.html#modelling-multiple-models-simultaneously-joint-models"> <code>formula = list(<...>)</code></a> and <a href="https://nerler.github.io/JointAI/reference/model_imp.html#modelling-multiple-models-simultaneously-joint-models"> <code>timevar = "<...>"</code></a>)</li> <li><strong>association structure</strong> (<code>assoc_type</code>) <ul> <li>underlying value (<code>underl.value</code>)</li> <li>observed/imputed value (<code>obs.value</code>)</li> </ul> </li> <li><strong>multivariate joint model</strong>:<br> full / block-diagonal / independent random effects</li> </ul> </div> </div> ] .col[ <button class="modal-button" href = "#modal-blackbox"> Shining Light into the Black Box</button> <div class="modal" id="modal-blackbox"> <div class="modal-content"> <span class="close">×</span> <ul> <li>monitor any node (<a href="https://nerler.github.io/JointAI/articles/SelectingParameters.html#other-parameters"> <code>monitor_params</code></a>)</li> <li>see the JAGS model (<code><object>$jagsmodel</code>)</li> <li>print model info (<a href="https://nerler.github.io/JointAI/articles/SelectingParameters.html#side-note-getting-information-about-of-the-imputation-models"> <code>list_models()</code></a>, <code><object>$models</code>)</li> <li>access the JAGS model object (<code><object>$model</code>)</li> <li>list all monitored parameters (<a href="https://nerler.github.io/JointAI/articles/SelectingParameters.html#parameters-of-the-analysis-model"> <code>parameters()</code></a>)</li> <li>list all regression coefficients (<code><object>$coef_list</code>)</li> </ul> </div> </div> <button class="modal-button" href = "#modal-compinfo"> Computational Infos</button> <div class="modal" id="modal-compinfo"> <div class="modal-content"> <span class="close">×</span> <code><object>$comp_info</code>: <ul> <li>start time</li> <li>duration</li> <li>package version</li> <li>parallel setting</li> </ul> </div> </div> <button class="modal-button" href = "#modal-subset"> Output Options</button> <div class="modal" id="modal-subset"> <div class="modal-content"> <span class="close">×</span> <ul> <li>Subset selection <ul> <li>iterations (<a href="https://nerler.github.io/JointAI/articles/AfterFitting.html#subset-of-mcmc-samples"> <code>start</code>, <code>end</code>, <code>thin</code></a>)</li> <li>nodes (<a href="https://nerler.github.io/JointAI/articles/AfterFitting.html#subset-of-parameters"> <code>subset</code></a>)</li> <li>chains (<code>exclude_chains</code>)</li> <li>sub-models (<code>outcome</code>)</li> </ul><br> <li>export imputed values<br>(<code>monitor_params = c(imps = TRUE)</code>, <code>get_MIdat()</code>)</li> </ul> </div> </div> <button class="modal-button" href = "#modal-prediction">Prediction</button> <div class="modal" id="modal-prediction"> <div class="modal-content"> <span class="close">×</span> <ul> <li>create "newdata" for effect plots (<a href="https://nerler.github.io/JointAI/reference/predDF.html"> <code>predDF()</code></a>)</li><br> <li>prediction (<code>predict()</code>) <ul> <li><strong>GLM(M)</strong> type models: linear predictor or outcome scale<br> <span style = "text-decoration: underline;">for now</span>: <span style = "color: var(--nord12);"> assumption random effects = 0</span> </li> <li><strong>prop. hazard models</strong>: (log) hazard, (-log) survival</li> <li><strong>joint model</strong> for longitudinal & survival data: <span style = "color:var(--nord11);">not <span style = "text-decoration: underline;">yet</span> possible</span></li> </ul></li> </ul> </div> </div> ] ] --- ## Summary Imputation via the posterior predictive distribution<br> ⇨ difficult to specify directly in complex settings * **MICE** uses direct specification<br> ⇨ violations in complex settings ⇨ bias * **Bayes**: specification via factorization of the joint distribution<br> ⇨ theoretically valid -- <br> Valid under **MAR** <br> ⇨ Extension to **MNAR** by specifying models for missingness process. <p class = "smallbreak"> </p> Allows **tailoring to** model **specific features** of the data. --- class: the-end, center, middle layout: true count: false # Thank you for your attention! <div class="contact"> <i class="fas fa-envelope"></i> <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a>  <a href="https://twitter.com/N_Erler" target="_blank"><i class="fab fa-twitter"></i> N_Erler</a>  <a href="https://github.com/NErler" target="_blank"><i class="fab fa-github"></i> NErler</a>  <a href="https://nerler.com" target="_blank"><i class="fas fa-globe-americas"></i> https://nerler.com</a> </div> --- count: false <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"> </script> <script> // Get the button that opens the modal var btn = document.querySelectorAll("button.modal-button"); // All page modals var modals = document.querySelectorAll('.modal'); // Get the <span> element that closes the modal var spans = document.getElementsByClassName("close"); // When the user clicks the button, open the modal for (var i = 0; i < btn.length; i++) { btn[i].onclick = function(e) { e.preventDefault(); modal = document.querySelector(e.target.getAttribute("href")); modal.style.display = "block"; } } // When the user clicks on <span> (x), close the modal for (var i = 0; i < spans.length; i++) { spans[i].onclick = function() { for (var index in modals) { if (typeof modals[index].style !== 'undefined') modals[index].style.display = "none"; } } } // When the user clicks anywhere outside of the modal, close it window.onclick = function(event) { if (event.target.classList.contains('modal')) { for (var index in modals) { if (typeof modals[index].style !== 'undefined') modals[index].style.display = "none"; } } } </script>