class: left, top, title-slide
When and Why Imputation with MICE / FCS Might Fail
Nicole Erler
Assistant Professor
Department of Biostatistics
n.erler@erasmusmc.nl
N_Erler
NErler
https://nerler.com
--- count: false layout: true <div class="my-footer"><span> <a href="https://twitter.com/N_Erler"><i class="fab fa-twitter"></i> N_Erler</a>      <a href="https://github.com/NErler"><i class="fab fa-github"></i> NErler</a>      <a href = "https://nerler.com"><i class="fas fa-globe-americas"></i> nerler.com</a> </span></div> --- count: false ## Outline .itlist-wider[ * Motivation / Setting the Stage * (Multiple) Imputation & MICE / FCS * Potential Issues with MICE * A Fully Bayesian Alternative * Summary ] -- <i class="far fa-images link-slides "></i> <span class = "link-address"> <a href = "https://www.nerler.com/talk/greifswald2023/">https://nerler.com/#talks</a> </span> --- class: center, middle # Setting the Stage --- ## Motivation: Chronic Hepatitis C * N `\(\approx\)` 700 patients with chronic hepatitis C * follow-up: `\(\leq\)` 38 years * several **baseline variables**: * age, sex, smoking, BMI, ... * **0 - 19% missing values**   ⇨ 66% complete cases ??? This project is motivated by a study in patients with a **chronic hepatitis C infection**. In many patients, the hepatitis has caused severe liver damage, and they will need a transplant. In light of the scarcity of donor organs, it is important to **identify factors that indicate a patient's risk** for liver failure. We have observational data from approx. **700 patients**. In this data, several **variables are only measured once**, at baseline. This includes the patient's **age at baseline** and **sex** and the **year of diagnosis**, as well as variables like **alcohol consumption**, **smoking**, or **BMI**, which could change over time, but were only measured once. In some of these variables, we have **missing values** for **up to 19% of the patients**. If we'd **exclude** everyone with missing values in these variables, we would **lose about a third of the data**. - - - -- * multiple **biomarkers**<br> (repeatedly measured) <img src="figures/plot_nobs.png", width = "650", style="position:absolute; bottom:45px; right:60px;"> ??? We also have several **biomarkers** that are **repeatedly measured**, but since the data is not from a prospective study, the measurements are **very irregular**. The number of measurements per patient is shown here in the histogram. The **median** number of measurement times per patient is **15**, but there are a few patients with just **one single observation** and even one patient with **> 700 measurement** occasions. -- * **Time-to-event Outcome**:<br> Clinical events<br> (death, transplant, ...) --- ## Missing Biomarker Values <img src="figures/HCVlongmissing2.png" width="100%" style="display: block; margin: auto;" /> ??? In the **plots** here, you see the trajectories of 6 different biomarkers of interest for 7 patients. The values are standardized to have similar scales for the purpose of this visualization. In many cases, the biomarkers are **measured at the same time points**, but it also happens that one biomarker value is missing or was measured at another time. Observations that we would loose if we'd **restrict our data** to only those time points at which all 6 markers are measured are shown as transparent. For the full dataset, this would result in the loss of 50% of the biomarker observations. [2:00 min] --- ## Analysis: Multivariate Joint Model **Proportional hazards model** for time until event: `$$h_i(t) = h_0(t)\;\exp\left(\underset{\substack{\text{time}\\\text{constant}} }{\underbrace{\mathbf x_i^\top \boldsymbol\beta^{(tc)}}} + \underset{\text{time varying}}{\underbrace{\sum_{k=1}^K \eta_{ki}(t)^\top\beta^{(tv)}_k}}\right)$$` ??? To analyse this data, we'd like to use a multivariate joint model. This model consists of **two parts**, a **proportional hazards** model for the time-to-event outcome and a **multivariate mixed model** for the longitudinal biomarkers. The **proportional hazards** model has the usual form. In the linear predictor, we have some variables that are **time constant**, our baseline covariates, and we have the **time-varying biomarker values**. - - - -- **Longitudinal (mixed) model** for each biomarker `\(k = 1, ... K\)`: `\begin{align*} \mathbb E(y_{ki}(t)\mid \mathbf b_{ki}) &= \eta_{ki}(t)\\ &= \underset{\text{fixed effects}}{\underbrace{\mathbf x_{ki}(t)^\top\boldsymbol\beta^{(k)}}} + \underset{\text{random effects}}{\underbrace{\mathbf z_{ki}(t)^\top \mathbf b_{ki}}} \qquad\scriptsize\text{with } \begin{pmatrix}\mathbf b_{1i}\\\vdots\\\mathbf b_{Ki}\end{pmatrix} \sim N(\mathbf 0, \mathbf D) \end{align*}` ??? Each biomarker is modelled using a mixed model. - - - -- <div style = "border: solid 3pt var(--nord14); width: 270px; height: 130px; position: absolute; right: 170px; bottom: 65px;"> </div> ??? The **random effects** of the models for the different biomarkers are **modelled jointly** in a multivariate normal distribution. - - - -- <div style = "border: solid 10pt var(--nordbg); width: 255px; height: 115px; position: absolute; right: 168px; bottom: 63px;"> </div> <div style = "border: solid 3pt var(--nord14); width: 80px; height: 60px; position: absolute; left: 405px; bottom: 172px;"> </div> <div style = "position: absolute; top: 170px; right: 430px; color: var(--nord14); font-weight: 500; font-size: 3rem;"> ⇩ </div> ??? By using the linear predictors `\(\eta\)` of the longitudinal models as covariates in the time-to-event model, we assume that the underlying value of the biomarkers, which may be measured with error, is associated with the risk of an event at the same time point. It is, of course, possible to assume different association structures, for example, a time lag or a cumulative effect, but since the focus here is on the missing values, we'll keep it simple. --- count: false ## Analysis: Multivariate Joint Model **Proportional hazards model** for time until event: `$$h_i(t) = h_0(t)\;\exp\left(\underset{\substack{\text{time}\\\text{constant}} }{\underbrace{\bbox[#3B4252, 2px]{\color{var(--nord15)}{\mathbf x_i}^\top} \boldsymbol\beta^{(tc)}}} + \underset{\text{time varying}}{\underbrace{\sum_{k=1}^K \eta_{ki}(t)^\top\beta^{(tv)}_k}}\right)$$` **Longitudinal (mixed) model** for each biomarker `\(k = 1, ... K\)`: `\begin{align*} \mathbb E(y_{ki}(t)\mid \mathbf b_{ki}) &= \eta_{ki}(t)\\ &= \underset{\text{fixed effects}}{\underbrace{\bbox[#3B4252, 2px]{\color{var(--nord15)}{\mathbf x_{ki}(t)}^\top}\boldsymbol\beta^{(k)}}} + \underset{\text{random effects}}{\underbrace{\mathbf z_{ki}(t)^\top \mathbf b_{ki}}} \qquad\scriptsize\text{with } \begin{pmatrix}\mathbf b_{1i}\\\vdots\\\mathbf b_{Ki}\end{pmatrix} \sim N(\mathbf 0, \mathbf D) \end{align*}` ??? In both model parts, we might want to use **baseline covariates**. However, in our motivating data, those variables **have missing values**. [3:22 min] --- ## How are Missing Data Handled? [<i class="fas fa-file-alt"></i> Shivasabesan et al. (2018)](https://doi.org/10.1016/j.injury.2018.03.035) * all trauma registry-based manuscripts (2015 - 2017) * 241/539 (45%) described how missing data was handled: * 234 (43%) **complete case analysis** (CCA) * 18 (3%) **multiple imputation** (MI) * 34 (6%) combination of CCA & MI ??? Missing data are a widespread issue, especially in observational data. But how are they usually dealt with? It is not easy to get a very good overview of how Missing Values are handled in practice. But there are a few studies that have looked into how missing data are handled and reported in specific types of studies or fields of application. - - - A systematic review that was published in the clinical journal "Injury" investigated missing data in trauma registries. About half of the 539 included papers described how missing data was handled, and the most common approach was again to perform a complete case analysis, followed by a combination with multiple imputation or just multiple imputation. - - - -- <p class = "smallbreak"> </p> [<i class="fas fa-file-alt"></i> Carroll et al. (2020)](https://doi.org/10.1186/s12874-020-01018-7) * observational time-to-event studies in oncology using prop. hazards model (2012 - 2018) * approx. 130/148 (88%) reported on the missing data method * 79 (53%) **complete case analysis** * 33 (22%) **multiple imputation** ??? In 2020, Carroll and colleagues published their investigation on missing data methods in observational time-to-event studies that used proportional hazards models. They included 148 studies, of which most reported on how missing data was handled, and, again, half of the studies performed complete case analyses, and the next most popular method was multiple imputation. --- ## How are Missing Data Handled? [<i class="fas fa-file-alt"></i> Hunt et al. (2021)](https://doi.org/10.1002/pds.5245) * multi-database pharmacoepidemiologic studies (2018 - 2019) * 19/62 (31%) reported the missing data method: * 13 (21%) **complete case analysis** * 4 (6%) **multiple imputation** ??? Even more recently, a review on multi-database studies in pharmacoepidemiology found that in that area of research, only on third of the studies reported on the missing data method, and also there, complete cases analysis, followed by multiple imputation was the most commonly reported method. - - - More and more clinical researchers are aware that in most settings complete case analysis is not appropriate, and multiple imputation is somehow considered the universal remedy to the missing data problem. --- ## Complete Case Analysis .pull-left[ Complete Case Analysis is <ul class="fa-ul"> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>inefficient</li> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>usually biased</li> </ul> <br> **For Example:** <ul class="fa-ul"> <li> <a href = "https://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased"> <span class = "fa-li"><i class="fab fa-wordpress"></i></span> thestatsgeek.com (2013)</a> <li> <a href = "https://doi.org/10.1002/sim.3944"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> White & Carlin (2010)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2009.08.028"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Knol et al. (2010)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2006.01.015"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Van der Heijden et al. (2006)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Janssen et al. (2010)</a> </li> </ul> ] .pull-right[ <img src="index_files/figure-html/ccdemo-1.png" width="120%" align = "right" style="display: block; margin: auto;" /> ] ??? The issue with complete case analysis is that it is at least inefficient. Depending on the number of variables that have missing values and the proportion of values missing, you may loose quite a large proportion of your data. Especially in observational studies that require the inclusion of many covariates to reduce the bias due to confounding. I have listed here a couple of resources on the issues associated with complete case analysis. --- class: center, middle # (Multiple) Imputation & MICE / FCS --- ## Development of (Multiple) Imputation **In the 1960s/70s:** Increasing number of missing values in the US census:<br> ⇨ Need to solve the missing data problem. -- <img src = "graphics/computer.png" class = "img-computer"> -- <br> .hlgt-box.w60[ **Idea / Requirement**<br> ⇨ fix the missing data problem once (centrally)<br> ⇨ supply completed data to many analysts ] --- ## Imputation of Missing Values .footnote[ [<i class = "fas fa-book"></i> Rubin (2004)](https://doi.org/10.1198/000313004X6355) ] **Rubin (in the 1970s):** * **uncertainty** about the missing value * some values **more likely** than others * relationship with **other** available **data** ??? The basic ideas about imputation go back to Donald Rubin and his work in the 1960s and 70s. The important issue in working with missing values is the **uncertainty** about what the value would have been. And so we **can't just pick** a single value and fill it in because then we would ignore this uncertainty. But we have some knowledge about the missing values. Usually, some values will be more likely than others, and often there is a relationship between the incomplete variable and the other data. - - - -- <div style = "position: absolute; right: 60px; top: 140px; width: 500px;"> <strong>⇨</strong> missing values have a <strong>distribution</strong> <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> <!-- <img src="figures/ImpDens.png", height = 250, style = "margin: auto; display: block;"> --> </div> ??? So, it makes sense to assume that missing values have a distribution and that we need a model to describe how an incomplete variable is related to the other data. - - - - -- <div class = "bayes-text"> Bayesian Concept!!! </div> -- <br><br><br> .hlgt-box[ Impute from the **predictive distribution** of the .n15[missing values] given the observed values: `$$p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}}\mid \mathbf{\mathcal D}_{obs})$$` ] ??? And this means that we can impute the missing values by **sampling from the predictive distribution of the missing values conditional on the observed data**. [4:00 min] --- ## Multiple Imputation <img src = "graphics/MI.png", height = 480, style = "margin: auto; display: block;"> ??? The principle of multiple imputation is simple: * We impute each missing value multiple times, in order to take into account the uncertainty that we have about the missing value, and thereby create a set of completed datasets. * Each of these datasets can then be analysed by standard complete data methods, and the results of these separate analyses are then pooled to obtain an overall result. One often cited advantage of this approach is that we can run this imputation once, and then re-use the imputed data for multiple analyses performed by different researchers. -- .covr-box[ ] --- ## Imputation Step .flex-grid[ .col[ <br> <table class="simpletable2"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> <br> * `\(\mathbf y\)`: **response** variable * `\(\color{var(--nord15)}{\mathbf x_1}\)`: **incomplete** covariate * `\(\mathbf x_2\)`, `\(\mathbf x_3\)`: **complete** covariates <div style = "width: 450px;"></div> ] .col[ **Sampling from the predictive distribution:  ** {{content}} ] ] ??? The most interesting part of this multiple imputation is the imputation step. Missing values should be imputed from their predictive distribution. The idea for imputation of `\(x_1\)` is similar to predicting the missing values. We can think of this as fitting a model to the cases where `\(x_1\)` is observed, **for example**, this could be a **linear regression model** that has the **response `\(y\)`** and the other covariates as its covariates. From this model, we can get estimates for the regression coefficients `\(\beta\)` and the standard deviation of the residuals `\(\sigma\)`. - - - - - -- * Define a model for the incomplete variable `\(\color{var(--nord15)}{\mathbf x_1}\)`, e.g. `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \varepsilon$$` to get estimates `\(\boldsymbol{\hat\beta}\)` (and `\(\hat\sigma\)`) {{content}} -- <br> * Sample `\(\color{var(--nord15)}{\hat{x}_{i1}}\)` from the predictive distribution `$$p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)$$` ??? And using these parameter estimates we can then calculate the fitted or predicted value of `\(x_1\)` for those cases where `\(x_1\)` is missing. --- ## Multivariate Missingness <div style = "width: 300px; margin-right: 50px;"> <div style = "text-align: center; margin-bottom: 25px; font-weight: 600;"> Multivariate<br>Missingness</div> <table class="simpletable2"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\color{var(--nord15)}{\mathbf x_2}\)</th> <th>\(\color{var(--nord15)}{\mathbf x_3}\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> <br> <div style = "display: block; margin: auto;"> <ul class = "arrowlist"> <li>\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}}\mid \mathbf{\mathcal D}_{obs})\) is<br> <strong>multivariate</strong></li> </ul> </div> </div> ??? In practice, we usually have missing values in multiple variables, which means that this **predictive distribution is multivariate**. - - - -- <div style = "width: 60%; position: absolute; top: 100px; right: 60px;"> <i class="fas fa-exclamation-triangle fa-2x" style = "color: var(--nord11);"></i> <div style = "position: relative; left: 75px; top: -57px; margin-bottom: -57px;"> When covariates are of <strong>mixed type</strong>:<br> <span class = "red bold">no closed form</span>. </div> ⇨ approximate `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}}\mid \mathbf{\mathcal D}_{obs})\)` {{content}} </div> ??? When we also have covariates of mixed type, meaning continuous and categorical, this multivariate distribution does not have a closed-form, and so we need to approximate it. - - - -- <p class = "smallbreak"> </p> **"Original" approach:**<br> Joint Model (Multiple) Imputation * Approximate `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}}\mid \mathbf{\mathcal D}_{obs})\)` with a known multivariate distribution. {{content}} -- <p class = "smallbreak"> </p> **Most common approach:**<br> <span class = "n4"> <span class = "n6 bold">M</span><span class = "s">ultivariate</span> <span class = "n6 bold">I</span><span class = "s">mputation by</span> <span class = "n6 bold">C</span><span class = "s">hained</span> <span class = "n6 bold">E</span><span class = "s">quations</span> </span><br> <span class = "n4"> <span class = "n6 bold">F</span><span class = "s">ully</span> <span class = "n6 bold">C</span><span class = "s">onditional</span> <span class = "n6 bold">S</span><span class = "s">pecification</span> </span><br> * Based on the idea of the **Gibbs sampler** --- ## Joint Model MI For example, assume all incomplete variables and the outcome are (latent) normal: `$$\mathbf x_k \text{ binary }\rightarrow \text{ latent } \boldsymbol{\hat{\mathbf{x}}_k} \text{ is standard normal: } \left\{\begin{array}{c} \mathbf x_k = 1\\ \mathbf x_k = 0\end{array}\right.\text{ if } \begin{array}{c} \boldsymbol{\hat{\mathbf{x}}_k}\geq \kappa_k\\ \boldsymbol{\hat{\mathbf{x}}_k} < \kappa_k\end{array}$$` .pull-left[ ⇨ `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}}\mid \mathbf{\mathcal D}_{obs})\)` is multivariate normal<br> ⇨ easy to sample from <br> ⇨ <i class="fab fa-r-project"></i> package [**jomo**](https://CRAN.R-project.org/package=jomo). ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## MICE / FCS Iterative sampling from the **full-conditionals**. -- .flex-grid[ .col[ * Specify a model for each incomplete variable with **all other variables in the linear predictor**. {{content}} ] .col[ <div style = "width: 400px; margin-left: 60px;"> <strong>Example:</strong> <table class="simpletable2"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\color{var(--nord15)}{\mathbf x_2}\)</th> <th>\(\color{var(--nord15)}{\mathbf x_3}\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> <br> <div class = "s"> \begin{alignat}{10} \color{var(--nord15)}{\mathbf x_1} &= \theta^{(1)}_0 &+& \theta^{(1)}_1 \mathbf y &+& \theta^{(1)}_2 \color{var(--nord15)}{\mathbf x_2} &+& \theta^{(1)}_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots &+& \boldsymbol\varepsilon \\ \color{var(--nord15)}{\mathbf x_2} &= \theta^{(2)}_0 &+& \theta^{(2)}_1 \mathbf y &+& \theta^{(2)}_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta^{(2)}_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots &+& \boldsymbol\varepsilon \\ \color{var(--nord15)}{\mathbf x_3} &= \theta_0^{(3)} &+& \theta_1^{(3)} \mathbf y &+& \theta_2^{(3)} \color{var(--nord15)}{\mathbf x_1} &+& \theta_3^{(3)} \color{var(--nord15)}{\mathbf x_2} &+& \ldots &+& \boldsymbol\varepsilon \end{alignat} </div> ] ] ??? The most common approach to handle such multivariate missingness is the MICE algorithm, which is also called imputation using a **fully conditional specification** because that is exactly what it does. It follows the idea of the Gibbs sampler to obtain a sample from the multivariate distribution by iterative sampling from the full conditional distribution of each of the incomplete variables. **In practice**, these full conditionals are typically specified using **regression models** that include **all other variables in the linear predictor**. An implication of this is that we assume linear associations between the dependent variables, here, the incomplete variables, and the independent variables, in our case the other covariates and the response of our analysis model of interest. [5:15 min] -- * Draw random **starting values** from the observed data. {{content}} -- * **Cycle** through all incomplete variables until convergence. {{content}} -- * Keep last value for each missing observation:<br> **⇨ one imputed dataset** {{content}} -- **Multiple runs** with different starting values<br> ⇨ multiple imputed datasets --- class: center, middle # Potential Issues with MICE --- ## A Simple Example .pull-left[ **Implied Assumption:**<br> <span>Linear association</span> between `\(\color{var(--nord15)}{\mathbf x_1}\)` and `\(\mathbf y\)`: `$$\color{var(--nord15)}{\mathbf x_1} = \theta_0 + \bbox[#3B4252, 2pt]{\theta_1 \mathbf y} + \theta_2 \mathbf x_2 + \theta_3 \mathbf x_3 + \ldots$$` <!-- <img src="figures/linplot.png", width = "450", height = "300", style="position:absolute; bottom:45px;"> --> <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> ] ??? One thing that is implied when we use such simple regression models for the imputation is that there is a linear association between incompl. covariate and the response (and other covariates). -- .pull-right[ <br> But what if `$$\mathbf y = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \color{var(--nord15)}{\mathbf x_1} + \beta_2 \color{var(--nord15)}{\mathbf x_1}^2} + \beta_3 \mathbf x_2 + \ldots$$` <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="100%" style="display: block; margin: auto;" /> <!-- <img src="figures/qdrplot.png", width = "450", height = "300", style="position:absolute; bottom:45px;"> --> ] ??? But that is of course not always the case. What if we have a setting where we assume that there is a non-linear association, for example quadratic? --- ## Non-linear Associations * .bold[true association]: non-linear * .bold.n7[imputation assumption]: linear <img src="figures/p_obs.png", height = 350, style = "position: absolute; right: 226px; bottom: 97px;"> ??? But that is, of course, not always the case. We might, for example, have a quadratic association between the response `\(y\)` and the incomplete covariate `\(x_1\)`. - - - -- <img src="figures/p_imp.png", height = 350, style = "position: absolute; right: 226px; bottom: 97px;"> ??? If we now impute missing values in `\(x_1\)` using a standard regression model, all the imputed values follow a linear relationship. - - - -- <div style = "position: absolute; right: 380px; top: 120px;"> <span style="font-size: 56pt; position: relative; right: 110px; bottom: 20px;">} ⇨</span> <span style = "color: var(--nord11); font-size: 1.2rem; font-weight: bold; position: relative; bottom: 30px; right: 100px;"> bias!</span> </div> <img src="figures/p_bias.png", height = 350, style = "position: absolute; right: 226px; bottom: 97px;"> ??? And even if we correctly assume the non-linear association in the analysis model that we fit on the imputed data, we will get biased results. I'm talking about a quadratic effect here because I can visualize it more easily than the non-linear relationship implied by the survival part of our joint model. --- ## Time-to-Event Outcomes (Simple) **Proportional Hazards Model:** `$$h_i(t) = h_0(t) \exp\left(\color{var(--nord15)}{\mathbf x_i}^\top \boldsymbol\beta^{(tc)}\right)$$` **Likelihood** `$$p(T_i, \delta_i \mid \color{var(--nord15)}{\mathbf x_i}, \boldsymbol\beta^{(tc)}) = \left[h_0(T_i) \exp \left\{ \color{var(--nord15)}{\mathbf x_i} \boldsymbol\beta^{(tc)}\right\}\right]^{\delta_i} \exp \left[-\int_0^{T_i} h_0(s)\exp\left( \color{var(--nord15)}{\mathbf x_i} \boldsymbol\beta^{(tc)}\right)ds\right]$$` ??? But we can see the issue when we look at this simple proportional hazards model (I've excluded the time-varying part for now) and the corresponding likelihood for a subject `\(i\)`. This likelihood describes a complex and non-linear relationship between the response (in this case, the event time `\(T\)` and event indicator `\(\delta\)`) and incomplete covariates `\(x\)`. - - - -- <p class = "smallbreak"> </p> .bold.red[Inconsistent] with the (naive) imputation model: `$$\color{var(--nord15)}{\mathbf x_1} = \theta_0 + \theta_1 \mathbf T + \theta_2 \boldsymbol \delta + \theta_3 \mathbf x_2 + \ldots$$` .footnote[ [<i class = "fas fa-file-alt"></i> White & Royston (2009)](https://doi.org/10.1002/sim.3618) [<i class = "fas fa-file-alt"></i> Bartlett et al. (2015)](https://doi.org/10.1177/0962280214521348) ] ??? But the regression model we would use to impute values in, say `\(x_1\)`, would look like this and imply a linear association. This inconsistency between the analysis model and imputation model will result in bias. [6:30 min] --- ## Multi-level Data .flex-grid[ .col[ <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" /> ] .col[ <div style = "width: 300px;"> </div> <table class="simpletable2"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class="hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> ] ] ??? We have multi-level data when we have measured the same variable repeatedly in the same patient, but also when we have a clustering structure in our data, for example in a multi-center study. In both cases, observations from the same patient, or the same cluster are not independent. We typically represent this type of data in long format, so that we now have multiple rows that belong to the same patient. --- ## Multi-level Data .footnote[ [<i class = "fas fa-file-alt"></i> Erler et al. (2016)](https://doi.org/10.1002/sim.6944) ] .flex-grid[ .col[ <div style = "width: 780px;"></div> For example: `$$p(\color{var(--nord15)}{\text{sex}} \mid \text{age}, ..., \underset{\text{time varying}}{\underbrace{\text{ALT}, \text{AST}, ...}})$$` {{content}} ] .col[ <table class="simpletable2"> <tr> <th></th> <th>age</th> <th>sex</th> <th>time</th> <th>ALT</th> <th>AST</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <tr class="hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <tr class = "hlgt-row"> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td>\(\ldots\)</td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] ] ??? But also the other component of our data, the **longitudinally measured biomarkers, give us an additional challenge** when dealing with the missing values in the baseline covariates. Because we have **unbalanced** biomarker measurements, our data is in **long format** where multiple rows belong to the same patient. To impute missing values in baseline covariates, for example, in `sex`, we need the distribution of that covariate, conditional on everything else, which includes the longitudinally measured biomarker values, for example, `AST` and `ALT`. Given that `sex` does not change over time, the imputed values should also all be the same in the rows belonging to the same patient. - - - -- <i class = "fas fa-bolt red"></i> time-varying imputations for time-constant covariates!<br> <i class = "fas fa-bolt red"></i> ignore correlation between repeated observations? {{content}} ??? If we would now apply the MICE algorithm to our data in long format, we'd get **time-varying imputations for the time-constant covariates**. And if we're not careful, we **might also ignore** that rows belonging to the same patient are **correlated**. - - - -- <ul class = "arrowlist"> <li> <strong>Average</strong> imputed values?</li> <li> <strong>Summarize</strong> time-varying variables ⇨ wide format?</li> </ul> <p class = "smallbreak"> </p> Imputation of baseline covariates is not straightforward. ??? So what could we do? We could try to **average the repeated imputations** of baseline covariates, but that would imply that all biomarker values are equally related to baseline variables, irrespective of when they were measured (and in our data we have follow-up of more than 20 years...). Or we could somehow **summarize time-varying variables** to get our data into wide format. But neither strategy really captures the hierarchical structure of the data. [7:45 min] --- ## Non-linear Associations & Multi-level In settings with **non-linear associations** and **multi-level data** the **correct predictive distribution** $$ p(\color{var(--nord15)}{\mathbf x_{mis}} \mid \text{everything else})$$ may not have a closed form. <br> <div style="width: 85%; padding: 1.15em 2em; background-color: var(--nord0); margin: auto; display: block; text-align: center;"> <strong><span style="font-size: 1.5rem;">⇨</span> We cannot easily specify the correct imputation model directly.</strong> </div> ??? So, in summary, whenever we have a non-linear association between the response and incomplete covariates, including time-to-event outcomes, or when we have multi-level data, it is not straightforward to specify the correct imputation model, meaning the correct predictive distribution of the missing values, directly because usually they do not have a closed form. --- class: center, middle # A Fully Bayesian Alternative --- ## Getting the Correct Distribution .flex-grid[ .col[ * We need `\(\;p(\color{var(--nord15)}{\mathbf X_{mis}} \mid \mathbf y, \ldots)\)` * We know `\(\;p(\mathbf y \mid \color{var(--nord15)}{\mathbf X_{mis}}, \ldots)\)` ] .col[ <div style = "color: var(--nord3);"> <ul> <li>\(\mathbf X_{mis}\): incomplete covariate(s)</li> <li>\(\mathbf y\): outcome(s)</li> <li>\(\ldots\): everything else </li> </ul> </div> ] ] ??? The interesting question now is: How do we get this predictive distribution? What we need is the distribution of the incomplete variable `\(x\)` conditional on the response and everything else. And what we do know is the distribution of the response conditional on the covariates and everything else, because this is the analysis model that we have specified already. And here Bayes comes to the rescue. -- **Bayes Theorem:** `\begin{align} p(\color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta \mid \mathbf y, \mathbf X_{obs}) = \frac{p(\mathbf y, \mathbf X_{obs} \mid \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)\; p(\color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)}{p\left(\mathbf y, \mathbf X_{obs}\right)} &\propto \underset{\text{joint distribution}}{\underbrace{p(\mathbf y, \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)}} \end{align}` ??? Because Bayes theorem allows us to convert a conditional distribution into the reversed conditional, and so using this theorem we can derive that the distribution of interest, the distribution of the incomplete covariate conditional on everything else is proportional to the distribution of the response conditional on the covariates, and the distribution of the covariates and everything else. We can now factorize this further, but I find it easier to just look at it as a specification of the joint distribution of everything. ??? And so, in the Bayesian approach, we go a different route. From Bayes theorem, we can derive that the posterior distribution of the missing values and parameters `\(\theta\)` is proportional to the joint distribution of all our data, observed and unobserved, and the parameters. I use `\(y\)` here to represent the response of an analysis model of interest, whatever that might be. - - - -- <br> This joint distribution does not have a closed form either **⇨ factorization!** ??? This joint distribution does not have a closed-form either, but we can specify it conveniently as a product of conditional distributions. [8:30 min] --- ## Factorization of the Joint Distribution <br> `$$p(\mathbf y, \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta) = p(\mathbf y \mid \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)\;\; p(\mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}\mid \boldsymbol\theta)\;\; p(\boldsymbol\theta)$$` ??? We can, for instance, split it into the distribution for the response `\(y\)` conditional on the covariates, the distribution of the covariates, observed and missing, and the distribution of the parameter vector `\(\boldsymbol\theta\)`. - - - -- <div style = "position: relative; top: -66px; background-color: var(--nordbg);"> \[ p(\mathbf y, \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)}}\;\; \underset{\text{covariate model(s)}}{\underbrace{p(\mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}\mid \boldsymbol\theta)}}\;\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}} \] </div> ??? By factorizing the joint distribution like this, we have written it as the product of our analysis model, a model for the covariates, and the prior distributions for the parameters. The nice thing about this division is that it works irrespective of how complex our analysis model of interest is. - - - -- <br> **For Example:** `$$p(\mathbf y, \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_3, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_3, \boldsymbol\theta)}}\; \underset{\text{covariate model(s)}}{\underbrace{p(\color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_3 \mid \boldsymbol\theta)}}\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}}$$` ??? And we can use this principle for our example. So we can write the joint distribution of `\(y\)` and `\(x_1\)`, `\(x_2\)` and `\(x_3\)` as the conditional distribution of `\(y\)` given the covariates., the distribution of the covariates conditional on the parameters, and, of course, prior distributions for those parameters. --- ## Factorization of the Joint Distribution <br> `$$p(\mathbf y, \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)}}\;\; \underset{\text{covariate model(s)}}{\underbrace{p(\mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}\mid \boldsymbol\theta)}}\;\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}}$$` <br> **(Multivariate) joint model for longitudinal and survival data:** `$$p(\mathbf T, \boldsymbol\delta, \mathbf y, \color{var(--nord15)}{\mathbf X}, \mathbf b, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{ \underset{\substack{\text{survival}\\\text{model}}}{\underbrace{ p(\mathbf T, \boldsymbol\delta \mid \mathbf b, \color{var(--nord15)}{\mathbf X}, \boldsymbol \theta)}}\;\; \underset{\substack{\text{(multivariate)}\\\text{longitudinal}\\\text{model}}}{\underbrace{ p(\mathbf y \mid \mathbf b, \color{var(--nord15)}{\mathbf X}, \boldsymbol\theta)\; p(\mathbf b\mid \boldsymbol \theta)}} }}\; \bbox[#242933, 3pt, border: 2px solid #242933]{\underset{\substack{\text{covariate}\\\text{models}}}{\underbrace{ p(\color{var(--nord15)}{\mathbf X} \mid \boldsymbol\theta)}}}\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}}$$` ??? For example, for our multivariate joint model, the analysis model consists of multiple parts, the time-to-event sub-model and the model for the longitudinal biomarkers, which also involves the specification of the distribution of the random effects `\(\mathbf b\)`. --- count: false ## Factorization of the Joint Distribution <br> `$$p(\mathbf y, \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}, \boldsymbol\theta)}}\;\; \underset{\text{covariate model(s)}}{\underbrace{p(\mathbf X_{obs}, \color{var(--nord15)}{\mathbf X_{mis}}\mid \boldsymbol\theta)}}\;\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}}$$` <br> **(Multivariate) joint model for longitudinal and survival data:** `$$p(\mathbf T, \boldsymbol\delta, \mathbf y, \color{var(--nord15)}{\mathbf X}, \mathbf b, \boldsymbol\theta) = \underset{\text{analysis model}}{\underbrace{ \underset{\substack{\text{survival}\\\text{model}}}{\underbrace{ p(\mathbf T, \boldsymbol\delta \mid \mathbf b, \color{var(--nord15)}{\mathbf X}, \boldsymbol \theta)}}\;\; \underset{\substack{\text{(multivariate)}\\\text{longitudinal}\\\text{model}}}{\underbrace{ p(\mathbf y \mid \mathbf b, \color{var(--nord15)}{\mathbf X}, \boldsymbol\theta)\; p(\mathbf b\mid \boldsymbol \theta)}} }}\; \bbox[#2E3440, 3pt, border: 2px solid #88C0D0]{\underset{\substack{\text{covariate}\\\text{models}}}{\underbrace{ p(\color{var(--nord15)}{\mathbf X} \mid \boldsymbol\theta)}}}\; \underset{\text{priors}}{\underbrace{p(\boldsymbol\theta)}}$$` ??? Most of this is straightforward to specify because it is the same that we'd need to specify when doing a Bayesian analysis on complete data. The only extra thing we need to do is specify the distribution of the covariates. --- ## Covariate Models <i class="fas fa-exclamation-triangle" style="color: var(--nord11);"></i> When `\(\color{var(--nord15)}{\mathbf X}\)` is multivariate & of mixed type ⇨ .bold.red[no closed form] <div style = "margin: auto; display: block; text-align: center;"> <strong>⇨ Factorization!</strong> </div> ??? But we still have a multivariate distribution with no closed-form. The trick is to apply the same strategy as for the joint distribution and to specify it as a product of conditional distributions. - - - -- For example `\begin{align} p(\color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \color{var(--nord15)}{\mathbf x_3(t)}, \mathbf x_4(t) \mid \boldsymbol\theta) =\, & p(\color{var(--nord15)}{\mathbf x_3(t)} \mid \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \mathbf x_4(t), \boldsymbol\theta) & \color{var(--nord3)}{\small\text{e.g., GLMM}}\\ & p(\mathbf x_4(t) \mid \color{var(--nord15)}{\mathbf x_1}, \mathbf x_2, \boldsymbol\theta) & \color{var(--nord3)}{\small\text{e.g., GLMM}}\\ & p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf x_2, \boldsymbol\theta) & \color{var(--nord3)}{\small\text{e.g., GLM}}\\ & \color{var(--nord3)}{p(\mathbf x_2 \mid \boldsymbol\theta)} & \color{var(--nord3)}{\small\text{(not needed)}} \end{align}` ??? For example, if we had 4 covariates, `\(x_1\)` and `\(x_3\)` having missing values, and `\(x_1\)` and `\(x_2\)` being baseline covariates while `\(x_3\)` and `\(x_4\)` are time-varying, we could specify the multivariate distribution using this sequence of models. Here I always condition on those other covariates for which I haven't specified a model yet. And when we are smart about the order in which we take the covariates, we can start with the models for time-varying covariates because then we don't need to include those anymore into the linear predictors of the subsequent models for the baseline covariates. This solves our issue with the hierarchical structure of the data. - - - -- <p class = "smallbreak"> </p> <div style="width: 60%; padding: 1em 0 0.7em; background-color: var(--nord0); margin: auto; display: block; text-align: center; border: solid 2px var(--nord8); border-radius: 10px;"> covariate models \(\neq\) imputation models! </div> ??? An important thing to realize is that these models here are not the imputation models. They are still part of the specification of the joint distribution of all the data. [10:45 min] --- ## Imputation Models <div class = "container"> <div class = "box"> <div class = "box-row"> <div class = "box-cell" style = "background: var(--nord0);">imputation<br>models</div> <div class = "box-cell">\(\propto\)</div> <div class = "box-cell" style = "background: var(--nord0);">joint<br>distribution</div> <div class = "box-cell">\(=\)</div> <div class = "box-cell" style = "background: var(--nord0);">analysis<br>model</div> <div class = "box-cell" style = "background: var(--nord0);">covariate<br>models</div> <div class = "box-cell" style = "background: var(--nord0)">priors</div> </div> </div> </div> <img src = "graphics/Gibbs.png", height = 350; style = "position: absolute; left: 60px; bottom: 80px;"> <div style = "position: absolute; left: 430px; bottom: 250px;"> <span style = "font-weight: bold; color: var(--nord7);">Gibbs sampler</span> </div> ??? The imputation models are then derived as full-conditional distributions from this joint distribution in the same way we do it with the posterior distributions of the parameters, typically using Gibbs sampling, - - - -- <div style = "position: absolute; right: 60px; bottom: 350px;"> MICE / FCS </div> <img src = "graphics/MICE.png", height = 300; style = "position: absolute; bottom: 50px; right: 60px;"> ??? The important difference to the MICE algorithm is that in the Bayesian approach, we do specify the joint distribution and derive the imputation models as full conditionals from this joint distribution. The models are usually much more complex than the ones we specify when using MICE and can provide a sample from the joint distribution of the missing values, even in complex settings. In MICE, we specify the imputation models directly, which means they typically have a relatively simple structure, and they could be incompatible with each other and the analysis model. Hence they may not produce a joint distribution that approximates the correct joint distribution well enough. [11:30 min] -- <div class = "MH"> Metropolis-Hastings </div> --- ## Advantages of the Bayesian Approach `$$p(\mathbf y, \color{var(--nord15)}{\mathbf X}, \boldsymbol\theta) = \bbox[#2E3440, 5pt]{\underset{\text{analysis model}}{\underbrace{p(\mathbf y \mid \color{var(--nord15)}{\mathbf X}, \color{var(--nord14)}{\boldsymbol\theta_{y\mid x}})}}}\; p(\color{var(--nord15)}{\mathbf X} \mid \boldsymbol\theta_x)\; p(\color{var(--nord14}{\boldsymbol\theta_{y\mid x}}, \boldsymbol\theta_x)\qquad \color{var(--nord3)}{\text{with } \boldsymbol\theta = (\boldsymbol\theta_{y\mid x}, \boldsymbol\theta_x)}$$` * specification of the joint distribution<br> ⇨ assures **compatibility** of imputation models ??? This is one of the advantages of the Bayesian approach. Because we specify the joint distribution, we can ensure that all imputation models are compatible with each other. - - - -- * use of the analysis model in the specification of the joint distribution<br> ⇨ **parameters of interest** `\(\color{var(--nord14)}{\boldsymbol\theta_{y\mid x}}\)` are estimated directly<br> ⇨ **non-linear associations** are taken into account<br> ⇨ assures **congeniality** ??? And because the analysis model is part of the specification of the joint distribution, we can be sure that the imputation models are also compatible with the analysis model, even when there are non-linear associations with covariates. In addition, the parameters of interest, so the parameters of the analysis model, are directly estimated. We do not have separate steps for imputation and analysis, but everything happens simultaneously in the same procedure. - - - -- * response is not in any linear predictor<br> ⇨ no problem to use **complex outcomes** ??? Because we can specify the joint distribution so that one of the conditional models is the analysis model, we can usually make sure that the outcome does not need to appear in the linear predictor of models for covariates. This makes it possible to use this approach in settings where we have complex outcomes that could not be easily included in the linear predictor of some or all of the covariates. [12:30 min] --- ## In practice: <i class="fab fa-r-project"></i> Package JointAI .footnote[ [<i class = "fas fa-file-alt"></i> Erler et al. (2021)](https://doi.org/10.18637/jss.v100.i20) [<i class = "fas fa-globe-americas"></i> nerler.github.io/JointAI](https://nerler.github.io/JointAI) ] <div class = "container"> <div class = "box" style = "margin-left: 20px;"> <div class = "box-row"> <div class = "box-cell" style="width:180px;"> <strong>Specification</strong><br>of the joint<br> distribution: </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>User</strong></div> <div class = "box-cell" style = "background: var(--nord1); border: 2px solid var(--nord8); ">analysis<br>model</div> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>JointAI</strong></div> <div class = "box-row"> <div class = "box-cell" style = "background: var(--nord1);">covariate<br>models</div> <div class = "box-cell" style = "background: var(--nord1)">priors</div> </div></div> </div></div></div> <a href = "https://nerler.github.io/JointAI/"> <img src = "graphics/JointAI.png" style = "position: absolute; right: 62px; top: 117px; width: 165px;"></a> ??? To use this Bayesian approach, for example, to fit the multivariate joint model in the motivating example, you can use the R package JointAI, short for Joint Analysis and Imputation. To specify the joint distribution, the user only has to specify the analysis model of interest, and the software sets defaults for the covariate models and priors. Several different analysis models are available, and the specification is almost the same as for standard functions in R like `lm()` or `glm()`. - - - -- <br> <div class = "container"> <div class = "box" style = "margin-left: 20px;"> <div class = "box-row"> <div class = "box-cell" style="width:180px;"> <strong>Estimation:</strong> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>JointAI</strong></div> <div class = "box-cell" style = "background: var(--nord1);">pre-processing</div> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>rjags / JAGS</strong></div> <div class = "box-cell" style = "background: var(--nord1);">MCMC sampling</div> </div> <div class = "box-cell" style = "background: var(--nord0);"> <div style = "padding-bottom: 10px;"> <strong>JointAI</strong></div> <div class = "box-cell" style = "background: var(--nord1);">post-processing</div> </div> </div></div></div> ??? The model estimation is done via MCMC with the help of JAGS, an external and freely available software that does Gibbs sampling. [13:15] --- ## JointAI: Model Types & Specification .flex-grid[ .col[ **Univariate Models:** * `lm_imp()` * `glm_imp()` * `clm_imp()` * `mlogit_imp()` * `betareg_imp()` * `lognorm_imp()` ] .col[ **Mixed Models** * `lme_imp()` * `glme_imp()` * `clmm_imp()` * `mlogitmm_imp()` * `betamm_imp()` * `lognormmm_imp()` ] .col[ **Survival Models** * `coxph_imp()` * `survreg_imp()` * `JM_imp()` ] ] **For Example:** ```r mod <- lm_imp(SBP ~ gender + age + alc * creat, data = NHANES, n.iter = 200) ``` <p class = "smallbreak"> </p> .footnote[ [Full documentation at https://nerler.github.io/JointAI/](https://nerler.github.io/JointAI/) ] --- class: center, middle # Summary --- ## In Comparison <table class = "ppt" style = "margin-top: -20px;"> <tr> <th style = "text-align: center;"></th> <th style = "width: 25%;">MICE</th> <th style = "width: 30%;">Joint Model MI</th> <th>Bayesian Analysis</th> </tr> <tr> <td></td> <td colspan = "2" style = "text-align: center;"> <strong>separate</strong> imputation & analysis</td> <td><strong>simultaneous</strong> analysis & imputation</td> </tr> <tr> <td></td> <td colspan = "2" style = "text-align: center;"><strong>direct</strong> specification of imputation model</td> <td><strong>indirect</strong> specification of the imputation model</td> </tr> <tr style = "padding-top: 0px;"> <td style = "text-align: center;"><i class="fas fa-thumbs-up fa-2x" style = "color: var(--nord10)"></i></td> <td><ul> <li>simple settings</li> </ul></td> <td><ul> <li>simple settings</li> <li>multi-level data</li> </ul></td> <td><ul> <li>non-linear associations</li> <li>time-to-event outcomes</li> <li>multi-level data</li> <li>more complex analyses</li></ul></td> </tr> <tr> <td style = "text-align: center;"><i class="fas fa-thumbs-down fa-2x" style = "color: var(--nord10)"></i></td> <td> <ul> <li>non-linear associations</li> <li>complex outcomes / data structure</li> </ul></td> <td><ul> <li>non-linear associations</li> <li>many incomplete variables / complex random effects</li> </ul> </td> <td><ul> <li> very large datasets</li> </ul></td> </tr> <tr style = "text-align: center;"> <td><i class="fab fa-r-project fa-2x" style = "color: var(--nord10);"></i></td> <td><span style = "color: var(--nord10); font-weight: bold;">mice</span></td> <td><span style = "color: var(--nord10); font-weight: bold;">jomo</span></td> <td><span style = "color: var(--nord10); font-weight: bold;">JointAI</span></td> </tr> </table> ??? MICE and Joint Model imputation are two different options to perform the imputation step in a multiple imputation procedure. This means, that in both cases the imputation is completely separate from the analysis. This separation can be convenient, when the same incomplete data is used in multiple analyses, but it is also introduces the risk of having imputation models that are not compatible with the analysis, as for example when there is a non-linear association in the analysis model. In the Bayesian approach, analysis and imputation are combined. This combination assures that the imputation and analysis models do not contradict each other. It is, however, possible to extract the imputed values sampled in the Bayesian approach so that this method could also serve as the imputation step in a multiple imputation. In MICE and the Joint model imputation we specify the imputation models directly. This is what makes these approaches difficult to use when the incomplete variables do not just have simple linear associations with the other variables. In the Bayesian approach we specify the likelihood for the data, but the imputed values are sampled from the posterior distribution that is derived from the likelihood and the prior. With all the advanced sampling techniques available nowadays, this also works when the posterior does not have a closed form, and so this approach is well suited for settings with complex associations., while MICE is better suited for simpler settings. Joint model imputation can handle multi-level settings, but assumes linear associations between all sub-models. Because the Bayesian approach is more computationally intensive, it may be less well suited for very large datasets. All three approaches are available in R, as the R packages mice, jomo and JointAI. --- ## Summary Imputation requires sampling from `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}} \mid \mathbf{\mathcal D}_{obs})\)`. -- **Multiple** Imputation: * Practical approach do handle the uncertainty about the missing values:<br> ⇨ Sample just a "few" values from `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}} \mid \mathbf{\mathcal D}_{obs})\)`. * Pooling to take into account within & between imputation variation. -- <p class = "smallbreak"> </p> For models that involve * survival outcomes * other non-linear associations * multi-level data it is **difficult to specify** `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}} \mid \mathbf{\mathcal D}_{obs})\)` directly. --- ## Summary **MICE / FCS**<br> Approximate `\(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}} \mid \mathbf{\mathcal D}_{obs})\)` by sampling from directly specified full-conditionals `\(p(\color{var(--nord15)}{\mathbf x_k} \mid \cdot)\)`. <ul class = "arrowlist"> <li>incompatibility / uncongeniality issues</li> <li>bias</li> </ul> ??? MICE tries to approximate this distribution using the idea of the Gibbs sampler. But because the full-conditionals may not be a good approximation of the correct full conditionals, it has issues with incompatibility and uncongeniality in complex settings and may result in bias. - - - -- <p class = "smallbreak"> </p> <strong>Fully Bayesian approach</strong> <ul> <li>Specification of the joint distribution \(p(\color{var(--nord15)}{\mathbf{\mathcal D}_{mis}}, \mathbf{\mathcal D}_{obs}, \boldsymbol \theta)\) via factorization.</li> <li>Derive \(p(\color{var(--nord15)}{\mathbf x_k} \mid \cdot)\) from this joint distribution. <ul class = "arrowlist"> <li>theoretically valid, compatible & congenial</li> </ul> </li> </ul> <span style = "float: right;"> .sgrey[... MAR, model fit, other model assumptions, ...] </span> ??? Using a fully Bayesian approach, we can specify the joint distribution as a sequence of conditional models that gives us a number of advantages. The imputation models are then derived from this joint distribution, resulting in a theoretically valid approach that assures compatibility and congeniality. --- class: the-end, center, middle layout: true count: false # Thank you for your attention! <div class="contact"> <i class="fas fa-envelope"></i> <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a>  <a href="https://twitter.com/N_Erler" target="_blank"><i class="fab fa-twitter"></i> N_Erler</a>  <a href="https://github.com/NErler" target="_blank"><i class="fab fa-github"></i> NErler</a>  <a href="https://nerler.com" target="_blank"><i class="fas fa-globe-americas"></i> https://nerler.com</a> </div> --- count: false