class: left, top, title-slide
Imputation Magic…
How (not) to deal with incomplete data
Nicole Erler
Department of Biostatistics
n.erler@erasmusmc.nl
N_Erler
NErler
https://nerler.com
--- count: false layout: true <div class="my-footer"><span> <a href="https://twitter.com/N_Erler"><i class="fab fa-twitter"></i> N_Erler</a>      <a href="https://github.com/NErler"><i class="fab fa-github"></i> NErler</a>      <a href = "https://nerler.com"><i class="fas fa-globe-americas"></i> nerler.com</a> </span></div> <!-- --- --> <!-- ## Outline / Topics --> <!-- * Missing Values are a Problem --> <!-- * General Considerations & Missing Data Mechanisms --> <!-- * Naive Imputation Approaches --> <!-- * Multiple Imputation --> <!-- * General Concept --> <!-- * Multivariate Missingness --> <!-- * General Considerations --> <!-- * Issues with imputation --> <!-- * in multi-level data --> <!-- * with non-linear associations / survival data --> --- count: false class: center, middle # Missing Values are a Problem! ??? Let's start right at the beginning. When we want to analyse data in which some values are missing, we have a problem. Why is that? Because even a single missing value can make it impossible to get any results at all. --- ## Example .gr-left[ **Data** <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-0.1</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-1.9</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-0.2</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-0.6</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> </table> ] .gr-right[ <br> <br> **What is the mean of `\(\color{white}{\mathbf x_1}\)`?** {{content}} ] ??? As an example, imagine we have the following data with four variables, but for now we are only interested in `\(x_1\)`. We want to calculate the mean of `\(x_1\)`, but for one of the patients, the value of `\(x_1\)` is missing. So, how do we calculate the mean? -- <br> `$$\boldsymbol{\bar x}_1 = \frac{-0.1 + \;\color{var(--nord15)}{\boldsymbol ?} - 1.9 - 0.2 - 0.6}{5}$$` ??? We need to sum up all the values of `\(x_1\)` and divide by the number of observations. The problem is, that we cannot even calculate this sum --- ## Missing Values are a Problem! Even with **just a single missing value** most (summary) statistics or parameters .red[cannot be calculated!] .pull-left[ <br> <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-0.1</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-1.9</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-0.2</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td>-0.6</td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> </table> ] ??? With just a single value missing in our data set, we are not able to get results for an analysis as simple as a linear regression. So, what is the solution to this problem? - - - -- .pull-right[ <br><br><br> <strong>Solution:</strong><br> .red[Exclude] incomplete cases? <div id = "hline1"></div> <img src = "materials/magic-wand.png" id = "magicwand"> ] ??? A common "solution" is to make the missing data problem "disappear" by just excluding all patients who have one or more missing values. You may not even be aware that you are doing this because the software does it for you. And what many researchers also aren't aware of is that they are not actually "avoiding" the missing data problem by doing this, because, as any analysis method and any method to deal with missing data, such a complete case analysis implies certain assumptions and has consequences. --- class: center, middle # Complete Case Analysis is (usually) a Bad Idea! ??? Because of those assumptions and consequences, complete case analysis is in most cases a rather bad idea. --- ## Complete Case Analysis: Inefficient! <img src="index_files/figure-html/ccdemo-1.png" width="100%" /> ??? In any case, complete case analysis is inefficient because you throw away information. You see on the y-axis the proportion of complete cases in a data set, and on the x-axis the number of incomplete variables. Each line represents a different proportion of missing values per variable. So, if we had 10% missing values in each of 25 variables, we may en up with only 7% of the original sample size. And if we had 10% missing values in 10 variables, we may only have 35% of our data left over in a complete case analysis. --- ## Complete Case Analysis Complete Case Analysis is <ul class="fa-ul"> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>inefficient</li> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>usually biased</li> </ul> <br> **For Example:** <ul class="fa-ul"> <li> <a href = "https://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased"> <span class = "fa-li"><i class="fab fa-wordpress"></i></span> thestatsgeek.com (2013)</a> <li> <a href = "https://doi.org/10.1002/sim.3944"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> White & Carlin (2010)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2009.08.028"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Knol et al. (2010)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2006.01.015"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Van der Heijden et al. (2006)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Janssen et al. (2010)</a> </li> </ul> ??? In addition, complete case analysis is biased in most settings. There are a few very specific exceptions, depending on what kind of model you use, where the missing values are, and why they are missing. --- class: center, middle # Imputation ??? So we need a better way to handle missing values, and the magic word here is "Imputation". --- class: center, middle, animated, fadeIn count: false # Imputation<br><br><br><br><br> <img src = "materials/magic-hat.png" id = "magichat"> ??? Imputation is this magic procedure where you where you just draw the correct values that are missing out of a hat, right? Unfortunately, it is not quite that easy! To figure out how to impute missing values we first need to understand more about them.p --- ## Understanding Missing Values .pull-left[ * There is **uncertainty** about the missing value. {{content}} ] ??? First, I think you can agree with me on this, we need to accept that there is **uncertainty** about what the value would have been. And so we **can't just pick** one value and fill it in, because then we would ignore this uncertainty. If the value of `height` is missing for one patient, we don't know what that value would have been. - - - - -- * Some values are **more likely** than others. {{content}} ??? 2) Usually, some values are going to be more likely than others. For the missing `height` we a value somewhere around 1.70 - 1.80m is probably more likely than values of 1.50m or 2.10m. Those values are also are possible, but they are less likely. - - - - -- **⇨ Missing values have a distribution.** <img src="figures/ImpDens.png", height = 250, style = "margin: auto; display: block;"> ??? So, in statistical terms, we can say that missing values have a distribution. - - - - -- .pull-right[ * There is a relationship with **other** (available) **data**. <br> {{content}} ] ??? Moreover, there typically is some relationship with the rest of the data. If we know that the missing `height` value is from a male, larger values become more likely and smaller values less likely. This means that we can use a model to learn how the incomplete variable is related to the other data. - - - -- <div class = "box bg-0" style = "margin-top: 5px"> <strong>Predictive distribution</strong> of the missing values given the observed values. $$ p(\color{var(--nord15)}{x_{mis}}\mid\text{everything else}) $$ </div> ??? This model defines what we call the **predictive distribution**. And this is the distribution that we need to sample imputed values from. So, you could say, that this predictive distribution is our magic hat. --- ## A Simple Example .gr-left[ <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\color{var(--nord15)}{\mathbf x_1}\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> * `\(\mathbf y\)`: **response** * `\(\color{var(--nord15)}{\mathbf x_1}\)`: **incomplete** covariate * `\(\mathbf x_2\)`, `\(\mathbf x_3\)`: **complete**<br>covariates ] .gr-right[ Fit a model to the cases with observed `\(\color{var(--nord15)}{\mathbf x_1}\)`:<br> `\(\color{var(--nord15)}{\mathbf x_1} = \alpha_0 + \alpha_1 \mathbf y_{-i} + \alpha_2 \mathbf x_{-i2} + \alpha_3 \mathbf x_{-i3} + \boldsymbol\varepsilon,\;\; \color{var(--nord3)}{\small\varepsilon \sim N(0, \sigma^2)}\)` ⇨ Estimate parameters `\(\boldsymbol{\hat\alpha}, \mathbf{\hat\sigma}\)` <br> {{content}} ] ??? Let's look at an example to get a better idea about how this works. We have the same data as before, where we have one or more missing values in the variable `\(x_1\)`. We could fit a model to all cases where `\(x_1\)` is observed and use the other variables as covariates in this model. From this model we get parameter estimates for the regression coefficients `\(\alpha\)` and the standard deviation of the error terms, `\(\sigma\)`. - - - -- **Predictive distribution** of `\(\color{var(--nord15)}{x_{i1}}\)`: <!-- `$$p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\alpha}, \mathbf{\hat{\sigma}})$$` --> Normal distribution with * mean   `\(\mathbf{\hat \alpha}_0 + \mathbf{\hat\alpha}_1 y_i + \mathbf{\hat{\alpha}}_2 x_{i2} + \mathbf{\hat{\alpha}}_3 x_{i3}\)` * variance   `\(\mathbf{\hat\sigma}^2\)` ??? This model defines the predictive distribution for the missing values. Because we used a linear regression model we assume that the missing value `\(x_{i1}\)` is from a normal distribution with a mean equal to the linear predictor, of the regression model, and standard deviation as estimated in the model fitted on the rest of the data. Essentially, we just "predict" the missing value in `\(x_1\)` from the model fitted on the part of the data in which `\(x_1\)` is observed. --- ## Imputation of Missing Values <img src="index_files/figure-html/imp regline2-1.png" width="100%" /> <div id="mathformula"> \(\mathbf{\hat\alpha}_0 + \mathbf{\hat\alpha}_1 \mathbf y + \mathbf{\hat{\alpha}}_2 \mathbf x_2 + \mathbf{\hat{\alpha}}_3 \mathbf x_3\) </div> ??? We can visualize this for the case with only one other variable in the imputation model. On the x-axis we have one of the complete variables and on the y-axis we have the incomplete variable `\(x_1\)` that we are imputing. In practice we will of course have more variables, but then I couldn't show it in a simple plot any more. So this is really just to get the idea. We know the value of the other variable, so we know where our incomplete cases are on the x-axis, but we don't know where to place them on the y-axis. Therefore I only marked them as empty circles on the x-axis here. When we now fit a model on the observed cases we can represent that as the corresponding regression line. - - - --- count: false name: predval_reg ## Imputation of Missing Values <img src="index_files/figure-html/imp regline3-1.png" width="100%" /> <div id="mathformula"> \(\mathbf{\hat\alpha}_0 + \mathbf{\hat\alpha}_1 \mathbf y + \mathbf{\hat{\alpha}}_2 \mathbf x_2 + \mathbf{\hat{\alpha}}_3 \mathbf x_3\) </div> ??? [jump to regression imputation](#regimp) When we then use this model to predict the missing values, we calculate where on the regression line the missing values would be. The regression line is the linear predictor from the model on the previous slide. Could we now just take those values as our imputed values? --- ## Imputation of Missing Values .box.bg-0[ **Important:** We need to take into account the **uncertainty**! ] ??? Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken into account. - - - - - -- .pull-left[ About the **actual value:** `\(\color{var(--nord15)}{\mathbf{\hat{x}_1}}\)` <img src="index_files/figure-html/imp prederror-1.png" width="100%" /> ] ??? There is uncertainty about the imputed values. Missing values have a distribution and we need to sample from this distribution. The regression line is the mean of this distribution, but we are not doing any random sampling if we take the mean. The observed data also is not exactly on this regression line, but spread around it. So we'd expect the same for the missing values. - - - -- .pull-right[ About the **parameter estimates:** <img src="index_files/figure-html/imp reglines multi-1.png" width="100%" /> ] ??? In addition, there is uncertainty about the parameter estimates in the imputation model, the `\(\hat\alpha\)`. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line. --- ## Imputation of Missing Values **We want:**<br> Imputation from the **predictive distribution** `\(p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else})\)`. <br> **Idea:**<br> Use a "prediction" model. <br> **Take into account:** * **uncertainty in parameter** estimates `\(\boldsymbol{\hat\alpha}\)` * **prediction error** `\((\color{var(--nord15)}{\mathbf{\hat x}_{mis}} \neq \color{var(--nord15)}{\mathbf x_{mis}})\)` * A missing value has a **distribution** ⇨ we can't just replace it with **one** value. ??? So, in summary, what have we seen so far? We want to impute missing values from the predictive distribution of the missing value given everything else. We could do that via some sort of prediction to make use of the relationships between variables. But we need to take into account that we have multiple sources of uncertainty or variation: - uncertainty about the **parameters** in the imputation model - **random variation** of the unknown values (also called **prediction error**) - and we need to take into account that there is **uncertainty about the missing value**, so that we can't represent a missing value by one single imputed value because that would not capture that uncertainty (the additional uncertainty that we have compared to an observed value) --- class: center, middle # Naive Ways to Handle Missing Data ??? So with this in mind, let's have a look at some unfortunately still used naive methods to handle missing data. --- ## Naive Ways to Handle Missing Data <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> ??? On the next few slides, I'll visualize some of these naive methods for imputation. I use this plot with the incomplete covariate `\(x_1\)` on the x-axis and the response `\(y\)` on the y-axis, so a plot that represents our analysis of interest. All of the white dots represent patients for whom we have both `\(x\)` and `\(y\)` observed and the empty red-ish dots are the cases for whom the value of the covariate `\(x\)` is missing. --- ## Mean Imputation <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> ??? First, we have **mean imputation**, where all missing values are replaced by the mean of the observed values of `\(x_1\)`. You can clearly see that the imputed values are not a good representation of the distribution of the true but missing values. They don't vary enough and this method will usually result in bias. --- ## Missing Indicator Method <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="100%" /> ??? Then, we have the missing indicator method. The idea here is to replace the missing values with a fixed value, for example zero. And, to distinguish the incomplete cases from the complete cases we additionally add an indicator variable that is zero for observed cases and one for incomplete cases. As for mean imputation we see that the imputed values do not at all represent the spread of the missing values. If we fit a model to this imputed data we will again get biased results. --- name: regimp ## Regression Imputation <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> ??? [jump to regline with predicted values](#predval_reg) Next, we have regression imputation, where we imputed based on a model like I've shown you a couple of slides ago, but we use the values on the regression line and ignore the random variation. The imputed values are a bit more in the range of the original data, but you still see that we underestimate the variability and thereby the uncertainty about the results. --- ## Single Imputation <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> ??? In single imputation we now improve upon the regression imputation by taking into account both the uncertainty about the parameters in the imputation model and the random variation. And we can see that the imputed values have a distribution that is much more similar to the distribution of the missing values. The imputed values don't need to be identical to the original values, but they need to be from the correct distribution. --- ## Single Imputation Can take into account * **uncertainty in parameter** estimates `\(\boldsymbol{\hat\alpha}\)` * **prediction error** `\((\color{var(--nord15)}{\mathbf{\hat x}_{mis}} \neq \color{var(--nord15)}{\mathbf x_{mis}})\)` .pull-left[ **But:** <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] .pull-right[ Single imputation does not take into account the **uncertainty about the imputed value**! ] ??? With single imputation we can take into account two of the sources of uncertainty or variation, but we only have one imputed value. We have no way of taking into account the added uncertainty that we have about the imputed value compared to an observed value, when we just have one single value. --- class: center, middle # Multiple Imputation ??? And this is why Donald Rubin came up with the idea of multiple imputation. --- ## Multiple Imputation MI was developed in the 1960s/70s... <img src = "materials/PCpic.png", height = 380 style = "position: absolute; right: 60px; bottom: 60px;"> -- <br> **Requirements** * computationally feasible * "fix" the missing data problem once / centrally<br> ⇨ distribute imputed data to other researchers --- ## Multiple Imputation <img src = "figures/MI.png", height = 480, style = "margin: auto; display: block;"> ??? The idea behind multiple imputation is that, using this principle, we sample imputed values and fill them into the original, incomplete data to create a completed dataset. And in order to take into account the uncertainty that we have about the missing values, we do this multiple times, so that we obtain multiple completed datasets. Because all the missing values have now been filled in, we can analyse each of these datasets separately with standard statistical techniques. To obtain overall results, the results from each of these analyses need to be combined in a way that takes into account both the uncertainty that we have about the estimates from each analysis, and the variation between these estimates. --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> --- ## Multiple Imputation: Pooling **Pooled Parameter Estimate:**<br> `$$\mathbf{\bar\beta} = \frac{1}{m}\sum_{\ell = 1}^m \mathbf{\hat\beta}^{(\ell)} \qquad \text{(average estimate)}$$` -- **Pooled Variance:** `$$T = \overline W + B + B/m$$` * `\(\displaystyle\overline W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}\left(\mathbf{\hat\beta}^{(\ell)}\right)\)` average within imputation variance * `\(\displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m \left(\mathbf{\hat\beta}^{(\ell)} - \mathbf{\bar\beta}\right)^2\)` between imputation variance --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> --- class: center, middle # Multivariate Missingness --- ## In Practice .flex-grid[ .col[ <div style = "text-align: center; margin-bottom: 25px;"> <strong>Multivariate<br>Missingness</strong></div> <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ **Most common approach:**<br> <span style = "color: var(--nord10); font-weight: bold;">MICE</span> <span style = "color: var(--nord3);">(multivariate imputation by chained equations)</span><br> <span style = "color: var(--nord10); font-weight: bold;">FCS</span> <span style = "color: var(--nord3);">(fully conditional specification)</span> <br> **Predictive distributions** <div style = "width: 700px;"> based on models </div> <div> \begin{alignat}{10} \color{var(--nord15)}{\mathbf x_1} &= \alpha_0 &+& \alpha_1 \mathbf y &+& \alpha_2 \color{var(--nord15)}{\mathbf x_2} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots \\ \color{var(--nord15)}{\mathbf x_2} &= \gamma_0 &+& \gamma_1 \mathbf y &+& \gamma_2 \color{var(--nord15)}{\mathbf x_1} &+& \gamma_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots\\ \color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+& \theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots \end{alignat} </div> ] ] ??? And the most common approach to imputation in this setting is MICE, short for **multivariate imputation by chained equations**, an approach that is also called **fully conditional specification**. The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor. --- <!-- ## MICE / FCS: Imputation Model Types --> <!-- **Parametric imputation models** --> <!-- * Linear model .sgrey[(continuous, cond. normal variable)] --> <!-- * Logistic model .sgrey[(binary variable)] --> <!-- * Multinomial model .sgrey[(categorical variable)] --> <!-- * ... --> <!-- **Semi-parametric models** --> <!-- * Predictive Mean Matching (PMM) .sgrey[(any type of variable)] --> <!--  [⇨ <i class = "fas fa-presentation"></i> NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=15) --> <!-- * Classification and regression trees --> <!-- * Random Forest --> <!-- * ... --> <!-- ??? --> <!-- The models for the different variables can be specified according to the type --> <!-- of variable. --> <!-- Once we have imputed each missing value, we start again with the first --> <!-- variable, but now use the imputed values of the other variables instead of the --> <!-- starting values, and we do this a few times until the algorithm has converged. --> ## MICE / FCS .pull-left[ **Iterative Algorithm:** - Start with **random draws** from the observed data.<br> ⇨ Not samples from the correct distribution! - Cycle through the models to **update the imputed values**. ⇨ Keep only last imputed value. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-19-1.png" width="100%" /> ] ??? Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time. --- class: center, middle # Missing Data ??? And, so, for most methods to handle missing values we can't make a general statement that will always be true. For the impact of a method there are a number of relevant aspects. --- ## Missing Values **Relevant** for the choice / impact of methods: .flex-grid[ .col[ - **How much is missing?** * per variable * per subject * complete cases ] .col[ - **How much information is available?** * sample size * relevant covariates * strength of association ] ] ??? The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject. And, as we've seen, we might also need to check what that means for the number of complete cases. But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables. - - - -- - **Where are values missing?** * response * covariates - **Why are values missing?**<br> ⇨ Missing Data Mechanism ??? We also need to distinguish between missing values in covariates and the response, and we need to think about, and make assumptions about why the values are missing, meaning, the missing data mechanism. --- ## How much information is missing / available? .flex-grid[ .col[ <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ <div style = "width: 700px;"> Imputation of \(\color{var(--nord15)}{\mathbf x_1}\) based on: \[\color{var(--nord15)}{\mathbf x_1} = \alpha_0 + \alpha_1 \mathbf y + \alpha_2 \color{var(--nord15)}{\mathbf x_2} + \alpha_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\] <ul> <li> Fit model on cases with observed \(\color{var(--nord15)}{\mathbf x_1}\)</li> <li> Predict missing \( \color{var(--nord15)}{\mathbf x_1} \)</li> </ul> </div> {{content}} ] ] -- <br> <div> <p> <strong>Scenario 1:</strong>  N = 200,  90% of \(\color{var(--nord15)}{\mathbf x_1}\) is missing<br> ⇨ N = 20 to estimate \(\boldsymbol\alpha\) </p> <br> {{content}} </div> -- <div> <strong>Scenario 2:</strong>  N = 5000,  90% of \(\color{var(--nord15)}{\mathbf x_1}\) is missing<br> ⇨ N = 500 to estimate \(\boldsymbol\alpha\) </div> --- ## Relevant covariates / strength of association <div> Imputation of \(\color{var(--nord15)}{\mathbf x_1}\) based on: \[\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\] Say, \(\color{var(--nord15)}{\mathbf x_1}\) is <span style = "color: var(--nord10); font-weight: bold;">bilirubin</span>. </div> <br> -- <div style = "width: 800px;"> <div class = "flex-grid"> <div class = "col"> <strong>Scenario 1:</strong><br> other covariates: <ul> <li>age</li> <li>gender</li> <li>eye color</li> </ul> </div> <div class = "col"> {{content}} </div> </div> </div> -- <strong>Scenario 2:</strong><br> other covariates: <div class = "flex-grid"> <div class = "col"> <ul> <li>age</li> <li>gender</li> <li>height</li> <li>weight</li> <li>family history</li> </ul> </div> <div class = "col"> <ul> <li>comorbidities</li> <li>creatinine</li> <li>AST, ALT, ALP</li> <li>...</li> </ul> </div> </div> -- <img src="materials/rabbit1.png" id="rabbit"> --- ## Where are values missing? .pull-left[ **Imputation Model** for `\(\color{var(--nord15)}{\mathbf y}\)`: `$$\color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y$$` **Analysis Model** `$$\color{var(--nord15)}{\mathbf y} = \beta_0 + \beta_1 \color{var(--nord15)}{\mathbf x_1} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \varepsilon_y$$` ] -- .pull-right[ If analysis model `\(=\)` imputation model <br> ⇨ `\(\boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}\)`<br> ⇨ No point in imputing responses. <br> {{content}} ] -- **Auxiliary variables**:<br> ⇨ analysis model `\(\neq\)` imputation model<br> ⇨ `\(\boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}\)`<br> ⇨ Imputing responses can be beneficial. <!-- ## Missing Data Mechanisms --> <!-- **Missing Completely At Random (MCAR)** --> <!-- `$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing})$$` --> <!-- .sgrey[no systematic difference between complete and incomplete cases] --> <!-- ??? --> <!-- For the missing data mechanism there is a specific terminology. --> <!-- First, we can have "missing completely at random" missing data. --> <!-- Missing completely at random means that the probability of a value being missing --> <!-- does not depend on anything, it is completely random and has nothing to do with --> <!-- what we are investigating in our study. --> <!-- This means that there are no systematic differences between complete and --> <!-- incomplete cases. --> <!-- - - - --> <!-- -- --> <!-- <br> --> <!-- **Missing At Random** --> <!-- `$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})$$` --> <!-- ??? --> <!-- Then, we have missing at random. --> <!-- In missing at random the assumption is that the probability of a value begin --> <!-- missing depends on other things, but only on things that we have measured in --> <!-- our data, and is actually observed. --> <!-- - - - --> <!-- -- --> <!-- **Missing Not At Random** --> <!-- `$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) \neq \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})$$` --> <!-- ??? --> <!-- The last missing data mechanism is missing not at random, and here the probability --> <!-- that a value is missing does depend on things that we have not measured or --> <!-- that is missing. This can either be the missing value itself, or missing values --> <!-- in other variables, or things that we haven't measured at all. --> <!-- There is no way of testing if we are dealing with MNAR or MAR. We will always --> <!-- need to make an assumption about whether we have MNAR data. --> <!-- Sometimes you can read in clinical papers that they assumed "random missingness", --> <!-- or "that the missing values are random". I assume that they refer either to --> <!-- MCAR or MAR, but it isn't clear which one, but it can make a very important --> <!-- difference whether you have MCAR or MAR. --> <!-- --- --> <!-- ## Some Examples --> <!-- * Data is collected by questionnaire ⇨ some got lost in the mail --> <!-- ??? --> <!-- Let's look at a few examples to see which type of missing data mechanism we --> <!-- might have. --> <!-- Say, we have a study for which we have collected data using a questionnaire. --> <!-- Some of the questionnaires were filled in, but on the way back they got lost --> <!-- in the mail. --> <!-- Which type of missing data mechanism would that be? --> <!-- * * * * --> <!-- If this is a study in the Netherlands we could probably argue that this is --> <!-- MCAR. But if we were performing a study in various areas in, say, Africa, --> <!-- and the postal service in the rural areas is much more unreliable than in the --> <!-- cities, and there are other factors that are of interest in our study that also --> <!-- differ between rural areas and cities, we won't have MCAR any more. --> <!-- - - - --> <!-- -- --> <!-- * A particular biomarker was not part of the standard panel before 2008<br> --> <!-- ⇨ missing for many patients who entered < 2008 --> <!-- ??? --> <!-- Another example. Imagine a particular biomarker was not part of the standard --> <!-- blood panel before 2008. And so, for most of the patients who entered the --> <!-- study before 2008 this value is missing, but for people who entered later it is --> <!-- mostly observed. --> <!-- Which missing data mechanism do we have? --> <!-- If we know the year of inclusion, then we'd have MAR. --> <!-- - - - --> <!-- -- --> <!-- * In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?" --> <!-- ??? --> <!-- Another example. We have a survey that we send out to pregnant women. One of the --> <!-- questions is if they are currently smoking. What mechanism would you expect for --> <!-- the missing values in that variable? --> <!-- .... --> <!-- - - - --> <!-- -- --> <!-- * Same survey: missing values in "chocolate consumption". --> <!-- ??? --> <!-- In the same survey, we also ask about the womens' daily chocolate consumption. --> <!-- What about the missing values in this variable? --> <!-- - - - --> <!-- -- --> <!-- <br> --> <!-- .box.bg-0.brdr-8[ --> <!-- MCAR / MAR / MNAR are NOT a property of the data but of a **model**. --> <!-- ] --> <!-- ??? --> <!-- As you see, the missing data mechanism is actually not a property of the data --> <!-- itself, but rather of the model that we use to fit the data or to impute it. --> <!-- And to make suitable assumptions you need expert knowledge on how the data --> <!-- was measured. This is not something that the statistician can determine. --> --- ## Why are values missing? Imputation of `\(\color{var(--nord15)}{\mathbf x_1}\)` based on: `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots$$` <ul> <li> Fit model on cases with observed \(\color{var(--nord15)}{\mathbf x_1}\)</li> <li> Predict missing \( \color{var(--nord15)}{\mathbf x_1} \)</li> </ul> -- .box.bg-0.brdr-8[ ⇨ Imputed `\(\color{var(--nord15)}{\mathbf x_1}\)` will have the same distribution as observed `\(\color{var(--nord15)}{\mathbf x_1}\)` with **the same values of all other variables**. ] **⇨ FCS MI is valid under M**issing **A**t **R**andom (**MAR**) --- ## FCS MI in Practice * valid under **MAR**<br> <span style = "color: grey; font-size: 0.9rem;"> imputation models need to contain the important predictors in the right form</span> -- * allows us to take into account * uncertainty about missing value<br> <span style = "color: grey; font-size: 0.9rem;"> if we use enough imputed datasets </span> * uncertainty about parameters in imputation model<br> <span style = "color: grey; font-size: 0.9rem;"> requires Bayes or Bootstrap   [⇨ NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=5) </span> * prediction error<br> <span style = "color: grey; font-size: 0.9rem;"> requires Bayes, or PMM with appropriate settings   [⇨ NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=22) </span> -- * Imputation models need to fit the data - no contradiction between imputation models - no contradiction between imputation models and analysis model(s) <ul class="fa-ul"> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="fas fa-bolt"></i></span>multi-level data, non-linear associations, survival data</li> </ul> --- ## Multiple Imputation FAQ * How many imputed datasets do I need? -- * Should we do a compl. case analysis as sensitivity analysis? -- * What % missing values is still ok? -- * Can I impute missing values in the response? -- * Can I impute missing values in the exposure? -- * Which variables do I need to include in the imputation? -- * Why do I need to include the response into the imputation models? Won't that artificially increase the association? -- * How should I report missing data / imputation in a paper? --- class: the-end, center, middle layout: true count: false # Thank you for your attention! <div class="contact"> <i class="fas fa-envelope"></i> <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a>  <a href="https://twitter.com/N_Erler" target="_blank"><i class="fab fa-twitter"></i> N_Erler</a>  <a href="https://github.com/NErler" target="_blank"><i class="fab fa-github"></i> NErler</a>  <a href="https://nerler.com" target="_blank"><i class="fas fa-globe-americas"></i> https://nerler.com</a> </div> --- count: false