Imputation Magic…

<div class = "title">Imputation Magic…</div>
<div class = "subtitle">How (not) to deal with incomplete data</div>
<div class  = "author">Nicole Erler</div>
<div class = "institute">Department of Biostatistics</div>
<div class = "contact">
<i class="fas fa-envelope"></i> <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a> 
<a href= https://twitter.com/N_Erler><i class="fab fa-twitter"></i> N_Erler</a> 
  <a href= https://github.com/NErler><i class="fab fa-github"></i> NErler</a> 
  <a href= https://nerler.com><i class="fas fa-globe-americas"></i> https://nerler.com</a>
</div>

---

<div class="my-footer"><span>
<a href="https://twitter.com/N_Erler"><i class="fab fa-twitter"></i> N_Erler</a>
&emsp;&emsp;&emsp;&emsp;
<a href="https://github.com/NErler"><i class="fab fa-github"></i> NErler</a> &emsp;&emsp;&emsp;&emsp;
<a href = "https://nerler.com"><i class="fas fa-globe-americas"></i> nerler.com</a>
</span></div>

---
count: false
class: center, middle

# Missing Values are a Problem!

???

Let's start right at the beginning.

When we want to analyse data in which some values are missing, we have a problem.
Why is that?

Because even a single missing value can make it impossible to get any results
at all.

---

## Example

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
</tr>
<tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td>-0.1</td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td>-1.9</td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td>-0.2</td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td>-0.6</td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
</table>

]

<br>

**What is the mean of `$\color{white}{\mathbf x_1}$`?**

]

???

As an example, imagine we have the following data with four variables, but for
now we are only interested in `$x_1$`.

We want to calculate the mean of `$x_1$`, but for one of the patients, the value
of `$x_1$` is missing.

So, how do we calculate the mean?

<br>

`$$\boldsymbol{\bar x}_1  = \frac{-0.1 + \;\color{var(--nord15)}{\boldsymbol ?} - 1.9 - 0.2 - 0.6}{5}$$`

???

We need to sum up all the values of `$x_1$` and divide by the number of observations.

The problem is, that we cannot even calculate this sum

---

## Missing Values are a Problem!

Even with **just a single missing value** 
most (summary) statistics or parameters .red[cannot be calculated!]

<br>

]

???

With just a single value missing in our data set, we are not able to 
get results for an analysis as simple as a linear regression.

So, what is the solution to this problem?

- - -

<br><br><br>
<strong>Solution:</strong><br>
.red[Exclude] incomplete cases?

]

???

A common "solution" is to make the missing data problem "disappear" by just
excluding all patients who have one or more missing values.

You may not even be aware that you are doing this because the software does
it for you.

And what many researchers also aren't aware of is that they are not actually
"avoiding" the missing data problem by doing this, because, as any analysis
method and any method to deal with missing data, such a complete case analysis
implies certain assumptions and has consequences.

---
class: center, middle

# Complete Case Analysis is (usually) a Bad Idea!

???

Because of those assumptions and consequences,
complete case analysis is in most cases a rather bad idea.

---

## Complete Case Analysis: Inefficient!

???

In any case, complete case analysis is inefficient because you throw away
information.

You see on the y-axis the proportion of complete cases in a data set, and on the
x-axis the number of incomplete variables. Each line represents a different
proportion of missing values per variable.

So, if we had 10% missing values in each of 25 variables, we may en up with
only 7% of the original sample size.

And if we had 10% missing values in 10 variables, we may only have 35% of our data
left over in a complete case analysis.

---

## Complete Case Analysis

Complete Case Analysis is 
<ul class="fa-ul">
<li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>inefficient</li>
<li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>usually biased</li>
</ul>

<br>

**For Example:**
<ul class="fa-ul">
<li>
<a href = "https://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased">
<span class = "fa-li"><i class="fab fa-wordpress"></i></span>
thestatsgeek.com (2013)</a>

<li>
<a href = "https://doi.org/10.1002/sim.3944">
<span class = "fa-li"><i class="fas fa-file-alt"></i></span>
White & Carlin (2010)</a>
</li>

<li>
<a href = "https://doi.org/10.1016/j.jclinepi.2006.01.015">
<span class = "fa-li"><i class="fas fa-file-alt"></i></span>
Van der Heijden et al. (2006)</a>
</li>

<li>
<a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008">
<span class = "fa-li"><i class="fas fa-file-alt"></i></span>
Janssen et al. (2010)</a>
</li>
</ul>

???

In addition, complete case analysis is biased in most settings. There are a few
very specific exceptions, depending on what kind of model you use, where the
missing values are, and why they are missing.

---

# Imputation

???

So we need a better way to handle missing values, and the magic word here
is "Imputation".

---

# Imputation<br><br><br><br><br>

???

Imputation is this magic procedure where you where you just draw the correct
values that are missing out of a hat, right?

Unfortunately, it is not quite that easy!

To figure out how to impute missing values we first need to understand
more about them.p

---

## Understanding Missing Values

* There is **uncertainty** about the missing value.

]

???

First, I think you can agree with me on this, we need to accept that there is
**uncertainty** about what the value would have been.

And so we **can't just pick** one value
and fill it in, because then we would ignore this uncertainty.

If the value of `height` is missing for one patient, we don't know what that
value would have been.

- - - -
--

* Some values are **more likely** than others.

???

2) Usually, some values are going to be more likely than others.

For the missing `height` we a value somewhere around 1.70 - 1.80m is probably
more likely than values of 1.50m or 2.10m. Those values are also are possible,
but they are less likely.

- - - -

**&#8680; Missing values have a distribution.**

???

So, in statistical terms, we can say that missing values have a distribution.

- - - -

* There is a relationship with **other** (available) **data**.

<br>

{{content}}
]

???

Moreover, there typically is some relationship with the rest of the data.

If we know that the missing `height` value is from a male, larger values become
more likely and smaller values less likely.

This means that we can use a model to learn how the incomplete variable is
related to the other data.

- - -

<div class = "box bg-0" style = "margin-top: 5px">
<strong>Predictive distribution</strong>
of the missing values given the observed values.

$$ p(\color{var(--nord15)}{x_{mis}}\mid\text{everything else}) $$

</div>

???

This model defines what we call the **predictive distribution**.
And this is the distribution that we need to sample imputed values from.

So, you could say, that this predictive distribution is our magic hat.

---

## A Simple Example

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\color{var(--nord15)}{\mathbf x_1}$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
</tr>
<tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr">$i$</td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
</tr>
<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</table>

* `$\mathbf y$`: **response**
* `$\color{var(--nord15)}{\mathbf x_1}$`: **incomplete** covariate
* `$\mathbf x_2$`, `$\mathbf x_3$`: **complete**<br>covariates

]

Fit a model to the cases with observed
`$\color{var(--nord15)}{\mathbf x_1}$`:<br>
`$\color{var(--nord15)}{\mathbf x_1} = \alpha_0 + \alpha_1 \mathbf y_{-i} + \alpha_2 \mathbf x_{-i2} + \alpha_3 \mathbf x_{-i3} + \boldsymbol\varepsilon,\;\; \color{var(--nord3)}{\small\varepsilon \sim N(0, \sigma^2)}$`

&#8680; Estimate parameters `$\boldsymbol{\hat\alpha}, \mathbf{\hat\sigma}$`

<br>

]

???

Let's look at an example to get a better idea about how this works.

We have the same data as before, where we have one or more missing values
in the variable `$x_1$`.

We could fit a model to all cases where `$x_1$` is observed and use the other
variables as covariates in this model.

From this model we get parameter estimates for the regression coefficients 
`$\alpha$` and the standard deviation of the error terms, `$\sigma$`.

- - -

**Predictive distribution** of `$\color{var(--nord15)}{x_{i1}}$`:

Normal distribution with
* mean &emsp; `$\mathbf{\hat \alpha}_0 + \mathbf{\hat\alpha}_1 y_i + \mathbf{\hat{\alpha}}_2 x_{i2} + \mathbf{\hat{\alpha}}_3 x_{i3}$`
* variance &emsp; `$\mathbf{\hat\sigma}^2$`

???

This model defines the predictive distribution for the missing values.

Because we used a linear regression model we assume that the missing value
`$x_{i1}$` is from a normal distribution with a mean equal to the linear predictor,
of the regression model, and standard deviation as estimated
in the model fitted on the rest of the data.

Essentially, we just "predict" the missing value in `$x_1$` from the model 
fitted on the part of the data in which `$x_1$` is observed.

---

## Imputation of Missing Values

<div id="mathformula">
$\mathbf{\hat\alpha}_0 + \mathbf{\hat\alpha}_1 \mathbf y + 
 \mathbf{\hat{\alpha}}_2 \mathbf x_2 + \mathbf{\hat{\alpha}}_3 \mathbf x_3$
</div>

???

We can visualize this for the case with only one other variable in the 
imputation model.

On the x-axis we have one of the complete variables and on the y-axis we have
the incomplete variable `$x_1$` that we are imputing.

In practice we will of course have more variables, but then I couldn't show it
in a simple plot any more. So this is really just to get the idea.

We know the value of the other variable, so we know where our incomplete cases
are on the x-axis, but we don't know where to place them on the y-axis. 
Therefore I only marked them as empty circles on the x-axis here.

When we now fit a model on the observed cases we can represent that as the
corresponding regression line.

- - -

---
count: false
name: predval_reg

## Imputation of Missing Values

<div id="mathformula">
$\mathbf{\hat\alpha}_0 + \mathbf{\hat\alpha}_1 \mathbf y + 
 \mathbf{\hat{\alpha}}_2 \mathbf x_2 + \mathbf{\hat{\alpha}}_3 \mathbf x_3$
</div>

???

[jump to regression imputation](#regimp)

When we then use this model to predict the missing values, we calculate
where on the regression line the missing values would be.

The regression line is the linear predictor from the model on the previous slide.

Could we now just take those values as our imputed values?

---

## Imputation of Missing Values

.box.bg-0[
**Important:** We need to take into account the **uncertainty**!
]

???

Not quite. We can't just use the fitted value to impute the missing value
because there is uncertainty that we haven't taken into account.

- - - - -

]

???

There is uncertainty about the imputed values.

Missing values have a distribution and we need to sample from this distribution.
The regression line is the mean of this distribution, but we are not doing any
random sampling if we take the mean.

The observed data also is not exactly on this regression line, but spread around
it. So we'd expect the same for the missing values.

- - -

<img src="index_files/figure-html/imp reglines multi-1.png" width="100%" />
]

???

In addition, there is uncertainty about the parameter estimates in the imputation
model, the `$\hat\alpha$`.

Because our data is just a sample, we don't know the true parameters.
With a different sample, we'd get a slightly different regression line.

---

## Imputation of Missing Values

**We want:**<br>
Imputation from the **predictive distribution**
`$p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else})$`.

<br>

**Idea:**<br>
Use a "prediction" model.

<br>

**Take into account:**
* **uncertainty in parameter** estimates `$\boldsymbol{\hat\alpha}$`
* **prediction error** `$(\color{var(--nord15)}{\mathbf{\hat x}_{mis}} \neq \color{var(--nord15)}{\mathbf x_{mis}})$`
* A missing value has a **distribution** &#8680; we can't just replace it with **one** value.

???

So, in summary, what have we seen so far?

We want to impute missing values from the predictive distribution of the missing
value given everything else.

We could do that via some sort of prediction to make use of the relationships
between variables.

But we need to take into account that we have multiple sources of uncertainty or
variation: 
- uncertainty about the **parameters** in the imputation model
- **random variation** of the unknown values (also called **prediction error**)
- and we need to take into account that there is **uncertainty about the missing
value**, so that we can't represent a missing value by one single imputed value
because that would not capture that uncertainty (the additional uncertainty that
we have compared to an observed value)

---
class: center, middle

# Naive Ways to Handle Missing Data

???

So with this in mind, let's have a look at some unfortunately still used 
naive methods to handle missing data.

---

## Naive Ways to Handle Missing Data

???

On the next few slides, I'll visualize some of these naive methods for 
imputation.

I use this plot with the incomplete covariate `$x_1$` on the x-axis and the response
`$y$` on the y-axis, so a plot that represents our analysis of interest.

All of the white dots represent patients for whom we have both `$x$` and `$y$` observed
and the empty red-ish dots are the cases for whom the value of the covariate
`$x$` is missing.

---

## Mean Imputation

???

First, we have **mean imputation**, where all missing values are replaced by the
mean of the observed values of `$x_1$`.

You can clearly see that the imputed values are not a good representation of the
distribution of the true but missing values. They don't vary enough and this 
method will usually result in bias.

---

## Missing Indicator Method

???

Then, we have the missing indicator method.

The idea here is to replace the missing values with a fixed value, for example
zero. And, to distinguish the incomplete cases from the complete cases we
additionally add an indicator variable that is zero for observed cases and one
for incomplete cases.

As for mean imputation we see that the imputed values do not at all represent 
the spread of the missing values. If we fit a model to this imputed data we will
again get biased results.

---

## Regression Imputation
<img src="index_files/figure-html/unnamed-chunk-8-1.png" width="100%" />

???

[jump to regline with predicted values](#predval_reg)

Next, we have regression imputation, where we imputed based on a model like
I've shown you a couple of slides ago, but we use the values on the regression
line and ignore the random variation.

The imputed values are a bit more in the range of the original data, but
you still see that we underestimate the variability and thereby the uncertainty
about the results.

---

## Single Imputation

???

In single imputation we now improve upon the regression imputation by taking
into account both the uncertainty about the parameters in the imputation model
and the random variation.

And we can see that the imputed values have a distribution that is much more
similar to the distribution of the missing values.

The imputed values don't need to be identical to the original values, but 
they need to be from the correct distribution.

---

## Single Imputation

Can take into account
* **uncertainty in parameter** estimates `$\boldsymbol{\hat\alpha}$`
* **prediction error** `$(\color{var(--nord15)}{\mathbf{\hat x}_{mis}} \neq \color{var(--nord15)}{\mathbf x_{mis}})$`

]

.pull-right[
Single imputation does not take into account the **uncertainty about the imputed
value**!

]

???

With single imputation we can take into account two of the sources of 
uncertainty or variation, but we only have one imputed value.

We have no way of taking into account the added uncertainty that we have 
about the imputed value compared to an observed value, when we just have one
single value.

---
class: center, middle

# Multiple Imputation

???

And this is why Donald Rubin came up with the idea of multiple imputation.

---

## Multiple Imputation

MI was developed in the 1960s/70s...

<br>

**Requirements**
* computationally feasible
* "fix" the missing data problem once / centrally<br>
  &#8680; distribute imputed data to other researchers

---

## Multiple Imputation

???

The idea behind multiple imputation is that, using this principle,
we sample imputed values and fill them into the original, incomplete data to 
create a completed dataset.

And in order to take into account the uncertainty that we have about the missing
values, we do this multiple times, so that we obtain multiple completed datasets.

Because all the missing values have now been filled in, we can analyse each of
these datasets separately with standard statistical techniques.

To obtain overall results, the results from each of these analyses need to be
combined in a way that takes into account both the uncertainty that we have
about the estimates from each analysis, and the variation between these estimates.

---

## Multiple Imputation

---

## Multiple Imputation

---

## Multiple Imputation: Pooling

**Pooled Parameter Estimate:**<br>
`$$\mathbf{\bar\beta} = \frac{1}{m}\sum_{\ell = 1}^m \mathbf{\hat\beta}^{(\ell)} \qquad
\text{(average estimate)}$$`

**Pooled Variance:**
`$$T = \overline W + B + B/m$$`
* `$\displaystyle\overline W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}\left(\mathbf{\hat\beta}^{(\ell)}\right)$`
  &nbsp;&nbsp;&nbsp;average within imputation variance

* `$\displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m \left(\mathbf{\hat\beta}^{(\ell)} - \mathbf{\bar\beta}\right)^2$`
  &nbsp;&nbsp;&nbsp;between imputation variance

---

## Multiple Imputation

---
class: center, middle

# Multivariate Missingness

---

## In Practice

<div style = "text-align: center; margin-bottom: 25px;">
<strong>Multivariate<br>Missingness</strong></div>

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr">$i$</td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

**Most common approach:**<br>
<span style = "color: var(--nord10); font-weight: bold;">MICE</span> 
  <span style = "color: var(--nord3);">(multivariate imputation by chained equations)</span><br>
  <span style = "color: var(--nord10); font-weight: bold;">FCS</span> 
  <span style = "color: var(--nord3);">(fully conditional specification)</span>

<br>

**Predictive distributions**

<div style = "width: 700px;">
based on models
</div>

<div>
\begin{alignat}{10}
\color{var(--nord15)}{\mathbf x_1} &= \alpha_0 &+& \alpha_1 \mathbf y &+&
\alpha_2 \color{var(--nord15)}{\mathbf x_2} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots \\
\color{var(--nord15)}{\mathbf x_2} &= \gamma_0 &+& \gamma_1 \mathbf y &+&
\gamma_2 \color{var(--nord15)}{\mathbf x_1} &+& \gamma_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots\\
\color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+&
\theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots
\end{alignat} 
</div>

]
]

???

And the most common approach to imputation in this setting is MICE, short for
**multivariate imputation by chained equations**, an approach that is also
called **fully conditional specification**.

The principle is an extension to what we've seen on the previous slides.
We impute missing values using models that have all other data in their linear
predictor.

---

## MICE / FCS

**Iterative Algorithm:**
- Start with **random draws** from the observed data.<br>
  &#8680; Not samples from the correct distribution!

- Cycle through the models to **update the imputed values**.

&#8680; Keep only last imputed value.
]

<img src="index_files/figure-html/unnamed-chunk-19-1.png" width="100%" />
]

???
Because in these imputation models we now have incomplete covariates, we use an
iterative algorithm. We start by randomly drawing starting values from the 
observed part of the data, and then we cycle through the
incomplete variables and impute one at a time.

---

# Missing Data

???

And, so, for most methods to handle missing values we can't make a general
statement that will always be true.

For the impact of a method there are a number of relevant aspects.

---

## Missing Values

**Relevant** for the choice / impact of methods:

.flex-grid[
.col[
- **How much is missing?**
  * per variable
  * per subject
  * complete cases
]
.col[
- **How much information is available?**
  * sample size
  * relevant covariates
  * strength of association
]
]
  
???

The first question that we usually first ask ourselves is how much is actually
missing in the data? And we can distinguish between the proportion or number
of missing values per variable or per subject.

And, as we've seen, we might also need to check what that means for the 
number of complete cases.

But what I find sometimes even more relevant is how much information is 
available? Again, with respect to the number of observations per variable and
per subject, and, are there relevant covariates that are associated with the
variables that have missing values, how strong these associations are, and
if these other variables are observed for the cases with missing values in the
other variables.

- - -

--
  
- **Where are values missing?**
  * response
  * covariates
  
- **Why are values missing?**<br>
  &#8680; Missing Data Mechanism
  
???

We also need to distinguish between missing values in covariates and the
response, and we need to think about, and make assumptions about why the values
are missing, meaning, the missing data mechanism.

---

## How much information is missing / available?

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr">$i$</td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td><i class = "fas fa-check"</i></td>
<td style="color: var(--nord15);"><i class = "fas fa-question"></i></td>
<td><i class = "fas fa-check"</i></td>
<td><i class = "fas fa-check"</i></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

<div style = "width: 700px;">
Imputation of $\color{var(--nord15)}{\mathbf x_1}$ based on:

\[\color{var(--nord15)}{\mathbf x_1} = \alpha_0 + \alpha_1 \mathbf y +
\alpha_2 \color{var(--nord15)}{\mathbf x_2} + \alpha_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\]

<ul>
<li> Fit model on cases with observed $\color{var(--nord15)}{\mathbf x_1}$</li>
<li> Predict missing $ \color{var(--nord15)}{\mathbf x_1} $</li>
</ul>

</div>

]
]

<p>
<strong>Scenario 1:</strong>&emsp;
N = 200,&emsp; 90% of $\color{var(--nord15)}{\mathbf x_1}$ is missing<br>
&#8680; N = 20 to estimate $\boldsymbol\alpha$
</p>
<br>

{{content}}
</div>

<div>
<strong>Scenario 2:</strong>&emsp;
N = 5000,&emsp; 90% of $\color{var(--nord15)}{\mathbf x_1}$ is missing<br>
&#8680; N = 500 to estimate $\boldsymbol\alpha$
</div>

---

## Relevant covariates / strength of association

<div>

Imputation of $\color{var(--nord15)}{\mathbf x_1}$ based on:

\[\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y +
\beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\]

Say, $\color{var(--nord15)}{\mathbf x_1}$ is <span style = "color: var(--nord10); font-weight: bold;">bilirubin</span>.

</div>

<br>

<strong>Scenario 1:</strong><br>

other covariates:
<ul>
<li>age</li>
<li>gender</li>
<li>eye color</li>
</ul>
</div>

<div class = "col">
{{content}}
</div>
</div>
</div>

<strong>Scenario 2:</strong><br>
other covariates:

<div class = "flex-grid">
<div class = "col">
<ul>
<li>age</li>
<li>gender</li>
<li>height</li>
<li>weight</li>
<li>family history</li>
</ul>
</div>

<div class = "col">
<ul>
<li>comorbidities</li>
<li>creatinine</li>
<li>AST, ALT, ALP</li>
<li>...</li>
</ul>
</div>
</div>

---

## Where are values missing?

**Imputation Model** for `$\color{var(--nord15)}{\mathbf y}$`:
`$$\color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y$$`

**Analysis Model**
`$$\color{var(--nord15)}{\mathbf y} = \beta_0 + \beta_1 \color{var(--nord15)}{\mathbf x_1} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \varepsilon_y$$`

]

.pull-right[
If analysis model `$=$` imputation model <br>
&#8680; `$\boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}$`<br>
&#8680; No point in imputing responses.

<br>

]

**Auxiliary variables**:<br>
&#8680; analysis model `$\neq$` imputation model<br>
&#8680; `$\boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}$`<br>
&#8680; Imputing responses can be beneficial.

---

## Why are values missing?

Imputation of `$\color{var(--nord15)}{\mathbf x_1}$` based on:

`$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y +
\beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots$$`

<ul>
<li> Fit model on cases with observed $\color{var(--nord15)}{\mathbf x_1}$</li>
<li> Predict missing $ \color{var(--nord15)}{\mathbf x_1} $</li>
</ul>

.box.bg-0.brdr-8[
&#8680; Imputed `$\color{var(--nord15)}{\mathbf x_1}$` will have the same
distribution as observed `$\color{var(--nord15)}{\mathbf x_1}$` with **the same
values of all other variables**.
]

**&#8680; FCS MI is valid under M**issing **A**t **R**andom (**MAR**)

---

## FCS MI in Practice

* valid under **MAR**<br>
  <span style = "color: grey; font-size: 0.9rem;">
  imputation models need to contain the important predictors in the right
  form</span>
  
--

* allows us to take into account
    * uncertainty about missing value<br>
        <span style = "color: grey; font-size: 0.9rem;">
        if we use enough imputed datasets
        </span>
    * uncertainty about parameters in imputation model<br>
        <span style = "color: grey; font-size: 0.9rem;">
        requires Bayes or Bootstrap &emsp; [&#8680; NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=5)
        </span>
    * prediction error<br>
      <span style = "color: grey; font-size: 0.9rem;">
      requires Bayes, or PMM with appropriate settings &emsp; [&#8680; NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=22)
      </span>

* Imputation models need to fit the data
  - no contradiction between imputation models
  - no contradiction between imputation models and analysis model(s)
<ul class="fa-ul">
<li><span class = "fa-li" style = "color:var(--nord11);"><i class="fas fa-bolt"></i></span>multi-level data, non-linear associations, survival data</li>
</ul>

---

## Multiple Imputation FAQ

* How many imputed datasets do I need?

* Should we do a compl. case analysis as sensitivity analysis?

* What % missing values is still ok?

* Can I impute missing values in the response?

* Can I impute missing values in the exposure?

* Which variables do I need to include in the imputation?

* Why do I need to include the response into the imputation models? Won't that
  artificially increase the association?

* How should I report missing data / imputation in a paper?

---

# Thank you for your attention!

<div class="contact">
<i class="fas fa-envelope"></i> <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a>&emsp;
<a href="https://twitter.com/N_Erler" target="_blank"><i class="fab fa-twitter"></i> N_Erler</a>&emsp;
  <a href="https://github.com/NErler" target="_blank"><i class="fab fa-github"></i> NErler</a>&emsp;
  <a href="https://nerler.com" target="_blank"><i class="fas fa-globe-americas"></i> https://nerler.com</a>
</div>

---