Working with Missing Data and Imputation

<div class = "title">Working with Missing Data and Imputation</div>
<div class = "author">Nicole Erler</div>
<div class = "institute">Department of Biostatistics</div>
<div class = "contact">
 <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a> 
<a href= https://twitter.com/N_Erler> N_Erler</a> 
 <a href= https://github.com/NErler> NErler</a> 
 <a href= https://nerler.com> https://nerler.com</a>
</div>

---

<div class="my-footer">
<a href="https://twitter.com/N_Erler"> N_Erler</a>
&emsp;&emsp;&emsp;&emsp;
<a href="https://github.com/NErler"> NErler</a> &emsp;&emsp;&emsp;&emsp;
<a href = "https://nerler.com"> nerler.com</a>
</div>

---

## Outline / Topics

* Missing Values are a Problem
* General Considerations & Missing Data Mechanisms

* Naive Imputation Approaches

* Multiple Imputation
  * General Concept
  * Multivariate Missingness
  * General Considerations
  
  
* Issues with imputation
  * in multi-level data
  * with non-linear associations / survival data

---
count: false
class: center, middle

# Missing Values are a Problem!

???

I'm going to start right at the beginning, and want to demonstrate why 
missing values are a problem.

- researchers who thought it was possible to use cases if there was only one
  single value missing
- SPSS options that seem this is possible

This is a bit theoretical, with lots of math, but don't worry, the math is more
for visualization, and the presentation won't be all formulas.

I'll then talk a bit in general about missing data, look at some naive missing
data methods, and then we'll take a look at multiple imputation.

---

## Example: Linear Regression

**Linear Regression Model:**

`\begin{eqnarray*}
y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\
&=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon
\end{eqnarray*}`

???

We use linear regression as an example, because there, we can calculate the
solution for the regression coefficients by hand with a formula, and,
theoretically, wouldn't need a computer to fit the model.

A linear regression model is written as a response `$y$` with covariates `$x$`, and
some regression coefficients `$\beta$`, and we have the error terms, `$\varepsilon$`.

We can also write this model in matrix notation, ...

- - -

with

`$$\mathbf y = \begin{pmatrix}
y_1\\
y_2\\
y_3\\
y_4\\
y_5
\end{pmatrix} \qquad
\mathbf X = \begin{pmatrix}
1 & x_{11} & x_{12} & x_{13}\\
1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\
1 & x_{31} & x_{32} & x_{33}\\
1 & x_{41} & x_{42} & x_{43}\\
1 & x_{51} & x_{52} & x_{53}
\end{pmatrix} \qquad
\boldsymbol\beta = \begin{pmatrix}
\beta_0\\
\beta_1\\
\beta_2\\
\beta_3
\end{pmatrix}$$`

???

... and then we have `$y$` as a vector, here, as an example for 5 subjects,
`$X$` is the design matrix, which contains the different covariates in the columns
and has a column of 1s for the intercept, and the value for `$x_1$` for the second
subject is missing.
The regression coefficients `$\beta$` are also a vector.

---

## Example: Linear Regression

**The Least Squares Estimator**

`$$\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y$$`

???

The regression coefficients in the linear model are usually estimated using 
the least squares estimator, and this estimator has a simple formula that
depends only on the design matrix `$X$` and the response `$y$`.

We'll now go through this formula in steps to see how the calculation is impacted
by the one missing value in `$X$`.

- - -

???

We start with the product of `$X^\top$` and `$X$`.

`$X^\top$` is the design matrix, but with rows and columns swapped, so that each
row is one variable, and each column is one subject.

And we need to multiply these two matrices.

---

## Example: Linear Regression

`$$\mathbf X^\top \mathbf X = \begin{pmatrix}
1      & 1 & 1      & 1      & 1 \\
x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\
x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\
x_{13} & x_{23} & x_{33} & x_{43} & x_{53}
\end{pmatrix} \begin{pmatrix}
1 & x_{11} & x_{12} & x_{13}\\
1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\
1 & x_{31} & x_{32} & x_{33}\\
1 & x_{41} & x_{42} & x_{43}\\
1 & x_{51} & x_{52} & x_{53}
\end{pmatrix} = \begin{pmatrix}
 \cdot & \cdot & \cdot & \cdot\\
 \cdot & \cdot & \cdot & \cdot\\
 \cdot & \cdot & \cdot & \cdot\\
 \cdot & \cdot & \cdot & \cdot
\end{pmatrix}$$`

???

How does matrix multiplication work?

We always multiply one row from the first matrix with a column from the second
matrix, and take the sum over all the product from these two vectors.

The result from the first row and first column will then be the top left element
in the result matrix.

And because here we have the intercept multiplied with itself, we have the sum
over the product of 1s, which is 5 in this case, because we have 5 subjects.

---

## Example: Linear Regression

`$$\mathbf X^\top \mathbf X = \begin{pmatrix}
1      & 1 & 1      & 1      & 1 \\
x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\
x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\
x_{13} & x_{23} & x_{33} & x_{43} & x_{53}
\end{pmatrix} \begin{pmatrix}
1 & x_{11} & x_{12} & x_{13}\\
1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\
1 & x_{31} & x_{32} & x_{33}\\
1 & x_{41} & x_{42} & x_{43}\\
1 & x_{51} & x_{52} & x_{53}
\end{pmatrix} = \begin{pmatrix}
 \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\
 \cdot & \cdot & \cdot & \cdot\\
 \cdot & \cdot & \cdot & \cdot\\
 \cdot & \cdot & \cdot & \cdot
\end{pmatrix}$$`

???

Then we move on to the second column, and here we multiply again each element with
one, so, one times `$x_{11}$`, one times the missing value, and so on.

And then we need to sum up all the products, but because one of the summands is
unknown, the sum will also be unknown.

---

## Example: Linear Regression

`$$\mathbf X^\top \mathbf X = \begin{pmatrix}
1      & 1 & 1      & 1      & 1 \\
x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\
x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\
x_{13} & x_{23} & x_{33} & x_{43} & x_{53}
\end{pmatrix} \begin{pmatrix}
1 & x_{11} & x_{12} & x_{13}\\
1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\
1 & x_{31} & x_{32} & x_{33}\\
1 & x_{41} & x_{42} & x_{43}\\
1 & x_{51} & x_{52} & x_{53}
\end{pmatrix} = \begin{pmatrix}
 \cdot & \color{var(--nord15)}{?} & \cdot &  \cdot\\
 \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\
 \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\
 \cdot & \color{var(--nord15)}{?} & \cdot & \cdot
\end{pmatrix}$$`

<div class = "small">
$
\color{grey}{\hat{\boldsymbol\beta} = \color{silver}{(\mathbf X^\top\mathbf X)^{-1}} \mathbf X^\top \mathbf y}
$
</div>

`$$(\mathbf X^\top \mathbf X)^{-1} = \begin{pmatrix}
 \cdot & \color{var(--nord15)}{?} & \cdot &  \cdot\\
 \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\
 \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\
 \cdot & \color{var(--nord15)}{?} & \cdot & \cdot
\end{pmatrix}^{-1} = \begin{pmatrix}
 \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\
 \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\
 \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\
 \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}
\end{pmatrix}$$`

???

But in the formula for the least squares estimator we have to then take the
inverse of this new matrix.

Calculating the inverse by hand is a bit tedious, so I'm not going to go through
it step by step. But the result is that we now have unknown values on all 
positions of the inverted matrix, because the calculations always involve one
or more of the unknown elements of the input matrix.

---

## Example: Linear Regression

Even with **just a single missing value** &#8680; parameters `$\boldsymbol\beta$`
cannot be estimated!

**Solution:** Exclude incomplete cases?

???

And so it is clear, whenever we have missing values in the covariates,
we cannot estimate our regression coefficients. And the same goes for missing
values in the response `$y$`.

And so the logical conclusion would be that we would have to exclude all those
cases for which some values are missing, and perform a complete case analysis.

---
class: center, middle

# Complete Case Analysis is (usually) a Bad Idea!

???

But, a complete case analysis is in most cases a rather bad idea.

---

## Complete Case Analysis: Inefficient!

???

Here is one reason why.

You see on the y-axis the proportion of complete cases in a dataset, and on the
x-axis the number of incomplete variables. Each line represents a different
proportion of missing values per variable.

So, if we had 10% missing values in 25 variables, we'd en up with only 7% of the
original sample size.
And if we had 10% missing values in 10 variables, we'd have 35% of our data
left over in a complete case analysis.

Even when we'd have only 2% missing in only 5 variables, we could loose 10% of
the data.

---

## Complete Case Analysis

Complete Case Analysis is 
<ul class="fa-ul">
<li>inefficient</li>
<li>usually biased</li>
</ul>

**For Example:**
<ul class="fa-ul">
<li>
<a href = "https://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased">

thestatsgeek.com (2013)</a>

<li>
<a href = "https://doi.org/10.1002/sim.3944">

White & Carlin (2010)</a>
</li>

<li>
<a href = "https://doi.org/10.1016/j.jclinepi.2006.01.015">

Van der Heijden et al. (2006)</a>
</li>

<li>
<a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008">

Janssen et al. (2010)</a>
</li>
</ul>

???

So it is clear, complete case analysis is very inefficient. In many cases we'll
loose quite a bit of data.

Moreover, complete case analysis is biased in most settings. There are a few
very specific exceptions, depending on what kind of model you use, where the
missing values are, and why they are missing.

---
class: center, middle

# Missing Data & Imputation

???

And, so, for most methods to handle missing values we can't make a general
statement that will always be true.

For the impact of a method there are a number of relevant aspects.

---

## Missing Values

**Relevant** for the choice / impact of methods:

.flex-grid[
.col[
- **How much is missing?**
  * per variable
  * per subject
  * complete cases
]
.col[
- **How much information is available?**
  * sample size
  * relevant covariates
  * strength of association
]
]
  
???

The first question that we usually first ask ourselves is how much is actually
missing in the data? And we can distinguish between the proportion or number
of missing values per variable or per subject.

And, as we've seen, we might also need to check what that means for the 
number of complete cases.

But what I find sometimes even more relevant is how much information is 
available? Again, with respect to the number of observations per variable and
per subject, and, are there relevant covariates that are associated with the
variables that have missing values, how strong these associations are, and
if these other variables are observed for the cases with missing values in the
other variables.

- - -

--
 
- **Where are values missing?**
 * response
 * covariates
 
- **Why are values missing?** 
 &#8680; Missing Data Mechanism
 
???

We also need to distinguish between missing values in covariates and the
response, and we need to think about, and make assumptions about why the values
are missing, meaning, the missing data mechanism.

---

## Missing Data Mechanisms

**Missing Completely At Random (MCAR)**
`$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing})$$`
.sgrey[no systematic difference between complete and incomplete cases]

???

For the missing data mechanism there is a specific terminology.

First, we can have "missing completely at random" missing data.
Missing completely at random means that the probability of a value being missing
does not depend on anything, it is completely random and has nothing to do with
what we are investigating in our study.

This means that there are no systematic differences between complete and 
incomplete cases.

- - -

**Missing At Random**
`$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})$$`
???

Then, we have missing at random.
In missing at random the assumption is that the probability of a value begin
missing depends on other things, but only on things that we have measured in 
our data, and is actually observed.

- - -

**Missing Not At Random**
`$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) \neq \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})$$`
???

The last missing data mechanism is missing not at random, and here the probability
that a value is missing does depend on things that we have not measured or
that is missing. This can either be the missing value itself, or missing values
in other variables, or things that we haven't measured at all.

There is no way of testing if we are dealing with MNAR or MAR. We will always 
need to make an assumption about whether we have MNAR data.

Sometimes you can read in clinical papers that they assumed "random missingness",
or "that the missing values are random". I assume that they refer either to 
MCAR or MAR, but it isn't clear which one, but it can make a very important 
difference whether you have MCAR or MAR.

---

## Some Examples

* Data is collected by questionnaire &#8680; some got lost in the mail

???

Let's look at a few examples to see which type of missing data mechanism we
might have.

Say, we have a study for which we have collected data using a questionnaire.
Some of the questionnaires were filled in, but on the way back they got lost
in the mail.

Which type of missing data mechanism would that be?

* * * *

If this is a study in the Netherlands we could probably argue that this is 
MCAR. But if we were performing a study in various areas in, say, Africa,
and the postal service in the rural areas is much more unreliable than in the 
cities, and there are other factors that are of interest in our study that also
differ between rural areas and cities, we won't have MCAR any more.

- - -

* A particular biomarker was not part of the standard panel before 2008 
 &#8680; missing for many patients who entered < 2008

???
Another example. Imagine a particular biomarker was not part of the standard 
blood panel before 2008. And so, for most of the patients who entered the 
study before 2008 this value is missing, but for people who entered later it is
mostly observed.

Which missing data mechanism do we have?

If we know the year of inclusion, then we'd have MAR.
- - -

* In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"

???
Another example. We have a survey that we send out to pregnant women. One of the
questions is if they are currently smoking. What mechanism would you expect for
the missing values in that variable?

....

- - -

* Same survey: missing values in "chocolate consumption".

???
In the same survey, we also ask about the womens' daily chocolate consumption.
What about the missing values in this variable?

- - -

.box.bg-0.brdr-8[
MCAR / MAR / MNAR are NOT a property of the data but of a **model**.
]

???

As you see, the missing data mechanism is actually not a property of the data
itself, but rather of the model that we use to fit the data or to impute it.

And to make suitable assumptions you need expert knowledge on how the data 
was measured. This is not something that the statistician can determine.

---

## Understanding Missing Values

* there is **uncertainty** about the missing value

]
???

The important issue in imputing missing values is that there is **uncertainty**
about what the value would have been. And so we **can't just pick** one value
and fill it in, because then we would just ignore this uncertainty.

If the value of `height` is missing for one patient, we don't know what that
value would have been.

- - - -
--

* some values are **more likely** than others

???
Also: some values are going to be more likely than others, and usually there is
a relationship between the variable that has missing values and the other data
that we have collected.

For the missing value in `height` we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.

- - - -

**&#8680; missing values have a distribution**

???
So, in statistical terms, we can say that missing values have a distribution.

- - - -

* there is a relationship with **other** available **data**

.box.bg-0[
Predictive distribution
of the missing values given the observed values.
`$$p(x_{mis}\mid\text{everything else})$$`
]
]

???

Moreover, there usually is some relationship between the missing value and other
data that we have collected. If we know that the missing value in `height` is
from a male, larger values become more likely and smaller values less likely.

This means that we need a model to learn how the incomplete variable is related
to the other data.

This model, together with an assumption about the type of distribution the 
missing value has, then allows us to specify the distribution we should 
sample values to impute the missing value.
We call this the predictive distribution. And the predictive distribution is 
generally based on everything else, including all other data
and parameters.

---

## A Simple Example

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
</tr>
<tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</table>

* `$\mathbf y$`: **response**
* `$\color{var(--nord15)}{\mathbf x_1}$`: **incomplete** covariate
* `$\mathbf x_2$`, `$\mathbf x_3$`: **complete** covariates

]

**Predictive distribution:**

`$$p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3,
\boldsymbol\beta, \sigma)$$`

]

???
Let's look at a simple example. Imagine, we have the following dataset, where we
have a completely observed response variable `$y$`, a variable `$x_1$` that is
missing for patient `$i$`, and two other covariates that are completely observed.

And so the the predictive distribution that we need to sample the imputed value
from, would be the distribution of `$x_1$`, given the response `$y$`, the other
covariates, and some parameters.
- - -

For example:

* Fit a model to the cases with observed
  `$\color{var(--nord15)}{\mathbf x_1}$`:
  `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon$$`
{{content}}

???
For example, we could think of this as fitting a regression model with `$x_1$` as
the dependent variable, and `$y$` & the other covariates as independent variables.

We can then fit this model to all those cases for which we have `$x_1$` observed,...
- - - -

* Estimate parameters `$\boldsymbol{\hat\beta}, \hat\sigma$` 
 &#8680; define distribution 
 `$p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)$`

???
... in order to estimate the parameters, and to learn how the
distribution of `$x_1$` conditional on the other data looks like.

And then we can use this information to specify the predictive distribution for
the cases with missing `$x_1$` and sample imputed values from this distribution.

---

## Imputation of Missing Values

???

We can visualize this for the case where we only have two variables, one 
incomplete, shown on the y-axis, and one complete, shown on the x-axis. So this 
is a visualization of the imputation model.

In practice we will of course have more variables, but then I couldn't show it
in a simple plot any more. So this is really just to get the idea.

We know the value of the other variable, so we know where our incomplete cases
are on the x-axis, but we don't know where to place them on the y-axis. 
Therefore I only marked them as empty circles on the x-axis here.

---
count: false

## Imputation of Missing Values

???

When we now fit a model on the observed cases we can represent that as the
corresponding regression line.

---
count: false
name: predval_reg

## Imputation of Missing Values

???

[jump to regression imputation](#regimp)

What happens when we now plug in the observed variables of our incomplete 
case in the estimated model, is that we get the fitted values, meaning the
corresponding values on the regression line.

Could we now just take those values as our imputed values? We fitted the
model on the complete cases, and then we predicted the value of the incomplete
variable from that model.

---

## Imputation of Missing Values

.box.bg-0[
**Important:** We need to take into account the **uncertainty**!
]

???

Not quite. We can't just use the fitted value to impute the missing value
because there is uncertainty that we haven't taken into account.

- - - - -

<img src="index_files/figure-html/imp reglines multi-1.png" width="100%" />
]

???

There is uncertainty about the parameter estimates in the imputation model.
Because our data is just a sample, we don't know the true parameters.
With a different sample, we'd get a slightly different regression line.

]

???

And there is uncertainty about the values themselves. In the observed data,
the data points are not exactly on the regression line, but spread around it.
So we'd expect the same for the missing values. 
Using the fitted values, the values on the regression line, would ignore this
random variation that we have in the data.

This is the part where we assume that the missing values have a distribution.
This distribution is the random variation around the expected value.

---

## Imputation of Missing Values

**We want:** 
Imputation from the **predictive distribution**
`$p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else})$`.

**Idea:** 
Use a "prediction" model.

**Take into account:**
* **uncertainty in parameter** estimates `$\boldsymbol{\hat\beta}$`
* **prediction error** `$(\mathbf{\hat x}_{mis} \neq \mathbf x_{mis})$`
* missing values has a **distribution** &#8680; we can't just replace it with **one** value.

???

So, in summary, what have we seen so far?

We want to impute missing values from the predictive distribution of the missing
value given everything else.

The idea is to do that via a prediction model.

But we need to take into account that we have multiple sources of uncertainty or
variation: 
- uncertainty about the parameters in the imputation model
- random variation of the unknown values (also called prediction error)
- and we need to take into account that there is uncertainty about the missing
value, so that we can't represent a missing value by one single imputed value
because that would not capture that uncertainty (the additional uncertainty that
we have compared to an observed value)

---
class: center, middle

# Naive Ways to Handle Missing Data

???

So with this knowledge on missing data and all the things that we need to 
take into account let's have a look at some unfortunately still used 
naive methods to handle missing data.

---

## Naive Ways to Handle Missing Data

???

We are now looking at the data that we would use for the actual analysis of 
interest, and the regression line from that analysis model.
So on the x-axis we now have the incomplete covariate and on the y-axis the
response, which we assume is fully observed.

The cases for which the covariate is observed are drawn as white dots, the cases
for which the covariate is missing as empty purple circles.
The correct regression line, that we would get if we didn't have any missing
values is shown with the dashed line.

---

## Complete Case Analysis

???

In a complete case analysis, the regression line would be calculated just based
on the white data points. Because the missing values are not missing completely
at random, but values are more likely to be missing for larger response values,
the estimated line now is lower than the true line.

---

## Mean Imputation
<img src="index_files/figure-html/unnamed-chunk-4-1.png" width="100%" />

???

The first imputation method that I'll show here is mean imputation. All missing
values in the covariate are filled in with the mean of the observed values of
that covariate. This is shown here with the filled purple dots. 
You can clearly see that they are not a good representation of the distribution
of the true but missing values.

The corresponding regression line, shown with the solid white line is closer
to the true line than for complete case analysis but is flatter than the true line.

---

## Missing Indicator Method
<img src="index_files/figure-html/unnamed-chunk-5-1.png" width="100%" />

???

The second missing data method is the missing indicator method. The idea here is
to replace the missing values with a fixed value, for example zero. And, to
distinguish the incomplete cases from the complete cases we additionally add an
indicator variable that is zero for observed cases and one for incomplete cases.

As for mean imputation we see that the imputed values do not at all represent 
the spread of the missing values. Because of the indicator variable we now get
two regression lines, one for observed one for incomplete cases, but they have
the same slope, which seems to be similar to the slope of the true regression 
line.

---

## Regression Imputation
<img src="index_files/figure-html/unnamed-chunk-6-1.png" width="100%" />

???

[jump to regline with predicted values](#predval_reg)

Next, we have regression imputation. The idea here is to impute based on a 
prediction model, like we saw before, but to just use the fitted values from 
that prediction. Because this method also does not take into account the 
random variation, we see that all imputed values are on one straight line.

And again we see that the model fitted on the imputed data results in a 
regression line with different slope than the true line, now with a steeper 
slope.

---

## Single Imputation
<img src="index_files/figure-html/unnamed-chunk-7-1.png" width="100%" />

???

In single imputation we now improve upon the regression imputation by taking
into account both the uncertainty about the parameters in the imputation model
and the random variation. And we can see that the imputed values have a
distribution that is much more similar to the distribution of the missing values.

The corresponding regression line is also almost identical to the true line.

---

## Single Imputation

Can take into account
* **uncertainty in parameter** estimates `$\boldsymbol{\hat\beta}$`
* **prediction error** `$(\mathbf{\hat x}_{mis} \neq \mathbf x_{mis})$`

]

.pull-right[
Single imputation does not take into account the **uncertainty about the imputed
value**!

]

???

In the single imputation we did take into account two of the sources of 
uncertainty or variation, but we only have one imputed value.
We have no way of taking into account the added uncertainty that we have 
about the imputed value compared to an observed value, when we just have one
single value.

We see that the different methods disagree quite a bit in their estimates.
The single imputation comes closest, but when we take a closer look at the CI 
we see that it is actually a bit narrower than the true CI. 
In the example I used here, I have a bit more than 50% missing values. So we
should have quite a bit additional uncertainty compared to the complete data.

The single imputation approach clearly underestimates the uncertainty that we
have about effect of the covariate.

---
class: center, middle

# Multiple Imputation

???

And this is why Donald Rubin came up with the idea of multiple imputation.

---

## Multiple Imputation

MI was developed in the 1960s/70s...

**Requirements**
* computationally feasible
* "fix" the missing data problem once / centrally 
 &#8680; distribute imputed data to other researchers

---

## Multiple Imputation

???

The idea behind multiple imputation is that, using this principle,
we sample imputed values and fill them into the original, incomplete data to 
create a completed dataset.

And in order to take into account the uncertainty that we have about the missing
values, we do this multiple times, so that we obtain multiple completed datasets.

Because all the missing values have now been filled in, we can analyse each of
these datasets separately with standard statistical techniques.

To obtain overall results, the results from each of these analyses need to be
combined in a way that takes into account both the uncertainty that we have
about the estimates from each analysis, and the variation between these estimates.

---

## Multiple Imputation

---

## Multiple Imputation

---

## Multiple Imputation

---

## Multiple Imputation: Pooling

**Pooled Parameter Estimate:** 
`$$\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad
\text{(average estimate)}$$`

**Pooled Variance:**
`$$T = \overline W + B + B/m$$`
* `$\displaystyle\overline W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}\left(\hat\beta^{(\ell)}\right)$`
  &nbsp;&nbsp;&nbsp;average within imputation variance

* `$\displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m \left(\hat \beta^{(\ell)} - \bar\beta\right)^2$`
  &nbsp;&nbsp;&nbsp;between imputation variance

---

## Multiple Imputation

---
class: center, middle

# Multivariate Missingness

---

## In Practice

<div style = "text-align: center; margin-bottom: 25px;">
Multivariate Missingness</div>

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

<div style = "width: 700px;">
based on models
</div>

<div>
\begin{alignat}{10}
\color{var(--nord15)}{\mathbf x_1} &= \beta_0 &+& \beta_1 \mathbf y &+&
\beta_2 \color{var(--nord15)}{\mathbf x_2} &+& \beta_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots \\
\color{var(--nord15)}{\mathbf x_2} &= \alpha_0 &+& \alpha_1 \mathbf y &+&
\alpha_2 \color{var(--nord15)}{\mathbf x_1} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots\\
\color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+&
\theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots
\end{alignat} 
</div>

{{content}}
]
]

???

And the most common approach to imputation in this setting is MICE, short for
**multivariate imputation by chained equations**, an approach that is also
called **fully conditional specification**.

The principle is an extension to what we've seen on the previous slides.
We impute missing values using models that have all other data in their linear
predictor.
- - -

--

**Most common approach:** 
MICE 
 (multivariate imputation by chained equations) 
 FCS 
 (fully conditional specification)

---

## MICE / FCS

- cycle through the models to **update the imputed values**

- until **convergence**

&#8680; keep only last imputed value
]

<img src="index_files/figure-html/unnamed-chunk-16-1.png" width="100%" />
]

???
Because in these imputation models we now have incomplete covariates, we use an
iterative algorithm. We start by randomly drawing starting values from the 
observed part of the data, and then we cycle through the
incomplete variables and impute one at a time.
- - - - - -

**Flexible model types:** 
Choose a different type of model per incomplete variable.

???

The models for the different variables can be specified according to the type
of variable.

Once we have imputed each missing value, we start again with the first
variable, but now use the imputed values of the other variables instead of the
starting values, and we do this a few times until the algorithm has converged.

---

## MICE / FCS: Imputation Model Types

**Parametric imputation models**
* Linear model .sgrey[(continuous, cond. normal variable)]
* Logistic model .sgrey[(binary variable)]
* Multinomial model .sgrey[(categorical variable)]
* ...

**Semi-parametric models**
* Predictive Mean Matching (PMM) .sgrey[(any type of variable)]
 &emsp;[&#8680; NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=15)
* Classification and regression trees
* Random Forest
* ...

---

## MICE / FCS: Predictive Mean Matching

<table class="data-table">
<tr>
<th>$\mathbf y$</th>
<th>$\color{var(--nord15)}{\mathbf x_1}$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<tr>
<td></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td>$\ldots$</td>
</tr>
<tr>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td>$\ldots$</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td>$\ldots$</td>
</tr>

<tr>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td></td>
<td style="color: var(--nord15);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<th>$\ldots$</th>
</tr>

<tr>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

* Fit a model on cases with `$\color{var(--nord15)}{\mathbf x_1}$` observed:
  `$\mathbf x_1^{obs} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \ldots$`
* Calculate the predicted values `$\mathbf{\hat x_1^{obs}}$`

* Calculate the predicted values `$\mathbf{\hat x_1^{mis}}$`
* Find cases where `$\hat x_{j1}^{obs}$` is similar to `$\hat x_{i1}^{mis}$`
* Use the corresponding **observed** value(s) `$x_{j1}^{obs}$` 
 to impute `$x_{i1}^{mis}$`
]
]

---

## Missing Values

**Relevant** for the choice / impact of methods:

- **Where are values missing?**
 * response
 * covariates
 
- **Why are values missing?** 
 &#8680; Missing Data Mechanism
 
---

## Considerations for the Use of FCS MI

**How much is missing / how much information is available?**

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

<div style = "width: 700px;">
Imputation of $\color{var(--nord15)}{\mathbf x_1}$ based on:

\[\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y +
\beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\]

<ul>
<li> Fit model on cases with observed $\color{var(--nord15)}{\mathbf x_1}$</li>
<li> Predict missing $ \color{var(--nord15)}{\mathbf x_1} $</li>
</ul>

</div>

]
]

<div>

Scenario 1:&emsp;
N = 200,&emsp; 90% of $\color{var(--nord15)}{\mathbf x_1}$ is missing 
&#8680; N = 20 to estimate $\boldsymbol\beta$

{{content}}
</div>

<div>
Scenario 2:&emsp;
N = 5000,&emsp; 90% of $\color{var(--nord15)}{\mathbf x_1}$ is missing 
&#8680; N = 500 to estimate $\boldsymbol\beta$
</div>

---

## Considerations for the Use of FCS MI

**Relevant covariates / strength of association**

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

<div style = "width: 700px;">
Imputation of $\color{var(--nord15)}{\mathbf x_1}$ based on:

\[\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y +
\beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\]

Say, $\color{var(--nord15)}{\mathbf x_1}$ is bilirubin.
</div>

]
]

Scenario 1:

other covariates:
<ul>
<li>age</li>
<li>gender</li>
<li>eye color</li>
</ul>
</div>

<div class = "col">
{{content}}
</div>
</div>

Scenario 2: 
other covariates:

<div class = "flex-grid">
<div class = "col">
<ul>
<li>age</li>
<li>gender</li>
<li>height</li>
<li>weight</li>
<li>family history</li>
</ul>
</div>

<div class = "col">
<ul>
<li>comorbidities</li>
<li>creatinine</li>
<li>AST, ALT, ALP</li>
<li>MELD</li>
<li>...</li>
</ul>
</div>
</div>

---

## Considerations for the Use of FCS MI

**Where are values missing?**

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
<th>$\ldots$</th>
</tr>
<tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>
<tr>
<td class="rownr"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class="rownr"></td>
<td style="color: var(--nord15);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
<th>$\ldots$</th>
</tr>

<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td></td>
</tr>
</table>

]

**Imputation Model** for `$\color{var(--nord15)}{\mathbf y}$`:
`$$\color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y$$`

* fit on cases with observed `$\color{var(--nord15)}{\mathbf y}$`
 &#8680; `$\boldsymbol{\hat\alpha}$`
* predict missing `$\color{var(--nord15)}{\mathbf y}$` using `$\boldsymbol{\hat\alpha}$` 
 &#8680; imputed cases will always have estimates equal to `$\boldsymbol{\hat\alpha}$`
 
{{content}}

]
]

**Analysis Model**
* estimates in observed part: `$\boldsymbol{\hat\alpha}$`
* estimates in imputed part: `$\boldsymbol{\hat\alpha}$` 
&#8680; same results as in imputation model

---

## Considerations for the Use of FCS MI

**Missing values in the response:**

If analysis model `$=$` imputation model 
&#8680; `$\boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}$` 
&#8680; No point in imputing responses.

**Auxiliary variables**: 
&#8680; analysis model `$\neq$` imputation model 
&#8680; `$\boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}$` 
&#8680; Imputing responses can be beneficial.

---

## Considerations for the Use of FCS MI

**Why are values missing?**

Imputation of `$\color{var(--nord15)}{\mathbf x_1}$` based on:

`$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y +
\beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots$$`

<ul>
<li> Fit model on cases with observed $\color{var(--nord15)}{\mathbf x_1}$</li>
<li> Predict missing $ \color{var(--nord15)}{\mathbf x_1} $</li>
</ul>

.box.bg-0.brdr-8[
&#8680; Imputed `$\color{var(--nord15)}{\mathbf x_1}$` will have the same
distribution as observed `$\color{var(--nord15)}{\mathbf x_1}$` with **the same
values of all other variables**.
]

**&#8680; FCS MI is valid under MAR**

---

## FCS MI in Practice

* valid under **MAR** 
 
 imputation models need to contain the important predictors in the right
 form
 
--

* allows us to take into account
 * uncertainty about missing value 
 
 if we use enough imputed datasets
 
 * uncertainty about parameters in imputation model 
 
 requires Bayes or Bootstrap &emsp; [&#8680; NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=5)
 
 * prediction error 
 
 requires Bayes, or PMM with appropriate settings &emsp; [&#8680; NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/04_imputation_step_ii.pdf#page=22)

* Imputation models need to fit the data
 - no contradiction between imputation models
 - no contradiction between imputation models and analysis model(s)
<ul class="fa-ul">
<li>multi-level data, non-linear associations, survival data</li>
</ul>

---

# Multi-level Data

---

## Multi-level Data

<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\color{var(--nord15)}{\mathbf x_1}$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
</tr>
<tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="hlgt-row">
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr class = "hlgt-row">
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr class = "hlgt-row">
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</table>

]
]

???

We have multi-level data when we have measured the same variable repeatedly in
the same patient, but also when we have a clustering structure in our data,
for example in a multi-center study. In both cases, observations from the 
same patient, or the same cluster are not independent.

We typically represent this type of data in long format, so that we now have
multiple rows that belong to the same patient.

---

## Multi-level Data

* observations of the same patient / cluster are correlated

* unbalanced data

]
]

???

We typically represent this type of data in long format, so that we now have
multiple rows that belong to the same patient.

---

## Multi-level Data

**(Linear) Mixed Model**
`$$y_{ij} = \underset{\text{fixed effects}}{\underbrace{\mathbf x_{ij}^\top\boldsymbol\beta}} + 
\underset{\text{random effects}}{\underbrace{\mathbf z_{ij}^\top\mathbf b_i}} +
\boldsymbol\varepsilon_i$$`

* **level-1** variables: repeatedly measured / time-varying

* **level-2** variables: baseline / patient (or cluster) specific / time-constant

???

For analysis: &#8680; typically use a mixed model

* takes into account that the repeated measurements for a patient are not independent
* can handle unbalanced data

Things get interesting when we have missing values in a baseline covariate: 
&#8680; when imputing we do not only need to take into account that multiple
missing values may belong to the same patient and should therefore be correlated,
**but that they should be identical.**

---

## FCS in Multi-level Data

.flex-grid[
.col[
If `$\color{var(--nord15)}{\mathbf x_1}$` is **level-1**: 
The imputation model for `$\color{var(--nord15}{x_{i1}(t)}$` could be
a mixed model, e.g.:
`$$\mathbb{E}[\color{var(--nord15)}{x_{i1}(t)}] = 
\underset{\color{var(--nord3)}{\text{fixed effects}}}{\color{var(--nord3)}{\underbrace{\color{var(--nord4)}{\theta_0 + \theta_1 y_i(t) + \theta_2 x_{2i}(t) + \theta_3 x_{3i}(t)}}}} + 
\underset{\color{var(--nord3)}{\substack{\text{random}\\\text{effects}}}}{\color{var(--nord3)}{\underbrace{\color{var(--nord4)}{\mathbf u_i \mathbf z_i(t)}}}}$$`

]

.col[
<table class="data-table">
<tr>
<th></th>
<th>$\mathbf y$</th>
<th>$\color{var(--nord15)}{\mathbf x_1}$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
</tr>
<tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="hlgt-row">
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr class = "hlgt-row">
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr class = "hlgt-row">
<td class="rownr">$i$</td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class = "rownr"></td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</table>

]
]

But what if `$\color{var(--nord15)}{\mathbf x_1}$` is a **level-2** variable?

???

If `$x_1$` is a level-2 variable:

* imputed values for the same subject should be identical

* Use a mixed model?&emsp;{{content}}
--

{{content}}
--
* Use a GLM?&emsp;{{content}}
--

{{content}}

--
**&#8680; Imputation in wide format?**

???
&#8680; For incomplete baseline variables imputation in wide format might be better(?)

---

## Imputation in Wide Format?

---
count: false
class: animated, fadeIn

## Imputation in Wide Format?
<img src="figures/wideform1.png", height = 480, style = "margin: auto; display: block;">

---
class: animated, fadeIn

## Imputation in Wide Format?
<img src="figures/wideform2.png", height = 450, style = "margin: auto; display: block;">

---
class: animated, fadeIn

## Imputation in Wide Format?
<img src="figures/wideform3.png", height = 450, style = "margin: auto; display: block;">

???

* In yellow: unnecessary imputations = imputations that we would not need
for the analysis with a mixed model 
* in red: imputations after death

* we would impute all these values for the outcome and each time-varying covariate

---

## Imputation in Wide Format?

* may **not** be **feasible**
* if feasible:
  * may **require summarizing** the data
  * can be (very) **inefficient**

--

**Alternatives?**

---

## Imputation in Wide Format

.pull-left[
<img src="figures/p_first.png" style="width:100%">
]
--
.pull-right[
<img src="figures/p_mean.png" style="width:100%">
]

---

## Imputation in Wide Format

.pull-left[
<img src="figures/p_rd.png" style="width:100%">
]
--
.pull-right[
<img src="figures/p_ri.png" style="width:100%">
]

---
count: false

## Imputation in Wide Format

---

## Imputation in Wide Format

.col[
<table class="data-table">
<tr>
<th></th>
<th style = "color: var(--nord10);">$\mathbf b_0$</th>
<th style = "color: var(--nord10);">$\mathbf b_1$</th>
<th style = "color: var(--nord10);">$\mathbf b_2$</th>
<th>$\mathbf x_1$</th>
<th>$\mathbf x_2$</th>
<th>$\mathbf x_3$</th>
</tr>
<tr><td></td><td colspan = "6"; style = "padding: 0px;"><hr /></td><tr>
<td class="rownr"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td></td>
</tr>
<tr class = "hlgt-row">
<td class="rownr">$i$</td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td></td>
<td></td>
<td style="color: var(--nord15);"></td>
</tr>
<tr>
<td class="rownr"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style="color: var(--nord15);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
</tr>
<tr>
<td class="rownr"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style="color: var(--nord15);"></td>
<td></td>
<td style="color: var(--nord15);"></td>
</tr>
<tr>
<td class="rownr"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td style = "color: var(--nord10);"></td>
<td></td>
<td style="color: var(--nord15);"></td>
<td></td>
</tr>
<tr>
<td class = "rownr"></td>
<td style = "color: var(--nord10);">$\vdots$</td>
<td style = "color: var(--nord10);">$\vdots$</td>
<td style = "color: var(--nord10);">$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
<td>$\vdots$</td>
</tr>
</table>

.footnote[
[ Practical (NIHES EL009)](https://nerler.github.io/EP16_Multiple_Imputation/practical/07_Imputation_of_Longitudinal_Data.html#imputation_using_mice)
]
]
]

---

## Imputation in Wide Format

Simple summaries of longitudinal variables: 
* may introduce bias

Summarize longitudinal trajectories using random effects:

<ul class="fa-ul">
<li>

more efficient</li>
<li>

requires sufficient fit</li>
<li>

only for incomplete baseline covariates</li>
</ul>

---

## Non-linear Associations

**Implied Assumption:** 
Linear association
between `$\color{var(--nord15)}{\mathbf x_1}$` and `$\mathbf y$`:

`$$\color{var(--nord15)}{\mathbf x_1} = 
\beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} +
\beta_2 \mathbf x_2 + \beta_3 \mathbf x_3$$`

]

???

Implied assumption: linear association between incompl. covariate and outcome
(and other covariates)

.pull-right[
 
But what if 
`$$\mathbf y = \theta_0 + 
\bbox[#3B4252, 2pt]{\theta_1 \color{var(--nord15)}{\mathbf x_1} +
\theta_2 \color{var(--nord15)}{\mathbf x_1}^2} +
\theta_3 \mathbf x_2 + \theta_4 \mathbf x_3$$`

]

???

But what if we have a setting where we assume that there is a non-linear 
association, for example quadratic?

---

## Non-linear Associations

???

If we

* correctly assume a non-linear association in the analysis model
* but a linear association in the imputation model

we introduce bias, even if we analyse the imputed data under the correct assumption

---

## Non-linear Associations

With non-linear associations specification of the **correct imputation model may
not be possible**.

Settings with non-linear associations: 
* (multiple) **transformations** of incomplete variables
* **interactions** with incomplete variables
* **survival models**

???

* In many such settings the correct predictive distribution will not have a 
    closed form

=> we then cannot just specify the imputation model as a simple regression 
model with all other variables in the linear predictor.

**Also critical:** 
settings with correlated observations

* **longitudinal data**
* clustered data (e.g. **multi-center studies**)

.footnote[
 [NIHES EL009](https://nerler.github.io/EP16_Multiple_Imputation/slide/10_requirements_for_mice.pdf)
]

---

## (Some) Software

**Multiple Imputation using FCS**
* package [**mice**](https://amices.org/mice/)
(add-ons: [**miceadds**](https://alexanderrobitzsch.github.io/miceadds/),
[**mitools**](https://cran.r-project.org/web/packages/mitools/index.html))
&emsp;&emsp;&emsp; .small[[ online book on **mice**](https://stefvanbuuren.name/fimd/)]
* SPSS: limited functionality

**Alternative Approaches in **
* package [**smcfcs**](https://CRAN.R-project.org/package=smcfcs)
 * substantive model compatible fully conditional specification
 * hybrid FCS MI & Bayesian approach (usage is similar to **mice**)
 * for survival data & non-linear associations
* package [**jomo**](https://CRAN.R-project.org/package=jomo) (Joint Model Multiple Imputation)
 * fully Bayesian
 * for survival & multi-level data and non-linear associations
* package [**JointAI**](https://nerler.github.io/JointAI/) (Joint Analysis and Imputation)
 * fully Bayesian
 * "standard" settings, survival & multi-level, non-linear associations

---

## Multiple Imputation FAQ

* How many imputed datasets do I need?

* Should we do a compl. case analysis as sensitivity analysis?

* What % missing values is still ok?

* Can I impute missing values in the response?

* Can I impute missing values in the exposure?

* Which variables do I need to include in the imputation?

* Why do I need to include the response into the imputation models? Won't that
  artificially increase the association?

* How should I report missing data / imputation in a paper?

---

# Thank you for your attention!

<div class="contact">
 <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a>&emsp;
<a href="https://twitter.com/N_Erler" target="_blank"> N_Erler</a>&emsp;
 <a href="https://github.com/NErler" target="_blank"> NErler</a>&emsp;
 <a href="https://nerler.com" target="_blank"> https://nerler.com</a>
</div>

---