Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.

  • researchers who thought it was possible to use cases if ther was only one single value missing
  • SPSS options that seem this is possible

This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.

I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.

Working with Missing Data and Imputation
Nicole Erler
Department of Biostatistics
1

Missing Values are a Problem!

1

I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.

  • researchers who thought it was possible to use cases if ther was only one single value missing
  • SPSS options that seem this is possible

This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.

I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.

Example: Linear Regression

Linear Regression Model:

\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}

2

We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model.

A linear regression model is written as a response y with covariates x, and some regression coefficients \beta, and we have the error terms, \varepsilon.

We can also write this model in matrix notation, ...


Example: Linear Regression

Linear Regression Model:

\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}

with

\mathbf y = \begin{pmatrix} y_1\\ y_2\\ y_3\\ y_4\\ y_5 \end{pmatrix} \qquad \mathbf X = \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} \qquad \boldsymbol\beta = \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \beta_3 \end{pmatrix}

2

We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model.

A linear regression model is written as a response y with covariates x, and some regression coefficients \beta, and we have the error terms, \varepsilon.

We can also write this model in matrix notation, ...


... and then we have y as a vector, here, as an example for 5 subjects, X is the design matrix, which contains the different covariates in the columns and has a column of 1s for the intercept, and the value for x_1 for the second subject is missing. The regression coefficients \beta are also a vector.

Example: Linear Regression

The Least Squares Estimator

\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y

3

The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix X and the response y.

We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in X.


Example: Linear Regression

The Least Squares Estimator

\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y


\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix}

3

The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix X and the response y.

We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in X.


We start with the product of X^\top and X.

X^\top is the design matrix, but with rows and colums swapped, so that each row is one variable, and each column is one subject.

And we need to multiply these two matrices.

Example: Linear Regression

\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}


\cdot = 1\cdot 1 +1\cdot 1 + 1\cdot 1 + 1\cdot 1 + 1\cdot 1

4

How does matrix multiplication work?

We always multiply one row from the first matrix with a column from the second matrix, and take the sum over all the product from these two vectors.

The result from the first row and first column will then be the top left element in the result matrix.

And because here we have the intercept multiplied with itself, we have the sum over the product of 1s, which is 5 in this case, because we have 5 subjects.

Example: Linear Regression

\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}


\begin{eqnarray*} \cdot &=& 1 \cdot x_{11} + 1\cdot\color{var(--nord15)}{?} + 1\cdot x_{31} + 1\cdot x_{41} + 1\cdot x_{51}\\ &=& x_{11} + \color{var(--nord15)}{?} + x_{31} + x_{41} + x_{51}\\ &=& \color{var(--nord15)}{?} \end{eqnarray*}

5

Then we move on to the second column, and here we multiply again each element with one, so, one times x_{11}, one times the missing value, and so on.

And then we need to sum up all the products, but because one of the summands is unknown, the sum will also be unknown.

Example: Linear Regression

\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}

6

And when we continue to do that, in the result of X^\top X there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate.


Example: Linear Regression

\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}


(\mathbf X^\top \mathbf X)^{-1} = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}^{-1} = \begin{pmatrix} \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} \end{pmatrix}

6

And when we continue to do that, in the result of X^\top X there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate.


But in the formula for the least squares estimator we have to then take the inverse of this new matrix.

Calculating the inverse by hand is a bit tedious, so I'm not going to go through it step by step. But the result is that we now have unknown values on all positions of the inverted matrix, because the calculations always involve one or more of the unknown elements of the input matrix.

Example: Linear Regression

When there are missing values in \mathbf X or \mathbf y we cannot estimate \boldsymbol\beta!!!

⇨ Exclude cases with missing values?

7

And so it is clear, whenever we have missing values in the covariates, we cannot estimate our regression coefficients. And the same goes for missing values in the response y.

And so the logical conclusion would be that we would have to exclude all those cases for which some values are missing, and perform a complete case anlaysis.

Complete Case Analysis is (usually) a Bad Idea!

8

But, a complete case analysis is in most cases a rather bad idea.

Complete Case Analysis

9

Here is one reason why.

You see on the y-axis the proportion of complete cases in a dataset, and on the x-axis the number of incomplete variables. Each line represents a different proportion of missing values per variable.

So, if we had 10% missing values in 25 variables, we'd en up with only 7% of the original sample size. And if we had 10% missing values in 10 variables, we'd have 35% of our data left over in a complete case analysis.

Even when we'd have only 2% missing in only 5 variables, we could loose 10% of the data.

Complete Case Analysis

Complete Case Analysis is

  • inefficient
  • usually biased


For Example:

10

So it is clear, complete case analysis is very inefficient. In many cases we'll loose quite a bit of data.

Moreover, complete case analysis is biased in most settings. There are a few very specific exceptions, depending on what kind of model you use, where the missing values are, and why they are missing.

Missing Data & Imputation

11

And, so, for most methods to handle missing values we can't make a general statement that will always be true.

For the impact of a method there are a number of relevant aspects.

Missing Values

Relevant for the choice / impact of methods:

  • How much is missing?
    • per variable
    • per subject
    • complete cases
  • How much information is available?
    • sample size
    • relevant covariates
    • strength of association
12

The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject.

And, as we've seen, we might also need to check what that means for the number of complete cases.

But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables.


Missing Values

Relevant for the choice / impact of methods:

  • How much is missing?
    • per variable
    • per subject
    • complete cases
  • How much information is available?
    • sample size
    • relevant covariates
    • strength of association
  • Where are values missing?

    • response
    • covariates
  • Why are values missing?
    ⇨ Missing Data Mechanism

12

The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject.

And, as we've seen, we might also need to check what that means for the number of complete cases.

But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables.


We also need to distinguish between missing values in covariates and the response, and we need to think about, and make assupmtions about why the values are missing, meaning, the missing data mechanism.

Missing Data Mechanisms

Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases

13

For the missing data mechanism there is a specific terminology.

First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.

This means that there are no systematic differences between complete and incomplete cases.


Missing Data Mechanisms

Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases


Missing At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})

13

For the missing data mechanism there is a specific terminology.

First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.

This means that there are no systematic differences between complete and incomplete cases.


Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed.


Missing Data Mechanisms

Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases


Missing At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})

Missing Not At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) \neq \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})

13

For the missing data mechanism there is a specific terminology.

First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.

This means that there are no systematic differences between complete and incomplete cases.


Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed.


The last missing data mechanism is missing not at random, and here the probability that a value is missing does depend on things that we have not measured or that is missing. This can either be the missing value itself, or missing values in other variables, or things that we haven't measured at all.

There is no way of testing if we are dealing with MNAR or MAR. We will always need to make an assumption about whether we have MNAR data.

Sometimes you can read in clinical papers that they assumed "random missingness", or "that the missing values are random". I assume that they refer either to MCAR or MAR, but it isn't clear which one, but it can make a very important difference whether you have MCAR or MAR.

Some Examples

  • Data is collected by questionnaire ⇨ some got lost in the mail
14

Let's look at a few examples to see which type of missing data mechanism we might have.

Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.

Which type of missing data mechanism would that be?


If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.


Some Examples

  • Data is collected by questionnaire ⇨ some got lost in the mail

  • A particular biomarker was not part of the standard panel before 2008
    ⇨ missing for many patients who entered < 2008

14

Let's look at a few examples to see which type of missing data mechanism we might have.

Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.

Which type of missing data mechanism would that be?


If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.


Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.

Which missing data mechanism do we have?

If we know the year of inclusion, then we'd have MAR.


Some Examples

  • Data is collected by questionnaire ⇨ some got lost in the mail

  • A particular biomarker was not part of the standard panel before 2008
    ⇨ missing for many patients who entered < 2008

  • In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"

14

Let's look at a few examples to see which type of missing data mechanism we might have.

Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.

Which type of missing data mechanism would that be?


If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.


Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.

Which missing data mechanism do we have?

If we know the year of inclusion, then we'd have MAR.


Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?

....


Some Examples

  • Data is collected by questionnaire ⇨ some got lost in the mail

  • A particular biomarker was not part of the standard panel before 2008
    ⇨ missing for many patients who entered < 2008

  • In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"

  • Same survey: missing values in "chocolate consumption".

14

Let's look at a few examples to see which type of missing data mechanism we might have.

Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.

Which type of missing data mechanism would that be?


If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.


Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.

Which missing data mechanism do we have?

If we know the year of inclusion, then we'd have MAR.


Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?

....


In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable?


Some Examples

  • Data is collected by questionnaire ⇨ some got lost in the mail

  • A particular biomarker was not part of the standard panel before 2008
    ⇨ missing for many patients who entered < 2008

  • In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"

  • Same survey: missing values in "chocolate consumption".


MCAR / MAR / MNAR are NOT a property of the data but of a model.

14

Let's look at a few examples to see which type of missing data mechanism we might have.

Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.

Which type of missing data mechanism would that be?


If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.


Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.

Which missing data mechanism do we have?

If we know the year of inclusion, then we'd have MAR.


Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?

....


In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable?


As you see, the missing data mechanism is actually not a property of the data itself, but rather of the model that we use to fit the data or to impute it.

And to make suitable assumptions you need expert knowledge on how the data was measured. This is not something that the statistician can determine.

Understanding Missing Values

  • there is uncertainty about the missing value
15

The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.

If the value of height is missing for one patient, we don't know what that value would have been.


Understanding Missing Values

  • there is uncertainty about the missing value
  • some values are more likely than others
15

The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.

If the value of height is missing for one patient, we don't know what that value would have been.


Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.

For the missing value in height we could expect something around 1.70 / 1.80m. And values of 1.50m and 2.10m are possible, but less likely.


Understanding Missing Values

  • there is uncertainty about the missing value
  • some values are more likely than others

⇨ missing values have a distribution

15

The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.

If the value of height is missing for one patient, we don't know what that value would have been.


Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.

For the missing value in height we could expect something around 1.70 / 1.80m. And values of 1.50m and 2.10m are possible, but less likely.


So, in statistical terms, we can say that missing values have a distribution.


Understanding Missing Values

  • there is uncertainty about the missing value
  • some values are more likely than others

⇨ missing values have a distribution

  • there is a relationship with other available data


Predictive distribution of the missing values given the observed values. p(x_{mis}\mid\text{everything else})

15

The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.

If the value of height is missing for one patient, we don't know what that value would have been.


Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.

For the missing value in height we could expect something around 1.70 / 1.80m. And values of 1.50m and 2.10m are possible, but less likely.


So, in statistical terms, we can say that missing values have a distribution.


Moreover, there usually is some relationship between the missing value and other data that we have collected. If we know that the missing value in height is from a male, larger values become more likely and smaller values less likely.

This means that we need a model to learn how the incomplete variable is related to the other data.

This model, together with an assumption about the type of distribution the missing value has, then allows us to specify the distribution we should sample values to impute the missing value. We call this the predictive distribution. And the predictive distribution is generally based on everything else, including all other data and parameters.

A Simple Example

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3

i
\vdots \vdots \vdots \vdots
  • \mathbf y: response
  • \color{var(--nord15)}{\mathbf x_1}: incomplete covariate
  • \mathbf x_2, \mathbf x_3: complete covariates

Predictive distribution:

p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)


16

Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.

And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.


A Simple Example

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3

i
\vdots \vdots \vdots \vdots
  • \mathbf y: response
  • \color{var(--nord15)}{\mathbf x_1}: incomplete covariate
  • \mathbf x_2, \mathbf x_3: complete covariates

Predictive distribution:

p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)


For example:

  • Fit a model to the cases with observed \color{var(--nord15)}{\mathbf x_1}: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon
16

Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.

And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.


For example, we could think of this as fitting a regression model with x_1 as the dependent variable, and y & the other covariates as independent variables.

We can then fit this model to all those cases for which we have x_1 observed,...


A Simple Example

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3

i
\vdots \vdots \vdots \vdots
  • \mathbf y: response
  • \color{var(--nord15)}{\mathbf x_1}: incomplete covariate
  • \mathbf x_2, \mathbf x_3: complete covariates

Predictive distribution:

p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)


For example:

  • Fit a model to the cases with observed \color{var(--nord15)}{\mathbf x_1}: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon

  • Estimate parameters \boldsymbol{\hat\beta}, \hat\sigma
    ⇨ define distribution p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)

16

Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.

And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.


For example, we could think of this as fitting a regression model with x_1 as the dependent variable, and y & the other covariates as independent variables.

We can then fit this model to all those cases for which we have x_1 observed,...


... in order to estimate the parameters, and to learn how the distribution of x_1 conditional on the other data looks like.

And then we can use this information to specify the predictive distribution for the cases with missing x_1 and sample imputed values from this distribution.

Imputation of Missing Values

17

We can visualize this for the case where we only have two variables, one incomplete, shown on the y-axis, and one complete, shown on the x-axis. So this is a visualization of the imputation model.

In practice we will of course have more variables, but then I couldn't show it in a simple plot any more. So this is really just to get the idea.

We know the value of the other variable, so we know where our incomplete cases are on the x-axis, but we don't know where to place them on the y-axis. Therefore I only marked them as empty circles on the x-axis here.

Imputation of Missing Values

17

When we now fit a model on the observed cases we can represent that as the corresponding regression line.

Imputation of Missing Values

17

jump to regression imputation

What happens when we now plug in the observed variables of our incomplete case in the estimated model, is that we get the fitted values, meaning the corresponding values on the regression line.

Could we now just take those values as our imputed values? We fitted the model on the complete cases, and then we predicted the value of the incomplete variable from that model.

Imputation of Missing Values

Important: We need to take into account the uncertainty!

18

Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.


Imputation of Missing Values

Important: We need to take into account the uncertainty!

about the parameter estimates

18

Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.


There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line.

Imputation of Missing Values

Important: We need to take into account the uncertainty!

about the parameter estimates

about the fitted/predicted value \color{var(--nord15)}{\mathbf{\hat{x}_1}}

18

Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.


There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line.

And there is uncertainty about the values themselves. In the observed data, the data points are not exactly on the regression line, but spread around it. So we'd expect the same for the missing values. Using the fitted values, the values on the regression line, would ignore this random variation that we have in the data.

This is the part where we assume that the missing values have a distribution. This distribution is the random variation around the expected value.

Imputation of Missing Values

We want:
Imputation from the predictive distribution p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else}).


Idea:
Use a "prediction" model.


Take into account:

  • uncertainty in parameter estimates \boldsymbol{\hat\beta}
  • prediction error (\mathbf{\hat x}_{mis} \neq \mathbf x_{mis})
  • missing values has a distribution ⇨ we can't just replace it with one value.
19

So, in summary, what have we seen so far?

We want to impute missing values from the predictive distribution of the missing value given everything else.

The idea is to do that via a prediction model.

But we need to take into account that we have multiple sources of uncertainty or variation:

  • uncertainty about the parameters in the imputation model
  • random variation of the unknown values (also called prediction error)
  • and we need to take into account that there is uncertainty about the missing value, so that we can't represent a missing value by one single imputed value because that would not capture that uncertainty (the additional uncertainty that we have compared to an observed value)

Naive Ways to Handle Missing Data

20

So with this knowledge on missing data and all the things that we need to take into account let's have a look at some unfortunately still used naive methods to handle missing data.

Naive Ways to Handle Missing Data

21

We are now looking at the data that we would use for the actual analysis of interest, and the regression line from that analysis model. So on the x-axis we now have the incomplete covariate and on the y-axis the response, which we assume is fully observed.

The cases for which the covariate is observed are drawn as white dots, the cases for which the covariate is missing as empty purple circles. The correct regression line, that we would get if we didn't have any missing values is shown with the dashed line.

Complete Case Analysis

22

In a complete case analysis, the regression line would be calculated just based on the white data points. Because the missing values are not missing completely at random, but values are more likely to be missing for larger response values, the estimated line now is lower than the true line.

Mean Imputation

23

The first imputation method that I'll show here is mean imputation. All missing values in the covariate are filled in with the mean of the observed values of that covariate. This is shown here with the filled purple dots. You can clearly see that they are not a good representation of the distribution of the true but missing values.

The corresponding regression line, shown with the solid white line is closer to the true line than for complete case analysis but is flatter than the true line.

Missing Indicator Method

24

The second missing data method is the missing indicator method. The idea here is to replace the missing values with a fixed value, for example zero. And, to distinguish the incomplete cases from the complete cases we additionally add an indicator variable that is zero for observed cases and one for incomplete cases.

As for mean imputation we see that the imputed values do not at all represent the spread of the missing values. Because of the indicator variable we now get two regression lines, one for observed one for incomplete cases, but they have the same slope, which seems to be similar to the slope of the true regression line.

Regression Imputation

25

jump to regline with predicted values

Next, we have regression imputation. The idea here is to impute based on a prediction model, like we saw before, but to just use the fitted values from that prediction. Because this method also does not take into account the random variation, we see that all imputed values are on one straight line.

And again we see that the model fitted on the imputed data results in a regression line with different slope than the true line, now with a steeper slope.

Single Imputation

26

In single imputation we now improve upon the regression imputation by taking into account both the uncertainty about the parameters in the imputation model and the random variation. And we can see that the imputed values have a distribution that is much more similar to the distribution of the missing values.

The corresponding regression line is also almost identical to the true line.

Naive Ways to Impute Missing Values

27

Here I have an overview of the parameter estimate of the incomplete covariate and the corresponding 95% confidence interval. So this is the slope that we saw for all the different methods.

On top is the value for the complete data, and to make the comparison easier I have added a shaded area that has the width of the 95% CI from the complete data analysis.

Of course, because this is just the results from one very simple example, we can't draw any conclusions about how much bias we get from which method and how they compare in general. This was just to visualize a bit what happens when you use one of these naive methods.

We see that the different methods disagree quite a bit in their estimates. The single imputation comes closest, but when we take a closer look at the CI we see that it is actually a bit narrower than the true CI. In the example I used here, I have a bit more than 50% missing values. So we should have quite a bit additional uncertainty compared to the complete data.

The single imputation approach clearly underestimates the uncertainty that we have about effect of the covariate.

Single Imputation

Can take into account

  • uncertainty in parameter estimates \boldsymbol{\hat\beta}
  • prediction error (\mathbf{\hat x}_{mis} \neq \mathbf x_{mis})

But:

Single imputation does not take into account the uncertainty about the imputed value!

28

In the single imputation we did take into account two of the sources of uncertainty or variation, but we only have one imputed value. We have no way of taking into account the added uncertainty that we have about the imputed value compared to an observed value, when we just have one single value.

Multiple Imputation

29

And this is why Donald Rubin came up with the idea of multiple imputation.

Multiple Imputation

30

Multiple Imputation

31

Multiple Imputation

MI was developed in the 1960s/70s...

32

Multiple Imputation

MI was developed in the 1960s/70s...


Requirements

  • computationally feasible
  • "fix" the missing data problem once / centrally
    ⇨ distribute imputed data to other researchers
32

Multiple Imputation

33

The idea behind multiple imputation is that, using this principle, we sample imputed values and fill them into the original, incomplete data to create a completed dataset.

And in order to take into account the uncertainty that we have about the missing values, we do this multiple times, so that we obtain multiple completed datasets.

Because all the missing values have now been filled in, we can analyse each of these datasets separately with standard statistical techniques.

To obtain overall results, the results from each of these analyses need to be combined in a way that takes into account both the uncertainty that we have about the estimates from each analysis, and the variation between these estimates.

Multiple Imputation

34

Multiple Imputation: Pooling

Pooled Parameter Estimate:
\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad \text{(average estimate)}

35

Multiple Imputation: Pooling

Pooled Parameter Estimate:
\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad \text{(average estimate)}

Pooled Variance: T = \bar W + B + B/m

  • \displaystyle\bar W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}(\hat\beta^{(\ell)})    average within imputation variance

  • \displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m (\hat \beta^{(\ell)} - \bar\beta)^2    between imputation variance

35

Multiple Imputation

36

Multivariate Missingness

37

In Practice

Multivariate
Missingness
\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots

Predictive distributions

based on models
\begin{alignat}{10} \color{var(--nord15)}{\mathbf x_1} &= \beta_0 &+& \beta_1 \mathbf y &+& \beta_2 \color{var(--nord15)}{\mathbf x_2} &+& \beta_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots \\ \color{var(--nord15)}{\mathbf x_2} &= \alpha_0 &+& \alpha_1 \mathbf y &+& \alpha_2 \color{var(--nord15)}{\mathbf x_1} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots\\ \color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+& \theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots \end{alignat}
38

And the most common approach to imputation in this setting is MICE, short for multivariate imputation by chained equations, an approach that is also called fully conditional specification.

The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor.


In Practice

Multivariate
Missingness
\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots

Predictive distributions

based on models
\begin{alignat}{10} \color{var(--nord15)}{\mathbf x_1} &= \beta_0 &+& \beta_1 \mathbf y &+& \beta_2 \color{var(--nord15)}{\mathbf x_2} &+& \beta_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots \\ \color{var(--nord15)}{\mathbf x_2} &= \alpha_0 &+& \alpha_1 \mathbf y &+& \alpha_2 \color{var(--nord15)}{\mathbf x_1} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots\\ \color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+& \theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots \end{alignat}


Most common approach:
MICE (multivariate imputation by chained equations)
FCS (fully conditional specification)

38

And the most common approach to imputation in this setting is MICE, short for multivariate imputation by chained equations, an approach that is also called fully conditional specification.

The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor.


MICE / FCS

Iterative:

  • start with random draws from the observed data

  • cycle through the models to update the imputed values

  • until convergence

⇨ keep only last imputed value

39

Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time.


MICE / FCS

Iterative:

  • start with random draws from the observed data

  • cycle through the models to update the imputed values

  • until convergence

⇨ keep only last imputed value

Flexible model types
choose a different type of model per incomplete variable

39

Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time.


The models for the different variables can be specified according to the type of variable.

Once we have imputed each missing value, we start again with the first variable, but now use the imputed values of the other variables instead of the starting values, and we do this a few times until the algorithm has converged.

Missing Values

Relevant for the choice / impact of methods:

  • How much is missing?
    • per variable
    • per subject
    • complete cases
  • How much information is available?
    • sample size
    • relevant covariates
    • strength of association
  • Where are values missing?

    • response
    • covariates
  • Why are values missing?
    ⇨ Missing Data Mechanism

40

Considerations for the Use of FCS MI

How much is missing / how much information is available?

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots
Imputation of \color{var(--nord15)}{\mathbf x_1} based on: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
  • Fit model on cases with observed \color{var(--nord15)}{\mathbf x_1}
  • Predict missing \color{var(--nord15)}{\mathbf x_1}
41

Considerations for the Use of FCS MI

How much is missing / how much information is available?

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots
Imputation of \color{var(--nord15)}{\mathbf x_1} based on: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
  • Fit model on cases with observed \color{var(--nord15)}{\mathbf x_1}
  • Predict missing \color{var(--nord15)}{\mathbf x_1}


Scenario 1:  N = 200,  90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 20 to estimate \boldsymbol\beta


41

Considerations for the Use of FCS MI

How much is missing / how much information is available?

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots
Imputation of \color{var(--nord15)}{\mathbf x_1} based on: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
  • Fit model on cases with observed \color{var(--nord15)}{\mathbf x_1}
  • Predict missing \color{var(--nord15)}{\mathbf x_1}


Scenario 1:  N = 200,  90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 20 to estimate \boldsymbol\beta


Scenario 2:  N = 5000,  90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 500 to estimate \boldsymbol\beta
41

Considerations for the Use of FCS MI

Relevant covariates / strength of association

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots
Imputation of \color{var(--nord15)}{\mathbf x_1} based on: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots Say, \color{var(--nord15)}{\mathbf x_1} is bilirubin.


42

Considerations for the Use of FCS MI

Relevant covariates / strength of association

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots
Imputation of \color{var(--nord15)}{\mathbf x_1} based on: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots Say, \color{var(--nord15)}{\mathbf x_1} is bilirubin.


Scenario 1:
other covariates:
  • age
  • gender
  • eye color
42

Considerations for the Use of FCS MI

Relevant covariates / strength of association

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
i \ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots
Imputation of \color{var(--nord15)}{\mathbf x_1} based on: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots Say, \color{var(--nord15)}{\mathbf x_1} is bilirubin.


Scenario 1:
other covariates:
  • age
  • gender
  • eye color
Scenario 2:
other covariates:
  • age
  • gender
  • height
  • weight
  • family history
  • comorbidities
  • creatinine
  • AST, ALT, ALP
  • MELD
  • ...
42

Considerations for the Use of FCS MI

Where are values missing?

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots

Imputation Model for \color{var(--nord15)}{\mathbf y}: \color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y

  • fit on cases with observed \color{var(--nord15)}{\mathbf y}\boldsymbol{\hat\alpha}
  • predict missing \color{var(--nord15)}{\mathbf y} using \boldsymbol{\hat\alpha}
    ⇨ imputed cases will always have estimates equal to \boldsymbol{\hat\alpha}
43

Considerations for the Use of FCS MI

Where are values missing?

\mathbf y \mathbf x_1 \mathbf x_2 \mathbf x_3 \ldots

\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\ldots
\vdots \vdots \vdots \vdots

Imputation Model for \color{var(--nord15)}{\mathbf y}: \color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y

  • fit on cases with observed \color{var(--nord15)}{\mathbf y}\boldsymbol{\hat\alpha}
  • predict missing \color{var(--nord15)}{\mathbf y} using \boldsymbol{\hat\alpha}
    ⇨ imputed cases will always have estimates equal to \boldsymbol{\hat\alpha}

Analysis Model

  • estimates in observed part: \boldsymbol{\hat\alpha}
  • estimates in imputed part: \boldsymbol{\hat\alpha}
    ⇨ same results as in imputation model
43

Considerations for the Use of FCS MI

Missing values in the response:

If analysis model = imputation model
\boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}
⇨ No point in imputing responses

44

Considerations for the Use of FCS MI

Missing values in the response:

If analysis model = imputation model
\boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}
⇨ No point in imputing responses


Auxiliary variables:
⇨ analysis model \neq imputation model
\boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}
⇨ Imputing responses can be beneficial

44

Considerations for the Use of FCS MI

Why are values missing?

Imputation of \color{var(--nord15)}{\mathbf x_1} based on:

\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots

  • Fit model on cases with observed \color{var(--nord15)}{\mathbf x_1}
  • Predict missing \color{var(--nord15)}{\mathbf x_1}
45

Considerations for the Use of FCS MI

Why are values missing?

Imputation of \color{var(--nord15)}{\mathbf x_1} based on:

\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots

  • Fit model on cases with observed \color{var(--nord15)}{\mathbf x_1}
  • Predict missing \color{var(--nord15)}{\mathbf x_1}

⇨ Imputed \color{var(--nord15)}{\mathbf x_1} will have the same distribution as observed \color{var(--nord15)}{\mathbf x_1} with the same values of all other variables.

⇨ FCS MI is valid under MAR

45

FCS MI in Practice

  • valid under MAR
    imputation models need to contain the important predictors in the right form
46

FCS MI in Practice

  • valid under MAR
    imputation models need to contain the important predictors in the right form

  • allows us to take into account

    • uncertainty about missing value
      if we use enough imputed datasets
    • uncertainty about parameters in imputation model
      requires Bayes or Bootstrap
    • prediction error
      requires Bayes, or predictive mean matching with appropriate settings
46

FCS MI in Practice

  • valid under MAR
    imputation models need to contain the important predictors in the right form

  • allows us to take into account

    • uncertainty about missing value
      if we use enough imputed datasets
    • uncertainty about parameters in imputation model
      requires Bayes or Bootstrap
    • prediction error
      requires Bayes, or predictive mean matching with appropriate settings
  • Imputation models need to fit the data
    • no contradiction between imputation models
    • no contradiction between imputation models and analysis model(s)
46

Non-linear Associations

Implied Assumption:
Linear association between \color{var(--nord15)}{\mathbf x_1} and \mathbf y:

\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3

47

Implied assumption: linear association between incompl. covariate and outcome (and other covariates)

Non-linear Associations

Implied Assumption:
Linear association between \color{var(--nord15)}{\mathbf x_1} and \mathbf y:

\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3


But what if \mathbf y = \theta_0 + \bbox[#3B4252, 2pt]{\theta_1 \color{var(--nord15)}{\mathbf x_1} + \theta_2 \color{var(--nord15)}{\mathbf x_1}^2} + \theta_3 \mathbf x_2 + \theta_4 \mathbf x_3

47

Implied assumption: linear association between incompl. covariate and outcome (and other covariates)

But what if we have a setting where we assume that there is a non-linear association, for example quadratic?

Non-linear Associations

  • true association: non-linear
  • imputation assumption: linear

} ⇨ bias!

48

If we

  • correctly assume a non-linear association in the analysis model
  • but a linear association in the imputation model

we introduce bias, even if we analyse the imputed data under the correct assumption

Non-linear Associations

With non-linear associations specification of the correct imputation model may not be possible.

Settings with non-linear associations:

  • (multiple) transformations of incomplete variables
  • interactions with incomplete variables
  • survival models
49
  • In many such settings the correct predictive distribution will not have a closed form

=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.

Non-linear Associations

With non-linear associations specification of the correct imputation model may not be possible.

Settings with non-linear associations:

  • (multiple) transformations of incomplete variables
  • interactions with incomplete variables
  • survival models


Also critical:
settings with correlated observations

  • longitudinal data
  • clustered data (e.g. multi-center studies)
49
  • In many such settings the correct predictive distribution will not have a closed form

=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.

Non-linear Associations

With non-linear associations specification of the correct imputation model may not be possible.

Settings with non-linear associations:

  • (multiple) transformations of incomplete variables
  • interactions with incomplete variables
  • survival models


Also critical:
settings with correlated observations

  • longitudinal data
  • clustered data (e.g. multi-center studies)

⇨ Bayes

49
  • In many such settings the correct predictive distribution will not have a closed form

=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.

Multiple Imputation FAQ

  • How many imputed datasets do I need?
50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

  • What % missing values is still ok?

50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

  • What % missing values is still ok?

  • Can I impute missing values in the response?

50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

  • What % missing values is still ok?

  • Can I impute missing values in the response?

  • Can I impute missing values in the exposure?

50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

  • What % missing values is still ok?

  • Can I impute missing values in the response?

  • Can I impute missing values in the exposure?

  • Which variables do I need to include in the imputation?

50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

  • What % missing values is still ok?

  • Can I impute missing values in the response?

  • Can I impute missing values in the exposure?

  • Which variables do I need to include in the imputation?

  • Why do I need to include the response into the imputation models? Won't that artificially increase the association?

50

Multiple Imputation FAQ

  • How many imputed datasets do I need?

  • Should we do a compl. case analysis as sensitivity analysis?

  • What % missing values is still ok?

  • Can I impute missing values in the response?

  • Can I impute missing values in the exposure?

  • Which variables do I need to include in the imputation?

  • Why do I need to include the response into the imputation models? Won't that artificially increase the association?

  • How should I report missing data / imputation in a paper?

50

Missing Values are a Problem!

1

I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.

  • researchers who thought it was possible to use cases if ther was only one single value missing
  • SPSS options that seem this is possible

This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.

I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow