I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.
This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.
I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.
I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.
This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.
I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.
Linear Regression Model:
\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}
We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model.
A linear regression model is written as a response y with covariates x, and some regression coefficients \beta, and we have the error terms, \varepsilon.
We can also write this model in matrix notation, ...
Linear Regression Model:
\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}
with
\mathbf y = \begin{pmatrix} y_1\\ y_2\\ y_3\\ y_4\\ y_5 \end{pmatrix} \qquad \mathbf X = \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} \qquad \boldsymbol\beta = \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \beta_3 \end{pmatrix}
We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model.
A linear regression model is written as a response y with covariates x, and some regression coefficients \beta, and we have the error terms, \varepsilon.
We can also write this model in matrix notation, ...
... and then we have y as a vector, here, as an example for 5 subjects, X is the design matrix, which contains the different covariates in the columns and has a column of 1s for the intercept, and the value for x_1 for the second subject is missing. The regression coefficients \beta are also a vector.
The Least Squares Estimator
\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y
The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix X and the response y.
We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in X.
The Least Squares Estimator
\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y
\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix}
The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix X and the response y.
We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in X.
We start with the product of X^\top and X.
X^\top is the design matrix, but with rows and colums swapped, so that each row is one variable, and each column is one subject.
And we need to multiply these two matrices.
\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}
\cdot = 1\cdot 1 +1\cdot 1 + 1\cdot 1 + 1\cdot 1 + 1\cdot 1
How does matrix multiplication work?
We always multiply one row from the first matrix with a column from the second matrix, and take the sum over all the product from these two vectors.
The result from the first row and first column will then be the top left element in the result matrix.
And because here we have the intercept multiplied with itself, we have the sum over the product of 1s, which is 5 in this case, because we have 5 subjects.
\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}
\begin{eqnarray*} \cdot &=& 1 \cdot x_{11} + 1\cdot\color{var(--nord15)}{?} + 1\cdot x_{31} + 1\cdot x_{41} + 1\cdot x_{51}\\ &=& x_{11} + \color{var(--nord15)}{?} + x_{31} + x_{41} + x_{51}\\ &=& \color{var(--nord15)}{?} \end{eqnarray*}
Then we move on to the second column, and here we multiply again each element with one, so, one times x_{11}, one times the missing value, and so on.
And then we need to sum up all the products, but because one of the summands is unknown, the sum will also be unknown.
\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}
And when we continue to do that, in the result of X^\top X there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate.
\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}
(\mathbf X^\top \mathbf X)^{-1} = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}^{-1} = \begin{pmatrix} \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} \end{pmatrix}
And when we continue to do that, in the result of X^\top X there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate.
But in the formula for the least squares estimator we have to then take the inverse of this new matrix.
Calculating the inverse by hand is a bit tedious, so I'm not going to go through it step by step. But the result is that we now have unknown values on all positions of the inverted matrix, because the calculations always involve one or more of the unknown elements of the input matrix.
When there are missing values in \mathbf X or \mathbf y we cannot estimate \boldsymbol\beta!!!
⇨ Exclude cases with missing values?
And so it is clear, whenever we have missing values in the covariates, we cannot estimate our regression coefficients. And the same goes for missing values in the response y.
And so the logical conclusion would be that we would have to exclude all those cases for which some values are missing, and perform a complete case anlaysis.
But, a complete case analysis is in most cases a rather bad idea.
Here is one reason why.
You see on the y-axis the proportion of complete cases in a dataset, and on the x-axis the number of incomplete variables. Each line represents a different proportion of missing values per variable.
So, if we had 10% missing values in 25 variables, we'd en up with only 7% of the original sample size. And if we had 10% missing values in 10 variables, we'd have 35% of our data left over in a complete case analysis.
Even when we'd have only 2% missing in only 5 variables, we could loose 10% of the data.
Complete Case Analysis is
For Example:
So it is clear, complete case analysis is very inefficient. In many cases we'll loose quite a bit of data.
Moreover, complete case analysis is biased in most settings. There are a few very specific exceptions, depending on what kind of model you use, where the missing values are, and why they are missing.
And, so, for most methods to handle missing values we can't make a general statement that will always be true.
For the impact of a method there are a number of relevant aspects.
Relevant for the choice / impact of methods:
The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject.
And, as we've seen, we might also need to check what that means for the number of complete cases.
But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables.
Relevant for the choice / impact of methods:
Where are values missing?
Why are values missing?
⇨ Missing Data Mechanism
The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject.
And, as we've seen, we might also need to check what that means for the number of complete cases.
But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables.
We also need to distinguish between missing values in covariates and the response, and we need to think about, and make assupmtions about why the values are missing, meaning, the missing data mechanism.
Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases
For the missing data mechanism there is a specific terminology.
First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.
This means that there are no systematic differences between complete and incomplete cases.
Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases
Missing At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})
For the missing data mechanism there is a specific terminology.
First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.
This means that there are no systematic differences between complete and incomplete cases.
Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed.
Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases
Missing At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})
Missing Not At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) \neq \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})
For the missing data mechanism there is a specific terminology.
First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.
This means that there are no systematic differences between complete and incomplete cases.
Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed.
The last missing data mechanism is missing not at random, and here the probability that a value is missing does depend on things that we have not measured or that is missing. This can either be the missing value itself, or missing values in other variables, or things that we haven't measured at all.
There is no way of testing if we are dealing with MNAR or MAR. We will always need to make an assumption about whether we have MNAR data.
Sometimes you can read in clinical papers that they assumed "random missingness", or "that the missing values are random". I assume that they refer either to MCAR or MAR, but it isn't clear which one, but it can make a very important difference whether you have MCAR or MAR.
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?
....
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"
Same survey: missing values in "chocolate consumption".
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?
....
In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable?
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"
Same survey: missing values in "chocolate consumption".
MCAR / MAR / MNAR are NOT a property of the data but of a model.
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?
....
In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable?
As you see, the missing data mechanism is actually not a property of the data itself, but rather of the model that we use to fit the data or to impute it.
And to make suitable assumptions you need expert knowledge on how the data was measured. This is not something that the statistician can determine.
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.
For the missing value in height
we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.
⇨ missing values have a distribution
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.
For the missing value in height
we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.
So, in statistical terms, we can say that missing values have a distribution.
⇨ missing values have a distribution
Predictive distribution of the missing values given the observed values. p(x_{mis}\mid\text{everything else})
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.
For the missing value in height
we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.
So, in statistical terms, we can say that missing values have a distribution.
Moreover, there usually is some relationship between the missing value and other
data that we have collected. If we know that the missing value in height
is
from a male, larger values become more likely and smaller values less likely.
This means that we need a model to learn how the incomplete variable is related to the other data.
This model, together with an assumption about the type of distribution the missing value has, then allows us to specify the distribution we should sample values to impute the missing value. We call this the predictive distribution. And the predictive distribution is generally based on everything else, including all other data and parameters.
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | |
---|---|---|---|---|
i | ||||
\vdots | \vdots | \vdots | \vdots |
Predictive distribution:
p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)
Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.
And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | |
---|---|---|---|---|
i | ||||
\vdots | \vdots | \vdots | \vdots |
Predictive distribution:
p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)
For example:
Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.
And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.
For example, we could think of this as fitting a regression model with x_1 as the dependent variable, and y & the other covariates as independent variables.
We can then fit this model to all those cases for which we have x_1 observed,...
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | |
---|---|---|---|---|
i | ||||
\vdots | \vdots | \vdots | \vdots |
Predictive distribution:
p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)
For example:
Fit a model to the cases with observed \color{var(--nord15)}{\mathbf x_1}: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon
Estimate parameters \boldsymbol{\hat\beta}, \hat\sigma
⇨ define distribution
p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)
Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.
And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.
For example, we could think of this as fitting a regression model with x_1 as the dependent variable, and y & the other covariates as independent variables.
We can then fit this model to all those cases for which we have x_1 observed,...
... in order to estimate the parameters, and to learn how the distribution of x_1 conditional on the other data looks like.
And then we can use this information to specify the predictive distribution for the cases with missing x_1 and sample imputed values from this distribution.
We can visualize this for the case where we only have two variables, one incomplete, shown on the y-axis, and one complete, shown on the x-axis. So this is a visualization of the imputation model.
In practice we will of course have more variables, but then I couldn't show it in a simple plot any more. So this is really just to get the idea.
We know the value of the other variable, so we know where our incomplete cases are on the x-axis, but we don't know where to place them on the y-axis. Therefore I only marked them as empty circles on the x-axis here.
When we now fit a model on the observed cases we can represent that as the corresponding regression line.
What happens when we now plug in the observed variables of our incomplete case in the estimated model, is that we get the fitted values, meaning the corresponding values on the regression line.
Could we now just take those values as our imputed values? We fitted the model on the complete cases, and then we predicted the value of the incomplete variable from that model.
Important: We need to take into account the uncertainty!
Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.
Important: We need to take into account the uncertainty!
about the parameter estimates
Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.
There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line.
Important: We need to take into account the uncertainty!
about the parameter estimates
about the fitted/predicted value \color{var(--nord15)}{\mathbf{\hat{x}_1}}
Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.
There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line.
And there is uncertainty about the values themselves. In the observed data, the data points are not exactly on the regression line, but spread around it. So we'd expect the same for the missing values. Using the fitted values, the values on the regression line, would ignore this random variation that we have in the data.
This is the part where we assume that the missing values have a distribution. This distribution is the random variation around the expected value.
We want:
Imputation from the predictive distribution
p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else}).
Idea:
Use a "prediction" model.
Take into account:
So, in summary, what have we seen so far?
We want to impute missing values from the predictive distribution of the missing value given everything else.
The idea is to do that via a prediction model.
But we need to take into account that we have multiple sources of uncertainty or variation:
So with this knowledge on missing data and all the things that we need to take into account let's have a look at some unfortunately still used naive methods to handle missing data.
We are now looking at the data that we would use for the actual analysis of interest, and the regression line from that analysis model. So on the x-axis we now have the incomplete covariate and on the y-axis the response, which we assume is fully observed.
The cases for which the covariate is observed are drawn as white dots, the cases for which the covariate is missing as empty purple circles. The correct regression line, that we would get if we didn't have any missing values is shown with the dashed line.
In a complete case analysis, the regression line would be calculated just based on the white data points. Because the missing values are not missing completely at random, but values are more likely to be missing for larger response values, the estimated line now is lower than the true line.
The first imputation method that I'll show here is mean imputation. All missing values in the covariate are filled in with the mean of the observed values of that covariate. This is shown here with the filled purple dots. You can clearly see that they are not a good representation of the distribution of the true but missing values.
The corresponding regression line, shown with the solid white line is closer to the true line than for complete case analysis but is flatter than the true line.
The second missing data method is the missing indicator method. The idea here is to replace the missing values with a fixed value, for example zero. And, to distinguish the incomplete cases from the complete cases we additionally add an indicator variable that is zero for observed cases and one for incomplete cases.
As for mean imputation we see that the imputed values do not at all represent the spread of the missing values. Because of the indicator variable we now get two regression lines, one for observed one for incomplete cases, but they have the same slope, which seems to be similar to the slope of the true regression line.
jump to regline with predicted values
Next, we have regression imputation. The idea here is to impute based on a prediction model, like we saw before, but to just use the fitted values from that prediction. Because this method also does not take into account the random variation, we see that all imputed values are on one straight line.
And again we see that the model fitted on the imputed data results in a regression line with different slope than the true line, now with a steeper slope.
In single imputation we now improve upon the regression imputation by taking into account both the uncertainty about the parameters in the imputation model and the random variation. And we can see that the imputed values have a distribution that is much more similar to the distribution of the missing values.
The corresponding regression line is also almost identical to the true line.
Here I have an overview of the parameter estimate of the incomplete covariate and the corresponding 95% confidence interval. So this is the slope that we saw for all the different methods.
On top is the value for the complete data, and to make the comparison easier I have added a shaded area that has the width of the 95% CI from the complete data analysis.
Of course, because this is just the results from one very simple example, we can't draw any conclusions about how much bias we get from which method and how they compare in general. This was just to visualize a bit what happens when you use one of these naive methods.
We see that the different methods disagree quite a bit in their estimates. The single imputation comes closest, but when we take a closer look at the CI we see that it is actually a bit narrower than the true CI. In the example I used here, I have a bit more than 50% missing values. So we should have quite a bit additional uncertainty compared to the complete data.
The single imputation approach clearly underestimates the uncertainty that we have about effect of the covariate.
Can take into account
But:
Single imputation does not take into account the uncertainty about the imputed value!
In the single imputation we did take into account two of the sources of uncertainty or variation, but we only have one imputed value. We have no way of taking into account the added uncertainty that we have about the imputed value compared to an observed value, when we just have one single value.
And this is why Donald Rubin came up with the idea of multiple imputation.
MI was developed in the 1960s/70s...
Requirements
The idea behind multiple imputation is that, using this principle, we sample imputed values and fill them into the original, incomplete data to create a completed dataset.
And in order to take into account the uncertainty that we have about the missing values, we do this multiple times, so that we obtain multiple completed datasets.
Because all the missing values have now been filled in, we can analyse each of these datasets separately with standard statistical techniques.
To obtain overall results, the results from each of these analyses need to be combined in a way that takes into account both the uncertainty that we have about the estimates from each analysis, and the variation between these estimates.
Pooled Parameter Estimate:
\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad
\text{(average estimate)}
Pooled Parameter Estimate:
\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad
\text{(average estimate)}
Pooled Variance: T = \bar W + B + B/m
\displaystyle\bar W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}(\hat\beta^{(\ell)}) average within imputation variance
\displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m (\hat \beta^{(\ell)} - \bar\beta)^2 between imputation variance
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Predictive distributions
And the most common approach to imputation in this setting is MICE, short for multivariate imputation by chained equations, an approach that is also called fully conditional specification.
The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor.
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Predictive distributions
Most common approach:
MICE
(multivariate imputation by chained equations)
FCS
(fully conditional specification)
And the most common approach to imputation in this setting is MICE, short for multivariate imputation by chained equations, an approach that is also called fully conditional specification.
The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor.
Iterative:
start with random draws from the observed data
cycle through the models to update the imputed values
until convergence
⇨ keep only last imputed value
Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time.
Iterative:
start with random draws from the observed data
cycle through the models to update the imputed values
until convergence
⇨ keep only last imputed value
Flexible model types
choose a different type of model per incomplete variable
Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time.
The models for the different variables can be specified according to the type of variable.
Once we have imputed each missing value, we start again with the first variable, but now use the imputed values of the other variables instead of the starting values, and we do this a few times until the algorithm has converged.
Relevant for the choice / impact of methods:
Where are values missing?
Why are values missing?
⇨ Missing Data Mechanism
How much is missing / how much information is available?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
How much is missing / how much information is available?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Scenario 1:
N = 200, 90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 20 to estimate \boldsymbol\beta
How much is missing / how much information is available?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Scenario 1:
N = 200, 90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 20 to estimate \boldsymbol\beta
Relevant covariates / strength of association
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Relevant covariates / strength of association
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Relevant covariates / strength of association
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Where are values missing?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Imputation Model for \color{var(--nord15)}{\mathbf y}: \color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y
Where are values missing?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Imputation Model for \color{var(--nord15)}{\mathbf y}: \color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y
Analysis Model
Missing values in the response:
If analysis model = imputation model
⇨ \boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}
⇨ No point in imputing responses
Missing values in the response:
If analysis model = imputation model
⇨ \boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}
⇨ No point in imputing responses
Auxiliary variables:
⇨ analysis model \neq imputation model
⇨ \boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}
⇨ Imputing responses can be beneficial
Why are values missing?
Imputation of \color{var(--nord15)}{\mathbf x_1} based on:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
Why are values missing?
Imputation of \color{var(--nord15)}{\mathbf x_1} based on:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
⇨ Imputed \color{var(--nord15)}{\mathbf x_1} will have the same distribution as observed \color{var(--nord15)}{\mathbf x_1} with the same values of all other variables.
⇨ FCS MI is valid under MAR
valid under MAR
imputation models need to contain the important predictors in the right
form
allows us to take into account
valid under MAR
imputation models need to contain the important predictors in the right
form
allows us to take into account
Implied Assumption:
Linear association
between \color{var(--nord15)}{\mathbf x_1} and \mathbf y:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3
Implied assumption: linear association between incompl. covariate and outcome (and other covariates)
Implied Assumption:
Linear association
between \color{var(--nord15)}{\mathbf x_1} and \mathbf y:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3
But what if
\mathbf y = \theta_0 +
\bbox[#3B4252, 2pt]{\theta_1 \color{var(--nord15)}{\mathbf x_1} +
\theta_2 \color{var(--nord15)}{\mathbf x_1}^2} +
\theta_3 \mathbf x_2 + \theta_4 \mathbf x_3
Implied assumption: linear association between incompl. covariate and outcome (and other covariates)
But what if we have a setting where we assume that there is a non-linear association, for example quadratic?
} ⇨ bias!
If we
we introduce bias, even if we analyse the imputed data under the correct assumption
With non-linear associations specification of the correct imputation model may not be possible.
Settings with non-linear associations:
=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.
With non-linear associations specification of the correct imputation model may not be possible.
Settings with non-linear associations:
Also critical:
settings with correlated observations
=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.
With non-linear associations specification of the correct imputation model may not be possible.
Settings with non-linear associations:
Also critical:
settings with correlated observations
⇨ Bayes
=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
Which variables do I need to include in the imputation?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
Which variables do I need to include in the imputation?
Why do I need to include the response into the imputation models? Won't that artificially increase the association?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
Which variables do I need to include in the imputation?
Why do I need to include the response into the imputation models? Won't that artificially increase the association?
How should I report missing data / imputation in a paper?
I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.
This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.
I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
I'm going to start right at the beginning, and want to demonstrate why missing values are a problem.
This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas.
I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation.
Linear Regression Model:
\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}
We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model.
A linear regression model is written as a response y with covariates x, and some regression coefficients \beta, and we have the error terms, \varepsilon.
We can also write this model in matrix notation, ...
Linear Regression Model:
\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}
with
\mathbf y = \begin{pmatrix} y_1\\ y_2\\ y_3\\ y_4\\ y_5 \end{pmatrix} \qquad \mathbf X = \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} \qquad \boldsymbol\beta = \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \beta_3 \end{pmatrix}
We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model.
A linear regression model is written as a response y with covariates x, and some regression coefficients \beta, and we have the error terms, \varepsilon.
We can also write this model in matrix notation, ...
... and then we have y as a vector, here, as an example for 5 subjects, X is the design matrix, which contains the different covariates in the columns and has a column of 1s for the intercept, and the value for x_1 for the second subject is missing. The regression coefficients \beta are also a vector.
The Least Squares Estimator
\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y
The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix X and the response y.
We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in X.
The Least Squares Estimator
\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y
\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix}
The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix X and the response y.
We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in X.
We start with the product of X^\top and X.
X^\top is the design matrix, but with rows and colums swapped, so that each row is one variable, and each column is one subject.
And we need to multiply these two matrices.
\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}
\cdot = 1\cdot 1 +1\cdot 1 + 1\cdot 1 + 1\cdot 1 + 1\cdot 1
How does matrix multiplication work?
We always multiply one row from the first matrix with a column from the second matrix, and take the sum over all the product from these two vectors.
The result from the first row and first column will then be the top left element in the result matrix.
And because here we have the intercept multiplied with itself, we have the sum over the product of 1s, which is 5 in this case, because we have 5 subjects.
\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}
\begin{eqnarray*} \cdot &=& 1 \cdot x_{11} + 1\cdot\color{var(--nord15)}{?} + 1\cdot x_{31} + 1\cdot x_{41} + 1\cdot x_{51}\\ &=& x_{11} + \color{var(--nord15)}{?} + x_{31} + x_{41} + x_{51}\\ &=& \color{var(--nord15)}{?} \end{eqnarray*}
Then we move on to the second column, and here we multiply again each element with one, so, one times x_{11}, one times the missing value, and so on.
And then we need to sum up all the products, but because one of the summands is unknown, the sum will also be unknown.
\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}
And when we continue to do that, in the result of X^\top X there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate.
\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}
(\mathbf X^\top \mathbf X)^{-1} = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}^{-1} = \begin{pmatrix} \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} \end{pmatrix}
And when we continue to do that, in the result of X^\top X there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate.
But in the formula for the least squares estimator we have to then take the inverse of this new matrix.
Calculating the inverse by hand is a bit tedious, so I'm not going to go through it step by step. But the result is that we now have unknown values on all positions of the inverted matrix, because the calculations always involve one or more of the unknown elements of the input matrix.
When there are missing values in \mathbf X or \mathbf y we cannot estimate \boldsymbol\beta!!!
⇨ Exclude cases with missing values?
And so it is clear, whenever we have missing values in the covariates, we cannot estimate our regression coefficients. And the same goes for missing values in the response y.
And so the logical conclusion would be that we would have to exclude all those cases for which some values are missing, and perform a complete case anlaysis.
But, a complete case analysis is in most cases a rather bad idea.
Here is one reason why.
You see on the y-axis the proportion of complete cases in a dataset, and on the x-axis the number of incomplete variables. Each line represents a different proportion of missing values per variable.
So, if we had 10% missing values in 25 variables, we'd en up with only 7% of the original sample size. And if we had 10% missing values in 10 variables, we'd have 35% of our data left over in a complete case analysis.
Even when we'd have only 2% missing in only 5 variables, we could loose 10% of the data.
Complete Case Analysis is
For Example:
So it is clear, complete case analysis is very inefficient. In many cases we'll loose quite a bit of data.
Moreover, complete case analysis is biased in most settings. There are a few very specific exceptions, depending on what kind of model you use, where the missing values are, and why they are missing.
And, so, for most methods to handle missing values we can't make a general statement that will always be true.
For the impact of a method there are a number of relevant aspects.
Relevant for the choice / impact of methods:
The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject.
And, as we've seen, we might also need to check what that means for the number of complete cases.
But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables.
Relevant for the choice / impact of methods:
Where are values missing?
Why are values missing?
⇨ Missing Data Mechanism
The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject.
And, as we've seen, we might also need to check what that means for the number of complete cases.
But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables.
We also need to distinguish between missing values in covariates and the response, and we need to think about, and make assupmtions about why the values are missing, meaning, the missing data mechanism.
Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases
For the missing data mechanism there is a specific terminology.
First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.
This means that there are no systematic differences between complete and incomplete cases.
Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases
Missing At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})
For the missing data mechanism there is a specific terminology.
First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.
This means that there are no systematic differences between complete and incomplete cases.
Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed.
Missing Completely At Random (MCAR) \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing}) no systematic difference between complete and incomplete cases
Missing At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})
Missing Not At Random \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) \neq \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})
For the missing data mechanism there is a specific terminology.
First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study.
This means that there are no systematic differences between complete and incomplete cases.
Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed.
The last missing data mechanism is missing not at random, and here the probability that a value is missing does depend on things that we have not measured or that is missing. This can either be the missing value itself, or missing values in other variables, or things that we haven't measured at all.
There is no way of testing if we are dealing with MNAR or MAR. We will always need to make an assumption about whether we have MNAR data.
Sometimes you can read in clinical papers that they assumed "random missingness", or "that the missing values are random". I assume that they refer either to MCAR or MAR, but it isn't clear which one, but it can make a very important difference whether you have MCAR or MAR.
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?
....
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"
Same survey: missing values in "chocolate consumption".
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?
....
In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable?
Data is collected by questionnaire ⇨ some got lost in the mail
A particular biomarker was not part of the standard panel before 2008
⇨ missing for many patients who entered < 2008
In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?"
Same survey: missing values in "chocolate consumption".
MCAR / MAR / MNAR are NOT a property of the data but of a model.
Let's look at a few examples to see which type of missing data mechanism we might have.
Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail.
Which type of missing data mechanism would that be?
If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more.
Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed.
Which missing data mechanism do we have?
If we know the year of inclusion, then we'd have MAR.
Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable?
....
In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable?
As you see, the missing data mechanism is actually not a property of the data itself, but rather of the model that we use to fit the data or to impute it.
And to make suitable assumptions you need expert knowledge on how the data was measured. This is not something that the statistician can determine.
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.
For the missing value in height
we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.
⇨ missing values have a distribution
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.
For the missing value in height
we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.
So, in statistical terms, we can say that missing values have a distribution.
⇨ missing values have a distribution
Predictive distribution of the missing values given the observed values. p(x_{mis}\mid\text{everything else})
The important issue in imputing missing values is that there is uncertainty about what the value would have been. And so we can't just pick one value and fill it in, because then we would just ignore this uncertainty.
If the value of height
is missing for one patient, we don't know what that
value would have been.
Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected.
For the missing value in height
we could expect something around 1.70 / 1.80m.
And values of 1.50m and 2.10m are possible, but less likely.
So, in statistical terms, we can say that missing values have a distribution.
Moreover, there usually is some relationship between the missing value and other
data that we have collected. If we know that the missing value in height
is
from a male, larger values become more likely and smaller values less likely.
This means that we need a model to learn how the incomplete variable is related to the other data.
This model, together with an assumption about the type of distribution the missing value has, then allows us to specify the distribution we should sample values to impute the missing value. We call this the predictive distribution. And the predictive distribution is generally based on everything else, including all other data and parameters.
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | |
---|---|---|---|---|
i | ||||
\vdots | \vdots | \vdots | \vdots |
Predictive distribution:
p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)
Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.
And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | |
---|---|---|---|---|
i | ||||
\vdots | \vdots | \vdots | \vdots |
Predictive distribution:
p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)
For example:
Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.
And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.
For example, we could think of this as fitting a regression model with x_1 as the dependent variable, and y & the other covariates as independent variables.
We can then fit this model to all those cases for which we have x_1 observed,...
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | |
---|---|---|---|---|
i | ||||
\vdots | \vdots | \vdots | \vdots |
Predictive distribution:
p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)
For example:
Fit a model to the cases with observed \color{var(--nord15)}{\mathbf x_1}: \color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon
Estimate parameters \boldsymbol{\hat\beta}, \hat\sigma
⇨ define distribution
p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)
Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable y, a variable x_1 that is missing for patient i, and two other covariates that are completely observed.
And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of x_1, given the response y, the other covariates, and some parameters.
For example, we could think of this as fitting a regression model with x_1 as the dependent variable, and y & the other covariates as independent variables.
We can then fit this model to all those cases for which we have x_1 observed,...
... in order to estimate the parameters, and to learn how the distribution of x_1 conditional on the other data looks like.
And then we can use this information to specify the predictive distribution for the cases with missing x_1 and sample imputed values from this distribution.
We can visualize this for the case where we only have two variables, one incomplete, shown on the y-axis, and one complete, shown on the x-axis. So this is a visualization of the imputation model.
In practice we will of course have more variables, but then I couldn't show it in a simple plot any more. So this is really just to get the idea.
We know the value of the other variable, so we know where our incomplete cases are on the x-axis, but we don't know where to place them on the y-axis. Therefore I only marked them as empty circles on the x-axis here.
When we now fit a model on the observed cases we can represent that as the corresponding regression line.
What happens when we now plug in the observed variables of our incomplete case in the estimated model, is that we get the fitted values, meaning the corresponding values on the regression line.
Could we now just take those values as our imputed values? We fitted the model on the complete cases, and then we predicted the value of the incomplete variable from that model.
Important: We need to take into account the uncertainty!
Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.
Important: We need to take into account the uncertainty!
about the parameter estimates
Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.
There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line.
Important: We need to take into account the uncertainty!
about the parameter estimates
about the fitted/predicted value \color{var(--nord15)}{\mathbf{\hat{x}_1}}
Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account.
There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line.
And there is uncertainty about the values themselves. In the observed data, the data points are not exactly on the regression line, but spread around it. So we'd expect the same for the missing values. Using the fitted values, the values on the regression line, would ignore this random variation that we have in the data.
This is the part where we assume that the missing values have a distribution. This distribution is the random variation around the expected value.
We want:
Imputation from the predictive distribution
p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else}).
Idea:
Use a "prediction" model.
Take into account:
So, in summary, what have we seen so far?
We want to impute missing values from the predictive distribution of the missing value given everything else.
The idea is to do that via a prediction model.
But we need to take into account that we have multiple sources of uncertainty or variation:
So with this knowledge on missing data and all the things that we need to take into account let's have a look at some unfortunately still used naive methods to handle missing data.
We are now looking at the data that we would use for the actual analysis of interest, and the regression line from that analysis model. So on the x-axis we now have the incomplete covariate and on the y-axis the response, which we assume is fully observed.
The cases for which the covariate is observed are drawn as white dots, the cases for which the covariate is missing as empty purple circles. The correct regression line, that we would get if we didn't have any missing values is shown with the dashed line.
In a complete case analysis, the regression line would be calculated just based on the white data points. Because the missing values are not missing completely at random, but values are more likely to be missing for larger response values, the estimated line now is lower than the true line.
The first imputation method that I'll show here is mean imputation. All missing values in the covariate are filled in with the mean of the observed values of that covariate. This is shown here with the filled purple dots. You can clearly see that they are not a good representation of the distribution of the true but missing values.
The corresponding regression line, shown with the solid white line is closer to the true line than for complete case analysis but is flatter than the true line.
The second missing data method is the missing indicator method. The idea here is to replace the missing values with a fixed value, for example zero. And, to distinguish the incomplete cases from the complete cases we additionally add an indicator variable that is zero for observed cases and one for incomplete cases.
As for mean imputation we see that the imputed values do not at all represent the spread of the missing values. Because of the indicator variable we now get two regression lines, one for observed one for incomplete cases, but they have the same slope, which seems to be similar to the slope of the true regression line.
jump to regline with predicted values
Next, we have regression imputation. The idea here is to impute based on a prediction model, like we saw before, but to just use the fitted values from that prediction. Because this method also does not take into account the random variation, we see that all imputed values are on one straight line.
And again we see that the model fitted on the imputed data results in a regression line with different slope than the true line, now with a steeper slope.
In single imputation we now improve upon the regression imputation by taking into account both the uncertainty about the parameters in the imputation model and the random variation. And we can see that the imputed values have a distribution that is much more similar to the distribution of the missing values.
The corresponding regression line is also almost identical to the true line.
Here I have an overview of the parameter estimate of the incomplete covariate and the corresponding 95% confidence interval. So this is the slope that we saw for all the different methods.
On top is the value for the complete data, and to make the comparison easier I have added a shaded area that has the width of the 95% CI from the complete data analysis.
Of course, because this is just the results from one very simple example, we can't draw any conclusions about how much bias we get from which method and how they compare in general. This was just to visualize a bit what happens when you use one of these naive methods.
We see that the different methods disagree quite a bit in their estimates. The single imputation comes closest, but when we take a closer look at the CI we see that it is actually a bit narrower than the true CI. In the example I used here, I have a bit more than 50% missing values. So we should have quite a bit additional uncertainty compared to the complete data.
The single imputation approach clearly underestimates the uncertainty that we have about effect of the covariate.
Can take into account
But:
Single imputation does not take into account the uncertainty about the imputed value!
In the single imputation we did take into account two of the sources of uncertainty or variation, but we only have one imputed value. We have no way of taking into account the added uncertainty that we have about the imputed value compared to an observed value, when we just have one single value.
And this is why Donald Rubin came up with the idea of multiple imputation.
MI was developed in the 1960s/70s...
Requirements
The idea behind multiple imputation is that, using this principle, we sample imputed values and fill them into the original, incomplete data to create a completed dataset.
And in order to take into account the uncertainty that we have about the missing values, we do this multiple times, so that we obtain multiple completed datasets.
Because all the missing values have now been filled in, we can analyse each of these datasets separately with standard statistical techniques.
To obtain overall results, the results from each of these analyses need to be combined in a way that takes into account both the uncertainty that we have about the estimates from each analysis, and the variation between these estimates.
Pooled Parameter Estimate:
\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad
\text{(average estimate)}
Pooled Parameter Estimate:
\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad
\text{(average estimate)}
Pooled Variance: T = \bar W + B + B/m
\displaystyle\bar W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}(\hat\beta^{(\ell)}) average within imputation variance
\displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m (\hat \beta^{(\ell)} - \bar\beta)^2 between imputation variance
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Predictive distributions
And the most common approach to imputation in this setting is MICE, short for multivariate imputation by chained equations, an approach that is also called fully conditional specification.
The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor.
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Predictive distributions
Most common approach:
MICE
(multivariate imputation by chained equations)
FCS
(fully conditional specification)
And the most common approach to imputation in this setting is MICE, short for multivariate imputation by chained equations, an approach that is also called fully conditional specification.
The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor.
Iterative:
start with random draws from the observed data
cycle through the models to update the imputed values
until convergence
⇨ keep only last imputed value
Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time.
Iterative:
start with random draws from the observed data
cycle through the models to update the imputed values
until convergence
⇨ keep only last imputed value
Flexible model types
choose a different type of model per incomplete variable
Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time.
The models for the different variables can be specified according to the type of variable.
Once we have imputed each missing value, we start again with the first variable, but now use the imputed values of the other variables instead of the starting values, and we do this a few times until the algorithm has converged.
Relevant for the choice / impact of methods:
Where are values missing?
Why are values missing?
⇨ Missing Data Mechanism
How much is missing / how much information is available?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
How much is missing / how much information is available?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Scenario 1:
N = 200, 90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 20 to estimate \boldsymbol\beta
How much is missing / how much information is available?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Scenario 1:
N = 200, 90% of \color{var(--nord15)}{\mathbf x_1} is missing
⇨ N = 20 to estimate \boldsymbol\beta
Relevant covariates / strength of association
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Relevant covariates / strength of association
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Relevant covariates / strength of association
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
i | \ldots | ||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Where are values missing?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Imputation Model for \color{var(--nord15)}{\mathbf y}: \color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y
Where are values missing?
\mathbf y | \mathbf x_1 | \mathbf x_2 | \mathbf x_3 | \ldots | |
---|---|---|---|---|---|
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\ldots | |||||
\vdots | \vdots | \vdots | \vdots |
Imputation Model for \color{var(--nord15)}{\mathbf y}: \color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y
Analysis Model
Missing values in the response:
If analysis model = imputation model
⇨ \boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}
⇨ No point in imputing responses
Missing values in the response:
If analysis model = imputation model
⇨ \boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}
⇨ No point in imputing responses
Auxiliary variables:
⇨ analysis model \neq imputation model
⇨ \boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}
⇨ Imputing responses can be beneficial
Why are values missing?
Imputation of \color{var(--nord15)}{\mathbf x_1} based on:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
Why are values missing?
Imputation of \color{var(--nord15)}{\mathbf x_1} based on:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots
⇨ Imputed \color{var(--nord15)}{\mathbf x_1} will have the same distribution as observed \color{var(--nord15)}{\mathbf x_1} with the same values of all other variables.
⇨ FCS MI is valid under MAR
valid under MAR
imputation models need to contain the important predictors in the right
form
allows us to take into account
valid under MAR
imputation models need to contain the important predictors in the right
form
allows us to take into account
Implied Assumption:
Linear association
between \color{var(--nord15)}{\mathbf x_1} and \mathbf y:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3
Implied assumption: linear association between incompl. covariate and outcome (and other covariates)
Implied Assumption:
Linear association
between \color{var(--nord15)}{\mathbf x_1} and \mathbf y:
\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3
But what if
\mathbf y = \theta_0 +
\bbox[#3B4252, 2pt]{\theta_1 \color{var(--nord15)}{\mathbf x_1} +
\theta_2 \color{var(--nord15)}{\mathbf x_1}^2} +
\theta_3 \mathbf x_2 + \theta_4 \mathbf x_3
Implied assumption: linear association between incompl. covariate and outcome (and other covariates)
But what if we have a setting where we assume that there is a non-linear association, for example quadratic?
} ⇨ bias!
If we
we introduce bias, even if we analyse the imputed data under the correct assumption
With non-linear associations specification of the correct imputation model may not be possible.
Settings with non-linear associations:
=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.
With non-linear associations specification of the correct imputation model may not be possible.
Settings with non-linear associations:
Also critical:
settings with correlated observations
=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.
With non-linear associations specification of the correct imputation model may not be possible.
Settings with non-linear associations:
Also critical:
settings with correlated observations
⇨ Bayes
=> we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor.
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
Which variables do I need to include in the imputation?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
Which variables do I need to include in the imputation?
Why do I need to include the response into the imputation models? Won't that artificially increase the association?
How many imputed datasets do I need?
Should we do a compl. case analysis as sensitivity analysis?
What % missing values is still ok?
Can I impute missing values in the response?
Can I impute missing values in the exposure?
Which variables do I need to include in the imputation?
Why do I need to include the response into the imputation models? Won't that artificially increase the association?
How should I report missing data / imputation in a paper?