class: left, top, title-slide
Working with Missing Data and Imputation
Nicole Erler
Department of Biostatistics
n.erler@erasmusmc.nl
N_Erler
NErler
https://nerler.com
--- count: false layout: true <div class="my-footer"><span> <a href="https://twitter.com/N_Erler"><i class="fab fa-twitter"></i> N_Erler</a>      <a href="https://github.com/NErler"><i class="fab fa-github"></i> NErler</a>      <a href = "https://nerler.com"><i class="fas fa-globe-americas"></i> nerler.com</a> </span></div> --- count: false class: center, middle # Missing Values are a Problem! ??? I'm going to start right at the beginning, and want to demonstrate why missing values are a problem. - researchers who thought it was possible to use cases if ther was only one single value missing - SPSS options that seem this is possible This is a bit theoretical, with lots of math, but don't worry, the math is more for visualization, and the presentation won't be all formulas. I'll then talk a bit in general about missing data, look at some naive missing data methods, and then we'll take a look at multiple imputation. --- ## Example: Linear Regression **Linear Regression Model:** `\begin{eqnarray*} y &=& \beta_0 + \beta_1 \mathbf x_1 + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon\\ &=& \mathbf X^\top \boldsymbol\beta + \boldsymbol\varepsilon \end{eqnarray*}` ??? We use linear regression as an example, because there, we can calculate the solution for the regression coefficients by hand with a formula, and, theoretically, wouldn't need a computer to fit the model. A linear regression model is written as a response `\(y\)` with covariates `\(x\)`, and some regression coefficients `\(\beta\)`, and we have the error terms, `\(\varepsilon\)`. We can also write this model in matrix notation, ... - - - -- with `$$\mathbf y = \begin{pmatrix} y_1\\ y_2\\ y_3\\ y_4\\ y_5 \end{pmatrix} \qquad \mathbf X = \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} \qquad \boldsymbol\beta = \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \beta_3 \end{pmatrix}$$` ??? ... and then we have `\(y\)` as a vector, here, as an example for 5 subjects, `\(X\)` is the design matrix, which contains the different covariates in the columns and has a column of 1s for the intercept, and the value for `\(x_1\)` for the second subject is missing. The regression coefficients `\(\beta\)` are also a vector. --- ## Example: Linear Regression **The Least Squares Estimator** `$$\hat{\boldsymbol\beta} = (\mathbf X^\top\mathbf X)^{-1} \mathbf X^\top \mathbf y$$` ??? The regression coefficients in the linear model are usually estimated using the least squares estimator, and this estimator has a simple formula that depends only on the design matrix `\(X\)` and the response `\(y\)`. We'll now go through this formula in steps to see how the calculation is impacted by the one missing value in `\(X\)`. - - - -- <br> `$$\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix}$$` ??? We start with the product of `\(X^\top\)` and `\(X\)`. `\(X^\top\)` is the design matrix, but with rows and colums swapped, so that each row is one variable, and each column is one subject. And we need to multiply these two matrices. --- ## Example: Linear Regression <svg id = "rect1" width="320" height="35"> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> <svg id = "rect2" width = 30 height = 210> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> <svg id = "square1" width="30" height="30"> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> <svg id = "square2" width="30" height="30"> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> `$$\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}$$` <br> `$$\cdot = 1\cdot 1 +1\cdot 1 + 1\cdot 1 + 1\cdot 1 + 1\cdot 1$$` ??? How does matrix multiplication work? We always multiply one row from the first matrix with a column from the second matrix, and take the sum over all the product from these two vectors. The result from the first row and first column will then be the top left element in the result matrix. And because here we have the intercept multiplied with itself, we have the sum over the product of 1s, which is 5 in this case, because we have 5 subjects. --- ## Example: Linear Regression <svg id = "rect1" width="320" height="35"> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> <svg id = "rect3" width = 45 height = 210> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> <svg id = "square3" width="30" height="30"> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> <svg id = "square4" width="30" height="30"> <rect width="100%" height="100%" rx = "3" style="fill:var(--nord1);" /> </svg> `$$\mathbf X^\top \mathbf X = \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{11} & \color{var(--nord15)}{?} & x_{31} & x_{41} & x_{51}\\ x_{12} & x_{22} & x_{32} & x_{42} & x_{52}\\ x_{13} & x_{23} & x_{33} & x_{43} & x_{53} \end{pmatrix} \begin{pmatrix} 1 & x_{11} & x_{12} & x_{13}\\ 1 & \color{var(--nord15)}{?} & x_{22} & x_{23}\\ 1 & x_{31} & x_{32} & x_{33}\\ 1 & x_{41} & x_{42} & x_{43}\\ 1 & x_{51} & x_{52} & x_{53} \end{pmatrix} = \begin{pmatrix} \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot\\ \cdot & \cdot & \cdot & \cdot \end{pmatrix}$$` <br> `\begin{eqnarray*} \cdot &=& 1 \cdot x_{11} + 1\cdot\color{var(--nord15)}{?} + 1\cdot x_{31} + 1\cdot x_{41} + 1\cdot x_{51}\\ &=& x_{11} + \color{var(--nord15)}{?} + x_{31} + x_{41} + x_{51}\\ &=& \color{var(--nord15)}{?} \end{eqnarray*}` ??? Then we move on to the second column, and here we multiply again each element with one, so, one times `\(x_{11}\)`, one times the missing value, and so on. And then we need to sum up all the products, but because one of the summands is unknown, the sum will also be unknown. --- ## Example: Linear Regression `$$\mathbf X^\top \mathbf X = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}$$` ??? And when we continue to do that, in the result of `\(X^\top X\)` there are some elements unknown, indicated by the questionmarks, and all the values where we have a dot we can calculate. - - - -- <br> `$$(\mathbf X^\top \mathbf X)^{-1} = \begin{pmatrix} \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot\\ \cdot & \color{var(--nord15)}{?} & \cdot & \cdot \end{pmatrix}^{-1} = \begin{pmatrix} \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?}\\ \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} & \color{var(--nord15)}{?} \end{pmatrix}$$` ??? But in the formula for the least squares estimator we have to then take the inverse of this new matrix. Calculating the inverse by hand is a bit tedious, so I'm not going to go through it step by step. But the result is that we now have unknown values on all positions of the inverted matrix, because the calculations always involve one or more of the unknown elements of the input matrix. --- ## Example: Linear Regression When there are **missing values** in `\(\mathbf X\)` or `\(\mathbf y\)` we **cannot estimate `\(\boldsymbol\beta\)`!!!** ⇨ Exclude cases with missing values? <img src = "materials/giphy.gif", height = 300 style = "margin: auto; display: block;"> ??? And so it is clear, whenever we have missing values in the covariates, we cannot estimate our regression coefficients. And the same goes for missing values in the response `\(y\)`. And so the logical conclusion would be that we would have to exclude all those cases for which some values are missing, and perform a complete case anlaysis. --- class: center, middle # Complete Case Analysis is (usually) a Bad Idea! ??? But, a complete case analysis is in most cases a rather bad idea. --- ## Complete Case Analysis <img src="index_files/figure-html/ccdemo-1.png" width="100%" /> ??? Here is one reason why. You see on the y-axis the proportion of complete cases in a dataset, and on the x-axis the number of incomplete variables. Each line represents a different proportion of missing values per variable. So, if we had 10% missing values in 25 variables, we'd en up with only 7% of the original sample size. And if we had 10% missing values in 10 variables, we'd have 35% of our data left over in a complete case analysis. Even when we'd have only 2% missing in only 5 variables, we could loose 10% of the data. --- ## Complete Case Analysis Complete Case Analysis is <ul class="fa-ul"> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>inefficient</li> <li><span class = "fa-li" style = "color:var(--nord11);"><i class="far fa-frown"></i></span>usually biased</li> </ul> <br> **For Example:** <ul class="fa-ul"> <li> <a href = "https://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased"> <span class = "fa-li"><i class="fab fa-wordpress"></i></span> thestatsgeek.com (2013)</a> <li> <a href = "https://doi.org/10.1002/sim.3944"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> White & Carlin (2010)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2009.08.028"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Knol et al. (2010)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2006.01.015"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Van der Heijden et al. (2006)</a> </li> <li> <a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008"> <span class = "fa-li"><i class="fas fa-file-alt"></i></span> Janssen et al. (2010)</a> </li> </ul> <!-- --- --> <!-- ## Literature --> <!-- <ul class="fa-ul"> --> <!-- <li> --> <!-- <a href = "https://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased"> --> <!-- <span class = "fa-li"><i class="fab fa-wordpress"></i></span> --> <!-- thestatsgeek.com (2013)</a><br> --> <!-- When is complete case analysis unbiased? --> <!-- </li> --> <!-- ??? --> <!-- When missingness occurs in either the outcome, one or more of the predictors, --> <!-- or both, fitting the regression model to the complete cases is unbiased provided the probability of being a complete case is independent of Y, conditional on X. --> <!-- For example, suppose X are factors measured of subjects at recruitment into the --> <!-- cohort study, and that the outcome Y is measured some time after recruitment. --> <!-- Suppose one of the predictors in X has missing values. Then missingness in X --> <!-- can't be directly caused by Y, since the future value of Y is yet to be determined. Missingness in X is either caused by the value of X itself, or by other factors/variables. Only if missingness is caused by such other factors, and these factors independently affect the outcome Y, will complete case analysis be biased. --> <!-- Unfortunately, this cannot be tested with the observed data. --> <!-- -- --> <!-- <li> --> <!-- <a href = "https://doi.org/10.1002/sim.3944"> --> <!-- <span class = "fa-li"><i class="fas fa-file-alt"></i></span> --> <!-- White & Carlin (2010)</a><br> --> <!-- Bias and efficiency of multiple imputation compared with completeācase analysis --> <!-- for missing covariate values --> <!-- </li> --> <!-- <li> --> <!-- <a href = "https://doi.org/10.1016/j.jclinepi.2009.08.028"> --> <!-- <span class = "fa-li"><i class="fas fa-file-alt"></i></span> --> <!-- Knol et al. (2010)</a><br> --> <!-- Unpredictable bias when using the missing indicator method or complete case --> <!-- analysis for missing confounder values: an empirical example --> <!-- </li> --> <!-- <li> --> <!-- <a href = "https://doi.org/10.1016/j.jclinepi.2006.01.015"> --> <!-- <span class = "fa-li"><i class="fas fa-file-alt"></i></span> --> <!-- Van der Heijden et al. (2006)</a><br> --> <!-- Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example --> <!-- </li> --> <!-- <li> --> <!-- <a href = "https://doi.org/10.1503/cmaj.110977"> --> <!-- <span class = "fa-li"><i class="fas fa-file-alt"></i></span> --> <!-- Goenwold et al. (2012)</a><br> --> <!-- Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis --> <!-- </li> --> <!-- <li> --> <!-- <a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008"> --> <!-- <span class = "fa-li"><i class="fas fa-file-alt"></i></span> --> <!-- Janssen et al. (2010)</a><br> --> <!-- Missing covariate data in medical research: to impute is better than to ignore --> <!-- </li> --> <!-- <li> --> <!-- <a href = "https://doi.org/10.1016/j.jclinepi.2009.12.008"> --> <!-- <span class = "fa-li"><i class="fas fa-file-alt"></i></span> --> <!-- Janssen et al. (2010)</a><br> --> <!-- Missing covariate data in medical research: to impute is better than to ignore --> <!-- </li> --> ??? So it is clear, complete case analysis is very inefficient. In many cases we'll loose quite a bit of data. Moreover, complete case analysis is biased in most settings. There are a few very specific exceptions, depending on what kind of model you use, where the missing values are, and why they are missing. --- class: center, middle # Missing Data & Imputation ??? And, so, for most methods to handle missing values we can't make a general statement that will always be true. For the impact of a method there are a number of relevant aspects. --- ## Missing Values **Relevant** for the choice / impact of methods: .flex-grid[ .col[ - **How much is missing?** * per variable * per subject * complete cases ] .col[ - **How much information is available?** * sample size * relevant covariates * strength of association ] ] ??? The first question that we usually first ask ourselves is how much is actually missing in the data? And we can distinguish between the proportion or number of missing values per variable or per subject. And, as we've seen, we might also need to check what that means for the number of complete cases. But what I find sometimes even more relevant is how much information is available? Again, with respect to the number of observations per variable and per subject, and, are there relevant covariates that are associated with the variables that have missing values, how strong these associations are, and if these other variables are observed for the cases with missing values in the other variables. - - - -- - **Where are values missing?** * response * covariates - **Why are values missing?**<br> ⇨ Missing Data Mechanism ??? We also need to distinguish between missing values in covariates and the response, and we need to think about, and make assupmtions about why the values are missing, meaning, the missing data mechanism. --- ## Missing Data Mechanisms **Missing Completely At Random (MCAR)** `$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing})$$` .sgrey[no systematic difference between complete and incomplete cases] ??? For the missing data mechanism there is a specific terminology. First, we can have "missing completely at random" missing data. Missing completely at random means that the probability of a value being missing does not depend on anything, it is completely random and has nothing to do with what we are investigating in our study. This means that there are no systematic differences between complete and incomplete cases. - - - -- <br> **Missing At Random** `$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) = \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})$$` ??? Then, we have missing at random. In missing at random the assumption is that the probability of a value begin missing depends on other things, but only on things that we have measured in our data, and is actually observed. - - - -- **Missing Not At Random** `$$\mathrm{Pr}(\text{value missing} \mid \text{data}_{obs}, \text{data}_{mis}) \neq \mathrm{Pr}(\text{value missing} \mid \text{data}_{obs})$$` ??? The last missing data mechanism is missing not at random, and here the probability that a value is missing does depend on things that we have not measured or that is missing. This can either be the missing value itself, or missing values in other variables, or things that we haven't measured at all. There is no way of testing if we are dealing with MNAR or MAR. We will always need to make an assumption about whether we have MNAR data. Sometimes you can read in clinical papers that they assumed "random missingness", or "that the missing values are random". I assume that they refer either to MCAR or MAR, but it isn't clear which one, but it can make a very important difference whether you have MCAR or MAR. --- ## Some Examples * Data is collected by questionnaire ⇨ some got lost in the mail ??? Let's look at a few examples to see which type of missing data mechanism we might have. Say, we have a study for which we have collected data using a questionnaire. Some of the questionnaires were filled in, but on the way back they got lost in the mail. Which type of missing data mechanism would that be? * * * * If this is a study in the Netherlands we could probably argue that this is MCAR. But if we were performing a study in various areas in, say, Africa, and the postal service in the rural areas is much more unreliable than in the cities, and there are other factors that are of interest in our study that also differ between rural areas and cities, we won't have MCAR any more. - - - -- * A particular biomarker was not part of the standard panel before 2008<br> ⇨ missing for many patients who entered < 2008 ??? Another example. Imagin a particular biomarker was not part of the standard blood panel before 2008. And so, for most of the patients who entered the study before 2008 this value is missing, but for people who entered later it is mostly observed. Which missing data mechanism do we have? If we know the year of inclusion, then we'd have MAR. - - - -- * In a survey in pregnant women some do not fill in the answer to "Are you currently smoking?" ??? Another example. We have a survey that we send out to pregnant women. One of the questions is if they are currently smoking. What mechanism would you expect for the missing values in that variable? .... - - - -- * Same survey: missing values in "chocolate consumption". ??? In the same survey, we also ask about the womens' daily chocolate consumption. What about the missing values in this variable? - - - -- <br> .box.bg-0.brdr-8[ MCAR / MAR / MNAR are NOT a property of the data but of a **model**. ] ??? As you see, the missing data mechanism is actually not a property of the data itself, but rather of the model that we use to fit the data or to impute it. And to make suitable assumptions you need expert knowledge on how the data was measured. This is not something that the statistician can determine. --- ## Understanding Missing Values .pull-left[ * there is **uncertainty** about the missing value {{content}} ] ??? The important issue in imputing missing values is that there is **uncertainty** about what the value would have been. And so we **can't just pick** one value and fill it in, because then we would just ignore this uncertainty. If the value of `height` is missing for one patient, we don't know what that value would have been. - - - - -- * some values are **more likely** than others {{content}} ??? Also: some values are going to be more likely than others, and usually there is a relationship between the variable that has missing values and the other data that we have collected. For the missing value in `height` we could expect something around 1.70 / 1.80m. And values of 1.50m and 2.10m are possible, but less likely. - - - - -- **⇨ missing values have a distribution** <img src="figures/ImpDens.png", height = 250, style = "margin: auto; display: block;"> ??? So, in statistical terms, we can say that missing values have a distribution. - - - - -- .pull-right[ * there is a relationship with **other** available **data** <br> .box.bg-0[ <strong>Predictive distribution</strong> of the missing values given the observed values. `$$p(x_{mis}\mid\text{everything else})$$` ] ] ??? Moreover, there usually is some relationship between the missing value and other data that we have collected. If we know that the missing value in `height` is from a male, larger values become more likely and smaller values less likely. This means that we need a model to learn how the incomplete variable is related to the other data. This model, together with an assumption about the type of distribution the missing value has, then allows us to specify the distribution we should sample values to impute the missing value. We call this the predictive distribution. And the predictive distribution is generally based on everything else, including all other data and parameters. --- ## A Simple Example .gr-left[ <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> </tr> <tr><td></td><td colspan = "4"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> </tr> </table> * `\(\mathbf y\)`: **response** * `\(\color{var(--nord15)}{\mathbf x_1}\)`: **incomplete** covariate * `\(\mathbf x_2\)`, `\(\mathbf x_3\)`: **complete** covariates ] .gr-right[ **Predictive distribution:** `$$p(\color{var(--nord15)}{\mathbf x_1} \mid \mathbf y, \mathbf x_2, \mathbf x_3, \boldsymbol\beta, \sigma)$$` <br> {{content}} ] ??? Let's look at a simple example. Imagine, we have the following dataset, where we have a completely observed response variable `\(y\)`, a variable `\(x_1\)` that is missing for patient `\(i\)`, and two other covariates that are completely observed. And so the the predictive distribution that we need to sample the imputed value from, would be the distribution of `\(x_1\)`, given the response `\(y\)`, the other covariates, and some parameters. - - - -- For example: * Fit a model to the cases with observed `\(\color{var(--nord15)}{\mathbf x_1}\)`: `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3 + \boldsymbol\varepsilon$$` {{content}} ??? For example, we could think of this as fitting a regression model with `\(x_1\)` as the dependent variable, and `\(y\)` & the other covariates as independent variables. We can then fit this model to all those cases for which we have `\(x_1\)` observed,... - - - - -- * Estimate parameters `\(\boldsymbol{\hat\beta}, \hat\sigma\)`<br> ⇨ define distribution `\(p(\color{var(--nord15)}{x_{i1}} \mid y_i, x_{i2}, x_{i3}, \boldsymbol{\hat\beta}, \hat\sigma)\)` ??? ... in order to estimate the parameters, and to learn how the distribution of `\(x_1\)` conditional on the other data looks like. And then we can use this information to specify the predictive distribution for the cases with missing `\(x_1\)` and sample imputed values from this distribution. --- ## Imputation of Missing Values <img src="index_files/figure-html/imp regline1-1.png" width="100%" /> ??? We can visualize this for the case where we only have two variables, one incomplete, shown on the y-axis, and one complete, shown on the x-axis. So this is a visualization of the imputation model. In practice we will of course have more variables, but then I couldn't show it in a simple plot any more. So this is really just to get the idea. We know the value of the other variable, so we know where our incomplete cases are on the x-axis, but we don't know where to place them on the y-axis. Therefore I only marked them as empty circles on the x-axis here. --- count: false ## Imputation of Missing Values <img src="index_files/figure-html/imp regline2-1.png" width="100%" /> ??? When we now fit a model on the observed cases we can represent that as the corresponding regression line. --- count: false name: predval_reg ## Imputation of Missing Values <img src="index_files/figure-html/imp regline3-1.png" width="100%" /> ??? [jump to regression imputation](#regimp) What happens when we now plug in the observed variables of our incomplete case in the estimated model, is that we get the fitted values, meaning the corresponding values on the regression line. Could we now just take those values as our imputed values? We fitted the model on the complete cases, and then we predicted the value of the incomplete variable from that model. --- ## Imputation of Missing Values .box.bg-0[ **Important:** We need to take into account the **uncertainty**! ] ??? Not quite. We can't just use the fitted value to impute the missing value because there is uncertainty that we haven't taken tinto account. - - - - - -- .pull-left[ about the **parameter estimates** <img src="index_files/figure-html/imp reglines multi-1.png" width="100%" /> ] ??? There is uncertainty about the parameter estimates in the imputation model. Because our data is just a sample, we don't know the true parameters. With a different sample, we'd get a slightly different regression line. -- .pull-right[ about the **fitted/predicted value** `\(\color{var(--nord15)}{\mathbf{\hat{x}_1}}\)` <img src="index_files/figure-html/imp prederror-1.png" width="100%" /> ] ??? And there is uncertainty about the values themselves. In the observed data, the data points are not exactly on the regression line, but spread around it. So we'd expect the same for the missing values. Using the fitted values, the values on the regression line, would ignore this random variation that we have in the data. This is the part where we assume that the missing values have a distribution. This distribution is the random variation around the expected value. --- ## Imputation of Missing Values **We want:**<br> Imputation from the **predictive distribution** `\(p(\color{var(--nord15)}{x_{mis}} \mid \text{everything else})\)`. <br> **Idea:**<br> Use a "prediction" model. <br> **Take into account:** * **uncertainty in parameter** estimates `\(\boldsymbol{\hat\beta}\)` * **prediction error** `\((\mathbf{\hat x}_{mis} \neq \mathbf x_{mis})\)` * missing values has a **distribution** ⇨ we can't just replace it with **one** value. ??? So, in summary, what have we seen so far? We want to impute missing values from the predictive distribution of the missing value given everything else. The idea is to do that via a prediction model. But we need to take into account that we have multiple sources of uncertainty or variation: - uncertainty about the parameters in the imputation model - random variation of the unknown values (also called prediction error) - and we need to take into account that there is uncertainty about the missing value, so that we can't represent a missing value by one single imputed value because that would not capture that uncertainty (the additional uncertainty that we have compared to an observed value) --- class: center, middle # Naive Ways to Handle Missing Data ??? So with this knowledge on missing data and all the things that we need to take into account let's have a look at some unfortunately still used naive methods to handle missing data. --- ## Naive Ways to Handle Missing Data <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> ??? We are now looking at the data that we would use for the actual analysis of interest, and the regression line from that analysis model. So on the x-axis we now have the incomplete covariate and on the y-axis the response, which we assume is fully observed. The cases for which the covariate is observed are drawn as white dots, the cases for which the covariate is missing as empty purple circles. The correct regression line, that we would get if we didn't have any missing values is shown with the dashed line. --- ## Complete Case Analysis <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="100%" /> ??? In a complete case analysis, the regression line would be calculated just based on the white data points. Because the missing values are not missing completely at random, but values are more likely to be missing for larger response values, the estimated line now is lower than the true line. --- ## Mean Imputation <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="100%" /> ??? The first imputation method that I'll show here is mean imputation. All missing values in the covariate are filled in with the mean of the observed values of that covariate. This is shown here with the filled purple dots. You can clearly see that they are not a good representation of the distribution of the true but missing values. The corresponding regression line, shown with the solid white line is closer to the true line than for complete case analysis but is flatter than the true line. --- ## Missing Indicator Method <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> ??? The second missing data method is the missing indicator method. The idea here is to replace the missing values with a fixed value, for example zero. And, to distinguish the incomplete cases from the complete cases we additionally add an indicator variable that is zero for observed cases and one for incomplete cases. As for mean imputation we see that the imputed values do not at all represent the spread of the missing values. Because of the indicator variable we now get two regression lines, one for observed one for incomplete cases, but they have the same slope, which seems to be similar to the slope of the true regression line. --- name: regimp ## Regression Imputation <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="100%" /> ??? [jump to regline with predicted values](#predval_reg) Next, we have regression imputation. The idea here is to impute based on a prediction model, like we saw before, but to just use the fitted values from that prediction. Because this method also does not take into account the random variation, we see that all imputed values are on one straight line. And again we see that the model fitted on the imputed data results in a regression line with different slope than the true line, now with a steeper slope. --- ## Single Imputation <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="100%" /> ??? In single imputation we now improve upon the regression imputation by taking into account both the uncertainty about the parameters in the imputation model and the random variation. And we can see that the imputed values have a distribution that is much more similar to the distribution of the missing values. The corresponding regression line is also almost identical to the true line. --- ## Naive Ways to Impute Missing Values <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="100%" /> ??? Here I have an overview of the parameter estimate of the incomplete covariate and the corresponding 95% confidence interval. So this is the slope that we saw for all the different methods. On top is the value for the complete data, and to make the comparison easier I have added a shaded area that has the width of the 95% CI from the complete data analysis. Of course, because this is just the results from one very simple example, we can't draw any conclusions about how much bias we get from which method and how they compare in general. This was just to visualize a bit what happens when you use one of these naive methods. We see that the different methods disagree quite a bit in their estimates. The single imputation comes closest, but when we take a closer look at the CI we see that it is actually a bit narrower than the true CI. In the example I used here, I have a bit more than 50% missing values. So we should have quite a bit additional uncertainty compared to the complete data. The single imputation approach clearly underestimates the uncertainty that we have about effect of the covariate. --- ## Single Imputation **Can take into account** * **uncertainty in parameter** estimates `\(\boldsymbol{\hat\beta}\)` * **prediction error** `\((\mathbf{\hat x}_{mis} \neq \mathbf x_{mis})\)` **But:** .pull-left[ <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="100%" /> ] .pull-right[ Single imputation does not take into account the **uncertainty about the imputed value**! ] ??? In the single imputation we did take into account two of the sources of uncertainty or variation, but we only have one imputed value. We have no way of taking into account the added uncertainty that we have about the imputed value compared to an observed value, when we just have one single value. --- class: center, middle # Multiple Imputation ??? And this is why Donald Rubin came up with the idea of multiple imputation. --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="100%" /> --- ## Multiple Imputation MI was developed in the 1960s/70s... <img src = "materials/PCpic.png", height = 380 style = "position: absolute; right: 60px; bottom: 60px;"> -- <br> **Requirements** * computationally feasible * "fix" the missing data problem once / centrally<br> ⇨ distribute imputed data to other researchers --- ## Multiple Imputation <img src = "figures/MI.png", height = 480, style = "margin: auto; display: block;"> ??? The idea behind multiple imputation is that, using this principle, we sample imputed values and fill them into the original, incomplete data to create a completed dataset. And in order to take into account the uncertainty that we have about the missing values, we do this multiple times, so that we obtain multiple completed datasets. Because all the missing values have now been filled in, we can analyse each of these datasets separately with standard statistical techniques. To obtain overall results, the results from each of these analyses need to be combined in a way that takes into account both the uncertainty that we have about the estimates from each analysis, and the variation between these estimates. --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="100%" /> --- ## Multiple Imputation: Pooling **Pooled Parameter Estimate:**<br> `$$\bar\beta = \frac{1}{m}\sum_{\ell = 1}^m \hat\beta^{(\ell)} \qquad \text{(average estimate)}$$` -- **Pooled Variance:** `$$T = \bar W + B + B/m$$` * `\(\displaystyle\bar W = \frac{1}{m}\sum_{\ell = 1}^m \mathrm{var}(\hat\beta^{(\ell)})\)` average within imputation variance * `\(\displaystyle B = \frac{1}{m - 1}\sum_{\ell = 1}^m (\hat \beta^{(\ell)} - \bar\beta)^2\)` between imputation variance --- ## Multiple Imputation <img src="index_files/figure-html/unnamed-chunk-15-1.png" width="100%" /> --- class: center, middle # Multivariate Missingness --- ## In Practice .flex-grid[ .col[ <div style = "text-align: center; margin-bottom: 25px;"> <strong>Multivariate<br>Missingness</strong></div> <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ **Predictive distributions** <div style = "width: 700px;"> based on models </div> <div> \begin{alignat}{10} \color{var(--nord15)}{\mathbf x_1} &= \beta_0 &+& \beta_1 \mathbf y &+& \beta_2 \color{var(--nord15)}{\mathbf x_2} &+& \beta_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots \\ \color{var(--nord15)}{\mathbf x_2} &= \alpha_0 &+& \alpha_1 \mathbf y &+& \alpha_2 \color{var(--nord15)}{\mathbf x_1} &+& \alpha_3 \color{var(--nord15)}{\mathbf x_3} &+& \ldots\\ \color{var(--nord15)}{\mathbf x_3} &= \theta_0 &+& \theta_1 \mathbf y &+& \theta_2 \color{var(--nord15)}{\mathbf x_1} &+& \theta_3 \color{var(--nord15)}{\mathbf x_2} &+& \ldots \end{alignat} </div> {{content}} ] ] ??? And the most common approach to imputation in this setting is MICE, short for **multivariate imputation by chained equations**, an approach that is also called **fully conditional specification**. The principle is an extension to what we've seen on the previous slides. We impute missing values using models that have all other data in their linear predictor. - - - -- <br> **Most common approach:**<br> <span style = "color: var(--nord10); font-weight: bold;">MICE</span> <span style = "color: var(--nord3);">(multivariate imputation by chained equations)</span><br> <span style = "color: var(--nord10); font-weight: bold;">FCS</span> <span style = "color: var(--nord3);">(fully conditional specification)</span> --- ## MICE / FCS .pull-left[ **Iterative:** - start with **random draws** from the observed data - cycle through the models to **update the imputed values** - until **convergence** ⇨ keep only last imputed value ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> ] ??? Because in these imputation models we now have incomplete covariates, we use an iterative algorithm. We start by randomly drawing starting values from the observed part of the data, and then we cycle through the incomplete variables and impute one at a time. - - - - - - -- **Flexible model types**<br> choose a different type of model per incomplete variable ??? The models for the different variables can be specified according to the type of variable. Once we have imputed each missing value, we start again with the first variable, but now use the imputed values of the other variables instead of the starting values, and we do this a few times until the algorithm has converged. --- ## Missing Values **Relevant** for the choice / impact of methods: .flex-grid[ .col[ - **How much is missing?** * per variable * per subject * complete cases ] .col[ - **How much information is available?** * sample size * relevant covariates * strength of association ] ] - **Where are values missing?** * response * covariates - **Why are values missing?**<br> ⇨ Missing Data Mechanism --- ## Considerations for the Use of FCS MI **How much is missing / how much information is available?** .flex-grid[ .col[ <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ <div style = "width: 700px;"> Imputation of \(\color{var(--nord15)}{\mathbf x_1}\) based on: \[\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\] <ul> <li> Fit model on cases with observed \(\color{var(--nord15)}{\mathbf x_1}\)</li> <li> Predict missing \( \color{var(--nord15)}{\mathbf x_1} \)</li> </ul> </div> {{content}} ] ] -- <br> <div> <p> <strong>Scenario 1:</strong>  N = 200,  90% of \(\color{var(--nord15)}{\mathbf x_1}\) is missing<br> ⇨ N = 20 to estimate \(\boldsymbol\beta\) </p> <br> {{content}} </div> -- <div> <strong>Scenario 2:</strong>  N = 5000,  90% of \(\color{var(--nord15)}{\mathbf x_1}\) is missing<br> ⇨ N = 500 to estimate \(\boldsymbol\beta\) </div> --- ## Considerations for the Use of FCS MI **Relevant covariates / strength of association** .flex-grid[ .col[ <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr">\(i\)</td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ <div style = "width: 700px;"> Imputation of \(\color{var(--nord15)}{\mathbf x_1}\) based on: \[\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots\] Say, \(\color{var(--nord15)}{\mathbf x_1}\) is <span style = "color: var(--nord10); font-weight: bold;">bilirubin</span>. </div> <br> {{content}} ] ] -- <div class = "flex-grid"> <div class = "col"> <strong>Scenario 1:</strong><br> other covariates: <ul> <li>age</li> <li>gender</li> <li>eye color</li> </ul> </div> <div class = "col"> {{content}} </div> </div> -- <strong>Scenario 2:</strong><br> other covariates: <div class = "flex-grid"> <div class = "col"> <ul> <li>age</li> <li>gender</li> <li>height</li> <li>weight</li> <li>family history</li> </ul> </div> <div class = "col"> <ul> <li>comorbidities</li> <li>creatinine</li> <li>AST, ALT, ALP</li> <li>MELD</li> <li>...</li> </ul> </div> </div> --- ## Considerations for the Use of FCS MI **Where are values missing?** .flex-grid[ .col[ <table class="data-table"> <tr> <th></th> <th>\(\mathbf y\)</th> <th>\(\mathbf x_1\)</th> <th>\(\mathbf x_2\)</th> <th>\(\mathbf x_3\)</th> <th>\(\ldots\)</th> </tr> <tr><td></td><td colspan = "5"; style = "padding: 0px;"><hr /></td><tr> <td class="rownr"></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class="rownr"></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td style="color: var(--nord15);"><i class = "fas fa-question"></i></td> <td><i class = "fas fa-check"</i></td> <td><i class = "fas fa-check"</i></td> <th>\(\ldots\)</th> </tr> <tr> <td class = "rownr"></td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td>\(\vdots\)</td> <td></td> </tr> </table> ] .col[ **Imputation Model** for `\(\color{var(--nord15)}{\mathbf y}\)`: `$$\color{var(--nord15)}{\mathbf y} = \alpha_0 + \alpha_1 \color{var(--nord15)}{\mathbf x_1} + \alpha_2 \mathbf x_2 + \alpha_3 \mathbf x_3 + \varepsilon_y$$` * fit on cases with observed `\(\color{var(--nord15)}{\mathbf y}\)` ⇨ `\(\boldsymbol{\hat\alpha}\)` * predict missing `\(\color{var(--nord15)}{\mathbf y}\)` using `\(\boldsymbol{\hat\alpha}\)`<br> ⇨ imputed cases will always have estimates equal to `\(\boldsymbol{\hat\alpha}\)` {{content}} ] ] -- **Analysis Model** * estimates in observed part: `\(\boldsymbol{\hat\alpha}\)` * estimates in imputed part: `\(\boldsymbol{\hat\alpha}\)`<br> ⇨ same results as in imputation model --- ## Considerations for the Use of FCS MI **Missing values in the response:** If analysis model `\(=\)` imputation model <br> ⇨ `\(\boldsymbol{\hat\beta} = \boldsymbol{\hat\alpha}\)`<br> ⇨ No point in imputing responses -- <br> **Auxiliary variables**:<br> ⇨ analysis model `\(\neq\)` imputation model<br> ⇨ `\(\boldsymbol{\hat\beta} \neq \boldsymbol{\hat\alpha}\)`<br> ⇨ Imputing responses can be beneficial --- ## Considerations for the Use of FCS MI **Why are values missing?** Imputation of `\(\color{var(--nord15)}{\mathbf x_1}\)` based on: `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \beta_1 \mathbf y + \beta_2 \color{var(--nord15)}{\mathbf x_2} + \beta_3 \color{var(--nord15)}{\mathbf x_3} + \ldots$$` <ul> <li> Fit model on cases with observed \(\color{var(--nord15)}{\mathbf x_1}\)</li> <li> Predict missing \( \color{var(--nord15)}{\mathbf x_1} \)</li> </ul> -- .box.bg-0.brdr-8[ ⇨ Imputed `\(\color{var(--nord15)}{\mathbf x_1}\)` will have the same distribution as observed `\(\color{var(--nord15)}{\mathbf x_1}\)` with **the same values of all other variables**. ] **⇨ FCS MI is valid under MAR** --- ## FCS MI in Practice * valid under **MAR**<br> <span style = "color: grey; font-size: 0.9rem;"> imputation models need to contain the important predictors in the right form</span> -- * allows us to take into account * uncertainty about missing value<br> <span style = "color: grey; font-size: 0.9rem;"> if we use enough imputed datasets </span> * uncertainty about parameters in imputation model<br> <span style = "color: grey; font-size: 0.9rem;"> requires Bayes or Bootstrap </span> * prediction error<br> <span style = "color: grey; font-size: 0.9rem;"> requires Bayes, or predictive mean matching with appropriate settings </span> -- * Imputation models need to fit the data - no contradiction between imputation models - no contradiction between imputation models and analysis model(s) --- ## Non-linear Associations .pull-left[ **Implied Assumption:**<br> <span>Linear association</span> between `\(\color{var(--nord15)}{\mathbf x_1}\)` and `\(\mathbf y\)`: `$$\color{var(--nord15)}{\mathbf x_1} = \beta_0 + \bbox[#3B4252, 2pt]{\beta_1 \mathbf y} + \beta_2 \mathbf x_2 + \beta_3 \mathbf x_3$$` <img src="materials/linplot.png", width = "450", height = "300", style="position:absolute; bottom:45px;"> ] ??? Implied assumption: linear association between incompl. covariate and outcome (and other covariates) -- .pull-right[ <br> But what if `$$\mathbf y = \theta_0 + \bbox[#3B4252, 2pt]{\theta_1 \color{var(--nord15)}{\mathbf x_1} + \theta_2 \color{var(--nord15)}{\mathbf x_1}^2} + \theta_3 \mathbf x_2 + \theta_4 \mathbf x_3$$` <img src="materials/qdrplot.png", width = "450", height = "300", style="position:absolute; bottom:45px;"> ] ??? But what if we have a setting where we assume that there is a non-linear association, for example quadratic? --- ## Non-linear Associations .pull-left[ * <span style="font-weight: bold; color:var(--nord4);">true association</span>: non-linear * <span style="font-weight: bold; color:var(--nord7);">imputation assumption</span>: linear ] .pull-right[ <span style="font-size: 56pt; position: relative; right: 110px; bottom: 20px;">} ⇨</span> <span style = "color: var(--nord11); font-size: 1.2rem; font-weight: bold; position: relative; bottom: 30px; right: 100px;"> bias!</span> ] <img src="materials/impplot.png", height = 350, style = "margin: auto; display: block;"> ??? If we * correctly assume a non-linear association in the analysis model * but a linear association in the imputation model we introduce bias, even if we analyse the imputed data under the correct assumption --- ## Non-linear Associations With non-linear associations specification of the **correct imputation model may not be possible**. Settings with non-linear associations: * (multiple) **transformations** of incomplete variables * **interactions** with incomplete variables * **survival models** ??? * In many such settings the correct predictive distribution will not have a closed form => we then cannot just specify the imputation model as a simple regression model with all other variables in the linear predictor. -- <br> **Also critical:**<br> settings with correlated observations * **longitudinal data** * clustered data (e.g. **multi-center studies**) -- <div style = "position: fixed; right:60px; bottom: 180px;"> .box.bg-0.brdr-8[ **⇨ Bayes** <i class="fas fa-smile fa-lg" style = "color: var(--nord8);"></i> ] </div> --- ## Multiple Imputation FAQ * How many imputed datasets do I need? -- * Should we do a compl. case analysis as sensitivity analysis? -- * What % missing values is still ok? -- * Can I impute missing values in the response? -- * Can I impute missing values in the exposure? -- * Which variables do I need to include in the imputation? -- * Why do I need to include the response into the imputation models? Won't that artificially increase the association? -- * How should I report missing data / imputation in a paper? --- class: the-end, center, middle layout: true count: false ## Thank you for your attention! <div class="contact"> <i class="fas fa-envelope"></i> <a href="mailto:n.erler@erasmusmc.nl" class="email">n.erler@erasmusmc.nl</a> <a href="https://twitter.com/N_Erler" target="_blank"><i class="fab fa-twitter"></i> N_Erler</a> <a href="https://github.com/NErler" target="_blank"><i class="fab fa-github"></i> NErler</a> <a href="https://nerler.com" target="_blank"><i class="fas fa-globe-americas"></i> https://nerler.com</a> </div> --- count: false