Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Generally speaking, there are three main approaches to handle missing data

(1) Imputation—where values are filled in the place of missing data

(2) omission—where samples with invalid data are discarded from further analysis and

(3) analysis—by directly applying methods unaffected by the missing values.

There are many ways to approach missing data. The most common, I believe, is to ignore it. But
making no choice means that your statistical software is choosing for you.

Most of the time, your software is choosing listwise deletion. Listwise deletion may or may not
be a bad choice, depending on why and how much data are missing.

Another common approach among those who are paying attention is imputation. Imputation
simply means replacing the missing values with an estimate, then analyzing the full data set as if
the imputed values were actual observed values.

How do you choose that estimate? The following are common methods:

Mean imputation

Simply calculate the mean of the observed values for that variable for all individuals who are
non-missing.

It has the advantage of keeping the same mean and the same sample size, but many, many
disadvantages. Pretty much every method listed below is better than mean imputation.
Substitution

Impute the value from a new individual who was not selected to be in the sample.

In other words, go find a new subject and use their value instead.

Hot deck imputation

A randomly chosen value from an individual in the sample who has similar values on other
variables.

In other words, find all the sample subjects who are similar on other variables, then randomly
choose one of their values on the missing variable.

One advantage is you are constrained to only possible values. In other words, if Age in your
study is restricted to being between 5 and 10, you will always get a value between 5 and 10 this
way.

Another is the random component, which adds in some variability. This is important for
accurate standard errors.

Cold deck imputation

A systematically chosen value from an individual who has similar values on other variables.
This is similar to Hot Deck in most ways, but removes the random variation. So for example, you
may always choose the third individual in the same experimental condition and block.

Regression imputation

The predicted value obtained by regressing the missing variable on other variables.

So instead of just taking the mean, you’re taking the predicted value, based on other variables.
This preserves relationships among variables involved in the imputation model, but not
variability around predicted values.

Stochastic regression imputation

The predicted value from a regression plus a random residual value.

This has all the advantages of regression imputation but adds in the advantages of the random
component.

Most multiple imputation is based off of some form of stochastic regression imputation.

Interpolation and extrapolation

An estimated value from other observations from the same individual. It usually only works in
longitudinal data.
Use caution, though. Interpolation, for example, might make more sense for a variable like
height in children–one that can’t go back down over time. Extrapolation means you’re
estimating beyond the actual range of the data and that requires making more assumptions that
you should.

Single or Multiple Imputation?

There are two types of imputation–single or multiple. Usually when people talk about
imputation, they mean single.

Single refers to the fact that you come up with a single estimate of the missing value, using one
of the seven methods listed above.

It’s popular because it is conceptually simple and because the resulting sample has the same
number of observations as the full data set.

Single imputation looks very tempting when listwise deletion eliminates a large portion of the
data set.

But it has limitations.

Some imputation methods result in biased parameter estimates, such as means, correlations,
and regression coefficients, unless the data are Missing Completely at Random (MCAR). The bias
is often worse than with listwise deletion, the default in most software.
The extent of the bias depends on many factors, including the imputation method, the missing
data mechanism, the proportion of the data that is missing, and the information available in the
data set.

Moreover, all single imputation methods underestimate standard errors.

Since the imputed observations are themselves estimates, their values have corresponding
random error. But when you put in that estimate as a data point, your software doesn’t know
that. So it overlooks the extra source of error, resulting in too-small standard errors and too-
small p-values.

And although imputation is conceptually simple, it is difficult to do well in practice. So it’s not
ideal but might suffice in certain situations.

So multiple imputation comes up with multiple estimates. Two of the methods listed above
work as the imputation method in multiple imputation–hot deck and stochastic regression.

Because these two methods have a random component, the multiple estimates are slightly
different. This re-introduces some variation that your software can incorporate in order to give
your model accurate estimates of standard error.

Multiple imputation was a huge breakthrough in statistics about 20 years ago. It solves a lot of
problems with missing data (though, unfortunately not all) and if done well, leads to unbiased
parameter estimates and accurate standard errors.
Two methods for dealing with missing data, vast improvements over traditional approaches,
have become available in mainstream statistical software in the last few years.

Both of the methods discussed here require that the data are missing at random–not related to
the missing values. If this assumption holds, resulting estimates (i.e., regression coefficients and
standard errors) will be unbiased with no loss of power.

The first method is Multiple Imputation (MI). Just like the old-fashioned imputation methods,
Multiple Imputation fills in estimates for the missing data. But to capture the uncertainty in
those estimates, MI estimates the values multiple times. Because it uses an imputation method
with error built in, the multiple estimates should be similar, but not identical.

The result is multiple data sets with identical values for all of the non-missing values and slightly
different values for the imputed values in each data set. The statistical analysis of interest, such
as ANOVA or logistic regression, is performed separately on each data set, and the results are
then combined. Because of the variation in the imputed values, there should also be variation in
the parameter estimates, leading to appropriate estimates of standard errors and appropriate p-
values.

Multiple Imputation is available in SAS, S-Plus, R, and now SPSS 17.0 (but you need the Missing
Values Analysis add-on module).

The second method is to analyze the full, incomplete data set using maximum likelihood
estimation. This method does not impute any data, but rather uses each cases available data to
compute maximum likelihood estimates. The maximum likelihood estimate of a parameter is
the value of the parameter that is most likely to have resulted in the observed data.
When data are missing, we can factor the likelihood function. The likelihood is computed
separately for those cases with complete data on some variables and those with complete data
on all variables. These two likelihoods are then maximized together to find the estimates. Like
multiple imputation, this method gives unbiased parameter estimates and standard errors. One
advantage is that it does not require the careful selection of variables used to impute values
that Multiple Imputation requires. It is, however, limited to linear models.

Analysis of the full, incomplete data set using maximum likelihood estimation is available in
AMOS. AMOS is a structural equation modeling package, but it can run multiple linear
regression models. AMOS is easy to use and is now integrated into SPSS, but it will not produce
residual plots, influence statistics, and other typical output from regression packages.

References:

Schafer, J. Software for Multiple Imputation

Hox, J.J. (1999) A Review of Current Software for Handling Missing Data, Kwantitatieve
Methoden, 62, 123-138.

Allison, P. (2000). Multiple Imputation for Missing Data: A Cautionary Tale, Sociological Methods
and Research, 28, 301-309.

You might also like