Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Modern Practices of handling missing values?

Andrea Antonio Rachetta1

RWTH Aachen University, Templergraben 55, 52062 Aachen, Germany


andrea.rachetta@rwth-aachen.de

Abstract. There are different cases why the data recorded contain miss-
ing values due to various circumstances. There are many methods for
handling loss of information. Comparing different data mechanisms and
the efficiency of a given method is being judged upon its studies and the-
ory. When the approaches are not applied on the right data, we obtain
biased results. This paper aims to outline many useful types of algorithms
and existing implementations which can solve this problem. It is difficult
to pinpoint the best method since they all utilize well known theories
such as: Maximum Likelihood, Regression, Random Forests, Bayesian,
Neural Networks, and Genetic Algorithms. These theories are advanced.
Modern approaches have the most promising potential since they can
develop themselves. In comparison, there are more simple alternatives
like deleting a part of the data that contains missing values, as well as
median or mean substitution. The most promising theory selected for our
method is based on a Regression model. We will compare our method
against eleven selected methods.
The performance of the selected libraries for Python or R are included
in the table.

Keywords: Missing Data · Missing Values · Data Preparation.

1 Introduction

There are many ways to handle missing values (MVs). The results like advantages
and disadvantages of each algorithm are discussed in this paper. It contains an
overview of many statistic models which impute the missing values to machine
learning models who fill them. MVs are observed to have negative side effects
on data analysis and data mining. The loss of efficiency as well as complication
in data analysis because of distortion in statistical power is the problem. Calcu-
lations in presence of MVs lead to biased results and training machine learning
models with MVs is not possible if the models are not robust.
Huang showed that imputing MVs was effective on an incomplete medical
dataset after using instant selection, which is another step in data preprocessing
[11]. A study showed that data scientists spend most of the time in preprocessing
data. Getting an overview of various methods is crucial for preparing the data in
the best way computer science can offer. First, we describe the nature of MVs as
?
Supported by the PADS Group at the RWTH Aachen University
2 Rachetta, 2020

well as most of the current available methods, related studies and focus on some
advanced ones. In the following sections many algorithms are presented with a
focus on the practicability, difficulty, efficiency, availability for the implementa-
tion. At last, we focus on the method that uses bayesian interference and on its
comparison against other popular algorithms.

1.1 Motivation
MVs are shown as ”NULL”, ”NA”, ”NaN” values with a question mark in soft-
ware or as an empty field in a spreadsheet. Therefore they do not contain any
information and leave many gaps in data. Those gaps can negatively influence
e.g. data analysis, machine learning and other models which are not robust to
them. Not having missing values, improves the quality of a dataset greatly. Han-
dling MVs in medical data and datasets related to Internet of Things is of great
importance. The goals is to handle them efficiently and filling the missing val-
ues with realistic/correct ones. The imputed data should give the result like a
complete dataset. The quality should be the same or higher. That goal is ideal,
but not possible to achieve. These methods approach values as if they were filled
out accordingly.

1.2 Data Patterns


In this section, we investigate the patterns in which MVs appear in datasets [7].
Figure 1 presents each of the common patterns.

Fig. 1. This graphic shows a dataset with univariate, monotone, multivariate and gen-
eral pattern.

Univariate and Multivariate Pattern - When there is exactly one feature


in the data that contains MVs, then an univariate pattern is being observed.
Ehrlinger claims that multivariate is a special case where the univariate pattern
is observed in more than one feature, which all have the same percentage of
missingness [7].

Monotone Pattern - In the monotone pattern, the percentage of missingness is


different for each feature. The percentage of missingness increases or decreases
in comparison to the other features, which indicates that a monotone pattern is
dependent on a certain unknown variable.
Modern Practices of handling missing values 3

General Pattern - The general pattern is the most common one and is typically
a combination of other patterns. The MVs have no specific pattern and are
randomly distributed among the data. It is the most difficult one to handle and
is useful if the algorithms handle this pattern efficiently.

Data patterns are not the only way to describe MVs.

1.3 Data Mechanisms


There are data mechanisms, which describe the background of MVs, from a
statistical point of view. In this section, we present the specific patterns of dis-
tribution of MVs across the dataset [7]. It is not possible to distinguish easily
between them.

Missing Completely At Random (MCAR) - Null Values that are MCAR do not
have a specific reason in which they are placed. As the name suggests there is
not any correlation between the missing values and the rest of the data.

Missing at Random (MAR) - Values that are MAR on the other hand have an
unknown connection with the rest of the data. This is a more common source
since there can be a given correlation, between existing data and missing data
which can be calculated.

Missing not at Random (MNAR) - Lastly, the values can be MNAR. In other
words, the missing values are dependent on another existing data entry or they
have an unknown dependent variable. An example for values MNAR is sen-
sor data, where temperature above or under a certain threshhold is not being
recorded. Values MNAR are difficult to impute since most approaches do not
assume that values are MNAR in their theory.

The paper is organized as follows: In Section 2, we present related work. Finally,


in Section 4, we conclude the paper and add the References

2 Related Work
This section contains an overview of existing studies in handling missing data.
The methods can be grouped according to the degree of difficulty of the theory
and their implementation. We discuss studies and obtain an overview of the best
performing methods to compare our method against.

2.1 Common Approaches


Generally, we can fill MVs with plausible values, also known as imputation, or
delete MVs from the dataset. Common Approaches, which ultilize some basic
statistic ideas or simple deletion, are often not optimal and lead to biased esti-
mates [1, 15]. We will briefly summarize papers in which less advanced methods
showed to be effective or produce biased results.
4 Rachetta, 2020

Deletion Methods and Simple Imputation To delete part of the data in which
the MVs are, is an easy approach and reasonable if the missingness dominates
a feature. In Listwise Deletion (LD) you delete feature or samples which are
incomplete. LD is also known as Casewise Deletion (CD) and Complete Case
Analysis (CCA), which works reasonably well if values are MCAR and the sample
is large [3, 12, 20]. Unfortunately, the MCAR assumption is often unreasonable
and it is misleading to call listwise deletion conservative because of its major loss
of data. It is often the default approach in statistical packages. In multivariate
missing data, there is the danger of under- or overestimating some effect and if
the data is not MCAR, it may yield to biased results [1].
Pairwise Deletion (PD): also known as Available Case Analysis (AC) is one
way to resolve the waste of LD [3]. PD uses all of the available data without
the fact that they may not have given a value in other variables. PD works
reasonably well when researchers assume missing values are MCAR, but it can
produce a covariance matrix that is not positive definite or cannot be inverted
[1]. The final issue Acock observed is that since the source of the given values of
each attribute are from different samples, it is difficult to choose a sample size
which does not over- or underestimate the result of analysis [1].
The most common and simple option is to impute the mean, known as Mean
Imputation/Substitution (MS). Using MS can outbalance the dataset with a
strong concentration on one point since you impute one value for all the MVs
[1]. MS underestimates the standard deviation of the dataset with the increase of
the amount of MVs. Another disadvantage is that MS weakens variance and can
produce inconsistent bias when there is a high degree of inequality in the number
of missing values for different variables [3]. Another simple method is to use
Median Imputation (MDI), where you impute the median instead of the mean.
Mode Imputation is the method used for categorical data. Mode imputation
imputes the most used nominal value and is as inefficient as the two numerical
counterparts. Even if MDS and Mode Imputation are more unbiased in terms of
considering the lower influence by outlier values in the calculation, they are as
inefficient as MS.
A simple SI method in longitudinal data is to use Last value carried forward
(LVCF). This method imputes MVs with the last recorded values that appear
on the same feature. Dependent on the dataset, it will not impute the same
value for each MV. LVCF underestimates the variance but less than MS and
MDS does. Although this strategy has enjoyed widespread use in the medical
and clinical trials literature, methodological studies have demonstrated that it
is capable of producing substantial bias, even under an MCAR mechanism [8].

2.2 Advanced Approaches

In this subsection we will look at studies which will first explain the two general
types of imputation and lay basis and opinions on them. Then focus on studies
that present Likelihood, Regression, Machine Learning, and Evolutionary based
methods.
Modern Practices of handling missing values 5

Advanced Imputation MS and MDS fall under the category of Single Imputa-
tion (SI), because of the one possible mean/median treated as the true value
for each MV. SI using probalistic models was an important advance over tradi-
tional approaches, but SI methods have one inherent flaw in general. SI tend to
underestimate the standard errors and thus overestimate the level of precision
[1].

Multiple Imputation (MI) techniques are a more less biased approach in


literature to handle MVs. MI imputes M possible values for each MV to create
M complete datasets. The then resulting dataset, which summarizes the M prior
ones, reflects extra variation in the dataset [3]. Suggested parameter for the
number of imputations M is 10 according to Acock [1]. Sullivan supports the
argument that MI has some protection against MNAR data mechanisms [20].
Van Ginkel adds that it is better than listwise and other simple approaches [10].
It is almost always preferred to pick MI over deletion approaches [16].

One huge category for predicting MVs are probalistic methods that calculate
the MVs with statistical modeling. The methods are also known as Likelihood
based methods [12]. Estimated Maximum Likelihood (EM) is an iterative pro-
cedure with two steps: the Expectation step (E-Step) and Maximisation step
(M-Step) [3]. The E-step estimates the values upon a given distribution for the
observed data. In the M-Step, it uses Maximum Likelihood. This method as-
sumes like MI that the data are MAR [8]. Raw Maximum Likelihood (RML) is
similiar to EM but does not do the E-Step [3].

Regression models rely on statistical principles to predict values. The com-


mon models are based on Linear Regression (LR) for numerical or Logistic Re-
gression (LGR) for categorical data. Cumulative Linear Regression (CLR) is a
development to the common LR. Mostafa thoroughly compared CLR against
popular R and Python Packages [17]. In the end, the perfomance was somewhat
similiar to the methods that were already published.

Machine Learning (ML) based methods use models apart from regression
to predict MVs. ML models provide possibilities to handle missing values [9].
Complex ML models like Random Forest (RF) are built from more than one
Decision Tree (DT) [18]. RF vote in terms of classification for each DT and
impute the value with the most votes. The best RF is the unsupervised model
[15, 21]. RF are customizable and f.e. Anoop built an RF that could impute with
MICE [2].

There are also approaches done that fall in the category of Evolutionary Al-
gorithms. Genetic Algorithm (GA) and Gravitational Search Algorithm (GSA)
are another pair of promising algorithms, which can help to impute MVs. GA
and GSA are both nature inspired algorithms which are capable to solve opti-
mization problems. First studies have already shown that RF outperforms GA
in Classification Problems [14]. GSA is new and there are hybrids like GA-GSA
which already show better performance in terms of clustering data [5].
6 Rachetta, 2020

3 Methodology

Our method is based on regression. Multi Linear Regression (MLR) uses more
than one feature to build a regression model in a higher than two dimensional
space. Since it can analyze the variance of the feature, which contains MVs, to
more than one feature, it has more possibilities to extract dependencies between
variables from a dataset. Hongguo also proposed MLR for imputing MVs in
wireless sensor networks [22].
Multi Linear Regression for handling Missing Values with Unknown Depen-
dent variable (MLRMUD) is one of the best performing multi regression model,
which perfomance will be evaluated.
This method is effective when the MVs have an unknown dependent variable
and it can handle difficult data patterns like the multivariate pattern.
The least squares multipliers method is used to find the coefficients for the linear
regression model represented in equation 1 below. This method constructs a
plane in a k-dimensional space and minimizes the euclid distance between it and
the data.
Y = B0 + B1 X1 + B2 X2 + ... + Bk Xk + E (1)
The regression coefficient B is calculated in equation 2.

B = (X T X)−1 X T y (2)

4 Evaluation

In this section will focus on our method MLRMUD and its performance.
Chhabra et al. compared MLRMUD against 6 advanced approaches: Predic-
tive Mean Matching (PMM), Multiple RF, MLR Bayesian, MLR Using Non-
Bayesian, Multiple Classification and Regression Tree (CART), and MLR with
Bootstrap Imputation. PMM is similiar to MI combined with Hot Deck Impu-
tation (HDI) and is one of the many developments build upon MI.

4.1 Python and R - Packages

In addition, we will present public packages and libraries available in Python


and R to compare our method against more approaches.

A known SI method is HDI, which uses observed characteristics in the current


data to impute MVs. The values imputed are from other samples that have the
same value in other features aside from the feature the MV is in [3]. When there
is no similarity found, either MS or MDS can be used to impute the MV. Cold
Deck Imputation (CDI) is very similiar to HDI but uses data from older datasets
to impute the MVs from [3]. HDI is used to compare its performance against
MLRMUD.
Modern Practices of handling missing values 7

Clustering Algorithms are another good candidate to impute MVs since they
observe the data, cluster them in groups and impute values that belong to the
same cluster. K-Nearest Neighbor Clustering (KNN) - KNN uses clustering to
group them together with the help of the Euclidean distance in k similiar cases.
Then it imputes the most frequent value within its case [9].
The function missForest in the package of R namecycles through the variables
in the data set and imputes using RF with other variables as features [15]. This
process may be repeated several times. The outputs are imputed data set with
the missing values filled in, and a set of prediction errors, giving the prediction
error for each variable. These give a measure of the success of the imputation. You
can tweek the random forest settings if desired. We will use the non-parametric
default setting.
Multiple Imputation by Chained Equation (MICE) Mvis (2004) as well as
the newer ice are almost equivalent to MICE and implemented in Stata [19].
There are different variations and adjustment to MICE available. MICE can
be used with categorical data when transformed to dummy data beforehand.
Miszal compared MI with MICE [15] . It can be used on NCAR data as well
and the weight can be added to the values as an extra parameter to outweigh
the disadvantages it has. There are some MICE packages specially configured to
data MNAR. Otherwise MICE is generally very effective on data which is MAR
or MCAR. The MICE package in R used in this experiment assumes that the
data is MAR. AregImpute has similiar performance to MICE [16].
Full Information Maximum Likelihood (FIML) has a user defined python
implementation available on github []. In Addition FIML is supported by MX
and commercial software like Mplus, LISREL and AMOS [1]. FIML uses all the
information available to use a maximum likelihood estimation [1].
There are many developments of LR/LGR like Binomial LR/Multinomial
LGR. Build upon different theories like f.e. the bayesian theory are Bayesian LR
or Bayesian Binary LGR. These four developments of LR/LGR are part of the
Python Package Autoimpute. We will take Binomial LR and compare it against
our method.

4.2 Performance evaluation


Our main statistic method to measure the performance of the methods is the
Mean Standard Error (MSE):
M SE = mean[(x1,actual − x1,predicted )2 + ... + (xn,actual − xn,predicted )2 ] (3)
Assuming the data has n attributes, where each column is displayed as a vector.
Karama used MLRMUD on the iris and vine dataset with different MVs
percentages. Hybrid GA methods were already overperformed with regards to
the Root Mean Square Error [13].

4.3 Results and Discussion


This table show how our method achieved overall better performance except
against MICE.
8 Rachetta, 2020

Table 1 gives a summary of the performance with some advanced MI-based


methods, which was evaluated by Chhabra et al. [4]. We added methods from
the Evaluation section.

Table 1. Comparison of MLRMUD with eleven different approaches with respect to


the Mean Standard Error

S.NO Method Mean Standard Error


1 Predictive Mean Matching 0.1032988
2 Multiple random forest Regression Imputation 0.09765137
3 Multiple Bayesian Regression Imputation 0.09503033
4 Multiple linear regression using non-Bayesian imputation 0.11876531
5 Multiple classification and regression tree (CART) 0.10915661
6 Multiple linear regression with bootstrap imputation) 0.11446101
7 MLRMUD 0.090175019
8 HDI 0.42115
9 KNN 0.263
10 missForest 0.1075207
11 MICE 0.02212442

5 Acknowledgement

Special thanks to Karama et al. for the theory and sources for the setting of the
experiment.
There are many advanced approaches that can be taken note of. A few honor-
able mentions of algorithm and databases that contain more advanced methods
than our paper but are not grouped into one of the theories in the related work
section, are imputation algorithms found in:

KEEL : The Knowledge Extraction Evolutionary Learning (KEEL) [6] database


presents many different algorithms. Examples for methods mentioned in the
category of MVs are based on Support Vector Machine (SVM) or Fuzzy K-
Means.

6 Conclusion

To conclude, the future relies on the further optimization of machine learning al-
gorithms like decision tree, random forest, neural network or as Mostafa suggests
in further research in evolutionary algorithms like GSA-GA [17]. They have a
lot of potential to improve already existing imputation methods.
MS, MDS or Mode Imputation should be avoided and only used, when there
are very few to almost none missing values. In order for them to be effective,
they must be used in MCAR data, which is already difficult to identify correctly.
Modern Practices of handling missing values 9

Effective are the machine learn models like missForest or other variation of
random forest since they can impute on MCAR, MAR and MNAR data with
mixed data types and they don‘t need any parameters. On the other hand, they
can become computationally more expensive on MI and big datasets. So it might
be better to use algorithms that rely on more statistical approaches that don‘t
need to be trained.
Multiple Imputation or MICE which can also handle multi type data, are
common and advanced statistic tools to impute missing values. Since the perfor-
mance is overall great, they should be used. A lot of well-known statistic software
like f.e. SAS, SPSS or R have already many implemented functions with sim-
iliar results and approaches that rely on MI and MICE. Python provides with
scikit-learn, IterativeImputer, which is also a great multivariate Imputer.
Regression based approaches are great and MLRUD has great performance.
The same can be said to EM or FIML that lie on a similar level of perfor-
mance. The disadvantage is the more difficult implementation and choosing of
parameters. Its most likely needed to be implemented by a user itself.

References
1. Acock, A.C.: Working With Missing Values. Journal of Marriage and Family 67(4),
1012–1028 (2005)
2. Anoop: Comparison of Random Forest and Parametric Imputation Models for
Imputing Missing Data Using MICE: A CALIBER Study (2014)
3. Bennett, D.A.: How can I deal with missing data in my study? Australian and
New Zealand Journal of Public Health 25(5), 464–469 (2001)
4. Chhabra, G., Vashisht, V., Ranjan: A Comparison of Multiple Imputation Methods
for Data with Missing Values. Indian Journal of Science and Technology 10(19),
1–7 (2017)
5. Diwakar, M., Kumar, J., Gupta, I.K.: A hybrid ga-gsa noval algorithm for data
clustering. In: 2018 4th International Conference on Recent Advances in Informa-
tion Technology (RAIT). pp. 1–6 (2018)
6. Educacion, M.D.: KEEL Homepage (2011 (accessed June 7, 2020)),
https://sci2s.ugr.es/keel/algorithms.php
7. Ehrlinger, L., Grubinger, T., Varga, B., Pichler, M., Natschläger, T., Zeindl, J.:
Treating Missing Data in Industrial Data Analytics. In: 2018 Thirteenth Inter-
national Conference on Digital Information Management (ICDIM). pp. 148–155
(2018)
8. Enders, C.K.: Analyzing longitudinal data with missing values. Rehabilitation Psy-
chology 56(4), 267–288 (2011)
9. Ezzine, I., Benhlima, L.: A Study of Handling Missing Data Methods for Big Data.
In: 2018 IEEE 5th International Congress on Information Science and Technology
(CiSt). pp. 498–501 (2018)
10. van Ginkel, J.R., Linting, M., Rippe, R.C.A., van der Voort, A.: Rebutting Existing
Misconceptions About Multiple Imputation as a Method for Handling Missing
Data. Journal of Personality Assessment 102(3), 297–308 (2020)
11. Huang, M.W., Lin, W.C., Chen, C.W., Ke, S.W., Tsai, C.F., Eberle, W.: Data
preprocessing issues for incomplete medical datasets. Expert Systems 33(5), 432–
438 (2016)
10 Rachetta, 2020

12. Hughes, R.A., Heron, J., Sterne, J.A.C., Tilling, K.: Accounting for missing data
in statistical analyses: multiple imputation is not always the answer. International
Journal of Epidemiology 48(4), 1294–1304 (2019)
13. Karama, A., Farouk, M., Atiya, A.: A Multi Linear Regression Approach for Han-
dling Missing Values with Unknown Dependent Variable (MLRMUD). In: 2018
14th International Computer Engineering Conference (ICENCO). pp. 195–201
(2018)
14. Leke, C., Twala, B., Marwala, T.: Modeling of missing data prediction: Com-
putational intelligence and optimization algorithms. In: 2014 IEEE International
Conference on Systems, Man, and Cybernetics (SMC). IEEE (2014)
15. Misztal, M.A.: Comparison of Selected Multiple Imputation Methods for Continu-
ous Variables – Preliminary Simulation Study Results. Acta Universitatis Lodzien-
sis. Folia Oeconomica 6(339), 73–98 (2019)
16. Moons, K.G.M., Donders, R.A.R.T., Stijnen, T., Harrell, F.E.: Using the outcome
for imputation of missing predictor values was preferred. Journal of Clinical Epi-
demiology 59(10), 1092–1101 (2006)
17. Mostafa, S.M.: Imputing missing values using cumulative linear regression. CAAI
Transactions on Intelligence Technology 4(3), 182–200 (2019)
18. Pristyanto, Y., Pratama, I.: Missing Values Estimation on Multivariate Dataset :
Comparison of Three Type Methods Approach. In: 2019 International Conference
on Information and Communications Technology (ICOIACT). pp. 342–347 (2019)
19. Royston, P., White, I.R., et al.: Multiple imputation by chained equations (mice):
implementation in stata. The Stata Journal: Promoting communications on statis-
tics and Stata (2011)
20. Sullivan, T.R., White, I.R., Salter, A.B., Ryan, P., Lee, K.J.: Should multiple
imputation be the method of choice for handling missing data in randomized trials?
Statistical Methods in Medical Research 27(9), 2610–2626 (2018)
21. Tang, F., Ishwaran, H.: Random Forest Missing Data Algorithms. Statistical anal-
ysis and data mining 10(6), 363–377 (2017)
22. Zhang, H., Yang, L.: An improved algorithm for missing data in wireless sensor
networks. In: International Conference on Software Intelligence Technologies and
Applications International Conference on Frontiers of Internet of Things 2014. pp.
346–350 (2014)

You might also like