Professional Documents
Culture Documents
J Patrec 2015 08 023
J Patrec 2015 08 023
J Patrec 2015 08 023
PII: S0167-8655(15)00288-3
DOI: 10.1016/j.patrec.2015.08.023
Reference: PATREC 6335
Please cite this article as: Fabio Lobato, Claudomiro Sales, Igor Araujo, Vincent Tadaiesky, Lilian Dias,
Leonardo Ramos, Adamo Santana, Multi-Objective Genetic Algorithm For Missing Data Imputation,
Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.08.023
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1
T
IP
CR
Research Highlights
US
AN
• The paper proposes a novel Multi-objective Genetic Algorithm for Data Imputation, called MOGAImp.
• This is the first method that applies a multi-objective approach in data imputation.
• The results confirm the MOGAImp prevalence for utilization over conflicting evaluation measures.
• MOGAImp codification scheme makes possible to adapt it to different application domains.
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
2
Fabio Lobatoa,b,∗∗, Claudomiro Salesa , Igor Araujoa , Vincent Tadaieskya , Lilian Diasa , Leonardo Ramosa , Adamo Santanaa
a Technological Institute, Federal University of Para, R. Augusto Correa, 01, Guamá, Postal Code: 479, Belem, Para, Brazil
T
b Engineering and Geoscience institute, Federal University of Western Para, R. Vera Paz, S/N, Sale, Belem, Para, Brazil
IP
ABSTRACT
CR
A large number of techniques for data analyses have been developed in recent years, however most of
them do not deal satisfactorily with a ubiquitous problem in the area: the missing data. In order to
mitigate the bias imposed by this problem, several treatment methods have been proposed, highlighting
the data imputation methods, which can be viewed as an optimization problem where the goal is
US
to reduce the bias caused by the absence of information. Although most imputation methods are
restricted to one type of variable whether categorical or continuous, moreover, they usually samples
with incomplete instances. To fill these gaps, this paper presents the multi-objective genetic algorithm
for data imputation called MOGAImp, based on the NSGA-II, which is suitable for mixed-attribute
AN
datasets and takes into account information from incomplete instances and the modeling task. A set
of tests for evaluating the performance of the algorithm were applied using 30 datasets with induced
missing values; five classifiers divided into three classes: rule induction learning, lazy learning and
approximate models; and were compared with three techniques presented in the literature. The results
obtained confirm the MOGAImp outperforms some well-established missing data treatment methods.
M
Furthermore, the proposed method proved to be flexible since it is possible to adapt it to different
application domains.
c 2015 Elsevier Ltd. All rights reserved.
ED
Values (MV), should be highlighted because of its ubiquity in can be seen as an optimization problem because the estima-
the data analysis process (Graham, 2009). Their causes are tion should preserve the intrinsic characteristics of the database
the most diverse and related to the application domain, such as (Zhang, 2010).
drawbacks in the data acquisition, measurement errors, sensors
In this sense, evolutionary algorithms have been widely ap-
AC
ing the other is degenerating, thus, justifying the application of the class of n or possible labels: C1 , C2 , ..., Cn .
multi objective optimization techniques.
N
Despite the paradigm in which the imputation method is D = {X, T, M} = {(xn , tn , mn )}n=1 (1)
based, some restrictions these state-of-the-art techniques show
should be pointed out. For instance, Zhang et al. (2011) makes where xn = [x1n , x2n , . . . , xdn ]T is the n-nth vector composed
it known that most of the available imputation methods are re- of d attributes, labeled as tn ∈ [C1 , C2 , . . . , Cc ]; and mn =
stricted to one type of variable only (categorical or numeri- [m1n , m2n , . . . , mdn ]T indicates which input attributes are un-
cal). In other words, these methods handle variables of dif- known in xn . The vector that indicates missing data, mn , is
ferent types separately, losing possible relationships between also termed vector-response indicator. X is the set of input
them. It is critical to remember that classification algorithms data, M is a binary matrix that indicates the absence of values,
commonly explore this kind of correlation, thus it is important both have dimension [d × N]. The label set T has dimension
to treat MV in mixed-attributes datasets properly. Two other im- [1 × N]. According M , X is divided into two parts:
portant restrictions are: 1) imputation methods evaluation can-
X = {X0 , Xm } (2)
T
not be properly evaluated apart from the modeling task (Silva
and Hruschka, 2013); and 2) complete-case analysis, should be
X0 and Xm are the observed values of the dataset, and instances
IP
avoided, where information of instances or attributes with miss-
with missing values, respectively. These definitions provides
ing values are removed (Graham, 2009).
the subsidy required to understand the relationship between the
Aiming to fill these gaps in the literature, this paper proposes cause of the missing data and the statistical effect called miss-
CR
a multi-objective evolutionary algorithm for data imputation. ingness mechanism, which are three: Missing Completely at
The well-known NSGA-II has been used for implementation Random (MCAR); Missing at Random (MAR); and Missing
due to its low complexity, flexibility and great effectiveness. Not at Random (MNAR) (Little and Rubin, 2002). The proper
In order to reduce the computational cost, a multi-threaded ge- way to deal with missing values depends, in most cases, on how
netic algorithm that makes NSGA-II even more computing ef-
ficient was also introduced. To compute the objective function
two most common evaluation measures in this application niche
were chosen: Root Mean Square Error (RMSE) and the classi-
US the attributes become missing. For practical purposes, most of
the studies involving missing values treatment assumes that the
missing data mechanism is governed by MAR or MCAR.
AN
Generally, methods for pattern recognition with missing val-
fier accuracy. ues can be classified into four categoriesGarcı́a-Laencina et al.
The MOGAImp performance was compared with some well- (2009): 1) Traditional approaches, also called naive meth-
accepted methods for handling missing data in pattern clas- ods, they represent the complete case analysis and deal with
sification. The comparison was performed using 30 datasets this problem by simple omission of attributes or instances with
M
with induced missing values; five classifiers divided into three missing data; 2) Data imputation, replaces the value associ-
classes: rule induction learning, lazy learning and approxi- ated with the missing data, usually “null” or “?”, by a plausi-
mate models; and were compared with three techniques pre- ble value; 3) Models are iterative procedures that aim to use
sented in the literature. The results proved promising, con-
ED
work in Section 3. The multi-objective genetic algorithm for niques were not modeled to deal directly with this problem.
missing data imputation is described in Section 4 and then the Currently, it is perceived a convergence of the missing values
experimental setup description is given in the Section 5. The treatment methods to the data imputation strategy.
results and conclusions are discussed in Sections 6 and 7, re-
AC
spectively.
3. Related Work
compromise between the single and multiple imputation meth- world datasets properly - they commonly have both data types;
ods. Finally, the iterative imputation techniques primarily use a the MOGAImp was designed to treat mixed-attribute data si-
generate-and-test mechanism, taking into account useful infor- multaneously, taking into account the relationships among the
mation (including incomplete cases). attributes.
Three categories of imputation proposed by Zhang (2010) Many studies developed in this field are not intended to im-
can be seen as an optimization problem. The MOGAImp fits prove the process of handling missing values itself, but aim to
into the iterative imputation category, in which the bioinspired contribute in a particular application domain. For instance, Fa-
algorithms have been widely applied because of its balance be- vorskaya et al. (2013) propose a method for texture reconstruc-
tween accuracy, development time and computational perfor- tion in dynamic scenes; Ding and Ross (2012) compare miss-
mance. It is important to highlight that the use of bioinspired ing data treatment methods in multibiometric systems; Miranda
algorithms in most missing values treatment methods is not to et al. (2012) use autoencoders to predict values for imputation
perform the data imputation itself, but they are applied to im- in electricity distribution networks.
prove convergence or to help in setting parameters (Marwala, In this sense, the MOGAImp flexibility should be high-
T
2009; Aydilek and Arslan, 2013). lighted. Its coding system allows easy adaptation to various
About the imputation methods on bioinspired algorithms, application domains. Particularly, this article is dedicated to
IP
Figueroa Garcı́a et al. (2010, 2011) deserve attention. In their presenting the application of the proposed approach to pattern
work, the authors used statistical measures such as the covari- classification, as will be shown in the following section.
ance matrix and auto-correlation function to guide the estima-
CR
tion process. However, these measures were computed from 4. Multi-Objective Genetic Algorithm For Missing Data
instances without missing values, and then the values to be im- Imputation
puted are estimated. This approach falls in complete case anal-
ysis, therefore, loses potentially important information to carry Genetic algorithms have been widely applied as global
out further inferences and thus can produce biased results. Aim-
ing to avoid this drawback, the MOGAImp takes into account
the information of incomplete instances to estimate plausible
values to impute, making it even more robust to noise.
US search method in various application domains, including data
mining problems and optimization problems. Our main moti-
vations for develop a solution based on the GA paradigm are:
1) GAs explore effectively large search spaces while also ex-
AN
In Luengo et al. (2011), the authors confront 14 different im- ploit optimal solutions; 2) the GA paradigm is extremely scal-
putation methods, using 23 classification methods, which were able, which can be efficiently parallelized; 3) GA are relatively
divided into three categories: 1) rule induction learning, 2) ap- easy to implement, adapt and tune for different application
proximate models and 3) methods based on distance. The pri- domains; 4) they are successfully applied in Multi-Objective
M
mary evaluation parameter used is the accuracy of the predictive Problems (MOPS), where the main goal is to obtain a set of
model. Whereas main contribution of this work is the correla- Pareto-optimal solutions. For these reasons, a considerable
tion of which imputation method is most applicable to a particu- number of Multi-Objective Evolutionary Algorithms (MOEA)
ED
lar group of classifiers, other relevant point refers to the analysis have been proposed during the last two decades, with a com-
of the influence of the imputation method in the data in relation mon characteristic of finding multiple solutions in one single
to two measures: Wilcoxon signed rank test and average mutual run, since the population diversity can be maintained along the
information difference; in addition to the classifiers accuracies. Pareto’s fronts. One of the most prominent MOAE is the Non-
dominated Sorting Genetic algorithm II (NSGA-II), which is
PT
(Hruschka et al., 2009; Garcı́a-Laencina et al., 2009; Silva and genetic algorithm for missing data imputation based on the
Hruschka, 2013) suggest, the best predictive accuracy results do NSGA-II, called MOGAImp. The following section will de-
not necessarily lead to the lowest classification bias. Therefore, scribe the genetic structure, including the multi-thread archi-
it is clear the missing data treatment methods cannot be prop- tecture, and a pseudo code description of MOGAImp.
AC
T
13 Pt ← survivors selection by NSGA-II(Rt , Oacc , ORMS E );
14 end
IP
15 F ← Pareto front(Pt )
16 return impute datasets(F, pool);
Algorithm 1: Multi-Objective Genetic Algorithm Imputation
CR
Fig. 1. Chromosome represented by an array of float point data.
tribute, thus “5” is mapped in “2.3”, “2” represents “Yes” and Aiming to resolve this problem, a parallel approach is also
so on. This codification strategy was developed having in mind proposed, attempting to reduce the processing time without in-
two broad goals: 1) provide abstraction of the data type, al- terfering with other properties of the algorithms. Here, when
ED
lowing the algorithm to handle both continuous and categorical calculating the fitness, each individual is assigned to one thread,
attributes; 2) yield a data structure suitable for applying genetic which is used to calculate the fitness function; with the main
operators and further knowledge extraction. process awaiting all threads to finish their processes, before ap-
Aiming to evaluate the fitness of the candidate solutions, a plying the NSGA-II and Genetic operations.
PT
T
used with 10-fold cross-validation.
mizes the result for RMSE and classifier accuracy, respectively;
Five well-known classification algorithms were selected to
while the last one presents the solution with greatest distance
IP
carry out the experimentation in order to represent three classi-
for the origin, which represents a balance between the two per-
fication categories. The grouped list is: Rule Induction Learn-
formances measures aforementioned. First, this section dis-
ing - C4.5 (Weka’s J48), Conjective and OneR; Approximate
CR
cusses the overall results regarding the three evaluation mea-
Models - Naı̈ve-Bayes; Lazy Learning - 3NN.
sures adopted. Then, the imputation methods behavior is ana-
To evaluate the performance of the imputation methods, three lyzed according the classification algorithms.
measures were chosen: the classification accuracy, which is the
ratio between the number of instances correctly classified and 6.1. Evaluation Measures
the total number of instances; the Wilson’s noise ratio in or-
der to study the imputation method impact on the classification
of instances with missing value (Wilson, 1972; Luengo et al.,
2011); and the Root Mean Square Error, which computes the
US With a view to evaluate the hypothesis that some perfor-
mance measures commonly used in the literature are conflict-
ing, the results for all datasets and classifiers were grouped
AN
distance between the imputed values and the originals - for cat- together aiming to make an overall analysis of the accuracy,
egorical attributes, the distance is considered 1. Lower RMSE RMSE and Wilson’s Noise Ratio. The Figures 2 and 3 shows
values represents better predictive accuracy of the imputation the imputation methods performance according to the classi-
method. Aiming to facilitate the comparisons, the RMSE was fiers’ accuracy and RMSE, respectively. These figures display
box plots, on each box, the central mark is the median, the gray
M
Accuracy
60
40
The MOGAImp parameters were set by means of calibrat- MOGAImp
MOGAImp
MOGAImp
WKNNI
CMC
MC
ing tests with artificial datasets. The evaluating results ob- RMSE
ACC
O
Missing
value
treatment
methods
tained from the simulations confirmed the convergence of the
AC
T
SHT 98.44 98.39 98.44 99.43 99.26 98.36
TTT 92.26 92.69 92.63 92.39 94.07 90.71
IP
As expected, the MOGAImp RMSE and the MOGAImp VTC 87.22 91.11 89.44 83.72 96.23 85.28
WNE 98.88 99.04 99.68 96 100 99.2
ACC achieved the best results regarding the RMSE and the clas-
Ranking 4.1 2.53 3.0 4.53 1.67 5.14
sifiers’ accuracies, respectively. Observing the MOGAImp O, it {CMC, MOGAImpO, MOGAImpACC} {MOGAImpRMS E, WKNNI, MC}
CR
is possible to attest that this solutions provides the better trade-
off between the evaluation measures analyzed, since it achieved
values close to optimal in the RMSE and classifier’s accuracies. In summary, the results shown above indicate that the pro-
The results for the Wilson’s noise ratio is shown in the Ta- posed method has a very competitive performance, obtaining
ble 2 as well as the rank obtained by each imputation method
according to the Friedman test. In the last row, the symbol
denotes that the difference between one or more methods is sta-
tistically significant. For instance, {method 1} {method 2,
US
results superior to the baseline methods. Despite the best rank
of CMC in the Wilson’s noise ratio, this result is not statistically
significant. Moreover, it is important to stress that the proposed
method has the advantage of be class-independent and provide
AN
method 3} indicates that the “method 1” is significantly better the possibility to incorporate more evaluations measures, mak-
than “methods 2 and 3”. The results presented in Table 2 show ing it easily adaptable to different application domains.
that CMC obtained the smallest rank sum, in other words, this
method achieved the best overall result for the Wilson’s noise
7. Conclusions
M
problem of this strategy is that, at classification time, the class ary algorithm NSGA-II. This new method, called MOGAImp,
value is not known, therefore a different approach is required. differs from the current evolutionary methods for data impu-
Even so, the Nemenyi post-hoc test indicated that the CMC is tation, with three contributions: 1) it is capable to tackle con-
not statistically significant better than the solutions produced by flicting evaluation measures; 2) it is suitable for mixed-attribute
PT
MOGAImp O and MOGAImp Acc. Therefore, MOGAImp O datasets; 3) the proposed method takes into account information
can be considered the best trade-off between the three evalua- of incomplete instances and the model building. As the litera-
tion measures analyzed, with the advantage that this solution is ture review demonstrate, this is the first method that applies a
class-independent, contrasting with CMC in this aspect. multi-objective approach in this application domain. Regard-
CE
to classification algorithms grouped into three classes: rule in- dataset, which is used to compute the fitness functions. For the
duction learning, approximate models and lazy learning. To scenario analyzed, two well-established evaluation measures of
perform this analysis, the imputation methods were compared data imputation methods were used as objectives functions: the
using the Wilcoxon Signed Rank test, the Table 3 shows the RMSE and classification accuracy.
results for classifier accuracy and RMSE. Analyzing these re- The MOGAImp was compered against three well-known im-
sults it is possible to conclude that the MOGAImp provides a putation methods, namely CMC, MC and WKNNI, in 30 pub-
good trade-off between the classifier accuracy and the RMSE licly available benchmarking datasets. To assess the algorithm
than the others imputation methods. Moreover, is observed that performance 5 classification algorithms were used in order to
there are no differences in the behavior of imputation methods represent three groups of classification methods: rule induc-
for a class of classifiers. Individually, the most of the classifiers tion learning, approximate models and lazy learning. The ex-
has statistically identical behavior to others, independently of perimental results showed that the MOGAImp outperforms the
the missing value treatment method. other imputation methods, achieving better statistical ranking in
ACCEPTED MANUSCRIPT
8
MOGAImp O 4 4 4 4 4 4 4
WKNNI 2.5 2.5 3 2.5 2.5 2.6 3
CMC 2.5 2.5 1.5 2.5 2.5 2.3 2
T
MC 5 5 5 5 5 5 5
IP
both objective functions studied and in the Wilson’s noise ratio. Frank, A., Asuncion, A., 2010. UCI Machine Learning Repository. Technical
The MOGAImp is also capable to give a set of solutions, which Report. University of California, Irvine, School of Information and Com-
CR
puter Sciences.
can be used to further knowledge extraction, helping the data Garcı́a-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R., 2009. Pat-
analysts to better understand the missing values problem in the tern classification with missing data: a review. Neural Computing and Ap-
domain studied. Moreover, the MOGAImp flexibility should be plications 19, 263–282.
highlighted, his unique codification scheme makes possible to Graham, J.W., 2009. Missing data analysis: making it work in the real world.
adapt it to different application domains.
The future research possibilities are broad. Prospects can be
drawn towards investigating the MOGAImp application to dif-
US Annual review of psychology 60, 549–76.
Honaker, J., King, G., King, G., 2013. What about missing data to do values in
time-series. American Journal of Political Science 54, 561–581.
Hruschka, E.R., Garcia, A.J.T., Hruschka Jr., E.R., Ebecken, N.F.F., 2009. On
the influence of imputation in classification: practical issues. Journal of
AN
ferent domains such as regression, clustering and time-series
Experimental & Theoretical Artificial Intelligence 21, 43–58.
analysis; investigate the adoption of heuristics to generate the Japkowicz, N., Shah, M., 2011. Evaluating Learning Algorithms: A Classifica-
initial population in order to reduce the search space; and im- tion Perspective. Cambridge University Press, New York, NY, USA.
plement a knowledge extraction method aiming to provide com- Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis with missing data. 2 ed.,
Wiley, New York.
prehensible model to the data analyst.
M
Luengo, J., Garcı́a, S., Herrera, F., 2011. On the choice of the best imputa-
tion methods for missing values considering three groups of classification
methods. Knowledge and Information Systems 32, 77–108.
Acknowledgments Marwala, T., 2009. Computational Intelligence for Missing Data Imputation,
Estimation and Management: Knowledge Optimization Techniques. 1 ed.,
ED
The authors would like to thank CNPQ and CAPES for sup- Information Science Reference.
porting this research. The funders had no role in study design, Meng, Z., Shi, Z., 2012. Extended rough set-based attribute reduction in incon-
sistent incomplete decision systems. Information Sciences 204, 44 – 69.
data collection and analysis, decision to publish, or preparation Miranda, V., Krstulovic, J., Keko, H., Moreira, C., Pereira, J., 2012. Recon-
of the manuscript. structing missing data in state estimation with autoenconders. IEEE Trans-
PT
Alcalá, J., Fernández, A., Luengo, J., Derrac, J., Garcı́a, S., Sánchez, L., Her- Stekhoven, D.J., Bühlmann, P., 2012. MissForest–non-parametric missing
rera, F., 2010. Keel data-mining software tool: Data set repository, in- value imputation for mixed-type data. Bioinformatics (Oxford, England)
tegration of algorithms and experimental analysis framework. Journal of 28, 112–8.
Multiple-Valued Logic and Soft Computing 17, 255–287. Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using
Aydilek, I.B., Arslan, A., 2013. A hybrid method for imputation of missing edited data. IEEE Transactions On Systems Man And Cybernetics 2, 408–
AC
values using optimized fuzzy c-means with support vector regression and a 421.
genetic algorithm. Information Sciences 233, 25–35. Wohlrab, L., Fürnkranz, J., 2010. A review and comparison of strategies for
Blake, R., Mangiameli, P., 2011. The effects and interactions of data quality handling missing values in separate-and-conquer rule learning. Journal of
and problem complexity on classification. J. Data and Information Quality Intelligent Information Systems 36, 73–98.
2, 8:1–8:28. Zhang, S., 2010. Shell-neighbor method and its application in missing data
Ding, Y., Ross, A., 2012. A comparison of imputation methods for handling imputation. Applied Intelligence 35, 123–133.
missing scores in biometric fusion. Pattern Recognition 45, 919–933. Zhang, S., Jin, Z., Zhu, X., 2011. Missing value estimation for mixed-attribute
Favorskaya, M., Damov, M., Zotin, A., 2013. Accurate spatio-temporal recon- data sets. IEEE Transactions on Knowledge and Data Engineering 23, 110–
struction of missing data in dynamic scenes. Pattern Recognition Letters 34, 121.
1694–1700.
Figueroa Garcı́a, J.C., Kalenatic, D., López Bello, C.A., 2010. An evolution-
ary approach for imputing missing data in time series. Journal of Circuits,
Systems and Computers 19, 107–121.
Figueroa Garcı́a, J.C., Kalenatic, D., Lopez Bello, C.A., 2011. Missing data
imputation in multivariate data by evolutionary algorithms. Computers in
Human Behavior 27, 1468–1474.