J Patrec 2015 08 023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Accepted Manuscript

Multi-Objective Genetic Algorithm For Missing Data Imputation

Fabio Lobato, Claudomiro Sales, Igor Araujo, Vincent Tadaiesky,


Lilian Dias, Leonardo Ramos, Adamo Santana

PII: S0167-8655(15)00288-3
DOI: 10.1016/j.patrec.2015.08.023
Reference: PATREC 6335

To appear in: Pattern Recognition Letters

Received date: 10 February 2015


Accepted date: 31 August 2015

Please cite this article as: Fabio Lobato, Claudomiro Sales, Igor Araujo, Vincent Tadaiesky, Lilian Dias,
Leonardo Ramos, Adamo Santana, Multi-Objective Genetic Algorithm For Missing Data Imputation,
Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.08.023

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1

T
IP
CR
Research Highlights
US
AN
• The paper proposes a novel Multi-objective Genetic Algorithm for Data Imputation, called MOGAImp.
• This is the first method that applies a multi-objective approach in data imputation.

• MOGAImp presents a good tradeoff between the evaluation measures studied.


M

• The results confirm the MOGAImp prevalence for utilization over conflicting evaluation measures.
• MOGAImp codification scheme makes possible to adapt it to different application domains.
ED
PT
CE
AC
ACCEPTED MANUSCRIPT
2

Pattern Recognition Letters


journal homepage: www.elsevier.com

Multi-Objective Genetic Algorithm For Missing Data Imputation

Fabio Lobatoa,b,∗∗, Claudomiro Salesa , Igor Araujoa , Vincent Tadaieskya , Lilian Diasa , Leonardo Ramosa , Adamo Santanaa
a Technological Institute, Federal University of Para, R. Augusto Correa, 01, Guamá, Postal Code: 479, Belem, Para, Brazil

T
b Engineering and Geoscience institute, Federal University of Western Para, R. Vera Paz, S/N, Sale, Belem, Para, Brazil

IP
ABSTRACT

CR
A large number of techniques for data analyses have been developed in recent years, however most of
them do not deal satisfactorily with a ubiquitous problem in the area: the missing data. In order to
mitigate the bias imposed by this problem, several treatment methods have been proposed, highlighting
the data imputation methods, which can be viewed as an optimization problem where the goal is

US
to reduce the bias caused by the absence of information. Although most imputation methods are
restricted to one type of variable whether categorical or continuous, moreover, they usually samples
with incomplete instances. To fill these gaps, this paper presents the multi-objective genetic algorithm
for data imputation called MOGAImp, based on the NSGA-II, which is suitable for mixed-attribute
AN
datasets and takes into account information from incomplete instances and the modeling task. A set
of tests for evaluating the performance of the algorithm were applied using 30 datasets with induced
missing values; five classifiers divided into three classes: rule induction learning, lazy learning and
approximate models; and were compared with three techniques presented in the literature. The results
obtained confirm the MOGAImp outperforms some well-established missing data treatment methods.
M

Furthermore, the proposed method proved to be flexible since it is possible to adapt it to different
application domains.
c 2015 Elsevier Ltd. All rights reserved.
ED

1. Introduction to estimate plausible values to complete the dataset, some of


PT

them are based in machine learning and others are imported


Real data are often affected by noise, whereas data mining al- from statistical learning theory (Garcı́a-Laencina et al., 2009).
gorithms are designed for quality data (Blake and Mangiameli, The replacement process of the values associated to the miss-
2011). In this context, the missing data, also called Missing ing data for a plausible value is called data imputation, which
CE

Values (MV), should be highlighted because of its ubiquity in can be seen as an optimization problem because the estima-
the data analysis process (Graham, 2009). Their causes are tion should preserve the intrinsic characteristics of the database
the most diverse and related to the application domain, such as (Zhang, 2010).
drawbacks in the data acquisition, measurement errors, sensors
In this sense, evolutionary algorithms have been widely ap-
AC

network problems, data migration failures and unwillingness to


plied due to two properties, exploration and exploitation, in ad-
respond to survey questions (Honaker et al., 2013).
dition to its balance between accuracy, development and con-
The MV brings harmful consequences to the data analysis
vergence time. Several works have developed data imputation
considering the bias imposed by this issue, additionally, the
methods using evolutionary algorithms (Figueroa Garcı́a et al.,
data analysis techniques are not developed to handle directly
2010, 2011), however, to the best of our knowledge, this is the
with MV. Many strategies to tackle this problem have been pro-
first approach proposed which is based on Evolutionary Multi-
posed, some of them are considered naive, such as the ignor-
objective Optimization.
ing/deleting instances with missing values where useful infor-
It should be noted that diverse measures are used to evaluate
mation is lost (Meng and Shi, 2012). Others approaches aim
the imputation process, with emphasis on the classification ac-
curacy; and measures that evaluate the predictive and distribu-
∗∗ Corresponding author:
Tel.: +55-91-3201-8112; fax: +55-91-3201-76340; tive accuracies of the imputation method itself. Some of these
e-mail: fabio.lobato@ufopa.edu.br (Fabio Lobato) measures are conflicting, in other words, while one is optimiz-
ACCEPTED MANUSCRIPT
3

ing the other is degenerating, thus, justifying the application of the class of n or possible labels: C1 , C2 , ..., Cn .
multi objective optimization techniques.
N
Despite the paradigm in which the imputation method is D = {X, T, M} = {(xn , tn , mn )}n=1 (1)
based, some restrictions these state-of-the-art techniques show
should be pointed out. For instance, Zhang et al. (2011) makes where xn = [x1n , x2n , . . . , xdn ]T is the n-nth vector composed
it known that most of the available imputation methods are re- of d attributes, labeled as tn ∈ [C1 , C2 , . . . , Cc ]; and mn =
stricted to one type of variable only (categorical or numeri- [m1n , m2n , . . . , mdn ]T indicates which input attributes are un-
cal). In other words, these methods handle variables of dif- known in xn . The vector that indicates missing data, mn , is
ferent types separately, losing possible relationships between also termed vector-response indicator. X is the set of input
them. It is critical to remember that classification algorithms data, M is a binary matrix that indicates the absence of values,
commonly explore this kind of correlation, thus it is important both have dimension [d × N]. The label set T has dimension
to treat MV in mixed-attributes datasets properly. Two other im- [1 × N]. According M , X is divided into two parts:
portant restrictions are: 1) imputation methods evaluation can-
X = {X0 , Xm } (2)

T
not be properly evaluated apart from the modeling task (Silva
and Hruschka, 2013); and 2) complete-case analysis, should be
X0 and Xm are the observed values of the dataset, and instances

IP
avoided, where information of instances or attributes with miss-
with missing values, respectively. These definitions provides
ing values are removed (Graham, 2009).
the subsidy required to understand the relationship between the
Aiming to fill these gaps in the literature, this paper proposes cause of the missing data and the statistical effect called miss-

CR
a multi-objective evolutionary algorithm for data imputation. ingness mechanism, which are three: Missing Completely at
The well-known NSGA-II has been used for implementation Random (MCAR); Missing at Random (MAR); and Missing
due to its low complexity, flexibility and great effectiveness. Not at Random (MNAR) (Little and Rubin, 2002). The proper
In order to reduce the computational cost, a multi-threaded ge- way to deal with missing values depends, in most cases, on how
netic algorithm that makes NSGA-II even more computing ef-
ficient was also introduced. To compute the objective function
two most common evaluation measures in this application niche
were chosen: Root Mean Square Error (RMSE) and the classi-
US the attributes become missing. For practical purposes, most of
the studies involving missing values treatment assumes that the
missing data mechanism is governed by MAR or MCAR.
AN
Generally, methods for pattern recognition with missing val-
fier accuracy. ues can be classified into four categoriesGarcı́a-Laencina et al.
The MOGAImp performance was compared with some well- (2009): 1) Traditional approaches, also called naive meth-
accepted methods for handling missing data in pattern clas- ods, they represent the complete case analysis and deal with
sification. The comparison was performed using 30 datasets this problem by simple omission of attributes or instances with
M

with induced missing values; five classifiers divided into three missing data; 2) Data imputation, replaces the value associ-
classes: rule induction learning, lazy learning and approxi- ated with the missing data, usually “null” or “?”, by a plausi-
mate models; and were compared with three techniques pre- ble value; 3) Models are iterative procedures that aim to use
sented in the literature. The results proved promising, con-
ED

methods to estimate the maximum likelihood in order to cal-


firming the prevalence of MOGAImp for utilization of different culate the joint distribution function of each attribute for pre-
and conflicting evaluation measures. Moreover, the MOGAImp dicting the value to be replaced; 4) Machine learning methods
showed a good trade-off between the evaluation measures anal- aims to evade the explicit MV replacement through the devel-
ized.
PT

opment/adaptation of machine learning algorithms more robust


The organization of this paper is as follows. Section 2 to the missing data.
presents a briefly theoretical background about pattern classi- Whatever the approach chosen, the goal is to reduce the bias
fication with missing data, followed by an overview of related imposed by the missing data, given that the data analysis tech-
CE

work in Section 3. The multi-objective genetic algorithm for niques were not modeled to deal directly with this problem.
missing data imputation is described in Section 4 and then the Currently, it is perceived a convergence of the missing values
experimental setup description is given in the Section 5. The treatment methods to the data imputation strategy.
results and conclusions are discussed in Sections 6 and 7, re-
AC

spectively.
3. Related Work

It is observed a convergence in the missing values treatment


2. Patterns Classification With Missing Data using data imputation methods. Zhang (2010) presents a cate-
gorization of imputation methods based on the amount of op-
In a dataset, the absence of information on instances is called erations performed: 1) single imputation, 2) multiple imputa-
missing data, also known as associated missing values or in- tions, 3) fractional imputation and 4) iterative imputation. The
complete data. To Garcı́a-Laencina et al. (2009) a dataset D, methods belonging to the first category provides a single esti-
in pattern classification scenario, may be defined as a com- mation for each missing value, while the multiple imputation
posite of N labeled instances. An instance is represented by methods estimate possible values for imputation based on ap-
a vector of d attributes (continuous or discrete), example: x = propriate measures to verify the accuracy in order to combine
[x1 , x2 , . . . , xi , . . . , xd ]T . Additionally, each sample belongs to them to estimate the final value. The third category represents a
ACCEPTED MANUSCRIPT
4

compromise between the single and multiple imputation meth- world datasets properly - they commonly have both data types;
ods. Finally, the iterative imputation techniques primarily use a the MOGAImp was designed to treat mixed-attribute data si-
generate-and-test mechanism, taking into account useful infor- multaneously, taking into account the relationships among the
mation (including incomplete cases). attributes.
Three categories of imputation proposed by Zhang (2010) Many studies developed in this field are not intended to im-
can be seen as an optimization problem. The MOGAImp fits prove the process of handling missing values itself, but aim to
into the iterative imputation category, in which the bioinspired contribute in a particular application domain. For instance, Fa-
algorithms have been widely applied because of its balance be- vorskaya et al. (2013) propose a method for texture reconstruc-
tween accuracy, development time and computational perfor- tion in dynamic scenes; Ding and Ross (2012) compare miss-
mance. It is important to highlight that the use of bioinspired ing data treatment methods in multibiometric systems; Miranda
algorithms in most missing values treatment methods is not to et al. (2012) use autoencoders to predict values for imputation
perform the data imputation itself, but they are applied to im- in electricity distribution networks.
prove convergence or to help in setting parameters (Marwala, In this sense, the MOGAImp flexibility should be high-

T
2009; Aydilek and Arslan, 2013). lighted. Its coding system allows easy adaptation to various
About the imputation methods on bioinspired algorithms, application domains. Particularly, this article is dedicated to

IP
Figueroa Garcı́a et al. (2010, 2011) deserve attention. In their presenting the application of the proposed approach to pattern
work, the authors used statistical measures such as the covari- classification, as will be shown in the following section.
ance matrix and auto-correlation function to guide the estima-

CR
tion process. However, these measures were computed from 4. Multi-Objective Genetic Algorithm For Missing Data
instances without missing values, and then the values to be im- Imputation
puted are estimated. This approach falls in complete case anal-
ysis, therefore, loses potentially important information to carry Genetic algorithms have been widely applied as global
out further inferences and thus can produce biased results. Aim-
ing to avoid this drawback, the MOGAImp takes into account
the information of incomplete instances to estimate plausible
values to impute, making it even more robust to noise.
US search method in various application domains, including data
mining problems and optimization problems. Our main moti-
vations for develop a solution based on the GA paradigm are:
1) GAs explore effectively large search spaces while also ex-
AN
In Luengo et al. (2011), the authors confront 14 different im- ploit optimal solutions; 2) the GA paradigm is extremely scal-
putation methods, using 23 classification methods, which were able, which can be efficiently parallelized; 3) GA are relatively
divided into three categories: 1) rule induction learning, 2) ap- easy to implement, adapt and tune for different application
proximate models and 3) methods based on distance. The pri- domains; 4) they are successfully applied in Multi-Objective
M

mary evaluation parameter used is the accuracy of the predictive Problems (MOPS), where the main goal is to obtain a set of
model. Whereas main contribution of this work is the correla- Pareto-optimal solutions. For these reasons, a considerable
tion of which imputation method is most applicable to a particu- number of Multi-Objective Evolutionary Algorithms (MOEA)
ED

lar group of classifiers, other relevant point refers to the analysis have been proposed during the last two decades, with a com-
of the influence of the imputation method in the data in relation mon characteristic of finding multiple solutions in one single
to two measures: Wilcoxon signed rank test and average mutual run, since the population diversity can be maintained along the
information difference; in addition to the classifiers accuracies. Pareto’s fronts. One of the most prominent MOAE is the Non-
dominated Sorting Genetic algorithm II (NSGA-II), which is
PT

In the classification context, traditional approaches used to


evaluate imputation methods usually take into consideration the a computationally fast and elitist MOEA, based on sorting by
distance between the original and imputed values, called pre- dominance.
dictive accuracy, or the classification accuracy. However, as Accordingly, this paper will present a novel multi-objective
CE

(Hruschka et al., 2009; Garcı́a-Laencina et al., 2009; Silva and genetic algorithm for missing data imputation based on the
Hruschka, 2013) suggest, the best predictive accuracy results do NSGA-II, called MOGAImp. The following section will de-
not necessarily lead to the lowest classification bias. Therefore, scribe the genetic structure, including the multi-thread archi-
it is clear the missing data treatment methods cannot be prop- tecture, and a pseudo code description of MOGAImp.
AC

erly evaluated apart from the modeling task. In addition, it is


possible to perceive that these measures are conflicting, which 4.1. Individual Representation and Fitness Function
justifies the adoption of a multi-objective optimization method, Before outlining the genetic structure of the proposed
as proposed in the present study. method, it is important to explain the individual encoding and
It is important to point out that some of imputation meth- the fitness function. As example, Figure 1 (a) shows a dataset
ods are specifically designed for discrete or continuous data, with five missing values, which should be filled by plausible
so, for mixed-attribute datasets these sort of methods handle ones. In MOGAImp, an individual is a complete solution, each
different data types separately Zhang et al. (2011); Stekhoven gene contains a value that will replace a corresponding missing
and Bühlmann (2012). This approach ignores the possible re- data space, thus producing a complete dataset (Figure 1 (b)).
lation among the variable types, which impacts negatively in The candidate solutions are represented by an array of in-
the classification process because such relationships are usually teger values, which maps/indexes the feasible values for each
explored by machine learning methods. In order to treat real- attribute from the solution pool, as shown in the Figure 1 (d).
ACCEPTED MANUSCRIPT
5
Data: Incomplete dataset, GA parameters
Result: Imputed datasets
1 for each attribute k in dataset do
2 pool(k) ← list attribute values(k);
3 end
4 P0 ← initialization(population size, pool);
5 D0 ← imputed datasets(P0 , pool);
6 [Oacc , ORMS E ] ← evaluate objectives(D);
7 P0 ← non-domination sort by NSGA-II(P0 , Oacc , ORMS E );
8 for t ← 1 to max generations do
9 Qt ← apply genetic operators(Pt−1 , Oacc , ORMS E );
10 D ← imputed datasets(Qt , pool);
h t i
11 O0acc , O0RMS E ← evaluate objectives(Dt );
12 Rt ← Pt−1 ∪ Qt ;

T
13 Pt ← survivors selection by NSGA-II(Rt , Oacc , ORMS E );
14 end

IP
15 F ← Pareto front(Pt )
16 return impute datasets(F, pool);
Algorithm 1: Multi-Objective Genetic Algorithm Imputation

CR
Fig. 1. Chromosome represented by an array of float point data.

set as stopping criterion. Each pair of individuals selected (fol-


The solution pools are built for each attribute presenting miss- lowing the tournament method) undergoes the crossover and
ing data, and consist of all the feasible values ordered into an
array - categorical attributes are ordered lexicographically. Fig-
ure 1 (c) illustrates this data structure, attributes Att1, Att2 and
Att3 have missing values and thus one solution pool for each.
US
mutation operations in order to create the offspring Qt . For the
crossover and mutation, the well-known n-point crossover and
creep mutation were chosen.
In many problems, the GA operators are very simple and
AN
The decoding process is done by mapping the index of the have limited impact in processing time. On the other hand,
genotype in the respective solution pool (attribute), so the value fitness calculation may demand high computational resources.
pointed out by the index is used in the phenotype. For in- For the application presented here, the classification model con-
stance, the genotype presented in the Figure 1 (d): 5, 2, 1, 1, struction and the RMSE function have to be calculated for each
3, is mapped consulting the solution pool of the respectively at- individual, thus generating a bottleneck.
M

tribute, thus “5” is mapped in “2.3”, “2” represents “Yes” and Aiming to resolve this problem, a parallel approach is also
so on. This codification strategy was developed having in mind proposed, attempting to reduce the processing time without in-
two broad goals: 1) provide abstraction of the data type, al- terfering with other properties of the algorithms. Here, when
ED

lowing the algorithm to handle both continuous and categorical calculating the fitness, each individual is assigned to one thread,
attributes; 2) yield a data structure suitable for applying genetic which is used to calculate the fitness function; with the main
operators and further knowledge extraction. process awaiting all threads to finish their processes, before ap-
Aiming to evaluate the fitness of the candidate solutions, a plying the NSGA-II and Genetic operations.
PT

multi-objective strategy was adopted. Two evaluation measures


widely applied in data imputation experiments were considered,
the classification accuracy and the distance between the real and 5. Experimental Setup
imputed values, computed by the RMSE. This section presents the experimental setup adopted in the
CE

presented study and the computational analyses. A more de-


4.2. Genetic Structure tailed description of the experimental framework, including the
After the solution pool construction, described in the Algo- datasets and algorithm parameterization, as well as exploited
rithm 1 (lines 1-3), the method follows the traditional struc- results can be found at http://linc.ufpa.br/mogaimp/.
AC

ture of NSGA-II. It starts initializing the population of indi-


viduals randomly, where each individual generates an imputed 5.1. Methodology
dataset (lines 4-5). Line 6 illustrates the population evaluation The MOGAImp performance was compared against three
according to the objective functions adopted, the classification relevant imputation methods available in Keel (Alcalá et al.,
accuracy and the RMSE. Using information from the previous 2010), namely: Concept Most Common Attribute Value for
step, the initial population then undergoes NSGA-II fast-non- Symbolic Attributes (CMC); Global Most Common Attribute
dominated-sort and crowding-distance-assignment procedures, Value for Symbolic Attributes, and Global Average Value for
which organize the population based on dominance of the fit- Numerical Attributes (MC); and the Weighted Imputation with
ness functions and also guides the selection process towards a KNN (WKNNI). These methods achieved the best results in the
uniformly spread-out Pareto-optimal. experiments conducted by Luengo et al. (2011).
Thereafter, the initial population goes through to the evolu- Altogether, 30 datasets with induced missing values were
tionary process (lines 8-14), until a defined max generations used in the experiments, the original datasets can be found at
ACCEPTED MANUSCRIPT
6

UCI Repository of Machine Learning Databases (Frank and


Table 1. Genetic Algorithm Parameters.
Asuncion, 2010). The datasets used have the following com- Groups
position: 10 were obtained from the Keel repository with in- Parameters
G1 G2 G3 G4 G5
duced missing values (Alcalá et al., 2010) - iris (IRS), pima Population size 400 350 250 250 100
(PIM), wine (WNE), australian (AUS), newthyroid (NTD), Generations 2000 1500 500 1500 100
ecoli (ECO), satimage (SAT), german (GER), magic (MAG)
and shuttle (SHT); and 20 were obtained by emulating two
missingness mechanisms, proposed by Little and Rubin (2002) 6. Results and Discussion
(MAR and MCAR) and with 30% and 45% of amputation prob-
This section presents the analysis of the test results obtained
ability, the original datasets used to produce them were: contra-
using the experimental framework described previously. For
ceptive method choice (CTV), glass identification (GLI), lym-
purposes of MOGAImp performance analysis, three solutions
phography (LPG), tic-tac-toe endgame (TTT) and vertebral col-
were extracted from the Pareto-optimal solutions: MOGAImp
umn (VTC). For the training and testing sets, sampling was
RMSE, MOGAImp ACC and MOGAImp O. The first two opti-

T
used with 10-fold cross-validation.
mizes the result for RMSE and classifier accuracy, respectively;
Five well-known classification algorithms were selected to
while the last one presents the solution with greatest distance

IP
carry out the experimentation in order to represent three classi-
for the origin, which represents a balance between the two per-
fication categories. The grouped list is: Rule Induction Learn-
formances measures aforementioned. First, this section dis-
ing - C4.5 (Weka’s J48), Conjective and OneR; Approximate

CR
cusses the overall results regarding the three evaluation mea-
Models - Naı̈ve-Bayes; Lazy Learning - 3NN.
sures adopted. Then, the imputation methods behavior is ana-
To evaluate the performance of the imputation methods, three lyzed according the classification algorithms.
measures were chosen: the classification accuracy, which is the
ratio between the number of instances correctly classified and 6.1. Evaluation Measures
the total number of instances; the Wilson’s noise ratio in or-
der to study the imputation method impact on the classification
of instances with missing value (Wilson, 1972; Luengo et al.,
2011); and the Root Mean Square Error, which computes the
US With a view to evaluate the hypothesis that some perfor-
mance measures commonly used in the literature are conflict-
ing, the results for all datasets and classifiers were grouped
AN
distance between the imputed values and the originals - for cat- together aiming to make an overall analysis of the accuracy,
egorical attributes, the distance is considered 1. Lower RMSE RMSE and Wilson’s Noise Ratio. The Figures 2 and 3 shows
values represents better predictive accuracy of the imputation the imputation methods performance according to the classi-
method. Aiming to facilitate the comparisons, the RMSE was fiers’ accuracy and RMSE, respectively. These figures display
box plots, on each box, the central mark is the median, the gray
M

normalized (per dataset) between [0,1] and then inverted.


blocks of the box are the 25th and 75th percentiles and the lines
The statistical test adopted to support the results analysis
extend to the most extreme data points that are not considered
was the Wilcoxon Signed Rank Test for the classifiers accu-
outliers.
racies and RMSE, with 90% of confidence level, which pro-
ED

vides good statistical evidence about the imputation methods’ 100  


behavior; and the Friedman test and the Nemenyi post-hoc test
90  
to verify the statistical significance of the results according to
the Wilson’s noise ratio, at a confidence level of 90% as well 80  
PT

Accuracy  

(Japkowicz and Shah, 2011). 70  

60  

5.2. MOGAImp Parameters 50  


CE

40  
The MOGAImp parameters were set by means of calibrat- MOGAImp  MOGAImp  MOGAImp   WKNNI   CMC   MC  
ing tests with artificial datasets. The evaluating results ob- RMSE   ACC   O  
Missing  value  treatment    methods  
tained from the simulations confirmed the convergence of the
AC

proposed method. At the same time, it was perceived that the


algorithm parameterization depends on the dataset complexity, Fig. 2. Classifiers’ Accuracies in overall comparisons.
therefore the datasets were grouped according to dimensional-
ity, amount of missing values and time required to model build- Analyzing the results presented by Figures 2 and 3, it is pos-
ing. Four groups were created for the Keel datasets with the sible to perceive that the classifier accuracy and the RMSE are
following composition: G1) iris, wine, newthyroid and ecoli; conflicting performance measures, while one is optimized, the
G2) australian, german and pima; G3) shuttle and magic; G4) other is harmed. As mentioned previously, the RMSE is nor-
satimage. The remaining datasets belongs to group 5 (G5). malized and inverted, so higher values in the box plots repre-
Table 1 shows the parameters that were specified for each sents lower RMSE values. It is more clearly observed in the
group. The crossover rate, mutation rate and the tournament WKNNI, this imputation method achieved a very good RMSE
size were the same for all groups, respectively, 100%, 50% and performance however it presented the worst behavior in the
10 individuals. classifier accuracy.
ACCEPTED MANUSCRIPT
7
1  
0.9   Table 2. Example of dataset with missing values.
0.8   Missing value treatment methods
0.7   MOGA MOGA MOGA
Dataset
0.6   IMP IMP IMP WKNNI CMC MC
RMSE  Norm  

0.5   RMSE ACC O


0.4   AUS 85.05 87.02 85.05 85.01 88.91 85.01
0.3   CTC 61.86 62.75 62.77 60.66 71.11 61.16
0.2   ECO 88.89 91.23 90.49 82.72 92.59 84.57
0.1   GER 81.88 82.18 81.88 82.88 81.63 82.25
0  
GLI 79.42 80.80 80.05 77.81 83.56 78.25
MOGAImp  MOGAImp  MOGAImp   WKNNI   CMC   MC   IRS 95.92 97.14 96.73 93.88 95.92 87.76
RMSE   ACC   O   LPG 86.53 87.84 87.29 84.32 88.42 83.21
Missing  value  treatment    methods   MAG 81.90 82.42 81.90 81.48 85.19 81.12
NTD 96.84 97.37 98.16 96.05 98.68 92.11
PIM 81.95 82.87 81.95 80.51 84.87 80.51
Fig. 3. RMSE in overall comparisons. SAT 91.82 92.13 91.99 93.56 93.47 91.86

T
SHT 98.44 98.39 98.44 99.43 99.26 98.36
TTT 92.26 92.69 92.63 92.39 94.07 90.71

IP
As expected, the MOGAImp RMSE and the MOGAImp VTC 87.22 91.11 89.44 83.72 96.23 85.28
WNE 98.88 99.04 99.68 96 100 99.2
ACC achieved the best results regarding the RMSE and the clas-
Ranking 4.1 2.53 3.0 4.53 1.67 5.14
sifiers’ accuracies, respectively. Observing the MOGAImp O, it {CMC, MOGAImpO, MOGAImpACC}  {MOGAImpRMS E, WKNNI, MC}

CR
is possible to attest that this solutions provides the better trade-
off between the evaluation measures analyzed, since it achieved
values close to optimal in the RMSE and classifier’s accuracies. In summary, the results shown above indicate that the pro-
The results for the Wilson’s noise ratio is shown in the Ta- posed method has a very competitive performance, obtaining
ble 2 as well as the rank obtained by each imputation method
according to the Friedman test. In the last row, the symbol 
denotes that the difference between one or more methods is sta-
tistically significant. For instance, {method 1}  {method 2,
US
results superior to the baseline methods. Despite the best rank
of CMC in the Wilson’s noise ratio, this result is not statistically
significant. Moreover, it is important to stress that the proposed
method has the advantage of be class-independent and provide
AN
method 3} indicates that the “method 1” is significantly better the possibility to incorporate more evaluations measures, mak-
than “methods 2 and 3”. The results presented in Table 2 show ing it easily adaptable to different application domains.
that CMC obtained the smallest rank sum, in other words, this
method achieved the best overall result for the Wilson’s noise
7. Conclusions
M

ratio. At this point, it should be observed that CMC is consid-


ered a class dependent imputation because it taking into consid-
eration the class label to estimates the value to be imputed, how- This paper has proposed a novel multi-objective genetic algo-
ever, as mentioned by Wohlrab and Fürnkranz (2010), the key rithm for data imputation based on the well-known evolution-
ED

problem of this strategy is that, at classification time, the class ary algorithm NSGA-II. This new method, called MOGAImp,
value is not known, therefore a different approach is required. differs from the current evolutionary methods for data impu-
Even so, the Nemenyi post-hoc test indicated that the CMC is tation, with three contributions: 1) it is capable to tackle con-
not statistically significant better than the solutions produced by flicting evaluation measures; 2) it is suitable for mixed-attribute
PT

MOGAImp O and MOGAImp Acc. Therefore, MOGAImp O datasets; 3) the proposed method takes into account information
can be considered the best trade-off between the three evalua- of incomplete instances and the model building. As the litera-
tion measures analyzed, with the advantage that this solution is ture review demonstrate, this is the first method that applies a
class-independent, contrasting with CMC in this aspect. multi-objective approach in this application domain. Regard-
CE

ing advantages 2 and 3, they follow from the individual’s cod-


ing scheme, which analyses the instances, storing all plausible
6.2. Classification Algorithms
values in order to use them to build the chromosome. Each
This section shows the MOGAImp comparison in relation individual is a complete solution, producing a unique imputed
AC

to classification algorithms grouped into three classes: rule in- dataset, which is used to compute the fitness functions. For the
duction learning, approximate models and lazy learning. To scenario analyzed, two well-established evaluation measures of
perform this analysis, the imputation methods were compared data imputation methods were used as objectives functions: the
using the Wilcoxon Signed Rank test, the Table 3 shows the RMSE and classification accuracy.
results for classifier accuracy and RMSE. Analyzing these re- The MOGAImp was compered against three well-known im-
sults it is possible to conclude that the MOGAImp provides a putation methods, namely CMC, MC and WKNNI, in 30 pub-
good trade-off between the classifier accuracy and the RMSE licly available benchmarking datasets. To assess the algorithm
than the others imputation methods. Moreover, is observed that performance 5 classification algorithms were used in order to
there are no differences in the behavior of imputation methods represent three groups of classification methods: rule induc-
for a class of classifiers. Individually, the most of the classifiers tion learning, approximate models and lazy learning. The ex-
has statistically identical behavior to others, independently of perimental results showed that the MOGAImp outperforms the
the missing value treatment method. other imputation methods, achieving better statistical ranking in
ACCEPTED MANUSCRIPT
8

Table 3. Wilcoxon results for accuracy per classification method.


Imp. Methods J48 KNN Naı̈ve-Bayes Conjective OneR AVG RANK
MOGAImp RMSE 4 4 4 4 4 4 4
MOGAImp ACC 1 1 1 2 1 1.2 1
Accuracy
MOGAImp O 3 2.5 3 3 3 2.9 3
WKNNI3 5.5 5.5 5.5 5 5 5.3 5
CMC 2 2.5 2 1 2 1.9 2
MC 5.5 5.5 5.5 6 6 5.7 6
MOGAImp RMSE 1 1 1.5 1 1 1.1 1
MOGAImp ACC 6 6 6 6 6 6 6
RMSE

MOGAImp O 4 4 4 4 4 4 4
WKNNI 2.5 2.5 3 2.5 2.5 2.6 3
CMC 2.5 2.5 1.5 2.5 2.5 2.3 2

T
MC 5 5 5 5 5 5 5

IP
both objective functions studied and in the Wilson’s noise ratio. Frank, A., Asuncion, A., 2010. UCI Machine Learning Repository. Technical
The MOGAImp is also capable to give a set of solutions, which Report. University of California, Irvine, School of Information and Com-

CR
puter Sciences.
can be used to further knowledge extraction, helping the data Garcı́a-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R., 2009. Pat-
analysts to better understand the missing values problem in the tern classification with missing data: a review. Neural Computing and Ap-
domain studied. Moreover, the MOGAImp flexibility should be plications 19, 263–282.
highlighted, his unique codification scheme makes possible to Graham, J.W., 2009. Missing data analysis: making it work in the real world.
adapt it to different application domains.
The future research possibilities are broad. Prospects can be
drawn towards investigating the MOGAImp application to dif-
US Annual review of psychology 60, 549–76.
Honaker, J., King, G., King, G., 2013. What about missing data to do values in
time-series. American Journal of Political Science 54, 561–581.
Hruschka, E.R., Garcia, A.J.T., Hruschka Jr., E.R., Ebecken, N.F.F., 2009. On
the influence of imputation in classification: practical issues. Journal of
AN
ferent domains such as regression, clustering and time-series
Experimental & Theoretical Artificial Intelligence 21, 43–58.
analysis; investigate the adoption of heuristics to generate the Japkowicz, N., Shah, M., 2011. Evaluating Learning Algorithms: A Classifica-
initial population in order to reduce the search space; and im- tion Perspective. Cambridge University Press, New York, NY, USA.
plement a knowledge extraction method aiming to provide com- Little, R.J.A., Rubin, D.B., 2002. Statistical Analysis with missing data. 2 ed.,
Wiley, New York.
prehensible model to the data analyst.
M

Luengo, J., Garcı́a, S., Herrera, F., 2011. On the choice of the best imputa-
tion methods for missing values considering three groups of classification
methods. Knowledge and Information Systems 32, 77–108.
Acknowledgments Marwala, T., 2009. Computational Intelligence for Missing Data Imputation,
Estimation and Management: Knowledge Optimization Techniques. 1 ed.,
ED

The authors would like to thank CNPQ and CAPES for sup- Information Science Reference.
porting this research. The funders had no role in study design, Meng, Z., Shi, Z., 2012. Extended rough set-based attribute reduction in incon-
sistent incomplete decision systems. Information Sciences 204, 44 – 69.
data collection and analysis, decision to publish, or preparation Miranda, V., Krstulovic, J., Keko, H., Moreira, C., Pereira, J., 2012. Recon-
of the manuscript. structing missing data in state estimation with autoenconders. IEEE Trans-
PT

actions on Power Systems 27.


Silva, J.D.A., Hruschka, E.R., 2013. An experimental study on the use of near-
References est neighbor-based imputation algorithms for classification tasks. Data &
Knowledge Engineering 84, 47–58.
CE

Alcalá, J., Fernández, A., Luengo, J., Derrac, J., Garcı́a, S., Sánchez, L., Her- Stekhoven, D.J., Bühlmann, P., 2012. MissForest–non-parametric missing
rera, F., 2010. Keel data-mining software tool: Data set repository, in- value imputation for mixed-type data. Bioinformatics (Oxford, England)
tegration of algorithms and experimental analysis framework. Journal of 28, 112–8.
Multiple-Valued Logic and Soft Computing 17, 255–287. Wilson, D.L., 1972. Asymptotic properties of nearest neighbor rules using
Aydilek, I.B., Arslan, A., 2013. A hybrid method for imputation of missing edited data. IEEE Transactions On Systems Man And Cybernetics 2, 408–
AC

values using optimized fuzzy c-means with support vector regression and a 421.
genetic algorithm. Information Sciences 233, 25–35. Wohlrab, L., Fürnkranz, J., 2010. A review and comparison of strategies for
Blake, R., Mangiameli, P., 2011. The effects and interactions of data quality handling missing values in separate-and-conquer rule learning. Journal of
and problem complexity on classification. J. Data and Information Quality Intelligent Information Systems 36, 73–98.
2, 8:1–8:28. Zhang, S., 2010. Shell-neighbor method and its application in missing data
Ding, Y., Ross, A., 2012. A comparison of imputation methods for handling imputation. Applied Intelligence 35, 123–133.
missing scores in biometric fusion. Pattern Recognition 45, 919–933. Zhang, S., Jin, Z., Zhu, X., 2011. Missing value estimation for mixed-attribute
Favorskaya, M., Damov, M., Zotin, A., 2013. Accurate spatio-temporal recon- data sets. IEEE Transactions on Knowledge and Data Engineering 23, 110–
struction of missing data in dynamic scenes. Pattern Recognition Letters 34, 121.
1694–1700.
Figueroa Garcı́a, J.C., Kalenatic, D., López Bello, C.A., 2010. An evolution-
ary approach for imputing missing data in time series. Journal of Circuits,
Systems and Computers 19, 107–121.
Figueroa Garcı́a, J.C., Kalenatic, D., Lopez Bello, C.A., 2011. Missing data
imputation in multivariate data by evolutionary algorithms. Computers in
Human Behavior 27, 1468–1474.

You might also like