Content Server

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
International Journal of Information Technology & Decision Making

Vol. 10, No. 1 (2011) 187206
c World Scientic Publishing Company

DOI: 10.1142/S0219622011004282
ENSEMBLE OF SOFTWARE DEFECT PREDICTORS:

AN AHP-BASED EVALUATION METHOD
YI PENG, GANG KOU , GUOXUN WANG and WENSHUAI WU

School of Management and Economics
University of Electronic Science and Technology of China
Chengdu, P. R. China, 610054
kougang@uestc.edu.cn
YONG SHI
College of Information Science & Technology
University of Nebraska at Omaha
Omaha, NE 68182, USA
and
CAS Research Center on Fictitious Economy and Data Sciences
Beijing 100080, China
Classication algorithms that help to identify software defects or faults play a crucial role
in software risk management. Experimental results have shown that ensemble of classiers are often more accurate and robust to the eects of noisy data, and achieve lower
average error rate than any of the constituent classiers. However, inconsistencies exist
in dierent studies and the performances of learning algorithms may vary using dierent
performance measures and under dierent circumstances. Therefore, more research is
needed to evaluate the performance of ensemble algorithms in software defect prediction. The goal of this paper is to assess the quality of ensemble methods in software
defect prediction with the analytic hierarchy process (AHP), which is a multicriteria
decision-making approach that prioritizes decision alternatives based on pairwise comparisons. Through the application of the AHP, this study compares experimentally the
performance of several popular ensemble methods using 13 dierent performance metrics
over 10 public-domain software defect datasets from the NASA Metrics Data Program
(MDP) repository. The results indicate that ensemble methods can improve the classication results of software defect prediction in general and AdaBoost gives the best
results. In addition, tree and rule based classiers perform better in software defect
prediction than other types of classiers included in the experiment. In terms of single classier, K-nearest-neighbor, C4.5, and Nave Bayes tree ranked higher than other
classiers.
Keywords: Ensemble; classication; software defect prediction; the analytic hierarchy
process (AHP).
1. Introduction
Large and complex software systems have become an essential part of our society.
Defects existing in software systems are prevalent and expensive. According to the
187
188
Y. Peng et al.
Research Triangle Institute (RTI), software defects cost the US economy billions of
dollars annually, and more than a third of the costs associated with software defects
could be avoided by improving software testing.1
As a useful software testing tool, software defect prediction can help detect
software faults in an early stage, which facilitates ecient test resource allocation, improves software architecture design, and reduces the number of defective
modules.2 Software defect prediction can be modeled as a two-group classication
problem by categorizing software units as either fault-prone (fp) or nonfault-prone
(nfp) using historical data.
Researchers have developed many classication models for software defect
prediction.215 Previous studies illustrate that ensemble methods, a combination
of classiers using some mechanisms, are superior to others in software defect
prediction.2,16 However, other works indicate that classiers performances may
vary using dierent performance measures and under dierent circumstances.1720
Furthermore, there are many ways to construct ensembles of classiers. How to
select the most appropriate ensemble method for software defect prediction problem has not been fully investigated.
The objective of this study is to evaluate the quality of ensemble methods for
software defect prediction with the analytic hierarchy process (AHP) method. The
AHP is a multicriteria decision-making approach that helps decision makers structure a decision problem based on pairwise comparisons and experts judgments.21
Three popular ensemble methods (bagging, boosting, and stacking) are compared
with 12 well-known classication methods using 13 performance measures over 10
public-domain datasets from the NASA Metrics Data Program (MDP) repository.22
The classication results are then analyzed using the AHP to determine the best
classier for software defect prediction task.
The rest of this paper is organized as follows: Sections 2 and 3 describe the
ensemble methods and the AHP, respectively; Section 4 explains the performance
metric, datasets, and methodology used in the experiment and analyzes the results;
Section 5 summarizes.
2. Ensemble Methods
Ensemble learning algorithms construct a set of classiers and then combine the
results of these classiers using some mechanisms to classify new data records.23
Experimental results have shown that ensembles are often more accurate and robust
to the eects of noisy data, and achieve lower average error rate than any of the
constituent classiers.2428
How to construct good ensembles of classiers is one of the most active research
areas in machine learning, and many methods for constructing ensembles have
been proposed in the past two decades.29 Dietterich30 divides these methods into
ve groups: Bayesian voting, manipulating the training examples, manipulating the
input features, manipulating the output targets, and injecting randomness. Several
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
189
comparative studies have been conducted to examine the eectiveness and performance of ensemble methods. Results of these studies indicate that bagging and
boosting are very useful in improving the accuracy of certain classiers,31 and their
performances vary with added classication noise.23 To investigate the capabilities
of ensemble methods in software defect prediction, this study concentrates on three
popular ensemble methods (i.e. bagging, boosting, and stacking) and compares their
performances on public-domain software defect datasets.
2.1. Bagging
Bagging combines multiple outputs of a learning algorithm by taking a plurality
vote to get an aggregated single predictio.32 The multiple outputs of a learning
algorithm are generated by randomly sampling with replacement of the original
training dataset and applying the predictor to the sample. Many experimental
results show that bagging can improve accuracy substantially. The vital element in
whether bagging will improve accuracy is the instability of the predictor.32 For an
unstable predictor, a small change in the training dataset may cause large changes
in predictions.33 For a stable predictor, however, bagging may slightly degrade the
performance.32
Researchers have performed large empirical studies to investigate the capabilities of ensemble methods. For instance, Bauer and Kohavi31 compared bagging
and boosting algorithms with a decision tree inducer and a Nave Bayes inducer.
They concluded that bagging reduces variance of unstable methods and leads to
signicant reductions in mean-squared errors. Dietterich23 studied three ensemble
methods (bagging, boosting, and randomization) using decision tree algorithm C4.5
and pointed out that bagging is much better than boosting when there is substantial
classication noise.
In this study, bagging is generated by averaging probability estimates.34
2.2. Boosting
Similar to bagging, boosting method also combines the dierent decisions of a
learning algorithm to produce an aggregated prediction.24,35 In boosting, however,
weights of training instances change in each iteration to force learning algorithms to
put more emphasis on instances that were predicted incorrectly previously and less
emphasis on instances that were predicted correctly previously.30 Boosting often
achieves more accurate results than bagging and other ensemble methods.23,29,31
However, boosting may overt the data and its performance deteriorates with classication noise.
This study evaluates a widely used boosting method, AdaBoost algorithm, in
the experiment. AdaBoost is the abbreviation for adaptive boosting algorithm
because it adjusts adaptively to the errors returned by classiers from previous
iterations.24,36 The algorithm assigns equal weight to each training instance at the
beginning. It then builds a classier by applying the learning algorithm to the
190
Y. Peng et al.
training data. Weights of misclassied instances are increased, while weights of correctly classied instances are decreased. Thus, the new classier concentrates more
on incorrectly classied instances in each iteration.
2.3. Stacking
Stacking generalization, often abbreviated as stacking, is a scheme for minimizing
the generalization error rate of one or more learning algorithms.37 Unlike bagging
and boosting, stacking can be applied to combine dierent types of learning algorithms. Each base learner, also called level 0 model, generates a class value for
each instance. The predictions of level-0 models are then fed into the level-1 model,
which combines them to form a nal prediction.34
Another ensemble method used in the experiment is voting, which is a simple
average of multiple classiers probability estimates provided by WEKA.34
2.4. Selected classification models
As a powerful tool that has numerous applications, classication methods have been
studied extensively by several elds, such as machine learning, statistics, and data
mining.38 Previous studies have shown that an ideal ensemble should consist of
accurate and diverse classiers.39,40 Therefore, this study selects 12 classiers to
build ensembles. They represent ve categories of classiers (i.e. trees, functions,
Bayesian classiers, lazy classiers, and rules) and were implemented in WEKA.
For trees category, we chose classication and regression tree (CART), Nave
Bayes tree, and C4.5. Functions category includes linear logistic regression, radial
basis function (RBF) network, sequential minimal optimization (SMO), and
Neural Networks. Bayesian classiers include Bayesian network and Nave Bayes.
K-nearest-neighbor was chosen to represent lazy classiers. For rules category,
decision table and Repeated Incremental Pruning to Produce Error Reduction
(RIPPER) rule induction were selected.
Classication and regression tree (CART) can predict both continuous and
categorical dependent attributes by building regression trees and discrete classes,
respectively.41 Nave Bayes tree is an algorithm that combines Nave Bayes induction algorithm and decision trees to increase the scalability and interpretability of
Nave Bayes classiers.42 C4.5 is a decision tree algorithm that constructs decision
trees in a topdown recursive divide-and-conquer manner.43
Linear logistic regression models the probability of occurrence of an event as a
linear function of a set of predictor variables.44 Neural network is a collection of
articial neurons that learns relationships between inputs and outputs by adjusting
the weights.28 RBF network45 is an articial neural network that uses radial basis
functions as activation functions. The centers and widths of hidden units are derived
using k-means, and the outputs obtained from the hidden layer are combined using
logistic regression.34 SMO is a sequential minimal optimization algorithm for training support vector machines (SVM).46,47
191
Bayesian network and Nave Bayes both model probabilistic relationships

between the predictor variables and the class variable. While Nave Bayes classier48
estimates the class-conditional probability based on Bayes theorem and can only
represent simple distributions, Bayesian network is a probabilistic graphic model
and can represent conditional independencies between variables.49
K-nearest-neighbor50 classies a given data instance based on learning by analogy. That is, it assigns an instance to the closest training examples in the feature
space.
Decision table selects the best-performing attribute subsets using best-rst
search and uses cross-validation for evaluation.51 RIPPER52 is a sequential covering algorithm that extracts classication rules directly from the training data
without generating a decision tree rst.28
Each of stacking and voting combines all classiers to generate one prediction.
Since bagging and boosting are designed to combine multiple outputs of a single
learning algorithm, they are applied to each of the 12 classiers and produced a
total of 26 aggregated outputs.
3. The Analytic Hierarchy Process (AHP)

The analytic hierarchy process is a multicriteria decision-making approach that
helps decision makers structure a decision problem based on pairwise comparisons
and experts judgments.53,54 Saaty55 summarizes four major steps for the AHP.
In the rst step decision makers dene the problem and decompose the problem
into a three-level hierarchy (the goal of the decision, the criteria or factors that
contribute to the solution, and the alternatives associated with the problem through
the criteria) of interrelated decision elements.56 The middle level of criteria might
be expanded to include subcriteria levels. After the hierarchy is established, the
decision makers compare the criteria two by using a fundamental scale in the second
step. In the third step, these human judgments are converted to a matrix of relative
priorities of decision elements at each level using the eigenvalue method. The fourth
step calculates the composite or global priorities for each decision alternatives to
determine their ratings.
The AHP has been applied in diverse decision problems, such as economics and
planning, policies and allocations of resources, conict resolution, arms control,
material handling and purchasing, manpower selection and performance measurement, project selection, marketing, portfolio selection, model selection, politics, and
environment.57 Over the last 20 years, the AHP has been studied extensively and
various variants of the AHP have been proposed.5861
In this study, the decision problem is to select the best ensemble method for
the task of software defect prediction. The rst step of the AHP is to decompose the problem into a decision hierarchy. As shown in Fig. 1, the goal is to
select an ensemble method that is superior to other ensemble methods over publicdomain software defect datasets through the comparison of a set of performance
192
Y. Peng et al.
Fig. 1. An AHP hierarchy for the ensemble selection problem.
measurements. The criteria are performance measures for classiers, such as overall
accuracy, F-measure, area under ROC (AUC), precision, recall, and Kappa statistic.
The decision alternatives are ensembles and individual classication methods, such
as AdaBoost, bagging, stacking, C4.5, SMO, and Nave Bayes. Individual classiers
are included as the decision alternatives for the purpose of comparisons.
In step 2, the input data for the hierarchy, which is a scale of numbers that indicates the preference of decision makers about the relative importance of the criteria,
are collected. Saaty53 provides a fundamental scale for this purpose, which has been
validated theoretically and practically. The scale ranges from 1 to 9 with increasing importance. Numbers 1, 3, 5, 7, and 9 represent equal, moderate, strong, very
strong, and extreme importance, respectively, while 2, 4, 6, and 8 indicate intermediate values. This study uses 13 measures to assess the capability of ensembles
and individual classiers. The matrix of pairwise comparisons of these measures are
exhibited in Table 1. Previous works have proved that the AUC is the most informative and objective measurement of predictive accuracy62 and is an extremely
important measure in software defect prediction. Therefore, it is assigned a number
of 9. The F-measure, mean absolute error, and overall accuracy are very important
measures, but less important than the AUC. The true positive rate (TPR), true
negative rate (TNR), false positive rate (FPR), false negative rate (FNR), precision, recall, and Kappa statistic are strongly important classication measures that
are less important than the F-measure, mean absolute error, and overall accuracy.
Training and test time refer to the time needed to train and test a classication
algorithm or ensemble method, respectively. They are useful measures in real-time
software defect identication. Since this study is not aimed at real-time software
defect identication problem, they are included to measure the eciency of ensemble methods and are given the lowest importance.
The third step of the AHP computes the principal eigenvector of the matrix
to estimate the relative weights (or priorities) of the criteria. The estimated priorities are obtained through a two-step process: (1) raise the matrix to large powers (square); (2) sum and normalize each row. This process is repeated until the
Acc
1
1/3
1/3
1/3
1/3
1/3
1/3
1
3
1/3
1
1/7
1/7
Measures
Acc
TPR
FPR
TNR
FNR
Precision
Recall
F-measure
AUC
Kappa
MAE
TrainTime
TestTime
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
TPR
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
FPR
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
TNR
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
FNR
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
Precision
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
Recall
1
1/3
1/3
1/3
1/3
1/3
1/3
1
3
1/3
1
1/7
1/7
F-measure
1/3
1/5
1/5
1/5
1/5
1/5
1/5
1/3
1
1/5
1/3
1/9
1/9
AUC
3
1
1
1
1
1
1
3
5
1
3
1/5
1/5
Kappa
Table 1. Pairwise comparisons of performance measures.
1
1/3
1/3
1/3
1/3
1/3
1/3
1
3
1/3
1
1/7
1/7
MAE
7
5
5
5
5
5
5
7
9
5
7
1
1
TrainTime
7
5
5
5
5
5
5
7
9
5
7
1
1
TestTime
0.1262
0.0491
0.0491
0.0491
0.0491
0.0491
0.0491
0.1262
0.2513
0.0491
0.1262
0.0133
0.0133
Priority

193
194
Y. Peng et al.
dierence between the sums of each row in two consecutive rounds is smaller than a
prescribed value. The priorities, which provide the relative ranking of performance
measures, are shown at the rightmost column in Table 1. After obtaining the priority vector of the criteria level, the AHP method moves to the lowest level in the
hierarchy, which consists of ensemble methods and classication algorithms in this
experiment. The pairwise comparisons at this level compare learning algorithms
with respect to each performance measure in the level immediately above. The
matrices of comparisons of the learning algorithms with respect to the criteria and
their priorities are analyzed and summarized in the experimental study section.
The ratings for the learning algorithms are produced by aggregating the relative
priorities of decision elements.21
4. Experimental Study
The experiment is designed to compare a wide selection of ensemble methods and
individual classiers for software defect prediction based on the AHP method.
As discussed in Sec. 3, the performance of ensemble methods and classication algorithms are evaluated using 13 measures over 10 public-domain software
defect datasets. The following paragraphs dene the performance measures and the
datasets, describe the experimental design, and present the results.
4.1. Performance measures
There are an extensive number of performance measures for classication. These
measures have been introduced for dierent applications and to evaluate dierent things. Commonly used performance measures in software defect classication are accuracy, precision, recall, F-measure, AUC, and mean absolute error.63,64
Besides these popular measures, this work includes seven other major classication
measures. The denitions of these measures are as follows.
Overall accuracy: Accuracy is the percentage of correctly classied modules.28 It
is one the most widely used classication performance metrics.
TN + TP
.
Overall accuracy =
TP + FP + FN + TN
True positive (TP ): TP is the number of correctly classied fault-prone modules.
TP rate measures how well a classier can recognize fault-prone modules. It is
also called sensitivity measure.
TP
True positive rate/Sensitivity =
.
TP + FN
False positive (FP ): FP is the number of nonfault-prone modules that is misclassied as fault-prone class. FP rate measures the percentage of nonfault-prone
modules that were incorrectly classied.
FP
.
False positive rate =
FP + TN
195
True negative (TN ): TN is the number of correctly classied nonfault-prone modules. TN rate measures how well a classier can recognize nonfault-prone modules.
It is also called specicity measure.
True negative rate/Specicity =
TN
.
TN + FP
False negative (FN ): FN is the number of fault-prone modules that is misclassied

as nonfault-prone class. FN rate measures the percentage of fault-prone modules
that were incorrectly classied.
False negative rate =
FN
.
FN + TP
Precision: This is the number of classied fault-prone modules that actually are
fault-prone modules.
Precision =
TP
.
TP + FP
Recall : This is the percentage of fault-prone modules that are correctly classied.
Recall =
TP
.
TP + FN
F-measure: It is the harmonic mean of precision and recall. F-measure has been
widely used in information retrieval.65
F-measure =
2 Precision Recall
Precision + Recall
AUC : ROC stands for receiver operating characteristic, which shows the tradeo between TP rate and FP rate.28 AUC represents the accuracy of a classier.
The larger the area, the better the classier.
Kappa statistic (KapS ): This is a classier performance measure that estimates
the similarity between the members of an ensemble in multiclassiers systems.66
KapS =
P (A) P (E)
1 P (E)
P (A) is the accuracy of the classier and P (E) is the probability that agreement
among classiers is due to chance.
c m
c m
c
k=1 ([
j=1
i=1 f (i, k)C(i, j)] [
j=1
i=1 f (i, j)C(i, k)])
P (E) =
2
m
m is the number of modules and c is the number of classes. f (i, j) is the actual

probability of i module to be of class j. m
i=1 f (i, j) is the number of modules of
class j. Given threshold , C (i, j) is 1 i j is the predicted class for i obtained
from P (i, j); otherwise it is 0.62
196
Y. Peng et al.
Mean absolute error (MAE ): This measures how much the predictions deviate
from the true probability. P (i, j) is the estimated probability of i module to be
of class j taking values in [0, 1].62
c m
j=1
i=1 |f (i, j)P (i, j)|
MAE =
mc
Training time: The computational time for training a classication algorithm or
ensemble method.
Test time: The computational time for testing a classication algorithm or ensemble method.
4.2. Data sources
The datasets used in this study are 10 public-domain software defect datasets provided by the NASA IV&V Facility Metrics Data Program (MDP) repository. The
brief descriptions of these MDP datasets are provided by the NASA Web site:
CM1: This dataset is from a science instrument written in a C code with approximately 20 kilo-source lines of code (KLOC). It contains 505 modules.
JM1: This dataset is a real-time C project containing about 315 KLOC. There
are 8 years of error data associated with the metrics and has 2012 modules.
KC3: This dataset is about the collection, processing, and delivery of satellite
metadata. It is written in Java with 18 KLOC and has 458 modules.
KC4: This dataset is a ground-based subscription server written in Perl code
containing 25 KLOC with 125 modules.
MC1: This dataset is about a combustion experiment that is designed to y on
the space shuttle written in C and C++ code containing 63 KLOC. There are
23 526 modules.
MW1: This dataset is about a zero gravity experiment related to combustion
written in C code containing 8 KLOC with 403 modules.
PC1: This dataset is a ight software from an earth orbiting satellite that is no
longer operational. It contains 40 KLOC of C code with 1107 modules.
PC2: This dataset is a dynamic simulator for attitude control systems. It contains
26 KLOC of C code with 5589 modules.
PC3: This dataset is a ight software from an earth orbiting satellite that is
currently operational. It has 40 KLOC of C code with 1563 modules.
PC4: This dataset is a ight software from an earth orbiting satellite that is
currently operational. It has 36 KLOC of C code with 1458 modules.
Though these datasets have dierent sets of attributes, they share some common
structures. Four attributes (i.e. module id, defect id, priority, and severity) that are
important for defect classication task exist in all 10 datasets. Module id is a unique
identier for every individual module. Defect id identies the types of defects and is
the dependant attribute in software defect classication. In two-class software defect
197
classication, modules with nonempty defect id are labeled as fp and modules with
empty defect id are labeled as nfp.
4.3. Experimental design
The experiment was carried out according to the following process:
Step 1 Prepare datasets: select relevant features.
Step 2 Train and test classication models on a randomly sampled partitions
(i.e. 10-fold cross-validation) using WEKA 3.7.34
Step 3 Collect evaluation measures of classication models using WEKA 3.7.
These performance measures are input data of the pairwise comparisons of
each alternative (i.e. classication models) for the AHP.
Step 4 Construct a set of pairwise comparison matrices of classication models
with respect to each performance measure (criterion in the AHP) using
Matlab 7.0.
Step 5 Multiply the relative ranking obtained from step 4 by the priorities of
performance measures (Table 1) to get the overall priorities for each classication model.
END
As mentioned in Sec. 2, this study compares three ensemble methods (i.e.
AdaBoost, bagging, and stacking) and a selection of 12 classiers. These learning
models are separated into three groups: AdaBoost, bagging, and others. Each of the
AdaBoost and bagging group has 12 algorithms, which are generated by applying
AdaBoost and bagging to each of the 12 classiers, respectively. The third group
has 14 algorithms, including stacking, voting, and the 12 individual classiers.
4.4. Experimental results
Table 2 summarizes the average of the classication results of all learning methods
over the 10 datasets (test datasets) using 10-fold cross-validation. The format of
classiers presented in Table 2 follows the names for classication methods used in
WEKA 3.7. The 13 performance measures described in Sec. 4.1 are collected for
each learning method. The classication algorithm producing the best result for a
specic performance measure is highlighted in boldface. If there are more than one
algorithm achieving the best result, they are all highlighted in boldface.
From Table 2, we observe that the performances of some classiers using
certain measures are rather close. For instance, bagging of Nave Bayes tree
(trees.NBTree.Bagging) achieves the best AUC (0.9042), which is considered
as an extremely important measure in software defect prediction2,62 ; while
boosting of CART (trees.SimpleCart.Adaboost), bagging of C4.5 decision tree
(trees.J48.Bagging), and bagging of decision table (rules.DecisionTable.Bagging)
produce similar results in terms of the AUC. The second observation is that a
86.78
88.16
90.65
91.28
91.88
92.53
91.98
92.63
79.04
79.56
89.13
90.61
86.88
84.90
82.76
88.32
89.17
89.22
89.37
89.98
85.35
81.53
86.91
89.43
82.29
82.83
87.40
86.03
77.08
79.02
90.88
89.01
90.72
88.70
88.95
88.51
83.54
83.02
19.82
13.02
15.38
20.54
20.77
7.66
9.44
7.73
9.07
9.60
10.86
16.11
17.50
13.69
24.16
54.83
45.88
29.40
35.22
65.94
61.72
64.90
62.53
59.26
59.04
36.73
33.39
51.61
84.18
86.65
85.36
77.19
78.87
90.32
87.87
90.13
87.29
88.30
87.80
82.93
83.17
86.54
85.45
63.56
36.30
44.39
38.17
25.42
25.47
28.05
26.04
27.24
27.46
32.67
50.10
53.16
37.58
44.85
36.44
63.70
55.61
61.83
74.58
74.53
71.95
73.96
72.76
72.54
67.33
49.90
46.84
62.42
55.15
60.47
15.82
13.35
14.64
22.81
21.13
9.68
12.13
9.87
12.71
11.70
12.20
17.07
16.83
13.46
14.55
18.09
84.40
90.09
89.23
83.22
82.19
91.95
91.23
91.67
91.50
90.20
90.38
88.66
85.51
89.04
89.33
82.52
84.18
86.65
85.36
77.19
78.87
90.32
87.87
90.13
87.29
88.30
87.80
82.93
83.17
86.54
85.45
81.91
83.60
83.58
86.80
45.97
39.53
86.76
89.84
15.50
81.91
16.40
80.98
86.11
33.82
56.35
89.25
24.83
43.65
82.78
79.63
83.60
81.38
43.39
79.28
15.96
87.63
85.67
Bayes.BayesNet.
Adaboost
Bayes.NaveBayes.
Adaboost
functions.Logistic.
Adaboost
functions.Multilayer
Perceptron.Adaboost
functions.RBFNetwork.
Adaboost
functions.SMO.
Adaboost
lazy.IBk.Adaboost
rules.DecisionTable.
Adaboost
rules.JRip.Adaboost
trees.J48.Adaboost
trees.NBTree.
Adaboost
trees.SimpleCart.
Adaboost
Bayes.BayesNet.
Bagging
Bayes.NaveBayes.
Bagging
functions.Logistic.
Bagging
Perceptron.Bagging
functions.RBFNetwork.
Bagging
83.85
Area
Overall F-measure
Mean
Kappa
True
False
True
False
Precision Recall
Under Accuracy
Absolute Statistic Positive Positive Negative Negative
ROC
Error
Rate
Rate
Rate
Rate
0.02
0.12
0.03
0.02
0.09
0.08
Test
Time
10.75
905.00
13.07
0.95
1.88
18.52
14.29
5.42
154.99
0.10
0.13
0.02
0.13
0.05
0.00
0.00
0.00
0.08
715.35 40.63
32.65 0.02
134.79
12.31
320.49
13.27
2.60
4.83
Train
Time
198
Algorithms/Measures
in Percents
Table 2. Classication results of ensembles and classiers.
Y. Peng et al.
86.28
87.29
90.70
89.77
90.16
91.06
89.63
90.41
92.26
91.37
82.22
82.48
89.17
86.49
87.34
89.09
88.14
88.21
90.18
88.57
19.98
12.71
9.49
17.82
14.31
11.28
12.95
13.13
11.40
8.63
21.14
20.53
15.28
12.90
24.03
23.28
59.50
49.35
52.81
56.80
53.40
52.25
59.76
56.31
34.97
29.68
46.18
50.59
84.42
84.26
88.63
87.22
86.51
88.42
88.92
87.60
90.17
88.23
78.72
76.90
85.48
86.36
87.68
90.14
90.95
89.53
62.84
65.23
27.51
42.28
35.48
33.31
37.23
37.96
34.13
37.02
25.30
37.61
44.83
38.46
37.38
33.34
34.99
35.50
37.16
34.77
72.49
57.72
64.52
66.69
62.77
62.04
65.87
62.98
74.70
62.39
55.17
61.54
62.62
66.66
65.01
64.50
72.54
61.29
15.58
15.74
11.37
12.78
13.49
11.58
11.08
12.40
9.83
11.77
21.28
23.10
14.52
13.64
12.32
9.86
9.05
10.47
11.58
13.15
82.06
86.40
90.15
87.16
89.49
90.65
88.20
89.92
90.73
90.43
82.25
83.37
89.33
88.81
91.28
91.94
89.41
91.38
90.13
92.55
0.02
Test
Time
0.20
0.10
1.38
85.80
21.69
5.16
207.68
18.85
0.01
0.00
4.26
0.00
0.00
0.00
0.01
0.00
2.55
2.21
0.01
0.01
0.00
0.01
0.00
0.00
0.08
0.00
0.02 40.12
31.04 0.02
362.19
Train
Time
84.42
1.12
84.26
40.18
88.63
0.00
87.22
3.11
86.51
1.92
88.42
0.52
88.92
19.65
87.60
1.89
90.17 1064.14
88.23
82.59
78.72
76.90
85.48
86.36
87.68
90.14
90.95
89.53
88.42
86.85
84.36
80.48
59.52
86.15
85.37
75.92
79.54
84.57
78.94
86.97
75.60
78.87
76.87
86.13
86.55
55.29
59.83
58.94
57.43
27.46
38.71
86.55
78.73
79.43
89.28
89.56
13.54
11.53
12.81
12.40
88.42
86.85
15.64
84.70
80.72
86.58
86.74
88.56
90.66
89.89
89.99
59.44
54.52
35.13
91.55
92.43
91.44
92.02
10.28
18.25
64.87
85.14
89.97
90.42
89.36
88.98
88.06
84.36
90.68
91.69
23.75
88.48
89.76
12.70
87.34
63.82
functions.SMO.
Bagging
lazy.IBk.Bagging
rules.DecisionTable.
Bagging
rules.JRip.Bagging
trees.J48.Bagging
trees.NBTree.Bagging
trees.SimpleCart.
Bagging
Bayes.BayesNet
Bayes.NaveBayes
functions.Logistic
Perceptron
functions.RBFNetwork
functions.SMO
lazy.IBk
rules.DecisionTable
rules.JRip
trees.J48
trees.NBTree
trees.SimpleCart
meta.Stacking
meta.Vote
82.59
Area
Overall F-measure
Mean
Kappa
True
False
True
False
Precision Recall
Under Accuracy
Absolute Statistic Positive Positive Negative Negative
ROC
Error
Rate
Rate
Rate
Rate
Algorithms/Measures
in Percents
Table 2. (Continued )

199
200
Y. Peng et al.
classier which obtains the best result for a given measure may perform poorly
on dierent measures. For instance, Bayesian network (Bayes.BayesNet) has the
best FP and TN rates, but performs poorly on F-measure, precision, and recall.
The third observation is that no classier yielding the best measures across the 13
measures, which is consistent with Challagulla et al.s work.64
After the computation of classication models over software defect datasets, the
next step is to conduct a set of pairwise comparisons. As Saaty21 pointed out one
should put decision elements into groups when the number is large. Since there are
38 classication models in the experiment, pairwise comparisons were carried out
in two stages. In the rst stage, classication models are grouped into AdaBoost,
bagging, and others. Others group includes stacking, voting, and 12 individual base
classiers. Pairwise comparison is conducted to nd out the relatively high ranking
classication models within each group. Second, the top ve ranking algorithms
from each group are then compared with each other to get a global ranking of
classication algorithms for software defect prediction.
Table 3. Priorities of AdaBoost classiers (Group 1).

Algorithms
trees.SimpleCart.Adaboost
trees.J48.Adaboost
trees.NBTree.Adaboost
rules.JRip.Adaboost
rules.DecisionTable.Adaboost
lazy.IBk.Adaboost
functions.MultilayerPerceptron.Adaboost
functions.Logistic.Adaboost
bayes.BayesNet.Adaboost
functions.SMO.Adaboost
functions.RBFNetwork.Adaboost
bayes.NaveBayes.Adaboost
Priorities
0.11092
0.10994
0.10025
0.09858
0.09497
0.09341
0.08028
0.07428
0.06831
0.06339
0.05892
0.04675
Table 4. Priorities of bagging classiers (Group 2).

Algorithms
trees.J48.Bagging
lazy.IBk.Bagging
trees.SimpleCart.Bagging
rules.JRip.Bagging
functions.MultilayerPerceptron.Bagging
rules.DecisionTable.Bagging
functions.Logistic.Bagging
functions.RBFNetwork.Bagging
bayes.BayesNet.Bagging
functions.SMO.Bagging
bayes.NaveBayes.Bagging
Priorities
0.11215
0.10731
0.10696
0.10545
0.09810
0.09752
0.09395
0.08071
0.05279
0.05246
0.05239
0.04020
201
Table 5. Priorities of stacking, voting, and individual

classiers (Group 3).
Algorithms
meta.Stacking
lazy.IBk
trees.J48
meta.Vote
trees.NBTree
trees.SimpleCart
functions.MultilayerPerceptron
rules.JRip
rules.DecisionTable
functions.Logistic
bayes.BayesNet
functions.SMO
functions.RBFNetwork
bayes.NaveBayes
Priorities
0.09551
0.09531
0.08763
0.08494
0.08475
0.07873
0.07809
0.07737
0.07480
0.07264
0.04673
0.04445
0.04271
0.03633
The priorities of classication methods within each group using pairwise comparisons are summarized in Tables 35, respectively. The rightmost column of
Table 3 reports the priorities of 12 AdaBoost algorithms, each with a dierent
base classier. The priorities were calculated following the AHP method described
in Sec. 3. Within the AdaBoost group, CART (trees.SimpleCart.Adaboost), C4.5
(trees.J48.Adaboost), and Nave Bayes tree (trees.NBTree.Adaboost) are the topranked classiers. C4.5 (trees.J48.Bagging), K-nearest-neighbor (lazy.IBk.Bagging),
and Nave Bayes tree (trees.NBTree.Bagging) are the top-ranked classiers in the
Table 6. Priorities of classiers of the top ve

classiers from each group.
Algorithms
trees.SimpleCart.Adaboost
trees.J48.Adaboost
trees.J48.Bagging
trees.NBTree.Adaboost
lazy.IBk
meta.Stacking
rules.JRip.Adaboost
lazy.IBk.Bagging
trees.SimpleCart.Bagging
rules.DecisionTable.Adaboost
trees.J48
rules.JRip.Bagging
trees.NBTree
meta.Vote
Priorities
0.07529
0.07465
0.07029
0.06840
0.06768
0.06744
0.06724
0.06707
0.06684
0.06629
0.06466
0.06213
0.06157
0.06026
0.06020
202
Y. Peng et al.
bagging group. For the others group, stacking and k-nearest-neighbor (lazy.IBk)
achieve the highest ranking, followed by C4.5 (trees.J48). The results presented
in Tables 35 suggest that C4.5 decision tree, Nave Bayes tree, and K-nearestneighbor are among the best performers for software defect detection.
Table 6 gives the results of pairwise comparisons of the top ve ranking classiers selected from each of the three groups. Among the top ve ranking classiers
in Table 6, three are AdaBoost algorithms and one is bagging algorithm, which
indicate that ensemble methods can certainly improve the performance of base
classiers for the task of software defect detection.
5. Conclusions
Though some previous studies have illustrated that ensemble methods can achieve
satisfactory results in software defect prediction, inconsistencies exist in dierent studies and the performances of learning algorithms may vary using dierent performance measures and under dierent circumstances. Therefore, more
research is needed to improve our understanding about the performance of ensemble
algorithms in software defect prediction. Realizing that the experimental results
using dierent performance measures over dierent datasets may be inconsistent,
this work introduced the AHP method, a multicriteria decision-making approach,
to derive the priorities of ensemble algorithms for the task of software defect
prediction.
An experiment was designed to compare three popular ensemble methods (bagging, boosting, and stacking) and 12 well-known classication methods using 13
performance measures over 10 public-domain software defect datasets from the
NASA Metrics Data Program (MDP) repository. The experimental results can be
summarized in the following observations:
Ensemble methods can improve the classication results for software defect
prediction in general and AdaBoost gives the best results.
Tree and rule-based classiers perform better in software defect prediction than
other types of classiers included in the experiment. In terms of single classier, K-nearest-neighbor (lazy.IBk), C4.5 (trees.J48), and Nave Bayes tree
(trees.NBTree) ranked higher than other classiers.
Stacking and voting can improve classication results and provide relatively
stable outcomes, but the results are not as good as AdaBoost and bagging.
The ranking of algorithms may change in dierent settings of comparisons.
For example, voting (meta.Vote) outranks Nave Bayes tree (trees.NBTree) in
group three (Table 5). While in the overall comparisons, Nave Bayes tree
(trees.NBTree) ranks better than voting (meta.Vote) (Table 6). This is due to
the pairwise comparisons conducted by the AHP. When the set of alternative
classiers change, the relative ranking of algorithms may change, especially when
the dierence between two classiers is statistically signicant.
203
Acknowledgments
The authors would like to thank the anonymous reviewers for their insightful comments and the NASA MDP for providing the software defect datasets. This research
has been partially supported by grants from the National Natural Science Foundation of China (Nos. 70901011, 70901015, and 70921061).
References
1. NIST Planning Report 02-3, The Economic Impacts of Inadequate Infrastructure for
Software Testing (U.S. Department of Commerces National Institute of Standards &
Technology, 2002), <http://www.nist.gov/director/prog-ofc/report02-3.pdf>.
2. S. Lessmann, B. Baesens, C. Mues and S. Pietsch, Benchmarking classication models
for software defect prediction: A proposed framework and novel ndings, IEEE Transactions on Software Engineering 34(4) (2008) 485496.
3. J. C. Munson and T. M. Khoshgoftaar, The detection of fault-prone programs, IEEE
Transactions on Software Engineering 18(5) (1992) 423433.
4. T. M. Khoshgoftaar, A. S. Pandya and D. L. Lanning, Application of neural networks
for predicting faults, Annals of Software Engineering 1(1) (1995) 141154.
5. T. M. Khoshgoftaar, E. B. Allen, J. P. Hudepohl and S. J. Aud, Application of neural
networks to software quality modeling of a very large telecommunications system,
IEEE Transactions on Neural Networks 8(4) (1997) 902909.
6. T. M. Khoshgoftaar, E. B. Allen, W. D. Jones and J. P. Hudepohl, Classication-tree
models of software-quality over multiple releases, IEEE Transactions on Reliability
49(1) (2000) 411.
7. T. M. Khoshgoftaar, E. B. Allen and J. Deng, Using regression trees to classify faultprone software modules, IEEE Transactions on Reliability 51(4) (2002) 455462.
8. T. M. Khoshgoftaar and N. Seliya, Analogy-based practical classication rules for
software quality estimation, Empirical Software Engineering 8(4) (2003) 325350.
9. T. Menzies, J. DiStefano, A. Orrego and R. Chapman, Assessing predictors of software
defects, in Proceedings of Workshop Predictive Software Models (2004).
10. A. A. Porter and R. W. Selby, Evaluating techniques for generating metric-based
classication trees, Journal of Systems and Software 12(3) (1990) 209218.
11. K. El-Emam, S. Benlarbi, N. Goel and S. N. Rai, Comparing case-based reasoning classiers for predicting high risk software components, Journal of Systems and
Software 55(3) (2001) 301310.
12. K. Ganesan, T. M. Khoshgoftaar and E. B. Allen, Case-based software quality prediction, International Journal of Software Engineering and Knowledge Engineering
10(2) (2000) 139152.
13. K. O. Elish and M. O. Elish, Predicting defect-prone software modules using support
vector machines, Journal of Systems and Software 81(5) (2008) 649660.
14. Y. Peng, G. Kou, G. Wang, H. Wang and F. Ko, Empirical evaluation of classiers
for software risk management, International Journal of Information Technology and
Decision Making 8(4) (2009) 749768.
15. Y. Peng, G. Wang and H. Wang, User preferences based software defect
detection algorithms selection using MCDM, Information Sciences (2010), doi:
10.1016/j.ins.2010.04.019.
16. L. Guo, Y. Ma, B. Cukic and H. Singh, Robust prediction of fault-proneness by random forests, in Proceedings of 15th International Symposium on Software Reliability
Engineering (2004).
204
Y. Peng et al.
17. N. E. Fenton and M. Neil, A critique of software defect prediction models, IEEE
Transactions on Software Engineering 25(5) (1999) 675689.
18. I. Myrtveit and E. Stensrud, A controlled experiment to assess the benets of estimating with analogy and regression models, IEEE Transactions on Software Engineering
25(4) (1999) 510525.
19. I. Myrtveit, E. Stensrud and M. Shepperd, Reliability and validity in comparative
studies of software prediction models, IEEE Transactions on Software Engineering
31(5) (2005) 380391.
20. M. Shepperd and G. Kadoda, Comparing software prediction techniques using simulation, IEEE Transactions on Software Engineering 27(11) (2001) 10141022.
21. T. L. Saaty, The Analytic Hierarchy Process: Planning, Priority Setting, Resource
Allocation (McGraw-Hill, Columbus, OH, 1980).
22. M. Chapman, P. Callis and W. Jackson, Metrics Data Program, NASA IV and V
Facility, http://mdp.ivv.nasa.gov/, 2004.
23. T. G. Dietterich, An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting, and randomization, Machine Learning
40(2) (2000) 139157.
24. Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in
Proc. 13th International Conference on Machine Learning (Morgan Kaufmann, San
Francisco, 1996), pp. 148156.
25. T. G. Dietterich, Machine learning research: Four current directions, AI Magazine 18
(1997) 97136.
26. K. M. Ting and Z. Zheng, A study of AdaBoost with Nave Bayesian classiers:
Weakness and improvement, Computational Intelligence 19(2) (2003) 186200.
27. T. Wilson, J. Wiebe and R. Hwa, Recognizing strong and weak opinion clauses, Computational Intelligence 22(2) (2006) 7399.
28. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd edn. (Morgan
Kaufmann, 2006).
29. D. Opitz and R. Maclin, Popular ensemble methods: An empirical study, Journal of
Artificial Intelligence Research 11 (1999) 169198.
30. T. G. Dietterich, Ensemble methods in machine learning, in J. Kittler and F. Roli
(eds.) First International Workshop on Multiple Classier Systems, Lecture Notes in
Computer Science, Vol. 1857 (New York, Springer-Verlag, 2000b), pp. 115.
31. E. Bauer and R. Kohavi, An empirical comparison of voting classication algorithms:
Bagging, boosting, and variants, Machine Learning 36(1/2) (1999) 105139.
32. L. Breiman, Bagging predictors, Machine Learning 24(2) (1996) 123140.
33. L. Breiman, Heuristics of instability in model selection, The Annals of Statistics 24(6)
(1994) 23502383.
34. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques, 2nd edn. (Morgan Kaufmann, San Francisco, 2005).
35. R. Schapire, The strength of weak learnability, Machine Learning 5(2) (1990) 197227.
36. Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning
and an application to boosting, Journal of Computer and System Sciences 55 (1997)
119139.
37. D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241259.
38. Y. Peng, G. Kou, Y. Shi and Z. Chen, A descriptive framework for the eld of data
mining and knowledge discovery, International Journal of Information Technology
and Decision Making 7(4) (2008) 639682.
39. L. Hansen and P. Salamon, Neural network ensembles, IEEE Transactions on Pattern
Analysis and Machine Intelligence 12 (1990) 9931001.
205
40. A. Krogh and J. Vedelsby, Neural network ensembles, cross validation, and active
learning, in Tesauro, G., Touretzky, D. and Leen, T. (eds.), Advances in Neural Information Processing Systems, Vol. 7 (Cambridge, MA, MIT Press, 1995), pp. 231238.
41. L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees (Wadsworth International Group, Belmont, California, 1984).
42. R. Kohavi, Scaling up the accuracy of Nave Bayes classiers: A decision tree hybrid, in
E. Simoudis, J. W. Han and U. Fayyad (eds.), Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining (Portland, OR. Menlo Park,
CA, AAAI Press, 1996), pp. 202207.
43. J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, 1993).
44. S. Le Cessie and J. C. Houwelingen, Ridge estimators in logistic regression, Applied
Statistics 41(1) (1992) 191201.
45. C. M. Bishop, Neural Networks for Pattern Recognition (Oxford University Press,
1995).
46. J. C. Platt, Fast training of support vector machines using sequential minimal optimization, in B. Schotolkopf, C. J. C. Burges and A. Smola (eds.), Advances in Kernel
Methods-Support Vector Learning (MIT Press, 1998), pp. 185208.
47. V. N. Vapnik, The Nature of Statistical Learning Theory (Springer, New York, USA,
1995).
48. P. Domingos and M. Pazzani, On the optimality of the simple Bayesian classier
under zero-one loss, Machine Learning 29(203) (1997) 103130.
49. S. M. Weiss and C. A. Kulikowski, Computer Systems that Learn: Classification and
Predication Methods from Statistics, Neural Nets, Machine Learning and Expert Systems (Morgan Kaufmann, 1991).
50. B. V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques (IEEE Computer Society Press, 1991).
51. R. Kohavi, The power of decision tables, in N. Lavrac and S. Wrobel (eds.), Proceedings of the Eighth European Conference on Machine Learning (Springer-Verlag,
Iraklion, Crete, Greece, 1995), pp. 174189.
52. W. W. Cohen, Fast eective rule induction, in Proceedings of the Twelfth International
Conference on Machine Learning (Morgan Kaufmann, 1995), pp. 115123.
53. T. L. Saaty, How to make a decision: The analytic hierarchy process, European Journal
of Operational Research 48 (1990) 926.
54. T. L. Saaty and M. Sagir, Extending the measurement of tangibles to intangibles,
International Journal of Information Technology & Decision Making 8(1) (2009)
727.
55. T. L. Saaty, Decision making with the analytic hierarchy process, International
Journal of Services Sciences 1(1) (2008) 8398.
56. T. L. Saaty, A scaling method for priorities in hierarchical structures, Journal of
Mathematical Psychology 15(3) (1977) 234281.
57. F. Zahedi, The analytic hierarchy process-a survey of the method and its applications,
Interfaces 16(4) (1986) 96108.
58. W. Ho, Integrated analytic hierarchy process and its applications A literature
review, European Journal of Operational Research 186(1) 211228.
59. K. Sugihara and H. Tanaka, Interval evaluations in the analytic hierarchy process by
possibility analysis, Computational Intelligence 17(3) (2001) 567579.
60. H. Li and L. Ma, Ranking decision alternatives by integrated DEA, AHP and gower
plot techniques, International Journal of Information Technology & Decision Making
7(2) (2008) 241258.
206
Y. Peng et al.
61. D. K. Despotis and D. Derpanis, A minmax goal programming approach to priority derivation in AHP with interval judgments, International Journal of Information
Technology & Decision Making 7(1) (2008) 175182.
62. C. Ferri, J. Hernandezorallo and R. Modroiu, An experimental comparison of performance measures for classication, Pattern Recognition Letters (2009) 2738.
63. C. Mair, G. Kadoda, M. Leel, L. Phapl, K. Schoeld, M. Shepperd and S. Webster,
An investigation of machine learning based prediction systems, Journal of Systems
Software 53(1) (2000) 2329.
64. V. U. B. Challagulla, F. B. Bastani, I. Y. Raymond and A. Paul, Empirical assessment
of machine learning based software defect prediction techniques, International Journal
on Artificial Intelligence Tools 17(2) (2008) 389400.
65. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval (Addison Wesley,
1999).
66. L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms (Wiley, 2004).
Copyright of International Journal of Information Technology & Decision Making is the property of World
Scientific Publishing Company and its content may not be copied or emailed to multiple sites or posted to a
listserv without the copyright holder's express written permission. However, users may print, download, or
email articles for individual use.

Content Server

Uploaded by

Copyright:

Available Formats

You might also like

Content Server

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Content Server

Uploaded by

Copyright:

Available Formats

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

International Journal of Information Technology & Decision Making

ENSEMBLE OF SOFTWARE DEFECT PREDICTORS:

YI PENG, GANG KOU , GUOXUN WANG and WENSHUAI WU

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

Bayesian network and Nave Bayes both model probabilistic relationships

3. The Analytic Hierarchy Process (AHP)

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Fig. 1. An AHP hierarchy for the ensemble selection problem.

Table 1. Pairwise comparisons of performance measures.

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

False negative (FN ): FN is the number of fault-prone modules that is misclassied

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

Table 2. Classication results of ensembles and classiers.

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Table 3. Priorities of AdaBoost classiers (Group 1).

Table 4. Priorities of bagging classiers (Group 2).

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

Table 5. Priorities of stacking, voting, and individual

Table 6. Priorities of classiers of the top ve

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method

January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428

You might also like