Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Innovations in Systems and Software Engineering

https://doi.org/10.1007/s11334-021-00427-1

ORIGINAL ARTICLE

3PcGE: 3-parent child-based genetic evolution for software defect


prediction
Somya Goyal1

Received: 31 May 2021 / Accepted: 21 November 2021


© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022

Abstract
Software defect prediction (SDP) is the most fascinating research area in software industry to enhance the quality of software
products. SDP classifiers predict the fault-prone modules in early development phases prior to begin testing phase, and thence,
the testing efforts can be focused to those predicted fault-prone modules. In this way, the early detection of fault-prone modules
increases the chances to release error-free products to the clients with reduced testing efforts and cost. For SDP application,
which uses voluminous high-dimensional data, feature selection (FS) has become essential data preprocessing technique.
From past three decades, search-based feature selection is prominently deployed to improve the efficiency of predictors. This
paper proposes a new approach, namely 3PcGE, for feature selection (FS) based on three-parent child (3Pc) and genetic
evolution (GE). The 3PcGE is inspired by evolutionary computation involving three-parent biological evolution process to
result an off-spring with best survival capability. The 3Pc separates the spindle from the mother’s cell body having defective
mitochondria and replaces the separated spindle in the emptied donor cell body having healthy mitochondria. In this way,
3-parent child is healthier than 2-parent child and free from fatal disease. 3PcGE searches the feature space for an optimal
feature subset using the performance of classification and number of features selected as fitness function. The FS is modeled
as multi-objective optimization problem, and pareto optimal solution is sought using evolutionary algorithm (3PcGE). The
performance is compared with the state-of-the-art FS technique. From experimental results, it is clear that the proposed
3PcGE outperforms the competing filter-based FS techniques by 18.98% and wrapper-based FS techniques by 17.5% in AUC
measure. The statistical comparison with the baseline technique (NSGA-II) shows that proposed FS technique 3PcGE is
effective to select optimal features and results in better accuracy of SDP models.

Keywords Software defect prediction (SDP) · Feature selection (FS) · Three-parent child · Search-based software engineering
(SBSE) · Multi-objective evolutionary algorithm (MOEA) · Genetic algorithm (GA) · Artificial neural network (ANN) ·
Accuracy

1 Introduction are scarce, but testing phase is very crucial to not let the
defects pass through the deployment to the final product. It
Software development is comprising of a broad range of is desirable to focus the testing resources to the high-risk
diverse tasks and processes to engineer a good-quality soft- components by early identification of fault-prone modules
ware product as per the specified requirements within the and to perform testing effectively with optimum cost.
time or cost constraints. During these processes, errors are Software defect prediction (SDP) is proved very effec-
introduced in the product inevitably. Testing is the phase tive to identify potentially faulty modules prior to testing
dedicated to detection of errors before they become defects phase and to allow efficient allocation of testing resources.
in final product. More than 70% of total development cost Porter and Selby [37] cast the problem of early identification
(or time) is consumed in testing phase still it cannot be of the high-risk components as a classification problem and
ensured that it is 100% error-free [36]. The testing resources devised measurement-based models successfully for fore-
casting high-risk error-prone components. Khoshgoftaar [24]
B Somya Goyal investigated the software quality models to correct software
somyagoyal1988@gmail.com faults early in development which is highly effective and
1 inexpensive. The work devised software quality classifica-
Manipal University Jaipur, Jaipur, Rajasthan 303007, India

123
S. Goyal

tion models as a valuable guide for cost-effective software for their investigation. However, ANN is the most powerful
quality improvement in the end product. SDP classifier as the literature suggests. Third, their work
The past three decades of literature witnessed that the SDP lacks in terms of symmetry in results due to the evaluation
is successfully being formulated as a learning problem and criteria used to compare the performance of SDP models on
approached in terms of machine learning (ML) algorithms class-imbalanced dataset. Fourth, the work is just an appli-
[5, 9, 15, 26, 40, 45]. cation of the genetic algorithm as such it is defined. Hence,
For SDP process, the features (or metrics) are extracted the studies lack in devising any tweak or novelty in the work-
from the software repository to detect the fault-proneness ing algorithms. Fifth, they have not conducted any statistical
of the software modules. These features should be in high analysis of the data to select the statistical test for comparison
correlation with the software quality and reliability. Few of among the models.
extracted features can be redundant or irrelevant and can orig- My work addresses these five issues, and I am proposing
inate adverse impact on the performance of the ML classifier a new approach, namely 3PcGE, to select features for SDP
as SDP. This problem is called curse of dimensionality. Fea- classification problem. It is based upon three-parent child [3]
ture selection (FS) is one of the potentially active solutions [49] process. 3PcGE is an improvement over genetic evolu-
to it [1, 6, 10, 20, 23, 25, 38, 48]. The feature selection (FS) tion adapted for feature selection problem; hence, 3PcGE
contributes to reduce the number of features consequently uses all the characteristics of genetic evolution after the pro-
to increase the simplicity of SDP classifier operating at a duction of 3-parent child. 3Pc ensures the convergence of
higher speed. Such a model built using reduced features set proposed algorithm to global optimum and hence improves
is more understandable along with the lesser cost of metrics the genetic evolution.
measurement [41]. I pursue SBSE with MOEA approach [39] to formulate
FS is a complex combinatorial problem which falls in NP- FS issue over cost–benefit analysis. I formulate two objec-
Hard class. It is next to impossible to conduct exhaustive tives—one is based on benefit aspect and other is based on
search when the significant number of features is present in cost aspect which are conflicting. Hence, pareto optimal algo-
software data. To solve this, evolutionary algorithms are pro- rithm NSGA-II [7] is adapted with 3Pc concept to build a
foundly being deployed to seek optimal subset of features. In wrapper-based feature selector. For this study, I expand the
the literature, many researcher employed search-based soft- research and analysis over wide range of dataset including
ware engineering (SBSE) [18, 19] to FS in SDP successfully both NASA and PROMISE repositories which are the most
and made huge progress over classical (filter-based FS and popular data repositories. I use the most prominent five classi-
wrapper-based FS) FS techniques [2, 14, 21, 27–29, 46]. fication algorithms in the field of SDP, namely ANN, decision
The literature suggests that research studies formulate FS tree (DT), Naïve Bayes (NB), support vector machine (SVM)
as single-objective optimization problem to maximize the and k nearest neighbor (KNN). K-fold cross-validation is
performance of SDP classifier. used with k  10. I use a wide range of evaluation crite-
ria including accuracy, precision, recall, f -measure, receiver
1.1 Motivation operation curve (ROC), area under the curve (AUC) and
confusion matrix for comparison among the performance
Despite a large number of studies in the field of SDP are of proposed algorithm and baseline FS techniques. Hyper-
present, still there is a lack of studies formulating FS in volume indicator is used to compare the proposed model
SDP as multi-objective optimization problem to simulate with baseline model. Statistically, it is verified that proposed
real-world situation of software industry. Only two studies algorithm 3PcGE performs better than baseline FS technique
from the literature over the span of more than three decades using t-paired test after statistical analysis of data used.
[4, 35] made attempt to model the SDP as multi-objective
optimization problem, which is itself a very small num- 1.2 Objective
ber to count upon, and more investigating studies are to be
conducted in order to better understand the perspective of The following objectives are set to steer the research work
MOEA for FS in SDP. Further, the studies are limited in
the scope because these have laid the results on following • To devise a novel algorithm 3PcGE to accurately classify
weaknesses. First, limited datasets have been used for model the software modules as fault-prone or not using a fresh
development without any justification of the reason to select concept of three-parent child in SDP field.
those only. More datasets should be considered as bench- • To select optimal feature subset with minimum computa-
mark datasets to demonstrate the generalization ability of tional cost using 3PcGE along with its generalization over
the proposed method. The studies neglected the most preva- a wide range of dataset.
lent NASA datasets. Second, both the studies neglected the • To ensure that 3PcGE is effective to improve the perfor-
use of artificial neural network (ANN)-based ML classifier mance of SDP classifier as feature selector.

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

• To compare the impact of proposed 3PcGE with baseline experimental setup, including dataset, evaluation criteria and
models. experimental design. Section 4 reports the results of exper-
• To validate that the proposed 3PcGE performs statistically iments and discussions. The paper is concluded in Sect. 5
better than competing algorithms. with remarks on future work.

1.3 Findings
2 Methodology
In this paper, I am proposing a new FS technique which
dynamically selects the features using 3PcGE global opti- This section highlights the research methodology followed in
mization technique and optimizes the performance of this paper and the detailed description to 3PcGE algorithm
machine learning (ML)-based SDP classifier. To the best of along with the problem formulation and three-parent child
my knowledge, three-parent child (3Pc) has not used in SDP. concept.
The findings of this paper can be summarized as follows.
First, I devise a novel 3PcGE algorithm and investigate the 2.1 Process model
impact of FS in SDP using 3PcGE by adapting genetic evo-
lution process with three-parent child production. 3PcGE is The proposed process model is shown in Fig. 1. First, the
implemented by adapting pareto multi-objective optimiza- dataset is divided into training dataset and testing dataset with
tion evolutionary algorithm, namely NSGA-II. Then, the random selection over ratio 70–30%, respectively. Second,
performance of 3PcGE is compared with baseline NSGA-II feature selection techniques are applied to training dataset.
model to confirm that 3PcGE performs better than NSGA- A selected feature subset is obtained which is further used to
II with lower computational cost. The experimental results reduce the dimensions of training dataset and testing dataset
show that 3PcGE performs statistically the best and returns separately. Third, classification algorithms are applied, and
with minimal feature set with most accurate classification. SDP models are constructed using reduced training dataset.
Finally, reduced testing dataset is deployed to evaluate the
1.4 Contribution performance of the constructed model. The literature advo-
cates the FS techniques should be only applied to training
The major contributions of this work are summarized as fol- dataset to avoid biasing [23, 31, 44].
lows

• To the best of my knowledge, none of the studies has used 3 Research questions
three-parent child (3Pc)-based genetic evolution for FS in
SDP till now. Hence, this paper can be a good supplement The following research questions are designed to steer
to the progress already made in this field. the research work in the direction to achieve the desired
• This paper formulates FS in SDP as multi-objective objectives and to verify the effectiveness of our proposed
optimization problem and solves using pareto-based evo- algorithm 3PcGE.
lutionary algorithm adapted to three-parent child genetic
evolution. RQ1: Is the Proposed 3PcGE algorithm based on three-parent
• I propose a novel 3PcGE technique adapting the NSGA-II child concept suitable for feature selection in SDP?
algorithm. RQ2: Is the proposed 3PcGE competent among other pareto-
• This study investigates the impact of proposed 3PcGE based multi-objective optimization algorithms, which are
technique on FS by comparing the performance with deployed as FS in SDP?
baseline model (NSGA-II) based on hypervolume (HV)
indicator over dataset from NASA repositories. RQ3: Does the proposed 3PcGE has advantage over the state-
• I compare the performance of proposed 3PcGE technique of-the-art FS techniques in selecting optimal feature subset
with state-of-the-art FS over recall, precision, ROC, AUC, with better accuracy of model?
F1 measure and accuracy using dataset from NASA repos-
itories. 3.1 Proposed 3PcGE algorithm

1.5 Organization of work The proposed 3PcGE is a search-based heuristic optimization


technique. It utilizes evolutionary process for producing 3-
The rest of paper is organized as follows. Section 2 describes parent child and genetic algorithm. The objective of 3-parent
the research methodology along with the research questions child is to avoid mitochondrial disease. Mitochondria is an
and proposed algorithm 3PcGE in detail. Section 3 covers the organelle residing within a cell containing DNA. In case,

123
S. Goyal

Fig. 1 Process model for the


research work

there is some mutation in that DNA; it can cause fatal dis- 3.1.1 Problem formulation
ease. This technique swaps the defective mitochondria of the
women’s cell with that of donor [3] [49]. In this way, it pro- The FS problem is to find the most relevant, non-redundant
duces 3-parent child with better survival possibility and fitter and minimal subset of features to improve the accuracy of
progeny. This work adapts the 3-parent child (3Pc) approach SDP classifier. This paper formulates the FS problem as
(shown in Fig. 2) and non-dominated sort genetic algorithm combinatorial optimization problem. The feature space is
(NSGA-II)-based genetic evolution (GE) to select optimal encoded as population where each vector is N-dimensional
features for SDP. The pseudo-code for proposed 3PcGE algo- (N: total number of input features) and represented as chro-
rithm is given as listing in Algorithm_1.

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

Fig. 2 Three-parent child approach

mosome member of some specific population denoted with 7-dimensional feature vector. The respective feature subsets
Xj. are {f 1, f 4}, {f 3, f 6} and {f 1, f 3, f 4, f 6}).
The feature space is of size 2 N , and exhaustive search-
ing is computationally very expensive. In case of significant

where N denotes the total number of input features and p Si ze number of features, it may become impossible to search the
denotes the size of population-number of chromosomes in entire solution space with restricted resources. Hence, sur-
a population at current generation. x j i ∈ {0, 1} where 0 vival of fittest is followed with specific fitness function of
denotes non-selection and 1 denotes selection of ith gene limited population size till specific maximum generations.
(feature) of jth solution chromosome (feature vector). The decision variables (genes) evolve over the successive
(Example—Supposing there are 7 features {f 1, f 2, f 3, f 4, generations to better adapt the environment which is equiva-
f 5, f 6, f 7} and an initial population is {1,001,000, 0,010,010, lent to better accuracy and lesser cost. From the next evolved
1,011,010}. It implies that the initial population is of size  generation, the fittest chromosome is selected to breed new
3 (3 chromosomes). Individual chromosome corresponds to generation of fitter off-springs.

123
S. Goyal

In this way, population-based learning occurs, and opti- where k denotes the selective pressure and Ŕ( j ) denotes
mized solutions are obtained. I propose two fitness values the ranking of jth chromosome on the basis of accuracy of

for each chromosome using cost–benefit analysis. One fit- SDP classifier using jth chromosome’s feature subset. xi
ness value is derived from the cost perspective, the size of the denotes the number of features selected in jth chromosome.
selected feature subset, i.e., the number of features selected. The generation of three-parent child-based population

(Feature subset {f 1, f 5}, returns fitness value of 2). Second p i f r om p i occurs with the mitochondrial changes in the
fitness value is from the benefit perspective, the accuracy of chromosomes of population p i . It is achieved as it happens
SDP classifier using the corresponding feature subset (data in biological synthesis. Some random bits of feature vector
with feature subset {f 1, f5} fed to the SDP classifier returns f j (x) of pi (as spindle of chromosome) are kept reserved and
the fitness value of 88%-accuracy of model). The objective rest of the bits are changed (as replacement of defective mito-
functions are shown in (1) and (2) chondria with healthy one from donor). Mathematically, it is
done by randomly generating donor population pd of same
  
Min x j i where f xj size as that pi (say original population). The jth chromo-
  some of population pd is N-sized vector f d (xi )  {xj1 ,xj2 ,
 xj1 , xj2 , . . . , xjn | x j i ∈ {0, 1} and (1) ….., xjn |x j i ∈ {−1, 0, 1}}. The value ‘0’ in donor chro-
mosomes represents ‘removed spindle of donor cell’ and {−
Max AUC of SDP classifier for current f (x) (2) 1, + 1} represents ‘healthy mitochondria’ of donor cell. For
each original chromosome (vector), a new random vector is
The fitness function for 3PcGE has pareto optimization as generated (donor chromosome) and added to original vector
it maximizes classification performance and minimizes the to produce a new three-parent child.
number of selected features. Both objective functions from Example: Consider ith generation; p denotes original
cost–benefit perspective are conflicting. The fitness function population; size of population is 3 means there are 3 chromo-
( j ) is given as (3) somes (feature vector) in each population; each chromosome
 
(j)  k · Ŕ(j) + (1 − k) x i /N (3) has 5 genes denoting 5 input features.

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

Mathematically, the spindle of donor cell is removed with Example—original population p and donor population pd
setting the gene value as ‘0’ in donor chromosomes and the are added following element-wise matrix addition. p is the
healthy mitochondria are retained with setting the gene value resulting 3-parent child population of size 3 with each chro-
as {− 1, + 1} in donor chromosomes. The spindle of original mosome is having 5 genes.
cell is transplanted in donor cell, and mitochondrial changes The selection of fittest chromosome is proposed using
are reflected by adding the corresponding vectors element- roulette wheel with ranking using the (j) fitness function.
wise as in (4) truncating the bounds with 0 as lower bound It is done as ‘one-to-one survival of fittest’ which involves
and 1 as upper bound. the comparison of each trial vector (chromosome) against
its corresponding target vector (chromosome) given in (5).

Threeparent − childpopulation : p + p d → p (4) Mathematically, the vector Xj (g + 1) in the (g + 1)th genera-
tion is obtained from the trial vector X j (g) and target vector
X j_new (g) from gth generation.


   
X j _new (g) if  X j new(g) ≥  X j (g)
Xj (g + 1) 
X j (g) otherwise
(5)

123
S. Goyal

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

In Algorithm_1, Step 1 and step 2 are initialization START

steps to generate the initial population randomly. The func-


Read Input Data
tion i ni t i al P o p( P Si ze ) generates initial population of size
P Si ze . Each member chromosome is a binary string of N bits
Set i=0; Generate Initial population with randomly selected features
set randomly as 0/1. Step 3 to Step 22 are repeated until max (say two parent population)
generation is reached (N Gen ).
Step5 and Step 6 are repeated for each chromosome in cur- Produce population of three parent-child ′ from the
chromosomes of
rent population to generate the corresponding three-parent
child. Then, the three-parent child is added to the current Combine two parent and three parent population = ∪ ′

population increasing its size to double. It is the most cru-


cial step to introduce diversity and overcome the limitation Produce New Population Ci (Crossover & Mutation) from
of NSGA-II. It gives the exposure to fitter solutions to be
Combine old population and new off springs = ∪ Ci
searched through. In each iteration, elitism allows only the
highly fit chromosomes to reproduce a lot; hence, in a few
Train and Cross-Validate SDP classifier for current selected
generations, the entire population filled with similar type of attributes represented by each member of population bi
solutions with similar fitness. Elitism ensures the high fitness
to pass on to next generation, but if immediately followed Find pareto fronts F = non-dominated search and rank ( )
with crossover as is the case of NSGA-II, it endangers the
success of algorithms from reaching the global optimum. Assign + =∅ and j=1

The three-parent child population exhibits the diversity and no


ensures the escape from local optimum trap. Step 8 updates + |+| |≤ ) ??

the current population with inclusion of three-parent popu- yes


Assign + = + ∪
lation to itself. It results 2* P Size number of chromosomes in
Crowding distance assignment ( )
the updated population. j = j+1
Step 9 is genetic evolution of chromosomes similar to bio-
logical evolution involving the random selection of pairs of Find pareto fronts
SortFin=descending
non-dominated
order search and rank (
and rank )

two chromosomes as parents from the population (a mix = ∪ F = non-dominated


:(
+ pareto
Find fronts
+ + | ) ];
― | search and rank (i= i)+ 1 ;
Update
of two parent and three parent), the cross-over of parents
to generate two new off-springs for increased exploitation Are the maximum generations evolved?? no
in the solution subspace and the mutation of newly born
yes
off-springs to explore the global search space. The function Return global_best solution chromosome from
Crossover_and_Mutation ( p i ) is used to perform the genetic
END
operators-selection, crossover and mutation on the popula-
tion p i . Step 10 adds this new population C i (comprising of Fig. 3 Flow chart of proposed 3PcGE algorithm
two-parent and three-parent child) to form bi population of
4* P Size number of chromosomes.
Step 11 and Step 12 compute the fitness function for bi member of new population in case p i+1 is lacking members
considering the conflicting objectives and produce pareto to count to P Si ze .
fronts F using non-dominated search while ranking as per the Step 21 updates the generation counter till NGen . The loop
fitness. It involves training and validation of SDP classifiers repeats till the termination criteria is reached and algorithm
(mentioned in subsequent Sect. 3.3) using the feature subset converged to stable optimal solution. Ultimately, the globally
represented by bi . The performance is evaluated suing AUC optimal solution is returned in Step 23 as current population
measure. The function Rank_And_Select (bi ) encapsulates fittest chromosome solution, with genes showing 0/1 for non-
the non-dominated ranking of each member chromosome of selection and selection of correction feature.
bi and assigning them into pareto fronts as F1 ,F2 ,F3 , so on. The overall flow of proposed algorithm 3PcGE is given
Step 13 is initialization for next population p i+1 taking as a flow chart in Fig. 3.
only the fittest solutions of previous population p i . In 3PcGE, searching of globally optimal solution occurs
Step 15, Step16 and Step 17 are repeated until the next through two phases, namely exploration and exploitation.
generated currently being populated p i+1 is generated with Exploration is achieved through generating diverse popu-
exactly P Si ze chromosomes who are the fittest from p i .Step lation, and exploitation is achieved through elitism. The
19 and Step 20 ensure the size of new generation as P Si ze . breeding of new offspring involves the selection of parents
The function Crowding distance assignment (F j ) is used to followed by the genetic operations like cross-over and muta-
rank the chromosomes in the same pareto front to select next tion. Then, it is the fitness computation to find the fitter

123
S. Goyal

babies to make next generation. The non-dominated search where p(x|clean) denotes the prior probability for x; the
and crowding distancing are special features abstracted from probability of seeing x as input when it is known that it
NSGA-II. The confirmed convergence to a globally optimal belongs to clean class; satisfying inequation (9) and Eq. (10).
solution is achieved with generation of three-parent child
(3Pc) population along with the selected fittest survival from P(buggy) ≥ 0, P(clean) ≥ 0 (9)
the preceding population.
P(buggy) + P(clean)  1 (10)
3.2 Classifiers
And, p(x) denotes the evidence which is the marginal
In this study, I have used five most popular classification probability that x is seen, regardless it belongs to buggy class
algorithms, namely Naïve Bayes (NB), k-nearest neighbor or clean class. It can be computed as Eq. (11).
(KNN), decision trees (DT), artificial neural networks (ANN)
and support vector machine (SVM) [8, 50, 51], [13]). p(x) p(x|buggy) × P(buggy) + p(x|clean) × P(clean)
Naïve Bayes classifier makes classification utilizing the (11)
probability theory from the statistics. Bayes rule is applied to
predict whether the module is buggy or not. It predicts that the Equation (7) which represents Bayes rule is the basis for
test sample data-point belongs to that particular class which Naïve Bayes classifier. By applying the values from Eqs. (7),
is having the highest posterior probability for that sample (8) and (11) into Eq. (6), the prediction for given data point
data point. Suppose for defect prediction problem, vector that whether it belongs to ‘buggy’ class or not can be made.
x denotes the attribute set and y is a set with two elements K-nearest neighbor is another classification algorithm
{buggy, clean}, which denotes the classes to which each data from statistics. It uses similarity between data points to pre-
point uniquely belongs. Naïve Bayes classifier predicts that dict the class. In our experimental setup, we utilize Euclidean
a specific module with attribute vector x belongs to ‘buggy’ distance which can be computed between any two data points,
class only if Eq. (5) satisfies. Otherwise, it predicts that the namely xi and xj as Eq. (12). Suppose for defect prediction
module belongs to ‘clean’ class. problem, vector x denotes the attribute set and y is a set with
two elements {buggy, clean}, which denotes the classes to
which each data point uniquely belongs.
P(buggy|x) ≥ P(clean|x) (6)
k

D(xi, x j)   (x ik − x jk )2 (12)
In Eq. (5), P(buggy|x) denotes the posterior probability i1
of class buggy, after having seen x, and P(clean|x) denotes
the posterior probability of class clean, after having seen x. Assume buggy is denoted with ‘ + 1’ and clean with ‘− 1,’
Equation (6) shows that for two class classification problem, hence y  {+ 1, − 1}. For the instance, xq , K-NN will make
whichever class will be having highest posterior probability classification using Eq. (13) after computing the ‘k’ nearest
will be predicted by the classifier for given x. The posterior neighbors of xq using Eq. (12). Suppose N k denotes the set
probability for any class can be computed using Bayes rule as of ‘k’ neighbors of xq .
given in Eq. (7a). Equation (7.a) can be rewritten as Eq. (7.b) 
for class buggy and as Eq. (8) for class clean. yq  sign(
 yi ) (13)
xi ∈Nk
Prior × Likeli hood
Posterior  (7.a) Decision trees-based classifiers are built using Classifica-
Evidence
tion and Regression Trees (CART) algorithm [32]. Decision
p(x|buggy) × P(buggy) trees are hierarchical, nonparametric, supervised machine
P(buggy|x)  (7.b) learning models. A tree is comprised of few internal nodes
p(x)
with decision functions and external leaves. In our experi-
ments, we used the ‘entropy’ as a measure of impurity which
where p(x|buggy) denotes the prior probability for x; the in turn records the goodness of split. Let us compute entropy
probability of seeing x as input when it is known that it for a node in classification tree say node ‘a’; N a denotes
belongs to buggy class; satisfying inequation (9) and Eq. (10). buggy
the number of instances that reaches to node ‘a’; Na and
Naclean denote the number of nodes in N a that belongs to class
p(x|clean) × P(clean) ‘buggy’ and class ‘clean,’ respectively. Suppose an instance
P(clean|x)  (8) reaches node ‘a,’ then its chances of being ‘buggy’ are given
p(x)

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

as Eq. (14). Similarly, its chances of being ‘clean’ are com- SVM solves the optimal hyperplane problem by Lan-
puted using Eq. (15). Entropy is computed as Eq. (16) for grangian multipliers. First, new higher-dimensional mapping
2-class classification problem. is achieved with function φ as Eq. (21) shown.
buggy
buggy Na y  w T φ(x) + c
pa  (14)
Na
Naclean where w is weight vector and c is a scalar.
paclean  (15) The SVM has to optimize Eq. (22)
Na
1
  MinimizewT w + ρ errori2 (22)
Entropy node  a  2
    
buggy buggy
 − pa log pa + paclean log paclean Subject to y  w T φ(x) + c + error
(16)
where ρ denotes the cost function.
Artificial neural networks are implemented with standard After solving this, the prediction made by SVM classifier
feed-forward, error backpropagation algorithm. Parameter can be given as Eq. (23) in terms of kernel.
setting is reported in Table 5. For n-feature input data X
 
 < x1 ,x2 ,…,xn > , there are n input neurons. For sigmoid Ŷ  α − α T .K (xcentre , x) + b (23)
activation function, the output yi for ith neuron is computed
using Eq. (17). In this way, features are fed in forward direc-
In Eq. (23) , K (xcentr e , x) denotes kernel based on radial
tion from input layer to hidden layer and then from hidden to
basis function as defined in Eq. (24); in our experiments we
output layer. The computed output at output neuron is com-
have used RBF kernel for SVM where the center and radius
pared with the actual output, and the error is computed as
are defined by the user.
Eq. (18) as half of the sum of squares of difference between
the actual output and predicted output, and the error is back 2
− |xcentr e −x|2
propagated to update weights as per Eq. (19) and learning K (xcentr e , x)  e 2.(radius) (24)
takes place in this way to minimize the error.


n
ŷi  sig( wi xi + wo ) (17) 4 Experimental setup
i1
In previous section, the research methodology is explained.
where wi denotes weight for ith neuron and wo denotes the Now, this section covers the experimental design, setup and
bias; the description of datasets utilized for the experiment. The
set-up is installed on Windows™ 10 Pro computer with
1 
n
error  (yi − 
yi )2 (18) Intel® Core™ i5-8265U CPU @1.60 GHz 1.80 GHz (64-bit
2 m processor) and 8 GB of RAM. The proposed experimental
i
design is evaluated by running k-fold cross-validation with
where m denotes number of output neuron. 10 as the value of k for the replication and randomization
w  η. error. input signal purposes. The statistics and machine learning toolbox and
optimization toolbox from MatlabR2019a are used for imple-
1 
n
mentation with default settings.
 η. (yi − 
yi )2 .xi (19)
2 m
i
4.1 Dataset
where η denotes learning rate.
Support vector machine works on Vapnik theory of max- The cleaned NASA repository is used in this study [33, 34].
imum marginal methods. We used the RBF kernel setting Out of 14 dataset, 12 are available online and used for experi-
for SVM. For ‘n’ instances denoted as < X i ,yi > , it finds the mental study. The details are tabulated as Table 1 [31]. NASA
optimal separating hyperplane between two classes denoted projects possess a set of features measured as static code
{buggy as + 1,clean as − 1} by finding w1 and w2 which metrics, for example, LOC counts, Halstead and McCabe
satisfies Eq. (20). complexity metrics [16, 30, 31]). These are widely accepted
metrics set due to easy computation at low cost [12, 13, 38,
y(w2 x + w1) ≥ − + 1 (20) 43, 47]).

123
S. Goyal

Table 1 NASA dataset [31] Table 3 Parameter value for classification algorithms
Dataset Attributes Modules Defective Non-defective Defective (%) Classifier Parameter: value

CM1 38 327 42 285 12.8 Artificial neural networks Input Layer Size: variable
JM1 22 7720 1612 6108 20.8 (ANN) chromosome size;
Hidden Layer Size: 10 neurons;
KC1 22 1162 294 868 25.3 Output Layer Size: 2;
KC3 40 194 36 158 18.5 Number of Hidden Layers:1;
MC1 39 1952 36 1916 1.8 Training Function: trainscg
(scaled conjugate gradient);
MC2 40 124 44 80 35.4
Performance Function:
MW1 38 250 25 225 10 Cross-Entropy;
PC1 38 679 55 624 8.1 Decision trees (DT) Algorithm: CART;
PC2 37 722 16 706 2.2 tenfold CV;
PC3 38 1053 130 923 12.3 Naïve Bayes (NB) tenfold CV;
PC4 38 1270 176 1094 13.8 Support vector machine (SVM) Kernel: Radial Basis Function;
PC5 39 1694 458 1236 27 Algorithm: SMO (Sequential
Minimal Optimization);
tenfold CV;
Table 2 Parameter value for 3PcGE k-nearest neighbor K  5;
(k-NN) Distance Criteria: Euclidean
Parameter description Value measure;
tenfold CV;
Number of generations 100
Population size 20
ACTUAL
3Pc generation probability 0.1 Classifier
CLASS
Parent selection Tournament method
BUGGY True_PositiveBuggy>>Buggy False_NegativeBuggy>>Clean
Crossover operator 2-point crossover
Crossover probability 0.9 CLEAN False_PositiveClean>>Buggy True_NegativeClean>>Clean

Mutation operator Bit flip


BUGGY CLEAN
Confusion
Mutation probability 1/n (n  number of features)
Matrix
PREDICTED CLASS

4.2 Parameter settings Fig. 4 Confusion matrix

The adapted values of parameters for 3PcGE algorithm and Confusion matrix contains information about the actual
for the classification are given in Tables 2 and 3, respectively. values and predicted values for the output class variable in
the form of a matrix. The predicted values for the classifica-
4.3 Feature selection techniques tions done by the fault prediction model are compared, and
performance is evaluated.
In this study, wrapper-based FS techniques using the above As shown in Fig. 4, the class ‘buggy’ is considered as pos-
proposed 3PcGE algorithm in conjunction with five clas- itive class and class ‘clean’ is considered as negative class.
sification algorithms for AUC as performance measure to The term ‘True Positive’ refers to the ‘count of modules’
evaluate the importance of a feature subset are deployed. which are buggy in actual and classified as buggy by the
Cross-validation is used to estimate the accuracy of the pre- classifier. The term ‘True Negative’ refers to the ‘count of
determined classification technique for a set of features. I modules’ which are clean in actual dataset and predicted as
propose five different wrapper-based FS techniques using clean by the classifier. It leads to two other terms which are
“one proposed 3PcGE algorithm + five classification algo- ‘False Positive’ and ‘False Negative.’ The ‘False Positive’
rithms + one evaluation measure”. refers to the ‘count of modules’ which belong to clean class
in actual dataset and predicted as buggy by the classifier in
4.4 Evaluation criteria consideration. The ‘False Negative’ means those modules
which are buggy in actual dataset and predicted as clean by
The performance of proposed 3PcGE-based SDP classifiers the classifier.
is measured using the following evaluation criteria which Sensitivity (recall or true positive rate or TPR) and speci-
are widely accepted in the studies from the literature [1, 45], ficity (1- false positive rate or 1- FPR) are computed as
Anbu et al. 2017; [21, 22, 40], [13]). Eqs. (25) and (26). True positive rate, TPR, can be thought as

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

F1-measure computes the harmonic mean of precision and


sensitivity as in (29)

F1measure
2 ∗ truepositive

2 ∗ truepositive + falsepositive + falsenegative (29)

5 Results and discussion

In this section, I report the results obtained from the experi-


mental work and make the empirical analysis to address the
research questions.
Fig. 5 ROC
RQ1: Is the Proposed 3PcGE algorithm based on three-parent
child concept suitable for feature selection in SDP?
hit rate, which accounts for what proportion of buggy mod-
To address this question, I implemented the proposed
ules we correctly predict and false positive rate, FPR, refers
3PcGE for the feature selection in SDP classification. I started
to the proportion of clean modules we wrongly accept as
to collect the AUC measures for all five 3PcGE-SDP models
buggy. Precision is computed as (27)
over NASA datasets (see Table 4).
true positive Then, I collected the accuracy and f1-measure because
Sensitivity(or recall)  (25) these measures take into consideration the sensitivity and
true positive + false negative
specificity simultaneously. Results are reported in Tables 5
true negative and 6.
Specificity  (26)
true negative + false positive From the experimental results, I infer that the feature
selection using 3PcGE is effective for SDP classification.
true positive
Precision  (27) 3PcGE.SVM classifier which utilizes the 3PcGE as feature
true positive + false positive selector and SVM as classification algorithm outperforms
other models. Over all criteria, all models are showing good
Receiver operating characteristics (ROC) curve is ana- results. For further analysis, boxplots are plotted as shown in
lyzed to evaluate the performance of the prediction model. Figs. 6, 7 and 8 for AUC, accuracy and f1-measure, respec-
During the development of the ROC curves, many cutoff tively.
points between 0 and 1 are selected; the sensitivity and It is clear from the boxplots that SVM classifier with
(1 − speci f icit y) at each cut off point are calculated. (see 3PcGE feature selection performs better than rest of the
Fig. 5). It is interpreted that closer the classifier gets to the proposed model. Along with the performance criteria, the
upper left corner, the better is its performance. To compare computational time taken to run the multi-objective evolu-
the performance of classifiers, the one above the other is con- tionary algorithm is also considered for performance eval-
sidered better. uation. Table 7 reports the computational time for all five
Area under the ROC curve (AUC) is a combined measure classifiers over NASA datasets.
of the sensitivity and specificity. It gives the averaged per- Figure 9 represents the plot for computational run time
formance for the classifier over different situations. AUC  consumed by proposed five classifiers. It is clear that
1 is considered ideal. 3PcGE.SVM performs optimally with minimum runtime.
Accuracy is the measure of the correctness of predic- From the experimental results, I can infer that 3PcGE is
tion model. It is defined as the ratio of correctly classified effectively suitable for feature selection for SDP problem as
instances to the total number of the instances [17] and com- the results are remarkable with optimal computational time.
puted as Eq. (28)
RQ2: Is the proposed 3PcGE competent among other pareto-
Accuracy (28) based multi-objective optimization algorithms, which are
truepositive + truenegative deployed as FS in SDP?

truepositive + falsepositive + truenegative + falsenegative
To address this question, I compare the performance of
the proposed algorithm with the most popular and baseline

123
S. Goyal

Table 4 AUC measure for


3PcGE-based SDP classifiers AUC 3PcGE.SVM 3PcGE.KNN 3PcGE.NB 3PcGE.ANN 3PcGE.DT

CM1 0.961 0.78 0.81 0.9 0.92


JM1 0.948 0.67 0.82 0.89 0.91
KC1 0.901 0.687 0.78 0.88 0.863
KC3 0.917 0.77 0.82 0.87 0.88
MC1 0.944 0.73 0.79 0.91 0.88
MC2 0.926 0.69 0.78 0.92 0.916
MW1 0.916 0.782 0.782 0.816 0.83
PC1 0.938 0.781 0.891 0.838 0.85
PC2 0.951 0.776 0.876 0.921 0.89
PC3 0.976 0.75 0.82 0.9 0.87
PC4 0.965 0.79 0.79 0.89 0.91
PC5 0.969 0.78 0.85 0.89 0.85
Average 0.942666667 0.748833333 0.817416667 0.885416667 0.88075

Table 6 F1-measure measure


for 3PcGE-based SDP classifiers F1-measure 3PcGE.SVM 3PcGE.KNN 3PcGE.NB 3PcGE.ANN 3PcGE.DT

CM1 0.8712 0.76831 0.796 0.867 0.85


JM1 0.8544 0.7098 0.71029 0.798 0.8
KC1 0.8519 0.7207 0.7903 0.832 0.845
KC3 0.8387 0.731 0.7891 0.801 0.831
MC1 0.8278 0.681 0.685 0.7999 0.801
MC2 0.801 0.57038 0.701 0.789 0.8
MW1 0.8216 0.6682 0.6482 0.801 0.789
PC1 0.83241 0.79921 0.7961 0.831 0.831
PC2 0.7901 0.683 0.76 0.77 0.789
PC3 0.8011 0.72 0.782 0.758 0.8
PC4 0.883 0.6688 0.756 0.851 0.86
PC5 0.8915 0.691 0.801 0.878 0.85
Average 0.838725833 0.70095 0.751249167 0.814658333 0.8205

Table 5 Accuracy measure for


3PcGE-based SDP classifiers ACCURACY 3PcGE.SVM 3PcGE.KNN 3PcGE.NB 3PcGE.ANN 3PcGE.DT

CM1 0.9212 0.8831 0.8996 0.9189 0.9021


JM1 0.9444 0.7498 0.8029 0.8202 0.9312
KC1 0.9519 0.8307 0.8703 0.9475 0.933
KC3 0.9387 0.831 0.841 0.9198 0.9127
MC1 0.9378 0.73 0.79 0.91 0.88
MC2 0.901 0.6238 0.767 0.847 0.9052
MW1 0.916 0.782 0.782 0.816 0.83
PC1 0.9541 0.8921 0.9261 0.9455 0.9363
PC2 0.911 0.763 0.891 0.899 0.878
PC3 0.921 0.8 0.876 0.9 0.919
PC4 0.953 0.788 0.821 0.912 0.923
PC5 0.955 0.791 0.89 0.88 0.899
Average 0.933758333 0.788708333 0.846408333 0.892991667 0.904125

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

Fig. 6 AUC measure for all five classifiers over NASA dataset

Fig. 7 Accuracy measure for all five classifiers over NASA dataset

algorithm, i.e., NSGA-II [35]. The accuracy criteria for com- Table 8 and Fig. 10 show 3PcGE FS SVM performs well
parison are the hypervolume tabulated in Tables 8 and 9, as a multi-objective optimization algorithm.
respectively, for proposed and baseline technique. Table 9 reports the HV values for the same classification
Table 8 tabulates the HV measurement for all five classi- algorithms using the features selected by NSGA-II algorithm
fication algorithms using the features selected using 3PcGE [35]. The study [35] is repeated over NASA datasets for the
FS algorithm, and Fig. 10 shows the corresponding boxplots collection of these measurements for comparative analysis
for graphical analysis. purposes.

123
S. Goyal
Table 8 Hypervolume measure
for 3PcGE-based SDP classifiers HV 3PcGE.SVM 3PcGE.KNN 3PcGE.NB 3PcGE.ANN 3PcGE.DT

CM1 9.47E−01 7.16E−01 8.27E−01 9.08E−01 9.08E−01


JM1 8.04E−01 6.15E−01 7.98E−01 8.04E−01 7.95E−01
KC1 7.94E−01 5.35E−01 7.42E−01 7.88E−01 7.65E−01
KC3 8.28E−01 6.13E−01 8.10E−01 8.28E−01 7.93E−01
MC1 9.23E−01 7.86E−01 9.39E−01 9.76E−01 9.73E−01
MC2 6.49E−01 5.02E−01 7.57E−01 7.30E−01 7.84E−01
MW1 9.07E−01 7.66E−01 8.27E−01 9.07E−01 8.67E−01
PC1 9.66E−01 7.83E−01 8.97E−01 9.66E−01 9.31E−01
PC2 9.79E−01 7.55E−01 9.45E−01 9.77E−01 9.68E−01
PC3 8.99E−01 6.29E−01 7.66E−01 8.64E−01 8.64E−01
PC4 8.98E−01 6.88E−01 8.61E−01 8.74E−01 8.53E−01
PC5 7.42E−01 5.23E−01 7.54E−01 7.56E−01 7.58E−01

Table 9 Hypervolume measure


for NSGA-II-based SDP HV NSGA.SVM NSGE.KNN NSGA.NB NSGA.ANN NSGA.DT
classifiers [35]
CM1 8.67E−01 7.01E−01 7.90E−01 8.89E−01 8.98E−01
JM1 7.92E−01 6.12E−01 6.99E−01 7.90E−01 6.75E−01
KC1 7.13E−01 5.24E−01 6.99E−01 7.88E−01 6.80E−01
KC3 6.98E−01 6.04E−01 8.09E−01 8.13E−01 6.98E−01
MC1 8.90E−01 7.98E−01 9.12E−01 9.56E−01 8.77E−01
MC2 6.22E−01 5.12E−01 7.33E−01 6.99E−01 7.68E−01
MW1 8.93E−01 6.98E−01 7.99E−01 9.00E−01 7.57E−01
PC1 7.51E−01 7.13E−01 8.57E−01 9.07E−01 8.21E−01
PC2 8.77E−01 7.33E−01 9.24E−01 9.35E−01 8.58E−01
PC3 8.00E−01 6.12E−01 7.24E−01 7.89E−01 6.79E−01
PC4 8.82E−01 6.16E−01 7.90E−01 7.40E−01 7.45E−01
PC5 6.78E−01 5.12E−01 7.23E−01 7.13E−01 6.55E−01

Table 7 Computational time for


3PcGE-based SDP classifiers (in Time (in seconds) 3PcGE.SVM 3PcGE.KNN 3PcGE.NB 3PcGE.ANN 3PcGE.DT
seconds)
CM1 40.322 97.592 52.963 103.65 99.628
JM1 90.955 196.55 143.892 759.273 284.539
KC1 82.501 165.55 121.952 376.56 199.537
KC3 33.472 83.908 48.134 78.678 102.548
MC1 88.484 179.545 143.106 401.527 302.485
MC2 29.787 35.394 14.88 67.987 59.436
MW1 31.154 55.898 23.455 101.45 70.743
PC1 62.333 178.919 74.538 209.729 205.47
PC2 77.026 192.34 94.008 298.951 199.756
PC3 79.451 157.155 91.654 392.593 185.63
PC4 81.341 153.32 111.193 387.456 187.57
PC5 98.236 190.54 143.56 363.845 302.56
Average 66.25516667 140.55925 88.61125 295.1415833 183.3252

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

Fig. 8 F1-measure for all five classifiers over NASA dataset

Fig. 9 Computational time for


3PcGE-based five classifiers Computaonal Cost of 3PcGE
over NASA dataset
3PcGE.SVM 3PcGE.KNN 3PcGE.NB 3PcGE.ANN 3PcGE.DT
800
700
600
Cost (in seconds)

500
400
300
200
100
0 CM1 JM1 KC1 KC3 MC1 MC2 MW1 PC1 PC2 PC3 PC4 PC5
3PcGE.SVM 40.32 90.96 82.5 33.47 88.48 29.79 31.15 62.33 77.03 79.45 81.34 98.24
3PcGE.KNN 97.59 196.6 165.6 83.91 179.5 35.39 55.9 178.9 192.3 157.2 153.3 190.5
3PcGE.NB 52.96 143.9 122 48.13 143.1 14.88 23.46 74.54 94.01 91.65 111.2 143.6
3PcGE.ANN 103.6 759.3 376.6 78.68 401.5 67.99 101.4 209.7 299 392.6 387.5 363.8
3PcGE.DT 99.63 284.5 199.5 102.5 302.5 59.44 70.74 205.5 199.8 185.6 187.6 302.6

The statistical test for comparing the performance of 0.05 for all five pairs. Hence, the null hypothesis cannot be
these two algorithms, namely 3PcGE and NSGA-II as fea- rejected and the normality is confirmed.
ture selector for five classification algorithms over NASA Now, as the assumptions for paired t-test (1) two sam-
dataset, is selected taking in consideration the assump- ples, (2) dependence, (3) normality are satisfied, I perform
tions of dependence and normality [42]. The HV measure the paired t-test to compare the performance of five pairs of
for five pairs {(3PcGE.SVM, NSGA.SVM), (3PcGE.KNN, SDP classifiers working on two pareto-based multi-objective
NSGA.KNN), (3PcGE.NB, NSGA.NB), (3PcGE.ANN, algorithms (3PcGE and NSGA-II) using hypervolume mea-
NSGA.ANN), (3PcGE.DT, NSGA.DT)} is collected over sure. Table 11 reports the results of t-test at 95% confidence
the same projects of NASA datasets at two different time level. It is evident from the values for t-test that null hypoth-
instants; hence, the five pairs are dependent (or paired). The esis are rejected (p-value is less than 0.05), and hence, the
normality is tested using Shapiro–Wilk test at significance performance of two algorithms (3PcGE and NSGA-II) is sig-
level 0.05. The values of p-static for Shapiro–Wilk test are nificantly different.
reported in Table 10. It is clear that the p-value is greater than

123
S. Goyal

Fig. 10 Hypervolume for 3PcGE-based SDP classifiers

Table 10 p static for Model p Static Table 12 Comparison with state-of-the-art FS techniques
Shapiro–Wilk test
3PcGE.SVM 3.13E−01 FS method type FS technique Average AUC over
NASA
NSGA.SVM 1.60E−01
3PcGE.KNN 2.04E−01 Filter based [12] Chi-square 0.79
NSGE.KNN 3.68E−01 InfoGain 0.75
3PcGE.NB 2.35E−01 ReliefF 0.76
NSGA.NB 1.81E−01 Wrapper-based k-Nearest neighbor 0.79
3PcGE.ANN 5.19E−01 subset selection Logistic regression 0.8
techniques [12]
NSGA.ANN 3.58E−01 Naïve Bayes 0.8
3PcGE.DT 2.54E−01 Proposed 3PcGE.SVM 0.94
NSGA.DT 1.84E−01 3PcGE-based FS 3PcGE.KNN 0.74
3PcGE.NB 0.817
3PcGE.ANN 0.885
Table 11 p value of paired t-test for all five pairs of classifiers
3PcGE.DT 0.88
Proposed (3PcGE) Baseline [35] p Value The best value is highlighted as bold

3PcGE.SVM NSGA.SVM 0.0014


3PcGE.KNN NSGE.KNN 0.0225
for NSGA-II is 107 s (reported by the study [35], and the
3PcGE.NB NSGA.NB 2.45E−04
average cost for proposed model (see Table 7) is lower than
3PcGE.ANN NSGA.ANN 4.50E−03
that for NSGA-II. It is only 66.255 for 3PcGE.SVM on aver-
3PcGE.DT NSGA.DT 1.72E−05
age.
From the experimental results, I infer that proposed
algorithm 3PcGE is competent among other pareto-based
Table 10 shows that the performance of 3PcGE is better multi-objective optimization algorithms and outperforms the
than NSGA-II statistically. Further, the computational cost NSGA-II algorithm statistically.

123
3PcGE: 3-parent child-based genetic evolution for software defect prediction

RQ3: Does the proposed 3PcGE has advantage over the state- empirically with other feature selection methods which are
of-the-art FS techniques in selecting optimal feature subset popular in the literature work, the proposed technique is
with better accuracy of model? compared with the traditional filter and wrapper FS tech-
niques. The study [12] is selected for comparative analysis.
To address this question, I compare the performance of The repeated ANOVA test is performed to statistically verify
proposed method with the state-of-the-art feature selection the results. The overall contribution of this study is a novel
methods. The study [11, 12] is selected for comparative anal- 3PcGE evolutionary algorithm to solve feature selection
ysis as they utilized most of the state-of-the-art FS techniques problem in SDP as pareto-based multi-objective optimiza-
over NASA datasets. tion problem. The results are fair to confirm the competency
Table 11 shows that proposed 3PcGE is advantageous over of proposed method.
the state-of-the-art FS techniques including the filter-based In future, I propose to extend the work by taking into con-
techniques and wrapper-based techniques. Among all the sideration more projects. It will enhance the generalizability
classifiers, SVM classifier using 3PcGE as FS technique is of technique. I also propose to implement some more pareto
best performer in terms of AUC measure. In order to confirm multi-objective algorithm to give competition to the proposed
this inference statistically, the statistical test is selected tak- technique and to enhance the solution accuracy.
ing into consideration the assumption of sphericity. Further,
the Mauchley’s test at significance level of 5% (significance
value  0.05) for sphericity is conducted to confirm that the
assumption of sphericity is met. The value of test static for
Mauchley’s test comes to be 0.1691. Hence, the assumption References
of sphericity is tenable (Table 12).
In this study, repeated ANOVA is selected for statistical 1. Afzal W, Torkar R (2016) Towards benchmarking feature subset
analysis of comparison between the proposed technique and selection methods for software fault prediction. In: Pedrycz W,
Succi G, Sillitti A (eds) Computational intelligence and quantitative
the literature work because (1) number of samples is more software engineering. Studies in computational intelligence, vol
than two, (2) normality is met, (3) sphericity is met. The result 617. Springer, Cham. https://doi.org/10.1007/978-3-319-25964-2-
for repeated ANOVA is value of 0.00 at significance value of 3
0.05. Hence, the null hypothesis is rejected. The performance 2. Anbu M, Anandha Mala GS (2019) Feature selection using
firefly algorithm in software defect prediction. Cluster Comput
of 3PcGE.SVM is statistically better than the state-of-the-art 22:10925–10934. https://doi.org/10.1007/s10586-017-1235-3
FS techniques. 3. Barritt JA et al (2001) Cytoplasmic transfer in assisted reproduc-
The statistical analysis shows that the proposed 3PcGE is tion. Hum Reprod Update 7:428. https://doi.org/10.1093/humupd/
competent FS technique and performs better than the state- 7.4.428
4. Canfora G, Lucia AD, Penta MD, Oliveto R, Panichella A,
of-the-art FS techniques. Panichella S (2015) Defect prediction as a multiobjective optimiza-
tion problem. Softw Test Verific Reliab 25(4):426–459
5. Catal C (2011) Software fault prediction: a literature review and
current trends. Expert Syst Appl 38(4):4626–4636
6 Conclusion and future work 6. Catal C, Diri B (2009) Investigating the effect of dataset size,
metrics sets, and feature selection techniques on software fault pre-
In this paper, I have formulated the feature selection prob- diction problem. Inf Sci 179(8):1040–1058
lem as a multi-objective optimization problem and applied 7. Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist
multiobjective genetic algorithm: nsga-ii. IEEE Trans Evol Comput
a novel concept of three-parent child to get optimal results.
6(2):182–197
The implementation of 3PcGE as feature selection technique 8. Erturk E, Sezer EA (2015) A comparison of some soft com-
in the field of software defect predict is quite innovative and puting methods for software fault prediction. Expert Syst Appl
new. In this way, I have approached the problem pursuing the 42(4):1872–1879. https://doi.org/10.1016/j.eswa.2014.10.025
9. Fenton NE, Neil M (1999) A critique of software defect prediction
pareto-based multi-objective optimization aspect. Then, the models. IEEE Trans Softw Eng 25(5):675–689. https://doi.org/10.
performance of proposed method is evaluated over multiple 1109/32.815326
performance evaluation criteria (AUC, accuracy, f1-measure, 10. Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing
run-time). To gain insight into the statistical comparative software metrics for defect prediction: an investigation on feature
selection techniques. Softw Pract Exp 41(5):579–606
analysis with other PMAs, NSGA-II which is one of the most 11. Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of
popular algorithms for optimization is selected and paired t- classification techniques on the performance of defect prediction
test conducted over hypervolume measure. The study [35] models. In: Proceedings of the international conference on software
is selected for comparative analysis. The results show that engineering, pp 789–800
12. Ghotra B, Mcintosh S, Hassan AE (2017) A large-scale study of
proposed technique is competent, and with support vector the impact of feature selection techniques on defect classification
machine as classification algorithm, it outperforms all other models. In: Proceedings of the international conference on mining
SDP classifiers. Then, to seek the performance comparison software repositories, pp 146–157

123
S. Goyal

13. Goyal S, Bhatia PK (2020) Comparison of machine learning tech- 36. Pressman RS (1997) Software engineering: a practitioner’s
niques for software quality prediction. Int J Knowl Syst Sci (IJKSS) approach. McGraw-Hill, New York
11(2):21–40 37. Porter A, Selby R (1990) Evaluating techniques for generating
14. Holmes HG et al (2003) Benchmarking attribute selection tech- metric-based classification trees. J Syst Softw 12:209–218
niques for discrete class data mining. IEEE Trans Knowl Data Eng 38. Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software
15(6):1437–1447 fault prediction metrics: a systematic literature review. Inf Softw
15. Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A system- Technol 55(8):1397–1418
atic literature review on fault prediction performance in software 39. Aurora R, José RR, Sebastián V (2019) A survey of many-objective
engineering. Trans Softw Eng IEEE 38(6):1276–1304 optimisation in search-based software engineering. J Syst Softw
16. Halstead MH (1977) Elements of software science. Elsevier North 149:382–395. https://doi.org/10.1016/j.jss.2018.12.015
Holland, New York 40. Rathore SS, Kumar S (2019) A study on software fault predic-
17. Hanley J, McNeil BJ (1982) The meaning and use of the area tion techniques. Artif Intell Rev 51(2):255–327. https://doi.org/10.
under a receiver operating characteristic ROC curve. Radiology 1007/s10462-017-9563-5
143:29–36 41. Rodríguez D, Ruiz R, Cuadrado-Gallego J, AguilarRuiz J (2007)
18. Harman M, Jones B (2001) Search based software engineering. J Detecting fault modules applying feature selection to classifiers.
Inf Softw Technol 43(14):833–839 In: IEEE international conference on information reuse and inte-
19. Harman M, Mansouri SA, Zhang Y (2012) Search-based software gration, 2007. IRI 2007., pp 667–672. IEEE
engineering: trends, techniques and applications. ACM Comput 42. Ross SM (2004) Introduction to probability and statistics for engi-
Surv (CSUR) 45(1):1–61 neers and scientists, 3rd edn. Elsevier Press, Cambridge
20. He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on 43. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some
software defect prediction with a simplified metric set. Inf Softw comments on the NASA software defect datasets. IEEE Trans
Technol 59:170–190 Softw Eng 39(9):1208–1215
21. Hosseini S, Turhan B, Mäntylä M (2018) A benchmark study on the 44. Song Q, Jia Z, Shepperd M, Ying S, Liu J (2011) A general software
effectiveness of search-based data selection and feature selection defect-proneness prediction framework. IEEE Trans Softw Eng
for cross project defect prediction. Inf Softw Technol J 95:296–312 37(3):356–370
22. Hosseini S, Turhan B, Gunarathna D (2019) A systematic literature 45. Wahono RS (2015) A systematic literature review of software
review and meta-analysis on cross project defect prediction. IEEE defect prediction. J Softw Eng 1(1):1–16
Trans Softw Eng 45(2):111–147 46. Wahono RS, Suryana N, Ahmad S (2014) Metaheuristic optimiza-
23. Jiarpakdee J, Tantithamthavorn C, Hassan AE (2019) The impact tion based feature selection for software defect prediction. J Softw
of correlated metrics on the interpretation of defect models. IEEE 9(5):1324–1333
Trans Softw Eng. https://doi.org/10.1109/TSE.2019.2891758 47. Xu Z, Liu J, Yang Z, An G, Jia X (2016) The impact of feature
24. Khoshgoftaar TM, Allen EB (2000) A practical classification-rule selection on defect prediction performance: an empirical compar-
for software quality models. IEEE Trans Reliab 49(2):209–216 ison. In: 2016 IEEE 27th international symposium on software
25. Kondo M, Bezemer C-P, Kamei Y, Hassan AE, Mizuno O (2019) reliability engineering (ISSRE), pp 309–320. IEEE
The impact of feature reduction techniques on defect prediction 48. Yu Q, Qian J, Jiang S, Zhenhua Wu, Zhang G (2019) An empirical
models. Empir Softw Eng 24:1925–1963 study on the effectiveness of feature selection for cross-project
26. Li Z, Jing XY, Zhu X (2018) Progress on approaches to software defect prediction. IEEE Access 7(2019):35710–35718
defect prediction. IET Softw 12(3):161–175 49. Zhang J et al (2016) Pregnancy derived from human zygote pronu-
27. Lin S-W, Ying K-C, Chen S-C, Lee Z-J (2008) Particle swarm clear transfer in a patient who had arrested embryos after IVF.
optimization for parameter determination and feature selection of Reprod Biomed Online 33:529. https://doi.org/10.1016/j.rbmo.
support vector machines. Expert Syst Appl 35:1817–1824 2016.07.008
28. Liu YC, Khoshgoftaar TM, Seliya N (2010) Evolutionary opti- 50. Zhou Y, Leung H (2006) Empirical analysis of object-oriented
mization of software quality modeling with multiple repositories. design metrics for predicting high and low severity faults. IEEE
IEEE Trans Softw Eng 36(6):852–864 Trans Softw Eng 32(10):771–789. https://doi.org/10.1109/TSE.
29. Mafarja M, Mirjalili S (2017) Whale optimization approaches for 2006.102
wrapper feature selection. Appl Soft Comput 62:441–453 51. Zhang Y, Lo D, Xia X, Sun J (2018) Combined classifier for cross-
30. McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng project defect prediction: an extended empirical study. Front Comp
4:308–320 Sci 12(2):280–296. https://doi.org/10.1007/s11704-017-6015-y
31. Menzies T, Greenwald J, Frank A (2007) Data mining static
code attributes to learn defect predictors. IEEE Trans Softw Eng
33(1):2–13
Publisher’s Note Springer Nature remains neutral with regard to juris-
32. Mitchell T (1997) Machine learning. McGraw-Hill, New York
dictional claims in published maps and institutional affiliations.
33. NASA – Software Defect Datasets [Online]. Available: https://
nasasoftwaredefectdatasets.wikispaces.com. Accessed 19 Aug
2019
34. NASA Defect Dataset [Online]. Available: https://github.com/
klainfo/NASADefectDataset. Accessed 19 Aug 2019
35. Ni C, Chen X, Wu F, Shen Y, Gu Q (2019) An empirical study on
pareto based multi-objective feature selection for software defect
prediction. J Syst Softw 152:215–238. https://doi.org/10.1016/j.
jss.2019.03.012

123

You might also like