Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO.

3, SEPTEMBER 2018 885

Ridge and Lasso Regression Models for


Cross-Version Defect Prediction
Xiaoxing Yang and Wushao Wen

Abstract—Sorting software modules in order of defect count can Ranksum Wilcoxon rank-sum test.
help testers to focus on software modules with more defects. One of RF Random forest.
the most popular methods for sorting modules is generalized linear RR Ridge regression.
regression. However, our previous study showed the poor perfor-
mance of these regression models, which might be caused by severe VIF Variance inflation factor.
multicollinearity. Ridge regression (RR) can improve the predic-
tion performance for multicollinearity problems. Lasso regression
NOTATIONS
(LAR) is a worthy competitor to RR. Therefore, we investigate both
RR and LAR models for cross-version defect prediction. Cross- n Number of training software modules.
version defect prediction is an approximate to real applications. It d Number of metrics.
constructs prediction models from a previous version of projects X Metric matrix of n training samples.
and predicts defects in the next version. Experimental results based xi Metric vector of the ith training software modules.
on 11 projects from the PROMISE repository consisting of 41 dif- xi,j jth metric value in vector xi .
ferent versions show that: 1) there exist severe multicollinearity
problems in the experimental datasets; 2) both RR and LAR mod-
yi Defect number of the ith training module.
els perform better than linear regression and negative binomial m Number of testing software modules.
regression for cross-version defect prediction; and 3) compared zi Metric vector of the ith testing software modules.
with two best methods in our previous study for sorting software f (zi ) Relative defect number of the ith testing software
modules according to the predicted number of defects, RR has modules predicted by models.
comparable performance and less model construction time. XT X Moment matrix.
Index Terms—Lasso regression (LAR), multicollinearity, ridge λm ax Largest eigenvalues of the moment matrix.
regression (RR), software defect prediction, sorting modules in λm in Smallest eigenvalues of the moment matrix.
order of defect count. Ri2 Coefficient of determination in the regression of ex-
planatory variable xi on the remaining explanatory
ABBREVIATIONS AND ACRONYMS variables of the model.
CK Chidamber and Kemerer. αj Parameter for xj to decide prediction models.
CLC Cumulative lift chart (also the area under CLC). k Ridge parameter of RR.
FPA Fault-percentile-average. I Identity matrix.
GCV Generalized cross-validation. RSS (y − Xα)T (y − Xα).
HKB Ridge estimator proposed by Hoerl, Kennard, and si Actual defect number in module fi .
Baldwin. s Total number of defects in all modules.
LAR Lasso regression.
Lasso Least absolute shrinkage and selection operator.
LR Linear regression.
LTR Learning-to-rank approach. I. INTRODUCTION
LW Ridge estimator proposed by Lawless and Wang. OFTWARE testing activities play a key role in software
NBR
PCA
Negative binomial regression.
Principal component analysis.
S development, which consume a great amount of resources
including time, money, and personnel [1]. Sorting software mod-
PCR Principal component regression. ules (files, packages, etc.) in order of defect count, which we
call as software defect prediction for the ranking task, can help
Manuscript received September 7, 2017; revised March 10, 2018 and May 22, testers to focus on software modules with more defects and iden-
2018; accepted June 9, 2018. Date of publication June 29, 2018; date of current
version August 30, 2018. This work was supported in part by the National tify defects more quickly [1], [2]. The process mainly includes
Natural Science Foundation of China under Grant 61602534 and in part by two parts: data and model construction methods [3]. First, we
the Fundamental Research Funds for the Central Universities (No.17lgpy122). obtain data from software modules, according to software met-
Associate Editor: T. H. Tse. (Corresponding author: Wushao Wen.)
X. Yang is with the School of Data and Computer Science, Sun Yat-Sen Uni-
rics (also referred to as features or attributes), such as lines of
versity, Guangzhou 510006, China, and also with the Guangdong Key Labora- code and previous defects [4]. Second, modeling approaches
tory of Big Data Analysis and Processing, Guangzhou 510006, China (e-mail:, such as linear regression (LR) and random forest (RF) are em-
yangxx27@mail.sysu.edu.cn). ployed to construct a model based on the data from modules
W. Wen is with the School of Data and Computer Science, Sun Yat-Sen
University, Guangzhou 510006, China (e-mail:,wenwsh@mail.sysu.edu.cn). with known defect numbers. Finally, the constructed model is
Digital Object Identifier 10.1109/TR.2018.2847353 used to predict the defect information of the software modules
0018-9529 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
886 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018

whose defect numbers are unknown, and thus, an order of these method for sorting software modules according to the predicted
modules based on the predicted defect numbers is obtained, number of defects.
which can help allocate testing resources [3]. The main contributions of this paper include the following.
One of the most popular methods for sorting modules in order 1) First analysis of the choice of the ridge parameter of RR
of defect count is generalized LR [1], [2], [5], [6], including LR for software defect prediction for the ranking task.
[7]. In our previous work, we found that these generalized LR 2) First application of LAR for software defect prediction
models performed poorly over most experimental datasets, and for the ranking task.
the possible reason was that the experimental datasets had severe 3) Further analysis of multicollinearity problems in 41 pub-
multicollinearity problems (the condition numbers were larger licly available datasets in software defect prediction do-
than 100) [3]. Generalized LR models have merits: they are main.
simple and easy to interpret the relationship between prediction 4) A comparison study of RR and LAR with five other model
values and metrics. Therefore, we attempt to find a model that construction algorithms for cross-version defect predic-
has the above merits of generalized LR models and give better tion for the ranking task.
performance when the multicollinearity problems exist. To be The rest of this paper is organized as follows. We present
specific, we focus only on the multicollinearity problems in an overview of related work in Section II. Section III describes
this paper, instead of other data quality problems such as class multicollinearity, RR, and LAR. Section IV details the experi-
imbalance [8], [9] or unlabeled datasets [10], and we attempt to mental methodologies, including datasets, evaluation measures,
find a method that can reduce multicollinearity when building and implementation. We show and discuss experimental results
models, without additional data processing. in Section V. The threats to validity are described in Section VI.
According to Freund et al.’s book [11], multicollinearity prob- Section VII draws the conclusions and points out future work.
lems can be solved by variable selection, redefining variables,
and biased estimation. Selecting partial software metrics (vari- II. RELATED WORK
able selection) can reduce multicollinearity, and lead to more
accurate models. Variable selection is used in many applica- In this paper, we investigate the application of RR and LAR
tions to reduce noisy or redundant variables. However, it is not to solve multicollinearity problems in cross-version defect pre-
easy to determine how many metrics should be selected, and diction for the ranking task. Therefore, this section presents the
the reduced datasets might lose some important metrics. Vari- related work from three aspects: model construction methods
ables redefinition includes methods based on knowledge of the for defect prediction for the ranking task, cross-version defect
variables, and methods based on statistical analysis. The former prediction, and multicollinearity in defect prediction.
methods need knowledge of the variables. It is difficult to decide
how to redefine variables, especially when there is a large col- A. Model Construction Methods for Defect Prediction for the
lection of variables. The latter methods convert original metrics Ranking Task
into a new set of metrics. A famous representative is principal
Sorting software modules in order of defect count is one kind
component analysis (PCA). One shortcoming of these methods
of software defect prediction, which employs software metrics
is that we have to transform the new metrics back to obtain
to predict the number of defects in order to support software
an estimator, which can tell the effectiveness of the original
testing activities [2], [3], [16].
metrics. Biased estimation can achieve better prediction with-
Ohlsson and Alberg [17] used an LR method to construct soft-
out data processing for multicollinearity problems. They reduce
ware defect prediction models, in order to predict the number of
multicollinearity when constructing models.
defects in software modules before coding started. They applied
In this paper, we investigate two famous biased estimation
Alberg diagram [the same as cumulative lift chart (CLC)] and
methods (ridge regression (RR) [12], [13] and least absolute
percentages of defects in the top modules to evaluate models,
shrinkage and selection operator (lasso) regression [14]) for
and demonstrated the usefulness of Alberg diagram. Khoshgof-
sorting software modules in order of defect count over cross-
taar and Allen [7] used LR to construct models to predict an or-
version datasets. Cross-version defect prediction is similar to
der of software modules, and they pointed out that module-order
practical use: a former version is used to train a model, and
models gave management more flexible reliability enhancement
the constructed model is used to predict the latter version. It
strategies than classification models.
is more practical than the other two ways of building models,
Ostrand et al. [1] applied NBR and a very simple model based
i.e., cross-project defect prediction and cross-validation defect
on only the lines of codes to construct models, to predict the
prediction [15].
expected number of defects in each module of the next release
Our objectives include: 1) investigating whether RR or lasso
of a large commercial system, employing percentages of defects
regression (LAR) could perform better than LR and negative
in the top 20% modules to evaluate software defect prediction
binomial regression (NBR) for cross-version defect prediction;
models. NBR achieved better results than the simple model.
2) studying the choice of the ridge parameter of RR for soft-
However, from the graphs of the actual defects and models, there
ware defect prediction; 3) checking whether there exist multi-
were much space for improving prediction models. Ostrand et al.
collinearity problems in the experimental datasets (11 projects
also used the prediction models to give the order of modules
from the PROMISE repository consisting of 41 different ver-
according to the defect density.
sions of these projects); and 4) comparing RR and LAR against
Gao and Khoshgoftaar [5] compared eight count models over
two best methods in our previous study [3] and one PCA-based
a full-scale industrial software system. The comparative study

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
YANG AND WEN: RIDGE AND LASSO REGRESSION MODELS FOR CROSS-VERSION DEFECT PREDICTION 887

showed that zero-inflated NBR and hurdle NBR were more ef- validation prediction. Therefore, cross-version defect prediction
fective according to the pairwise hypothesis testing techniques, has been also studied recently [3], [15], [21].
information criteria-based comparative techniques, and Pear- Shukla et al. [15] compared multiobjective defect prediction
son’s chi-square measure, but hurdle Poisson regression with models with four traditional machine learning algorithms for
threshold 2 was better according to the prediction accuracy. cross-version defect prediction for the classification task, and
The comparison results of model construction methods highly they pointed out that the multiobjective logistic regression was
depended on the performance measures for evaluating models. more cost-effective than single-objective algorithms.
Therefore, it is important to use the appropriate performance Bennin et al. [21] conducted cross-release evaluation of 11
measures for evaluating defect prediction models. fault density prediction models (K star, M5 etc.) based on data
Schroter et al. [18] compared four methods, including RR, collected from 25 open source software projects, using an effort-
to construct the prediction models. They used RR to penalize aware performance measure. The authors found that M5 tree
outliers. From the performance measures and analysis, Schroter model performed best when assuming cross-release prediction,
et al. focused on classification task instead of ranking task. And but there were no statistically significant differences among all
they concluded that RR was similar to LR so that they did not methods.
discuss RR in detail. Different from their work, we attempt
to study the application of RR for software defect prediction C. Multicollinearity in Defect Prediction
for the ranking task, instead of the classification task. In ad-
The phenomenon that some software metrics are highly cor-
dition, we analyze the choice of the ridge parameter of RR in
related with each other has been found long ago. For example,
this application, which was not included in Schroter et al.’s
Ohlsson et al. [17] used correlation coefficients to investigate
work [18].
not only the relationship between metrics and the number of de-
Weyuker et al. [6] compared four approaches—NBR, RF, re-
fects, but also the relationship among 11 metrics, and they found
cursive partitioning, and Bayesian additive regression trees—for
that the metrics that most correlated with the number of defects
predicting the ranking of modules based on defects. Weyuker
were highly correlated with each other. Graves et al. [22] com-
et al. proposed a new performance measure [fault-percentile-
puted correlation coefficients among complexity metrics, and
average (FPA)], which was based on the whole ranking given
found that most complexity metrics were highly correlated to
by the prediction models and put more emphases on the for-
the lines of code.
mer ranking. Experimental results indicated that NBR and RF
In order to reduce the redundancy of software metrics,
models performed better than the other two models according to
many studies focus on variable selection methods. For ex-
percentages of defects in the top 20% modules and FPA. Based
ample, Menzies et al. [23] used information gain to select
on a longer time for fitting RF and the nondeterministic results
metrics, and pointed that two or three metrics could work as
of RF models, Weyuker et al. concluded that linear and additive
well as all metrics for building prediction models. Khoshgof-
models were good and realistic for software defect prediction.
taar et al. [24] compared seven metric selection methods, and
We [3] proposed a learning-to-rank (LTR) approach to di-
found that information gain and signal to noise ratio were
rectly optimize the ranking performance measures, and the ex-
better than other methods for choosing metrics to construct
perimental results showed that the LTR approach and RF were
prediction models. Chen et al. [25] proposed a new metric
better for constructing software defect prediction models for
selection method using multiobjective optimization, and the
the ranking task. LTR was better over datasets with three met-
experimental results demonstrated the effectiveness of their pro-
rics and RF was better over the original datasets. We found
posed method.
that the poor results of LR models and generalized LR mod-
Parsa et al. [26] applied recursive RR for selecting most ef-
els might be caused by the existence of severe multicollinearity
fective bug predictors of software, and used association rule
in experimental datasets, whose condition number were larger
generation technique to find the main causes of program failure.
than 100.
Different from Parsa et al.’s work, we employ RR to directly
construct defect prediction models based on all the software
metrics in this paper, instead of selecting bug predictors. The
B. Cross-Version Defect Prediction
analysis of the choice of the ridge parameter of RR was not
When using one-version data, holdout, cross-validation, and included in Parsa et al.’s work [26].
bootstrap family techniques are popular model validation tech-
niques. Tantithamthavorn et al. [19] investigated the bias and III. MULTICOLLINEARITY AND PENALIZED REGRESSION
variance of 12 model validation techniques in the domain of
In this section, we introduce multicollinearity and two penal-
defect prediction, and they pointed out that selecting an appro-
ized regression, i.e., RR and LAR.
priate model validation technique was a critical experimental
design choice. The authors recommended the use of out-of-
A. Multicollinearity
sample bootstrap validation in defect prediction studies.
Cross-project prediction [20] is a challenge and hot topic in Multicollinearity can be defined as the existence of strong
defect prediction recently. Shukla et al. [15] pointed out that as correlations among the independent variables [11]. It can be
previous version of same software project would have similar caused by many reasons. One possible reason is that the physical
parameter distribution among files, cross-version defect predic- meaning of variables (i.e., metrics in software defect prediction)
tion was more practical than cross-project prediction and cross- determines the collinearity among them. For example, if lines of

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
888 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018

code correspond to lines of comments in a program, then lines of where αj s are the corresponding parameters obtained by train-
code are correlated to lines of comments in the dataset generated ing. Once αj s are fixed, the model is learned. For training
from this program. Another reason is the large collection of data, both metric vectors (denoted as X) and corresponding
metrics. In order to construct a better model, we tend to collect defect number (denoted as y) are known. Then, the ordinary
as many metrics that are relevant to defects as possible in case we least-square estimate of α is as follows:
miss some key metrics of software modules. This often causes
α = (XT X)−1 XT y. (4)
serious multicollinearity in datasets.
There are several ways to detect multicollinearity [27]. The One consequence of severe multicollinearity is that the in-
first way is to compute the correlation matrix of predictor vari- verse of moment matrix XT X, which is used in LR and gen-
ables. However, this method cannot tell clearly the degree of eralized LR, may be inaccurate. In the case of perfect multi-
multicollinearity. Other ways include tolerance value or vari- collinearity, the matrix XT X cannot be inverted and the ordi-
ance inflation factor (VIF), Eigenvalue, and condition number. nary least-squares estimator does not exist. RR can overcome
In this paper, we adopt condition number and VIF to detect these difficulties, which is based on adding a biasing constant
multicollinearity. k to the diagonal of XT X before computing the estimate of α.
Condition number can be calculated through the ratio of the The resulting estimator is obtained as the following equation:
largest eigenvalue to its smallest eigenvalue as follows:
ᾱ = (XT X + kI)−1 XT y (5)
λm ax where k is a ridge parameter and I is an identity matrix. Note
cond = (1)
λm in that if k equals to zero, the RR becomes LR. The choice of
where λm ax and λm in are the largest and smallest eigenvalues, the ridge parameter is the key to RR models. There has been a
which are characteristic roots of the moment matrix XT X. The substantial amount of interest in estimating a good value of k
moment matrix will be introduced in Section III-B. If the small- from the data [28]. In this paper, we compare three estimation
est eigenvalue is close to zero or the condition number is larger methods: generalized cross-validation (GCV) [28], the ridge
than 100, it indicates serious multicollinearity [13]. estimator proposed by Hoerl, Kennard, and Baldwin (the method
VIF can be computed as follows: is called HKB for short)[29], and the ridge estimator proposed
by Lawless and Wang (the method is called LW for short) [30].
1 Ordinary least-square regression minimizes RSS =
VIFi = (2)
1 − Ri2 (y − Xα)T (y − Xα) directly, while RR minimizes RSS

subject to a constraint di=1 |αi |2 ≤ t (t ≥ 0). LAR minimize
where Ri2 is the coefficient of determination in the regression 
RSS subject to a constraint di=1 |αi | ≤ t (t ≥ 0). The lasso
of explanatory variable xi on the remaining explanatory vari- shrinks the ordinary least-squares estimator towards zero and
ables of the model. If VIFi is larger than 10, there exists high potentially sets αi to zero for some i, and it can perform as
multicollinearity [27]. a variable selection operator [31]. The lasso solution can be
As mentioned in Section I, variable selection(such as infor- computed by the combined quadratic programming method
mation gain [3]), variable redefining (such as PCA), and biased [14], the shooting algorithm [31], least angle regression [32],
estimation are three main kinds of methods to deal with multi- and coordinate descent algorithm [33].
collinearity problems. In this paper, we investigate two biased
estimation methods (RR [12] and LAR [14]) for software defect
IV. EXPERIMENTAL SETUP
prediction. They are introduced in Section III-B.
In this section, we detail datasets, evaluation measures, and
B. Ridge and Lasso Regression implementation, respectively.

Given n training metric vectors of software modules A. Datasets


xi = (xi,1 , xi,2 , · · · · · · , xi,d ) (i: 1 to n, xi is the metric vec-
tor of the ith software module, xi,j is the jth metric value in In this paper, we use the 11 open-source projects that include
vector xi , and d is number of metrics), which compose a met- data for three or more versions in the PROMISE repository
ric matrix X, and corresponding defect number yi , the goal of [15]. Different from Shukla et al.’s work [15], we employ the
software defect prediction for the ranking task is to construct a defect number instead of the categories (defect-prone or not) as
software defect prediction model based on these training data the dependent variable. These datasets include Chidamber and
and then to predict the relative defect numbers of m testing Kemerer (CK) metrics1 [34], [35], which are popularly used in
vectors zi (i:1 to m), which is denoted as f (zi ). An order of software defect prediction domain [15], [36].
software modules can be obtained according to the predicted The characteristics of these experimental datasets are pre-
defect numbers. This order can help testers to focus on software sented in Table I. The column of “faulty modules” records
modules with more defects and identify defects more quickly. the number of modules having defects (with the percentages
A multiple LR model can be written as follows: of faulty modules in the subsequent brackets), the column of
“range of defects” records ranges of defect numbers in the cor-

d responding datasets, and the column of “total defects” records
f (zi ) = αj zi,j (3)
j =1 1 http://openscience.us/repo/defect/ck/

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
YANG AND WEN: RIDGE AND LASSO REGRESSION MODELS FOR CROSS-VERSION DEFECT PREDICTION 889

TABLE I ter ranking performance, and can help allocate testing resources
EXPERIMENTAL DATASETS
better on the whole.
CLC [37] uses percentages of modules as x-axis and percent-
ages of defects as y-axis. The area under CLC is always used for
comparison, and the area is simply denoted as CLC in this paper.
Considering m modules f1 , f2 , ..., fm , listed in increasing order
of predicted defect number, si as the actual defect number in
the module fi , and s = s1 + s2 + ... + sm as the total number
of defects in all the modules, the area under the curve should
be computed as the sum of areas of trapezoid composed of two
adjacent points and the axes in the CLC. Using CLC to denote
the area under the curve in rest of the paper, it can be computed
as follows:

m
CLC = trapezoidt
t=1
    
1 1 sm 
= 0+ + ...
m 2 s
 
sm + ... + s2 sm + ... + s1
+ + . (6)
s s

C. Implementation
In this section, we present the implementation details.
According to the objectives, we mainly conduct four
experiments.
1) First, we compute condition numbers and VIFs of these
datasets to detect the degree of multicollinearity.
2) Subsequently, we compare the performance of RR and
LAR with LR and NBR, in order to investigate whether
RR or LAR could perform better than LR and NBR for
cross-version defect prediction. We also study different
the total number of defects in all the modules of the correspond- choices of the ridge parameter of RR for software defect
ing datasets. prediction.
3) After that, we compare RR and LAR with principal com-
B. Evaluation Measures ponent regression (PCR).
Because we research on software defect prediction for the 4) Finally, we compare RR and LAR with two best meth-
ranking task instead of classification task, we adopt the CLC ods (RF and LTR) in the previous study [3] for sorting
(also referred to as Alberg diagram) [17], [37] and FPA [6], software modules in order of defect count, in order to
instead of recall, f-measure, or other classification indicators. investigate whether RR or LAR is a good method for sort-
In our previous work [3], we proved that the CLC and FPA ing modules in order of defect count. We also compare
should be consistent to evaluate a ranking. Nevertheless, we the model construction time of all methods.
adopt both CLC and FPA as performance measures because Condition numbers and VIFs are computed in MATLAB
they are advocated by different researchers. Details of FPA and (2009), RR, LAR, LR, NBR, RF, and PCR are implemented
CLC are described below. in R2 , and LTR is implemented in Java. Like the previous work
Considering m modules f1 , f2 , . . . ,fm , listed in increasing [3], we use the default parameters for LR and NBR, and set the
order of predicted defect number, si as the actual defect number number of trees for RF to 500. For LTR, we set the feasible solu-
in the module fi , and s = s1 + s2 + ... + sm as the total number tion space to Ω = di=1 [−20, 20], and both population size and
of defects in all the modules, the proportion of actual defects in maximal generation are set to 100. For LAR, the lasso solution
the top t predicted modules (i.e., top t modules predicted to have is computed by least angle regression [32]. We compare three

most defects) to the whole defects is 1s m i=m −t+1 si . Then FPA
estimation methods for the ridge parameter k of RR: GCV [28],
is defined as follows [6]: HKB [29], and LW [30]. For PCR, we use cumulative percent
variance to select the number of principal components [38].
1 1 
m m
si . There are mainly three prediction techniques for software
m t=1 s i=m −t+1 defect prediction [15], which are as follows.

Actually, FPA is an average of the proportions of actual defects


in the top i (i: 1 to m) predicted modules. Larger FPA means bet- 2 http://www.r-project.org./

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
890 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018

1) Cross-validation prediction: Training and testing data are TABLE II


CONDITION NUMBERS AND VIFS OF EXPERIMENTAL DATASETS
from the same version of a project. That is, defect predic-
tion models are built from partial data and used to test the
remaining data.
2) Cross-version prediction: The former version of the data
is used for training a model and the latter version is used
as a validation set to evaluate the model.
3) Cross-project prediction: Prediction models built from
one project are used to test data of another project.
Considering different versions of the same software project
might have similar parameter distribution among files [15], we
use cross-version prediction in this paper. The former version of
the data is used to build the model and the latter version is used
as a validation set to evaluate the model. For example, we use
ant-1.3 to train a model and use the model to predict ant-1.4.
Because metrics with only one value can cause “NAN” values in
the scale matrix, which is needed in some regression methods,
we delete these metrics before applying the model construction
methods.
Because some methods such as RF are not deterministic ap-
proaches (that is, they may obtain different results over the same
training and testing datasets), all methods run ten times for the
same set of data. As a result, we obtain ten results for each
method over each set of training and testing data. The Wilcoxon
rank-sum test [39] (we call it as ranksum for short in the rest
of paper) is used as the statistical test. Ranksum is a nonpara-
metric test of the null hypothesis that two populations are the
same against an alternative hypothesis that a particular popula-
tion tends to have larger values than the other. To be specific,
ranksum is used to test whether ten results by one method are
significantly larger than those by another method using the same
training and testing data. The sample size of ten per group is suf-
ficient for ranksum at 0.05 significance. We also use Wilcoxon
signed-rank test [40] to test the significance of the differences
between mean results achieved by the two methods over all
datasets.

V. EXPERIMENTAL RESULTS
In this section, we show the experimental results. First, we experimental datasets have serious multicollinearity problems
detect the degree of multicollinearity in experimental datasets. according to both condition numbers and VIFs.
Subsequently, we show the comparison of RR and LAR with LR
and NBR for sorting modules in order of defect count. After that, B. Comparison of RR and LAR With LR and NBR
the comparison results of RR and LAR with PCR are shown.
Finally, the comparison of RR and LAR with RF and LTR is LR [7], [17] and NBR [1], [6] are popular methods for sort-
reported, and the model construction time is shown. ing software modules in order of defect count. However, mul-
ticollinearity, particularly severe multicollinearity, might cause
unsteady inverse matrix. When the LR or generalized LR uses
A. Multicollinearity in Experimental Datasets
the unsteady inverse matrix, it might result in poor performance
In this section, we compute condition numbers and VIFs of of prediction models. Section V-A have proved that there ex-
our experimental datasets. There are 20 metrics, so there are 20 ist multicollinearity problems over all the experimental data.
VIF values for each dataset. We record the maximum VIFs and Hence, it might be inappropriate to directly use ordinary LR or
average VIFs. Results are shown in Table II. NBR. In this case, biased estimation methods such as RR [12]
According to He’s definition [13], when the condition number and LAR [14], which can deal with multicollinearity problems,
is larger than 100, there exists severe multicollinearity. Accord- should be good options. Therefore, in this section, we show the
ing to Dereny and Rashwan [27], when VIF is larger than 10, comparison results of four regression models: RR, LAR, LR,
there exists high multicollinearity. From Table II, all condi- and NBR, in order to investigate whether the two biased estima-
tion numbers are larger than 1000, and all average VIFs are tion methods (RR and LAR) can outperform LR and NBR over
larger than 10, not to mention maximum VIFs. Therefore, all these datasets.

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
YANG AND WEN: RIDGE AND LASSO REGRESSION MODELS FOR CROSS-VERSION DEFECT PREDICTION 891

TABLE III
MEAN FPA AND CLC RESULTS OF RR, LAR, NBR, AND LR MODELS FOR CROSS-VERSION DEFECT PREDICTION

Because there are different ways for choosing a ridge parame- results of the corresponding method over all the sets of data.
ter of RR, we also compare three choices for the ridge parameter: The row of “final with NBR” records the p-values of Wilcoxon
GCV [28], HKB [29], and LW [30], in order to study the best signed-rank test for mean results achieved by the corresponding
choice for the ridge parameter when applying RR for software method and NBR over all the sets of data. The rows of “final
defect prediction. with LR” and “final with LAR” are similar.
As mentioned in Section IV, we adopt the area under CLC As pointed out in the previous work [3], CLC and FPA are
[17], [37] and FPA as performance measures. These four com- consistent to evaluate a ranking. From Table III, RR performs
pared regression methods in this section are deterministic ap- better than NBR in 24 out of 30 sets of data, and LAR per-
proaches. That is, when using the same training and testing forms better than NBR in 25 out of 30 sets of data. Compared
data, they obtain the same results. Hence, the ten results in ten with LR, RR, and LAR perform better in no less than 19 out
runs for each set of data are the same, and the variances of of 30 sets of data. Therefore, RR and LAR perform better than
these ten results are zero. If result A is larger than result B, A LR and NBR in most cases. According to the p-values in the
is significantly larger than B. Mean FPA and CLC results of rows of “final with NBR” and “final with LR”, RR, and LAR
these regression models are shown in Table III. Ant-1.3-1.4 in achieve significantly better results than LR and NBR at the
Table III means using ant-1.3 as the training set and ant-1.4 as the 0.05 significance level. This implies that RR and LAR are more
testing set, and others are similar. RR-GCV means using GCV appropriate than LR and NBR for software defect prediction
to decide the ridge parameter of RR, and others are similar. The when there exists multicollinearity. Considering the severe mul-
best FPA and CLC results achieved by these regression models ticollinearity problem in other datasets in our previous work [3],
are in bold type. The row of “maximum times” records the times multicollinearity might be a common problem in software de-
of achieving maximum FPA or CLC values by the correspond- fect prediction. In such a case, we recommend using RR or LAR
ing methods. The row of “compare with NBR” summarizes the to deal with multicollinearity problems instead of using LR and
comparison results of the corresponding method with NBR over NBR.
each set of data. For example, “24-0-6” in column “RR-GCV” Compared with LAR, RR-GCV performs better over 19 sets
means that compared with NBR, RR-GCV performs better over of data, equally over 1 set of data, and worse over 10 sets of
24 sets of data, equally over no sets of data, and worse over 6 data. RR-LW performs better over 19 sets of data, and worse
sets of data. The row of “compare with LR” summarizes the over 11 sets of data than LAR. RR-HKB performs better over
comparison results of the corresponding method with LR over 18 and worse over 12 sets of data than LAR. However, the
each set of data. The row of “total mean” records the mean times of achieving maximum FPA or CLC values by RR and

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
892 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018

TABLE IV
MEAN FPA AND CLC RESULTS OF RR, LAR, AND PCR MODELS FOR CROSS-VERSION DEFECT PREDICTION

LAR are similar (8,7,5,6). According to the p-values in the of ten results achieved by PCR is also zero. Mean FPA and CLC
last row in Table III, the difference between RR results and results of RR, LAR, and PCR models are shown in Table IV.
LAR results are not significant at the 0.05 significance level. The “PCR9” means that the number of principal components
Therefore, RR is slightly better than LAR in general, but not for PCR is decided by cumulative percent variances at 0.9, and
significantly. the “PCR99” means that the number of principal components
The total mean of RR-GCV is slightly larger than RR-LW and for PCR is decided by cumulative percent variances at 0.99. The
RR-HKB. Compared with RR-GCV, RR-LW performs better best FPA and CLC results achieved by these regression models
over 14 sets of data, equally over 2 sets of data, and worse over are in bold type. The row of “maximum times” records the times
14 sets of data. RR-HKB performs better over 12 sets, equally of achieving maximum FPA or CLC values by the corresponding
over 1 set, and worse over 17 sets of data than RR-GCV. RR- methods. The rows of “compare with PCR9” and “compare with
HKB performs better over 13 sets, equally over 2 sets, and worse PCR99” summarize the comparison results of the corresponding
over 15 sets of data than RR-LW. In addition, the p-values of method with PCR9 and PCR99, respectively, over each set of
any two of these three RR models are larger than 0.4. Therefore, data. The row of “total mean” records the mean results of the
GCV is slightly better for choosing the ridge parameter of RR corresponding method over all the sets of data. The row of
for software defect prediction, but not significantly. “final with PCR9” / “final with PCR99” records the p-values
of Wilcoxon signed-rank test for mean results achieved by the
corresponding method and PCR9/PCR99, respectively, over all
C. Comparison of RR and LAR With PCR the sets of data.
In this section, we compare RR and LAR with a PCA-based As illustrated in Table IV, RR performs better than PCR in
method: PCR. PCA is a powerful tool for analyzing data and no less than 17 out of 30 sets of data, and LAR performs better
compressing data [41]. PCA can convert the original metrics than PCR in 16 out of 30 sets of data. That is, RR and LAR
into a new set of metrics that are linearly uncorrelated, without perform better than LR and NBR in more sets of data. The total
much loss of information. In this experiment, cumulative per- mean of RR is slightly larger than PCR. However, LAR and
cent variances at 0.9 and 0.99 are used to select the number of PCR99 achieve more times of maximum FPA or CLC values. In
principal components [38]. addition, according to the p-values in the last two rows, there are
When the number of principal components is fixed, PCR is no significant differences between RR/LAR with PCR results
also a deterministic approach, which obtains the same results over all the sets of data. Therefore, RR is slightly better than
when using the same training and testing data, so the variance PCR in general, but not significantly.

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
YANG AND WEN: RIDGE AND LASSO REGRESSION MODELS FOR CROSS-VERSION DEFECT PREDICTION 893

TABLE V
MEAN FPA AND CLC RESULTS OF RR, LAR, RF, AND LTR MODELS FOR CROSS-VERSION DEFECT PREDICTION

As mentioned above, PCA converts the original metrics into 2–15” in column “RR-GCV” means that compared with RF, RR-
a new set of metrics, so the vector of estimated coefficients us- GCV performs significantly better over 13 sets of data, equally
ing ordinary least-squares regression correspond to the selected over 2 sets of data, and worse over 15 sets of data. Ranksum at
principal components instead of original metrics. Hence, we 0.05 significance is used to test whether or not results of one
need to transform this vector back to get the final PCR estimator method are significantly better than another method in ten runs
in order to describe the relationship between defects and origi- over each set of data. The row of “total mean” records the mean
nal metrics, and to predict new software modules, whereas, RR results of the corresponding method over all sets of data. The
and LAR do not need converting or transforming, because the row of “final with RF”/“final with LTR” records the p-values
models describe directly the relationship between defects and of Wilcoxon signed-rank test for mean results achieved by the
original metrics. In addition to the slightly better performance corresponding method and RF/LTR, respectively, over all the
of RR in Table IV, RR is a good choice for cross-version defect sets of data.
prediction. Compared with LTR, RR-GCV performs no worse in 19 out
of 30 sets, RR-LW performs no worse in 20 out of 30 sets, RR-
HKB performs no worse in 21 out of 30 sets of data, and LAR
D. Comparison of RR and LAR With RF and LTR performs no worse in 15 out of 30 sets of data. The total mean
In our previous work [3], RF and LTR performed best among of RR is larger than LTR, but the total mean of LAR is smaller
the compared methods for sorting software modules in order of than LTR. The p-values of RR/LAR with LTR are larger than
defect count. In order to investigate whether RR or LAR is a 0.08. Therefore, RR models achieve slightly better results, and
good method for sorting modules in order of defect count, we LAR models achieve slightly worse results than LTR, but not
show the comparison of RR and LAR with RF and LTR. Table V significantly.
shows the mean calculated over ten testing FPA and CLC results Shih [42] pointed out that RF could deal well with correla-
in ten runs. To be noted, because the variances are not zero for tion and high-order interactions, and recommended to use RF
RF or LTR models, the largest results might be not significantly when working with highly-correlated data or data with many
larger than other results, and thus, the largest FPA and CLC interactions . It means that RF can handle the multicollinearity
results are not shown in bold type and the row of “maximum problems. As given in Table V, the total mean of RF models is
times” is not recorded in Table V. The rows of “compare with even larger than RR. Nevertheless, according to the p-values,
RF” and “compare with LTR” record the comparison results of there is no significant difference between RR results and RF
the corresponding method with RF and LTR. For example, “13- results at 0.05 significance level. Therefore, RR can achieve

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
894 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018

TABLE VI
MEAN MODEL CONSTRUCTION TIME (IN SECONDS) OF ALL METHODS FOR CROSS-VERSION DEFECT PREDICTION

comparable results with RF. Considering that it is more difficult RR costs less time to construct models than LTR, RF, NBR,
for RF models to interpret the relationship between defects and and LAR. Therefore, RR is a good choice for software defect
original metrics, RR is still a good choice. prediction for the ranking task when multicollinearity problems
Table VI shows the mean model construction time in ten runs exist.
of all compared methods in our experiments. The row of “mean GCV is slightly better for choosing the ridge parameter of
time” records the mean time of the corresponding method over RR for software defect prediction than LW and HKB , but not
all sets of data. The row of “final with RR-GCV” records the significantly.
p-values of Wilcoxon signed-rank test for mean results achieved LAR also performs better than LR and NBR, and achieves
by the corresponding method and RR-GCV over all sets of data. comparable results with PCR and LTR. Compared with RF,
From Table VI, the model construction time of RR-LW, RR- LAR performs worse, but LAR costs less construction time and
HKB, or PCR99 is not significantly different from that of RR- its model is easier to interpret the relationship between defects
GCV. LR costs least time to build models among these methods. and original metrics. As mentioned in Section III, LAR shrinks
LTR, RF, NBR, and LAR cost more time to construct defect ordinary least-squares estimator toward zero and potentially sets
prediction models than RR-GCV. some coefficients to zero [31], so LAR can perform as a variable
selection operator. The merits of LAR and RF are different.
To sum up, compared with LR and NBR, all other compared
E. Discussion
methods (RR, LAR, PCR, RF, and LTR) can deal with multi-
In this section, we summarize the above experimental results. collinearity problems in a degree. When considering only the
There exist severe multicollinearity problems in all exper- model performance, RF and RR are better for software defect
imental datasets according to both condition numbers and prediction for the ranking task when multicollinearity problems
VIFs. Considering the serious multicollinearity problem in other exist. From the three aspects of model performance, interpreta-
datasets in our previous work [3], multicollinearity might be a tion and construction time, RR is a good choice. Among three
common problem in software defect prediction. compared algorithms (GCV, LW, and HKB) for selecting the
The comparison of RR with other methods can be summa- ridge parameter, GCV is slightly better.
rized in Table VII. “Worse” in Table VII means that, compared
with RR, the corresponding method performs worse. Others are
similar. VI. THREATS TO VALIDITY
RR can achieve better results than LR and NBR, slightly In order to simulate real applications, we use 11 open-source
(not significantly) better results than LAR, PCR and LTR, and projects including 41 versions in the PROMISE repository [15]
slightly (not significantly) worse results than RF. It is easier to conduct experiments. We use a former version of projects to
for RR models than RF to interpret the relationship between train models and use the models to predict the latter version.
defects and original metrics. And RR is a deterministic method, In order to investigate the performance of RR and LAR, we
whereas RF is a nondeterministic method, which cannot produce compare them with ordinary LR, generalized linear regression
the same model when using the same training data. In addition, (NBR), one PCA-based method, and two best methods in our

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
YANG AND WEN: RIDGE AND LASSO REGRESSION MODELS FOR CROSS-VERSION DEFECT PREDICTION 895

TABLE VII
COMPARISON OF RR WITH ALL OTHER METHODS FOR CROSS-VERSION DEFECT PREDICTION

previous study (RF and LTR). All of these make us confident that mous biased estimation methods (RR [12], [13] and LAR [14])
the obtained results are strongly related to the software defect for sorting software modules in order of defect count over cross-
prediction domain, and the results are convincing. However, version datasets. Our experimental results show the following
some potential threats to validity should be considered. conclusions.
1) According to both condition numbers and VIFs, all of
A. Threats to Internal Validity these experimental datasets have severe multicollinear-
ity problems. Considering the serious multicollinearity
One threat is the choice of parameters. The best parameter problem in other datasets in our previous work [3], mul-
might be different for each dataset. In this paper, we simply set ticollinearity might be a common problem in software
the parameters of four compared methods according to our pre- defect prediction, and the problem should be solved.
vious work [3]. In addition, only cumulative percent variances 2) RR and LAR perform better than LR and NBR over
at 0.9 and 0.99 are used to select the number of principal com- most experimental datasets. It implies that RR and LAR
ponents for PCR, the lasso solution is computed only by least can overcome the multicollinearity problems over these
angle regression, and only three methods are used for selecting datasets. Optimistically, it might imply that RR and LAR
the parameter of RR. These parameters might not be the best can deal with multicollinearity problems in software de-
parameters, and these methods might not be the best methods. fect prediction domain.
The results might be different when using different parameters 3) When considering only the model performance, RF and
or different methods. RR are better for software defect prediction for the rank-
Another threat to the validity is the choice of the model per- ing task when multicollinearity problems exist. From the
formance measures. We adopt the area under CLC (also referred three aspects of model performance, interpretation and
to as Alberg diagram) [17], [37] and FPA [6] as the performance construction time, RR is an attractive approach to sort
measures, because this paper focuses on software defect predic- software modules according to the number of defects.
tion for the ranking task instead of classification task. However, 4) The difference among the compared methods for selecting
there exist other performance measures. For example, percent- the ridge parameter is not very significant. Nevertheless,
age of defects found in the top 20% modules is a performance GCV is slightly better for choosing the ridge parameter of
measure for software defect prediction for the ranking task. The RR for software defect prediction.
conclusion might be different when adopting other performance In general, RR and RF can perform best among all methods
measures. without data processing, according to our experimental results.
Compared with RF, the merits of RR include: RR is a determin-
B. Threats to External Validity istic approach, it costs less time to construct RR models, and it
The main threat to external validity is experimental data. is easy for RR models to interpret the relationship between de-
Our experiments are based on a large collection of publicly fects and metrics. In our previous work [3], when using variable
available datasets, so the conclusions should be convincing. selection to select three metrics, LTR performed better than RF.
However, these datasets are only a very small part of all datasets However, RF without variable selection achieved larger FPA
in real world. It is not necessary that all datasets have severe results than LTR using variable selection in most cases. There-
multicollinearity problems, and the conclusions over these 41 fore, the conclusion that RR achieves comparable results with
datasets might not hold for other datasets, especially when other RF without variable selection, might imply that RR also achieves
datasets have no multicollinearity problems. comparable results with LTR using variable selection. In our fu-
ture work, we will further investigate the choice of parameters
VII. CONCLUSION AND FUTURE WORK of biased estimation methods for software defect prediction and
compare them with more methods (including variable selection
Sorting software modules in order of defect count is very methods and variable redefinition methods).
important because it can help software developers to allocate
testing resources effectively and efficiently. Generalized LR ap-
proaches including LR have been popularly used and have been
REFERENCES
demonstrated to be effective for this task. However, our previous
study showed that these regression approaches did not perform [1] T. Ostrand, E. Weyuker, and R. Bell, “Predicting the location and number
of faults in large software systems,” IEEE Trans. Softw. Eng., vol. 31,
well over datasets having multicollinearity problems [3]. In this no. 4, pp. 340–355, Apr. 2005.
paper, we investigate 11 open-source projects (including 41 ver- [2] M. D’Ambros, M. Lanza, and R. Robbes, “Evaluating defect prediction
sions) in the PROMISE repository, and find that these datasets approaches: A benchmark and an extensive comparison,” Empirical Softw.
also have multicollinearity problems. Considering that biased Eng., vol. 17, pp. 531–577, 2012.
[3] X. Yang, K. Tang, and X. Yao, “A learning-to-rank approach to soft-
estimation methods can achieve better prediction without data ware defect prediction,” IEEE Trans. Rel., vol. 64, no. 1, pp. 234–246,
processing for multicollinearity problems, we employ two fa- Mar. 2015.

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.
896 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018

[4] E. Mills, “Software metrics,” Defense Tech. Rep. CMU/SEI-88-CM-012, [29] A. Hoerl, R. Kennard, and K. Baldwin, “Ridge regression: Some simula-
Inf. Center, Fort Belvoir, VA, USA, DTIC Document, 1988. tions,” Commun. Statist., vol. 4, pp. 105–123, 1975.
[5] K. Gao and T. Khoshgoftaar, “A comprehensive empirical study of count [30] J. Lawless and P. Wang, “A simulation study of ridge and other regression
models for software defect prediction,” IEEE Trans. Rel., vol. 56, no. 2, estimators,” Commun. Statist.–Theory Methods, vol. 5, no. 4, pp. 307–323,
pp. 223–236, Jun. 2007. 1976.
[6] E. Weyuker, T. Ostrand, and R. Bell, “Comparing the effectiveness of [31] W. J. Fu, “Penalized regressions: The bridge versus the lasso,” J. Comput.
several modeling methods for fault prediction,” Empirical Softw. Eng., Graph. Statist., vol. 7, no. 3, pp. 397–416, 1998.
vol. 15, no. 3, pp. 277–295, 2010. [32] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regres-
[7] T. M. Khoshgoftaar and E. B. Allen, “A comparative study of ordering sion,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004.
and classification of fault-prone software modules,” Empirical Softw. Eng., [33] T. Wu and K. Lange, “Coordinate descent algorithms for lasso pe-
vol. 4, no. 2, pp. 159–186, 1999. nalized regression,” Ann. Appl. Statist., vol. 2, no. 1, pp. 224–244,
[8] K. Bennin, J. Keung, A. Monden, and Y. Kamei, “Investigating the effects 2008.
of balanced training and testing datasets on effort-aware fault prediction [34] S. Chidamber and C. Kemerer, “A metrics suite for object oriented design,”
models,” in Proc. IEEE 40th Annu. Comput. Softw. Appl. Conf., 2016, IEEE Trans. Softw. Eng., vol. 20, no. 6, pp. 476–493, Jun. 1994.
pp. 154–163. [35] M. Jureczko and L. Madeyski, “Towards identifying software project clus-
[9] S. Wang and X. Yao, “Using class imbalance learning for software defect ters with regard to defect prediction,” in Proc. 6th Int. Conf. Predictive
prediction,” IEEE Trans. Rel., vol. 62, no. 2, pp. 434–443, Jun. 2013. Models Softw. Eng., New York, NY, USA, 2010, pp. 9:1–9:10.
[10] J. Nam and S. Kim, “Clami: Defect prediction on unlabeled datasets,” in [36] G. Canfora, A. Lucia, M. Penta, R. Oliveto, A. Panichella, and S.
Proc. 30th IEEE/ACM Int. Conf. Automated Softw. Eng., 2015, pp. 452– Panichella, “Multi-objective cross-project defect prediction,” in Proc.
463. IEEE 6th Int. Conf. Softw. Testing, Verification Validation, 2013, pp. 252–
[11] R. Freund, W. Wilson, P. Sa, and C. Shen, Regression Analysis : Statistical 261.
Modeling of a Response Variable. Chongqing, China: Chongqing Univ. [37] Y. Jiang, B. Cukic, and Y. Ma, “Techniques for evaluating fault pre-
Press, 2012. diction models,” Empirical Softw. Eng., vol. 13, no. 5, pp. 561–595,
[12] A. Hoerl and R. Kennard, “Ridge regression: Biased estimation for 2008.
nonorthogonal problems,” Technometrics, vol. 12, pp. 55–67, 1970. [38] S. Valle, W. Li, and S. J. Qin, “Selection of the number of principal
[13] X. He, Practical Regression Analysis. Beijing, China: Higher Educ. Press, components: The variance of the reconstruction error criterion with a
2008. comparison to other methods,” Ind. Eng. Chem. Res., vol. 38, pp. 4389–
[14] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. 4401, 1999.
Statist. Soc., Series B, vol. 58, no. 1, pp. 267–288, 1996. [39] M. Fay and M. Proschan, “Wilcoxon-mann-whitney or t-test? On assump-
[15] S. Shukla, T. Radhakrishnan, and K. Muthukumaran, “Multi-objective tions for hypothesis tests and multiple interpretations of decision rules,”
cross-version defect prediction,” Soft Computing, vol. 22, pp. 1959–1980, Statist. Surveys, vol. 4, pp. 1–39, 2010.
2016. [40] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
[16] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects for Bulletin, vol. 1, no. 6, pp. 80–83, 1954.
eclipse,” in Proc. Int. Workshop Predictor Models Softw. Eng., IEEE, [41] L. I. Smith, “A tutorial on principal components analysis,” Cornell Univ.,
2007, pp. 9–15. Ithaca, NY, USA, Tech. Rep. OUCS-2002-12, 2002.
[17] N. Ohlsson and H. Alberg, “Predicting fault-prone software modules in [42] S. S. Shih, “Random forests, for model (and predictor) selection,” UCLA
telephone switches,” IEEE Trans. Softw. Eng., vol. 22, no. 12, pp. 886–894, Ling 251: Variation in Phonology, 2013.
Dec. 1996.
[18] A. Schroter, T. Timmermann, and A. Zeller, “Predicting component fail-
ures at design time,” in Proc. ACM/IEEE Int. Symp. Empirical Softw. Eng.
Proc., 2006, pp. 18–27.
[19] C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. A. Matsumoto, Xiaoxing Yang received the B.S. and Ph.D. degrees
“An empirical comparison of model validation techniques for defect in computer science from the University of Science
prediction models,” IEEE Trans. Softw. Eng., vol. 43, no. 1, pp. 1–18, and Technology of China, Hefei, China.
Jan. 2017. She is currently an Associate Research Fellow
[20] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross- with the School of Data and Computer Science, Sun
project defect prediction: A large scale experiment on data vs. domain vs. Yat-Sen University, Guangzhou, China. Her research
process,” in Proc. 7th Joint Meeting Eur. Softw. Eng. Conf. ACM SIGSOFT interests include software defect prediction, machine
Int. Symp. Foundations Softw. Eng., 2009, pp. 91–100. learning, evolutionary computation, and cloud com-
[21] K. E. Bennin, K. Toda, Y. Kamei, J. Keung, A. Monden, and N. Ubayashi, puting.
“Empirical evaluation of cross-release effort-aware defect prediction mod-
els,” in Proc. IEEE Int. Conf. Softw. Quality Rel. Security, 2016, pp. 214–
221.
[22] T. Graves, A. Karr, J. Marron, and H. Siy, “Predicting fault incidence
using software change history,” IEEE Trans. Softw. Eng., vol. 26, no. 7,
pp. 653–661, Jul. 2000. Wushao Wen received the B.S. degree from the Uni-
[23] T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes versity of Science and Technology of China, Hefei,
to learn defect predictors,” IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 2– China, in 1993 and the M.S. and Ph.D. degrees from
13, Jan. 2007. the University of California, Davis, CA, USA, in 1999
[24] T. Khoshgoftaar, K. Gao, and A. Napolitano, “An empirical study of and 2001, respectively.
feature ranking techniques for software quality prediction,” Int. J. Softw. He is currently a Professor with the School of
Eng. Knowl. Eng., vol. 22, no. 2, pp. 161–183, 2012. Data and Computer Science, Sun Yat-Sen University,
[25] X. Chen, Y. Shen, Z. Cui, and X. Ju, “Applying feature selection to China. He was an Engineer and a Project Manager
software defect prediction using multi-objective optimization,” in Proc. with China Telecommunication, Inc., from 1993 to
IEEE 41st Annu. Comput. Softw. Appl. Conf., 2017, pp. 54–59. 1997. From 2000 to 2001, he was an Engineer with
[26] S. Parsa, M. Vahidi-Asl, and S. A. Naree, “Finding causes of software Cisco Systems, Inc., USA. From 2001 to 2004, he was
failure using ridge regression and association rule generation methods,” in a Senior Engineer with CIENA Corporation, Cupertino, CA, leading the design
Proc. 9th ACIS Int. Conf. Softw. Eng., Artif. Intell., Netw., Parallel/Distrib. and implementation of optical routing and signaling systems. He was a Senior
Comput., 2008, pp. 873–878. Networking Expert with McAfee, Inc., from 2004 to 2006 and a Staff Engineer
[27] M. EL-Dereny and N. I. Rashwan, “Solving multicollinearity problem with Juniper Networks from 2006 to 2009 in the network intrusion detection
using ridge regression models,” Int. J. Contemp. Math. Sci., vol. 6, no. 12, area. He has been a Professor with Sun Yat-Sen University since 2009 and was
pp. 585–600, 2011. appointed as the Chief Director of the Networks and Information Center, the
[28] G. Golub, M. Heath, and G. Wahba, “Generalized cross-validation as a Director of Shared Experimentation Teaching Center, Sun Yat-Sen University,
method for choosing a good ridge parameter,” Technometrics, vol. 21, China, in 2013, and 2014, respectively. He is currently doing research in cloud
no. 2, pp. 215–223, 1979. computing, network security, network architectures, and multimedia networks.

Authorized licensed use limited to: Indian Institute Of Technology (Banaras Hindu University) Varanasi. Downloaded on February 28,2021 at 11:05:09 UTC from IEEE Xplore. Restrictions apply.

You might also like