Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Proceedings of the SMART–2021, IEEE Conference ID: 52563

10th International Conference on System Modeling & Advancement in Research Trends, 10th–11th December, 2021
Faculty of Engineering & Computing Sciences, Teerthanker Mahaveer University, Moradabad, India

Comparative Analysis of Supervised Learning


2021 10th International Conference on System Modeling & Advancement in Research Trends (SMART) | 978-1-6654-3970-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/SMART52563.2021.9676307

Techniques of Machine Learning for Software


Defect Prediction

Anurag Gupta1, Ratnesh Kumar Shukla2, Dr. Abhishek Bhola3, Alok Singh Sengar4
Computer Science & Engineering, College of Computing Science & Information Technology,Moradabad, India
1,2

3
Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation,
Vaddeswaram, AP, India
4
Computer Science & Engineering, College of Computing Science & Information Technology, Moradabad, India
E-mail: 1anurag.aeg@gmail.com, 2ratnesh.nitttr@gmail.com, 3abhishek_bhola@hotmail.com, 4aalok_iitr@live.com

Abstract—Software Bug prediction or defect prediction is every tree present in Random Forest divides out the class
very important for the organizations to detect the bugs in the prediction and the class with maximum votes becomes
early stage of Software development process because software model’s prediction.
developers can know the vulnerable areas where the defects
may be present. 4) Artificial Neural Networks
In this research paper we have compared different Artificial neural networks is a supervised machine
statistical techniques like Linear Regression, Naïve learning Technique based on biological concept of working
Bayes, Random Forest, Decision Tree, Artificial Neural of brain which is based on small cells called neurons.
Networks etc. and come up with the best among them 5) Decision Tree (DT)
for the Bug prediction. Comparison is made using Decision tree is based on hierarchical concept in which
Performance Measures like Accuracy, precision, recall top down approach is used. Decisions are taken based on
and F-measure. the condition. if the condition is true then one child node
Keywords: Software Bug Prediction, Vulnerable, Statistical
is created and if false then other child node is created. This
Techniques
Process will continue unless we reach to leaf node.
I. Introduction II. Literature Review
In this paper study has been done on different statistical Different studies already made and references present
techniques and comparative Analysis has been done. in the references section.
There are many existing techniques through which we Malhotra, Ruchika [1] in her research paper shown
can do the software defect prediction based on the different comparative Analysis for software bug prediction
factors. Some of them are mentioned below which are used techniques. reviewed the techniques for software bug
in our Analysis prediction models and assessed their performance,
1) Linear Regression (LR) D’Ambros, Marco, Michele Lanza and others [2]
Linear regression is the supervised machine learning in their Research paper did useful comparison between
Technique which is using the concept of independent and different bug prediction methods. The study showed
dependent variables. When dependent variable is dependent comparison between different bug prediction methods, also
on a single variable then it becomes Single Linear introduced new approach and calculated its performance
Regression and when it is dependent on multiple variables by doing a good comparison with other approaches their
then it becomes Multiple Linear Regression. approach.
2) Naive Bayes Gupta and Saxena [3] generated a model for Object
Oriented Software Defect Forecasting System, the study
Naive Bayes is a Supervised machine learning technique
used many similar types of defect datasets which were
based on famous Mathematical Bayes Theorem to predict class
present in Promise Software Database. They proposed a
of datasets and it uses Posterior Probability and Pre Probabilities.
model with average model accuracy of 76.27%
It can be used for binary and multiclass classifications.
T. Gyimothy and others [4] analyzed various object
3) Random Forest oriented metrics. Results shown the Coupling between
Random forest is a Supervised learning technique is Objects metric (CBO) is the best metric to predict the bugs
nothing but a group of trees that works as an ensemble. in the class and the Lines of code as also good, but the
406 Copyright © IEEE–2021 ISBN: 978-1-6654-3970-1

Authorized licensed use limited to: UNIV ESTADUAL PAULISTA JULIO DE MESQUITA FILHO. Downloaded on May 16,2023 at 16:45:47 UTC from IEEE Xplore. Restrictions apply.
Comparative Analysis of Supervised Learning Techniques of Machine Learning for Software Defect Prediction

Depth of Inheritance Tree (DIT) and Number of Children 20 7 5 57 2 4 94 0 2


(NOC) were not good features for prediction. 21 6 5 58 3 4 95 0 2
22 9 5 59 2 4 96 1 2
III. Datasets
23 4 5 60 7 4 97 0 2
Datasets were taken from Promise database Fi is the 24 4 5 61 3 4 98 0 2
number of faults and the Ti is the number of tests for each 25 2 5 62 0 4 99 0 2
day (Di) in a part of software projects lifetime. The DSET1 26 4 5 63 1 4 100 1 2
dataset has 46 measurements 27 3 5 64 0 4 101 0 1
DATASET2 has 111 measurements. 28 9 6 65 1 4 102 0 1
Table. 1: Dataset 29 2 6 66 0 4 103 1 1
Di Fi Ti Di Fi Ti 30 5 6 67 0 4 104 2 1
1 2 75 24 2 8 31 4 6 68 1 3 105 0 1
2 0 31 25 1 15 32 1 6 69 1 3 106 1 2
3 30 63 26 7 31 33 4 6 70 0 3 107 0 2
4 13 128 27 0 1 34 3 6 71 0 3 108 0 1
5 13 122 28 22 57 35 6 6 72 1 3 109 1 1
6 3 27 29 2 27 36 13 6 73 1 4 110 0 1
7 17 136 30 5 35 37 19 8 74 0 4 111 1 1
8 2 49 31 12 26 IV. Analysis
9 2 26 32 14 36
We have analyzed the data sets & Number of faults
10 20 102 33 5 28
given below
11 13 53 34 2 22
12 3 26 35 0 4 Table. 3: Number of Faults
13 3 78 36 7 8 Class No. of Faults Data SET1 Data SET2
14 4 48 37 3 5 A 0 to 4 30 76
15 4 75 38 0 27
B 5 to 9 5 23
1`6 0 14 39 0 6
C 10 to 14 5 4
17 0 4 40 0 6
18 0 14 41 0 4 D 15 to 19 2 3
19 0 22 42 5 0 E greater than 20 4 5
20 0 5 43 2 6 Table. 4: Confusion Matrix
21 0 9 44 3 5
Forecasted Actual
22 30 33 45 0 8
23 15 118 46 0 2 Class X Class Y

Table. 2: Dataset X True Positive False Positive


Y False Negative True Negative
Di Fi Ti Di Fi Ti Di Fi Ti
1 5 4 38 15 8 75 0 4 Above two tables A & B shows the number of faults
2 5 4 39 7 8 76 0 4 and confusion matrix that is used to measure performance
3 5 4 40 15 8 77 1 4 of statistical techniques respectively.
4 5 4 41 21 8 78 2 2 A. Accuracy
5 6 4 42 8 8 79 0 2
Accuracy Performance metrics is the proportion of true
6 8 5 43 6 8 80 1 2
positives (TP and TN) in total number of examined values
7 2 5 44 20 8 81 0 2
8 7 5 45 10 8 82 0 2
The best accuracy is one, whereas the worst accuracy is
9 4 5 46 3 8 83 0 2
zero. Accuracy performance metrics can be calculated by
10 2 5 47 3 8 84 0 2 using the following formula:
11 31 5 48 8 4 85 0 2 Accuracy = (Sum of true positives and True Negatives)
12 4 5 49 5 4 86 0 2 / (All positives+ All Negatives) (i)
13 24 5 50 1 4 87 2 2 Table. 5: Accuracy Measure for Five Statistical Techniques
14 49 5 51 2 4 88 0 2
15 14 5 52 2 4 89 0 2 DATASETS LR NB RF DT ANN
16 12 5 53 2 4 90 0 2
17 8 5 54 7 4 91 0 2 DSET1 0.872 0.898 0.962 0.951 0.938
18 9 5 55 2 4 92 0 2 DSET2 0.891 0.911 0.975 0.972 0.954
19 4 5 56 0 4 93 0 2
AVERAGE 0.8815 0.9045 0.9685 0.9615 0.946
Table Contd
Copyright © IEEE–2021 ISBN: 978-1-6654-3970-1 407

Authorized licensed use limited to: UNIV ESTADUAL PAULISTA JULIO DE MESQUITA FILHO. Downloaded on May 16,2023 at 16:45:47 UTC from IEEE Xplore. Restrictions apply.
10th International Conference on System Modeling & Advancement in Research Trends, 10th–11th December, 2021
Faculty of Engineering & Computing Sciences, Teerthanker Mahaveer University, Moradabad, India

Fig. 3: Chart Showing Recall Measures for Five Statistical Technique


Fig. 1: Chart Showing Accuracy Measures for Five Statistical Techniques D. F-measure
B. Precision (Performance Measure for Positive F-measure Performance metrics is used by adding
Prediction) the Recall and Precision values in one metric in order to
Precision Performance metric is the number of compare different supervised Techniques with each other.
correct positive predictions by the total number of positive F-measure is given by:
Predictions. The best precision is 1, whereas the worst is 0 F-measure = (2 * Recall * Precision )/ (Recall +
Precision = True Positives / (True Positives + False Precision) (iv)
Positives) (ii) Table. 8: F-Measure For Five Statistical Techniques
Table. 6 : Precision Measure For Five Techniques
DATASETS LR NB RF DT ANN
DATASET LR NB RF DT ANN
DSET1 0.95 0.96 1.00 1.00 1.00
DSET1 0.91 0.95 0.99 1 1
DSET2 0.94 0.98 1.00 0.99 0.99
DSET2 0.89 0.98 1 0.99 0.98
0.901 0.972 0.99 0.99 0.99 AVERAGE 0.90 0.97 1.00 1.00 0.99
AVERAGE

Fig. 4: Chart Showing F-measure for Five Statistical Techniques


Fig. 2: Chart Showing Precision Measures for Five Statistical Techniques
V. Conclusion and Scope in Future
C. Recall (Performance Measure for True Positive Enhancements
Rate) We used Different statistical techniques for dataset and
We calculated Recall as the number of positive metrics and multiple performance measures. In This Research
predictions divided by the total number of positives paper, we evaluated five statistical techniques for software
predictions. The best recall value is 1 and the worst value is bug prediction. five statistical techniques used are linear
0. We calculated Recall by the formula: regression, NB, Random forest, Decision tree and ANN.
Recall = True Positives / (True Positives + False The process is used for two real Bug datasets. Analyzed
Negatives) (iii) results see Table 3 & Fig 1 for Accuracy measure of five
Table. 7: Recall Measure For Five Statistical Techniques statistical techniques, Table 4 & Fig 2 for Precision measure,
DATASETS LR NB RF DT ANN Table 5 & Fig 3 for recall and Table 6 & Fig 4 for F-measure
are collected. The comparative study clearly shows that the
DSET1 0.988 0.956 1 0.999 1
Supervised Learning Technique Random Forest has the best
DSET2 0.992 0.978 1 0.998 0.99 results over the others. As future enhancements we may
AVERAGE 0.99 0.967 1 0.9985 0.995 include more Bug Data for getting more accurate results.
408 Copyright © IEEE–2021 ISBN: 978-1-6654-3970-1

Authorized licensed use limited to: UNIV ESTADUAL PAULISTA JULIO DE MESQUITA FILHO. Downloaded on May 16,2023 at 16:45:47 UTC from IEEE Xplore. Restrictions apply.
Comparative Analysis of Supervised Learning Techniques of Machine Learning for Software Defect Prediction

References [11] Tiwari, A. K., & Shukla, R. K. (2019, March). Machine Learning
Approaches for Face Identification Feed Forward Algorithms. In
[1] R. Malhotra, “Comparative analysis of statistical and machine
Proceedings of 2nd International Conference on Advanced Computing
learning methods for predicting faulty modules,” Applied Soft and Software Engineering (ICACSE).
Computing 21, (2014): 286-297 [12] Kumar Shukla, R., & Kumar Tiwari, A. (2021). Comparative
[2] D’Ambros, Marco, Michele Lanza, and Romain Robbes. “An Analysis of Machine Learning Based Approaches for Face Detection
extensive comparison of bug prediction approaches.” Mining and Recognition. Journal of Information Technology Management,
Software Repositories (MSR), 2010 7th IEEE Working Conference 13(1), 1-21.
on. IEEE, 2010 [13] Singhal, P., Srivastava, P. K., Tiwari, A. K., & Shukla, R. K. (2022).
[3] Gupta, Dharmendra Lal, and Kavita Saxena. “Software bug A Survey: Approaches to Facial Detection and Recognition with
prediction using object-oriented metrics.” Sādhanā (2017): 1-15. Machine Learning Techniques. In Proceedings of Second Doctoral
[4] T. Gyimothy, R. Ferenc and I. Siket, “Empirical Validation of Object Symposium on Computational Intelligence (pp. 103-125). Springer,
Oriented Metrics on Open Source Software for Fault Prediction,” Singapore.
IEEE Transactions On Software Engineering, 2005. [14] Shukla, R. K., Prakash, V., & Pandey, S. (2020, December). A
[5] Y. Tohman, K. Tokunaga, S. Nagase, and M. Y., “Structural Perspective on Internet of Things: Challenges & Applications.
approach to the estimation of the number of residual software faults In 2020 9th International Conference System Modeling and
based on the hyper-geometric districution model,” IEEE Trans. on Advancement in Research Trends (SMART) (pp. 184-189). IEEE.
Software Engineering, pp. 345–355, 1989. [15] Jain, A., Kumar, A., & Sharma, S. (2015). Comparative Design and
Analysis of Mesh, Torus and Ring NoC. Procedia Computer Science,
[6] A. Sheta and D. Rine, “Modeling Incremental Faults of
48, 330-337.
Software Testing Process Using AR Models ”, the Proceeding
[16] Ghai, D., Gianey, H. K., Jain, A., & Uppal, R. S. (2020). Quantum
of 4th International Multi-Conferences on Computer Science and and dual-tree complex wavelet transform-based image watermarking.
Information Technology (CSIT 2006), Amman, Jordan. Vol. 3. 2006. International Journal of Modern Physics B, 34(04), 2050009.
[7] D. Sharma and P. Chandra, “Software Fault Prediction Using [17] Agrawal, N., Jain, A., & Agarwal, A. (2019). Simulation of Network
Machine-Learning Techniques,” Smart Computing and Informatics. on Chip for 3D Router Architecture. International Journal of Recent
Springer, Singapore, 2018. 541-549. Technology and Engineering, 8, 58-62.
[8] M. M. Rosli, N. H. I. Teo, N. S. M. Yusop and N. S. Moham, “The [18] Shukla, R. K., Agarwal, A., & Malviya, A. K. (2018). An Introduction
Design of a Software Fault Prone Application Using Evolutionary of Face Recognition and Face Detection for Blurred and Noisy
Algorithm,” IEEE Conference on Open Systems, 2011. Images. International Journal of Scientific Research in Computer
[9] Singh, Praman Deep, and Anuradha Chug. “Software defect Science and Engineering, 6(3), 39-43.
prediction analysis using machine learning algorithms.” 7th [19] Agarwal, A. K., & Jain, A. (2019). Synthesis of 2D and 3D NoC
International Conference on Cloud Computing, Data Science & mesh router architecture in HDL environment. J Adv Res Dyn
Engineering-Confluence, IEEE, 2017. Control Syst, 11(4), 2573-2581.
[10] Shukla, R. K., & Tiwari, A. K. (2020). A Machine Learning [20] Jain, A., Gahlot, A. K., Dwivedi, R., Kumar, A., & Sharma, S.
Approaches on Face Detection and Recognition. Solid State K. (2018). Fat Tree NoC Design and Synthesis. In Intelligent
Technology, 63(5), 7619-7627. Communication, Control and Devices (pp. 1749-1756). Springer,

Copyright © IEEE–2021 ISBN: 978-1-6654-3970-1 409

Authorized licensed use limited to: UNIV ESTADUAL PAULISTA JULIO DE MESQUITA FILHO. Downloaded on May 16,2023 at 16:45:47 UTC from IEEE Xplore. Restrictions apply.

You might also like