ASC Review Article

Applied Soft Computing Journal 162 (2024) 111805
Contents lists available at ScienceDirect
Applied Soft Computing

journal homepage: www.elsevier.com/locate/asoc
A systematic review of machine learning methods in software testing

Sedighe Ajorloo a, Amirhossein Jamarani b, Mehdi Kashfi a, Mostafa Haghi Kashani a, *,
Abbas Najafizadeh a
a
Department of Computer Engineering, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran
b
The Center for Advanced Computer Studies, University of Louisiana at Lafayette, LA, USA
H I G H L I G H T S
• A comprehensive systematic review on machine learning methods in software testing is provided.

• The main ideas, methods, tools, merits, demerits, evaluation metrics, and evaluation methods are discussed.
• A scientific taxonomy of machine learning methods in software testing is presented.
• A detailed list of challenges, open issues, and future research directions is outlined.
A R T I C L E I N F O A B S T R A C T
Keywords: Background: The quest for higher software quality remains a paramount concern in software testing, prompting a
Machine learning shift towards leveraging machine learning techniques for enhanced testing efficacy.
Software testing Objective: The objective of this paper is to identify, categorize, and systematically compare the present studies on
Quality of software
software testing utilizing machine learning methods.
Systematic review
Method: This study conducts a systematic literature review (SLR) of 40 pertinent studies spanning from 2018 to
March 2024 to comprehensively analyze and classify machine learning methods in software testing. The review
encompasses supervised learning, unsupervised learning, reinforcement learning, and hybrid learning
approaches.
Results: The strengths and weaknesses of each reviewed paper are dissected in this study. This paper also provides
an in-depth analysis of the merits of machine learning methods in the context of software testing and addresses
current unresolved issues. Potential areas for future research have been discussed, and statistics of each review
paper have been collected.
Conclusion: By addressing these aspects, this study contributes to advancing the discourse on machine learning’s
role in software testing and paves the way for substantial improvements in testing efficacy and software quality.
1. Introduction dependent on software-based features. Software also plays a significant

role in determining the overall quality of the product. Nonetheless, there
Computer software refers to the compilation of data and computer are bugs in every piece of software [10], irrespective of the amount of
instructions that establish the operational trajectory and execution of a testing done or the technologies utilized [11]. Bugs can endanger
computer system. The core components of software engineering humans and the environment in highly automated systems that rely
encompass methods [1], processes [2], tools [3,4], quality [5,6], and heavily on software, such as automobiles and airplanes. It is nearly
maintenance [7,8]. In many sectors and scientific disciplines, software impossible to create a software-based, sophisticated system that works
has played a key role. It has a big impact on automation initiatives and flawlessly the first time [11]. As a result, software testing (ST) proced
influences the rise of productivity [9]. In addition, the development of ures are carried out alongside other quality control procedures
new products in many other industries is becoming increasingly throughout the software development process to provide additional
* Corresponding author.
E-mail addresses: s.ajorloo@qodsiau.ac.ir (S. Ajorloo), c00550518@louisiana.edu (A. Jamarani), mh.kashani@qodsiau.ac.ir (M. Haghi Kashani), a.najafizadeh@
qodsiau.ac.i (A. Najafizadeh).
https://doi.org/10.1016/j.asoc.2024.111805
Received 21 September 2023; Received in revised form 27 March 2024; Accepted 19 May 2024
Available online 25 May 2024
1568-4946/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
relevant information about the reliability and quality of the product environments, datasets and case studies, and methods for measuring
[12]. Because of this, the level of quality that is present during the accuracy and efficiency. By probing into these matters and research
testing procedures is one of the most important variables that will questions (RQ) below, we aim to unveil the challenges, prospects, and
determine the quality of the product in its final form [13]. forthcoming trajectories for the application of ML in ST.
ST is a process that identifies errors or bugs in software while also RQ1: What are the different ML methods that have been applied to
providing information about the functionality and quality characteris ST?
tics of the software being tested [14]. Testing should follow the devel RQ2: What are the evaluation metrics used to evaluate the effec
opment phase of any software project [15]. According to a study that tiveness of ML methods in ST?
was carried out by the National Institute of Standards and Technology, RQ3: Which tools or evaluation environments are commonly used for
the amount of damage that is incurred as a result of software issues is ST and ML, and how are they integrated?
significant [16]. As a result, the ST phase is extremely important and RQ4: What datasets and case studies have been used to evaluate the
accounts for around half of the total software development budget [17]. performance of ML methods in ST?
ST uses a significant amount of resources but adds no new functionality RQ5: What evaluation methods have been employed to measure the
to the product; hence, a significant amount of effort has been expended accuracy and efficiency of ML methods in ST?
to reduce its cost by providing automated ST solutions. Over the past RQ6: What are the current challenges and opportunities for applying
decade, many different methods for automated ST have been developed, ML to ST, and what are the possible future directions?
all with the same overarching goal: to maximize error detection while Guidelines in [27,28] have been used to explore, categorize, and
generating as little test input data as possible. compare available ML methods in ST. In order to present a compre
Artificial intelligence (AI) has demonstrated its efficacy in addressing hensive taxonomy for the classification of ML methods in ST, 40 papers
challenges across various fields, including but not limited to biology, were selected and analyzed. In addition, this review provides a concise
social media [18], medical sciences [19], space and environmental ap overview of the key obstacles and open issues, as well as a description of
plications [20], the Internet of things (IoT) [21], self-driving vehicles the significant areas in which future research might improve the
[22], and unmanned aircraft systems [22]. It is, therefore, logical to methodologies used in the selected studies.
consider that AI, encompassing machine learning (ML), deep learning This SLR is structured as Fig. 1: Section 2 discusses related work and
(DL), search algorithms, and optimization techniques, has the potential motivation. Section 3 illustrates the research methodology. A full review
to contribute to the advancement of software engineering practices, of the chosen papers is provided in Section 4 after the classification of
particularly in the domain of ST. In terms of reducing the time and effort ML methods in ST. The analysis of the results and future work are
needed for various software engineering methodologies, AI has proven described in Sections 5 and 6, respectively. Threats to validity and
to be a formidable opponent [23]. Automating a variety of tasks that are limitations are outlined in Section 7. Finally, Section 8 explains the
associated with software engineering can be accomplished with the help conclusion.
of ML, a subfield of AI [24]. In addition, the complexity of software is
growing at an exponential rate, and the standard testing methodologies 2. Related work and motivation
that have been used in the past are unable to adequately scale to meet
the demands of increasingly sophisticated software [25]. Considering Numerous reviews have been conducted in ST and ML. Nevertheless,
the ever-increasing complexity of today’s programming frameworks, these literature reviews have some limitations. This section pertains to
ML-based methods have become increasingly appealing. The deploy several review studies that examine strategies for handling the common
ment of ML methods to automate the process of ST has been the subject boundaries of ST and ML. Section 2.1 provides an overview of the
of a significant amount of research. It doesn’t matter if the objective at related works, which have been categorized as surveys. Additionally,
hand is to generate the test cases, prioritize the test cases, or do other Section 2.2 highlights the weaknesses of these reviews. We present a
sorts of testing like black box or white box testing; ML has demonstrated summary of the related works in Table 1, which includes various pa
its usefulness in all of these cases [26]. rameters such as the main concepts, review types, paper selection pro
The application of ML methods has proven to be an effective strategy cesses, taxonomies, open issues, evaluation parameters, applied tools,
for automating testing-related operations, the construction and evalua and publication year of each study.
tion of test Oracles, and the prediction of the cost and amount of time
required for testing. Because of this, programming analysts can imagine 2.1. Review studies on software testing and machine learning
more accurate results than traditional testing ever could, thanks to AI,
which educates frameworks to learn and implement that information in An overview of machine learning testing was provided by the authors
the future [25]. However, the probability of making an error is not the in [29]. Their work surveyed properties, components, workflows, and
only aspect that is reduced. The amount of time expected to be spent application scenarios. The paper also analyzed trends in datasets and
performing testing on a product and discovering potential faults is also research focus and identified challenges and directions in ML testing.
reduced, yet the amount of information that must be managed can still Additionally, the authors offered a survey of testing techniques for
increase without placing any load on the testing group. machine learning applications. Nonetheless, a drawback in the paper’s
The limited examination of comprehensive Machine Learning (ML) organization lies in its absence of a systematic framework and structure.
methods in Software Testing (ST) complicates understanding the diverse The authors in [30] established a framework for analyzing the usage of
approaches and challenges. So far, there have not been many complete DNNs in SE, covering aspects such as trends, techniques, data process
reviews on this topic. Furthermore, since ML methods in ST are an ultra- ing, research topics, relationships, datasets, optimization algorithms,
critical and sensitive field, it is necessary to provide a comprehensive and evaluation metrics. Also, the authors proposed a research roadmap
study. Recognizing the importance of ML methods in ST, this study aims outlining opportunities for future work in the application of DNNs in SE.
to thoroughly analyze the obstacles, potential directions, advantages, However, no taxonomy was provided in their study.
and disadvantages involved. The study also delves into the connection Chen and Babar [31] provided a review of machine learning-based
between ML methods and ST. Through a systematic literature review, we modern software systems (MLBSS) security. They highlight the need
aim to find, categorize, and compare various ML methods in ST. We for joint efforts from software engineering, system security, and machine
meticulously scrutinize the methodologies employed in existing litera learning disciplines to address the evolving threats posed by MLBSS. The
ture and devise a coherent classification framework. This methodolog authors not only examined security threats but also offered insights into
ical review addresses fundamental questions, such as the types of ML secure practices throughout the software lifecycle. By summarizing
methods used in ST, the metrics for evaluating them, common tools and existing literature and outlining future research directions, they aimed
2
Fig. 1. The structure of this study.
Table 1
Related studies in the field of software testing and machine learning.
Review Paper Main topic Publication Paper selection Taxonomy Future Covered
type year process work years
Survey [29] Machine learning methods in software testing 2020 Yes Yes Yes 2007–2019
[30] Deep learning for software testing 2022 Yes No Yes 2006–2022
[31] Software systems based on machine learning 2024 Yes Yes Yes Not
mentioned
[32] Investigation into the detection and anticipation of security risks through 2024 Yes No Yes 2011–2022
the application of diverse deep learning frameworks in software testing
SMS [23] Software testing pertains to machine learning 2019 Yes No Yes 2017–2018
SLR [33] Software testing and machine learning methods 2012 Yes No Yes 1991–2010
[34] Software testing and machine learning methods 2015 Yes Yes Yes 1991–2013
[35] Verification and testing of neural networks in software testing 2020 Yes Yes Yes 2011–2018
[36] The utilization of data mining, machine learning, and DL methods for 2022 Yes Yes Yes 2010–2021
predicting software faults
[37] Software testing and reinforcement learning 2023 Yes Yes Yes 2012–2021
Our machine learning methods in software testing 2024 Yes Yes Yes 2018–2024
study
to foster an understanding of system security engineering in the realm of detecting ST faults. They reviewed papers published between 1991 and
MLBSS. Suman and Khan [32] analyzed the literature on using various 2013. However, the papers they reviewed have become outdated as ML
deep learning models to detect and predict security risks in software and ST have grown in recent years. Zhang and Li [35] stated that neural
testing. They reviewed existing techniques, evaluated performance in network (NN) algorithms are increasingly used in Safety-Critical
dicators, and discussed benefits and drawbacks. CyberPhysical Systems (SCCPSs), prompting interest in testing and
Durelli, et al. [23] performed a review for the purpose of conducting verification (T&V) methods for NN-based control software. A review of
a systematic mapping study (SMS) to provide an overview of the T&V methodologies was conducted by analyzing 83 papers from 2011 to
research at the intersection of ST and ML to automate and streamline ST. 2018. The study categorized approaches into themes like robustness,
Their covered years, including papers, were quite limited. A systematic failure resilience, and interpretability of NNs. However, gaps exist in
literature review was conducted by Wen, et al. [33] to explore models achieving repeatability and defined testing configurations.
for estimating software development efforts based on ML methods. They Batool and Khan [36] analyzed previously published reviews, sur
commented on the limitations of the applications of ST and ML and veys, and related studies to extract a set of questions. They explained the
provided information regarding the improvement of those limitations. significance of answering the newly added questions and categorized
The authors, nevertheless, did not compile a taxonomy to have all their previous work according to data mining, ML, and DL while evaluating
reviewed studies segregate. their respective performances. In [37], the authors examined the utili
The authors in [34] provided an SLR on the applications of ML in zation of reinforcement learning (RL) in software testing, addressing its
3
application where traditional machine learning methods falter. Results were not included.
indicated widespread usage of RL in certain testing scenarios, albeit • Some of the related papers have not explicitly focused on unre
primarily limited to two applications, highlighting a need for explora solved matters; they have briefly and implicitly listed future challenges.
tion in advanced RL techniques and multi-agent RL. • While some papers lack proper classification or taxonomies, this
paper not only offers a clear and visual categorization but also estab
lishes a subclass for each of them.
2.2. The motivation for an SLR on machine learning methods in software • The evaluation parameters and tools were largely disregarded in
testing most reviews.
• Previous studies reviewed only a limited number of papers.
In order to conduct a systematic review of the ML method in ST,
recent studies are identified, classified, and compared. The focus of this 3. Research methodology
review is to provide a detailed analysis and classification of ML methods
taken in ST in various sectors, as presented in Section 4. To verify the ST and ML have both seen a lot of research published recently. Due to
novelty of our work, an optimally thorough search on Google Scholar, the empirical study, this work is able to offer a thorough analysis of the
Springer, IEEE Explorer, ScienceDirect, SAGE, Taylor&Francis, Wiley, subject. In this section, an SLR approach to ML methods in ST is pre
Emerald, ACM, and Inderscience was conducted using a specific sented. A systematic review is the procedure of locating, categorizing,
research string as follows:
Based on our review, none of the studies we examined fully analyzing, and producing a comparative overview to arrive at solutions
addressed our research questions in Section 3.1, which highlights the to research problems and questions pertaining to certain research topics
need for an SLR to strengthen and update the current evidence-based ML [27]. This approach can be applied in any academic discipline to clearly
method in ST. Table 1 summarizes the surveys we studied, detailing grasp, reduce prejudice, and identify unresolved concerns and future
their review types, main topics, publication years, paper selection pro guidelines, according to [38]. This study’s specific goal was to provide a
cesses, taxonomies, future works, and covered years. It is evident that thorough method for practical steps in this subject.
only five of the papers used the relatively similar SLR method, and four As depicted in Fig. 2, this methodical procedure employs a three-
papers used the simple review method. Our research has a transparent phase guideline, namely, planning, conducting, and documenting. During
paper selection process, a prepared taxonomy, and an explanation of the planning phase, we first determine the questions and requirements
future works, and includes recently published papers up to March 2024. that are motivating this SLR. Then, during the conducting phase, papers
However, some of the papers that have already been reviewed do not are chosen based on inclusion and exclusion criteria. The observations
provide a taxonomy. So, we conducted this comprehensive study to are finally recorded in the documenting phase, and the analyses, com
cover the following deficiencies: parisons, and visualizations of the results that produce the conclusions
• Newly published papers, particularly from 2021 to March 2024, to the research questions are presented.
Fig. 2. Introduction to the research method.
4
3.1. Planning phase (Table 2).
The process of planning for this SLR commences with identifying the
3.2. Conducting phase
research motivation and culminates in the development of a review
protocol, which includes the following steps:
The second phase of the research methodology involves conducting a
Step 1: Determining your research’s purpose. The first step entails
comprehensive search for relevant papers, beginning with the selection
defining the motivation on the basis of the unique value added by this
of appropriate papers and ending with data extraction. This section aims
SLR in comparison to the existing reviews discussed in Section 2.2.
to outline the process of paper selection that is carried out during the
Step 2: Involves question formulation. The second step involves
second phase of the SLR. The paper selection process follows a three-step
defining research questions to aid in the creation and validation of the
guideline.
review methodology, with the paper’s purpose serving as inspiration. In
Step 1: Selecting Primary Research: The initial step in the research
the Introduction section is a list of the study’s questions. Answering the
process involved searching through Google Scholar, the most popular
questions at hand can help uncover knowledge gaps that could lead to
search engine, using well-known academic publishers like Springer,
breakthroughs during the documentation phase.
IEEE, ScienceDirect, SAGE, Taylor&Francis, Wiley, Emerald, ACM, and
Step 3: Establish the review protocol. The review scope and research
Inderscience based on titles and keywords. The following is a definition
questions were established in the prior stage in accordance with the
of the search terms:
objectives of this SLR in order to modify the search terms for the • Initial selection: At the conclusion of step one, 284 papers from
extraction of literature [27]. A protocol was also developed by taking journals, conferences, white papers, book chapters, and books were
into account [39] and our prior SLR experience [40–55]. We asked an extracted, as noted in Table 2. In this step, the papers’ abstracts and
outside expert with experience conducting SLRs in this era for feedback conclusions were examined, after which papers that had been pub
in order to assess the defined protocol before it was put into action. The lished online between 2018 and March 2024 were selected. Addi
updated protocol took into account his suggestions. To lessen improper tionally, non-peer-reviewed and non-English papers, thesis, review
motivation in research and increase the efficiency of data extraction, a papers, short papers, and book chapters were excluded in order to
pilot study (representing approximately 25% of the included papers) find the papers that were the most pertinent. Overall, we located 43
was conducted. Additionally, during the pilot phase, we improved the papers at this stage.
review’s parameters, search techniques, and inclusion/exclusion criteria • Final selection: Ultimately, an exhaustive review was conducted on
the entirety of the selected articles, resulting in the identification of
40 relevant papers for subsequent comprehensive scrutiny. These
Table 2
papers were deemed suitable for further examination as they effec
Inclusion and exclusion criteria.
tively addressed our research inquiries and provided comprehensive
Criteria Justification
accounts of the methodologies employed and problems encountered.
Inclusion • Studies that mainly focus on • Having a comprehensible image
the integration of software of software testing and machine Steps 2 and 3: Data extraction and synthesis: By looking at 40 pertinent
testing with machine learning learning
• Papers online from 2018 to • Recent papers have referenced
papers, we are able to classify ML methods in ST in Section 4 and
March 2024 the classical and fundamental highlight their advantages and disadvantages.
literature regarding this subject.
Exclusion • Short papers that are less than • These papers do not provide 3.3. Documenting phase
five pages adequate material to be used in
our SLR
• Survey and review papers • The studies fail to provide The observations are documented, and the selected articles for re
practical, noteworthy, view can be found in Section 4, along with relevant information based
innovative solutions or on the proposed classification. Section 5 analyzes, visualizes, and reports
information.
on the results, while Section 7 explores and outlines any threats to
• Studies that are not composed • The papers that were not
in English nor evaluated by evaluated and those written in validity and limitations.
referees non-English languages were
excluded due to doubts about 4. Taxonomy of machine learning methods in software testing
their quality and the inability to
investigate them
Within this particular section, a taxonomy has been established for
5
Fig. 3. The taxonomy of machine learning methods in software testing.
the literature pieces that have been examined. The task of organizing the Table 3. These criteria are systematically assessed and juxtaposed within
existing literature on ML methods in ST, including the various methods each subcategory of analysis.
and approaches employed, poses a significant challenge due to the wide
range of studies conducted in this field. The proposed taxonomy 4.1. Supervised learning methods
scheme’s framework is depicted in Fig. 3. There are four primary cate
gories that have been identified: Supervised learning methods, unsu Supervised learning is a learning technique that involves the process
pervised learning methods, reinforcement learning methods, and hybrid of assigning labels to data. Supervised learning methods are provided
learning methods. Given that the predominant focus of scholarly inquiry with a training dataset that consists of labeled data where both the input
in this particular domain revolves around topics associated with one of and corresponding output are known. This dataset is used to construct a
the aforementioned four perspectives, conducting a comprehensive ex system model that captures the learned relationship between the input
amination of the literature from each of these vantage points facilitates and output variables. Following the completion of training, the trained
the categorization of the assessed papers under a broad thematic model is capable of utilizing a newly introduced input to generate the
framework. In this particular instance, the 40 chosen papers have been anticipated output [56,57].
organized based on the aforementioned criteria. The fundamental This section focuses on conducting research pertaining to supervised
characteristics of the techniques employed in the reviewed papers, learning methods, specifically classification and regression techniques.
including their distinctions, main ideas, assessment methodologies, In this study, we examine the classification and regression techniques
tools utilized, drawbacks, and benefits, are thoroughly examined and that are relevant to the research field, as discussed in Section 4.1.1.
deliberated upon. Among the criteria to measure machine learning and
software testing, nineteen critical ones are opted for and described in
6
Table 3 Table 3 (continued )

Metric definitions. Evaluation parameter Description and formula
Evaluation parameter Description and formula
Confusion matrix The utilization of a confusion matrix is to assess the
Recall The recall of a proposed model refers to the proportion efficacy of a classification model through the
of positive observations in the actual class of a dataset comparison of predicted class labels with the actual class
that are correctly predicted by the model. Essentially, labels of a given test dataset. The matrix is a tabular
recall measures how well a classifier can accurately representation that presents the counts of true positives,
TP false positives, true negatives, and false negatives.
identify all positive samples. Recall =
TP + FN Suitability The extent to which a software application meets the
Precision Precision is a metric that pays attention to false positives needs, requirements, and expectations of its intended
and represents the ratio of true positive predictions to all users, encompassing aspects such as functionality,
predicted positive observations. It reflects the model’s performance, reliability, usability, compatibility, and
capacity to avoid labeling negative samples as positive. security.
TP
Precision =
TP + FP
Cost The cost associated with obtaining, creating, executing,
4.1.1. Summary of the chosen Supervised Learning methods
or sustaining the desired service.
Reliability This statement pertains to the likelihood of a system’s
In [58], a cost-sensitive approach was introduced with the aim of
ability to successfully carry out a task as intended in a enhancing the efficacy of ML classifiers in the prediction of defective
particular environment and timeframe. software modules, even in the presence of unstable datasets. The find
Time This study comprehensively examines various ings indicated a decrease in the occurrence of false negatives, an in
dimensions associated with time, encompassing
crease in the efficacy of the ST, and an increase in the accurate
execution time, average response time, statistical
analysis time, and delay. identification of defective modules. Nevertheless, the proposed model
Accuracy This refers to the evaluation process used to select the expanded the range of testing. Additionally, a decline in ST productivity,
best model for identifying relationships within a dataset, a rise in the number of false positives, and a decrease in overall accuracy
which is primarily based on input or training data. were observed. This study utilized a single NASA data set and an ML
TP + TN
Accuracy =
TP + TN + FP + FN
(N→Negative, method exclusively.
P→Positive, F→False, T→True) Nakajima [59] proposed a metamorphic testing approach that is
Performance The number of productive tasks completed within a suitable for neural network learning models. The framework was
defined timeframe.
explained by employing a test case involving ML programs based on
Efficiency Efficiency refers to the ability to achieve maximum
productivity with minimum wasted effort, time, or neural networks, which were utilized to categorize handwritten nu
resources. It is the ratio of output to input or the amount merical digits. The primary focal points of the discussion were the di
of work done in a given period of time. versity of the dataset and the utilization of behavioral oracles. The
Mutual information (MI) Mutual information is a metric used to quantify the concept of dataset diversity encompasses the extent to which datasets
extent to which one random variable provides
information about another. The dimensionless quantity
rely on training outcomes and proposes an approach for generating
under consideration typically possesses units of bits and subsequent test inputs. The statistical indicators were modified by
can be conceptualized as the decrease in uncertainty behavioral oracle observers throughout the training procedures. The
regarding a random variable when information about suggested approach can be advantageous in ST and ML programs.
another variable is known.
In [60], the dpEmu framework was employed to simulate data
Success rate Success rate is a measure of the proportion of successful
outcomes or events compared to the total number of anomalies and evaluate the performance of ML methods when con
attempts or opportunities. It is typically expressed as a fronted with flawed data. This framework promoted the practice of
percentage and can be used to evaluate the effectiveness conducting analysis on models and systems in instances where the inputs
of a process, system, or individual’s performance. to the system are deemed to be faulty. The primary objective was to
F-measure F-measure is a statistical metric that combines precision
achieve flexibility, whereby alternative data sources could be employed,
and recall to evaluate the performance of a binary
classification model. and the system being evaluated could encompass either an ML model or
Coverage Code coverage is a software testing metric that measures a complex software system. This study showcased the potential of
the percentage of code lines, branches, or paths executed simulating and leveraging data errors in the context of ML solution
during the execution of a test suite.
research. The DpEmu prototype exhibited restricted functionality and
Root mean square error It displays predictive errors and defines the error rate
(RMSE) between actual and predicted values. The model will
usability. This study demonstrated the integration of data robustness
perform better if the RMSE is smaller. and fault tolerance into the development of ML systems.
Effectiveness It is a measure of how well something works in practice The study conducted by Raman, et al. [61] demonstrated the effec
rather than simply in theory or on paper. tiveness of ML in evaluating the crucial features of exploratory data
Average Percentage of The average percentage of faults detected (APFD)
analysis (EDA) software and assessing the accuracy of results, specif
Fault Detection quantifies the weighted average of the percentage of
(APFD) faults that are identified during the execution of a test ically in the field of field-programmable gate array (FPGA) architec
suite. The APFD metric is utilized to assess the efficiency tures. The suggested testing program employed an ML model to identify
of a test suite in promptly identifying faults. discrepancies in delay calculations. The efficacy of employing ML
Mutation Score Mutation score is a metric used to assess the
methodologies was exemplified in the evaluation of delay calculators
effectiveness of a software testing strategy or test suite. It
measures the percentage of code mutations that are
and the identification of instances where these calculators deviated from
detected by the tests. the anticipated accurate labels. The results obtained from the study
Mean squared error The estimator’s mean squared error quantifies the demonstrated a significant level of effectiveness in utilizing this partic
(MSE) average of the squared errors, which represents the ular method.
average squared discrepancy between the estimated
Kassaymeh, et al. [62] examined the incorporation of a metaheuristic
values and the true value.
Mean absolute error The term MAE stands for the average absolute distance algorithm into the backpropagation neural network for the purpose of
(MAE) between each point’s identity line in either the optimizing network parameters in order to address the ST team size
∑n ⃒⃒ ⃒
i̇=1 yi̇ − xi
⃒ prediction problem. The incorporation of this integration led to an
horizontal or vertical direction. MAE = =
∑n n enhancement in the accuracy of the backpropagation neural network’s
i=1 |ei |
n
prediction, consequently elevating the overall quality of the outcomes.
The recommended technique underwent evaluation on two separate
7
datasets. This study examined the influence of the Salp Swarm Algo while upholding a commendable level of precision.
rithm on the predictive efficacy of backpropagation neural networks. Yahmed, et al. [68] introduced DiverGet, a search-based ST method,
The findings demonstrated the superiority of the recommended as a means to assess the efficacy of deep neural network (DNN) quan
approach in comparison to alternative methods. tization. The authors acknowledged that the process of quantization
Kamaraj, et al. [63] introduced a weight-optimized ANN that in may result in a decline in accuracy and pose challenges to performance.
corporates stochastic diffusion search (SDS) for the purpose of identi Consequently, they put forward a solution called DiverGet to tackle
fying the optimal weights based on a predetermined fitness function. these concerns. DiverGet employed an ML-based methodology to
The outcomes were assigned random classifications by default and ob generate a diverse collection of test cases with the aim of evaluating the
tained from the classifier. Consequently, the individuals were catego precision and efficiency of different quantization schemes. The study
rized according to their respective priorities, with the test items sharing presented experimental findings that demonstrate the efficacy of
the same priority being grouped together. The findings of the study DiverGet in identifying high-quality quantization schemes and
indicated that the proposed neural network outperformed the tradi improving the performance and accuracy of DNNs.
tional ANN in terms of both correct output misclassification and wrong The employment of DL models based on short-term memory (LSTM)
output misclassification. was suggested as a method for prioritizing test cases [69]. The predictive
Sheta, et al. [64] introduced two computational models for deter capability of LSTM was utilized to estimate the probability of each test
mining the required sample size in ST, utilizing ANN and genetic pro case detecting a fault within the cycle. This estimation was based on the
gramming. The proposed model utilized two computational techniques, testing data derived from all preceding continuous integration cycles.
namely multilayer perceptron ANN methods and genetic programming, The findings of the study indicated that there is potential for enhancing
in order to illustrate the correlation between the number of test workers the fault detection rate in continuous integration testing. Moreover, the
and the errors quantified in the software. Both of the recommended tester had the ability to identify the previous test cases within the con
models produced promising approximation results in real-time appli straints of the testing time. LSTM training data sets lacked the inclusion
cations and demonstrated the ability to predict the necessary team size. of the relationship between experimental items in the conducted
The TORC method, as described in [65], proposed a resolution to the experiment. Additionally, the factors of time costs and errors were not
challenge of the Oracle problem in scientific models. This approach taken into account. Ultimately, the framework and parameters of the
incorporated a combination of the combined interaction test, mutation LSTM network did not exhibit optimal predictive accuracy.
analysis, and convolutional neural networks (CNN). Furthermore, a The primary objective of software engineering, as stated in [70], is to
study was undertaken wherein CNNs were employed as Oracle proced deliver an output while simultaneously optimizing the cost and time
ures. Additionally, a technique based on features and neighborhoods required for application development. In order to attain the objective, it
was introduced by CNNs to elucidate the phenomenon of image was observed that software teams engage in testing their applications
misclassification. The experimental findings indicated that the perfor before they are deployed for live production. Documentation assumed a
mance of a shallower CNN is not consistently surpassed by its deeper critical role in facilitating their test automation efforts. The study pri
counterpart. The analysis of scientific models and errors detected in marily centered on the utilization of existing test resources to automate
CNNs did not reveal any discernible correlation between software bugs. test generation, with a particular emphasis on overcoming challenges
This paper did not include a discussion on the comparison of the per through systematic process enhancement and establishing a compre
formance of feature and neighborhood-based analysis techniques to hensive test strategy applicable to various software applications.
metrics such as neuron coverage, DeepGauge, and surprise adequacy for Raj and Chandrasekaran [71] introduced a methodology for
DL systems. leveraging ML methods in the automated generation of test sets. The
Ruospo, et al. [66] conducted an analysis of six potential scenarios NEAT algorithm was employed in this study to autonomously generate
pertaining to the feasibility of employing the ST library for online testing novel test sets or augment the coverage of an existing test set. The NEAT
of an embedded system designed to execute AI-based programs. This algorithm combined the benefits of ANNs and evolutionary genetic al
study examined the impact of incorporating a software test library on gorithms. The approach demonstrated superior coverage compared to
the performance of artificial neural networks (ANN) in specific recom alternative methods, requiring less time and a reduced number of test
mended scenarios. The performance of CNNs remained consistent across cases. Based on the findings of this study, it can be inferred that the
the three examined scenarios, where idle time was utilized for executing NEAT exhibited efficacy in automating the generation of test suites for
the software test library. Furthermore, three potential resolutions were the software being evaluated.
presented, with the primary determinant being the duration required for In Oleshchenko [72], the author described an approach that involved
error detection. The objective of these methodologies, with the excep establishing a test result repository, implementing automated analysis
tion of one instance, was to reduce the duration required for fault using the kNN algorithm, and providing real-time updates on the
detection. The findings indicated that the most favorable outcome was progress of testing. Although the author noted the need for compre
achieved by employing a combination of idle times and cross-layer hensive examination across a broader range of applications, the
implementation for both CNNs. approach indicated the potential to improve accuracy and efficiency by
Sivaji and Rao [67] proposed a CNN method that utilizes African a significant 25% when compared to current error analysis methodolo
buffalo-based techniques to optimize regression testing by reducing both gies. Its integration of Java, MongoDB, Elasticsearch, Jenkins, and
time and resource consumption. Initially, the data sets to be examined Docker technologies made it simple to deploy and operate.
were gathered. Subsequently, a CNN model utilizing African buffalo
data was developed to conduct test case minimization. Subsequently, 4.1.2. Overview of supervised learning methods
the process of test case minimization was executed. Furthermore, the The categorization of selected papers and important elements for
classification layer of the convolutional neural slice model was utilized analyzing supervised learning are elaborated upon in greater depth in
to establish the fitness function of the African buffalo. Subsequently, the Table 4. The evaluation of the assigned studies using evaluation factors
regression test was conducted in order to evaluate the software’s per in supervised learning methods is elaborated in Table 5. The factors
formance. The simulation of the model was ultimately completed using encompassed in this study include recall, precision, time, accuracy,
software, and the performance criteria were subsequently compared performance, efficiency, success rate, coverage, RMSE, effectiveness,
with those of previous studies. The model that was developed demon APFD, MSE, MAE, confusion matrix, and cost. The majority of papers
strated a notable level of accuracy and efficiency in its ability to process reviewed in the context of supervised learning methods evaluated the
multiple test cases. Hence, the implemented methodology effectively performance of their proposed approach using the metric of accuracy.
decreased the duration of execution and energy consumption, all the
8
Table 4
Classification of recent research and additional information on supervised learning methods.
Article Applied method Main idea Datasets Evaluation Tool(s) or Advantage(s) Disadvantage(s)
method evaluation
environment
[58] Multi-Layer Enhancing ML classifiers NASA JM1 (http: Simulation Not • Reduced false • Reduction of the ST
perceptron for software module defect //promise.site.uotta mentioned negatives efficiency
prediction with wa.ca/SERepository/ • Efficacy of the ST • Increase in the
unbalanced datasets datasets/jm1.arff) Improved module number of false
identification as defective positives
• Decreased overall
accuracy
[59] Support vector Investigating a framework MNIST Simulation Python Able to show bug- The algorithm
machines for NN that incorporates injected program requires lots of data
dataset diversity and anomalies and development
behavioral oracle time.
[60] Support vector Emulating data faults to MNIST Prototype Python Improving ML system The approach exhibits
machines study and develop ML training limitations in terms of
solutions both functionality and
usability.
[61] Random forest Showing how AI can Not mentioned Simulation Python High efficiency Not expanding the
efficiently test and validate model to predict the
critical EDA software net input slope
components, especially in
FPGA architectures
[62] Backpropagation Predicting the ST phase • Albrecht dataset Simulation MATLAB High performance The algorithm
(feedforward) developer count with a • Cosmic dataset requires lots of data
neural networks method • Desharnais dataset and development time
• Kemerer dataset
[63] Backpropagation Proposing techniques to Not mentioned Simulation • MATLAB • Less computational The algorithm
(feedforward) implement test Oracles • WEKA time requires lots of data
neural networks • Less misclassification and development time
[64] Multi-Layer Estimating the number of Not mentioned Simulation FORTRAN Predicted the team size Lack of evaluation of
perceptron test workers needed to test proposed models for
software based on other test instances
measured faults using ANN
and GP methods
[65] Convolutional Proposing a method to https://github. Simulation Lua An alternative that • Not considering
neural networks address the Oracle problem com/vsantjr/TOrC directly and more types of
for scientific models comprehensively images and
explains DNN failures environments
Avoiding FNA
performance
comparisons with
neuron coverage and
SADL
[66] Convolutional Analyzing and integrating CIFAR-10 Simulation Not Matching idle times with Not exploring GPU
neural networks STLs for online testing of mentioned layered implementation architectures and all
embedded systems running for both CNNs unexplored
ANN-based applications possibilities
[67] Convolutional Developing African Not mentioned Simulation Python • Low execution time The algorithm used
neural networks buffalo-based • Low amount of has a high cost.
convolutional neural memory
slicing to reduce regression • High precision
testing time and resource • High efficiency for
use detecting faults
• High Recall
High performance for
minimizing the test cases
[68] Convolutional Proposing a search-based Pavia University Simulation Python Automatic generation of DNN’s performance
neural network approach to evaluate DNN dataset (PU) (https diverse test cases for a was not evaluated.
quantization techniques ://www.ehu.eus broad range of
/ccwintco/index.ph quantization
p/Hyperspectral_Re configurations
mote_Sensing_Scenes)
Salinas dataset (SA)
(https://www.ehu.eus
/ccwintco/index.ph
p/Hyperspectral_Re
mote_Sensing_Scenes)
[69] Recurrent neural Successful regression Paint Control dataset Simulation Python • Improving the • Neglecting time and
network testing for embedded IOF/ROL (https://doi. prioritization error costs
software in continuous org/10.7910/D effectiveness • LSTM network
integration VN/GIJ5DE) • Increasing the fault parameters’ poor
detection rate in the CI prediction accuracy
environment
(continued on next page)
9
Table 4 (continued )
method evaluation
environment
Decreasing testing Uncorrelated LSTM

execution time training data sets and
test cases
[70] Recurrent neural Examining challenges Not mentioned Simulation Junit • Decreasing execution Low reliability
network brought by orderly practice time
improvement and • Lower cost
characterization of test
systems for any software
application
[71] Recurrent neural Using ML to automatically Not mentioned Simulation Python • High coverage • Need to improve
network generate the test suites • Less time to generate test set generation
the test suite time
Fewer number of test Need for better
cases evaluation of branch
and loop coverages
[72] kNN Investigating the kNN Not mentioned Simulation Java • High efficiency • Not implementing it
clustering algorithm for MongoDB in a real-world
software testing fault Node JS • High accuracy environment
classification technique
development
4.2. Unsupervised learning methods features of DL systems and examine mutation operator interconnections.
Liu, et al. [77] proposed a testing methodology named Deep
In contrast to the process of supervised learning, an unsupervised Boundary with the aim of enhancing the scope of DL software applica
learning method is provided with a collection of inputs that lack any tions. The authors examined the challenges associated with evaluating
associated labels, thereby lacking an output. In essence, an unsupervised DL software and emphasized the importance of achieving comprehen
learning method endeavors to identify patterns, structures, or knowl sive coverage. They provided a definition for decision boundary repre
edge within unlabeled data through the process of clustering sample sentation and demonstrated its application in generating test cases that
data into distinct groups based on their similarity. Unsupervised specifically address the unexplored regions of the input space. The
learning methods are commonly employed in the domains of clustering empirical findings indicated that DeepBoundary exhibits superior
and data aggregation [57,73]. The following section, specifically sub coverage compared to conventional testing methodologies, all the while
section 4.2.1, provides a summary of the papers examined in this study necessitating a reduced number of test cases. The researchers reached
that utilize unsupervised learning methods. the conclusion that the implementation of DeepBoundary had the po
tential to enhance the overall quality and dependability of DL software.
4.2.1. Overview of the selected unsupervised learning methods Additionally, it can serve as a complementary tool to traditional testing
Ali, et al. [74] developed a priority and test case selection method to methodologies.
improve agile fault detection rates. The suggested strategy prioritizes Suman and Khan [78] introduced the Dove Swarm-based Deep
and selects test cases by regularly changing the test suite and discovering Neural Method (DSbDNM) in terms of software testing and intrusion
failing test cases through continuous integration in each agile develop detection. This study compared four deep learning-based vulnerability
ment cycle. The CTFF had two stages. Clustering was the first step, prediction (DLVP) methods, which are known for accurately identifying
followed by selecting the highest-priority test cases from each cluster for security vulnerabilities. To attain this purpose, the algorithm’s perfor
execution to find the biggest fault. Three agile-developed software mance was compared to other methods. The DSbDNM model was
programs tested the CTFF model with different-sized test cases. The initially trained using internet presentation data that contained intru
study found that CTFF detected more errors than random prioritizing sion information. Following that, feature extraction and predictions of
and other error-based methods. The CTFF model outperformed previous malicious actions were performed. Furthermore, a thorough classifica
methods. tion of various forms of assault and negative behaviors was conducted.
A clustering-based adaptive random sequence methodology was The efficiency of the generated prediction model was also tested by
presented in [75] to improve regression testing. Adaptive random se initiating and detecting an unidentified assault. As a result, the proposed
quences maximize neighboring test case variety in this strategy. This method would improve performance and attain high levels of accuracy
study established adaptive random sequence algorithms using three within a restricted computing time frame.
clustering methodologies. Adaptive random sequences were created
using MSampling. Compared to the random prioritization approach and 4.2.2. Overview of unsupervised learning methods
the coverage test prioritization method, early error detection and effi The categorization of selected papers and important elements for
cacy increased. All three clustering algorithms outperformed Method analyzing unsupervised learning are elaborated upon in greater depth in
Coverage and RT-MS. For prioritizing test cases, DM-clustering was the Table 6. The evaluation of the assigned studies using evaluation factors
most successful strategy. in unsupervised learning methods is elaborated in Table 7. The factors
Mutation testing for DL systems was introduced in [76] to evaluate encompassed in this study include recall, precision, accuracy, perfor
test data. To detect DL source mistakes, a thorough collection of muta mance, efficiency, F-measure, coverage, effectiveness, APFD, mutation
tion operators at the resource level was designed. Next, model-level score, and time.
mutation operators were created to directly introduce faults into DL
models without training. Finally, injection error detection was used to
evaluate the quality of the test data. The test data provided extensive 4.3. Reinforcement learning methods
feedback and suggestions to help understand and build DL systems. This
study did not propose enhanced mutation operators to cover more Reinforcement learning is a widely recognized and quite popular
learning method [79,80]. In contrast to supervised learning, which relies
10
on labeled data, training data for reinforcement learning methods only
Suitability
offers an indication of correctness without explicit labels. The process of
acquiring "good" behavior is achieved through iterative interactions
with the environment. The learning process in question bears a resem
*
*
*
*
blance to supervised learning, albeit with a distinct characteristic: rather
than relying on an extensive dataset with labeled information, the model
Cost
must engage in interactive exchanges with its environment. These in
*
teractions subsequently yield either positive rewards or negative pun
ishments. This feedback serves to reinforce the behavior of the model,
Confusion matrix
thereby attributing it to a designation. In this section, we analyze the

reinforcement learning methods that pertain to the research domain, as
outlined below.
*
*
*
4.3.1. Overview of the selected reinforcement learning methods

In [81], the memetic algorithm was proposed as the preferred
MAE
approach for automated testing. This algorithm, which incorporates

*
reinforcement learning, combines genetic methods with local search

steps. The present study employed the structural method to automate
MSE
the process of generating test data, with a focus on satisfying the crite
*
*
rion of covering all limited paths. In addition, the proposed algorithm

APFD
was evaluated in comparison to meta-heuristic and evolutionary

methods. The results obtained from the study demonstrated that this
*
particular method outperformed alternative algorithms in terms of both

the number of fitness evaluations conducted and the rate of successful
Effectiveness
outcomes.
In order to efficiently detect errors in recently developed code,
Rawat, et al. [82] introduced a method for prioritizing test cases that
*
*
*
combined hidden Markov models and reinforcement learning. The

suggested method effectively identified essential test cases with a low
RMSE
rate of false positives, as demonstrated by experimenting on a set of five

*
web applications, yielding an exceptional F1 score of 0.849. This

method increased overall efficiency by streamlining testing activities
Coverage
and saving time and resources.

The objective of [83] was to employ reinforcement learning methods
*
in order to design a reward function within a continuous integration

environment. This paper proposed the utilization of weighted reward
Success rate
functions that were derived from dissimilar historical outcomes. The

technology being presented had the potential to increase the number of
test cases executed and identify a greater number of defects. The find
*
ings indicated that the utilization of reward functions significantly

improved the ability to identify errors during the integration test. The
Efficiency
examination of sorting objective optimization and the implementation

of DL methods for error detection in agent algorithms were not thor
*
oughly investigated.
The authors of [84] presented a reinforcement learning-based
Performance
method for the simultaneous localization of multiple defects in ST.

The selected distribution population was transformed into a functional
program using this approach, resulting in enhanced performance of the
Supervised learning methods metrics in reviewed approaches.
*
*
*
*
*
*
*
genetic algorithm and evaluation index through the identification of

various defects in ST. This article presented a methodology for the
Accuracy
identification of concurrent faults in ST, which has been found to have a

positive impact on positioning accuracy. Test case prioritization in the
*
*
*
*
*
*
*
context of continuous integration (CI) and game testing are two

important software testing difficulties that Nouwou Mindom, et al. [85]
Time
empirically addressed through the utilization of Deep Reinforcement

*
*
*
*
*
*
*
Learning (DRL) algorithms. The research aimed to determine which DRL

methods demonstrated improved performance in identifying game
Precision
problems and solving the test case prioritization problem. The authors
used DRL methods to determine the relative ranking of test cases during
*
their experimentation in a CI environment to achieve this. For the game

testing task, the researchers also experimented with a simple game and
Recall
used DRL methods to examine the game environment for bugs. The re
*
sults indicated that some selected DRL methods performed better than
more recent approaches that have been published in the literature.
Table 5
Article
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
11
Table 6
Classification of recent research and additional information on unsupervised learning methods.
Article Applied Main idea Datasets Evaluation Tool or Advantage(s) Disadvantage(s)
method method evaluation
environment
[74] K-means Proposing a solution relevant Not mentioned Design Not mentioned • High error Not extending CTFF to resolve
clustering to regression testing in agile detection rate regression testing constraints in
practices • High component-based software and
performance product line engineering.
[75] K-means Increasing the test case Not mentioned Simulation Not mentioned • Early fault The selected test cases for object-
clustering prioritization effectiveness for detection oriented software were not evenly
object-oriented software • High spread across the input domain.
effectiveness
[76] Deep neural Measuring the quality of test • MNIST Simulation Python • Generating Advanced mutation operators are not
networks data with a mutation testing CIFAR-10 higher-quality proposed for investigating the
framework for DL systems test data relationships between mutation
operators.
[77] Deep neural Proposing a coverage testing • MNIST Simulation Python • High coverage Limited evaluation of datasets
networks method for DL software
[78] Deep Introducing DSbDNM as a • NF-UQ-NIDS- prototype Python • High accuracy Lack of Investigating classifiers that
Neural high-precision solution that v2 • High Precision might be able to lessen the
Network can be obtained in a short • High Recall limitations of deep learning-based
computational time • Network • High F-measure models without missing necessary
Intrusion features
Detection
datasets
Table 7
Unsupervised learning methods metrics in reviewed approaches.
Article Recall Precision Accuracy performance Efficiency F-measure Coverage Effectiveness APFD Mutation Score Time Suitability
[74] * * * * * * * * *
[75] * * * * *
[76] * *
[77] * * * * *
[78] * * * * * *
Table 8
Classification of recent research and additional information on reinforcement learning methods.
method evaluation
environment(s)
[81] Q-learning Automating test data Not mentioned Simulation • Python • Low fitness This paper’s approach
production with a structural evaluations does not guarantee the
method • Higher success quality of the final
rate solution to a problem.
[82] Q-learning employing efficient and Not mentioned Simulation • Not • High accuracy • High complexity
sustainable prioritization mentioned • High • Validation challenge
techniques to reduce the Adaptability
number of software test cases • High
while improving their quality. accessibility
[83] Reinforcement Using reinforcement learning to • Paint Control dataset Simulation Not mentioned High fault Not changing the agent
learning design a continuous integration • IOF/ROL (https detection algorithm to improve
reward function ://doi.org error detection
/10.7910/D
VN/GIJ5DE)
Google Open-Source
Data Set (GSDTSR)
(https://doi.org
/10.7910/DVN/MJFKD
N)
[84] Reinforcement Reinforcement learning-based Not mentioned Simulation • Siemens EXAMF/EXAML This paper’s method takes
learning ST multiple defect localization. suites evaluation index plenty of data and
• Grep reduction computation.
• Gzip
• Sed
[85] Reinforcement Assessing and contrasting the • Paint-Control Real testbed • Python High performance Inadequate analysis of
learning various DRL frameworks’ • IOFROL PPO-SB and PPO-TF
implemented algorithms https://github.com/ics implementations
e20/RT-CI
12
Table 9
Reinforcement learning methods metrics in reviewed approaches.
Article Time Performance Efficiency Success rate Coverage APFD Suitability
[81] * * *
[82] * * * * *
[83] * * *
[84] * *
[85] * * * *
4.3.2. Overview of reinforcement learning methods reinforcement learning-based test data generation. This study was
The categorization of the aforementioned studies and the influential accompanied by a small empirical investigation. The formulation of
factors for analyzing reinforcement learning methods are elaborated search-based test data generation involved the establishment of a deci
upon in greater depth in Table 8. Table 9 presents an assessment of the sion process that leverages reinforcement learning methods. The find
aforementioned studies, employing evaluation criteria specific to rein ings of the study demonstrated the feasibility of acquiring learning
forcement learning methods. The factors encompassed in this study behaviors through the utilization of a metaheuristic algorithm.
comprise time, performance, efficiency, success rate, coverage, and Chen, et al. [89] proposed the utilization of DRLGENCERT, a
APFD. framework that employs deep reinforcement learning methods for the
automated testing of certificate verification in implementations of
Secure Sockets Layer/Transport Layer security. The DRLGENCERT sys
4.4. Hybrid learning methods tem incorporated the utilization of standard certificates as both input
and output, potentially resulting in variations in efficiency during the
In order to address the limitations inherent in individual ML certificate generation process. The employed framework utilized deep
methods, it is necessary to integrate them into a cohesive approach to reinforcement learning to determine the optimal subsequent action
achieve optimal efficiency. In order to generate novel hybrid learning based on the outcomes of prior modifications, as opposed to random or
methods, it is imperative to employ a diverse range of methodologies inadvertent combinations. The findings suggested that DRLGENCERT
across multiple processes. The utilization of a hybrid learning method is functioned as an automated system for generating test cases.
predicated on the prevalent practice among researchers of combining Xiao, et al. [90] employed neural network models in their study to
two or more methods. However, the method is versatile and has the construct prediction models for the error recognition and fault correc
potential to be applied to a wide range of problems. The subsequent tion processes, taking into account the testing effort. A proposal was
portion of this section examines the principal attributes of the hybrid made for the implementation of a forecasting algorithm to carry out the
learning methods that have been selected. prediction models. The findings indicated that the ANN models recom
mended in this study outperformed the analytical model in accurately
4.4.1. Overview of the selected hybrid learning methods predicting the number of errors detected. The suggested model had the
Ahmad, et al. [86] showcased a heuristic performance testing potential to aid decision-makers in the assessment of reliability, esti
approach that automates the process of identifying system inputs and mation of costs, and determination of the most favorable release time.
determining specific input combinations that lead to performance bot López-Martín [91] conducted an investigation into the prediction of
tlenecks. The methodology, referred to as iPerfXRL, employs deep software test efforts using various ML models. The models were pre
reinforcement learning methods to manage extensive input spaces with pared and tested using datasets selected from a globally accessible
multiple dimensions effectively. This paper presented empirical evi public repository of software projects. The selection of data sets was
dence supporting the superiority of the proposed method in generating contingent upon evaluations of data quality, level of development,
more contextually appropriate outcomes compared to alternative ap development platform, programming language, measurement method,
proaches. Moreover, empirical evidence demonstrated that this and project resource level. The findings indicated that a highly precise
approach possesses the capability to identify and distinguish regions prediction of software test effort can contribute to the estimation of
within the input space that exhibit interconnected combinations. It was project cost, facilitate the development of a project schedule, and assist
recommended that diverse key performance indicators, such as CPU and the team leader in effectively establishing a test team.
memory utilization, be incorporated within the reward function to Kahles, et al. [92] utilized ML methods to automate the process of
enhance performance outcomes. Furthermore, it was possible to expand root cause analysis in agile ST environments. In order to achieve precise
the methodology to identify potential underlying factors contributing to categorization of the factors contributing to unsuccessful experiments,
performance limitations. the utilization of clustering and ML classification methods was
Chen and Huang [87] presented a reinforcement learning method contemplated. The findings of this study indicated that the utilization of
that utilizes test program generation to identify transition postponement MLP-based classification outperformed cluster analysis in the identifi
errors. The initial findings indicated that the utilization of a reinforce cation of underlying factors contributing to test failures. To enhance the
ment learning method has the potential to facilitate the creation of quality of conclusions, it was recommended to employ comprehensive
software-based self-test programs. The processor that was subjected to interviews with test engineers in order to identify novel features that can
testing was specifically aimed at complex test cases and successfully enhance the performance of algorithms.
achieved a high level of fault coverage for transition delay faults. The issue of testing and debugging neural network systems was
Although the processor under test did not include several recent pro discussed in [93]. Furthermore, the examination of the contrasting
cessor designs, it is important to note that flawless data and instruction systems with regard to algorithm implementation was conducted with a
caching were not taken into consideration in the scope of this paper. focus on testing. The aim of this article was to identify the requirements
In [88], a Double Deep Q-Networks (DDQN) agent was employed to of test systems and analyze the specific features of various neural
examine the potential substitution of human-designed meta-heuristic network models. The discourse revolved around strategies for mitigating
algorithms with reinforcement learning in the context of search-based the drawbacks associated with the systems.
ST methodologies. The authors presented a DDQN agent that was The model proposed in [94] employed a combination of RFM anal
trained using DNNs. Additionally, they introduced a comprehensive ysis and regression techniques to effectively minimize the testing
framework called GunPowder for search-based ST. In a similar vein, duration and resource allocation when applied to an existing dataset.
GunPowder conducted a feasibility study that explored the potential of
13
The development of these methods aimed to automate the model and comprehensiveness attained by ML methods in detecting secure and
enhance the comprehension of the test cases’ behavior during evalua insecure test scenarios for autonomous vehicles, specifically when
tion. Prior to conducting any formal examinations, the different modules employing fixed autonomous vehicle attributes. Consequently, the main
slated for testing underwent a thorough inspection. In cases where it was aim of their study was to investigate the potential improvement of
deemed necessary, the electrical properties of these individual compo SDC-Scissor’s ability to distinguish between secure and insecure test
nents were simulated to ascertain their precise operational capabilities. cases through the refinement of ML methods. To evaluate the practical
The test outcomes provided insights into the functioning mechanisms of feasibility of SDC-Scissor, the researchers integrated their tool into the
these systems, enabling the assessment of their reliability. The findings operational structure of an automotive industry entity.
were disseminated to the test management team, which integrated the The primary focus of [99] was to evaluate the precision of estimation
data into the system as a component of formulating a proficient test results by examining the quality of the inputs used for estimation as well
environment strategy. as the model employed. The aim of this study was to improve the ac
In [95], a method for prioritizing test inputs in DNNs was introduced. curacy of software regression test effort estimation (SRTEE) by devel
This approach utilized intelligent mutation analysis to consider a larger oping a technique called StackSRTEE. This technique incorporated a
number of test inputs, aiming to detect bugs at an earlier stage within a stacking ensemble model. The composition consisted of the three most
shorter time frame. Consequently, this technique enhanced the testing commonly used ML methods. The grid search (GS) methodology was
efficiency of DNNs by simplifying the process. This paper presented an utilized by the researchers to optimize the hyperparameters of the
approach for prioritizing DNNs based on intelligent change analysis. A StackSRTEE model. Following this, the model underwent training and
designed set of model and input change rules was specifically employed, evaluation using a dataset obtained from the ISBSG repository. The size
and rank-based learning was utilized to integrate these changes in order of the functional change was the primary independent variable
to prioritize the test input. The findings of this study provided evidence employed to augment the inputs of the StackSRTEE model.
supporting the suitability and effectiveness of the proposed methodol Khan, et al. [100] concentrated on using machine learning methods
ogy. This study examined the proposed methodology employed by a to anticipate cumulative software failure levels, with the ultimate goal of
corporation in the context of autonomous vehicles, with the aim of enhancing residual defect forecasts and acquiring a thorough compre
assessing its feasibility. The efficacy of this methodology was demon hension of software-related difficulties. The main data source for the
strated by the outcomes achieved. research study was software metrics and defect data that were retrieved
The utilization of ML methods for ST was discussed in [96]. The from a static code repository. Using a correlation methodology to find
study proposed an approach for identifying software defects by inte meaningful metrics, this dataset allowed the examination of the rela
grating the methodologies of random forest (RF) and CNN. The CNN tionship between different software metrics and reported problems.
method was employed to process the input data and extract relevant
features, which were subsequently utilized by the RF method for the 4.4.2. Overview of hybrid learning methods
purpose of classification. The evaluation of the proposed method Table 10 provides a comprehensive categorization of the selected
involved the utilization of two datasets. The results of this evaluation studies and elucidates in greater depth the influential factors that are
indicated that the CNN and RF methods exhibited superior accuracy employed for the analysis of hybrid learning methods. Table 11 presents
compared to conventional methods. Based on the findings of the study, an assessment of selected studies through the utilization of evaluation
the proposed methodology demonstrated its potential as a valuable in factors in hybrid methodologies. The factors considered in this study
strument for ST, particularly in scenarios where traditional testing encompassed various aspects such as recall, precision, cost, reliability,
methodologies may prove insufficient. However, further investigation time, accuracy, performance, efficiency, MI, f-measure, coverage, RMSE,
was necessary in order to validate the proposed strategy on larger and effectiveness, MSE, MAE, and Success rate. The performance and time of
more diverse datasets. hybrid learning methods were assessed in the majority of research
The combination of support vector machines (SVM) and k-nearest studies.
neighbor (KNN) methods was proposed as a method for solving the
problem of ST [97]. The methodology was employed within the 5. Analysis of results
framework of educational assistance software, whose effectiveness is
contingent upon precise and efficient testing procedures. The method In this section, the analysis of the results from the systematic review
ology involved conducting experiments on a dataset comprising test is presented. Section 5.1 provides a comprehensive summary of the
cases, wherein the SVM and KNN methods were employed to classify the chosen studies. In order to fulfill the objective of this review, which is to
test cases into two categories, namely passing or failing, based on the examine and analyze the distinctions, benefits, and drawbacks of
input attributes. The empirical results indicated that the hybrid learning different ML methods in ST, a comprehensive examination of the
method exhibited superior accuracy and efficacy compared to tradi mentioned classification is presented in Section 5.2.
tional testing methodologies. The conclusion of the article posited that
the proposed methodology could potentially be employed to improve 5.1. Overview of the selected studies
the precision and effectiveness of other ST scenarios. In general, the
study underscored the benefits associated with employing a hybrid The subsequent inquiries are designed to investigate the current
methodology and demonstrated the promising capabilities of ML advancements in ML methods in ST.
methods such as SVM and KNN in the context of ST.
Birchler, et al. [98] introduced a methodology called SDC-Scissor, • Which publishing houses have the highest number of papers on ML
which employs ML methods to identify and eliminate test cases with a methods in ST?
low likelihood of detecting faults in self-driving cars (SDCs) before their • What was the distribution of publishers and the amount of annual
execution. The main aim of their inquiry was to assess the viability and research dedicated to ML methods in the field of ST?
level of precision of categorizing test cases for SDCs as either safe or • What are the groups and active research communities on ML
unsafe before their implementation. The challenge that garnered methods in ST?
attention was related to the advancement of methodologies that can
effectively exploit features derived from SDC test cases. The aim was to
reduce the costs associated with testing while ensuring that the testing 5.1.1. A chronological examination of the studies
process remained highly effective. The researchers also investigated the Based on the data presented in Fig. 4, it can be observed that a ma
possibility of a maximum threshold for the precision and jority of the papers, specifically 40%, were published in IEEE. Following
14
Table 10
Classification of recent research and additional information on hybrid learning methods.
method evaluation
environment
(s)
[86] • Deep neural Representing an automated Not mentioned Simulation • Python Finding the more Reward performance does
networks heuristic performance relevant not include CPU and
(unsupervised) testing method that combination memory usage.
Q-learning identifies input
combinations that cause
performance bottlenecks
[87] • Convolutional Transition delay fault Not mentioned Simulation Not mentioned High coverage The tested processor’s
neural network detection via reinforcement outdated design
Q-learning learning-based test
program generation
[88] • Deep neural Investigating whether Not mentioned Formal Python High coverage Unoptimized network
network search-based software architecture
• Q-Learning testing (SBST)
reinforcement learning can
replace human-designed
metaheuristic algorithms
[89] • Q-learning Proposing a Deep Not mentioned Prototype • ZMap [101] Discovering several The used algorithm
Deep neural Reinforcement Learning previously requires large amounts of
networks system for SSL/TLS unknown certificate data, and trained models
certificate verification verification flaws cannot multitask.
automation
[90] • Feedforward Developing ANN-based • Firefox from Bugzilla Simulation Not mentioned High prediction The algorithm demands a
neural network error recognition and fault dataset (https://bugz errors lot of data and
• Recurrent neural correction prediction illa.mozilla.org/) development time.
network models Product of Tomcat 8
Convolutional dataset (https://bz.apa
neural network che.org/bugzilla/)
[91] • Simple linear Investigating ML models for The International Simulation Not mentioned • Calculating the Not utilizing ML models
regression software test effort Software Benchmarking project cost for defect correction effort
• Multilayer prediction Standards Group dataset • Making project prediction.
perceptron (ISBSG) schedule
• Support vector Enabling the team
regression leader to form a test
• Decision trees team correctly.
• K-Nearest
neighbors
[92] • k-means Root cause analysis Not mentioned Prototype Not mentioned High accuracy • Avoiding
algorithm automation in agile ST hyperparameter
• Gaussian environments using ML optimization
mixture • Ignoring granular
• Multilayer ground truth failure
perceptron categories
• No feature extraction to
improve algorithms’
outcomes
Not testing NLP
[93] • Recurrent neural Adapting ST methods to Not mentioned Design Not mentioned Providing a Considering an example of
network test neural networks guideline to a small dimension
• Multi-Layer improve neural
perceptron network models
[94] • Ridge regression Automating the model and Not mentioned Simulation Python Evaluating system The algorithm misfits
• Simple linear understanding test case reliability complicated datasets.
regression behavior using some
• Support vector methods
regression
[95] • Convolutional Proposing intelligent • CIFAR-10 [102] Simulation Python High effectiveness The algorithm in use
neural networks mutation analysis to • CIFAR-100 [102] requires a lot of data.
• Recurrent neural prioritize DNN test inputs • MNIST (http://yann.le
network to label more bug-revealing cun.com/ex
inputs early for a short time db/mnist/)
• MNIST_VS_USPS [103]
• COIL (https://www.
cs.columbia.edu/
CAVE/software/softli
b/coil-20.php)
• PIE27_VS_PIE5
(http://www.cs.cmu.
edu/afs/
(continued on next page)
15
Table 10 (continued )
method evaluation
environment
(s)
cs/project/PIE/
MultiPie/Multi-Pie/
Home.html)
• PIE27_VS_PIE9
(http://www.cs.cmu.
edu/afs/
cs/project/PIE/
MultiPie/Multi-Pie/
Home.html)
• Driving (https://uda
city.com/
self-driving-car)
• TREC [104]
• IMDB [105]
• SMS Spam [106]
• CoLA [107]
• Hate Speech [108]
• KDDCUP99
(http://kdd.ics.uci.
edu/databases/
kddcup99/kddcup99.
html)
[96] • Convolutional Providing information Not mentioned Simulation Python • High accuracy Limited dataset
neural networks regarding a financial High coverage
Random forest institution’s ability to
authorize credit cards for
its clients
[97] • Support vector Utilizing an educational Not mentioned Real testbed Not mentioned High accuracy No comparison to other
machines assistant software system to ML methods
• K-Nearest show the efficacy of a
neighbors hybrid ML method for ST
[98] • Naive Bayes Determining ML-based test Zenodo (https://zenodo. Simulation Python • High accuracy • Need further SDC
• Random Forest selection methods for cost- org/record/7011983) • High precision datasets
effective simulation-based High Recall Not considering having
SDC testing flaky tests in virtual
environments
[99] • Neural networks Evaluating the efficacy of ISBSG Simulation Not mentioned • High accuracy Needs better ML
• Support vector estimation inputs and the • High precision algorithms to increase test
regression corresponding model in regression accuracy
• Decision trees generating precise
regression estimation outputs
[100] Recurrent neural Investigating the feasibility Not mentioned Simulation Not mentioned High effectiveness limited by the lack of
network of combining software identifying clusters certain data
DBSCAN clustering and fault prone to errors variations in performance
k-means prediction • Low failure across different methods
Random Forest patterns within
Support vector software systems
machines
Convolutional
neural networks
this, 37% of the papers were published by Springer, while 12% were learning methods in software testing publications reflects a natural
published by Elsevier. A smaller proportion of papers, namely 6%, were evolution in research priorities and the broader landscape of software
published in Wiley and Taylor&Francis collectively, and the remaining testing methodologies.
5% were published in ACM. Furthermore, Fig. 5 presents a visual rep Table 12 displays the categorization of the papers into two distinct
resentation of the quantity of scholarly papers published in both journals groups, namely conferences and journals, where they have been pub
and conference proceedings spanning the years 2018 to March 2024. lished. Out of the total of 40 papers examined, it was found that three of
Additionally, it is illustrated that 2018 witnessed the greatest number of them were published in EMSE (impact factor: 4.1), two were published
published papers. The trend of decreasing machine learning methods in in ISSRE, and two were published in the Software Quality Journal
software testing publications over the years may be attributed to several (impact factor: 1.9). Furthermore, it is evident that the majority of the
factors. Initially, there was significant excitement and exploration of the papers were disseminated through conference proceedings, as indicated
potential applications of machine learning in software testing, leading to by Table 12.
a surge in related research publications. However, as the field matured,
researchers likely shifted their focus towards refining existing method 5.1.2. Active research communities
ologies, addressing practical challenges, and exploring interdisciplinary Table 13 illustrates the distribution of papers across publication
approaches. Additionally, advancements in other areas of software channels that have published a minimum of two papers in the specific
testing, such as automation tools and techniques, may have also influ area under investigation. This table provides insights into the distribu
enced the distribution of research efforts away from exclusive reliance tion of active research communities within the selected papers, taking
on machine learning methods. Overall, the decreasing trend in machine into account the affiliations of the authors following the final selection
16
during the conducting phase. Table 13 presents a comprehensive
Suitability
compilation of active communities that have been involved in a mini
mum of two studies that were included in the analysis. Additionally, the
table provides information regarding the specific research focus of each
*
community.
Researchers from the Xiamen University of Technology in China, Al-
Success rate
Balqa Applied University in Jordan, and Sathyabama Institute of Science

and Technology in India have each published two papers on this subject.
As illustrated in Table 13, the majority of researchers directed their
*
attention towards methods of supervised learning.
MAE
*
*
5.2. Research objectives, methods, and evaluation metrics
MSE
In this section, we provide responses to the research inquiries that

*
*
*
were introduced in Section 3.1. The research questions provide essential
information regarding the specific aspects that should be examined
Effectiveness
when analyzing the presented survey. Additionally, they serve as a guide

for future discussions and investigations. To address Q1, the preceding
sections have delineated the predominant ML methods in ST, which are
*
*
*
classified into four primary categories. The taxonomy encompasses

various categories of ML methods, including supervised learning
RMSE
methods, unsupervised learning methods, reinforcement learning

*
methods, and hybrid learning methods.

Fig. 6 demonstrates that a significant proportion of the selected pa
Coverage
pers, specifically 74%, focused on the collective utilization of supervised

learning methods and hybrid learning methods. This stems from the
*
origins of hybrid learning and supervised learning methods to tackle

diverse real-world problems. Supervised learning, where models learn
F-measure
from labeled data, is favored for its straightforward implementation and

ability to provide accurate predictions when ample labeled data is
*
*
*
available. Hybrid learning methods, combining elements of different

approaches, offer a flexible and adaptive approach, allowing for
MI
improved performance and generalization across varied datasets. The

*
popularity of these methods stems from their proven success in a wide

Efficiency
range of applications, including image recognition, natural language

processing, and predictive analytics. By leveraging the strengths of both
supervised and hybrid learning, researchers and practitioners can ach
*
*
*
*
ieve superior results and address complex challenges in machine

performance
learning effectively. Also, considering the significant prevalence of

mixed approaches incorporating supervised and hybrid methods,
particularly the latter, which amalgamates various techniques within
machine learning, it’s evident that supervised and hybrid methods hold
*
*
*
*
*
greater popularity. Another primary factor motivating researchers to

Accuracy
utilize supervised methods is the widespread availability of labeled

datasets, which significantly facilitates data-handling processes. Addi
*
*
*
*
*
*
tionally, 13% of the papers examined unsupervised learning methods,

and 13% explored reinforcement learning methods.
Time
This study involved the assessment of reviewed papers using multi

*
*
*
*
*
*
*
*
ple evaluation parameters, as documented in Tables 5, 7, 9, and 11. The

Hybrid learning methods metrics in reviewed approaches.
formula presented below is employed to determine the most important

Reliability
and significant parameters of Q2. Fig. 7 depicts the parameters

employed by researchers for the evaluation of techniques and methods
utilized in the reviewed papers. As indicated in the given parameters,
*
the percentage of each parameter was calculated using Eq. (1). Eq. (1)
Cost
entails the process of tallying the frequency of individual occurrences,

*
dividing it by the total number of occurrences, and subsequently

multiplying the result by 100. In this instance, there was a notable
Precision
emphasis on accuracy, constituting 14% of the overall focus, in contrast

to the combined emphasis on other metrics, which accounted for more
*
*
*
*
than 34%. Performance and time exhibited a somewhat analogous

pattern to accuracy, with both accounting for approximately 28% of the
Recall
attention.
*
*
Machine learning techniques offer the potential for high accuracy,

time efficiency, performance, and overall testing efficiency in software
Table 11
Article
[100]
testing. By automating various testing processes, machine learning al

[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
gorithms can execute tests faster and with greater precision than manual
17
Fig. 4. Percentage of chosen papers classification.
Fig. 5. Distribution of papers depending on publication years.
methods. These algorithms excel at recognizing patterns and anomalies approaches.

within data, enabling the detection of defects and bugs with remarkable In response to Q3, as indicated in Fig. 9, it was found that 30% of the
accuracy. Moreover, machine learning models can leverage predictive papers did not mention the tools utilized for the analysis and imple
analytics to prioritize testing efforts effectively, optimize resource allo mentation of the proposed methods. Based on the data presented in
cation, and improve overall testing efficiency. Their adaptability allows Tables 4, 6, 8, and 10, it can be observed that the Python programming
them to continuously learn and evolve from new data and experiences, language was the most frequently utilized environment, accounting for
leading to more effective testing strategies over time. Additionally, 39% of the total usage across 40 research studies focused on ML methods
machine learning facilitates parallel testing across different platforms in ST. The widespread use of Python in machine learning and software
and configurations, further reducing testing time while maintaining testing has both positive and negative implications. On the positive side,
high levels of accuracy. The feedback loop provided by machine Python’s simplicity and readability facilitate accessibility and collabo
learning models also fosters continuous improvement in the software ration within the field, fostering a vibrant community and extensive
development process, enhancing overall quality and efficiency. How open-source resources. However, this reliance on Python may lead to a
ever, successful implementation relies on factors such as the quality of lack of diversity in tooling and methodologies, potentially limiting
training data, algorithm selection, and the expertise of the testing team innovation. Furthermore, challenges arise regarding reproducibility and
in utilizing machine learning effectively. interoperability, as different libraries and frameworks may vary in
support and documentation. The dominance of Python may inadver
Number of each occurrence
occurrence percentage(i) = ∑ ∗ 100 (1) tently exclude researchers proficient in other languages, potentially
Number of all occurrence
overlooking valuable perspectives and insights. Lastly, in software
Fig. 8 illustrates that within the domain of supervised learning testing, while Python offers powerful testing libraries, there’s a risk of
methods, scholars primarily emphasized accuracy (19%) and perfor blind spots and biases in test coverage due to the reliance on a single
mance (16%). Conversely, in the realm of unsupervised learning language. The utilization of MATLAB was 4%, and FORTRAN and Junit
methods, accuracy (16%) and performance (8%) emerged as the pivotal were observed, each representing 3% of the chosen papers. At the same
factors. The efficiency and time parameters of reinforcement learning time, other tools or evaluation environments were employed in 21% of
methods experienced a respective increase of 25% and 19%, respec the selected papers.
tively. In the papers employing hybrid methods, the pivotal metrics of In regard to Q4, as depicted in Fig. 10, a number of selected studies
interest are time and performance, which account for 12% and 14% have employed diverse datasets in order to assess their methodologies
improvement, and accuracy, which exhibits an 11% enhancement. The for analyzing experimental outcomes. A significant portion of the re
findings indicate that accuracy plays a crucial role in the majority of searchers failed to specify the dataset utilized in their study. The MNIST
18
Table 12 Table 12 (continued )

The distribution of the studies per publication channel. Category Publisher Publication channel Impact Count
Category Publisher Publication channel Impact Count factor
factor (2022)
(2022)
International Conference - 1
Conferences IEEE International Symposium - 2 on Emerging Trends and
on Software Reliability Advances in Electrical
Engineering (ISSRE) Engineering and
International Conference - 1 Renewable Energy
on Data Science and (ETAEERE)
Advanced Analytics Elsevier Materials Today: - 1
(DSAA) Proceedings (MATPR)
Asian Test Symposium - 1 Journals Springer Empirical Software 4.1 3
(ATS) Engineering (EMSE)
IEEE Symposium on On- - 1 Software Quality Journal 1.9 2
Line Testing (IOLTS) Distributed and Parallel 1.2 1
International Conference - 1 Databases
on Machine Learning and Programming and 0.7 1
Applications (ICMLA) Computer Software
IEEE Symposium Series on - 1 Soft Computing (SOCO) 4.1 1
Computational Intelligence Journal of Supercomputing 3.3 1
(SSCI) (JSUP)
International Conference 1 Elsevier Expert Systems with 8.5 1
on Software Engineering Applications (ESWA)
(ICSE) Applied Soft Computing 8.7 1
International Symposium - 1 (ASOC)
on Quality Electronic Computers & Security 5.6 1
Design (ISQED) Journal of Systems and 3.5 1
International Conference - 1 Software
on Contemporary IEEE IEEE Access 3.9 1
Information Technology Taylor&Francis International Journal of - 1
and Mathematics (ICCITM) Computers and
IEEE International - 1 Applications (IJCA)
Conference on Software Wiley Concurrency and 2.0 1
Quality, Reliability and Computation: Practice and
Security Companion (QRS- Experience (CCPE)
C)
on Software Maintenance
(ICSM) Table 13
IEEE Conference on - 1 Active communities and their research focus.
Software Testing,
Affiliation Ref. Category
Validation, and
Verification (ICST) Xiamen University of Technology, Xiamen, [83, Reinforcement learning
International Conference - 1 China 69] methods,
on Electrical Engineering Supervised learning
and Informatics (ICEEI) methods
International Conference - 1 Al-Balqa Applied University, Salt, Jordan [62, Supervised learning
on Code Quality (ICCQ) 64] methods
ACM International Workshop on - 1 Sathyabama Institute of Science and [70, Supervised learning
Search-Based Software Technology, Chennai, India 97] methods,
Testing (SBST) Hybrid learning methods
on Automation of Software
Test (AST)
Springer International Workshop on - 1
Structured Object-Oriented
Formal Language and
Method (SOFL+MSVL)
on Multimedia Technology
and Enhanced Learning
(ICMTEL)
on Intelligent Computing,
Information and Control
Systems (ICICCS)
Europe Middle East & Fig. 6. Percentage of selected papers classification.
North Africa Information
Systems and Technologies
to Support Learning dataset is widely recognized as the most frequently utilized dataset in
(EMENA-ISTL) various domains. Following MNIST, the CIFAR-10 dataset, the Paint
International Conference - 1 Control dataset, the IOF/ROL dataset, and the ISBSG dataset are sub
On Innovative Computing
sequently regarded as significant resources in their respective fields.
And Communication
(ICICC) However, curating high-quality training datasets for machine
learning in software testing presents several challenges, including the
need for diverse, representative, and labeled data. To address these
19
Fig. 7. Percentage of evaluation metrics in the reviewed methods.
Fig. 8. Percentage of evaluation metrics in the chosen articles for each classification.
Fig. 10. Repetition of employed datasets and case studies in the cho
Fig. 9. Percentage of evaluation tools in the chosen articles for each sen articles.
classification.
list of datasets suitable for machine learning applications in software
challenges, researchers and practitioners employ various strategies and testing in Table 14. It categorizes datasets based on their specific pur
best practices. This includes collecting data from sources such as bug poses, including bug prediction, test case generation, code quality
repositories and code repositories, annotating the data with relevant assessment, security testing, mobile app testing, regression testing,
labels like bug severity and test case outcomes, and augmenting the image classification, and other miscellaneous tasks. Each dataset is
dataset through techniques like synthetic data generation and transfer accompanied by a brief description highlighting its relevance and usage
learning. Active learning techniques help prioritize annotation efforts, in software testing research and practice. This curated list serves as a
while collaboration with domain experts ensures dataset relevance and valuable resource for researchers and practitioners seeking high-quality
quality. Ethical considerations, continuous improvement, and bench training data for developing and evaluating machine learning algo
marking efforts further enhance dataset curation, ultimately facilitating rithms in the field of software testing.
more effective and reliable machine-learning approaches in software To specify Q5, as indicated in Fig. 11, it can be observed that the
testing. evaluation methods employed in the selected papers are predominantly
Drawing from the reviewed articles, we compiled a comprehensive simulations, accounting for 77% of the total. Prototypes constitute 10%
of the evaluation methods, while design-based methods comprise 5%.
20
Table 14
Datasets for machine learning in software testing.
Purpose Dataset Name Description
Bug Prediction PROMISE Repository Collection of datasets for software

engineering tasks, including defect
prediction.
NASA MDP NASA’s Metrics Data Program
provides datasets for software
engineering research.
ECLIPSE Dataset from the ECLIPSE project for
defect prediction and software
quality assessment.
Test Case GZoltar A dataset containing code coverage
Generation information to facilitate test case
generation. Fig. 11. Percentage of evaluation methods in the chosen articles for each
KLEE Dataset from KLEE symbolic
classification.
execution engine, useful for
generating test cases.
Siemens Suite A well-known dataset for software 6. Open issues and future works
testing, consisting of programs and
test cases.
Code Quality CK Contains software projects with
Significant concerns regarding the employment of ML methods in ST
Assessment associated metrics for code quality have yet to be comprehensively and exhaustively explored, as illustrated
assessment. in Fig. 12. This section presents an overview of the major challenges and
GitHub Repositories Various repositories on GitHub future trends associated with ML methods in ST, with a specific focus on
provide datasets for code quality
addressing Q6. The present study employed the SLR approach to collect
assessment and analysis.
MSR Mining Challenge Datasets from the Mining Software data. Consequently, the subsequent sections discuss the challenges and
Datasets Repositories (MSR) challenge for issues that are currently regarded as unresolved. The reviewed papers
code quality analysis. have not provided a comprehensive and precise depiction of the chal
Security Testing OWASP WebGoat A deliberately insecure web lenges and future trends.
application designed for testing
security tools and knowledge.
SARD (Software Collection of known vulnerabilities • Test Oracle: A significant portion of the testing process is pragmatic
Assurance Reference and vulnerable software and satisfying; however, there exists a necessity to enhance the ef
Dataset) applications. ficiency, affordability, and dependability of the testing process. To
NIST National Dataset of known software, software
address this requirement, it is necessary to establish a test Oracle,
Software Reference libraries, and file profiles for forensic
Library (NSRL) analysis. which is a procedure that distinguishes between the correct and
Mobile App DroidBench A benchmark suite to evaluate the incorrect behaviors exhibited by the system during testing [59,63,65,
Testing effectiveness of dynamic program 109]. The matter of test Oracle automation has received considerably
analysis tools for Android apps. less focus compared to various other facets of test automation, and it
Sapienz A dataset for automated testing of
continues to be a relatively unresolved issue, as noted in [110,111].
Android apps, containing
automatically generated test cases. There is a pressing need for a collaborative endeavor to improve
Regression DaCapo A benchmark suite for Java-based testing methodologies with the aim of achieving a more compre
Testing dynamic compilation research. hensive and advanced test automation framework. Additionally,
Apache Commons Lang A dataset for regression testing,
there is a need to explore more effective approaches to tackle the test
consisting of projects from the
Apache Commons Lang library. Oracle problem and to develop automated or semi-automated test
JUnit Test Suites Test suites from various projects Oracle solutions. This challenge is particularly significant for
using the JUnit testing framework, ML-based systems since they operate on probabilistic reasoning and
useful for regression testing. lack established predicted values for comparison with actual values
Image MNIST Handwritten digits dataset widely
during testing. Consequently, assessing the accuracy of their output
Classification used for image classification tasks.
CIFAR-10 Dataset of small color images for becomes a challenging task [112].
object recognition. • Multi-objective optimization: Upon reviewing and analyzing various
Paint Control Dataset A dataset for image classification in research papers, it has been observed that the majority of them did
the context of paint control
not take into account the simultaneous consideration of multi-
applications.
Other ISBSG International Software
objective goals in ML methods for ST. Each paper is centered on a
Miscellaneous Benchmarking Standards Group unique challenge that exhibits diverse characteristics. As an illus
dataset for software project tration, some papers offer several techniques, such as scalability or
benchmarking. precision, while others disregard these metrics entirely. The avail
IOF/ROL Dataset for input/output error
able techniques tend to overlook multi-objective goals. Therefore,
detection and recovery in software
systems. there is a need for an effective and multi-objective method to
determine the parameters of ML methods in ST. These methods
should aim to create a trade-off between the parameters. This issue is
Real test beds and formal methods each represent 5% and 3% of the currently an open and tempting area of research.
evaluation techniques utilized. Researchers prefer simulation environ • Test coverage: According to an analysis of ML methods in ST, only
ments over real test-bed examinations mainly due to cost efficiency, 20% of studies addressed the concept of test coverage. The metric of
control over experimental variables, risk mitigation, and accessibility. test coverage in ST is a subject of interest, as it measures the extent of
Simulations offer scalability and reproducibility, allowing for large-scale testing carried out by a given set of tests. According to [94], the
experimentation and exploration of diverse scenarios, albeit with assessment of software test coverage measures determines the extent
acknowledged limitations compared to real-world testing. to which testing has been thorough. The techniques that can be
employed are statement coverage, branch coverage, and path
21
Fig. 12. Open issues and future works on ML methods implemented in ST.
coverage, which may still pose a challenge. This challenge is espe software systems in various domains. One proposed methodology is
cially important in the context of ML, where assessing complex sys to integrate trust and security mechanisms directly into the design
tems requires rigorous testing methods to guarantee their accuracy and implementation of ML models used for testing purposes. This can
and reliability. involve employing techniques such as adversarial training to
• Accuracy: The accuracy of testing is a crucial challenge that ML enhance the robustness of ML models against malicious attacks and
methods must address. Based on the literature review, it appears that ensure compliance with industry-standard security protocols and
52% of studies addressed the concept of test accuracy, yet as a very best practices. Moreover, incorporating explainability and inter
important metric in ML methods, it can be an open issue. The con pretability features into ML-based testing approaches can enhance
cepts of accuracy and precision are inherently ambiguous across trust by providing insights into the decision-making process of these
languages, but they can be effectively measured as a metric of quality models.
[94]. Assessing the aforementioned attribute constitutes a means of • Interoperability: Interoperability testing is an ST methodology that
guaranteeing trustworthiness in the results of simulations. The ac evaluates the ability of a software system to function seamlessly with
curate outcome is considered the standard for measuring quality. The other software components, systems, and versions. One of the pri
attribute of accuracy is employed as a component of validation and mary advantages of utilizing standards-based products is the
verification in scientific ST. Although the concepts of accuracy and achievement of effective interoperation. Regardless of whether the
precision have been studied in ML and ST methods, the subject is still standards are proprietary, international, or public in nature, users
considered an open issue. commonly anticipate that products supported by standards will
• Scalability: Scalability is a critical aspect to consider when deploying seamlessly interact with comparable products, as noted in [113].
ML methods in software testing, particularly concerning the perfor Various forms of testing are commonly utilized to assist consumers in
mance and efficiency of these methods as the size and complexity of making informed decisions regarding the compatibility of products
software systems increase. One potential framework to address with similar standards and their functionality when used in
scalability issues is to adopt distributed computing paradigms and conjunction with other products. The various tiers of interoperability
parallel processing techniques. By distributing the computational testing encompass specification-level interoperability, data-type
workload across multiple nodes or processors, scalability can be interoperability, physical interoperability, and semantic interopera
enhanced, allowing ML-based testing methods to handle larger bility. This subject matter holds promise for prospective research
datasets and more complex software systems. Additionally, endeavors.
designing algorithms and architectures that are inherently scalable • Failure management: The findings of our study indicate that the suc
can ensure that ML-based testing approaches can adapt to the cess of software projects cannot be attributed to a singular factor.
growing demands of modern software development practices. Several techniques that can enhance the success rates of projects
• Trust & security: Ensuring trust and security in ML-based testing have been investigated. However, a significant majority of these
methodologies is paramount, especially given the critical nature of techniques focus on identifying and mitigating the impact of factors
22
that may contribute to project failure. According to [65,114], and from labeled data, often garners more prominence compared to un
[115], the timely prediction of software system malfunctions can aid supervised learning, which relies on unlabeled data. This discrep
in determining potential remedies that may enhance the achieve ancy may arise from historical precedence, ease of implementation,
ment rate. Defects that arise during the testing phase can result in or perceived efficacy in specific contexts. Additionally, reinforce
software failure and are commonly referred to as failures. Further ment learning, celebrated for its trial-and-error-based learning with
more, instances of failure may not necessarily arise solely in reaction rewards, may receive undue focus due to its association with notable
to unidentified faults or unresolved glitches. Hardware or firmware successes like AlphaGo. Furthermore, hybrid learning methods,
faults can arise from environmental factors or errors, such as an amalgamating diverse approaches, encounter biases based on
incorrect input value utilized during the testing phase. The factors perceived novelty or utility. These biases have far-reaching impli
leading to failure may vary, as indicated by previous research [114]. cations, impacting research agendas, funding distribution, and
The integration of failure management into current methodologies educational priorities within the machine learning domain. Conse
may present a compelling avenue for future research. quently, there emerges a risk of overlooking crucial insights and
• Empirical research: ML methods have received a lot of attention in hindering comprehensive progress across learning paradigms.
recent years, including in ST. However, implementing these ideas in Addressing these biases is paramount to fostering a more inclusive
real-world settings remains difficult. According to the evaluated and balanced landscape, ensuring equitable advancement in the
papers, only 5% of the proposed ML methods have been tested in real field.
testbeds, with the remaining studies relying on simulation tools for
testing. In addition, empirical research in the realm of ML methods 7. Threats to validity and limitations
for software testing is vital for understanding the practical implica
tions and effectiveness of these methods. To address this challenge, it The potential limitations that may affect the validity of this research
is essential to design and conduct experiments in real-world settings study are carefully examined through a series of steps, which are out
rather than relying solely on simulations. One proposed methodol lined as follows:
ogy is to establish collaborative partnerships between academia and First step. It involves examining the potential risks that may
industry to facilitate the deployment of ML methods in actual test compromise the accurate identification of comprehensive primary pa
beds. These partnerships can enable access to diverse datasets, real- pers. The primary determinant of the research strategy is the acquisition
world software systems, and the infrastructure necessary for con of a comprehensive body of literature that is free from bias. In order to
ducting empirical studies. Furthermore, establishing benchmarks achieve this objective, the common strings in the search term were
and standardized evaluation protocols can aid in comparing the searched and combined. Additionally, a review protocol was formulated
performance of different ML-based testing approaches across various with the aim of identifying pertinent and impartial studies.
domains. Second step. It involves identifying and analyzing potential threats to
• High-quality training datasets: Access to high-quality training datasets the process of selection and data extraction. In the context of systematic
is fundamental for the success of ML-based testing methods. To reviews, it is customary for each study to undergo a process of quality
address this challenge, researchers can explore innovative tech assessment [27].
niques for data collection, augmentation, and synthesis. This may
involve leveraging crowdsourcing platforms, simulation environ • The presence of bias in the outcome is undesirable, as the objective is
ments, and data generation algorithms to create diverse and repre to obtain accurate results.
sentative datasets for software testing. Additionally, establishing • Internal validity refers to the extent to which a study is conducted
data-sharing initiatives and repositories within the research com without any systematic errors.
munity can facilitate the dissemination of high-quality datasets and • External validity refers to the extent to which the findings of a study
promote collaboration among researchers. can be generalized and applied to real-world situations beyond the
• Test input generation: SBST utilizes a meta-heuristic optimization specific context in which the study was conducted.
search method, such as a genetic algorithm, to autonomously create
test inputs [88]. This test-generation technique has been extensively This investigation employed two distinct types of quality assess
employed in academic studies pertaining to conventional software ments. Two types of assessments were conducted. The first type involved
testing paradigms. In addition to its application in producing test evaluating the quality of the papers based on their ability to address the
inputs for evaluating functional qualities such as program correct research questions. The second type of assessment was carried out spe
ness, SBST has also been employed to investigate conflicts related to cifically to address one of our main research questions.
algorithmic fairness during requirement analysis [116]. The appli Third step. We examined the potential risks and vulnerabilities that
cation of SBST has demonstrated successful outcomes in the testing may compromise the integrity and security of the synthesized data, as
of autonomous driving systems, as evidenced by multiple studies well as the subsequent outcomes and findings. The challenge of reviews
[98,117,118,119]. There are several research opportunities avail is further compounded by the presence of a reliability threat [120]. The
able in the application of SBST for generating test inputs in the present matter was derived from a comprehensive characterization
context of testing other ML systems. This is due to the evident model involving the collaboration of multiple researchers as well as the
compatibility between SBST and ML, as SBST is capable of dynami implementation of various methodological and procedural steps, which
cally searching for test inputs throughout extensive input spaces. were subsequently piloted and subjected to external evaluation. In
Current methodologies for creating test inputs mostly concentrate on conducting our SLR, we adhered to the guidelines outlined in [27,28].
producing adversarial inputs to evaluate the resilience of an ML However, it is important to note that certain deviations from their pre
system. However, adversarial examples have faced criticism because scribed methods were encountered, as detailed in Section 3. By
of their lack of representation of genuine input data. Therefore, an employing a rigorous systematic review methodology, conducting
interesting and challenging area of research is the development of external evaluations, and engaging multiple researchers, it is possible to
methods for generating authentic test inputs and the automatic assert that the review possesses a high level of validity. This review
evaluation of their naturalness. sought to offer a thorough and methodical examination. Nevertheless, it
• Bias in representation across machine learning paradigms: In the realm is important to acknowledge certain limitations of this study that should
of machine learning research and practice, there exists a noticeable be taken into account in future research. Some of the limitations of this
disparity in the attention and resources allocated to different study are as follows:
learning methodologies. Supervised learning, wherein models learn
23
• This research exclusively considers reputable journal and conference interests or personal relationships that could have appeared to influence
papers that possess the highest qualifications. Consequently, books, the work reported in this paper.
book chapters, non-English scripts, non-JCR papers, commentaries
and review papers, and short papers have been excluded. Data availability
• In this SLR, nine well-known online databases were utilized to
identify relevant, reputable papers. However, it is important to note No data was used for the research described in the article.
that the authors cannot assert that all papers on the subject of ML
methods in ST have been selected. References
• Section 3.1 of this SLR presents six questions that were posed to
explore this topic further; however, it is possible to propose addi [1] H. Zuse, Software complexity: measures and methods, Walter de Gruyter GmbH
& Co KG, 2019, pp. 1–5.
tional questions. [2] A. Fuggetta, Software process: a roadmap, Proc. Conf. Future Softw. Eng. (2000)
• The reviewed papers in this research were classified into four cate 25–34.
gories: supervised learning methods, unsupervised learning [3] B.W. Kernighan, P.J. Plauger, Software tools, ACM SIGSOFT Softw. Eng. Notes
vol. 1 (1) (1976) 15–20.
methods, reinforcement learning methods, and hybrid learning [4] M. Nikravan, et al., An intelligent energy efficient QoS-routing scheme for WSN,
methods. However, there may be other possible categories. Int. J. Adv. Eng. Sci. Technol. vol. 8 (1) (2011) 121–124.
[5] A. Mishra, Review: software quality assurance—from theory to implementation,
Comput. J. vol. 47 (6) (2004) 728–7218, https://doi.org/10.1093/comjnl/
8. Conclusion and future work 47.6.728.
[6] B. Arasteh, S.M.J. Hosseini, Traxtor: an automatic software test suit generation
This study tried to fill a gap in the existing pieces of literature by method inspired by imperialist competitive optimization algorithms, J. Electron.
Test. vol. 38 (2) (2022) 205–215.
providing a comprehensive review of ML methods in ST, an area that has
[7] B. Arasteh, P. Gunes, A. Bouyer, F. Soleimanian Gharehchopogh, H. Alipour
not received thorough attention in previous reviews. With a recognition Banaei, R. Ghanbarzadeh, A modified horse herd optimization algorithm and its
of the significance of ML methods in ST, our study aimed to thoroughly application in the program source code clustering (vol), Complexity 2023 (2023)
analyze the methods’ obstacles, potential directions, advantages, and 3988288.
[8] B. Arasteh, R. Ghanbarzadeh, F.S. Gharehchopogh, A. Hosseinalipour, Generating
disadvantages. By conducting a systematic literature review, we cate the structural graph-based model from a program source-code using chaotic
gorized and compared various ML methods in ST, aiming to establish a forrest optimization algorithm, Expert Syst. vol. 40 (6) (2023) e13228.
clear classification system. Key questions addressed include the types of [9] B. Broekman, E. Notenboom, Testing embedded software, Pearson Education,
2003, pp. 217–228.
ML methods utilized, evaluation metrics, common tools and environ [10] B. Arasteh, F.S. Gharehchopogh, P. Gunes, F. Kiani, M. Torkamanian-Afshar,
ments, available datasets, case studies, and methods for assessing ac A novel metaheuristic based method for software mutation test using the
curacy and efficiency. We analyzed 40 papers sourced from 284 journals discretized and modified forrest optimization algorithm, J. Electron. Test. vol. 39
(3) (2023) 347–370.
and conferences spanning 2018 to March 2024. Notable developments [11] H. Reza, K. Ogaard, A. MalgeA model based testing technique to test web
include fluctuating publication trends, with 2018 displaying the highest applications using statecharts IEEE , in Fifth International Conference on
volume of papers and 2024 showing the lowest. Dominant publishing Information Technology: New Generations (itng 2008) , 2008, in Fifth
International Conference on Information Technology: New Generations (itng ),
sources were identified, with IEEE, Springer, and Elsevier contributing 2008183–188.
significantly to the total publications. Papers were categorized into four [12] A.M. Nascimento, L.F. Vismari, P.S. Cugnasca, J.B.C. Júnior, J.R. de Almeira
primary methods: supervised learning, unsupervised learning, rein JúniorA cost-sensitive approach to enhance the use of ML classifiers in software
testing efforts IEEE , 2019 18th IEEE International Conference On Machine
forcement learning, and hybrid learning methods, with supervised
Learning And Applications (ICMLA) , 2019, 18th International Conference On
learning and hybrid learning methods collectively representing 74% of Machine Learning And Applications (ICMLA), IEEE20191806–1813.
the published papers. Python emerged as the predominant tool and [13] Z. Asghari, B. Arasteh, A. Koochari, Effective software mutation-test using
platform in 39% of the reviewed papers. Evaluation methods predomi program instructions classification, J. Electron. Test. vol. 39 (5) (2023) 631–657.
[14] P. Bourque, J.-M. Lavoie, A. Lee, S. Trudel, T.C. LethbridgeGuide to the software
nantly involved simulation environments, with various quality assess engineering body of knowledge (swebok) and the software engineering education
ment metrics employed as recall, precision, accuracy, time, knowledge (seek)-a preliminary mapping Proc. 10th Int. Workshop Softw.
performance, and MSE. While most papers focused on prominent met Technol. Eng. Pract. , 2002, , 8–9 (IEEE Computer Society).
[15] B. Arasteh, P. Imanzadeh, K. Arasteh, F.S. Gharehchopogh, B. Zarei, A source-
rics like accuracy and time, a notable proportion made efforts to code aware method for software mutation testing using artificial bee colony
enhance testing efficiency through ML methods in ST. Identified chal algorithm, J. Electron. Test. vol. 38 (3) (2022) 289–302.
lenges include the need for empirical research, scalability, test coverage, [16] M. Newman"Software errors cost us economy $59.5 billion annually," NIST
Assesses Technical Needs of Industry to Improve Software-Testing , 2002.
test input generation, failure management, interoperability, test Oracle, [17] G.J. Myers, T. Badgett, T.M. Thomas, C. Sandler, 2004, Wiley Online
accuracy, trust, security, and access to high-quality training datasets, Library123–156, The art of software testingvol. 2.
highlighting avenues for future work. [18] S. Bazzaz Abkenar, E. Mahdipour, S.M. Jameii, M. Haghi Kashani, A hybrid
classification method for Twitter spam detection based on differential evolution
and random forest, Concurr. Comput.: Pract. Exp. vol. 33 (21) (2021) e6381.
Participation of authors [19] G. Liang, W. Fan, H. Luo, X. Zhu, The emerging roles of artificial intelligence in
cancer drug development and precision therapy, Biomed. Pharmacother. vol. 128
(2020) 110255.
The first and second authors have equal contributions to this work.
[20] J. Liu, F. Weng, Z. Li, "Satellite-based PM2. 5 estimation directly from reflectance
at the top of the atmosphere using a machine learning algorithm, Atmos. Environ.
CRediT authorship contribution statement vol. 208 (2019) 113–122.
[21] B.K. Mohanta, D. Jena, U. Satapathy, S. Patnaik, Survey on IoT security:
challenges and solution using machine learning, artificial intelligence and
Sedighe Ajorloo: Investigation, Methodology, Resources, Visuali blockchain technology, Internet Things vol. 11 (2020) 100227.
zation, Writing – original draft. Amirhossein Jamarani: Investigation, [22] D. Truong, W. Choi, Using machine learning algorithms to predict the risk of
Methodology, Resources, Visualization, Writing – original draft. Mehdi small unmanned aircraft system violations in the national airspace system, J. Air
Transp. Manag. vol. 86 (2020) 101822.
Kashfi: Investigation, Writing – review & editing. Mostafa Haghi [23] V.H. Durelli, et al., Machine learning applied to software testing: a systematic
Kashani: Conceptualization, Methodology, Supervision, Validation, mapping study, IEEE Trans. Reliab. vol. 68 (3) (2019) 1189–1212.
Writing – review & editing. Abbas Najafizadeh: Validation, [24] M. Noorian, E. Bagheri, W. Du, Machine learning-based software testing: towards
a classification framework, SEKE (2011) 225–229.
Visualization. [25] N. Jha, R. Popli, S. Chakraborty, and P. Kumar, Software Test Automation Using
Selenium and Machine Learning," in Proceedings of First International
Declaration of Competing Interest Conference on Computational Electronics for Wireless Communications,
Singapore, S. Rawat, A. Kumar, P. Kumar, and J. Anguera, Eds., 2022// 2022:
Springer Nature Singapore, pp. 419-429.
The authors declare that they have no known competing financial
24
[26] R. LachmannMachine learning-driven test case prioritization approaches for [58] A.M. Nascimento, L.F. Vismari, P.S. Cugnasca, J.B.C. Júnior, J.R. d A. Júnior,
black-box software testing Eur. Test. Telem. Conf. Nuremberg, Ger. , 2018. A cost-sensitive approach to enhance the use of ml classifiers in software testing
[27] P. Brereton, B.A. Kitchenham, D. Budgen, M. Turner, M. Khalil, Lessons from efforts, 18th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA) (2019) 1806–1813, 16-
applying the systematic literature review process within the software engineering 19 Dec. 2019 2019.
domain, J. Syst. Softw. vol. 80 (4) (2007) 571–583. [59] S. NakajimaDataset Diversity for Metamorphic Testing of Machine Learning
[28] B. Kitchenham, "Procedures for performing systematic reviews," Keele, UK, Keele Software Cham , Springer International Publishing , Structured Object-Oriented
University, vol. 33, no. 2004, pp. 1-26, 2004. Formal Language and Method // 2019 , 2019, // , 201921–38Z. Duan, S. Liu, C.
[29] J.M. Zhang, M. Harman, L. Ma, Y. Liu, Machine learning testing: survey, Tian, F. Nagoya (Eds.).
landscapes and horizons, IEEE Trans. Softw. Eng. vol. 48 (1) (2020) 1–36. [60] J.K. Nurminen, Software Framework for Data Fault Injection to Test Machine
[30] Y. Yang, X. Xia, D. Lo, J. Grundy, A survey on deep learning for software Learning Systems 2019 IEEE Int. Symp. . Softw. Reliab. Eng. Workshops (ISSREW)
engineering, p. Article 206, ACM Comput. Surv. vol. 54 (10s) (2022). p. Article 27-30 Oct. 2019 , 2019, , 294–299.
206. [61] M. Raman, N. Abdallah, J. DunoyerAn Artificial Intelligence Approach to EDA
[31] H. Chen, M.A. Babar, Security for machine learning-based software systems: a Software Testing: Application to Net Delay Algorithms in FPGAs 6-7 20th
survey of threats, practices, and challenges, p. Article 151, ACM Comput. Surv. International Symposium on Quality Electronic Design (ISQED) , March 2019, ,
vol. 56 (6) (2024). p. Article 151. 311–316 2019.
[32] Suman, R.A. Khan, Survey on identification and prediction of security threats [62] S. Kassaymeh, S. Abdullah, M. Alweshah, A.I. HammouriA Hybrid Salp Swarm
using various deep learning models on software testing, Multimed. Tools Appl. Algorithm with Artificial Neural Network Model for Predicting the Team Size
(2024). Required for Software Testing Phase 2021 2021 International Conference on
[33] J. Wen, S. Li, Z. Lin, Y. Hu, C. Huang, Systematic literature review of machine Electrical Engineering and Informatics (ICEEI) 12-13 , Oct. 2021, , International
learning based software development effort estimation models, Inf. Softw. Conference on Electrical Engineering and Informatics (ICEEI) 20211–6.
Technol. vol. 54 (1) (2012) 41–59. [63] K. Kamaraj, C. Arvind, K. Srihari, A weight optimized artificial neural network for
[34] R. Malhotra, A systematic review of machine learning techniques for software automated software test oracle, Soft Comput. vol. 24 (17) (2020) 13501–13511.
fault prediction, Appl. Soft Comput. vol. 27 (2015) 504–518. [64] A. Sheta, S. Aljahdali, M. Braik, Utilizing Faults and Time to Finish Estimating the
[35] J. Zhang, J. Li, Testing and verification of neural-network-based safety-critical Number of Software Test Workers Using Artificial Neural Networks and Genetic
control software: a systematic literature review, Inf. Softw. Technol. vol. 123 Programming, in: Á. Rocha, M. Serrhini (Eds.), Information Systems and
(2020) 106296. Technologies to Support Learning, Springer International Publishing, Cham,
[36] I. Batool, T.A. Khan, Software fault prediction using data mining, machine 2019, pp. 613–624.
learning and deep learning techniques: a systematic literature review, Comput. [65] V.A.D.S. Júnior, "A method and experiment to evaluate deep neural networks as
Electr. Eng. vol. 100 (2022) 107886. /05/01/ 2022. test oracles for scientific software," presented at the Proceedings of the 3rd ACM/
[37] A. Abo-eleneen, A. Palliyali, C. Catal, The role of reinforcement learning in IEEE International Conference on Automation of Software Test, Pittsburgh,
software testing, Inf. Softw. Technol. vol. 164 (2023) 107325. Pennsylvania, 2022.
[38] M. Rahimi, M. Songhorabadi, M.H. Kashani, Fog-based smart homes: a systematic [66] A. Ruospo, D. Piumatti, A. Floridia, E. SanchezA SUitability Analysis of Software
review, J. Netw. Comput. Appl. vol. 153 (2020) 102531. /03/01/ 2020. Based Testing Strategies for the On-line Testing of Artificial Neural Networks
[39] C. Calero, M.F. Bertoa, M.Á. Moraga, A systematic literature review for software Applications in Embedded Devices 2021 IEEE 27th Int. Symp. . -Line Test. Robust.
sustainability measures," presented at the Proceedings of, 2nd Int. Workshop Syst. Des. (IOLTS) 28-30 June 20212021, , 1–6.
Green. Sustain. Softw., San. Fr., Calif. (2013). [67] U. Sivaji, P.S. RaoWITHDRAWN: Test case minimization for regression testing by
[40] N. Khoshniat, A. Jamarani, A. Ahmadzadeh, M. Haghi Kashani, E. Mahdipour, analyzing software performance using the novel method " ed: Elsevier , 2021.
Nature-inspired metaheuristic methods in software testing, Soft Comput. vol. 28 [68] A.H. Yahmed, H.B. Braiek, F. Khomh, S. Bouzidi, R. Zaatour, DiverGet: a SeArch-
(2) (2024) 1503–1544. based Software Testing Approach for Deep Neural Network Quantization
[41] M. Songhorabadi, M. Rahimi, A. MoghadamFarid, M. Haghi Kashani, Fog Assessment, Empir. Softw. Eng. vol. 27 (7) (2022) 193.
computing approaches in IoT-enabled smart cities, J. Netw. Comput. Appl. vol. [69] L. Xiao, H. Miao, T. Shi, Y. Hong, LSTM-based deep learning for spatial–temporal
211 (2023) 103557. software testing, Distrib. Parallel Databases vol. 38 (3) (2020) 687–712.
[42] M. Haghi Kashani, E. Mahdipour, Load balancing algorithms in fog computing, [70] M. Tejo Vinay, M. Lukeshnadh, B. Keerthi Samhitha, S.C. Mana, J. JoseA Robust
IEEE Trans. Serv. Comput. vol. 16 (2) (2023) 1505–1521. and Intelligent Machine Learning Algorithm for Software Testing Singapore ,
[43] S. Nemati, M. Haghi Kashani, R. Faghih Mirzaee, Comprehensive survey of Springer Nature Singapore , Advances in Electronics, Communication and
ternary full adders: Statistics, corrections, and assessments, IET Circuits, Devices Computing 2021// , 2021, //Springer Nature , Singapore2021455–462P.K.
Syst. vol. 17 (3) (2023) 111–134. Mallick, A.K. Bhoi, G.-S. Chae, K. Kalita (Eds.).
[44] M. Etemadi, et al., A systematic review of healthcare recommender systems: open [71] H.L.P. Raj, K. ChandrasekaranNEAT Algorithm for Testsuite generation in
issues, challenges, and techniques, Expert Syst. Appl. vol. 213 (2023) 118823. Automated Software Testing 2018 IEEE Symp. . Ser. Comput. Intell. (SSCI) 18-21
[45] S. Bazzaz Abkenar, M. Haghi Kashani, M. Akbari, E. Mahdipour, Learning textual Nov. 2018 , 2018, , 2361–2368.
features for Twitter spam detection: a systematic literature review, Expert Syst. [72] L. OleshchenkoSoftware Testing Errors Classification Method Using Clustering
Appl. vol. 228 (2023) 120366. Algorithms Singapore , Springer Nature Singapore , International Conference on
[46] M. Sheikh Sofla, M. Haghi Kashani, E. Mahdipour, R. Faghih Mirzaee, Towards Innovative Computing and Communications 2023// , 2023, //Springer Nature ,
effective offloading mechanisms in fog computing, Multimed. Tools Appl. vol. 81 Singapore2023553–566A.E. Hassanien, O. Castillo, S. Anand, A. Jaiswal (Eds.).
(2) (2022) 1997–2042. [73] E. Alpaydin, Introduction to Machine Learning, MIT press, 2020, pp. 11–12.
[47] M. Nikravan, M. Haghi Kashani, A review on trust management in fog/edge [74] S. Ali, Y. Hafeez, S. Hussain, S. Yang, Enhanced regression testing technique for
computing: Techniques, trends, and challenges, J. Netw. Comput. Appl. vol. 204 agile software development and continuous integration strategies, Softw. Qual. J.
(2022) 103402. vol. 28 (2) (2020) 397–423.
[48] Y. Karimi, M. Haghi Kashani, M. Akbari, E. Mahdipour, Leveraging big data in [75] J. Chen, et al., Test case prioritization for object-oriented software: an adaptive
smart cities: a systematic review, Concurr. Comput.: Pract. Exp., Submitt. Publ. random sequence approach based on clustering, J. Syst. Softw. vol. 135 (2018)
vol. 33 (21) (2021). 107–125.
[49] M. Haghi Kashani, M. Madanipour, M. Nikravan, P. Asghari, E. Mahdipour, [76] L. Ma, Deepmutation: Mutation testing of deep learning systems IEEE , 2018 IEEE
A systematic review of IoT in healthcare: applications, techniques, and trends, 29th international symposium on software reliability engineering (ISSRE) , 2018,
J. Netw. Comput. Appl. vol. 192 (2021) 103164. 29th international symposium on software reliability engineering (ISSRE),
[50] M. Fathi, M. Haghi Kashani, S.M. Jameii, E. Mahdipour, Big data analytics in IEEE2018100–111.
weather forecasting: a systematic review, Arch. Comput. Methods Engineering, [77] Y. Liu, L. Feng, X. Wang, and S. Zhang, "DeepBoundary: A Coverage Testing
Submitt. Publ. vol. 29 (2) (2021). Method of Deep Learning Software based on Decision Boundary Representation,"
[51] Z. Ahmadi, M. Haghi Kashani, M. Nikravan, E. Mahdipour, Fog-based healthcare in 2022 IEEE 22nd International Conference on Software Quality, Reliability, and
systems: a systematic review, Multimed. Tools Appl. vol. 80 (30) (2021) Security Companion (QRS-C), 5-9 Dec. 2022 2022, pp. 166-172.
36361–36400. [78] Suman, R.A. Khan, An optimized neural network for prediction of security threats
[52] M. Songhorabadi, M. Rahimi, A.M.M. Farid, M.H. Kashani" arXiv preprint Fog on software testing, Comput. Secur. vol. 137 (2024) 103626.
Comput. Approaches Smart Cities.: A State—Art. Rev. arXiv:2011.14732 , 2020, , [79] R.S. Sutton, A.G. Barto, The reinforcement learning problem. in Reinforcement
1–19. learning: An introduction, MIT Press, Cambridge, MA, 1998, pp. 51–85.
[53] M. Haghi Kashani, A.M. Rahmani, N. Jafari Navimipour, Quality of service-aware [80] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey,
approaches in fog computing, Int. J. Commun. Syst. vol. 33 (8) (2020) e4340. J. Artif. Intell. Res. vol. 4 (1996) 237–285.
[54] S.B. Abkenar, M.H. Kashani, M. Akbari, E. Mahdipour" arXiv preprint Twitter [81] M. Esnaashari, A.H. Damia, Automation of software test data generation using
spam Detect.: A Syst. Rev. arXiv:2011.14754, , 2020. genetic algorithm and reinforcement learning, Expert Syst. Appl. vol. 183 (C, p.
[55] S. Bazzaz Abkenar, M. Haghi Kashani, E. Mahdipour, S.M. Jameii, Big data 12) (2021).
analytics meets social media: A systematic review of techniques, open issues, and [82] N. Rawat, V. Somani, A.K. Tripathi, Prioritizing software regression testing using
future directions, Telemat. Inform. vol. 57 (2020) 101517–105525. reinforcement learning and hidden Markov model, Int. J. Comput. Appl. vol. 45
[56] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: a review of (12) (2023) 748–754.
classification techniques, Emerg. Artif. Intell. Appl. Comput. Eng. vol. 160 (1) [83] T. Shi, L. Xiao, K. Wu, Reinforcement Learning Based Test Case Prioritization for
(2007) 3–24. Enhancing the Security of Software, 6-9 Oct. 2020, 2020 IEEE 7th Int. Conf. Data
[57] T. Hastie, R. Tibshirani, and J. Friedman, "The elements of statistical learning. Sci. Adv. Anal. (DSAA) (2020) 663–672.
Springer series in statistics," New York, NY, USA, 2001. [84] J. Fang, Y. LuSimultaneous Localization of Multiple Defects in Software Testing
Based on Reinforcement Learning Cham , Springer International Publishing ,
25
Multimedia Technology and Enhanced Learning 2021// , 2021, //, [102] A. Krizhevsky, G. Hinton, Learn. Mult. layers Features tiny Images (2009).
2021180–190W. Fu, Y. Xu, S.-H. Wang, Y. Zhang (Eds.). [103] M. Long, J. Wang, G. Ding, J. Sun, P.S. YuTransfer feature learning with joint
[85] P.S. Nouwou Mindom, A. Nikanjam, F. Khomh, A comparison of reinforcement distribution adaptation," in Proceedings of IEEE Int. Conf. Comput. Vis. , 2013, ,
learning frameworks for software testing tasks, Empir. Softw. Eng. vol. 28 (5) 2200–2207.
(2023) 111. [104] X. Li, D. RothLearning question classifiers Cooling 2002 The 19th International
[86] T. Ahmad, A. Ashraf, D. Truscan, A. Domi, I. Porres, Using deep reinforcement Conference on Computational Linguistics , 2002.
learning for exploratory performance testing of software systems with multi- [105] A. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, C. PottsLearning word vectors
dimensional input spaces, IEEE Access vol. 8 (2020) 195000–195020. for sentiment analysis Proc. 49th Annu. Meet. Assoc. Comput. Linguist.: Hum.
[87] C.Y. Chen, J.L. Huang, Reinforcement-learning-based test program generation for Lang. Technol. , 2011, , 142–150.
software-based self-test, 10-13 Dec. 2019, 2019 IEEE 28th Asian Test. Symp. [106] T.A. Almeida, J.M.G. Hidalgo, A. YamakamiContributions to the study of SMS
(ATS) (2019) 73–735. spam filtering: new collection and results Proc. 11th ACM Symp. . Doc. Eng. ,
[88] J. Kim, M. Kwon, S. YooGenerating Test Input with Deep Reinforcement Learning 2011, , 259–262.
2018 IEEE/ACM 11th International Workshop on Search-Based Software Testing [107] A. Warstadt, A. Singh, S.R. Bowman, Neural network acceptability judgments,
(SBST) 28-29 May 2018 , 2018, , 51–58, 28-29 May 2018. Trans. Assoc. Comput. Linguist. vol. 7 (2019) 625–641.
[89] C. Chen, W. Diao, Y. Zeng, S. Guo, C. HuDRLgencert: Deep Learning-Based [108] T. Davidson, D. Warmsley, M. Macy, I. WeberAutomated hate speech detection
Automated Testing of Certificate Verification in SSL/TLS Implementations 2018 and the problem of offensive language 1 ( vol. 11 Proc. Int. AAAI Conf. web Soc.
IEEE Int. Conf. Softw. Maint. Evol. (ICSME) 23-29 Sept. 2018 , 2018, , 48–58. Media , 2017, , 512–515.
[90] H. Xiao, M. Cao, R. Peng, Artificial neural network based software fault detection [109] A. Singhal, A. BansalGeneration of test oracles using neural network and decision
and correction prediction models considering testing effort, Appl. Soft Comput. tree model IEEE , in 2014 5th International Conference-Confluence The Next
vol. 94 (2020) 106491. Generation Information Technology Summit (Confluence) , 2014, in 5th
[91] C. López-Martín, Machine learning techniques for software testing effort International Conference-Confluence The Next Generation Information
prediction, Softw. Qual. J. vol. 30 (1) (2022) 65–100. Technology Summit (Confluence), 2014313–318.
[92] J. Kahles, J. Törrönen, T. Huuhtanen, A. Jung, Automating root cause analysis via [110] E.T. Barr, M. Harman, P. McMinn, M. Shahbaz, S. Yoo, The oracle problem in
machine learning in agile software testing environments, 22-27 April 2019, 2019 software testing: a survey, IEEE Trans. Softw. Eng. vol. 41 (5) (2014) 507–525.
12th IEEE Conf. Softw. Test., Valid. Verif. (ICST) (2019) 379–390. [111] A. Singhal, A. Bansal, A. Kumar, An approach to design test oracle for aspect
[93] Y.L. Karpov, L.E. Karpov, Y.G. Smetanin, Adaptation of general concepts of oriented software systems using soft computing approach, Int. J. Syst. Assur. Eng.
software testing to neural networks, Program. Comput. Softw. vol. 44 (5) (2018) Manag. vol. 7 (2016) 1–5.
324–334. [112] D. Marijan, A. Gotlieb, M.K. Ahuja, Challenges of testing machine learning based
[94] S.H. Managoli, U. PadmaData Analysis for Implementing an Efficient Testing systems. 2019 IEEE international conference on artificial intelligence testing
Model in Software Testing Using Machine Learning Singapore , Springer Nature (AITest), IEEE, 2019, pp. 101–102.
Singapore , Proceedings of Third International Conference on Intelligent [113] H. Van Der Veer, A. Wiles, Achieving technical interoperability, Eur.
Computing, Information and Control Systems 2022// , 2022, //Springer Nature , Telecommun. Stand. Inst. (2008).
Singapore2022777–789A.P. Pandian, R. Palanisamy, M. Narayanan, T. Senjyu [114] D. Graham, R. Blackand E. Van Veenendaal, " Cengage Learning , Foundations of
(Eds.). software testing ISTQB Certification , 2021, , 127–155.
[95] Z. Wang, H. You, J. Chen, Y. Zhang, X. Dong, W. ZhangPrioritizing Test Inputs for [115] A. Mohammadjafari, S.F. Ghannadpour, M. Bagherpour, F. Zandieh" arXiv
Deep Neural Networks via Mutation Analysis 2021 IEEE/ACM 43rd Int. Conf. preprint Multi-Object. Multi-mode Time-Cost. Trade Model. Constr. Proj.
Softw. Eng. (ICSE) 22-30 May 2021 , 2021, , 397–409. Considering Product. Improv. arXiv:2401.12388 , 2024arXiv:2401.12388.
[96] N. Sulaiman, S.O. HasoonApplication of Convolution Neural Networks and [116] A. Finkelstein, M. Harman, S.A. Mansouri, J. Ren, Y. Zhang, A search based
Randomforest for Software Test 31 2022 8th Int. Conf. Contemp. Inf. Technol. approach to fairness analysis in requirement assignments to aid negotiation,
Math. (ICCITM) Aug.-1 Sept. 2022 , 2022, ,Aug.-1 Sept. 2022146–152. mediation and decision making, Requir. Eng. vol. 14 (2009) 231–245.
[97] L. Ramesh, S. Radhika, S. Jothi, Hybrid support vector machine and K-nearest [117] R.B. Abdessalem, S. Nejati, L.C. Briand, T. StifterTesting vision-based control
neighbor-based software testing for educational assistant, Concurr. Comput. systems using learnable evolutionary algorithms Proc. 40th Int. Conf. Softw. Eng.
Pract. Exp. vol. 35 (1) (2023) e7433. , 2018, , 1016–1026.
[98] C. Birchler, S. Khatiri, B. Bosshard, A. Gambi, S. Panichella, Machine learning- [118] R. Ben Abdessalem, S. Nejati, L.C. Briand, T. StifterTesting advanced driver
based test selection for simulation-based testing of self-driving cars software, assistance systems using multi-objective search and neural networks Proc. 31st
Empir. Softw. Eng. vol. 28 (3) (2023) 71. IEEE/ACM Int. Conf. Autom. Softw. Eng. , 2016, , 63–74.
[99] T. Labidi, Z. Sakhrawi, On the value of parameter tuning in stacking ensemble [119] R.B. Abdessalem, A. Panichella, S. Nejati, L.C. Briand, T. StifterTesting
model for software regression test effort estimation, J. Supercomput. (2023). autonomous cars for feature interaction failures using many-objective search
[100] A. Khan, R.R. Mekuria, R. Isaev, Applying machine learning analysis for software Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng. , 2018, , 143–154.
quality test, 22-22 April 2023, 2023 Int. Conf. Code Qual. (ICCQ) (2023) 1–15. [120] P. Jamshidi, A. Ahmad, C. Pahl, Cloud migration research: a systematic review,
[101] Z. Durumeric, E. Wustrow, J.A. Halderman{ZMap}: fast internet-wide scanning IEEE Trans. Cloud Comput. vol. 1 (2) (2013) 142–157.
and its security applications," in 22nd USENIX Secur. Symp. . (USENIX Secur. 13)
, 2013, , 605–620.
26

ASC Review Article

Uploaded by

Copyright:

Available Formats

You might also like

ASC Review Article

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASC Review Article

Uploaded by

Copyright:

Available Formats

Applied Soft Computing Journal 162 (2024) 111805

Contents lists available at ScienceDirect

Applied Soft Computing

A systematic review of machine learning methods in software testing

• A comprehensive systematic review on machine learning methods in software testing is provided.

1. Introduction dependent on software-based features. Software also plays a significant

Fig. 1. The structure of this study.

Fig. 2. Introduction to the research method.

3.1. Planning phase (Table 2).

Fig. 3. The taxonomy of machine learning methods in software testing.

Table 3 Table 3 (continued )

Decreasing testing Uncorrelated LSTM

on labeled data, training data for reinforcement learning methods only

must engage in interactive exchanges with its environment. These in­

thereby attributing it to a designation. In this section, we analyze the

4.3.1. Overview of the selected reinforcement learning methods

approach for automated testing. This algorithm, which incorporates

reinforcement learning, combines genetic methods with local search

rion of covering all limited paths. In addition, the proposed algorithm

was evaluated in comparison to meta-heuristic and evolutionary

particular method outperformed alternative algorithms in terms of both

combined hidden Markov models and reinforcement learning. The

rate of false positives, as demonstrated by experimenting on a set of five

web applications, yielding an exceptional F1 score of 0.849. This

and saving time and resources.

in order to design a reward function within a continuous integration

functions that were derived from dissimilar historical outcomes. The

ings indicated that the utilization of reward functions significantly

examination of sorting objective optimization and the implementation

method for the simultaneous localization of multiple defects in ST.

genetic algorithm and evaluation index through the identification of

identification of concurrent faults in ST, which has been found to have a

context of continuous integration (CI) and game testing are two

empirically addressed through the utilization of Deep Reinforcement

Learning (DRL) algorithms. The research aimed to determine which DRL

their experimentation in a CI environment to achieve this. For the game

during the conducting phase. Table 13 presents a comprehensive

Balqa Applied University in Jordan, and Sathyabama Institute of Science

In this section, we provide responses to the research inquiries that

when analyzing the presented survey. Additionally, they serve as a guide

classified into four primary categories. The taxonomy encompasses

methods, unsupervised learning methods, reinforcement learning

methods, and hybrid learning methods.

pers, specifically 74%, focused on the collective utilization of supervised

origins of hybrid learning and supervised learning methods to tackle

from labeled data, is favored for its straightforward implementation and

available. Hybrid learning methods, combining elements of different

improved performance and generalization across varied datasets. The

popularity of these methods stems from their proven success in a wide

range of applications, including image recognition, natural language

ieve superior results and address complex challenges in machine

learning effectively. Also, considering the significant prevalence of

greater popularity. Another primary factor motivating researchers to

utilize supervised methods is the widespread availability of labeled

tionally, 13% of the papers examined unsupervised learning methods,

This study involved the assessment of reviewed papers using multi­

ple evaluation parameters, as documented in Tables 5, 7, 9, and 11. The

formula presented below is employed to determine the most important

and significant parameters of Q2. Fig. 7 depicts the parameters

must engage in interactive exchanges with its environment. These in

This study involved the assessment of reviewed papers using multi

testing. By automating various testing processes, machine learning al