Professional Documents
Culture Documents
ASC Review Article
ASC Review Article
ASC Review Article
H I G H L I G H T S
A R T I C L E I N F O A B S T R A C T
Keywords: Background: The quest for higher software quality remains a paramount concern in software testing, prompting a
Machine learning shift towards leveraging machine learning techniques for enhanced testing efficacy.
Software testing Objective: The objective of this paper is to identify, categorize, and systematically compare the present studies on
Quality of software
software testing utilizing machine learning methods.
Systematic review
Method: This study conducts a systematic literature review (SLR) of 40 pertinent studies spanning from 2018 to
March 2024 to comprehensively analyze and classify machine learning methods in software testing. The review
encompasses supervised learning, unsupervised learning, reinforcement learning, and hybrid learning
approaches.
Results: The strengths and weaknesses of each reviewed paper are dissected in this study. This paper also provides
an in-depth analysis of the merits of machine learning methods in the context of software testing and addresses
current unresolved issues. Potential areas for future research have been discussed, and statistics of each review
paper have been collected.
Conclusion: By addressing these aspects, this study contributes to advancing the discourse on machine learning’s
role in software testing and paves the way for substantial improvements in testing efficacy and software quality.
* Corresponding author.
E-mail addresses: s.ajorloo@qodsiau.ac.ir (S. Ajorloo), c00550518@louisiana.edu (A. Jamarani), mh.kashani@qodsiau.ac.ir (M. Haghi Kashani), a.najafizadeh@
qodsiau.ac.i (A. Najafizadeh).
https://doi.org/10.1016/j.asoc.2024.111805
Received 21 September 2023; Received in revised form 27 March 2024; Accepted 19 May 2024
Available online 25 May 2024
1568-4946/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
relevant information about the reliability and quality of the product environments, datasets and case studies, and methods for measuring
[12]. Because of this, the level of quality that is present during the accuracy and efficiency. By probing into these matters and research
testing procedures is one of the most important variables that will questions (RQ) below, we aim to unveil the challenges, prospects, and
determine the quality of the product in its final form [13]. forthcoming trajectories for the application of ML in ST.
ST is a process that identifies errors or bugs in software while also RQ1: What are the different ML methods that have been applied to
providing information about the functionality and quality characteris ST?
tics of the software being tested [14]. Testing should follow the devel RQ2: What are the evaluation metrics used to evaluate the effec
opment phase of any software project [15]. According to a study that tiveness of ML methods in ST?
was carried out by the National Institute of Standards and Technology, RQ3: Which tools or evaluation environments are commonly used for
the amount of damage that is incurred as a result of software issues is ST and ML, and how are they integrated?
significant [16]. As a result, the ST phase is extremely important and RQ4: What datasets and case studies have been used to evaluate the
accounts for around half of the total software development budget [17]. performance of ML methods in ST?
ST uses a significant amount of resources but adds no new functionality RQ5: What evaluation methods have been employed to measure the
to the product; hence, a significant amount of effort has been expended accuracy and efficiency of ML methods in ST?
to reduce its cost by providing automated ST solutions. Over the past RQ6: What are the current challenges and opportunities for applying
decade, many different methods for automated ST have been developed, ML to ST, and what are the possible future directions?
all with the same overarching goal: to maximize error detection while Guidelines in [27,28] have been used to explore, categorize, and
generating as little test input data as possible. compare available ML methods in ST. In order to present a compre
Artificial intelligence (AI) has demonstrated its efficacy in addressing hensive taxonomy for the classification of ML methods in ST, 40 papers
challenges across various fields, including but not limited to biology, were selected and analyzed. In addition, this review provides a concise
social media [18], medical sciences [19], space and environmental ap overview of the key obstacles and open issues, as well as a description of
plications [20], the Internet of things (IoT) [21], self-driving vehicles the significant areas in which future research might improve the
[22], and unmanned aircraft systems [22]. It is, therefore, logical to methodologies used in the selected studies.
consider that AI, encompassing machine learning (ML), deep learning This SLR is structured as Fig. 1: Section 2 discusses related work and
(DL), search algorithms, and optimization techniques, has the potential motivation. Section 3 illustrates the research methodology. A full review
to contribute to the advancement of software engineering practices, of the chosen papers is provided in Section 4 after the classification of
particularly in the domain of ST. In terms of reducing the time and effort ML methods in ST. The analysis of the results and future work are
needed for various software engineering methodologies, AI has proven described in Sections 5 and 6, respectively. Threats to validity and
to be a formidable opponent [23]. Automating a variety of tasks that are limitations are outlined in Section 7. Finally, Section 8 explains the
associated with software engineering can be accomplished with the help conclusion.
of ML, a subfield of AI [24]. In addition, the complexity of software is
growing at an exponential rate, and the standard testing methodologies 2. Related work and motivation
that have been used in the past are unable to adequately scale to meet
the demands of increasingly sophisticated software [25]. Considering Numerous reviews have been conducted in ST and ML. Nevertheless,
the ever-increasing complexity of today’s programming frameworks, these literature reviews have some limitations. This section pertains to
ML-based methods have become increasingly appealing. The deploy several review studies that examine strategies for handling the common
ment of ML methods to automate the process of ST has been the subject boundaries of ST and ML. Section 2.1 provides an overview of the
of a significant amount of research. It doesn’t matter if the objective at related works, which have been categorized as surveys. Additionally,
hand is to generate the test cases, prioritize the test cases, or do other Section 2.2 highlights the weaknesses of these reviews. We present a
sorts of testing like black box or white box testing; ML has demonstrated summary of the related works in Table 1, which includes various pa
its usefulness in all of these cases [26]. rameters such as the main concepts, review types, paper selection pro
The application of ML methods has proven to be an effective strategy cesses, taxonomies, open issues, evaluation parameters, applied tools,
for automating testing-related operations, the construction and evalua and publication year of each study.
tion of test Oracles, and the prediction of the cost and amount of time
required for testing. Because of this, programming analysts can imagine 2.1. Review studies on software testing and machine learning
more accurate results than traditional testing ever could, thanks to AI,
which educates frameworks to learn and implement that information in An overview of machine learning testing was provided by the authors
the future [25]. However, the probability of making an error is not the in [29]. Their work surveyed properties, components, workflows, and
only aspect that is reduced. The amount of time expected to be spent application scenarios. The paper also analyzed trends in datasets and
performing testing on a product and discovering potential faults is also research focus and identified challenges and directions in ML testing.
reduced, yet the amount of information that must be managed can still Additionally, the authors offered a survey of testing techniques for
increase without placing any load on the testing group. machine learning applications. Nonetheless, a drawback in the paper’s
The limited examination of comprehensive Machine Learning (ML) organization lies in its absence of a systematic framework and structure.
methods in Software Testing (ST) complicates understanding the diverse The authors in [30] established a framework for analyzing the usage of
approaches and challenges. So far, there have not been many complete DNNs in SE, covering aspects such as trends, techniques, data process
reviews on this topic. Furthermore, since ML methods in ST are an ultra- ing, research topics, relationships, datasets, optimization algorithms,
critical and sensitive field, it is necessary to provide a comprehensive and evaluation metrics. Also, the authors proposed a research roadmap
study. Recognizing the importance of ML methods in ST, this study aims outlining opportunities for future work in the application of DNNs in SE.
to thoroughly analyze the obstacles, potential directions, advantages, However, no taxonomy was provided in their study.
and disadvantages involved. The study also delves into the connection Chen and Babar [31] provided a review of machine learning-based
between ML methods and ST. Through a systematic literature review, we modern software systems (MLBSS) security. They highlight the need
aim to find, categorize, and compare various ML methods in ST. We for joint efforts from software engineering, system security, and machine
meticulously scrutinize the methodologies employed in existing litera learning disciplines to address the evolving threats posed by MLBSS. The
ture and devise a coherent classification framework. This methodolog authors not only examined security threats but also offered insights into
ical review addresses fundamental questions, such as the types of ML secure practices throughout the software lifecycle. By summarizing
methods used in ST, the metrics for evaluating them, common tools and existing literature and outlining future research directions, they aimed
2
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 1
Related studies in the field of software testing and machine learning.
Review Paper Main topic Publication Paper selection Taxonomy Future Covered
type year process work years
Survey [29] Machine learning methods in software testing 2020 Yes Yes Yes 2007–2019
[30] Deep learning for software testing 2022 Yes No Yes 2006–2022
[31] Software systems based on machine learning 2024 Yes Yes Yes Not
mentioned
[32] Investigation into the detection and anticipation of security risks through 2024 Yes No Yes 2011–2022
the application of diverse deep learning frameworks in software testing
SMS [23] Software testing pertains to machine learning 2019 Yes No Yes 2017–2018
SLR [33] Software testing and machine learning methods 2012 Yes No Yes 1991–2010
[34] Software testing and machine learning methods 2015 Yes Yes Yes 1991–2013
[35] Verification and testing of neural networks in software testing 2020 Yes Yes Yes 2011–2018
[36] The utilization of data mining, machine learning, and DL methods for 2022 Yes Yes Yes 2010–2021
predicting software faults
[37] Software testing and reinforcement learning 2023 Yes Yes Yes 2012–2021
Our machine learning methods in software testing 2024 Yes Yes Yes 2018–2024
study
to foster an understanding of system security engineering in the realm of detecting ST faults. They reviewed papers published between 1991 and
MLBSS. Suman and Khan [32] analyzed the literature on using various 2013. However, the papers they reviewed have become outdated as ML
deep learning models to detect and predict security risks in software and ST have grown in recent years. Zhang and Li [35] stated that neural
testing. They reviewed existing techniques, evaluated performance in network (NN) algorithms are increasingly used in Safety-Critical
dicators, and discussed benefits and drawbacks. CyberPhysical Systems (SCCPSs), prompting interest in testing and
Durelli, et al. [23] performed a review for the purpose of conducting verification (T&V) methods for NN-based control software. A review of
a systematic mapping study (SMS) to provide an overview of the T&V methodologies was conducted by analyzing 83 papers from 2011 to
research at the intersection of ST and ML to automate and streamline ST. 2018. The study categorized approaches into themes like robustness,
Their covered years, including papers, were quite limited. A systematic failure resilience, and interpretability of NNs. However, gaps exist in
literature review was conducted by Wen, et al. [33] to explore models achieving repeatability and defined testing configurations.
for estimating software development efforts based on ML methods. They Batool and Khan [36] analyzed previously published reviews, sur
commented on the limitations of the applications of ST and ML and veys, and related studies to extract a set of questions. They explained the
provided information regarding the improvement of those limitations. significance of answering the newly added questions and categorized
The authors, nevertheless, did not compile a taxonomy to have all their previous work according to data mining, ML, and DL while evaluating
reviewed studies segregate. their respective performances. In [37], the authors examined the utili
The authors in [34] provided an SLR on the applications of ML in zation of reinforcement learning (RL) in software testing, addressing its
3
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
application where traditional machine learning methods falter. Results were not included.
indicated widespread usage of RL in certain testing scenarios, albeit • Some of the related papers have not explicitly focused on unre
primarily limited to two applications, highlighting a need for explora solved matters; they have briefly and implicitly listed future challenges.
tion in advanced RL techniques and multi-agent RL. • While some papers lack proper classification or taxonomies, this
paper not only offers a clear and visual categorization but also estab
lishes a subclass for each of them.
2.2. The motivation for an SLR on machine learning methods in software • The evaluation parameters and tools were largely disregarded in
testing most reviews.
• Previous studies reviewed only a limited number of papers.
In order to conduct a systematic review of the ML method in ST,
recent studies are identified, classified, and compared. The focus of this 3. Research methodology
review is to provide a detailed analysis and classification of ML methods
taken in ST in various sectors, as presented in Section 4. To verify the ST and ML have both seen a lot of research published recently. Due to
novelty of our work, an optimally thorough search on Google Scholar, the empirical study, this work is able to offer a thorough analysis of the
Springer, IEEE Explorer, ScienceDirect, SAGE, Taylor&Francis, Wiley, subject. In this section, an SLR approach to ML methods in ST is pre
Emerald, ACM, and Inderscience was conducted using a specific sented. A systematic review is the procedure of locating, categorizing,
research string as follows:
Based on our review, none of the studies we examined fully analyzing, and producing a comparative overview to arrive at solutions
addressed our research questions in Section 3.1, which highlights the to research problems and questions pertaining to certain research topics
need for an SLR to strengthen and update the current evidence-based ML [27]. This approach can be applied in any academic discipline to clearly
method in ST. Table 1 summarizes the surveys we studied, detailing grasp, reduce prejudice, and identify unresolved concerns and future
their review types, main topics, publication years, paper selection pro guidelines, according to [38]. This study’s specific goal was to provide a
cesses, taxonomies, future works, and covered years. It is evident that thorough method for practical steps in this subject.
only five of the papers used the relatively similar SLR method, and four As depicted in Fig. 2, this methodical procedure employs a three-
papers used the simple review method. Our research has a transparent phase guideline, namely, planning, conducting, and documenting. During
paper selection process, a prepared taxonomy, and an explanation of the planning phase, we first determine the questions and requirements
future works, and includes recently published papers up to March 2024. that are motivating this SLR. Then, during the conducting phase, papers
However, some of the papers that have already been reviewed do not are chosen based on inclusion and exclusion criteria. The observations
provide a taxonomy. So, we conducted this comprehensive study to are finally recorded in the documenting phase, and the analyses, com
cover the following deficiencies: parisons, and visualizations of the results that produce the conclusions
• Newly published papers, particularly from 2021 to March 2024, to the research questions are presented.
4
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
The process of planning for this SLR commences with identifying the
3.2. Conducting phase
research motivation and culminates in the development of a review
protocol, which includes the following steps:
The second phase of the research methodology involves conducting a
Step 1: Determining your research’s purpose. The first step entails
comprehensive search for relevant papers, beginning with the selection
defining the motivation on the basis of the unique value added by this
of appropriate papers and ending with data extraction. This section aims
SLR in comparison to the existing reviews discussed in Section 2.2.
to outline the process of paper selection that is carried out during the
Step 2: Involves question formulation. The second step involves
second phase of the SLR. The paper selection process follows a three-step
defining research questions to aid in the creation and validation of the
guideline.
review methodology, with the paper’s purpose serving as inspiration. In
Step 1: Selecting Primary Research: The initial step in the research
the Introduction section is a list of the study’s questions. Answering the
process involved searching through Google Scholar, the most popular
questions at hand can help uncover knowledge gaps that could lead to
search engine, using well-known academic publishers like Springer,
breakthroughs during the documentation phase.
IEEE, ScienceDirect, SAGE, Taylor&Francis, Wiley, Emerald, ACM, and
Step 3: Establish the review protocol. The review scope and research
Inderscience based on titles and keywords. The following is a definition
questions were established in the prior stage in accordance with the
of the search terms:
objectives of this SLR in order to modify the search terms for the • Initial selection: At the conclusion of step one, 284 papers from
extraction of literature [27]. A protocol was also developed by taking journals, conferences, white papers, book chapters, and books were
into account [39] and our prior SLR experience [40–55]. We asked an extracted, as noted in Table 2. In this step, the papers’ abstracts and
outside expert with experience conducting SLRs in this era for feedback conclusions were examined, after which papers that had been pub
in order to assess the defined protocol before it was put into action. The lished online between 2018 and March 2024 were selected. Addi
updated protocol took into account his suggestions. To lessen improper tionally, non-peer-reviewed and non-English papers, thesis, review
motivation in research and increase the efficiency of data extraction, a papers, short papers, and book chapters were excluded in order to
pilot study (representing approximately 25% of the included papers) find the papers that were the most pertinent. Overall, we located 43
was conducted. Additionally, during the pilot phase, we improved the papers at this stage.
review’s parameters, search techniques, and inclusion/exclusion criteria • Final selection: Ultimately, an exhaustive review was conducted on
the entirety of the selected articles, resulting in the identification of
40 relevant papers for subsequent comprehensive scrutiny. These
Table 2
papers were deemed suitable for further examination as they effec
Inclusion and exclusion criteria.
tively addressed our research inquiries and provided comprehensive
Criteria Justification
accounts of the methodologies employed and problems encountered.
Inclusion • Studies that mainly focus on • Having a comprehensible image
the integration of software of software testing and machine Steps 2 and 3: Data extraction and synthesis: By looking at 40 pertinent
testing with machine learning learning
• Papers online from 2018 to • Recent papers have referenced
papers, we are able to classify ML methods in ST in Section 4 and
March 2024 the classical and fundamental highlight their advantages and disadvantages.
literature regarding this subject.
Exclusion • Short papers that are less than • These papers do not provide 3.3. Documenting phase
five pages adequate material to be used in
our SLR
• Survey and review papers • The studies fail to provide The observations are documented, and the selected articles for re
practical, noteworthy, view can be found in Section 4, along with relevant information based
innovative solutions or on the proposed classification. Section 5 analyzes, visualizes, and reports
information.
on the results, while Section 7 explores and outlines any threats to
• Studies that are not composed • The papers that were not
in English nor evaluated by evaluated and those written in validity and limitations.
referees non-English languages were
excluded due to doubts about 4. Taxonomy of machine learning methods in software testing
their quality and the inability to
investigate them
Within this particular section, a taxonomy has been established for
5
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
the literature pieces that have been examined. The task of organizing the Table 3. These criteria are systematically assessed and juxtaposed within
existing literature on ML methods in ST, including the various methods each subcategory of analysis.
and approaches employed, poses a significant challenge due to the wide
range of studies conducted in this field. The proposed taxonomy 4.1. Supervised learning methods
scheme’s framework is depicted in Fig. 3. There are four primary cate
gories that have been identified: Supervised learning methods, unsu Supervised learning is a learning technique that involves the process
pervised learning methods, reinforcement learning methods, and hybrid of assigning labels to data. Supervised learning methods are provided
learning methods. Given that the predominant focus of scholarly inquiry with a training dataset that consists of labeled data where both the input
in this particular domain revolves around topics associated with one of and corresponding output are known. This dataset is used to construct a
the aforementioned four perspectives, conducting a comprehensive ex system model that captures the learned relationship between the input
amination of the literature from each of these vantage points facilitates and output variables. Following the completion of training, the trained
the categorization of the assessed papers under a broad thematic model is capable of utilizing a newly introduced input to generate the
framework. In this particular instance, the 40 chosen papers have been anticipated output [56,57].
organized based on the aforementioned criteria. The fundamental This section focuses on conducting research pertaining to supervised
characteristics of the techniques employed in the reviewed papers, learning methods, specifically classification and regression techniques.
including their distinctions, main ideas, assessment methodologies, In this study, we examine the classification and regression techniques
tools utilized, drawbacks, and benefits, are thoroughly examined and that are relevant to the research field, as discussed in Section 4.1.1.
deliberated upon. Among the criteria to measure machine learning and
software testing, nineteen critical ones are opted for and described in
6
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
7
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
datasets. This study examined the influence of the Salp Swarm Algo while upholding a commendable level of precision.
rithm on the predictive efficacy of backpropagation neural networks. Yahmed, et al. [68] introduced DiverGet, a search-based ST method,
The findings demonstrated the superiority of the recommended as a means to assess the efficacy of deep neural network (DNN) quan
approach in comparison to alternative methods. tization. The authors acknowledged that the process of quantization
Kamaraj, et al. [63] introduced a weight-optimized ANN that in may result in a decline in accuracy and pose challenges to performance.
corporates stochastic diffusion search (SDS) for the purpose of identi Consequently, they put forward a solution called DiverGet to tackle
fying the optimal weights based on a predetermined fitness function. these concerns. DiverGet employed an ML-based methodology to
The outcomes were assigned random classifications by default and ob generate a diverse collection of test cases with the aim of evaluating the
tained from the classifier. Consequently, the individuals were catego precision and efficiency of different quantization schemes. The study
rized according to their respective priorities, with the test items sharing presented experimental findings that demonstrate the efficacy of
the same priority being grouped together. The findings of the study DiverGet in identifying high-quality quantization schemes and
indicated that the proposed neural network outperformed the tradi improving the performance and accuracy of DNNs.
tional ANN in terms of both correct output misclassification and wrong The employment of DL models based on short-term memory (LSTM)
output misclassification. was suggested as a method for prioritizing test cases [69]. The predictive
Sheta, et al. [64] introduced two computational models for deter capability of LSTM was utilized to estimate the probability of each test
mining the required sample size in ST, utilizing ANN and genetic pro case detecting a fault within the cycle. This estimation was based on the
gramming. The proposed model utilized two computational techniques, testing data derived from all preceding continuous integration cycles.
namely multilayer perceptron ANN methods and genetic programming, The findings of the study indicated that there is potential for enhancing
in order to illustrate the correlation between the number of test workers the fault detection rate in continuous integration testing. Moreover, the
and the errors quantified in the software. Both of the recommended tester had the ability to identify the previous test cases within the con
models produced promising approximation results in real-time appli straints of the testing time. LSTM training data sets lacked the inclusion
cations and demonstrated the ability to predict the necessary team size. of the relationship between experimental items in the conducted
The TORC method, as described in [65], proposed a resolution to the experiment. Additionally, the factors of time costs and errors were not
challenge of the Oracle problem in scientific models. This approach taken into account. Ultimately, the framework and parameters of the
incorporated a combination of the combined interaction test, mutation LSTM network did not exhibit optimal predictive accuracy.
analysis, and convolutional neural networks (CNN). Furthermore, a The primary objective of software engineering, as stated in [70], is to
study was undertaken wherein CNNs were employed as Oracle proced deliver an output while simultaneously optimizing the cost and time
ures. Additionally, a technique based on features and neighborhoods required for application development. In order to attain the objective, it
was introduced by CNNs to elucidate the phenomenon of image was observed that software teams engage in testing their applications
misclassification. The experimental findings indicated that the perfor before they are deployed for live production. Documentation assumed a
mance of a shallower CNN is not consistently surpassed by its deeper critical role in facilitating their test automation efforts. The study pri
counterpart. The analysis of scientific models and errors detected in marily centered on the utilization of existing test resources to automate
CNNs did not reveal any discernible correlation between software bugs. test generation, with a particular emphasis on overcoming challenges
This paper did not include a discussion on the comparison of the per through systematic process enhancement and establishing a compre
formance of feature and neighborhood-based analysis techniques to hensive test strategy applicable to various software applications.
metrics such as neuron coverage, DeepGauge, and surprise adequacy for Raj and Chandrasekaran [71] introduced a methodology for
DL systems. leveraging ML methods in the automated generation of test sets. The
Ruospo, et al. [66] conducted an analysis of six potential scenarios NEAT algorithm was employed in this study to autonomously generate
pertaining to the feasibility of employing the ST library for online testing novel test sets or augment the coverage of an existing test set. The NEAT
of an embedded system designed to execute AI-based programs. This algorithm combined the benefits of ANNs and evolutionary genetic al
study examined the impact of incorporating a software test library on gorithms. The approach demonstrated superior coverage compared to
the performance of artificial neural networks (ANN) in specific recom alternative methods, requiring less time and a reduced number of test
mended scenarios. The performance of CNNs remained consistent across cases. Based on the findings of this study, it can be inferred that the
the three examined scenarios, where idle time was utilized for executing NEAT exhibited efficacy in automating the generation of test suites for
the software test library. Furthermore, three potential resolutions were the software being evaluated.
presented, with the primary determinant being the duration required for In Oleshchenko [72], the author described an approach that involved
error detection. The objective of these methodologies, with the excep establishing a test result repository, implementing automated analysis
tion of one instance, was to reduce the duration required for fault using the kNN algorithm, and providing real-time updates on the
detection. The findings indicated that the most favorable outcome was progress of testing. Although the author noted the need for compre
achieved by employing a combination of idle times and cross-layer hensive examination across a broader range of applications, the
implementation for both CNNs. approach indicated the potential to improve accuracy and efficiency by
Sivaji and Rao [67] proposed a CNN method that utilizes African a significant 25% when compared to current error analysis methodolo
buffalo-based techniques to optimize regression testing by reducing both gies. Its integration of Java, MongoDB, Elasticsearch, Jenkins, and
time and resource consumption. Initially, the data sets to be examined Docker technologies made it simple to deploy and operate.
were gathered. Subsequently, a CNN model utilizing African buffalo
data was developed to conduct test case minimization. Subsequently, 4.1.2. Overview of supervised learning methods
the process of test case minimization was executed. Furthermore, the The categorization of selected papers and important elements for
classification layer of the convolutional neural slice model was utilized analyzing supervised learning are elaborated upon in greater depth in
to establish the fitness function of the African buffalo. Subsequently, the Table 4. The evaluation of the assigned studies using evaluation factors
regression test was conducted in order to evaluate the software’s per in supervised learning methods is elaborated in Table 5. The factors
formance. The simulation of the model was ultimately completed using encompassed in this study include recall, precision, time, accuracy,
software, and the performance criteria were subsequently compared performance, efficiency, success rate, coverage, RMSE, effectiveness,
with those of previous studies. The model that was developed demon APFD, MSE, MAE, confusion matrix, and cost. The majority of papers
strated a notable level of accuracy and efficiency in its ability to process reviewed in the context of supervised learning methods evaluated the
multiple test cases. Hence, the implemented methodology effectively performance of their proposed approach using the metric of accuracy.
decreased the duration of execution and energy consumption, all the
8
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 4
Classification of recent research and additional information on supervised learning methods.
Article Applied method Main idea Datasets Evaluation Tool(s) or Advantage(s) Disadvantage(s)
method evaluation
environment
[58] Multi-Layer Enhancing ML classifiers NASA JM1 (http: Simulation Not • Reduced false • Reduction of the ST
perceptron for software module defect //promise.site.uotta mentioned negatives efficiency
prediction with wa.ca/SERepository/ • Efficacy of the ST • Increase in the
unbalanced datasets datasets/jm1.arff) Improved module number of false
identification as defective positives
• Decreased overall
accuracy
[59] Support vector Investigating a framework MNIST Simulation Python Able to show bug- The algorithm
machines for NN that incorporates injected program requires lots of data
dataset diversity and anomalies and development
behavioral oracle time.
[60] Support vector Emulating data faults to MNIST Prototype Python Improving ML system The approach exhibits
machines study and develop ML training limitations in terms of
solutions both functionality and
usability.
[61] Random forest Showing how AI can Not mentioned Simulation Python High efficiency Not expanding the
efficiently test and validate model to predict the
critical EDA software net input slope
components, especially in
FPGA architectures
[62] Backpropagation Predicting the ST phase • Albrecht dataset Simulation MATLAB High performance The algorithm
(feedforward) developer count with a • Cosmic dataset requires lots of data
neural networks method • Desharnais dataset and development time
• Kemerer dataset
[63] Backpropagation Proposing techniques to Not mentioned Simulation • MATLAB • Less computational The algorithm
(feedforward) implement test Oracles • WEKA time requires lots of data
neural networks • Less misclassification and development time
[64] Multi-Layer Estimating the number of Not mentioned Simulation FORTRAN Predicted the team size Lack of evaluation of
perceptron test workers needed to test proposed models for
software based on other test instances
measured faults using ANN
and GP methods
[65] Convolutional Proposing a method to https://github. Simulation Lua An alternative that • Not considering
neural networks address the Oracle problem com/vsantjr/TOrC directly and more types of
for scientific models comprehensively images and
explains DNN failures environments
Avoiding FNA
performance
comparisons with
neuron coverage and
SADL
[66] Convolutional Analyzing and integrating CIFAR-10 Simulation Not Matching idle times with Not exploring GPU
neural networks STLs for online testing of mentioned layered implementation architectures and all
embedded systems running for both CNNs unexplored
ANN-based applications possibilities
[67] Convolutional Developing African Not mentioned Simulation Python • Low execution time The algorithm used
neural networks buffalo-based • Low amount of has a high cost.
convolutional neural memory
slicing to reduce regression • High precision
testing time and resource • High efficiency for
use detecting faults
• High Recall
High performance for
minimizing the test cases
[68] Convolutional Proposing a search-based Pavia University Simulation Python Automatic generation of DNN’s performance
neural network approach to evaluate DNN dataset (PU) (https diverse test cases for a was not evaluated.
quantization techniques ://www.ehu.eus broad range of
/ccwintco/index.ph quantization
p/Hyperspectral_Re configurations
mote_Sensing_Scenes)
Salinas dataset (SA)
(https://www.ehu.eus
/ccwintco/index.ph
p/Hyperspectral_Re
mote_Sensing_Scenes)
[69] Recurrent neural Successful regression Paint Control dataset Simulation Python • Improving the • Neglecting time and
network testing for embedded IOF/ROL (https://doi. prioritization error costs
software in continuous org/10.7910/D effectiveness • LSTM network
integration VN/GIJ5DE) • Increasing the fault parameters’ poor
detection rate in the CI prediction accuracy
environment
(continued on next page)
9
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 4 (continued )
Article Applied method Main idea Datasets Evaluation Tool(s) or Advantage(s) Disadvantage(s)
method evaluation
environment
4.2. Unsupervised learning methods features of DL systems and examine mutation operator interconnections.
Liu, et al. [77] proposed a testing methodology named Deep
In contrast to the process of supervised learning, an unsupervised Boundary with the aim of enhancing the scope of DL software applica
learning method is provided with a collection of inputs that lack any tions. The authors examined the challenges associated with evaluating
associated labels, thereby lacking an output. In essence, an unsupervised DL software and emphasized the importance of achieving comprehen
learning method endeavors to identify patterns, structures, or knowl sive coverage. They provided a definition for decision boundary repre
edge within unlabeled data through the process of clustering sample sentation and demonstrated its application in generating test cases that
data into distinct groups based on their similarity. Unsupervised specifically address the unexplored regions of the input space. The
learning methods are commonly employed in the domains of clustering empirical findings indicated that DeepBoundary exhibits superior
and data aggregation [57,73]. The following section, specifically sub coverage compared to conventional testing methodologies, all the while
section 4.2.1, provides a summary of the papers examined in this study necessitating a reduced number of test cases. The researchers reached
that utilize unsupervised learning methods. the conclusion that the implementation of DeepBoundary had the po
tential to enhance the overall quality and dependability of DL software.
4.2.1. Overview of the selected unsupervised learning methods Additionally, it can serve as a complementary tool to traditional testing
Ali, et al. [74] developed a priority and test case selection method to methodologies.
improve agile fault detection rates. The suggested strategy prioritizes Suman and Khan [78] introduced the Dove Swarm-based Deep
and selects test cases by regularly changing the test suite and discovering Neural Method (DSbDNM) in terms of software testing and intrusion
failing test cases through continuous integration in each agile develop detection. This study compared four deep learning-based vulnerability
ment cycle. The CTFF had two stages. Clustering was the first step, prediction (DLVP) methods, which are known for accurately identifying
followed by selecting the highest-priority test cases from each cluster for security vulnerabilities. To attain this purpose, the algorithm’s perfor
execution to find the biggest fault. Three agile-developed software mance was compared to other methods. The DSbDNM model was
programs tested the CTFF model with different-sized test cases. The initially trained using internet presentation data that contained intru
study found that CTFF detected more errors than random prioritizing sion information. Following that, feature extraction and predictions of
and other error-based methods. The CTFF model outperformed previous malicious actions were performed. Furthermore, a thorough classifica
methods. tion of various forms of assault and negative behaviors was conducted.
A clustering-based adaptive random sequence methodology was The efficiency of the generated prediction model was also tested by
presented in [75] to improve regression testing. Adaptive random se initiating and detecting an unidentified assault. As a result, the proposed
quences maximize neighboring test case variety in this strategy. This method would improve performance and attain high levels of accuracy
study established adaptive random sequence algorithms using three within a restricted computing time frame.
clustering methodologies. Adaptive random sequences were created
using MSampling. Compared to the random prioritization approach and 4.2.2. Overview of unsupervised learning methods
the coverage test prioritization method, early error detection and effi The categorization of selected papers and important elements for
cacy increased. All three clustering algorithms outperformed Method analyzing unsupervised learning are elaborated upon in greater depth in
Coverage and RT-MS. For prioritizing test cases, DM-clustering was the Table 6. The evaluation of the assigned studies using evaluation factors
most successful strategy. in unsupervised learning methods is elaborated in Table 7. The factors
Mutation testing for DL systems was introduced in [76] to evaluate encompassed in this study include recall, precision, accuracy, perfor
test data. To detect DL source mistakes, a thorough collection of muta mance, efficiency, F-measure, coverage, effectiveness, APFD, mutation
tion operators at the resource level was designed. Next, model-level score, and time.
mutation operators were created to directly introduce faults into DL
models without training. Finally, injection error detection was used to
evaluate the quality of the test data. The test data provided extensive 4.3. Reinforcement learning methods
feedback and suggestions to help understand and build DL systems. This
study did not propose enhanced mutation operators to cover more Reinforcement learning is a widely recognized and quite popular
learning method [79,80]. In contrast to supervised learning, which relies
10
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Suitability
offers an indication of correctness without explicit labels. The process of
acquiring "good" behavior is achieved through iterative interactions
with the environment. The learning process in question bears a resem
*
*
*
*
blance to supervised learning, albeit with a distinct characteristic: rather
than relying on an extensive dataset with labeled information, the model
Cost
*
teractions subsequently yield either positive rewards or negative pun
ishments. This feedback serves to reinforce the behavior of the model,
Confusion matrix
*
*
the process of generating test data, with a focus on satisfying the crite
*
*
outcomes.
In order to efficiently detect errors in recently developed code,
Rawat, et al. [82] introduced a method for prioritizing test cases that
*
*
*
oughly investigated.
The authors of [84] presented a reinforcement learning-based
Performance
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
problems and solving the test case prioritization problem. The authors
used DRL methods to determine the relative ranking of test cases during
*
used DRL methods to examine the game environment for bugs. The re
*
sults indicated that some selected DRL methods performed better than
more recent approaches that have been published in the literature.
Table 5
Article
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
11
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 6
Classification of recent research and additional information on unsupervised learning methods.
Article Applied Main idea Datasets Evaluation Tool or Advantage(s) Disadvantage(s)
method method evaluation
environment
[74] K-means Proposing a solution relevant Not mentioned Design Not mentioned • High error Not extending CTFF to resolve
clustering to regression testing in agile detection rate regression testing constraints in
practices • High component-based software and
performance product line engineering.
[75] K-means Increasing the test case Not mentioned Simulation Not mentioned • Early fault The selected test cases for object-
clustering prioritization effectiveness for detection oriented software were not evenly
object-oriented software • High spread across the input domain.
effectiveness
[76] Deep neural Measuring the quality of test • MNIST Simulation Python • Generating Advanced mutation operators are not
networks data with a mutation testing CIFAR-10 higher-quality proposed for investigating the
framework for DL systems test data relationships between mutation
operators.
[77] Deep neural Proposing a coverage testing • MNIST Simulation Python • High coverage Limited evaluation of datasets
networks method for DL software
[78] Deep Introducing DSbDNM as a • NF-UQ-NIDS- prototype Python • High accuracy Lack of Investigating classifiers that
Neural high-precision solution that v2 • High Precision might be able to lessen the
Network can be obtained in a short • High Recall limitations of deep learning-based
computational time • Network • High F-measure models without missing necessary
Intrusion features
Detection
datasets
Table 7
Unsupervised learning methods metrics in reviewed approaches.
Article Recall Precision Accuracy performance Efficiency F-measure Coverage Effectiveness APFD Mutation Score Time Suitability
[74] * * * * * * * * *
[75] * * * * *
[76] * *
[77] * * * * *
[78] * * * * * *
Table 8
Classification of recent research and additional information on reinforcement learning methods.
Article Applied method Main idea Datasets Evaluation Tool(s) or Advantage(s) Disadvantage(s)
method evaluation
environment(s)
[81] Q-learning Automating test data Not mentioned Simulation • Python • Low fitness This paper’s approach
production with a structural evaluations does not guarantee the
method • Higher success quality of the final
rate solution to a problem.
[82] Q-learning employing efficient and Not mentioned Simulation • Not • High accuracy • High complexity
sustainable prioritization mentioned • High • Validation challenge
techniques to reduce the Adaptability
number of software test cases • High
while improving their quality. accessibility
[83] Reinforcement Using reinforcement learning to • Paint Control dataset Simulation Not mentioned High fault Not changing the agent
learning design a continuous integration • IOF/ROL (https detection algorithm to improve
reward function ://doi.org error detection
/10.7910/D
VN/GIJ5DE)
Google Open-Source
Data Set (GSDTSR)
(https://doi.org
/10.7910/DVN/MJFKD
N)
[84] Reinforcement Reinforcement learning-based Not mentioned Simulation • Siemens EXAMF/EXAML This paper’s method takes
learning ST multiple defect localization. suites evaluation index plenty of data and
• Grep reduction computation.
• Gzip
• Sed
[85] Reinforcement Assessing and contrasting the • Paint-Control Real testbed • Python High performance Inadequate analysis of
learning various DRL frameworks’ • IOFROL PPO-SB and PPO-TF
implemented algorithms https://github.com/ics implementations
e20/RT-CI
12
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 9
Reinforcement learning methods metrics in reviewed approaches.
Article Time Performance Efficiency Success rate Coverage APFD Suitability
[81] * * *
[82] * * * * *
[83] * * *
[84] * *
[85] * * * *
4.3.2. Overview of reinforcement learning methods reinforcement learning-based test data generation. This study was
The categorization of the aforementioned studies and the influential accompanied by a small empirical investigation. The formulation of
factors for analyzing reinforcement learning methods are elaborated search-based test data generation involved the establishment of a deci
upon in greater depth in Table 8. Table 9 presents an assessment of the sion process that leverages reinforcement learning methods. The find
aforementioned studies, employing evaluation criteria specific to rein ings of the study demonstrated the feasibility of acquiring learning
forcement learning methods. The factors encompassed in this study behaviors through the utilization of a metaheuristic algorithm.
comprise time, performance, efficiency, success rate, coverage, and Chen, et al. [89] proposed the utilization of DRLGENCERT, a
APFD. framework that employs deep reinforcement learning methods for the
automated testing of certificate verification in implementations of
Secure Sockets Layer/Transport Layer security. The DRLGENCERT sys
4.4. Hybrid learning methods tem incorporated the utilization of standard certificates as both input
and output, potentially resulting in variations in efficiency during the
In order to address the limitations inherent in individual ML certificate generation process. The employed framework utilized deep
methods, it is necessary to integrate them into a cohesive approach to reinforcement learning to determine the optimal subsequent action
achieve optimal efficiency. In order to generate novel hybrid learning based on the outcomes of prior modifications, as opposed to random or
methods, it is imperative to employ a diverse range of methodologies inadvertent combinations. The findings suggested that DRLGENCERT
across multiple processes. The utilization of a hybrid learning method is functioned as an automated system for generating test cases.
predicated on the prevalent practice among researchers of combining Xiao, et al. [90] employed neural network models in their study to
two or more methods. However, the method is versatile and has the construct prediction models for the error recognition and fault correc
potential to be applied to a wide range of problems. The subsequent tion processes, taking into account the testing effort. A proposal was
portion of this section examines the principal attributes of the hybrid made for the implementation of a forecasting algorithm to carry out the
learning methods that have been selected. prediction models. The findings indicated that the ANN models recom
mended in this study outperformed the analytical model in accurately
4.4.1. Overview of the selected hybrid learning methods predicting the number of errors detected. The suggested model had the
Ahmad, et al. [86] showcased a heuristic performance testing potential to aid decision-makers in the assessment of reliability, esti
approach that automates the process of identifying system inputs and mation of costs, and determination of the most favorable release time.
determining specific input combinations that lead to performance bot López-Martín [91] conducted an investigation into the prediction of
tlenecks. The methodology, referred to as iPerfXRL, employs deep software test efforts using various ML models. The models were pre
reinforcement learning methods to manage extensive input spaces with pared and tested using datasets selected from a globally accessible
multiple dimensions effectively. This paper presented empirical evi public repository of software projects. The selection of data sets was
dence supporting the superiority of the proposed method in generating contingent upon evaluations of data quality, level of development,
more contextually appropriate outcomes compared to alternative ap development platform, programming language, measurement method,
proaches. Moreover, empirical evidence demonstrated that this and project resource level. The findings indicated that a highly precise
approach possesses the capability to identify and distinguish regions prediction of software test effort can contribute to the estimation of
within the input space that exhibit interconnected combinations. It was project cost, facilitate the development of a project schedule, and assist
recommended that diverse key performance indicators, such as CPU and the team leader in effectively establishing a test team.
memory utilization, be incorporated within the reward function to Kahles, et al. [92] utilized ML methods to automate the process of
enhance performance outcomes. Furthermore, it was possible to expand root cause analysis in agile ST environments. In order to achieve precise
the methodology to identify potential underlying factors contributing to categorization of the factors contributing to unsuccessful experiments,
performance limitations. the utilization of clustering and ML classification methods was
Chen and Huang [87] presented a reinforcement learning method contemplated. The findings of this study indicated that the utilization of
that utilizes test program generation to identify transition postponement MLP-based classification outperformed cluster analysis in the identifi
errors. The initial findings indicated that the utilization of a reinforce cation of underlying factors contributing to test failures. To enhance the
ment learning method has the potential to facilitate the creation of quality of conclusions, it was recommended to employ comprehensive
software-based self-test programs. The processor that was subjected to interviews with test engineers in order to identify novel features that can
testing was specifically aimed at complex test cases and successfully enhance the performance of algorithms.
achieved a high level of fault coverage for transition delay faults. The issue of testing and debugging neural network systems was
Although the processor under test did not include several recent pro discussed in [93]. Furthermore, the examination of the contrasting
cessor designs, it is important to note that flawless data and instruction systems with regard to algorithm implementation was conducted with a
caching were not taken into consideration in the scope of this paper. focus on testing. The aim of this article was to identify the requirements
In [88], a Double Deep Q-Networks (DDQN) agent was employed to of test systems and analyze the specific features of various neural
examine the potential substitution of human-designed meta-heuristic network models. The discourse revolved around strategies for mitigating
algorithms with reinforcement learning in the context of search-based the drawbacks associated with the systems.
ST methodologies. The authors presented a DDQN agent that was The model proposed in [94] employed a combination of RFM anal
trained using DNNs. Additionally, they introduced a comprehensive ysis and regression techniques to effectively minimize the testing
framework called GunPowder for search-based ST. In a similar vein, duration and resource allocation when applied to an existing dataset.
GunPowder conducted a feasibility study that explored the potential of
13
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
The development of these methods aimed to automate the model and comprehensiveness attained by ML methods in detecting secure and
enhance the comprehension of the test cases’ behavior during evalua insecure test scenarios for autonomous vehicles, specifically when
tion. Prior to conducting any formal examinations, the different modules employing fixed autonomous vehicle attributes. Consequently, the main
slated for testing underwent a thorough inspection. In cases where it was aim of their study was to investigate the potential improvement of
deemed necessary, the electrical properties of these individual compo SDC-Scissor’s ability to distinguish between secure and insecure test
nents were simulated to ascertain their precise operational capabilities. cases through the refinement of ML methods. To evaluate the practical
The test outcomes provided insights into the functioning mechanisms of feasibility of SDC-Scissor, the researchers integrated their tool into the
these systems, enabling the assessment of their reliability. The findings operational structure of an automotive industry entity.
were disseminated to the test management team, which integrated the The primary focus of [99] was to evaluate the precision of estimation
data into the system as a component of formulating a proficient test results by examining the quality of the inputs used for estimation as well
environment strategy. as the model employed. The aim of this study was to improve the ac
In [95], a method for prioritizing test inputs in DNNs was introduced. curacy of software regression test effort estimation (SRTEE) by devel
This approach utilized intelligent mutation analysis to consider a larger oping a technique called StackSRTEE. This technique incorporated a
number of test inputs, aiming to detect bugs at an earlier stage within a stacking ensemble model. The composition consisted of the three most
shorter time frame. Consequently, this technique enhanced the testing commonly used ML methods. The grid search (GS) methodology was
efficiency of DNNs by simplifying the process. This paper presented an utilized by the researchers to optimize the hyperparameters of the
approach for prioritizing DNNs based on intelligent change analysis. A StackSRTEE model. Following this, the model underwent training and
designed set of model and input change rules was specifically employed, evaluation using a dataset obtained from the ISBSG repository. The size
and rank-based learning was utilized to integrate these changes in order of the functional change was the primary independent variable
to prioritize the test input. The findings of this study provided evidence employed to augment the inputs of the StackSRTEE model.
supporting the suitability and effectiveness of the proposed methodol Khan, et al. [100] concentrated on using machine learning methods
ogy. This study examined the proposed methodology employed by a to anticipate cumulative software failure levels, with the ultimate goal of
corporation in the context of autonomous vehicles, with the aim of enhancing residual defect forecasts and acquiring a thorough compre
assessing its feasibility. The efficacy of this methodology was demon hension of software-related difficulties. The main data source for the
strated by the outcomes achieved. research study was software metrics and defect data that were retrieved
The utilization of ML methods for ST was discussed in [96]. The from a static code repository. Using a correlation methodology to find
study proposed an approach for identifying software defects by inte meaningful metrics, this dataset allowed the examination of the rela
grating the methodologies of random forest (RF) and CNN. The CNN tionship between different software metrics and reported problems.
method was employed to process the input data and extract relevant
features, which were subsequently utilized by the RF method for the 4.4.2. Overview of hybrid learning methods
purpose of classification. The evaluation of the proposed method Table 10 provides a comprehensive categorization of the selected
involved the utilization of two datasets. The results of this evaluation studies and elucidates in greater depth the influential factors that are
indicated that the CNN and RF methods exhibited superior accuracy employed for the analysis of hybrid learning methods. Table 11 presents
compared to conventional methods. Based on the findings of the study, an assessment of selected studies through the utilization of evaluation
the proposed methodology demonstrated its potential as a valuable in factors in hybrid methodologies. The factors considered in this study
strument for ST, particularly in scenarios where traditional testing encompassed various aspects such as recall, precision, cost, reliability,
methodologies may prove insufficient. However, further investigation time, accuracy, performance, efficiency, MI, f-measure, coverage, RMSE,
was necessary in order to validate the proposed strategy on larger and effectiveness, MSE, MAE, and Success rate. The performance and time of
more diverse datasets. hybrid learning methods were assessed in the majority of research
The combination of support vector machines (SVM) and k-nearest studies.
neighbor (KNN) methods was proposed as a method for solving the
problem of ST [97]. The methodology was employed within the 5. Analysis of results
framework of educational assistance software, whose effectiveness is
contingent upon precise and efficient testing procedures. The method In this section, the analysis of the results from the systematic review
ology involved conducting experiments on a dataset comprising test is presented. Section 5.1 provides a comprehensive summary of the
cases, wherein the SVM and KNN methods were employed to classify the chosen studies. In order to fulfill the objective of this review, which is to
test cases into two categories, namely passing or failing, based on the examine and analyze the distinctions, benefits, and drawbacks of
input attributes. The empirical results indicated that the hybrid learning different ML methods in ST, a comprehensive examination of the
method exhibited superior accuracy and efficacy compared to tradi mentioned classification is presented in Section 5.2.
tional testing methodologies. The conclusion of the article posited that
the proposed methodology could potentially be employed to improve 5.1. Overview of the selected studies
the precision and effectiveness of other ST scenarios. In general, the
study underscored the benefits associated with employing a hybrid The subsequent inquiries are designed to investigate the current
methodology and demonstrated the promising capabilities of ML advancements in ML methods in ST.
methods such as SVM and KNN in the context of ST.
Birchler, et al. [98] introduced a methodology called SDC-Scissor, • Which publishing houses have the highest number of papers on ML
which employs ML methods to identify and eliminate test cases with a methods in ST?
low likelihood of detecting faults in self-driving cars (SDCs) before their • What was the distribution of publishers and the amount of annual
execution. The main aim of their inquiry was to assess the viability and research dedicated to ML methods in the field of ST?
level of precision of categorizing test cases for SDCs as either safe or • What are the groups and active research communities on ML
unsafe before their implementation. The challenge that garnered methods in ST?
attention was related to the advancement of methodologies that can
effectively exploit features derived from SDC test cases. The aim was to
reduce the costs associated with testing while ensuring that the testing 5.1.1. A chronological examination of the studies
process remained highly effective. The researchers also investigated the Based on the data presented in Fig. 4, it can be observed that a ma
possibility of a maximum threshold for the precision and jority of the papers, specifically 40%, were published in IEEE. Following
14
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 10
Classification of recent research and additional information on hybrid learning methods.
Article Applied method Main idea Datasets Evaluation Tool(s) or Advantage(s) Disadvantage(s)
method evaluation
environment
(s)
[86] • Deep neural Representing an automated Not mentioned Simulation • Python Finding the more Reward performance does
networks heuristic performance relevant not include CPU and
(unsupervised) testing method that combination memory usage.
Q-learning identifies input
combinations that cause
performance bottlenecks
[87] • Convolutional Transition delay fault Not mentioned Simulation Not mentioned High coverage The tested processor’s
neural network detection via reinforcement outdated design
Q-learning learning-based test
program generation
[88] • Deep neural Investigating whether Not mentioned Formal Python High coverage Unoptimized network
network search-based software architecture
• Q-Learning testing (SBST)
reinforcement learning can
replace human-designed
metaheuristic algorithms
[89] • Q-learning Proposing a Deep Not mentioned Prototype • ZMap [101] Discovering several The used algorithm
Deep neural Reinforcement Learning previously requires large amounts of
networks system for SSL/TLS unknown certificate data, and trained models
certificate verification verification flaws cannot multitask.
automation
[90] • Feedforward Developing ANN-based • Firefox from Bugzilla Simulation Not mentioned High prediction The algorithm demands a
neural network error recognition and fault dataset (https://bugz errors lot of data and
• Recurrent neural correction prediction illa.mozilla.org/) development time.
network models Product of Tomcat 8
Convolutional dataset (https://bz.apa
neural network che.org/bugzilla/)
[91] • Simple linear Investigating ML models for The International Simulation Not mentioned • Calculating the Not utilizing ML models
regression software test effort Software Benchmarking project cost for defect correction effort
• Multilayer prediction Standards Group dataset • Making project prediction.
perceptron (ISBSG) schedule
• Support vector Enabling the team
regression leader to form a test
• Decision trees team correctly.
• K-Nearest
neighbors
[92] • k-means Root cause analysis Not mentioned Prototype Not mentioned High accuracy • Avoiding
algorithm automation in agile ST hyperparameter
• Gaussian environments using ML optimization
mixture • Ignoring granular
• Multilayer ground truth failure
perceptron categories
• No feature extraction to
improve algorithms’
outcomes
Not testing NLP
[93] • Recurrent neural Adapting ST methods to Not mentioned Design Not mentioned Providing a Considering an example of
network test neural networks guideline to a small dimension
• Multi-Layer improve neural
perceptron network models
[94] • Ridge regression Automating the model and Not mentioned Simulation Python Evaluating system The algorithm misfits
• Simple linear understanding test case reliability complicated datasets.
regression behavior using some
• Support vector methods
regression
[95] • Convolutional Proposing intelligent • CIFAR-10 [102] Simulation Python High effectiveness The algorithm in use
neural networks mutation analysis to • CIFAR-100 [102] requires a lot of data.
• Recurrent neural prioritize DNN test inputs • MNIST (http://yann.le
network to label more bug-revealing cun.com/ex
inputs early for a short time db/mnist/)
• MNIST_VS_USPS [103]
• COIL (https://www.
cs.columbia.edu/
CAVE/software/softli
b/coil-20.php)
• PIE27_VS_PIE5
(http://www.cs.cmu.
edu/afs/
(continued on next page)
15
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 10 (continued )
Article Applied method Main idea Datasets Evaluation Tool(s) or Advantage(s) Disadvantage(s)
method evaluation
environment
(s)
cs/project/PIE/
MultiPie/Multi-Pie/
Home.html)
• PIE27_VS_PIE9
(http://www.cs.cmu.
edu/afs/
cs/project/PIE/
MultiPie/Multi-Pie/
Home.html)
• Driving (https://uda
city.com/
self-driving-car)
• TREC [104]
• IMDB [105]
• SMS Spam [106]
• CoLA [107]
• Hate Speech [108]
• KDDCUP99
(http://kdd.ics.uci.
edu/databases/
kddcup99/kddcup99.
html)
[96] • Convolutional Providing information Not mentioned Simulation Python • High accuracy Limited dataset
neural networks regarding a financial High coverage
Random forest institution’s ability to
authorize credit cards for
its clients
[97] • Support vector Utilizing an educational Not mentioned Real testbed Not mentioned High accuracy No comparison to other
machines assistant software system to ML methods
• K-Nearest show the efficacy of a
neighbors hybrid ML method for ST
[98] • Naive Bayes Determining ML-based test Zenodo (https://zenodo. Simulation Python • High accuracy • Need further SDC
• Random Forest selection methods for cost- org/record/7011983) • High precision datasets
effective simulation-based High Recall Not considering having
SDC testing flaky tests in virtual
environments
[99] • Neural networks Evaluating the efficacy of ISBSG Simulation Not mentioned • High accuracy Needs better ML
• Support vector estimation inputs and the • High precision algorithms to increase test
regression corresponding model in regression accuracy
• Decision trees generating precise
regression estimation outputs
[100] Recurrent neural Investigating the feasibility Not mentioned Simulation Not mentioned High effectiveness limited by the lack of
network of combining software identifying clusters certain data
DBSCAN clustering and fault prone to errors variations in performance
k-means prediction • Low failure across different methods
Random Forest patterns within
Support vector software systems
machines
Convolutional
neural networks
this, 37% of the papers were published by Springer, while 12% were learning methods in software testing publications reflects a natural
published by Elsevier. A smaller proportion of papers, namely 6%, were evolution in research priorities and the broader landscape of software
published in Wiley and Taylor&Francis collectively, and the remaining testing methodologies.
5% were published in ACM. Furthermore, Fig. 5 presents a visual rep Table 12 displays the categorization of the papers into two distinct
resentation of the quantity of scholarly papers published in both journals groups, namely conferences and journals, where they have been pub
and conference proceedings spanning the years 2018 to March 2024. lished. Out of the total of 40 papers examined, it was found that three of
Additionally, it is illustrated that 2018 witnessed the greatest number of them were published in EMSE (impact factor: 4.1), two were published
published papers. The trend of decreasing machine learning methods in in ISSRE, and two were published in the Software Quality Journal
software testing publications over the years may be attributed to several (impact factor: 1.9). Furthermore, it is evident that the majority of the
factors. Initially, there was significant excitement and exploration of the papers were disseminated through conference proceedings, as indicated
potential applications of machine learning in software testing, leading to by Table 12.
a surge in related research publications. However, as the field matured,
researchers likely shifted their focus towards refining existing method 5.1.2. Active research communities
ologies, addressing practical challenges, and exploring interdisciplinary Table 13 illustrates the distribution of papers across publication
approaches. Additionally, advancements in other areas of software channels that have published a minimum of two papers in the specific
testing, such as automation tools and techniques, may have also influ area under investigation. This table provides insights into the distribu
enced the distribution of research efforts away from exclusive reliance tion of active research communities within the selected papers, taking
on machine learning methods. Overall, the decreasing trend in machine into account the affiliations of the authors following the final selection
16
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Suitability
compilation of active communities that have been involved in a mini
mum of two studies that were included in the analysis. Additionally, the
table provides information regarding the specific research focus of each
*
community.
Researchers from the Xiamen University of Technology in China, Al-
Success rate
*
attention towards methods of supervised learning.
MAE
*
*
5.2. Research objectives, methods, and evaluation metrics
MSE
*
*
were introduced in Section 3.1. The research questions provide essential
information regarding the specific aspects that should be examined
Effectiveness
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
the percentage of each parameter was calculated using Eq. (1). Eq. (1)
Cost
attention.
*
*
Article
[100]
gorithms can execute tests faster and with greater precision than manual
17
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
18
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
19
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Fig. 8. Percentage of evaluation metrics in the chosen articles for each classification.
Fig. 10. Repetition of employed datasets and case studies in the cho
Fig. 9. Percentage of evaluation tools in the chosen articles for each sen articles.
classification.
list of datasets suitable for machine learning applications in software
challenges, researchers and practitioners employ various strategies and testing in Table 14. It categorizes datasets based on their specific pur
best practices. This includes collecting data from sources such as bug poses, including bug prediction, test case generation, code quality
repositories and code repositories, annotating the data with relevant assessment, security testing, mobile app testing, regression testing,
labels like bug severity and test case outcomes, and augmenting the image classification, and other miscellaneous tasks. Each dataset is
dataset through techniques like synthetic data generation and transfer accompanied by a brief description highlighting its relevance and usage
learning. Active learning techniques help prioritize annotation efforts, in software testing research and practice. This curated list serves as a
while collaboration with domain experts ensures dataset relevance and valuable resource for researchers and practitioners seeking high-quality
quality. Ethical considerations, continuous improvement, and bench training data for developing and evaluating machine learning algo
marking efforts further enhance dataset curation, ultimately facilitating rithms in the field of software testing.
more effective and reliable machine-learning approaches in software To specify Q5, as indicated in Fig. 11, it can be observed that the
testing. evaluation methods employed in the selected papers are predominantly
Drawing from the reviewed articles, we compiled a comprehensive simulations, accounting for 77% of the total. Prototypes constitute 10%
of the evaluation methods, while design-based methods comprise 5%.
20
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Table 14
Datasets for machine learning in software testing.
Purpose Dataset Name Description
21
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Fig. 12. Open issues and future works on ML methods implemented in ST.
coverage, which may still pose a challenge. This challenge is espe software systems in various domains. One proposed methodology is
cially important in the context of ML, where assessing complex sys to integrate trust and security mechanisms directly into the design
tems requires rigorous testing methods to guarantee their accuracy and implementation of ML models used for testing purposes. This can
and reliability. involve employing techniques such as adversarial training to
• Accuracy: The accuracy of testing is a crucial challenge that ML enhance the robustness of ML models against malicious attacks and
methods must address. Based on the literature review, it appears that ensure compliance with industry-standard security protocols and
52% of studies addressed the concept of test accuracy, yet as a very best practices. Moreover, incorporating explainability and inter
important metric in ML methods, it can be an open issue. The con pretability features into ML-based testing approaches can enhance
cepts of accuracy and precision are inherently ambiguous across trust by providing insights into the decision-making process of these
languages, but they can be effectively measured as a metric of quality models.
[94]. Assessing the aforementioned attribute constitutes a means of • Interoperability: Interoperability testing is an ST methodology that
guaranteeing trustworthiness in the results of simulations. The ac evaluates the ability of a software system to function seamlessly with
curate outcome is considered the standard for measuring quality. The other software components, systems, and versions. One of the pri
attribute of accuracy is employed as a component of validation and mary advantages of utilizing standards-based products is the
verification in scientific ST. Although the concepts of accuracy and achievement of effective interoperation. Regardless of whether the
precision have been studied in ML and ST methods, the subject is still standards are proprietary, international, or public in nature, users
considered an open issue. commonly anticipate that products supported by standards will
• Scalability: Scalability is a critical aspect to consider when deploying seamlessly interact with comparable products, as noted in [113].
ML methods in software testing, particularly concerning the perfor Various forms of testing are commonly utilized to assist consumers in
mance and efficiency of these methods as the size and complexity of making informed decisions regarding the compatibility of products
software systems increase. One potential framework to address with similar standards and their functionality when used in
scalability issues is to adopt distributed computing paradigms and conjunction with other products. The various tiers of interoperability
parallel processing techniques. By distributing the computational testing encompass specification-level interoperability, data-type
workload across multiple nodes or processors, scalability can be interoperability, physical interoperability, and semantic interopera
enhanced, allowing ML-based testing methods to handle larger bility. This subject matter holds promise for prospective research
datasets and more complex software systems. Additionally, endeavors.
designing algorithms and architectures that are inherently scalable • Failure management: The findings of our study indicate that the suc
can ensure that ML-based testing approaches can adapt to the cess of software projects cannot be attributed to a singular factor.
growing demands of modern software development practices. Several techniques that can enhance the success rates of projects
• Trust & security: Ensuring trust and security in ML-based testing have been investigated. However, a significant majority of these
methodologies is paramount, especially given the critical nature of techniques focus on identifying and mitigating the impact of factors
22
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
that may contribute to project failure. According to [65,114], and from labeled data, often garners more prominence compared to un
[115], the timely prediction of software system malfunctions can aid supervised learning, which relies on unlabeled data. This discrep
in determining potential remedies that may enhance the achieve ancy may arise from historical precedence, ease of implementation,
ment rate. Defects that arise during the testing phase can result in or perceived efficacy in specific contexts. Additionally, reinforce
software failure and are commonly referred to as failures. Further ment learning, celebrated for its trial-and-error-based learning with
more, instances of failure may not necessarily arise solely in reaction rewards, may receive undue focus due to its association with notable
to unidentified faults or unresolved glitches. Hardware or firmware successes like AlphaGo. Furthermore, hybrid learning methods,
faults can arise from environmental factors or errors, such as an amalgamating diverse approaches, encounter biases based on
incorrect input value utilized during the testing phase. The factors perceived novelty or utility. These biases have far-reaching impli
leading to failure may vary, as indicated by previous research [114]. cations, impacting research agendas, funding distribution, and
The integration of failure management into current methodologies educational priorities within the machine learning domain. Conse
may present a compelling avenue for future research. quently, there emerges a risk of overlooking crucial insights and
• Empirical research: ML methods have received a lot of attention in hindering comprehensive progress across learning paradigms.
recent years, including in ST. However, implementing these ideas in Addressing these biases is paramount to fostering a more inclusive
real-world settings remains difficult. According to the evaluated and balanced landscape, ensuring equitable advancement in the
papers, only 5% of the proposed ML methods have been tested in real field.
testbeds, with the remaining studies relying on simulation tools for
testing. In addition, empirical research in the realm of ML methods 7. Threats to validity and limitations
for software testing is vital for understanding the practical implica
tions and effectiveness of these methods. To address this challenge, it The potential limitations that may affect the validity of this research
is essential to design and conduct experiments in real-world settings study are carefully examined through a series of steps, which are out
rather than relying solely on simulations. One proposed methodol lined as follows:
ogy is to establish collaborative partnerships between academia and First step. It involves examining the potential risks that may
industry to facilitate the deployment of ML methods in actual test compromise the accurate identification of comprehensive primary pa
beds. These partnerships can enable access to diverse datasets, real- pers. The primary determinant of the research strategy is the acquisition
world software systems, and the infrastructure necessary for con of a comprehensive body of literature that is free from bias. In order to
ducting empirical studies. Furthermore, establishing benchmarks achieve this objective, the common strings in the search term were
and standardized evaluation protocols can aid in comparing the searched and combined. Additionally, a review protocol was formulated
performance of different ML-based testing approaches across various with the aim of identifying pertinent and impartial studies.
domains. Second step. It involves identifying and analyzing potential threats to
• High-quality training datasets: Access to high-quality training datasets the process of selection and data extraction. In the context of systematic
is fundamental for the success of ML-based testing methods. To reviews, it is customary for each study to undergo a process of quality
address this challenge, researchers can explore innovative tech assessment [27].
niques for data collection, augmentation, and synthesis. This may
involve leveraging crowdsourcing platforms, simulation environ • The presence of bias in the outcome is undesirable, as the objective is
ments, and data generation algorithms to create diverse and repre to obtain accurate results.
sentative datasets for software testing. Additionally, establishing • Internal validity refers to the extent to which a study is conducted
data-sharing initiatives and repositories within the research com without any systematic errors.
munity can facilitate the dissemination of high-quality datasets and • External validity refers to the extent to which the findings of a study
promote collaboration among researchers. can be generalized and applied to real-world situations beyond the
• Test input generation: SBST utilizes a meta-heuristic optimization specific context in which the study was conducted.
search method, such as a genetic algorithm, to autonomously create
test inputs [88]. This test-generation technique has been extensively This investigation employed two distinct types of quality assess
employed in academic studies pertaining to conventional software ments. Two types of assessments were conducted. The first type involved
testing paradigms. In addition to its application in producing test evaluating the quality of the papers based on their ability to address the
inputs for evaluating functional qualities such as program correct research questions. The second type of assessment was carried out spe
ness, SBST has also been employed to investigate conflicts related to cifically to address one of our main research questions.
algorithmic fairness during requirement analysis [116]. The appli Third step. We examined the potential risks and vulnerabilities that
cation of SBST has demonstrated successful outcomes in the testing may compromise the integrity and security of the synthesized data, as
of autonomous driving systems, as evidenced by multiple studies well as the subsequent outcomes and findings. The challenge of reviews
[98,117,118,119]. There are several research opportunities avail is further compounded by the presence of a reliability threat [120]. The
able in the application of SBST for generating test inputs in the present matter was derived from a comprehensive characterization
context of testing other ML systems. This is due to the evident model involving the collaboration of multiple researchers as well as the
compatibility between SBST and ML, as SBST is capable of dynami implementation of various methodological and procedural steps, which
cally searching for test inputs throughout extensive input spaces. were subsequently piloted and subjected to external evaluation. In
Current methodologies for creating test inputs mostly concentrate on conducting our SLR, we adhered to the guidelines outlined in [27,28].
producing adversarial inputs to evaluate the resilience of an ML However, it is important to note that certain deviations from their pre
system. However, adversarial examples have faced criticism because scribed methods were encountered, as detailed in Section 3. By
of their lack of representation of genuine input data. Therefore, an employing a rigorous systematic review methodology, conducting
interesting and challenging area of research is the development of external evaluations, and engaging multiple researchers, it is possible to
methods for generating authentic test inputs and the automatic assert that the review possesses a high level of validity. This review
evaluation of their naturalness. sought to offer a thorough and methodical examination. Nevertheless, it
• Bias in representation across machine learning paradigms: In the realm is important to acknowledge certain limitations of this study that should
of machine learning research and practice, there exists a noticeable be taken into account in future research. Some of the limitations of this
disparity in the attention and resources allocated to different study are as follows:
learning methodologies. Supervised learning, wherein models learn
23
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
• This research exclusively considers reputable journal and conference interests or personal relationships that could have appeared to influence
papers that possess the highest qualifications. Consequently, books, the work reported in this paper.
book chapters, non-English scripts, non-JCR papers, commentaries
and review papers, and short papers have been excluded. Data availability
• In this SLR, nine well-known online databases were utilized to
identify relevant, reputable papers. However, it is important to note No data was used for the research described in the article.
that the authors cannot assert that all papers on the subject of ML
methods in ST have been selected. References
• Section 3.1 of this SLR presents six questions that were posed to
explore this topic further; however, it is possible to propose addi [1] H. Zuse, Software complexity: measures and methods, Walter de Gruyter GmbH
& Co KG, 2019, pp. 1–5.
tional questions. [2] A. Fuggetta, Software process: a roadmap, Proc. Conf. Future Softw. Eng. (2000)
• The reviewed papers in this research were classified into four cate 25–34.
gories: supervised learning methods, unsupervised learning [3] B.W. Kernighan, P.J. Plauger, Software tools, ACM SIGSOFT Softw. Eng. Notes
vol. 1 (1) (1976) 15–20.
methods, reinforcement learning methods, and hybrid learning [4] M. Nikravan, et al., An intelligent energy efficient QoS-routing scheme for WSN,
methods. However, there may be other possible categories. Int. J. Adv. Eng. Sci. Technol. vol. 8 (1) (2011) 121–124.
[5] A. Mishra, Review: software quality assurance—from theory to implementation,
Comput. J. vol. 47 (6) (2004) 728–7218, https://doi.org/10.1093/comjnl/
8. Conclusion and future work 47.6.728.
[6] B. Arasteh, S.M.J. Hosseini, Traxtor: an automatic software test suit generation
This study tried to fill a gap in the existing pieces of literature by method inspired by imperialist competitive optimization algorithms, J. Electron.
Test. vol. 38 (2) (2022) 205–215.
providing a comprehensive review of ML methods in ST, an area that has
[7] B. Arasteh, P. Gunes, A. Bouyer, F. Soleimanian Gharehchopogh, H. Alipour
not received thorough attention in previous reviews. With a recognition Banaei, R. Ghanbarzadeh, A modified horse herd optimization algorithm and its
of the significance of ML methods in ST, our study aimed to thoroughly application in the program source code clustering (vol), Complexity 2023 (2023)
analyze the methods’ obstacles, potential directions, advantages, and 3988288.
[8] B. Arasteh, R. Ghanbarzadeh, F.S. Gharehchopogh, A. Hosseinalipour, Generating
disadvantages. By conducting a systematic literature review, we cate the structural graph-based model from a program source-code using chaotic
gorized and compared various ML methods in ST, aiming to establish a forrest optimization algorithm, Expert Syst. vol. 40 (6) (2023) e13228.
clear classification system. Key questions addressed include the types of [9] B. Broekman, E. Notenboom, Testing embedded software, Pearson Education,
2003, pp. 217–228.
ML methods utilized, evaluation metrics, common tools and environ [10] B. Arasteh, F.S. Gharehchopogh, P. Gunes, F. Kiani, M. Torkamanian-Afshar,
ments, available datasets, case studies, and methods for assessing ac A novel metaheuristic based method for software mutation test using the
curacy and efficiency. We analyzed 40 papers sourced from 284 journals discretized and modified forrest optimization algorithm, J. Electron. Test. vol. 39
(3) (2023) 347–370.
and conferences spanning 2018 to March 2024. Notable developments [11] H. Reza, K. Ogaard, A. MalgeA model based testing technique to test web
include fluctuating publication trends, with 2018 displaying the highest applications using statecharts IEEE , in Fifth International Conference on
volume of papers and 2024 showing the lowest. Dominant publishing Information Technology: New Generations (itng 2008) , 2008, in Fifth
International Conference on Information Technology: New Generations (itng ),
sources were identified, with IEEE, Springer, and Elsevier contributing 2008183–188.
significantly to the total publications. Papers were categorized into four [12] A.M. Nascimento, L.F. Vismari, P.S. Cugnasca, J.B.C. Júnior, J.R. de Almeira
primary methods: supervised learning, unsupervised learning, rein JúniorA cost-sensitive approach to enhance the use of ML classifiers in software
testing efforts IEEE , 2019 18th IEEE International Conference On Machine
forcement learning, and hybrid learning methods, with supervised
Learning And Applications (ICMLA) , 2019, 18th International Conference On
learning and hybrid learning methods collectively representing 74% of Machine Learning And Applications (ICMLA), IEEE20191806–1813.
the published papers. Python emerged as the predominant tool and [13] Z. Asghari, B. Arasteh, A. Koochari, Effective software mutation-test using
platform in 39% of the reviewed papers. Evaluation methods predomi program instructions classification, J. Electron. Test. vol. 39 (5) (2023) 631–657.
[14] P. Bourque, J.-M. Lavoie, A. Lee, S. Trudel, T.C. LethbridgeGuide to the software
nantly involved simulation environments, with various quality assess engineering body of knowledge (swebok) and the software engineering education
ment metrics employed as recall, precision, accuracy, time, knowledge (seek)-a preliminary mapping Proc. 10th Int. Workshop Softw.
performance, and MSE. While most papers focused on prominent met Technol. Eng. Pract. , 2002, , 8–9 (IEEE Computer Society).
[15] B. Arasteh, P. Imanzadeh, K. Arasteh, F.S. Gharehchopogh, B. Zarei, A source-
rics like accuracy and time, a notable proportion made efforts to code aware method for software mutation testing using artificial bee colony
enhance testing efficiency through ML methods in ST. Identified chal algorithm, J. Electron. Test. vol. 38 (3) (2022) 289–302.
lenges include the need for empirical research, scalability, test coverage, [16] M. Newman"Software errors cost us economy $59.5 billion annually," NIST
Assesses Technical Needs of Industry to Improve Software-Testing , 2002.
test input generation, failure management, interoperability, test Oracle, [17] G.J. Myers, T. Badgett, T.M. Thomas, C. Sandler, 2004, Wiley Online
accuracy, trust, security, and access to high-quality training datasets, Library123–156, The art of software testingvol. 2.
highlighting avenues for future work. [18] S. Bazzaz Abkenar, E. Mahdipour, S.M. Jameii, M. Haghi Kashani, A hybrid
classification method for Twitter spam detection based on differential evolution
and random forest, Concurr. Comput.: Pract. Exp. vol. 33 (21) (2021) e6381.
Participation of authors [19] G. Liang, W. Fan, H. Luo, X. Zhu, The emerging roles of artificial intelligence in
cancer drug development and precision therapy, Biomed. Pharmacother. vol. 128
(2020) 110255.
The first and second authors have equal contributions to this work.
[20] J. Liu, F. Weng, Z. Li, "Satellite-based PM2. 5 estimation directly from reflectance
at the top of the atmosphere using a machine learning algorithm, Atmos. Environ.
CRediT authorship contribution statement vol. 208 (2019) 113–122.
[21] B.K. Mohanta, D. Jena, U. Satapathy, S. Patnaik, Survey on IoT security:
challenges and solution using machine learning, artificial intelligence and
Sedighe Ajorloo: Investigation, Methodology, Resources, Visuali blockchain technology, Internet Things vol. 11 (2020) 100227.
zation, Writing – original draft. Amirhossein Jamarani: Investigation, [22] D. Truong, W. Choi, Using machine learning algorithms to predict the risk of
Methodology, Resources, Visualization, Writing – original draft. Mehdi small unmanned aircraft system violations in the national airspace system, J. Air
Transp. Manag. vol. 86 (2020) 101822.
Kashfi: Investigation, Writing – review & editing. Mostafa Haghi [23] V.H. Durelli, et al., Machine learning applied to software testing: a systematic
Kashani: Conceptualization, Methodology, Supervision, Validation, mapping study, IEEE Trans. Reliab. vol. 68 (3) (2019) 1189–1212.
Writing – review & editing. Abbas Najafizadeh: Validation, [24] M. Noorian, E. Bagheri, W. Du, Machine learning-based software testing: towards
a classification framework, SEKE (2011) 225–229.
Visualization. [25] N. Jha, R. Popli, S. Chakraborty, and P. Kumar, Software Test Automation Using
Selenium and Machine Learning," in Proceedings of First International
Declaration of Competing Interest Conference on Computational Electronics for Wireless Communications,
Singapore, S. Rawat, A. Kumar, P. Kumar, and J. Anguera, Eds., 2022// 2022:
Springer Nature Singapore, pp. 419-429.
The authors declare that they have no known competing financial
24
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
[26] R. LachmannMachine learning-driven test case prioritization approaches for [58] A.M. Nascimento, L.F. Vismari, P.S. Cugnasca, J.B.C. Júnior, J.R. d A. Júnior,
black-box software testing Eur. Test. Telem. Conf. Nuremberg, Ger. , 2018. A cost-sensitive approach to enhance the use of ml classifiers in software testing
[27] P. Brereton, B.A. Kitchenham, D. Budgen, M. Turner, M. Khalil, Lessons from efforts, 18th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA) (2019) 1806–1813, 16-
applying the systematic literature review process within the software engineering 19 Dec. 2019 2019.
domain, J. Syst. Softw. vol. 80 (4) (2007) 571–583. [59] S. NakajimaDataset Diversity for Metamorphic Testing of Machine Learning
[28] B. Kitchenham, "Procedures for performing systematic reviews," Keele, UK, Keele Software Cham , Springer International Publishing , Structured Object-Oriented
University, vol. 33, no. 2004, pp. 1-26, 2004. Formal Language and Method // 2019 , 2019, // , 201921–38Z. Duan, S. Liu, C.
[29] J.M. Zhang, M. Harman, L. Ma, Y. Liu, Machine learning testing: survey, Tian, F. Nagoya (Eds.).
landscapes and horizons, IEEE Trans. Softw. Eng. vol. 48 (1) (2020) 1–36. [60] J.K. Nurminen, Software Framework for Data Fault Injection to Test Machine
[30] Y. Yang, X. Xia, D. Lo, J. Grundy, A survey on deep learning for software Learning Systems 2019 IEEE Int. Symp. . Softw. Reliab. Eng. Workshops (ISSREW)
engineering, p. Article 206, ACM Comput. Surv. vol. 54 (10s) (2022). p. Article 27-30 Oct. 2019 , 2019, , 294–299.
206. [61] M. Raman, N. Abdallah, J. DunoyerAn Artificial Intelligence Approach to EDA
[31] H. Chen, M.A. Babar, Security for machine learning-based software systems: a Software Testing: Application to Net Delay Algorithms in FPGAs 6-7 20th
survey of threats, practices, and challenges, p. Article 151, ACM Comput. Surv. International Symposium on Quality Electronic Design (ISQED) , March 2019, ,
vol. 56 (6) (2024). p. Article 151. 311–316 2019.
[32] Suman, R.A. Khan, Survey on identification and prediction of security threats [62] S. Kassaymeh, S. Abdullah, M. Alweshah, A.I. HammouriA Hybrid Salp Swarm
using various deep learning models on software testing, Multimed. Tools Appl. Algorithm with Artificial Neural Network Model for Predicting the Team Size
(2024). Required for Software Testing Phase 2021 2021 International Conference on
[33] J. Wen, S. Li, Z. Lin, Y. Hu, C. Huang, Systematic literature review of machine Electrical Engineering and Informatics (ICEEI) 12-13 , Oct. 2021, , International
learning based software development effort estimation models, Inf. Softw. Conference on Electrical Engineering and Informatics (ICEEI) 20211–6.
Technol. vol. 54 (1) (2012) 41–59. [63] K. Kamaraj, C. Arvind, K. Srihari, A weight optimized artificial neural network for
[34] R. Malhotra, A systematic review of machine learning techniques for software automated software test oracle, Soft Comput. vol. 24 (17) (2020) 13501–13511.
fault prediction, Appl. Soft Comput. vol. 27 (2015) 504–518. [64] A. Sheta, S. Aljahdali, M. Braik, Utilizing Faults and Time to Finish Estimating the
[35] J. Zhang, J. Li, Testing and verification of neural-network-based safety-critical Number of Software Test Workers Using Artificial Neural Networks and Genetic
control software: a systematic literature review, Inf. Softw. Technol. vol. 123 Programming, in: Á. Rocha, M. Serrhini (Eds.), Information Systems and
(2020) 106296. Technologies to Support Learning, Springer International Publishing, Cham,
[36] I. Batool, T.A. Khan, Software fault prediction using data mining, machine 2019, pp. 613–624.
learning and deep learning techniques: a systematic literature review, Comput. [65] V.A.D.S. Júnior, "A method and experiment to evaluate deep neural networks as
Electr. Eng. vol. 100 (2022) 107886. /05/01/ 2022. test oracles for scientific software," presented at the Proceedings of the 3rd ACM/
[37] A. Abo-eleneen, A. Palliyali, C. Catal, The role of reinforcement learning in IEEE International Conference on Automation of Software Test, Pittsburgh,
software testing, Inf. Softw. Technol. vol. 164 (2023) 107325. Pennsylvania, 2022.
[38] M. Rahimi, M. Songhorabadi, M.H. Kashani, Fog-based smart homes: a systematic [66] A. Ruospo, D. Piumatti, A. Floridia, E. SanchezA SUitability Analysis of Software
review, J. Netw. Comput. Appl. vol. 153 (2020) 102531. /03/01/ 2020. Based Testing Strategies for the On-line Testing of Artificial Neural Networks
[39] C. Calero, M.F. Bertoa, M.Á. Moraga, A systematic literature review for software Applications in Embedded Devices 2021 IEEE 27th Int. Symp. . -Line Test. Robust.
sustainability measures," presented at the Proceedings of, 2nd Int. Workshop Syst. Des. (IOLTS) 28-30 June 20212021, , 1–6.
Green. Sustain. Softw., San. Fr., Calif. (2013). [67] U. Sivaji, P.S. RaoWITHDRAWN: Test case minimization for regression testing by
[40] N. Khoshniat, A. Jamarani, A. Ahmadzadeh, M. Haghi Kashani, E. Mahdipour, analyzing software performance using the novel method " ed: Elsevier , 2021.
Nature-inspired metaheuristic methods in software testing, Soft Comput. vol. 28 [68] A.H. Yahmed, H.B. Braiek, F. Khomh, S. Bouzidi, R. Zaatour, DiverGet: a SeArch-
(2) (2024) 1503–1544. based Software Testing Approach for Deep Neural Network Quantization
[41] M. Songhorabadi, M. Rahimi, A. MoghadamFarid, M. Haghi Kashani, Fog Assessment, Empir. Softw. Eng. vol. 27 (7) (2022) 193.
computing approaches in IoT-enabled smart cities, J. Netw. Comput. Appl. vol. [69] L. Xiao, H. Miao, T. Shi, Y. Hong, LSTM-based deep learning for spatial–temporal
211 (2023) 103557. software testing, Distrib. Parallel Databases vol. 38 (3) (2020) 687–712.
[42] M. Haghi Kashani, E. Mahdipour, Load balancing algorithms in fog computing, [70] M. Tejo Vinay, M. Lukeshnadh, B. Keerthi Samhitha, S.C. Mana, J. JoseA Robust
IEEE Trans. Serv. Comput. vol. 16 (2) (2023) 1505–1521. and Intelligent Machine Learning Algorithm for Software Testing Singapore ,
[43] S. Nemati, M. Haghi Kashani, R. Faghih Mirzaee, Comprehensive survey of Springer Nature Singapore , Advances in Electronics, Communication and
ternary full adders: Statistics, corrections, and assessments, IET Circuits, Devices Computing 2021// , 2021, //Springer Nature , Singapore2021455–462P.K.
Syst. vol. 17 (3) (2023) 111–134. Mallick, A.K. Bhoi, G.-S. Chae, K. Kalita (Eds.).
[44] M. Etemadi, et al., A systematic review of healthcare recommender systems: open [71] H.L.P. Raj, K. ChandrasekaranNEAT Algorithm for Testsuite generation in
issues, challenges, and techniques, Expert Syst. Appl. vol. 213 (2023) 118823. Automated Software Testing 2018 IEEE Symp. . Ser. Comput. Intell. (SSCI) 18-21
[45] S. Bazzaz Abkenar, M. Haghi Kashani, M. Akbari, E. Mahdipour, Learning textual Nov. 2018 , 2018, , 2361–2368.
features for Twitter spam detection: a systematic literature review, Expert Syst. [72] L. OleshchenkoSoftware Testing Errors Classification Method Using Clustering
Appl. vol. 228 (2023) 120366. Algorithms Singapore , Springer Nature Singapore , International Conference on
[46] M. Sheikh Sofla, M. Haghi Kashani, E. Mahdipour, R. Faghih Mirzaee, Towards Innovative Computing and Communications 2023// , 2023, //Springer Nature ,
effective offloading mechanisms in fog computing, Multimed. Tools Appl. vol. 81 Singapore2023553–566A.E. Hassanien, O. Castillo, S. Anand, A. Jaiswal (Eds.).
(2) (2022) 1997–2042. [73] E. Alpaydin, Introduction to Machine Learning, MIT press, 2020, pp. 11–12.
[47] M. Nikravan, M. Haghi Kashani, A review on trust management in fog/edge [74] S. Ali, Y. Hafeez, S. Hussain, S. Yang, Enhanced regression testing technique for
computing: Techniques, trends, and challenges, J. Netw. Comput. Appl. vol. 204 agile software development and continuous integration strategies, Softw. Qual. J.
(2022) 103402. vol. 28 (2) (2020) 397–423.
[48] Y. Karimi, M. Haghi Kashani, M. Akbari, E. Mahdipour, Leveraging big data in [75] J. Chen, et al., Test case prioritization for object-oriented software: an adaptive
smart cities: a systematic review, Concurr. Comput.: Pract. Exp., Submitt. Publ. random sequence approach based on clustering, J. Syst. Softw. vol. 135 (2018)
vol. 33 (21) (2021). 107–125.
[49] M. Haghi Kashani, M. Madanipour, M. Nikravan, P. Asghari, E. Mahdipour, [76] L. Ma, Deepmutation: Mutation testing of deep learning systems IEEE , 2018 IEEE
A systematic review of IoT in healthcare: applications, techniques, and trends, 29th international symposium on software reliability engineering (ISSRE) , 2018,
J. Netw. Comput. Appl. vol. 192 (2021) 103164. 29th international symposium on software reliability engineering (ISSRE),
[50] M. Fathi, M. Haghi Kashani, S.M. Jameii, E. Mahdipour, Big data analytics in IEEE2018100–111.
weather forecasting: a systematic review, Arch. Comput. Methods Engineering, [77] Y. Liu, L. Feng, X. Wang, and S. Zhang, "DeepBoundary: A Coverage Testing
Submitt. Publ. vol. 29 (2) (2021). Method of Deep Learning Software based on Decision Boundary Representation,"
[51] Z. Ahmadi, M. Haghi Kashani, M. Nikravan, E. Mahdipour, Fog-based healthcare in 2022 IEEE 22nd International Conference on Software Quality, Reliability, and
systems: a systematic review, Multimed. Tools Appl. vol. 80 (30) (2021) Security Companion (QRS-C), 5-9 Dec. 2022 2022, pp. 166-172.
36361–36400. [78] Suman, R.A. Khan, An optimized neural network for prediction of security threats
[52] M. Songhorabadi, M. Rahimi, A.M.M. Farid, M.H. Kashani" arXiv preprint Fog on software testing, Comput. Secur. vol. 137 (2024) 103626.
Comput. Approaches Smart Cities.: A State—Art. Rev. arXiv:2011.14732 , 2020, , [79] R.S. Sutton, A.G. Barto, The reinforcement learning problem. in Reinforcement
1–19. learning: An introduction, MIT Press, Cambridge, MA, 1998, pp. 51–85.
[53] M. Haghi Kashani, A.M. Rahmani, N. Jafari Navimipour, Quality of service-aware [80] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey,
approaches in fog computing, Int. J. Commun. Syst. vol. 33 (8) (2020) e4340. J. Artif. Intell. Res. vol. 4 (1996) 237–285.
[54] S.B. Abkenar, M.H. Kashani, M. Akbari, E. Mahdipour" arXiv preprint Twitter [81] M. Esnaashari, A.H. Damia, Automation of software test data generation using
spam Detect.: A Syst. Rev. arXiv:2011.14754, , 2020. genetic algorithm and reinforcement learning, Expert Syst. Appl. vol. 183 (C, p.
[55] S. Bazzaz Abkenar, M. Haghi Kashani, E. Mahdipour, S.M. Jameii, Big data 12) (2021).
analytics meets social media: A systematic review of techniques, open issues, and [82] N. Rawat, V. Somani, A.K. Tripathi, Prioritizing software regression testing using
future directions, Telemat. Inform. vol. 57 (2020) 101517–105525. reinforcement learning and hidden Markov model, Int. J. Comput. Appl. vol. 45
[56] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: a review of (12) (2023) 748–754.
classification techniques, Emerg. Artif. Intell. Appl. Comput. Eng. vol. 160 (1) [83] T. Shi, L. Xiao, K. Wu, Reinforcement Learning Based Test Case Prioritization for
(2007) 3–24. Enhancing the Security of Software, 6-9 Oct. 2020, 2020 IEEE 7th Int. Conf. Data
[57] T. Hastie, R. Tibshirani, and J. Friedman, "The elements of statistical learning. Sci. Adv. Anal. (DSAA) (2020) 663–672.
Springer series in statistics," New York, NY, USA, 2001. [84] J. Fang, Y. LuSimultaneous Localization of Multiple Defects in Software Testing
Based on Reinforcement Learning Cham , Springer International Publishing ,
25
S. Ajorloo et al. Applied Soft Computing 162 (2024) 111805
Multimedia Technology and Enhanced Learning 2021// , 2021, //, [102] A. Krizhevsky, G. Hinton, Learn. Mult. layers Features tiny Images (2009).
2021180–190W. Fu, Y. Xu, S.-H. Wang, Y. Zhang (Eds.). [103] M. Long, J. Wang, G. Ding, J. Sun, P.S. YuTransfer feature learning with joint
[85] P.S. Nouwou Mindom, A. Nikanjam, F. Khomh, A comparison of reinforcement distribution adaptation," in Proceedings of IEEE Int. Conf. Comput. Vis. , 2013, ,
learning frameworks for software testing tasks, Empir. Softw. Eng. vol. 28 (5) 2200–2207.
(2023) 111. [104] X. Li, D. RothLearning question classifiers Cooling 2002 The 19th International
[86] T. Ahmad, A. Ashraf, D. Truscan, A. Domi, I. Porres, Using deep reinforcement Conference on Computational Linguistics , 2002.
learning for exploratory performance testing of software systems with multi- [105] A. Maas, R.E. Daly, P.T. Pham, D. Huang, A.Y. Ng, C. PottsLearning word vectors
dimensional input spaces, IEEE Access vol. 8 (2020) 195000–195020. for sentiment analysis Proc. 49th Annu. Meet. Assoc. Comput. Linguist.: Hum.
[87] C.Y. Chen, J.L. Huang, Reinforcement-learning-based test program generation for Lang. Technol. , 2011, , 142–150.
software-based self-test, 10-13 Dec. 2019, 2019 IEEE 28th Asian Test. Symp. [106] T.A. Almeida, J.M.G. Hidalgo, A. YamakamiContributions to the study of SMS
(ATS) (2019) 73–735. spam filtering: new collection and results Proc. 11th ACM Symp. . Doc. Eng. ,
[88] J. Kim, M. Kwon, S. YooGenerating Test Input with Deep Reinforcement Learning 2011, , 259–262.
2018 IEEE/ACM 11th International Workshop on Search-Based Software Testing [107] A. Warstadt, A. Singh, S.R. Bowman, Neural network acceptability judgments,
(SBST) 28-29 May 2018 , 2018, , 51–58, 28-29 May 2018. Trans. Assoc. Comput. Linguist. vol. 7 (2019) 625–641.
[89] C. Chen, W. Diao, Y. Zeng, S. Guo, C. HuDRLgencert: Deep Learning-Based [108] T. Davidson, D. Warmsley, M. Macy, I. WeberAutomated hate speech detection
Automated Testing of Certificate Verification in SSL/TLS Implementations 2018 and the problem of offensive language 1 ( vol. 11 Proc. Int. AAAI Conf. web Soc.
IEEE Int. Conf. Softw. Maint. Evol. (ICSME) 23-29 Sept. 2018 , 2018, , 48–58. Media , 2017, , 512–515.
[90] H. Xiao, M. Cao, R. Peng, Artificial neural network based software fault detection [109] A. Singhal, A. BansalGeneration of test oracles using neural network and decision
and correction prediction models considering testing effort, Appl. Soft Comput. tree model IEEE , in 2014 5th International Conference-Confluence The Next
vol. 94 (2020) 106491. Generation Information Technology Summit (Confluence) , 2014, in 5th
[91] C. López-Martín, Machine learning techniques for software testing effort International Conference-Confluence The Next Generation Information
prediction, Softw. Qual. J. vol. 30 (1) (2022) 65–100. Technology Summit (Confluence), 2014313–318.
[92] J. Kahles, J. Törrönen, T. Huuhtanen, A. Jung, Automating root cause analysis via [110] E.T. Barr, M. Harman, P. McMinn, M. Shahbaz, S. Yoo, The oracle problem in
machine learning in agile software testing environments, 22-27 April 2019, 2019 software testing: a survey, IEEE Trans. Softw. Eng. vol. 41 (5) (2014) 507–525.
12th IEEE Conf. Softw. Test., Valid. Verif. (ICST) (2019) 379–390. [111] A. Singhal, A. Bansal, A. Kumar, An approach to design test oracle for aspect
[93] Y.L. Karpov, L.E. Karpov, Y.G. Smetanin, Adaptation of general concepts of oriented software systems using soft computing approach, Int. J. Syst. Assur. Eng.
software testing to neural networks, Program. Comput. Softw. vol. 44 (5) (2018) Manag. vol. 7 (2016) 1–5.
324–334. [112] D. Marijan, A. Gotlieb, M.K. Ahuja, Challenges of testing machine learning based
[94] S.H. Managoli, U. PadmaData Analysis for Implementing an Efficient Testing systems. 2019 IEEE international conference on artificial intelligence testing
Model in Software Testing Using Machine Learning Singapore , Springer Nature (AITest), IEEE, 2019, pp. 101–102.
Singapore , Proceedings of Third International Conference on Intelligent [113] H. Van Der Veer, A. Wiles, Achieving technical interoperability, Eur.
Computing, Information and Control Systems 2022// , 2022, //Springer Nature , Telecommun. Stand. Inst. (2008).
Singapore2022777–789A.P. Pandian, R. Palanisamy, M. Narayanan, T. Senjyu [114] D. Graham, R. Blackand E. Van Veenendaal, " Cengage Learning , Foundations of
(Eds.). software testing ISTQB Certification , 2021, , 127–155.
[95] Z. Wang, H. You, J. Chen, Y. Zhang, X. Dong, W. ZhangPrioritizing Test Inputs for [115] A. Mohammadjafari, S.F. Ghannadpour, M. Bagherpour, F. Zandieh" arXiv
Deep Neural Networks via Mutation Analysis 2021 IEEE/ACM 43rd Int. Conf. preprint Multi-Object. Multi-mode Time-Cost. Trade Model. Constr. Proj.
Softw. Eng. (ICSE) 22-30 May 2021 , 2021, , 397–409. Considering Product. Improv. arXiv:2401.12388 , 2024arXiv:2401.12388.
[96] N. Sulaiman, S.O. HasoonApplication of Convolution Neural Networks and [116] A. Finkelstein, M. Harman, S.A. Mansouri, J. Ren, Y. Zhang, A search based
Randomforest for Software Test 31 2022 8th Int. Conf. Contemp. Inf. Technol. approach to fairness analysis in requirement assignments to aid negotiation,
Math. (ICCITM) Aug.-1 Sept. 2022 , 2022, ,Aug.-1 Sept. 2022146–152. mediation and decision making, Requir. Eng. vol. 14 (2009) 231–245.
[97] L. Ramesh, S. Radhika, S. Jothi, Hybrid support vector machine and K-nearest [117] R.B. Abdessalem, S. Nejati, L.C. Briand, T. StifterTesting vision-based control
neighbor-based software testing for educational assistant, Concurr. Comput. systems using learnable evolutionary algorithms Proc. 40th Int. Conf. Softw. Eng.
Pract. Exp. vol. 35 (1) (2023) e7433. , 2018, , 1016–1026.
[98] C. Birchler, S. Khatiri, B. Bosshard, A. Gambi, S. Panichella, Machine learning- [118] R. Ben Abdessalem, S. Nejati, L.C. Briand, T. StifterTesting advanced driver
based test selection for simulation-based testing of self-driving cars software, assistance systems using multi-objective search and neural networks Proc. 31st
Empir. Softw. Eng. vol. 28 (3) (2023) 71. IEEE/ACM Int. Conf. Autom. Softw. Eng. , 2016, , 63–74.
[99] T. Labidi, Z. Sakhrawi, On the value of parameter tuning in stacking ensemble [119] R.B. Abdessalem, A. Panichella, S. Nejati, L.C. Briand, T. StifterTesting
model for software regression test effort estimation, J. Supercomput. (2023). autonomous cars for feature interaction failures using many-objective search
[100] A. Khan, R.R. Mekuria, R. Isaev, Applying machine learning analysis for software Proc. 33rd ACM/IEEE Int. Conf. Autom. Softw. Eng. , 2018, , 143–154.
quality test, 22-22 April 2023, 2023 Int. Conf. Code Qual. (ICCQ) (2023) 1–15. [120] P. Jamshidi, A. Ahmad, C. Pahl, Cloud migration research: a systematic review,
[101] Z. Durumeric, E. Wustrow, J.A. Halderman{ZMap}: fast internet-wide scanning IEEE Trans. Cloud Comput. vol. 1 (2) (2013) 142–157.
and its security applications," in 22nd USENIX Secur. Symp. . (USENIX Secur. 13)
, 2013, , 605–620.
26