Professional Documents
Culture Documents
A Machine-Learning-Driven Evolutionary Approach For Testing Web Application Firewalls
A Machine-Learning-Driven Evolutionary Approach For Testing Web Application Firewalls
Abstract—Web application firewalls (WAFs) are an essential In this paper, we focus our testing efforts on a common cat-
protection mechanism for online software systems. Because of the egory of attacks, namely, SQL injections (SQLi). SQLi has
relentless flow of new kinds of attacks as well as their increased received a lot of attention from academia as well as practition-
sophistication, WAFs have to be updated and tested regularly to
prevent attackers from easily circumventing them. In this paper, ers [7], [10], [14], [24]–[27], [32], [34], [45]. Yet the Open Web
we focus on testing WAFs for SQL injection attacks, but the gen- Application Security Project (OWASP) finds that the prevalence
eral principles and strategy we propose can be adapted to other of SQLi vulnerabilities is common and the impact of a success-
contexts. We present ML-Driven, an approach based on machine ful exploitation is severe [50]. While we assess our approach
learning and an evolutionary algorithm to automatically detect based on the example of SQLi, we believe that many of the
holes in WAFs that let SQL injection attacks bypass them. Ini-
tially, ML-Driven automatically generates a diverse set of attacks principles of our methodology can be adapted to other forms of
and submits them to the system being protected by the target WAF. attacks.
Then, ML-Driven selects attacks that exhibit patterns (substrings) Various techniques have been proposed in the literature to
associated with bypassing the WAF and evolves them to generate detect SQLi attacks based on a variety of approaches, includ-
new successful bypassing attacks. Machine learning is used to in- ing white-box testing [21], static analysis [20], model-based
crementally learn attack patterns from previously generated at-
tacks according to their testing results, i.e., if they are blocked testing [30], and black-box testing [18]. However, such tech-
or bypass the WAF. We implemented ML-Driven in a tool and niques present some limitations that may adversely impact their
evaluated it on ModSecurity, a widely used open-source WAF, and practical applicability as well as their vulnerability detection
a proprietary WAF protecting a financial institution. Our empir- capability. For example, white-box testing techniques and static
ical results indicate that ML-Driven is effective and efficient at analysis tools require access to source code [33], which might
generating SQL injection attacks bypassing WAFs and identifying
attack patterns. not be possible when dealing with third-party components or
industrial appliances, and are linked to specific programming
Index Terms—Software security testing, SQL injection (SQLi), languages [19]. Model-based testing techniques require mod-
web application firewall (WAF).
els expressing the security policies or the implementation of
WAFs and the web application under test [30], which are often
I. INTRODUCTION not available or very difficult to manually construct. Black-box
testing strategies do not require models or the source code but
EB application firewalls (WAFs) protect enterprise web
W systems from malicious attacks. As a facade to the web
application they protect, WAFs inspect incoming HTTP mes-
they are less effective in detecting SQLi vulnerabilities. Indeed,
comprehensive reviews on black-box techniques [13], [18] have
revealed that many types of security vulnerabilities (including
sages and decide whether blocking or forwarding them to the
SQLi attacks) remain largely undetected and, thus, warrant fur-
target web application. The decision is often performed based on
ther research.
a set of rules, which are designed to detect attack patterns. Since
In our preliminary work [9], we introduced a novel
cyber-attacks are increasingly sophisticated, WAF rules tend to
black-box technique, namely, ML-Driven, that combines the
become complex and difficult to manually maintain and test.
classical (μ + λ) evolutionary algorithm (EAs) with machine
Therefore, automated testing techniques for WAFs are crucial
learning algorithms for generating tests (i.e., attacks) bypassing
to prevent malicious requests from reaching web applications
a WAF’s validation routines. ML-Driven uses machine learn-
and services.
ing to incrementally learn attack patterns and build a classifier,
Manuscript received August 1, 2017; revised January 12, 2018; accepted i.e., that predicts combinations of attack substrings (“slices”) as-
January 27, 2018. Date of publication June 25, 2018; date of current version sociated with bypassing the WAF. The resulting classifier is used
August 30, 2018. This work was supported by the European Research Council
(ERC) under the European Union’s Horizon 2020 Research and Innovation within the main loop of (μ + λ)-EAs to rank tests depending on
Programme under Grant 694277. Associate Editor: T. H. Tse. (Corresponding which substrings compose them and the corresponding bypass-
author: Annibale Panichella.) ing probabilities. In each iteration, tests with the highest rank are
The authors are with the SnT Centre, University of Luxembourg, Luxembourg
City L-2721, Luxembourg (e-mail: dennis.appelt@googlemail.com; cdnguyen. selected and mutated to generate λ new tests (offsprings), which
dr@gmail.com; annibale.panichella@uni.lu; briand@svv.lu). are then executed against the WAF. The corresponding execu-
Color versions of one or more of the figures in this paper are available online tion results are used to retrain the classifier to incrementally
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TR.2018.2805763 improve its accuracy. Through subsequent generations, tests are
0018-9529 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
734 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
evolved to increase the number of attacks able to bypass the replace its predecessors, leaving the practitioners with a
target WAF. single, yet the best, option.
We defined two variants of ML-Driven, namely, ML- 2) Assessing the influence of the selected machine learning
Driven D and ML-Driven B, that differ in the number of algorithm on the test results by comparing two alterna-
tests being selected for generating new tests (offsprings). The tive classification models, namely, RandomTree and Ran-
former variant selects fewer tests to evolve but generates more domForest, which are both adapted to large numbers of
offsprings per selected test, thus increasing exploitation (deep features and datasets, but with complementary advantages
search). The latter variant selects more tests to mutate, which and drawbacks.
results in a lower number of offsprings per selected test, hence 3) Extending the previous evaluation and conducting a large-
increasing exploration (broad search). Our preliminary study scale experiment on a proprietary WAF protecting a finan-
with ModSecurity,1 a popular open source WAF, has shown that cial institution.
both ML-Driven D and ML-Driven B are effective in gen- 4) Comparing ML-Driven with SqlMap and WAF
erating a large number of distinct SQLi attacks passing thought Testing Framework, which are state-of-the-art vul-
the WAF. However, ML-Driven D performs better in the ear- nerability detection tools.
lier stages of the search, while ML-Driven B outperforms it 5) A qualitative analysis showing that the additional distinct
in the last part of the search, for reasons we will explain. attacks found by ML-Driven E help security analysts
In the race against cyber-attacks, time is vital. Being able design better patches compared to the other ML-Driven
to learn and anticipate attacks that successfully bypass WAFs variants, RAN, and state-of-the-art vulnerability detection
in a timely manner is critical. With more successful, distinct tools.
attacks being detected, the WAF administrator is in a better po- The remainder of this paper is structured as follows.
sition to identify missed attack patterns and to devise patches Section II-A provides background information on WAFs as well
that block all further attacks sharing the same patterns. There- as SQLi attacks and discusses related work. Section III details
fore, our goal is to devise a technique that can efficiently gen- our approach followed by Section IV where we describe the
erate as many successful distinct attacks as possible. To this design and procedure of our empirical evaluation. Section V
aim, in this paper, we extended our prior work and propose an describes the evaluation results and their implications regarding
adaptive variant of ML-Driven, namely, ML-Driven E (en- our research questions, while Section VI explains and illustrates
hanced), that combines the strengths of ML-Driven D (deep how to use the generated attacks for repairing vulnerable WAFs.
search), and ML-Driven B (broad search) in an adaptive man- Further reflections on the results are provided in Section VII
ner. In ML-Driven E, the number of offsprings is computed while Section VIII concludes this paper.
dynamically depending on the bypassing probability assigned
to each selected test by the constructed classifier. Conversely, II. BACKGROUND AND RELATED WORK
ML-Driven B and D generate a fixed number of offsprings
In this section, we provide background notions about SQLi
per attack/test. Therefore, ML-Driven E is a more flexible
vulnerabilities and describe existing white-box and black-box
approach that better balances eploration and exploitation over
approaches aimed at uncovering them.
run time.
Moreover, we conducted a much larger empirical study with
two popular WAFs that protect three open-source and 44 propri- A. SQLi Vulnerabilities
etary web services. The results show that the enhanced version In systems that use databases, such as web-based systems,
of our technique (ML-Driven E) significantly outperforms the SQL statements accessing back-end databases are usually
the following: treated as strings. These strings are formed by concatenating
1) its predecessors ML-Driven B and ML-Driven D; different string fragments based on user choices or the appli-
2) a random test strategy, which serves as a baseline; cation’s control flow. Once an SQL statement is formed, it is
3) two state-of-the-art vulnerability detection tools, namely, submitted to the database server to be executed. For example,
SqlMap and WAF Testing Framework. an SQL statement can be formed as follows (a simplified exam-
We also performed a qualitative analysis of the attacks gen- ple from one of our web services in the case study):
erated by our approach and we found out that they enable the
$sql = "select ∗ from hotelList where
identification of attack patterns that are strongly associated with
country =’";
bypassing the WAFs, thus providing better support for improv-
$sql = $sql.$country;
ing the WAFs’ rule set.
$sql = $sql."’";
To summarize, the key contributions of this paper include the
$result = mysql_query($sql) or
following.
die(mysql_error());
1) Enhancing ML-Driven with an adaptive test selection
and generation heuristic, which more effectively explores The variable $country is an input provided by the user,
attack patterns with higher likelihood of bypassing the which is concatenated with the rest of the SQL statement and
WAF. This enhanced ML-Driven variant is intended to then stored in the string variable $sql. The string is then passed
to the function mysql_query that sends the SQL statement
1 [Online]. Available: https://www.modsecurity.org to the database server to be executed.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 735
SQLi is an attack technique in which attackers inject ma- a WAF and a web application for broken access control vul-
licious SQL code fragments into input parameters that lack nerabilities, e.g., forceful browsing, by explicitly specifying the
proper validation or sanitization. An attacker might construct interactions of application components on the source code level
input values in a way that changes the behavior of the resulting and by applying static and dynamic verification to enforce only
SQL statement and performs arbitrary actions on the database legal state transitions. In contrast, we propose a black-box tech-
(e.g., exposure of sensitive data, insertion or alteration of data nique that does not require access to source code or client-side
without authorization, loss of data, or even taking control of the input checks to generate test cases. In our approach, we use ma-
database server). chine learning to identify the patterns recognized by the firewall
In the previous example, if the input $country received as SQLi attacks and generate bypassing test cases that avoid
the attack payload ’ or 1 = 1 --, then the resulting SQL those patterns.
statement is Tripp et al. [48] proposed XSS Analyzer, a learning approach
to web security testing. The authors tackle the problem of effi-
select ∗ from hotelList
ciently selecting from a comprehensive input space of attacks
where country=’’ or 1 = 1 --’.
by learning from previous attack executions. Based on the per-
The clause or 1=1 is a tautology, i.e., the condition will formed learning, the selection of new attacks is adjusted to
always be true, and is thus able to bypassing the original con- select attacks with a high probability of revealing a vulnerabil-
dition in the where clause, making the SQL query return all ity. More specifically, XSS Analyzer generates attacks from a
rows in the table. grammar and learns constraints that express which literals an
Web Application Firewalls: Web applications with high se- attack cannot contain in order to evade detection. The authors
curity requirements are commonly protected by WAFs. In the find that XSS Analyzer outperforms a comparable state-of-the-
overall system architecture, a WAF is placed in front of the web art algorithm, thus, suggesting that learning input constraints
application that has to be protected. Every request that is sent to (a concept similar to the path conditions in ML-Driven) is
the web application is examined by the WAF before it reaches the effective to guide the test case generation.
web application. The WAF hand over the request to the web ap- In contrast to our work, XSS Analyzer applies learning to
plication only if the request complies with the firewall’s rule set. individual literals only while our approach also learns if a com-
A common approach to define the firewall’s rule set is using a bination of literals is likely to bypass or be blocked. Therefore,
blacklist. A blacklist contains string patterns, typically defined XSS Analyzer cannot capture more complex input constraints
as regular expressions. Requests recognized by these patterns involving multiple literals simultaneously, e.g., an attack should
are likely to be malicious attacks (e.g., SQLi) and, therefore, contain literal a, but not b and c in order to evade detection.
are blocked. For example, the following regular expression de- Furthermore, to analyze which literals in an attack are blocked,
scribes the syntax for SQL comments, e.g., /∗∗ / or #, which XSS Analyzer first splits each attack into its composing tokens;
are frequently used in SQLi attacks: then, it resends each token to the target web application. Since
multiple attacks can share the same tokens, XSS Analyzer sends
/\∗!?|\∗/|[’;]--|--[\s\r\n\v\f]
each token only once to avoid performing the same analysis
|(?:--[ˆ-]∗?-)
multiple times.
|([ˆ\-&])#.∗?[\s\r\n\v\f]|;?\\x00.
This procedure consumes a large quantity of HTTP requests.
There are several reasons why a WAF may provide insufficient In contrast, ML-Driven does not require to resubmit individ-
protection, including implementation bugs or misconfiguration. ual slices, but learns path conditions solely from the previously
One way to ensure the resilience of a WAF against attacks executed test case and, thus, spends its test budget more effi-
is to rely on an automated testing procedure that thoroughly ciently. In addition, there are differences between XSS Analyser
and efficiently detects vulnerabilities. This paper addresses this and ML-Driven in terms of objectives: The former addresses
challenge for SQLi, one of the main types of attacks in practice. cross-site scripting sanitization in web applications, while the
latter addresses the detection of SQLi attacks in WAFs.
In grammar-based testing, a strategy typically samples a large
B. Related Work input space defined by a grammar. Several grammar-based ap-
Previous research on ensuring the resilience of IT systems proaches exist in the literature for testing security properties
against malicious requests has focused on the testing of firewalls of an application under test [37], [47]. Godefroid et al. [23]
as well as input validation mechanisms. proposed white-box fuzzing, which starts by executing an ap-
Offutt et al. introduced the concept of bypass testing in which plication under test with a given well-formed input and uses
an application’s input validation is tested for robustness and se- symbolic execution to create input constraints when conditional
curity [38]. Tests are generated to intentionally violate the client- statements are encountered on the execution path. Then, new in-
side input checks and are then sent to the server application to puts are created that exercise different execution paths by negat-
test whether the input constraints are adequately evaluated. Liu ing the previously collected constraints. The implementation of
and Kuan Tan [35] proposed an automated approach to recover the approach found a critical memory corruption vulnerability
an input validation model from program source code and for- in a file-processing application. In a follow-up work, the au-
mulated two coverage criteria for testing input validation based thors propose grammar-based white-box fuzzing [21], which
on the model. Desmet et al. [17] verify a given combination of addresses the generation of highly structured program inputs.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
736 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
In this paper, well-formed inputs are generated from a grammar they are injected in. We systematically surveyed known SQLi
and the constraints created during execution are expressed as attacks published in the literature, e.g., [5], [6], [24], and from
constraints on grammar tokens. To generate new inputs, a con- other sources, e.g., OWASP2 and SqlMap.3 Then, we defined
straint solver searches the grammar for inputs that satisfy the a context-free grammar for generating and analyzing SQLi
constraints. The authors implemented their work in a tool named attacks.
SAGE and found several security-related bugs [22]. In contrast We consider three main categories of SQLi attacks in our
to the mentioned work of Godefroid et al., our work does not grammar: 1) Boolean; 2) Union; and 3) Piggy-backed. These
require access to the source code of the application under test, type of attacks aim at manipulating the intended logic by inject-
which is in many practical scenarios not available, but proposes ing additional SQL code fragments in the original SQL queries.
a black-box approach based on machine learning to efficiently We briefly discuss each attack category and provide example
sample the input space defined by an attack grammar. attacks that can be derived using our grammar. For a detailed
The topic of testing network firewalls has also been addressed discussion, refer to the literature [10], [27].
by an abundant literature. Although network firewalls operate 1) Boolean Attacks: The intent of a Boolean SQLi attack is to
on a lower layer than application firewalls, which are our focus, influence the where clause within an SQL statement to always
they share some commonalities. Both use policies to decide evaluate either to true or false. As a result, a statement, into
which traffic is allowed to pass or should be rejected. Therefore, which a Boolean SQLi attack is injected, returns on its execution
testing approaches to find flaws in network firewall policies either all data records of the queried database tables (in case the
might also be applicable to WAF policies. Brucker et al. [1] where clause evaluates always to true) or none (in case the where
proposed a model-based testing approach which transforms a clause evaluates always to false). This attack method is typically
firewall policy into a normal form. Based on case studies, they used to bypass authentication mechanisms, extract data without
found that this policy transformation increases the efficiency of authorization, or to identify injectable parameters. The example
test case generation by at least two orders of magnitude. Hwang described in Section II-A is an instance of Boolean attacks.
et al. [29] defined structural coverage criteria of policies under 2) Union Attacks: The union keyword joins the result set
test and developed a test generation technique based on con- of multiple select statements and, hence, union SQLi attacks
straint solving that tries to maximize structural coverage. Other are typically used to extract data located in other database tables
research has focused on testing the firewalls implementation in- than the original statement is querying. For example, consider
stead of policies. Al-Shaer et al. [3] developed a framework to an application that retrieves a list of product names based on a
automatically test if a policy is correctly enforced by a firewall. search term. The SQL statement to retrieve the product names
Therefore, the framework generates a set of policies as well as might be
test traffic and checks whether the firewall handles the generated
SELECT name FROM products WHERE name
traffic correctly according to the generated policy. Some authors
LIKE "%search term%"
have proposed specification-based firewall testing. Jürjens and
Wimmel [30] proposed to formally model the tested firewall and where search term is a string provided by the user. If
to automatically derive test cases from the formal specification. the SQL statement is formed in an insecure way, an attacker
Senn et al. [43] proposed a formal language for specifying secu- could provide the search term phone%" UNION SELECT
rity policies and automatically generate test cases from formal passwd FROM users #%", which would result in the state-
policies to test the firewall. In contrast, in addition to targeting ment
application firewalls, our approach does not rely in any mod-
SELECT name FROM products WHERE name
els of security policies or the firewall under test, such formal
LIKE "%phone%" UNION SELECT passwd FROM
models are rarely available in practice.
users.
III. APPROACH Hence, in addition to a list of products containing the search
term phone, the attacker could obtain the passwords of all users
This section introduces an approach for testing WAFs.
with the modified query above.
Section III-A defines the input space for this testing problem
3) Piggy-Backed Attacks: In SQL, the semicolon (;) can be
as a context-free grammar. Section III-B presents a simple at-
used to separate two SQL statements. Piggy-backed attacks use
tack generation strategy that randomly samples the input space
the semicolon to append an additional statement to the orig-
and serves as baseline. Section III-C presents two test genera-
inal statement and can be used for a wide range of attack
tion strategies that make use of machine learning to guide test
purposes (e.g., data extracting or modification, and denial of ser-
generation toward areas in the input space that are more likely to
vice). An example of a piggy-backed attack is ; DROP TABLE
contain successful attacks. Finally, Section III-D details an ap-
users #. If this attack is injected into a vulnerable SQL state-
proach to combine such attack generation strategies to achieve
ment, it drops the table user and, thus, potentially breaks the
better results.
application.
A. Context-Free Grammar for SQLi Attacks
SQLi attacks (or test cases in this context) are small “pro- 2 [Online]. Available: https://www.owasp.org
grams” that aim at changing the intent of the target SQL queries 3 [Online]. Available: http://sqlmap.org
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 737
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
738 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 739
TABLE I
EXAMPLE OF TEST DECOMPOSITIONS AND THEIR ENCODING
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
740 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
The paths from the root node of the DT to its leaf nodes em-
Algorithm 2: ML-driven SQLi Attack Generation.
body the combinations of slices which are likely to be the reason
for tests to bypass or to be blocked, depending on the classifica- 1: procedure MLDRIVENGEN(initT ests, outputT ests)
tion. More generally, we define a concept of path condition as 2: execute(initT ests)
follows. 3: P ← initT ests
Definition 3: (Path Condition) A path condition represents 4: archive ← UPDATEARCHIVE(P )
a set of slices that the machine learning technique deems to be 5: // learn the initial classifier
relevant for the attack’s classification into blocked or bypassing. 6: trainData ← transf orm(archive)
7: DT ← learnClassif ier(trainData)
The path condition is represented as a conjunction ki (si =
val), in which val = 1 | 0, and k is the number of relevant 8: rankT ests(P, DT )
slices. 9: while not-done do
The procedure for computing path conditions depends on the 10: O ← OFFSPRINGSGEN(P , λ, MAXM )
machine learning algorithm that is used to build DTs. We have 11: execute(O)
selected two alternative algorithms and assessed their overall 12: archive ← UPDATEARCHIVE(O)
impact on test results. 13: // re-training the classifier
RandomTree: The most prominent difference to classical DT 14: trainData ← transf orm(archive)
algorithms (e.g., C4.5 [42]) is that RandomTree relies on ran- 15: DT ← learnClassif ier(trainData)
domization for building the DT [15]. When selecting an attribute 16: // new population
for a tree node, the algorithm chooses the best attribute amongst 17: P ← SELECT(P ∪ O)
a randomly selected subset of attributes. By choosing only sub- 18: end while
set of attributes, the algorithm scales well with the size of the 19: outputT ests ← f ilterBypassingT ests(archive)
training dataset. In our context, a scalable learner is important 20: return outputT ests
since the datasets contain a larger number of attributes and the 21: end procedure
DT is frequently rebuilt, as described in Section III-C4.
Given a DT and a slice vector V of a test t, we can obtain the
path condition for t by visiting the DT from the root and check class. Finally, the class with the highest prediction confidence
the presence (value = 1) or absence (value = 0) of the attributes is chosen by RandomForest as final classification.
encountered with respect to V . The procedure stops once a leaf To compute the path condition for a given attack with Ran-
is reached. For instance, Fig. 4 presents a DT learned from the domForest, for all trees that classify the attack as bypassing, a
example data discussed in Table I. For test t1 , the attribute s3 is path condition is computed separately. Thereby, the path con-
present in the test’s slice vector s1 , s2 , s3 , and thus, t1 follows dition for each tree is computed according to the previously
the left branch. Hence, the path condition is (s3 = 1). Similarly, described procedure. The overall path condition for the entire
for test t3 with the slice vector s4 , s5 , the attribute s3 is not RandomForest is, then, the conjunction of the path conditions
present, thus the right branch is followed leading to attribute s5 , computed from the trees.
which is present in the slice vector. Therefore, the resulting path 4) ML-Driven Evolutionary Testing Strategy: To increase
condition for t3 is (s3 = 0 ∧ s5 = 1). the likelihood of generating bypassing attacks, we propose an
RandomForest: Machine classification methods are well ML-driven evolutionary testing strategy whose pseudocode is
known to be unstable, i.e., a small change in the training data can detailed in Algorithm 2. It combines machine learning algo-
result in a different classification model [51]. Ensemble methods rithms (either RandomTree or RandomForest) with the classical
have been proposed to address the issue. In essence, multiple (μ + λ) EA, which is a population-based EA with a population
models are learned so that their collective predictions can mit- size μ and a restricted number of λ offsprings.
igate bias in individual models. In this paper, we implement As any EA, Algorithm 2 starts with an initial set of solutions
an ensemble of classifiers to guide the test generation, i.e., in- (initT ests) usually called population. The initial population
stead of using only one RandomTree, we extend our technique can either be generated by the random attack generator from
to make use of ensembles of trees produced by RandomFor- Section III-B or by selecting tests from existing test suites. The
est [15]. However, the benefits of RandomForest come at the initial tests are executed against a target WAF (line 2) if their
cost of an increased computational overhead, e.g., learning an execution results are not yet known (i.e., in case of randomly
ensemble of RandomTrees takes longer than learning only a sin- generated tests). Each solution is assigned a fitness score ac-
gle RandomTree. We evaluate in Section V whether the benefits cording to its probability of bypassing the WAF determined the
justify the increased overhead. classifier built in lines 5 and 6 of Algorithm 2. The classifier is
A RandomForest consists of multiple RandomTrees. To clas- built as follows: 1) the tests are transformed into the training set
sify an attack with RandomForest, each individual RandomTree (routine trainData in line 6); 2) a classifier DT is then trained
first classifies the attack and computes the prediction confidence from the data (line 7 of Algorithm 2). Then, individuals in the
(an estimated probability for the classification is to be correct). initial population are ranked using the DT classifier (routine
Then, all individual classifications are consolidated by comput- rankT ests in line 8), which assigns each individual (test) a
ing the average of the prediction confidence values for each probability of bypassing the WAF.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 741
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
742 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
Motivations: Let us now explain why we decided to opt for the neighborhood of highly probable bypassing tests. This leads
the classical (μ, λ)-EA, which is a mutation-based EA with no to generating more bypassing attacks in the earlier stages of
crossover operator. With this algorithm, MAXN offsprings are the search. Reversely, since MAXM is lower in ML-Driven
generated from one single test t via mutation only. Crossover, B, more tests are selected for mutation, resulting in selecting
which is another well-known operator widely applied in various not only the few tests with a high bypassing probability, but also
EAs (e.g., genetic algorithms), is not applied here. Usually, it tests with a low bypassing probability. As a result, ML-Driven
generates two offsprings from a pair of solutions (parents) by B generates fewer new bypassing tests.
randomly exchanging their genes. However, for our problem, a On the other hand, after some iterations, there are more tests
crossover operator cannot be defined since different solutions with a high bypassing probability being available for muta-
(tests) within the same population have different derivation trees tion. In such a scenario, selecting more tests and mutating each
that represent instances of incompatible derivation rules of our one less often helps preserving diversity: ML-Driven B will
grammar. For example, let us assume we select two test cases t1 explore the neighborhoods of multiple highly probable bypass-
and t2 for reproduction. The former instantiates the derivation ing tests. ML-Driven D explores the neighborhoods of fewer
rule start ::= numericCtx while the latter is an instance tests while many other tests with similar bypassing probability
of the rule start ::= sQuoteCtx. If we apply any crossover remain unexplored.
operator (e.g., the single-point crossover), then the resulting In this section, we propose an improved variant of our at-
two offsprings would violate our grammar since each of the tack generation strategy called ML-Driven E (Enhanced). Its
new test case would contain slices from different (incompatible) goal is to combine the strengths of ML-Driven B and ML-
derivation rules. Driven D by adaptively adjusting the parameter MAXM to
Exploration Versus Exploitation: Assuming that the total better balance exploration and exploitation. We propose ML-
number of offsprings (λ) to generate is constant, MAXM con- Driven E, which is a more flexible approach for assigning
trols how the EA explores the test space either broadly (ex- the test budget to individual tests. Instead of generating a fixed
ploration) or deeply (exploitation). When MAXM is small, the number of mutants per test, as done with ML-Driven B/D,
approach generates fewer offsprings per selected test, but selects the number of mutants is calculated dynamically. The rationale
more tests to mutate, thus exploring the test space in a broader of ML-Driven E is twofold. First, the available test budget
fashion (higher exploration). When MAXM is large, on the con- should be allocated only to tests with a high bypassing prob-
trary, the approach generates more offsprings per selected test, ability. Second, the test budget should be divided amongst all
but selects fewer tests to mutate, thus exploring the test space in tests in proportion of that probability, thus favoring those more
a deeper fashion (higher exploitation). In the evaluation section, likely to yield new attacks.
two variants of ML-Driven are distinguished: ML-Driven Given the total number of offsprings λ to generate, a set T of
B (broad, with MAXM = 10) and ML-Driven D (deep, with tests selected as parents, the probability P (t) of test t to bypass
MAXM = 100). Section III-D presents a thorough analysis on the WAF, then the number of offsprings mt to generate for test
how the parameter MAXM influences the overall test results. t is defined as
P (t)
mt = ∗ λ. (1)
D. Enhancing ML-Driven: An Adaptive Approach to Balance x∈T P (x)
Exploration and Exploitation On the right-hand side, the fraction represents the relative by-
Maintaining a good balance between exploration and ex- passing probability of t by dividing its bypassing probability
ploitation is extremely important for a successful application with the sum of all bypassing probabilities of tests x ∈ T . By
of EAs [49]. In our case, the balance is determined by the pa- multiplying the relative bypassing probability of test t with λ
rameter MAXM , which affects the number mutants (offsprings) (total number of offsprings), t is assigned a share of the total
generated from each selected test. For ML-Driven D, MAXM budget proportional to its relative bypassing probability. In other
is set to a higher value, which leads selecting fewer tests for mu- words, the number of offsprings/mutants to generate for each
tation (≈ λ/100), but generating more mutants per selected test test t is proportional to its bypassing probability in relation to
(higher exploitation). For ML-Driven B, MAXM is set to the probabilities of all selected parents.
a lower value, which leads selecting more tests for mutation ML-Driven E shares the same pseudocode with ML-
(≈ λ/10), but generating fewer mutants per selected test (higher Driven D and ML-Driven B with the exception of
exploration). Notice that for both variants, the total test budget the routine OFFSPRINGSGEN used to generate offsprings in
is the same, but the allocation of the test budget differs. Algorithm 2. Instead of using Algorithm 3, ML-Driven E
A detailed analysis presented in Section V-A shows that no uses the new adaptive routine detailed in Algorithm 4, which
variant is superior over the other [8]: ML-Driven D performs includes the proposed modification to the budget calculation.
better at the beginning of the search but ML-Driven B out- Similarly to Algorithm 3, Algorithm 4 generates λ offsprings
performs it in later stages. The reason for this phenomenon is within the loop in lines 3–18. The differences concern 1) the
because of the number of bypassing tests selected and their in- number test cases selected as parents and 2) the number of
fluence on the overall performance. At the beginning, there are mutants/offsprings generated from each individual parent. In
only a few available tests with a high bypassing probability. line 4, the routine selectAttacksF orM utation selects a set of
Due to the higher value of MAXM , ML-Driven D is likely attacks T that have the highest bypassing probability from all
to select only these tests for mutation and, thus, exploits better available attacks in the current population. All attacks above a
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 743
A. Subject Applications
Algorithm 4: Adaptive Offsprings Generation.
1: procedure ADAPTIVEOFFSPRINGSGEN(population, λ, 1) Open-Source WAF: In this case study, the firewall under
DT ) test is ModSecurity, which implements the OWASP core rule set.
2: offsprings ← ∅ ModSecurity is an open-source WAF that can be deployed with
3: while |offsprings| < λ do the Apache HTTP Server to protect web applications hosted un-
4: T ← selectAttacksF orM utation(parents) der the server. Depending on the applications under protection,
5: for all t ∈ T do different firewall rule sets defined for different purposes can be
6: mt ← getM utationBudget(t, T, DT ) used. The OWASP core rules target various kinds of attacks,
7: V ← getSliceV ector(t) e.g., Trojan, Denial of Service, and SQLi, and are maintained
8: pathCondition ← getP ath(V, DT ) by an active community of security experts.
9: while mt > 0 do The web applications under protection are HotelRS, Cyclos,
10: s ← pickASliceF rom(V ) and SugarCRM. HotelRS is a service-oriented based system,
11: if satisf y(s, pathCondition) then providing web services for room reservation. It was developed
12: newT est ← mutate(t, s) and used in [16]. Cyclos is a popular open-source Java/Servlet
13: of f springs ← of f springs ∪ newT est Web Application for e-commerce and online payment.5 Sugar-
14: mt ← mt − 1 CRM is a popular customer relationship management system.6
15: end if SugarCRM and Cyclos have been widely used in practice. In
16: end while our experiment setting, the three applications are deployed on an
17: end for Apache HTTP Server under Linux. ModSecurity is embedded
18: end while within the web server; it protects the application’s web services
19: return offsprings from SQLi attacks. Specifically, since these web services re-
20: end procedure ceive SOAP messages7 from web clients, a malicious client can
seed an SQLi attack string into a SOAP message and submit
it to the web services in order to gain illegal access to data or
configurable threshold σ, e.g., in our experiment σ = 80%, functionality of the system.
are selected. The loop from line 5 to 17 is executed for In this paper, note that our testing target is the WAF that
each selected attack in T . Within the loop, in line 6, the protects the applications, not the applications themselves, as our
routine getM utationBudget calculates the number of off- focus is on testing firewalls. HotelRS, SugarCRM, and Cyclos
springs/mutants mt to generate from attack t with respect to play solely the role of a destination for SQLi tests that bypass
T and DT , the latter being the classifier. In other words, this the WAF.
method implements (1). Lines 9–16 describe the mutation pro- 2) Proprietary WAF: In the industrial setting, we evaluate
cedure for t, that is mostly unchanged from the original ver- our approach on a proprietary WAF that is used in a corporate
sion of the algorithm, except that the number of generated off- IT environment to protect back-end web services. These ser-
springs/mutants for t is set to mt . vices are the backbone of a financial corporation and process
Since the number of tests with a bypassing probability larger thousands of transactions daily. To provide protection from ma-
than 80% may vary across generations, both the number of licious requests, the WAF validates incoming requests in two
selected parents and the budget allocation vary over time. When steps. First, the values in a request are validated with respect
there are only few tests with a large bypassing probability (≥σ), to data types (e.g., string or numeric) and boundary constraints,
Algorithm 4 selects only those tests as parents. As a result, the e.g., a credit card number is expected to be a sequence of 16 to
number of offsprings mt generated from each parent t will be 19 digits. In a second step, each value is checked to make sure
large (higher exploitation). Instead, when the current population that it does not contain known malicious string patterns (i.e.,
contains many tests with a large bypassing probability (≥σ), the using an SQLi blacklist) commonly used in attacks. Only if the
total budget of μ offsprings is shared among a larger set of request passes both validation steps, the request is forwarded to
parents. As a consequence, fewer mutants will be generated the back-end services.
from each parent (higher exploration). Therefore, Algorithm 4 The computation time required to evaluate all testing strate-
adaptively balances exploration and exploitation depending on gies with the proprietary WAF is highly expensive because of
the number tests in the current population that have a bypassing the following:
probability greater than σ. 1) the slow responsive time of the web services;
2) the number of testing strategies to compare;
IV. EMPIRICAL STUDY 3) the number of repetitions to perform for each testing
strategy.
This section evaluates the proposed testing strategies in two Therefore, for the experiment, we used of a high-performance
separate case studies: a popular open-source WAF and a pro- cluster [2] when testing the proprietary WAF. Furthermore, we
prietary WAF that protects a financial institution. Section IV-A
introduces the case studies. Section IV-B formulates the research
questions. Section IV-C explains the procedure we followed to 5 http://project.cyclos.org
execute the experiments, while Section IV-E describes the ex- 6 http://sourceforge.net
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
744 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
TABLE II Since all testing strategies generate attacks from the same
AVERAGE RESPONSE TIME IN MILLISECONDS OF SOME WEB SERVICE
OPERATIONS IN OUR EXPERIMENTAL ENVIRONMENT COMPARED TO
input space, i.e., the grammar introduced in Section III-A, we
THE CASE STUDY’S ENVIRONMENT evaluate how efficient the different strategies are in sampling
the input space for bypassing attacks. Therefore, we measure
Op. 1 Op. 2 Op. 3 Op. 4 for each strategy how many distinct bypassing attacks are found
over time.
RTTH P C 11, 36 11, 39 11, 46 16, 61
RTTA C T 456, 56 180, 02 302, 09 854, 63
RQ1: How efficient are ML-Driven E, ML-Driven B, ML-
had to optimize a replica of the test environment to significantly Driven D, and RAN in finding bypassing tests?
decrease response times when invoking services.
In our optimized test environment, all configurations related To assess the impact of the machine learning algorithms on
to request filtering, that is whether a request is blocked or let the test result, we implemented two alternative classifiers for
through, are copied. Other configurations, e.g., logging or en- each of the ML-Driven strategies: RandomTree and Random-
cryption, are disabled. The web application under protection is Forest. As detailed in Section III-C3, both algorithms have com-
replaced by a simple mock-up application, which implements plementary advantages and drawbacks. In brief, RandomForest
the same interface as the original application under protec- is an ensemble classifier and, hence, is expected to be more
tion. The mock-up replays a set of recorded responses from robust against changes in the training set than RandomTree.
the original application and, thus, the WAF remains unaffected. However, RandomForest comes at a higher computational cost
Table II shows the message round trip time (RTT) per operation than RandomTree, since multiple models have to be learned. To
computed over a time span of 30 days with the actual envi- assess the influence of both algorithms on our approach, we run
ronment (RTTACT ) compared to the optimized environment on ML-Driven with both algorithms and compare the number of
the HPC (RTTHPC ). As we can see, the time has been reduced identified bypassing attacks and path conditions in a given time
significantly. Note that, even with these optimizations, the total budget.
computation time of our experiments is equivalent to 8 years,
337 days, and 12 h on a single CPU core. RQ2: Does the choice of machine learning algorithm
Even though an optimized test environment is required from matter?
an experimental standpoint, in practice, testing the firewall is just
a single test activity in an array of test activities (e.g., testing the We evaluate whether, in our context, the benefits of the Ran-
WAF, services, front-end). Given time and resource constraints, domForest compared to the RandomTree justify the increased
creating and maintaining test environments that are specific for computation overhead to learn the classifier. Therefore, we com-
each single test activity is very costly. Based on our experience, pare how many bypassing tests are found over time with these
we found that test engineers prefer test copies of the actual WAF two algorithms. In addition, we assess whether the algorithms
configuration and services. have an impact on the stability of the test result, i.e., we compare
Since we, by design, optimized our test environment to enable the variation among repetitions of the same test run.
large-scale experiments, the test execution times in our experi-
mental setting is not representative of test execution times in the RQ3: How does ML-Driven compare to similar techniques?
actual environment (see Table II). This is a problem as it biases
the results of our experiments to the advantage of approaches RQ3 compares our proposed technique with existing tech-
that are less expensive in terms of test case generation but lead niques for testing WAFs. We perform a quantitative comparison
to the execution of more test cases, such as random testing. of the techniques by comparing the number of bypassing attacks
Therefore, in our analyses, we transform the time scale of the found by each technique.
optimized environment into a realistic one, accounting for the
actual test execution times in practice. RQ4: Are we learning new, useful attack patterns as the
Assume T = {t1 , . . . , tn } is a set of timestamps measured number of distinct, bypassing attacks increases?
on the experimental environment, such that one timestamp is
noted after the execution of every test. Then, for the ith test case
RQ4 assesses whether, as we find more bypassing tests, we
execution, f (ti ) = ti + (RTTACT − RTTHPC ) ∗ i transforms a
also identify more attack patterns that can be useful to improve
timestamp ti ∈ T into the corresponding timestamp in the actual
the rule set of the WAF. In our context, an attack pattern is the
environment f (ti ). We will use the latter time scale to compare
underlying root cause that enables an attack to bypass the WAF.
test strategies in a realistic fashion.
For example, the attacks union ∗!50000∗select pwd
from user and union ∗!50000∗select 99 share the
B. Research Questions
pattern union ∗!50000∗select (root cause) while the re-
This paper investigates several variants of a machine- mainder of the attacks differ. We want to investigate whether
learning-driven testing strategy and a random testing strategy. identifying successful attack patterns help understanding why
We compare and evaluate all these strategies for their capability attacks are bypassing and eventually fix the WAF to cor-
of finding bypassing attacks on both subject applications, i.e., rectly detect further attacks. For the ML-Driven techniques,
ModSecurity and the proprietary WAF. such a pattern is characterized by a path condition (see
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 745
Definition 3). A path condition characterizes the slices, or com- problem and on the size of the search space [31]. In our case,
bination of slices, that are likely causing an attack to bypass. the population size determines the number of tests that will
Thus, a path condition represents a pattern that is not correctly be preserved for the next generation. A small population size
detected by the WAF. may lead to loss of test cases with large bypassing probability,
To answer RQ4, we measure how many path conditions can thus reducing the chances of generating new attacks able to
be extracted from a model that is learned by the ML-Driven pass through the WAF. From our preliminary experiments, we
techniques. More specifically, we analyze the growth in the observed that a population size of 500 individuals works best
number of path conditions as the number of successful, distinct for ML-Driven since it helps preserving as many as possible
attacks grows over time. probable bypassing attacks.
2) Number of Offsprings (λ): In ML-Driven, newly gen-
C. Procedure erated offsprings (test cases) are used to update the archive,
We implemented the techniques proposed in this paper which is later used to retrain the classifier DT by applying ei-
into our SQLi testing tool called Xavier. Xavier supports the ther RandomTree or RandomForest. Therefore, offsprings are
automated testing of web services for SQLi vulnerabilities and the additional “data points” that are used for updating DT . A
has been described in [10]. ML-Driven generates test cases large λ value means that in each generation a significant por-
in the form of SQLi attack strings, such as ‘ OR"1"="1"#. tion of the training set consists of new test cases; a small λ
To generate malicious requests, Xavier takes such attack strings value leads to retraining DT with almost the same training set
and injects them into sample SOAP messages, which are sub- (i.e., updating DT would be less effective). In this paper, we
sequently sent to an application under test. Xavier, therefore, set λ = 4000 to guarantee that DT is always retrained with a
relies on sample SOAP messages as inputs. They can be taken large portion new test cases. While the total number offsprings
from existing web service test suites, or can easily be generated is fixed, the number of offsprings generated from each test t dif-
from the WSDL8 that describes the service interface under test. fers for the three variants of ML-Driven. For ML-Driven B
In our experiments, when available, we use SOAP messages and ML-Driven D, it is fixed and it is equal to MAXM = 10
from the functional test suite of the service under test (Cyclos and MAXM = 100, respectively. In ML-Driven E, the num-
and our industrial case study) or, otherwise, we manually cre- ber of offsprings for each test is determined dynamically using
ate SOAP messages from the WSDLs (HotelRS, SugarCRM). the adaptive strategy described in Section III-D.
Each of these messages consists of a number of parameters and 3) Archive Size: The archive keeps track of all attacks gen-
their legitimate values, which the services expect and that the erated across the generations and it is used at the end of each
WAF has to let through. In our testing process, each SOAP generation as training set for machine learning. To avoid exceed-
message is considered separately. A test generation technique, ing the resources of a typical laptop and to finish in a reasonable
ML-Driven or RAN, continuously generates attacks, inject- time, we limit the training set to 6000 blocked attacks and 6000
ing one attack each time into a parameter of the selected SOAP bypassing attacks. If the number of tests in the archive is larger,
message to create a new SOAP message, and then sends it to we consider only the most recent tests, i.e., those produced in
the web server. latest generations.
Incoming SOAP requests to the web server are first treated 4) Stopping Condition: The search terminates when the
by the WAF and only those that comply with firewall rules maximum search budget of 175 min is reached for ModSe-
are forwarded to web applications, and otherwise are blocked. cure. For the proprietary WAF, we used a larger search budget
In case a request is blocked, the WAF replies to the client that of 500 min. To allow a fair comparison, RAN was configured
issued the request with a special response stating that the request with the same stopping condition.
has been denied. When our testing tool, Xavier, receives such 5) Machine Learning Setting: For the machine learning al-
a response, it marks the test, embedded in the original request, gorithms used in this paper, i.e., RandomTree and RandomFor-
with a blocked label “B” (for blocked), and otherwise a passed est, we used their implementations available in Weka [28] with
label “P” (bypassing). the default parameter values.
6) Number of Repetition: To account for the randomization
D. Parameter Setting involved in the testing techniques (e.g., RAN), each algorithm
was run ten times for each subject application.
For the ML-Driven approaches, there are various param-
eters to set for both the machine learning algorithms and
(μ + λ)-EA. For most of the parameters (e.g., machine learn- E. Variables
ing training setting), we follow the recommendations from the The following dependent variables are controlled or measured
literature [15], [28], [41]. Moreover, we used a trial-and-error in our experiments.
procedure to calibrate the parameters μ and λ. The final param- 1) Dt : The number of distinct tests that can bypass a target
eter values are as follows: WAF at time t is a way to measure the efficiency of a test
1) Population Size (μ): Typically, the population size for strategy. Note that, given two distinct tests in the same
EAs can vary from tens to several thousands of individuals [41]. attack category, one might be caught by the WAF while
The best setting for this parameter depends on the nature of the the other bypasses it. This may be caused by tests in a
category that should be handled by different rules, some
8 [Online]. Available: http://www.w3.org/TR/wsdl of them missing or incorrect in the current WAF rule set,
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
746 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
Fig. 5. Average number of bypassing tests (D t ) found for nine tested opera-
Fig. 6. Statistical variation for the number of bypassing tests found (D t )
tions (ten repetitions each) for ModSecurity.
across all tested parameters for ModSecurity.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 747
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
748 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
TABLE V
RESULTS OF THE WILCOXON TEST FOR THE PROPRIETARY WAF
For each testing strategy (e.g., ML-E), we report whether it statistically outperforms its counterparts (e.g., RAN) as well the
corresponding p-values.
separate the parameters into groups to better analyze the results. Table IV shows the groups: Group 1 has two parameters that
Parameters in the same group share similar input constraints in share an input constraint that restricts the number of characters
terms of number of allowed characters and tend to produce a to a maximum of eight. Similarly, Groups 2, 3, and 4 have
similar number of bypassing test cases. It is worth noticing that similar constraints with 16, 25, and 35 characters, respectively.
this particular WAF considers not only such input constraints but For each group, Table V reports the results of the Wilcoxon test
also other criteria (e.g., SQLi blacklist). Therefore, parameters when comparing the number of bypassing attacks generated by
in a same group do not necessarily produce the same results. the different testing strategies.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 749
Fig. 9. Group 2: Average number of bypassing attacks, proprietary WAF, Fig. 11. Group 4: Average number of bypassing attacks, proprietary WAF, 12
seven parameters. parameters.
According to Table V, we observe that for Groups 2, 3, and 4 the ML-Driven techniques find around 30 bypassing tests.
ML-Driven E is always significantly superior to its predeces- The most prominent difference of this group, as compared to
sor as well as to RAN. However, looking at Figs. 9–11, we notice the others, is that very few bypassing tests are found. This is due
the following trend: the more the bypassing tests, the smaller to the fact that the number of allowed characters is only eight,
the difference in test results across ML-Driven techniques. thus resulting in a small number of attacks to bypass. Examin-
For example, ML-Driven E generates, on average, +25% ing the training sets used to learn the classifiers more closely,
bypassing attacks compared to the second best approach in the we find that the training sets are heavily imbalanced. There are
comparison (i.e., ML-Driven B) for Group 2. For Group 3, only between 15 and 30 bypassing attacks compared to up to
the gap between ML-Driven E and the closest alternative ap- 6000 blocked attacks, i.e., they represent only 0.25%–0.5% of
proach (i.e., RAN) is +30% on average. For Group 4, the average the data points in the training set. For the other tested parame-
difference between the ML-Driven E and its predecessor is ters, where the machine-learning-driven techniques outperform
lower than 10%. RAN, there are typically several hundreds of bypassing attacks
The average number of bypassing tests for the two parameters in a training set. As a result of the imbalanced training data, the
in Group 1 is depicted in Fig. 12. All techniques tend to saturate learned classifier has a low recall (≈80%) and is less accurate in
after about 125 min and, after that, the number of bypassing tests predicting an attack’s bypassing probability. RandomTree and
increases very slightly. RAN is the most effective approach, RandomForest tend to generate a constant classifier in such a
significantly outperforming all ML-Driven variants after scenario, i.e., the resulting classifier likely labels all data tests
105 min of running time (see Table V). At the end of the search, as “blocked” because this leads to a very low classification error
RAN reaches up to approximately 40 bypassing tests while all (< 1%).
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
750 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
Fig. 13. Comparison of the RandomTree and RandomForest classifier. Aver- Fig. 14. Boxplots for the number of bypassing tests found with one represen-
age number of bypassing tests found for nine tested operations (ten repetitions tative parameter (get-relationships) in ModSecurity.
each) in ModSecurity.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 751
TABLE VI
TAMPER SCRIPTS OF SQLMAP SELECTED FOR OUR EXPERIMENTS
SqlMap uses several tamper scripts. Each tamper script applies TABLE VII
RESULTS FOR TESTING THE PROPRIETARY WAF WITH THE WAF TESTING
a specific obfuscation method to an SQLi attack (e.g., randomly FRAMEWORK AND SQLMAP
changes the case of letters in an SQL keyword). Since some tam-
per scripts are only meant to be applied with particular WAFs,
Parameter WAF Testing Framework SqlMap
web application frameworks (e.g., ASP or PHP), or database
management systems, we selected a number of tamper scripts #TCTo ta l #TCB y p a ss #TCTo ta l #TCB y p a ss
that are applicable to our subject applications (i.e., applicable to
Group 1 19 0 3283.9 0.0
the proprietary WAF/ModSecurity, SOAP web services written Group 2 19 0 3281.6 0.0
in PHP/Java, and MySQL). Table VI lists the tamper scripts se- Group 3 19 0 3278.9 0.3
lected for our experiments and gives a brief description for each. Group 4 19 3 3260.5 66.5
Note that SqlMap does not report how many or which attacks
bypassed a WAF under test. Therefore, we routed the HTTP traf-
fic between SqlMap and the WAF under test through a custom of Group 4, three of the 19 attacks are bypassing. For all other
HTTP proxy, which determines whether each generated attack tested parameters, the WAF Testing Framework does not find
bypassed the WAF or was blocked. a bypassing attack. SqlMap generates a similar number of test
To compare ML-Driven with the WAF Testing Framework cases for each tested parameter (on average between 3260.5 and
and SqlMap, we use both tools to test ModSecurity and our 3283.9 test cases). For the parameter of Group 4, SqlMap finds,
industry partner’s proprietary WAF. The configuration of both on average, 66.5 bypassing attacks and for the parameter of
WAFs is the same as in the other reported experiments. Since Group 3, SqlMap finds, on average, only 0.3 bypassing attacks.
SqlMap’s attack generation is randomized, we repeated each For the other parameters, SqlMap does not find any bypassing
experiment involving SqlMap ten times and report the averages. attacks.
The attack generation in the WAF Testing Framework is deter- The fact that both tools find the most bypassing attacks
ministic and, hence, we report the result of a single run. We do for Group 4 can be explained with the fact that the input
not impose any limit on the number of generated test cases; both length constraint for this group is the least strict (refer to Ta-
tools are configured to execute as many test cases as they are ble IV). This also confirms the trend observed in the results
capable of generating. for ML-Driven, which found the most bypassing attacks for
For the experiment with the proprietary WAF, we randomly Group 4 as well. For the other groups, the WAF Testing Frame-
selected one representative parameter to test from each of the work and SqlMap do not reliably find bypassing attacks, be-
four parameter groups (refer to Table IV). We found in the cause the generated attacks are either too long to satisfy the
evaluation of RQ1 that parameters in a same group share a input length constraint or contain a pattern that is blacklisted.
similar number of bypassing attacks. Given how computation For the experiment with ModSecurity, we use the WAF Test-
intensive these experiments are, we test only a representative ing Framework and SqlMap to test the same nine parameters as
parameter of each group and interpret the result as a performance selected for the evaluation of ML-Driven (refer to RQ1). The
indicator for the other parameters in the same group. Table VII WAF Testing Framework executes again 19 test cases per tested
reports for both tools the total number of generated test cases parameter and SqlMap executes a similar number of test cases
(TCTotal ) and the number of test cases bypassing the proprietary as in the previous experiment (∼3270 test cases per parameter).
WAF (TCBypass ). The WAF Testing Framework executes 19 However, neither the WAF Testing Framework nor SqlMap find
test cases for each tested parameter. For the tested parameter any bypassing attack for any of the tested parameters.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
752 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
TABLE VIII
SLICE ENCODINGS
S1d or Sb !
Sf 1 or 1 S3 1
S5 /∗∗ / S6 ∼ 1
S26 # S57 1
S2 S29 0
S0 0 S15 )
Sf true Sc ”
TABLE IX
DEMONSTRATION OF THE PATTERN LEARNED FROM EACH PATH CONDITION
AND EXAMPLE ATTACKS CONTAINING THE PATTERN (THE PATTERN
CHARACTERIZED BY THE PATH CONDITIONS IS HIGHLIGHTED IN RED)
Fig. 15. Number of path conditions (red y-axis on the left) and bypassing tests
(blue y-axis on the right) for ML-Driven E with RandomTree and Random-
Generations Id Pattern Example SQLi Attack
Forest.
1 pc1 or 1, /∗∗ / 0 or 1/∗∗ /
5 pc2 or, 1 ‘or true like 1 ||’
When comparing the results of the WAF Testing Framework 5 pc3 or, ), # ‘) or not false#
and SqlMap with ML-Driven (refer to Section V-A), we find 5 pc4 or, 0 1 or true > ( 0)
that the formers are either completely ineffective or signifi- 5 pc5 or, true, # ‘or∼true#
5 pc6 or, 0 0) or ! ”−−
cantly less effective than ML-Driven. For most of the tested 5 pc7 or, ” “ || true or”
parameters, both tools fail to find any bypassing attack, while 5 pc8 or, /∗∗ /, #, , 0, ! ‘ or/∗∗ /!0 is true#
ML-Driven finds a considerable number of them. For the few 5 pc9 or, /∗∗ /, 1, ∼ 1 1 or/∗∗ / 0 < (∼ 1)
parameters, for which the alternative tools find bypassing at-
tacks, ML-Driven finds an order of magnitude more bypassing
model. The number of path conditions that can be extracted
attacks. Furthermore, the attacks found by the alternative tools
from a model is also steadily increasing for both techniques,
are a subset of the attacks found by ML-Driven, and hence,
although beginning from generation 4, there are significantly
using the alternative tools in combination with ML-Driven
more path conditions extracted with RandomForest than with
would not provide more distinct bypassing attacks. Another
RandomTree. For RandomForest, the number of path condi-
advantage of ML-Driven over the alternative tools is that ML-
tions is close to 200 after 10 generations, compared with 50 for
Driven abstracts path conditions from the bypassing attacks,
RandomTree.
which helps the firewall administrator to repair the WAF (refer
For both techniques, the number of bypassing tests used to
to Section VI).
train the model and the number of path conditions obtained
from the model are concurrently increasing. This is no surprise
D. RQ4: Are We Learning New, Useful Attack Patterns as the
as more test cases lead to refined models. For RandomForest,
Number of Distinct, Bypassing Attacks Increases? that leads to even more path conditions since many more trees
As explained in Section III-C, the ML-Driven techniques are built.
periodically build a model (i.e., DTs) that is used to guide the In general, path conditions can help determine the reason why
generation of new attacks. However, this model can also be a set of attacks is bypassing. This knowledge can be utilized
used to abstract common string patterns shared by and possibly to stop further attack attempts from succeeding by inferring a
causing bypassing attacks. Therefore, for RQ4, we analyze the string pattern from the path condition that matches all the attacks
models that are learned during the test runs of ML-Driven E. described by it. Such a string pattern can be in turn added to the
For both RandomForest and RandomTree, we analyze how the WAF’s rule set to block all further attacks containing the same
number of path conditions increases as the result of generating string pattern.
more distinct bypassing attacks. To better explain the benefits of having more attack patterns,
Fig. 15 depicts the average number of path conditions that let us describe examples of path conditions and the attacks they
can be extracted from a model (red) and the number of by- characterize that are obtained from a test run of the industrial
passing tests with which the model is trained (blue). Since the WAF used in our evaluation. In particular, let us consider an
ML-Driven techniques retrain the model at the end of each example of a path condition: pc1 = S1d ∧ Sf 1 ∧ S5 ∧ ¬S11 ∧
generation, the corresponding training iterations (or EA genera- ¬S25 ∧ ¬S26 ∧ ¬S1b ∧ ¬S15 ∧ ¬S2f .
tions) are depicted on the horizontal axis. These curves are based Table VIII shows the literals that are represented by the cor-
on the average of all test runs performed with the ModSecurity responding slices. An example attack characterized by pc1 is
case study. 0 or 1/∗∗ /, which is an obfuscated variation of a tautology
The number of distinct, bypassing tests, which are used to attack, e.g. ’’ or 1 = 1. All attacks characterized by pc1
train the model, is steadily increasing for both techniques. This contain the slices S1d, Sf 1, and S5. If a WAF blocks input
is due to fact that the further a test run progresses, the more containing any of the three slices, the attacks characterized by
bypassing tests are found and are used, in turn, to retrain the pc1 do no longer bypass.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 753
TABLE X
PATH CONDITIONS FROM GENERATION 5
Id Path condition
pc2 S1d ∧ S57 ∧ ¬S25 ∧ ¬S26 ∧ ¬Sc ∧ ¬S14 ∧ ¬S1 ∧ ¬S3e ∧ ¬S0 ∧ ¬S38 ∧ ¬S29 ∧ ¬S5 ∧ ¬Se1 ∧ ¬S1c ∧ ¬S1de ∧
¬S76 ∧ ¬S67
pc3 S1d ∧ S26 ∧ S15 ∧ ¬Sf ∧ ¬S1c ∧ ¬Se1 ∧ ¬S25 ∧ ¬S5 ∧ ¬Sc ∧ ¬S14 ∧ ¬S1de ∧ ¬S3e ∧ ¬S76 ∧ ¬S29 ∧ ¬S38 ∧ ¬S67
pc4 S1d ∧ S29 ∧ ¬Se1 ∧ ¬S25 ∧ ¬S5 ∧ ¬S14 ∧ ¬S1de ∧ ¬S3e ∧ ¬S76 ∧ ¬S38 ∧ ¬S67
pc5 S1d ∧ Sf ∧ S26 ∧ ¬S1c ∧ ¬Se1 ∧ ¬S25 ∧ ¬S5 ∧ ¬Sc ∧ ¬S14 ∧ ¬S1de ∧ ¬S3e ∧ ¬S76 ∧ ¬S29 ∧ ¬S38 ∧ ¬S67
pc6 S1d ∧ S0 ∧ ¬S25 ∧ ¬S26 ∧ ¬Sc ∧ ¬S14 ∧ ¬S3e ∧ ¬S38 ∧ ¬S29 ∧ ¬S5 ∧ ¬Se1 ∧ ¬S1c ∧ ¬S1de ∧ ¬S76 ∧ ¬S67
pc7 S1d ∧ Sc ∧ ¬S1c ∧ ¬Se1 ∧ ¬S25 ∧ ¬S5 ∧ ¬S14 ∧ ¬S1de ∧ ¬S3e ∧ ¬S76 ∧ ¬S29 ∧ ¬S38 ∧ ¬S67
pc8 S1d ∧ S5 ∧ S26 ∧ S2 ∧ S0 ∧ Sb ∧ ¬Sa8 ∧ ¬S7c3 ∧ ¬Sf ∧ ¬S1e1 ∧ ¬S25 ∧ ¬Sd ∧ ¬Sc ∧ ¬S14 ∧ ¬S15 ∧ ¬S26d ∧
¬S60 ∧ ¬S2e ∧ ¬S2d ∧ ¬S38 ∧ ¬S3
pc9 S1d ∧ S3 ∧ S5 ∧ S6 ∧ ¬Sf ∧ ¬S25 ∧ ¬Sf 1 ∧ ¬Sc ∧ ¬S26 ∧ ¬S1b ∧ ¬S5a ∧ ¬S14 ∧ ¬S71
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
754 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
all attacks in A. We will use a running example in which A and 3) ri does not account for SQL comment characters, i.e., #
P C are initialized with the attacks and path conditions from and /∗∗ /, which are frequently used in bypassing attacks
generation five of the previous examples (refer to Table IX). R (e.g., pc3 , pc5 , pc8 , and pc9 ).
is initialized with the regular expressions from the proprietary Based on these observations, we devise an improved regular
WAF used in our evaluation. Note that the examples are not expression ri∗ to avoid the mentioned shortcomings:
artificial as the regular expressions were actively protecting the
(?i)(( |"|\d) .∗ or)|(or .∗ (/∗∗ /|#)).
industrial web service application used in our evaluation. The
attacks in Table IX would have bypassed the WAF deployed in The modified expression ri∗ consists of two parts. The first
the production system. part is (( |"|\d) .∗ or) and addresses 1) as well as 2).
The strategy starts in line 2 by initializing the output variable It addresses 1) by replacing the apostrophe with ( |"|\d)
R by assigning R to it. The while loop from line 2 to 2 is and it addresses 2) by replacing [\s] + with an arbitrary
repeated until all path conditions in P C are processed. In line 2, character sequence (denoted by .∗ ). Therefore, the first part
the loop begins by selecting a slice s that is the most frequently matches strings that contain either an apostrophe, a double
contained slice among the path conditions in P C. Line 2 selects quote, or a digit, followed by an arbitrary character sequence,
all path conditions P Cs ⊂ P C that contain slice s. Then, line 2 and followed by or. The second part is (or .∗ (/∗∗ /|#)) and
selects all attacks As ⊂ A that are characterized by the path addresses 3) as it matches strings that contain or and an SQL
conditions in P Cs . In the running example, the slice S1d is comment character (/∗∗ / or #). Since both parts are joined with
assigned to s because it is a part of eight path conditions (pc2 – an alternation (|), the strings matched by ri∗ is the union set of
pc9 ) and, hence, is a part of more path conditions than any other the strings matched by the first part and the second part.
slice. The path conditions pc2 –pc9 (see Table X) are assigned to With the performed changes, ri∗ matches all attacks in As .
P Cs because they contain slice S1d. Accordingly, the attacks In line 2, we remove the old expression ri from the output set
characterized by the path conditions pc2 –pc9 are assigned to As R and add ri∗ to R . Since all attacks in As are now correctly
(refer to Table IX). identified, we remove in line 11 the path conditions P Cs from
In line 8, a security analyst manually inspects the regular ex- P C and continue with the next loop iteration until all path
pressions in R and identifies a regular expression ri that should conditions in P C are processed. In the running example, P C is
be modified in order to correctly identify the bypassing attacks empty after the first iteration and the main loop ends.
in As . Depending on the regular expressions R, there might al- Note that the proposed strategy helps a security expert to
ready exist an expression ri ∈ R that identifies attacks similar to efficiently deal with the typically large number of path condi-
those in As and, hence, could be modified to also account for As . tions found by ML-Driven by grouping conditions containing
Alternatively, if no such regular expression exists, a new regular similar attack patterns. Grouping patterns helps the security
expression can be created or, if there is more than one candidate experts to consider relevant patterns together when devising an
regular expression for modification, several regular expressions improved regular expression, and hence, they do not have to con-
can be modified so that each expression matches a subset of sider each pattern individually. The strategy is defined in such a
As . The security analyst is free to make this decision based on way that the most frequent attack patterns are tackled first.
how he intends to logically structure the rule set. In the running
example, we assign the following regular expression to ri : B. Suitability of Other Techniques for the Purpose of
Repairing a WAF
(?i)’[\s]+or.
The testing techniques and tools discussed throughout this
The regular expression begins with (?i), which enables paper find bypassing attacks with varying effectiveness. This
case-insensitive matching. It matches strings that contain an section discusses how suitable the different techniques are for
apostrophe followed by at least one whitespace ([\s]+ ) fol- the purpose of repairing a WAF.
lowed by the SQL keyword or. We select this expression for In our experiments, SqlMap and the WAF testing framework
modification because it is the only expression in the rule set that did find none or very few bypassing attacks and, hence, do not
tries to match attacks containing the SQL keyword or. Since provide much useful information for repairing a WAF as many
all attacks in As also contain or, the expression is a suitable bypassing attacks would remain uncovered. RAN is more ef-
candidate for modification. fective in finding bypassing attacks than SqlMap and the WAF
In line 9, the security analyst extends the regular expression Testing Framework, but it does not identify common bypassing
ri so that it also matches the attacks in As . The path conditions attack patterns. In absence of such patterns, a security analyst
in P Cs provide guidance on how ri could be extended. In the has to translate a possibly large number of bypassing attacks into
running example, when we contrast the expression ri with the an WAF patch. This limitation affects not only RAN but also
path conditions in P Cs , we make several observations regarding the other tools (i.e., SqlMap and WAF Testing Framework) as
why ri does not identify the attacks in As : they provide the set of bypassing attacks without any additional
1) unlike in ri , not all attacks in As start with an apostrophe, information about which substrings are associated with bypass-
but they start either with an apostrophe, a double quote ing the WAF. Instead, our ML-Driven techniques provide both
(e.g., pc7 ), or a digit (e.g., pc6 ); the set of bypassing attacks (which are larger in number) and
2) unlike in ri , none of the path conditions requires a leading the associated attack patterns identified by machine learning
whitespace before the literal or; and algorithms (i.e., RandomTree and RandomForest).
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 755
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
756 IEEE TRANSACTIONS ON RELIABILITY, VOL. 67, NO. 3, SEPTEMBER 2018
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.
APPELT et al.: MACHINE-LEARNING-DRIVEN EVOLUTIONARY APPROACH FOR TESTING WEB APPLICATION FIREWALLS 757
[24] W. Halfond, J. Viegas, and A. Orso, “A classification of sql-injection [50] J. Williams and D. Wichers, “Owasp, top 10, the ten most critical web
attacks and countermeasures,” in Proc. IEEE Int. Symp. Secure Softw. application security risks,” Open Web Appl. Security Project, Tech.
Eng., 2006, vol. 1, pp. 13–15. Rep., 2013. [Online]. Available: http://owasptop10.googlecode.com/
[25] W. G. Halfond, S. Anand, and A. Orso, “Precise interface identification files/OWASP%20Top%2010%20-%202010.pdf.
to improve testing and analysis of web applications,” in Proc. 18th Int. [51] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning
Symp. Softw. Testing Anal., 2009, pp. 285–296. Tools and Techniques. San Mateo, CA, USA: Morgan Kaufmann, 2011.
[26] W. G. Halfond and A. Orso, “Amnesia: Analysis and monitoring for
neutralizing sql-injection attacks,” in Proc. 20th IEEE/ACM Int. Conf. Dennis Appelt received the Ph.D. degree in
Automated Softw. Eng., 2005, pp. 174–183. software engineering from the University of
[27] W. G. J. Halfond and A. Orso, “Preventing SQL injection at- Luxembourg, Luxembourg City, Luxembourg, in
tacks using AMNESIA,” in Proc. 28th Int. Conf. Softw. Eng., 2006, 2016. His thesis was entitled “Automated security
pp. 795–798. testing of web-based systems against SQL injection
[28] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. attacks.”
Witten, “The weka data mining software: An update,” SIGKDD Explo- He conducted his Ph.D. studies from July 2012 to
rations Newsl., vol. 11, no. 1, pp. 10–18, Nov. 2009. June 2016 with the Interdisciplinary Centre for Secu-
[29] J. Hwang, T. Xie, F. Chen, and A. X. Liu, “Systematic structural testing of rity, Reliability, and Trust (SnT), University of Lux-
firewall policies,” in Proc. IEEE Symp. Rel. Distrib. Syst., 2008, pp. 105– embourg, under the supervision of Prof. L. Briand.
114. From July 2016 to November 2016, he was a Postdoc-
[30] J. Jürjens and G. Wimmel, “Specification-based testing of firewalls,” in toral Researcher with SnT. He disseminated his Ph.D. work in the proceedings
Perspectives of System Informatics (Lecture Notes in Computer Science, of various international conferences (e.g., ASE, ISSTA, ICST). He is currently a
vol. 2244), D. Bjørner, M. Broy, and A. Zamulin, Eds. Berlin, Germany: Software Engineer with the Research Department, Ripple, Inc., San Francisco,
Springer, 2001, pp. 308–316. CA, USA, and works on secure protocol design and secure implementations for
[31] G. Karafotias, M. Hoogendoorn, and A. E. Eiben, “Parameter control in financial software. His research interests include software security, automated
evolutionary algorithms: Trends and challenges,” IEEE Trans. Evol. Com- software testing, and distributed ledger technologies.
put., vol. 19, no. 2, pp. 167–187, Apr. 2015.
[32] A. Kieyzun, P. J. Guo, K. Jayaraman, and M. D. Ernst, “Automatic creation
of SQL injection and cross-site scripting attacks,” in Proc. 31st Int. Conf. Cu D. Nguyen received the Ph.D. degree with a dis-
Softw. Eng., 2009, pp. 199–209. tinguished dissertation in computer science from the
[33] Y.-F. Li, P. K. Das, and D. L. Dowe, “Two decades of web applica- University of Trento, Trento, Italy, in 2009.
tion testing: A survey of recent advances,” Inf. Syst., vol. 43, pp. 20–54, Before joining POST Luxembourg as a Data Sci-
2014. entist and Security Expert, he was a Researcher with
[34] A. Liu, Y. Yuan, D. Wijesekera, and A. Stavrou, “Sqlprob: A proxy-based the University of Luxembourg and a Postdoctoral
architecture towards preventing sql injection attacks,” in Proc. 2009 ACM Researcher with Fondazione Bruno Kessler, Trento,
Symp. Appl. Comput., 2009, pp. 2054–2061. Italy. His fields of interests include machine learning,
[35] H. Liu and H. B. Kuan Tan, “Testing input validation in web applications computer security, and secure software engineering.
through automated model recovery,” J. Syst. Softw., vol. 81, no. 2, pp. 222–
233, 2008.
[36] P. McMinn, “Search-based software test data generation: A survey,” Softw. Annibale Panichella received the Ph.D. degree in
Testing, Verification Rel., vol. 14, no. 2, pp. 105–156, 2004. software engineering from the University of Salerno,
[37] R. McNally, K. Yiu, D. Grove, and D. Gerhardy, “Fuzzing: The state of Fisciano, Italy, in 2014. His dissertation was entitled
the art,” DTIC Doc., Tech. Rep. DSTO–TN–1043, 2012. “Search-based software maintenance and testing.”
[38] J. Offutt, Y. Wu, X. Du, and H. Huang, “Bypass testing of web He is an Assistant Professor with the Delft Univer-
applications,” in Proc. 15th Int. Symp. Softw. Rel. Eng., 2004, sity of Technology, Delft, The Netherlands. He is also
pp. 187–197. a Research Fellow with the Interdisciplinary Cen-
[39] A. Panichella, F. Kifetew, and P. Tonella, “Automated test case genera- tre for Security, Reliability and Trust , University of
tion as a many-objective optimisation problem with dynamic selection of Luxembourg, Luxembourg City, Luxembourg, where
the targets,” IEEE Trans. Softw. Eng., vol. 44, no. 2, pp. 122–158, Feb. he was a Research Associate until January 2018. He
2018. serves and has served as a program committee mem-
[40] A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch cov- ber of various international conference (e.g., ICSE, GECCO, ICST, and ICPC)
erage as a many-objective optimization problem,” in Proc. 8th IEEE Int. and as a reviewer for various international journals (e.g., TSE, TOSEM, TEVC,
Conf. Softw. Testing, Verification Validation, Graz, Austria, Apr. 13–17, EMSE, STVR) in the fields of software engineering and evolutionary com-
2015, pp. 1–10. putation. His research interests include security testing, evolutionary testing,
[41] A. Petrowski and S. Ben-Hamida, Evolutionary Algorithms. New York, search-based software engineering, textual analysis, and empirical software
NY, USA: Wiley, 2017. engineering.
[42] J. R. Quinlan, C4.5: Programs for Machine Learning, vol. 1. San Mateo,
CA, USA: Morgan Kaufmann, 1993.
[43] D. Senn, D. Basin, and G. Caronni, “Firewall conformance testing,” in Lionel C. Briand (F’10) is a Professor of soft-
Testing of Communicating Systems (Lecture Notes in Computer Science, ware verification and validation with the SnT Cen-
vol. 3502), F. Khendek and R. Dssouli, Eds. Berlin, Germany: Springer, tre for Security, Reliability, and Trust, University of
2005, pp. 226–241. Luxembourg, Luxembourg City, Luxembourg, where
[44] S. Shamshiri, J. M. Rojas, G. Fraser, and P. McMinn, “Random or genetic he is also the Vice Director of the centre. He
algorithm search for object-oriented test suite generation?” in Proc. Annu. is currently running multiple collaborative research
Conf. Genet. Evol. Comput., 2015, pp. 1367–1374. projects with companies in the automotive, satellite,
[45] L. K. Shar and H. B. K. Tan, “Defeating sql injection,” Computer, vol. 3, financial, and legal domains. He has held various en-
pp. 69–77, 2013. gineering, academic, and leading positions in five
[46] M. Soltani, A. Panichella, and A. van Deursen, “A guided genetic algo- other countries. He became an IEEE Fellow in 2010
rithm for automated crash reproduction,” in Proc. 39th Int. Conf. Softw. for his work on the testing of object-oriented systems.
Eng., Buenos Aires, Argentina, May 20–28, 2017, pp. 209–220. His research interests include software testing and verification, model-driven
[47] M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability software development, search-based software engineering, and empirical soft-
Discovery. London, U.K.: Pearson, 2007. ware engineering.
[48] O. Tripp, O. Weisman, and L. Guy, “Finding your way in the testing Prof. Briand was a recipient of the IEEE Computer Society Harlan Mills
jungle: A learning approach to web security testing,” in Proc. Int. Symp. Award and the IEEE Reliability Society Engineer-of-the-Year Award for his
Softw. Testing Anal., 2013, pp. 347–357. work on model-based verification and testing, respectively, in 2012 and 2013.
[49] M. Črepinšek, S.-H. Liu, and M. Mernik, “Exploration and exploitation He received an ERC Advanced grant in 2016—on the topic of modeling and
in evolutionary algorithms: A survey,” ACM Comput. Surveys, vol. 45, testing cyber-physical systems—which is the most prestigious individual re-
pp. 35, 2013. search grant in the European Union.
Authorized licensed use limited to: VIT University. Downloaded on September 04,2021 at 05:02:30 UTC from IEEE Xplore. Restrictions apply.