Automating Privilege Escalation with deep reinforcement learning

Session 2B: Machine Learning for Cybersecurity AISec ’21, November 15, 2021, Virtual Event, Republic of Korea
Automating Privilege Escalation with

Deep Reinforcement Learning
Kalle Kujanpää Willie Victor Alexander Ilin
kalle.kujanpaa@aalto.fi willie.victor@f-secure.com alexander.ilin@aalto.fi
Aalto University F-Secure Aalto University
Espoo, Finland Johannesburg, South Africa Espoo, Finland
ABSTRACT our modern and connected world. As better detection and response
AI-based defensive solutions are necessary to defend networks and methods are developed, attackers invariably adapt their tools and
information assets against intelligent automated attacks. Gathering techniques to remain competitive. One adaptation is the use of
enough realistic data for training machine learning-based defenses advanced automation to perform attack sequences so quickly they
is a significant practical challenge. An intelligent red teaming agent cannot be responded to by defenders. An example of this is the
capable of performing realistic attacks can alleviate this problem. NotPetya malware from 2017. The malware spread very rapidly
However, there is little scientific evidence demonstrating the fea- by credential dumping and lateral movement techniques usually
sibility of fully automated attacks using machine learning. In this expected of human-on-keyboard attacks (see e.g. [20]).
work, we exemplify the potential threat of malicious actors using Partial or even full automation of attacks is nothing new. Many
deep reinforcement learning to train automated agents. We present exploits and techniques consist of discrete steps, and they can be
an agent that uses a state-of-the-art reinforcement learning algo- easily scripted. Frameworks such as Metasploit, OpenVAS, Cobalt
rithm to perform local privilege escalation. Our results show that Strike, and PowerShell Empire support and automate red team-
the autonomous agent can escalate privileges in a Windows 7 envi- ing activities. However, this type of automation relies on several
ronment using a wide variety of different techniques depending on assumptions about the target environment and often requires a
the environment configuration it encounters. Hence, our agent is human to configure it for a specific scenario. Moreover, the highly
usable for generating realistic attack sensor data for training and predictable sequences of observable events generated by automated
evaluating intrusion detection systems. attacks give a further advantage to defenders. Hence, these kinds
of attacks can often be detected and responded to efficiently. In
CCS CONCEPTS contrast, a human expert might be capable of determining the op-
timal approach for a given target almost immediately and avoid
• Computing methodologies → Reinforcement learning; Neu-
detection, thanks to the experience that they have gathered.
ral networks; Partially-observable Markov decision processes; • Secu-
Can the human approach be emulated by intelligent machine
rity and privacy → Network security; Intrusion detection systems;
learning agents that learn from experience, instead of the user hav-
• Applied computing → System forensics.
ing to enumerate all possibilities and creating logic trees to account
for all the alternatives? There would be many potential use cases
KEYWORDS
for an intelligent machine learning red teaming agent. To train com-
A2C, actor-critic, attack automation, autonomous cyber defense, prehensive defensive systems using machine learning, enormous
autonomous malware, deep reinforcement learning, red teaming, amounts of data of realistic attacks might be necessary. An intelli-
neural networks, POMDP, post-exploitation, privilege escalation gent agent enables the generation of large amounts of attack data,
ACM Reference Format: on-demand, that could be used to train or refine detection models.
Kalle Kujanpää, Willie Victor, and Alexander Ilin. 2021. Automating Privi- Moreover, in order to understand how to defend against attacks
lege Escalation with Deep Reinforcement Learning. In Proceedings of the and perform risk assessment, the potential behavior and threats
14th ACM Workshop on Artificial Intelligence and Security (AISec ’21), No- posed by these machine learning-based agents must be understood
vember 15, 2021, Virtual Event, Republic of Korea. ACM, New York, NY, USA,
by the cyber security community.
12 pages. https://doi.org/10.1145/3474369.3486877
Despite a fear of malicious actors using reinforcement learn-
ing for offensive purposes and the significant advances in deep
1 INTRODUCTION reinforcement learning during the past few years, few studies on
Defending networks and information assets from attack in a con- red teaming with reinforcement learning have been published. The
stantly evolving threat landscape remains a substantial challenge in most likely reason for this is that the problem of learning to perform
Permission to make digital or hard copies of all or part of this work for personal or an attack is extremely hard:
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the • a complete attack typically consists of a long sequence of
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission interdependent steps;
and/or a fee. Request permissions from permissions@acm.org. • the action space is practically infinite if the actions are the
AISec ’21, November 15, 2021, Virtual Event, Republic of Korea commands that the agent can execute;
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8657-9/21/11. . . $15.00 • even formalizing red teaming activities as machine learning
https://doi.org/10.1145/3474369.3486877 problems can be extremely challenging.
157
Therefore, existing research has focused on automating smaller Thus, our study takes a step towards solving more complex red
sub-tasks such as initial access [37] or lateral movement during teaming tasks with artificial intelligence.
post-exploitation [28].
DeepExploit [37] is a deep reinforcement learning agent that is
trained to automate gaining initial access using known vulnera-
bilities and exploits. It is built on the Metasploit framework [35]. 2 RELATED WORK
After a successful penetration, it tries to recursively gain access Applying reinforcement learning in cyber security has been a sub-
to other hosts in the local network of the given input IP address. ject of much recent research [32]. Examples of application areas
DeepExploit is primarily a framework for penetration testing, and include, among others, anti-jamming communication systems [21],
its support for post-exploitation activities is very limited: the agent spoofing detection in wireless networks [41], phishing detection
treats lateral movement as a second initial access task. [8], autonomous cyber defense in software-defined networking
The deep RL agent of Maeda and Mimura [28] is a step forward [22], mobile cloud offloading for malware detection [39], botnet
in emulating adversarial behavior in real environments: it is trained detection [1], security in mobile edge caching [42], and security
to perform lateral movement in Windows domains. The authors in autonomous vehicle systems [16]. In addition, reinforcement
train the agent using the modules of PowerShell Empire as the learning has been applied to research of physical security, such as
action space. The state of the proposed agent consists of ten entries, grid security [33], and to green security games [40].
such as the number of discovered computers in the network, the Previously, multi-agent reinforcement learning has been applied
number of compromised computers, and whether the agent has to cyber security simulations with competing adversarial and de-
local administrative privileges. The authors demonstrate that the fensive agents [5, 12, 23]. It has been shown that both the attacking
reinforcement learning agent can learn to perform lateral movement and the defending reinforcement learning agents can learn to im-
and obtain domain controller privileges. prove their performance. The success of multi-agent reinforcement
In this work, we present one potential use case for an intelligent learning might have wider implications for information security re-
machine learning red teaming agent: we use deep RL to automate search even though these simulation-based studies are not directly
the task of local privilege escalation. Privilege escalation is the applicable to real environments.
typical first step performed by an attacker after gaining initial There have also been attempts to apply reinforcement learning
access, and it is often followed by lateral movement to other hosts to penetration testing [6, 10, 18, 19, 37, 43]. The results of these
in the penetrated network. We consider privilege escalation in efforts suggest that reinforcement learning can support the human
Windows 7 environments, which may have an arbitrary number of in charge of the penetration testing process [6, 18]. Reinforcement
system components (such as services, DLLs, and tasks). We propose learning has also been applied to planning the steps following the
a formalization of the privilege escalation task as a reinforcement initial penetration by learning a policy in a simulated version of the
learning problem and present a novel architecture of an actor-critic environment [10]. Penetration testing [43] and web hacking [13]
RL agent. We experimentally show that the proposed agent can have also been converted to simulated capture-the-flag challenges
learn to perform the privilege escalation task. that can be solved with reinforcement learning. Finally, reinforce-
Although we focus on one sub-task performed by a malicious ment learning has been applied to attacking static Portable Exe-
actor, the learning problem that we consider is significantly harder cutable (PE) [2] and even supervised learning-based anti-malware
compared to previous works [28, 37]: engines [15]. The trained agents are capable of modifying the mal-
ware to evade detection.
• Privilege escalation needs a sequence of actions, the selection Reinforcement learning has also been applied to blue teaming.
of which depends on the changing system state. The scenario Microsoft has developed a research toolkit called CyberBattleSim,
of [37], for example, has one-step solutions without any which enables modeling the behavior of autonomous agents in
changes in the system state. a high-level abstraction of a computer network [38]. Reinforce-
• Privilege escalation can be accomplished by multiple differ- ment learning agents that operate in the abstracted network can
ent strategies. be trained using the framework. The objective of the platform is to
• The attacked system can have a varying number of system create an understanding of how malicious reinforcement learning
components (services, DLLs, tasks), and the agent should agents could behave in a network and how reinforcement learning
generalize to any number of those. can be used for threat detection. Deep reinforcement learning can
• Our training setup is more realistic and diverse compared also be applied to improving feature selection for malware detection
to [28]. Instead of attacking a system whose variability is [14].
implemented by noise in system parameters, we use different Non-learning-based approaches to automating adversary emula-
system configurations in each training episode. tion have been developed as well. Caldera is a framework capable
• The state of the system in our experiments is described by of automatically planning adversarial actions against Windows en-
thousands of variables, which is much larger than the states terprise networks. Caldera uses an inbuilt model of the structure
of the agents in [28, 37]. of enterprise domains and knowledge about the objectives and po-
• The actions that we design for our learning environment tential actions of an attacker. Then, an intelligent heuristics-based
are more atomic compared to [28, 37]. Most of our actions planner decides which adversarial actions to perform [3]. Moreover,
can be implemented with standard OS commands instead of several different non-RL-based AI approaches to penetration testing
using modules of existing exploitation frameworks. and vulnerability analysis have been proposed [29].
158
Supervised and unsupervised learning with and without neural 𝜋 (𝑎𝑡 |𝑠𝑡 ; 𝜃 ):
networks have been applied for blue teaming purposes. For instance,
"∞ #
Õ
𝑘
malicious PowerShell commands can be detected with novel deep 𝑉 (𝑠) = E 𝛾 𝑟𝑡 +𝑘 𝑠𝑡 = 𝑠 .
learning methods [24]. An ensemble detector combining an NLP- 𝑘=0
based classifier with a CNN-based classifier was the best at detecting We update the parameters 𝜔 of the value function 𝑉 (𝑠𝑡 ; 𝜔) using
malicious commands. The detection performance was high enough Monte Carlo estimates of the total discounted rewards as the targets
to be useful in practice. The detector was evaluated using a large
∞
!
dataset consisting of legitimate commands executed by standard Õ
loss 𝑉 (𝑠𝑡 ; 𝜔), 𝛾 𝑘 𝑟𝑡 +𝑘 → min
users, malicious commands executed by malware, and malicious 𝜔
𝑘=0
commands designed by security experts. The suitability of machine
learning for intrusion detection, malicious code detection, malware using the Huber loss [25]. In practice, most of the parameters 𝜃 and
analysis, and spam detection has been discussed [4, 11, 26]. These 𝜔 are shared (see Section 5.2 and Figure 1).
methods often rely on extensive feature engineering [7, 26]. In de-
fensive tasks, good performance can often be achieved without deep 4 PRIVILEGE ESCALATION AS A
neural networks, with solutions like logistic regression, support REINFORCEMENT LEARNING TASK
vector machines, and random forests [30]. Machine learning-based
systems are vulnerable to adversarial attacks [9], and as different
4.1 Problem Definition
ML-based techniques have dissimilar weaknesses, a combination In this work, we focus on automating one particular step often
of machine learning techniques is often necessary [4, 7]. performed by red teaming actors: local privilege escalation in Win-
dows. For our reinforcement learning agent, there will be three
possible paths to success:
3 REINFORCEMENT LEARNING
• Add the current user as a local administrator
Reinforcement learning is one of the three main paradigms of ma-
• Obtain administrative credentials
chine learning alongside supervised and unsupervised learning [36].
• Overwrite a program that is executed with elevated privi-
In reinforcement learning, an agent interacts with an environment
leges when a user or an administrator logs on
over discrete time steps to maximize its long-run reward. At a given
time step 𝑡, the environment has state 𝑠𝑡𝑒 and the agent is given an The first alternative is hardly how a true red teaming actor would
observation 𝑜𝑡 = 𝑓 (𝑠𝑡𝑒 ) and a reward signal 𝑟𝑡 . If the environment is approach the problem as changes in the local administrators of a
fully observable, the observation is equal to the environment state, workstation are easily detectable by any advanced detection and
𝑜𝑡 = 𝑠𝑡𝑒 . In a more general scenario, the agent receives only a partial response system. However, if the agent is successful at doing that,
observation 𝑜𝑡 which does not represent the full environment state. it demonstrates that the agent can, with some exceptions, execute
In this case, the agent has its own state 𝑠𝑡 which might differ from arbitrary code with elevated privileges on the victim host. The
the environment state 𝑠𝑡𝑒 . The agent selects an action 𝑎𝑡 from the second alternative is a more realistic alternative for performing
set of possible actions 𝐴 and acts in the environment. The environ- local privilege escalation. The third method is arguably inferior to
ment transitions to a new state 𝑠𝑡𝑒+1 and the agent receives a new the other two methods as it requires the attacker to wait for the
observation 𝑜𝑡 +1 and a new reward 𝑟𝑡 +1 . The goal of the agent is system to be rebooted or some other event to occur that triggers
to maximize the sum of the collected rewards 𝑅 = 𝑡∞=0 𝛾 𝑡 𝑟𝑡 where the scheduled task or the AutoRun.
Í
𝛾 ∈ (0, 1] is a discount factor used to discount future rewards [31].
In this paper, we use a model-free approach to reinforcement 4.2 Learning Environment
learning in which the agent does not build an explicit model of the The learning environment is a simulated Windows 7 environment
environment. The agent selects an action 𝑎𝑡 according to a policy with a random non-zero number of services, tasks, and AutoRuns.
function 𝜋 (𝑎𝑡 |𝑠𝑡 ) which depends on the agent state 𝑠𝑡 . We use an In each training episode, we introduce one vulnerability in the
algorithm called the advantage actor-critic (A2C) [31] in which the simulated system by selecting randomly from the following 12
policy is parameterized as 𝜋 (𝑎|𝑠; 𝜃 ). The parameters 𝜃 of the policy alternatives:
are updated in the direction of
(1) hijackable DLL
• missing DLL
∇𝜃 log 𝜋 (𝑎𝑡 |𝑠𝑡 ; 𝜃 )𝐴(𝑎𝑡 , 𝑠𝑡 ) (1)
• writable DLL
(2) re-configurable service
where 𝐴(𝑎𝑡 , 𝑠𝑡 ) is an advantage function which estimates the (rela-
(3) unquoted service path
tive) benefit of taking action 𝑎𝑡 in state 𝑠𝑡 in terms of the expected
(4) modifiable ImagePath in the service registry
total reward. In A2C, the advantage function is computed as
(5) writable executable pointed to by a service
𝑘−1 (6) missing service binary and a writable service folder
Õ
𝐴(𝑎𝑡 , 𝑠𝑡 ) = 𝛾 𝑖 𝑟𝑡 +𝑖 + 𝛾 𝑘 𝑉 (𝑠𝑡 +𝑘 ) − 𝑉 (𝑠𝑡 ) (7) writable binary pointed to by an AutoRun
𝑖=0 (8) alwaysInstallElevated bits set to one
(9) credentials of a user with elevated access in the WinLogon
where 𝑉 (𝑠) is the state-value function which estimates the expected registry
total reward when the agent starts at state 𝑠𝑡 and follows policy (10) credentials of a user with elevated access in an Unattend file
159
(11) writable binary pointed to by a scheduled task running with Table 1: General information about the system stored in the
elevated privileges agent state
(12) writable Startup folder
To increase the variability of the environment states, we also ran- Trinary variables:
domly add services, tasks, and AutoRuns that might initially seem (1) Are there credentials in files?
vulnerable to the agent. For instance, a service with one of the (2) Do the credentials in the files belong to users with elevated
service-specific vulnerabilities above but without elevated priv- privileges?
ileges or a service with a writable parent folder but without an (3) Are there credentials in the registry?
unquoted path can be added. Moreover, standard user credentials (4) Do the credentials in the registry belong to users with
might be added to the registry, or a folder on the Windows path elevated privileges?
might be made writable, among others. (5) Is there a writable folder on the Windows path?
To train an autonomous reinforcement learning agent to perform (6) Are the AlwaysInstallElevated bits set?
local privilege escalation, we need to formalize the learning problem, (7) Can the AutoRuns be enumerated using an external
that is, we need to define the reward function 𝑟 , the action space 𝐴, PowerShell module?
and the space of the agent states.
Binary variables:
Defining the reward function is perhaps the easiest task. We
selected the simplest possible reward structure without any reward Has a malicious executable been created in Kali Linux?
shaping. The agent is given a reward 𝑟 = 1 for the final action of the Has a malicious service executable been created in Kali Linux?
episode if the privilege escalation has been performed successfully. Has a malicious DLL been created in Kali Linux?
Otherwise, a zero reward is given. Based on our experiments, this Has a malicious MSI file been created in Kali Linux
simple sparse reward signal is sufficient for teaching the agent Has a malicious executable been downloaded?
to perform privilege escalation with as few actions as possible Has a malicious service executable been downloaded?
because the reward is progressively discounted as more steps are Has a malicious DLL been downloaded?
taken by the agent. We also experimented by giving the agent only Has a malicious MSI file been downloaded?
half a reward (+ 12 ) for performing privilege escalation by the third, Does the agent know the list of local users?
arguably inferior method, but the agent had trouble learning the Are there users whose privileges need to be checked?
desired behavior. Does the agent know the services running on the OS?
The state 𝑠𝑡𝑒 of the environment and its dynamics are determined Does the agent know the scheduled tasks running on the OS?
by the Windows 7 environment (or its simulator) that the agent Does the agent know the AutoRuns of the OS?
interacts with. The environment is only partially observable: the Has the agent performed a static analysis of the service
observations 𝑜𝑡 are the outputs of the commands that the agent binaries to detect DLLs?
executes. Working with such a rich observation space is difficult, Have the DLLs loaded by the service binaries been searched?
and therefore, we have designed a custom procedure that converts Are there folders whose permissions must be checked?
the observations into the agent state 𝑠𝑡 . It is the agent state that is Are there executables whose permissions must be checked?
used as the input of the policy 𝜋 (𝑎𝑡 |𝑠𝑡 ) and value 𝑉 (𝑠𝑡 ) functions. Does the agent know the current username?
We also manually designed a set 𝐴 of high-level actions that the Does the agent know the Windows path?
agent needs to choose from. We describe the agent state and the Are there base64-credentials to decode?
action space in the following sections.
4.3 State of the Agent

We update the state of the agent by keeping the information that is owning user and the 11 trinary attributes listed in Table 2. Each of
relevant for the task of privilege escalation. The agent state includes these attributes can have three possible values: true (+1), unknown
variables that contain general information about the system, infor- (0), and false (-1). Then, the agent needs to perform actions such as
mation about discovered services, dynamic-link libraries (DLLs), A25 Check service permissions with accesschk64 to fill the values of
AutoRun registry, and scheduled tasks. the unknown attributes.
The general information is represented by the 27 variables listed Since local privilege escalation can be performed by DLL hijack-
in Table 1. Seven of these variables are trinary (true/unknown/false) ing, we also include the information about the DLLs used by the
and contain information useful for the task of privilege escalation. services in the state. Each DLL is described using a set of attributes
The remaining 20 variables are binary (true/false), and they also listed in Table 2. This information is added to the state after taking
contain information about the previous actions of the agent. The action A29 Analyze service executables for DLLs.
previous actions are included in the state to make the agent state Privileges can be elevated in Windows by using vulnerable ex-
as close to Markov as possible, which makes the training easier. ecutables in the AutoRun registry and misconfigured scheduled
At the beginning of each training episode, the agent has no tasks. Therefore, we add information about the AutoRun files and
knowledge of the services running on the host. The agent has to the scheduled tasks to the agent state. Each AutoRun file and each
collect a list of services by taking the action A31 Get a list of services. scheduled task is described using the trinary attributes defined in
Once a service is detected, it is described by its name, full path, the Table 2.
160
Table 2: Part of the agent state with information about ser- and vulnerabilities. Some actions are only relevant for specific vul-
vices, DLLs, AutoRuns, and scheduled tasks (all are trinary nerabilities, whereas many others are more general and can be
variables) used in multiple scenarios (see Appendix B). Our general principle
in constructing the action space has been to make the actions as
Service Is the service running? atomic as possible while keeping the problem potentially solvable
Is the service run with elevated privileges? by the current RL.
Is the service path unquoted? The actions are defined on a high level, which means that their
Is there a writable parent folder? exact implementation can vary, for example, depending on the plat-
Is there whitespace in the service path? form. For instance, action A29 Analyze service executables for DLLs
Is the service binary in C:\Windows? can be implemented by static analysis of the Portable Executable
Can the service executable be written? files with an open-source analyzer to detect the loaded DLLs. The
Can the service be re-configured? same action can be implemented using a custom analyzer or a script
Can the service registry be modified? to download the executable and analyze it with Process Monitor.
Does the service load a vulnerable DLL? Our high-level action definition enables modifying the low-level
Has the service been exploited? implementations of the actions, such as changing the frameworks
used, without affecting the trained agent. To create the necessary
DLL Is the DLL missing?
malicious executables, we use Kali Linux with Metasploit. The ma-
Is the DLL writable?
licious DLLs needed for performing DLL hijacking are compiled
Has the DLL been replaced with a malicious DLL?
manually. However, the low-level implementation of these com-
AutoRun Is the AutoRun file writable? mands can easily be changed if desired.
Is the AutoRun file in C:\Windows? Each of the high-level actions is well-specified and can be per-
formed using only a handful of standard Windows (cmd.exe and
Task Is the task run with elevated privileges?
PowerShell) and Linux (zsh) commands. Many of the commands
Is the executable writable?
need arguments. For instance, to take the action A9 Start an ex-
Is the executable in C:\Windows?
ploited service, the name of the service must be specified. In this
work, we automatically fill the arguments using the auxiliary infor-
Table 3: Examples of auxiliary information used to fill the mation collected as discussed in Section 4.3. For example, one of the
command arguments actions defined is A28 Check directory permissions with icacls. The
agent maintains an internal list of directories that are of interest,
and when the action to analyze the permissions of directories is
Service Name, executable path, user
selected, every directory on the list is scanned.
Executable Path, linked DLLs
DLL Name, executable calling, path
AutoRun Executable path, trigger 5 EXPERIMENTS
Task Name, executable path, trigger, user 5.1 Simulator of a Windows 7 Virtual Machine
Credentials Username, password, plaintext
A key practical challenge for training a reinforcement learning
File system Folders, executables, permissions
agent to perform red teaming tasks is the slow simulation speed
when performing actions on a real virtual machine. For example,
running commands necessary for privilege escalation can take
In addition to the variables defined in Table 1 and Table 2, the longer than a minute on a full-featured Windows 7 virtual machine,
agent maintains a collection of auxiliary information in its memory. even if the agent acts optimally. At the beginning of training, when
The information is needed to fill the arguments of the commands the agent selects actions very close to randomly, one episode of
executed by the agent. Examples of the auxiliary information are training on a real VM can last significantly longer. Moreover, each
given in Table 3. This information is gathered and updated based on training episode requires a new virtual machine that has been
the observations, that is, the outputs of the commands performed configured with one of the available vulnerabilities. Provisioning
by the agent. The auxiliary information is not given as input to the and configuring a virtual machine in such a manner will further add
neural network, and hence, it affects neither the policy 𝜋 (𝑎𝑡 |𝑠𝑡 ) nor to the time it would take to train the agent. Training a successful
the value 𝑉 (𝑠𝑡 ) directly. agent may require thousands of training episodes, which can take a
prohibitively large amount of time when training on a real operating
4.4 Action Space system. Developing an infrastructure to tackle the long simulation
We designed the action space of the agent by including actions times on a real system is a significant challenge, and it is left outside
needed for gathering information about the victim Windows host the scope of this study.
and performing the privilege escalation techniques. The action To alleviate this issue, we implemented a simulated Python en-
space consists of 38 actions listed in Table 4. Although the ac- vironment that emulates the behavior of a genuine Windows 7
tion space is crafted for known privilege escalation vulnerabilities operating system relevant to the privilege escalation task. The sim-
(which we consider unavoidable within the constraints of the cur- ulation consists of, among others, a file system with access controls,
rent RL), there is no one-to-one relationship between the actions Windows registry, AutoRuns, scheduled tasks, users, executables,
161
Table 4: Actions
A1. Create a malicious executable in Kali Linux A19. Change service registry to point to a malicious executable
A2. Create a malicious service executable in Kali Linux A20. Change service registry to add the user to local administrators
A3. Compile a custom malicious DLL in Kali Linux A21. Install a malicious MSI file
A4. Create a malicious MSI in Kali Linux A22. Search for unattend* sysprep* unattended* files
A5. Download a malicious executable in Windows A23. Decode base64 credentials
A6. Download a malicious service executable in Windows A24. Test credentials
A7. Download a malicious DLL in Windows A25. Check service permissions with accesschk64
A8. Download a malicious MSI in Windows A26. Check the ACLs of the service registry with Get-ACL
A9. Start an exploited service A27. Check executable permissions with icacls
A10. Stop an exploited service A28. Check directory permissions with icacls
A11. Overwrite the executable of an autorun A29. Analyze service executables for DLLs
A12. Overwrite the executable of a scheduled task A30. Search for DLLs
A13. Overwrite a service binary A31. Get a list of services
A14. Move a malicious executable so that it is executed by A32. Get a list of AutoRuns
an unquoted service path A33. Get a list of scheduled tasks
A15. Overwrite a DLL A34. Check AlwaysInstallElevated bits
A16. Move a malicious DLL to a folder on Windows path A35. Check for passwords in Winlogon registry
to replace a missing DLL A36. Get a list of local users and administrators
A17. Re-configure service to use a malicious executable A37. Get the current user
A18. Re-configure service to add the user to local administrators A38. Get the Windows path
AutoRun1 5.2 The Architecture of the Agent

... max
AutoRun𝑀
We use an A2C reinforcement learning agent (described in Section 3)
select
𝜋 in which the policy 𝜋 (𝑎|𝑠) and value 𝑉 (𝑠) functions are modeled
Service1 with neural networks. In practice, we use a single neural network
with two heads: one produces the estimated value of the state
DLL11
... max 𝑉 (𝑠) and the other the probabilities 𝜋 (𝑎|𝑠) of taking one of the 38
DLL1𝑛 actions. The network gets as inputs the state variables described in
...
argmax
Tables 1 and 2. The complete model has less than 27,000 parameters.
Service𝑁
𝑣 The main challenge in designing the neural network is that the
max
DLL𝑁 1 number of AutoRuns, services, tasks, and DLLs can vary across train-
... max
DLL𝑁 𝑚 ing episodes. A Windows host might have anything from dozens to
thousands of services, and the number of tasks and AutoRuns might
Task1 also vary significantly depending on the host. We want our agent to
... max be able to generalize to any number of those. Therefore, we process
Task𝐿 the information about each service, AutoRun, and scheduled task
separately and aggregate the outputs of these computational blocks
Figure 1: The architecture of the A2C agent. Colored using the maximum operation. The architecture of the neural net-
boxes represent multilayer perceptrons (same colors denote work is presented in Figure 1. Computational blocks with shared
shared parameters). The black circles represent the concate- parameters are shown with the same colors.
nation of input signals. In the proposed architecture, we concatenate the max-aggregated
outputs of the blocks that process the information about AutoRuns,
tasks, and DLLs with the outputs of the blocks that process the
information about the individual services. Intuitively, this corre-
sponds to augmenting the service data with the information about
the most vulnerable AutoRun and task. Note that we also augment
and services. Using this environment, the actions taken by the agent the service data with the information about the DLLs used by the
can be simulated in a highly efficient manner. Moreover, creating corresponding service. Then, we pass the concatenated informa-
simulated machines with random vulnerabilities for training re- tion through multilayer perceptrons that output value estimates
quires little programming and computing power and can be done and policies for all services. Finally, we regard the service with the
very fast. However, to determine whether training the agent in a highest value estimate as the most vulnerable one and select the
simulated environment instead of a real operating system is feasible, policy corresponding to that service as the final policy.
the trained agent will be evaluated by testing it on a vulnerable
Windows 7 virtual machine.
162
Training the agent for 50,000 episodes in the simulated envi-

ronment (see Section 5.1) takes less than two hours without any
significant code optimizations using a single NVIDIA GeForce GTX
1080 Ti GPU, which is a high-end consumer GPU from 2017. Note
that performing the same training on a real Windows 7 virtual
machine could take weeks.
5.4 Testing the Agent

Next, we test whether the agent trained in our simulated environ-
ment can transfer to a real Windows 7 operating system without
any adaptation. We also compare the performance of our agent to
two baselines: a ruleset crafted by an expert that can be interpreted
as the optimal policy and a random policy.
We create a Windows 7 virtual machine using Hyper-V provided
by the Windows 10 operating system. We assume that the offensive
actor has gained low-level access with code execution rights by
Figure 2: The average episode length during training on a performing, for example, a successful penetration test or a phishing
logarithmic scale. The red line illustrates the episode length campaign. This is simulated by installing an SSH server on the vic-
of an optimal policy that is approximately 10.7 actions per tim host. We use the Paramiko SSH library in Python to connect to
episode. the virtual machine and execute commands with user-level creden-
tials [17]. We use Hyper-V to create a Kali Linux virtual machine
with Metasploit for generating malicious executables. However,
instead of using Metasploit for creating malicious DLLs, the agent
has to modify and compile a custom DLL code by taking action
5.3 Training A3 Compile a custom malicious DLL in Kali Linux. Paramiko is also
Training consists of episodes in which the agent interacts with used to connect to the Kali machine.
one instance of a sampled environment. At the beginning of each Using SSH for simulating low-level code execution rights on
episode, the agent has no knowledge of the environment. The empty the victim Windows 7 has some limitations. Some of the Windows
agent state is fed into the neural network that produces the value command-line utilities such as wmic and accesschk64 are blocked
and the policy outputs. The action is sampled from the probabilities by non-privileged users over SSH. To overcome this limitation of
given by the policy output. Thus, we do not use explicit exploration the test scenario, we open a second SSH session using elevated
strategies such as epsilon-greedy. The selected action is performed credentials and run the blocked commands in that session. In prac-
in the simulated environment, and the reward and the observa- tice, a malicious actor would be able to execute these utilities while
tions received as a result of the action are passed to the agent. The accessing the victim’s environment via reverse shell or meterpreter
observations are parsed to update the agent state as described in session. Care was taken to prevent the test infrastructure from
Section 4.3. Then, a new action is selected based on the updated affecting the target environment. For example, due to the selection
state. The iteration continues until the maximum number of steps of an SSH tunnel as an off-the-shelf communication channel for
for one episode is reached, or the agent has successfully performed testing purposes, the agent does not target any SSH-related vul-
privilege escalation. nerabilities. Engineering effect was not prioritized for creating a
The parameters of the neural network are updated at the end of production-ready attack agent, as it was considered beyond the
each episode. The gradient (1) is computed automatically by back- scope of the research.
propagation with PyTorch [34] and the parameters are updated In order to take some of the actions listed in Table 4, the victim
using the Adam optimizer [27]. The agent is trained for as long as host has to have Windows Sysinternals with accesschk64. More-
the average reward per episode continues to increase. The hyper- over, we need an executable for scanning the DLLs loaded by the
parameters used for training the agent are given in Appendix A. PE files. We used an open-source solution for that, but it failed
Figure 2 presents the evolution of the episode length (averaged to detect a DLL loaded by a handcrafted service executable. To
over 100 episodes) during one training run. The average episode work around this issue, we hard-coded the result of the scan in the
length starts from around 200, and it gradually decreases reaching a agent. To properly address this issue, the high-level action of per-
level slightly above 11 after approximately 30,000 training episodes. forming PE scanning could be mapped to a script that uploads the
We use 1,000 as the maximum number of steps per episode (see service executable on a Windows machine and uses ProcMon from
Appendix A), which implies that the agent manages to solve the Sysinternals to analyze the DLLs loaded by the service executable.
problem and gets rewards from the very beginning of the optimiza- Alternatively, a superior PE analyzer could be used.
tion procedure when it takes close-to-random actions. We estimated First, we tested our agent on a virtual machine without external
that the average episode length is approximately 10.7 actions if the antivirus (AV) software or an intrusion detection system, but which
agent acts according to the optimal policy. Thus, the results indicate had an up-to-date and active Windows Defender (which is essen-
that the agent has learned to master the task of privilege escalation. tially only an anti-spyware program in Windows 7). We kept the
163
Table 5: Actions to exploit a missing DLL file Table 7: Actions to exploit elevated credentials in the WinL-
ogon registry
A35. Check for passwords in Winlogon registry
A37. Get the current user A35. Check for passwords in Winlogon registry
A31. Get a list of services A36. Get a list of local users and administrators
A28. Check directory permissions with icacls A24. Test credentials
A36. Get a list of local users and administrators
A26. Check the ACLs of service registries with Get-ACL
A25. Check service permissions with accesschk64 Table 8: Actions to exploit AlwaysInstallElevated
A34. Check AlwaysInstallElevated bits
A32. Get a list of AutoRuns A35. Check for passwords in Winlogon registry
A28. Check directory permissions with icacls A37. Get the current user
A27. Check executable permissions with icacls A31. Get a list of services
A22. Search for unattend* sysprep* unattended* files A28. Check directory permissions with icacls
A33. Get a list of scheduled tasks A36. Get a list of local users and administrators
A28. Check directory permissions with icacls A26. Check the ACLs of service registries with Get-ACL
A27. Check executable permissions with icacls A25. Check service permissions with accesschk64
A29. Analyze service executables for DLLs A34. Check AlwaysInstallElevated bits
A30. Search for DLLs A4. Create a malicious MSI file in Kali Linux
A28. Check directory permissions with icacls A8. Download a malicious MSI file in Windows
A38. Get the Windows path A21. Install a malicious MSI file
A28. Check directory permissions with icacls
A3. Compile a custom malicious DLL in Kali Linux
A7. Download a malicious DLL in Windows escalation. The increased number of services had no negative effect
A16. Move a malicious DLL to a folder on Windows path on the agent, and the agent was successful at the task. An example
to replace a missing DLL sequence of commands is given in Appendix C. However, because
A9. Start an exploited service the agent performs each selected action on every applicable service,
the agent generates some noise by scanning through the permis-
Table 6: Actions to exploit a service with a missing binary sions of all services in C:\Windows. That could have caused an alert
in an advanced detection and response system.
The number of actions used by the agent to escalate privileges
during the testing phase is given in Table 9. We compare the fol-
A37. Get the current user
lowing agents:
A31. Get a list of services
A28. Check directory permissions with icacls • the oracle agent, which assumes complete knowledge of the
A2. Create a malicious service executable in Kali Linux system, including the vulnerability;
A6. Download a malicious service executable in Windows • the optimal policy, which is approximated using a fixed rule-
A13. Overwrite a service binary set crafted by an expert;
A9. Start an exploited service • the deterministic RL agent, which selects the action with the
highest probability;
• the stochastic RL agent, which samples the action from the
number of services similar to the number of services during train- probabilities produced by the policy network;
ing by excluding all services in C:\Windows from the list of services • an agent taking random actions.
gathered by the agent. We made the agent deterministically select For all random trials, we used 1000 samples and computed the
the action with the highest probability. Our agent was successful in average number of actions used by the agent. Because of the com-
exploiting all the twelve vulnerabilities. Examples of the sequences putational cost of running thousands of episodes, all tests involving
of actions taken by the agent during evaluation can be found in randomness were run in a simulated environment similar to the
Tables 5–8. The agent took very few unnecessary actions. The testing VM. The results suggest that the policy of the deterministic
performance (measured in terms of the number of actions) could agent is close to optimal. The addition of stochasticity to action
be improved by gathering more information before scanning for selection has a slightly negative effect on the performance but it
directory permissions. Now, the agent prefers scanning the direc- increases the variability of the agent’s actions making the agent
tory permissions immediately after finding interesting directories. potentially more difficult to detect.
However, the amount of noise generated by the agent would have We additionally tested the ability of the agent to generalize to
been similar as the agent would have performed more Windows multiple vulnerabilities which might be simultaneously present
commands per high-level action. in the system. This was done in three ways. First, the agent was
After that, we did not limit the number of services (by exclud- evaluated in an environment with six different types of vulnerable
ing services in C:\Windows) and let the agent perform privilege services present. Second, the agent was evaluated in an environment
164
Table 9: Number of actions used to escalate privileges
Vulnerability Oracle Expert Deterministic Stochastic Random

(full knowledge) (≈ optimal policy) RL RL
1 10 20 24 25.3 231.2
2 5 10 9 8.0 152.2
3 7 7 8 8.2 206.0
4 5 9 8 9.0 147.7
5 7 10 15 11.5 212.2
6 7 7 8 8.3 208.1
7 6 9 14 14.1 171.7
8 5 15 11 11.0 156.0
9 3 11 3 10.9 96.0
10 4 13 14 13.5 162.8
11 6 9 17 17.6 166.9
12 6 8 13 13.3 165.0
AVG 5.9 10.7 12.0 12.6 173.0
with all twelve vulnerabilities present. Finally, random combina- using Metasploit with default settings. However, if the mapping
tions of any two vulnerabilities were tested. The agent had little from the high-level actions to the low-level commands (see Table
trouble performing privilege escalation in any of these scenarios. 4) was improved so that more sophisticated payloads were used or
As a matter of interest, we finally evaluated the agent’s perfor- the action space was expanded with actions for defense evasion,
mance against a host running an up-to-date version of a standard a reinforcement learning agent could be capable of privilege esca-
endpoint protection software, Microsoft Security Essentials, with lation in hosts with up-to-date antivirus software but without an
real-time protection enabled. As expected, the AV software managed advanced detection and response system.
to recognize the default malicious executables created by msfvenom While simple attacks are likely to be detected by advanced breach
in Kali Linux. However, the AV software failed to recognize the detection solutions, not all companies employ those for various
custom DLL compiled by the agent, and hence, privilege escalation reasons. The constant stream of breaches seen in the news reflects
using DLL hijacking was possible. Moreover, the AV software failed that reality. Moreover, if adversaries develop RL-based tools for
to detect any methods that did not involve a downloaded malicious learning and automating adversarial actions, they might prefer to
payload, such as re-configuring the vulnerable service to execute target networks that are less likely to be running breach detection
a command that added the user as a local administrator. Hence, software.
privilege escalation was possible in many of the scenarios, even The current threat level presented by reinforcement learning
with up-to-date AV software present. It should be noted that these agents is most likely limited to agents capable of exploiting existing
techniques fall beyond the scope of file-based threat detection used well-known vulnerabilities. The same could be achieved by a scripted
by standard antivirus software and would require more advanced agent with a suitable hard-coded logic. However, the RL approach
protection strategies to counter, such as behavioral- or heuristics- offers a number of benefits compared to a scripted agent:
based detection. The agent’s performance against such detection • Scripting an attacking agent can be difficult when the number
engines was considered to be beyond the scope of the project and of potentially exploitable vulnerabilities grows and if the
was not assessed. attacked system contains an IDS.
• The probabilistic approach of our RL agent will produce
6 DISCUSSION more varied attacks (and attempted attacks) than a scripted
Our work demonstrates that it is possible to train a deep rein- robot that follows hard-coded rules, which makes our agent
forcement learning agent to perform local privilege escalation in more usable for training ML-based defenses and testing and
Windows 7 using known vulnerabilities. Our method is the first evaluating intrusion detection systems.
reinforcement learning agent, to the best of our knowledge, that • An RL agent may be quickly adapted to changes in the envi-
performs privilege escalation with an extensive state space and ronment. For example, if certain sequences of actions cause
a high-level action space with easily customizable low-level im- an alarm raised by an intrusion detection system, the agent
plementations. Despite being trained in simulated environments, might learn to take a different route, which is not detectable
the test results demonstrate that our agent can solve the formal- by the IDS. This would produce invaluable information for
ized privilege escalation problem in a close-to-optimal fashion on strengthening the defense system.
full-featured Windows machines with realistic vulnerabilities. In the long run, RL agents could have the potential to discover
The efficacy of our implementation is limited if up-to-date an- and exploit novel unseen vulnerabilities, which would have an enor-
tivirus software is running on the victim host because only a hand- mous impact on the field. To implement this idea, agents would
crafted DLL is used, whereas the malicious executables are created most likely need to interact with an authentic environment, which
165
would require a great deal of engineering effort and a huge amount [2] Hyrum S Anderson, Anant Kharkar, Bobby Filar, David Evans, and Phil Roth.
of computational resources. Crafting the action space would nev- 2018. Learning to Evade Static PE Machine Learning Malware Models via Rein-
forcement Learning. arXiv preprint arXiv:1801.08917 (2018).
ertheless be most likely unavoidable within the constraints of the [3] Andy Applebaum, Doug Miller, Blake Strom, Chris Korban, and Ross Wolf. 2016.
current RL methods. However, the development should go in the Intelligent, Automated Red Team Emulation. In Proceedings of the 32nd Annual
Conference on Computer Security Applications. 363–373.
direction of making the actions more atomic and minimizing the [4] Giovanni Apruzzese, Michele Colajanni, Luca Ferretti, Alessandro Guido, and
amount of prior knowledge used in designing the action space. Mirco Marchetti. 2018. On the Effectiveness of Machine and Deep Learning for
This could allow the agent to encompass more vulnerabilities and Cyber Security. In 2018 10th International Conference on Cyber Conflict (CyCon).
IEEE, 371–390.
could be a way to get closer to the ultimate goal of discovering new [5] John A Bland, Mikel D Petty, Tymaine S Whitaker, Katia P Maxwell, and Wal-
vulnerabilities. ter Alan Cantrell. 2020. Machine Learning Cyberattack and Defense Strategies.
Another research direction is to increase the complexity of the Computers & Security 92 (2020), 101738.
[6] Francesco Caturano, Gaetano Perrone, and Simon Pietro Romano. 2021. Discov-
learning task. In this first step, we wanted to understand how RL- ering reflected cross-site scripting vulnerabilities using a multiobjective rein-
powered attacks could work in a constrained, varied setup, and forcement learning environment. Computers & Security 103 (2021), 102204.
[7] Ünal Çavuşoğlu. 2019. A new hybrid approach for intrusion detection using
our key result is showing that the RL approach works for such a machine learning methods. Applied Intelligence 49, 7 (2019), 2735–2761.
complex learning task. Defeating defensive measures or expanding [8] Moitrayee Chatterjee and Akbar-Siami Namin. 2019. Detecting Phishing Websites
to a wider range of target environments would be a research topic through Deep Reinforcement Learning. In 2019 IEEE 43rd Annual Computer
Software and Applications Conference (COMPSAC), Vol. 2. IEEE, 227–232.
with a significantly larger scope. It would be interesting to see [9] Tong Chen, Jiqiang Liu, Yingxiao Xiang, Wenjia Niu, Endong Tong, and Zhen
whether a reinforcement learning agent can perform more steps in Han. 2019. Adversarial attack and defense in reinforcement learning-from AI
the cyber security kill chain, such as defense evasion. It would also security view. Cybersecurity 2, 11 (2019), 1–22.
[10] Ankur Chowdhary, Dijiang Huang, Jayasurya Sevalur Mahendran, Daniel Romo,
be interesting to train the agent in an environment with an intrusion Yuli Deng, and Abdulhakim Sabur. 2020. Autonomous Security Analysis and
detection system or a defensive RL agent and perform multi-agent Penetration Testing. In 2020 16th International Conference on Mobility, Sensing
and Networking (MSN). IEEE, 508–515.
reinforcement learning, which has been done in previous research [11] Zhihua Cui, Fei Xue, Xingjuan Cai, Yang Cao, Gai-ge Wang, and Jinjun Chen.
on post-exploitation in simulated environments [12]. 2018. Detection of Malicious Code Variants Based on Deep Learning. IEEE
Transactions on Industrial Informatics 14, 7 (2018), 3187–3196.
[12] Richard Elderman, Leon JJ Pater, Albert S Thie, Madalina M Drugan, and Marco
Ethical Considerations Wiering. 2017. Adversarial Reinforcement Learning in a Cyber Security Simula-
tion.. In ICAART (2). 559–566.
The primary goal of this work is to contribute to building resis- [13] László Erdődi and Fabio Massimo Zennaro. 2021. The Agent Web Model: modeling
tant defense systems that can detect and prevent various types web hacking for reinforcement learning. International Journal of Information
Security (2021), 1–17.
of potential attacks. Rule-based defense systems can be effective, [14] Zhiyang Fang, Junfeng Wang, Jiaxuan Geng, and Xuan Kan. 2019. Feature
but as the number of attack scenarios grows, they become increas- Selection for Malware Detection Based on Reinforcement Learning. IEEE Access
7 (2019), 176177–176187.
ingly difficult to build and maintain. Data-driven defensive systems [15] Zhiyang Fang, Junfeng Wang, Boya Li, Siqi Wu, Yingjie Zhou, and Haiying Huang.
trained with machine learning offer a promising alternative but 2019. Evading Anti-Malware Engines With Deep Reinforcement Learning. IEEE
implementing this idea in practice is challenged by the problem of Access 7 (2019), 48867–48879.
[16] Aidin Ferdowsi, Ursula Challita, Walid Saad, and Narayan B Mandayam. 2018.
scarcity of available training data in this domain. In this work, we Robust Deep Reinforcement Learning for Security and Safety in Autonomous
present a possible solution to this problem by training a reinforce- Vehicle Systems. In 2018 21st International Conference on Intelligent Transportation
ment learning agent to perform malicious activities and therefore Systems (ITSC). IEEE, 307–312.
[17] Jeff Forcier. 2021. Paramiko: A Python Implementation of SSHv2. http://www.
to generate invaluable training data for improving defense systems. paramiko.org/index.html
The presented approach can also be useful to support the red team- [18] Mohamed C Ghanem and Thomas M Chen. 2018. Reinforcement Learning for
Intelligent Penetration Testing. In 2018 Second World Conference on Smart Trends
ing activities performed by cyber security experts by automating in Systems, Security and Sustainability (WorldS4). IEEE, 185–192.
some steps in the kill chain. However, the system developed in this [19] Mohamed C Ghanem and Thomas M Chen. 2020. Reinforcement Learning for
project can potentially be dangerous in the wrong hands. Hence, Efficient Network Penetration Testing. Information 11, 1 (2020), 6.
[20] Andy Greenberg. 2018. The Untold Story of NotPetya, the Most Devastating
the code created in this project will not be open-sourced or released Cyberattack in History. Wired, August 22 (2018).
to the public. [21] Guoan Han, Liang Xiao, and H Vincent Poor. 2017. Two-Dimensional Anti-
Jamming Communication Based on Deep Reinforcement Learning. In 2017 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
ACKNOWLEDGMENTS 2087–2091.
[22] Yi Han, Benjamin IP Rubinstein, Tamas Abraham, Tansu Alpcan, Olivier De Vel,
We thank the Security and Software Engineering Research Center Sarah Erfani, David Hubczenko, Christopher Leckie, and Paul Montague. 2018. Re-
S2 ERC for funding our work. In addition, we thank David Karpuk, inforcement Learning for Autonomous Defence in Software-Defined Networking.
In International Conference on Decision and Game Theory for Security. Springer,
Andrew Patel, Paolo Palumbo, Alexey Kirichenko, and Matti Aksela 145–165.
from F-Secure for their help in running this project and their domain [23] Xiaofan He, Huaiyu Dai, and Peng Ning. 2016. Faster Learning and Adaptation
in Security Games by Exploiting Information Asymmetry. IEEE Transactions on
expertise, Tuomas Aura for giving us highly valuable feedback, Signal Processing 64, 13 (2016), 3429–3443.
and the Academy of Finland for the support within the Flagship [24] Danny Hendler, Shay Kels, and Amir Rubin. 2018. Detecting Malicious PowerShell
Programme Finnish Center for Artificial Intelligence (FCAI). Commands using Deep Neural Networks. In Proceedings of the 2018 on Asia
Conference on Computer and Communications Security. 187–197.
[25] Peter J Huber. 1992. Robust Estimation of a Location Parameter. In Breakthroughs
in Statistics. Springer, 492–518.
REFERENCES [26] TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer, and Eul Gyu Im. 2018.
[1] Mohammad Alauthman, Nauman Aslam, Mouhammd Al-Kasassbeh, Suleman A Multimodal Deep Learning Method for Android Malware Detection using
Khan, Ahmad Al-Qerem, and Kim-Kwang Raymond Choo. 2020. An efficient Various Features. IEEE Transactions on Information Forensics and Security 14, 3
reinforcement learning-based Botnet detection approach. Journal of Network and (2018), 773–788.
Computer Applications 150 (2020), 102479.
166
[27] Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti- B ACTIONS REQUIRED TO EXPLOIT THE
mization. arXiv preprint arXiv:1412.6980 (2014).
[28] Ryusei Maeda and Mamoru Mimura. 2021. Automating post-exploitation with VULNERABILITIES
deep reinforcement learning. Computers & Security 100 (2021), 102108. The sequences of actions that can be used to exploit the twelve
[29] Dean Richard McKinnel, Tooska Dargahi, Ali Dehghantanha, and Kim-
Kwang Raymond Choo. 2019. A systematic literature review and meta-analysis vulnerabilities are presented in this section. The actions used in
on artificial intelligence in penetration testing and vulnerability assessment. multiple scenarios are marked with the blue color.
Computers & Electrical Engineering 75 (2019), 175–188.
[30] Nikola Milosevic, Ali Dehghantanha, and Kim-Kwang Raymond Choo. 2017.
(1.1) Missing DLL:
Machine learning aided Android malware classification. Computers & Electrical A37. Get the current user
Engineering 61 (2017), 266–274.
[31] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo- A31. Get a list of services
thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchro- A29. Analyze service executables for DLLs
nous Methods for Deep Reinforcement Learning. In International conference on
machine learning. PMLR, 1928–1937.
A30. Search for DLLs
[32] Thanh Thi Nguyen and Vijay Janapa Reddi. 2019. Deep Reinforcement Learning A38. Get the Windows path
for Cyber Security. arXiv preprint arXiv:1906.05799 (2019). A28. Check directory permissions with icacls
[33] Zhen Ni and Shuva Paul. 2019. A Multistage Game in Smart Grid Security: A
Reinforcement Learning Solution. IEEE Transactions on Neural Networks and A3. Compile a custom malicious DLL in Kali Linux
Learning Systems 30, 9 (2019), 2684–2695. A7. Download a malicious DLL in Windows
[34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory A16. Move a malicious DLL to a folder on Windows path
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-
maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan to replace a missing DLL
Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith A9. Start an exploited service
Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning
Library. In Advances in Neural Information Processing Systems 32, H. Wallach,
(1.2) Writable DLL:
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Cur- A37. Get the current user
ran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-
an-imperative-style-high-performance-deep-learning-library.pdf A31. Get a list of services
[35] Rapid7. 2021. Metasploit Framework. (2021). https://docs.rapid7.com/metasploit/ A29. Analyze service executables for DLLs
msf-overview/
[36] Richard S Sutton and Andrew G Barto. 2018. Reinforcement Learning: An Intro-
A30. Search for DLLs
duction. MIT press. A27. Check executable permissions with icacls
[37] Isao Takaesu. 2018. Deep Exploit. https://github.com/13o-bbr-bbq/machine_ A3. Compile a custom malicious DLL in Kali Linux
learning_security/tree/master/DeepExploit
[38] Microsoft 365 Defender Research Team. 2021. Gamifying machine learning for A7. Download a malicious DLL in Windows
stronger security and AI models. https://www.microsoft.com/security/blog/2021/ A15. Overwrite a DLL
04/08/gamifying-machine-learning-for-stronger-security-and-ai-models/ A9. Start an exploited service
[39] Xiaoyue Wan, Geyi Sheng, Yanda Li, Liang Xiao, and Xiaojiang Du. 2017. Rein-
forcement Learning Based Mobile Offloading for Cloud-Based Malware Detection. (2) Re-configurable Service:
In GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE, 1–6.
[40] Yufei Wang, Zheyuan Ryan Shi, Lantao Yu, Yi Wu, Rohit Singh, Lucas Joppa, A37. Get the current user
and Fei Fang. 2019. Deep Reinforcement Learning for Green Security Games A31. Get a list of services
with Real-Time Information. In Proceedings of the AAAI Conference on Artificial A25. Check service permissions with accesschk64
Intelligence, Vol. 33. 1401–1408.
[41] Liang Xiao, Yan Li, Guolong Liu, Qiangda Li, and Weihua Zhuang. 2015. Spoofing A18. Re-configure service to add the user to local
Detection with Reinforcement Learning in Wireless Networks. In 2015 IEEE administrators
Global Communications Conference (GLOBECOM). IEEE, 1–5.
[42] Liang Xiao, Xiaoyue Wan, Canhuang Dai, Xiaojiang Du, Xiang Chen, and Mohsen
A9. Start an exploited service
Guizani. 2018. Security in Mobile Edge Caching with Reinforcement Learning. (3) Unquoted Service Path:
IEEE Wireless Communications 25, 3 (2018), 116–122.
[43] Fabio Massimo Zennaro and Laszlo Erdodi. 2021. Modeling Penetration Test- A37. Get the current user
ing with Reinforcement Learning Using Capture-the-Flag Challenges: Trade- A31. Get a list of services
offs between Model-free Learning and A Priori Knowledge. arXiv preprint
arXiv:2005.12632 (2021).
A2. Create a malicious service executable in Kali Linux
A6. Download a malicious service executable in Windows
A14. Move a malicious executable so that it is executed by
an unquoted service path
A SUPPLEMENTARY MATERIAL FOR RL A9. Start an exploited service
The hyperparameters used for training are listed below: (4) Modifiable ImagePath:
Parameter Explanation Value A37. Get the current user
𝛾 Discount rate 0.995
A26. Check the ACLs of the service registry with Get-ACL
𝛼 Learning rate for Adam 0.001
A20. Change service registry to add the user to local
𝛽1 First moment decay rate for Adam 0.9
administrators
𝛽2 Second moment decay rate for Adam 0.999
𝑁 max Maximum number of steps 1000
𝑁 serv Number of services [1, 20]
𝑁 auto Number of autoruns [1, 10]
𝑁 task Number of tasks [1, 10]
𝑀dlls Number of DLLs loaded per service [1, 4]
167
(5) Writable Service Executable: C COMMAND-LINE EXAMPLE

A37. Get the current user We exemplify our mapping from actions to commands by showing
A31. Get a list of services the commands taken by the agent to escalate privileges by exploit-
A27. Check executable permissions with icacls ing a service with weak folder permissions and a missing binary
A2. Create a malicious service executable in Kali Linux (see Table 6). Note that all the commands disclosed below have
A6. Download a malicious service executable in Windows been derived from public sources (given in the footnotes) and can
A13. Overwrite a service binary be recreated by security practitioners. Furthermore, none of the
A9. Start an exploited service commands are proprietary to F-Secure.
(6) Missing Service Executable:
A35. Check for passwords in Winlogon registry:1
A37. Get the current user
reg query "HKLM\SOFTWARE\Microsoft\Windows NT\
CurrentVersion\Winlogon" /v DefaultUsername
reg query "HKLM\SOFTWARE\Microsoft\Windows NT\
A2. Create a malicious service executable in Kali Linux
CurrentVersion\Winlogon" /v DefaultPassword
A6. Download a malicious service executable in Windows
A13. Overwrite a service binary
A37. Get the current user:2
whoami
(7) Writable AutoRun Executable:
A37. Get the current user A31. Get a list of services:3
A32. Get a list of AutoRuns wmic service get name,pathname,startname,startmode,started
A27. Check executable permissions with icacls /format:csv
A1. Create a malicious executable in Kali Linux
A5. Download a malicious executable in Windows A28. Check directory permissions with icacls:2
A11. Overwrite the executable of an AutoRun icacls.exe "c:\windows\system32"
(8) AlwaysInstallElevated: icacls.exe "c:\windows"
A37. Get the current user . . . (15 rows skipped)
A34. Check AlwaysInstallElevated bits icacls.exe "c:\program files (x86)\microsoft\edge"
A4. Create a malicious MSI in Kali Linux icacls.exe "c:\program files\missing file service"
A8. Download a malicious MSI in Windows icacls.exe "c:\program files"
A21. Install a malicious MSI file . . . (4 rows skipped)
(9) WinLogon Registry: icacls.exe "c:\windows\system32\wbem"
icacls.exe "c:\program files\windows media player"
A36. Get a list of local users and administrators A2. Create a malicious service executable in Kali Linux:4
A24. Test credentials sudo -S msfvenom -p windows/exec CMD=’net localgroup
(10) Unattend File:
administrators user /add’ -f exe-service -o java_updater_svc
A22. Search for unattend* sysprep* unattended* files
A23. Decode base64 credentials A6. Download a malicious service executable in Windows:5
A36. Get a list of local users and administrators powershell.exe -command "Invoke-WebRequest -Uri
A24. Test credentials ’82.130.20.144/java_updater_service’
(11) Writable Task Binary: -OutFile ’C:\Users\user\Downloads\java_updater_svc"
A37. Get the current user move /y "C:\Users\user\Downloads\java_updater_svc"
A33. Get a list of scheduled tasks "C:\Users\user\Downloads\java_updater_svc.exe"
A27. Check executable permissions with icacls
A1. Create a malicious executable in Kali Linux A13. Overwrite a service binary:4
A5. Download a malicious executable in Windows copy /y "C:\Users\user\Downloads\java_updater_svc.exe"
A12. Overwrite the executable of a scheduled task "c:\program files\missing file service\missingservice.exe"
(12) Writable Startup Folder:
A9. Start an exploited service:4
A37. Get the current user sc start missingsvc
A32. Get a list of AutoRuns
A1. Create a malicious executable in Kali Linux 1 https://github.com/sagishahar/lpeworkshop/blob/master/Lab%20Exercises%20Walkthrough%20-
A5. Download a malicious executable in Windows %20Windows.pdf
2 https://sushant747.gitbooks.io/total-oscp-guide/content/privilege_escalation_windows.html
A11. Overwrite the executable of an AutoRun 3 https://book.hacktricks.xyz/windows/windows-local-privilege-escalation
4 https://infosecwriteups.com/privilege-escalation-in-windows-380bee3a2842
5 https://adamtheautomator.com/powershell-download-file/
168

Automating Privilege Escalation with deep reinforcement learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automating Privilege Escalation with deep reinforcement learning

Uploaded by

Copyright:

Available Formats

Session 2B: Machine Learning for Cybersecurity AISec ’21, November 15, 2021, Virtual Event, Republic of Korea

Automating Privilege Escalation with

4.3 State of the Agent

AutoRun1 5.2 The Architecture of the Agent

Training the agent for 50,000 episodes in the simulated envi-

5.4 Testing the Agent

Table 9: Number of actions used to escalate privileges

Vulnerability Oracle Expert Deterministic Stochastic Random

(5) Writable Service Executable: C COMMAND-LINE EXAMPLE

You might also like