Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

HoneyGPT: Breaking the Trilemma in Terminal Honeypots with

Large Language Model


Ziyang Wang Jianzhou You Haining Wang
Institute of Information Engineering, Sangfor Technologies Inc. Virginia Tech Arlington
Chinese Academy of Sciences;School Shenzhen, China Virginia, USA
of Cyber Security, University of youjianzhou@gmail.com hnw@vt.edu
Chinese Academy of Sciences
Beijing, China
wangziyang2022@iie.ac.cn
arXiv:2406.01882v1 [cs.CR] 4 Jun 2024

Tianwei Yuan Shichao Lv Yang Wang


Institute of Information Engineering, Institute of Information Engineering, Shenzhen Institute of Advanced
Chinese Academy of Sciences;School Chinese Academy of Sciences;School Technology,Chinese Academy of
of Cyber Security, University of of Cyber Security, University of Sciences
Chinese Academy of Sciences Chinese Academy of Sciences Shenzhen, China
Beijing, China Beijing, China yang.wang1@siat.ac.cn
yuantianwei@iie.ac.cn lvshichao@iie.ac.cn

Limin Sun
Institute of Information Engineering,
Chinese Academy of Sciences;School
of Cyber Security, University of
Chinese Academy of Sciences
Beijing, China
sunlimin@iie.ac.cn
ABSTRACT in enticing attackers into more profound interactive engagements
Honeypots, as a strategic cyber-deception mechanism designed and capturing a wider array of novel attack vectors in comparison
to emulate authentic interactions and bait unauthorized entities, to existing honeypot technologies.
continue to struggle with balancing flexibility, interaction depth,
and deceptive capability despite their evolution over decades. Often 1 INTRODUCTION
they also lack the capability of proactively adapting to an attacker’s Honeypots have been widely used as effective cybersecurity tools
evolving tactics, which restricts the depth of engagement and sub- for detecting, enticing, and understanding malicious activities on
sequent information gathering. Under this context, the emergent the Internet [2, 10, 12, 14, 25, 26, 29, 32, 33, 42]. These traps are
capabilities of large language models, in tandem with pioneering designed to lure attackers, allowing security teams to gather their
prompt-based engineering techniques, offer a transformative shift information and monitor their behaviors for threat analysis, as well
in the design and deployment of honeypot technologies. In this as
paper, we introduce HoneyGPT, a pioneering honeypot architecture deflect ongoing attacks.
based on ChatGPT, heralding a new era of intelligent honeypot Terminal honeypots are a specialized type of honeypots for em-
solutions characterized by their cost-effectiveness, high adaptabil- ulating real terminal systems. They are typically employed to at-
ity, and enhanced interactivity, coupled with a predisposition for tract hackers or malicious software targeting terminal systems.
proactive attacker engagement. Furthermore, we present a struc- Terminal honeypots range from low-interaction variants that sim-
tured prompt engineering framework that augments long-term ulate a minimal set of protocols or service information [17], to
interaction memory and robust security analytics. This framework, medium-interaction honeypots that replicate a broader array of
integrating thought of chain tactics attuned to honeypot contexts, system functionalities at a higher deployment and maintenance
enhances interactivity and deception, deepens security analytics, cost, and high-interaction honeypots that leverage sophisticated
and ensures sustained engagement. virtualization technologies, such as VMWare to create convincing
The evaluation of HoneyGPT includes two parts: a baseline com- environments of terminal systems or even deploy actual systems
parison based on a collected dataset and a field evaluation in real for maximum authenticity.
scenarios for four weeks. The baseline comparison demonstrates However, existing terminal honeypots face a trilemma prob-
HoneyGPT’s remarkable ability to strike a balance among flexibility, lem related to flexibility, interaction level, and deceptive capability,
interaction depth, and deceptive capability. The field evaluation fur- stemming from the limitations of programmatic approaches and
ther validates HoneyGPT’s efficacy, showing its marked superiority the fixed nature of system environments. In the trilemma, existing
Trovato and Tobin, et al.

terminal honeypot designs struggle to achieve optimal performance 3. We conduct a baseline evaluation of HoneyGPT in comparison
across all three dimensions simultaneously, forcing defenders to with traditional honeypots, utilizing open-source attack data cap-
compromise and select imperfect solutions. tured by Cowrie. The results indicate that HoneyGPT outperforms
It is very challenging to address the trilemma in practice. The conventional honeypots with respect to its composite capabilities
primary methods for running honeypots are based on either pro- in flexibility, level of interaction, and deception efficacy.
grammatic simulations or real operating systems. On one hand, the 4. We deploy HoneyGPT and Cowire, which is a programmatic
programmatic simulation approach is limited by development costs, honeypot, on the Internet for four weeks and observe that Hon-
making it difficult to balance scalability and interaction depth. On eyGPT captures more attack behaviors and longer interactions than
the other hand, honeypots based on real operating systems suffer Cowrie.
from rigidity, due to their inherent configurations that are short of The rest of this paper is structured as follows. Section 2 provides
flexibility. Moreover, both deployment methods lack intelligence background on LLMs, prompt engineering, and terminal honey-
during interactions, unable to provide responses that align with pots. Section 3 describes the trilemma of flexibility, interaction,
attackers’ intentions to entice further engagement. This signifi- and deception in existing terminal honeypots. Section 4 presents
cantly limits the honeypot’s deception capabilities. However, there our HoneyGPT solution, detailing its framework, concept, strategy,
has been a growing demand for honeypots that are more dynamic, prompt management, and configuration. Section 5 validates the ef-
intelligent, and cost-effective. ficacy of HoneyGPT, providing baseline comparisons on deception,
The development of Large Language Models (LLMs) [5, 6, 11, interaction, and flexibility, as well as field evaluations to measure
35, 39] appears to offer a potential solution to these problems. By its effectiveness in engaging attackers and capturing novel attack
providing attack commands as queries to the large language model, vectors. Section 6 outlines the limitations and potential future work,
it can generate responses tailored to the attackers’ interests based and finally, Section 7 concludes the paper.
on the provided prompt. This approach breaks the Trilemma of
traditional honeypots, as a single LLM can simultaneously simulate 2 BACKGROUND
multiple high-interaction honeypots with different configurations. In this section, we provide the background of Large Language
Furthermore, by designing prompts to guide the LLM, it is possible Models, prompt engineering, and terminal honeypots.
to better fulfill attackers’ objectives and entice them into deeper
levels of interaction.
2.1 Large Language Models
However, designing honeypots based on LLMs still faces three
major challenges. A prominent issue is the absence of a standard- Large Language Models (LLMs) are a subset of pre-trained language
ized approach to honeypot prompt design. To address this problem, models (PLMs), characterized by their voluminous parameter sets,
this paper establishes a universal specification for honeypot prompt reaching or even exceeding hundreds of billions, as defined by Ka-
keywords. Another challenge arises from the native thinking con- plan et al [11]. Since OpenAI introduced ChatGPT [19], LLMs have
straints of LLMs, which cannot handle complex attack combinations. been widely used across a spectrum of disciplines [4, 6], particularly
To address this challenge, we employ a Chain of Thought (CoT) in the field of cybersecurity [8, 9, 16, 21, 22, 31]. Compared to pre-
strategy that allows the language model to answer the command’s vious models, LLMs offer two main advantages in interdisciplinary
impact on the operating system during each interaction. When applications:
posing the next question, we integrate the consolidated changes in Firstly, LLMs display astonishing emergent abilities as identified
the operating system and the user command to enhance the large by Wei et al [35], which are competencies absent in smaller-scale
language model’s understanding of the task. The third challenge models but manifest in larger configurations. Such models experi-
lies in that the performance of LLMs in prolonged sessions is ham- ence a quantum leap in performance relative to their diminutive
pered by their context length limitations and inherent forgetting counterparts. Prominent among these emergent abilities are:
mechanisms. To manage this issue, we employ the CoT approach • In-context learning: LLMs can, upon receiving natural
to assess the significance of each attacker command and its impact language instructions or several task demonstrations, gen-
on the terminal system. This assessment allows us to strategically erate accurate outputs for test cases by continuing the input
prune the session history, ensuring that only the most essential con- text sequence, circumventing the need for further training
textual information is retained within the confines of the context or gradient adjustments [4].
length, thereby optimizing the honeypot’s effectiveness in extended • Instruction following: LLMs possess the ability to com-
interactions. The major contributions of this paper are summarized prehend and execute natural language instructions with
as follows. minimal or even zero examples, adapting to new tasks [30].
1. We propose HoneyGPT, an intelligent terminal honeypot that By fine-tuning tasks articulated through instructional prompts,
addresses the dimensions of flexibility, interaction level, and decep- LLMs have demonstrated proficiency in accurately handling
tive capability, enabling extended interactions while overcoming tasks they have never seen before [20].
the Trilemma. • Step-by-step reasoning: Distinct from conventional PLMs,
2. We propose a Chain of Thought strategy for terminal honey- LLMs are capable of deconstructing complex tasks into a
pot task scenarios, which enhances the task-solving capabilities sequence of discrete subtasks by employing prompt-based
of native LLMs and overcomes the inherent context length limita- strategies such as chain-of-thought. It is conjectured that
tions. The introduction of CoT is essential to advancing honeypot this capacity may be derived from their training on code [6,
technology and improving network security defense capabilities. 36].
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

Secondly, LLMs transform the traditional employment of AI Emulated Honeypots. Emulated honeypots are software constructs
algorithm development and utilization, significantly diminishing that simulate network services and devices to attract cyber ad-
the technical threshold for cross-disciplinary actors to harness such versaries. They are developed using conventional programming
algorithms. In contrast to previous models, users engage with LLMs methodologies, as exemplified by Honeyd [17], Dionaea [23], and
via a prompt interface (e.g., the GPT-3-turbo API), by formulating Cowrie [18]. These honeypots utilize rule-based matching and log-
natural language prompts or directives to modulate the model’s ical controls to mimic desired targets. Despite their effectiveness,
behavior to yield anticipated results. This empowers individuals they are constrained by the limitations of development and opera-
who lack a deep understanding of model training paradigms but tional costs, preventing a full emulation of an operating system’s
are conversant with prompt engineering to employ LLMs for task intricate behaviors. The interactivity spectrum of these honeypots
facilitation. spans from minimal to moderate: Honeyd and Dionaea deliver
low-interaction platforms rich in configuration possibilities, while
2.2 Prompt Engineering Cowrie steps up to offer a medium-interaction milieu. However,
Prompt engineering is the process of crafting prompts to better this comes at the cost of diminished adaptability and increased
guide large language models to understand and resolve tasks. Here operational costs. Such frameworks have laid the groundwork for
we briefly introduce prompt engineering from its core facets: com- a range of specialized offshoots [1, 14].
ponents, design principles, and strategic approaches.
Components. The standard components of LLM prompts typi-
cally include: Real-system Honeypots. Real-system honeypots, by contrast, uti-
lize genuine, unmodified system environments, offering a more
• Task Description: Instructions that the LLM is expected authentic simulation of user interactions, file system activities, and
to adhere to, expressed in natural language. operational behaviors [2, 32, 34]. They provide a high level of in-
• Input Data: The requisite data for a task, presented in teraction, creating a convincing facade of a real system, which
natural language. significantly increases the probability of capturing complex attack
• Contextual Information: The contextual or background strategies that simpler honeypot architectures may miss. However,
information that can aid in the execution of a task. the inherent rigidity of their physical environments [1] requires
• Prompt Style: The formulation of the prompt, which is individualized configuration for each new system variant, which
tailored to optimize the model’s responses, such as through incurs substantial deployment and maintenance costs.
role-playing or incremental reasoning. This authenticity offers potential attackers a highly interactive
Design Principles. The development of effective LLM prompts is environment, fostering a convincing deception that aids in uncov-
predicated on: ering intricate attack vectors that might elude detection by more
rudimentary honeypots. However, a challenge arises from the secu-
• Clarity: Ensure instructions are explicit and communica- rity protocols native to real operating systems, such as stringent
tion is clear through symbolic delimiters, structured out- permission controls, which are not inherently conducive to fulfill-
puts, conditional logic, and "Few-shot" examples. ing malicious requests. These protocols can inadvertently deter
• Deliberation: Allow LLMs time for thought, structur- threat actors from engaging in more profound levels of system
ing prompts for sequential processing and reasoning with exploitation, thus potentially curtailing the honeypot’s effective-
"chain of thought" techniques. Allow LLMs time to reason ness as a lure. Additionally, the deployment and maintenance of
by structuring prompts for sequential information process- real-system honeypots are encumbered by the need for individ-
ing using "chain of thought" techniques. ual configurations for each system variant, leading to significant
Prompting Strategies. The "Chain of Thought" (CoT) [5, 13, 15, resource expenditures.
36, 38, 39, 41] paradigm has been shown to significantly enhance
LLM performance on tasks that require complex reasoning by in-
corporating intermediary reasoning steps within the prompts. The Hybrid Honeypots. In an effort to combine the strengths of both
zero-shot CoT, as introduced in foundational research, employs emulated and real system honeypots, a scalable hybrid honeypot
prompts such as "Let’s think step by step" to shepherd LLMs to a framework has been proposed [37, 40]. This framework employs
logical conclusion, thus highlighting the significance of scale in programmed methods for simulating low-level interactions, such as
emergent capabilities. network protocols, while delegating high-level interactions, such
as file executions, to a real file system. Although the hybrid model
2.3 Terminal Honeypots ostensibly enhances flexibility and interaction depth, it is not de-
Honeypots are security tools designed as decoys to attract and void of limitations. The framework may experience discordance be-
monitor unauthorized or malicious attackers. Based on their im- tween the simulated protocols and the actual file system responses,
plementation approaches, terminal honeypots can be categorized leading to a detectable discrepancy that astute attackers could po-
into three primary types: emulated honeypots, which are based on tentially exploit. Furthermore, the rigidity in configuring a physical
programmed simulations; real system honeypots, which utilize real environment restricts adaptability and diminishes the diversity of
systems; and hybrid honeypots, which combine the features of the supported environmental configurations, which could impair its
first two types. capacity to effectively deceive sophisticated attackers.
Trovato and Tobin, et al.

3 TRILEMMA OF EXISTING TERMINAL with pipe symbols (|). The inability to process complex operations
HONEYPOTS can lead to honeypot instability or crashes, thereby failing to fur-
ther engage attackers. Enhancing the interaction of honeypots is
The trilemma in honeypot design reflects the intricate balancing
crucial for increasing their reliability and effectiveness.
act required to optimize the three key attributes of effective hon-
eypot systems: interaction depth, flexibility, and deception ability.
Honeypots from their inception have grappled with this trilemma, 3.3 Limited Deception
consistently facing trade-offs that impede their overall efficacy. A key factor contributing to limited deception is the lack of dynamic
and adaptive behavior in existing honeypot systems. Code-based
3.1 Limited Flexibility terminal honeypots typically rely on static configurations and pre-
The narrow concept of flexibility within honeypot design is in- defined matching mechanisms, while honeypots based on real oper-
dicative of the system’s adaptability in authentically replicating ating systems often perform limited and monotonous simulations
a variety of systems and services. Such adaptability is pivotal for of the operating environment due to cost constraints. Experienced
honeypots, as it underpins their ability to create authentic and per- attackers can easily identify these singular and rigid environment
suasive simulations of various operational environments. Yet, the configurations. The lack of variability reduces the effectiveness
quest for such flexibility is one of the most formidable challenges of deception because attackers can quickly recognize and bypass
in honeypot development, with the central issue being the support the simulated environment of the honeypot. Additionally, code-
for a wide range of operating system versions and nuanced system based terminal honeypots lack authentic user interactions and the
configurations. realism of contextual information, further limiting their ability to
Contemporary honeypots often struggle to provide sufficient convincingly simulate real systems. When attackers perform com-
flexibility, particularly regarding the support of diverse operating plex operations such as file execution or process communication,
system versions and security configurations. The pursuit of in- code-based honeypots struggle to replicate authentic user interac-
creased flexibility can compromise the depth of interaction and the tions or provide the depth of a real system.
honeypot’s deception ability. In addition, in terms of enticing attackers for deeper-level at-
Code-based emulated honeypots exemplify this issue, as they tacks, current honeypots often fall short. Whether it is a code-based
require extensive resources for the creation and maintenance of sim- honeypot like Cowrie or a honeypot based on real systems, the lim-
ulations that encompass numerous operating system versions and ited configurations and restricted permissions provided to attackers
environmental configurations. Such complexity leads to substantial often fail to meet the intentions of attackers. Catering actions such
development and operational costs, and no code-based honeypot as modifying user passwords, terminating competing processes, or
has yet managed to seamlessly transition across varied terminal en- viewing graphics card information, which could effectively boost an
vironments while maintaining high interaction fidelity. Real-system attacker’s motivation, are often overlooked by existing honeypots.
honeypots offer detailed interactions but lack the adaptability of Especially in a real system environment, the operating system’s
their emulated counterparts, making them difficult to reconfig- characteristics, such as permission control mechanisms and limited
ure in response to evolving threats. They also impose significant configurations (such as hardware devices and account password
management overhead, due to the requirement of maintaining real configurations), have limitations in fulfilling the attacker’s attack
operating system environments. intentions.
Moreover, traditional honeypot offerings remain constrained to
single-dimensional deception techniques. As to leveraging the data 4 HONEYGPT
they capture, there is a conspicuous lack of flexibility in integrating To address the trilemma of existing honeypots, we introduce Hon-
security analysis, necessitating distinct systems dedicated to data eyGPT, an advanced honeypot framework based on ChatGPT. This
processing and analysis. framework transforms the conventional "request-response" mes-
The core challenge, therefore, is to architect honeypots that sage interaction (based on terminal protocols) into a "question-
strike an optimal balance between delivering authentic, complex answer" text interaction (based on ChatGPT API).
interactions and possessing the agility to counter novel attacks. As shown in Figure 1, the framework consists of three main
Achieving this equilibrium should not exacerbate costs or inflate components: the Terminal Protocol Proxy, the Prompt Manager,
management burdens, crafting a solution that is as economical as and ChatGPT. The Terminal Protocol Proxy is responsible for fun-
it is effective. damental protocol tasks, such as establishing connections, finger-
print emulation, message encapsulation, and parsing. We utilize
3.2 Limited interaction the protocol layer of the Cowrie honeypot to develop the Terminal
Due to limitations in implementation methods and development Protocol Proxy, which supports protocols such as SSH and Telnet.
costs, code-based terminal honeypots are restricted to simple rule The Prompt Manager is tasked with constructing prompts based on
matching and lack the capability to execute complex commands as a commands sent by attackers and interacting with ChatGPT. Chat-
real operating system. This limitation significantly reduces the level GPT generates responses based on the provided prompts. After
of interaction these honeypots can manage, making them inade- receiving a response from ChatGPT, the Prompt Manager extracts
quate in responding to complex commands issued by attackers. For the terminal output and forwards it to the Terminal Protocol Proxy,
instance, Cowrie, a commonly used middle-interactive honeypot, which then encapsulates it into messages and returns them to at-
is unable to handle complex commands that involve concatenation tackers. Since the Terminal Protocol Proxy repurposes the work
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

the size of the historical context. This algorithm judiciously trims


components of 𝐻𝑖 that have less significance, striking a balance
between retaining pertinent context and maintaining the model’s
operational efficiency within its architectural constraints.
System State Register 𝑆𝑅𝑖 refers to a collection of system states
communicated to ChatGPT at the 𝑖-th interaction: {(𝐶 1 ), (𝐶 2 ), . . . ,
(𝐶𝑖 −1 )}, where each state 𝐶𝑖 corresponds to the system’s changes
after each prior interaction, paralleling the order of 𝐻𝑖 . 𝑆𝑅𝑖 is the
manifestation of a ’chain-of-thought’ strategy, which HoneyGPT
leverages to enhance its capacity for generating the subsequent
response 𝐴𝑖 .
Attacker Query 𝑄𝑖 in the context of HoneyGPT, refers to the
command issued by the attacker to the honeypot terminal system
during the 𝑖-th interaction.
Honeypot Answer 𝐴𝑖 in the context of HoneyGPT, denotes the
command execution echo returned by the honeypot terminal to the
attacker during the 𝑖-th interaction.
Figure 1: HoneyGPT Framework System State Evaluation 𝐸𝑖 is an assessment provided by Chat-
GPT regarding the system’s state post-interaction, consisting of
two parts: Impact Factor 𝐹𝑖 , which values the impact of 𝑄𝑖 on the
of the Cowrie honeypot and is not a primary innovation, we will system, and 𝐶𝑖 , which denotes the changes to the terminal system
predominantly elaborate on the design of the Prompt Manager in environment after 𝑄𝑖 execution.
the following section. Weaken Factor 𝑤 is the constant used in the pruning algo-
rithm, which accounts for the timing of interactions, preferentially
4.1 Concept and Strategy trimming older and less relevant exchanges to maintain a recent,
We first articulate the fundamental concepts and methodological pertinent dialogue history.
strategies that are integral to the design of the Prompt Manager.
4.1.2 Strategy. The design of HoneyGPT incorporates a Chain of
4.1.1 Concept of HoneyGPT. We define the interaction history as Thought (CoT) strategy to address the intrinsic limitations of Chat-
a sequence of N question-answer pairs: GPT in processing complex, extended dialogues and to circumvent
{(𝑄 1, 𝐴1 ), (𝑄 2, 𝐴2 ), . . . , (𝑄 𝑁 , 𝐴𝑁 )} (1) the constraints of the prompt context length. The essence of CoT
To obtain the response 𝐴𝑖 from the 𝑖-th interaction, we con- lies in compelling the LLM to decompose a multifaceted problem
structed complex prompts. The operational logic of HoneyGPT into a sequence of sub-problems, tackling them incrementally to
during the 𝑖-th interaction is showcased in Figure 2. The 𝑖-th Hon- enhance the LLM’s ability to comprehend and address intricate
eyGPT interaction is formally defined by Equation 2. issues.
Complex Extended Dialogue Tasks: In realistic adversarial
(𝐴𝑖 , 𝐶𝑖 , 𝐹𝑖 ) = 𝐶ℎ𝑎𝑡𝐺𝑃𝑇 (𝑃, 𝑆, 𝐻𝑖 , 𝑆𝑅𝑖 , 𝑄𝑖 ) (2) scenarios, attackers often execute a series of malicious activities
The components 𝑃, 𝑆, 𝐻𝑖 , 𝑆𝑅𝑖 , and 𝑄𝑖 together constitute the through multiple commands. As shown in Figure 3, attackers issue
prompt for the 𝑖-th interaction. Here 𝑃 and 𝑆 represent the static the "write-elevate-execute" attack sequence. While native Chat-
configuration of the prompt, remaining unchanged throughout the GPT can effectively handle individual commands, it struggles with
interactions. 𝑄𝑖 is the command sent by the attacker during the complex, interdependent, extended dialogue tasks that require mem-
𝑖-th interaction. 𝐻𝑖 and 𝑆𝑅𝑖 are dynamic aspects of the prompt, orization and coherence. By contrast, HoneyGPT can significantly
updated after each interaction by the Prompt Manager M. The increase the capability to handle such complex command sequences
Prompt Manager synthesizes these variables into the new prompt by incorporating the CoT strategy.
for subsequent dialogue sessions, as defined by: Within HoneyGPT, we simplify the complexity of extended dia-
logues into two steps: first, assessing the impact of previous com-
𝑀 (𝑃, 𝑆, 𝑄𝑖 , 𝑆𝑅𝑖 , 𝐶𝑖 , 𝐹𝑖 , 𝐴𝑖 , 𝑤) = (𝑃, 𝑆, 𝐻𝑖+1, 𝑆𝑅𝑖+1, 𝑄𝑖+1 ) (3) mands on the operating system, and then determining the system’s
The following are definitions for each variable: response to new adversarial commands based on the evaluated
Honeypot Principle 𝑃 defines the strategies and guidelines impact. To ensure the continuity and automation of the attack inter-
used to create a honeypot. action process, the task of assessing the impact of attack commands
Honeypot Setting 𝑆 refers to the initial configuration set within on the operating system has been delegated to ChatGPT. Thus,
the prompt for the Honeypot. during each interaction, the prompt must provide ChatGPT with
History of Interaction 𝐻𝑖 {(𝑄 1, 𝐴1 ), (𝑄 2, 𝐴2 ), . . . , (𝑄𝑖 −1, 𝐴𝑖 −1 )} details of the changes that have occurred within the system up
provides ChatGPT with the essential context required to support to that point( 𝑆𝑅𝑖 ), as well as the present attack command( 𝑄𝑖 ).
its capacity for extended interaction memory. To ensure that the ChatGPT is then tasked with analyzing the expected output from
length of 𝐻𝑖 remains within the context length limits of ChatGPT, the terminal ( 𝐴𝑖 ) and determining the result of System State Eval-
the Prompt Manager 𝑀 implements a pruning algorithm to regulate uation ( 𝐶𝑖 , 𝐹𝑖 ). Following the conclusion of an interaction, the
Trovato and Tobin, et al.

Figure 2: HoneyGPT Architecture

Figure 3: Comparison of the Effects of Integrating CoT

Prompt Manager integrates 𝐶𝑖 into 𝑆𝑅𝑖 , respectively, setting the its length is significantly shorter than the context length limit, and
stage for the subsequent interaction. therefore, it does not require pruning. Since 𝐻𝑖 represents the col-
Prompt Pruning Strategy: After each interaction, the Prompt lection of historical honeypot interactions, it suffices to consider
Manager appends content to prompt, respectively, which results in both the temporal sequence and the magnitude of impact to identify
an increasingly lengthy prompt for subsequent interactions. How- and remove the least pertinent interaction records for the current
ever, due to limitations imposed by computational resources, stor- exchange. The impact magnitude is determined by ChatGPT dur-
age capacities, and model characteristics, LLMs have a restricted ing each interaction, for which we have provided a set of grading
context length for prompts. Should the prompt’s token count ex- criteria that ChatGPT uses to generate 𝐹𝑖 , as listed in Table 1.
ceed this threshold, ChatGPT will be unable to function. To address
this issue, we introduce a strategy known as Prompt Pruning: if the
prompt’s context length surpasses the limit, we eliminate the least
impactful entries from the dynamic portion, 𝐻𝑖 , with respect to the
current operating system. Although 𝑆𝑅𝑖 also grows dynamically,
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

Table 1: Rules of the Assignment of Values to 𝐹𝑖 ensure that a newly executed command has a higher significance
than a high-privilege command that occurred three steps prior,
Value Condition signified by 𝐹𝑖+3 · 𝑤 3 < 𝐹𝑖 . Additionally, the value of 𝑤 should not
be too diminutive in order to adequately reflect the effect of time
0 Read file, display system information
on the relevance of past commands. The process of temporal decay
1 file created, install tool
is implemented during Content Updating, where 𝐹𝑖 of each element
2 modify files/dir, change working directory, change shell
is multiplied by the Weaken Factor (𝑤).
3 Start/stop service, download file, Elevate privilige
4 Impact services, delete files, password changed 4.2.3 Content Updating . The updating process is conducted at
the end of each interaction session, aiming to renew the dynamic
cache contents, specifically the System State Register 𝑆𝑅𝑖 and the
4.2 Prompt Manager 𝑀 History of Interaction 𝐻𝑖 . This does not include static elements
The operations of the Prompt Manager are categorized into three such as constants S, P, and the adversary’s instructions 𝐴𝑖 . Upon
main processes: prompt construction, prompt pruning, and content the completion of the 𝑖-th interaction, the Prompt Manager incor-
updating. porates the query-response pair (𝑄𝑖 , 𝐴𝑖 ) into 𝐻𝑖 , and the context
𝐶𝑖 into 𝑆𝑅𝑖 . The 𝑆𝑅𝑖 supports the CoT strategy, while 𝐻𝑖 serves
4.2.1 Prompt Construction . The construction process takes place as contextual information for enriching ChatGPT’s understanding
before each interaction session. When the attacker issues a com- of the interaction history. Additionally, the updating includes the
mand to HoneyGPT, the Prompt Manager formulates three critical application of a temporal decay to the Impact Factor 𝐹𝑖 to support
questions for ChatGPT to answer, based on the Honeypot Princi- the pruning mechanism.
ple, Honeypot Setting, Attacker Query, System State Register, and
History of Interaction: 4.3 Configuration of HoneyGPT
(1) What is the terminal’s response (𝐴𝑖 ) ? Here we introduce the static components involved in configuring
(2) How does the system state change (𝐶𝑖 ) ? HoneyGPT, namely the System Principles (𝑃) and Honeypot Set-
(3) Assign a numerical value to the command’s aggressiveness tings (𝑆), which remain unchanged once a honeypot is initialized.
(𝐹𝑖 ). System Principles provide behavioral guidelines for simulating sys-
Question 2 underpins the CoT strategy, while Question 3 is in- tem activities, covering aspects such as the role of HoneyGPT, time
tended for prompt length pruning and will be discussed later. Upon sensitivity, input/output formats, and few-shot learning. These prin-
obtaining responses Ai, Ci, and Fi, HoneyGPT relays Ai back to ciples enable HoneyGPT to effectively attract and analyze potential
the attacker, while Ci and Fi are utilized by the Prompt Manager to attack behaviors by maintaining a credible system illusion. The
update the Hi and SRi in prompt for the subsequent interaction. Honeypot Settings involve specific configurations like hardware
4.2.2 Prompt Pruning. Pruning is activated when the length of specifications and software environments, which are essential for
the prompt exceeds the operational limits of commercial LLMs. simulating realistic targets that are attractive to attackers. These
The primary objective is to trim interactions of less importance settings are designed to enhance the honeypot’s attractiveness and
from the History of Interaction (𝐻𝑖 ). Since the entries regarding precision in capturing malicious exploitation such as cryptocur-
system state changes are typically concise, 𝑆𝑅𝑖 are exempt from this rency mining. Once set, both components provide a stable platform
curtailment. The success of pruning depends on accurately singling for HoneyGPT to operate effectively within its designed parameters.
out the least significant interactions from 𝐻𝑖 . We have graded the A detailed description of these two sets is given in Appendix A.
impact of adversarial commands on the current system by using
a scale from 0 to 4. The criteria for assigning values to the Impact
5 EVALUATION
Factor (𝐹𝑖 ) are detailed in Table 1. LLMs will generate responses by Our evaluation of HoneyGPT includes two parts: the baseline evalu-
referencing the table, ensuring that the pruning process is executed ation and the field evaluation. The baseline evaluation compares the
in a systematic and uniform manner. capabilities in deception, interaction level, and flexibility between
In addition, the design contemplates the temporal dimension HoneyGPT and traditional honeypots. Specifically, to evaluate the
of command relevance, positing that earlier exchanges bear less honeypots’ capability in deception and interaction level under con-
pertinence to the present interaction. The Weaken Factor 𝑤 is sistent attack scenarios, the baseline evaluation utilizes the attack
introduced to quantify this temporal effect. With each interaction’s dataset collected by Cowrie. The field evaluation involves deploying
conclusion, the Impact Factor 𝐹𝑖 for each entry in 𝐻𝑖 is attenuated HoneyGPT and Cowrie simultaneously on the Internet for a period
by multiplication with 𝑤, effectuating a time-based diminution of of four weeks to assess whether HoneyGPT can influence attackers’
𝐹𝑖 . During the pruning process, entries with the lowest 𝐹𝑖 values are attack strategies and capture a wider range of attack types and
prioritized for removal to manage and preserve the prompt token interaction lengths.
limit efficiently.
Regarding the determination of the Weaken Factor 𝑤, both the 5.1 Baseline Comparison Set
length of the dialogue and the relevance of the content are con- The comparison in the baseline evaluation is from three aspects:
sidered. Observations from cybersecurity practices indicate that deception, interaction level, and flexibility. The comparisons of
a sequence of attacks with strong correlation typically progresses deception and interaction level are performed in a quantitative
through a "write-elevate-execute" triad. Therefore, it is advisable to manner, by utilizing the same attack dataset to ensure the same
Trovato and Tobin, et al.

Table 2: Baseline Comparison Set

Capability Subjects Method Research


Type Type
Deception HoneyGPT (GPT-3.5 Replay attack Quantitative
turbo), HoneyGPT dataset Research
(GPT-4), Cowrie, Real
System
Interaction HoneyGPT (GPT-3.5 Replay attack Quantitative
turbo), HoneyGPT dataset Research
(GPT-4), Cowrie
Flexibility HoneyGPT, Cowrie, Verification Qualitative
Honeyd, Real System Individually Research

Table 3: Attack Data Source Figure 4: Distribution of Deception Categories Across Differ-
ent Honeypots

Source Capture Time Span Number Effective Attack


Method of Attacks Sessions
5.2.1 Evaluation Metrics. We assess the deception of honeypots
Cowrie 2021/04/06- 5099153 2479
[3] from two aspects:
captured 2021/06/11
Cowrie 2022/06/09- 1261689 7374 (1) Whether the execution of commands succeeds or not. Attack
[24] behaviors typically exhibit temporal dependencies and con-
captured 2022/10/29
trol logic, where the execution of commands relies on the
desired effects of previous attack commands, such as suc-
cessful execution or system information that is of interest
attack scenarios, while the flexibility comparison is performed in a to attackers. Thus, a higher execution success rate indicates
qualitative manner, through individual experimental verification the honeypot’s capability to induce attackers into deeper
for each indicator. interactions and capture more valuable information.
Table 2 shows the experimental setup for the baseline compari- (2) Whether the outputs of honeypot comply with the OS logic or
son. The comparison of deception is conducted among HoneyGPT not. Whether the honeypot’s outputs align with the logic
based on GPT-3.5 turbo interface, HoneyGPT based on GPT-4 turbo of the operating system directly affects the effectiveness of
interface, Cowrie, and a real system. The interaction level com- its deception. Deviations from the expected behaviors of
parison is limited to the comparisons among HoneyGPT based on an operating system, such as inconsistent shell prompts or
GPT-3.5 turbo interface, HoneyGPT based on GPT-4 turbo interface, command execution results, may raise suspicion by attack-
and Cowrie, because the real system exhibits the best interaction ers and expose the honeypot. A higher level of OS logic
level and there is no need to compare. The flexibility comparison is compliance indicates a stronger deceptive capability of the
conducted among Honeyd, a low-interaction honeypot, along with honeypot.
HoneyGPT, Cowrie, and the real system.
Based on these two aspects, we then define four distinct categories
To assess the differences in deception and interaction level be-
to classify and compare the deceptive performance of the honeypot
tween HoneyGPT and traditional honeypots under the same attack
system’s responses.
scenarios, we use the same attack dataset to replay traffic, simu-
lating identical attack scenarios. This dataset compiles attack data • SALC (Successful Attack with Logic Compliance):
previously captured by Cowrie, a mainstream, code-based, medium- Successful attacks that conform to the logic of the operating
interaction open-source honeypot system [3, 24]. As shown in Table system.
3, these datasets include a substantial volume of attack commands • SALNLC (Successful Attack without Logic Compliance):
and valid attack sessions. A sequence of actions performed by an Successful attacks that do not adhere to the logic of the op-
attacker post-successful login is considered as a valid attack ses- erating system.
sion. We perform data cleaning and aggregation on these datasets, • FALC (Failed Attack with Logic Compliance): Failed
resulting in 160 attack sessions and 1,489 attack commands, which attacks that conform to the logic of the operating system.
are used for baseline comparison. • FALNLC (Failed Attack without Logic Compliance):
Failed attacks that do not adhere to the logic of the operating
5.2 Baseline Comparison of Deception system.
Here we assess the deception capabilities of Cowrie, HoneyGPT With these defined categories, we can now introduce four key
based on the GPT-3.5 Turbo API, HoneyGPT based on the GPT-4 evaluation metrics to assess the honeypot system’s deceptive capa-
API, and a real operating system based on virtual machines. bilities:
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

Table 4: Evaluation of Interaction Level

Metrics Cowrie HoneyGPT HoneyGPT


(GPT-3.5-turbo) (GPT-4)
Full Session Re- 81.88% 93.75% 99.38%
sponse Rate
Command Re- 96.98% 99.13% 99.93%
sponse Rate
Mean Session 57.28% 71.73% 99.93%
Length Percentage
Mean Interaction 83.24% 94.45% 99.91%
Degree Percentage

Figure 5: Performance Comparison


HoneyGPT based on GPT-3.5 and GPT-4 also demonstrates superior
performance than Cowrie and approaches the level of the real op-
• Accuracy: It measures the system’s ability to generate out-
erating system. Additionally, increasing the model parameter size
puts consistent with the logic of a real operating system. It is
leads to enhancements in both metrics. Note that with the increase
calculated by dividing the number of successful attacks that
in model parameter size, there is a notable improvement in the pro-
conform to the operating system’s logic (SALC) by the total
portion of successful execution of deceptive commands that comply
number of successful attacks (SALC + SALNLC). Improv-
with the operating system logic (SALC). This indicates that as the
ing accuracy enables more precise simulation of operating
model evolves, the precision of HoneyGPT in catering to attackers’
system behaviors, lowers attacker suspicion, and enhances
intentions also increases. In summary, the performance compar-
information gathering and threat awareness capabilities.
ison highlights the advantages of HoneyGPT based on GPT-3.5
• Temptation: It evaluates the effectiveness of the system
and GPT-4 over Cowrie and the real operating system. HoneyGPT
in attracting attackers’ engagement. It is calculated by di-
demonstrates superior performance in terms of accuracy, tempta-
viding the number of successful attacks that conform to
tion, attack success rate, and compliance with operating system
the operating system’s logic (SALC) by the total number
logic. Moreover, increasing the model parameter size can further
of attacks with operating system logic (SALC + FALC). A
improve the performance in these metrics.
higher level of temptation indicates a more successful lure
for attackers and greater information collection.
• Attack Success Rate: It measures the overall success 5.3 Baseline Comparison of Interaction
rate of executing deceptive commands. It is calculated by Unlike the deception assessment, we consider a honeypot providing
dividing the total number of successful attacks by the total interactions as long as it can respond without crashing, even if the
number of attempted attacks. A higher success rate im- honeypot’s feedback suggests that the service does not exist or there
plies more successful interactions and better information is insufficient permission. We assess the interaction level of Cowrie,
collection. HoneyGPT based on the GPT-3.5 Turbo API, and HoneyGPT based
• OS Logic compliance: It evaluates the system’s capabil- on the GPT-4 API. The baseline comparison of interaction level
ity to generate outputs consistent with the logic of a real can be divided into two parts. First, we compare and analyze the
operating system. It is calculated by dividing the number degree of interaction support that HoneyGPT and Cowrie provide in
of attacks that conform to the operating system’s logic response to diverse attack requests. Second, based on the ATT&CK
(SALC + FALC) by the total number of attempted attacks. matrix, we derive the response rates of HoneyGPT to different
Improving OS logic compliance allows for the creation of attack techniques under varied interface supports and analyze the
more accurate simulations of operating system behaviors, reasons for non-responses to each attack technique.
lowering attackers’ suspicions.
5.3.1 Performance Comparison. To assess the interaction level of
5.2.2 Performance Comparison. Figures 4 and 5 illustrate the per-
a honeypot, we introduce four key metrics:
formance comparison among Cowrie, a real operating system, and
HoneyGPT based on GPT-3.5 and GPT-4. • Complete Session Response Rate: The ratio of sessions
In terms of attack success rate and temptation, HoneyGPT based that are completely and successfully responded to requests
on GPT-3.5 and GPT-4 outperforms both Cowrie and the real operat- throughout their duration.
ing system. The inherent characteristics of LLMs enable HoneyGPT • Command Response Rate: The percentage of individual
to better cater to attackers’ intentions, enticing them to further commands within sessions that receive responses.
engage in attacks, which is a significant advantage of HoneyGPT. • Mean Session Length Percentage: Defined as the ratio of
Furthermore, increasing the model parameter size leads to improve- the average executed session length to the overall average
ments in both metrics. Regarding OS logic compliance and accuracy, session length.
Trovato and Tobin, et al.

Figure 6: Analyse of GPT-3.5 Turbo Non-Responses Figure 7: Analyse of GPT-4 Non-Responses

Response wrong format indicates issues with the parsing of


• Mean Interaction Degree Percentage: Defined as the responses due to discrepancies between the returned format and
average of the proportions of successful responses within the expected format.
individual sessions. Wrong command occurs when the command is deemed incor-
Table 4 presents a detailed comparison of the interaction levels rect or inexecutable within the specified system environment.
for each honeypot. The results clearly demonstrate the superior Prompt length exceeds limit arises when the length of an
performance of HoneyGPT, especially the version based on the GPT- attack command or response surpasses the LLM’s context length
4 API, achieving nearly perfect interaction performance across all limit, making it impossible to work.
categories. This indicates the superiority of HoneyGPT’s capability These issues are not easily rectifiable solely through prompt
to maintain interactive sessions and support command interactions. engineering in circumstances where the model’s performance is
subpar. By contrast, with the advent of GPT-4, these error types
5.3.2 Unresponsiveness Analysis and Error Origins. We assess the become virtually nonexistent, with only a minuscule number of
interactive performance of HoneyGPT based on the GPT-3.5 Turbo "security policy" errors remaining. It highlights the improvement
API and that of HoneyGPT based on the GPT-4 API in responding in error management with the introduction of GPT-4, reflecting
to various attack vectors, analyzing the types and frequencies of significant advancements in better interaction than GPT-3.5 Turbo.
errors they encounter with each type of attack. Security policy occurs when the ChatGPT refrains from exe-
Our analysis employs the "Techniques" component of the ATT&CK cuting a command based on its internal security protocols, and it
framework developed by MITRE to classify the test data [28]. Within primarily involves operations such as malicious file downloads or
the ATT&CK framework, the "Techniques" section comprehen- deletions. This type of error is unavoidable in the design of hon-
sively outlines the specific methods adversaries employ to achieve eypots based on commercial LLMs. Fortunately, at this moment,
their tactical objectives. By examining the response rate and error the frequency of such an error is very low, and their impact on
causes for each technique, we gain deeper insights into the perfor- honeypot performance is negligible.
mance capabilities and potential limitations of HoneyGPT under
different API supports. 5.4 Baseline Comparison of Flexibility
The test results, as detailed in Appendix B, indicate that Hon- The evaluation of honeypot flexibility focuses on three key aspects:
eyGPT demonstrates excellent support in responding to each attack scalability across different systems, adaptability in different system
technique, achieving response rates above 99%. Notably, the perfor- configurations, and integration of additional security analysis fea-
mance of GPT-4 surpasses that of GPT-3.5 Turbo. This analysis not tures. The comparative evaluation is among HoneyGPT, Cowrie,
only highlights the robustness of HoneyGPT’s design in simulat- Honeyd, and real-system honeypot. Cowrie operates as a medium-
ing realistic and responsive interaction, but also underscores the interaction honeypot that provides interactive terminal emulation,
improvements in handling complex security scenarios facilitated and Honeyd serves as a low-interaction honeypot designed for
by advancements from GPT-3.5 to GPT-4. general purposes. Real-system honeypots, on the other hand, are
Additionally, Figures 6 and 7 provide an analysis of the causes actual operating systems running on virtual machines.
and frequencies of non-responses encountered by each version of
HoneyGPT against various ATT&CK techniques. A notable obser- 5.4.1 Flexibility in Simulating Different Systems. The diversity of
vation is substantial reductions in commonly encountered error operating systems requires traditional honeypots to adapt to vari-
types with GPT-4, such as "response wrong format", "wrong com- ous system versions and instruction sets. This adaptation frequently
mand", and "prompt length exceeds limit", in comparison to GPT-3.5 demands independent development efforts for each simulated sys-
Turbo. tem. However, the advanced generative capabilities of LLMs offer a
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

Table 5: Flexibility in Simulating Different Systems additional functionalities through simple prompts. For example,
we can employ ChatGPT to analyze attackers’ behavioral logs,
Honeypot Expansion Cost Simulation Level Scalability enabling functionalities like security policy alerts. As designed in
Cowrie High Medium Medium the prompt manager of HoneyGPT, each interaction with LLMs
Honeyd Low Low High not only inquires about the outcome of the current command but
Real System High High Low also asks about the potential impact of the attack command on the
HoneyGPT Low High High operating system.
This innovative approach represents a paradigm shift where a
honeypot, by leveraging LLMs, seamlessly integrates security anal-
groundbreaking solution to overcome these challenges. We evalu- ysis capabilities. The utility of the honeypot is greatly enhanced
ate the emulation capabilities of these honeypots across different while reducing the cost and complexity associated with the integra-
operating systems, including Ubuntu and CentOS. Table 5 lists tion of these auxiliary features. Therefore, HoneyGPT presents a
the expansion costs, simulation levels, and scalability of each hon- cost-effective, scalable, and efficient approach for future honeypots.
eypot type. Cowrie, as a medium-interaction honeypot, requires
extensive modifications to adapt to different system versions, in- 5.5 Field Evaluation
cluding updates to system information and the file systems, as well
The field evaluation focuses on assessing HoneyGPT’s entrapment
as reconfigurations of the instruction set, and thus the expansion
efficacy in the real world. We have simultaneously deployed Hon-
costs are considerably high. Honeyd, as a low-interaction honeypot,
eyGPT based on GPT-4 and Cowrie inside the same network for
requires only minimal adjustments in configuration information
four weeks. HoneyGPT is implemented using the best practice de-
to accommodate system version changes, offering high scalability.
ployment strategies, while Cowrie is configured with its default
Real-system honeypots, while achieving high simulation fidelity,
settings. We detail the evaluation procedure and our main findings
require the configuration of new physical environments, which
below.
significantly increases the associated costs and reduces overall flex-
ibility. HoneyGPT requires only minimal adjustments to system 5.5.1 HoneyGPT Deployment. Our baseline comparisons demon-
settings in prompt to simulate different systems. The LLMs can strate that HoneyGPT significantly enhances deception, interactiv-
automatically generate file system, instruction set, system user, and ity, and flexibility. However, the economic burden associated with
other details based on the system settings in prompt. Our approach the query costs of commercial LLMs presents practical challenges
offers high scalability and a superior level of simulation. for deployment. Furthermore, these models also impose restrictions
on the quantity and frequency of requests from a single user within
5.4.2 Flexibility in Simulating Different System Configurations . To
a given timeframe, potentially limiting their capacity to handle tasks
assess the flexibility of configuration modifications, we conduct
with high concurrent requests. To optimize the cost-effectiveness
a comparative analysis of Cowrie, Honeyd, real-system Honey-
of HoneyGPT under the constraints of high query costs and strin-
pot, and HoneyGPT. We categorize configuration changes into
gent request limitations, we adopt a hybrid deployment strategy.
five types: network configuration, file system, user configuration,
This strategy integrates HoneyGPT with the traditional emulated
hardware setting, and service status. If configuration updates can
honeypot approach, leveraging the strengths of both to enhance
be easily achieved by simple command-line or configuration file
performance while reducing costs.
modifications, we consider it as supporting flexible modification
In the strategic deployment of HoneyGPT, prudent management
for that configuration. Table 6 shows that HoneyGPT is the only
of command processing is crucial for optimizing efficacy and cost-
product allowing flexible modifications across all configuration as-
efficiency. Those commands that are challenging for LLMs to pro-
pects. Honeyd, Cowrie, and real-system honeypots exhibit limited
cess, as well as those governed by fixed, simple rules, are delegated
configuration flexibility.
to emulated honeypots. In other words, the commands that LLMs
Traditional honeypots often struggle with modification of sys-
typically handle inefficiently, such as binary file reading, and those
tem configurations, leading to fixed deployments that can be easily
dependent on time mechanisms (e.g., sleep and uptime), as well as
identified by attackers. This rigidity undermines their effectiveness
commands following simple rules, such as whoami, are rerouted to
as deception tools. However, the advent of LLMs introduces a novel
emulated honeypots.
paradigm. HoneyGPT enables direct modifications to system config-
Furthermore, resolved interactions from HoneyGPT—comprising
urations via prompt adjustments, offering unprecedented flexibility
terminal prompts and command combinations—are stored in a
and adaptability, thereby enhancing deception capabilities.
database. If an attacker’s request corresponds to a stored HoneyGPT
5.4.3 Flexibility of Integration with Security Analysis Capability. In query, the recorded response from the database is immediately
the context of practical operations, honeypots primarily act as col- returned. For new, context-sensitive queries, processing is handled
lectors of attack data, representing just one facet of comprehensive by HoneyGPT to ensure responses are coherent and contextually
security measures. Beyond this basic role, the analysis of collected accurate. Due to the repetitiveness of attacker requests in actual
attack data is crucial. Traditional honeypot designs, such as Cowrie, intrusion scenarios, this strategy significantly reduces the query
Honeyd, require specialized development efforts, thereby incurring load and congestion on commercial LLMs.
significant overheads to integrate these functionalities. Moreover, to enhance the appeal of the honeypot to specific
However, the HoneyGPT’s design circumvents this challenge adversaries, such as those involved in cryptocurrency mining, the
by leveraging the capabilities of ChatGPT to seamlessly integrate simulated environment within HoneyGPT is configured to mimic a
Trovato and Tobin, et al.

Table 6: Evaluation of Configuration Flexibility for Different Configuration

Honeypot Type Network Configuration File System User Configuration Hardware Settings Service Status
Cowrie ✓ ✓ ✓ × ×
HoneyD ✓ × × × ✓
Real-system honeypot ✓ × ✓ × ×
HoneyGPT ✓ ✓ ✓ ✓ ✓

high-performance computing server. Prompts are carefully crafted


to align with attackers’ intentions, allowing for the successful ex-
ecution of attack commands. This is intended to modify attacker
behaviors and elicit more prolonged and informative interactions,
thereby providing deeper insight into their strategies and poten-
tially uncovering more complex attack techniques.
5.5.2 Comparative Analysis of Attacker Engagement. By comparing
the interaction data of HoneyGPT and Cowrie, we observe that for
Figure 9: Command Support Level
the same attack source under the three cases, HoneyGPT achieves
a deeper interaction with the attacker than Cowrie.
• Fulfillment of Attacker's Intent. As illustrated in
Figure 8, attackers issue the ps and grep command combi-
nation to identify processes related to the "miner" keyword,
aiming to detect mining malware. Cowrie cannot reveal
the processes with the "miner" keyword, leading to a pre-
mature termination of the attack. Conversely, HoneyGPT
effectively mimics the expected system responses, thus ful-
filling the attacker’s intent and encouraging sustained en- Figure 10: Content Rigidity
gagement.
This comparative analysis highlights HoneyGPT’s superior capabil-
ities in providing dynamic, contextually relevant content that can
significantly sway attacker behaviors, support a broader array of
command interactions, and effectively disguise its honeypot iden-
tity, thereby extending engagement and potentially exposing more
advanced attack strategies.
Figure 8: Fulfillment of Attacker’s Intent 5.5.3 New Attacks Vector Discovered by HoneyGPT. The deploy-
ment of HoneyGPT has unveiled new attack vectors that signifi-
• Command Support Level. As illustrated in Figure 9, when cantly differ from those captured by the traditional Cowrie honey-
attackers try to use complex command structures with mul- pot. As detailed in Appendix C, HoneyGPT identifies 11 types of
tiple command combinations, the limited interaction ca- attack actions categorized into six different ATT&CK techniques,
pabilities of the Cowrie honeypot often fail to support all highlighting the evolving landscape of cyber threats and the critical
the sub-commands within these complex commands. As role of advanced, adaptive honeypot technologies like HoneyGPT
a result, attackers typically cease their attacks after un- in identifying and analyzing emerging attack vectors.
successful attempts using similar commands. By contrast,
HoneyGPT’s generative capabilities are sufficient to handle 6 LIMITATIONS AND FUTURE WORK
such complex commands, which can entice attackers to
engage in deeper levels of attacks. While the integration of LLMs has enhanced the entrapment ca-
• Content Rigidity. As illustrated in Figure 10, when at- pabilities of honeypot systems, there remain several limitations
tackers engage in reconnaissance of the target system, Cowrie manifested in the following aspects:
typically provides rigid, easily identifiable honeypot re- • Context Length Restriction. LLMs are inherently lim-
sponses. Recognizing these as features of a honeypot, at- ited by the Context length of prompt. When attackers input
tackers often terminate their interaction prematurely. By and execute lengthy command scripts directly at the ter-
contrast, HoneyGPT dynamically generates responses tai- minal, LLMs may fail to respond appropriately. Attempts
lored to the honeypot setting outlined in the prompt, effec- to limit response lengths within the prompts have proven
tively camouflaging its honeypot nature and maintaining suboptimal. This issue may necessitate the development of
attacker engagement. proprietary LLMs tailored for extensive token interactions.
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

• Request Rate Limit. Although the use of rule-based [2] K. G. Anagnostakis, S. Sidiroglou, P. Akritidis, K. Xinidis, E. Markatos, and A. D.
matching has reduced the amount of unnecessary com- Keromytis. 2005. Detecting Targeted Attacks Using Shadow Honeypots. In 14th
USENIX Security Symposium. USENIX Association.
mercial LLM requests, commercial models’ limitations on [3] Melike Başer, Ebu Yusuf Güven, and Muhammed Ali Aydın. 2021. Ssh and
request rates remain a significant obstacle to the deploy- telnet protocols attack analysis using honeypot technique: Analysis of ssh and
telnet honeypot. In 2021 6th International Conference on Computer Science and
ment of honeypot systems. Professional honeypot models, Engineering. IEEE Computer Society.
fine-tuned from open-source LLMs, could potentially ad- [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
dress this issue. Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
• Ineffectiveness against Fixed Attack Sequences. Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris
During field testing, we discovered that some attackers em- Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
ploy predetermined attack sequences to target honeypots, Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and
Dario Amodei. 2020. Language models are few-shot learners. In Advances in
regardless of the honeypot’s ability to respond to requests Neural Information Processing Systems. Curran Associates, Inc.
or the quality of the response content. For such attackers, [5] Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang.
2024. Towards revealing the mystery behind chain of thought: a theoretical per-
HoneyGPT is unable to prolong the interaction length by spective. In Advances in Neural Information Processing Systems. Curran Associates,
catering to their intent. Inc.
• Prompt-based Attacks on ChatGPT. While no prompt- [6] Yao Fu, Hao Peng, and Tushar Khot. 2022. How does gpt obtain its ability?
tracing emergent abilities of language models to their sources. Yao Fu’s Notion
based attacks against ChatGPT have been discovered thus (2022).
far, this remains a potential vulnerability that needs to be [7] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten
monitored and addressed [7, 27]. Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-
world llm-integrated applications with indirect prompt injection. In Proceedings
• Overfitting Issue. LLMs may exhibit overfitting to incor- of the 16th ACM Workshop on Artificial Intelligence and Security. Association for
rect commands, returning seemingly normal results despite Computing Machinery.
[8] J He and M Vechev. 2023. Large Language Models for Code: Security Hardening
command errors. This issue highlights the need for further and Adversarial Testing. In Proceedings of the 2023 ACM SIGSAC Conference on
refinement in the model’s error recognition capabilities. Computer and Communications Security. Association for Computing Machinery.
[9] Peiwei Hu, Ruigang Liang, and Kai Chen. 2024. DeGPT: Optimizing Decompiler
Moving forward, the development of LLMs for honeypot ap- Output with LLM. In Proceedings of the 31st Annual Network and Distributed
plications should focus on mitigating these limitations. Research System Security Symposium. Association for Computing Machinery.
efforts may include creating more sophisticated and context-aware [10] Linan Huang and Quanyan Zhu. 2021. Duplicity games for deception design
with an application to insider threat mitigation. IEEE Transactions on Information
models that can handle longer token sequences and have a deeper Forensics and Security (2021).
understanding of complex systems. Furthermore, enhancing models [11] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess,
Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
with temporal awareness and addressing request rate limits will Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
be crucial for advancing honeypot technology. It will also be para- [12] Jinwoo Kim, Eduard Marin, Mauro Conti, and Seungwon Shin. 2022. EqualNet:
mount to continually monitor for new types of attacks and adapt A Secure and Practical Defense for Long-term Network Topology Obfuscation.
In Proceedings of the 29st Annual Network and Distributed System Security Sym-
the LLMs accordingly to maintain robust security measures. Finally, posium. Association for Computing Machinery.
addressing the overfitting problem may involve implementing more [13] Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic,
nuanced training strategies that emphasize the correct handling of and Hao Su. 2024. Deductive verification of chain-of-thought reasoning. In
Advances in Neural Information Processing Systems. Curran Associates, Inc.
anomalous or incorrect command inputs. [14] Efrén López-Morales, Carlos Rubio-Medrano, Adam Doupé, Yan Shoshitaishvili,
Ruoyu Wang, Tiffany Bao, and Gail-Joon Ahn. 2020. HoneyPLC: A Next-
Generation Honeypot for Industrial Control Systems. In Proceedings of the 2020
7 CONCLUSION ACM SIGSAC Conference on Computer and Communications Security. Association
The triad of challenges in traditional honeypots—flexibility, interac- for Computing Machinery.
[15] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun
tion depth, and deceptive capacity—has historically curtailed their Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain:
allure and efficacy in engaging attackers. To address this trilemma, Multimodal reasoning via thought chains for science question answering. In
this paper presents HoneyGPT, a bespoke innovation in adaptive Advances in Neural Information Processing Systems. Curran Associates, Inc.
[16] Ruijie Meng, Martin Mirchev, Marcel Böhme, and Abhik Roychoudhury. 2024.
honeypot solutions specifically engineered to circumvent the in- Large language model guided protocol fuzzing. In Proceedings of the 31st Annual
herent limitations of traditional honeypots. Our comparison evalu- Network and Distributed System Security Symposium. Association for Computing
Machinery.
ations indicate that HoneyGPT transcends traditional honeypots in [17] Provos Niels. 2004. A Virtual Honeypot Framework.. In 13th USENIX Security
terms of flexibility, interaction, and deception, and it also exceeds Symposium. USENIX Association.
the entrapment capabilities of real systems, emphasizing its supe- [18] M OOSTERHOF. [n. d.]. Cowrie. [Online]. https://github.com/micheloosterhof/
cowrie.
riority in meeting attackers’ intents to deepen engagements. Our [19] openai. [n. d.]. chatgpt. [Online]. https://openai.com/blog/chatgpt/.
field evaluations further validate that HoneyGPT adeptly conducts [20] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
extended interactions and captures a broader spectrum of attack Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul-
man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell,
vectors while facing diverse adversaries in the real world. Overall, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training
HoneyGPT positions itself as an effective and intelligent counter- language models to follow instructions with human feedback. In Advances in
Neural Information Processing Systems. Curran Associates, Inc.
measure against ongoing security threats, and it also paves the path [21] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
for future honeypot research. Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github
copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy.
IEEE Computer Society.
REFERENCES [22] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan
[1] Vetterl Alexander and Clayton Richard. 2018. Bitter Harvest: Systematically Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language
Fingerprinting Low- and Medium-interaction Honeypots at Internet Scale. In models. In 2023 IEEE Symposium on Security and Privacy. IEEE Computer Society.
12th USENIX Workshop on Offensive Technologies. USENIX Association. [23] PhiBo phibos. [n. d.]. Dionaea. [Online]. https://github.com/DinoTools/dionaea.
Trovato and Tobin, et al.

[24] VS Devi Priya and S Sibi Chakkaravarthy. 2023. Containerized cloud-based A CONFIGURATION OF HONEYGPT
honeypot deception for tracking attackers. Scientific Reports (2023).
[25] Julian L Rrushi. 2018. Dnic architectural developments for 0-knowledge detection A.1 System Principles 𝑃
of opc malware. IEEE Transactions on Dependable and Secure Computing (2018).
[26] Takayuki Sasaki, Akira Fujita, Carlos H Ganán, Michel van Eeten, Katsunari HoneyGPT, serving as a terminal honeypot system, is specifically
Yoshioka, and Tsutomu Matsumoto. 2022. Exposed infrastructures: Discovery, designed to attract and monitor attack activities. To achieve this
attacks and remediation of insecure ics remote management devices. In 2022 objective, we have implemented the system’s principles within
IEEE Symposium on Security and Privacy. IEEE Computer Society.
[27] Wai Man Si, Michael Backes, Yang Zhang, and Ahmed Salem. 2023. { Two- the prompts, providing operational guidelines for ChatGPT. These
in-One } : A Model Hijacking Attack Against Text Generation Models. In 32nd prompts play different crucial roles in the functionality of the sys-
USENIX Security Symposium. USENIX Association.
[28] Blake E Strom, Andy Applebaum, Doug P Miller, Kathryn C Nickels, Adam G
tem, including:
Pennington, and Cody B Thomas. 2018. Mitre att&ck: Design and philosophy.
In Technical report. The MITRE Corporation. Role of HoneyGPT. HoneyGPT strategically simulates an authen-
[29] Sadegh Torabi, Elias Bou-Harb, Chadi Assi, ElMouatez Billah Karbab, Amine tic terminal environment, tailored to align closely with attackers’
Boukhtouta, and Mourad Debbabi. 2020. Inferring and investigating IoT-
generated scanning campaigns targeting a large network telescope. IEEE Trans-
intents, thereby encouraging deeper and more sustained engage-
actions on Dependable and Secure Computing (2020). ment. The system expertly balances authenticity with the objective
[30] Sanh Victor, Webson Albert, Raffel Colin, Bach Stephen, Sutawika Lintang, of enticing attackers into prolonged interactions, thereby enhancing
Alyafeai Zaid, Chaffin Antoine, Stiegler Arnaud, Raja Arun, Dey Manan, et al.
2022. Multitask prompted training enables zero-shot task generalization. In the honeypot’s effectiveness.
International Conference on Learning Representations. OpenReview.net.
[31] Nishant Vishwamitra, Keyan Guo, Farhan Tajwar Romit, Isabelle Ondracek, Long Time Sensitivity. In scenarios where attacks are time-sensitive,
Cheng, Ziming Zhao, and Hongxin Hu. 2024. Moderating New Waves of Online attackers may issue commands related to current time inquiries,
Hate with Chain-of-Thought Reasoning in Large Language Models. In 2024 IEEE
Symposium on Security and Privacy. IEEE Computer Society. such as uptime and top. Since ChatGPT inherently lacks the capa-
[32] Omar Abdel Wahab, Jamal Bentahar, Hadi Otrok, and Azzam Mourad. 2019. bility to query real-time networked time data, it is necessary for
Resource-aware detection and defense system against multi-type attacks in the us to provide time-related information such as the current time
cloud: Repeated bayesian stackelberg game. IEEE Transactions on Dependable
and Secure Computing (2019). and boot time in the prompts to ChatGPT. This approach ensures
[33] Yuntao Wang, Zhou Su, Abderrahim Benslimane, Qichao Xu, Minghui Dai, and that the honeypot’s responses are temporally accurate, thus more
Ruidong Li. 2023. Collaborative honeypot defense in uav networks: A learning-
based game approach. IEEE Transactions on Information Forensics and Security
convincingly mimicking a real system environment.
(2023).
[34] Zhiwei Wang, Yihui Yan, Yueli Yan, Huangxun Chen, and Zhice Yang. 2022. Input/Output Format. The structure of both input and output
{ CamShield } : Securing Smart Cameras through Physical Replication and Isola- prompts critically defines each interaction with the LLM in Hon-
tion. In 31st USENIX Security Symposium. USENIX Association.
[35] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian
eyGPT. The input format is crucial for the LLM’s accurate under-
Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. standing of the honeypot system’s state and the commands issued
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William by attackers. Conversely, the output format affects HoneyGPT’s
Fedus. 2022. Emergent abilities of large language models. arXiv preprint
arXiv:2206.07682 (2022). operational effectiveness. To ensure HoneyGPT’s robustness, we
[36] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei instruct the LLM to use the JSON format for outputs. This standard-
Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting ization ensures consistent and correct parsing of responses during
Elicits Reasoning in Large Language Models. In Advances in Neural Information
Processing Systems. Curran Associates, Inc. each interaction with the model.
[37] Yiwen Xu, Yu Jiang, Lu Yu, and Juan Li. 2021. Brief industry paper: Catching iot
malware in the wild using honeyiot. In 2021 IEEE 27th Real-Time and Embedded Few-Shot Learning. HoneyGPT utilizes a few-shot learning strat-
Technology and Applications Symposium. IEEE Computer Society. egy to enhance ChatGPT’s comprehension of task requirements
[38] Kaiyu Yang, Jia Deng, and Danqi Chen. 2022. Generating Natural Language
Proofs with Verifier-Guided Search. In Advances in Neural Information Processing and improve the quality of its outputs, compensating for gaps in
Systems. Curran Associates, Inc. several command knowledge. Our experience indicates that by inte-
[39] Mengjiao (Sherry) Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. grating only 4-5 representative examples, HoneyGPT’s generative
2022. Chain of Thought Imitation with Procedure Cloning. In Advances in Neural
Information Processing Systems. Curran Associates, Inc. capabilities can be significantly enhanced.
[40] Jianzhou You, Shichao Lv, Yue Sun, Hui Wen, and Limin Sun. 2021. Honeyvp: A
cost-effective hybrid honeypot architecture for industrial control systems. In ICC
2021-IEEE International Conference on Communications. IEEE Computer Society. A.2 Honeypot Setting 𝑆
[41] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. 2022. Star: Self- Honeypot settings describe the terminal system information simu-
taught reasoner bootstrapping reasoning with reasoning. In Advances in Neural
Information Processing Systems. Curran Associates, Inc. lated by HoneyGPT, which is crucial for attracting and capturing
[42] Zhenxin Zhan, Maochao Xu, and Shouhuai Xu. 2013. Characterizing honeypot- potential attackers. Our experience indicates that honeypots with
captured cyber attacks: Statistical framework and case study. IEEE Transactions high-end configurations are more attractive to malicious entities,
on Information Forensics and Security (2013).
such as cryptocurrency-mining malware. Furthermore, the more
details of honeypot configurations, the higher precision of the re-
sponses. The ability to customize honeypot settings via natural
language significantly enhances flexibility and reduces the com-
plexity of deployment. Honeypot settings are divided into two main
categories: hardware and software.
Hardware. It is essential to define the hardware specifications
that will be emulated by HoneyGPT. It includes defining CPU types
and counts, GPU presence, and storage capacities, those attributes
that attackers frequently scrutinize. The specifications selected
HoneyGPT: Breaking the Trilemma in Terminal Honeypots with Large Language Model

must mirror those of the systems intended to be simulated by Hon- tailored within HoneyGPT to create customized responses for each
eyGPT to maintain the authenticity of the honeypot environment. engagement scenario.
Additionally, configuring systems with high-end GPUs and CPUs
can attract specific types of threat actors, such as crypto-mining B SUCCESSFUL RESPONSE RATE OF
malware, prompting deeper interactions. HONEYGPT BASED ON GPT-3.5-TURBO AND
Software. The software environment of a honeypot, including the GPT-4
operating system, open ports, user configurations, and application Figure 11 shows the successful response rate of HoneyGPT based
services, must be strategically crafted to reflect the behaviors of on GPT-3.5-turbo and GPT-4, with support rates for each attack
potential targets. For instance, if the objective is to mimic a web technique exceeding 99%.
server, HoneyGPT should be configured with the appropriate web
server software, along with associated services and open ports. C NEW ATTACKS VECTOR DISCOVERED BY
Additionally, attackers are often drawn to details such as system HONEYGPT
processes, resource utilization, scheduled tasks, user configurations, As shown in Table 7, HoneyGPT identifies 11 types of attack actions,
and filesystem information. These elements can be specifically which are classified into 6 categories of ATT&CK techniques.
Trovato and Tobin, et al.

Figure 11: Successful Response Rate with GPT-3.5 Turbo and GPT-4 in HoneyGPT

Table 7: New Attacks Vector Discovered by HoneyGPT

Technology action
System Service Discovery (T1007) By employing pipeline operators (|) to integrate the nvidia-smi command with grep and other
filtering commands like head, awk, and wc, these commands facilitate the selective extraction
and presentation of precise GPU information in accordance with predetermined criteria.
System Service Discovery (T1007) By employing pipeline operators (|) to merge the lspci command with various filtering com-
mands including grep, cut, egrep, and wc. The objective is to extract and display specific
information related to VGA or 3D graphics from the PCI devices present within the system.
System Service Discovery (T1007) The lscpu command is integrated with egrep and cut to selectively extract the CPU model
name.
System Service Discovery (T1007) The grep command is utilized to search the /proc/cpuinfo file to calculate the total number of
processors or CPU cores in the system.
System Network Configuration Discov- The command arp -a is used to display the system’s ARP table, containing mappings between
ery (T1016) IP addresses and MAC addresses.
System Network Configuration Discov- An HTTP GET request is sent using curl to "curl ipinfo.io/org" in order to retrieve organizational
ery (T1016) information associated with the system’s IP address that initiated the request.
System Network Configuration Discov- The nano command is used to view the /etc/resolv.conf file, to access the configuration of the
ery (T1016) DNS resolver.
System Information Discovery (T1082) The uptime command is employed to obtain the system’s uptime.
Indicator Removal on Host (T1070) Several actions are taken to ensure anonymity and to avoid leaving traces of system activity: 1.
Unsetting environmental variables to disable the saving of command history. 2. The command
history -c is used to clear the existing history. 3. The rm -rf command combined with the path
to log files is used to delete log files, thereby eliminating traces of system activity.
Virtualization/Sandbox Evasion( T1497) The sleep command is used to nsuring program synchronization, preventing detection of
anomalous behavior, and it can also be used to detect if the target system is a honeypot, in
which case, if execution fails, further attacks are not pursued.
Obfuscated Files or Information (T1027) Scripts written in shell or Perl are obfuscated by base64 encryption prior to execution to avoid
detection and identification.

You might also like