Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

LLM in the Shell: Generative Honeypots

Muris Sladić Veronica Valeros


Czech Technical University in Prague Czech Technical University in Prague
Czech Republic Czech Republic
sladimur@fel.cvut.cz veronica.valeros@fel.cvut.cz

Carlos Catania Sebastian Garcia


CONICET, UNCuyo Czech Technical University in Prague
Argentina Czech Republic
harpo@ingenieria.uncuyo.edu.ar sebastian.garcia@agents.fel.cvut.cz
arXiv:2309.00155v2 [cs.CR] 9 Feb 2024

ABSTRACT trained on huge amounts of data, to create very realistic honeypot


Honeypots are essential tools in cybersecurity. However, most of systems.
them (even the high-interaction ones) lack the required realism to This article presents two main contributions: human-tested em-
engage and fool human attackers. This limitation makes them eas- pirical evidence supports LLMs in building credible honeypots and
ily discernible, hindering their effectiveness. This work introduces refined LLM-based honeypot software.
a novel method to create dynamic and realistic software honey-
pots based on Large Language Models. Preliminary results indicate
that LLMs can create credible and dynamic honeypots capable of
addressing important limitations of previous honeypots, such as 2 RELATED WORK
deterministic responses, lack of adaptability, etc. We evaluated the The application of artificial intelligence and NLP techniques in
realism of each command by conducting an experiment with hu- the context of honeypot generation is an emerging research topic.
man attackers who needed to say if the answer from the honeypot In particular, the dynamic capabilities of AI techniques seem to
was fake or not. Our proposed honeypot, called shelLM, reached be adequate for dealing with the intrinsic variation of attacker
an accuracy of 0.92. The source code and prompts necessary for behaviors. For instance, the authors of [9] employ a Reinforcement
replicating the experiments have been made publicly available. Learning (RL) approach to design a self-learning honeypot. By
ACM Reference Format: interacting with attackers, the RL-based honeypot is capable of
Muris Sladić, Veronica Valeros, Carlos Catania, and Sebastian Garcia. . LLM adapting its strategy, providing a more dynamic defense mechanism
in the Shell: Generative Honeypots. In . ACM, New York, NY, USA, 6 pages. that learns from each encounter. This paves the way for more
adaptive systems that can better understand and counter evolving
1 INTRODUCTION cyber threats. Furthering the concept of adaptability, the study
in [11] proposes the creation of Self Adapting honeypots, based on
Honeypots allow security researchers and defense teams to detect,
a game-theoretical approach between the attacker and the honeypot.
monitor, and log attacker behavior. The process of successfully un-
This offers a more complex and interactive defensive landscape,
derstanding, learning, and recognizing attackers’ behaviors takes
making it difficult for attackers to identify and bypass the honeypot.
considerable effort and requires large amounts of data. Most honey-
More recently, researchers applied NLP techniques to extract
pots nowadays, however, are used for telemetry and early detection,
information from honeypot logs. The work of [3] employs various
not for analyzing attackers’ behavior since attackers can easily iden-
NLP techniques to cluster different types of cyber attacks, providing
tify most shell-based honeypots [7]. Increasing the credibility of
valuable insights into attacker behaviors and tactics. Such infor-
a shell-based honeypot is not a simple task. The most important
mation can then be used to fine-tune honeypot configurations and
problem of software honeypots is that they do not mimic real com-
improve the overall cybersecurity posture. A similar approach is
puter systems sufficiently well to make human attackers believe
presented in [10], where a fine-tuned LLM (GPT-2C) model is used
long enough that they are interacting with a real computer. These
in parsing dynamic logs generated by Cowrie SSH honeypots. This
honeypots are usually too limited, file systems too generic, answers
fine-tuned model aids in real-time analysis and threat identification,
too deterministic, and the overall system looks too simple [8].
showcasing how machine learning algorithms can contribute to
Our work proposes shelLM, a honeypot creation software based
enhancing the effectiveness of honeypots.
on Large Language Models (LLMs). The aim is to show that the
There are no studies on the application of NLP models for imple-
dynamic creation of fake file systems and command responses can
menting parts of honeypot software. In a preliminary analysis [6],
be more engaging for attackers and prolong their discovery that
authors propose an approach for interfacing with ChatGPT, demon-
they are not interacting with a real system. LLMs can create file
strating how to formulate prompts that can mimic the behavior of
systems and contents on demand, unlike traditional honeypots,
terminal commands on various operating systems.
where they must be manually created. We show that by using good
Our work moves a step forward by implementing a complete
prompt engineering techniques we can leverage LLMs, which are
honeypot software based on the use of Large Language Models such
,, as GPT3.5. It focuses on building more interactive and convincing
. honeypot software that can realistically engage with attackers.
,, Muris Sladić, Veronica Valeros, Carlos Catania, and Sebastian Garcia

3 IMPLEMENTATION (4) Update Prompt: The user command and the LLM response
The key features of shelLM are: (i) the content of a session is are then added to the prompt, which will be used for the
transferred into the new session to keep consistency, (ii) chain- next interaction in the session.
of-thought prompt technique [12], and (iii) prompts with precise
instructions to avoid certain pitfalls. 3.3 Model and Consistency Techniques
The LLM model used was GPT-3.5-turbo-16k from OpenAI [1]. The
3.1 Prompt Techniques model output was generated using a temperature of 0, limited to a
In shelLM, to instruct its behavior, the LLM is provided with an maximum of 800 tokens, with all other parameters in their default
initial prompt at the beginning of each session. We refer to it as values.
the personality prompt. Different prompt techniques are used to To keep the consistency between multiple sessions, the content
instruct the LLM to be precise, to be real, and not to disclose its of each terminal session is saved in .txt file and used as an initial
usage. Initially, through prompt engineering, the LLM is given a prompt for the next sessions. If a session ends, the user can continue
personality that describes the behavior of a real Linux terminal. the interaction in the same state. If the context gets filled, i.e., the
The model was given detailed step-by-step explanations of the sum of input tokens and needed tokens to generate output reaches
expected behavior, including a few examples of the desired output the maximum of 16k tokens, the history of the conversation is
under certain situations. In addition, the model was encouraged to wiped, restoring the initial personality prompt.
think step-by-step following the Chain of Thought (CoT) prompt
technique [12] combined with few-shot [4] prompting to get better 3.4 Demonstrating shelLM in Action
results. Finally, orders considered more important were repeated in An example output generated by shelLM can be observed in Fig-
the prompt multiple times to force the expected behavior. ure 2. The Figure shows the output of a ping command, offering
a glimpse into how the LLM honeypot emulates network connec-
3.2 Session Management tivity diagnostics. Figure 3 shows the output of the xinput com-
mand, demonstrating shelLM’s capability to mimic input device
A shelLM session consists of the complete set of interactions be-
configurations. Lastly, Figure 4 provides a sample of the wget com-
tween a user and the LLM. The session operation is close to the
mand output, illustrating the honeypot capabilities for replicating
approach used for building chatbots on top of an LLM, as shown in
network interactions with other servers. These examples help to
Figure 1.
highlight the precision with which the LLM honeypot simulates
genuine system behaviors, making it an extremely useful tool in
the hands of cybersecurity professionals.

Figure 2: Example output of a ping command generated by


the shelLM LLM honeypot.

Figure 1: Prompt construction during a shelLM session

The session begins with the user interacting with the LLM. This
is the initialization point where the personality prompt is first intro-
duced. Then, a prompt construction process is repeated for each Figure 3: Example output of a xinput command generated by
new user input until the session ends. The prompt construction the shelLM LLM honeypot.
process consists of the following steps:
(1) User Input: The user inputs a command. This software was implemented in Python and published on
(2) Prompt Creation: This prompt integrates the original per- GitHub at https://github.com/stratosphereips/shelLM. The tool
sonality prompt, all previous interactions (both user inputs requires a valid OpenAI API for GPT-3.5-turbo-16k. An online demo
and LLM responses), and the new user command. instance of the shelLM honeypot is currently operational and can be
(3) LLM Processing: The LLM processes this comprehensive accessed at the IP address 147.32.80.38. This access is facilitated
prompt and generates a response. via the Secure Shell (SSH) protocol on port 1337. To log in, use
LLM in the Shell: Generative Honeypots ,,

Figure 4: Example output of a wget command generated by


the shelLM LLM honeypot.

’root’ as the username and ’WhatABeatifulPassword’ as the


password.

4 EXPERIMENT SETUP
Figure 5: Process for evaluating the honeypot software. Users
The hypothesis behind our work was that it is possible to create
interact with the LLM honeypot and send answers with their
an LLM honeypot that humans could not easily distinguish from a
assessment to be evaluated. Security experts compare the
real system. Our method consisted of using an LLM for simulating
output of the LLM with the answers of the participants, de-
a Linux terminal that can be used in real life through a secure shell
termining for each answer if it was an FP/FN/TN/TP.
(SSH).
The implementation of the LLM honeypot software was evalu-
ated by asking 12 users with various levels of security expertise to counted as a false negative. The rest were counted as true positives
try the honeypot and to identify which outputs helped them dis- and true negatives correspondingly.
cern whether the system was a honeypot. Answers were manually A more detailed interpretation of each error is described as fol-
processed to obtain a list of input commands, their outputs, and lows:
the user evaluation. Results were computed by counting the errors
• True Positives (TP): Attackers correctly identified the com-
committed by the participants.
mands as a honeypot, and the expert agreed and confirmed
The experiments focused on human interactions with the LLM
the command was revealing the honeypot.
honeypot. LLM honeypots were deployed in a cloud provider with
• False Positives (FP): Attackers correctly identified the com-
SSH to allow secure access by the participants. Each participant was
mand as a honeypot, but for the wrong reasons, maybe be-
randomly assigned a unique honeypot instance. After logging in, an
cause they knew beforehand that it was a honeypot.
automatic process was triggered to run the LLM honeypot, initiating
• False Negatives (FN): Attackers say the command is nor-
the LLM-user interaction. Users were never able to interact with the
mal, and the expert believes the command was revealing
host system, only with the LLM honeypot. Participants were told
the honeypot. This is considered beneficial for the honeypot
to connect to the LLM honeypot, interact, and send the answers
software.
via email.
• True Negatives (TN): Attackers say it is not a honeypot, and
Participants were told beforehand that the system was a hon-
the expert confirmed it was not revealing a honeypot.
eypot. The goal was to evaluate if a known honeypot generates
text output that does not give participants any indication of failure, The example below shows a true positive, in which a participant
wrongness, or not being a real system. Participants gave specific, identified the system as a "honeypot", and experts confirmed that it
command-by-command feedback about whether, according to them, was a clear LLM-introduced error.
the output belonged to a honeypot or not. Participants sent com-
plete logs, sent screenshots, or recorded videos. jennifer@itcompany:~$ w
To verify the correctness of the experiments and the answers, bash: syntax error near unexpected token `newline'
we consulted with several Linux and security experts, who checked
the correctness of the output and helped classify each command The example below shows a false positive, in which a participant
as well as validate participants’ responses. Figure 5 represents the identified the system as a "honeypot", but experts confirmed that
flow of this experiment. it is common that the ’.bashrc’ file does not exist in systems that
use other interpreters. This is common on Linux-based IoT devices.
4.1 Type Error Interpretation This answer is an artifact of telling the participants they would be
Given the particular nature of the honeypot evaluation, the error inside a honeypot.
types need to be clarified. If a command output was marked as
a "honeypot" when it was actually indistinguishable from a real jennifer@itcompany:~$ cat .bashrc
answer, it was counted as a false positive. If participants marked cat: .bashrc: No such file or directory
an answer as "real" when it was clearly wrong or forged, it was
,, Muris Sladić, Veronica Valeros, Carlos Catania, and Sebastian Garcia

Another example of a false positive was the command shown of commands were false negatives, in which participants missed
below, in which the participant identified this command as honey- noticing inconsistencies in the model output.
pot due to the small number of files. However, many small systems
have this number of files or fewer. We believe this answer was an
artifact of insufficient real security experience.
Experts
jennifer@itcompany:~$ ls /var/run Attackers REAL FORGED
acpid.pid crond.pid dhclient-eth0.pid REAL 167 1
dhclient.pid initramfs motd sshd.pid FORGED 17 41
sudo utmp wpa_supplicant.pid
Accuracy : 0.9204
95% CI : (0.877, 0.9521)
4.2 Metrics for Evaluating Honeypot Software No Information Rate : 0.8142
Following the error interpretation described in subsection 4.1, the P-Value [Acc > NIR] : 5.432e-06
following metrics are considered for evaluating the honeypot soft-
ware performance: False Negative Rate : 0.0232
Overall Accuracy: indicates the number of commands correctly True Negative Rate : 0.9076
identified by the participants over all the commands executed by False Discovery Rate : 0.0923
all the participants. While this metric is not completely adequate 'Positive' Class : REAL
for the current problem it still provides a valuable overview. The
formula is:
𝑇𝑃 +𝑇𝑁
Overall Accuracy = (1)
𝐹𝑁 + 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑃 Figure 6: Confusion Matrix and Performance metrics for the
False Negative Rate: In honeypot systems, the false negative rate LLM-based honeypot.
(FNR) indicates that the honeypot is effective at masquerading as a
real system.
𝐹𝑁
False Negative Rate = (2) For our purposes, a perfect honeypot would only have True Neg-
𝐹𝑁 +𝑇𝑃
atives since we would need attackers to say negative, meaning the
True Negative Rate (Specificity): It is desirable to maximize the
answers are not coming from a honeypot, and it would be true. This
True Negative Rate (TNR) to demonstrate that the honeypot is
is different from other detection methods, where what is expected is
effective at not flagging typical commands as revealing.
mostly a large number of true positives. If we look at the results per
𝑇𝑁 users shown in Table 1), we can confirm that the True Negative Rate
True Negative Rate = (3)
𝑇 𝑁 + 𝐹𝑃 was 1 in most cases. This indicates that the honeypot effectively
False Discovery Rate: The false discovery rate (FDR) is an impor- convinces attackers that they are not interacting with a honeypot.
tant metric to minimize because it measures the number of times On the other hand, User 9 is a notable outlier with a TNR of 0.5,
the honeypot is incorrectly identified as a real system. suggesting the honeypot was only partially successful in deceiving
𝐹𝑃 this attacker.
False Discovery Rate = (4) The FNR values are generally low, which demonstrates the sys-
𝐹𝑃 + 𝑇𝑃
tem’s effectiveness. Again, User 9, who has an FNR of 0 but a lower
5 RESULTS TNR, might indicate a different interaction pattern that could be
A total of 12 human users tested the performance of the honeypot. studied further.
In sum, users executed 226 commands associated with package Finally, when considering the overall accuracy, most users have
management, file system management, network operations, system an accuracy above 0.9, showcasing again the general effectiveness
information, and file management. Only 76 unique commands were of the honeypot.
executed. The top ten commands executed were: cat, ls, sudo, Widely common commands such as ls, pwd, and whoami always
get, echo, pwd, nano, ping, ssh and whois. On average, each user returned the expected output. Similarly, commands including ping,
executed 19 commands, with a minimum of 12 and a maximum of wget, whois, tcpdump, sudo -su and non-valid command inputs
38 commands. resulted in credible output. The issues that occurred were incon-
The results of the experiments are summarized in Figure 6. The sistencies between files in /proc and /etc directories, possibly
analysis showed that 74% (167) of the commands that seemed cred- caused by the limitation on the number of tokens and the occa-
ible to participants had realistic outputs (True Negative Rate). A sional strange behavior of the LLM. This strange behavior of the
total of 7.5% (17) of commands flagged as honeypot indicators were model can be attributed to its context getting bigger. As it was
false positives and could be attributed to the knowledge of partic- observed recently [5], Large Language Models perform the best
ipants that they are in a honeypot system (False Discovery Rate). when the relevant information is at the beginning or the end of the
An 18% (41) of commands flagged as honeypot indicators produced input context. When the relevant information is in the middle of
incorrect or strange output (True Positives). The remaining 0.5% (1) the input context, the model’s performance tends to degrade.
LLM in the Shell: Generative Honeypots ,,

Table 1: Performance metrics of the LLM honeypot per user Table 2: Costs for running the experiment per user

User Accuracy FNR TNR FDR User Tokens Input $ Output $ Time [m]
1 0.643 0.417 1 0 1 7243 0.2939 0.0189 47
2 0.941 0.071 1 0 2 7455 0.3120 0.0197 27
3 1 0 1 0 3 9572 0.4224 0.0252 24
4 0.882 0.2 1 0 4 12731 0.5616 0.0355 26
5 1 0 1 0 5 14365 0.6334 0.0379 44
6 0.923 0.090 1 0 6 9098 0.4012 0.0241 31
7 0.923 0.105 1 0 7 12306 0.5418 0.0324 42
8 0.933 0.090 1 0 8 5253 0.2317 0.0138 34
9 0.938 0 0.5 0.5 9 9586 0.4225 0.0253 55
10 0.867 0.167 1 0 10 3963 0.1748 0.0104 12
11 0.75 0.273 1 0 11 12831 0.5654 0.0338 53
12 1 0 1 0 12 9755 0.4295 0.0257 29

6 COST ANALYSIS Twelve human security experts performed the evaluation of the
A primary constraint with the use of shelLM is the cost involved in honeypot’s credibility. During this preliminary evaluation, the LLM
the generation process. Since the current implementation is based honeypot obtained an accuracy of 92%, marked by 74% true nega-
on the OPENAI GPT-3.5-turbo-16k model and has high computa- tives and 18% true positives. The results confirmed our hypothesis
tional and financial resource requirements, the use of this technol- that it was possible to create an LLM honeypot that humans find
ogy in every scenario may not be feasible. The cost of generating hard to distinguish from a real system.
responses using GPT-3.5-turbo-16k can add up quickly, especially Despite the encouraging preliminary results, we are aware of
when processing large volumes of text. This expense is primarily the limitations of our current implementation. In particular, the
due to the significant amount of energy consumed by the special- occasional strange behavior of the LLM is caused by its inherent
ized hardware and software required for running these models at stochastic nature and its memory issues related to size of the input
scale. Therefore, it is crucial to carefully consider the potential ben- context [5]. Answer latency due to the API responsiveness and the
efits and drawbacks of using shelLM in a given context, weighing response generation time are two other limitations of the current
cost against utility, before making a decision. implementation.
At the time of writing this paper, the costs for the GPT-3.5-turbo- Future work will focus on improving the LLM-based honeypot
16k model are $0, 001 for 1k input tokens and $0, 002 for 1k output responsiveness, engagement, and error management. We plan to
tokens [2]. The estimate of the cost for a full shelLM session of 16k fine-tune the model to attempt to remediate forgetfulness and be-
tokens is around $0.70, from which $0.6577 can be attributed to the havior deterioration issues with larger context. In addition, we
input tokens and $0.0414 for the output tokens. intend to compare shelLM with other well-known honeypots to
In Table 2, we present a comprehensive cost analysis for each determine its appeal to attackers and the quality of data it produces.
of the 12 participants in the experiments. The table presents, for More experiments will be run where the participants would not
each user, the total token count, input and output costs, and total know that they are being tested in the recognition of honeypots
session duration. to measure their growing suspiciousness. Finally, we plan to con-
The average duration participants spent interacting with shelLM duct more experiments focused on differentiating human from bot
was around 35 minutes. Based on this data we can say that the cost behaviors within the honeypot and identify their respective signals.
of usage of shelLM is around $0.4 per 30 minutes of active use,
that is roughly $0.8 per hour. The total cost of our experiment was REFERENCES
$5,2936, out of which $4,9905 was for input tokens and $0,3031 on [1] Open AI. 2023. GPT 3.5. https://platform.openai.com/docs/models/gpt-3-5.
[Online; accessed 22-July-2023].
output tokens. [2] Open AI. 2023. Pricing. https://openai.com/pricing. [Online; accessed
The initial personality prompt was a fundamental element for 9-December-2023].
simulating a Linux terminal. Since it was included in the prompt [3] Matteo Boffa, Giulia Milan, Luca Vassio, Idilio Drago, Marco Mellia, and Zied
Ben Houidi. 2022. Towards NLP-based Processing of Honeypot Logs. In 2022 IEEE
of all the participants of the experiments, we analyzed its costs European Symposium on Security and Privacy Workshops (EuroS’I&’PW). 314–321.
independently. The initial personality prompt was composed of https://doi.org/10.1109/EuroSPW55150.2022.00038
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
2138 tokens, and its cost was $0.00655. Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
7 CONCLUSION AND FUTURE WORK Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
In this study, we presented a novel application of LLMs for honeypot Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
development, resulting in a system that dynamically generates syn- Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165
thetic data as demanded by the user. The core of the honeypot was [5] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua,
implemented using several LLM prompt engineering techniques. Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models
,, Muris Sladić, Veronica Valeros, Carlos Catania, and Sebastian Garcia

Use Long Contexts. arXiv:2307.03172 [cs.CL] [9] Adrian Pauna and Ion Bica. 2014. RASSH - Reinforced adaptive SSH honeypot.
[6] Forrest McKee and David Noever. 2023. Chatbots in a Honeypot World. In 2014 10th International Conference on Communications (COMM). 1–6. https:
arXiv:2301.03771 [cs.CR] //doi.org/10.1109/ICComm.2014.6866707
[7] Iyatiti Mokube and Michele Adams. 2007. Honeypots: Concepts, Approaches, [10] Febrian Setianto, Erion Tsani, Fatima Sadiq, Georgios Domalis, Dimitris Tsakalidis,
and Challenges. In Proceedings of the 45th Annual Southeast Regional Conference and Panos Kostakos. 2021. GPT-2C: a parser for honeypot logs using large pre-
(Winston-Salem, North Carolina) (ACM-SE 45). Association for Computing Ma- trained language models. 649–653. https://doi.org/10.1145/3487351.3492723
chinery, New York, NY, USA, 321–326. https://doi.org/10.1145/1233341.1233399 [11] Gérard Wagener, Radu State, Thomas Engel, and Alexandre Dulaunoy. 2011.
[8] Shun Morishita, Takuya Hoizumi, Wataru Ueno, Rui Tanabe, Carlos Gañán, Adaptive and self-configurable honeypots. Proceedings of the 12th IFIP/IEEE
Michel J.G. van Eeten, Katsunari Yoshioka, and Tsutomu Matsumoto. 2019. Detect International Symposium on Integrated Network Management, IM 2011, 345–352.
Me If You. . . Oh Wait. An Internet-Wide View of Self-Revealing Honeypots. In https://doi.org/10.1109/INM.2011.5990710
2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). [12] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,
134–143. Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits
Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]

You might also like