Professional Documents
Culture Documents
generative honeypts
generative honeypts
3 IMPLEMENTATION (4) Update Prompt: The user command and the LLM response
The key features of shelLM are: (i) the content of a session is are then added to the prompt, which will be used for the
transferred into the new session to keep consistency, (ii) chain- next interaction in the session.
of-thought prompt technique [12], and (iii) prompts with precise
instructions to avoid certain pitfalls. 3.3 Model and Consistency Techniques
The LLM model used was GPT-3.5-turbo-16k from OpenAI [1]. The
3.1 Prompt Techniques model output was generated using a temperature of 0, limited to a
In shelLM, to instruct its behavior, the LLM is provided with an maximum of 800 tokens, with all other parameters in their default
initial prompt at the beginning of each session. We refer to it as values.
the personality prompt. Different prompt techniques are used to To keep the consistency between multiple sessions, the content
instruct the LLM to be precise, to be real, and not to disclose its of each terminal session is saved in .txt file and used as an initial
usage. Initially, through prompt engineering, the LLM is given a prompt for the next sessions. If a session ends, the user can continue
personality that describes the behavior of a real Linux terminal. the interaction in the same state. If the context gets filled, i.e., the
The model was given detailed step-by-step explanations of the sum of input tokens and needed tokens to generate output reaches
expected behavior, including a few examples of the desired output the maximum of 16k tokens, the history of the conversation is
under certain situations. In addition, the model was encouraged to wiped, restoring the initial personality prompt.
think step-by-step following the Chain of Thought (CoT) prompt
technique [12] combined with few-shot [4] prompting to get better 3.4 Demonstrating shelLM in Action
results. Finally, orders considered more important were repeated in An example output generated by shelLM can be observed in Fig-
the prompt multiple times to force the expected behavior. ure 2. The Figure shows the output of a ping command, offering
a glimpse into how the LLM honeypot emulates network connec-
3.2 Session Management tivity diagnostics. Figure 3 shows the output of the xinput com-
mand, demonstrating shelLM’s capability to mimic input device
A shelLM session consists of the complete set of interactions be-
configurations. Lastly, Figure 4 provides a sample of the wget com-
tween a user and the LLM. The session operation is close to the
mand output, illustrating the honeypot capabilities for replicating
approach used for building chatbots on top of an LLM, as shown in
network interactions with other servers. These examples help to
Figure 1.
highlight the precision with which the LLM honeypot simulates
genuine system behaviors, making it an extremely useful tool in
the hands of cybersecurity professionals.
The session begins with the user interacting with the LLM. This
is the initialization point where the personality prompt is first intro-
duced. Then, a prompt construction process is repeated for each Figure 3: Example output of a xinput command generated by
new user input until the session ends. The prompt construction the shelLM LLM honeypot.
process consists of the following steps:
(1) User Input: The user inputs a command. This software was implemented in Python and published on
(2) Prompt Creation: This prompt integrates the original per- GitHub at https://github.com/stratosphereips/shelLM. The tool
sonality prompt, all previous interactions (both user inputs requires a valid OpenAI API for GPT-3.5-turbo-16k. An online demo
and LLM responses), and the new user command. instance of the shelLM honeypot is currently operational and can be
(3) LLM Processing: The LLM processes this comprehensive accessed at the IP address 147.32.80.38. This access is facilitated
prompt and generates a response. via the Secure Shell (SSH) protocol on port 1337. To log in, use
LLM in the Shell: Generative Honeypots ,,
4 EXPERIMENT SETUP
Figure 5: Process for evaluating the honeypot software. Users
The hypothesis behind our work was that it is possible to create
interact with the LLM honeypot and send answers with their
an LLM honeypot that humans could not easily distinguish from a
assessment to be evaluated. Security experts compare the
real system. Our method consisted of using an LLM for simulating
output of the LLM with the answers of the participants, de-
a Linux terminal that can be used in real life through a secure shell
termining for each answer if it was an FP/FN/TN/TP.
(SSH).
The implementation of the LLM honeypot software was evalu-
ated by asking 12 users with various levels of security expertise to counted as a false negative. The rest were counted as true positives
try the honeypot and to identify which outputs helped them dis- and true negatives correspondingly.
cern whether the system was a honeypot. Answers were manually A more detailed interpretation of each error is described as fol-
processed to obtain a list of input commands, their outputs, and lows:
the user evaluation. Results were computed by counting the errors
• True Positives (TP): Attackers correctly identified the com-
committed by the participants.
mands as a honeypot, and the expert agreed and confirmed
The experiments focused on human interactions with the LLM
the command was revealing the honeypot.
honeypot. LLM honeypots were deployed in a cloud provider with
• False Positives (FP): Attackers correctly identified the com-
SSH to allow secure access by the participants. Each participant was
mand as a honeypot, but for the wrong reasons, maybe be-
randomly assigned a unique honeypot instance. After logging in, an
cause they knew beforehand that it was a honeypot.
automatic process was triggered to run the LLM honeypot, initiating
• False Negatives (FN): Attackers say the command is nor-
the LLM-user interaction. Users were never able to interact with the
mal, and the expert believes the command was revealing
host system, only with the LLM honeypot. Participants were told
the honeypot. This is considered beneficial for the honeypot
to connect to the LLM honeypot, interact, and send the answers
software.
via email.
• True Negatives (TN): Attackers say it is not a honeypot, and
Participants were told beforehand that the system was a hon-
the expert confirmed it was not revealing a honeypot.
eypot. The goal was to evaluate if a known honeypot generates
text output that does not give participants any indication of failure, The example below shows a true positive, in which a participant
wrongness, or not being a real system. Participants gave specific, identified the system as a "honeypot", and experts confirmed that it
command-by-command feedback about whether, according to them, was a clear LLM-introduced error.
the output belonged to a honeypot or not. Participants sent com-
plete logs, sent screenshots, or recorded videos. jennifer@itcompany:~$ w
To verify the correctness of the experiments and the answers, bash: syntax error near unexpected token `newline'
we consulted with several Linux and security experts, who checked
the correctness of the output and helped classify each command The example below shows a false positive, in which a participant
as well as validate participants’ responses. Figure 5 represents the identified the system as a "honeypot", but experts confirmed that
flow of this experiment. it is common that the ’.bashrc’ file does not exist in systems that
use other interpreters. This is common on Linux-based IoT devices.
4.1 Type Error Interpretation This answer is an artifact of telling the participants they would be
Given the particular nature of the honeypot evaluation, the error inside a honeypot.
types need to be clarified. If a command output was marked as
a "honeypot" when it was actually indistinguishable from a real jennifer@itcompany:~$ cat .bashrc
answer, it was counted as a false positive. If participants marked cat: .bashrc: No such file or directory
an answer as "real" when it was clearly wrong or forged, it was
,, Muris Sladić, Veronica Valeros, Carlos Catania, and Sebastian Garcia
Another example of a false positive was the command shown of commands were false negatives, in which participants missed
below, in which the participant identified this command as honey- noticing inconsistencies in the model output.
pot due to the small number of files. However, many small systems
have this number of files or fewer. We believe this answer was an
artifact of insufficient real security experience.
Experts
jennifer@itcompany:~$ ls /var/run Attackers REAL FORGED
acpid.pid crond.pid dhclient-eth0.pid REAL 167 1
dhclient.pid initramfs motd sshd.pid FORGED 17 41
sudo utmp wpa_supplicant.pid
Accuracy : 0.9204
95% CI : (0.877, 0.9521)
4.2 Metrics for Evaluating Honeypot Software No Information Rate : 0.8142
Following the error interpretation described in subsection 4.1, the P-Value [Acc > NIR] : 5.432e-06
following metrics are considered for evaluating the honeypot soft-
ware performance: False Negative Rate : 0.0232
Overall Accuracy: indicates the number of commands correctly True Negative Rate : 0.9076
identified by the participants over all the commands executed by False Discovery Rate : 0.0923
all the participants. While this metric is not completely adequate 'Positive' Class : REAL
for the current problem it still provides a valuable overview. The
formula is:
𝑇𝑃 +𝑇𝑁
Overall Accuracy = (1)
𝐹𝑁 + 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑃 Figure 6: Confusion Matrix and Performance metrics for the
False Negative Rate: In honeypot systems, the false negative rate LLM-based honeypot.
(FNR) indicates that the honeypot is effective at masquerading as a
real system.
𝐹𝑁
False Negative Rate = (2) For our purposes, a perfect honeypot would only have True Neg-
𝐹𝑁 +𝑇𝑃
atives since we would need attackers to say negative, meaning the
True Negative Rate (Specificity): It is desirable to maximize the
answers are not coming from a honeypot, and it would be true. This
True Negative Rate (TNR) to demonstrate that the honeypot is
is different from other detection methods, where what is expected is
effective at not flagging typical commands as revealing.
mostly a large number of true positives. If we look at the results per
𝑇𝑁 users shown in Table 1), we can confirm that the True Negative Rate
True Negative Rate = (3)
𝑇 𝑁 + 𝐹𝑃 was 1 in most cases. This indicates that the honeypot effectively
False Discovery Rate: The false discovery rate (FDR) is an impor- convinces attackers that they are not interacting with a honeypot.
tant metric to minimize because it measures the number of times On the other hand, User 9 is a notable outlier with a TNR of 0.5,
the honeypot is incorrectly identified as a real system. suggesting the honeypot was only partially successful in deceiving
𝐹𝑃 this attacker.
False Discovery Rate = (4) The FNR values are generally low, which demonstrates the sys-
𝐹𝑃 + 𝑇𝑃
tem’s effectiveness. Again, User 9, who has an FNR of 0 but a lower
5 RESULTS TNR, might indicate a different interaction pattern that could be
A total of 12 human users tested the performance of the honeypot. studied further.
In sum, users executed 226 commands associated with package Finally, when considering the overall accuracy, most users have
management, file system management, network operations, system an accuracy above 0.9, showcasing again the general effectiveness
information, and file management. Only 76 unique commands were of the honeypot.
executed. The top ten commands executed were: cat, ls, sudo, Widely common commands such as ls, pwd, and whoami always
get, echo, pwd, nano, ping, ssh and whois. On average, each user returned the expected output. Similarly, commands including ping,
executed 19 commands, with a minimum of 12 and a maximum of wget, whois, tcpdump, sudo -su and non-valid command inputs
38 commands. resulted in credible output. The issues that occurred were incon-
The results of the experiments are summarized in Figure 6. The sistencies between files in /proc and /etc directories, possibly
analysis showed that 74% (167) of the commands that seemed cred- caused by the limitation on the number of tokens and the occa-
ible to participants had realistic outputs (True Negative Rate). A sional strange behavior of the LLM. This strange behavior of the
total of 7.5% (17) of commands flagged as honeypot indicators were model can be attributed to its context getting bigger. As it was
false positives and could be attributed to the knowledge of partic- observed recently [5], Large Language Models perform the best
ipants that they are in a honeypot system (False Discovery Rate). when the relevant information is at the beginning or the end of the
An 18% (41) of commands flagged as honeypot indicators produced input context. When the relevant information is in the middle of
incorrect or strange output (True Positives). The remaining 0.5% (1) the input context, the model’s performance tends to degrade.
LLM in the Shell: Generative Honeypots ,,
Table 1: Performance metrics of the LLM honeypot per user Table 2: Costs for running the experiment per user
User Accuracy FNR TNR FDR User Tokens Input $ Output $ Time [m]
1 0.643 0.417 1 0 1 7243 0.2939 0.0189 47
2 0.941 0.071 1 0 2 7455 0.3120 0.0197 27
3 1 0 1 0 3 9572 0.4224 0.0252 24
4 0.882 0.2 1 0 4 12731 0.5616 0.0355 26
5 1 0 1 0 5 14365 0.6334 0.0379 44
6 0.923 0.090 1 0 6 9098 0.4012 0.0241 31
7 0.923 0.105 1 0 7 12306 0.5418 0.0324 42
8 0.933 0.090 1 0 8 5253 0.2317 0.0138 34
9 0.938 0 0.5 0.5 9 9586 0.4225 0.0253 55
10 0.867 0.167 1 0 10 3963 0.1748 0.0104 12
11 0.75 0.273 1 0 11 12831 0.5654 0.0338 53
12 1 0 1 0 12 9755 0.4295 0.0257 29
6 COST ANALYSIS Twelve human security experts performed the evaluation of the
A primary constraint with the use of shelLM is the cost involved in honeypot’s credibility. During this preliminary evaluation, the LLM
the generation process. Since the current implementation is based honeypot obtained an accuracy of 92%, marked by 74% true nega-
on the OPENAI GPT-3.5-turbo-16k model and has high computa- tives and 18% true positives. The results confirmed our hypothesis
tional and financial resource requirements, the use of this technol- that it was possible to create an LLM honeypot that humans find
ogy in every scenario may not be feasible. The cost of generating hard to distinguish from a real system.
responses using GPT-3.5-turbo-16k can add up quickly, especially Despite the encouraging preliminary results, we are aware of
when processing large volumes of text. This expense is primarily the limitations of our current implementation. In particular, the
due to the significant amount of energy consumed by the special- occasional strange behavior of the LLM is caused by its inherent
ized hardware and software required for running these models at stochastic nature and its memory issues related to size of the input
scale. Therefore, it is crucial to carefully consider the potential ben- context [5]. Answer latency due to the API responsiveness and the
efits and drawbacks of using shelLM in a given context, weighing response generation time are two other limitations of the current
cost against utility, before making a decision. implementation.
At the time of writing this paper, the costs for the GPT-3.5-turbo- Future work will focus on improving the LLM-based honeypot
16k model are $0, 001 for 1k input tokens and $0, 002 for 1k output responsiveness, engagement, and error management. We plan to
tokens [2]. The estimate of the cost for a full shelLM session of 16k fine-tune the model to attempt to remediate forgetfulness and be-
tokens is around $0.70, from which $0.6577 can be attributed to the havior deterioration issues with larger context. In addition, we
input tokens and $0.0414 for the output tokens. intend to compare shelLM with other well-known honeypots to
In Table 2, we present a comprehensive cost analysis for each determine its appeal to attackers and the quality of data it produces.
of the 12 participants in the experiments. The table presents, for More experiments will be run where the participants would not
each user, the total token count, input and output costs, and total know that they are being tested in the recognition of honeypots
session duration. to measure their growing suspiciousness. Finally, we plan to con-
The average duration participants spent interacting with shelLM duct more experiments focused on differentiating human from bot
was around 35 minutes. Based on this data we can say that the cost behaviors within the honeypot and identify their respective signals.
of usage of shelLM is around $0.4 per 30 minutes of active use,
that is roughly $0.8 per hour. The total cost of our experiment was REFERENCES
$5,2936, out of which $4,9905 was for input tokens and $0,3031 on [1] Open AI. 2023. GPT 3.5. https://platform.openai.com/docs/models/gpt-3-5.
[Online; accessed 22-July-2023].
output tokens. [2] Open AI. 2023. Pricing. https://openai.com/pricing. [Online; accessed
The initial personality prompt was a fundamental element for 9-December-2023].
simulating a Linux terminal. Since it was included in the prompt [3] Matteo Boffa, Giulia Milan, Luca Vassio, Idilio Drago, Marco Mellia, and Zied
Ben Houidi. 2022. Towards NLP-based Processing of Honeypot Logs. In 2022 IEEE
of all the participants of the experiments, we analyzed its costs European Symposium on Security and Privacy Workshops (EuroS’I&’PW). 314–321.
independently. The initial personality prompt was composed of https://doi.org/10.1109/EuroSPW55150.2022.00038
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
2138 tokens, and its cost was $0.00655. Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
7 CONCLUSION AND FUTURE WORK Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
In this study, we presented a novel application of LLMs for honeypot Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
development, resulting in a system that dynamically generates syn- Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165
thetic data as demanded by the user. The core of the honeypot was [5] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua,
implemented using several LLM prompt engineering techniques. Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models
,, Muris Sladić, Veronica Valeros, Carlos Catania, and Sebastian Garcia
Use Long Contexts. arXiv:2307.03172 [cs.CL] [9] Adrian Pauna and Ion Bica. 2014. RASSH - Reinforced adaptive SSH honeypot.
[6] Forrest McKee and David Noever. 2023. Chatbots in a Honeypot World. In 2014 10th International Conference on Communications (COMM). 1–6. https:
arXiv:2301.03771 [cs.CR] //doi.org/10.1109/ICComm.2014.6866707
[7] Iyatiti Mokube and Michele Adams. 2007. Honeypots: Concepts, Approaches, [10] Febrian Setianto, Erion Tsani, Fatima Sadiq, Georgios Domalis, Dimitris Tsakalidis,
and Challenges. In Proceedings of the 45th Annual Southeast Regional Conference and Panos Kostakos. 2021. GPT-2C: a parser for honeypot logs using large pre-
(Winston-Salem, North Carolina) (ACM-SE 45). Association for Computing Ma- trained language models. 649–653. https://doi.org/10.1145/3487351.3492723
chinery, New York, NY, USA, 321–326. https://doi.org/10.1145/1233341.1233399 [11] Gérard Wagener, Radu State, Thomas Engel, and Alexandre Dulaunoy. 2011.
[8] Shun Morishita, Takuya Hoizumi, Wataru Ueno, Rui Tanabe, Carlos Gañán, Adaptive and self-configurable honeypots. Proceedings of the 12th IFIP/IEEE
Michel J.G. van Eeten, Katsunari Yoshioka, and Tsutomu Matsumoto. 2019. Detect International Symposium on Integrated Network Management, IM 2011, 345–352.
Me If You. . . Oh Wait. An Internet-Wide View of Self-Revealing Honeypots. In https://doi.org/10.1109/INM.2011.5990710
2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). [12] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,
134–143. Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits
Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]