Professional Documents
Culture Documents
Applying LLMs To Threat Intelligence - by Thomas Roccia - Nov, 2023 - SecurityBreak
Applying LLMs To Threat Intelligence - by Thomas Roccia - Nov, 2023 - SecurityBreak
Member-only story
75
LLMs, or Large Language Models, are an exciting technology designed to
leverage natural languages with various technologies. Specifically in
Cybersecurity, and more so in Threat Intelligence, there are challenges
that can be partially addressed with LLMs and generative AI.
In this blog, I will discuss the potential of LLMs for threat intelligence
applications. I will first introduce some common challenges, then define
what prompt engineering is and how it can be applied to practical use
cases. Next, I will discuss some techniques such as few-shot learning,
RAG, and agents. Everything will be illustrated with code examples. Stay
with me, as we’re about to dive deep and acquire real skills, rather than
just skimming the surface.
Search Write
With these challenges in mind, let’s explore how LLMs can be utilized to
enhance analysts’ capabilities.
What is Prompt Engineering?
We cannot discuss LLMs without defining what is Prompt Engineering.
Clarity: Define the task you want the model to perform clearly.
But while many individuals are focusing on crafting the perfect prompt,
they are essentially overlooking the true potential of LLMs and their vast
capabilities.
Now, let’s talk about the genuine strength of LLMs and explore how we
can pragmatically create our own applications with it.
Practical Application of LLMs
There are multiple techniques that can be used in conjunction with a
model. In this section, I will explore some of them to provide you with
the keys to delve into this technology independently and achieve a better
understanding of it.
Few-Shot Prompting
Few-shot prompting is an interesting technique that can be employed to
instruct an LLM using a very limited amount of data.
The idea is to supply your model with some examples of what you expect
so it can replicate them directly. For instance, in the code below, I ‘teach’
the model a desired output — in this case, a mermaid mindmap — so that
it can produce similar mindmaps in the future.
User: The second line designates the role of “user.” This line presents
examples of user inputs.
Finally, I capture the user input, allowing the assistant to generate the
subsequent mindmap based on that input.
The primary objective here is to enhance a model using your data. But
how does it work under the hood?
For the sake of this blog, I’ve adapted his code to be compatible with
Jupyter Notebook and create an interface using pywidget. I’ll walk you
through each step to construct your own RAG. In this example, we used
LangChain, which is an open-source library designed for interacting with
an LLM.
Note: For this example, the data is stored in Markdown format, but you
can use any type of data.
Here we are using the group knowledge to load into our RAG.
Tokenisation
Tokenization is the process of converting a sequence of text into
individual units, known as “tokens.” These tokens can range from being
as small as characters to as long as words, depending on the specific
needs of the task and the language in question. Tokenization is an
essential pre-processing step in Natural Language Processing (NLP) and
text analytics models. Tokenisation can be done using the library
Tiktoken.
In our context, tokenization isn’t strictly required. However, it proves
beneficial if you aim to manage the amount of data sent and for
optimization and cost-control purposes.
The following code demonstrates how to employ this method with our
MITRE ATT&CK Groups data.
Embeddings
Embeddings provide a means to convert words or phrases into numerical
representations, or vectors, so they can be easily processed by
computers. Why is this useful? By transforming text into numerical form,
it becomes simpler to gauge the similarity between words or sentences,
facilitating tasks such as search and classification.
embeddings = OpenAIEmbeddings()
retriever = db.as_retriever(search_kwargs={"k":5})
query = "What are some phishing techniques used by threat actors?"
print("[+] Getting relevant documents for query..")
relevant_docs = retriever.get_relevant_documents(query)
Alright, our retriever is now up and running. The next step is to integrate
this retriever with our LLM.
We now have our RAG operational. But one thing that’s bothersome is
that our model doesn’t remember what we’ve discussed previously…
RAG + Memory
Being able to interact with your own data is quite powerful; you can
essentially feed any type of data and let your LLM work with your
personalized or internal data.
However, as seen in our previous example, the model doesn’t retain the
memory of prior interactions, which can be somewhat frustrating when
trying to gather multiple pieces of information about the same threat
actor.
Source: https://peterroelants.github.io/posts/react-repl-agent/
class TIVTLookup:
def __init__(self):
self.ti_lookup = TILookup()
ti_tool = TIVTLookup()
tools = [
Tool(
name="Retrieve_IP_Info",
func=ti_tool.ip_info,
description="Useful when you need to look up threat intelligence informati
),
Tool(
name="Retrieve_Communicating_Samples",
func=ti_tool.communicating_samples,
description="Useful when you need to get communicating samples from an ip
),
Tool(
name="Retrieve_Sample_information",
func=ti_tool.samples_identification,
description="Useful when you need to obtain more details about a sample."
),
]
It’s worth noting that numerous other functions can be integrated into
our code. However, for the purpose of this demonstration, we’ll maintain
simplicity.
agent = initialize_agent(
tools, llm=llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False, ag
)
agent.run("Can you give me more details about this ip: 77.246.107.91? How many sam
Conclusion
In this blog, I explored some interesting LLM features that allow you to
build your own application. I created some proof-of-concept
implementations that can be easily adapted for your own use case.
I started with a deep dive into prompt engineering concepts and few-shot
learning, and then looked at how to build a RAG with your own data.
Lastly, I discussed Agents and how they can be used in conjunction with
your existing tools.
I hope you enjoyed the journey. If you want to explore more about these
concepts, check out the resources below.
That’s it! If you like this blog, you can share it and like it. You can follow
me on Twitter @fr0gger_ for more stuff such as this one. ❤
Ressources
OTRF/GenAI-Security-Adventures (github.com)
Agents | Langchain
https://peterroelants.github.io/posts/react-repl-agent/
TheIntelBrief (securitybreak.io)
Security Researcher
139 2 444 2
553 8 41
55 106 5
Lists
78 5.4K 72
552 9 1.5K 26
See more recommendations
Help Status About Careers Blog Privacy Terms Text to speech Teams