Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

CSE-ARS: Deep learning-based late fusion of multimodal information

for chat-based social engineering attack recognition


Nikolaos Tsinganos1, Panagiotis Fouliras1, Ioannis Mavridis1 and Dimitris Gritzalis2
1
Department of Applied Informatics, University of Macedonia (UoM), 54636 Thessaloniki, Greece;
{tsinik, pfoul, mavridis}@uom.edu.gr
2
Department of Informatics, Athens University of Economics and Business (AUEB), 10434 Athens, Greece
Corresponding author: Dimitris Gritzalis (dgrit@aueb.gr)

Abstract: In today's digital world, social engineering attacks pose a significant threat as they can be used to
extract sensitive information or manipulate individuals into performing certain actions. Attackers usually exploit
human personality traits and utilize dialogue acts, persuasion, and deception techniques to fool their victims. This
paper proposes a deep learning-based system for recognizing chat-based social engineering (CSE) attacks by
employing a late fusion approach that combines an ensemble of individual recognizers of CSE attack enablers.
The proposed CSE Attack Recognition System (CSE-ARS) and the particular recognizers are evaluated on custom
corpora of real-world and fictional chat texts. The evaluation of CSE-ARS resulted in promising results for
recognizing CSE attacks, thereby demonstrating the potential of deep learning-based approaches to mitigate such
attacks. This study highlights the efficiency of the proposed system and its potential to serve as a valuable tool for
safeguarding individuals and organizations from CSE attacks.

Keywords: chat-based social engineering attacks, deep learning, natural language processing, multimodal, ensemble, late fusion

1. Introduction
Social engineering is a versatile and complex phenomenon that appears in real life and on digital world as well.
It is related with a manipulative form of communication that exploits human personality traits either in a mass
personal or interpersonal way (Gehl and Lawson, 2022). From “fake news” to psychographic advertisements and
from cognitive hacking to spear-phishing, there is a plethora of different types of social engineering attacks.
Verizon’s key annual report “Verizon Data Breach Investigations Report” (“2022 Data Breach Investigations
Report,” n.d.), consistently states social engineering attacks as the most usual and hazardous vector by which
malicious users gain access to secured and restricted computer networks. Contemporary practices of attackers
have been studied thoroughly (Wang et al., 2021) and numerous solutions have been suggested within the
academic domain as well as within the public sector. (Syafitri et al., 2022). Recently, chat-based social engineering
(CSE) attacks increased due to wide-spread use of electronic medium communication tools that is boosted due to
COVID-19 pandemic (Venkatesha et al., 2021) and is now considered a mainstream.
CSE attacks are tightly related to pretexting, a type of social engineering attack, in which a storyline is
methodically planned out in advance and the attacker builds a persona with specific characteristics to approach
the human target. The most known pretexts were created by Kevin Mitnick, a legendary social engineer, whose
stories met wide media coverage and can be found in (Mitnick, 2011). In CSE attacks, pretexts often exploit a
simple fact: human personality has vulnerabilities that can be exploited broader using cultural dynamics, social
stereotypes, and gender roles. The complexity of the phenomenon imposes an interdisciplinary approach to deal
with the many different factors that are engaged. Although several cyber-defense mechanisms have been proposed
(Vishwanath, 2022) to defend from social engineering attacks, most of the solutions focus on technical
countermeasures to improve users’ protection. Being technical these mechanisms do not account for known CSE
attack enablers as identified and illuminated in (Tsinganos et al., 2018). Currently, the majority of CSE attacks
include phishing attacks, spear-phishing, social media scams, covid-19 scams and whaling. The attackers’ favorite
method of approach is impersonation where they pretend to be someone else in order to deceive and gain
unauthorized access to sensitive information. Furthermore, the attackers recently use artificial technology
techniques to create realistic videos and audio of individuals (deepfakes), which can be used to impersonate them
in CSE attacks.

1.1 Background
CSE attacks can be realized through various online platforms, such as instant messaging apps, chatbots, social
media platforms, or even email services, and can have severe consequences, such as data breaches, financial
losses, or identity theft. Traditional technical security measures such as firewalls and antivirus software are not
effective in preventing CSE attacks, and there is a growing need for new techniques to recognize such attacks. A
typical CSE attack typically involves the following phases (Mitnick and Simon, 2011):
• Reconnaissance: The attacker gathers information about the target through publicly available information
or by using social media or other online platforms.
• Relationship: The attacker establishes a relationship with the target by using influence tactics.
• Attack: The attacker uses the information and trust they have gained to launch the attack (e.g., disclose
information, click malicious link etc.).
• Consolidation: After the attack is successful, the attacker consolidates her gains (e.g., gain further access
to the target's systems).
• Cover-up: The attacker covers her tracks to avoid detection and make it difficult to trace the attack.
In 2005, Hoeschele and Rogers (Hoeschele and Rogers, 2005) introduced the Social Engineering Defense
Architecture (SEDA) which is designed to detect social engineering attacks during real-time phone conversations.
Although the model was successful in detecting the attacks, it did not incorporate previous activity history or
personality recognition characteristics of both the attacker and the victim. In their 2010 paper, Bezuidenhout et
al. (Bezuidenhout et al., 2010) introduced an architecture named Social Engineering Attack Detection Model
(SEADM), which assists users in making decisions through the use of a simple binary decision tree model.
However, the authors rely on several unrealistic assumptions to justify the logic of their proposed system. SEADM
was revisited in a subsequent paper by Mouton et al. in 2015 (Mouton et al., 2015) where the system was adapted
to account for social engineering attacks that include unidirectional, and bidirectional communication between
the attacker and the victim. According to Bhakta and Harris (Bhakta and Harris, 2015), the most successful social
engineering attacks involve a conversation between the attacker and the victim. Their methodology involves
utilizing a predefined Topic Blacklist (TBL) to check dialogue sentences. The authors report achieving a precision
rate of a recall rate of 88.9% with their approach. Sawa et al. (Sawa et al., 2016) expanded upon the
aforementioned work by implementing advanced language processing techniques that balance syntactic and
semantic analysis. The reported results demonstrate 100% precision and 60% recall. However, they used a small
dataset with only three conversations which limits the precision and recall's success as a measure. Moreover, the
algorithm did not consider any contextual information during the classification process, making it unaware of the
specific environment in which it operates. Uebelacker and Quiel (Uebelacker and Quiel, 2014) have proposed a
taxonomy of social engineering attacks based on Cialdini's Influence principles. They investigated the relationship
between the Big-5 Theory, which pertains to personality traits, and Cialdini's influence principles. They proposed
a Social Engineering Personality Framework (SEPF) and outlined a complete research roadmap for future work
on SEPF.

1.2 Motivation
CSE attacks are a growing threat which can lead to various negative consequences, including financial loss,
damage to reputation, loss of productivity, or legal consequences. Financial loss can occur due to fraudulent
transactions or data breaches, which can be costly for organizations. A successful CSE attack can damage the
reputation of an organization, leading to a decline in business and loss of customer trust. Loss of productivity can
occur when important data is lost, leading to the need for system rebuilding, and decreased efficiency. Legal and
regulatory penalties can be incurred if an organization is held liable for data breaches. CSE attacks can target
sensitive information essential for daily operation, such as financial data, personal data of employees and clients,
and trade secrets. Given the significant impact of CSE attacks, it is crucial to develop effective detection and
prevention mechanisms. These systems should be cost-effective, easy to use, and enable individuals and
organizations to better protect themselves against the imminent risks.

1.3 Contribution
The contributions of this paper can be summarized as follows:
• We propose a chat-based social engineering recognition system, called CSE-ARS, which embraces an
interdisciplinary approach taking into account personality, linguistic (e.g., dialogue-acts), behavioral
(persuasion, persistence), and information technology characteristics.
• We propose the application of various deep learning models, including recurrent neural networks
(RNNs), convolutional neural networks (CNNs), and transformer-based models, to recognize each
specific enabler of a CSE attack. More specifically, we present the following recognizers:
o Critical Information Leverage Recognizer (CRINL-R): a long -short term (LSTM) model
recognizing critical data information leakage utilizing named entity recognition (NER)
techniques
o Personality Recognizer (PERST-R): a bidirectional encoder representation for transformers
(BERT) model that predicts personality traits
o Dialogue-act Recognizer (DIACT-R): a BERT model that recognizes dialogue acts that can lead
to deception taking in account all dialogue history
o Persuasion Recogniser (PERSU-R): a CNN model that predicts persuasion attempts
o Persistence Recognizer (PERSI-R): a BERT model that predicts persistent behavior by
identifying paraphrasing.
• We introduce five custom corpora of realized and fictional chat-based social engineering attacks, each
one annotated for a particular downstream task. The corpora are utilized for training and evaluating the
different modality recognizers in the context of our transfer learning approach.
• We propose a late fusion approach, supporting the fact that different types of enablers are required to
successfully conduct a CSE attack, that combines the final predictions of each recognizer into a single
prediction, exploiting the strengths of each different model.
• We propose a hybrid optimization approach, in which the outputs of the individual recognizers are fed
to a weighted linear aggregation and the weights are optimized using the metaheuristic algorithm of
simulated annealing and k-fold cross validation.
• We provide a detailed description of the system architecture, implementation, and evaluation, making it
easy to reproduce and build upon our work.
The rest of the paper is organized as follows. Section 2 provides a description of the CSE-ARS design
considerations and architecture decisions. In Section 3, we introduce the system’s recognisers each of which is
designed to recognize a different CSE attack enabler. Section 4 details the training of CSE-ARS. Section 5 presents
the evaluation results obtained from testing the proposed system. In Section 6, we discuss the findings and
limitations of this study. Finally, Section 7 concludes the paper by summarizing the contributions of this work and
highlighting future directions for research.

2. CSE-ARS Architecture
2.1 Design considerations
For a successful CSE attack several enablers can be triggered. By definition (Bernard, 2012), an enabler is
something that enables or facilitates an action. In the context of CSE attacks, enablers refer to the various tactics,
techniques and artifacts used by attackers to trick their targets into performing certain actions. These enablers
include critical information leverage, personality traits that can be exploited, specific dialogue acts that can lead
to information leakage, persuasion techniques, deception techniques and paraphrasing methods. All these enablers
are often used in combination to increase the chances for a successful CSE attack. Recognition of these enablers
in early stage is important to recognize CSE attacks. Following our work in (Tsinganos et al., 2018) we identify
the following critical enablers:
• Critical information leverage: Social engineers attempt to extract sensitive or confidential information
from their targets.
• Personality characteristics: Social engineers attempt to exploit certain personality characteristics of their
victims in order to manipulate them and gain their trust.
• Dialogue acts: Social engineers use certain dialogue acts, such as questions or commands, to elicit
information or manipulate the conversation.
• Persuasion attempts: Social engineering attacks often involve attempts to persuade the target to take a
certain action, such as providing critical information or clicking on a link.
• Persistent behavior: Social engineers may attempt to paraphrase information to obtain additional
information by repeatedly turning the discussion to the subject of interest.
• Deception attempts: Social engineering attacks often involve deception, such as pretending to be
someone else or providing false information.
• Chat History: A CSE attack can be unleashed in separate and distinct phases. A recognition mechanism
should be able to assess the full conversation between two interlocutors.

2.2 Architectural Design


We designed, and implemented a resilient and efficient cyber-defense system that predicts whether a chat-based
dialogue is evolving into a social engineering attack. It then decides whether an interlocutor is unleashing an
attack to extract critical personal, corporate or information technology (IT) data from the other. The proposed
system, CSE-ARS, utilizes the recent trends in deep learning (DL) and natural language processing (NLP) to
examine and classify utterances taking in account the CSE attack enablers presented in previous subsection 2.1.
CSE-ARS is a synthesis of particular recognizers that utilize DL models for each individual CSE attack enabler.
The DL models were evaluated based on their performance using the same context in order to draw safe
conclusions. This means that we paid special attention to the training data, the hyperparameters, the experimental
setup, and the number of experiments conducted in order for the comparison to be trustworthy. As a principle
during our evaluations, we preferred to consider false positives as less important and focused on not having too
many false negatives. This trade-off was a basic design decision for all DL models that were tested and evaluated.
CSE-ARS brings a multi-modal approach to solving the problem of recognizing in CSE attacks. In the context of
our study, multi-modal refers to the use of multiple modalities or sources of information to make predictions. The
proposed system is also utilizing a late fusion technique, which takes as input each individual recognizers’ output
and produces a final prediction through a weighted linear aggregation. The weights are determined using a
validation set and are optimized to maximize the performance of the system. This allows us to exploit the strengths
of different DL models and to account for the fact that different types of enablers may be utilized to successfully
conduct a CSE attack. More analytically, CSE-ARS incorporates the following recognizers:
• Critical Information Leverage Recognizer (CRINL-R): a named-entity recognizer (NER), based on a
bi-directional long short-term memory (bi-LSTM) model (Schuster and Paliwal, 1997), recognizing
critical data information.
• Personality Recognizer (PERST-R): a BERT (Devlin et al., 2019) model that predicts personality traits
of the interlocutors based on the Big-5 theory.
• Dialogue-act Recognizer (DIACT-R): a BERT model that recognizes dialogue acts that can lead to
deception taking in account the whole dialogue history.
• Persuasion Recognizer (PERSU-R): a convolutional neural network (CNN) (Lecun et al., 1998) that
predicts persuasion attempts.
• Persistence Recognizer (PERSI-R): a BERT model that predicts persistent behavior by identifying
paraphrasing in chat dialogue.
After the individual recognizers generate their outputs, the late fusion model combines them to determine whether
the chat text constitutes a social engineering attack. Each recognizer has its own feature extraction network
customized for its specific objective and state. The outputs of the recognizers are fused using a weighted linear
aggregation method. All the models, including the CSE-ARS but excluding PERST-R, are trained on different
variants of our custom CSE corpus (Tsinganos and Mavridis, 2021).
The multimodal fusion architecture is depicted in Fig.1. The system’s design is flexible, allowing for the addition
or removal of individual recognizers as needed. This allows the system to be adapted to new types of social
engineering enablers as they emerge. By using a combination of different recognizers, the system is able to capture
a wide range of enablers and provide a comprehensive assessment of the likelihood of a particular input being a
CSE attack.

Figure 1 – The CSE-ARS system architecture

Moreover, this architecture allows the late fusion model to effectively use the strengths of the different recognizers
and make a more accurate prediction by combining their results. In our work, we used concatenation for the fusion
layer to combine the outputs of multiple binary classifiers denoting the independence of the different enablers and
keeping the system simple. The concatenation layer takes the outputs of each classifier and concatenates them
into a single vector, which can then be fed into a fully connected layer for the final prediction. The outputs of the
classifiers are treated as separate features and the information from each classifier is retained in the final combined
vector. This is useful when the classifiers are designed to capture different aspects of the input and the outputs are
independent of each other. In this way, the final combined vector provides a multi-modal representation of the
input and can be used to make a prediction.

2.3 Evaluation Metrics


We used a number of performance metrics are used to evaluate the quality of the proposed deep learning models
or system, and to assess how well they are able to make accurate predictions on new, unseen data. More
specifically, throughout our experiments the following performance metrics were used:
• Receiver Operating Characteristic (ROC) is a graphical representation of the performance of a binary
classifier model at different classification thresholds. The ROC curve plots the true positive rate (TPR)
against the false positive rate (FPR) at various threshold settings, where the TPR is the proportion of
actual positive instances that are correctly identified by the model, and the FPR is the proportion of
negative instances that are incorrectly identified as positive.
• Accuracy is a measure of the proportion of correct predictions made by the model, expressed as a
percentage. It is calculated as the ratio of the number of correct predictions to the total number of
predictions made by the model.
• Precision is a measure of the proportion of true positive predictions among all positive predictions made
by the model. It is calculated as the ratio of the number of true positive predictions to the total number
of positive predictions made by the model.
• Recall, also known as sensitivity, is a measure of the proportion of true positive predictions among all
actual positive instances. It is calculated as the ratio of the number of true positive predictions to the total
number of actual positive instances
• F1, The F1 metric is an evaluation measure that combines precision and recall to assess the performance
of a binary classification model. It provides a single numerical value that represents the harmonic mean
of precision and recall, giving equal importance to both measures. The F1 metric is particularly useful
when there is an imbalance between the positive and negative classes in the dataset.

3. CSE-ARS Recognizers
In this section, we present a comprehensive discussion on the particular recognizers of CSE attack enablers.
3.1 CRINL-R
3.1.1 Description
Critical information leverage refers to the unintentional or unauthorized disclosure of sensitive information during
a conversation. In CSE attacks, attackers may attempt to extract sensitive information by asking questions or
making requests that lead the target to reveal information they should not. CRINL-R utilizes NER (Collobert et
al., n.d.), a technique that extracts contextual information from unstructured text data, such as chat text. Pre-trained
NER models can be fine-tuned to recognize additional entities by training them with an annotated corpus
containing the relevant tags. The CSE corpus, presented in (Tsinganos and Mavridis, 2021), is an example of such
a corpus, where personal, IT, and enterprise information have been identified and labeled. CRINL-R utilizes NER
to process CSE dialogue sentences and to identify the words or text spans that can be labeled with a tag associated
with an entity from the CSE ontology (Tsinganos and Mavridis, 2021).
The CSE ontology establishes linkages between social engineering and cybersecurity concepts, with a particular
emphasis on identifying sensitive data that may be inadvertently disclosed by an employee of an organization
during chat-based interactions. This ontology employs a classification scheme that organizes conceptually similar
elements within the context to enable hierarchical categorization of data assets. Its development involved the
utilization of a custom information extraction system and various textual resources, such as corporate IT policies,
CVs of IT professionals, ICT manuals, and similar documents. To ensure optimal effectiveness for text
classification algorithms, the ontology is intentionally designed not to exceed three levels of depth.

3.1.2 Methodology
To build the NER system, the following steps were taken:
• Corpus creation: The CSE corpus is composed of 56 dialogues (3380 sentences) of realized and fictional
CSE attack dialogues,
• Preprocessing: Preprocessing steps involved reading the CSE corpus, which was already annotated in
IOB format, and storing each line as a list of token-tag pairs. Then, a pre-trained English language model
from the spaCy NLP library (“spaCy · Industrial-strength Natural Language Processing in Python,” n.d.)
was used to represent each token with a word embedding.
• Building: Using Keras (“Keras: the Python deep learning API,” n.d.), a bi-LSTM model was constructed,
which comprises two compound layers: bi-directional LSTM and classification.
• Training: A loss function was specified to measure the distance between prediction and truth, and a batch-
wise gradient descent algorithm (Sutskever et al., 2013) was specified for optimization.
• Assessing: The performance assessment of the model was conducted by applying the model to the pre-
processed validation data. For each sample dialogue sentence and each token in a sentence, a tensor was
obtained that contained a predicted probability distribution over the tags. The tag with highest probability
was chosen, and for each sentence and each token the true and predicted tags were returned.
3.1.3 Corpus
CRINL-R was trained using the CSE Corpus and its characteristics is presented in Table 1.
Table 1 –CSE Corpus details

Characteristic Value
Corpus Name CSE Corpus
Collection Method Web-scraping, pattern-matching text extraction
Corpus size (N) 56 text dialogues/3380 sentences
Vocabulary size (V) 4500 terms
Total no. of turns 3380
Avg, tokens per turn 7.97
Content Chat-based dialogues
Collection date Jun 2018 – Dec 2020
Language English
Release year Jun 2021
License Private

3.1.4 Training
The bi-LSTM neural network training occurred performing the following steps in sequence:
1. Each dialogue sentence was split into a sequence of token-tag pairs.
2. Each token-tag pair was represented by a word embedding vector.
3. The word embeddings were provided by spaCy NLP library and pretrained to encode semantic
information.
4. Going forward the bi-LSTM model at each step:
1. the input vector was read and combined with the internal memory
2. the output vector was produced and the internal memory was updated
5. This sequence of actions was performed by the bi-LSTM model to the input sequence and produced an
output sequence of the same length.
6. Going backwards, the model read the input sequence again and produced another output sequence.
7. At each position, the outputs of steps 4 and 5 were combined and fed into a classifier which outputted
the probability for the input word, at this position, that should be annotated with the Personal, IT or
Enterprise tag.

3.15 Results
The training and validation performance results of CRINL-R are presented in Fig.2 (a) and (b).

Figure 2 – CRINL-R (a) Loss metrics, (b) Accuracy metrics.

After training the model for 10 epochs, we reached the scores presented in Table 2 for each term contained in one
of the three categories:

Table 2 - CRINL-R performance results on identifying terms related to Personal, Enterprise or IT


tag F1 Precision Recall Support
PERSONAL 0.67 0.63 0.51 65
IT 0.86 0.93 0.64 122
ENTERPRISE 0.71 0.84 0.56 89

3.2 PERST-R
3.2.1 Description
In psychology, the term "human personality" denotes the unique variations in patterns of cognition, affect, and
behavior that distinguish one individual from another. The Five-Factor Model (FFM) (Salgado, 2002), also known
as the Big-5 Theory, represents a widely recognized framework for the classification of personality traits. The
Big-5 comprises five factors that are thought to account for the majority of individual differences in personality.
These factors are typically assessed on a continuum ranging from 0 to 1 (Spielberger, 2004) and play a significant
role in CSE attacks (Tsinganos et al., 2018), as they can be exploited by attackers and deceive vulnerable
individuals. The five personality traits have also been linked with high or low suppressibility to CSE attacks as
follows:
• Openness has been positively linked to responsible behavior regarding security best practices
(“Propensity to Click on Suspicious Links,” n.d.), so low openness would potentially facilitate dangerous
security behavior.
• Consciousness in low levels predicted deviant workplace behavior in the form of irresponsible conduct
or rule-breaking (Salgado, 2002).
• Extraversion in high level is predictive of increased vulnerability to phishing attacks (Albladi and Weir,
2017).
• Agreeableness is most associated with phishing and multiple studies (Anawar et al., 2019),
uebelackerSocialEngineeringPersonality2014]] have reached to similar conclusions. Agreeable people
may be manipulated by establishing trust with the target, as this represents a facet of agreeableness.
• Neuroticism in low level leads to higher susceptibility (Exploring the Role of Individual Employee
Characteristics and Personality on Employee Compliance with Cybersecurity Policies, 2012).
Having a system that recognizes personality traits can also provide valuable insights into the behavior and decision
making of employees. The main objective of the PERST-R model is to identify the personality traits, as defined
in the Big-5 theory utilizing the chat dialogue.

3.2.2 Methodology
A pre-trained Bidirectional Encoder Representations from Transformers (BERT) model is used to recognize
personality traits which later is fine-tuned on a large corpus of text data that has been labeled for each of the five
personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism). The steps
performed are as follows:
• Corpus Selection: We selected a suitable corpus of textual data comprising instances of individual
behavior that can be leveraged to deduce personality traits.
• Pre-processing: The collected data were pre-processed and converted into a format that can be used for
training the BERT model.
• Model Training: We trained the BERT model using the preprocessed corpus. The model was trained to
predict the likelihood of each of the five personality traits for a given input text.
• Model Evaluation: The performance of the trained model was evaluated by measuring its accuracy,
precision, recall, and F1 score on a held-out test dataset.

3.2.3 Corpus
The training corpus used is the FriendsPersona (Jiang et al., 2020) which is a large-scale conversational dataset
that was constructed using scripts from the popular American TV show "Friends". It contains 1,175 dialogues
between pairs of characters, totaling 105,784 utterances. The dataset is annotated with the Big-5 personality traits,
with each dialogue annotated by three human raters. The FriendsPersona dataset is unique in that it provides both
conversational data and personality trait annotations, which enables researchers to explore the relationship
between personality traits and conversational behavior. In terms of inter-annotator agreement, the creators
achieved an average pair-wise kappa of 54.92% between 2 annotators and Fleiss’ kappa of 20.54% among 3
annotators across five personality traits. Table 3 presents the corpus details

Table 3 - FriendsPersona corpus details


Characteristic Value
Total dialogues 1,175
Total utterances 105,784
Annotated personality traits Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism
Annotation method Three human raters per dialogue
Demographic information Age, gender, occupation
Relationship labels Friends, family, romantic partners
Source Scripts from the TV show "Friends"
Language English
Release year 2021
License Creative Commons Attribution 4.0 International (CC BY 4.0)

3.2.4 Results
PERST-R model achieved satisfactory accuracy in recognizing Big-5 personality traits, with an overall accuracy
of 71,12%. Table 4 and Fig.3 present the performance results and ROC graph. The area under the ROC curve
(AUC) for each of the Big-5 personality traits was also impressive, with values of 0.83 for Openness, 0.62 for
Conscientiousness, 0.79 for Extraversion, 0.57 for Agreeableness, and 0,80 for Neuroticism. These results
demonstrate the effectiveness of PERST-R model in accurately recognizing Big-5 personality traits, and suggest
that this approach could be useful in CSE attack recognition.

Table 4 - PERST-R Accuracy

Model Openness Conscientiousness Extraversion Agreeableness Neuroticism Average


BERT 80,12 66,79 71,09 65,37 72,27 71,12

Figure 3 – PERST-R ROC

3.3 DIACT-R
3.3.1 Description
Dialogue acts refer to the communicative functions or intentions that speakers convey in their utterances during a
conversation. In CSE attacks, dialogue acts are used by attackers to manipulate the conversation and lead the
target to take certain actions or reveal critical data information. Recognizing dialogue acts in chat text can bring
several advantages in terms of safeguarding individuals and organizations against successful CSE attacks. DIACT-
R recognizer utilizes the full chat-history that contains important information to recognize CSE attacks. Previous
interactions can be used to establish a baseline of normal behavior and communication patterns, which can then
be used to identify any anomalies or deviations that may indicate an attempted social engineering attack. For
example, if an attacker is trying to impersonate an individual or organization, monitoring the history of previous
interactions can reveal inconsistencies or red flags that would indicate that the attacker is not who they claim to
be.
3.3.2 Methodology
DIACT-R is based on a BERT-based model, called SG-CSE BERT (Tsinganos et al., 2023) fine-tuned on SG-CSE
Corpus, that follows the schema-guided paradigm for CSE attack state tracking. The steps towards SG-CSE BERT
implementation are the following:
• Corpus Creation: Using CSE corpus (Tsinganos and Mavridis, 2021) we extracted related dialogue acts,
called Schema-Guided Chat-based Social Engineering Dialogue Acts (SG-CSE DAs). We created a new
corpus (SG-CSE corpus) appropriate for dialogue state tracking (DST), labeled with SG-CSE DAs. This
corpus is used to fine-tune and evaluate DIACT-R. Fig.4 depicts the distribution of types of dialogue acts
in the SG-CSE Corpus.
• Ontology Creation: We utilized the CSE conceptual model and the CSE ontology (Tsinganos and
Mavridis, 2021) to create the Schema-Guided Chat-based Social Engineering Ontology (SG-CSE
Ontology). Thus, four different CSE attack types (CSE_SME_Info_Extraction,
CSE_IT_Info_Extraction, CSE_Personal_Extraction, CSE_Department_Extraction) were extracted and
they were represented as services. Each service was mapped to a specific schema.
• Model: The SG-CSE BERT model functions within a schema-guided framework, allowing for the
conditioning of the CSE attack service schema through the utilization of intent and slot descriptions.
• Intent Mapping: The schema includes several intents which were mapped to the corresponding DA. The
schema-guided paradigm makes use of a domain’s ontology to create the required schema that describes
each service existing in the domain. In our research, the domain is the CSE attacks and we predict the
service, user intention, and requested slot-value pairs.

Figure 4 - Distribution of types of dialogue acts in the SG-CSE Corpus

3.3.3 Corpus
The SG-CSE Corpus is a corpus of task-oriented dialogues that has been specifically developed for CSE attacks
and is created from the CSE corpus. Its principal goal is to function as a training dataset for intent prediction, slot-
value prediction, and dialogue state tracking in the context of CSE attacks. The evaluation set of the SG-Corpus
is composed of services from the CSE domain that have not been observed previously, providing an opportunity
to evaluate the SG-CSE BERT model's ability to generalize in zero-shot situations. The SG-CSE Corpus consists
of 90 dialogues. SG-CSE Corpus details are presented in Table 5

Table 5 - SG-CSE Corpus details

Characteristic Value
Corpus Name SG-CSE Corpus
Collection Method Web-scraping, pattern-matching text extraction
Corpus size (N) 90 text dialogues/5798 sentences
Vocabulary size (V) 4500 terms
Total no. of turns 5798
Avg, tokens per turn 8.42
No. of slots 18
No. of slot values 234
Content Chat-based dialogues
Collection date Jun 2018 – Dec 2020
Characteristic Value
Language English
Release year Aug 2022
License Private

3.3.4 Training
DIACT-R is based on the SG-CSE BERT model (Fig. 5), which operates in a schema-guided setting and can
condition the CSE attack service schema using the descriptions of intents and slots. To enable the model to
represent unseen intents and slots, a BERT-based model pre-trained on large corpora, called SG-CSE BERT, is
employed. The SG-CSE BERT model encodes all schema intents, slots, and categorical slot values into embedded
representations. Since CSE attacks can take many forms, schemas may differ in their number of intents and slots.
Therefore, predictions are made over a dynamic set of intents and slots by conditioning them on the corresponding
schema embedding. In Fig.5 the two sequences 𝑎1 , . . . , 𝑎𝑣 and 𝑣1 , . . . , 𝑣𝑣 are fed as a pair to SG- CSE BERT
encoder. The 𝑈𝐶𝐿𝑆 is the embedded representation of the schema and 𝑡, ... , 𝑡𝑛+𝑚 are the token-level representations.

Figure 5 – The SG-CSE BERT model. A pair of two sequences a, v is fed to the BERT encoder which outputs the pair's
embedding Ucls and the token representations (t1, ..., tn+m)

3.3.5 Results
The Hugging Face library and the BERT uncased model with 12 layers, 768 hidden dimensions, and 12 self-
attention heads were utilized to form the SG-CSE BERT model. To train the model, we set the batch size at 32,
and used a dropout rate of 0.2 for all classification heads. A linear warmup strategy with a duration of 10% of the
training steps was employed, as well as the AdamW optimizer (Loshchilov and Hutter, 2017) with a learning rate
of 2e-5. Fig. 6 illustrates the SG-CSE BERT model's performance, with Active Intent Accuracy and Requested
Slots F1 displaying high efficiency, while Average Goal Accuracy and Joint Goal Accuracy exhibit lower
efficiency.

Figure 6 – DIACT-R performance

Table 6 depicts the results in tabular form.

Table 6 - SG-CSE Attack state tracker performance


System Model Parameters Active Intent Req Slot Acg Goal Joint Goal
Accuracy F1 Accuracy Accuracy
DIACT-R BERTBASE 110M 85.2 89.6 74.1 56.7
Seen BERTBASE 110M 53.8 48.3 31.4 24.9
Unseen BERTBASE 110M 69.5 68.9 52.7 40.8

3.4 PERSU-R
3.4.1 Description
Persuasion is often used to manipulate individuals into taking a certain action or providing sensitive information.
Social engineers use various persuasion techniques, such as building trust, creating a sense of urgency, or
appealing to emotions, to persuade their targets. In a recent study (“2022 Data Breach Investigations Report,”
n.d.), it was found that social engineering attacks that employed persuasion tactics had a success rate of 70%,
while those that did not have a success rate of 30%. Cialdini (Cialdini, 2021) has identified seven well-known
principles of persuasion: authority, reciprocity, liking, social proof, scarcity, commitment and unity. These
principles act as an entry point for most interdisciplinary scientific works on persuasion.

3.4.2 Methodology
The PERSU-R recognizer is based on the CSE-PUC (Tsinganos et al., 2022b) neural network architecture which
identifies whether a written sentence carries a persuasive payload container (pp-container), using a combination
of a Convolutional Neural Network (CNN) and a Multi-Layer Perceptron (MLP) . The CNN network is used as a
feature extractor in the first layer of the network, and the MLP is used to produce a probability distribution over
the sentence classes, indicating whether the sentence is a persuasion payload container or not. The CNN
architecture identifies local patterns in the sentence that may indicate the presence of persuasive cues. A nonlinear
function is applied to each k-sized window of tokens in the convolution layer, transforming it into a scalar value.
The pooling layer combines the resulting vectors from the different k-sized windows into a single one-dimensional
vector by taking the maximum value observed in each of the l dimensions over the different k-sized windows.
The output of the l-dimensional vector is then fed into the following parts of the overall network architecture. The
parameters or values of the filters applied are tuned by back-propagating the gradients from the network loss
during the training phase.

3.4.3 Corpus
The CSE-PUC architecture was trained on the CSE-PUC corpus, which is an enriched version of the CSE corpus.
More specifically, the CSE corpus was enriched using word embedding techniques to add sentences with similar
words while keep meaning the same. Table 7 presents the CSE-PUC corpus details, where we can see that after
preprocessing and standardization, the corpus contains 3880 labeled sentences, with a balanced distribution of pp-
container and neutral sentences.

Table 7 –CSE-PUC corpus details

Characteristic Value
Corpus Name CSE-PUC Corpus
Collection Method Web-scraping, pattern-matching text extraction
Corpus size (N) 56 text dialogues/3380 sentences
Vocabulary size (V) 4500 terms
Total no. of turns 3380
Avg, tokens per turn 7.97
Content Chat-based dialogues
Collection date Jun 2018 – Dec 2020
Language English
Release year Jun 2021
License Private

It's also worth noting that the distribution of the number of words in the sentences varied between the two classes,
with social engineers using shorter sentences to launch their attacks and longer ones to build rapport before their
attack. Following the annotation task, the resulting dataset was balanced, as shown in Fig. 7, with 51.1% of the
sentences being pp-containers and 48.9% being non-pp containers, which are labeled as neutral in the figure. Fig.
8 displays the distribution of the sentence length for each class. It can be observed that a majority of social
engineers employ short sentences to carry out their attacks, whereas they use a greater number of words to
establish a friendly rapport before the actual attack.

600

500

400

48,9 300
543
51,1 475 467
200

247 234
100 201 201 173
157
98 103 123 90
49 76 53 76
0 26 33 18
5 10 15 20 25 30 35 40 45 50

pp-cont ainer neutral pp-cont ainer neutral

Figure 7 - The distribution of the number of words in the Figure 8 - The distribution of the sentence length for
sentences varied between the two classes each class

3.4.4 Training
The training steps of the CSE-PUC are presented in pseudocode in Table 8.

Table 8 - Algorithmic steps of CSE-PUC training

Algorithm CSE-PUC training


Require labeled SOURCE corpus, unlabeled TARGET corpus, hyperparameters
Input dataset 𝒟  CSE Corpus, n words 𝑤1:𝑛 = 𝑤1, 𝑤2, … , 𝑤𝑛 , sentence 𝐸[𝑤𝑖 ] = 𝑤𝑖 , concatenated
vector of k words 𝑥𝑖 , weight vector u, nonlinear-linear activation function f, 𝑟𝑖 ∈ ℝ, 𝑥𝑖 ∈ ℝ𝑘∗𝑑 and
𝑢 ∈ ℝ𝑘∗𝑑 .

Initialization CSE-PUC training hyperparameters initialization: batch size, training epochs, learning
rate, dropout rate, optimizer
Begin Training:
Step 1: CSE-PUC network hyperparameters initialization: filter size, number of filters, weight values,
padding, activation function, stride length, pooling method,
Loop: for each sentence 𝐸 in dataset 𝒟
Step 2: do forward propagation:
1-D Convolution operation, transform vector 𝑥𝑖 = [𝑤𝑖 , 𝑤𝑖+1 , … , 𝑤𝑖+𝑘 ] ∈ ℝ𝑘∗𝑑 to 𝑟𝑖 = 𝑓(𝑥𝑖 ∘ 𝑢)
Pooling operation: output q vectors 𝒑𝑖:𝑞 where 𝒑𝑖 ∈ ℝℓ
Step 3: Calculate Total Error
1
𝑇𝑜𝑡𝑎𝑙 𝐸𝑟𝑟𝑜𝑟 = ∑ 2 (𝑡𝑎𝑟𝑔𝑒𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 − 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦)2
Step 4: Backpropagate the error and adjust the model parameters.
Calculate gradients of the error regarding all weights in the network
Execute gradient descent:
Update all filter and weight values
End Loop
End Training
Output: Probabilities over the classes

Before the training process, we still need to specify several hyperparameters, such as the number of filters and
filter size, as shown in Table 9.

Table 9 – CSE-PUC architecture training details

Description Values
Batch size 16
Word Embeddings fastText
Word Embeddings size 300
Filter sizes 2,3,4,5
Number of filters 100,100,100,100
Stride length 1
Zero padding Yes
Activation function ReLU
Pooling method 1-max
Dropout rate 0.1
Training epochs 10

3.4.5 Results
In Table 10, the performance of the PERSU-R recognizer is presented.

Table 10 – CSE-PUC Results

Model Accuracy Macro F1


Own-trained CSE-PUC 62.2 % 54.8 %
system CSE-PUC 71.6 % 58.2 %

The evaluation metrics used are accuracy, and Macro F1-score. Figs 9 and 10 depict the training and validation
accuracy scores over the ten training epochs.

Figure 9 – CSE-PUC Validation loss Figure 10 CSE-PUC Accuracy

3.5 PERSI-R
3.5.1 Description
In the context of CSE attacks we define as persistence the behavior of a social engineer who insists trying to
extract sensitive information using paraphrasing techniques to request the same information multiple times
without getting noticed. Paraphrasing refers to the restating of information in a different way, and it is a tactic that
malicious users use to evade recognition while persisting in their attempts to extract sensitive information or
perform other malicious actions.

3.5.2 Methodology
PERSI-R utilizes the CSE-PersistenceBERT model (Tsinganos et al., 2022a), which learns a universal
representation that transfers knowledge with only minimal adaptation to the paraphrase recognition task. CSE-
PersistenceBERT is a fine-tuned version (see Fig. 11) of BERT that is specifically trained for the paraphrase
recognition task in the context of CSE attacks. The model is fine-tuned on the CSE-Persistence corpus, which is
annotated with labels indicating whether a pair of sentences are paraphrases or not. More specifically, the CSE-
PersistenceBERT model takes as input a pair of sentences (see Fig. 12), and outputs a binary classification result
indicating whether the sentences are paraphrases or not. To obtain fixed-size sentence representations that can be
fed into the model, the input sentences are tokenized, padded to a fixed length, and passed through the BERT
model to obtain contextualized word embeddings. The sentence representations are then obtained by taking the
mean of the contextualized word embeddings across all tokens in each sentence.

Figure 11 – PERSI-R transfer learning method

Figure 12 - The proposed classification pipeline

3.5.3 Corpus
Following the SNLI paradigm (Bowman et al., 2015), we transformed our CSE corpus into the CSE-Persistence
corpus, a suitable dataset for paraphrase recognition. Each instance of the CSE-Persistence corpus is composed
of two sentences and is manually labeled as being a member of one of the following three categories:
• Identical (I): the two sentences are semantically close and share a common term targeting the same leaf
entity in the CSE ontology .
• Similar (S): the two sentences are semantically related and share a common intent, which translates into
a higher-level entity in the CSE ontology, targeting a different leaf entity
• Different (D): the two sentences are not semantically related, and they do not share a common higher-
level or leaf entity.
We also named the two sentences as follows:
• the first sentence is referred to as the Prologue in the sense that the social engineering attacker uses it to
introduce her intention to the play.
• the second sentence is referred to as the Epilogue in the sense that it concludes the play and informs us
how the story ends.
Thus, the CSE-Persistence corpus contains instances of training examples where each instance is composed of a
string for the Prologue, a string for the Epilogue, and a Paraphrase Recognition label (Identical, Similar, Different).
Table 11 presents the CSE-Persistence corpus details.

Table 11 - CSE-Persistence corpus details

Characteristic Value
Corpus Name CSE-Persistence Corpus
Collection Method Web-scraping, pattern-matching text extraction
Corpus size 16900 sentences
Source of judgment Three judges
Development pairs 1690
Test pairs 1690
Identical sentences 6484
Similar sentences 5023
Different sentences 5393
Release year 2021
License Private

3.5.4 Training
During the training process, three columns were considered from the CSE-Persistence corpus: ‘Prologue’,
‘Epilogue’, and ‘Paraphrase Recognition label’. The Paraphrase Recognition label is the ‘gold_label’ given to the
instance after the manual annotation of the dataset. The CSE-Persistence corpus was divided into train, validation,
and test set, where all the source data were elicited from successful or unsuccessful CSE attacks gathered from
websites, books, and logs as described in our previous work

3.5.5 Results
After fine-tuning, we compared CSE-PersitenceBERT to the baseline model and achieved the accuracy values
presented in Table 12,

Table 12 - Model benchmarks on the CSE-Persistence corpus.

Training Training Validation Validation


Model Accuracy Loss Accuracy Loss
(%) (%) (%) (%)
Baseline 82.57 0.36 73.83 0.39
CSE-PersistenceBERT 84.96 0.21 78.03 0.36

Fig. 13a, depicts the accuracy of the training and validation data sets of CSE-PersistenceBERT (indicated as CSE),
and Fig.13b depicts the loss of training and validation datasets.
Figure 123 - CSE-PersistenceBERT (a) Training and Validation Accuracy (b) Train and Validation Loss

4. CSE-ARS Training
To comprehensively assess the effectiveness of our proposed CSE-ARS, a ten-fold cross-validation experiment
was conducted. An augmented version of CSE Corpus underwent a randomization process, resulting in its division
into ten distinct subsets. Throughout each iteration of the experiment, nine subsets were further subdivided into
training (80%) and validation (20%) sets, while the remaining subset was designated as the testing set. By
employing this methodology, prediction scores for each testing subset were obtained after ten rounds, which were
subsequently amalgamated to derive an overall prediction score. To enhance the performance of our approach,
several strategies were implemented. Initially, we explored diverse configurations and trained each recognizer
using a single domain training set, with the validation set utilized to mitigate the risk of overfitting. Subsequently,
the optimal configuration parameters were selected based on the evaluation criterion of the area under the receiver
operating characteristic curve (AUC) value. Finally, subsequent to the training of the specific recognizers, an
exhaustive examination of various coefficient combinations was conducted until the classification performance,
as assessed by the AUC value, reached its maximum on the validation set.

4.3 Corpus
CSE-ARS was trained on the CSE-ARS corpus, which is an augmented version of the CSE corpus (Tsinganos
and Mavridis, 2021) composed of realized and fictional CSE attack dialogues. Table 13 presents the summary of
the CSE-ARS corpus.

Table 13 – CSE-ARS corpus details

Characteristic Value
Corpus Name CSE-ARS corpus
Collection Method Web-scraping, pattern-matching text extraction
Corpus size (N) 56 text dialogues/3380 sentences
Vocabulary size (V) 5400 terms
Total no. of turns 9685
Avg. tokens per turn 9,34
Content Chat-based dialogues
Collection date Jun 2018 – Dec 2020
Language English
Release year Feb 2023
License Private

4.1 Late-Fusion
Ensemble techniques offer a valuable means of enhancing system performance by combining multiple machine
learning models. Such techniques prove particularly beneficial when individual models exhibit high accuracy but
differ in the types of errors they make, as often encountered in multimodal approaches. Within the context of this
study, we consider "modality"(Lahat et al., 2015) as each distinct acquisition framework that captures information
regarding the same phenomenon from various types of recognizers, under different conditions, across multiple
experiments or subjects, among other factors. The late fusion approach, akin to decision-level fusion, constitutes
an ensemble technique that combines multiple models to generate a final prediction. In late fusion, the unimodal
decision values are acquired and merged to derive the ultimate decision. This approach facilitates flexible training
and straightforward predictions, even when one or more modalities are absent, albeit at the expense of disregarding
certain low-level interactions between modalities.
The output of CSE-ARS recognizers is concatenated and subsequently fed into a final classifier to formulate the
conclusive prediction. Leveraging late fusion methodology leads to enhanced performance when compared to
utilizing a solitary model, rendering it a commonly adopted technique across various machine learning domains,
including image classification, natural language processing, and speech recognition. The key advantage of late
fusion lies in its capacity to allow the individual recognizers to specialize in their respective areas of expertise,
thereby contributing to a more accurate final prediction. If a recognizer 𝑚𝑖 is used on modality 𝑖 using input
𝑘𝑖 where i = 1,2, … , M then the final prediction of a late fusion system is given by p =
f(𝑚1 (𝑘1 ), 𝑚2 (𝑘2 ), … , 𝑚𝑀 (𝑘𝑀 )). The strengths of late fusion systems are relatively simple to implement
compared to other models, as they simply combine the outputs of different models into a single prediction.

4.2 Simulated Annealing


The novelty of our approach resides in the architectural design and the fusion of multi-dimensional data. The
fundamental idea behind our multimodal fusion approach is to integrate the distinct output to enhance CSE attack
recognition. The outputs of the individual recognizers are represented by probability distributions, obtained
through the application of the SoftMax classifier. The SoftMax classifier formula is as follows:
𝑒 𝑦𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑦)𝑖 = 𝑛
∑𝑗=1 𝑒 𝑦𝑖
where 𝑦𝑖 represent the data in 𝑗𝑡ℎ column of the output vector, and n represents the output vector dimension. The
SoftMax layer's output constitutes a probability distribution across the two possible classes, whether an utterance
is considered CSE attack or not. We conducted multimodal fusion based on weighted linear aggregation; the
specific construction steps are as follows:
• STEP1: The trained CSE-ARS recognizers (CRINL-E, DIACT-R, PERSI-R, PERST-R and PERSU-R)
are applied to the particular validation sets. The output (prediction results) of each individual recognizer
(𝑖. 𝑒. , 𝑜𝑢𝑡𝑝𝑢𝑡𝐶𝑅𝐼𝑁𝐿−𝑅 , 𝑜𝑢𝑡𝑝𝑢𝑡𝐷𝐼𝐴𝐶𝑇−𝑅 , 𝑜𝑢𝑡𝑝𝑢𝑡𝑃𝐸𝑅𝑆𝐼−𝑅 , 𝑜𝑢𝑡𝑝𝑢𝑡𝑃𝐸𝑅𝑆𝑇−𝑅 , and 𝑜𝑢𝑡𝑝𝑢𝑡𝑃𝐸𝑅𝑆𝑈−𝑅 ) has the
form of a matrix with dimensions equal to the number of samples in the corresponding validation set and
the number of classes.
• STEP2: To fuse the features of the individual recognizers, we define the following fusion methods:
𝑜𝑢𝑡𝑝𝑢𝑡𝐶𝑆𝐸−𝐴𝑅𝑆 = 𝛼 ∗ 𝑜𝑢𝑡𝑝𝑢𝑡𝐶𝑅𝐼𝑁𝐿−𝑅 + 𝛽 ∗ 𝑜𝑢𝑡𝑝𝑢𝑡𝐷𝐼𝐴𝐶𝑇−𝑅 + 𝛾 ∗ 𝑜𝑢𝑡𝑝𝑢𝑡𝑃𝐸𝑅𝑆𝐼−𝑅 + 𝛿 ∗ 𝑜𝑢𝑡𝑝𝑢𝑡𝑃𝐸𝑅𝑆𝑇−𝑅 + 𝜀 ∗ 𝑜𝑢𝑡𝑝𝑢𝑡𝑃𝐸𝑅𝑆𝑈−𝑅
{ 𝛼+𝛽+𝛾+𝛿+𝜀 =1 (1)
0 ≤ 𝛼 ≤ 1, 0 ≤ 𝛽 ≤ 1, 0 ≤ 𝛽 ≤ 1, 0 ≤ 𝛽 ≤ 1, 0 ≤ 𝜀 ≤ 1

where 𝑜𝑢𝑡𝑝𝑢𝑡𝐶𝑆𝐸−𝐴𝑅𝑆 represent the result of feature fusion, α represents the weight of the CRINL-R
output, β represents the weight of the DIACT-R output, γ represents the weight of the PERSI-R output,
δ represents the weight of the PERST-R output and ε represents the weight of the PERSU-R output.
• STEP3: We find the best combination of (𝛼, 𝛽, 𝛾, 𝛿, 𝜀) on the validation set so that the cross entropy
between 𝑜𝑢𝑡𝑝𝑢𝑡𝐶𝑆𝐸−𝐴𝑅𝑆 and label (one-hot encoding) is close to the theoretical minimum
(𝛼, 𝛽, 𝛾, 𝛿, 𝜀). The step is equivalent to a new round of feature learning.
• STEP4: In order to find the optimal solution of (𝛼, 𝛽, 𝛾, 𝛿, 𝜀) on the validation set, we define the
following optimization problem:
min (𝐿𝑜𝑠𝑠(𝑜𝑢𝑡𝑝𝑢𝑡𝐶𝑆𝐸−𝐴𝑅𝑆 , 𝐿𝑎𝑏𝑒𝑙)

𝛼+𝛽+𝛾+𝛿+𝜀 =1 (2)
{
0 ≤ 𝛼 ≤ 1, 0 ≤ 𝛽 ≤ 1, 0 ≤ 𝛽 ≤ 1, 0 ≤ 𝛽 ≤ 1, 0 ≤ 𝜀 ≤ 1
To obtain the global optimal solution, we utilize the Simulated Annealing (SA) algorithm (Rere et al., 2015). SA
is a metaheuristic optimization algorithm used for discovering the global optimum of a complex objective
function. The probability of a particular state of x is determined by the following equation:
−∆𝑓(𝑥)
𝑝(𝑥) = 𝑒 𝑘𝑇
where f(x) is the configuration of energy, k is Boltzmann’s constant, and T is temperature. The algorithm that
describes the optimization procedure is shown in Table 14
Table 134 - Simulated Annealing algorithm

Algorithm: Simulated Annealing


Generate random initial solution 𝑥0
Calculate objective function
Parameter initialization (T, k, c)
while control condition not true do
for number of new states
Pick new solution 𝑥0 + ∆𝑥 in neighborhood
# Evaluate new state
If 𝑓(𝑥0 + ∆𝑥) > 𝑓(𝑥0 ) then
𝑓𝑛𝑒𝑤 = 𝑓(𝑥0 +); 𝑥0 = 𝑥0 + ∆𝑥
else
∆𝑓 = 𝑓(𝑥0 + ∆𝑥) – 𝑓(𝑥0 )
𝑟𝑎𝑛𝑑𝑜𝑚 𝑟(0, 1)
−∆𝑓(𝑥)
𝒊𝒇 𝑟 > exp ( 𝑘𝑇 ) 𝒕𝒉𝒆𝒏
𝑓𝑛𝑒𝑤 = 𝑓(𝑥0 + ∆𝑥), 𝑥0 = 𝑥0 + ∆𝑥
else
𝑓𝑛𝑒𝑤 = 𝑓(𝑥0),
end if
end if
𝑓 = 𝑓𝑛𝑒𝑤
Decrease the temperature periodically: T = c x T
end for
end while

To summarize, the Simulated Annealing (SA) algorithm is a metaheuristic optimization algorithm used to find the
global optimal solution of a complex objective function. It starts with an initial solution and evaluates it using the
objective function. Then, it perturbs the solution and evaluates the new solution. If the new solution is better, it
becomes the new current solution. If it's worse, it may still be accepted with a probability based on the temperature
parameter. The objective function in this work is defined as shown in formula (2), and the SA algorithm is used
to determine the optimal values of (α, β, γ, δ, ε) on the validation set. Finally, the tuple (α, β, γ, δ, ε) is transferred
to the test set to predict CSE attacks using the fused features, and the results are compared to those obtained
through single modality feature learning. The workflow of the CSE-ARS multimodal fusion is shown in Fig. 14.

Figure 134 – Workflow of the CSE-ARS multimodal fusion

The layers depicted in the figure above are as follows:


• Input Layer: This layer receives input from the individual recognizers.
• Fusion Layer: This layer combines the outputs from the individual recognizers into a single
representation. The fusion layer is implemented as a weighted linear aggregation.
• Dense Layer: This layer is used to apply non-linear transformations to the fused representation to produce
the final prediction. The dense layer is implemented as a fully connected layer with a sigmoid activation
function for binary classification problems.
• Output Layer: This layer produces the final prediction, which could be the probability of a given text
being a CSE attack or not.

These steps are repeated until the system reached satisfactory performance. For this specific problem of CSE
attack recognition, a binary cross-entropy loss function is used. For the optimizer, our method of choice was the
Adam optimizer (Kingma and Ba, 2014), which adaptively adjusts the learning rate for each parameter based on
the historical gradient information. Furthermore, it tends to converge faster and with a better convergence
minimum compared to SGD (Ruder, 2016). In summary, the choice of a binary cross-entropy loss function and
the Adam optimizer is a reasonable one for the late fusion system for CSE attack recognition, as it provides a
robust and efficient way to measure the error and update the system parameters.

5. CSE-ARS Evaluation
During our research, every experiment was tracked as it relates to the progress and the results. Furthermore, the
level of detail logged concerning each experiment was at such a level that we could be able to recreate each
experiment or compare it at a later time with another experiment. For experiment tracking (e.g., loss curve, model
performance metrics, hyperparameters etc.) and experiment versioning, we utilized the Weight & Biases platform
(Biewald, 2020). Different combinations of hyper-parameter values gave different performance results. We
exhaustively searched all possible combinations to find the optimal hyper-parameter values. The performance of
each hyper-parameters set was evaluated against a dedicated validation set. Our test split was never used for hyper-
parameter tuning, to avoid overfitting and the best model was selected based on its performance on the validation
set. Afterwards, this model’s performance was measured against the test split. The CSE-ARS system has been
extensively evaluated on the CSE-ARS corpus.
Due to lack of similar CSE recognition systems, we utilized a baseline system based on the majority voting (Jain
et al., 2022) method, in which the predictions of all the classifiers are combined and the class label with the highest
frequency is selected as the final prediction. By comparing the results of the two models, we aim to show the
benefits of using a more sophisticated late fusion model over a simple majority voting ensemble model for chat-
based social engineering attack recognition. The results show that CSE-ARS is outperforming the majority voting
ensemble model which used as a baseline system.
For each different recognizer we suppose that if recognition is true (e.g., persuasion is recognized) then a CSE
attack is underway. Thus, there are two different classes we need to predict: CSE-attack and Neutral. The predicted
class label 𝑦̂, if there are m different classifiers, is given by 𝑦̂ = 𝑚𝑜𝑑𝑒{𝐶1 (𝑥), 𝐶2 (𝑥), … , 𝐶𝑚 (𝑥)}. We evaluated
the model performance of each recognizer via 10-fold cross-validation on the training dataset before we combine
them into an ensemble recognizer: e.g., the global optimization outcome for (α, β, γ, δ, ε) on the validation set
yielded (0.134, 0.294, 0.201, 0.169, 0.202). Upon acquiring the key parameters (α, β, γ, δ, ε) for the multimodal
fusion model, we tested both the individual recognizers and the fusion model on the test set. Their respective
prediction accuracies were 56,70%, 71,20%, 68,70%, 64,90%, 58,90%, and 79,96%. The multimodal fusion
model achieved a prediction accuracy 8.76% greater than the optimal single modality model. The loss value of
each model was calculated as shown in the following formula
1
𝐿𝑜𝑠𝑠(𝐿𝑡 , 𝐿∗𝑡 ) = − ∑[𝐿𝑡 (𝑖) ∗ 𝑙𝑜𝑔𝐿∗𝑡 (𝑖)] + 𝜆𝑅(𝑤)
𝑛

where 𝐿𝑡 is the correct label of the sample, 𝐿∗𝑡 is the network output, and 𝜆 is the weight of the regularization term.
The loss value for each individual recognizer was 0.304, 0.250, 0.378, 0.401, 0.290 and the multimodal fusion
model was only 0.17954. We used 10 times of 10-fold cross-validation on the individual recognizers and
multimodal fusion model and the results are shown in Fig. 15.
Figure 15 – Sequence of 10-fold Cross-Validation

Each set represent a 10-fold cross validation. All individual recognizers are represented with gray-scaled colors
in each set. In the 10 times of validations, the average prediction accuracy of the models of CSE-ARS multimodal
fusion, and of particular recognizers is shown in Table 15.

Table 145 - Average Accuracy of individual recognizers and CSE-ARS multimodal fusion in 10-fold cross validation

Model Avg. Accuracy Avg. Precision Avg. Recall Avg. F1-score


PERST-R 0.567 0.512 0.501 0.560
DIACT-R 0.712 0.648 0.633 0.604
PERSU-R 0.687 0.601 0.600 0.653
PERSI-R 0.749 0.667 0.640 0.630
CRINL-R 0.589 0.554 0.550 0.601
Majority Voting 0.753 0.700 0.689 0.701
CSE-ARS 0,799 0.702 0.702 0.701

To evaluate the performance of the CSE-ARS system in predicting CSE attacks, we computed ROC curves for
each class and determined the AUC values for each individual model. Fig. 16 shows the ROC curves of the models
used to evaluate the prediction performance of CSE attacks, and Table 16 displays the calculated AUC values for
each model.

Figure 146 -ROC curves of the CSE-ARS and particular recognizers models
Table 156 - AUC values of the CSE-ARS and particular recognizers models.

Model AUC
PERST-R 0.5479 (+/- 0.14)
DIACT-R 0.6783 (+/- 0.12)
PERSU-R 0.7010 (+/- 0.11)
PERSI-R 0.6971 (+/- 0.14)
CRINL-R 0.5783 (+/- 0.9)
Majority Voting 0.649 (+/- 0.12)
CSE-ARS 0.7432 (+/- 0.10)

6. Discussion
Deep learning is a powerful approach for building recognizers as it allows for the learning of high-level
abstractions from data. This is particularly useful in the case of CSE attacks, where the patterns of behavior and
language used by attackers can be complex and difficult to identify. Deep learning models such as CNNs, LSTMs
and BERT are able to learn these complex patterns and make accurate predictions about the likelihood of a CSE
attack. Furthermore, the use of pre-trained deep learning models is beneficial as these models are trained on large
amounts of data, which enables them to generalize well to new data and new types of CSE attacks. This is
important as social engineering attacks are constantly evolving, and being able to adapt to new types of attacks is
crucial for the effectiveness of a CSE attack recognition system. For the implementation of the individual deep
learning recognizers, we mainly used the popular deep learning frameworks PyTorch and HuggingFace.
The CSE-ARS late fusion model outperformed the individual recognizers in terms of accuracy, precision, recall,
F1 score and AUC value. PERSI-R had the highest performance among the individual models, but the multimodal
fusion model achieved a higher accuracy in 10-fold cross-validation. However, there was a large gap between the
AUC value of the multimodal fusion model and PERSU-R. It is suggested that the superior performance of the
CSE-ARS model is due to the strength of the late fusion approach to utilize multiple modalities, which may
capture more comprehensive information about CSE attacks. Overall, the results suggest that the multimodal
fusion approach has potential for improving the accuracy of CSE attack prediction. However, further research is
needed to validate the findings on larger and more diverse datasets, and to explore the generalizability and
robustness of the approach. Additionally, it may be beneficial to investigate the interpretability of the model and
the specific features that contribute to the prediction.
Many trade-offs had to be considered during model selection such as computing requirements, and performance.
As already mentioned, the design decision was to prefer models that had fewer false positives neglecting the false
negatives metric. Furthermore, it was also a design decision to choose the models based solely on their
performance and not on computing requirements. Neural networks demand more powerful machines (e.g., GPU
over CPU) to deliver high accuracy with an acceptable inference latency. It was crucial to maintain a
comprehensive record of definitions required to replicate an experiment alongside its pertinent artefacts, which
refer to the files created during an experiment such as those displaying loss curve, evaluation loss graph, logs or
intermediate results of a model throughout a training process. This practice facilitates the comparison of distinct
experiments and aids in the selection of the optimal experiment tailored to one's specific requirements. The
proposed system CSE-ARS has shown promising results in recognizing the various enablers such as personality
traits, dialogue acts, persuasion attempts, persistent behavior and critical information leverage. The system's
performance has been evaluated using a dataset of real-world and fictional chat conversations and the results
indicate that it is able to accurately recognize these enablers with high precision and recall. One of the main
contributions of this study is the use of the late fusion method to combine the separate outputs of the individual
recognizer. This approach allows for the strengths of each technique to be leveraged and results in a more robust
and accurate recognition system.
The individual recognizers used in the proposed system also deserve further discussion. The use of a convolutional
network for recognizing persuasion attempts has been shown to be effective in identifying patterns in the language
used in persuasion attempts. The use of BERT for recognizing personality traits, dialogue acts, and persistent
behavior is also well-suited for these tasks as BERT is a pre-trained language model that has been shown to have
strong performance in natural language understanding tasks. It is worth noting that the BERT models used in this
study were fine-tuned on specific and appropriate corpora, and their performance may be improved by further
fine-tuning on larger and more diverse corpora. Moreover, the use of such models allows for the proposed system
to be easily updated and improved as new data and techniques become available, making it more adaptive to the
constantly evolving social engineering attacks. Overall, the individual recognizers used in CSE-ARS have been
shown to be effective in recognizing the various enablers of social engineering attacks.
7. Conclusion
In this paper, we have presented the CSE-ARS system, showcasing its efficacy in the recognition and mitigation
of chat-based social engineering attacks. CSE-ARS adopts an interdisciplinary approach, considering diverse
factors encompassing personality traits, linguistic aspects (e.g., dialogue acts), behavioral characteristics (e.g.,
persuasion, persistence), and information technology attributes. Through multi-modal late fusion, CSE-ARS
integrates predictions from multiple recognizers, leveraging the strengths of various deep learning models,
including the Critical Information Leverage Recognizer (CRINL-R), Personality Recognizer (PERST-R),
Dialogue-act Recognizer (DIACT-R), Persuasion Recognizer (PERSU-R), and Persistence Recognizer (PERSI-
R). To facilitate the training and evaluation of these recognizers within a transfer learning framework, we
employed five distinct custom corpora of realized and fictional chat-based social engineering attacks, with each
corpus specifically annotated for a particular downstream task. Leveraging transfer learning, we effectively
utilized pre-existing pre-trained models, reducing the demand for extensive training data while simultaneously
enhancing the efficiency of the CSE-ARS system. In order to optimize the performance of CSE-ARS, we
employed a hybrid approach wherein the outputs of the individual recognizers were aggregated through weighted
linear combination. The determination of weight values involved employing simulated annealing and k-fold cross-
validation techniques, ensuring that the system attained its peak performance while maintaining a balanced
integration of the individual recognizers. The comprehensive evaluation conducted in this study substantiates the
effectiveness and reliability of CSE-ARS in accurately recognizing CSE attacks, rendering it a valuable tool for
safeguarding individuals and organizations against such threats. Our results indicate that the proposed approach
exhibits satisfactory performance on recognizing chat-based social engineering attacks. Future research endeavors
may focus on expanding the system's capabilities by incorporating additional factors influencing social
engineering attacks. Furthermore, investigating the system's performance in real-world scenarios and exploring
the potential of integrating CSE-ARS with existing security systems would contribute to practical deployment
and wider adoption.

References

2022 Data Breach Investigations Report [WWW Document], n.d. . Verizon Business. URL
https://www.verizon.com/business/resources/reports/dbir/ (accessed 11.20.22).
Albladi, S.M., Weir, G.R.S., 2017. Personality traits and cyber-attack victimisation: Multiple mediation analysis. 2017 Internet of Things
Business Models, Users, and Networks 1–6. https://doi.org/10.1109/CTTE.2017.8260932
Anawar, S., Kunasegaran, D.L., Mas’ud, M.Z., Zakaria, N.A., 2019. ANALYSIS OF PHISHING SUSCEPTIBILITY IN A WORKPLACE:
A BIG-FIVE PERSONALITY PERSPECTIVES.
Bernard, P., 2012. COBIT® 5 - A Management Guide. Van Haren.
Bezuidenhout, M., Mouton, F., Venter, H.S., 2010. Social engineering attack detection model: SEADM, in: 2010 Information Security for
South Africa. Presented at the 2010 Information Security for South Africa (ISSA), IEEE, Johannesburg, South Africa, pp. 1–8.
https://doi.org/10/czz4kx
Bhakta, R., Harris, I.G., 2015. Semantic analysis of dialogs to detect social engineering attacks, in: Proceedings of the 2015 IEEE 9th
International Conference on Semantic Computing (IEEE ICSC 2015). Presented at the 2015 IEEE International Conference on
Semantic Computing (ICSC), IEEE, Anaheim, CA, USA, pp. 424–427. https://doi.org/10.1109/ICOSC.2015.7050843
Biewald, L., 2020. Experiment tracking with weights and biases.
Bowman, R., S., Angeli, G., Potts, C., Manning, C.D., 2015. A large annotated corpus for learning natural language inference, in:
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for
Computational Linguistics.
Cialdini, R.B., 2021. Influence, New and Expanded: The Psychology of Persuasion. HarperCollins.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., n.d. Natural Language Processing (Almost) from Scratch.
NATURAL LANGUAGE PROCESSING 45.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding,
in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June
2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp. 4171–4186.
https://doi.org/10.18653/v1/n19-1423
Exploring the Role of Individual Employee Characteristics and Personality on Employee Compliance with Cybersecurity Policies, 2012.
Gehl, R.W., Lawson, S.T., 2022. Social Engineering: How Crowdmasters, Phreaks, Hackers, and Trolls Created a New Form of Manipulativ
e Communication. MIT Press.
Hoeschele, M., Rogers, M., 2005. Detecting social engineering (Chapter). Presented at the IFIP International Conference on Digital
Forensics, Springer, pp. 67–77.
Jain, A., Kumar, A., Susan, S., 2022. Evaluating Deep Neural Network Ensembles by Majority Voting Cum Meta-Learning Scheme, in: Soft
Computing and Signal Processing: Proceedings of 3rd ICSCSP 2020, Volume 2. Springer, pp. 29–37.
Jiang, H., Zhang, X., Choi, J.D., 2020. Automatic Text-Based Personality Recognition on Monologues and Multiparty Dialogues Using
Attentive Networks and Contextual Embeddings (Student Abstract), in: Proceedings of the AAAI Conference on Artificial
Intelligence. pp. 13821–13822. https://doi.org/10.1609/aaai.v34i10.7182
Keras: the Python deep learning API [WWW Document], n.d. URL https://keras.io/ (accessed 10.9.21).
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Lahat, D., Adali, T., Jutten, C., 2015. Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects. Proc. IEEE 103, 1449–
1477. https://doi.org/10.1109/JPROC.2015.2460697
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86,
2278–2324. https://doi.org/10/d89c25
Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Mitnick, K., 2011. Ghost in the Wires: My Adventures as the World’s Most Wanted Hacker. Hachette UK.
Mitnick, K.D., Simon, W.L., 2011. The Art of Deception: Controlling the Human Element of Security. John Wiley & Sons.
Mouton, F., Leenen, L., Venter, H.S., 2015. Social Engineering Attack Detection Model: SEADMv2, in: 2015 International Conference on
Cyberworlds (CW). Presented at the 2015 International Conference on Cyberworlds (CW), IEEE, Visby, Sweden, pp. 216–223.
https://doi.org/10.1109/CW.2015.52
Propensity to Click on Suspicious Links: Impact of Gender, of Age, and of Personality Traits | Semantic Scholar [WWW Document], n.d.
URL https://www.semanticscholar.org/paper/Propensity-to-Click-on-Suspicious-Links%3A-Impact-of-Sudzina-
Pavl%C3%AD%C4%8Dek/6663c539233a1b8e83abe4ea08df58c5a28ae696 (accessed 3.7.23).
Rere, L.M.R., Fanany, M.I., Arymurthy, A.M., 2015. Simulated Annealing Algorithm for Deep Learning. Procedia Computer Science 72,
137–144. https://doi.org/10.1016/j.procs.2015.12.114
Ruder, S., 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
Salgado, J.F., 2002. The Big Five Personality Dimensions and Counterproductive Behaviors. International Journal of Selection and
Assessment 10, 117–125. https://doi.org/10/b8c428
Sawa, Y., Bhakta, R., Harris, I.G., Hadnagy, C., 2016. Detection of Social Engineering Attacks Through Natural Language Processing of
Conversations, in: 2016 IEEE Tenth International Conference on Semantic Computing (ICSC). Presented at the 2016 IEEE Tenth
International Conference on Semantic Computing (ICSC), IEEE, Laguna Hills, CA, USA, pp. 262–265.
https://doi.org/10.1109/ICSC.2016.95
Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 2673–2681.
https://doi.org/10.1109/78.650093
spaCy · Industrial-strength Natural Language Processing in Python [WWW Document], n.d. URL https://spacy.io/ (accessed 10.13.21).
Spielberger, C.D., 2004. Encyclopedia of applied psychology. Elsevier Academic Press.
Sutskever, I., Martens, J., Dahl, G., Hinton, G., 2013. On the importance of initialization and momentum in deep learning, in: International
Conference on Machine Learning. PMLR, pp. 1139–1147.
Syafitri, W., Shukur, Z., Mokhtar, U.A., Sulaiman, R., Ibrahim, M.A., 2022. Social Engineering Attacks Prevention: A Systematic Literature
Review. IEEE Access 10, 39325–39343. https://doi.org/10.1109/ACCESS.2022.3162594
Tsinganos, N., Fouliras, P., Mavridis, I., 2023. Leveraging Dialogue State Tracking for Zero-Shot Chat-Based Social Engineering Attack
Recognition. Applied Sciences 13, 5110. https://doi.org/10.3390/app13085110
Tsinganos, N., Fouliras, P., Mavridis, I., 2022a. Applying BERT for Early-Stage Recognition of Persistence in Chat-Based Social
Engineering Attacks. Applied Sciences 12, 12353. https://doi.org/10.3390/app122312353
Tsinganos, N., Mavridis, I., 2021. Building and evaluating an annotated corpus for automated recognition of chat-based social engineering
attacks. Applied Sciences 11. https://doi.org/10/gnpb3z
Tsinganos, N., Mavridis, I., Gritzalis, D., 2022b. Utilizing Convolutional Neural Networks and Word Embeddings for Early-Stage
Recognition of Persuasion in Chat-Based Social Engineering Attacks. IEEE Access 10, 108517–108529.
https://doi.org/10.1109/ACCESS.2022.3213681
Tsinganos, N., Sakellariou, G., Fouliras, P., Mavridis, I., 2018. Towards an Automated Recognition System for Chat-based Social
Engineering Attacks in Enterprise Environments, in: Proceedings of the 13th International Conference on Availability, Reliability
and Security. Presented at the ARES 2018: International Conference on Availability, Reliability and Security, ACM, Hamburg
Germany, pp. 1–10. https://doi.org/10/gftfmk
Uebelacker, S., Quiel, S., 2014. The Social Engineering Personality Framework, in: 2014 Workshop on Socio-Technical Aspects in Security
and Trust. Presented at the 2014 Workshop on Socio-Technical Aspects in Security and Trust (STAST), IEEE, Vienna, pp. 24–30.
https://doi.org/10/gf3gg9
Venkatesha, S., Reddy, K.R., Chandavarkar, B.R., 2021. Social Engineering Attacks During the COVID-19 Pandemic. SN COMPUT. SCI.
2, 78. https://doi.org/10.1007/s42979-020-00443-1
Vishwanath, A., 2022. The Weakest Link: How to Diagnose, Detect, and Defend Users from Phishing. MIT Press.
Wang, Z., Zhu, H., Sun, L., 2021. Social Engineering in Cybersecurity: Effect Mechanisms, Human Vulnerabilities and Attack Methods.
IEEE Access 9, 11895–11910. https://doi.org/10.1109/ACCESS.2021.3051633

You might also like