Detecting Spear Phishing Using Natural Language Processing and Linguistic Analysis

Detecting spear phishing using natural language
processing and linguistic analysis

Ray Smith
Abstract
This project explores The use of machine learning techniques to identify impersonation in emails. As
malicious actors employ social engineering techniques to access important information, resources,
and accounts, impersonation attacks pose an increasing threat to both organisations and people.
Identifying harmful emails before they reach the end user is feasible by utilising machine learning
and natural language processing algorithms. The strategies and procedures used to identify
fraudulent emails are described in this study, including content, syntax, and keyword analysis. It
discusses how natural language processing may be used to spot malicious emails and how to
increase detection's precision and accuracy.
This study examines the linguistics of emails to determine whether it is possible to identify people by
their writing styles. The effectiveness of supervised and unsupervised machine learning approaches
to identify dangerous emails is analysed. The report then compares the accuracy and false positive
rates of the machine learning algorithms to assess how well they perform.
The study then explores the potential drawbacks and difficulties of these strategies. The findings of
this study demonstrate that machine learning algorithms can accurately identify impersonation
assaults. Unfortunately, the algorithms are constrained by their incapacity to recognise sophisticated
attacks that call for more advanced characteristics. The report finishes with future research topics
for enhancing machine learning algorithms' capability to recognise spear phishing attacks.
This project aligns with the Web and Mobile Security (WM) CyBoK Skill.
i|Page
31/03/2021
Table of Contents
Abstract ............................................................................................................................ i
List of Figures .................................................................................................................. iv
List of Tables .................................................................................................................... v
List of Equations .............................................................................................................. vi
1 Introduction............................................................................................................... 1
2 Literature Review/Background .................................................................................. 2
3 Research Methodology .............................................................................................. 4
3.1 Methodological approach .............................................................................................. 4
3.1.1 Data Collection .................................................................................................................................. 4
3.1.2 Data Preprocessing ........................................................................................................................... 4
3.1.3 Feature Extraction ............................................................................................................................ 4
3.1.4 Machine Learning Model .................................................................................................................. 4
3.1.5 Evaluation ......................................................................................................................................... 5
3.1.6 Validation .......................................................................................................................................... 5
3.1.7 Ethical Considerations ...................................................................................................................... 5
3.2 Datasets........................................................................................................................ 5
3.2.1 Enron vs Hillary Clinton ..................................................................................................................... 5
3.2.2 Ham/Spam ........................................................................................................................................ 6
3.2.3 Dataflow diagram to show research methodology .......................................................................... 6
4 Investigation ............................................................................................................. 7
4.1 Functional Requirements............................................................................................... 7
4.2 Non-functional Requirements ........................................................................................ 7
4.3 Study of two machine learning models and how they may be used to address the
problem ................................................................................................................................... 7
4.3.1 Logistic Regression ............................................................................................................................ 7
4.3.2 Support Vector Machine (SVM) ........................................................................................................ 8
4.3.3 Decision on the model to use ........................................................................................................... 9
5 Design & Development ............................................................................................ 10

5.1 Overview .................................................................................................................... 10
5.2 Imported Libraries ....................................................................................................... 10
5.3 Dataflow Diagrams ...................................................................................................... 11
5.3.1 A simplified version of the program ............................................................................................... 11
5.4 Menu System .............................................................................................................. 11

5.5 Email Parsing .............................................................................................................. 11
5.6 Linguistic Analysis ....................................................................................................... 12
5.6.1 Formality ......................................................................................................................................... 12
5.6.2 Spelling............................................................................................................................................ 13
ii | P a g e
31/03/2021
5.6.3 Punctuation..................................................................................................................................... 14
5.6.4 Language Fluency............................................................................................................................ 15
5.6.5 Spam Detection/Identification of Senders ..................................................................................... 16
5.7 Non-linguistic analysis ................................................................................................. 18
5.7.1 Timestamp ...................................................................................................................................... 18
5.7.2 Sender/Recipient ............................................................................................................................ 19
6 Test Results ............................................................................................................. 21

6.1 Testing of Requirements .............................................................................................. 21
6.1.1 Filtering of incoming emails ............................................................................................................ 21
6.1.2 Natural Language Processing .......................................................................................................... 21
6.1.3 Machine Learning ........................................................................................................................... 22
7 Analysing Results..................................................................................................... 23
7.1 Spam Detection........................................................................................................... 23
7.2 SVM model for identifying senders of a message.......................................................... 23
7.3 Experimental linguistic/non-linguistic analysis results .................................................. 25
8 Conclusion ............................................................................................................... 27
References...................................................................................................................... 29
Appendix ........................................................................................................................ 31
8.1 Appendix A – Ethical Approval not required ................................................................. 31
8.2 Appendix B – python-menu source code ...................................................................... 31
8.3 Appendix C – Main code (linguistic_analyiser.py) ......................................................... 33
8.4 Appendix D - nlp.py ..................................................................................................... 43
8.5 Appendix E – Filtering of incoming emails test evidence ............................................... 44
8.6 Appendix F – NLP brawner-s test evidence ................................................................... 45
8.7 Appendix G - Evidence of testing datasets against the model........................................ 46
8.8 Appendix H - Evidence of passing the test for testing datasets against the model .......... 47
8.9 Appendix I................................................................................................................... 48
8.10 Appendix J .................................................................................................................. 49
8.11 Appendix K ................................................................................................................. 50
8.12 Appendix L – Design of CLI interface ............................................................................ 51
8.13 Appendix M - Classification of SVM models .................................................................. 53
8.14 Appendix N – Researching existing solutions ................................................................ 54
8.15 Appendix O - Laws and regulations .............................................................................. 54
iii | P a g e
31/03/2021
List of Figures
Figure 1 - Research methodology dataflow diagram .............................................................................. 6
Figure 2 - Logistic Regression python analysis ........................................................................................ 8
Figure 3 - Support Vector Machine python analysis ............................................................................... 9
Figure 4 - Simplified dataflow diagram of the program........................................................................ 11
Figure 5 – Cropped example output of the python-menu library ........................................................ 11
Figure 6 - Code snippet of email parsing .............................................................................................. 12
Figure 7 - Code snippet for detecting formality .................................................................................... 13
Figure 8 - Code snippet of spelling function ......................................................................................... 14
Figure 9 - Code snippet of punctuation function .................................................................................. 15
Figure 10 - Code snippet of language detector..................................................................................... 15
Figure 11 - Architecture of the SVM model (RsearchGate, 2017) ........................................................ 16
Figure 12 - Optimal hyperplane for the SVM model (S, 2021) ............................................................. 16
Figure 13 - Code snippet of the reading dataset for spam detection ................................................... 17
Figure 14 - Code snippet of refining text for spam detection............................................................... 18
Figure 15 - Code snippet of dividing the dataset for training/testing .................................................. 18
Figure 16 - Code snippet of splitting the data....................................................................................... 18
Figure 17 - Code snippet of pipelining data .......................................................................................... 18
Figure 18 - Code snippet to retrieve the timestamp of an email ......................................................... 19
Figure 19 - Code snippet of the start of the class 'Person' (see appendix C for full code output) ....... 19
Figure 20 - Classification of the SVM model for spam detection ......................................................... 23
Figure 21 - Classification of the SVM model for detecting bass-e (Appendix M) ................................. 23
Figure 22 - Classification of the SVM model for detecting brawner-s (Appendix M) ........................... 24
Figure 23 - Classification of the SVM model for detecting nemec-g (Appendix M) ............................. 24
Figure 24 - Graph to show the efficiency of experimental linguistic/non-linguistic analysis ............... 25
iv | P a g e
31/03/2021
List of Tables
Table 1 - Imported libraries information .............................................................................................. 10
Table 2- ARI grade levels ....................................................................................................................... 13
Table 3 - Results of testing each experimental analysis method.......................................................... 25
Table 4 - Results of removing punctuation attribute from testing ....................................................... 25
v|Page
31/03/2021
List of Equations
Equation 1 - Grade Level (GL) equation ................................................................................................ 12
Equation 2 - Automated Readability Index (ARI) equation ................................................................... 12
Equation 3 - Linear kernel formula ....................................................................................................... 17
vi | P a g e
31/03/2021
1 Introduction
Phishing is a common cyberattack that involves misleading individuals into providing sensitive
information such as personal, financial, and other information (Krombholz, Hobel, Huber, & Weippl,
2013). A more sophisticated and targeted form of phishing called spear phishing employs tailored
messages to deceive specific individuals or groups. Spear phishing attacks are getting more and
more complex, which puts people and organisations worldwide in danger. This study aims to explore
how linguistic analysis and natural language processing may be used to recognise spear phishing
attacks.
The necessity for trustworthy and efficient cybersecurity solutions has never been more pressing
than it is in the contemporary digital age. Traditional detection techniques are shown to be
ineffective in limiting the threat posed by spear phishing assaults as they continue to develop and
grow more sophisticated. This dissertation aims to present a unique method for identifying and
counteracting spear phishing assaults by using the strength of linguistic analysis and natural
language processing. It does this by thoroughly examining the linguistic and semantic characteristics
of spear phishing attempts. This study intends to not only improve the security posture of individuals
and organisations but also to create increased awareness and knowledge of the increasingly complex
terrain of cyber dangers by establishing a workable and efficient detection tool.
The overreaching research topic of this dissertation is How might linguistic analysis and natural
language processing be utilised to identify spear phishing attacks?
The following objectives have been posed to respond to this question:
❖ To research existing methods and procedures used to identify spear phishing attacks.
❖ To investigate the language used in emails and communications.
❖ To develop and deploy a spear phishing attack detection method based on natural language
processing.
❖ To evaluate the proposed approach's efficiency and compare it with other methodologies.
By achieving these objectives, this dissertation aims to contribute to the growing body of knowledge
surrounding spear phishing detection and prevention. Moreover, it seeks to provide a practical
solution leveraging the power of linguistic analysis and natural language processing to enhance the
security of individuals and organizations against this pervasive and evolving threat. Through this
research, the study also aspires to stimulate further exploration and development of innovative
techniques in the field of cybersecurity and spear phishing countermeasures.
1|Page
31/03/2021
2 Literature Review/Background
In recent years, researchers have increasingly focused on the application of natural language
processing (NLP) and linguistic analysis to enhance spear phishing detection. Spear phishing, a
specialised form of cyber attack, involves the use of targeted emails to deceive victims into
disclosing sensitive information or installing malware. Traditional methods of detecting phishing
emails, such as heuristics and blacklists, have proven insufficient due to their vulnerability to
circumvention by attackers. Thus, the exploration of NLP and linguistic analysis in spear phishing
detection is a crucial area of research.
(Karim, Hasan, Uddin, & Islam, 2021) proposed a machine-learning approach that combined
linguistic and lexical features to identify spear phishing emails. Utilising a dataset of 2,000 emails,
the study achieved an accuracy of 91.6% in spear phishing email detection. This research highlights
the potential effectiveness of integrating linguistic analysis with machine learning techniques to
improve spear phishing detection. However, the study's limited dataset size suggests that further
research is necessary to validate the approach with larger and more diverse email samples.
(Maity & Bandyopadhyay, 2021) emphasised the analysis of email content using NLP techniques to
identify social engineering tactics prevalent in spear phishing attacks. The authors evaluated their
approach on a dataset of 2,500 emails, reporting an accuracy of 96.2% in detecting spear phishing
emails. While the results of this study demonstrate the promising potential of NLP techniques in
spear phishing detection, the study's focus on content analysis could limit its effectiveness in
addressing other aspects of spear phishing emails, such as sender information and email structure.
(Zhang, Yuan, Zhang, & Chen, 2020) developed a hybrid model that combined machine learning
algorithms with NLP techniques to detect spear phishing emails. The authors tested their approach
on a dataset of 1,500 emails, obtaining an accuracy of 95.2% in spear phishing email identification.
The study's hybrid approach showcases the potential benefits of integrating machine learning
algorithms with NLP techniques. Nonetheless, the relatively small dataset size may hinder the
generalisability of the model, necessitating further research with more extensive datasets.
A recent study (Maity & Bandyopadhyay, 2021) proposed a deep learning-based approach that
employed both lexical and semantic features to detect spear phishing emails. The authors tested
their approach on a dataset of 3,000 emails, achieving an accuracy of 97.5% in identifying spear
phishing emails. Despite the impressive results, the study's focus on deep learning-based techniques
may require substantial computational power and resources, potentially limiting its applicability in
real-world scenarios.
(Younghoo, Joshua, & Richard, 2020) explored the use of transfer learning and pre-trained language
models, such as BERT and GPT, to improve spear phishing detection. The authors argued that
leveraging pre-trained models could enable the detection system to capitalise on the vast knowledge
and linguistic understanding inherent in these models, thus potentially enhancing the detection
accuracy. The study employed a dataset of about 5 million emails and achieved an accuracy of 87 %
in detecting spear phishing emails. While the use of transfer learning and pre-trained models
presents a promising direction in spear phishing detection, the associated computational
requirements and resource constraints may limit its practical applicability, particularly for smaller
organisations with limited resources.
In their conference paper, (Soon, Chiang, On, Rusli, & (year), 2020) compared the performance of
ensemble simple feedforward neural networks (FFNNs) and deep learning neural networks (DLNNs)
2|Page
31/03/2021
in phishing detection. The authors aimed to determine the effectiveness of these techniques in
identifying phishing attacks and assess their potential for practical implementation in cybersecurity
systems. They conducted experiments using a dataset containing various phishing and legitimate
emails, focusing on the accuracy, precision, and recall of the models. The results demonstrated that
the ensemble FFNN approach outperformed the DLNN model in phishing detection, suggesting that
combining multiple FFNNs in an ensemble manner can enhance the detection capabilities of
phishing classifiers. The authors highlighted the importance of considering both the performance
and the computational complexity of these models when selecting an appropriate method for
phishing detection in real-world scenarios.
Lastly (Benavides, Fuertes, Sanchez, & Sanchez, 2020) used deep learning methods to perform a
thorough literature study on the categorisation of phishing attack solutions in the context of defence
and security. The authors looked at various feature extraction methods, including natural language
processing (NLP) techniques like TF-IDF and topic modelling, as well as deep learning algorithms like
convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term
memory (LSTM) networks. They concluded that the best accuracy in spear phishing email detection
was achieved by combining deep learning techniques, specifically LSTM networks with NLP-based
feature extraction. The scientists did, however, note the need for future studies to improve model
performance and investigate the trade-offs between detection accuracy and computing needs. Their
work emphasises the value of ongoing research in this area and demonstrates the potential of deep
learning techniques in advancing phishing attack solutions.
In conclusion, the literature on NLP and linguistic analysis in spear phishing detection highlights the
potential of combining these techniques with machine learning and deep learning algorithms to
achieve significant improvements in accuracy. However, the limited dataset sizes employed in these
studies necessitate further research to validate and refine the proposed models. Additionally, the
exploration of alternative approaches that address various aspects of spear phishing emails and
consider practical constraints, such as computational resources, remains a critical area for future
research. Furthermore, addressing privacy concerns and developing privacy-preserving techniques
for spear phishing detection will be essential as the field progresses.
3|Page
31/03/2021
3 Research Methodology
3.1 Methodological approach
This study will look at the effectiveness of linguistic analysis and natural language processing in
identifying spear phishing attacks. The following methodological techniques will be used to
accomplish this.
3.1.1 Data Collection

Gathering a dataset of spear phishing emails will be the initial stage of this research project. This
dataset will be used to train and evaluate the technique based on natural language processing. The
dataset will be gathered from publicly accessible sources.
The decision to gather spear phishing emails from publically accessible sources is justified due to a
wider variety of emails from various sources. The pre-processing phase of deleting any identifying
information and unnecessary emails is also appropriate since it protects people's privacy and
eliminates any extraneous data that could impact how accurate the analysis is.
3.1.2 Data Preprocessing

The gathered dataset will be cleaned up to get rid of extraneous data and noise. Remove any email
headers, footers, and signatures.
The measures for data pretreatment suggested for this research project, such as removing
extraneous emails and identifying information, are appropriate since they protect people's privacy
and eliminate any noise from the dataset that can impair the study. The use of methods like
stemming, stop word removal, and tokenisation—frequently employed in natural language
processing and probably to increase the accuracy and effectiveness of the analysis—is also
warranted.
3.1.3 Feature Extraction

By the use of natural language processing methods including part-of-speech tagging, named entity
identification, and sentiment analysis, the linguistic characteristics of the spear-phishing emails will
be recovered. Each email's feature vector will be created using these features.
The use of natural language processing methods for feature extraction, including part-of-speech
tagging, named entity identification, and sentiment analysis, is justified since it enables a more
thorough and thorough study of the language used in the emails. To create a more potent machine
learning model, these elements are expected to offer helpful insights into the linguistic
characteristics of spear phishing emails.
3.1.4 Machine Learning Model

The spear phishing emails will be categorised as either valid or malicious using a machine learning
model. The preprocessed dataset and the feature vectors created in the preceding phase will be
used to train this model. The model helps to assess decision trees, support vector machines, and
neural networks among other machine learning techniques.
Since that machine learning techniques like decision trees, support vector machines, and neural
networks have been widely tested and deployed in the field of natural language processing, it makes
sense to apply them for spear phishing detection. It is also acceptable to decide to evaluate the
method's effectiveness using a variety of metrics, including accuracy, precision, recall, and F1-score
4|Page
31/03/2021
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
(𝐹1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙). This is because these metrics provide a comprehensive evaluation of
the method's efficacy.
3.1.5 Evaluation
Several criteria, gained by the machine learning model including accuracy, precision, recall, and F1-
score, will be used to assess the effectiveness of the established technique. The method's
effectiveness will be assessed by comparing it to currently used methods for identifying spear
phishing attempts.
3.1.6 Validation
Using a different dataset of spear phishing emails that were not utilised during training, the created
technique will be verified. This testing will guarantee that the method can properly identify spear
phishing assaults in fresh, unexplored data and that it is generalisable.
It is appropriate to use a different dataset of spear phishing emails to validate the method because it
prevents the method from overfitting to the training dataset and assures that it can correctly
identify spear phishing attempts in fresh, unused data.
3.1.7 Ethical Considerations

Ethical considerations will be made in this research because spear phishing assaults are prohibited
and can harm people and organisations. To preserve people's privacy, the gathered data will be
anonymised, and any spear phishing attacks found will be reported to the relevant authorities.
This will guarantee that the study is carried out ethically and responsibly, and the ethical concerns
mentioned in the methodology, such as the anonymisation of acquired data and reporting suspected
spear phishing attempts to the proper authorities, are warranted.
3.2 Datasets
3.2.1 Enron vs Hillary Clinton
Both the Enron (Cohen, 2016) and Hillary Clinton (KAGGLE, 2015) datasets are often utilised in NLP
research for various objectives. The Enron dataset, which is commonly used for email classification
tasks, is a collection of emails from the Enron Company. On the other hand, the Hillary Clinton
dataset is a selection of emails from Hillary Clinton's email server that were made public as part of a
probe into her email habits while she was the Secretary of State.
The Enron dataset is widely viewed as a better alternative for natural language processing research
for a variety of reasons, even though each dataset has advantages and disadvantages. The Hillary
Clinton dataset only contains roughly 30,000 emails, but the Enron dataset has over 500,000 emails.
The Federal Energy Regulatory Commission acquired it while looking into the fall of Enron.
Second, compared to the Hillary Clinton dataset, which is skewed towards emails with political and
governmental overtones, the Enron dataset represents a more varied and representative sample of
emails. The Enron dataset includes emails from several people who held various positions and
divisions inside the business, offering a wider diversity of linguistic and communicational practises.
Last but not least, there are many established benchmarks and points of comparison since the Enron
dataset has been widely analysed and used in many natural language processing research projects.
Making it simpler for researchers to compare their findings to those generated by other researchers
using the same dataset, improves the reliability and generalisability of findings.
5|Page
31/03/2021
The Enron dataset is usually regarded as a better option for natural language processing research
because of its bigger size, diversity, and proven use in the field, even if both datasets have
advantages and disadvantages.
The May 7th 2015 version of the dataset has been used in this project (Cohen, 2016).
3.2.2 Ham/Spam
For spam detection, a collection of SMS-tagged messages known as the SMS Spam Collection was
assembled (UCI MACHINE LEARNING). 5,574 English SMS messages in one batch have been labelled
as spam or ham (legal) communications. This dataset will be used to train the spam detection
portion of the project.
3.2.3 Dataflow diagram to show research methodology
Figure 1 - Research methodology dataflow diagram
6|Page
31/03/2021
4 Investigation
4.1 Functional Requirements
Functional requirements are the particular capabilities and functions that a system must have to
achieve its intended goals. Some essential functional needs for spear phishing detection can be:
❖ Filtering of incoming emails - The system must be able to recognise emails that are potential
spear phishing attempts.
❖ Natural Language Processing - To spot crucial linguistic cues that can point to a spear
phishing effort, the system must be able to evaluate the email's text.
❖ Machine Learning - The system needs to be able to apply machine learning methods to
enhance its spear phishing detection capabilities over time.
❖ Alerts and Reporting - When a possible spear phishing effort is discovered, the system must
be able to deliver real-time alarms and reports to users.
4.2 Non-functional Requirements

Non-functional needs are features of a system that are essential to its success but are not
immediately connected to its functionality. The following are some important non-functional
specifications for a spear phishing detection system:
❖ Scalability - Without compromising its capacity to identify spear phishing attempts, the
system should be able to manage large datasets of emails.
❖ Reliability - While recognising spear phishing attempts, the system should be dependable
and consistently accurate.
❖ Usability - The system should have an intuitive user interface, be simple to use, and provide
end users with clear instructions.
4.3 Study of two machine learning models and how they may be used to address the
problem
4.3.1 Logistic Regression
An event's chance of success or failure is determined using a logistic regression classification
technique. It is used when the dependent variable is binary (1 or 0). A given set of labelled data may
be divided into several classes by analysing the connections between the data. It learns a linear
connection from the supplied dataset before introducing a non-linearity in the form of the Sigmoid
function (Rout, 2023).
4.3.1.1 Application of NLP to the problem of phishing message detection

Logistic regression, which is often used for many sorts of NLP tasks, allows the use of any input
characteristic. By classifying each period into the EOS (end-of-sentence) and not-EOS categories, you
may solve the problem of period disambiguation by detecting whether a period signifies the end of a
sentence or a word.
4.3.1.2 Strengths
Understanding logistic regression is simple and there is minimal training needed. It works well when
the dataset can be linearly separated and has good accuracy for a variety of basic data sets. It
doesn't make any assumptions about how classes are distributed in feature space.
Although it is less likely to do so, high-dimensional datasets can cause overfitting in logistic
regression. To prevent over-fitting in certain cases, one may want to take into account regularisation
7|Page
31/03/2021
(L1 and L2) approaches. The training in logistic regression is extremely effective and easy to execute
and analyse (Pareek, 2021).
4.3.1.3 Weaknesses
Logistic regression has a linear decision surface, hence it cannot address non-linear issues. Real-
world situations seldom include linearly separable data (Pareek, 2021).
Complex associations are difficult to establish using logistic regression. The performance of this
method can be readily surpassed by more potent and condensed algorithms like neural networks
(Pareek, 2021).
Independent and dependent variables are linearly correlated in linear regression. Nevertheless, for
Logistic Regression, independent variables must have a linear relationship with the log chances
(log(p/(1-p)) (Pareek, 2021).
4.3.1.4 Analysing the Logistic Regression model with Python
Figure 2 - Logistic Regression python analysis
4.3.2 Support Vector Machine (SVM)

Support Vector Machine (SVM), a supervised machine learning algorithm, may do classification,
regression, and even outlier detection. The linear SVM classifier works by creating an optimal
hyperplane that separates the features with the maximum possible margin. A single-class label will
be applied to all of the data points that fall on one side of the line, and a second-class label will be
applied to all of the points that fall on the other side. Even while it sounds simple, there are
countless lines to choose from (Maklin, 2019).
4.3.2.1 Application of NLP to the problem of phishing message detection

Because instances are often represented by highly high-dimensional but relatively sparse feature
vectors, positive and negative examples are frequently spread into two considerably distinct sections
of the feature space in NLP tasks. This aids in both the SVM's ability to locate a classification
hyperplane in feature space and the classifier's ability to generalise. It plays a significant role in the
SVM's capacity to deliver great results for a variety of NLP tasks (L., Kalina, & Hamish, 2009).
4.3.2.2 Strengths
With SVMs, several different kernels can simulate non-linear decision boundaries. Moreover,
especially in high-dimensional space, they are relatively resistant to overfitting.
8|Page
31/03/2021
4.3.2.3 Weaknesses
SVMs don't scale well to larger datasets, are memory-intensive, and are more challenging to tune
since picking the right kernel is so important. In the business, random forests are now preferred over
SVMs (Raj, 2022).
4.3.2.4 Analysing the SVM model with Python
Figure 3 - Support Vector Machine python analysis
4.3.3 Decision on the model to use

While both models have benefits and drawbacks, SVM has occasionally been shown to be a more
effective model for NLP than LR.
The capacity of SVM can handle non-linearly separable data, which is typical in NLP applications, is
one of its key benefits over LR. SVM does this by utilising a kernel function to translate the input
data into a higher-dimensional space where it may be linearly separated. On the other hand, LR
assumes that the input features and the output variable have a linear relationship, which may not be
acceptable for challenging NLP problems.
The effectiveness of SVM and LR in NLP tasks has been examined in several research papers. SVM
surpassed LR in research (Joachims, 1998) that examined their abilities to execute text classification
tasks. (Yang, et al., 2016) conducted a study to evaluate the performance of SVM and LR on
sentiment analysis tasks, and they discovered that SVM outperformed LR in terms of accuracy and
F1 scores.
SVM can handle non-linear data, capture complicated connections between features, and manage
unbalanced datasets, making it a superior model for NLP applications compared to LR. Thus, this
project will employ the SVM model.
9|Page
31/03/2021
5 Design & Development
5.1 Overview
The issue of spear phishing email detection utilising various language analysis techniques is difficult
yet crucial for cybersecurity. A specific person or organisation is the target of a spear phishing
assault, in which the attacker sends them targeted, persuasive emails in an attempt to coerce them
into disclosing confidential information or carrying out criminal deeds.
This coding project will employ several language analysis techniques constructed to recognise spear
phishing emails. Sentiment analysis is one technique that may be used to spot any suspicious or
unfavourable feelings in the email content. Another technique is named entity recognition, which
involves analysing the email text to find any names of people or businesses that are frequently used
in phishing scams.
The coding project can also scan the email text and spot any irregularities or grammatical mistakes
that are frequently present in phishing emails using natural language processing (NLP) techniques
like part-of-speech tagging and dependency parsing. Moreover, to categorise incoming emails as
phishing or legitimate, machine learning techniques like support vector machines (SVM) and
decision trees may be trained on a collection of existing spear phishing emails.
5.2 Imported Libraries

Name Version Source
DateTime 3.8.10 https://docs.python.org/3/library/datetime.html
Email.parser 3.2 https://docs.python.org/3/library/email.parser.html
Langdetect 1.0.9 https://pypi.org/project/langdetect/
Nltk 3.7 https://www.nltk.org/
Numpy 1.24.2 https://pypi.org/project/numpy/
OS 3.8.10 https://docs.python.org/3/library/os.html
Pandas 1.5.0 https://pypi.org/project/pandas/1.5.0/
Pickle 4.0 https://docs.python.org/3/library/pickle.html
pyspellchecker 0.7.0 https://pypi.org/project/pyspellchecker/
Python-menu 1.1.6 https://pypi.org/project/python-menu/
Re 2.2.1 https://docs.python.org/3/library/re.html
Readability 0.3.1 https://pypi.org/project/readability/
Sklearn 1.1.2 https://scikit-learn.org/stable/
Sys 3.8.10 https://docs.python.org/3/library/sys.html
Textblob 0.17.1 https://pypi.org/project/textblob/
Time 3.8.10 https://docs.python.org/3/library/time.html
Tkinter 8.6 https://docs.python.org/3/library/tkinter.html
Table 1 - Imported libraries information
10 | P a g e
31/03/2021
5.3 Dataflow Diagrams
5.3.1 A simplified version of the program
Figure 4 - Simplified dataflow diagram of the program
5.4 Menu System

To display an easy-to-use menu system in the Command line interface (CLI) I created a PyPI library
called python-menu (Table 1) to easily create, display and retrieved data from a menu interface. The
source code can be viewed in Appendix B. The design and outputs can be seen in Appendix G.
Figure 5 – Cropped example output of the python-menu library
Figure 5 shows an example of what the CLI menu will look like.
5.5 Email Parsing

To retrieve the data from emails the email.parser library (see Table 1) was used.
1. def analyse_email(inputfile, to_email_list, from_email_list, email_body_list,
sent_time_list):
2. with open(inputfile, "r") as file:
3. data = file.read()
4. # Creating email parser instance to analyse emails
5. email = Parser().parsestr(data)
6. if email['to']:
7. # Fixing formatting issues
8. email_to = email['to'].replace("\n", "").replace("\t", "").replace(" ",
"").split(",")
9. to_email_list.append(email_to)
10. # Gets the time the email was sent in GMT timzone
11 | P a g e
31/03/2021
11. sent_time_list.append((datetime.strptime(email['date'].split()[4],
"%H:%M:%S") -
timedelta(hours=int(email['date'].split()[5])/100)).strftime("%H:%M:%S"))
12. # Gets the from email and appends it to a list
13. from_email_list.append(email['from'])
14. # Gets the main text of the email and appends it to a list
15. email_body_list.append(email.get_payload())
Figure 6 - Code snippet of email parsing
Figure 6 shows a code snippet of how the program will retrieve the senders and recipients' email
addresses, The time the email was sent and the main body of the email.
5.6 Linguistic Analysis

5.6.1 Formality
By examining different linguistic aspects of the text, such as sentence length, word frequency, and
grammatical difficulty, the readability library in Python (see Table 1) may be used to determine the
formality of the text.
Using the results function to input a block of text and get a standard score that represents the text's
reading difficulty. A text with a higher score would be seen to be more formal, whilst one with a
lower score would be considered to be more informal.
The Automated Readability Index (ARI), which considers elements like sentence length and word
complexity, which are frequently signs of a more formal writing style, might help determine the
formality of a phrase.
(Senter & Smith, 1967) shows the multiple regression equation for predicting grade level from two
obtained ratios is
𝑤 𝑠
𝐺𝐿 = 0.5 ( ) + 4.71 ( ) − 21.43
𝑠 𝑤
Equation 1 - Grade Level (GL) equation
GL = assigned grade level, w/s = words per sentence or sentence length, s/w = strokes per word or
word length
This is simplified to ARI = (w/s) + 9 (s/w)

𝑤 𝑠
( )+ 9( )
𝑠 𝑤
Equation 2 - Automated Readability Index (ARI) equation
ARI = Automated Readability Index, w/s = words per sentence, s/w = strokes per word
12 | P a g e
31/03/2021
Score Age Grade Level
1 5-6 Kindergarten
2 6-7 First/Second Grade
3 7-9 Third Grade
4 9-10 Fourth Grade
5 10-11 Fifth Grade
6 11-12 Sixth Grade
7 12-13 Seventh Grade
8 13-14 Eights Grade
9 14-15 Ninth Grade
10 15-16 Tenth Grade
11 16-17 Eleventh Grade
12 17-18 Twelfth Grade
13 18-24 College Student
14 24+ Professor
Table 2- ARI grade levels
Table 2 from (readable.com, 2021) shows the ARI score compared to the US grade levels.
1. def analyse_formality(text, person, detected_language, train, to_email_array):
2. if detected_language in ['en','de','nl']:
3. results = readability.getmeasures(text, lang="en")
4. for to_email in to_email_array:
5. if train:
6. if to_email in person.formality:
7. # Cacluating formality of user
8. person.formality[to_email] =
person.formality[to_email]+[results['readability grades']['ARI']]
9. else:
10. person.formality[to_email] = [results['readability
grades']['ARI']]
11. elif not train and results['readability grades']['ARI'] >
person.single_formality_average_calc(to_email)+2 or results['readability
grades']['ARI'] < person.single_formality_average_calc(to_email)-2:
12. return ('Formality',1)
13. else:
Figure 7 - Code snippet for detecting formality
Using this formula in a function (Figure 7) to detect formality allows the program to understand if
the detected ARI score is simliar to the know average formality of the sender to the recipient. If the
score is off by ± 2 points then it will raise a flag. The reason I chose 2 is because it gave the highest
accuracy when testing numbers between 1-10
5.6.2 Spelling
For spell-checking and correction, the Python language's pyspellchecker library (see Table 1) offers
many classes and methods. Levenshtein distance, Damerau-Levenshtein distance, and Soundex are
just a few of the spell-checking algorithms offered by the library, which is built on top of the Enchant
library.
1. def analyse_spelling(text, person, detected_language, train):
2. # Languages support for spell checking
3. supported_languages = ['en','es','fr','pt','de','ru','ar']
4. if detected_language in supported_languages:
5. # Cleaing text by removing unwanted spaces, grammar and capitalisation
13 | P a g e
31/03/2021
6. text = clean_text(text).split()
7. if not train:
8. for word in text:
9. if word in person.corrections and person.corrections[word] > 10:
10. return ('Spelling',1)
11. else:
13. elif train:
14. # Spell checking in the language of text given
15. spell = SpellChecker(language=detected_language)
16. # find those words that may be misspelt
17. misspelled = spell.unknown(text)
18.
19. for word in misspelled:
20. if len(word) > 3:
21. correct_word = spell.correction(word)
22. # Adds misspelt words to the person's data
23. person.spelling_mistakes[word] = 1 if word not in
person.spelling_mistakes else person.spelling_mistakes[word]+1
24. # Adds the correct spelling of words that the person usually
gets wrong
25. person.corrections[correct_word] = 1 if correct_word not in
person.corrections else person.corrections[correct_word]+1
Figure 8 - Code snippet of spelling function
One potential idea is to store commonly misspelled words for each individual using the
spell.unknown() method in the PySpellChecker library (see Table 1). By using the spell.correction()
method, the array of misspelled words can be corrected. If the text of an email is then spelt
correctly, despite the sender typically misspelling a specific word, it could trigger a flag. This is
because the sender is known to frequently spell the word incorrectly, and therefore, correct spelling
in this instance could indicate that the email is being sent by an impersonator.
5.6.3 Punctuation
The frequency of punctuation someone uses when writing may be used to get insight into their
writing style and personality. For instance, those who frequently use exclamation points or question
marks may be more emotive or impulsive, whereas people who frequently use periods or commas
may be more methodical and thoughtful writers.
A good start is to gather a sample of someone's work and note the frequency of various punctuation
marks to gauge how frequently they use each mark. The outcomes can then be contrasted with a
standard or average for the genre of writing, such as formal, casual, or social media updates.
1. def analyse_punctuation(text, person, train):
2. # Creating dict for percentages of punctuation used to be added
3. punctuation = {".":0,",":0,"?":0,":":0,"!":0,"-
":0,"[":0,"]":0,"(":0,")":0,"{":0,"}":0,"'":0,'"':0,"...":0}
4. if train:
5. for punc in person.punctuation:
6. # When training the self.punctuation dict will have it's values
updated
7. person.punctuation[punc] = person.punctuation[punc]+[round(sum([1 if
i == punc else 0 for i in text])/len(text.split(" "))*100,2)]
8. elif not train:
9. for punc in punctuation:
10. # When testing the model will appened each percentage of the text to
test to the "punctuation" temp array
11. punctuation[punc] = round(sum([1 if i == punc else 0 for i in
text])/len(text.split(" "))*100,2)
14 | P a g e
31/03/2021
12. for punc1, punc2 in zip(punctuation.items(),person.punctuation.items()):
13. # Accuracy decreases if it looks at more then "." as other
punctuations are not used at all so hard to compare
14. if punc1[1] < round(sum(punc2[1])/len(punc2[1]),2)-2 or punc1[1]
> round(sum(punc2[1])/len(punc2[1]),2)+2:
15. return ('Punctuation',1)
Figure 9 - Code snippet of punctuation function
Figure 9 shows how to analyse punctuation. The ‘analyse_punctuation()’ takes in three arguments,
‘text, ‘person’ and ‘train’.
The ‘person’ argument is an object that holds details on a person's punctuation usage, whereas the
‘text’ argument is a string of text that will be examined for its use of punctuation. The ‘train’
parameter, which is a boolean, specifies whether or not the function is being used to train or test a
model.
The method returns a tuple ('Punctuation',1) if the punctuation usage in text differs noticeably from
the typical punctuation usage for the specified individual for any punctuation mark when it exceeds
this threshold. If not, the method returns a tuple ('Punctuation',0), indicating that the text's
punctuation complies with the provided person's typical use.
5.6.4 Language Fluency

1. def analyse_language(text, person, detected_language, train):
2. if detected_language in person.languages:
3. if train:
4. # Adding 1 to the number of times a person has written in a certain
language
5. person.languages[detected_language] += 1
6. return ('Language',0)
7. elif not train:
8. # While testing if person has not sent more than 5 emails in the
detected language it will flag it as possible impersonation
9. if person.languages[detected_language] < 5:
Figure 10 - Code snippet of language detector
Figure 10 shows how to detect the language of a given text. If ‘train’ is True, the function will
increment the count of the number of times the person has written in the detected language in the
‘person.languages’ dictionary.
If ‘train’ is False, the function will check if the number of times the person has written in the
detected language is less than 5. If it is, then the function will return a tuple containing the string
'Language' and the integer 1, indicating that the text may be an impersonation attempt.
If the detected language is not one of the languages the person has written in, or if ‘train’ is True and
the language has been processed, the function will return a tuple containing the string 'Language'
and the integer 0, indicating that the language is not suspicious.
Note that the function doesn't perform any language analysis on the text parameter, it simply uses
the detected language as a key to check the person's history of language usage.
15 | P a g e
31/03/2021
5.6.5 Spam Detection/Identification of Senders
The SVM model will be implemented in the spam detection of incoming emails and identifying if the
text matches patterns from a particular email.
Both methods use the same features and therefore the spam detector will be used as a
demonstration of the code. The full code for both SVM models will be in Appendix C.
5.6.5.1 Architecture
Figure 11 - Architecture of the SVM model (RsearchGate, 2017)
According to the study (Hassan, 2020), Figure 11 provides a detailed overview of the architecture of
Support Vector Regression (SVR). Specifically, for the ith hidden node of the input vector x, the
output is represented by K(xi, x), which involves mapping the input of x to the support vector xi. This
mapping is achieved by selecting an appropriate kernel function.
Figure 12 - Optimal hyperplane for the SVM model (S, 2021)
16 | P a g e
31/03/2021
In a high-dimensional space, two classes of data points are separated using the SVM model's
optimum hyperplane. The margin, or the distance between the hyperplane and the nearest data
points from each class, is what the SVM algorithm seeks to optimise. Both linearly separable and
non-linearly separable data may be used with the SVM model (Jain, 2020).
The SVM model addresses an optimisation problem by reducing the classification error and
maximising the margin to determine the ideal hyperplane. This involves finding the support vectors,
which are the data points closest to the hyperplane, and determining the weight vector that defines
the hyperplane (Jain, 2020).
5.6.5.2 Functions & Parameters

One of the most significant varieties of kernels, the linear kernel is a one-dimensional kernel,
according to Awasthi (2020). When there are several characteristics, it is typically the best solution,
and text classification jobs frequently favour it as the majority of these problems may be linearly
divided. Moreover, the linear kernel is quicker than other kernel operations.
𝐹(𝑥, 𝑥𝑗) = 𝑠𝑢𝑚(𝑥. 𝑥𝑗)

Equation 3 - Linear kernel formula
Equation 1 illustrates the SVM's linear kernel formula, where ‘x’ and ‘xj’ represent the classification-
relevant data.
There are two crucial parameters in the SVM model. The first is the regularisation strength
parameter C, which has an inverse relationship with C. C must have a value that is positive because
the default value is 1.0. According to scikit-learn, the penalty used in the model is a squared L2
penalty.
The second option, known as the kernel parameter, describes the kind of computational kernel that
will be applied. If no settings are provided, it will use the 'rbf' kernel, which is the default. The kernel
matrix is pre-compiled using data metrics if a callable is provided, and it must be an array of shapes
(n samples, n samples).
5.6.5.3 Reading the dataset

1. # Reading in data set using the panda module
2. dataset = pd.read_csv('../datasets/ham-spam/spam.csv', sep='�!',
names=['label','message'], engine='python')
Figure 13 - Code snippet of the reading dataset for spam detection
The code reads in a dataset from a CSV file located at '../datasets/ham-spam/spam.csv' using the
pandas module (see Table 1). The pd.read_csv() function is used to read the CSV file, where sep='�!'
specifies the delimiter to use for separating values in the CSV file.
The names parameter specifies the column names to use for the data. In this case, the first column is
labelled 'label' and the second column is labelled 'message'.
The engine parameter is set to 'python', which tells pandas to use the Python parsing engine to read
the CSV file. This allows for the ‘�’ character to be used as the default C engine cannot parse it.
5.6.5.4 Data pre-processing

1. # Function to refine text for data processing
2. def refine(text):
17 | P a g e
31/03/2021
3. return ' '.join([WordNetLemmatizer().lemmatize(string) for string in
re.sub('[^a-zA-Z]',' ',str(text)).lower().split() if string not in
set(stopwords.words('english'))])
Figure 14 - Code snippet of refining text for spam detection
Figure 14 takes in a text input, and refines it by removing non-alphabetic characters, converting it to
lowercase, removing stopwords, and lemmatising the remaining words. This refined text can be used
for further processing, such as text classification or sentiment analysis.
5.6.5.5 Dividing dataset

1. # Cleaning text for processing
2. dataset['message'] = [refine(dataset['message'][index]) for index in range(0,
len(dataset))]
3. # Retrieving 'message' and 'label' labels for the SVM module
4. X = dataset['message']
5. y = dataset['label']
Figure 15 - Code snippet of dividing the dataset for training/testing
These lines of code (Figure 15) clean the text data by removing stopwords and lemmatising the
words and preparing it for SVM training by assigning the refined text to X and the labels to y. The
resulting X and y variables can then be used for further processing, such as feature extraction and
model training.
5.6.5.6 Text processing

1. ### Training SVM Model
2. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30,
random_state = 42)
Figure 16 - Code snippet of splitting the data
This code block splits the cleaned text data and labels into training and testing sets, which can then
be used to train and evaluate an SVM model. The ‘train_test_split’ function is commonly used to
partition a dataset into training and testing sets to evaluate machine learning models. Using a split of
70% training and 30% testing returned the most accurate results. The random state allows for
reproducibility while testing different splits.
5.6.5.7 Pipelining
1. # Pipelining data
2. text_svm = Pipeline([('tfidf',TfidfVectorizer()),('svm',LinearSVC())])
3. text_svm.fit(X_train,y_train)
Figure 17 - Code snippet of pipelining data
This code block (Figure 17) builds a pipeline object that uses the TfidfVectorizer to transform text
data into numerical features and train an SVM model using those features. By invoking text
svm.predict() on fresh text data, this pipeline object may be used to anticipate the labels of new text
data. Since it makes it easier to apply the same processing steps to fresh data and enables you to
combine many processing stages into a single object, the usage of pipelines is a frequent practice in
machine learning.
5.7 Non-linguistic analysis

5.7.1 Timestamp
18 | P a g e
31/03/2021
"%H:%M:%S") -
Figure 18 - Code snippet to retrieve the timestamp of an email
The code updates a ‘sent_time_list’ array with a tuple holding the time the email was sent in GMT.
The result of invoking the ‘strftime’ function on a ‘datetime’ object is the single element that makes
up the tuple.
The code first uses the ‘strptime’ function of the ‘datetime’ module to extract the hour, minute, and
second components of the email's date field before calculating the GMT time. To choose the fourth
element, which should include the time in the format "HH:MM:SS," the split method is used to
divide the date field by whitespace.
The function then deducts a ‘timedelta’ object to modify the time by the timezone offset from the
‘datetime’ object. The 5th element of the date field, which has the format "+HHMM," is converted to
an integer and its value is divided by 100 to get the timezone offset.
The code then uses the ‘strftime‘ function to format the ‘datetime’ object, resulting in a string with
the format "HH:MM:SS," which denotes the time the email was sent in GMT. The finished string is
then added as a tuple to the sent time list.
The full code can be viewed in Appendix D and is then integrated into the main code in Appendix C
5.7.2 Sender/Recipient
1. class Person:
2. def __init__(self, email):
3. self.firstname = None
4. self.surname = None
5. self.email = email
6. self.phone = None
7. self.formality = {}
8. self.formality_average_calc = []
9. self.formality_average = 0
10. self.corrections = {}
11. self.spelling_mistakes = {}
12. self.languages = {'af':0, 'ar':0, 'bg':0, 'bn':0, 'ca':0, 'cs':0, 'cy':0,
'da':0, 'de':0, 'el':0, 'en':0, 'es':0, 'et':0, 'fa':0, 'fi':0, 'fr':0, 'gu':0,
'he':0,
13. 'hi':0, 'hr':0, 'hu':0, 'id':0, 'it':0, 'ja':0, 'kn':0, 'ko':0, 'lt':0, 'lv':0,
'mk':0, 'ml':0, 'mr':0, 'ne':0, 'nl':0, 'no':0, 'pa':0, 'pl':0,
14. 'pt':0, 'ro':0, 'ru':0, 'sk':0, 'sl':0, 'so':0, 'sq':0, 'sv':0, 'sw':0, 'ta':0,
'te':0, 'th':0, 'tl':0, 'tr':0, 'uk':0, 'ur':0, 'vi':0, 'zh-cn':0, 'zh-tw':0}
15. self.punctuation = {".":[],",":[],"?":[],":":[],"!":[],"-
":[],"[":[],"]":[],"(":[],")":[],"{":[],"}":[],"'":[],'"':[],"...":[]}
16. self.punctuation_averages = {}
17. self.emails_trained = 0
18. self.times = {}
19. self.average_time = None
Figure 19 - Code snippet of the start of the class 'Person' (see appendix C for full code output)
The Person class is a template for creating objects that represent an individual with email
communication capabilities. The class contains properties to store the individual's first name,
19 | P a g e
31/03/2021
surname, email, phone number, formality, corrections, spelling mistakes, languages, punctuation
usage, and other communication-related metrics.
Together with updating and calculating the averages for some of these characteristics, the class also
offers methods for the formality average, punctuation average, and average email sending time. The
class also contains a method called repr that outputs the properties of the object in a structured
manner. The class enables the tracking and analysis of communication patterns and trends for
specific email users and is intended to be a component of a broader system that analyses email
conversations. This class is stored by pickling the objects using the pickle module (Table 1)
(See Appendix C for the full code output)
20 | P a g e
31/03/2021
6 Test Results
6.1 Testing of Requirements
6.1.1 Filtering of incoming emails
Test Test value Expected Fail/Pass Evidence
description Result
Testing an Congratulations, you have won a Spam Appendix E
unseen spam lottery of $5000. To Won Text on,
message to see 555500
if it can detect Hey mate how are you doing? Ham Appendix E
them as spam You have won £3000! To claim your Spam Appendix E
prize go to http://clickme.com
Can I borrow some prime for Ham Appendix E
tomrrow please
Hey tom this is jerry, I was hoping Ham Appendix E
you could send me over the email
containing the presentation. The
link to it was
http://presentation.com/435jtr34rt
but it does not work.
Hey billy can i get my £100 back plz Ham Appendix E
What do you want to do? Win? Spam Appendix E
Well Congrats you have just won
£4500, click here to claim
6.1.2 Natural Language Processing

Test Description Expected results Fail/Pass Evidence
Applying an unseen 200 emails 132 are from Over 70% accuracy 55.67% Appendix F
dataset with real brawner-s while 68 are
and fake emails from from someone else
a user to see if it can
tell the difference
Testing how many 3181 emails from Nemec Over 70% 63% Appendix G
emails the code can are going to be given to
correctly identify as the model to see if it can
being from the user tell if it matches her usual
writing style
2040 emails from Bass-e Over 70% 58% Appendix G
are going to be given to
the model to see if it can
tell if it matches her usual
writing style
132 emails from Brawner Over 70% 97% Appendix G
are going to be given to
the model to see if it can
tell if it matches her usual
writing style
21 | P a g e
31/03/2021
3181 emails from Nemec Over 70% 77% Appendix H
will be tested again but
without the punctuation
attribute
2040 emails from Bass-e Over 70% 76% Appendix H
will be tested again but
without the punctuation
attribute
6.1.3 Machine Learning
Test description Test value Expected results Fail/Pass Evidence
Using the SVM 904 emails using Over 90% 98.90% Appendix I
model to train a the label love-p
dataset of a mix will be added to
of emails from a the dataset of
person we want 2040 emails for
to train and a set bass-e
of random emails 904 emails using Over 90% 99.68% Appendix J
not from the the label love-p
person will be added to
the dataset of
emails for 3181
nemec-g
904 emails using Over 90% 81.54% Appendix K
the label love-p
will be added to
the dataset of
132 emails for
brawner-s
22 | P a g e
31/03/2021
7 Analysing Results
7.1 Spam Detection
1. from sklearn.svm import SVC
2. from sklearn.ensemble import RandomForestClassifier
3.
4. ### Analysing SVM
5. classify(SVC(C=3), cleaned_text, labels)
6.
7. Accuracy: 98.38516746411483
8. precision recall f1-score support
9.
10. ham 0.98 1.00 0.99 1448
11. spam 1.00 0.88 0.94 224
12.
13. accuracy 0.98 1672
14. macro avg 0.99 0.94 0.96 1672
15. weighted avg 0.98 0.98 0.98 1672
Figure 20 - Classification of the SVM model for spam detection
These results suggest that the model is performing well in detecting spam messages. The accuracy of
98.4% indicates that the model correctly classified the majority of messages in the dataset. This
means that out of the 1672 messages, 1645 were correctly classified as either ham or spam.
The precision score for the spam class is 1.0, which indicates that all the messages classified as spam
were spam. The recall score of 0.88 suggests that the model correctly identified 88% of the spam
messages in the dataset.
The F1-score, which is a weighted average of precision and recall, is 0.94 for the spam class. This
suggests that the model is achieving a good balance between precision and recall for detecting
spam.
Overall, these results suggest that the model is performing well in detecting spam messages, and it
could be useful for filtering out unwanted messages in a real-world setting. However, it is important
to note that the performance of the model may vary depending on the specific characteristics of the
dataset and the context in which it is used.
7.2 SVM model for identifying senders of a message

1. Accuracy: 95.4074074074074
3.
4. eric.bass@enron.com 0.95 0.98 0.96 409
5. phillip.love@enron.com 0.96 0.92 0.94 266
6.
7. accuracy 0.95 675
8. macro avg 0.96 0.95 0.95 675
9. weighted avg 0.95 0.95 0.95 675
Figure 21 - Classification of the SVM model for detecting bass-e (Appendix M)
1. Accuracy: 92.45901639344262
3.
5. sandra.brawner@enron.com 1.00 0.41 0.58 39
6.
23 | P a g e
31/03/2021
8. macro avg 0.96 0.71 0.77 305
9. weighted avg 0.93 0.92 0.91 305
Figure 22 - Classification of the SVM model for detecting brawner-s (Appendix M)
1. Accuracy: 97.83783783783784
3.
4. gerald.nemec@enron.com 0.98 0.99 0.98 474
6.
8. macro avg 0.98 0.97 0.98 740
9. weighted avg 0.98 0.98 0.98 740
Figure 23 - Classification of the SVM model for detecting nemec-g (Appendix M)
The results provided show the performance of SVM models in classifying emails from specific
individuals. The models have been tested for three different scenarios: detecting emails from bass-e,
brawner-s, and nemec-g. The performance metrics provided include accuracy, precision, recall, and
f1-score for each case. Love-1 900 represents random emails.
The SVM model achieved an accuracy of 95.41% in identifying emails from eric.bass@enron.com and
phillip.love@enron.com. The model showed high precision (0.95 for eric.bass and 0.96 for
phillip.love) and recall (0.98 for eric.bass and 0.92 for phillip.love), resulting in f1-scores of 0.96 and
0.94, respectively. The overall performance of the model in this scenario is good, with high accuracy
and balanced precision and recall values.
In this case, the model achieved an accuracy of 92.46% in classifying emails from
phillip.love@enron.com and sandra.brawner@enron.com. The precision values were very high (0.92
for phillip.love and 1.00 for sandra.brawner), but the recall values showed a significant difference
(1.00 for phillip.love and 0.41 for sandra.brawner). This led to f1-scores of 0.96 for phillip.love and
0.58 for sandra.brawner. Although the overall accuracy of the model is quite high, the low recall for
sandra.brawner indicates that the model may have difficulties in correctly identifying emails from
this person.
The SVM model achieved an relatively high accuracy of 97.84% in identifying emails from
gerald.nemec@enron.com and phillip.love@enron.com. The model demonstrated high precision
(0.98 for both gerald.nemec and phillip.love) and recall (0.99 for gerald.nemec and 0.95 for
phillip.love), resulting in f1-scores of 0.98 and 0.97, respectively. The model performed exceptionally
well in this scenario, with high accuracy and well-balanced precision and recall values.
24 | P a g e
31/03/2021
7.3 Experimental linguistic/non-linguistic analysis results
Accuracy
100
80
60
40 Nemec-g
20 Bass-e
Brawner-s
0
Brawner-s Bass-e Nemec-g
Figure 24 - Graph to show the efficiency of experimental linguistic/non-linguistic analysis

All attributes 97 57 62
Timestamps 85.94 69.2 54.7
Language 96.88 99.41 99.62
Formality 30 33.97 38.23
Punctuation 6.25 1.48 7.19
Spelling 100 24.14 28.24
Spam detection 89 80.92 90.71
Table 3 - Results of testing each experimental analysis method

All attributes 97 75.94 77.29
Table 4 - Results of removing punctuation attribute from testing
Three datasets taken from the Enron dataset have been used to test the efficiency of experimenting
with different linguistic and non-linguistic attributes to see if they can be used to detect if the text is
written by whom it says it is from.
Based on the provided percentages, it appears that the accuracy of the Python code for detecting
the authenticity of the emails is varied for each individual. The highest accuracy for all attributes was
found for Brawner-s, with a score of 97%, while Bass-e and Nemec-g had lower accuracy scores of
57% and 62%, respectively.
It is important to note that the number of emails tested for each individual varied significantly, with
Brawner-s only being tested with 132 emails, while Bass-e was tested with 2040 emails, and Nemec-
g was tested with 3181 emails. Therefore, the accuracy of the code may be influenced by the size of
the dataset.
25 | P a g e
31/03/2021
Overall, these results suggest that the Python code for detecting email authenticity is not equally
accurate for all individuals and attributes. However, it appears to perform relatively well for spam
detection, while struggling with formality and punctuation accuracy.
Table 4 displays the results of running the Python code for detecting email authenticity, but this time
without the punctuation attribute. The analysis revealed that removing punctuation improved the
accuracy of the code by 18.94% for Bass-e and 15.29% for Nemec-g.
It is worth noting that Brawner-s had a relatively small dataset of only 132 emails, which makes it
challenging to accurately examine its accuracy. This reinforces the idea that the size of the dataset
has an impact on the accuracy of the code.
The results suggest that the accuracy of the Python code for detecting email authenticity is
influenced by the individual being tested and the specific attributes used. However, the findings also
highlight the importance of considering the size of the dataset when evaluating the accuracy of such
code. Removing certain attributes, such as punctuation, may improve the accuracy of the code.
26 | P a g e
31/03/2021
8 Conclusion
In summary, the SVM models show strong performance in detecting emails from specific individuals
in the given scenarios. The models generally exhibit high accuracy, precision, recall, and f1-scores,
indicating their effectiveness in classifying emails correctly. However, it is important to note that
there may be some cases where the model's performance is not as balanced, as seen in the
brawner-s detection scenario, where the recall value for sandra.brawner was significantly lower
compared to other metrics this was due to the branwer-s dataset being very small and therefore,
indicates that a large dataset is needed for accuracy.
It is crucial to consider these variations in performance when using SVM models for email
classification. To further improve the models, additional feature engineering or parameter tuning
could be explored. Additionally, analysing a larger dataset or incorporating more email examples
from the individuals in question may help the model learn better representations of the email
patterns and improve its classification capabilities.
The SVM model for spam detection performed well in identifying spam messages, with an accuracy
of 98.4%, according to the examination of the model. The model accurately detected 88% of the
spam messages in the sample, as shown by the precision score of 1.0 and recall score of 0.88 for the
spam class. The F1-score of 0.94 indicates a decent compromise between accuracy and recall for
spam detection. It is crucial to remember that the model's performance might change based on the
specifics of the dataset and the situation in which it is applied.
Overall, the SVM models show promising results in detecting emails from specific individuals, but
further improvements and testing may be needed to ensure consistent and reliable performance
across different cases.
The experimental investigation of several linguistic and non-linguistic variables for determining email
authenticity also showed that the Python code's accuracy varied depending on the person and
attribute. The algorithm struggled with formality and punctuation correctness but did rather well at
detecting spam. The comparatively tiny dataset of Brawner-s also shows that the size of the dataset
affects the code's accuracy.
SVM models function generally by locating the border between several classes in the data and
producing predictions based on this boundary. On the other hand, because of the complexity of
language and the variety of writing styles used by people, linguistic procedures for determining the
authenticity of emails are sometimes more difficult to use. Hence, while assessing the precision of
such models, it is crucial to carefully take into account the unique properties of the dataset and the
environment in which the code is utilised.
The application of machine learning methods to the detection of spear phishing attempts is one area
that may warrant more investigation. There is still an opportunity for improvement even if certain
strategies, like those covered previously in this work, have already been investigated. For instance,
researchers may look at how deep learning algorithms can be used to better scan email content and
spot minor spear phishing indicators.
The incorporation of linguistic and social engineering methods into detection systems is an
additional topic that needs more study. Analyzing these elements of email content might increase
detection precision because spear phishing attempts frequently depend on persuasive language and
27 | P a g e
31/03/2021
psychological manipulation. Indicators of a possible spear phishing attack might include social
engineering techniques like pretending to be a reliable source or invoking urgency.
28 | P a g e
31/03/2021
References
Althobaiti, A. (2019). Detecting Spear Phishing Emails using Natural Language Processing and
Machine Learning Techniques.
Ashrafi, M. Z. (2020). A Linguistic-based Approach for Detecting Spear Phishing Emails.
Benavides, E., Fuertes, W., Sanchez, S., & Sanchez, M. (2020). Classification of Phishing Attack
Solutions by Employing Deep Learning Techniques: A Systematic Literature Review.
Developments and Advances in Defense and Security.
Cohen, W. W. (2016, 05 8). Enron Email Dataset. Retrieved from cs.cmu.edu:

https://www.cs.cmu.edu/~./enron/
Fayiz, M. A. (2021). A Framework for Detecting Spear Phishing Emails Using Natural Language
Processing and Machine Learning.
Hassan, Z. (2020). Comparison of Artificial Neural Network and Support Vector Machine for Long-
Term Runoff Simulation.
Jain, A. (2020, September 25). Support Vector Machines(S.V.M) — Hyperplane and Margins.
Retrieved from medium.com: https://medium.com/@apurvjain37/support-vector-
machines-s-v-m-hyperplane-and-margins-ee2f083381b4
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant
features. In Machine learning, 137-142.
KAGGLE. (2015, 08 31). Hillary Clinton's Emails. Retrieved from KAGGLE:

https://www.kaggle.com/datasets/kaggle/hillary-clinton-emails
Karim, A., Hasan, M. A., Uddin, M. Z., & Islam, S. M. (2021). Spear Phishing Detection Using Machine
Learning and Linguistic Analysis. SN Computer Science, 1-8.
Krombholz, K., Hobel, H., Huber, M., & Weippl, E. (2013). Social Engineering Attacks on the
Knowledge Worker. Proceedings of the 6th International Conference on Security of
Information and Networks, 1-9.
L., Y., Kalina, B., & Hamish, C. (2009). Adapting SVM for Natural Language Learning. A Case Study
Involving Information Extraction, 1-2. Retrieved from https://gate.ac.uk/sale/nle-svm/svm-
ie.pdf
Maity, S., & Bandyopadhyay, S. (2021). Detection of Spear Phishing Emails Using Natural Language
Processing Techniques. International Journal of Computer Applications, 13-18.
Maklin, C. (2019, August 12). Support Vector Machine Python Example. Retrieved from
towardsdatascience: https://towardsdatascience.com/support-vector-machine-python-
example-
d67d9b63f1c8#:~:text=Support%20Vector%20Machine%20(SVM)%20is,straight%20line%20
between%20two%20classes.
Mallick, T., & Bandyopadhyay, S. (2022). Detecting Spear Phishing Emails using Deep Learning-based
Model. Journal of Intelligent & Fuzzy Systems, 1-12.
29 | P a g e
31/03/2021
Pareek, P. (2021, September 2). Logistic Regression: Essential Things to Know. Retrieved from
datadriveninvestor: https://medium.datadriveninvestor.com/logistic-regression-essential-
things-to-know-a4fe0bb8d10a
Raj, A. (2022, March 30). Everything About Support Vector Classification — Above and Beyond.
Retrieved from towardsdatascience: https://towardsdatascience.com/everything-about-
svm-classification-above-and-beyond-
cc665bfd993e#:~:text=Disadvantages%20of%20SVM%20Classifier%3A&text=SVM%20algorit
hm%20is%20not%20suitable,samples%2C%20the%20SVM%20will%20underperform.
readable.com. (2021, July 07). The Automated Readability Index. Retrieved from readable.com:
https://readable.com/readability/automated-readability-index/
Rout, A. R. (2023, January 10). Advantages and Disadvantages of Logistic Regression. Retrieved from
geeksforgeeks: https://www.geeksforgeeks.org/advantages-and-disadvantages-of-logistic-
regression/
RsearchGate. (2017). Schematic-diagram-of-SVM-architecture_fig1_317701295. Retrieved from

researchgate: https://www.researchgate.net/figure/Schematic-diagram-of-SVM-
architecture_fig1_317701295
S, P. (2021, June 16). The A-Z guide to Support Vector Machine. Retrieved from analyticsvidhya:
https://www.analyticsvidhya.com/blog/2021/06/support-vector-machine-better-
understanding/
Senter, R. J., & Smith, E. A. (1967). AUTOMATED READABILITY INDEX. Information Science, 8.
Soon, G. K., Chiang, L. C., On, C. K., Rusli, N. M., & (year), T. S. (2020). Comparison of Ensemble
Simple Feedforward Neural Network and Deep Learning Neural Network on Phishing
Detection.
UCI MACHINE LEARNING. (n.d.). SMS Spam Collection Dataset. Retrieved from kaggle:
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
UK Government. (2018). Data protection. Retrieved from gov.uk: https://www.gov.uk/data-

protection#:~:text=The%20Data%20Protection%20Act%202018%20is%20the%20UK's%20im
plementation%20of,used%20fairly%2C%20lawfully%20and%20transparently
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for
document classification. In Proceedings of the 2016 conference of the North American
chapter of the association for computational linguistics: human language technologies,
1480-1489.
Younghoo, L., Joshua, S., & Richard, H. (2020). CONTEXT-AWARE TINY BERT FOR DETECTING.
Zhang, Y., Yuan, H., Zhang, L., & Chen, L. (2020). Hybrid approach of natural language processing and
machine learning for spear-phishing email detection. Journal of Ambient Intelligence and
Humanized Computing, 441-452.
30 | P a g e
31/03/2021
Appendix
8.1 Appendix A – Approval Redacted
8.2 Appendix B – python-menu source code

1. # Adding in colourful text
2. class Colour:
3. Black = "\u001b[30m"
4. Red = "\u001b[31m"
5. Green = "\u001b[32m"
6. Yellow = "\u001b[33m"
7. Blue = "\u001b[34m"
8. Magenta = "\u001b[35m"
9. White = "\u001b[37m"
10. Cyan = "\u001b[36m"
11. Reset = "\u001b[0m"
12. bracketsymbol = Blue
13. normaltext = Cyan
14. plussymbol = Red
15. lines = White
16. text = Yellow
17.
18. class Menu():
19. # Takes in three optional attributes to change the title, options shown and
result type given (index, actually value)
20. def __init__(self, title="Menu",options=[],result='index'):
21. self.title = title
22. self.options = options
23. self.result = result
24.
25. def show(self):
26. # Displays the title of the menu
27. print(f"{Colour.bracketsymbol}[{Colour.plussymbol}+{Colour.bracketsymbol}
]{Colour.Yellow} {self.title}\t",end="")
28. if type(self.options) == type([]):
31 | P a g e
31/03/2021
29. # Displays the options given
30. for index, option in enumerate(self.options):
31. print(f"{Colour.Yellow}[{index+1}] {option}\t",end="")
32. print(f"{Colour.Reset}\n")
33. elif type(self.options) == type(""):
34. print(f"{Colour.Yellow}[1] {self.options}{Colour.Reset}\n",end="")
35. # Stores the value of the user input
36. self.value = getinput(self.title, self.options, self.result)
37. # Set the default values if no attributes are given
38. def update(self, title=None, options=None, result=None):
39. if title != None:
41. if options != None:
43. if result != None:
45.
46.
47. class List():
48. def __init__(self, title="List",options=[],result='index'):
52.
53. def show(self):
54. print(f"{Colour.bracketsymbol}[{Colour.plussymbol}+{Colour.bracketsymbol}
]{Colour.Yellow} {self.title}")
55. if type(self.options) == type(""):
56. self.options = [self.options]
57.
58. for index, name in enumerate(self.options, 1):
59. display(f'{index} - {name}')
60.
61. self.value = getinput('Option',self.options,self.result)
62.
63. def update(self, title=None, options=None, result=None):
64. if title != None:
66. if options != None:
68. if result != None:
70.
71. # Function to get input with formated and coloured chacters
72. def getinput(title, options, result):
73. while True:
74. try:
75. if not options:
76. return "Error: No options found"
77.
78. choice = int(input(f"({Colour.text}{title}{Colour.Reset}) > "))
79.
80. if 0 < choice <= len(options):
81. if result == 'value':
82. return options[choice-1]
83. else:
84. return choice
85. else:
86. display("Invalid Option")
87. except Exception:
88. display("Invalid Option")
89.
90. # Displays text in a nice colourful format
91. def display(text):
32 | P a g e
31/03/2021
92. print(f"{Colour.bracketsymbol}[{Colour.plussymbol}+{Colour.bracketsymbol}]
{Colour.Reset}{text}\n")
8.3 Appendix C – Main code (linguistic_analyiser.py)

1. #!/usr/bin/python3
2. import os, sys, pickle, time, re, readability, tkinter as tk, pandas as pd
3. from tkinter import filedialog
4. from nltk import *
5. from nltk.corpus import stopwords
6. from nltk.stem import WordNetLemmatizer
7. from sklearn.model_selection import train_test_split
8. from sklearn.feature_extraction.text import TfidfVectorizer
9. from sklearn.svm import LinearSVC
10. from pickle import load, dump
11. from sklearn.pipeline import Pipeline
12. from nltk.corpus import *
13. from langdetect import detect
14. from textblob import Word
15. from spellchecker import SpellChecker
16. from email.parser import Parser
17. from datetime import datetime, timedelta
18. from menu import *
19. import numpy as np
20.
21. class Person:
22. def __init__(self, email):
23. self.firstname = None
24. self.surname = None
25. self.email = email
26. self.phone = None
27. self.formality = {}
28. self.formality_average_calc = []
29. self.formality_average = 0
30. self.corrections = {}
31. self.spelling_mistakes = {}
32. self.languages = {'af':0, 'ar':0, 'bg':0, 'bn':0, 'ca':0, 'cs':0, 'cy':0,
'da':0, 'de':0, 'el':0, 'en':0, 'es':0, 'et':0, 'fa':0, 'fi':0, 'fr':0, 'gu':0,
'he':0,
33. 'hi':0, 'hr':0, 'hu':0, 'id':0, 'it':0, 'ja':0, 'kn':0, 'ko':0, 'lt':0, 'lv':0,
'mk':0, 'ml':0, 'mr':0, 'ne':0, 'nl':0, 'no':0, 'pa':0, 'pl':0,
34. 'pt':0, 'ro':0, 'ru':0, 'sk':0, 'sl':0, 'so':0, 'sq':0, 'sv':0, 'sw':0, 'ta':0,
'te':0, 'th':0, 'tl':0, 'tr':0, 'uk':0, 'ur':0, 'vi':0, 'zh-cn':0, 'zh-tw':0}
35. self.punctuation = {".":[],",":[],"?":[],":":[],"!":[],"-
":[],"[":[],"]":[],"(":[],")":[],"{":[],"}":[],"\'":[],'\"':[],"...":[]}
36. self.punctuation_averages = {}
37. self.emails_trained = 0
38. self.times = {}
39. self.average_time = None
40.
41. def update_formality_average(self):
42. self.formality_average_calc.clear()
43. [self.formality_average_calc.append(sum(self.formality[array])/len(self.f
ormality[array])) for array in self.formality]
44. self.formality_average =
sum(self.formality_average_calc)/len(self.formality_average_calc) if
len(self.formality_average_calc) else 0
45.
46. def update_top_spelling_mistakes(self):
47. self.top_selling_mistakes = {}
33 | P a g e
31/03/2021
48. for i in sorted(self.spelling_mistakes, key=self.spelling_mistakes.get,
reverse=True)[:len(self.spelling_mistakes) if len(self.spelling_mistakes) < 20
else 20]:
49. self.top_selling_mistakes[i] = self.spelling_mistakes[i]
50.
51. def update_punctuation_averages(self):
52. for punc in self.punctuation.items():
53. self.punctuation_averages[punc[0]] =
round(sum(punc[1])/len(punc[1]),2)
54.
55. def update_average_time(self):
56. if self.times:
57. # Turns the multi dimentional array into a 1d array so the average
time can be calculated
58. times = np.concatenate(list(self.times.values()))
59. # Finds the average time in a list of times in the format %H:%M:%S
60. self.average_time = str(timedelta(seconds=sum(map(lambda f:
int(f[0])*3600 + int(f[1])*60 + int(f[2]), map(lambda f: f.split(':'),
times)))/len(times)))[:8]
61.
62. def single_formality_average_calc(self, to_email):
63. return sum(self.formality[to_email])/len(self.formality[to_email])
64.
65. def single_time_average_calc(self, to_email):
66. # Setting the times to the only the emails of the to address so it is
more accurate
67. times = self.times[to_email]
68. # Finds the average time in a list of times in the format %H:%M:%S
69. return str(timedelta(seconds=sum(map(lambda f: int(f[0])*3600 +
int(f[1])*60 + int(f[2]), map(lambda f: f.split(':'), times)))/len(times)))[:8]
70.
71. def update_all(self):
72. self.update_formality_average()
73. self.update_punctuation_averages()
74. self.update_top_spelling_mistakes()
75. self.update_average_time()
76.
77. def __repr__(self):
78. return f"{Colour.Blue}Email:
{Colour.Yellow}{self.email}\n{Colour.Blue}Emails trained:
{Colour.Yellow}{self.emails_trained}\n{Colour.Blue}Formaility Average:
{Colour.Yellow}{self.formality_average}\n{Colour.Blue}Most used language:
{Colour.Yellow}{sorted(self.languages.items(),reverse=True, key=lambda kv:
kv[1])[0][0]}\n{Colour.Blue}Most spelt wrong words:
{Colour.Yellow}{self.top_selling_mistakes}{Colour.Reset}\n{Colour.Blue}Punctuatio
n: {Colour.Yellow}{self.punctuation_averages}\n{Colour.Blue}Average time an email
is sent: {Colour.Yellow}{self.average_time}{Colour.Reset}"
79.
80. # This function will see how long it takes other a given function to run to check
for efficiency
81. def time_function(origin):
82. def wrapper(*args,**kwargs):
83. # Checking the time before the function has run
84. start_time = time.time()
85. # Running the function
86. result = origin(*args,**kwargs)
87. # Checking the difference in time between before the function is run and
after
88. function_runtime = time.time() - start_time
89. # Displaying how long it took to run the function in seconds
90. print(f'{origin.__name__} ran in: {function_runtime} seconds')
91. return result
92. return wrapper
93.
34 | P a g e
31/03/2021
94. def load_objects(filename):
95. # Load people
96. with open(filename, 'rb') as data:
97. # Loads saved objects from pickle file
98. return pickle.load(data)
99.
100. def save_object(object, filename):
101. # Overwrites any existing file.
102. with open(filename, 'wb') as output:
103. # Saves all objects from pickle file
104. pickle.dump(object, output, pickle.HIGHEST_PROTOCOL)
105.
106. def clean_text(text):
107. # Convert the text to lowercasez
108. text = text.lower()
109. # Removal of special characters
110. text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
111. # Removal unnecessary spaces
112. text = re.sub(r'\s+', ' ', text)
113. return text
114.
115. ### Preprocessing data for SVM model
119.
120. def analyse_time(person, to_email_array, time, train):
122. if train:
123. if to_email in person.times:
124. # Recording times of when the user sends emails
125. person.times[to_email] = person.times[to_email]+[time]
126. else:
127. person.times[to_email] = [time]
128. elif not train:
129. if to_email in person.times:
130. average_time = person.single_time_average_calc(to_email)
131. else:
132. average_time = person.average_time
133. # Calculating the difference in time from the average time the
user sends and email to a particular person and the time the email to analyse was
sent
134. average_time = datetime.strptime(average_time, "%H:%M:%S")
135. test_time = datetime.strptime(time, "%H:%M:%S")
136.
137. if average_time > test_time:
138. diff = average_time-test_time
139. else:
140. diff = test_time - average_time
141.
142. # 3 hours in seconds is 10800
143. if diff.total_seconds() > 10800:
144. return ('Time',1)
145. else:
146. return ('Time',0)
147.
148. def analyse_punctuation(text, person, train):
149. # Creating dict for percentages of punctuation used to be added
150. punctuation = {".":0,",":0,"?":0,":":0,"!":0,"-
":0,"[":0,"]":0,"(":0,")":0,"{":0,"}":0,"'":0,'"':0,"...":0}
151. if train:
152. for punc in person.punctuation:
35 | P a g e
31/03/2021
153. # When training the self.punctuation dict will have it's
values updated
154. person.punctuation[punc] =
person.punctuation[punc]+[round(sum([1 if i == punc else 0 for i in
text])/len(text.split(" "))*100,2)]
156. for punc in punctuation:
157. # When testing the model will appened each percentage of the
text to test to the "punctuation" temp array
158. punctuation[punc] = round(sum([1 if i == punc else 0 for i in
text])/len(text.split(" "))*100,2)
159. for punc1, punc2 in
zip(punctuation.items(),person.punctuation.items()):
160. # Accuracy decreases if it looks at more then "." as other
punctuations are not used at all so hard to compare
161. if punc1[1] < round(sum(punc2[1])/len(punc2[1]),2)-5 or
punc1[1] > round(sum(punc2[1])/len(punc2[1]),2)+5:
164.
165. def analyse_language(text, person, detected_language, train):
166. if detected_language in person.languages:
167. if train:
168. # Adding 1 to the number of times a person has written in a
certain language
169. person.languages[detected_language] += 1
172. # While testing if person has not sent more than 5 emails in
the detected language it will flag it as possible impersonation
173. if person.languages[detected_language] < 5:
176.
177. def analyse_spelling(text, person, detected_language, train):
178.
179. # Cleaing text by removing unwanted spaces, grammer and capitalisation
180. text = clean_text(text).split()
181. # Get the one `most likely` answer -print(spell.correction(word)) Get
a list of `likely` options - print(spell.candidates(word))
182. if not train:
183. for word in text:
184. if word in person.corrections and person.corrections[word] >
20:
186. else:
188. elif train:
189. # Spell checking in language of text given
190. spell = SpellChecker(language=detected_language)
191. # find those words that may be misspelled
192. misspelled = spell.unknown(text)
193.
194. for word in misspelled:
195. if len(word) > 3:
196. correct_word = spell.correction(word)
197. # Adds misspelled words to the person data
198. person.spelling_mistakes[word] = 1 if word not in
person.spelling_mistakes else person.spelling_mistakes[word]+1
199. # Adds the correct spelling of words that the person
usaually gets wrong
200. person.corrections[correct_word] = 1 if correct_word not
in person.corrections else person.corrections[correct_word]+1
201.
36 | P a g e
31/03/2021
202. def analyse_formality(text, person, detected_language, train,
to_email_array):
203.
204. results = readability.getmeasures(text, lang="en")
205.
207. if train:
208. #print(f"{text}: {results['readability grades']['ARI']}")
210. # Cacluating formality of user
211. person.formality[to_email] =
person.formality[to_email]+[results['readability grades']['ARI']]
212. else:
213. person.formality[to_email] = [results['readability
grades']['ARI']]
216. formality_average =
person.single_formality_average_calc(to_email)
217. else:
218. formality_average = person.formality_average
219. if results['readability grades']['ARI'] > formality_average+1
or results['readability grades']['ARI'] < formality_average-1:
221. else:
223.
224.
225. def analyse_spam(text, train):
226. if not train:
227. # Importing trained SVM model
228. text_svm = load(open('../pkl/model.pkl', 'rb'))
229. result = text_svm.predict([refine(text)])[0]
230. if result == 'spam':
231. return ('Spam',1)
232. elif result == 'ham':
233. return ('Spam',0)
234.
235. def analyse_text(dataset, train):
236. registered = False
237. for text, from_email, to_email, time in
zip(dataset[0],dataset[1],dataset[2], dataset[3]):
238. if train:
239. # Chedcking if an email is known to the model if not it will
be added
240. for person in people:
241. # If the email is in the objects then the person is
already registered
242. if from_email == person.email:
243. registered = True
244. if not registered:
245. # If the person is new it will make a new object for them
246. people.append(Person(from_email))
247.
248. # Checking if a person is in the database
250. if person.email==from_email:
251. person = person
252. if train:
253. # When training this will show the user how many
emails has been trained for each person
254. person.emails_trained +=1
255. break
256. '''else:
37 | P a g e
31/03/2021
257. # Returns an error message if the person cannot be found when
testing
258. return 'N/A', 'Person not found'''
259.
260. try:
261. if len(text) > 10:
262. # Detecting the language used in the given text
263. detected_language = detect(text)
264. else:
265. # Some erros occur as text is short so this bypasses that
266. detected_language = "en"
267. except:
268. # Some erros occur as text is blank so this bypasses that
269. detected_language = "en"
270.
271. # Calcuates scores for test data
272. #scores = [analyse_spam(text, train),analyse_language(text,
person, detected_language, train),analyse_spelling(text, person,
detected_language, train),analyse_formality(text, person, detected_language,
train, to_email),analyse_time(person, to_email, time, train)]
273. #scores = [analyse_spam(text, train)]
274. #scores = [analyse_language(text, person, detected_language,
train)]
275. #scores = [analyse_spelling(text, person, detected_language,
train)]
276. #scores = [analyse_formality(text, person, detected_language,
train, to_email)]
277. #scores = [analyse_punctuation(text, person, train)]
278. scores = [analyse_time(person, to_email, time, train)]
279. # Removing nonetype from list which indicates that attribute
cannot be checked
280. scores = [x for x in scores if x is not None]
281. # Updating stats for __repr__ output
282. person.update_all()
283.
284. if not train:
285. # Returns the percentage match based on the reuslts of the
analysis
286. #percentage = f'{round(100-(sum(score[1] for score in
scores)/len(scores))*100,2)}% match to {person.email}'
287. percentage = round(100-(sum(score[1] for score in
scores)/len(scores))*100,2)
288. # Returns the scores for every check that has been enabled
289. scores = " ".join(f'{score[0]}: {score[1]} ' for index, score
in enumerate(scores))
290. return percentage, scores
291.
292. def analyse_email(inputfile, to_email_list, from_email_list,
email_body_list, sent_time_list, message_id_list):
293. with open(inputfile, "r") as file:
294. data = file.read()
295. # Creating email parser instance to analyse emails
296. email = Parser().parsestr(data)
297.
298. if email['to']:
299. # Fixing formatting issues
300. email_to = email['to'].replace("\n", "").replace("\t",
"").replace(" ", "").split(",")
301. to_email_list.append(email_to)
"%H:%M:%S") -
304. # Gets the from email and appends it to a list
38 | P a g e
31/03/2021
305. from_email_list.append(email['from'])
306. # Gets the main text of the email and appends it to a list
307. email_body_list.append(email.get_payload())
308. message_id_list.append(email['message-id'])
309.
310. # Takes in an array of names of datasets e.g. ["dataset1","dataset2"] to
then retrieve the data
311. def get_dataset(names):
312. email_body_list, from_email_list, to_email_list, sent_time_list,
message_id_list = [], [], [], [], []
313. for name in names:
314. # Looping over all directories and files calling analyse_email
function each time to analyse as emails in the dataset given
315. for directory, subdirectory, filenames in
os.walk(f"../datasets/enron/{name}/sent"):
316. for filename in filenames:
317. analyse_email(os.path.join(directory, filename),
to_email_list, from_email_list, email_body_list, sent_time_list, message_id_list)
318.
319. return email_body_list, from_email_list, to_email_list,
sent_time_list, message_id_list
320.
321. # Loading objects for Person class
322. people = load_objects('../pkl/people.pkl')
323.
324. # Creating menu to ask what the user wants to do
325. main_menu = Menu(title='Main Menu',options=['Show people','Train dataset
for lingustic model','Train SVM model for spam dection','Train SVM for email
identification','Test experimental models','Test SVM for email
identification','Reset dataset of people','Remove Email','Exit'])
326. datasets_list = List(title="Datatsets")
327. check_menu = Menu(title='Check',options=['Yes','No'])
328. models_list = List(title="Models")
329. def menu():
330. while True:
331. main_menu.show()
332. getinput = main_menu.value
333. if getinput == 1:
335. # Printing the person object details to the screen
336. person.update_all()
337. print(person)
338. # Spacing out each person shown so it is more legible
339. print('-'*100)
340. elif getinput == 2:
341. while True:
342. # Getting a list of all datasets
343. names = os.listdir("../datasets/enron/")
344. # Updating list with names from dataset folders
345. datasets_list.update(options = ["All"]+[name for name in
names]+["Exit"], result='index')
346. # Shows the list to the CLI
347. datasets_list.show()
348. # Gets the answer the user inputted
349. answer = datasets_list.value
350. if answer == 1:
351. # Training all datasets in directory
352. analyse_text(get_dataset(names),1)
353. display(f"All datasets have been trained")
354. elif answer == len(names)+2:
355. break
356. else:
357. # Training data set chosen from menu
358. analyse_text(get_dataset([names[answer-2]]),1)
39 | P a g e
31/03/2021
359. display(f"Dataset {[names[answer-2]][0]} has been
trained")
361. ### Reading dataset
363. dataset = pd.read_csv('../datasets/ham-spam/spam.csv',
sep='�!', names=['label','message'], engine='python')
365. dataset['message'] = [refine(dataset['message'][index]) for
index in range(0, len(dataset))]
370. X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.30, random_state = 42)
372. text_svm =
Pipeline([('tfidf',TfidfVectorizer()),('svm',LinearSVC())])
374. ### Saving Model for quick importing
375. dump(text_svm, open('../pkl/model.pkl', 'wb'))
376. display('SVM model successfully trained')
377.
379. while True:
383. datasets_list.update(options = [name for name in
names]+["Exit"],result='value')
388.
389. if answer == 'Exit':
390. break
391. else:
392. dataset0 = get_dataset([answer])
393. dataset1 = get_dataset(['love-p'])
394. dataset =
[dataset0[0]+dataset1[0],dataset0[1]+dataset1[1]]
395. X = dataset[0]
396. y = dataset[1]
398. X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size = 0.30, random_state = 42)
400. text_svm =
Pipeline([('tfidf',TfidfVectorizer()),('svm',LinearSVC())])
403. dump(text_svm,
open(f'../pkl/models/{answer}model.pkl', 'wb'))
404. '''
405. ### classification_report of models not in use as it
was only needed for the report
407. from sklearn.model_selection import train_test_split,
cross_val_score
408. from sklearn.metrics import classification_report
409. from sklearn.feature_extraction.text import
CountVectorizer, TfidfVectorizer, TfidfTransformer
40 | P a g e
31/03/2021
411. ### Model Training
412. def classify(model, X, y):
413. # train test split
414. x_train, x_test, y_train, y_test =
train_test_split(X, y, test_size=0.30, random_state=42, shuffle=True, stratify=y)
415. # model training
416. pipeline_model = Pipeline([('vect',
CountVectorizer()),('tfidf',TfidfTransformer()),('clf', model)])
417. pipeline_model.fit(x_train, y_train)
418.
419. print('Accuracy:', pipeline_model.score(x_test,
y_test)*100)
420. y_pred = pipeline_model.predict(x_test)
421. print(classification_report(y_test, y_pred))
422.
423. from sklearn.linear_model import LogisticRegression
427. classify(SVC(C=3), X, y)'''
429. while True:
438.
440. break
441. else:
442. test = False
443. running_percentage = []
444. # Getting the results of a test message against the
trained model
445. dataset = get_dataset([answer])
446. total = []
447. margin = 50
448. for text, from_email, to_email, time, message_id in
zip(dataset[0],dataset[1],dataset[2], dataset[3],dataset[4]):
449. percentage, scores = analyse_text([[text],
[from_email], [to_email], [time]],0)
450. running_percentage.append(percentage)
451. if test:
452. with open(f"../datasets/enron/{answer}/file")
as file:
453. lines = file.readlines()
454. for line in lines:
455. if message_id in line:
456. if "real" in line and
percentage>margin or "fake" in line and percentage<margin:
457. total.append(1)
458. else:
460. else:
461. if percentage>margin:
463. elif percentage<margin:
41 | P a g e
31/03/2021
465.
466. # Displaying the percentage match of the test email to
the person who sent it
467. display(f'{(sum(total)/len(total))*100}% correctly
judged')
468.
470. while True:
479.
481. break
482. else:
484. names = os.listdir("../pkl/models/")
486. models_list.update(options = [name for name in
488. models_list.show()
490. model = models_list.value
491.
492. # Getting the results of a test message against the
trained model
493. dataset = get_dataset([answer])
494. total = []
495. ### Testing model
496. text_svm = load(open(f'../pkl/models/{model}', 'rb'))
497. count = 0
498. for text, from_email in zip(dataset[0],dataset[1]):
499. count+=1
500. if text_svm.predict([refine(text)])[0] ==
from_email:
502. else:
504. print(f'Count is {count}')
505. count = 0
506. # Displaying the percentage match of the test email to
the person who sent it
507. display(f'{(sum(total)/len(total))*100}% correctly
judged')
509. display('Are you sure you want to reset the model data?')
510. while True:
511. # Menu to double check the user wants to reset the models
data as it can take a while to retrain
512. check_menu.show()
513. check = check_menu.value
514. if check == 1:
515. # Clearing objects from people
516. people.clear()
517. display('Model Data Reset Successful!')
518. break
519. else:
42 | P a g e
31/03/2021
520. break
522. names = [person.email for person in people]
523. emails_list = List(title="Emails", options=names,
result='value')
524. emails_list.show()
526. if person.email == emails_list.value:
527. display(f'Are you sure you want to remove the
{emails_list.value}')
528. check_menu.show()
529. if check_menu.value == 1:
530. people.remove(person)
531. display(f'{emails_list.value} Removed')
533. # Saving pople objects so they can be reloaded in
534. save_object(people, '../pkl/people.pkl')
535. break
536.
537. if __name__ == '__main__':
538. menu()
8.4 Appendix D - nlp.py

1. #!/usr/bin/python3
2. import os, re, pandas as pd
3. from nltk.corpus import stopwords
4. from nltk.stem import WordNetLemmatizer
5. from sklearn.model_selection import train_test_split
6. from sklearn.feature_extraction.text import TfidfVectorizer
7. from sklearn.svm import LinearSVC
9. from pickle import load, dump
10. from menu import *
11.
12. ### Preprocessing data for SVM model
16.
17. # Function to predict if text is spam
18. def predict_spam(text):
19. return text_svm.predict([refine(text)])
20.
21. while True:
22. getinput = Menu(options=['Train dataset','Test model','Exit'])
23. getinput.show()
24. getinput = getinput.value
25.
26. if getinput == 1:
27. ### Reading dataset
29. dataset = pd.read_csv('../datasets/ham-spam/spam.csv', sep='�!',
names=['label','message'], engine='python')
31. dataset['message'] = [refine(dataset['message'][index]) for index in
range(0, len(dataset))]
43 | P a g e
31/03/2021
36. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.30, random_state = 42)
38. text_svm = Pipeline([('tfidf',TfidfVectorizer()),('svm',LinearSVC())])
41. dump(text_svm, open('../pkl/model.pkl', 'wb'))
42.
44. ### Testing model
45. text_svm = load(open('../pkl/model.pkl', 'rb'))
46.
47. data = ['Congratulations, you have won a lottery of $5000. To Win Text
on, 555500 ','Hey mate how are you doing?','Can I borrow some prime for tomrrow
please',"You have won £3000! To claim your prize go to http://clickme.com","Hey
tom this is jerry, I was hoping you could send me over the email containing the
presentation. The link to it was http://presentation.com/435jtr34rt but it does
not work.","Hey billy can i get my £100 back plz","What do you want to do? Win?
Well Congrats you have just won £4500, click here to claim"]
48.
49. for text in data:
50. colours =
{'spam':f'{Colour.Red}spam{Colour.Reset}','ham':f'{Colour.Green}ham{Colour.Reset}
'}
51. print(f'{text}{Colour.Black}:{colours[predict_spam(text)[0]]}')
52.
53. #print(predict_spam(input('>'))[0])
55. break
8.5 Appendix E – Filtering of incoming emails test evidence
44 | P a g e
31/03/2021
8.6 Appendix F – NLP brawner-s test evidence
45 | P a g e
31/03/2021
8.7 Appendix G - Evidence of testing datasets against the model
46 | P a g e
31/03/2021
8.8 Appendix H - Evidence of passing the test for testing datasets against the model
47 | P a g e
31/03/2021
8.9 Appendix I
48 | P a g e
31/03/2021
8.10 Appendix J
49 | P a g e
31/03/2021
8.11 Appendix K
50 | P a g e
31/03/2021
8.12 Appendix L – Design of CLI interface
51 | P a g e
31/03/2021
52 | P a g e
31/03/2021
8.13 Appendix M - Classification of SVM models
1. dataset0 = get_dataset([answer])
2. dataset1 = get_dataset(['love-p'])
3. dataset = [dataset0[0]+dataset1[0],dataset0[1]+dataset1[1]]
4. X = dataset[0]
5. y = dataset[1]
7. from sklearn.model_selection import train_test_split, cross_val_score
53 | P a g e
31/03/2021
8. from sklearn.metrics import classification_report
9. from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,
TfidfTransformer
11. ### Model Training
12. def classify(model, X, y):
13. # train test split
14. x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
random_state=42, shuffle=True, stratify=y)
15. # model training
16. pipeline_model = Pipeline([('vect',
CountVectorizer()),('tfidf',TfidfTransformer()),('clf', model)])
17. pipeline_model.fit(x_train, y_train)
18.
19. print('Accuracy:', pipeline_model.score(x_test, y_test)*100)
20. y_pred = pipeline_model.predict(x_test)
21. print(classification_report(y_test, y_pred))
22.
26. classify(SVC(C=3), X, y)
8.14 Appendix N – Researching existing solutions

“A Framework for Detecting Spear Phishing Emails Using Natural Language Processing and Machine
Learning" (Fayiz, 2021) presents a framework that combines NLP and machine learning techniques to
detect spear phishing emails. The framework extracts various features from the email, such as
sender information, email headers, and email content, and then uses these features to train a
classifier. The results show that the framework achieves an accuracy of 98.7% in detecting spear
phishing emails.
“A Linguistic-based Approach for Detecting Spear Phishing Emails" (Ashrafi, 2020) proposes a
linguistic-based approach for detecting spear phishing emails. The approach uses features such as
the presence of specific words and phrases, sentiment analysis, and linguistic analysis to detect
spear phishing emails. The approach achieves an accuracy of 94.67% in detecting spear phishing
emails
“Detecting Spear Phishing Emails using Natural Language Processing and Machine Learning
Techniques" (Althobaiti, 2019) proposes a system that uses NLP and machine learning techniques to
detect spear phishing emails. The system extracts various features from the email, such as the email
header, subject, and content, and then uses these features to train a classifier. The system achieves
an accuracy of 96.5% in detecting spear phishing emails.
Overall, these existing solutions demonstrate the effectiveness of using NLP and linguistic analysis to
detect spear phishing emails. They provide valuable insights into the features that can be used to
detect spear phishing emails and the machine learning techniques that can be applied to train
classifiers.
8.15 Appendix O - Laws and regulations

Spear phishing attacks are handled by the General Data Protection Regulation (GDPR) in the UK.
Strict regulations that apply to data security and breach reporting also apply to businesses that
handle personal data. The correct organisational and technological safeguards must be in place for
54 | P a g e
31/03/2021
organisations that process personal data to protect it against unauthorised or unlawful processing as
well as accidental loss, destruction, or damage. Furthermore, the GDPR mandates that businesses
alert clients when their data has been misused, which may include spear phishing attacks. If the
GDPR is not obeyed, it might result in high penalties, reputational harm, and legal ramifications (UK
Government, 2018).
55 | P a g e
31/03/2021

Detecting Spear Phishing Using Natural Language Processing and Linguistic Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detecting Spear Phishing Using Natural Language Processing and Linguistic Analysis

Uploaded by

Copyright:

Available Formats

Detecting spear phishing using natural language

processing and linguistic analysis

5 Design & Development ............................................................................................ 10

5.4 Menu System .............................................................................................................. 11

6 Test Results ............................................................................................................. 21

The following objectives have been posed to respond to this question:

❖ To investigate the language used in emails and communications.

3.1.1 Data Collection

3.1.2 Data Preprocessing

3.1.3 Feature Extraction

3.1.4 Machine Learning Model

3.1.7 Ethical Considerations

3.2.3 Dataflow diagram to show research methodology

Figure 1 - Research methodology dataflow diagram

4.2 Non-functional Requirements

4.3.1.1 Application of NLP to the problem of phishing message detection

4.3.1.4 Analysing the Logistic Regression model with Python

Figure 2 - Logistic Regression python analysis

4.3.2 Support Vector Machine (SVM)

4.3.2.1 Application of NLP to the problem of phishing message detection

4.3.2.4 Analysing the SVM model with Python

Figure 3 - Support Vector Machine python analysis

4.3.3 Decision on the model to use

5.2 Imported Libraries

Figure 4 - Simplified dataflow diagram of the program

5.4 Menu System

Figure 5 – Cropped example output of the python-menu library

5.5 Email Parsing

Figure 6 - Code snippet of email parsing

5.6 Linguistic Analysis

This is simplified to ARI = (w/s) + 9 (s/w)

Figure 7 - Code snippet for detecting formality

Figure 8 - Code snippet of spelling function

Figure 9 - Code snippet of punctuation function

5.6.4 Language Fluency

Figure 10 - Code snippet of language detector

Figure 11 - Architecture of the SVM model (RsearchGate, 2017)

Figure 12 - Optimal hyperplane for the SVM model (S, 2021)

5.6.5.2 Functions & Parameters

𝐹(𝑥, 𝑥𝑗) = 𝑠𝑢𝑚(𝑥. 𝑥𝑗)

5.6.5.3 Reading the dataset

Figure 13 - Code snippet of the reading dataset for spam detection

5.6.5.4 Data pre-processing

Figure 14 - Code snippet of refining text for spam detection

5.6.5.5 Dividing dataset

Figure 15 - Code snippet of dividing the dataset for training/testing

5.6.5.6 Text processing

Figure 16 - Code snippet of splitting the data

Figure 17 - Code snippet of pipelining data

5.7 Non-linguistic analysis

Figure 18 - Code snippet to retrieve the timestamp of an email

(See Appendix C for the full code output)

6.1.2 Natural Language Processing

Figure 20 - Classification of the SVM model for spam detection

7.2 SVM model for identifying senders of a message

Figure 21 - Classification of the SVM model for detecting bass-e (Appendix M)

Figure 22 - Classification of the SVM model for detecting brawner-s (Appendix M)

Figure 23 - Classification of the SVM model for detecting nemec-g (Appendix M)

Brawner-s Bass-e Nemec-g

Figure 24 - Graph to show the efficiency of experimental linguistic/non-linguistic analysis

Brawner-s Bass-e Nemec-g

Brawner-s Bass-e Nemec-g