AlexSynasc2021_v2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Extracting Exploits and Attack Vectors from

OSINT News using NLP


Alexandra Dinisor Cristina-Veronica Vladescu Octavian Grigorescu
Computer Science Computer Science Software Development Department
University Politehnica of Bucharest University Politehnica of Bucharest CODA Intelligence SRL
Bucharest, Romania Bucharest, Romania Bucharest, Romania
maria.dinisor@stud.acs.upb.ro cristiana.vladescu@stud.acs.upb.ro octavian.grigorescu@codaintelligence.com
Dragos Corlatescu Cristian Sandescu Mihai Dascalu
Computer Science Software Development Department Computer Science
University Politehnica of Bucharest CODA Intelligence SRL University Politehnica of Bucharest
Bucharest, Romania Bucharest, Romania Bucharest, Romania
dragos.corlatescu@upb.ro cristian.sandescu@codaintelligence.com mihai.dascalu@upb.ro

Abstract—Cybersecurity has an immense impact on society Discovered by security researchers or bug bounty
as it enables the digital protection of individuals and hunters, vulnerabilities are commonly published and
enterprises against an increasing number of online threats. discussed on the Internet. Therefore, important articles and
Moreover, the rate at which attackers discover and exploit in-depth reports are written and issued on niche magazines
critical vulnerabilities outperforms the vendors’ capabilities to by cybersecurity analysts and reporters, thus providing
respond accordingly and provide security patches. As such, support until vendors release corresponding security patches.
open-source intelligence data (OSINT) has become a valuable
resource, from which details on zero-day vulnerabilities can be Our research aims to aggregate data from trusted
retrieved and timely actions can be taken before the patches cybersecurity news platforms (i.e., a central OSINT source)
become available. In this paper we propose a method to and to automatically extract information regarding
automatically label articles on vulnerabilities and cyberattacks vulnerabilities exploited in the wild as zero-days. As such,
from trusted sources. Using Named Entity Recognition, we our model extract key characteristics based on sequence
extract essential information about new vulnerabilities, such as labeling, particularly the exploit’s public release and the
the exploit’s public release and the environment in which the attack vector (i.e., the attack’s operating environment).
attack’s exploitation is possible. Our balanced dataset contains
1095 samples out of which 250 entries are from cybersecurity The next section describes the current state of the fields
articles; the rest of the articles were crawled and annotated of cybersecurity with the focus on early detection of
from the U.S. Government’s Vulnerability Database, whereas vulnerabilities, followed by Natural Language Processing
automated text augmentation techniques were also considered. with data augmentation and sequence labeling. The third
Our model built on top of spaCy obtained an overall section presents the method in which the datasets used in our
performance of 75% recall on the Exploit Available task. experiments are presented, as well as the applied algorithms.
When considering the Attack Vector metric, the model The results are analyzed in the fourth section, whereas the
achieved the following recalls: Network 72%, Local 78%, and fifth presents conclusions and future experiments, together
Physical 92%. with envisioned improvements.
Keywords—Zero-days Attack, Exploit, Attack Vector, Entity
Labeling, spaCy. II. STATE OF THE ART
This section introduces state-of-the-art methods for the
I. INTRODUCTION early detection of vulnerabilities, followed by relevant
In recent years, the accelerated digitalization process Natural Language Processing (NLP) techniques, namely data
resulted in an intensifying number of vulnerabilities and augmentation to enhance our dataset, and named entity
cyber-attacks. A fitting example is the 67 percent increase in recognition for labeling text segments.
security breaches over a range of five years [1], whereas
most attacks are people-based and lead to information A. Early Detection of Vulnerabilities
leakage. Moreover, Talalaev [2] highlighted that 73 percent Detection techniques for software vulnerabilities are
of cybercriminals consider traditional antivirus security to be widely discussed in the scientific literature. For example,
irrelevant when distributing their Trojans or hijacking Mittal, et al. [4] and Queiroz, et al. [5] correlated user-
internet connected devices. generated data from social media posts with the
identification of new vulnerabilities. Social networks
In terms of preventing data breaches, early notifications represent a valuable OSINT source, where details about
on urgent patches, as well as information to mitigate the risk software or hardware-related vulnerabilities are discussed
of exposure, should be a top priority. However, the average even before being officially disclosed by the vendors.
delay in applying patches by employees in a company is 102
days [3]. This alarming protection gap is a result of Moholth, et al. [6] used Reactive Programming as a
misinformation and lack of concern on the abundant patches method to extract vulnerability-relevant tweets. Their
and updates released by the software vendors. Nevertheless, experiment relied on filtering data with human involvement
the ever-changing threat landscape greatly impacts users to ensure that the conclusions were correct. Moreover, a
trying to protect themselves against attackers. certain set of keywords was considered as not all tweets

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


described a recent issue actively exploited in the wild. New 2) Named Entity Recognition (NER)
critical flaws (i.e., zero-days vulnerabilities) were identified NLP techniques were frequently used to identify
using a large number of retweets within a short time interval. sequences that reference a specific entity. Bidirectional long-
Another approach on identifying vulnerabilities that are short term memory with conditional random field layer on
top (BiLSTM-CRF) proved to be beneficial in a variety of
likely to be exploited was presented by Tavabi, et al. [7] who
considered a neural language model whose input texts were real-life situations, including geoscience entity detection [15]
extracted from the dark web and deep web, as well as or social media message recognition [16]. N-grams and
security blog posts. The examination of dark web posts Convolutional Neural Networks were also used in
separated the available information relevant to the experiments from the biomedical field [17].
cybersecurity field from posts on illegal activities, such as The most frequently used format in terms of labeling data
the drug trafficking and the resale of stolen merchandise. The for Named Entity Recognition tasks is the IOB scheme in
experiment included a Skip-Gram model with Negative which chunks of texts are labelled as: beginning (B-tag),
Sampling. A further classification task for exploit prediction center (i.e., I-tag inside), and the outside component (O-tag).
was carried out with a SVM classifier with Radial Basis In various experiments from the cybersecurity domain,
Function kernel, which produced the best results for Conditional Random Field classifiers [18] benefitted from
detecting the high-severity flaws, with an F1 score of .80. this approach. Another successful labeling strategy consists
Collecting Twitter posts for vulnerability detection is a of annotating data with customized relational labels based on
widely considered perspective. As there can be millions of their importance in a cyber-attack, which aids in sentence
tweets describing past and recent cyber-attacks, Trabelsi, et classification and malware-related token prediction [19].
al. [8] proposed a machine learning component named Other approaches for detecting typical tokens in a text,
SMASH (Social Media Analysis for Security on HANA) to such as the name of a person or a city, consider pre-trained
be integrated in a security management platform. The two models from spaCy1. SpaCy [20] is an open-source Python
mentioned tasks focus on detecting zero-day exploits in library that supports various NLP tasks. SpaCy is known for
tweet posts, as well as on separating new CVE assignations its processing speed in parsing large-scale data and it offers
from CVE updates. Irrelevant terms were removed, and bag- pretrained language models for more than 15 languages.
of-words representations were considered when applying a SpaCy implements a state-of-the-art architecture for NER
clustering algorithm. In their experiment, Linux kernel based on Bloom embeddings [21] and residual CNNs [22].
vulnerabilities about zero-day exploits were assessed. The
correlation between their findings and the delay in CVE ‘Tok2vec’ is an independent layer from spaCy that can
Update detection argues that software developers and be shared between components. It is frequently applied in
companies are in general one step behind vulnerability conjunction with another component, such as ‘ner’, and is set
discovery, thus exposing their products to malicious actors. as the first layer to generate suitable dynamic vectors. The
built-in spaCy components ‘tagger’ and ‘parser’ do not share
their features with the ‘ner’ component. As a result, they are
B. Natural Language Specific Tasks
often disabled during training for custom NER models. The
1) Data Augmentation named entity recognition system includes several NLP
Data augmentation in NLP is a more sensitive task in Transformer models, out of which roBERTa [23], an
comparison to the same strategy applied in other machine optimized BERT implementation, was considered in our
learning subfields, such as computer vision. Given precise experiments as part of the spaCy version 3 NER pipeline.
disciplines like biology or cybersecurity, a single word might
affect the meaning of the entire phrase. As a result, careful III. METHOD
testing of the text augmentation techniques [9], such as
synonym replacement, random swap, random insertion, and Our overarching aim is to detect software and hardware
random deletion, is recommended. vulnerabilities from OSINT news. As such, our method
extracts specific relevant references to zero-day attacks
Using pre-trained embeddings when augmenting text is a called EVEs (Early Vulnerability Exposures). First, we
widely employed approach to preserve the context of the required a dataset on exploitable vulnerabilities and cyber-
input sentence. Contextual embeddings prove to be reliable attacks. Starting from an initial corpus of articles
sources of preventing language models from overfitting. The introducing vulnerabilities [24], we introduce two new
Bidirectional Encoder Representations from Transformers datasets created for the task at hand, namely the EVE and
(BERT) [10] provided state-of-the-art results for a variety of NVD-CVE corpora. Afterwards, text augmentation
NLP tasks, while outperforming classic sequential models. techniques were applied to extend our datasets, followed by
BERT was trained on a large plain text corpus (about 3300 machine learning models trained on the security-relevant
million words) from two popular sources: English data.
Wikipedia and BookCorpus [11]. BERT as a contextual
word embedding augmenter was applied on datasets from a A. Initial Corpus of Articles Introducing Vulnerabilities
variety of scientific fields [12]. Moreover, the cased model
keeps the text’s original accents and marks, allowing a Cybersecurity reporters from trusted websites provide
Named Entity Recognition [13] model to benefit during useful information for our task. However, articles need to be
training. Other BERT-based experiments, such as fine- filtered out since part of them consider events outside our
tuning BERT in the result of a conditional version for text scope of vulnerability identification – for example, bug
augmentation [14], were conducted with promising bounty programs, hacked government websites, or data
outcomes.

1
https://spacy.io/
breaches for certain systems without mentioning the actual vulnerability was remotely exploitable also in cases
exploited vulnerability. of flaws linked to a browser component, such as
plugins, browser, extensions, or add-ons.
A dataset containing 1000 manually labeled
cybersecurity articles was gathered by Iorga, et al. [24] with • ‘ADJACENT’: the attacker was connected to the
the aim of performing binary text classification of whether an same network.
article introduces a new vulnerability or not. The articles
were extracted from four cybersecurity news platforms, • ‘PHYSICAL’: the article explicitly mentions that the
namely: The HackerNews2, Ars Technica3, Security Affairs4, attacker required physical access to the targeted
Threatpost5 . The selected article provided insights into the machine. Part of attacks included the use of
most recent critical vulnerabilities or easy-to-exploit peripheral devices, such as USB drives.
injection flaws. • ‘LOCAL’: the attacker accesses the targeted system
The scraping of the articles was performed using locally by using scripts from the console or the
Newspaper [25]. Additional features were considered to keyboard.
ensure a rigorous text analysis and a precise labeling. The
following annotations were also collected, if mentioned in TABLE I. ANNOTATED SAMPLES FOR EXPLOIT AVAILABLE TASK
text: the CVE-ID [26] (i.e., the identifier corresponding to a
Annotated sample Exploit
Common Vulnerability and Exposure), its CVSS (i.e., (Exploit available text span) available
Common Vulnerability Scoring System) score [27], affected “… an anonymous hacker with an online alias YES
product and version, patched version, or mentioned related "SandboxEscaper" today released proof-of-
products. The final version of this dataset consisted of 596 concept (PoC) exploit code for a new zero-day
security-relevant articles and 404 security irrelevant articles. vulnerability…”
“… a security researcher today publicly disclosed YES
details and proof-of-concept exploits for two
B. EVE Corpus 'unpatched' zero-day vulnerabilities …”
The time lag between crucial moments, such as “Researchers illustrated and demonstrated four YES
vulnerability discovery, CVE assignation, patch release, and attack scenarios, as explained below …”
“Researchers have no plans to release the proof- NO
CVE publish date, is of utmost importance. Even though of-concept code for these attacks ….”
relevant articles were filtered from the security irrelevant
ones, a further division was necessary. As such, we introduce
the EVE (Early Vulnerability Exposure) corpus which Table II introduces examples for each Attack Vector
focuses on zero-day exploits, i.e., cyberattacks with and type, alongside the EVE corpus summary statistics for the
without patched versions available from the affected Attack Vector. Due to insufficient samples, the security
vendors, with no published CVE. Out of the 596 relevant articles describing ‘Adjacent’ attacks were not taken into
articles, 250 EVE articles were manually selected and consideration into the subsequent machine learning
annotated by one cybersecurity expert. approaches.
Two new annotations were considered for each article:
Exploit Available and Attack Vector. The annotation of TABLE II. ANNOTATED SAMPLES FOR ATTACK VECTOR NER TASK
examples focused on labeling only the most relevant and Attack Annotated sample Total
shortest sequences; thus, the chuck's first token had to be in Vector type (Attack vector text span) samples
strong relation to the annotated metric. Network “… one of which could allow remote 202
hackers to take complete control …”
First, the Exploit Available annotation consist of labeling Physical “Although exploiting the issue requires 15
text spans where proof-of-concepts (PoC – e.g., video physical access, Sintonen explained …”
demonstration, or GitHub exploit code) were mentioned. Local “… could allow a local attacker to gain and 6
References to security blog posts containing details about the run code with administrative system
privileges on the targeted machines …”
unpatched flaw or detailed explanations about working PoC Adjacent “… could have allowed an attacker, 3
exploits were categorized as relevant for this feature. In some connected to the same network as the victim
cases, a comprehensive description of the exploit technique …“
was provided by the security writer of the article. The
annotated articles with Exploit Available represent 22.4% of
the EVE articles (see Table I for training set samples). All NER experiments required the conversion of the
original texts into inside–outside–beginning (IOB) tagged
Second, the annotation for Attack Vector consists of four texts, as the entities had variable length. Entity tokens refer
different classes, together with the corresponding relevant in this particular case to relevant sequences related to the
text spans. This annotation relates to the exploitability metric chosen vulnerability features, namely Exploit Available and
from the CVSS v3 score. This vulnerability metric highlights Attack Vector. The Attack Vector entities represent only
the context in which the vulnerability was already exploited 10% of the tokens, whereas the Exploit Available text spans
in the wild or potentially exploited. The four categories are: cover only 3% of all the tokens from a news article, thus
making the NER approaches very challenging.
• ‘NETWORK’: the article mentions that the attacker
had full remote control over the system or device. The As a result, further experiments considered a trimmed
dataset containing only the sentences with labeled entities
2
https://thehackernews.com/ (‘OnlyTagSent’) and disregarding all other sentences from
3
https://arstechnica.com/ the news article.
4
https://securityaffairs.co/wordpress/
5
https://threatpost.com/
C. NVD-CVE Corpus words and their corresponding tags (IOB scheme). The
Given the limitation of the EVE dataset that is neither model architecture implied an embeddings layer, a BiLSTM
large nor diverse enough, more annotations were collected layer with a dropout rate of 0.2, and a time distributed layer
from multiple sources related to cybersecurity attacks. As applied on Dense layers with ReLU activation corresponding
such, vulnerability data feeds from the National to each IOB tag (see Fig. 1 for architecture). No pre-trained
Vulnerability Database6 of the U.S. government were used. It embeddings were chosen due to the small-scaled preliminary
is worth mentioning that the exploitability metrics of each dataset.
discovered vulnerability since 2002 are offered and
published in a JSON format, having the advantage of already
being correctly annotated. This additional dataset was highly
imbalanced with 800 Network, 200 Local, and 150 Physical
examples and was collected by the same expert as in the
previous stage.

D. Text augmentation
For the spaCy experiment described later on in detail, a
text augmentation approach was chosen for the Local and
Physical Attack Vector types, particularly for the training
examples belonging to the NVD data feed with the aim of Fig. 1. BiLSTM network architecture.
having a balanced dataset. These Attack Vector types have a
reduced number of examples both in the EVE corpus and in In terms of hyperparameters, the configuration implied
the NVD data feed. A first attempt considered NLPAug7 for an Adam optimizer with a learning rate of 0.005, the model
synonym replacement. However, this approach damaged was trained for 20 epochs with a batch size of 32. The
highly specific cyber-security terms (e.g., ‘buffer overflow performance on the training data was measured with
vulnerability’ – ‘cowcatcher overflow vulnerability’, categorical cross-entropy as a loss function.
‘attacker with physical access’ – ‘assailant with forcible
access’) and was therefore abandoned. F. spaCy
A second experiment was conducted using TextAttack8 Our second approach for the NER tasks considered a
[28], more specifically an Embedding Augmenter in custom NER model built using spaCy v3. The spaCy’s
Command-Line Interface to change a specific number of underlying model architecture is not detailed publicly, but it
words per input. This approach encountered the same issue considers an embedding strategy with subword features and
with the security vocabulary (e.g., ‘Cross-site scripting Bloom Embeddings Input data was transposed into the
(XSS) vulnerability’ – ‘Cross-site scripting (XSS) fragility’). required spaCy v3 training format with IOB tags, while
dropping sentence and document indexes. The training of
The final and successful solution was a word-level the custom entity types for each EVE feature was performed
augmentation with insertion. A pre-trained contextual by adding the specific new entity labels before setting up the
embedding with BERT was used for a semantically suitable pipeline and the entity recognizer. The model started from
insertion in the original paragraphs (e.g., ‘root login may an English spaCy trained pipeline optimized for CPU,
allow upon a reboot’ - ‘unauthenticated root commit login updated in spaCy version 3 and the custom entity labels
may also allow upon a quick reboot’. The augmented were added to the entity recognizer. The initial NLP spaCy
contextual BERT texts implied the NVD-CVE Local and processing pipeline contained additional pipeline
Physical examples were added to the training set. The goal components, such as "tok2vec" or "tagger", that did not
after sampling was to obtain a balanced dataset (noted as affect the NER component; as such, they were disabled
EVE +NVD-CVE +Augmented), which in the end contained during the training process. No static values were used in
395 Network, 400 Local, and 300 Physical examples. the spaCy training configuration. In terms of
hyperparameters, the NER model was trained for 20 epochs
The evaluation metrics of Precision, Recall and F1-score
on shuffled data to reduce the bias generated by the order of
are presented for all experiments. We emphasize the
the training examples with compounding batch sizes from 4
importance of Recall, as our focus is to maximize the
to 32, a spaCy optimizer, and a dropout rate of 0.2.
number of correctly detected sequences that represent a
vulnerability. Our overarching goal is to build a framework
that can be used in the decision-making process of a IV. RESULTS
cybersecurity expert.
A. Bi-LSTM
E. Sequence Labeling The first experiments with the BiLSTM network
Bidirectional word-level Long-Short Term Memory exhibited problems in identifying the text spans which
(BiLSTM) networks were built with Tensorflow and Keras describe the EVE tasks in cybersecurity articles: references
instead of a standard LSTM, because these networks to a published proof-of-concept exploit video and to an
consider the previous and post information after a specific attack vector metric. For the machine learning approaches,
sequence. The input vocabulary required a tokenizer for the datasets were divided statically into two subsets: train
and test data with a split percentage of 80% and 20%, which
were later used as a benchmark for the spaCy approach.
6
https://nvd.nist.gov/vuln/data-feeds
7
https://github.com/makcedward/nlpaug
The model’s performance in the Exploit Available task
8 was evaluated on the entire corpus. The classification report
https://github.com/QData/TextAttack
from Table III indicates that the model clearly overfits the 1) Exploit Available
data because the O value, outside-chuck entity, represents Table VI introduces the results with our spaCy model
99,6% out of the total entities. obtained when testing with new article sentences from the
‘OnlyTagSent’ EVE dataset. No NVD-CVE data was added
TABLE III. CLASSIFICATION REPORT ON THE WHOLE EXPLOIT CORPUS
since the exploit’s release is not related to the CVSS score
and thus could not be extracted from the National
Named entity Precision Recall F1 Score Vulnerability Database.
B-exploitAvailable .00 .00 .00
I-exploitAvailable .00 .00 .00
O 1.00 1.00 1.00 TABLE VI. SPACY EVALUATION ON EVE CORPUS

Named entity Precision Recall F1 Score


B-exploitAvailable .642 .750 .692
The cybersecurity articles were excessively long, and the I-exploitAvailable .625 .409 .495
named entities were sparse. Due to the poor performance of
the first trial, experiments continued with shortened versions
of the EVE features. The Bi-LSTM model was further 2) Attack Vector
trained on the corresponding Exploit Available dataset type, Table VII introduces the results of our spaCy model on
‘OnlyTagSent’ dataset. Considering that this corpus both the EVE+NVD-CVE corpus (see Table VII), as well as
contained 15.5% named entity tokens, the model showed a the balanced model with augmentation (see Table VIII). The
comprehensible overfitting condition after 20 epochs (see the poor performance on the first dataset is argued by the high
classification report in Table IV). Similar to the previous imbalance between the classes.
experiment, such high scores showed only a correct
extraction for the outside chunk’s entity, ‘O’.
TABLE VII. SPACY EVALUATION ON EVE+NVD-CVE CORPUS
Next, the experiments for determining the Attack Vector Named entity Precision Recall F1 Score
features were conducted only on ‘OnlyTagSent’ corpus, B-Network .224 .083 .122
given the weak performance of the Bi-LSTM approach on I-Network .059 .125 .206
identifying the Exploit Available labels; however, the Bi- B-Local .750 .333 .461
LSTM model exhibited overfitting as presented in the I-Local .760 .308 .439
classification report from Table V. B-Physical .571 .461 .510
I-Physical .896 .594 .715

TABLE IV. CLASSIFICATION REPORT FOR EXPLOIT AVAILABLE


'ONLYTAGSENT' WITH BI-LSTM TABLE VIII. SPACY EVALUATION ON EVE +NVD-CVE +AUGMENTED
CORPUS
Named entity Precision Recall F1 Score
Named entity Precision Recall F1 Score
B-exploitAvailable .00 .00 .00
B-Network .670 .724 .696
I-exploitAvailable .00 .00 .00
I-Network .718 .658 .686
O .93 1.00 .96
B-Local .537 .773 .634
I-Local .641 .784 .705
TABLE V. CLASSIFICATION REPORT FOR ATTACK VECTOR B-Physical .918 .730 .813
'ONLYTAGSENT' WITH BI-LSTM
I-Physical .881 .928 .904
Named entity Precision Recall F1 Score
B-Network .01 .02 .01
I-Network .10 .01 .03 C. Sample Use Cases
B-Local .00 .00 .00 1) Exploit Available
I-Local .00 .00 .00
When tested on new articles from ZDNet 9 outside the
B-Physical .00 .00 .00
I-Physical .00 .00 .00
EVE corpus, the spaCy model identified quite well the
O .97 .99 .98
Exploit Available entities. Table IX highlights three samples,
the latter having no ‘ExploitAvailable’ named entity and
being correctly identified as such.
Considering the presented cases, this approach was
abandoned in favour of the state-of-the-art Named Entity
TABLE IX. SPACY EVALUATION ON NEW ENTRIES
Recognition tool, spaCy. Further alteration and
improvements of the cybersecurity datasets were specifically Test samples
conducted for this machine learning approach. “'Here's How the Attack Works? The attack involves exploitation of three
vulnerabilities via iTunes and the App Store's iOS Notify function.”
“'Nelson later released proof-of concept code for the first Steam zero-day,
B. spaCy and also criticized Valve and HackerOne for their abysmall handling of his
Experiments were conducted with variations of both bug report.”
datasets (i.e., trimmed versions or diverse augmentations); “Security researchers and regular Steam users alike are mad because Valve
refused to acknowledge the reported issue as a security flaw, and declined to
however, the most promising results with the spaCy version patch it.”
3 model are represented by the trimmed version
(‘OnlyTagSent’) for the Exploit Available feature and the
augmented corpus for Attack Vector. The evaluation was
conducted using the spaCy scorer.

9
https://www.zdnet.com/
2) Attack Vector employed for two of the three Attack Vector classes (i.e.,
Tests were performed on newly gathered paragraphs Physical and Local) to ensure a balanced dataset.
from security articles recently. The test samples annotated by
As a future development, cybersecurity information
spaCy are presented in Table VIII. The Network test samples should be crawled and annotated from other sources such as
indicate that additional tokens were erroneously tagged, as
exploit kits sold on the Dark Web, blogs of cybersecurity
the recall for the beginning chunk is approximately 0.72. experts, or the Google Hacking Database. Additional
Both test samples for Physical denote adequate references for other exploitability metrics from the CVSS v3
identifications, with a minor incorrect labeling with the Local score should be retrieved using NER models - for instance,
category in the first examples. The Local class has the best the User Interaction (e.g., “requires user interaction”,
overall recall and correctly labels multiple sequences in the “victim needs to open the malicious iWork file”) and
last example. Privileges Required (e.g., “attacker requires no privileges”,
“non-privileged user can initiate”). In terms of follow-up
TABLE X. SAMPLES AND SPACY OUTPUT processing, we envision creating a dashboard with a
Type Test sample
newsfeed in which articles can be filtered depending on user
Network “Lastly, an extension named Rainbow Fart was needs and on the owned infrastructure.
ascertained to have a zip slip vulnerability, which allows
an adversary to overwrite arbitrary files on a victim's ACKNOWLEDGMENT
machine and gain remote code execution.”
Network “The disclosure of the CODESYS flaws comes close on This work was supported by a grant of the Romanian
the heels of similar issues that were addressed in Siemens National Authority for Scientific Research and Innovation,
SIMATIC S7-1200 and S7-1500 PLCs that could be CNCS – UEFISCDI, project number 2PTE⁄2020,
exploited by attackers to remotely gain access to protected YGGDRASIL – “Automated System for Early Detection of
areas of the memory and achieve unrestricted and
undetected code execution.”
Cyber Security Vulnerabilities” and by the internal UPB
Network “The vulnerability exists because the software lacks program Proof of Concept.
proper authentication controls to information accessible
from the web UI. An attacker could exploit this REFERENCES
vulnerability by sending a malicious HTTP request to the
web UI of an affected device.” [1] K. Bissel, R. M. Lasalle, and P. D. CIN, “Ninth Annual Cost of
Physical “The USB Mass Storage Class driver in Microsoft Cybercrime Study”, 2019. [Online]. Available:
and Windows Vista SP2, Windows Server 2008 SP2 and R2 https://www.accenture.com/us-en/insights/security/cost-cybercrime-
Local SP1, Windows 7 SP1, Windows 8.1, Windows Server study. [Accessed: August 19th 2021]
2012 Gold and R2, Windows RT 8.1, and Windows 10 [2] A. Talalaev, “Website Hacking Statistics You Should Know in 2021”,
Gold and 1511 allows physically proximate attackers to 2021. [Online]. Available: https://patchstack.com/website-hacking-
execute arbitrary code by inserting a crafted USB device, statistics/. [Accessed: June 30th 2021]
aka "USB Mass Storage Elevation of Privilege [3] Barkly, “Study Reveals 64% of Organizations Experienced Successful
Vulnerability.” Endpoint Attack in 2018”, 2018. [Online]. Available:
Physical ”Setup Wizard in Android 5.1.x before LMY49H and 6.x https://www.businesswire.com/news/home/20181016005758/en/Stud
before 2016-03-01 allows physically proximate attackers y-Reveals-64-of-Organizations-Experienced-Successful-Endpoint-
to bypass the Factory Reset Protection protection Attack-in-2018. [Accessed: August 19th 2021]
mechanism and delete data via unspecified vectors, aka [4] S. Mittal, P. K. Das, V. Mulwad, A. Joshi, and T. Finin,
internal bug 25955042.” "Cybertwitter: Using twitter to generate alerts for cybersecurity
Local ” Avamar Data Store (ADS) and Avamar Virtual Edition threats and vulnerabilities," in 2016 IEEE/ACM International
(AVE) in EMC Avamar Server before 7.3.0-233 allow Conference on Advances in Social Networks Analysis and Mining
local users to obtain root privileges by leveraging admin (ASONAM), San Francisco, CA, USA, 2016, pp. 860–867.
access and entering a sudo command.” [5] A. L. Queiroz, S. Mckeever, and B. Keegan, "Eavesdropping hackers:
Local “An elevation of privilege vulnerability in Mediaserver in Detecting software vulnerability communication on social media
Android 4.x before 4.4.4, 5.0.x before 5.0.2, 5.1.x before using text mining," in The Fourth International Conference on Cyber-
5.1.1, 6.x before 2016-11-01, and 7.0 before 2016-11-01 Technologies and Cyber-Systems, 2019, pp. 41-48.
could enable a local malicious application to execute
[6] O. C. Moholth, R. Juric, and K. M. McClenaghan, "Detecting cyber
arbitrary code within the context of a privileged process.
security vulnerabilities through reactive programming," in
This issue is rated as High because it could be used to
Proceedings of the 52nd Hawaii International Conference on System
gain local access to elevated capabilities, which are not
Sciences, Grand Wailea, Maui, Hawaii, USA, 2019, pp. 1–10.
normally accessible to a third-party application.”
[7] N. Tavabi, P. Goyal, M. Almukaynizi, P. Shakarian, and K. Lerman,
V. CONLUSIONS AND FUTURE WORK "Darkembed: Exploit prediction with neural language models," in
Conference on Artificial Intelligence, New Orleans, Louisiana, USA,
This paper introduces a method to automatically label 2018, pp. 7489–7854.
articles on vulnerabilities and cyberattacks from trusted [8] S. Trabelsi, H. Plate, A. Abida, M. M. B. Aoun, A. Zouaoui, C.
sources, while focusing on the exploit’s public release and Missaoui, S. Gharbi, and A. Ayari, "Monitoring software
vulnerabilities through social networks analysis," in 2015 12th
the attack vector.
International Joint Conference on e-Business and
Experiments with two different approaches were Telecommunications (ICETE), Colmar, France, 2015, pp. 236–242.
considered. The bidirectional word-level LSTM model [9] J. Wei and K. Zou, "Eda: Easy data augmentation techniques for
boosting performance on text classification tasks," in Conference on
showed poor results for both tasks, whereas our custom Empirical Methods in Natural Language Processing and the 9th
model built on top of spaCy with Bloom embeddings [21] International Joint Conference Natural Language Processing, Hong
and residual CNNs [22] achieved promising results. Besides Kong, China, 2019, pp. 6381–6387.
the EVE corpus with annotated cybersecurity articles, the [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-
NVD data feeds with the CVEs from 2017 to 2021 were training of deep bidirectional transformers for language
added into the final corpus. Text augmentation techniques understanding," in Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
with contextualized word embeddings from BERT were Technologies, Minneapolis, MN, USA, 2019, pp. 4171–4186.
[11] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. [20] M. Honnibal and I. Montani, "spacy 2: Natural language
Torralba, and S. Fidler, "Aligning books and movies: Towards story- understanding with bloom embeddings," Convolutional Neural
like visual explanations by watching movies and reading books," in Networks and Incremental Parsing, vol. 7, 2017.
Proceedings of the IEEE international conference on computer [21] J. Serrà and A. Karatzoglou, "Getting deep recommenders fit: Bloom
vision, Santiago, Chile, 2015, pp. 19–27. embeddings for sparse binary input/output networks," in Proceedings
[12] V. Atliha and D. Šešok, "Text augmentation using BERT for image of the Eleventh ACM Conference on Recommender Systems, Como,
captioning," Applied Sciences, vol. 10, p. 5978, 2020. Italy, 2017, pp. 279–287.
[13] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. [22] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, "Beyond a
Dyer, "Neural architectures for named entity recognition," in The gaussian denoiser: Residual learning of deep cnn for image
2016 Conference of the North American Chapter of the Association denoising," IEEE transactions on image processing, vol. 26, pp.
for Computational Linguistics: Human Language Technologies, San 3142–3155, 2017.
Diego California, USA, 2016, pp. 260–270. [23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M.
[14] X. Wu, S. Lv, L. Zang, J. Han, and S. Hu, "Conditional bert Lewis, L. Zettlemoyer, and V. Stoyanov, "Roberta: A robustly
contextual augmentation," in International Conference on optimized bert pretraining approach," CoRR, vol. abs/1907.11692,
Computational Science, Faro, Portugal, 2019, pp. 84–95. 2019.
[15] Q. Qiu, Z. Xie, L. Wu, L. Tao, and W. Li, "BiLSTM-CRF for [24] D. Iorga, D. Corlătescu, O. Grigorescu, C. Săndescu, M. Dascălu, and
geological named entity recognition from the geoscience literature," R. Rughiniş, "Early Detection of Vulnerabilities from News Websites
Earth Science Informatics, vol. 12, pp. 565–579, 2019. using Machine Learning Models," in 2020 19th RoEduNet
[16] B. Y. Lin, F. F. Xu, Z. Luo, and K. Zhu, "Multi-channel bilstm-crf Conference: Networking in Education and Research (RoEduNet),
model for emerging named entity recognition in social media," in Online, 2020, pp. 1–6.
Proceedings of the 3rd Workshop on Noisy User-generated Text, [25] L. Ou-Yang, “Newspaper3k: Article scraping & curation”, 2021.
Copenhagen, Denmark, 2017, pp. 160–165. [Online]. Available: https://newspaper.readthedocs.io/en/latest/.
[17] Q. Zhu, X. Li, A. Conesa, and C. Pereira, "GRAM-CNN: a deep [Accessed: August 19th 2021]
learning approach with local context for named entity recognition in [26] The MITRE Corporation, “CVE”, 2021. [Online]. Available:
biomedical text," Bioinformatics, vol. 34, pp. 1547–1554, 2018. https://cve.mitre.org/. [Accessed: August 19th 2021]
[18] A. Sirotina and N. Loukachevitch, "Named entity recognition in [27] National Institute of Standards and Technology, “National
information security domain for russian," in International Conference Vulnerability Database”, 2021. [Online]. Available:
on Recent Advances in Natural Language Processing (RANLP 2019), https://nvd.nist.gov/vuln-metrics/cvss#. [Accessed: August 19th
Varna, Bulgaria, 2019, pp. 1114-1120. 2021]
[19] P. Phandi, A. Silva, and W. Lu, "SemEval-2018 task 8: Semantic [28] J. X. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi,
extraction from CybersecUrity REports using natural language "Textattack: A framework for adversarial attacks, data augmentation,
processing (SecureNLP)," in 12th International Workshop on and adversarial training in nlp," in 2020 Conference on Empirical
Semantic Evaluation ((SemEval-2018)), New Orleans, Louisiana, Methods in Natural Language Processing: System Demonstrations,
USA, 2018, pp. 697–706. Online, 2020, pp. 119–126.

You might also like