Professional Documents
Culture Documents
AI_report_ver2
AI_report_ver2
Hai Pham Van, Han Nguyen Nam, Huy Nguyen Quang, Kien Do Luong, Thanh
Nguyen Van, Tung Nguyen Duy,
ABSTRACT
Online newspapers are experiencing exponential growth in the digital age which
leads to the proliferation of fake news and information bias. Addressing these
challenges requires robust solutions leveraging advancements in Natural Language
Processing (NLP). In this paper, we present a comprehensive pipeline for detecting
fake news and performing Named Entity Recognition (NER) in online articles.
Our methodology integrates cutting-edge NLP techniques, including Bidirectional
Encoder Representations from Transformers (BERT), Transformer architecture,
BiLSTM, and Conditional Random Fields (CRF). We utilize BERT, a pre-trained
model, for understanding article semantics, enhancing accuracy in downstream KEYWORDS
tasks. For fake news detection, we employ BERT in a classification architecture, Fake news detection,
leveraging both headline and content information to significantly improve prediction sentiment analysis,
accuracy. Bidirectional Long Short-Term Memory (BiLSTM) is utilized for Named BERT, LSTM
Entity Recognition due to its bidirectional sequence modeling capabilities, identifying Transformer, NLP
potential entities efficiently. Additionally, we enhance NER predictions by
incorporating a CRF model for structured prediction tasks. We also use Scrapy as
a data crawling and preprocessing tool to extract features from online news articles,
enriching our dataset and enhancing model performance in prediction and labeling
tasks. The outcomes of our architecture are tested with contemporary datasets namely:
the Phobert Named Entity Reconigtion (PhoNER) datasets of Covid-19 and Kaggle
datasets relating to fake news.
1
insights into future directions for research and practical then temporarily stored by Kafka. Logstash will retrieve
applications in the field of online news analysis. data from Kafka, process it through several steps, and
A. General purpose and scope classify the data before adding it to Elasticsearch.
The primary objective of this paper is to undertake a b) Website system: The website allows users to analyze and
multifaceted analysis of news articles, employing statistical track issues of interest from the data in Elasticsearch. The
methodologies to discern their tonal attributes. Through backend utilizes Spring Boot to connect with both SQL and
this analysis, the paper aims to provide readers with Elasticsearch to retrieve data. The frontend is built using
recommendations for reputable news sources, thereby ReactJS.
facilitating informed consumption of information. Our result is a complete web application that enables
Additionally, the research endeavors to develop robust users to create keyword sets for analyzing and tracking
mechanisms for the detection of both genuine and data from news websites, as well as review articles collected
counterfeit news items, while also delving into the by the system.
identification of their respective sources. Furthermore, the
paper seeks to devise sophisticated keyword sets utilizing
diverse combination rules, thereby enabling precise and II. Related Works
efficient retrieval of articles that adhere to predetermined
criteria. This research is delimited to the development of a In order to perform statistical analysis, as well as to
web-based system designed to execute the aforementioned detect fake news by topic and category, the first main task is
tasks with efficacy and accuracy. Emphasis will be placed to determine what entities the news articles are targeting.
on the creation of a user-friendly interface that grants users This task is essentially named entity recognition.
access to a curated selection of news articles, vetted for
authenticity and credibility. The system, while not A. Named Entity Recognition Task (NER)
excessively intricate, will be endowed with a 1. Overview of NER
comprehensive suite of functionalities essential for the
fulfillment of its objectives. This encompasses, but is not In the current era of information explosion, useful
limited to, statistical algorithms for tone analysis, machine information extracting (IE) from data has become a
learning models for fake news detection, and sophisticated prerequisite activity in all existing fields.
search mechanisms based on predefined keyword sets. The According to the paper "A survey on Named Entity
resultant system is envisioned as a reliable and Recognition — datasets, tools, and methodologies" [1],
indispensable tool for individuals seeking to navigate the Name Entity Recognition or NER "is the process of
vast landscape of digital news media with discernment and identifying numerous segments of information referenced
confidence. in a text and then classifying them into pre-established
B. Data mining categories. Entities like “person”, “organization”, “region”,
The lack of training data for models is always an important and many more might be considered categories. Named
issue to address for any NLP projects. In our case , we Entity Recognition is a broad category of NLP issues
implemented Scrapy, a data crawling and processing tools known as sequence tagging.
which can automate the extraction of features of online NER is a subtask of IE, tasked with automatically
news articles from numerous online news forums and extracting named entities (such as person names, locations,
newspapers organizations, dates/times, quantities, etc.) by identifying
By implementing Scrapy, the crawled data consists of entities in a given text (Recognition) and determining their
articles collected directly from over 100 news websites daily types (Classification).
and aggregated for statistical analysis and keyword Through NER, we can build a knowledge base. It can
extraction. The data sources include articles from online organize and arrange information in a way that is useful for
news websites on the internet such as:dantri.com.vn, humans and also provide useful inference for subsequent
vnexpress.vn, soha.vn, kenh14.vn and other news websites. NLP algorithms.
The basic data fields that need to be collected are: NER can be implemented with the help of tool sets such
as:
a) SpaCy: [2] This Python package is designed for
Field name Meaning high-level natural language processing (NLP) tasks and is
domain Domain name of the site open source. It enables the creation of programs that can
publishedTime Date on which the article effectively process and comprehend large volumes of text,
was posted making it well-suited for deployment in production
title Title of the article environments. With support for over 72 languages and 80
content Contents of the news pre-trained pipelines across nearly 24 languages, it
authorDisplayName Name of the author facilitates the development of systems for information
type The type of the news extraction and natural language understanding.
... ... Additionally, it promotes multitask learning using
advanced models like BERT, offering state-of-the-art
Table 1: Data fields to be collected processing speed. Furthermore, it allows integration of
custom models developed in TensorFlow, PyTorch, and
other libraries. For named entity recognition, it offers a
dedicated pipeline component called the entity recognizer.
C. General Architecture The NER module [3] is built on a model that makes use of
In terms of architecture, our team have built an integrated residual convolutional neural networks and Bloom
system consisting of 2 main modules: embeddings. There are models for the English language
a) Data collection and processing system: Data will be that were previously trained on the OntoNotes Release 5.0
collected from news websites on the internet using Scrapy, package, including tagged text taken from news articles,
2
blogs, and phone conversations. sentences. For example, sentences need to be separated
b) NLTK, recognized as a prominent Python platform for whenever punctuation or period marks are found. The
developing applications that leverage human language purpose of sentence segmentation is to delineate sentence
data, has gained prominence in the realm of natural boundaries.
language processing (NLP). As cited by Bird et al. Tokenization [17] [18]: is the process of dividing a string
(2009) [4], NLTK serves as a comprehensive toolkit of written language into its constituent words.
encompassing various text processing functionalities, Part-of-speech tagging [15] [19]: is the process of
including parsing, categorization, tagging, stemming, assigning a word in a text to its corresponding part of
tokenization, and semantic reasoning. Moreover, it offers speech tag based on the context and its definition. It
named entity recognition (NER) capabilities alongside a describes the characteristic structure of the vocabulary
vibrant community forum, fostering active discussions and terms in a sentence or text.
knowledge exchange. NLTK is commonly utilized in both Entity recognition [20]: is the process of identifying key
research endeavors and educational settings, serving as a elements from the text, and then classifying them into
valuable resource for training students in NLP techniques. predefined categories. This is a crucial process of NER as it
Notably, its NER module employs a Maximum Entropy fulfills the purpose of NER.
Classifier trained on the Automatic Content Extraction
corpus.
c) Pytorch, which was created by Facebook [5], and it was B. Overview of BiLSTM Architecture
another open-source deep-learning package designed only 1. LTSM
for Python use. It serves as a premier platform for both
industry and education purposes. Pytorch can be mostly LSTM, a famous variant of RNN, was proposed by
used for image recognition and language processing by Hochreiter and Schmidhuber in 1997 [21] as a solution to
making machine learning more scalable several AI models the aforementioned problem. The LSTM model consists of
can be built quickly and efficiently. three gate networks, "input gate", "forget gate", and "output
Some of the other tools offered for industry-based projects gate", which perform better than RNN.
are LingPipe, AllenNLP [6], IBM Watson [7], Intellexer [1],
ParallelDots [8], and Dandelion API [9]. Some applications
of NER include:
- Indexing documents and supporting search tools :
integrating NER into document indexing and search
tools enhances the efficiency, relevance, and
sophistication of information retrieval systems by
leveraging the semantic information encoded in text
documents [10] [11].
Fig. 3. Basic architecture of LSTM
- Customer support chatbots: By integrating NER,
chatbots could potentially ask more relevant and
contextually appropriate questions during LSTM operates in a three-step operation:
conversations with users. This could lead to more Step 1: The first step in LSTM is to decide which
engaging and satisfying interactions [12] and information to discard from the cell at that specific time.
integrating NER into chatbots could enable them to This is determined with the help of a sigmoid function. It
better understand and respond to user queries considers the previous state pht´1 q and the current input
involving named entities [13]. pxt q and computes the function.
- Opinion mining: Named Entity Recognition (NER) Step 2: There are two functions in the second step. The first
can aid in opinion mining by identifying and is the sigmoid function, and the second is the tanh
extracting entities mentioned in opinions and also function. The sigmoid function decides which values to let
linking them to find certain meaningful through (0 or 1). The tanh function assigns weights to the
relationships, which can provide valuable context passed values, determining their importance from -1 to 1.
and insights for sentiment analysis [14]. Step 3: The third step is to decide what the final output will
be. First, a sigmoid layer needs to be run to determine
2. Basic steps of NER which part of the cell state goes to the output. Then, the cell
state needs to be passed through the tanh function to push
the values from -1 to 1 and multiply it with the output of
the sigmoid gate.
On the other hand, accurately identifying proper nouns
in a piece of text depends not only on the preceding
information of the word being considered but also on the
information following it. However, a traditional LSTM
architecture with a single layer can only predict the label of
the current word based on the information obtained from
the preceding words.
2. BiLSTM
Fig. 2. Basic steps of NER BiLSTM [22] was created to address the aforementioned
weakness. A BiLSTM architecture typically contains 2
Sentence segmentation [15] [16]: is the process of individual LSTM networks used simultaneously and
dividing a string of written language into its component independently to model the input sequence in 2 directions:
3
from left to right (forward LSTM) and from right to left model. We apply it with the negative log-linear function of
(backward LSTM). the function above.
# +
m
ź
LpX, λ, yq “ ´ log k k
P py |X , λq (4)
k“1
m
ÿ
»
n
ÿ
fi
— 1
— ÿ ffi
log — exp λj fj pX m , i, yi´1
k
, yik qffi
ffi
“´
– ZpXm q j
fl
k“1 i“1
(5)
Fig. 4. BiLSTM = forward LSTM + backward LSTM Calculate the partial derivative with respect to lambda to
find the minimum value of the log function because finding
the argmin value will achieve the maximum value for the
However, the BiLSTM architecture still does not fully entire negative log-linear function.
exploit the contextual semantics of the text.
m
BLpX, y, λq 1 ÿ
“´ Fj py k , X k q
C. Overview of CRF Model Bλ m k“1
m
CRF [23] is a probabilistic model for structured ÿ
` ´1P py k |X k , λqFj py, X k q (6)
prediction tasks and has been applied successfully in many
k“1
fields such as computer vision, natural language
processing, bioinformatics, etc. In the CRF model, the where:
nodes containing input data and the nodes containing
n
output data are directly connected to each other, ÿ
Fj py, Xq “ Fj pX, i, yi´1 , yi q (7)
contrasting with the architecture of LSTM or BiLSTM i“1
where inputs and outputs are indirectly connected through
is the observed mean feature value,
memory cells. CRF can be used for named entity labeling
with input features being manually extracted features of a m
ÿ
word such as: capitalization of the first letter, all letters are P py k |X k , λqFj py, X k q (8)
capitalized, preceding word is capitalized, word under k“1
consideration, preceding word is under consideration, . . . is the expected feature value according to the model.
In the CRF algorithm, we aim to maximize the conditional We use partial derivatives as a step in the gradient
probability P py|xq - the probability of having output vector descent method. Gradient descent is an iterative
y, given input provided by random variable x; and we take optimization algorithm to update parameter values until
the sequence with the highest probability. convergence of lambda is found. The final gradient descent
update equation for the CRF model is:
ŷ “ argmaxy P py|xq (1)
« ff
In CRF, the input data is sequential, and we rely on the m
ÿ k k
m
ÿ k k k
previous context to make predictions about a data point. We λ“λ`α Fj py , X q ` P py |X , λqFj py, X q
use feature functions that have multiple input values: k“1 k“1
(9)
- The set of input vectors X. The CRF model used for entity labeling addresses the
- The position i of the data point we are predicting. drawback of label bias due to the independence of labels
- The label of the data point i ´ 1 in X. from each other in the Hidden Markov Model. CRF first
- The label of the data point i in X. identifies the necessary feature functions, initializes lambda
The feature function fj is defined as follows: weights for random values, and then applies the gradient
fj px, i, yi , yi´1 q. descent method iteratively until the parameter values
Each feature function is based on the label of the previous converge.
and current word, takes on values of 0 or 1. We assign a set However, a problem with linear chain CRF is that they
of weights (lambda values) to each function, the algorithm only capture the dependencies between labels in a forward
will learn: direction.
n
ÿ D. Overview of BERT Mechanism
1 ÿ
P py|X, λq “ exp λi fi pX, i, yi´1 , yi q (2) 1. Encoder and Decoder Process
ZpXq j
i“1
Computers cannot learn directly from raw data such as
where: images, text files, audio files, or video clips. Therefore, they
require a process of encoding information into numerical
n
ÿ form and decoding the numerical form into output results.
ZpXq “ řyPY λi fi pX, i, yi´1 , yi q (3) These are the two processes called encoder and decoder:
i“1
Encoder: It is the phase of transforming input into
is the normalized function. learning features capable of learning tasks. For Neural
To estimate the parameters lambda, we use Maximum Network models, the Encoder consists of hidden layers. For
Likelihood Estimation (MLE). It is a method in statistics for CNN models, the Encoder comprises a sequence of
estimating the parameter values of a CRF probability Convolutional + Maxpooling layers. For RNN models, the
4
Encoder process mainly involves Embedding layers and word representation vector. We can encode it as a [0, 1]
Recurrent Neural Network layers. position vector or use sine and cosine functions [24].
Decoder: The output of the encoder is the input to the 3. Attention mechanism
decoders. This phase aims to determine the probability
distribution from the learning features at the encoder, 3.1 Scale dot product attention
thereby identifying the output labels. The result can be a This is a self-attention mechanism [24] where each word
single label for classification models or a sequence of labels can adjust its weight for other words in the sentence so that
in chronological order for seq2seq model. the weight is larger for words closer to it and gradually
decreases for words further away.
After the word embedding step (passing through the
embedding layer), we have the input of the encoder and
decoder as a matrix X of size mxn, where m and n are the
length of the sentence and the dimension of a word
embedding vector, respectively.
The three matrices Wq , Wk , Wv are the parameters that
the model needs to train. After multiplying these matrices
with the input matrix X, we obtain the matrices Q, K, V
(corresponding to Query, Key, and Value matrices in the
figure). The Query and Key matrices are used to compute
the score distribution for word pairs. The Value matrix
utilizes the score distribution to calculate the probability
distribution output vector.
The input for calculating attention will include the matrix
Q (each of its rows is a query vector representing the input
words) and the matrix K (similarly to the matrix Q, each
row is a key vector representing the input words). These
two matrices Q, K are used to compute attention where
words in the sentence return attention to a specific word in
the sentence. The attention vector will be calculated based
on the weighted average of the value vectors in the matrix
Fig. 5. seq2seq model with encoding and decoding, for
V with attention weights (computed from Q, K).
machine translation task
The function of Attention is:
QK T
ˆ ˙
2. Transformer architecture AttentionpQ, K, V q “ sof tmax ? V (10)
dk
The Transformer architecture was first introduced in the
paper titled "Attention is All You Need" by Vaswani et Dividing by dk , which is the dimensionality of the key
al [24]. This paper was presented at the Conference on vector, aims to prevent overflow if the exponent is too large.
Neural Information Processing Systems (NeurIPS) in 2017
This architecture consists of two parts, an encoder on the 3.2 Multi-head attention
left and a decoder on the right. So after the process of scale dot product, we will obtain
Encoder: is a stacked combination of 6 defined layers. an attention matrix. The parameters that the model needs
Each layer consists of two sub-layers within it. The first to adjust are the matrices Wq , Wk , Wv . Each such process is
sub-layer is multi-head self-attention. The second sub-layer called a head of attention. When repeating this process
is simply a fully connected feed-forward layer. A note is multiple times (in the paper, it’s 3 heads), we will obtain
that we will use a residual connection at each sub-layer the Multi-head Attention process as the transformation
immediately after normalization layer. This architecture below. Each branch of the input is a head of attention. In
has a similar idea to the resnet network in CNN. The output this branch, we perform scaled dot production, and the
of each sub-layer is LayerNorm (x + SubLayer(x)) with a output is the attention matrices.
dimension of 512. After obtaining the three attention matrices at the output,
Decoder: The decoder is also a stacked combination of 6 we concatenate them along the columns to obtain the
layers. The architecture is similar to the sub-layers in the aggregated multi-head matrix with the same height as the
Encoder except for the addition of 1 sub-layer representing input matrix, according to the formula:
the attention distribution at the first position. This layer is
no different from the multi-head self-attention layer except M ultiHeadpQ, K, V q “ concatenatephead1 , . . . , headh qW0
that it is adjusted to avoid including future words in the where, headi “ AttentionpQi , Ki , Vi q
attention. At the i-th step of the decoder, we only know the Note that, to return an output with the same size as the
words at positions smaller than i, so adjusting ensures that input matrix, we just need to multiply it by the W0 matrix
attention is only applied to words smaller than position i. with the same width as the input matrix.
The residual mechanism is also applied similarly to the
Encoder.
4. Encoder and Decoder process in BERT mechanism
Note that there is always an additional step of adding
Positional Encoding to the inputs of the encoder and Thus, concluding the process 3, we have completed the
decoder to incorporate the time factor into the model, first sub-layer of the Transformer, which is the multi-head
thereby increasing accuracy. This involves adding the Attention layer. In the second sub-layer, we will pass
position encoding vector of the word in the sentence to the through fully connected connections and return the output
5
with the same size as the input. The purpose is to be able to
iterate these blocks N times.
After the backward propagation transformations, we
often lose information about the position of words.
Therefore, it is necessary to apply a residual connection to
update previous information into the output. This method
is similar to ResNet in CNN. To ensure stable training, we
will also apply another normalization layer immediately
after the addition operation.
If we repeat the above layer block 6 times and denote them
as one encoder. We can simplify the graph of the encoder
process as shown below.
The decoder process is completely similar to the encoder
except for:
The decoder process will generate tokens sequentially from
left to right at each time step.
At each block layer of the decoder, we need to add the final
matrix from the encoder as an input to the multi-head
attention. Fig. 7. BERT architecture for MLM task
Add a layer of Masked Multi-head Attention sub-layer at
the beginning of each block layer. This layer is no different
from the Multi-head Attention except that it does not In which the task of masked language model has an
consider the attention of future words. output size equal to the maximum length of the sentence,
and the task of predicting whether the next sentence is
present or not has an output size of 2. Then, the loss value
5. Overview of BERT for updating the parameters is the total loss of both tasks.
BERT [25] is trained simultaneously on two different
tasks: Masked Language Model and Next Sentence
Prediction [25].
The BERTbase model utilizes 12 Transformer encoder
blocks, while the BERTlarge model uses 24 Transformer
encoder blocks (See Figure 6).
E. Evaluation metrics
output size
Fig. 8. Confusion matrix
ÿ
Loss “ ´ yi ¨ log ŷi (11)
i“1
6
1. Precision website provides interaction with the for the following
services: Search, Generating statistics, News following and
Precision is defined as the ratio of the number of True
other APIs for Administrators, User personal information
Positives to the total number of predicted Positive instances
and Account.
(i.e. take the number of correctly predicted labels and
In specific, APIs relating to user personal information will
divides it by the total number of predictions made for that
help manage the users history in accessing and following
label).
news. APIs for Admin and Account provide account
A high precision means a high accuracy of correctly
management, login/logout actions, authentication and
labeled instances.
permission. Services and APIs will communicate through
TP mutual database. In addition a news database will be
P recision “ (12) constructed through a news integrated module with a
TP ` FP
procedure containing the steps: retrieval, integration, and
2. Recall data cleaning.
In terms of deployment the system will be deployed on
Recall is defined as the ratio of the number of True personal server and locally hosted. The SQL database is
Positives to the total number of actual Positive instances deployed n PostgreSQL and other services are deployed on
(i.e. take the number of correctly predicted labels and Docker.
divides it by the total number of labels that are actually Actors accessing the components of the system from the
correct). frontend to APIs and database are always required to
A high recall means a high proportion of True Positives, authenticate and given the right permission
indicating a low rate of entities not labeled as such.
3. F1 score
In practice, adjusting the model to increase Recall beyond
a certain point may lead to a decrease in Precision, and vice
versa, attempting to increase Precision can reduce Recall.
In such cases, F1 score serves as a useful measure of
prediction success when dealing with imbalanced classes,
calculated based on Precision and Recall.
2 1 1
“ `
F1 P recision Recall
P recision ˚ Recall
ðñ F1 “ 2 ˆ (14)
P recision ` Recall
7
Fig. 9. Architectural System Design
8
Fig. 11. Use cases for the Administrator Actor
9
Engine to the Spiders to analyze. At Spiders, when received outlines the design, architecture, and features of
a response, they will begin to extract the information from Elasticsearch, emphasizing its distributed nature,
the reponse(title, content, author, data published,...) and scalability, and support for real-time search and analytics.
handle the urls that have potential to crawl and push back The paper highlights Elasticsearch’s RESTful API, which
to the Engine (requests). At this point, the Engine will allows users to interact with the search engine using HTTP
receive the results from Spiders regarding two main tasks: requests, and discusses its use cases in various domains,
pushing the parsed data to the Item Pipeline to process and including log analysis, full-text search, and business
saved into the database, pushing new URLS to the intelligence. Since its launch in 2010, Elasticsearch has
Scheduler and return to step 3. quickly become the most popular search tool and is widely
used for use cases related to log analysis, full-text search,
1.2.2 Kafka security information, business analytics, and operational
Apache Kafka [27] is a a distributed messaging system. information. Elasticsearch is considered a search engine
Kafka is developed and maintained by Apache, which is why and inherits from Apache Lucene. Elasticsearch essentially
Kafka (message broker) has the name of Apache Kafka. functions as a web server with fast search capabilities
Similar to other message brokers, it is developed following through the RESTful protocol. It possesses highly efficient
the model of public/subscibe. The party who public the data data analysis and statistical capabilities and runs on its own
is called producer and the party receiving the data divided server platform and communicates via RESTful, so it is not
by topics is called consumer. overly dependent on what client or existing system you
have written it in. Therefore, integrating it into your system
becomes easier, and you only need to send an HTTP
1.2.3 Logstash
request, and it will return the results. Elasticsearch is a also
Logstash [29] is an open source application, and belongs
distributed system with incredible scalability and is an
to the ELK Stack ecosystem, with a very important aim
open-source system
which consists of three stages in the log pipeline
corresponding to three modules:
• INPUT: Receive, collect raw event log data from various 2. AI Module
sources such as files, Redis, RabbitMQ, Beats, syslog, ...
• FILTER: After receiving the data, perform event log data 2.1 BERT Model
operations (such as add, delete, replace, ... log content) BERT (Bidirectional Encoder Representation from
according to the administrator’s configuration to rebuild Transformer) stands for a model that represents words
the log event data structure as desired. bidirectionally using Transformer techniques. BERT is
• OUTPUT: Finally, forward the event log data to other designed to pre-train word embeddings. One notable
services such as Elasticsearch for log reception, storage, feature of BERT is its ability to balance context in both left
or display. and right directions.
The attention mechanism of Transformer passes all
words in the sentence simultaneously into the model at
once without considering the direction of the sentence.
Therefore, Transformer is considered bidirectional training,
although in reality, it is more accurate to say it is
non-directional training. This feature allows the model to
learn the context of a word based on all surrounding words,
including both left and right words.
Fig. 12. HT LogStash Flow One unique aspect of BERT that previous embedding
models did not have is the ability for fine-tuning the
training results. We can add an output layer to the model
At the INPUT stage, Logstash will be configured to select
architecture to customize it for specific training tasks.
the form of receiving log events or fetching log data
remotely as needed. After obtaining the log events, the
INPUT step will write the event data to a centralized queue
in RAM or on disk.
Each pipeline worker thread will continue to retrieve a
batch of events from this queue to process FILTER to help
restructure the log data to be sent at the OUTPUT stage.
The number of events processed per batch and the number
of pipeline worker threads can be configured for optimal
tuning.
By default, Logstash uses an in-memory queue in RAM
between stages (input -> filter and filter -> output) as a
Fig. 13. Pre-train and Fine-tuning process of BERT
buffer to store event data before processing. If the Logstash
service program is stopped for some reason in the middle,
then the event data in the buffer will be lost. Currently there are many different versions of BERT. All
the versions are based on the change if Transformer
1.2.4 Elasticsearch architecture focusing on three parameters:
Elasticsearch [30] is a distributed search and analytics • L: the number of block sub-layers inside transformer
tool built on Apache Lucene. Elasticsearch was first • H: the size of the embedding vectors (aka hidden
introduced in the paper titled "ElasticSearch: A Distributed, size)
RESTful Search Engine" by Shay Banon [31]. This paper • A: the number of heads inside a multi-head layer,
was presented at the 2010 Berlin Buzzwords Conference. It each head will perform a self-attention mechanism
10
There are two main architecture with different names: pre-trained BERT model. This turns a piece of text into a
BERTbase (L = 12, H = 768, A = 12) which has 110 millions fixed-size vector that represents the semantic aspect of the
parameters in total and BERTlarge (L = 24, H = 1024, A = 16) document.
which has 340 millions parameters in total In BERT large Step 2: Keywords and phrases (n-grams) are extracted
architecture, the number of layers has been doubled, the from the same document using Bag of Words techniques
hidden size has been increased 1.33 times and the number (such as TFIDF Vectorizer or CountVectorizer). This is a
of heads in multi-head layers are also 1.33x times the BERT classical step to perform keyword extraction.
base.
BERT also take the classification token - CLS for short - as
the input and follwed by the word sequence. After that it
transports the input to the upper layers. Each layer applies
self-attention and passes the result through a feedforward
network, then it hands over to the next encoder layer. The
output of the model is a vector sized according to the
BERT’s dimension. If we want to extract a classifier from
this model, we can take the output corresponding to the
CLS classification token.
Fig. 16. Demonstration for step 1 and 2 of KeyBert
11
aims to minimize redundancy and maximize diversity in The model through two main steps. Firstly, word
text summarization tasks. It starts by selecting keywords embeddings of each word and character are fed into the
most similar to the document. Then, it iteratively selects BiLSTM model layer to extract useful information about the
new candidates that are similar to the document but semantics and morphology of the word and the context
dissimilar to the already chosen keywords. surrounding that word. Secondly, the CRF layer processes
this information as features to make predictions about NER
labels for each word. In addition to the information
received from the BiLSTM layer, CRF also relies on
information from previously predicted labels. For example,
if the previous label is B-LOCATION, it is highly likely that
the current word being considered will have the label
I-LOCATION.
The learning parameters of the BiLSTM+CRF architecture
include: the word embedding matrix, the weight matrix of
the BiLSTM layer, and the transition matrix of the CRF
layer. All of these parameters are updated during training
on labeled data through the Back-propagation algorithm
with Stochastic Gradient Descent (SGD).
3. Backend
4. Frontend
12
IV. Experimental Results
A. Model Evaluation
1. Experiment environment
The team is currently training models with the dataset in
the Python environment using deep learning libraries such
as Scikit-learn, Tensorflow [35], and Pytorch. The team
runs the training process on local machines and Google
Colab with corresponding configurations:
2. Dataset
2.1 PhoNER_COVID_19
PhoNER_COVID19 [36] is a dataset for identifying
named entities related to COVID-19 in Vietnamese, built
and published by VinAI Research.
The data was collected from articles tagged with the
keywords "COVID-19" or "COVID" from reputable online
news sources in Vietnam (VnExpress, ZingNews, BaoMoi,
and ThanhNien) from February 2020 to August 2020.
Subsequently, the main text content of the articles was
segmented into sentences using the RDRSegmenter from
VnCoreNLP. Sentences related to COVID-19 patients were
selected using BM25Plus. Then, they manually filtered out
sentences that did not contain relevant information about
COVID-19 patients in Vietnam, resulting in 10,027 raw
sentences.
Next, the data was manually labeled with a clear process.
The labeled results were reviewed and could be amended if
necessary. Finally, from the 10,000 raw text sentences,
35,000 entities were obtained. The authors divided the
dataset into corresponding Train/Validation/Test sets with
a ratio of 5/2/3.
PhoNER_COVID_19 is published at:
https://github.com/VinAIResearch/PhoNER_COVID19
13
3. Evaluation Results
PATIENT-ID AGE GENDER DATE OCCUPATION NAME LOCATION ORGANIZATION SYMPTOM TRANPORTATION
Istm-syllable 96 97 95 98 66 86 90 87 83 92
lstm-word 97 98 95 99 69 88 88 84 83 93
bert-syllable 95 93 94 98 56 90 84 77 84 83
bert-word 97 95 94 99 75 90 86 80 85 85
PATIENT-ID AGE GENDER DATE OCCUPATION NAME LOCATION ORGANIZATION SYMPTOM TRANPORTATION
Istm-syllable 96 91 90 96 55 82 94 78 88 93
lstm-word 95 88 86 96 57 85 93 80 86 95
bert-syllable 95 80 85 98 60 93 86 71 87 82
bert-word 96 82 86 99 65 93 86 76 88 83
PATIENT-ID AGE GENDER DATE OCCUPATION NAME LOCATION ORGANIZATION SYMPTOM TRANPORTATION
Istm-syllable 96 92 92 96 60 84 92 82 85 92
lstm-word 96 92 90 97 62 86 90 81 84 94
bert-syllable 95 86 89 98 58 91 85 74 85 82
bert-word 97 88 90 99 70 91 86 78 86 84
Table 4: F1 score
14
Fig. 24. Comparing F1 score of BiLSTM + CRF model
15
3.1 NER task
It can be observed that the results of BiLSTM+CRF
applied at the word level are better than at the character
level for our team. However, the results of BiLSRM+CRF at
the word level from VinAI Research [36] [37] consistently
outperform ours, especially for the OCCUPATION (JOB)
label. Furthermore that predictions are far more accurate
with fields such as PATIENT_ID , AGE , GENDER and
achieve less correct guesses with fields like JOB.
These results demonstrate that (Figure 24) a more detailed
approach at the word level can potentially yield more
promising outcomes for our team in the future. Another
observation is the bert model at the word-level has also
shown good results, while syllable-level model are
out-performed in many categories such as OCCUPATION,
ORGANIZATION, LOCATION.
16
Fig. 27. Dashboard interface containing all the dashboard in-
use of the user
17
Fig. 29. Detailed charts in a dashboard
This is the link of the demo video : https://youtu.be/ Secondly, concerning the quality of AI modules: delve
cbWDSplfBHY deeper into the domain of news data, gain a better
understanding of the structural levels of articles, texts;
2. Remarks explore and execute data enrichment to enable the model to
The web demo has fully met the functional requirements learn more. Implement the construction of real and fake
of the tasks. news datasets in Vietnamese and proceed with training.
The web demo has a user-friendly and clear interface.
However, the functionality of detecting real/fake news is
not very easy to explain to users, meaning users do not fully References
understand why the system classifies a news article A from
website B as real/fake. [1] B. Jehangir, S. Radhakrishnan, R. Agarwal, “A survey
on named entity recognition—datasets, tools, and
methodologies,” Natural Language Processing Journal,
V. Conlusion vol. 3, p. 100017, 2023.
18
[7] D. A. Ferrucci, “Introduction to “this is watson”,” IBM [22] M. Schuster, K. K. Paliwal, “Bidirectional recurrent
Journal of Research and Development, vol. 56, no. 3.4, neural networks,” IEEE transactions on Signal
pp. 1–1, 2012. Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[8] A. Jain, I. Aggarwal, A. Singh, “Paralleldots at [23] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
semeval-2019 task 3: Domain adaptation with feature Z. Su, D. Du, C. Huang, P. H. Torr, “Conditional
embeddings for contextual emotion analysis,” in random fields as recurrent neural networks,” in
Proceedings of the 13th International Workshop on Semantic Proceedings of the IEEE international conference on
Evaluation, 2019, pp. 185–189. computer vision, 2015, pp. 1529–1537.
[9] A. Dandelion, “Dandelion api,” línea]. Disponible: [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
https://dandelion. eu.[Último acceso: 23 de Enero de 2016], L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
2021. “Attention is all you need,” Advances in neural
information processing systems, vol. 30, 2017.
[10] A. Brandsen, S. Verberne, K. Lambers, M. Wansleeben,
“Can bert dig it? named entity recognition for [25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “Bert:
information retrieval in the archaeology domain,” Pre-training of deep bidirectional transformers
Journal on Computing and Cultural Heritage (JOCCH), for language understanding,” arXiv preprint
vol. 15, no. 3, pp. 1–18, 2022. arXiv:1810.04805, 2018.
[11] A. CARAVALE, P. MOSCATI, N. DURAN-SILVA, [26] D. M. Powers, “Evaluation: from precision, recall
B. GRIMAU, B. RONDELLI, “Developing a and f-measure to roc, informedness, markedness and
digital archaeology classification system using correlation,” arXiv preprint arXiv:2010.16061, 2020.
natural language processing and machine learning
techniques.,” Archeologia e Calcolatori, vol. 34, no. 2, [27] V. Cothey, “Web-crawling reliability,” Journal of the
2023. American Society for Information Science and Technology,
vol. 55, no. 14, pp. 1228–1238, 2004.
[12] S. Reshmi, K. Balakrishnan, “Enhancing
inquisitiveness of chatbots through ner integration,” [28] J. Wang, Y. Guo, “Scrapy-based crawling and user-
in 2018 International Conference on Data Science and behavior characteristics analysis on taobao,” in 2012
Engineering (ICDSE), 2018, pp. 1–5, IEEE. International Conference on Cyber-Enabled Distributed
Computing and Knowledge Discovery, 2012, pp. 44–52,
[13] E. M. Kusumaningtyas, E. R. Laurentino, A. R. IEEE.
Barakbah, “Responsive chatbot using named entity
recognition and cosine similarity,” 2023. [29] J. Turnbull, The Logstash Book. James Turnbull, 2013.
[14] J. Shin, E. Jo, Y. Yoon, J. Jung, “A system for [30] M.-P. Scott-Boyer, P. Dufour, F. Belleau, R. Ongaro-
interviewing and collecting statements based on intent Carcy, C. Plessis, O. Périn, A. Droit, “Use of
classification and named entity recognition using elasticsearch-based business intelligence tools for
augmentation,” Applied Sciences, vol. 13, no. 20, integration and visualization of biological data,”
p. 11545, 2023. Briefings in Bioinformatics, vol. 24, no. 6, p. bbad348,
2023.
[15] C. Manning, H. Schutze, Foundations of statistical natural
language processing. MIT press, 1999. [31] A. ElasticSearch, “Distributed restful search engine,”
2012.
[16] B. Minixhofer, J. Pfeiffer, I. Vulić, “Where’s the
point? self-supervised multilingual punctuation- [32] https://maartengr.github.io/KeyBERT/.
agnostic sentence segmentation,” arXiv preprint
arXiv:2305.18893, 2023. [33] A. Besbes, “How to extract relevant
keywords with keybert.” [Online]. Available:
[17] D. Jurafsky, J. H. Martin, “Speech and language https://towardsdatascience.com/how-to-extract-
processing: An introduction to natural language relevant-keywords-with-keybert-6e7b3cf889ae.
processing, computational linguistics, and speech
recognition.” [34] M. Mythily, V. R. Kanakala, R. Nambiar, et al.,
“An extensive review of spring boot testing based
[18] B. Elov, S. M. Khamroeva, Z. Xusainova, “The pipeline on business requirements of the software,” in 2023
processing of nlp,” in E3S Web of Conferences, vol. 413, 4th International Conference on Smart Electronics and
2023, p. 03011, EDP Sciences. Communication (ICOSEC), 2023, pp. 1547–1553, IEEE.
[19] P. Gholami-Dastgerdi, M.-R. Feizi-Derakhshi, “Part of [35] S. Pattanayak, Pro deep learning with TensorFlow 2.0: A
speech tagging using part of speech sequence graph,” mathematical approach to advanced artificial intelligence in
Annals of Data Science, vol. 10, no. 5, pp. 1301–1328, Python. Springer, 2023.
2023.
[36] D. Q. Nguyen, A. T. Nguyen, “Phobert: Pre-trained
[20] D. Nadeau, S. Sekine, “A survey of named language models for vietnamese,” arXiv preprint
entity recognition and classification,” Lingvisticae arXiv:2003.00744, 2020.
Investigationes, vol. 30, no. 1, pp. 3–26, 2007.
[37] T. H. Truong, M. H. Dao, D. Q. Nguyen, “Covid-
[21] S. Hochreiter, J. Schmidhuber, “Long short-term 19 named entity recognition for vietnamese,” arXiv
memory,” Neural computation, vol. 9, no. 8, pp. 1735– preprint arXiv:2104.03879, 2021.
1780, 1997.
19