COMP 652 Project Final Paper

DiBERT: Short-text Classification in Disaster
Incidents using transformer-based bidirectional

model
I. INTRODUCTION might otherwise go unnoticed. This capability allows for real-

In recent years, the frequency and severity of major natural time monitoring of situations, early detection of emerging
disasters have escalated, drawing global attention to the threats, and the dissemination of critical information to both
catastrophic effects of earthquakes, hurricanes, wildfires, and responders and the affected populations.
floods. The study is focused on the multi-classification of short-
Each disaster type brings its unique challenges and text for disaster management response using humanitarian
consequences [1]. The increasing incidence and severity of categories provided on the HumAid dataset.
natural and artificial disasters worldwide necessitate a robust
and comprehensive approach to disaster management. The texts will undergo a series of transformations as part
Disaster systems embody the structured and coordinated of the preprocessing. Text normalization will be used, such
strategies, technologies, and policies designed to mitigate as removing stop words, punctuation symbols, retweet tags,
the impacts of disasters, enhance resilience, and facilitate HTML tags, URLs, emoticons, and conversion to lowercase
recovery. These systems span preparedness, response, letters. Other techniques for preprocessing will be used if
recovery, and mitigation phases, involving various necessary.
stakeholders, including government agencies, non-profit
The authors will utilize and improve the DeBERTa model
organizations, communities, and international bodies.
to embed the words. It integrates absolute position
This paper aimed to analyze short texts on disaster-related
embeddings of words just before the softmax layer,
incidents, such as earthquakes, hurricanes, and floods. The
enabling the model to decode the masked words by
dataset came from HumAid Twitter dataset [2], which is a
leveraging the combined contextual embeddings of the
comprehensive collection of more than 77000 tweets that
contents and their positions [4].
have been manually tagged and compiled from 19 significant
natural disasters, including earthquakes, hurricanes, wildfires, The preprocessed data and the word embeddings will be
and floods, occurring worldwide between 2016 and 2019. processed using Bidirectional Long Short-Term Memory
The dataset is categorized into various humanitarian themes (BiL- STM). Hyperparameter tuning will be done to improve
and contains exclusively English tweets, making it the most the model. The results will be evaluated using accuracy,
extensive dataset in crisis informatics. precision, and recall. Other evaluations will be done if
Natural Language Processing (NLP) has emerged as a necessary.
trans- formative tool, revolutionizing how information is
processed, interpreted, and acted upon during crises [3]. By The authors will compare the proposed solution with the
analyzing vast amounts of data from diverse sources such as baseline models [5] such as Support Vector Machine, Random
social media, news reports, and direct communications, NLP Forest, and XGBoost.
technologies can identify patterns, trends, and crucial This can be used by disaster management organizations to
information that efficiently identify the need given the situation and accelerate
the response.
II. METHODOLOGY
A. Architecture
1) Micro BERT: A bi-directional encoder transformer

(BERT) was designed from native PyTorch functionality and
compared to the performance of other pre-trained BERT
models. Let this custom BERT model be called “Micro BERT”
as it is by far the smallest BERT models used to model the
disaster classification task. Micro BERT was trained on both
the normal BERT tasks of masked language modeling (MLM)
and next sentence prediction (NSP) similar to base BERT [6].
To derive the dataset for each of these tasks the first 2 mil-
lion sentences from Google’s Books Corpus was used. Basic
prepossessing was done on these sentences that match the pre-
processing done to the disaster dataset we will be modeling Fig. 1. This figure shows an example of how a tokenized sentences are
transformed into the masked sentences, label sequences, and attention masks.
later. This included casting all characters to lowercase, and The BERT model performs masking on .15 of the tokens in each sentence. Of
tokenization (used nltk word tokenize function). the specific these tokens selected for masking 80% are replaced with the [MASK] token,
tokenizer used was a byte pair encoder that encoded word 10% are replaced with another random token, and 10% are left unchanged.
The label vector has zeros at all indexes that were not masked and the original
pieces founds in the corpus of sentences. The vocabulary size words at masked indices. The attention mask has 1 ones at all indices that are
of this tokenizer was set to 15,000-word pieces (about half the not a [PAD] token (the example does not include pad tokens therefore it is a
size of a base BERT model’s vocab of approximately 30,000- vector of all ones.
word pieces) [6]. Next, the preparation of the dataset for the
MLM task took place, 15% of the tokens in the sentences
of thousands to millions of sentence pairs at the currently
were selected for masking [6]. Of these tokens selected for
accessible computation resources.
masking 80% of them were replaced with the “Mask” token,
A total of 81908 sentence pairs were created
10% were replaced with another random token from the
(approximately 40,000 pairs followed each other while
vocabulary, and the final 10% were left unchanged [6]. An
approximately 40,000 pairs did not) then the CLS tokens was
example of this process done on each sentence is shown in
added at the beginning of the first sentence, the SEP was
figure 1. Note that in the actual BERT model sub-words
added at the end of the first sentence and end of the second
pieces would be used instead of the entire tokens (like shown
sentence. Attention was added to the attention mask to tell the
in the figure 1 example) [6]. This resulted in many encoded
attention layers to attend to the added CLS and SEP tokens.
sentences form the books corpus. The label vectors for this
Since the model will never predict the occurrence of a CLS or
MLM task were also generated. Label vector has zeros for all
SEP tokens, zero were added at these indices in the masked
elements that were not masked and has the original word
sentence labels. A vector that labels any token from the first
index at all the masked tokens indices.
sentence as 1 and any token from the second sentence as 2
Next the dataset was prepared for the NSP task. these was created. Finally, the mask labels vectors were also
encoded sentences were paired together and labeled if they created. All were padded to 512 tokens in length uses the
followed each other (meaning the first sentences index as integer 0 as the pad token in all of the data structures since
N and the next sentences index was N + 1: labeled as 1) this is the size of text our BERT model will be able to input.
for the training of the NSP task [6]. Approximately 40,000 Also, attention mask vectors were also created for each
sentence pairs were made that followed each other in the sentence pair in the multi-headed self-attention layers. These
text. Approximately 40,000 sentences pairs were made that vectors have 1 at all indices that are not a pad token in the
did not followed each other in the text (the first sentences original tokenized sentences.
index as N and the next sentences index does not equal The final dataset that was input into the BERT transformer
N + 1, labeled as 0). This pairing of the sentences was included input ids of the masked sentence pairs, an attention
also done on the masked sentence labels. Segment encoding mask of these original sentences, the segment encoding vector
vectors were also derived for each sentience pairs were these of these sentence pairs, the labels for the masked tokens in the
vectors had a 1 at each index was a token from the first original sentence pairs, and the labels for the NSP task if the
sentence was present and a 2 at any index were a token sentences logically follow each other for all 81,908 sentence
from sentence 2 was present. A consideration is that only pairs. This dataset was then partition into 70% train, 15%
sentence pairs where both sentences token lengths that were validation, and 15% test splits. It was also broken up into 16
longer than 10 tokens were chosen. This was done in order sample batches.
to cut down on the total 2 million sentences extracted from Micro BERT employed an embedding layer to embed the
the books corpus and ensure information dense sentences were vectorized input ids. This embedding layer had the vocab
selected. It was not feasible to train the model on hundreds size used in the tokenizer (15,000 subword tokens which is
approximately half of the 30,522 subword tokens of BERT
Fig. 3. This figure shows the sine and cosine equations used to
generate the positional embedding vectors at different token positions (pos)
at each different index in the embedding dimension (i) [7]. The variable
dmodel is the dimension of each embedding vector [7].
Fig. 2. This figure shows how the masked sentences are combined together
by adding the [CLS] and [SEP] tokens. This is also done for the masked
labels (it adds zeros at all [CLS] and [SEP] indices and the attention mask
(it adds ones at all [CLS] and [SEP] indices
base) and the embedding dimension of each token was kept

very small at 128. This is a significantly smaller embedding
dimension of BERT base at 768. The embedding layer had to-
tal 1,920,000 parameters. The model used another embedding Fig. 4. The diagram on the right shows how to calculate the output of
layer to embed the segment encoding vectors but it had the self-scaled Dot-Produce attention. The right shows a diagram on how multi-
same dimension but only a vocab size of 3 since we only had headed attention which performs attention score calculation of sections of the
embedding (example head ones take all the tokens embedding from index 0
zeros (padding indices), ones (sentence 1 indices), and twos to 9 while head two takes all the tokens embedding from index 10 to 29 [7].
(sentence 2 indices) as elements in the segment embedding
vectors. This layer had 384 parameters. Also, a positional
embedding vectors (had the same embedding dim of 128 and to produce the same dimension output as the input into the
length of 512) were also generated followed the sine and attention layer. Micro BERT used a multi-headed attention
cosine function [7]. The positional embedding vectors has 0 layer with 8 heads. The number of parameters in the attention
parameters. The sentence embedding, the segment embedding, layers in Micro BERT was 66,038. A dropout of .2 was
and the positional embedding were all added (via matrix added to the resulting the resulting hidden states (keys,
addition) and a dropout of .2 was applied to this resulting query’s, values ...) that were calculated in the multi headed
matrix to produce the input into the encoder blocks [6]. attention). However, the attention masks were applied to the
The encoder block took this input and performed multi- embedding as to ignore pad embedding in all of the sequences
headed self- attention (all tokens can see all other tokens in making this computation more efficient and focused. The
the input sequence effectively making this a bi-directional output from attention layer also has .2 dropout applied and is
calculation of attention scores) deriving the key, query, and then added to a residual connection of the input of the
value (via the respective key, query, and value weight attention layer via matrix addition and then undergoes a layer
matrices) from the embedded representation [7]. The attention normalization layer with 256 parameters (since the output dim
output is calculated from the key (K) query (Q) and value (V), was still 128 resulting in 128 weight and 128 bias in the layer
and dimension of key vector (dk) (via the formula outlined in norm) [6].
the paper” Attention is all you need” [7]. The output from the first half of the encoder block was then
passed to a feed forward neural network with a input layer the
QKT size of the embedding dim (128 in Micro BERT), a hidden
Attention (Q, K, V ) =softmax √
Vd layer the size of the feed forward dim (128 in Micro BERT)
k
with Gaussian Error Linear Unit (GELU) activation, a dropout
The multi-headed aspect comes from this process of deriving of .2 between the hidden and output layer, and a output layer
attention scores is done multiple times for different parts of with the size of the original embedding dimension (128 in
the embedding vectors over the entire input sequence. For Micro BERT). The feed-forward network had a total 33,024
example, attention head one performs the attention calculation parameters. This output has .2 dropout applied again and is
over the entire sequence but only using indices 1 through 8 of then residual connected with the input to the feed-forward
the word embedding [7]. Attention head two performs the neural network and undergoes another layer normalization
same attention calculation over the entire sequence bu using with again 256 parameters. The output of this final layer
instead indices 9 through 16. These outputs from multi head normalization is the output from this encoder block [6]. Each
are then concatenated together after that attention score encoder block in Micro BERT has a total of 99574 parameters.
calculations
calculated and used to adjust the weight and biases of the
MLP head, encoder, and embedding layers. This MLP head
16,770 parameters.
Finally, Micro BERT included an adjustable MLP head that
performs sequence classification via the CLS token for the
disaster dataset that will be done after training. This MLP
head took the cls embedding and mapped it to a hidden dim
of 128 (with relu and dropout) and an output layer of 10 since
there are 10 classes in the disaster dataset. This MLP head a
parameter count of 17802.
The entire model (including all the different MLP heads)
contained 4,205,476 parameters in total. The Micro BERT
was then trained on MLM and NSP simultaneously on the
dataset derived from the books corpus. The hyper parameters
used was a word piece vocab size of 15,000, embedding
dimensions of 128, 8 attention heads, feed forward
dimension (used in the hidden dimes of the feed forward
layers as well as MLP heads) of 128, dropout of .2, 3
stacked encoder blocks, a validation loss sensitive early
stopper with a patience of 10, and a learning rate of 1e-5.
Note: that the cross-entropy loss function used in each MLP
head were different due to the fact that the MLM function
needed to ignore the PAD indices when calculating the loss
Fig. 5. The box in the diagram surrounds the encoder architecture that Micro while the other two used basic cross entropy losses functions.
Bert uses [6]. The AdamW optimizer was used. The summary of the
training process can be seen in figure 6. The model achieved a
test loss of 9.1346 which was close to the validation loss at
A diagram of the encoder block in it entirely can be seen in .91359. Both the non-training partition were a bit higher
figure 5. than the training loss of 9.0215. The model MLM task test
Micro BERT employs 3 different stacked encoder blocks accuracy of .0140 and a NSP test accuracy of .6341. The
where the output from the previous block is the input to the test and validation accuracies were always above the training
next block. The parameters for these blocks are distinct from accuracy for the kept epochs showing small signs of over-
each other but the architectures are identical. The output of the fitting only on the loss metrics. The total training time for
encoder blocks is then passed to a distinct head based on the the training of Micro BERT on the NSP and MLM tasks
task the model is performing. When the model was simultaneously was 63 minutes. The model’s state at training
performing MLM the output of the encoder (each specific epoch 16 was saved as the final state of Micro BERT’s
embedding of each token in particular) was mapped to a pretraining. The models train time was 63 minutes with the
MLP head with a hidden dim of 128 neurons paired with kept epoch being 16. Next the pretrained model used to
ReLU activation and dropout of .2. This hidden state was classify categories form the disaster dataset. Micro BERT
them mapped to a output layer the size of the vocab (15,000 in was initialized from the pretraining checkpoint on the book’s
the case of Micro BERT). This MLP head had a total of corpus. Like done in NSP only the CLS token was extracted
1,951,384 parameters. This output sequence of distributions of from the output of the encoder blocks output and mapped to
each token was then compared to the masked label vectors a MLP head with a hidden layer of 128 (with relu and
explained above using cross entropy loss function that ignored dropout of .2) and finally to the output layer of 10 neurons.
pad tokens. This loss was then used to adjust the weights and Cross entropy loss was again used as the loss function and
biases of the MLM MLP head, encoder, and embedding layers Adam as the optimizer. The exact same hyperparameters were
in the model. used to train Micro BERT on the disaster classification task as
Micro BERT used a different MLP head when training the the books corpora (embedding dimension 128, number of
model on the NSP tasks. This MLP head only used the CLS attention heads 8, number of stacked encoders blocks 3,
token at the beginning of each sentence as the only embedding feed forward hidden dimension 128, dropout .2, patience of
that was passed to the MLP head. This CLS embedding token early stopper 10, learning rate 1e-5). A summary of the final
was passed to a hidden MLP layer of 128 (with corresponding partitional losses, and accuracies can be found in figure 7.
relu activation and dropout of .2) and finally to softmax layer The results of Micro BERT on the disaster dataset were
of 2 neurons (a 2 neuron softmax proved to result in higher losses of .7601, .7900, .8025 on train, validation and test. The
accuracy on NSP that using a 1 neuron output layer of accuracies were .7284, .7236, .7263 on train, validation and
sigmoid activation). This output distribution was again test. The kept validation F1 score was .7162 while the test F1
compared to the label vector for the NSP tasks and the score was .7179. Again, the training loss was a bit lower
loss/gradient was than
Fig. 6. This figure summarizes the training of Micro BERT on the MLM and NSP tasks. The green line represents which epoch the models final parameters
were saved. The red line shows the test value for that respective metric. The text below represents the final states of the saved model.
Fig. 7. This figure summarizes the training of the pretrained Micro BERT on the classification task on the disaster dataset.
the other two partition losses but the accuracies were all close dimension 128, dropout .2, patience of early stopper 10,
leading to very small evidence of over-fitting the dataset. The learning rate 1e-5) to see if pretraining on the books corpus
total train time for this model was 238 minutes and the kept made a difference. The results from this version of Micro
epoch being 141. BERT can be seen in figure 8.
Finally, a non-pretrained Micro Bert (the weights of the The final losses of the non-pretrained Micro BERT were
Micro BERT model were not initialized from the training train .8477, val .9335 and test .9446. The accuracies were
on the google books corpus) was also fit using the exact .7044 train, .6926 validation, and .6947 test. The F1 score for
same MLP head (weights were again not initialized from the validation and test partitions was .677 and .675. The same
previous training) and was fit on the disaster classification trend of small differences between train and val/test losses
dataset (embedding dimension 128, number of attention heads continued with small differences in accuracy being present
8, number of stacked encoders blocks 3, feed forward across partitions (validation and test partitions are about a half
hidden
a percent less accurate). These results show larger signs of
overfitting than the training of the pretrained Micro BERT.
The train time of the non-pretrained Micro BERT was 120
minutes with a kept epcoh of 77.
B. Dataset and Pre-processing
First, the dataset is loaded from a CSV file, skipping the D. DeBERTa
problematic lines, and the rows with lacking values are
deleted to control the data. integrity. Similarly, the 'class_int' The code content does not address the process or use of the
column is transformed to an integer type, and the 'tweet_text' DeBERTa model as it is. DeBERTa is a BERT descendant
rows are converted to strings. One-hot encoding is applied to that owns some improvements, like the attention mechanism
classify categorical data. The program removes URLs, disentanglement and better parameter sharing strategies. If
placeholders, and non-word characters using text you have decided to use the DeBERTa model instead of the
preprocessing words, comments, hashtags, and the text is also Tiny BERT model, you'll have to modify the code to load the
tokenized, stemmed and lemmatized using NLTK which has DeBERTa model and use it for fine-tuning.
been downloaded previously. The preprocessed tweets are
stored in a new column, and the boolean values are replaced E. AlBERT
with numbers, the step table is all done. Data that can then be
used for further analytical or predictive problem-solving. The supplied code is also impartial of the AlBERT model,
same as DeBERTa. AlBERT is like the second version
# Load dataset while ignoring errors in specific lines df =
pd.read_csv("dataset_final.csv", encoding='latin1', on_bad_lines of the BERT model architecture, including training the model
= 'skip') in a faster way and better parameter sharing. If you plan to
df.isnull().sum() employ AlBERT rather than TinyBERT, the code also should
be adjusted to incorporate the AlBERT model and tune the
event 0 tweet_text 0 class_int 0 tweet_text_tokenize 0
tweet_stem 0 tweet_lemma 69 dtype: int64
model.
df = df.dropna(axis=0)
F. TinyBERT
df.isnull().sum()
event 0 tweet_text 0 class_int 0 tweet_text_tokenize 0 This code is based on heavily relying on the TinyBERT
tweet_stem 0 tweet_lemma 0 dtype: int64
model. TinyBERT is a lightweight version of the BERT
df.info() model with the purpose of providing a compact model
<class 'pandas.core.frame.DataFrame'> Index: 74141 entries, 0 to through knowledge distillation techniques that retain the
74209 Data columns (total 6 columns): # Column Non-Null Count same levels of performance on downstream tasks as the
Dtype --- ------ -------------- ----- 0 event 74141 non-null
object 1 tweet_text 74141 non-null object 2 class_int 74141 non- original BERT model. The code loads the pre-trained
null object 3 tweet_text_tokenize 74141 non-null object 4 TinyBERT model which is trained on the disaster tweet
tweet_stem 74141 non-null object dataset. After training, the model is evaluated for text
classification. This code executes the training and evaluation
of a TinyBERT model, timing the entire training process
C. Disaster BERT Model
using Python's time. Time () to capture start and end times.
Training occurs over 10 epochs with a batch size of 32, using
The code cut flow is the development of disaster text
both training and validation datasets. After training, the script
classification model using a version of the BERT model. The
prints a model summary and calculates the total number of
sentence specifically refers to a compressing model of
trainable parameters. It then evaluates the model on the test
TinyBERT that was put together from BERT. This model
set to measure loss and accuracy, also timing this process for
feeds on the data about tweets related to disasters as being
performance analysis. The output includes training and
trained on the dataset and gets customized through various
testing times, the number of epochs, and the total trainable
metrics including accuracy, F1 score, recall, precision, ROC
parameters, offering a comprehensive view of the model's
curves, and AUC scores. The main task is utilized for training
training performance and efficiency.
and fine-tuning the model of the disaster BERT, which is
assigned to categorize tweets into different categories related G. Evaluation
to the disaster. This code processes a dataset of tweets for
machine learning by splitting it into training and testing sets This code below demonstrates comprehensive evaluation and
with a 20% test size for model evaluation. It utilizes a BERT visualization processes for a trained TinyBERT model on a
tokenizer to prepare the data for input into a TinyBERT text classification task. Initially, it plots the training and
model, adapting the model's final layer to output ten classes validation loss and accuracy over epochs, providing visual
for multi-class classification. The model is compiled using insight into the model's learning and generalization
the Adam optimizer and categorical cross-entropy loss. capabilities across training sessions. It then computes
Finally, both the input encodings and labels are converted predicted probabilities and applies a softmax function to
into TensorFlow tensors, setting the stage for model training transform log its into actual probabilities. The script
and evaluation. calculates the ROC curve and the Area Under the Curve
(AUC) for each class, offering a graphical representation of
model performance across different thresholds. Predicted of model is capable of handling satisfactorily the short-text
probabilities are further used to derive predictions, which are classification tasks in disaster incidents which have become a
then compared against the true labels to compute accuracy, real need today. The model's ability to represent contextual
F1 score, recall, and precision, offering a holistic view of the clues and sense of the sentences turned out to be a key ability
model's predictive accuracy. A confusion matrix is plotted to to recognizing the texts into correctly disaster-related
visually assess the model's performance in distinguishing categories.
between different classes. Finally, a detailed classification
report is generated, providing precision, recall, F1-score, and
support for each class, thus summarizing the model’s
performance across various metrics. This series of
evaluations and visualizations helps in understanding the
model's strengths and weaknesses in classifying text into
specific categories.
Fig. 9. This figure shows the results of Confusion Matrix for the Model
Fig. 10. Shows the classification report of the final Model
IV. DISCUSSION
The results of the DiBERT model for disaster incidents in

short text classification were quite satisfactory that it allows
the model to perform quite well in determining the exact
Fig.8. This figure summarizes the training of the pretrained Tiny BERT on class of disaster in many of the classes. The accurateness of
the classification task on the disaster dataset. DiBERT together with the metrics like F1 score, recall, and
III. RESULTS the precision score are beyond the expectations. It
outperforms the similar models and the conventional
The experiments introducing the DiBERT model on short- solution. Its capacity to build separate generalizations for
text classification in disasters did show up ideal expectations. each dataset and its flexibility to different scenarios of such
The system was found to have a high accuracy rate in the disasters will surely prove that it is a practical model. Despite
classification of short text into various categories such as the constraints in the interpretation, these results still hide
disaster-related, disaster-related. Measures that were core of many possible futures to explore about combining fine-tuning
evaluation like accuracy, F1 score, recall and precision were and building enhancement technologies to make the model
computed using those parameters to test the effectiveness of more suitable for management and response to disasters.
the model in the categorization of disasters. The ROC curve
and corresponding AUC score were also employed to Pretraining is a cornerstone in equipping the model to extract
measure model success at different thresholds. Our the sophisticated concepts filiations the essential details from
evaluation results have thus indicated that the DiBERT type the large text data, provided by the corpora in the process.
We have found that the training, in particular, plays the vital
role. The results indicate that the higher accuracy to the with the opportunities to utilize the strengths of transformer
unseen data benchmarks are obtained as compared with the models similar to DiBERT on the token level by means of
standard network. The model of the pretrained DiBERT task-specific fine-tuning or modifications of the architecture
clearly proves it being very powerful in the handling of prepared for the token processing specifically Nonetheless,
disaster-related texts; therefore, it is clear how necessary the accounting the drawbacks and dilemmas involved in this
pretraining is for making the models more suitable for the approach, we argue that pretraining is a determinant factor,
specific tasks. aside from the questions of model size optimization, and
directions of future research that may be necessary in order
while achieving this balance of the number of parameters, for transformer-based models to be of high utility in disaster
training time, and model performance is an essential factor, it management and text classification tasks.
remains a challenge to achieve it. Concurrently, models with
fewer parameters could possibly result in an V. CONCLUSION
underperformance. On the other hand, they have the
advantage of being trained within shorter timeframes during In conclusion, the DiBERT model, a short-text classification
which they also require fewer computational resources. For disaster incident model based on a transformer-bidirectional
less complicated problems or for resource-restricted architecture, has introduced a robust tool for disaster
circumstances, simple models may be a more suitable option management and response. The experimenting and evaluation
as the cost-to-benefit ratio does not necessarily lean towards of the model's performance reveals its ability to make sure
the use of complex models. Thus, designing optimized model that the disaster short texts are properly analyzed and
architecture with those parameters as regards resources classified. The experience of transformer-based models like
consumption and model performance is an important issue to DiBERT provides the basis for the development of disaster
be solved in this matter. Also, while scale may be management systems that process information related to
synonymous with more memory consumption, sufficient or disasters better and faster. Such research findings further the
advanced models require you to weigh in factors as development of natural language processing techniques in
computation time against the other. disaster response, thus aid in the extraction of useful
information from a limited number of texts in emergency
situations. This could help pave the way for further research
In addition to that, token tasks might be not so powerful in and development in the area of disaster text classification that
tasks like language masking modeling (MLM), but diBERT
with the same measure will show respectful results in the could be used in real scenarios like emergency response,
short text classification. Even though the overall outcomes of crisis communication, and disaster preparedness.
MLM were not as good as one may expect, the final text Conclusively, the DiBERT model is one of the indispensable
classifier employing the MLM model to approximate a suite devices for disaster management, as it provides quick and
of document-level tasks, such as document classification, was reliable classification of short-texts in emergency situations.
able to resolve complex task. The coming research may deal
VI.REFERENCES
[1] Tamara L Sheldon and Crystal Zhan. The impact of
natural disasters on us home ownership. Journal of the
Association of Environmental and Resource Economists,
6(6):1169–1203, 2019.
[2] Muhammad Imran Ferda Ofli Firoj Alam, Umair Qazi.
Humaid: Human-annotated disaster incidents data from
twitter. In 15th International Conference on Web and
Social Media (ICWSM), 2021.
[3] Shaheen Khatoon, Majed A Alshamari, Amna Asif,
Md Maruf Hasan, Sherif Abdou, Khaled Mostafa Elsayed,
and Mohsen Rashwan. Development of social media
analytics system for emergency event detection and cri-
sismanagement. Comput. Mater. Contin, 68(3), 2021.
[4] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu
Chen. Deberta: Decoding-enhanced BERT with disen-
tangled attention. CoRR, abs/2006.03654, 2020. URL
https://arxiv.org/abs/2006.03654.
[5] Rani Koshy and Sivasankar Elango. Multimodal
tweet classification in disaster response systems using
transformer-based bidirectional attention model. Neural
Computing and Applications, 35(2):1607–1627, 2023.
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. Attention is all you need. Advances
in neural information processing systems, 30, 2017.
Fig. 8. This figure summarizes the training of the non-pretrained Micro BERT on the classification task on the disaster dataset.

COMP 652 Project Final Paper

Uploaded by

Copyright:

Available Formats

You might also like

COMP 652 Project Final Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMP 652 Project Final Paper

Uploaded by

Copyright:

Available Formats

DiBERT: Short-text Classification in Disaster

Incidents using transformer-based bidirectional

I. INTRODUCTION might otherwise go unnoticed. This capability allows for real-

1) Micro BERT: A bi-directional encoder transformer

base) and the embedding dimension of each token was kept

Fig. 10. Shows the classification report of the final Model

The results of the DiBERT model for disaster incidents in

You might also like