Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Multimedia Tools and Applications (2022) 81:18483–18501

https://doi.org/10.1007/s11042-022-12217-3

Enhancing multimodal disaster tweet classification


using state-of-the-art deep learning networks

Divakaran Adwaith1,2 · Ashok Kumar Abishake1,2 · Siva Venkatesh Raghul1,2 ·


Elango Sivasankar1,2
Received: 11 July 2021 / Revised: 6 November 2021 / Accepted: 10 January 2022 /
Published online: 9 March 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
During disasters, multimedia content on social media sites offers vital information. Reports
of injured or deceased people, infrastructure destruction, and missing or found people are
among the types of information exchanged. While several studies have demonstrated the
importance of both text and picture content for disaster response, previous research has
primarily concentrated on the text modality and not so much success with multi-modality.
Latest research in multi-modal classification in disaster related tweets uses comparatively
primitive models such as KIMCNN and VGG16. In this research work we have taken this
further and utilized state-of-the-art models in both text and image classification to try and
improve multi-modal classification of disaster related tweets. The research was conducted
on two different classification tasks, first to detect if a tweet is informative or not, second
to understand the response needed. The process of multimodal analysis is broken down by
incorporating different methods of feature extraction from the textual data corpus and pre-
processing the corresponding image corpus, then we use several classification models to
train and predict the output and compare their performances while tweaking the parameters
to improve the results. Models such as XLNet, BERT and RoBERTa in text classification
and ResNet, ResNeXt and DenseNet in image classification were trained and analyzed.
Results show that the proposed multimodal architecture outperforms models trained using a
single modality (text or image alone). Also, it proves that the newer state-of-the-art models
outperform the baseline models by a reasonable margin for both the classification tasks.

 Divakaran Adwaith
adwaith3208@gmail.com

Ashok Kumar Abishake


abishake.dev@gmail.com

Siva Venkatesh Raghul


sraghul127@gmail.com

Elango Sivasankar
sivasankar@nitt.edu

1 Department of Computer Science, National Institute of Technology, Tiruchirappalli, India


2 Tanjore Main Road, National Highway 67, Near BHEL Trichy, Tiruchirappalli,
Tamil Nadu 620015, India
18484 Multimedia Tools and Applications (2022) 81:18483–18501

Keywords Multimodal analysis · Deep learning · Disaster response · Tweet classification

1 Introduction

Social media has become one of the most powerful tools during times of natural
calamity/disaster. Many organizations and individuals use social media extensively to post
and collect information. They provide valuable, live information such as the extent of the
crisis’s damage, the number of people affected, and so on. Large-scale disaster response
organizations can use these tweets to make better decisions and respond more quickly. As
a result, more agencies, states, and nonprofits are interested in automating Twitter tracking
(i.e., disaster relief organizations and news agencies). However, a computer has a hard time
understanding the information presented in the tweet. For example, a tweet might describe
a beautiful orange evening scenery as “the evening sky is ablaze” but only in a metaphori-
cal sense. This is immediately apparent to a person, particularly with a visual aid. Building
a Classification model that combines both the textual and visual content to make a decision
will have a huge impact on modern Disaster Management. An attempt has been made to
improve multimodal analysis on tweet data using both text and image modalities to predict
whether a given tweet is informative for humanitarian aid and also classify the information
if found useful. The previous results showed that incorporating images into on-topic text
material was more beneficial during the disaster, in addition, the imagery content included
some important facts about the event [9]. But previous work was missing out on the fact
that using state-of-the-art models of each modality would make feature extraction more effi-
cient and would lead to better results. As a result, we propose using state-of-the-art deep
learning techniques to learn a joint representation from both modalities (text and image) of
social media data. Furthermore, we run separate experiments in which we train models with
(i) only tweet text, (ii) only tweet image, and (iii) both tweet text and image (multimodal).
Based on the findings, we attempt to demonstrate that the proposed multimodal architecture
outperforms models trained using a single modality (e.g., either text or image).

2 Literature review

Existing work in the field of multimodal analysis for disaster response uses outdated neu-
ral networks like KIMCNN or traditional LSTMs and VGG16 [5, 9]. There is a huge scope
of improvement since Transformer models like BERT look at the text in a bidirectional
sense, thus simulating human behaviour. Whereas, KIMCNN along with Continuous Bag Of
Words (CBOW) looks at it in a unidirectional fashion [4, 7]. Similarly, models like VGG16
are very heavy [12], thus making the classifier run slowly. In addition to that it gives lower
accuracy than the latest models. Latest models like DenseNet, through its innovative archi-
tecture, provide fantastic results while significantly reducing the amount of parameters [3].
In this particular study, we run through various modern state-of-the-art models striving for
an improvement in accuracy while justifying their selection with their superior architecture.
Novel weakly-shared Deep Transfer Networks (DTNs) that convert cross-domain infor-
mation from text domain to image domain have also been used in previous research on
feature learning from multimodal data [11, 14]. They propose that DTNs with weekly
parameter-shared layers can capture complicated representations of data from several
domains more effectively. This is primarily focused on getting information from one domain
Multimedia Tools and Applications (2022) 81:18483–18501 18485

and using it in another, but in our work, we extract information from both domains simul-
taneously and perform classification based on it. Some other works in the field of utilising
twitter tweets for disaster recovery also involve tweet classification combined with location
estimation using markov chains for response [13].
The comparative study is done based on text models such as BERT [1], XLNet [16]
and RoBERTa [6] and image classification models such as ResNet [2], ResNeXt [15] and
DenseNet [3]. The specific models are chosen as they are some of the top performers in this
field. In addition to this, based on the previous research, models using KIMCNN and VGG
are trained and compared as well [9].

3 Methodology

3.1 Overview

The idea can be split into three parts:


1. Use state-of-the-art Text classification models like BERT, XLNet and
RoBERTa to enhance text classification accuracy.
2. Use state-of-the-art Image classification models like ResNet, ResNeXt and DenseNet
to enhance image classification accuracy.
3. Use different combinations of the model mentioned above, one for text and one for
image for feature extraction and use the extracted features to train another classifier
model.
First the preprocessed text is used to train BERT, XLNet and RoBERTa as Classifiers
and then they are benchmarked against the base model KIMCNN. Then the preprocessed
images are used to train ResNet, ResNeXt and DenseNet as Classifiers and then they are
benchmarked against the base model VGG. At last, different combinations of Text and
Image models are used as Feature Extractors and the features are used to train a simple
neural network classifier and they are benchmarked against the base model KIMCNN +
VGG. Figure 1 shows the overall workflow of the research work.

3.2 Data pre-processing

3.2.1 Text pre-processing

Cleaning up the corpus is required in order to work with a standard dataset that is free
of extraneous noise. Figure 2 illustrates the Text pre-processing steps. The pre-processing
consists of the following steps:
1. If the dataset was collected from the web, all irrelevant tags are to be removed. In terms
of disaster response, these tags provide no further information. We also need to remove
the punctuations and special characters that typically make the text noisy. This is done
using regex.
2. Removing stop words and parts of speech. Stop words are not significant while
meaningful features are reconstructed from the text.
3. Tokens are lemmatised such that several versions of the single root word are unified
which contribute the same meaning. Words may essentially be divided into their proper
18486 Multimedia Tools and Applications (2022) 81:18483–18501

Dataset

TEXT INPUT IMAGE INPUT

Preprocessing Preprocessing

Vectorization

Baseline State-of-the- Baseline State-of-the-


Model art Model Model art Model

KIMCNN BERT VGG ResNet

XLNet ResNeXt

RoBERTa DenseNet

Multimodal modal combinations

1 2 3
METRICS METRICS 4 5 6 METRICS METRICS

7 8 9

METRICS

Evaluation and
Comparison

Fig. 1 Experimentation Methodology Flowchart

stems and suffixes or prefixes to generate new words [10]. This is done using Tok-
enizer Modules that follow different algorithms that are specific to the model being
trained. The modules used are BertTokenizer, XLNetTokenizer and RobertaTokenizer
from HuggingFace and the model names were “bert-base-cased”, “xlnet-base-cased”
and “roberta-base” respectively.
Multimedia Tools and Applications (2022) 81:18483–18501 18487

DATASET

Replace non-
alphanumeric
characters and stop
words

Tokenize BERT, RoBERTa, XLNet


tokenizer modules

Generate word based


embeddings using
Word2Vec Algorithm

Embeddings to be Preprocessed data to be


used for baseline fed to BERT, RoBERTa,
models (KIMCNN) XLNet models

Fig. 2 Text Preprocessing Approach

3.2.2 Image pre-processing

The images need to be preprocessed in order to match the ImageNet specifications. This
is important due to two main reasons, first being the networks like ResNet, ResNeXt and
DenseNet are designed with input dimensions corresponding to ImageNet [5, 9, 10] and
another reason is that if the statistics are normalized similar to how ImageNet dataset was
trained, then initializing the weights to pretrained weights turns out to be more effective.
Figure 3 illustrates the Image pre-processing steps.
The pre-processing consists of the following steps:
1. Splitting the dataset into the Train, Validation and Test dataset. It is split in the ratio
70-15-15 respectively.
2. The image is then resized to a size of 256x256.
3. Centercrop with a size of 224x224 is performed on the image to extract the most
important region in the image.
18488 Multimedia Tools and Applications (2022) 81:18483–18501

CrisisMMD Dataset -
Images

Train Dataset Validation Dataset Test Dataset

Transformations according to
Imagenet dataset

Resize

Imagenet
Centercrop to based models
224x224

Normalising with
Imagenet Statistics

Fig. 3 Image Preprocessing Approach

4. The image is normalized according to ImageNet statistics.

3.3 Building the text models

3.3.1 Architecture

The text models were constructed using the following approach.


1. Take the default model corresponding to that particular architecture.
2. Add a dropout of 0.1.
3. Add a fully connected layer with input dimension matching the last layer of architecture
and output dimension as 2 (for Informative Task) or as 5 (for Humanitarian task).
4. Add a LogSoftmax layer.
The modules used were BertModel, XLNetModel and RobertaModel from the Hug-
gingFace library along with model names “bert-base-cased”, “xlnet-base-cased” and
Multimedia Tools and Applications (2022) 81:18483–18501 18489

“roberta-base” respectively. The weights are initialized to pre-trained weights. No weights


are frozen and the entire network is trained/fine-tuned.

3.3.2 Learning rate

Different learning rates were tried out in a range between 1e-1 to 1e-5 for each model to
find the rate that optimizes a particular model well. Also different learning rate schedulers
were tried out for each model.
1. First one being LinearScheduler that steps down the learning rate uniformly after every
batch such that the learning rate becomes zero at the last step. The number of steps
is equal to the product of the number of batches and number of epochs. The module
getlinear schedule with warmup from the Transformers library with 0 warm up steps
was used.
2. Second scheduler reduces the learning rate by multiplying a given factor when a
monitored metric does not change. The module ReduceLROnPlateau from PyTorch
schedulers was used, with the monitored quantity being Validation Accuracy. Patience
was 10, threshold being 0.01 and min lr being 1e-5. The factor was default and was
operated in Max mode.

3.3.3 Epochs and loss function

The number of epochs were set to 40 for the Informative Task and to 100 for the Human-
itarian Task. The loss function used was the cross entropy loss function. Batch size was
determined based on the memory requirement of the model and the memory available on
Google Colab.

3.3.4 BERT

BERT was created by Google and unlike any other language representation model, it pre-
trains the deep representations in a bi-directional manner. The bi-directional training of the
BERT architecture, as opposed to the unidirectional training of other state-of-the-art models
like ELMo or OpenAI GPT, produces better results and increases performance. It employs
an attention mechanism for each layer [1].
BERT was trained with a batch size of 128. For the Informative Task, a learning rate of
2e-5 was used along with the Linear Scheduler. For the Humanitarian Task, a learning rate
of 1e-1 was used along with ReduceLROnPlateau.

3.3.5 XLNet

XLNet is an unsupervised language representation learning method based on a novel gener-


alized permutation language modelling objective. XLNet’s backbone model is Transformer-
XL, which has shown to perform well in language tasks with a lot of context. On a variety of
language tasks, such as query-response, natural language inference, and sentiment analysis,
XLNet achieves state-of-the-art (SOTA) results [16].
XLNet was trained with a batch size of 48 for the Informative Task along with a learning
rate of 2e-5 and Linear Scheduler. For the Humanitarian Task, a learning rate of 1e-1 was
used along with the Linear Scheduler with a batch size of 32.
18490 Multimedia Tools and Applications (2022) 81:18483–18501

3.3.6 RoBERTa

RoBERTa is an acronym for Robustly Optimized BERT Pre-training Approach. It was cre-
ated by Facebook and Washington University researchers. The primary aim of this paper
was to refine BERT architecture training in order to reduce time spent on the pre- train-
ing. RoBERTa’s architecture is very similar to BERT’s, but the authors made a few design
improvements to its architecture and training procedure to address BERT’s flaws. These
modifications are as follows: Remove the Next Sentence Prediction (NSP) target, train with
larger batch sizes and longer sequences, and change the masking pattern dynamically [6].
RoBERTa was trained with a batch size of 128. For the Informative Task, a learning rate
of 2e-5 was used along with the Linear Scheduler. For the Humanitarian Task, a learning
rate of 1e-1 was used along with ReduceLROnPlateau.

3.4 Building the image models

3.4.1 Architecture

The image models were constructed using the following approach.


1. Take the default model corresponding to the architecture. Since all the models are
ImageNet based models, output dimension is fixed to 1000 unlike text models.
2. Add Batch Normalization for these 1000 outputs.
3. Add a fully connected layer with input dimension as 1000 and output dimension as 256.
4. Add the ReLU function.
5. Add a dropout of 0.4.
6. Add another fully connected layer with input dimension as 256 and output dimension
as 2 (for Informative Task) or as 5 (for Humanitarian task).
7. Add a LogSoftmax layer.
The modules used were “resnet18”, “resnext50 32x4d”, “densenet161” from the
Torchvision models library. The weights are initialized to pre-trained weights. No weights
are frozen and the entire network is trained/fine-tuned.

3.4.2 Learning rate, epochs and loss function

We observed that the best learning rate and the scheduler did not vary depending upon the
model. All the image models were trained with a learning rate of 2e-5 and using a linear
scheduler. The number of epochs were set to 40 for the Informative Task and to 50 for
the Humanitarian Task. The loss function used is again the Cross Entropy loss function.
Batch size was determined based on the memory requirement of the model and the memory
available on Google Colab.

3.4.3 ResNet

ResNet (Residual Networks) is a form of neural network created by Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun in their paper “Deep Residual Learning for Image
Recognition” published in 2015. In this network, they use a technique called skip connec-
tions to solve the problem of vanishing or exploding gradients. The advantage of using
this type of skip link is that regularization can skip any layer that degrades architecture
performance [2].
Multimedia Tools and Applications (2022) 81:18483–18501 18491

ResNet was trained with a batch size of 512.

3.4.4 ResNeXt

ResNeXt is a simple and modular network architecture made for advanced image clas-
sification. The ResNeXt network is built by reusing a unit block that combines a set
of transformations that follow the same structure. With just a few hyper-parameters to
configure, the simple design yields a homogeneous multi-branch architecture. This tech-
nique reveals a new factor called “cardinality” (the size of the set of transformations) as a
significant element in addition to depth and width [15].
ResNeXt was trained with a batch size of 64.

3.4.5 DenseNet

DenseNet is a neural network in which each layer is feed-forward linked to all other layers
within the same dense block. The feature vectors of all previous layers are viewed as distinct
inputs for each layer, while the current layer’s feature maps are propagated as inputs to all
the layers following it. This networking structure produces cutting-edge performance on
CIFAR 10/100 and SVHN. DenseNet achieves comparable precision to ResNet on the wide
scale ILSVRC 2012 (ImageNet) dataset though using below half the model parameters and
about 50% the number of FLOPs [3].
DenseNet was trained with a batch size of 32.

3.5 Building the multimodal models

3.5.1 Architecture

Models were using different combinations of text and image classifiers to achieve best mul-
timodal accuracy. Totally, 9 multimodal models were trained and validated – 3 text * 3
images = 9 combinations (based on above mentioned models). Two feature vectors obtained
from both modalities are then fed into a shared representation before performing a predic-
tion. Figure 4 shows how different multimodal models are built using different combinations
of text and image models. The multimodal model is built using the following steps:
1. Take two models one for text and one for image and freeze their weights.
2. Concatenate their outputs.
3. Pass the concatenated outputs to the classifier, whose architecture is given below.
For multimodal classification we use the Text models and Image models for feature
extraction and thus their weights are frozen and are not included in the training process.
Only the classifier is trained for multimodal classification. Table 1 describes the architecture
of the classifier used in multimodal models.

3.5.2 Learning rate, epochs and loss function

We observed that the best learning rate and the scheduler did not vary depending upon the
model, but the best learning rate varied depending on the Task. For the Informative Task the
best learning rate was 2e-5, only a small amount of fine tuning was needed. Whereas for
Humanitarian Task the best learning rate was 1e-2. Linear scheduler was used for all models.
18492 Multimedia Tools and Applications (2022) 81:18483–18501

Fig. 4 Changes proposed on the base paper

The number of epochs were set to 50 for the Informative Task and to 70 for the Humani-
tarian Task. The loss function used is again the Cross Entropy loss function. Batch size was
determined based on the memory requirement of the model and the memory available on
Google Colab and remained constant for every model and was 512.

4 Experimentation

4.1 Dataset

The dataset that has been used here is the CrisisMMD: Multimodal Twitter Datasets from
Natural Disasters [8]. It contains hundreds of thousands of manually annotated tweets and
Multimedia Tools and Applications (2022) 81:18483–18501 18493

Table 1 Architecture of our neural network classifier

Layer name Input Output Other functions

Batch Normalization Sum of Text Sum of Text -


and Image and Image
Model Output Model Output
Dimension Dimension
Linear 1 Sum of Text 1000 ReLU
and Image
Model Output
Dimension
Linear 2 1000 500 ReLU, Dropout = 0.4
Linear 3 500 250 ReLU, Dropout = 0.2
Linear 4 250 125 ReLU, Dropout = 0.1
Linear 5 125 1 or 5 ReLU, Dropout = 0.02
LogSoftmax 1 or 5 1 or 5 -

photographs gathered during seven major natural disasters in 2017 around the globe, includ-
ing earthquakes, hurricanes, wildfires, and floods. 70 percent of the dataset was used for
training, 15 percent for validation and 15 percent for testing.
Dataset used for Informative Task: It is a set comprising tweet text and its corresponding
image, each of which conveys whether the data is informative or not-informative. To work
with this dataset, informative labels are given a score of +1 and for not-informative labels,
a value of 0 is assigned.
Dataset used for Humanitarian Task: It is a set comprising tweet text and its cor-
responding image, each of which conveys the humanitarian response needed for it,
namely : not humanitarian, infrastructure and utility damage, other relevant information,
rescue volunteering or donation effort,
affected individuals. To work with this dataset, labels are given a score from 0 to 4
correspondingly.

4.2 Evaluation metrics

We have used multiple metrics to evaluate the performance of the models. They are:
1. Accuracy
2. F1 score
3. Precision
4. Recall
The ratio of correct predictions to the total number of instances present in the dataset is
defined as Accuracy. The accuracy is given by:
N umberof correctpredictions
Accuracy = (1)
T otalnumberofpredictionsmade
The weighted average of precision and recall gives the F1 score. It is given by the
formula:
2 · precision · recall
F1 = (2)
precision + recall
18494 Multimedia Tools and Applications (2022) 81:18483–18501

Fig. 5 Dataset Distribution in Informative Dataset

The precision is given by:


TP
P recision = (3)
T P + FP
The recall is given by the formula:
TP
Recall = (4)
T P + FN
TP stands for True Positive; FP stands for False Positive, and FN stands for False
Negative in the above equations.

Fig. 6 Dataset Distribution in Humanitarian Dataset


Multimedia Tools and Applications (2022) 81:18483–18501 18495

Table 2 Performance analysis for informative task on different text models

Model Accuracy Precision Recall F1 - Score

Base 0.7614 0.7538 0.7614 0.7458


BERT 0.8462 0.8438 0.8462 0.8440
XLNet 0.8592 0.8573 0.8592 0.8568
RoBERTa 0.8625 0.8617 0.8625 0.8586

The bold entries correspond to the best performing model for that classification task

The general observation is that in the cases where the data is not evenly split amongst the
classes, F1 score gives a better means of validation. Figures 5 and 6 show us the distribution
in the datasets.
We can observe that the distribution of training data is not even for both datasets. So F1
score is considered to have higher importance.

4.3 Performance analysis

4.3.1 Text models

We analyze the text models first and look at how different Transformer models compare
against each other and against the Base Model which uses the Continuous Bag of Words
(CBOW) along with the Word-to-vector model and KIMCNN. The transformer models used
here are BERT, XLNet and RoBERTa. The transformer models use pre-trained tokenizers
corresponding to the model. Tables 2 and 3 compare the performance of different Text
models using different metrics. Figures 7 and 8 plots the F1 score of these models.
We can clearly observe that the State-of-the-art transformer models outperform the base
model. The best text model for Informative task is RoBERTa and performs 5.45% bet-
ter than the base model. Similarly, the best model for Humanitarian task is XLNet and
outperforms baseline models by 9.39%.

4.3.2 Image models

We then analyze the image models and look at how different Torchvision models com-
pare against each other and against the Base Model which uses the VGG16 model. The
torchvision models used are ResNet, ResNeXt and DenseNet. Tables 4 and 5 compare the
performance of different Image models using different metrics. Figures 9 and 10 plots the
F1 score of these models.
Here, we observe that the State-of-the-art torchvision models outperform the base model.
The best image model for Informative task is ResNeXt and performs 2.29% better than the

Table 3 Performance analysis for humanitarian task on different text models

Model Accuracy Precision Recall F1 - Score

Base 0.7120 0.7085 0.7120 0.6969


BERT 0.7267 0.7261 0.7267 0.7228
XLNet 0.7979 0.8086 0.7979 0.8013
RoBERTa 0.7613 0.7619 0.7613 0.7539

The bold entries correspond to the best performing model for that classification task
18496 Multimedia Tools and Applications (2022) 81:18483–18501

Fig. 7 F1 Score Analysis for Informative Task on different Text Models

Fig. 8 F1 Score Analysis for Humanitarian Task on different Text Models

Table 4 Performance analysis for informative task on different image models

Model Accuracy Precision Recall F1 - Score

Base 0.8330 0.8310 0.8330 0.8320


ResNet 0.8488 0.8476 0.8488 0.8480
ResNeXt 0.8559 0.8541 0.8559 0.8545
DenseNet 0.8527 0.8510 0.8527 0.8515

The bold entries correspond to the best performing model for that classification task
Multimedia Tools and Applications (2022) 81:18483–18501 18497

Table 5 Performance analysis for humanitarian task on different image models

Model Accuracy Precision Recall F1 - Score

Base 0.7680 0.7640 0.7680 0.7630


ResNet 0.7916 0.7930 0.7916 0.7878
ResNeXt 0.8073 0.8074 0.8073 0.8038
DenseNet 0.8115 0.8086 0.8115 0.8085

The bold entries correspond to the best performing model for that classification task

Fig. 9 F1 Score Analysis for Informative Task on different Image Models

Fig. 10 F1 Score Analysis for Humanitarian Task on different Image Models


18498 Multimedia Tools and Applications (2022) 81:18483–18501

Table 6 Performance analysis for informative task on different multimodal models

Model Accuracy Precision Recall F1 - Score

Base 0.8440 0.8410 0.8400 0.8420


BERT + ResNet 0.8735 0.8721 0.8735 0.8724
BERT + ResNeXt 0.8892 0.8882 0.8892 0.8883
BERT + DenseNet 0.8866 0.8854 0.8866 0.8854
XLNet + ResNet 0.8527 0.8574 0.8527 0.8449
XLNet + ResNeXt 0.8540 0.8568 0.8540 0.8472
XLNet + DenseNet 0.8665 0.8732 0.8623 0.8683
RoBERTa + ResNet 0.8950 0.8955 0.8950 0.8926
RoBERTa + ResNeXt 0.8911 0.8910 0.8911 0.8888
RoBERTa + DenseNet 0.8918 0.8915 0.8918 0.8896

The bold entries correspond to the best performing model for that classification task

Table 7 Performance analysis for humanitarian task on different multimodal models

Model Accuracy Precision Recall F1 - Score

Base 0.7840 0.7850 0.7800 0.7830


BERT + ResNet 0.8304 0.8297 0.8304 0.8293
BERT + ResNeXt 0.8251 0.8248 0.8251 0.8237
BERT + DenseNet 0.8586 0.8584 0.8586 0.8576
XLNet + ResNet 0.8157 0.8083 0.8157 0.8107
XLNet + ResNeXt 0.8272 0.8264 0.8272 0.8253
XLNet + DenseNet 0.8262 0.8193 0.8262 0.8221
RoBERTa + ResNet 0.8398 0.8329 0.8398 0.8343
RoBERTa + ResNeXt 0.8398 0.8395 0.8398 0.8375
RoBERTa + DenseNet 0.8597 0.8598 0.8597 0.8587

The bold entries correspond to the best performing model for that classification task

Fig. 11 F1 Score Analysis for Informative Task on different Multimodal Models


Multimedia Tools and Applications (2022) 81:18483–18501 18499

Fig. 12 F1 Score Analysis for Humanitarian Task on different Multimodal Models

base model. Similarly, the best model for Humanitarian task is DenseNet and outperforms
baseline models by 4.35%.

4.3.3 Multimodal models

Now we analyze the multimodal models and look at how different combinations of text
models and image models compare against each other and against the Base Model which
uses CBOW, Word-to-vector and KIMCNN for text modality and VGG16 for image
modality.
Different models under comparison are BERT + ResNet, BERT + ResNeXt, BERT +
DenseNet, XLNet + ResNet, XLNet + ResNeXt, XLNet + DenseNet, RoBERTa + ResNet,
RoBERTa + ResNeXt, RoBERTa + DenseNet. Tables 6 and 7 compare the performance of
different Multimodal models using different metrics. Figures 11 and 12 plots the F1 score
of these models.
We can clearly see that the combination of State-of-the-art models of each modal-
ity outperforms baseline models by a large margin. The best model for Informative task
is RoBERTa + ResNet and outperforms baseline models by 5.10%. Similarly, the best
model for Humanitarian task is RoBERTa + DenseNet and outperforms baseline models
by 7.57%.

5 Conclusion

In this paper, we implemented learning of a joint representation of social media data using
both text and image modalities. To perform two classification tasks, we used state-of-the-
art deep learning architectures which learn high-level feature representations from text and
images.
As we were working on text modality, we compared different Transformer based models.
We observed that RoBERTa gives the best accuracy when we consider text modality alone
for Informative task. But we observe that XLNet gives the best accuracy when we consider
text modality alone for Humanitarian task. All the models were fairly efficient enough in
18500 Multimedia Tools and Applications (2022) 81:18483–18501

outperforming the baseline model. We can conclude that XLNet and RoBERTa outperforms
BERT. In case of binary classification RoBERTa outperforms XLNet while for multiclass
classification XLNet outperforms RoBERTa.
When it comes to image modality, we compared different models present in the
Torchvision library. We observed that all the 3 models outperformed the baseline model
significantly. ResNeXt and DenseNet performed relatively well when compared to ResNet.
When image modality alone is considered, ResNeXt is the best model for Informative task
while DenseNet is the best model for Humanitarian task. The performance of models for
image modality goes like this: ResNet ¡ ResNeXt = DenseNet.
Multimodal models were one of the most interesting parts of the research. As mentioned
earlier, multimodal models use the features given from text and image models. When using
the text and image models they use the pretrained weights and only the final classification
layers are trained. This led to some interesting observations. Despite XLNet performing
better than BERT when it comes to text modality classifications, it does not outperform
BERT in the case of multimodal classification. This can be attributed to the fact that the
features output by XLNet is not as diverse and does not encompass the entirety of features
as BERT does. The best model for the Informative Task has been RoBERTa + ResNet
giving an accuracy of about 89.5%, which is 5.1% better than the baseline model. The best
model for the Humanitarian Task has been BERT + DenseNet giving an accuracy of about
85.86%, which is 7.4% better than the baseline model.
In conclusion, we can safely say that the research has provided promising results that
help improve the accuracy of twitter data classification which in turn has the potential to
save lives. This illustrates the impact that the study has, however, this is not the end of the
road in this study. Most of the day-to-day twitter posts do not have strong correspondence
between text and image modalities. This is still less explored and the challenging future
direction for the study would be to develop a model that works well with heterogeneous
multimodal inputs as well.

Availability of Data and Material All the datasets used for supporting the conclusions are available at:
https://crisisnlp.qcri.org/crisismmd.

Code Availability All the code developed and used in this research are available at: https://github.com/
adwaith007/disaster-response-cnn.

Declarations

Conflict of interest The authors declare that they have no competing interests.

References

1. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training Of deep bidirectional transform-
ers for language understanding. In: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, vol 1. Long and Short
Papers), Association for Computational Linguistics, Minneapolis, pp 4171–4186
2. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE
Conference on computer vision and pattern recognition (CVPR), pp 770–778
3. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks.
In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 2261–2269
Multimedia Tools and Applications (2022) 81:18483–18501 18501

4. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for
Computational Linguistics, Doha, pp 1746–1751
5. Kumar A, Singh JP, Dwivedi YK, Rana NP (2020) A deep multi-modal neural network for informative
twitter content classification during emergencies. Ann Oper Res:1–32
6. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V
(2020) RoBERTa: A robustly optimized BERT pretraining approach. In: (ICLR 2020). Conference Blind
Submission
7. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector
space. In: Bengio Y, LeCun Y (eds) 1st international conference on learning representations, ICLR 2013,
workshop track proceedings, Scottsdale, pp 1–12
8. Ofli F, Alam F, Imran M (2018) CrisisMMD: multimodal twitter datasets from natural disasters. In:
International AAAI conference on web and social media, North America, pp 465–473
9. Ofli F, Alam F, Imran M (2020) Analysis of social media data using multimodal deep learning for
disaster response. In: Hughes A, McNeill F, Zobel CW (eds) ISCRAM 2020 Conference proceedings
- 17th international conference on information systems for crisis response and management. Virginia
Tech, Blacksburg, pp 802–811
10. Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units.
In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1. Long
Papers) Association for Computational Linguistics, Berlin, pp 1715–1725
11. Shu X, Qi G-J, Tang J, Wang J (2015) Weakly-shared deep transfer networks for heterogeneous-domain
knowledge propagation. In: Proceedings of the 23rd ACM international conference on multimedia (MM
’15). Association for Computing Machinery, New York, pp 35–44
12. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition.
In: Bengio Y, LeCun Y (eds) 3Rd international conference on learning representations, ICLR 2015,
conference track proceedings, San Diego
13. Singh JP, Dwivedi YK, Rana NP, Kumar A, Kapoor K (2019) Event classification and location
prediction from tweets during disasters. Ann Oper Res 283:737–757
14. Tang J, Shu X, Li Z, Qi G-J, Wang J (2016) Generalized Deep Transfer Networks for Knowledge Prop-
agation in Heterogeneous Domains. ACM Trans Multimed Comput Commun Appl 12, 4s, Article 68,
22
15. Xie S, Girshick R, Dollr P, Tu Z, He K (2017) Aggregated residual transformations for deep neural
networks. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR), pp 5987–5995
16. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2020) XLNEt: Generalized autoregressive
pre-training for language understanding. In: 33Rd conference on neural information processing systems
(neurIPS), Vancouver

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

You might also like