Multimodal sentiment analysis A survey

Displays 80 (2023) 102563
Contents lists available at ScienceDirect
Displays
journal homepage: www.elsevier.com/locate/displa
Review
Multimodal sentiment analysis: A survey✩

Songning Lai a , Xifeng Hu a , Haoxuan Xu a , Zhaoxia Ren b , Zhi Liu a ,∗
a
School of Information Science and Engineering, Shandong University, Qingdao, China
b
Assets and Laboratory Management Department, Shandong University, Qingdao, China
ARTICLE INFO ABSTRACT
Keywords: Multimodal sentiment analysis has emerged as a prominent research field within artificial intelligence,
Multimodal sentiment analysis benefiting immensely from recent advancements in deep learning. This technology has unlocked unprecedented
Multimodal fusion possibilities for application and research, rendering it a highly sought-after area of study. In this review, we
Affective computing
aim to present a comprehensive overview of multimodal sentiment analysis by delving into its definition,
Computer vision
historical context, and evolutionary trajectory. Furthermore, we explore recent datasets and state-of-the-art
models, with a particular focus on the challenges encountered and the future prospects that lie ahead. By
offering constructive suggestions for promising research directions and the development of more effective
multimodal sentiment analysis models, this review intends to provide valuable guidance to researchers in this
dynamic field.
1. Introduction Text-based sentiment analysis [13–15] has made significant strides in

the field of natural language processing (NLP). Vision-based sentiment
Emotion refers to the subjective response of organisms to external analysis emphasizes the analysis of human facial expressions [16]
stimuli [1]. Humans possess a remarkable ability for sentiment anal- and body language. Speech-based sentiment analysis predominantly
ysis, and researchers are actively exploring methods to transfer this extracts features like pitch, timbre, and intonation from speech signals
capability to artificial agents [2]. for sentiment analysis [17]. Thanks to advancements in deep learning,
Sentiment analysis involves the examination of sentiment polarity
these three modalities have made notable progress within the realm of
using available information [3,4]. With the rapid advancements in
sentiment analysis.
artificial intelligence, computer vision, and natural language process-
Using a single modality for sentiment analysis has inherent limita-
ing, the implementation of sentiment analysis by artificial agents is
tions [18–21]. Emotional information captured by a single modality is
becoming increasingly feasible. It is an interdisciplinary research field
that encompasses computer science, psychology, social science, and often limited and incomplete. Conversely, leveraging multiple modal-
other related disciplines [5–7]. ities allows for a more comprehensive understanding of emotional
For decades, scientists have been dedicated to equipping AI agents polarity. Analyzing only one modality yields restricted results and
with sentiment analysis capabilities, as it is a crucial aspect of achieving hinders accurate emotion analysis.
human-like AI. Recognizing the significance of multi-modal sentiment analysis,
Sentiment analysis holds significant research value, evident from researchers have progressively developed numerous models to address
a multitude of studies [8–11]. With the exponential growth of online this need. Text features, for instance, dominate and serve as a crucial
data, businesses can leverage evaluative information, such as reviews component in deep emotional analysis [22]. Extracting expression and
and review videos, to enhance their products and services. Additionally, pose features from the visual modality effectively complements text
sentiment analysis has found applications in diverse areas including lie sentiment analysis and judgment [23]. Speech modality, on one hand,
detection, interrogation, and entertainment. Subsequent sections will extracts textual features, while also enabling the recognition of speech
delve into the practical applications and research potential of sentiment tones to unveil the temporal dynamics of the underlying text [24].
analysis.
Fig. 1 illustrates the model architecture for a classical multi-modal sen-
In the past, sentiment analysis primarily focused on a single modal-
timent analysis, comprising three essential parts: individual modality
ity, such as visual modality, speech modality, or text modality [12].
✩ This paper was recommended for publication by Guangtao Zhai.

∗ Corresponding author.
E-mail addresses: 202000120172@mail.sdu.edu.cn (S. Lai), 201942544@mail.sdu.edu.cn (X. Hu), 202020120237@mail.sdu.edu.cn (H. Xu),
renzx@sdu.edu.cn (Z. Ren), liuzhi@sdu.edu.cn (Z. Liu).
https://doi.org/10.1016/j.displa.2023.102563
Received 28 June 2023; Received in revised form 2 October 2023; Accepted 23 October 2023
Available online 31 October 2023
0141-9382/© 2023 Published by Elsevier B.V.
S. Lai et al. Displays 80 (2023) 102563
Fig. 1. The figure illustrates the model architecture for a classical approach to multi-modal sentiment analysis. The architecture comprises three essential components: feature
extraction for each modality, fusion of the extracted features, and sentiment analysis based on the fused features. These three components play a crucial role in the analysis process,
prompting researchers to focus on optimizing each part individually.
feature extraction, fusion of modality features, and sentiment analysis characteristics. This section presents well-known datasets within the
based on the fused features. These three components are of utmost im- research community, intending to assist researchers in understanding
portance, and researchers have initiated individual optimization efforts the distinct features of each dataset and facilitating their selection
for each part [25]. process.
Our research focuses on multimodal sentiment analysis, providing a
comprehensive overview of the field, including its definition, historical 2.1. IEMOCAP [31]
context, and evolutionary trajectory. It explores recent datasets and
state-of-the-art models, addressing challenges and future prospects. IEMOCAP, a sentiment analysis dataset released by the Speech
Its unique contribution lies in offering constructive suggestions for Analysis and Interpretation Laboratory in 2008, offers a comprehensive
research directions. [26] discusses affective computing and automatic multi-modal dataset. It consists of 1039 conversational segments, with
emotion recognition, covering techniques, signal modalities, databases, a total video duration of 12 h. The study involved participants engaging
feature extraction, applications, and issues such as privacy and fair- in five different scenarios, expressing emotions based on predefined
ness. [27] emphasizes emotion recognition based on modal informa- scenarios. The dataset incorporates audio, video, and text information,
tion, particularly video and audio, and reviews research methods and along with facial expression and posture data collected through addi-
fusion technologies. It also examines emotion recognition based on tional sensors. Data points are categorized into ten emotions, including
deep learning and provides suggestions for future research. In compar- neutral, happy, sad, angry, surprised, scared, disgusted, frustrated, ex-
ison to other surveys, our research stands out by providing a compre- cited, and other. IEMOCAP provides a valuable resource for researchers
hensive overview and valuable guidance for researchers in multimodal
investigating sentiment analysis across multiple modalities.
sentiment analysis.
This review provides a comprehensive overview of multimodal
2.2. DEAP [32]
sentiment analysis. It includes a summary and introduction to relevant
datasets, aiding researchers in selecting appropriate datasets for their
studies. We compare and analyze models that hold significant research DEAP is a dataset specifically designed for sentiment analysis that
significance in multimodal sentiment analysis, offering suggestions for utilizes physiological signals as its source of data. The dataset examines
model construction. In addition, we explore three types of modal EEG data from 32 subjects, with a 1:1 ratio of male and female
fusion methods, highlighting the advantages and disadvantages of each participants. EEG signals were collected at 512 Hz from different areas
approach. Finally, we discuss the challenges and future directions in of the subjects’ brains, including the frontal, parietal, occipital, and
multimodal sentiment analysis, presenting several promising research temporal lobes. To annotate the EEG signals, the subjects were asked
avenues. Distinct from other reviews in the field, our focus lies in to rate the corresponding videos in terms of three aspects: Valence,
providing constructive suggestions for promising research directions Arousal, and Dominance, on a scale of 1 to 9. This dataset provides
and constructing superior multimodal sentiment analysis models. We valuable resources for researchers to explore sentiment analysis using
accentuate the challenges and future prospects associated with these physiological signals.
technologies.
2.3. CMU-MOSI [33]
2. Multimodal sentiment analysis datasets
The CMU-MOSI dataset comprises 93 significant YouTube videos,
The rapid growth of the Internet has ushered in an era of data explo- covering a diverse range of topics, as described by Zadeh et al. (2016).
sion [28–30]. Researchers have extensively gathered these data from The selection process ensured that each video featured a single speaker
various online sources such as videos and reviews, and have created facing the camera, enabling clear capture of facial expressions. While
sentiment datasets tailored to their specific needs. Table 1 provides there were no specific constraints on the camera model, distance, or
a summary of commonly used multimodal datasets. The first column speaker scene, all presentations and comments in the videos were
lists the dataset names, while the second column indicates the release conducted in English by 89 distinct speakers, including 41 women
year of the sentiment data. The third column categorizes the modalities and 48 men. The 93 videos were further divided into 2199 subjective
included in each dataset, and the fourth column specifies the platform opinion segments, each annotated with sentiment intensity ranging
from which the dataset was obtained. The fifth column denotes the from strongly negative to strongly positive (−3 to 3). Overall, the CMU-
language used in the dataset, and the sixth column quantifies the data MOSI dataset serves as a valuable resource for researchers engaged in
volume within each dataset. Each dataset possesses its own unique the study of sentiment analysis.
2
Table 1
The table presented below showcases the utilized multimodal datasets.
Name Year Modalities Source Language Number
IEMOCAP 2008 A+V+T N/A English 10 039
A+V+T
DEAP 2011 A+V+T N/A English 10 039
A+V+T
CMU-MOSI 2016 A+V+T YouTube English 2199
CMU-MOSEI 2018 A+V+T YouTube English 23 453
MELD 2019 A+V+T The Friends English 13 000
Multi-ZOL 2019 V+T ZOL.com Chinese 5288
CH-SIMS 2020 A+V+T N/A Chinese 2281
Spanish
Portuguese
CMU-MOSEAS 2021 A+V+T YouTube German 40 000
French
Reddit
MEMOTION 2022 V+T Facebook English 10 000
2.4. CMU-MOSEI [34] 2.8. CMU-MOSEA [38]
CMU-MOSEI is a highly regarded dataset extensively used in sen- CMU-MOSEA is a versatile dataset that encompasses multiple lan-
timent analysis, comprising 3228 YouTube videos. These videos are guages, including Spanish, Portuguese, German, and French. The
divided into 23,453 segments and encompass data from three dis- dataset consists of 40,000 sentence fragments, covering 250 diverse
tinct modalities: text, visual, and sound. With contributions from 1000 topics and featuring contributions from 1645 speakers. The annotations
speakers and coverage of 250 different topics, this dataset offers a within the dataset are classified into two categories: sentiment inten-
diverse range of perspectives. All videos in the dataset are in English, sity and binary. Each sentence fragment is annotated with sentiment
and annotations for both sentiment and emotion are available. The six strength, quantified within the interval [−3, 3]. Additionally, the binary
emotion categories include happy, sad, angry, scared, disgusted, and category indicates whether the speaker expressed an opinion or made
an objective statement. For each sentence, emotions are categorized
surprised, while the sentiment intensity markers span from strongly
into six categories: happiness, sadness, fear, disgust, surprise, and
negative to strongly positive (−3 to 3). Overall, CMU-MOSEI stands as
annotation.
an invaluable resource for researchers delving into sentiment analysis
across multiple modalities. 2.9. MEMOTION [39]
2.5. MELD [35] MEMOTION is a meme-based dataset that focuses on popular memes
associated with politics, religion, and sports. The dataset comprises
MELD is a comprehensive dataset featuring video clips sourced from 10,000 data points and is organized into three distinct sub-tasks: Senti-
the well-known television series Friends. The dataset encompasses tex- ment Analysis, Emotion Classification, and Scale/Intensity of Emotion
tual, audio, and video information that aligns with the accompanying Classes. Annotators approached each data point differently based on
textual data. It consists of 1400 videos, further divided into 13,000 the specific subtask. In subtask one, data points were annotated into
individual segments. The dataset is meticulously annotated with seven three categories: negative, neutral, and positive. Subtask two involved
categories of emotions: Anger, Disgust, Sadness, Joy, Neutral, Surprise, annotating each data point into four categories: humor, sarcasm, of-
and Fear. Additionally, each segment is annotated with three sentiment fense, and motivation. Lastly, in subtask three, each data point was
labels: positive, negative, and neutral. assigned a sentiment intensity value within the interval of [0, 4]. This
dataset presents a unique opportunity for researchers to explore the use
2.6. Multi-ZOL [36] of memes as a form of communication and expression within modern
culture.
Multi-ZOL is a dataset designed for bimodal sentiment classification
of images and text. The dataset consists of reviews of mobile phones col- 2.10. MULTIZOO & MULTIBENCH [40]
lected from ZOL.com. It contains 5288 sets of multimodal data points
The MULTIBENCH dataset is a comprehensive and diverse collection
that cover various models of mobile phones from multiple brands.
of datasets designed for multimodal learning research. It encompasses
These data points are annotated with a sentiment intensity rating from
15 datasets spanning six distinct research areas, including multimedia,
1 to 10 for six aspects.
affective computing, robotics, finance, human–computer interaction,
and healthcare. With more than 10 input modalities available, such
2.7. CH-SIMS [37]
as language, image, video, audio, time-series, tabular data, force sen-
sors, proprioception sensors, sets, and optical flow, the MULTIBENCH
CH-SIMS is a distinctive dataset comprising 60 open-sourced videos dataset offers a wide range of real-world scenarios for studying mul-
sourced from the web. These videos are further divided into 2281 video timodal representation learning. Each dataset is carefully curated to
clips. The dataset specifically focuses on the Chinese (Mandarin) lan- cover various prediction tasks, resulting in a total of 20 prediction
guage and ensures that each segment features only one character’s face tasks for evaluation. This diversity allows researchers to explore the
and voice. It encompasses a diverse array of scenes and includes speak- integration of different data sources and investigate the performance
ers spanning various age groups. The dataset is meticulously labeled and robustness of multimodal models across domains and modali-
for each modality, rendering it a valuable resource for researchers. ties. The availability of MULTIBENCH dataset provides researchers
The annotations in the dataset encompass sentiment intensity ratings with a standardized and reliable benchmark for assessing the capabil-
ranging from negative to positive (−1 to 1), along with additional ities and limitations of multimodal models, promoting accessible and
annotations for attributes such as age and gender. reproducible research in the field of multimodal learning.
3
2.11. MER [41]
The Multimodal Emotion Recognition Challenge (MER 2023)

dataset is a high-quality benchmark designed to advance research in
multimodal emotion recognition, with a particular focus on system
robustness. Utilized in the MER 2023 challenge, the dataset comprises
three sub-challenges: MER-MULTI, MER-NOISE, and MER-SEMI. In the
Fig. 2. The figure illustrates the overall framework of early feature-based approaches
MER-MULTI sub-challenge, participants recognized both discrete and for multimodal fusion. Once the features are extracted, this model employs dedicated
dimensional emotions, allowing for a comprehensive understanding components to fuse the features from each modality. These components are designed
of emotional labels. The MER-NOISE sub-challenge evaluated sys- to integrate the information from different modalities in a cohesive manner, enabling
tem robustness in the presence of modality perturbations, while the the model to capture the complementary aspects of the data and create a unified
representation. By fusing the features, the model can effectively leverage the combined
MER-SEMI sub-challenge addressed limited labeled data by providing
strength of the modalities and enhance the overall performance of the fusion process.
unlabeled samples for semi-supervised learning. The dataset includes
labeled samples for training and validation, as well as a substantial
amount of unlabeled data. By signing a new End User License Agree-
ment, researchers can access this dataset, contributing to advancements of training data to achieve optimal performance. Additionally, their
in multimodal emotion recognition and human–computer interaction. structures tend to be complex, resulting in longer training times. Fig. 2
In conclusion, these sentiment analysis datasets provide valuable illustrates the overall framework of early feature-based approaches for
resources for researchers studying sentiment analysis across multiple multimodal fusion. Some notable representative models in this category
modalities. The datasets cover various modalities such as speech, video, include:
text, physiological signals, and memes, allowing researchers to explore
different aspects of sentiment analysis. They encompass diverse top-
3.1.1. THMM (Tri-modal Hidden Markov Model) [42]
ics, languages, and cultures, providing a comprehensive understanding
of sentiment expression. The datasets are annotated with emotions Multimodal sequence modeling and analysis can be approached by
and sentiment intensity, enabling researchers to analyze and classify representing eigenvectors of multiple modalities as higher-order tensors
sentiment in a nuanced manner. Additionally, some datasets focus and utilizing tensor decomposition methods to extract hidden states
on specific challenges, such as system robustness and limited labeled and transition probabilities. This approach effectively capitalizes on
data, promoting advancements in multimodal emotion recognition. the correlation and complementarity present in multimodal data, while
Overall, these datasets serve as valuable benchmarks for evaluating and mitigating issues such as the curse of dimensionality and overfitting.
developing multimodal sentiment analysis models. However, a drawback of this approach is that the order and rank of the
tensors, as well as the number of hidden states, must be predetermined,
3. Multimodal fusion which may potentially impact the model’s performance and efficiency.
It requires careful consideration and tuning to ensure optimal re-
Multimodal data provides a more comprehensive understanding of sults and avoid potential limitations associated with predefining these
objects by incorporating different perspectives, making it more infor- parameters.
mative compared to single-modal data. The information derived from
different modalities can complement each other. In the domain of
multimodal sentiment analysis, effectively fusing data features from 3.1.2. RMFN (Recurrent Multistage Fusion Network) [43]
diverse modalities, preserving the semantic integrity of each modality, An alternative approach involves employing multiple recurrent neu-
and achieving robust fusion between them pose significant challenges. ral network (RNN) layers to gradually fuse features derived from differ-
Depending on the specific approach to modal fusion, it can be cat- ent modalities. This fusion process occurs progressively, transitioning
egorized into three stages: feature-based multimodal fusion in the from local to global information and from low-level to high-level
early stage, model-based multimodal fusion in the middle stage, and representations, ultimately yielding a comprehensive sentiment repre-
decision-based multimodal fusion in the late stage. These stages representation. In this approach, an attention mechanism is incorporated to
sent different strategies for integrating modalities and optimizing the dynamically adjust the relevance of features across semantic space for
overall sentiment analysis performance.
each modality. This attention mechanism enables the model to capture
the nuanced variations in emotions exhibited by the same word under
3.1. Early feature-based approaches for multimodal fusion
different nonverbal behaviors. By adaptively focusing on the relevant
features, the model can effectively account for the diverse expressions
Early feature-based multimodal fusion methods aim to merge the
of sentiment across modalities, enhancing its overall performance in
features of different modalities at a shallow level following feature
extraction in the initial stage. This fusion process involves bringing the multimodal sentiment analysis.
features of different single modalities into a unified parameter space.
However, since different modalities contain varying amounts of in- 3.1.3. RAVEN (Recurrent Attended Variation Embedding Network) [44]
formation, the resulting features often contain redundant information.
For multimodal sentiment analysis, a Hierarchical Fusion Network
To address this, dimensionality reduction techniques are commonly
has been proposed, featuring both a local fusion module and a global fu-
employed to eliminate this redundancy. The reduced features are then
sion module. This network architecture aims to enhance the integration
fed into the model for subsequent feature extraction and prediction.
Early feature fusion ensures that the model considers input information of information across modalities. In particular, local cross-modal fusion
from multiple modalities right from the start of the feature modeling is achieved by employing a sliding window approach. This technique
process. effectively reduces computational complexity by enabling efficient fu-
Nevertheless, unifying multiple parameter spaces in the input layer sion of multimodal features within localized segments. By leveraging
might not always yield the desired outcomes due to differences in the sliding window mechanism, the model can capture fine-grained
the parameter spaces of the various modalities. While these models relationships between modalities while maintaining computational ef-
effectively handle multimodal emotion recognition tasks with high ficiency, thus contributing to the overall effectiveness of the fusion
accuracy and robustness, they typically require a substantial amount process in multimodal sentiment analysis.
4
Fig. 4. The figure illustrates the comprehensive framework of a multimodal model

based on decision fusion in the later stages. In this approach, individual classifiers are
trained for each modality. Subsequently, the outputs of these classifiers are integrated to
Fig. 3. The figure illustrates the overarching framework of medium-term model-based accomplish the task of multimodal sentiment analysis. By training separate classifiers
approaches for multimodal fusion. In this approach, the feature information from each for each modality, the model can capture the unique characteristics and nuances of
modality is inputted into multiple kernel learning, neural networks, graph models, different modalities. The fusion of the individual classifiers allows for a holistic analysis
and alternative methods to accomplish the fusion of modalities. Notably, the majority that leverages the strengths of each modality, enhancing the overall performance of the
of the nodes involved in modal fusion are variable, indicating their adaptability and sentiment analysis task.
flexibility in the fusion process. This variability allows the model to effectively integrate
and combine the information from different modalities, tailoring the fusion to suit the
specific characteristics and requirements of the multimodal data being processed.
model, enabling it to adaptively choose different kernel functions based
on the given task. Consequently, this adaptability enhances the ro-
bustness and accuracy of the model. However, it is important to note
3.1.4. HFFN(Hierarchical feature fusion network with local and global
that the model structure is relatively straightforward and may have
perspectives for multimodal affective computing) [45]
limitations in handling complex emotional expressions that require
An alternative approach involves utilizing RNNs in conjunction with
more intricate representations. While the model excels in its flexibility
adversarial learning techniques to learn joint representations across
and adaptability, it may benefit from further enhancements to handle
different modalities. This approach aims to enhance the effectiveness
the intricacies of complex emotional expressions.
of single-modal representations while also addressing challenges such
as missing modalities or noise in the data. By leveraging RNNs and in- 3.2.2. BERT-like (self supervised models to improve multimodal speech
corporating adversarial learning, the model can effectively capture the emotion recognition) [48]
correlations and dependencies between modalities, leading to improved This model represents a Transformer-based approach for multi-
multimodal representations. These joint representations not only en- modal sentiment analysis, enabling the alignment and fusion of text
hance the individual modalities but also enable the model to handle and image information through the utilization of a self-attention mech-
situations where certain modalities may be missing or corrupted. The anism. By incorporating self-supervised learning techniques, the model
inclusion of adversarial learning further helps the model to learn robust demonstrates effectiveness in addressing multimodal emotion recog-
representations that are resilient to noise or inconsistencies across nition tasks with high levels of accuracy and robustness. However, it
modalities. is worth noting that the quality of data and annotations can have an
impact on the model’s performance. The model’s ability to generalize
3.1.5. MCTN (Multimodal Cyclic Translation Network) [46] and make accurate predictions relies on the quality and reliability of
A medium-term model-based multimodal fusion approach incorpo- the input data as well as the accuracy of the annotations associated
rates multimodal data into a network, where the intermediate layers with the multimodal samples. Ensuring high-quality data and precise
of the model facilitate the fusion of features from different modalities. annotations is crucial to optimize the performance of the model in
In this approach, modality feature fusion is strategically performed at multimodal sentiment analysis tasks.
intermediate stages to enable interactions between modalities. Model-
based fusion methods encompass various techniques, such as multiple 3.3. Multimodal model based on decision fusion in the later stage
kernel learning, neural networks, graph models, and alternative meth-
ods. These methods are utilized to determine the optimal fusion points A decision-level fusion method is employed to integrate information
and facilitate the effective integration of information across modalities. from various modalities in this approach. Decision-level fusion involves
By leveraging model-based fusion, the approach enables comprehensive training separate models on data from different modalities and com-
and context-aware fusion that enhances the overall performance of bining their outputs to make the final decision. Multimodal models
multimodal analysis tasks. based on decision fusion typically employ techniques such as averaging,
majority voting, weighting, and learnable models to fuse the modalities.
These models are known for being lightweight and flexible, allowing
3.2. Medium-term model-based multimodal fusion method
for efficient processing and adaptability. In scenarios where a modality
is missing, the decision can still be reached by utilizing the available
A medium-term model-based multimodal fusion approach involves
modalities. The overall framework of the multimodal model based on
inputting multimodal data into the network, where the intermediate
decision fusion in the later stages is depicted in Fig. 4.
layers of the model carry out feature fusion between the different
modalities. Model-based modality fusion methods offer the flexibil- 3.3.1. Deep multimodal fusion architecture [49]
ity to determine the optimal location for modality feature fusion, In this model, each modality is equipped with its own independent
enabling intermediate interactions. These fusion methods typically en- classifier. The prediction results are obtained by averaging the confi-
compass multiple kernel learning, neural networks, graph models, and dence scores from each classifier. This model is characterized by its
alternative techniques. The overall framework of the medium-term simple structure, ease of implementation, and effectiveness in address-
model-based multimodal fusion method is depicted in Fig. 3. ing multimodal emotion recognition tasks. However, it is important to
note that the model lacks the ability to capture the interaction infor-
3.2.1. MKL (Multiple Kernel Learning) [47] mation between modalities, which may result in the loss of valuable
This model employs a multiple kernel learning approach to facil- insights. The absence of explicit mechanisms for capturing modality
itate the fusion of multimodal information. It utilizes distinct kernel interactions can limit the model’s ability to fully exploit the synergis-
functions to represent the diverse modal information and dynamically tic potential of multimodal data. Consequently, there is a possibility
selects an optimal combination of these functions by optimizing an of overlooking important contextual cues and interdependencies that
objective function. This methodology enhances the flexibility of the could contribute to a more comprehensive understanding of emotions.
5
3.3.2. SAL-CNN (Select-Additive Learning CNN) [50] Table 2

This table showcases a selection of recent and highly effective multimodal sentiment
This model is a multimodal sentiment analysis approach that com-
analysis models. The first column lists the model names, while the second column
bines CNN and attention mechanisms. It leverages an adaptive atten- indicates the respective publication years. The third column indicates the dataset
tion mechanism to fuse text and image features, a spatial attention utilized for model evaluation, and the fourth column presents the corresponding
mechanism to extract text-related regions within images, and a fully accuracy achieved on that dataset.
connected layer for classification. The incorporation of attention mech- Name Year Dataset Acc
anisms enables the model to effectively handle multimodal emotion MultiSentiNet-Att 2017 MVSA 68.86%
recognition tasks with high levels of accuracy and robustness. However, CMU-MOSI 80.98%
it is worth noting that the model’s performance is reliant on a substan- DFF-TMF 2019 CMU-MOSEI 77.15%
tial amount of training data to achieve optimal results. Additionally, Flickr 87.10%
due to the relatively complex structure of the model, the training AHRM 2020 Getty Image 87.80%
process may require a longer duration. It is important to consider these SFNN 2020 Yelp 62.90%
factors when implementing and training the model for multimodal MISA 2020 MOSI 83.40%
sentiment analysis tasks.
CMU-MOSI 84.10%
MAG-BERT 2020 CMU-MOSEI 84.50%
3.3.3. TSAM (Temporally Selective Attention Model) [51]
CMU-MOSI 92.28%
The proposed model introduces a time-selective attention mecha- TIMF 2021 CMU-MOSEI 79.46%
nism, which employs an attention mechanism to assign weights and
Auto-ML based Fusion 2021 B-T4SA 95.19%
guide the model in selecting the relevant time step. The selected
CMU-MOSI 84.00%
time step is then passed to a distinct SDL loss function model for CMU-MOSEI 82.81%
Self-MM 2022
sentiment analysis. SDL, known as Self-Distillation Learning, is a multi- CH-SIMS 80.74%
modal sentiment analysis method that leverages the complementar- CMU-MOSI 83.60%
ity among different modalities to enhance the model’s generalization DISRFN 2022 CMU-MOSEI 87.50%
ability and robustness. By integrating a time-selective attention mech- CMU-MOSI 84.05%
anism, the model demonstrates effectiveness in handling multimodal TETFN 2023 CMU-MOSEI 84.25%
emotion recognition tasks with high levels of accuracy and robustness. CMU-MOSI 89.30%
The incorporation of the time-selective attention mechanism allows the TEDT 2023 CMU-MOSEI 86.20%
model to focus on the most relevant temporal information, contributing CMU-MOSI 85.06%
to improved sentiment analysis performance. SPIL 2023 CMU-MOSEI 85.01%
There are three main approaches to multimodal fusion: early CH-SIMS 81.25%
feature-based fusion, medium-term model-based fusion, and decision-
level fusion.
Early feature-based fusion methods merge the features of different
feature fusion at intermediate layers, and decision-level fusion com-
modalities at a shallow level after feature extraction. The goal is to
bines outputs from separate models. Each approach has its advantages
bring the features of different modalities into a unified parameter space.
and considerations in terms of accuracy, robustness, training data
However, this approach often leads to redundant information in the
requirements, complexity, and flexibility.
resulting features due to variations in the amount of information con-
tained in different modalities. To address this, dimensionality reduction
4. Latest multimodal sentiment analysis models
techniques are commonly used to eliminate redundancy. The reduced
features are then fed into the model for further feature extraction and
prediction. Early feature fusion ensures that the model considers input In recent years, the field of multimodal sentiment analysis has
information from multiple modalities right from the start of the feature witnessed significant advancements, leading to the development of nu-
modeling process. However, these models require a substantial amount merous practical and efficient models and architectures. It is important
of training data to achieve optimal performance and tend to have to note that it is not feasible to cover all these models within the
complex structures, resulting in longer training times. scope of this chapter. Therefore, we will focus on presenting some of
Medium-term model-based fusion approaches involve inputting the recent and cutting-edge multimodal sentiment analysis models that
multimodal data into the network, where the intermediate layers of have emerged. These models have served as benchmarks for subsequent
the model perform feature fusion between different modalities. These researchers, providing valuable experimental references. A summary of
methods offer flexibility in determining the optimal location for modal- these models can be found in Table 2. The first column of the table
ity feature fusion, allowing for intermediate interactions. Techniques denotes the model names, while the second column indicates the year
such as multiple kernel learning, neural networks, graph models, and of publication. The third column provides information regarding the
alternative methods are often used in model-based modality fusion. The datasets employed for evaluation, and the fourth column presents the
overall framework of the medium-term model-based fusion method is corresponding accuracy achieved on these datasets.
shown in Fig. 3.
Decision-level fusion methods integrate information from various 4.1. MultiSentiNet-Att [52]
modalities by training separate models on data from different modali-
ties and combining their outputs to make the final decision. Techniques In this model, an LSTM network is employed to integrate textual
such as averaging, majority voting, weighting, and learnable models information with word vectors. Additionally, VGG is utilized to extract
are commonly employed to fuse the modalities. Decision-level fusion both target feature information and scene feature information from an
models are lightweight and flexible, enabling efficient processing and image. These target and scene feature vectors are then utilized in the
adaptability. In scenarios where a modality is missing, the decision cross-modal attention mechanism learning process alongside the word
can still be reached by utilizing the available modalities. The overall vectors. This mechanism assigns specific weights to the word vectors
framework of the multimodal model based on decision fusion in the that are associated with the sentiment expressed in the text, utilizing
later stages is depicted in Fig. 4. a combination of the target and scene feature information from the
In summary, early feature-based fusion focuses on merging fea- image. The resulting features are subsequently fed into a multi-layer
tures at a shallow level, medium-term model-based fusion performs perceptron to accomplish the sentiment analysis task.
6
4.2. DFF-ATMF [53] 4.8. Auto-ML based fusion [59]
The proposed model primarily focuses on the text and audio modal- The authors propose a novel approach that combines text and image
ities. Its main contribution lies in the introduction of novel strategies sentiment analysis through the utilization of AutoML for generating a
for multi-feature fusion and multimodal fusion. The model employs final fused classification. In this approach, individual classifiers for text
two parallel branches to extract features specifically for the text and
and image are combined using the best model selected by AutoML. This
audio modalities. Subsequently, a multimodal attention fusion module
decision-level fusion technique merges the outputs of the individual
is employed to effectively combine the features from these two modal-
classifiers to form the final classification. This approach showcases a
ities, achieving multimodal fusion. This fusion process enhances the
typical example of decision-level fusion, where the strengths of both
model’s ability to capture complementary information and achieve a
text and image analysis are leveraged to improve the overall sentiment
comprehensive understanding of the input data from both the text and
audio sources. analysis performance.
4.3. AHRM [54] 4.9. Self-MM [60]
This model primarily aims to capture the relationship between In this study, the authors propose a novel multi-modal sentiment
the text and visual modalities. The authors introduce an attention analysis architecture that combines self-supervised learning and multi-
mechanism-based heterogeneous relation model, which effectively task learning. To capture the private information specific to each
combines high-quality representation information from both the text
modality, a single-modal label generation module called ULGM is con-
and visual modalities. The progressive dual attention mechanism em-
structed based on self-supervised learning. The corresponding loss func-
ployed in this model effectively emphasizes channel-level semantic
tion for this module is designed to integrate the private features ac-
information from both images and text. In order to incorporate social
quired through three self-supervised learning subtasks into the original
attributes, social relations are introduced to capture complementary
information from the social context. Furthermore, heterogeneous net- multi-modal sentiment analysis model, employing a weight adjustment
works are constructed to integrate features from different modalities, strategy. The proposed model demonstrates strong performance, and
facilitating a comprehensive understanding of the multimodal data. the ULGM module based on self-supervised learning also possesses
the capability of calibrating single-modal labels. This combination of
4.4. SFNN [55] self-supervised learning and multi-task learning proves to be effec-
tive in enhancing the sentiment analysis performance while capturing
The proposed model is a neural network based on semantic fea- modality-specific information.
ture fusion. Convolutional neural networks and attention mechanisms
are used to extract visual features. Visual features are mapped to 4.10. DISRFN [61]
text features and combined with text modality features for sentiment
analysis.
The model is a dynamically invariant representation-specific fusion
4.5. MISA [56] network. The joint domain separation network is improved to obtain
a joint domain separation representation for all modalities, so that
The proposed model introduces an innovative framework for multi- redundant information can be effectively utilized. Second, a HGFN
modal sentiment analysis. In this framework, each modality undergoes network is used to dynamically fuse the feature information of each
feature extraction and is subsequently mapped into two separate fea- modality and learn the features of multiple modal interactions. At the
ture spaces. One of these feature spaces primarily focuses on learning same time, a loss function that improves the fusion effect is constructed
the invariant features of the modality, while the other one empha- to help the model learn the representation information of each modality
sizes the acquisition of the modality’s unique features. This approach in the subspace.
allows the model to capture both the shared characteristics across
modalities and the distinctive aspects specific to each modality, thereby 4.11. TEDT [62]
facilitating a comprehensive analysis of the multi-modal data.
4.6. MAG-BERT [57] This model introduces a novel multimodal encoding-decoding trans-
lation network, incorporating a transformer, to tackle the challenges as-
The authors introduce a ‘‘multi-modal’’ adaptation architecture that sociated with multimodal sentiment analysis. Specifically, it addresses
is specifically applied to BERT. This novel model is capable of ac- the influence of individual modal data and the low quality of non-
commodating input from multiple modalities during the fine-tuning natural language features. The proposed method assigns text as the
process. The Multi-modal Adaptation Gateway (MAG) can be conceptu- primary information, while sound and image serve as secondary in-
alized as a vector embedding structure, enabling the incorporation of formation. To enhance the quality of non-natural language features, a
multimodal information and its subsequent embedding as a sequence modality reinforcement cross-attention module is employed to convert
within BERT. This adaptation architecture enhances the model’s ability them into natural language features. Furthermore, a dynamic filtering
to process and analyze data from diverse modalities, facilitating a more mechanism is implemented to filter out error information arising from
comprehensive understanding of multimodal inputs within the BERT cross-modal interaction. The model’s key strength lies in its ability to
framework. enhance multimodal fusion and accurately analyze human sentiment.
However, it should be noted that this approach may require signifi-
4.7. TIMF [58] cant computational resources and might not be suitable for real-time
analysis.
The main idea of this model is that each modality learns features
separately and performs tensor fusion of the features of each modality.
In the dataset fusion stage, the feature fusion for each modality is 4.12. TETFN [63]
implemented by a tensor fusion network. In the decision fusion stage,
the upstream results are fused by soft fusion to adjust the decision The Text Enhanced Transformer Fusion Network (TETFN) presents a
results. novel approach to MSA that effectively addresses the challenges posed
7
by the varying contributions of textual, visual, and acoustic modalities. 5.3. F1 score
This method focuses on learning text-oriented pairwise cross-modal
mappings to acquire unified and robust multimodal representations. The F1 score is a metric that combines precision and recall into
It achieves this by incorporating textual information into the learning a single value, providing a balanced evaluation of the model’s per-
process of sentiment-related nonlinguistic representations through text- formance. It is the harmonic mean of precision and recall and is
based multi-head attention. Additionally, the model incorporates uni- particularly useful when there is an imbalance between positive and
modal label prediction to maintain differentiated information among negative instances in the dataset. The F1 score ranges from 0 to 1, with
the various modalities. To capture both global and local information higher values indicating better performance.
of a human face, the vision pre-trained model Vision-Transformer is
employed to extract visual features from the original videos. The key 5.4. Mean Absolute Error (MAE)
strength of this model lies in its ability to leverage textual information
to enhance the effectiveness of nonlinguistic modalities in MSA, while The mean absolute error is a metric commonly used in regression-
simultaneously preserving inter- and intra-modality relationships. based sentiment analysis tasks. It measures the average absolute dif-
ference between the predicted sentiment values and the true sentiment
4.13. SPIL [64] values. MAE provides a measure of the model’s accuracy in predicting
sentiment on a continuous scale.
This model introduces a deep modal shared information learning
module designed to enhance representation learning in multimodal 5.5. Mean Squared Error (MSE)
sentiment analysis tasks. The proposed module effectively captures
both shared and private information within a comprehensive modal Similar to MAE, mean squared error is another metric used in
representation. It achieves this by utilizing a covariance matrix to regression-based sentiment analysis tasks. It calculates the average
capture shared information across modalities, while employing a self- squared difference between the predicted sentiment values and the true
supervised learning strategy to capture private information specific to sentiment values. MSE amplifies the impact of large errors, making it
each modality. The module is flexible and can be seamlessly integrated more sensitive to outliers in the predictions.
into existing frameworks, allowing for adjustable information exchange
between modalities to learn either private or shared information as 5.6. Cross-entropy loss
needed. Additionally, a multi-task learning strategy is implemented to
guide the model’s attention towards modal differentiation during train-
Cross-entropy loss is a common metric used in sentiment analysis
ing. The proposed model demonstrates superior performance compared
tasks that involve probabilistic models, such as deep learning mod-
to current state-of-the-art methods across various metrics on three
els. It measures the dissimilarity between the predicted probability
publicly available datasets. Furthermore, the study explores additional
distribution and the true probability distribution of sentiment labels.
combinatorial techniques for leveraging the capabilities of the proposed
Minimizing cross-entropy loss encourages the model to produce more
module.
accurate and confident predictions.
5. Introduction to evaluation metrics 5.7. Cohen’s Kappa
In the field of multimodal sentiment analysis, evaluating the per- Cohen’s Kappa is a statistical measure of inter-rater agreement, com-
formance of sentiment analysis models is crucial for assessing their monly used when evaluating sentiment analysis models in scenarios
effectiveness. Various evaluation metrics have been developed to mea- where multiple annotators label the same dataset. It takes into account
sure the performance of these models, providing insights into their the agreement between the model’s predictions and the average agree-
accuracy, robustness, and generalizability. This chapter aims to intro- ment expected by chance. Cohen’s Kappa provides insights into the
duce and discuss the commonly used evaluation metrics in multimodal model’s performance beyond chance agreement.
sentiment analysis. By using these evaluation metrics, researchers and practitioners can
assess the performance of multimodal sentiment analysis models and
5.1. Accuracy compare their effectiveness. It is important to consider the specific
characteristics of the dataset and the task at hand when selecting
Accuracy is a fundamental evaluation metric that measures the appropriate evaluation metrics to ensure a comprehensive evaluation
overall correctness of the sentiment predictions made by a model. of model performance.
It calculates the ratio of correctly classified instances to the total
number of instances in the dataset. While accuracy provides a general 6. Challenges and future scope
understanding of the model’s performance, it may not be sufficient for
assessing the model’s performance when class imbalances exist in the The rapid development of deep learning has led to significant
dataset. advancements in multimodal sentiment analysis techniques [65–68].
However, despite these advancements, there are still numerous chal-
5.2. Precision and recall lenges that need to be addressed in the field of multimodal sentiment
analysis. Therefore, this subsection aims to analyze the current state
Precision and recall are evaluation metrics commonly used in binary of research, identify the challenges faced, and explore potential future
sentiment classification tasks. Precision measures the proportion of developments in multimodal sentiment analysis.
correctly predicted positive instances out of all instances predicted
as positive. Recall, on the other hand, calculates the proportion of 6.1. Dataset
correctly predicted positive instances out of all actual positive instances
in the dataset. Precision and recall provide insights into the model’s The dataset is a critical component in multimodal sentiment anal-
ability to correctly identify positive instances and avoid false positives. ysis. However, there is currently a lack of large-scale datasets that
8
encompass multiple languages. Considering the linguistic and racial of these applications include multimodal emotion analysis for real-
diversity in many countries, it is crucial to have a large and diverse time mental health assessment [73–75], multimodal criminal linguistic
dataset that can be used to train multimodal sentiment analysis models deception detection models [76], and offensive language detection.
with robust generalization capabilities and wide applicability. More- Another exciting prospect is the development of human-like emotion-
over, existing multimodal datasets suffer from low annotation accuracy aware robots. Multimodal emotion analysis is a powerful technique
and have not yet achieved absolute continuous values. To address this for recognizing and analyzing emotions. By leveraging multimodal
issue, researchers need to focus on more precise labeling of multimodal information and combining data from various modalities, sentiment
datasets. Furthermore, most current multimodal datasets only incorpo- analysis models can significantly enhance the accuracy of emotion
rate visual, speech, and text modalities, while neglecting the inclusion recognition. As research progresses, multimodal sentiment analysis
of modal information combined with physiological signals such as brain techniques will continue to improve. It is conceivable that one day
waves and pulses. Thus, there is a need to expand multimodal datasets we may witness the emergence of a multimodal sentiment analysis
to include these additional modalities for more comprehensive analysis. model with a large number of parameters, possessing sentiment analysis
capabilities comparable to those of humans. Such a development would
6.2. Detection of hidden emotions be truly remarkable and pave the way for exciting advancements in the
field.
Multimodal sentiment analysis tasks have long faced a recognized
challenge: the analysis of hidden emotions. Hidden emotions encom- 7. Conclusion
pass various aspects, including sarcastic emotions expressed through
the use of sarcastic words, emotions that require contextual analysis Multimodal sentiment analysis has garnered significant recogni-
for concrete understanding, and complex emotions such as happiness tion in various research fields, establishing itself as a central topic
and sadness experienced by individuals [69,70]. The exploration of in natural language processing and computer vision. In this compre-
these hidden emotions is crucial as it represents a gap between human hensive review, we provide an in-depth exploration of various facets
and artificial intelligence capabilities in understanding and interpreting of multimodal sentiment analysis, covering its research background,
emotions [71,72]. By delving into these hidden emotions, we can definition, and developmental trajectory. Additionally, we present a
bridge this gap and enhance the effectiveness of multimodal sentiment comprehensive overview of commonly used benchmark datasets in
analysis. Table 1, and conduct a comparative analysis of recent state-of-the-
art multimodal sentiment analysis models. Furthermore, we highlight
6.3. Multiple forms of video data the existing challenges within the field and delve into potential future
developments.
Video data presents unique challenges in multimodal sentiment Numerous promising endeavors are currently underway, with sev-
analysis tasks. Despite the speaker facing the camera and the high- eral already implemented. However, there remain several challenges
resolution quality of the video data, the actual situation is often more that necessitate further investigation, giving rise to significant research
complex, necessitating a model that is resilient to noise and capa- directions:
ble of analyzing low-resolution video data. Furthermore, there is an (1) Constructing large-scale multimodal sentiment datasets in mul-
opportunity for researchers to delve into the capture and analysis of tiple languages.
micro-expressions and micro-gestures exhibited by speakers, as these (2) Addressing the domain transfer problem across video, text, and
subtle cues can provide valuable insights for sentiment analysis. There- speech modal data.
fore, exploring techniques to address these challenges and harness the (3) Developing a unified, large-scale multimodal sentiment analysis
potential of micro-expressions and micro-gestures is an important area model with exceptional generalization capabilities.
of focus for researchers in the field. (4) Reducing model parameters, optimizing algorithms, and mini-
mizing algorithmic complexity.
6.4. Multiform language data (5) Tackling the challenges posed by multi-lingual hybridity in
multimodal sentiment analysis.
In multimodal sentiment analysis tasks, text data is typically uni- (6) Exploring the weighting problem in modal fusion and devising
modal, focusing on a single language. However, in online communities, rational weighting schemes for different modalities based on specific
evaluation texts often involve multiple languages, as reviewers em- cases.
ploy different languages to express their comments more vividly. This (7) Investigating the inter-modal correlations and distinguishing
presents a challenge in analyzing text data with mixed emotions within between shared and private information to enhance model performance
a multimodal context. Effectively utilizing memes that are embedded in and interpretability.
the text is an important research area, as memes often convey highly (8) Constructing a multimodal sentiment analysis model capable of
impactful emotional messages from reviewers. Moreover, analyzing effectively capturing and analyzing hidden emotions.
emotions becomes more challenging when multiple individuals are Addressing these research directions will undoubtedly contribute to
engaged in a conversation, as most text data is transcribed directly the advancement of multimodal sentiment analysis and enable the de-
from speech. Furthermore, considering the cultural characteristics of velopment of more powerful and comprehensive models in the future.
different regions and countries, the same text data may evoke diverse
emotions. Therefore, understanding and addressing these complexi- Funding
ties associated with cross-lingual text data, incorporating memes, and
accounting for cultural nuances are crucial areas of investigation in This work was supported in part by Joint found for smart computing
multimodal sentiment analysis research. of Shandong Natural Science Foundation under Grant ZR2020LZH013;
open project of State Key Laboratory of Computer Architecture
6.5. Future prospects CARCHA202001; the Major Scientific and Technological Innovation
Project in Shandong Province under Grant 2021CXG010506 and
The future of multimodal sentiment analysis techniques holds great 2022CXG010504; ‘‘New University 20 items’’ Funding Project of Jinan
promise, with several potential applications on the horizon. Some under Grant 2021GXRC108 and 2021GXRC024.
9
Declaration of competing interest [23] Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang
Xu, Sheng Gao, Jie Lin, Chunxiao Fan, et al., Hybrid multimodal feature
extraction, mining and fusion for sentiment analysis, in: Proceedings of the 3rd
The authors declare no conflict of interest. This research was con-
International on Multimodal Sentiment Analysis Workshop and Challenge, 2022,
ducted in an unbiased manner and the results presented in this paper pp. 81–88.
are based solely on the data and analysis conducted during the research [24] Anna Favaro, Chelsie Motley, Tianyu Cao, Miguel Iglesias, Ankur Butala,
process. No financial or other relationships exist that could be perceived Esther S. Oh, Robert D. Stevens, Jesús Villalba, Najim Dehak, Laureano Moro-
Velázquez, A multi-modal array of interpretable features to evaluate language
as a conflict of interest.
and speech patterns in different neurological disorders, in: 2022 IEEE Spoken
Language Technology Workshop, SLT, IEEE, 2023, pp. 532–539.
Data availability [25] Soujanya Poria, Erik Cambria, Rajiv Bajpai, Amir Hussain, A review of affective
computing: From unimodal analysis to multimodal fusion, Inf. Fusion 37 (2017)
98–125.
No data was used for the research described in the article.
[26] Garima Sharma, Abhinav Dhall, A survey on automatic multimodal emotion
recognition in the wild, Adv. Data Sci.: Methodol. Appl. (2021) 35–64.
References [27] Xin Gu, Yinghua Shen, Jie Xu, Multimodal emotion recognition in deep learning:
A survey, in: 2021 International Conference on Culture-Oriented Science &
Technology, ICCST, IEEE, 2021, pp. 77–82.
[1] Julien Deonna, Fabrice Teroni, The Emotions: A Philosophical Introduction,
Routledge, 2012. [28] Sathyan Munirathinam, Industry 4.0: Industrial internet of things (IIOT), in:
Advances in Computers, vol. 117, (no. 1) Elsevier, 2020, pp. 129–164.
[2] Clayton Hutto, Eric Gilbert, Vader: A parsimonious rule-based model for sen-
[29] Esteban Ortiz-Ospina, Max Roser, The rise of social media, Our World Data
timent analysis of social media text, in: Proceedings of the International AAAI
(2023).
Conference on Web and Social Media, vol. 8, (no. 1) 2014, pp. 216–225.
[30] Abdul Haseeb, Enjun Xia, Shah Saud, Ashfaq Ahmad, Hamid Khurshid, Does
[3] Soo-Min Kim, Eduard Hovy, Determining the sentiment of opinions, in: COLING
information and communication technologies improve environmental quality in
2004: Proceedings of the 20th International Conference on Computational
the era of globalization? An empirical analysis, Environ. Sci. Pollut. Res. 26
Linguistics, 2004, pp. 1367–1373.
(2019) 8594–8608.
[4] Erik Cambria, Björn Schuller, Yunqing Xia, Catherine Havasi, New avenues in
[31] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower,
opinion mining and sentiment analysis, IEEE Intell. Syst. 28 (2) (2013) 15–21.
Samuel Kim, Jeannette N. Chang, Sungbok Lee, Shrikanth S. Narayanan, IEMO-
[5] Arshi Parvaiz, Muhammad Anwaar Khalid, Rukhsana Zafar, Huma Ameer,
CAP: Interactive emotional dyadic motion capture database, Lang. Resourc. Eval.
Muhammad Ali, Muhammad Moazam Fraz, Vision transformers in medical
42 (2008) 335–359.
computer vision—A contemplative retrospection, Eng. Appl. Artif. Intell. 122
[32] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan
(2023) 106126.
Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, Ioannis Patras, Deap: A
[6] Bo Zhang, Jun Zhu, Hang Su, Toward the third generation artificial intelligence,
database for emotion analysis; Using physiological signals, IEEE Trans. Affect.
Sci. China Inf. Sci. 66 (2) (2023) 1–19.
Comput. 3 (1) (2011) 18–31.
[7] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham
[33] Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency, Mosi: Multi-
Neubig, Pre-train, prompt, and predict: A systematic survey of prompting
modal corpus of sentiment intensity and subjectivity analysis in online opinion
methods in natural language processing, ACM Comput. Surv. 55 (9) (2023) 1–35.
videos, 2016, arXiv preprint arXiv:1606.06259.
[8] Jireh Yi-Le Chan, Khean Thye Bea, Steven Mun Hong Leow, Seuk Wai Phoong,
[34] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, Louis-
Wai Khuen Cheng, State of the art: A review of sentiment analysis based on
Philippe Morency, Multimodal language analysis in the wild: Cmu-mosei dataset
sequential transfer learning, Artif. Intell. Rev. 56 (1) (2023) 749–780.
and interpretable dynamic fusion graph, in: Proceedings of the 56th Annual
[9] Mayur Wankhade, Annavarapu Chandra Sekhara Rao, Chaitanya Kulkarni, A
Meeting of the Association for Computational Linguistics (Volume 1: Long
survey on sentiment analysis methods, applications, and challenges, Artif. Intell.
Papers), 2018, pp. 2236–2246.
Rev. 55 (7) (2022) 5731–5780.
[35] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik
[10] Hui Li, Qi Chen, Zhaoman Zhong, Rongrong Gong, Guokai Han, E-word of mouth Cambria, Rada Mihalcea, Meld: A multimodal multi-party dataset for emotion
sentiment analysis for user behavior studies, Inf. Process. Manage. 59 (1) (2022) recognition in conversations, 2018, arXiv preprint arXiv:1810.02508.
102784.
[36] Nan Xu, Wenji Mao, Guandan Chen, Multi-interactive memory network for aspect
[11] Ashima Yadav, Dinesh Kumar Vishwakarma, Sentiment analysis using deep based multimodal sentiment analysis, in: Proceedings of the AAAI Conference on
learning architectures: A review, Artif. Intell. Rev. 53 (6) (2020) 4335–4385. Artificial Intelligence, vol. 33, (no. 01) 2019, pp. 371–378.
[12] Ganesh Chandrasekaran, Tu N. Nguyen, Jude Hemanth D, Multimodal senti- [37] Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun
mental analysis for social media applications: A comprehensive review, Wiley Zou, Kaicheng Yang, Ch-sims: A Chinese multimodal sentiment analysis dataset
Interdiscipl. Rev.: Data Mining Knowl. Discov. 11 (5) (2021) e1415. with fine-grained annotation of modality, in: Proceedings of the 58th Annual
[13] Bernhard Kratzwald, Suzana Ilic, Mathias Kraus, Stefan Feuerriegel, Helmut Meeting of the Association for Computational Linguistics, 2020, pp. 3718–3727.
Prendinger, Decision support with text-based emotion recognition: Deep learning [38] Amir Zadeh, Yan Sheng Cao, Simon Hessner, Paul Pu Liang, Soujanya Poria,
for affective computing, 2018, arXiv preprint arXiv:1803.06397. Louis-Philippe Morency, CMU-MOSEAS: A multimodal language dataset for
[14] Carlo Strapparava, Rada Mihalcea, Semeval-2007 task 14: Affective text, in: Spanish, Portuguese, German and French, in: Proceedings of the Conference
Proceedings of the Fourth International Workshop on Semantic Evaluations, on Empirical Methods in Natural Language Processing. Conference on Empirical
SemEval-2007, 2007, pp. 70–74. Methods in Natural Language Processing, vol. 2020, NIH Public Access, 2020,
[15] Yang Li, Quan Pan, Suhang Wang, Tao Yang, Erik Cambria, A generative model p. 1801.
for category text generation, Inform. Sci. 450 (2018) 301–315. [39] Sathyanarayanan Ramamoorthy, Nethra Gunti, Shreyash Mishra, S. Suryavardan,
[16] Rong Dai, Facial expression recognition method based on facial physiological Aishwarya Reganti, Parth Patwa, Amitava DaS, Tanmoy Chakraborty, Amit Sheth,
features and deep learning, J. Chongqing Univ. Technol. (Natural Science) 34 Asif Ekbal, et al., Memotion 2: Dataset on sentiment and emotion analysis of
(6) (2020) 146–153. memes, in: Proceedings of de-Factify: Workshop on Multimodal Fact Checking
[17] Zhu Ren, Jia Jia, Quan Guo, Kuo Zhang, Lianhong Cai, Acoustics, content and and Hate Speech Detection, CEUR, 2022.
geo-information based sentiment prediction from large-scale networked voice [40] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe
data, in: 2014 IEEE International Conference on Multimedia and Expo, ICME, Morency, Ruslan Salakhutdinov, MULTIZOO & multibench: A standardized
IEEE, 2014, pp. 1–4. toolkit for multimodal deep learning, J. Mach. Learn. Res. 24 (2023) 1–7.
[18] Liu Jiming, Zhang Peixiang, Liu Ying, Zhang Weidong, Fang Jie, Summary of [41] Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan
multi-modal sentiment analysis technology, J. Front. Comput. Sci. Technol. 15 Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al., MER 2023: Multi-label
(7) (2021) 1165. learning, modality robustness, and semi-supervised learning, 2023, arXiv preprint
[19] Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, Jie Xu, Zhoujun Li, Image–text arXiv:2304.08981.
sentiment analysis via deep multimodal attentive fusion, Knowl.-Based Syst. 167 [42] Louis-Philippe Morency, Rada Mihalcea, Payal Doshi, Towards multimodal sen-
(2019) 26–37. timent analysis: Harvesting opinions from the web, in: Proceedings of the 13th
[20] Akshi Kumar, Geetanjali Garg, Sentiment analysis of multimodal twitter data, International Conference on Multimodal Interfaces, 2011, pp. 169–176.
Multimedia Tools Appl. 78 (2019) 24103–24119. [43] Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency, Multimodal
[21] Ankita Gandhi, Kinjal Adhvaryu, Vidhi Khanduja, Multimodal sentiment analysis: language analysis with recurrent multistage fusion, 2018, arXiv preprint arXiv:
review, application domains and future directions, in: 2021 IEEE Pune Section 1808.03920.
International Conference, PuneCon, IEEE, 2021, pp. 1–5. [44] Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, Louis-Philippe
[22] Vaibhav Rupapara, Furqan Rustam, Hina Fatima Shahzad, Arif Mehmood, Imran Morency, Words can shift: Dynamically adjusting word representations using
Ashraf, Gyu Sang Choi, Impact of SMOTE on imbalanced text features for toxic nonverbal behaviors, in: Proceedings of the AAAI Conference on Artificial
comments classification using RVVC model, IEEE Access 9 (2021) 78621–78634. Intelligence, vol. 33, (no. 01) 2019, pp. 7216–7223.
10
[45] Sijie Mai, Haifeng Hu, Songlong Xing, Divide, conquer and combine: Hierarchical [60] Wenmeng Yu, Hua Xu, Ziqi Yuan, Jiele Wu, Learning modality-specific rep-
feature fusion network with local and global perspectives for multimodal affec- resentations with self-supervised multi-task learning for multimodal sentiment
tive computing, in: Proceedings of the 57th Annual Meeting of the Association analysis, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
for Computational Linguistics, 2019, pp. 481–492. 35, (no. 12) 2021, pp. 10790–10797.
[46] Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, Barnabás [61] Jing He, Haonan Yanga, Changfan Zhang, Hongrun Chen, Yifu Xua, Dy-
Póczos, Found in translation: Learning robust joint representations by cyclic namic invariant-specific representation fusion network for multimodal sentiment
translations between modalities, in: Proceedings of the AAAI Conference on analysis, Comput. Intell. Neurosci. 2022 (2022).
Artificial Intelligence, vol. 33, (no. 01) 2019, pp. 6892–6899. [62] Fan Wang, Shengwei Tian, Long Yu, Jing Liu, Junwen Wang, Kun Li, Yongtao
[47] Soujanya Poria, Erik Cambria, Alexander Gelbukh, Deep convolutional neural Wang, TEDT: Transformer-based encoding–decoding translation network for
network textual features and multiple kernel learning for utterance-level multi- multimodal sentiment analysis, Cogn. Comput. 15 (1) (2023) 289–303.
modal sentiment analysis, in: Proceedings of the 2015 Conference on Empirical [63] Di Wang, Xutong Guo, Yumin Tian, Jinhui Liu, LiHuo He, Xuemei Luo, TETFN:
Methods in Natural Language Processing, 2015, pp. 2539–2544. A text enhanced transformer fusion network for multimodal sentiment analysis,
[48] Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, Suranga Pattern Recognit. 136 (2023) 109259.
Nanayakkara, Jointly fine-tuning" bert-like" self supervised models to improve [64] Songning Lai, Xifeng Hu, Yulong Li, Zhaoxia Ren, Zhi Liu, Danmin Miao, Shared
multimodal speech emotion recognition, 2020, arXiv preprint arXiv:2008.06682. and private information learning in multimodal sentiment analysis with deep
[49] Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, modal alignment and self-supervised multi-task learning, 2023, arXiv preprint
Louis-Philippe Morency, Deep multimodal fusion for persuasiveness prediction, arXiv:2305.08473.
in: Proceedings of the 18th ACM International Conference on Multimodal [65] Mahesh G. Huddar, Sanjeev S. Sannakki, Vijay S. Rajpurohit, A survey of
Interaction, 2016, pp. 284–288. computational approaches and challenges in multimodal sentiment analysis, Int.
[50] Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, Eric P. Xing, Select- J. Comput. Sci. Eng. 7 (1) (2019) 876–883.
additive learning: Improving generalization in multimodal sentiment analysis, in: [66] Ramandeep Kaur, Sandeep Kautish, Multimodal sentiment analysis: A survey and
2017 IEEE International Conference on Multimedia and Expo, ICME, IEEE, 2017, comparison, Res. Anthol. Implement. Sentiment Anal. Across Multiple Discipl.
pp. 949–954. (2022) 1846–1870.
[51] Hongliang Yu, Liangke Gui, Michael Madaio, Amy Ogan, Justine Cassell, Louis- [67] Lukas Stappen, Alice Baird, Lea Schumann, Schuller Bjorn, The multimodal
Philippe Morency, Temporally selective attention model for social and affective sentiment analysis in car reviews (muse-car) dataset: Collection, insights and
state recognition in multimedia content, in: Proceedings of the 25th ACM improvements, IEEE Trans. Affect. Comput. (2021).
International Conference on Multimedia, 2017, pp. 1743–1751. [68] Anurag Illendula, Amit Sheth, Multimodal emotion classification, in: Companion
[52] Nan Xu, Wenji Mao, Multisentinet: A deep semantic network for multimodal sen- Proceedings of the 2019 World Wide Web Conference, 2019, pp. 439–449.
timent analysis, in: Proceedings of the 2017 ACM on Conference on Information [69] Donglei Tang, Zhikai Zhang, Yulan He, Chao Lin, Deyu Zhou, Hidden topic–
and Knowledge Management, 2017, pp. 2399–2402. emotion transition model for multi-level social emotion detection, Knowl.-Based
[53] Feiyang Chen, Ziqian Luo, Yanyan Xu, Dengfeng Ke, Complementary fusion of Syst. 164 (2019) 426–435.
multi-features and multi-modalities in sentiment analysis, 2019, arXiv preprint [70] Petr Hajek, Aliaksandr Barushka, Michal Munk, Fake consumer review detection
arXiv:1904.08138. using deep neural networks integrating word embeddings and emotion mining,
[54] Jie Xu, Zhoujun Li, Feiran Huang, Chaozhuo Li, S. Yu Philip, Social image Neural Comput. Appl. 32 (2020) 17259–17274.
sentiment analysis by exploiting multimodal content and heterogeneous relations, [71] Soonil Kwon, A CNN-assisted enhanced audio signal processing for speech
IEEE Trans. Ind. Inform. 17 (4) (2020) 2974–2982. emotion recognition, Sensors 20 (1) (2019) 183.
[55] Weidong Wu, Yabo Wang, Shuning Xu, Kaibo Yan, SFNN: Semantic features [72] Umar Rashid, Muhammad Waseem Iqbal, Muhammad Akmal Skiandar, Muham-
fusion neural network for multimodal sentiment analysis, in: 2020 5th Inter- mad Qasim Raiz, Muhammad Raza Naqvi, Syed Khuram Shahzad, Emotion
national Conference on Automation, Control and Robotics Engineering, CACRE, detection of contextual text using deep learning, in: 2020 4th International
IEEE, 2020, pp. 661–665. Symposium on Multidisciplinary Studies and Innovative Technologies, ISMSIT,
[56] Devamanyu Hazarika, Roger Zimmermann, Soujanya Poria, Misa: Modality- IEEE, 2020, pp. 1–5.
invariant and-specific representations for multimodal sentiment analysis, in: [73] Zhentao Xu, Verónica Pérez-Rosas, Rada Mihalcea, Inferring social media users’
Proceedings of the 28th ACM International Conference on Multimedia, 2020, mental health status from multimodal information, in: Proceedings of the 12th
pp. 1122–1131. Language Resources and Evaluation Conference, 2020, pp. 6292–6299.
[57] Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, [74] Rahee Walambe, Pranav Nayak, Ashmit Bhardwaj, Ketan Kotecha, Employing
Louis-Philippe Morency, Ehsan Hoque, Integrating multimodal information in multimodal machine learning for stress detection, J. Healthc. Eng. 2021 (2021)
large pretrained transformers, in: Proceedings of the Conference. Association for 1–12.
Computational Linguistics. Meeting, vol. 2020, NIH Public Access, 2020, p. 2359. [75] Nujud Aloshban, Anna Esposito, Alessandro Vinciarelli, What you say or how you
[58] Jianguo Sun, Hanqi Yin, Ye Tian, Junpeng Wu, Linshan Shen, Lei Chen, Two- say it? Depression detection through joint modeling of linguistic and acoustic
level multimodal fusion for sentiment analysis in public security, Secur. Commun. aspects of speech, Cogn. Comput. 14 (5) (2022) 1585–1598.
Netw. 2021 (2021) 1–10. [76] Safa Chebbi, Sofia Ben Jebara, Deception detection using multimodal fusion
[59] Vasco Lopes, António Gaspar, Luís A Alexandre, João Cordeiro, An automl-based approaches, Multimedia Tools Appl. (2021) 1–30.
approach to multimodal image sentiment analysis, in: 2021 International Joint
Conference on Neural Networks, IJCNN, IEEE, 2021, pp. 1–9.
11

Multimodal sentiment analysis A survey

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal sentiment analysis A survey

Uploaded by

Copyright:

Available Formats

Displays 80 (2023) 102563

Contents lists available at ScienceDirect

Multimodal sentiment analysis: A survey✩

ARTICLE INFO ABSTRACT

1. Introduction Text-based sentiment analysis [13–15] has made significant strides in

✩ This paper was recommended for publication by Guangtao Zhai.

2.4. CMU-MOSEI [34] 2.8. CMU-MOSEA [38]

2.11. MER [41]

The Multimodal Emotion Recognition Challenge (MER 2023)

Fig. 4. The figure illustrates the comprehensive framework of a multimodal model

3.3.2. SAL-CNN (Select-Additive Learning CNN) [50] Table 2

4.2. DFF-ATMF [53] 4.8. Auto-ML based fusion [59]

4.3. AHRM [54] 4.9. Self-MM [60]

5. Introduction to evaluation metrics 5.7. Cohen’s Kappa

You might also like