Professional Documents
Culture Documents
Multimodal sentiment analysis A survey
Multimodal sentiment analysis A survey
Displays
journal homepage: www.elsevier.com/locate/displa
Review
Keywords: Multimodal sentiment analysis has emerged as a prominent research field within artificial intelligence,
Multimodal sentiment analysis benefiting immensely from recent advancements in deep learning. This technology has unlocked unprecedented
Multimodal fusion possibilities for application and research, rendering it a highly sought-after area of study. In this review, we
Affective computing
aim to present a comprehensive overview of multimodal sentiment analysis by delving into its definition,
Computer vision
historical context, and evolutionary trajectory. Furthermore, we explore recent datasets and state-of-the-art
models, with a particular focus on the challenges encountered and the future prospects that lie ahead. By
offering constructive suggestions for promising research directions and the development of more effective
multimodal sentiment analysis models, this review intends to provide valuable guidance to researchers in this
dynamic field.
https://doi.org/10.1016/j.displa.2023.102563
Received 28 June 2023; Received in revised form 2 October 2023; Accepted 23 October 2023
Available online 31 October 2023
0141-9382/© 2023 Published by Elsevier B.V.
S. Lai et al. Displays 80 (2023) 102563
Fig. 1. The figure illustrates the model architecture for a classical approach to multi-modal sentiment analysis. The architecture comprises three essential components: feature
extraction for each modality, fusion of the extracted features, and sentiment analysis based on the fused features. These three components play a crucial role in the analysis process,
prompting researchers to focus on optimizing each part individually.
feature extraction, fusion of modality features, and sentiment analysis characteristics. This section presents well-known datasets within the
based on the fused features. These three components are of utmost im- research community, intending to assist researchers in understanding
portance, and researchers have initiated individual optimization efforts the distinct features of each dataset and facilitating their selection
for each part [25]. process.
Our research focuses on multimodal sentiment analysis, providing a
comprehensive overview of the field, including its definition, historical 2.1. IEMOCAP [31]
context, and evolutionary trajectory. It explores recent datasets and
state-of-the-art models, addressing challenges and future prospects. IEMOCAP, a sentiment analysis dataset released by the Speech
Its unique contribution lies in offering constructive suggestions for Analysis and Interpretation Laboratory in 2008, offers a comprehensive
research directions. [26] discusses affective computing and automatic multi-modal dataset. It consists of 1039 conversational segments, with
emotion recognition, covering techniques, signal modalities, databases, a total video duration of 12 h. The study involved participants engaging
feature extraction, applications, and issues such as privacy and fair- in five different scenarios, expressing emotions based on predefined
ness. [27] emphasizes emotion recognition based on modal informa- scenarios. The dataset incorporates audio, video, and text information,
tion, particularly video and audio, and reviews research methods and along with facial expression and posture data collected through addi-
fusion technologies. It also examines emotion recognition based on tional sensors. Data points are categorized into ten emotions, including
deep learning and provides suggestions for future research. In compar- neutral, happy, sad, angry, surprised, scared, disgusted, frustrated, ex-
ison to other surveys, our research stands out by providing a compre- cited, and other. IEMOCAP provides a valuable resource for researchers
hensive overview and valuable guidance for researchers in multimodal
investigating sentiment analysis across multiple modalities.
sentiment analysis.
This review provides a comprehensive overview of multimodal
2.2. DEAP [32]
sentiment analysis. It includes a summary and introduction to relevant
datasets, aiding researchers in selecting appropriate datasets for their
studies. We compare and analyze models that hold significant research DEAP is a dataset specifically designed for sentiment analysis that
significance in multimodal sentiment analysis, offering suggestions for utilizes physiological signals as its source of data. The dataset examines
model construction. In addition, we explore three types of modal EEG data from 32 subjects, with a 1:1 ratio of male and female
fusion methods, highlighting the advantages and disadvantages of each participants. EEG signals were collected at 512 Hz from different areas
approach. Finally, we discuss the challenges and future directions in of the subjects’ brains, including the frontal, parietal, occipital, and
multimodal sentiment analysis, presenting several promising research temporal lobes. To annotate the EEG signals, the subjects were asked
avenues. Distinct from other reviews in the field, our focus lies in to rate the corresponding videos in terms of three aspects: Valence,
providing constructive suggestions for promising research directions Arousal, and Dominance, on a scale of 1 to 9. This dataset provides
and constructing superior multimodal sentiment analysis models. We valuable resources for researchers to explore sentiment analysis using
accentuate the challenges and future prospects associated with these physiological signals.
technologies.
2.3. CMU-MOSI [33]
2. Multimodal sentiment analysis datasets
The CMU-MOSI dataset comprises 93 significant YouTube videos,
The rapid growth of the Internet has ushered in an era of data explo- covering a diverse range of topics, as described by Zadeh et al. (2016).
sion [28–30]. Researchers have extensively gathered these data from The selection process ensured that each video featured a single speaker
various online sources such as videos and reviews, and have created facing the camera, enabling clear capture of facial expressions. While
sentiment datasets tailored to their specific needs. Table 1 provides there were no specific constraints on the camera model, distance, or
a summary of commonly used multimodal datasets. The first column speaker scene, all presentations and comments in the videos were
lists the dataset names, while the second column indicates the release conducted in English by 89 distinct speakers, including 41 women
year of the sentiment data. The third column categorizes the modalities and 48 men. The 93 videos were further divided into 2199 subjective
included in each dataset, and the fourth column specifies the platform opinion segments, each annotated with sentiment intensity ranging
from which the dataset was obtained. The fifth column denotes the from strongly negative to strongly positive (−3 to 3). Overall, the CMU-
language used in the dataset, and the sixth column quantifies the data MOSI dataset serves as a valuable resource for researchers engaged in
volume within each dataset. Each dataset possesses its own unique the study of sentiment analysis.
2
S. Lai et al. Displays 80 (2023) 102563
Table 1
The table presented below showcases the utilized multimodal datasets.
Name Year Modalities Source Language Number
IEMOCAP 2008 A+V+T N/A English 10 039
A+V+T
DEAP 2011 A+V+T N/A English 10 039
A+V+T
CMU-MOSI 2016 A+V+T YouTube English 2199
CMU-MOSEI 2018 A+V+T YouTube English 23 453
MELD 2019 A+V+T The Friends English 13 000
Multi-ZOL 2019 V+T ZOL.com Chinese 5288
CH-SIMS 2020 A+V+T N/A Chinese 2281
Spanish
Portuguese
CMU-MOSEAS 2021 A+V+T YouTube German 40 000
French
Reddit
MEMOTION 2022 V+T Facebook English 10 000
CMU-MOSEI is a highly regarded dataset extensively used in sen- CMU-MOSEA is a versatile dataset that encompasses multiple lan-
timent analysis, comprising 3228 YouTube videos. These videos are guages, including Spanish, Portuguese, German, and French. The
divided into 23,453 segments and encompass data from three dis- dataset consists of 40,000 sentence fragments, covering 250 diverse
tinct modalities: text, visual, and sound. With contributions from 1000 topics and featuring contributions from 1645 speakers. The annotations
speakers and coverage of 250 different topics, this dataset offers a within the dataset are classified into two categories: sentiment inten-
diverse range of perspectives. All videos in the dataset are in English, sity and binary. Each sentence fragment is annotated with sentiment
and annotations for both sentiment and emotion are available. The six strength, quantified within the interval [−3, 3]. Additionally, the binary
emotion categories include happy, sad, angry, scared, disgusted, and category indicates whether the speaker expressed an opinion or made
an objective statement. For each sentence, emotions are categorized
surprised, while the sentiment intensity markers span from strongly
into six categories: happiness, sadness, fear, disgust, surprise, and
negative to strongly positive (−3 to 3). Overall, CMU-MOSEI stands as
annotation.
an invaluable resource for researchers delving into sentiment analysis
across multiple modalities. 2.9. MEMOTION [39]
2.5. MELD [35] MEMOTION is a meme-based dataset that focuses on popular memes
associated with politics, religion, and sports. The dataset comprises
MELD is a comprehensive dataset featuring video clips sourced from 10,000 data points and is organized into three distinct sub-tasks: Senti-
the well-known television series Friends. The dataset encompasses tex- ment Analysis, Emotion Classification, and Scale/Intensity of Emotion
tual, audio, and video information that aligns with the accompanying Classes. Annotators approached each data point differently based on
textual data. It consists of 1400 videos, further divided into 13,000 the specific subtask. In subtask one, data points were annotated into
individual segments. The dataset is meticulously annotated with seven three categories: negative, neutral, and positive. Subtask two involved
categories of emotions: Anger, Disgust, Sadness, Joy, Neutral, Surprise, annotating each data point into four categories: humor, sarcasm, of-
and Fear. Additionally, each segment is annotated with three sentiment fense, and motivation. Lastly, in subtask three, each data point was
labels: positive, negative, and neutral. assigned a sentiment intensity value within the interval of [0, 4]. This
dataset presents a unique opportunity for researchers to explore the use
2.6. Multi-ZOL [36] of memes as a form of communication and expression within modern
culture.
Multi-ZOL is a dataset designed for bimodal sentiment classification
of images and text. The dataset consists of reviews of mobile phones col- 2.10. MULTIZOO & MULTIBENCH [40]
lected from ZOL.com. It contains 5288 sets of multimodal data points
The MULTIBENCH dataset is a comprehensive and diverse collection
that cover various models of mobile phones from multiple brands.
of datasets designed for multimodal learning research. It encompasses
These data points are annotated with a sentiment intensity rating from
15 datasets spanning six distinct research areas, including multimedia,
1 to 10 for six aspects.
affective computing, robotics, finance, human–computer interaction,
and healthcare. With more than 10 input modalities available, such
2.7. CH-SIMS [37]
as language, image, video, audio, time-series, tabular data, force sen-
sors, proprioception sensors, sets, and optical flow, the MULTIBENCH
CH-SIMS is a distinctive dataset comprising 60 open-sourced videos dataset offers a wide range of real-world scenarios for studying mul-
sourced from the web. These videos are further divided into 2281 video timodal representation learning. Each dataset is carefully curated to
clips. The dataset specifically focuses on the Chinese (Mandarin) lan- cover various prediction tasks, resulting in a total of 20 prediction
guage and ensures that each segment features only one character’s face tasks for evaluation. This diversity allows researchers to explore the
and voice. It encompasses a diverse array of scenes and includes speak- integration of different data sources and investigate the performance
ers spanning various age groups. The dataset is meticulously labeled and robustness of multimodal models across domains and modali-
for each modality, rendering it a valuable resource for researchers. ties. The availability of MULTIBENCH dataset provides researchers
The annotations in the dataset encompass sentiment intensity ratings with a standardized and reliable benchmark for assessing the capabil-
ranging from negative to positive (−1 to 1), along with additional ities and limitations of multimodal models, promoting accessible and
annotations for attributes such as age and gender. reproducible research in the field of multimodal learning.
3
S. Lai et al. Displays 80 (2023) 102563
4
S. Lai et al. Displays 80 (2023) 102563
5
S. Lai et al. Displays 80 (2023) 102563
6
S. Lai et al. Displays 80 (2023) 102563
The proposed model primarily focuses on the text and audio modal- The authors propose a novel approach that combines text and image
ities. Its main contribution lies in the introduction of novel strategies sentiment analysis through the utilization of AutoML for generating a
for multi-feature fusion and multimodal fusion. The model employs final fused classification. In this approach, individual classifiers for text
two parallel branches to extract features specifically for the text and
and image are combined using the best model selected by AutoML. This
audio modalities. Subsequently, a multimodal attention fusion module
decision-level fusion technique merges the outputs of the individual
is employed to effectively combine the features from these two modal-
classifiers to form the final classification. This approach showcases a
ities, achieving multimodal fusion. This fusion process enhances the
typical example of decision-level fusion, where the strengths of both
model’s ability to capture complementary information and achieve a
text and image analysis are leveraged to improve the overall sentiment
comprehensive understanding of the input data from both the text and
audio sources. analysis performance.
This model primarily aims to capture the relationship between In this study, the authors propose a novel multi-modal sentiment
the text and visual modalities. The authors introduce an attention analysis architecture that combines self-supervised learning and multi-
mechanism-based heterogeneous relation model, which effectively task learning. To capture the private information specific to each
combines high-quality representation information from both the text
modality, a single-modal label generation module called ULGM is con-
and visual modalities. The progressive dual attention mechanism em-
structed based on self-supervised learning. The corresponding loss func-
ployed in this model effectively emphasizes channel-level semantic
tion for this module is designed to integrate the private features ac-
information from both images and text. In order to incorporate social
quired through three self-supervised learning subtasks into the original
attributes, social relations are introduced to capture complementary
information from the social context. Furthermore, heterogeneous net- multi-modal sentiment analysis model, employing a weight adjustment
works are constructed to integrate features from different modalities, strategy. The proposed model demonstrates strong performance, and
facilitating a comprehensive understanding of the multimodal data. the ULGM module based on self-supervised learning also possesses
the capability of calibrating single-modal labels. This combination of
4.4. SFNN [55] self-supervised learning and multi-task learning proves to be effec-
tive in enhancing the sentiment analysis performance while capturing
The proposed model is a neural network based on semantic fea- modality-specific information.
ture fusion. Convolutional neural networks and attention mechanisms
are used to extract visual features. Visual features are mapped to 4.10. DISRFN [61]
text features and combined with text modality features for sentiment
analysis.
The model is a dynamically invariant representation-specific fusion
4.5. MISA [56] network. The joint domain separation network is improved to obtain
a joint domain separation representation for all modalities, so that
The proposed model introduces an innovative framework for multi- redundant information can be effectively utilized. Second, a HGFN
modal sentiment analysis. In this framework, each modality undergoes network is used to dynamically fuse the feature information of each
feature extraction and is subsequently mapped into two separate fea- modality and learn the features of multiple modal interactions. At the
ture spaces. One of these feature spaces primarily focuses on learning same time, a loss function that improves the fusion effect is constructed
the invariant features of the modality, while the other one empha- to help the model learn the representation information of each modality
sizes the acquisition of the modality’s unique features. This approach in the subspace.
allows the model to capture both the shared characteristics across
modalities and the distinctive aspects specific to each modality, thereby 4.11. TEDT [62]
facilitating a comprehensive analysis of the multi-modal data.
4.6. MAG-BERT [57] This model introduces a novel multimodal encoding-decoding trans-
lation network, incorporating a transformer, to tackle the challenges as-
The authors introduce a ‘‘multi-modal’’ adaptation architecture that sociated with multimodal sentiment analysis. Specifically, it addresses
is specifically applied to BERT. This novel model is capable of ac- the influence of individual modal data and the low quality of non-
commodating input from multiple modalities during the fine-tuning natural language features. The proposed method assigns text as the
process. The Multi-modal Adaptation Gateway (MAG) can be conceptu- primary information, while sound and image serve as secondary in-
alized as a vector embedding structure, enabling the incorporation of formation. To enhance the quality of non-natural language features, a
multimodal information and its subsequent embedding as a sequence modality reinforcement cross-attention module is employed to convert
within BERT. This adaptation architecture enhances the model’s ability them into natural language features. Furthermore, a dynamic filtering
to process and analyze data from diverse modalities, facilitating a more mechanism is implemented to filter out error information arising from
comprehensive understanding of multimodal inputs within the BERT cross-modal interaction. The model’s key strength lies in its ability to
framework. enhance multimodal fusion and accurately analyze human sentiment.
However, it should be noted that this approach may require signifi-
4.7. TIMF [58] cant computational resources and might not be suitable for real-time
analysis.
The main idea of this model is that each modality learns features
separately and performs tensor fusion of the features of each modality.
In the dataset fusion stage, the feature fusion for each modality is 4.12. TETFN [63]
implemented by a tensor fusion network. In the decision fusion stage,
the upstream results are fused by soft fusion to adjust the decision The Text Enhanced Transformer Fusion Network (TETFN) presents a
results. novel approach to MSA that effectively addresses the challenges posed
7
S. Lai et al. Displays 80 (2023) 102563
by the varying contributions of textual, visual, and acoustic modalities. 5.3. F1 score
This method focuses on learning text-oriented pairwise cross-modal
mappings to acquire unified and robust multimodal representations. The F1 score is a metric that combines precision and recall into
It achieves this by incorporating textual information into the learning a single value, providing a balanced evaluation of the model’s per-
process of sentiment-related nonlinguistic representations through text- formance. It is the harmonic mean of precision and recall and is
based multi-head attention. Additionally, the model incorporates uni- particularly useful when there is an imbalance between positive and
modal label prediction to maintain differentiated information among negative instances in the dataset. The F1 score ranges from 0 to 1, with
the various modalities. To capture both global and local information higher values indicating better performance.
of a human face, the vision pre-trained model Vision-Transformer is
employed to extract visual features from the original videos. The key 5.4. Mean Absolute Error (MAE)
strength of this model lies in its ability to leverage textual information
to enhance the effectiveness of nonlinguistic modalities in MSA, while The mean absolute error is a metric commonly used in regression-
simultaneously preserving inter- and intra-modality relationships. based sentiment analysis tasks. It measures the average absolute dif-
ference between the predicted sentiment values and the true sentiment
4.13. SPIL [64] values. MAE provides a measure of the model’s accuracy in predicting
sentiment on a continuous scale.
This model introduces a deep modal shared information learning
module designed to enhance representation learning in multimodal 5.5. Mean Squared Error (MSE)
sentiment analysis tasks. The proposed module effectively captures
both shared and private information within a comprehensive modal Similar to MAE, mean squared error is another metric used in
representation. It achieves this by utilizing a covariance matrix to regression-based sentiment analysis tasks. It calculates the average
capture shared information across modalities, while employing a self- squared difference between the predicted sentiment values and the true
supervised learning strategy to capture private information specific to sentiment values. MSE amplifies the impact of large errors, making it
each modality. The module is flexible and can be seamlessly integrated more sensitive to outliers in the predictions.
into existing frameworks, allowing for adjustable information exchange
between modalities to learn either private or shared information as 5.6. Cross-entropy loss
needed. Additionally, a multi-task learning strategy is implemented to
guide the model’s attention towards modal differentiation during train-
Cross-entropy loss is a common metric used in sentiment analysis
ing. The proposed model demonstrates superior performance compared
tasks that involve probabilistic models, such as deep learning mod-
to current state-of-the-art methods across various metrics on three
els. It measures the dissimilarity between the predicted probability
publicly available datasets. Furthermore, the study explores additional
distribution and the true probability distribution of sentiment labels.
combinatorial techniques for leveraging the capabilities of the proposed
Minimizing cross-entropy loss encourages the model to produce more
module.
accurate and confident predictions.
In the field of multimodal sentiment analysis, evaluating the per- Cohen’s Kappa is a statistical measure of inter-rater agreement, com-
formance of sentiment analysis models is crucial for assessing their monly used when evaluating sentiment analysis models in scenarios
effectiveness. Various evaluation metrics have been developed to mea- where multiple annotators label the same dataset. It takes into account
sure the performance of these models, providing insights into their the agreement between the model’s predictions and the average agree-
accuracy, robustness, and generalizability. This chapter aims to intro- ment expected by chance. Cohen’s Kappa provides insights into the
duce and discuss the commonly used evaluation metrics in multimodal model’s performance beyond chance agreement.
sentiment analysis. By using these evaluation metrics, researchers and practitioners can
assess the performance of multimodal sentiment analysis models and
5.1. Accuracy compare their effectiveness. It is important to consider the specific
characteristics of the dataset and the task at hand when selecting
Accuracy is a fundamental evaluation metric that measures the appropriate evaluation metrics to ensure a comprehensive evaluation
overall correctness of the sentiment predictions made by a model. of model performance.
It calculates the ratio of correctly classified instances to the total
number of instances in the dataset. While accuracy provides a general 6. Challenges and future scope
understanding of the model’s performance, it may not be sufficient for
assessing the model’s performance when class imbalances exist in the The rapid development of deep learning has led to significant
dataset. advancements in multimodal sentiment analysis techniques [65–68].
However, despite these advancements, there are still numerous chal-
5.2. Precision and recall lenges that need to be addressed in the field of multimodal sentiment
analysis. Therefore, this subsection aims to analyze the current state
Precision and recall are evaluation metrics commonly used in binary of research, identify the challenges faced, and explore potential future
sentiment classification tasks. Precision measures the proportion of developments in multimodal sentiment analysis.
correctly predicted positive instances out of all instances predicted
as positive. Recall, on the other hand, calculates the proportion of 6.1. Dataset
correctly predicted positive instances out of all actual positive instances
in the dataset. Precision and recall provide insights into the model’s The dataset is a critical component in multimodal sentiment anal-
ability to correctly identify positive instances and avoid false positives. ysis. However, there is currently a lack of large-scale datasets that
8
S. Lai et al. Displays 80 (2023) 102563
encompass multiple languages. Considering the linguistic and racial of these applications include multimodal emotion analysis for real-
diversity in many countries, it is crucial to have a large and diverse time mental health assessment [73–75], multimodal criminal linguistic
dataset that can be used to train multimodal sentiment analysis models deception detection models [76], and offensive language detection.
with robust generalization capabilities and wide applicability. More- Another exciting prospect is the development of human-like emotion-
over, existing multimodal datasets suffer from low annotation accuracy aware robots. Multimodal emotion analysis is a powerful technique
and have not yet achieved absolute continuous values. To address this for recognizing and analyzing emotions. By leveraging multimodal
issue, researchers need to focus on more precise labeling of multimodal information and combining data from various modalities, sentiment
datasets. Furthermore, most current multimodal datasets only incorpo- analysis models can significantly enhance the accuracy of emotion
rate visual, speech, and text modalities, while neglecting the inclusion recognition. As research progresses, multimodal sentiment analysis
of modal information combined with physiological signals such as brain techniques will continue to improve. It is conceivable that one day
waves and pulses. Thus, there is a need to expand multimodal datasets we may witness the emergence of a multimodal sentiment analysis
to include these additional modalities for more comprehensive analysis. model with a large number of parameters, possessing sentiment analysis
capabilities comparable to those of humans. Such a development would
6.2. Detection of hidden emotions be truly remarkable and pave the way for exciting advancements in the
field.
Multimodal sentiment analysis tasks have long faced a recognized
challenge: the analysis of hidden emotions. Hidden emotions encom- 7. Conclusion
pass various aspects, including sarcastic emotions expressed through
the use of sarcastic words, emotions that require contextual analysis Multimodal sentiment analysis has garnered significant recogni-
for concrete understanding, and complex emotions such as happiness tion in various research fields, establishing itself as a central topic
and sadness experienced by individuals [69,70]. The exploration of in natural language processing and computer vision. In this compre-
these hidden emotions is crucial as it represents a gap between human hensive review, we provide an in-depth exploration of various facets
and artificial intelligence capabilities in understanding and interpreting of multimodal sentiment analysis, covering its research background,
emotions [71,72]. By delving into these hidden emotions, we can definition, and developmental trajectory. Additionally, we present a
bridge this gap and enhance the effectiveness of multimodal sentiment comprehensive overview of commonly used benchmark datasets in
analysis. Table 1, and conduct a comparative analysis of recent state-of-the-
art multimodal sentiment analysis models. Furthermore, we highlight
6.3. Multiple forms of video data the existing challenges within the field and delve into potential future
developments.
Video data presents unique challenges in multimodal sentiment Numerous promising endeavors are currently underway, with sev-
analysis tasks. Despite the speaker facing the camera and the high- eral already implemented. However, there remain several challenges
resolution quality of the video data, the actual situation is often more that necessitate further investigation, giving rise to significant research
complex, necessitating a model that is resilient to noise and capa- directions:
ble of analyzing low-resolution video data. Furthermore, there is an (1) Constructing large-scale multimodal sentiment datasets in mul-
opportunity for researchers to delve into the capture and analysis of tiple languages.
micro-expressions and micro-gestures exhibited by speakers, as these (2) Addressing the domain transfer problem across video, text, and
subtle cues can provide valuable insights for sentiment analysis. There- speech modal data.
fore, exploring techniques to address these challenges and harness the (3) Developing a unified, large-scale multimodal sentiment analysis
potential of micro-expressions and micro-gestures is an important area model with exceptional generalization capabilities.
of focus for researchers in the field. (4) Reducing model parameters, optimizing algorithms, and mini-
mizing algorithmic complexity.
6.4. Multiform language data (5) Tackling the challenges posed by multi-lingual hybridity in
multimodal sentiment analysis.
In multimodal sentiment analysis tasks, text data is typically uni- (6) Exploring the weighting problem in modal fusion and devising
modal, focusing on a single language. However, in online communities, rational weighting schemes for different modalities based on specific
evaluation texts often involve multiple languages, as reviewers em- cases.
ploy different languages to express their comments more vividly. This (7) Investigating the inter-modal correlations and distinguishing
presents a challenge in analyzing text data with mixed emotions within between shared and private information to enhance model performance
a multimodal context. Effectively utilizing memes that are embedded in and interpretability.
the text is an important research area, as memes often convey highly (8) Constructing a multimodal sentiment analysis model capable of
impactful emotional messages from reviewers. Moreover, analyzing effectively capturing and analyzing hidden emotions.
emotions becomes more challenging when multiple individuals are Addressing these research directions will undoubtedly contribute to
engaged in a conversation, as most text data is transcribed directly the advancement of multimodal sentiment analysis and enable the de-
from speech. Furthermore, considering the cultural characteristics of velopment of more powerful and comprehensive models in the future.
different regions and countries, the same text data may evoke diverse
emotions. Therefore, understanding and addressing these complexi- Funding
ties associated with cross-lingual text data, incorporating memes, and
accounting for cultural nuances are crucial areas of investigation in This work was supported in part by Joint found for smart computing
multimodal sentiment analysis research. of Shandong Natural Science Foundation under Grant ZR2020LZH013;
open project of State Key Laboratory of Computer Architecture
6.5. Future prospects CARCHA202001; the Major Scientific and Technological Innovation
Project in Shandong Province under Grant 2021CXG010506 and
The future of multimodal sentiment analysis techniques holds great 2022CXG010504; ‘‘New University 20 items’’ Funding Project of Jinan
promise, with several potential applications on the horizon. Some under Grant 2021GXRC108 and 2021GXRC024.
9
S. Lai et al. Displays 80 (2023) 102563
Declaration of competing interest [23] Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, Peng Zou, Yangyang
Xu, Sheng Gao, Jie Lin, Chunxiao Fan, et al., Hybrid multimodal feature
extraction, mining and fusion for sentiment analysis, in: Proceedings of the 3rd
The authors declare no conflict of interest. This research was con-
International on Multimodal Sentiment Analysis Workshop and Challenge, 2022,
ducted in an unbiased manner and the results presented in this paper pp. 81–88.
are based solely on the data and analysis conducted during the research [24] Anna Favaro, Chelsie Motley, Tianyu Cao, Miguel Iglesias, Ankur Butala,
process. No financial or other relationships exist that could be perceived Esther S. Oh, Robert D. Stevens, Jesús Villalba, Najim Dehak, Laureano Moro-
Velázquez, A multi-modal array of interpretable features to evaluate language
as a conflict of interest.
and speech patterns in different neurological disorders, in: 2022 IEEE Spoken
Language Technology Workshop, SLT, IEEE, 2023, pp. 532–539.
Data availability [25] Soujanya Poria, Erik Cambria, Rajiv Bajpai, Amir Hussain, A review of affective
computing: From unimodal analysis to multimodal fusion, Inf. Fusion 37 (2017)
98–125.
No data was used for the research described in the article.
[26] Garima Sharma, Abhinav Dhall, A survey on automatic multimodal emotion
recognition in the wild, Adv. Data Sci.: Methodol. Appl. (2021) 35–64.
References [27] Xin Gu, Yinghua Shen, Jie Xu, Multimodal emotion recognition in deep learning:
A survey, in: 2021 International Conference on Culture-Oriented Science &
Technology, ICCST, IEEE, 2021, pp. 77–82.
[1] Julien Deonna, Fabrice Teroni, The Emotions: A Philosophical Introduction,
Routledge, 2012. [28] Sathyan Munirathinam, Industry 4.0: Industrial internet of things (IIOT), in:
Advances in Computers, vol. 117, (no. 1) Elsevier, 2020, pp. 129–164.
[2] Clayton Hutto, Eric Gilbert, Vader: A parsimonious rule-based model for sen-
[29] Esteban Ortiz-Ospina, Max Roser, The rise of social media, Our World Data
timent analysis of social media text, in: Proceedings of the International AAAI
(2023).
Conference on Web and Social Media, vol. 8, (no. 1) 2014, pp. 216–225.
[30] Abdul Haseeb, Enjun Xia, Shah Saud, Ashfaq Ahmad, Hamid Khurshid, Does
[3] Soo-Min Kim, Eduard Hovy, Determining the sentiment of opinions, in: COLING
information and communication technologies improve environmental quality in
2004: Proceedings of the 20th International Conference on Computational
the era of globalization? An empirical analysis, Environ. Sci. Pollut. Res. 26
Linguistics, 2004, pp. 1367–1373.
(2019) 8594–8608.
[4] Erik Cambria, Björn Schuller, Yunqing Xia, Catherine Havasi, New avenues in
[31] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower,
opinion mining and sentiment analysis, IEEE Intell. Syst. 28 (2) (2013) 15–21.
Samuel Kim, Jeannette N. Chang, Sungbok Lee, Shrikanth S. Narayanan, IEMO-
[5] Arshi Parvaiz, Muhammad Anwaar Khalid, Rukhsana Zafar, Huma Ameer,
CAP: Interactive emotional dyadic motion capture database, Lang. Resourc. Eval.
Muhammad Ali, Muhammad Moazam Fraz, Vision transformers in medical
42 (2008) 335–359.
computer vision—A contemplative retrospection, Eng. Appl. Artif. Intell. 122
[32] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan
(2023) 106126.
Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, Ioannis Patras, Deap: A
[6] Bo Zhang, Jun Zhu, Hang Su, Toward the third generation artificial intelligence,
database for emotion analysis; Using physiological signals, IEEE Trans. Affect.
Sci. China Inf. Sci. 66 (2) (2023) 1–19.
Comput. 3 (1) (2011) 18–31.
[7] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham
[33] Amir Zadeh, Rowan Zellers, Eli Pincus, Louis-Philippe Morency, Mosi: Multi-
Neubig, Pre-train, prompt, and predict: A systematic survey of prompting
modal corpus of sentiment intensity and subjectivity analysis in online opinion
methods in natural language processing, ACM Comput. Surv. 55 (9) (2023) 1–35.
videos, 2016, arXiv preprint arXiv:1606.06259.
[8] Jireh Yi-Le Chan, Khean Thye Bea, Steven Mun Hong Leow, Seuk Wai Phoong,
[34] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, Louis-
Wai Khuen Cheng, State of the art: A review of sentiment analysis based on
Philippe Morency, Multimodal language analysis in the wild: Cmu-mosei dataset
sequential transfer learning, Artif. Intell. Rev. 56 (1) (2023) 749–780.
and interpretable dynamic fusion graph, in: Proceedings of the 56th Annual
[9] Mayur Wankhade, Annavarapu Chandra Sekhara Rao, Chaitanya Kulkarni, A
Meeting of the Association for Computational Linguistics (Volume 1: Long
survey on sentiment analysis methods, applications, and challenges, Artif. Intell.
Papers), 2018, pp. 2236–2246.
Rev. 55 (7) (2022) 5731–5780.
[35] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik
[10] Hui Li, Qi Chen, Zhaoman Zhong, Rongrong Gong, Guokai Han, E-word of mouth Cambria, Rada Mihalcea, Meld: A multimodal multi-party dataset for emotion
sentiment analysis for user behavior studies, Inf. Process. Manage. 59 (1) (2022) recognition in conversations, 2018, arXiv preprint arXiv:1810.02508.
102784.
[36] Nan Xu, Wenji Mao, Guandan Chen, Multi-interactive memory network for aspect
[11] Ashima Yadav, Dinesh Kumar Vishwakarma, Sentiment analysis using deep based multimodal sentiment analysis, in: Proceedings of the AAAI Conference on
learning architectures: A review, Artif. Intell. Rev. 53 (6) (2020) 4335–4385. Artificial Intelligence, vol. 33, (no. 01) 2019, pp. 371–378.
[12] Ganesh Chandrasekaran, Tu N. Nguyen, Jude Hemanth D, Multimodal senti- [37] Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun
mental analysis for social media applications: A comprehensive review, Wiley Zou, Kaicheng Yang, Ch-sims: A Chinese multimodal sentiment analysis dataset
Interdiscipl. Rev.: Data Mining Knowl. Discov. 11 (5) (2021) e1415. with fine-grained annotation of modality, in: Proceedings of the 58th Annual
[13] Bernhard Kratzwald, Suzana Ilic, Mathias Kraus, Stefan Feuerriegel, Helmut Meeting of the Association for Computational Linguistics, 2020, pp. 3718–3727.
Prendinger, Decision support with text-based emotion recognition: Deep learning [38] Amir Zadeh, Yan Sheng Cao, Simon Hessner, Paul Pu Liang, Soujanya Poria,
for affective computing, 2018, arXiv preprint arXiv:1803.06397. Louis-Philippe Morency, CMU-MOSEAS: A multimodal language dataset for
[14] Carlo Strapparava, Rada Mihalcea, Semeval-2007 task 14: Affective text, in: Spanish, Portuguese, German and French, in: Proceedings of the Conference
Proceedings of the Fourth International Workshop on Semantic Evaluations, on Empirical Methods in Natural Language Processing. Conference on Empirical
SemEval-2007, 2007, pp. 70–74. Methods in Natural Language Processing, vol. 2020, NIH Public Access, 2020,
[15] Yang Li, Quan Pan, Suhang Wang, Tao Yang, Erik Cambria, A generative model p. 1801.
for category text generation, Inform. Sci. 450 (2018) 301–315. [39] Sathyanarayanan Ramamoorthy, Nethra Gunti, Shreyash Mishra, S. Suryavardan,
[16] Rong Dai, Facial expression recognition method based on facial physiological Aishwarya Reganti, Parth Patwa, Amitava DaS, Tanmoy Chakraborty, Amit Sheth,
features and deep learning, J. Chongqing Univ. Technol. (Natural Science) 34 Asif Ekbal, et al., Memotion 2: Dataset on sentiment and emotion analysis of
(6) (2020) 146–153. memes, in: Proceedings of de-Factify: Workshop on Multimodal Fact Checking
[17] Zhu Ren, Jia Jia, Quan Guo, Kuo Zhang, Lianhong Cai, Acoustics, content and and Hate Speech Detection, CEUR, 2022.
geo-information based sentiment prediction from large-scale networked voice [40] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe
data, in: 2014 IEEE International Conference on Multimedia and Expo, ICME, Morency, Ruslan Salakhutdinov, MULTIZOO & multibench: A standardized
IEEE, 2014, pp. 1–4. toolkit for multimodal deep learning, J. Mach. Learn. Res. 24 (2023) 1–7.
[18] Liu Jiming, Zhang Peixiang, Liu Ying, Zhang Weidong, Fang Jie, Summary of [41] Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan
multi-modal sentiment analysis technology, J. Front. Comput. Sci. Technol. 15 Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al., MER 2023: Multi-label
(7) (2021) 1165. learning, modality robustness, and semi-supervised learning, 2023, arXiv preprint
[19] Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, Jie Xu, Zhoujun Li, Image–text arXiv:2304.08981.
sentiment analysis via deep multimodal attentive fusion, Knowl.-Based Syst. 167 [42] Louis-Philippe Morency, Rada Mihalcea, Payal Doshi, Towards multimodal sen-
(2019) 26–37. timent analysis: Harvesting opinions from the web, in: Proceedings of the 13th
[20] Akshi Kumar, Geetanjali Garg, Sentiment analysis of multimodal twitter data, International Conference on Multimodal Interfaces, 2011, pp. 169–176.
Multimedia Tools Appl. 78 (2019) 24103–24119. [43] Paul Pu Liang, Ziyin Liu, Amir Zadeh, Louis-Philippe Morency, Multimodal
[21] Ankita Gandhi, Kinjal Adhvaryu, Vidhi Khanduja, Multimodal sentiment analysis: language analysis with recurrent multistage fusion, 2018, arXiv preprint arXiv:
review, application domains and future directions, in: 2021 IEEE Pune Section 1808.03920.
International Conference, PuneCon, IEEE, 2021, pp. 1–5. [44] Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, Louis-Philippe
[22] Vaibhav Rupapara, Furqan Rustam, Hina Fatima Shahzad, Arif Mehmood, Imran Morency, Words can shift: Dynamically adjusting word representations using
Ashraf, Gyu Sang Choi, Impact of SMOTE on imbalanced text features for toxic nonverbal behaviors, in: Proceedings of the AAAI Conference on Artificial
comments classification using RVVC model, IEEE Access 9 (2021) 78621–78634. Intelligence, vol. 33, (no. 01) 2019, pp. 7216–7223.
10
S. Lai et al. Displays 80 (2023) 102563
[45] Sijie Mai, Haifeng Hu, Songlong Xing, Divide, conquer and combine: Hierarchical [60] Wenmeng Yu, Hua Xu, Ziqi Yuan, Jiele Wu, Learning modality-specific rep-
feature fusion network with local and global perspectives for multimodal affec- resentations with self-supervised multi-task learning for multimodal sentiment
tive computing, in: Proceedings of the 57th Annual Meeting of the Association analysis, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol.
for Computational Linguistics, 2019, pp. 481–492. 35, (no. 12) 2021, pp. 10790–10797.
[46] Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, Barnabás [61] Jing He, Haonan Yanga, Changfan Zhang, Hongrun Chen, Yifu Xua, Dy-
Póczos, Found in translation: Learning robust joint representations by cyclic namic invariant-specific representation fusion network for multimodal sentiment
translations between modalities, in: Proceedings of the AAAI Conference on analysis, Comput. Intell. Neurosci. 2022 (2022).
Artificial Intelligence, vol. 33, (no. 01) 2019, pp. 6892–6899. [62] Fan Wang, Shengwei Tian, Long Yu, Jing Liu, Junwen Wang, Kun Li, Yongtao
[47] Soujanya Poria, Erik Cambria, Alexander Gelbukh, Deep convolutional neural Wang, TEDT: Transformer-based encoding–decoding translation network for
network textual features and multiple kernel learning for utterance-level multi- multimodal sentiment analysis, Cogn. Comput. 15 (1) (2023) 289–303.
modal sentiment analysis, in: Proceedings of the 2015 Conference on Empirical [63] Di Wang, Xutong Guo, Yumin Tian, Jinhui Liu, LiHuo He, Xuemei Luo, TETFN:
Methods in Natural Language Processing, 2015, pp. 2539–2544. A text enhanced transformer fusion network for multimodal sentiment analysis,
[48] Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, Suranga Pattern Recognit. 136 (2023) 109259.
Nanayakkara, Jointly fine-tuning" bert-like" self supervised models to improve [64] Songning Lai, Xifeng Hu, Yulong Li, Zhaoxia Ren, Zhi Liu, Danmin Miao, Shared
multimodal speech emotion recognition, 2020, arXiv preprint arXiv:2008.06682. and private information learning in multimodal sentiment analysis with deep
[49] Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltrušaitis, modal alignment and self-supervised multi-task learning, 2023, arXiv preprint
Louis-Philippe Morency, Deep multimodal fusion for persuasiveness prediction, arXiv:2305.08473.
in: Proceedings of the 18th ACM International Conference on Multimodal [65] Mahesh G. Huddar, Sanjeev S. Sannakki, Vijay S. Rajpurohit, A survey of
Interaction, 2016, pp. 284–288. computational approaches and challenges in multimodal sentiment analysis, Int.
[50] Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, Eric P. Xing, Select- J. Comput. Sci. Eng. 7 (1) (2019) 876–883.
additive learning: Improving generalization in multimodal sentiment analysis, in: [66] Ramandeep Kaur, Sandeep Kautish, Multimodal sentiment analysis: A survey and
2017 IEEE International Conference on Multimedia and Expo, ICME, IEEE, 2017, comparison, Res. Anthol. Implement. Sentiment Anal. Across Multiple Discipl.
pp. 949–954. (2022) 1846–1870.
[51] Hongliang Yu, Liangke Gui, Michael Madaio, Amy Ogan, Justine Cassell, Louis- [67] Lukas Stappen, Alice Baird, Lea Schumann, Schuller Bjorn, The multimodal
Philippe Morency, Temporally selective attention model for social and affective sentiment analysis in car reviews (muse-car) dataset: Collection, insights and
state recognition in multimedia content, in: Proceedings of the 25th ACM improvements, IEEE Trans. Affect. Comput. (2021).
International Conference on Multimedia, 2017, pp. 1743–1751. [68] Anurag Illendula, Amit Sheth, Multimodal emotion classification, in: Companion
[52] Nan Xu, Wenji Mao, Multisentinet: A deep semantic network for multimodal sen- Proceedings of the 2019 World Wide Web Conference, 2019, pp. 439–449.
timent analysis, in: Proceedings of the 2017 ACM on Conference on Information [69] Donglei Tang, Zhikai Zhang, Yulan He, Chao Lin, Deyu Zhou, Hidden topic–
and Knowledge Management, 2017, pp. 2399–2402. emotion transition model for multi-level social emotion detection, Knowl.-Based
[53] Feiyang Chen, Ziqian Luo, Yanyan Xu, Dengfeng Ke, Complementary fusion of Syst. 164 (2019) 426–435.
multi-features and multi-modalities in sentiment analysis, 2019, arXiv preprint [70] Petr Hajek, Aliaksandr Barushka, Michal Munk, Fake consumer review detection
arXiv:1904.08138. using deep neural networks integrating word embeddings and emotion mining,
[54] Jie Xu, Zhoujun Li, Feiran Huang, Chaozhuo Li, S. Yu Philip, Social image Neural Comput. Appl. 32 (2020) 17259–17274.
sentiment analysis by exploiting multimodal content and heterogeneous relations, [71] Soonil Kwon, A CNN-assisted enhanced audio signal processing for speech
IEEE Trans. Ind. Inform. 17 (4) (2020) 2974–2982. emotion recognition, Sensors 20 (1) (2019) 183.
[55] Weidong Wu, Yabo Wang, Shuning Xu, Kaibo Yan, SFNN: Semantic features [72] Umar Rashid, Muhammad Waseem Iqbal, Muhammad Akmal Skiandar, Muham-
fusion neural network for multimodal sentiment analysis, in: 2020 5th Inter- mad Qasim Raiz, Muhammad Raza Naqvi, Syed Khuram Shahzad, Emotion
national Conference on Automation, Control and Robotics Engineering, CACRE, detection of contextual text using deep learning, in: 2020 4th International
IEEE, 2020, pp. 661–665. Symposium on Multidisciplinary Studies and Innovative Technologies, ISMSIT,
[56] Devamanyu Hazarika, Roger Zimmermann, Soujanya Poria, Misa: Modality- IEEE, 2020, pp. 1–5.
invariant and-specific representations for multimodal sentiment analysis, in: [73] Zhentao Xu, Verónica Pérez-Rosas, Rada Mihalcea, Inferring social media users’
Proceedings of the 28th ACM International Conference on Multimedia, 2020, mental health status from multimodal information, in: Proceedings of the 12th
pp. 1122–1131. Language Resources and Evaluation Conference, 2020, pp. 6292–6299.
[57] Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, [74] Rahee Walambe, Pranav Nayak, Ashmit Bhardwaj, Ketan Kotecha, Employing
Louis-Philippe Morency, Ehsan Hoque, Integrating multimodal information in multimodal machine learning for stress detection, J. Healthc. Eng. 2021 (2021)
large pretrained transformers, in: Proceedings of the Conference. Association for 1–12.
Computational Linguistics. Meeting, vol. 2020, NIH Public Access, 2020, p. 2359. [75] Nujud Aloshban, Anna Esposito, Alessandro Vinciarelli, What you say or how you
[58] Jianguo Sun, Hanqi Yin, Ye Tian, Junpeng Wu, Linshan Shen, Lei Chen, Two- say it? Depression detection through joint modeling of linguistic and acoustic
level multimodal fusion for sentiment analysis in public security, Secur. Commun. aspects of speech, Cogn. Comput. 14 (5) (2022) 1585–1598.
Netw. 2021 (2021) 1–10. [76] Safa Chebbi, Sofia Ben Jebara, Deception detection using multimodal fusion
[59] Vasco Lopes, António Gaspar, Luís A Alexandre, João Cordeiro, An automl-based approaches, Multimedia Tools Appl. (2021) 1–30.
approach to multimodal image sentiment analysis, in: 2021 International Joint
Conference on Neural Networks, IJCNN, IEEE, 2021, pp. 1–9.
11