s00521-023-08276-8 - Transformer Learning on Twitter Database

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Neural Computing and Applications (2023) 35:10945–10956

https://doi.org/10.1007/s00521-023-08276-8(0123456789().,-volV)(0123456789().
,- volV)

ORIGINAL ARTICLE

Transformer transfer learning emotion detection model: synchronizing


socially agreed and self-reported emotions in big data
Sanghyub John Lee1 • JongYoon Lim2 • Leo Paas1 • Ho Seok Ahn2

Received: 11 May 2022 / Accepted: 6 January 2023 / Published online:26 January 2023
 The Author(s) 2023

Abstract
Tactics to determine the emotions of authors of texts such as Twitter messages often rely on multiple annotators who label
relatively small data sets of text passages. An alternative method gathers large text databases that contain the authors’ self-
reported emotions, to which artificial intelligence, machine learning, and natural language processing tools can be applied.
Both approaches have strength and weaknesses. Emotions evaluated by a few human annotators are susceptible to
idiosyncratic biases that reflect the characteristics of the annotators. But models based on large, self-reported emotion data
sets may overlook subtle, social emotions that human annotators can recognize. In seeking to establish a means to train
emotion detection models so that they can achieve good performance in different contexts, the current study proposes a
novel transformer transfer learning approach that parallels human development stages: (1) detect emotions reported by the
texts’ authors and (2) synchronize the model with social emotions identified in annotator-rated emotion data sets. The
analysis, based on a large, novel, self-reported emotion data set (n = 3,654,544) and applied to 10 previously published
data sets, shows that the transfer learning emotion model achieves relatively strong performance.

Keywords Emotion data sets  Emotion detection model  Transformer-based language model  Constructed emotion theory

1 Introduction may provide valuable insights for commercial firms and


public policy makers. Yet detecting such expressions of
Billions of social networking service users engage in their human emotions is challenging, particularly when the
areas of interest by sharing their opinions and information, analyzed data only include text, not facial expressions or
mainly as text. Predicting the emotions they express in other nonverbal information [30].
these texts is critical for both researchers and businesses One option is to task human annotators with generating
[31]; consumer emotions expressed in online restaurant labels of emotions and categorizing text passages
reviews [37] or tweets about COVID-19 [18] for example [2, 8, 9, 14, 26, 27, 29, 34]. This stream of research reflects
a classical view of emotion theory [35], which postulates
that categories such as fear and joy have universal bio-
& Sanghyub John Lee logical fingerprints, inherited by humans through evolution,
sanghyub.lee@auckland.ac.nz
that get aroused by certain situations. Therefore, annotators
JongYoon Lim should be able to detect emotions expressed in texts
jy.lim@auckland.ac.nz
authored by others [8, 26, 27, 32]. Yet empirical studies
Leo Paas also reveal that the number and categories of emotions vary
leo.paas@auckland.ac.nz
across individuals, depending their age [11], gender [23],
Ho Seok Ahn social contexts, and culture [16, 24]. Accordingly, anno-
hs.ahn@auckland.ac.nz
tators may not be able to judge every emotion expressed by
1
Marketing Department, University of Auckland Business others in all situations, especially those that the annotators
School, Auckland 1142, New Zealand have not experienced. In addition, personal experiences are
2
CARES, Department of Electrical, Computer and Software inherently subjective, reflecting the effects of experiential
Engineering, University of Auckland, Auckland 1142, New blindness and a person’s current state of mind [5]. For
Zealand

123
10946 Neural Computing and Applications (2023) 35:10945–10956

example, one and the same person can be evaluated dif- model, such as the RoBERTa-large model [21], on a large,
ferently by others, depending on those evaluators’ states of self-reported emotion data set, and then on a relatively
mind. Men who have just crossed a high suspension bridge small, socially agreed emotion data set with annotator-
perceive women as more attractive than men who have not generated labels.
crossed the bridge, because they confuse their fear with With this novel two-step approach, we aim to improve
attraction [12]. Wang et al. [41] caution that annotators’ forecasting accuracy in terms of predicting emotions across
assessments of emotions can be subjective and varied too. different types of data sets, with self-reported or annotator-
Thus an alternative approach analyzes larger data sets, labeled emotions. To the best of our knowledge, this study
to which emotion labels have been added by the texts’ is the first to train models sequentially on self-reported
authors [25, 32, 33], which we refer to as self-rated labels. emotion data, followed by data gathered from annotator
According to the theory of constructed emotions [6], the labels. Previous transformer models trained on large
human brain uses past experiences, categorized as con- emotion data sets achieve higher classification accuracy
cepts, to guide people’s actions and give meaning to their than those trained on small data sets only [9], though
observations of their surroundings. Emotions are part of the previous research focuses on differences in size of the data
meaning attributed by and expressed in texts. Data sets sets, not the types of emotion data. We also note a previous
with self-reported emotions are relatively easy to collect study [43] that offers similar findings for chemical reac-
and can train artificial intelligence (AI), machine learning tions, though unrelated to emotion detection.
(ML), and natural language processing (NLP) algorithms To confirm whether the newly introduced TTL approach
[25, 32, 33]. For example, tweets by authors who label achieves greater emotion detection accuracy than alterna-
them with emotion hashtags, such as #joy, #anger, and tive modeling approaches, including non-sequentially
#sadness, can be submitted to AI, ML, and NLP models, trained models that also involve (1) annotator-rated emo-
which produce general emotion detection rules. These rules tion data, (2) self-reported emotion data, and (3) both
then can be applied to other tweets that contain emotional types, we investigate the following question:
content but do not feature labels created by the text’s RQ3. Does the TTL approach achieve higher classifica-
author or another annotator. tion accuracy than models that have been non-sequen-
Such ML-based approaches suffer their own limitations; tially trained on annotator-rated or self-reported emotion
when authors report their own emotions, they might over- data?
look some socially constructed emotions that annotators
In search of answers for these questions, we gather a
might be able to identify. Furthermore, models based on
novel data set with 3,654,544 tweets that include emotion
self-reported emotion data sets need sufficient amounts of
hashtags, inserted by their authors. We analyze this large
text and labels for training, but what those precise amounts
data set, along with 10 previously published data sets that
are remains unclear. In particular, ML models do not have
contain text with emotion labels, whether provided by the
previous experience (as annotators might), so they may
texts’ authors or annotators. The new data set can be
require more data. Considering these strengths, weak-
leveraged for further research; it represents one of the
nesses, and gaps, we ask,
largest data sets of tweets containing self-reported emotion
RQ1. Are models based on (1) annotator-labeled or (2)
hashtags (n = 3,654,544), posted on Twitter between
self-reported emotions more accurate in detecting emo-
October 2008 to October 2021. It is available as an open
tions in texts with (a) self-reported or (b) annotated
data set for academic purposes (https://github.com/Emo
emotion labels?
tionDetection/Self-Reported-SR-emotion-dataset.git).
RQ2. Does the size of the data set featuring self-reported
In turn, the methodological contribution of this paper is
emotions affect the classification accuracy of ML-based
threefold. First, we offer the first (to the best of our
emotion detection models?
knowledge) assessment of the generalizability of the fore-
Rather than either self-reported emotions or annotated casting accuracy of emotion detection models developed
emotion labels, we propose a novel, combined approach, on the basis of either self-reported or annotator-labeled
the transformer transfer learning (TTL) model. In this emotion data sets. To do so, we apply models trained on
approach, the transformer model gets trained in a series of each self-reported emotion data set to predict the emotions
stages that mimic human development. Over the course of that annotators have used to label the texts and vice versa.
their social-emotional development [22], children first Second, regarding the relevance of large databases, we
identify their own emotions and then learn to synchronize assess whether models developed on the newly collected
their emotions with those of others, such as in response to self-reported data set achieve higher forecasting accuracy
relationship issues [38]. The proposed TTL approach than models developed on smaller, previously published
replicates these two stages by first training a transformer data sets that also contain self-reported emotions. Third, we

123
Neural Computing and Applications (2023) 35:10945–10956 10947

propose and apply the TTL approach, which replicates the sets, containing authors’ self-reported emotions, enhance
social-emotional development stages of children. forecasting accuracy.
Finally, recent studies [9, 43] propose that researchers
can increase model accuracy by sequentially training a
2 Related work transformer model, first on a large general data set and then
on a small, target data set (RQ3). For example, BERT [10]
Growing literature recognizes the importance of specific was sequentially trained on a large emotion data set
(fine-grained) emotion detection models and data sets. (GoEmotions [9]; n = 58 k) and then small data sets
Existing emotion detection algorithms rely on the concepts (sampled from various emotion data sets, [2] [8] [14] [25]
of (1) word-level affect lexicon, (2) phrase-level traditional [27] [32] [33]; n \ 1,000). It achieves greater classification
ML, (3) document-level deep learning, and (4) document- accuracy than models trained only on a small data set [9].
level pretrained transformer models. Prior studies include various annotator-rated and self-re-
A word-level affect lexicon entails the identification of a ported emotion data sets, but the focus is primarily on size
corpus of words related to specific emotions. For example, differences, rather than types of labels, self-reported or
‘‘stole’’ relates to the emotion of anger; ‘‘amazing’’ is human annotator labels. As noted in the introduction, we
related to joy. This vocabulary-based approach assigns the aim to extend this approach to replicate the social-emo-
primary emotion to a label in the text by searching for the tional development stages of humans and thereby achieve
frequency of words associated with various, specific emo- robust performance across emotion data sets.
tions (e.g., [4, 25]).
Phrase-level traditional ML analyzes texts beyond sim-
ple counts of words. The algorithms learn the meaning of 3 Data sets
words or phrases from the training data set, using an n-
gram function (i.e., sequence of N words; [17]). For 3.1 Annotator-rated and self-reported emotion
example, the 2-g word tokens ‘‘fine young’’ and ‘‘young data sets
man’’ can be extracted from the phrase ‘‘fine young man’’
[20, 28, 40]. A total of 11 data sets were analyzed, each containing at
For document-level deep learning, Goodfellow et al. least four of Ekman’s [13] six commonly applied specific
[15] point out that as the amount of data used for NLP emotions (anger, disgust, fear, joy, sadness, and surprise).
increases, traditional ML algorithms suffer from insuffi- The data sets consist of emotion labels and accompanying
cient lexical feature extraction. Deep learning algorithms text sentences, such as news headlines, tweets, and Reddit
instead use many hidden layers to find complex document- comments. Table 2 summarizes seven previously collected
level representations of large amounts of text, without annotator-rated emotion data sets:
lexical feature extraction [1, 26, 32, 41].
• Affective Text (D_A1): Six annotators labeled 1000
In 2018, Google introduced a pretrained transformer
news headlines from Google News and CNN (https://
model, ‘‘Bidirectional Encoder Representations from
web.eecs.umich.edu/*mihalcea/affectivetext/),
Transformers’’ (BERT) [10], that outperformed other
• Emotion Cause data set (D_A2): Four annotators
models in many NLP tasks, including specific emotion
labeled 2,000 automatically generated sentences
classifications [3, 7, 9]. Previously published studies indi-
(https://www.site.uottawa.ca/*diana/resources/emo
cate that transformer models are optimal for emotion
tion_stimulus_data/),
detection, but classification accuracy varies depending on
• CrowdFlower Sentiment Analysis (D_A3): 40,000
the emotion data set being analyzed [3, 7, 9]. As, Table 1
tweets labeled by crowdsourcing (https://data.world/
shows, accuracy for classifying human emotions varies
crowdflower/ sentiment-analysis-in-text),
from 50 to 80%; it seems difficult to surpass 80%.
• Emotion Intensities 2017 in Tweets (D_A4): Annota-
Yet it is not clear whether models based on self-reported
tors labeled 7,000 tweets (https://saifmohammad.com/
emotions generalize to data sets containing annotator labels
WebPages/EmotionIntensity-SharedTask.html),
or vice versa (RQ1). Previous research tends to compare
• GoEmotions (D_A5): Three to five annotators labeled
different algorithms used for emotion detection, not the
58,000 Reddit comments (https://github.com/google-
type of data being used to define the emotions expressed in
research/google-research/tree/master/goemotions),
texts. According to this comparison, transformer models
• Stance Sentiment Emotion Corpus (SSEC) (D_A6):
outperform alternative algorithms [3, 7, 9], so we integrate
Three to six annotators labeled 4,800 tweets used in the
them into our proposed TTL approach. Table 1 also shows
SemEval 2016 competition (http://www.romanklinger.
that emotion data sets exhibit a trend of increasing sizes
de/ssec/), and
over time (RQ2) but cannot indicate whether a larger data

123
10948 Neural Computing and Applications (2023) 35:10945–10956

Table 1 Key reading table for classifying specific emotions


Study Data Method Findings

Balahur et al. 1081 cases Simple K-means EmotiNet derived from existing NLP resources, based on the
[4] seven most basic emotions (fear, anger, sadness, guilt, shame,
joy, and disgust)
Mohammad 21,051 tweets labeled by Strength of association (SoA) Twitter Emotion Corpus (TEC), developed by collecting tweets
[25] six emotions with emotion hashtags, based on six basic emotions [13]
Mohammad TEC and 1250 newspaper Support vector machine When using a combination of word-level affect lexicon and
and headlines data sets (SVM) (18)-gram function with SVM, the micro F1 score was the
Kiritchenko highest at .499
[28]
Liu [20] TEC data set Various ML algorithms The weighted F1 score was .579 with SVM, .478 with random
forest, and highest at .605 with logistic regression
Volkova and 52, 925 tweets Log-linear models with L2 Trained the model using lexical features extracted from tweets
Bachrach [40] regularization annotated with six basic emotions, for an overall accuracy
score of .78
Abdul-Mageed 682,082 tweets Gated Recurrent Neural Nets The F1 score was .8012 on the six basic emotions [13]
and Ungar [1] (GRNNs) model classification
Wang et al. [41] Approximately 2.5 million MNB and LIBLINEAR The F1 score was the highest at .6163, with a large-scale linear
tweets with 131 hashtags classification that classifies the seven emotion labels (joy,
sadness, anger, love, fear, thankfulness, and surprise)
Saravia et al. 664,462 tweets with 339 CNN architecture with a The average F1 score was .79 on eight emotions (sadness, joy,
[32] hashtags matrix form of the enriched fear, anger, surprise, trust, disgust, and anticipation)
patterns
Mohammad 10.983 English tweets for SVM/SVR, LSTMs, and Bi- Deep learning algorithms were applied by the top-performing
et al. [26] the SemEval-2018 LSTMs teams on a classification with four emotions (anger, fear, joy,
competition and sadness)
Chatterjee et al. 41,179 dialogues for the Bi-LSTMs with BERT and The highest-ranked team achieved a micro F1 score of .7959 on
[7] SemEval-2019 ELMo four emotions (happy, sad, angry, and others)
competition
Al-Omari et al. Data set of SemEval-2019 Ensemble model (BERT, The F1 score was .748 on four emotions (happy, sad, angry, and
[3] GloVe embeddings) others)
Demszky et al. 58 k Reddit comments Transformer model, BERT GoEmotions based on Reddit comments. The macro-average F1
[9] score was .64 on the six basic emotions [13] classification

Table 2 Annotator-rated emotion data sets


ID Data set Ekman emotion Size Tot. emotions Source

D_A1 [2] Affective text All 6 1k 16 News headlines


D_A2 [14] The emotion cause All 6 2k 6 FrameNet
D_A3 [8] Crowd-flower Anger surprise joy sadness 40 k 14 Tweets
D_A4 [27] Emotion intensities 2017 Anger fear joy sadness 7k 4 Tweets
D_A5 [9] Go-emotions All 6 58 k 27 Reddit comments
D_A6 [27, 32] SSEC All 6 4.8 k 8 Tweets
D_A7 [26] SemEval-2018 All 6 2.5 k 11 Tweets

• SemEval-2018 Affect in Tweets Data (D_A7): Seven • CARER emotion data set (D_S1): 664,000 tweets with
annotators labeled 2,500 tweets (http://saifmohammad. self-reported emotions (https://github.com/dair-ai/emo
com/ WebPages/SentimentEmotionLabeledData.html). tion_data set),
• International Survey on Emotion Antecedents and
Table 3 lists the three previously published self-reported
Reactions (ISEAR) (D_S2): 7,500 sentences in which
emotion data sets:

123
Neural Computing and Applications (2023) 35:10945–10956 10949

Table 3 Self-reported emotion


ID Data set Ekman emotion Size Total emotions Source
data sets
D_S1 [32] CARER All 6 664 k 8 Tweets
D_S2 [33] ISEAR Joy, fear, anger, sadness, disgust 7.5 k 7 Survey
D_S3 [25] TEC All 6 21 k 6 Tweets

participants reported emotions through a survey (https://


github.com/sinmaniphel/py_isear_data set),
• Twitter Emotion Corpus (TEC) (D_S3): Collection of
21,000 tweets with emotion hashtags (http://saifmoham
mad.com/WebPages/SentimentEmotionLabeledData.
html).
• Collecting the self-reported emotion data set
We collected the new self-reported emotion data set
(D_SR) by using the Twitter Application Programming
Interface (API), with approval from Twitter. The collected
data set consists of publicly available information and
excludes personally identifiable information.
We collected tweets in English (n = 5,367,357) posted
between March 2008 and October 2021 that feature one of
Ekman’s six basic emotions with a hashtag [13]: #anger, Fig. 1 Word cloud of most frequent words
#disgust, #fear, #joy, #sadness, and #surprise [25]. For
example, the text ‘‘Spring is coming!!!!’’ featured ‘‘#joy,’’ 3.2 Integration of eleven emotion data sets
as inserted by the tweet’s author. Emotion hashtags in the
tweets (independent variables) affect the dependent vari- We included only cases reflecting one of Ekman’s [13] six
ables and were used exclusively as label values. Website emotions (anger, disgust, fear, joy, sadness, and surprise)
addresses and special characters were removed to obtain from the 11 emotion data sets mentioned previously.
only English words [25]. For example, exclamation marks Table 4 reports the ten previously collected emotion data
were removed from ‘‘Spring is coming!!!!,’’ resulting in sets, which contain a total of 108,317 cases that feature
‘‘Spring is coming.’’ four to six Ekman emotions. The new self-reported data set
Duplicate tweets also were removed, retaining the first features 3,654,544 cases. The annotator-rated emotion data
one only (11.32% of tweets in our data set). Tweets sets, D_A1 to D_A7, contain a total of 63,516 cases. The
including multiple emotion hashtags were removed to previously collected self-reported data sets, D_S1 to D_S3,
capture unique emotion, such as ‘‘Everything makes me include 44,801 cases, and D_SR contains 3,654,544 cases.
cry … everything #sadness #angry #joy’’ [1] (3.95% of The occurrence of specific emotions across the 11 data sets
tweets). Tweets containing words that may not represent ranges, defined by proportions from 1.42 to 38.81%.
each emotion were also removed, such as #anger toge-
ther with ‘‘management’’ or ‘‘mentalhealth’’; #disgust
containing ‘‘insideout’’; #fear in conjunction with ‘‘god’’; 4 Methodology
#joy containing ‘‘redvelvet’’; and #surprise with ‘‘birth-
day’’ (4.83% of tweets). Also, tweets that had fewer than 4.1 Transformer models
three English words and re-tweets were excluded [25]
(11.81% of tweets). This process resulted in As discussed in Sect. 2, transformer models such as BERT
n = 3,654,544. [10] outperform alternative ML algorithms [3, 7] for
A first descriptive analysis shows that the number of detecting specific emotions in texts. As the first transformer
words in a tweet is defined by Mean = 15.59, SD = 8.85, model, BERT relies on text encoders trained on the
Median = 14, Min = 3, Max = 71, Q1 = 9, and Q3 = 20. BooksCorpus (with 800 million words) and the English-
In Fig. 1, the word cloud of the most frequently occurring language Wikipedia (with 2500 million words) [36]. Its
words shows that ‘‘love’’ (n = 399,484) is the most men- bidirectional training references both left and right sides of
tioned emotion word, which reflects general valence. The sentences simultaneously. Masked language modeling
most mentioned specific emotion is ‘‘joy.’’ optimizes the weights in BERT model, such that it can train

123
10950 Neural Computing and Applications (2023) 35:10945–10956

Table 4 Cases in emotion data sets that contain Ekman emotions

ID # of emotions Anger Disgust Fear Joy Sadness Surprise Total


D_A1 6 79 41 185 433 262 217 1217 .03%
D_A2 6 478 95 423 467 575 213 2251 .06%
D_A3 4 110 0 0 5209 5165 2187 12,671 .34%
D_A4 4 1701 0 2252 1616 1533 0 7102 .19%
D_A5 6 3921 429 522 18,176 2283 3764 29,095 .77%
D_A6 6 511 49 57 400 69 31 1117 .03%
D_A7 4 2357 0 2820 2894 1992 0 10,063 .27%
D_S1 5 2709 0 2373 6761 5797 719 18,359 .49%
D_S2 5 1079 1066 1076 1092 1082 0 5395 .14%
D_S3 6 1555 761 2814 8239 3830 3848 21,047 .56%
D_SR 6 244,084 50,887 776,320 1,414,936 440,633 727,684 3,654,544 97.12%
258,584 53,328 788,842 1,460,223 463,221 738,663 3,762,861 100.00%
Total 6
6.87% 1.42% 20.96% 38.81% 12.31% 19.63% 100.00%

a language model that calculates the probability distribu- emotion data sets get split into a training (80%) and a
tion of words appearing in a sentence on the basis of large, testing (20%) data set for the transformer model training.
unlabeled texts with unsupervised learning (see Fig. 2). In At the beginning of the sequence, a special class token
turn, pretrained BERT models can be fine-tuned with an [CLS] gets added for classification tasks. The representa-
additional output layer (Fig. 3) to develop high-performing tion values of dimensions are converted from token values.
models for a wide range of NLP tasks, such as linguistic Because the transformer blocks reflect the relationships of
acceptability, natural language inferencing, similarity pre- all pairs of words in a sentence, the [CLS] vector indicates
diction, and sentiment analysis [10]. The fine-tuning pro- the contextual meaning of the entire sentence. In the output
cess is introduced in more detail in Sect. 4.2. layer of the transformer model, we add a fully connected
We also note two versions of BERT: BERT-base layer to categorize the class labels into specific emotions
(L = 12, H = 768, A = 12, total parameters = 110 M) and (Fig. 3). During training, we fine-tuned all parameters of
BERT-large (L = 24, H = 1024, A = 16, total parame- the transformer model together, and the output label
ters = 340 M), where L is the number of transformer probabilities were calculated using the Adam function [19].
blocks, H is the hidden size, and A is the number of This process appears as the middle area in Fig. 4.
attention blocks [10]. Because BERT-large contains more Figure 4 summarizes the new two-step TTL architecture
training parameters, its training time for a million sen- that we introduce. First, the transformer model, RoBERTa-
tences in our study is prolonged (approximately seven large (upper area of Fig. 4), was trained on large self-re-
hours per epoch), compared with BERT-base (approxi- ported emotion data sets. In our study, the self-reported
mately three hours per epoch), using the latest RTX3090 emotion transformer model (middle area in Fig. 4) was
GPU. Nevertheless, BERT-large achieves better perfor- trained on the large, integrated, self-reported emotion data
mance than BERT-base. sets, D_S1 to D_S3 and D_SR (n = 3,699,345). Second,
Liu et al. [21] point out that Robustly optimized BERT the emotion model is synchronized to detect socially
approaches (RoBERTa) outperformed BERT in various agreed emotions by retraining on relatively small, annota-
NLP tasks. This version uses the same architecture as tor-rated emotion data sets. For our study, the model
BERT but pretrains ten times more data, including both undergoes further training (lower area in Fig. 4) on the
BERT data and 63 million news articles, a web text corpus, combined annotator-rated emotion data sets, D_A1 to
and stories. Because it offers the highest level of fore- D_A7 (n = 63,516), through a fine-tuning process, by re-
casting accuracy to date, we apply RoBERTa-large to using the self-reported emotion model (middle area in
compare emotion detection models [19]. Fig. 4).
The purpose of RoBERTa and other ML models trained
4.2 Fine-tuning process with word vectors is to perform a classification in which
the output indicates the likelihood of the input sentence
With RoBERTa, a fine-tuning process takes place for being classified as one of the possible emotion labels, such
sequence-level classification tasks, as in the BERT archi- as fear, anger, sadness, joy, surprise, and disgust. We use
tecture. We follow an existing process [19], in which the weighted F1 score [20] to assess forecasting accuracy;

123
Neural Computing and Applications (2023) 35:10945–10956 10951

Fig. 2 Overview of pretraining BERT [10]

this measure provides a suitable measure of unbalanced Fig. 3 BERT fine-tuning network architecture [10]
data distributions [39], as exist for our analyses. The for-
mula for the F1 score of each label (class) can be expressed was trained on 80% of the D_A1 data set, and the test set
as follows: included 20% of D_A1.
2  precisionðlÞ  recallðlÞ Input variables were stored as word vector tokens
F1 scoreðlÞ ¼ ; ð1Þ
precisionðlÞ þ recallðlÞ (pretrained embedding for different words), segment
embeddings (sentence number encoded into a vector), and
where l is the label (anger, disgust, fear, joy, sadness, or position embeddings (encoded word within the sentence).
truepositiveðlÞ
surprise), precision(l) is truepositiveðlÞþfalsepositiveðlÞ, and Output variables consisted of four to six labels, depending
recall(l) is truepositiveðlÞ on the Ekman emotions available in the data set (Tables 2
truepositiveðlÞþfalsenegativeðlÞ. To calculate the weigh-
and 3).
ted F1 score, we take the mean of all F1 scores of each label
For the RoBERTa-large models, we used Hugging Face
while weighting the data set size of each label.
[42], one of the most frequently applied transformer
libraries for the Python programming language. The batch
size and learning rate are 16 and 0.00001, respectively, in
5 Evaluation
all models. Devlin et al. [10] point out that overfitting may
occur after approximately four epochs, due to the trans-
5.1 Evaluation by models trained on separate
former model’s numerous parameters, up to 340 million.
data sets
Thus, training of the RoBERTa-large models was com-
pleted in three epochs.
To address RQ1 and RQ2, we trained 11 RoBERTa-large
For RQ1, we evaluated the classification accuracy of the
models on the 11 separate emotion data sets defined in
seven models based on annotator-labeled data (M_A1 to
Table 4. The 11 models take the labels M_A1 to M_A7,
M_A7) in the seven annotator-rated emotion test sets
M_S1 to M_S3, and M_SR, depending on the specific data
(D_A1 to D_A7) and in the four self-reported emotion test
set on which they rely. We applied an 80/20 data set
sets (D_S1 to DS_3 and D_SR). Thus we obtain 77
splitting strategy to derive a training and a test set from
weighted F1 scores (upper part of Table 5). For example,
each of the 11 data sets, so for example, the M_A1 model
the first two cells in the top row of Table 5 show that M_A1
produces F1 = 0.70 in the 20% test set derived from D_A1,

123
10952 Neural Computing and Applications (2023) 35:10945–10956

lower right quadrilateral, Table 5) than in annotator-rated


emotion data sets (F1 = 0.57; lower left quadrilateral,
Table 5). Individual authors, across different data sets or in
different writing contexts, may exhibit biases similar to
those indicated by annotators when expressing their emo-
tions. According to Table 5, this bias has a relatively strong
effect when models based on self-reported emotions are
applied to data sets with annotator-labeled emotions.
For RQ2, in the self-reported test sets (D_S1 to D_S3
and D_SR), the M_SR model, based on the larger data set,
results in the highest average F1 score of 0.70, compared
with scores from 0.60 to 0.65 for M_S1 to M_S3 (lower
right quadrilateral, Table 5). In contrast, the M_SR model
achieved the lowest average F1 score of 0.49, compared
with scores ranging from 0.52 to 0.66 for M_S1 to M_S3,
when testing the annotator-rated test sets (lower left
quadrilateral, Table 5). Thus, a RoBERTa model trained on
large data sets can achieve good performance in similar
Fig. 4 TTL architecture
contexts but not as much in different contexts.
whereas it indicates F1 = 0.52 in test set derived from
5.2 Evaluation by models trained on multiple
D_A2.
data sets
The annotator-rated models tend to perform best on test
sets derived from the data set on which they were devel-
To address RQ3, we used the same input and output vari-
oped, i.e., the diagonal results in Table 5 are relatively
ables as in Sect. 5.1. To start, we trained the TTL emotion
high. For example, M_A1 was developed on the training
model on the large, self-reported training set, and then on
set derived from D_A1, and it achieved an F1 score of 0.70
the smaller, annotator-rated training set. To test the pro-
on the test set derived from D_A1—higher than the scores
posed advantages of the TTL approach, we consider three
for the ten other test sets (i.e., 0.38 to 0.62). Thus, data sets
alternative RoBERTa models as benchmarks: (1) the
may be rated by annotators with idiosyncratic rules for
annotator-rated emotion RoBERTa model, trained on the
labeling emotions.
seven annotator-rated training sets (containing 80% of
Table 5 also shows that annotator-rated models (M_A1
D_A1 to D_A7, n = 63,516); (2) the self-reported emotion
to M_A7) resulted in a higher average F1 score in test sets
RoBERTa model, trained on the four self-reported training
of the seven annotator-rated emotion data sets (F1 = 0.62;
sets (containing 80% of D_S1 to D_SR, n = 3,699,345);
upper left quarter of Table 5) than in the four self-reported
and (3) the integration RoBERTa emotion model, trained
emotion test sets (F1 = 0.53; upper right quarter of
simultaneously on all 11 data sets, (n = 3,762,861), instead
Table 5). For example, in the first row of Table 5, the
of consecutively.
average F1 score of M_A1 is 0.55 for annotator-rated
Table 6 contains the 44 weighted F1 scores for the TTL
emotion test sets (D_A1 to D_A7) and 0.46 for self-re-
emotion model and the three alternatives. The TTL emo-
ported emotion test sets (D_S1 to D_S3 and D_SR). The
tion model achieved the highest average F1 score of 0.84
unique rules that annotators apply when specifying emo-
across the 11 analyzed data sets. The annotator-rated
tions in each data set thus appear susceptible to bias in
emotion model achieved the second highest average F1
relation to classifying self-reported emotions.
score (0.79).
Next, we evaluate the four models trained on the four
Figure 5 reports the plot of the loss, which reflects the
self-reported emotion data sets, D_S1 to D_S3 and D_SR,
classification error in the training and testing sets that
in all 11 test sets. The resulting 44 weighted F1 scores
occurs while training annotator-rated training sets. The loss
appear in the lower part of Table 5. Here again, the self-
associated with the self-reported emotion model is greater
reported models tend to perform best on the test set derived
than that linked to the TTL emotion model; that is, the TTL
from the data set on which they were developed. We find
approach can improve the performance of the transformer
relatively high diagonal results in Table 5 for the four
model during the model training stage. The TTL emotion
models developed on the self-reported data sets. Further-
model achieved the highest (D_A1, D_A5, and D_A7),
more, the four self-reported models achieve higher average
second highest (D_A2, D_A3, and D_A6), or third highest
F1 scores in self-reported emotion data sets (F1 = 0.65;
(D_A4) F1 scores in the separate annotator-rated test sets.

123
Neural Computing and Applications (2023) 35:10945–10956 10953

Table 5 Weighted F1 scores of emotion models (M_##) trained on separate data sets (D_##)

Model D_A1 D_A2 D_A3 D_A4 D_A5 D_A6 D_A7 D_S1 D_S2 D_S3 D_SR Avg
M_A1 .70 .52 .59 .38 .62 .65 .41 .51 .48 .41 .44 .52
M_A2 .45 .98 .57 .70 .61 .56 .69 .69 .59 .38 .46 .61
M_A3 .57 .52 .72 .40 .61 .38 .43 .57 .49 .47 .55 .52
M_A4 .36 .87 .51 .86 .51 .50 .88 .60 .59 .40 .61 .61
M_A5 .58 .75 .59 .46 .87 .74 .49 .63 .60 .42 .44 .60
M_A6 .43 .62 .59 .44 .67 .83 .48 .57 .60 .38 .42 .55
M_A7 .44 .94 .67 .90 .64 .64 .87 .65 .60 .49 .66 .68
M_S1 .44 .78 .55 .56 .63 .63 .58 .94 .66 .38 .40 .60
M_S2 .64 .77 .75 .52 .66 .72 .54 .64 .83 .50 .61 .65
M_S3 .46 .72 .54 .53 .41 .46 .53 .54 .64 .70 .69 .57
M_SR .36 .64 .54 .51 .39 .47 .53 .50 .68 .77 .84 .57

The delimiters (D and M) distinguish between data sets (e.g., D_A1) and models (e.g., M_A1). Avg = Average F1 score of all test sets; for
example, the average value of M_A1 was calculated for D_A1 to D_SR.

Furthermore, it achieved above-average F1 scores, from biases shown by annotators. That is, models developed on
fifth (D_S1, D_S2, and D_SR) to sixth (D_S3) highest self-reported emotion data sets perform less well on data
among of 15 emotion models in the separate self-reported sets with annotator-rated emotions (average F1 = 0.57)
test sets. than on those with self-reported emotions (average
Figure 6 plots the average F1 scores of the 15 emotion F1 = 0.65).
models from Sects. 5.1 and 5.2. The TTL emotion model For RQ2, the comparison of the findings with models
achieves an average F1 of 0.84, which is higher than the trained on self-reported emotion data sets confirms that the
values for the 11 models trained on separate data sets M_SR model trained on a relatively large self-reported data
(M_A1 to M_SR; F1 between 0.52 and 0.68) and three set achieves better performance (average F1 = 0.70) in the
models trained on multiple data sets (annotator-rated, self- self-reported emotion test sets than the three models trained
reported, and integration emotion models; F1 between 0.62 on smaller, self-reported emotion data sets (M_S1 = 0.60;
and 0.79). M_S2 = 0.65, M_S3 = 0.64). To the best of our knowl-
edge, the D_SR emotion data set is the largest collection of
tweets with emotion hashtag labels (n = 3,654,544) ever
6 Discussion collected, spanning 13 years from October 2008 to October
2021. Nevertheless, M_SR earns a relatively low score on
We examined annotator-rated and self-reported emotion annotator-rated emotion data sets (average F1 = 0.49)
data sets as sources for developing emotion detection compared with the other three models that are based on
models. Each data set has its own rules; any model tends to self-reported emotion labels (M_S1 = 0.60, M_S2 = 0.66,
do best when applied to the test set taken from the data set M_S3 = 0.64).
on which the focal model was trained. This result provides To answer RQ3, we offer the TTL emotion model, ini-
further empirical support for the theory of constructed tially trained on the four combined self-reported emotion
emotions [6], which argues that the concept of emotion can data sets (n = 3,699,345) and then on the combined
produce different categories across different people, annotator-rated emotion data set (n = 63,516). The model
depending on their personal experiences. This first finding displays relatively strong performance, with the highest
contradicts the classical view of emotion theory [35] that average F1 score of 0.84; it achieves the highest average F1
people possess inherent emotions, like universal biological score of 0.87 on annotator-rated emotion test sets, but only
fingerprints. 0.79 on self-reported emotion test sets. Notably, the TTL
In relation to RQ1, we find that models developed on emotion model reveals substantial improvements over the
annotator-rated emotion data sets perform less well on data annotator-rated emotion models trained on corresponding
sets with self-reported emotions (average F1 = 0.53) than training sets (D_A1, D_A3, D_A4, D_A5, and D_A7). The
on those with annotator-rated emotions (average average F1 score of the TTL emotion model also is higher
F1 = 0.62). Also relevant for RQ1 is our finding that people than those of the integration emotion model that trained all
are biased in expressing their own emotions, similar to the

123
10954 Neural Computing and Applications (2023) 35:10945–10956

Table 6 Weighted F1 scores of emotion models trained on multiple data sets


Model D_A1 D_A2 D_A3 D_A4 D_A5 D_A6 D_A7 D_S1 D_S2 D_S3 D_SR Avg

Annotator .75 .96 .84 .88 .90 .85 .89 .67 .71 .59 .67 .79
Self .34 .63 .54 .52 .38 .51 .53 .94 .80 .80 .84 .62
Integration .49 .92 .58 .76 .70 .61 .75 .93 .81 .78 .84 .74
TTL .77 .97 .83 .88 .90 .83 .90 .89 .79 .68 .79 .84
Annotator = annotator-rated emotion model trained on annotator-rated training sets (D_A1 to D_A7). Self = self-reported emotion model trained
on the self-reported training sets (D_S1 to D_SR). Integration = integration emotion model trained on the integrated data sets (annotator-rated
and self-reported training sets). TTL = TTL emotion model sequentially trained with self-reported training sets, followed by annotator-rated
training sets

Fig. 5 Plots of the losses of a annotator-rated emotion model and b TTL emotion model during training by annotator-rated training sets

A limitation of this study is that we only collected


emotions expressed in tweets, which may not generalize to
other text posted on various social media platforms. The
TTL emotion model achieved the highest average F1 score
of 0.84, which is only 0.05 higher than the annotator-rated
emotion model value of 0.79. Considering that the classi-
fication accuracy of previous emotion detection studies
falls between 0.50 and 0.80, it may be difficult to increase
the performance of human emotion detection dramatically.
Continued research should integrate other types of social
media data sets. Also, methods such as ensemble tech-
niques can be used to investigate potential improvements to
the accuracy of transformer models.
Fig. 6 Average F1 scores of fifteen emotion models Acknowledgements This article was supported by Twitter, which
provided the academic research API that enabled us to collect tweets.
data sets simultaneously, as well as the annotator-rated and
Author’s contribution SJL drafted the manuscript and designed the
self-reported emotion models (Fig. 6). study. All authors considered the results and approved the final
Further studies might apply the proposed TTL approach manuscript.
to other target domains with small annotator-rated emotion
data sets. For example, it might be useful for developing Funding Open Access funding enabled and organized by CAUL and
its Member Institutions. This work was not funded.
universally applicable emotion detection models that
reflect other target domains, such as specific countries (e.g., Data availability The analyzed emotion-labeled data set is shared as
USA and China), age groups (e.g., children and adults), and an open data set for academic purposes (https://github.com/Emo
genders, based on large, self-reported emotion data sets. tionDetection/Self-Reported-SR-emotion-dataset.git).

123
Neural Computing and Applications (2023) 35:10945–10956 10955

Code availability Not applicable. 10. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-
training of deep bidirectional transformers for language under-
standing. arXiv preprint arXiv:1810.04805.
Declarations 11. Dunfield K, Kuhlmeier VA, Connell L, Kelley E (2011) Exam-
ining the diversity of prosocial behavior: helping, sharing, and
Conflict of interest The authors declare that they have no conflict of comforting in infancy. Infancy 16(3):227–247
interest. 12. Dutton DG, Aaron AP (1974) Some evidence for heightened
sexual attraction under conditions of high anxiety. J Pers Soc
Ethical approval This article does not contain any studies with human Psychol 30(4):510–517
participants or animals performed by any of the authors. 13. Ekman P (1972) Universals and cultural differences in facial
expression of emotions, Nebraska. In: symposium on motivation,
Consent to participate Not applicable. University Nebraska Press, Lincoln, pp 83–207
14. Ghazi D, Inkpen D, Szpakowicz S (2015) Detecting emotion
Consent for publication Not applicable. stimuli in emotion-bearing sentences. In: international conference
on intelligent text processing and computational linguistics,
Open Access This article is licensed under a Creative Commons Springer, Cham, pp 152–165
Attribution 4.0 International License, which permits use, sharing, 15. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT
adaptation, distribution and reproduction in any medium or format, as Press, Cambridge
long as you give appropriate credit to the original author(s) and the 16. Kitayama S, Mesquita B, Karasawa M (2006) Cultural affor-
source, provide a link to the Creative Commons licence, and indicate dances and emotional experience: socially engaging and disen-
if changes were made. The images or other third party material in this gaging emotions in Japan and the United States. J Pers Soc
article are included in the article’s Creative Commons licence, unless Psychol 91(5):890
indicated otherwise in a credit line to the material. If material is not 17. Kumar N, Dangeti P, Bhavsar K (2019) Natural language pro-
included in the article’s Creative Commons licence and your intended cessing with Python cookbook. Packt Publishing, Birmingham
use is not permitted by statutory regulation or exceeds the permitted 18. Lee SJ, Kishore S, Lim J, Paas L, Ahn HS (2021) Overwhelmed
use, you will need to obtain permission directly from the copyright by fear: emotion analysis of COVID-19 Vaccination Tweets. In:
holder. To view a copy of this licence, visit http://creativecommons. TENCON 2021–2021 IEEE Region 10 Conference (TENCON),
org/licenses/by/4.0/. pp 429–434
19. Lim J, Sa I, Ahn HS, Gasteiger N, Lee SJ, MacDonald B (2021)
Subsentence extraction from text using coverage-based deep
References learning language models. Sensors 21(8):2712
20. Liu CH (2017) Applications of twitter emotion detection for stock
1. Abdul-Mageed M, Ungar L (2017) Emonet: Fine-grained emo- market prediction. Doctoral dissertation, Massachusetts Institute
tion detection with gated recurrent neural networks. In: Pro- of Technology
ceedings of the 55th annual meeting of the association for 21. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Stoyanov, V
computational linguistics (volume 1: Long papers), pp 718–728 (2019) Roberta: A robustly optimised BERT pretraining
2. Agirre E, Màrquez L, Wicentowski R (2007) Proceedings of the approach. arXiv preprint arXiv:1907.11692
Fourth International Workshop on Semantic Evaluations 22. Malik F, Marwaha R (2018) Developmental stages of social
(SemEval-2007) In: Proceedings of the fourth international emotional development in children. StatPearls Publishing, Trea-
workshop on semantic evaluations (SemEval-2007) sure Island
3. Al-Omari H, Abdullah MA, Shaikh S (2020) Emodet2: emotion 23. Mcconatha JT, Lightner E, Deaner SL (1994) Culture, age, and
detection in English textual dialogue using BERT and BILSTM gender as variables in the expression of emotions. J Soc Behav
models. In: 2020 11th international conference on information Pers 9(3):481
and communication systems (ICICS), IEEE, pp 226–232 24. Mesquita B, Walker R (2003) Cultural differences in emotions: a
4. Balahur A, Hermida JM, Montoyo A, Muñoz R (2011) Emotinet: context for interpreting emotional experiences. Behav Res Ther
a knowledge base for emotion detection in text built on the 41(7):777–793
appraisal theories. In: international conference on application of 25. Mohammad S (2012) # Emotional tweets. In: * SEM 2012: the
natural language to information systems, Springer, Berlin, Hei- first joint conference on lexical and computational semantics–
delberg, pp 27–39 Volume 1: proceedings of the main conference and the shared
5. Barrett LF (2017) The theory of constructed emotion: an active task, and Volume 2: proceedings of the sixth international
inference account of interoception and categorization. Soc Cogn workshop on semantic evaluation (SemEval 2012), pp 246–255
Affect Neurosci 12(1):1–23 26. Mohammad S, Bravo-Marquez F, Salameh M, Kiritchenko S
6. Barrett LF (2017) Categories and their role in the science of (2018) Semeval-2018 task 1: affect in tweets. In: proceedings of
emotion. Psychol Inq 28(1):20–26 the 12th international workshop on semantic evaluation, pp 1–17
7. Chatterjee A, Narahari KN, Joshi M, Agrawal P (2019) SemEval- 27. Mohammad SM, Bravo-Marquez F (2017) Emotion intensities in
2019 task 3: EmoContext contextual emotion detection in text. tweets. arXiv preprint arXiv:1708.03696
In: Proceedings of the 13th international workshop on semantic 28. Mohammad SM, Kiritchenko S (2015) Using hashtags to capture
evaluation, pp 39–48 fine emotion categories from tweets. Comput Intell
8. Crowdflower (2017) Crowdflower’s data sets. https://data.world/ 31(2):301–326
crowdflower. Accessed 19 Dec 2021 29. Mohammad SM, Sobhani P, Kiritchenko S (2017) Stance and
9. Demszky D, Movshovitz-Attias D, Ko J, Cowen A, Nemade G, sentiment in tweets. ACM Trans Internet Technol (TOIT)
Ravi S (2020) GoEmotions: a data set of fine-grained emotions. 17(3):1–23
arXiv preprint arXiv:2005.00547. 30. Sailunaz K, Dhaliwal M, Rokne J, Alhajj R (2018) Emotion
detection from text and speech: a survey. Soc Netw Anal Min
8(1):1–26

123
10956 Neural Computing and Applications (2023) 35:10945–10956

31. Salehan M, Kim DJ (2016) Predicting the performance of online without screens improves preteen skills with nonverbal emotion
consumer reviews: a sentiment mining approach to big data cues. Comput Hum Behav 39:387–392
analytics. Decis Support Syst 81:30–40 39. Vinodhini G, Chandrasekaran RM (2012) Sentiment analysis and
32. Saravia E, Liu HCT, Huang YH, Wu J, Chen YS (2018) Carer: opinion mining: a survey. Int J 2(6):282–292
Contextualized affect representations for emotion recognition. In: 40. Volkova S, Bachrach Y (2016) Inferring perceived demographics
Proceedings of the 2018 conference on empirical methods in from user emotional tone and user-environment emotional con-
natural language processing, pp 3687–3697 trast. In: Proceedings of the 54th annual meeting of the associa-
33. Scherer KR, Wallbott HG (1994) Evidence for universality and tion for computational linguistics (Volume 1: Long Papers),
cultural variation of differential emotion response patterning. pp 1567–1578
J Pers Soc Psychol 66(2):310 41. Wang W, Chen L, Thirunarayan K, Sheth AP (2012) Harnessing
34. Schuff H, Barnes J, Mohme J, Padó S, Klinger R (2017) Anno- twitter‘‘ big data’’ for automatic emotion identification. In: 2012
tation, modelling and analysis of fine-grained emotions on a international conference on privacy, security, risk and trust and
stance and sentiment detection corpus. In: Proceedings of the 8th 2012 international conference on social computing, IEEE,
workshop on computational approaches to subjectivity, sentiment pp 587–592
and social media analysis, pp 13–23 42. Wolf T, Chaumond J, Debut L, Sanh V, Delangue C, Moi A,
35. Siegel EH, Sands MK, Noortgate WVD, Condon P, Chang Y, Dy Rush, M A (2020) Transformers: state-of-the-art natural language
J, Barrett FL (2018) Emotion fingerprints or emotion popula- processing. In: Proceedings of the 2020 conference on empirical
tions? A meta-analytic investigation of autonomic features of methods in natural language processing: system demonstrations,
emotion categories. Psychol Bull 144(4):343 pp 38–45
36. Tenney I, Das D, Pavlick E (2019) BERT rediscovers the clas- 43. Zhang Y, Wang L, Wang X, Zhang C, Ge J, Tang J, Duan H
sical NLP pipeline. arXiv preprint arXiv:1905.05950 (2021) Data augmentation and transfer learning strategies for
37. Tian GLUL, McIntosh C (2021) What factors affect consumers’ reaction prediction in low chemical data regimes. Organ Chem
dining sentiments and their ratings: evidence from restaurant Front 8(7):1415–1423
online review data. Food Qual Prefer 88:104060
38. Uhls YT, Michikyan M, Morris J, Garcia D, Small GW, Zgourou Publisher’s Note Springer Nature remains neutral with regard to
E, Greenfield PM (2014) Five days at outdoor education camp jurisdictional claims in published maps and institutional affiliations.

123

You might also like