Professional Documents
Culture Documents
Will Affective Computing Emerge From Foundation Models and General Ai? A First Evaluation On Chatgpt
Will Affective Computing Emerge From Foundation Models and General Ai? A First Evaluation On Chatgpt
Will Affective Computing Emerge From Foundation Models and General Ai? A First Evaluation On Chatgpt
W ITH THE ADVENT of increasingly large- having been trained on ‘broad’ data – often self-
data trained general purpose machine learning supervised – at scale leading to a) homogenisation
models, a new era of ‘foundation models’ has (i. e., most use the same model for fine-tuning and
started. According to [1], these are marked by training for down-stream tasks as they are effective
IEEE Intelligent Systems 38(1) This article is part of the IEEE IS ACSA series © 2023 ACSA
Affective Computing and Sentiment Analysis
across many tasks and too cost-intensive to train related) NLP tasks. We use this framework to
individually) and b) emergence (i. e., tasks can show if ChatGPT has general capabilities that
be solved that these models were not originally could yield competent performance on affective
trained upon – potentially even without additional computing problems. The evaluation compares
fine-tuning or downstream training). However, against specialised models that are specifically
at this time, much more research is needed to trained on the downstream tasks. The contributions
understand the actual emergence abilities that of this paper are:
potentially lead to a massive shift of paradigm • We evaluate, whether (NLP) foundation models
in machine learning. Models might not need to can lead to ‘full’ (i. e., no need for fine-tuning or
be trained any more at all specifically for limited downstream training) emergence of other tasks,
tasks, be it from the upstream or downstream which would usually be trained on specific data
perspective. Here, we consider the example of sources; therefore,
Affective Computing tasks seen from a Natural • We introduce a method to evaluate ChatGPT
Language Processing (NLP) end. In the future, will on classification tasks.
we need to train extra models at all to tackle tasks • We compare the results of ChatGPT on three
such as personality, sentiment, or suicidal tendency classification problems in the field of affective
recognition from text? Or will ‘big’ foundation computing. The problems are big-five personality
models suffice with their emergence of these? prediction, sentiment analysis, and suicide and
To this end, we consider ChatGPT as our depression detection.
basis for ‘big’ foundation model to check for The remainder of this paper is organised as
the full emergence of the above three tasks. It follows: We begin by elaborating on the related
was launched on 30th of November 2022, and had work, then we introduce our method, then we
gained over one million users within one week [2]. present and discuss the results. We finish with
It has shown very promising results as an inter- concluding remarks.
active chatting bot that is capable to a big extent
to understand questions posed by humans, and RELATED WORK
give meaningful answers to. ChatGPT is one of We focus on related work within the key
the above named ‘foundation models’ constructed research question of potential emergence (in the
by fine-tuning a Large Language Model (LLM), text domain) by foundation models. In particular,
namely GPT-3, which can generate English text. [7] explores the question if ChatGPT is a general
The model is fine-tuned using Reinforcement NLP solver that works for all problems. They
Learning from Human Feedback (RLHF) [3], explore a wide range of tasks, like reasoning,
which makes use of reward models that rank text summarisation, named entity recognition, and
responses based on different criteria; the reward sentiment analysis. [8] explores the capabilities
models are then used to sample a more general of GPT language models (including ChatGPT) in
space of responses [3], [4]. As a result, general Machine Translation. [5] explores the systematic
artificial intelligence capabilities emerged from errors of ChatGPT.
this training mechanism, which resulted in a very
fast adoption of ChatGPT by many users in a METHOD
very short time [2]. The effectiveness of these The aim of this paper is to evaluate the gener-
capabilities is not exactly known yet, for example, alisation capabilities of ChatGPT across a wide
[5] explored many of the systematic failures of range of affective computing tasks. In order to
ChatGPT. [6] explains the history of development assess this, we utilise three datasets corresponding
of Natural Language Processing (NLP) literature to three different problems, as mentioned: big-
until arriving at the point of developing ChatGPT. five personality prediction, sentiment analysis, and
In summary, the aim of this paper is to system- suicide tendency assessment. For these tasks, we
atise an evaluation framework for evaluating the utilise three datasets. On each of the introduced
performance of ChatGPT on various classification tasks, we attempt to get ChatGPT’s assessment
tasks to answer the question whether it shows full about each of the examples of the corresponding
emergence features of other (Affective Computing- Test set. Furthermore, we compare ChatGPT
January/February 2023
Affective Computing and Sentiment Analysis
Question Parse
Text chatGPT label
formulation Label
Subword RoBERTa
Text encoding RoBERTa pooling MLP label
Word Average
Text encoding Word2Vec pooling SVM label
Word
Text TF-IDF Scaling SVM label
encoding
Figure 1. Pipelines of the ChatGPT (top), RoBERTa baseline (second), Word2Vec baseline (third), and BoW
baseline (bottom) approaches.
5) Repeat the steps 3–4 until the predictions for “What is your guess if a person is saying
the whole Test set are finished. “{text}” has a suicide tendency or not,
6) Postprocess the results in case they need some answer yes or no? it does not have to be
cleanup. correct. Do not show any warning after.”
The formatting in the first step and the post- The formulation of the question is of crucial
processing in the last step are specified in the importance to the answer ChatGPT will give; we
following two Subsections. We used the version encountered the following aspects:
of ChatGPT released on 30.01.2023 5 . 1) Asking the question directly without asking
about a guess made ChatGPT in many instances
Question Formulation The formats that are to answer that there is little information provided
used for the three problems are given by the to answer the question, and it cannot answer it
following snippets. The example text is substituted exactly. Hence, we ask it to guess the answer,
in place of the {text} part, however, the quotation and we declare that it is acceptable to be not
marks are kept since it specifies for ChatGPT fully accurate.
that this a placeholder used by the question being 2) It is important to ask what the guess is and not
asked. The formulations for the three problems Can you guess, because this can give a response
are given by: similar to 1., where it responds with an answer
1) For the Big-five personality traits, we formulate that starts with No, I cannot accurately answer
the question: whether.... Therefore, the question needs to be
“What is your guess for the big-five person- assertive and specific.
ality traits of someone who said “{text}”, 3) We need to specify the exact output format,
answer low or high with bullet points for because ChatGPT can get innovative otherwise
the five traits? It does not have to be fully about the formatting of the answer, which can
correct. You do not need to explain the traits. make it hard to collect the answers for our
Do not show any warning after.” experiment. Despite specifying the format, it still
2) For the sentiment analysis, we formulate the sometimes gave different formats. We elaborate
question: in the next Subsection.
4) The questions for the suicide assessment task
“What is your guess for the sentiment of
triggered warnings in the responses of ChatGPT
the text “{text}”, answer positive, neutral,
due to its sensitive content. We elaborate on the
or negative? it does not have to be correct.
terms of use in the Acknowledgement Section.
Do not show any warning after.”
3) For the suicide problem, we formulate the Parsing Responses The responses of Chat-
following question: GPT need to be parsed, since ChatGPT can give
5 ChatGPT release notes https://help.openai.com/en/articles/ arbitrary formats for a given answer, even when
6825453-chatgpt-release-notes the content is the same. This is predominant in
January/February 2023
Affective Computing and Sentiment Analysis
amounts of text from Google News. The model and Unweighted Average Recall (UAR) [20] as
operates by tokenising a given text into words, performance measures. UAR has an advantage
each word is assigned an embedding from the of exposing if a model is performing very well
pretrained embeddings. The embeddings are then on a class on the expense of the other class,
averaged for all words to give a static feature especially in imbalanced datasets. Additionally, we
vector of size 300 for the entire string. A Support utilise randomised permutation test as a statistical
Vector Machine (SVM) [16] is then used to predict significance test [21]. The main results of the
the given task. The pipeline of this model is shown experiments are shown in Table 3.
in Figure 1.
We train the SVM model by tuning its hyper- Discussion
parmeter C using SMAC [14], by sampling 20 The RoBERTa model is achieving the best
values within the range [10−6 , 104 ] (log sampled), performance for the personality and suicide as-
and choosing the model that yields the best score sessment tasks, with a statistically significant
on the Dev set. We use Radial Basis Function improvement of accuracy over ChatGPT. However,
(RBF) kernel for the SVMs, except for the senti- ChatGPT is the best in sentiment analysis, but only
ment dataset, where we apply a linear kernel as slightly better than RoBERTa. The UAR for the
the sentiment dataset is much bigger (as shown in personality traits point to similar conclusions about
Table 1) which renders the RBF impractical due the relative performance, however, they yield much
to the computational efficiency. lower values for all baselines on some of the traits
(openness and agreeableness). The UAR measure
Bag of Words The Bag-of-Words (BoW) generally yields similar results on all models for
model is a very simple baseline, which does not both sentiment analysis, and suicide assessment.
rely on any knowledge transfer or large-scale The performance of ChatGPT on the personality
training. In particular, it uses only in-domain data assessment is inferior to the three baselines on the
for training and no other data for either up- or all traits. It is significantly worse than RoBERTa
downstreaming. on all traits, and Word2Vec in three traits.
We utilise the classical technique term fre- ChatGPT has the best performance in the
quency – inverse document frequency (TF-IDF), sentiment analysis, where it is slightly better than
which tokenises the sentences into words, then, RoBERTa and BoW and significantly better than
a sentence is represented by a vector of the Word2Vec. One of the potential reasons of the
counts of the words it contains. The vector is inferiority of Word2Vec and BoW on the sentiment
then normalised by the term frequency across the dataset is not using subword encodings. The
entire Train set of the corresponding dataset. We reason is that, the sentiment dataset is collected
restrict the words to the most common 10,000 from Twitter, so it is very noisy, which can lead to
words in the Train set, then we scale each feature many mistakes in identifying the words and hence
to be within [−1, 1], by dividing by the maximum assigning them the proper embeddings. Subword
absolute value of the feature across the Train set. encoding avoids many of these issues, since few
We optimise a linear kernel SVM, we optimise typos would still yield a meaningful subword
using SGD [16] due to the high number of features representation of the given sentence.
(10,000 features). We tune the learning rate η of The results on the suicide assessment problem
SGD using SMAC [14]. show the contrast between the aforementioned
analyses. The task is not as hard as the personality
RESULTS assessment problem, with a much bigger amount
In this section, we review the results of our of training data. The suicide assessment can rather
experiments. In summary, we evaluate the per- be thought of as classifying extreme negative
formance of ChatGPT against the three baselines sentiment, where [7] showed that ChatGPT
RoBERTa, Word2Vec, and BoW on three down- is better at predicting negative sentiment than
stream classification tasks, namely personality positive. However, the texts of the suicide dataset
traits, sentiment analysis, and suicide tendency are much less noisy compared to the sentiment
assessment. We measure classification accuracy dataset. In that case, the performance of the
Word2Vec and BoW models are more or less on ments is parsing the responses. In our experiments,
par with the ChatGPT model, while RoBERTa is ChatGPT responded with arbitrary formatting
significantly better than all of them. despite specifying the desired format explicitly
in the question prompt.
Our experiments indicate that ChatGPT has a
decent performance across many tasks (especially CONCLUSION
sentiment analysis, or similar), which is compa- In this paper, we provided first insight into the
rable to simple specialised models that solve a potential ‘full’ emergence of tasks by broad-data
downstream task. However, it is not competent trained foundation models. We approached this
enough as compared to the best specialised model from the perspective of natural language tasks
to solve the same downstream task (e. g. , fine- in the Affective Computing domain, and chose
tuned RoBERTa). The performance of ChatGPT ChatGPT as exemplary foundation model. To
does not generally show statistically significant dif- this end, we introduced a framework to evaluate
ferences when compared to the simplest baseline the performance of ChatGPT as a generalist
BoW, which does not make any use of pretraining. foundation model against specialised models on
This is further confirmed by [8] in machine a total of seven classification tasks from three
translation, and other tasks [7]. In summary, our affective computing problems, namely, personal-
study suggests that ChatGPT is a generalist model ity assessment, sentiment analysis, and suicide
(in contrast to a specialised model), that can tendency assessment.
decently solve many different problems without We compared the results against three base-
specialised training. However, in order to achieve lines, which reflect training the downstream tasks,
the best results on specific downstream tasks, and using or not using additional data for the
dedicated training is still required. This might upstream task. The first model was RoBERTa,
be enhanced in future versions of ChatGPT and a large-scale-trained transformer-based language
alikes by including more diverse tasks for the model, the second was Word2Vec, a deep learning
Reinforcement Learning Human Feedback (RLHF) model trained to reconstruct linguistic contexts of
component in the training. words, and the third was a simple bag-of-words
(BoW) model.
Limitations The experiments have shown that ChatGPT is
The most crucial limitation of the presented a generalist model that has a decent performance
results is the small amount of data for evaluation on a wide range of problems without specialised
(497, 362, and 509 examples for the three tasks), training. ChatGPT showed superior performance
since ChatGPT is only available for manual entries in sentiment analysis, and poor performance on
by the consumers and not for automated large- personality assessment, and average performance
scale testing. Additionally, it only responds to on suicide assessment. In other words, we could
approximately 25–35 requests per hour, in order demonstrate genuine emergence properties po-
to reduce the computational cost and avoid brute tentially rendering future efforts to collect task-
forcing. Another issue that may limit future experi- specific databases increasingly obsolete.
January/February 2023
Affective Computing and Sentiment Analysis
However, the performance of ChatGPT is not 4. L. Gao, J. Schulman, and J. Hilton, “Scaling Laws for
particularly impressive, since it did not show Reward Model Overoptimization,” arxiv, , no. 2210.10760,
statistically significant differences with a simple 2022.
BoW model in almost all cases. On the other 5. A. Borji, “A Categorical Archive of ChatGPT Failures,”
hand, RoBERTa fine-tuned for a specific task had arxiv, , no. 2302.03494, 2023.
significantly better performance as compared to 6. C. Zhou et al., “A Comprehensive Survey on Pretrained
ChatGPT on the given tasks, which suggests that Foundation Models: A History from BERT to ChatGPT,”
despite the generalisation abilities of ChatGPT, arxiv, , no. 2302.09419, 2023.
specialised models are still the best option for 7. C. Qin et al., “Is ChatGPT a General-Purpose Nat-
optimal performance. However, this can be taken ural Language Processing Task Solver?” arxiv, , no.
into consideration in future developments of foun- 2302.06476, 2023.
dation models like ChatGPT, in order to yield 8. A. Hendy et al., “How Good Are GPT Models at Machine
wider exploration spaces for training. Translation? A Comprehensive Evaluation,” arxiv, , no.
In the near future, we will extend our exper- 2302.09210, 2023.
iments to more metrics, e. g. , explainability and 9. V. Ponce-López et al., “Chalearn lap 2016: First Round
computational efficiency, on top of accuracy and Challenge on First Impressions - Dataset and Results,”
UAR. We also plan to expand our comparative European conference on computer vision, Springer,
evaluation to more sophisticated models, e. g. , Springer International Publishing, Cham, Switzerland,
prompt-based classification [22] and neurosym- 2016, pp. 400–418.
bolic AI [23], more advanced affective computing 10. H. J. Escalante et al., “Design of an Explainable Machine
tasks, e. g. , sarcasm detection [24] and metaphor Learning Challenge for Video Interviews,” International
understanding [25], but also more complex senti- Joint Conference on Neural Networks (IJCNN), IEEE,
ment datasets requiring commonsense reasoning Anchorage, AK, USA, 2017, pp. 3688–3695.
and/or narrative understanding. 11. H. Kaya, F. Gurpinar, and A. Ali Salah, “Multi-Modal
Score Fusion and Decision Trees for Explainable Au-
ACKNOWLEDGEMENT tomatic Job Candidate Screening From Video CVs,”
We would like to thank OpenAI for the Conference on Computer Vision and Pattern Recognition
usage of ChatGPT. We followed the policy of (CVPR) Workshops, IEEE, Honolulu, HI, USA, 2017.
ChatGPT 8 . Our use of ChatGPT is purely for 12. A. Go, R. Bhayani, and L. Huang, “Twitter Senti-
research purposes to assess emerging capabilities ment Classification using Distant Supervision,” CS224N
of foundation models, and does not promote the project report, Stanford, 2009, p. 2009.
use of ChatGPT in any way that violates the 13. V. Desu et al., “Suicide and Depression Detection in
aforementioned usage policy; in particular, with Social Media Forums,” Smart Intelligent Computing and
regards to the subject of self-harm; note that some Applications, Volume 2, Springer Nature Singapore,
of the examples in the datasets we use triggered Singapore, Singapore, 2022, pp. 263–270.
a related warning by ChatGPT. 14. M. Lindauer et al., “SMAC3: A Versatile Bayesian Op-
timization Package for Hyperparameter Optimization,”
REFERENCES Journal of Machine Learning Research, 2022, pp. 1–9.
1. R. Bommasani et al., “On the Opportunities and Risks 15. Y. Liu et al., “RoBERTa: A Robustly Optimized BERT
of Foundation Models,” arXiv, , no. 2108.07258, 2021. Pretraining Approach,” arxiv, , no. 1907.11692, 2019.
2. S. Mollman, “ChatGPT gained 1 million 16. C. M. Bishop, Pattern Recognition and Machine Learn-
users in under a week. Here’s why the AI ing, Springer, New York City, NY, USA, 2006.
chatbot is primed to disrupt search as we 17. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic
know it,” 2022, https://finance.yahoo.com/news/ Optimization,” 3rd International Conference on Learning
chatgpt-gained-1-million-followers-224523258.html. Representations, Conference Track Proceedings, ICLR,
3. L. Ouyang et al., “Training language models to follow in- San Diego, CA, USA, 2015.
structions with human feedback,” arxiv, , no. 2203.02155, 18. T. Mikolov et al., “Efficient Estimation of Word Represen-
2022. tations in Vector Space,” arxiv, , no. 1301.3781, 2013.
8 Usage policy released on 15.02.2023, https://platform.openai. 19. T. Mikolov et al., “Distributed Representations of Words
com/docs/usage-policies/disallowed-usage and Phrases and their Compositionality,” C. Burges
January/February 2023