Music Emotion Recognition Using Multi-head Self-attention-Based Models

Music Emotion Recognition Using Multi-head
Self-attention-Based Models
Yao Xiao1 , Haoxin Ruan1 , Xujian Zhao1(B) , Peiquan Jin2 , and Xuebo Cai3
1 Southwest University of Science and Technology, Mianyang 621010, Sichuan, China
jasonzhaoxj@swust.edu.cn
2 University of Science and Technology of China, Hefei 230026, Anhui, China
3 Sichuan University of Culture and Arts, Mianyang 621000, Sichuan, China
Abstract. Music Emotion Recognition (MER) has been a major challenge in

Music Information Retrieval (MIR) and is essential in many fields, such as music
psychotherapy, individualized instruction, and music recommendation. Some
existing approaches aim to extract emotion-related features through deep neu-
ral networks. Unfortunately, these methods perform poorly in outputting music
emotion continuously, which is important for emotion perception. More recently,
self-attention-based models have been proposed to learn emotion-related infor-
mation by determining which part should be the focus and ignoring irrelevant and
noisy data. However, since music emotion has much multi-dimensional semantic
information, they cannot fully excavate the multi-dimensional implicit semantic
information of the feature sequence and will ignore some of the emotion-related
information. Aiming at addressing these issues, we present a new approach for
MER that can extract continuous features and then excavate multi-dimensional
information sensitive to emotion effectively. Firstly, the study suggests a neural
network-based method to extract continuous features, intending to mine emotion-
abundant features. After that, we design a classifier with the multi-head self-
attention-based model to excavate multi-dimensional information for music emo-
tion recognition. Finally, we conduct experiments on a real dataset EMOPIA.
Experimental results show our methods surpass the baseline on the corresponding
modalities (symbolic and acoustic) at an accuracy of 6.1% and 4.4% on the same
dataset, which verifies the superiority of our method.
Keywords: Music emotion recognition · Arousal-Valence · Deep learning ·

Short-chunk CNN · Multi-head self-attention
1 Introduction
Music emotion recognition has been a major research subject in MIR (Music Infor-
mation Retrieval) [1], especially when building an automatic MER (Music Emotion
Recognition) system. Emotions are one of the primary motivators for people to engage
and interact with music. In addition, as such systems have the potential to make individ-
ualized music instruction more accessible, the importance of MER systems is becoming
more apparent.
Y. Xiao and H. Ruan—These authors contribute equally to this work.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 101–114, 2023.
https://doi.org/10.1007/978-981-99-4752-2_9
102 Y. Xiao et al.
Deep learning methods have become increasingly popular in recent years. How-
ever, they still face many challenges and limitations when performing music emotion
recognition. For example, the continuous recognition degree of Convolutional Neural
Network (CNN) is not high, and the timing of music emotion is not considered in Long
Short-Term Memory (LSTM) models. Moreover, LSTM models do not consider the
influence of local critical information on music emotion. In other words, they do not
perform well in resolving the problem of continuity recognition, which is essential for
emotion perception. Therefore, they are not promising for the MER tasks.
Like other kinds of multimedia, such as pictures, videos, and audio, music also con-
tains rich multi-dimensional semantic information. So far, self-attention-based models
have been proposed to learn the emotion-related information of music by letting the
model focus on different parts of the sequence to determine which part should be the
focus [19]. Such models can help note emotional expression in music and ignore irrel-
evant and noisy information. However, when faced with the task of MER, they cannot
fully excavate the multi-dimensional implicit semantic details of the feature sequence
and will ignore some of the emotion-related information. More specifically, they can
only pay attention to some emotion-related information and dismiss the rest, as Fig. 1
shows.
Fig. 1. How multi-head self-attention (b) works by noting multi-dimensional semantic informa-
tion compared with self-attention (a).
Generally, there are two main challenging issues when formulating MER models,
especially piano music emotion recognition in our paper:
(1) How to extract music features that contain more continuous information than current
methods?
(2) How to capture rich multi-dimensional emotional information between differ-
ent parts of the feature sequence and identify the multi-dimension of emotional
expression?
This study presents a novel approach to address the challenges of formulating MER
models. Firstly, we propose a learning-based method to extract continuous features,
intending to excavate emotion-abundant features. After that, we design a classifier with
the Multi-Head Self-Attention (MHSA) based model to excavate multi-dimensional
emotion-related information.
Music Emotion Recognition Using Multi-head Self-attention 103
In summary, we make the following contributions in this paper:

(1) Aiming to extract continuous information sensitive to emotion, we design feature
extractors according to the characteristic of different neural networks. Specifically,
Short-chunk CNN-based and Bidirectional LSTM (BiLSTM) based music feature
extractors are presented for symbolic and acoustic domains, respectively.
(2) Aiming to capture abundant multi-dimensional emotional information between dif-
ferent parts of the music feature and identify multiple aspects of emotional expres-
sion, we exploit a multi-head self-attention-based model with a varying number of
heads to strengthen the ability to recognize emotions of music.
(3) We conduct experiments on a real dataset with four-category emotion labels
EMOPIA [11]. The experimental results show our methods outperform the baseline.
For the audio and MIDI branch, our method outperforms the baseline at an accuracy
of 6.1% and 4.4% on the same dataset, respectively. Additionally, we develop a
comprehensive benchmark for the music formats of both MIDI and audio through
reliable ablation experiments to evaluate which settings will boost the accuracy to
the maximum extent and provide valuable insights into the performance.
2 Related Works
We divide previous work on music emotion recognition into the following three
categories.
2.1 Symbolic Music vs. Acoustic Music

Symbolic music representations use abstract representations of musical elements such
as notes, rhythms, chords, and melodies and do not involve audio signals. This makes
them easy to process, modify, and use across different platforms. Symbolic music is
typically represented in a MIDI file with text-like characteristics, making it possible to
apply Natural Language Processing (NLP) models like BERT and GPT for improved
results. Ferreira [8] used a combination of LSTM and GPT2, to deal with the issue of
emotion classification, while Qiu [17] proposed the MIDIGPT model based on the early
work [7].
Open datasets have traditionally been limited by copyright laws, which interfere
with training MER models because researchers are deterred from sharing the raw audio
material needed for training and evaluation [1]. Some proper solutions for this situation
are to provide metadata from the datasets and preprocessed audio features like MIDI.
2.2 Deep Learning Methods for MER

With the development of deep learning, the accuracy of using deep learning methods to
identify music emotions has been greatly improved in recent years. In music and speech
recognition, researchers are using neural networks such as CNN to learn high-level
features from audio data [1].
More recently, LSTM networks have gained significant attention in video, speech,
and music recognition [5, 8, 17]. They were introduced to address the long-term depen-
dency issue of RNN. Meanwhile, the AVEC 2017 competition winners showed that
104 Y. Xiao et al.
LSTM-RNN could reduce feature engineering efforts and improve performance for
continuous emotion recognition [5]. BiLSTM is an extension of LSTM that processes
sequences in both forward and backward directions, allowing it to capture contextual
information from both the past and future. As the internal correlations in music emotion
are strong and the current state is dependent not only on the previous but also on the
future state, the BiLSTM network is an ideal solution to address this issue [10].
2.3 Attention Mechanism for MER

Self-attention models are effective in MER as they can weigh each position in an input
sequence and learn the dependence relationships between different time steps. The Multi-
scale Context-based Attention model (MCA) proposed by [4] fuses different time scales
with attention to dynamically model the representation of music structure, and experi-
ments show its effectiveness. The attentive LSTM by [3] incorporates a modified atten-
tion mechanism that considers only the hidden states before a certain moment, result-
ing in significantly improved results compared to existing literature. These innovative
approaches highlight the potential of self-attention models in MER. Traditional attention
models are designed to rely heavily on external information [4], and they may not fully
account for music’s complex and diverse emotional characteristics.
Generally, existing studies typically do not perform well in resolving the problem of
continuity recognition, which is important in music emotion recognition. Besides, current
attention models used in MER cannot capture abundant multi-dimensional emotional
information and identify the multi-dimension of emotional expression. In the paper,
our proposal can extract continuous features more efficiently and then excavate multi-
dimensional emotion-related information, making it comprehensive for MER.
3 Methodology
As Figure 2 shows, our method for MER consists of three consecutive tasks: Prepro-
cessing, Feature Extraction, and Emotion Recognition.
Fig. 2. An overview architecture of MER

3.1 Preprocessing
First, music data should be preprocessed to get their general representations, which
can be used as the input of DNNs for the following tasks. In the paper, we model the
music data through feature representation to obtain Mel-spectrogram representations
from audio files and MIDI-like [15] representations from MIDI files.
Mel-Spectrogram Feature Representation. Audio features are the earliest and most
widely studied representations in the MER field, mainly extracted from waveform files
[9]. The most commonly used timbre representation is Mel-spectrogram, which has a
nonlinear correspondence with Hz frequency. It is widely used in speech recognition
and MIR tasks because it provides a more meaningful representation of the spectral
content of sound signals. Our implementation involves using TorchAudio to obtain the
Mel-spectrogram with 128 bins using a 1024-size FFT (with Hanning window) and 512
hop-size at a 22,050 Hz sampling rate. We randomly sample three seconds of audio
chunks to generate input matrices for the following model.
Transcript to MIDI and MIDI-Related Feature Representation. The MIDI files are
transcribed from the original audio using the high-resolution piano transcription model
developed by [12]. After acquiring the MIDI files, we manually obtain MIDI-like [15]
representation using TorchAudio. And then, the MIDI file will be converted into a vector,
denoted as V 1×len , where the length of the vector depends on the MIDI file.
3.2 Feature Extraction
After that, we train feature extraction networks to extract continuous features abun-
dant in emotional information. Specifically, we leverage these representations on
modality-specific networks enabling independent extraction of music emotion from dif-
ferent inputs. In detail, we introduce two DNNs as continuous feature extractors to
extract emotion-abundant features for audio domain representation and MIDI domain
representation, respectively.
Short-Chunk CNN-Based Feature Extraction for Acoustic Music. The proposed
feature extraction method utilizes the Short-chunk CNN [21] architecture, as shown in
Fig. 3, for feature extraction in MIR, which has been demonstrated in early experiments
to be strong robustness against noise and other changes [21]. It comprises 7-layer CNNs
with residual connections, and the output is a fixed 128-dimensional vector obtained
through max pooling, followed by two fully connected layers with ReLU activation for
classification. Notably, using 3 × 3 small filters allows for deeper levels of network
field of vision and the extraction of more detailed music information, contributing to
the effectiveness and efficiency of the feature extraction process [20]. Consequently, we
adopt the Short-chunk CNN as our backbone model for audio domain feature extraction.
To get continuous implicit semantic features, we feed the Mel-spectrogram matrices
into the backbone model, which are denoted as Q128×len . The length of the matrix
is dependent on the audio file’s sample rate and the window size of the FFT from
106 Y. Xiao et al.
Fig. 3. Short-chunk CNN network for feature extraction (c/s/p stands for channel/stride/padding,
respectively)
preprocessing. After that, we will get a 512-dimension vector V 512×1 for the following
MHSA-based classifier.
BiLSTM-Based Feature Extraction for Symbolic Music. We also design a symbolic-
domain method similar to the audio-domain one. Considering the relationship between
emotion and performance time, we use BiLSTM in Fig. 4 to model feature matrix W
from MIDI-like representations, and the output dimensions of W is l × 512, where l is
the length of a MIDI-like file.
3.3 Emotion Recognition
Last, these features are learned by an MHSA-based classifier, each with a feed-forward
layer and then fully connected layers for music recognition.
In the paper, we incorporate a multi-head self-attention layer, along with a feed-
forward block and two dense layers, to classify the music data. Furthermore, the design
enables adjustment of the number of heads used in the attention block for different
dimensions of music emotion.
By capturing multiple aspects of emotional expression by attending to different
dimensions of musical emotion in each head, our proposal enhances the accuracy and
robustness of MER models. In more detail, by utilizing multiple heads, abundant multi-
dimensional emotional information can be captured between different parts of the music
feature, identifying multiple aspects of emotional expression. Additionally, the number
of heads in multi-head self-attention allows our classifier to attend to multiple aspects
of emotional expression in parallel, enabling more efficient and effective modeling of
music emotions.
We feed the deep-semantic features extracted from the previous steps into the MHSA-
based classifier to classify their emotion classification. Specifically, the methods armed
with our MHSA-based classifier are called SCMA and BiLMA for ShortChunk CNN-
based and BiLSTM-based feature extractors, respectively, illustrated in Fig. 5.
Because MIDI files are not sliced into fixed lengths, the dimensions of the matrix
W generated by BiLSTM are different. As a result, the dimensions of each W z matrix
generated from W by multi-head self-attention layer are different. The dimension of W z
and W all equals batchsize × n × m, which means m in every batch will be different. To
ensure consistency across all dimensions of the features, we apply Eq. 1 to each batch.
Thus the W z and W change into M , and the dimension is n × n.
T
M = W WZ (1)
Fig. 4. BiLSTM network for feature Fig. 5. Structure of SCMA/BiLMA.

extraction
Finally, after modeling feed-forward and fully connected layers, the emotion
classification is performed, as depicted in Fig. 5.
4 Experiments and Results

To evaluate the performance of our proposal, we experiment on the EMOPIA dataset.
In this section, we discuss the performance evaluation of our algorithm.
4.1 Dataset
The EMOPIA dataset [11] is a piano music dataset for symbolic and acoustic music
emotion recognition. The emotion labels are assigned using the 4Q model [18], which
consists of Q1, Q2, Q3, and Q4. Note that the audio files are gathered from the Internet by
provided metadata. Unfortunately, we could only access 845 audio clips out of the total
1087 clips for our research due to copyright limitations. While this may have impacted
the baseline accuracy of our study, we take great care to reproduce the baseline work
for music emotion recognition using the remaining 845 clips. However, the repository
includes complete MIDI format files.
The MIDI files in the EMOPIA dataset are transcribed from the original audio using
the high-resolution piano transcription model developed by [12]. The dataset creator
manually checked the transcription results for a random set of clips and found the accu-
racy in note pitch, velocity, and duration satisfactory. After that, songs with engineered
ambient effects were removed from the collection, as the resulting transcription could
be fragmented and undesirable.
More information about the EMOPIA dataset is summarized in Table 1.
4.2 Experimental Settings

Modules armed with our proposed MHSA-based classifier, SCMA for audio, and MIDI
BiLMA all share the configuration shown in Table 2.
We evaluate the models for each training epoch in the validation set. The training
algorithm is stopped early when the validation accuracy for emotion recognition has no
improvement for T consecutive epochs, where T = 0.05 × N , N denotes the number of
maximum training epochs. The checkpoint achieving the best accuracy in the validation
108 Y. Xiao et al.
Table 1. Summary of EMOPIA [11].
INFO EMOPIA
Number of MIDI 1,087
Number of mp3 845 clips used (in 1,087)
Emotional Label 4Q taxonomy
Train-validation-test splits 7:2:1
Source Youtube
Piano Music Type pop and multicultural
Single Duration About 30 s
Table 2. Hyper-parameters in detail.
Parameters in details Detail

optimizer Adam
weight decay 1e-4
max epoch 200
global seed (SCMA/BiLMA) 42/43
learning rate (SCMA/BiLMA) 1e–4/1e–3
batch size (SCMA/BiLMA) 32/8
number of heads (SCMA/BiLMA) 4/4
set during the training procedure is saved and evaluated in the testing set. All experiments
are repeated in the same random seed with different epochs. Evaluation Metrics The
performance of our model is measured in terms of F1-score and AUROC (Area Under
the Receiver Operating Characteristic curve) [2].
The F1-score, defined by Eq. 4, is a performance metric that summarizes the balance
between a classifier’s precision and recall in a single value. Additionally, in Eqs. 2, 3,
and 5, TP stands for True Positive, which represents the number of positive instances
and is correctly predicted as positive by the classification model. Similarly, FN stands
for False Negative, FP stands for False Positive, and TN stands for True Negative. These
terms are essential for understanding the calculation of various classification metrics and
evaluating the performance of classification models.
TP
Precision = (2)
TP + FN
TP
Recall = (3)
TP + FN
2 Precision × Recall
F1-score = (4)
Precision + Recall
AUROC is a performance metric for classification tasks that evaluates a model’s

ability to distinguish between classes by analyzing its performance at multiple thresholds,
unlike fixed threshold measures such as F1-score. AUROC is calculated based on the
relationship between True Positive Rate (TPR) and False Positive Rate (FPR) at different
classification thresholds, as shown in Eq. 5.
TP FP
TPR = , FPR = (5)
TP + FN FP + TN
AUROC provides a comprehensive measure of its classification ability by consid-
ering the model’s performance at different thresholds. Therefore, our study employed
both AUROC and F1-score, along with accuracy, as evaluation metrics.
4.3 Performance of MHSA-Based Models for MER
Table 3 summarizes the performance of our MER model for the audio branch model
compared to the baseline [10] without an MHSA-based classifier. Our SCMA model
outperforms the baseline model with an accuracy of 0.714, an F1-score of 0.712, and an
AUROC score of 0.933. In contrast, the baseline model achieved an accuracy of 0.670,
an F1-score of 0.634, and an AUROC score of 0.902. It suggests that our SCMA model
can effectively capture relevant information from features for emotion recognition tasks.
Moreover, the MHSA-based classifier enhances its ability to focus on essential emotion
information from different perspectives.
Table 3. Comparison between our audio domain model and Short-chunk CNN (baseline) [21].
Audio branch models acc f1 auroc

Short-chunk CNN [21] 0.670 0.634 0.902
SCMA (our method for audio music) 0.714 0.712 0.933
Table 4 show that the BiLMA model for symbolic domain outperforms all current
models on the same dataset EMOPIA on accuracy, improving by 6.1% over the baseline
model. It is worth noting that the F1-score of MT-MIDIBERT (2022) [17] marginally
outperformed our method. However, it should be acknowledged that pre-trained models
like BERT and GPT require significantly larger training datasets and entail higher time
overheads than our lightweight model, which can be trained in a few minutes on a
normal GPU without pre-training. While the small performance gains of pre-trained
models come at a high cost, our model offers a more efficient and practical solution for
certain applications. Further, Fig. 6 shows the detailed classification situation of typical
models.
110 Y. Xiao et al.
Table 4. Comparison between existing symbolic domain approaches.
MIDI branch models acc f1

LSTM-Attn (Baseline) [14] 0.647 0.563
SVM (2013) [13] 0.477 0.476
SVM (2013) [16] 0.398 0.362
MIDIGPT (2020) [8] 0.587 0.572
MIDIBERT-Piano (2021) [6] 0.634 0.628
MT-MIDIGPT (2022) [17] 0.625 0.611
MT-MIDIBERT (2022) [17] 0.676 0.664
BiLMA (our method for symbolic music) 0.708 0.631
4.4 Ablation Experiments
In this section, we conduct experiments on the EMOPIA dataset to study the impact of
the network architecture, MIDI-related features, number of heads, and training epochs.
Verify the Superiority of our Feature Extraction Networks. As Table 5 shows, our
model with a MIDI-like feature achieves the best accuracy against REMI [10]. Regarding
accuracy, our model for the symbolic branch outperforms the baseline model on both
MIDI-like and REMI features. However, the improvement is more significant for MIDI-
like features, where the BiLMA model outperforms the baseline model by 6.1%. For the
REMI feature, the BiLMA model shows a gain of 5.7% over the baseline model.
Table 5. Comparison of different MIDI features.
MIDI Models features acc f1

LSTM-Attn [14] (baseline) remi 0.583 0.481
midi-like 0.647 0.563
BiLMA (ours) remi 0.640 0.537
midi-like 0.708 0.631
As Table 6 shows, when our audio branch network’s architecture changes, accuracy
will drop to an extent. Besides, the SCMABiL model shows only marginal performance
improvements compared to the baseline model, indicating that the BiLSTM layer may
add complexity to the model without providing significant benefits.
Verify the Superior Training Procedure of our Networks. Table 7 shows that each
number of head and training epochs on the SCMA model influences the result of emotion
recognition. The results indicate that increasing the number of heads from 2 to 4 improves
the model’s performance for all three metrics. When the head number equals four, and
Fig. 6. The result of baseline models for audio and MIDI is shown in (a) and (c), respectively.
The performance of models incorporated with a multi-head self-attention classifier is shown in
(b) and (d). The audio model effectively recognizes Q2 but less for Q3, while the MIDI model
exhibits the opposite trend.
Table 6. Comparison of audio models: Impact of different combinations.
Audio Model acc f1 auroc

Short-chunk CNN (SC) [21] (baseline) 0.670 0.634 0.902
SC + MA (SCMA)(ours) 0.714 0.712 0.933
SC + MA + BiLSTM (SCMABiL) 0.690 0.677 0.898
the epoch equals 118, it reaches the best accuracy. Moreover, even when the head number
equals 8, the lowest accuracy effectively surpasses the baseline shown in Table 3.
Table 7. Influence of the different number of head and training epochs on the SCMA model.
Number of heads acc f1 auroc

2 0.677 0.677 0.898
4 (62epoch) 0.705 0.728 0.918
4 (118epoch) 0.714 0.712 0.933
4 (148epoch) 0.670 0.625 0.870
8 0.636 0.650 0.886
16 0.670 0.686 0.870
Table 8 indicates that increasing the number of training epochs from 60 to 100 signif-
icantly improves accuracy and F1-score for the MIDI-like feature. However, increasing
the number of epochs further to 250 does not improve the model’s performance.
112 Y. Xiao et al.
Table 8. Influence of the different number of training epochs on the BiLMA model.
BiLMA Epoch acc-remi f1-remi acc-midi-like f1-midi-like

60 0.640 0.537 0.674 0.591
100 0.570 0.440 0.708 0.631
200 (SGD) 0.571 0.440 0.697 0.609
250 (SGD) 0.537 0.458 0.651 0.542
4.5 Discussion
Symbolic music is a better form for emotion recognition [17] since it intrinsically con-
tains information such as pitch, duration, speed, and severity, which can be used to
analyze emotion [9]. Besides, the pre-train models did a good job on text-like data,
and MIDI is one of them. However, our results show that acoustic music’s accuracy
is higher than symbolic music’s in emotion recognition. Compared with many current
MIDI branch models in Table 4 show that the BiLMA model for symbolic domain out-
performs all current models on the same dataset EMOPIA on accuracy, improving by
6.1% over the baseline model. It is worth noting that the F1-score of MT-MIDIBERT
(2022) [17] marginally outperformed our method. However, it should be acknowledged
that pre-trained models like BERT and GPT require significantly larger training datasets
and entail higher time overheads than our lightweight model, which can be trained in a
few minutes on a normal GPU without pre-training. While the small performance gains
of pre-trained models come at a high cost, our model offers a more efficient and prac-
tical solution for certain applications. Further, Fig. 6 shows the detailed classification
situation of typical models.
Table 4, our accuracy of audio in Table 3 still surpasses that of symbolic music in all
metrics, which suggests that except for pitch, duration, speed, and severity, there may
be other information in audio that determines the emotion of the music.
5 Conclusion
In this paper, we propose an efficient approach for MER. Firstly, a method to extract
continuous features is presented, intending to excavate emotion-abundant features. After
that, we design a classifier with the MHSA-based model to excavate multi-dimensional
information for music emotion recognition. Experimental results demonstrate our pro-
posal’s effectiveness, which achieves state-of-the-art performance on the EMOPIA
dataset, setting a new benchmark in the field. Based on our approach, future research
directions could include multimodal methods incorporating audio, MIDI, and even
videos to understand music emotion from various perspectives. Such research may yield
valuable insights and contribute to developing more sophisticated MER systems.
Acknowledgments. This paper is supported by the Humanities and Social Sciences Founda-
tion of the Ministry of Education (17YJCZH260), the Sichuan Science and Technology Pro-
gram (2020YFS0057), the National Innovation Training Program for Undergraduate Students
(202210619023).
References
1. Cañón, J.S.G., et al.: Music emotion recognition: toward new, robust standards in personalized
and context-sensitive applications. IEEE Sig. Process. Mag. 38, 106–114 (2021)
2. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning
algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
3. Chaki, S., Doshi, P., Patnaik, P., Bhattacharya, S.: Attentive RNNs for continuous-time emo-
tion prediction in music clips. In: Chhaya, N., Jaidka, K., Healey, J., Ungar, L., Sinha,
A.R. (eds.) Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020)
Co-Located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020),
New York, USA, 7 February 2020. CEUR Workshop Proceedings, vol. 2614, pp. 36–46.
CEUR-WS.org (2020)
4. Chang, W.H., Li, J.L., Lin, Y.S., Lee, C.C.: A genre-affect relationship network with task-
specific uncertainty weighting for recognizing induced emotion in music. In: 2018 IEEE
International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018)
5. Chen, S., Jin, Q., Zhao, J., Wang, S.: Multimodal multi-task learning for dimensional and
continuous emotion recognition. In: Proceedings of the 7thAnnual Workshop on Audio/Visual
Emotion Challenge, AVEC 2017, pp. 19–26. Association for Computing Machinery, New
York (2017)
6. Chou, Y.H., Chen, I.C., Chang, C.J., Ching, J., Yang, Y.H.: MidiBERT-Piano: large-scale
pre-training for symbolic music understanding. ArXiv abs/2107.05223 (2021)
7. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D
hybrid networks. In: Proceedings of the 18thACM International Conference on Multimodal
Interaction, ICMI 2016, pp. 445–450. Association for Computing Machinery, New York
(2016)
8. Ferreira, L.N., Lelis, L.H.S., Whitehead, J.: Computer-generated music for tabletop role-
playing games. In: Lelis, L., Thue, D. (eds.) Proceedings of the Sixteenth AAAI Conference
on Artificial Intelligence and Interactive Digital Entertainment, AIIDE 2020, Virtual, 19–23
October 2020, pp. 59–65. AAAI Press (2020)
9. Han, D., Kong, Y., Han, J., Wang, G.: A survey of music emotion recognition. Front. Comput.
Sci. 16(6), 166335 (2022)
10. Huang, Y.S., Yang, Y.H.: Pop music transformer: beat-based modeling and generation of
expressive pop piano compositions. In: Proceedings of the 28th ACM International Confer-
ence on Multimedia, MM 2020, pp. 1180–1188. Association for Computing Machinery, New
York (2020)
11. Hung, H., Ching, J., Doh, S., Kim, N., Nam, J., Yang, Y.: EMOPIA: a multi-modal pop piano
dataset for emotion recognition and emotion-based music generation. In: Lee, J.H., et al. (eds.)
Proceedings of the 22nd International Society for Music Information Retrieval Conference,
ISMIR 2021, Online, 7–12 November 2021, pp. 318–325 (2021)
12. Kong, Q., Li, B., Song, X., Wan, Y., Wang, Y.: High-resolution piano transcription with
pedals by regressing onset and offset times. IEEE/ACM Trans. Audio Speech Lang. Process.
29, 3707–3717 (2021)
13. Lin, Y., Chen, X., Yang, D.: Exploration of music emotion recognition based on MIDI. In: de
Souza Britto Jr., A., Gouyon, F., Dixon, S. (eds.) Proceedings of the 14th International Society
for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil, 4–8 November
2013, pp. 221–226 (2013)
14. Lin, Z., et al.: A structured self-attentive sentence embedding. In: 5th International Conference
on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference
Track Proceedings. OpenReview.net (2017)
114 Y. Xiao et al.
15. Oore, S., Simon, I., Dieleman, S., Eck, D., Simonyan, K.: This time with feeling: learning
expressive musical performance. Neural Comput. Appl. 32(4), 955–967 (2020)
16. Panda, R.E.S., Malheiro, R., Rocha, B., Oliveira, A.P., Paiva, R.P.: Multi-modal music emotion
recognition: a new dataset, methodology and comparative analysis. In: 10th International
Symposium on Computer Music Multidisciplinary Research (CMMR 2013), pp. 570–582
(2013)
17. Qiu, J., Chen, C., Zhang, T.: Novel Multi-Task Learning Method for Symbolic Music Emotion
Recognition. arXiv preprint arXiv:2201.05782 (2022)
18. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980)
19. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran
Associates Inc., Red Hook (2017)
20. Won, M., Choi, K., Serra, X.: Semi-supervised music tagging transformer. In: Lee, J.H.,
et al. (eds.) Proceedings of the 22nd International Society for Music Information Retrieval
Conference, ISMIR 2021, Online, 7–12 November 2021, pp. 769–776 (2021)
21. Won, M., Ferraro, A., Bogdanov, D., Serra, X.: Evaluation of CNN-based automatic music
tagging models. In: Proceedings of 17th Sound and Music Computing (2020)

Music Emotion Recognition Using Multi-head Self-attention-Based Models

Uploaded by

Copyright:

Available Formats

You might also like

Music Emotion Recognition Using Multi-head Self-attention-Based Models

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Music Emotion Recognition Using Multi-head Self-attention-Based Models

Uploaded by

Copyright:

Available Formats

Music Emotion Recognition Using Multi-head

Abstract. Music Emotion Recognition (MER) has been a major challenge in

Keywords: Music emotion recognition · Arousal-Valence · Deep learning ·

In summary, we make the following contributions in this paper:

2.1 Symbolic Music vs. Acoustic Music

2.2 Deep Learning Methods for MER

2.3 Attention Mechanism for MER

Fig. 2. An overview architecture of MER

3.2 Feature Extraction

3.3 Emotion Recognition

Fig. 4. BiLSTM network for feature Fig. 5. Structure of SCMA/BiLMA.

4 Experiments and Results

4.2 Experimental Settings

Table 1. Summary of EMOPIA [11].

Table 2. Hyper-parameters in detail.

Parameters in details Detail

AUROC is a performance metric for classification tasks that evaluates a model’s

4.3 Performance of MHSA-Based Models for MER

Audio branch models acc f1 auroc

Table 4. Comparison between existing symbolic domain approaches.

MIDI branch models acc f1

4.4 Ablation Experiments

Table 5. Comparison of different MIDI features.

MIDI Models features acc f1

Table 6. Comparison of audio models: Impact of different combinations.

Audio Model acc f1 auroc

Number of heads acc f1 auroc

BiLMA Epoch acc-remi f1-remi acc-midi-like f1-midi-like

You might also like