Professional Documents
Culture Documents
Music Emotion Recognition Using Multi-head Self-attention-Based Models
Music Emotion Recognition Using Multi-head Self-attention-Based Models
Music Emotion Recognition Using Multi-head Self-attention-Based Models
Self-attention-Based Models
Yao Xiao1 , Haoxin Ruan1 , Xujian Zhao1(B) , Peiquan Jin2 , and Xuebo Cai3
1 Southwest University of Science and Technology, Mianyang 621010, Sichuan, China
jasonzhaoxj@swust.edu.cn
2 University of Science and Technology of China, Hefei 230026, Anhui, China
3 Sichuan University of Culture and Arts, Mianyang 621000, Sichuan, China
1 Introduction
Music emotion recognition has been a major research subject in MIR (Music Infor-
mation Retrieval) [1], especially when building an automatic MER (Music Emotion
Recognition) system. Emotions are one of the primary motivators for people to engage
and interact with music. In addition, as such systems have the potential to make individ-
ualized music instruction more accessible, the importance of MER systems is becoming
more apparent.
Y. Xiao and H. Ruan—These authors contribute equally to this work.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023
D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 101–114, 2023.
https://doi.org/10.1007/978-981-99-4752-2_9
102 Y. Xiao et al.
Deep learning methods have become increasingly popular in recent years. How-
ever, they still face many challenges and limitations when performing music emotion
recognition. For example, the continuous recognition degree of Convolutional Neural
Network (CNN) is not high, and the timing of music emotion is not considered in Long
Short-Term Memory (LSTM) models. Moreover, LSTM models do not consider the
influence of local critical information on music emotion. In other words, they do not
perform well in resolving the problem of continuity recognition, which is essential for
emotion perception. Therefore, they are not promising for the MER tasks.
Like other kinds of multimedia, such as pictures, videos, and audio, music also con-
tains rich multi-dimensional semantic information. So far, self-attention-based models
have been proposed to learn the emotion-related information of music by letting the
model focus on different parts of the sequence to determine which part should be the
focus [19]. Such models can help note emotional expression in music and ignore irrel-
evant and noisy information. However, when faced with the task of MER, they cannot
fully excavate the multi-dimensional implicit semantic details of the feature sequence
and will ignore some of the emotion-related information. More specifically, they can
only pay attention to some emotion-related information and dismiss the rest, as Fig. 1
shows.
Fig. 1. How multi-head self-attention (b) works by noting multi-dimensional semantic informa-
tion compared with self-attention (a).
Generally, there are two main challenging issues when formulating MER models,
especially piano music emotion recognition in our paper:
(1) How to extract music features that contain more continuous information than current
methods?
(2) How to capture rich multi-dimensional emotional information between differ-
ent parts of the feature sequence and identify the multi-dimension of emotional
expression?
This study presents a novel approach to address the challenges of formulating MER
models. Firstly, we propose a learning-based method to extract continuous features,
intending to excavate emotion-abundant features. After that, we design a classifier with
the Multi-Head Self-Attention (MHSA) based model to excavate multi-dimensional
emotion-related information.
Music Emotion Recognition Using Multi-head Self-attention 103
2 Related Works
We divide previous work on music emotion recognition into the following three
categories.
LSTM-RNN could reduce feature engineering efforts and improve performance for
continuous emotion recognition [5]. BiLSTM is an extension of LSTM that processes
sequences in both forward and backward directions, allowing it to capture contextual
information from both the past and future. As the internal correlations in music emotion
are strong and the current state is dependent not only on the previous but also on the
future state, the BiLSTM network is an ideal solution to address this issue [10].
3 Methodology
As Figure 2 shows, our method for MER consists of three consecutive tasks: Prepro-
cessing, Feature Extraction, and Emotion Recognition.
3.1 Preprocessing
First, music data should be preprocessed to get their general representations, which
can be used as the input of DNNs for the following tasks. In the paper, we model the
music data through feature representation to obtain Mel-spectrogram representations
from audio files and MIDI-like [15] representations from MIDI files.
Mel-Spectrogram Feature Representation. Audio features are the earliest and most
widely studied representations in the MER field, mainly extracted from waveform files
[9]. The most commonly used timbre representation is Mel-spectrogram, which has a
nonlinear correspondence with Hz frequency. It is widely used in speech recognition
and MIR tasks because it provides a more meaningful representation of the spectral
content of sound signals. Our implementation involves using TorchAudio to obtain the
Mel-spectrogram with 128 bins using a 1024-size FFT (with Hanning window) and 512
hop-size at a 22,050 Hz sampling rate. We randomly sample three seconds of audio
chunks to generate input matrices for the following model.
Transcript to MIDI and MIDI-Related Feature Representation. The MIDI files are
transcribed from the original audio using the high-resolution piano transcription model
developed by [12]. After acquiring the MIDI files, we manually obtain MIDI-like [15]
representation using TorchAudio. And then, the MIDI file will be converted into a vector,
denoted as V 1×len , where the length of the vector depends on the MIDI file.
After that, we train feature extraction networks to extract continuous features abun-
dant in emotional information. Specifically, we leverage these representations on
modality-specific networks enabling independent extraction of music emotion from dif-
ferent inputs. In detail, we introduce two DNNs as continuous feature extractors to
extract emotion-abundant features for audio domain representation and MIDI domain
representation, respectively.
Short-Chunk CNN-Based Feature Extraction for Acoustic Music. The proposed
feature extraction method utilizes the Short-chunk CNN [21] architecture, as shown in
Fig. 3, for feature extraction in MIR, which has been demonstrated in early experiments
to be strong robustness against noise and other changes [21]. It comprises 7-layer CNNs
with residual connections, and the output is a fixed 128-dimensional vector obtained
through max pooling, followed by two fully connected layers with ReLU activation for
classification. Notably, using 3 × 3 small filters allows for deeper levels of network
field of vision and the extraction of more detailed music information, contributing to
the effectiveness and efficiency of the feature extraction process [20]. Consequently, we
adopt the Short-chunk CNN as our backbone model for audio domain feature extraction.
To get continuous implicit semantic features, we feed the Mel-spectrogram matrices
into the backbone model, which are denoted as Q128×len . The length of the matrix
is dependent on the audio file’s sample rate and the window size of the FFT from
106 Y. Xiao et al.
Fig. 3. Short-chunk CNN network for feature extraction (c/s/p stands for channel/stride/padding,
respectively)
preprocessing. After that, we will get a 512-dimension vector V 512×1 for the following
MHSA-based classifier.
BiLSTM-Based Feature Extraction for Symbolic Music. We also design a symbolic-
domain method similar to the audio-domain one. Considering the relationship between
emotion and performance time, we use BiLSTM in Fig. 4 to model feature matrix W
from MIDI-like representations, and the output dimensions of W is l × 512, where l is
the length of a MIDI-like file.
Last, these features are learned by an MHSA-based classifier, each with a feed-forward
layer and then fully connected layers for music recognition.
In the paper, we incorporate a multi-head self-attention layer, along with a feed-
forward block and two dense layers, to classify the music data. Furthermore, the design
enables adjustment of the number of heads used in the attention block for different
dimensions of music emotion.
By capturing multiple aspects of emotional expression by attending to different
dimensions of musical emotion in each head, our proposal enhances the accuracy and
robustness of MER models. In more detail, by utilizing multiple heads, abundant multi-
dimensional emotional information can be captured between different parts of the music
feature, identifying multiple aspects of emotional expression. Additionally, the number
of heads in multi-head self-attention allows our classifier to attend to multiple aspects
of emotional expression in parallel, enabling more efficient and effective modeling of
music emotions.
We feed the deep-semantic features extracted from the previous steps into the MHSA-
based classifier to classify their emotion classification. Specifically, the methods armed
with our MHSA-based classifier are called SCMA and BiLMA for ShortChunk CNN-
based and BiLSTM-based feature extractors, respectively, illustrated in Fig. 5.
Because MIDI files are not sliced into fixed lengths, the dimensions of the matrix
W generated by BiLSTM are different. As a result, the dimensions of each W z matrix
generated from W by multi-head self-attention layer are different. The dimension of W z
and W all equals batchsize × n × m, which means m in every batch will be different. To
ensure consistency across all dimensions of the features, we apply Eq. 1 to each batch.
Thus the W z and W change into M , and the dimension is n × n.
T
M = W WZ (1)
Music Emotion Recognition Using Multi-head Self-attention 107
Finally, after modeling feed-forward and fully connected layers, the emotion
classification is performed, as depicted in Fig. 5.
4.1 Dataset
The EMOPIA dataset [11] is a piano music dataset for symbolic and acoustic music
emotion recognition. The emotion labels are assigned using the 4Q model [18], which
consists of Q1, Q2, Q3, and Q4. Note that the audio files are gathered from the Internet by
provided metadata. Unfortunately, we could only access 845 audio clips out of the total
1087 clips for our research due to copyright limitations. While this may have impacted
the baseline accuracy of our study, we take great care to reproduce the baseline work
for music emotion recognition using the remaining 845 clips. However, the repository
includes complete MIDI format files.
The MIDI files in the EMOPIA dataset are transcribed from the original audio using
the high-resolution piano transcription model developed by [12]. The dataset creator
manually checked the transcription results for a random set of clips and found the accu-
racy in note pitch, velocity, and duration satisfactory. After that, songs with engineered
ambient effects were removed from the collection, as the resulting transcription could
be fragmented and undesirable.
More information about the EMOPIA dataset is summarized in Table 1.
INFO EMOPIA
Number of MIDI 1,087
Number of mp3 845 clips used (in 1,087)
Emotional Label 4Q taxonomy
Train-validation-test splits 7:2:1
Source Youtube
Piano Music Type pop and multicultural
Single Duration About 30 s
set during the training procedure is saved and evaluated in the testing set. All experiments
are repeated in the same random seed with different epochs. Evaluation Metrics The
performance of our model is measured in terms of F1-score and AUROC (Area Under
the Receiver Operating Characteristic curve) [2].
The F1-score, defined by Eq. 4, is a performance metric that summarizes the balance
between a classifier’s precision and recall in a single value. Additionally, in Eqs. 2, 3,
and 5, TP stands for True Positive, which represents the number of positive instances
and is correctly predicted as positive by the classification model. Similarly, FN stands
for False Negative, FP stands for False Positive, and TN stands for True Negative. These
terms are essential for understanding the calculation of various classification metrics and
evaluating the performance of classification models.
TP
Precision = (2)
TP + FN
TP
Recall = (3)
TP + FN
2 Precision × Recall
F1-score = (4)
Precision + Recall
Music Emotion Recognition Using Multi-head Self-attention 109
Table 3 summarizes the performance of our MER model for the audio branch model
compared to the baseline [10] without an MHSA-based classifier. Our SCMA model
outperforms the baseline model with an accuracy of 0.714, an F1-score of 0.712, and an
AUROC score of 0.933. In contrast, the baseline model achieved an accuracy of 0.670,
an F1-score of 0.634, and an AUROC score of 0.902. It suggests that our SCMA model
can effectively capture relevant information from features for emotion recognition tasks.
Moreover, the MHSA-based classifier enhances its ability to focus on essential emotion
information from different perspectives.
Table 3. Comparison between our audio domain model and Short-chunk CNN (baseline) [21].
Table 4 show that the BiLMA model for symbolic domain outperforms all current
models on the same dataset EMOPIA on accuracy, improving by 6.1% over the baseline
model. It is worth noting that the F1-score of MT-MIDIBERT (2022) [17] marginally
outperformed our method. However, it should be acknowledged that pre-trained models
like BERT and GPT require significantly larger training datasets and entail higher time
overheads than our lightweight model, which can be trained in a few minutes on a
normal GPU without pre-training. While the small performance gains of pre-trained
models come at a high cost, our model offers a more efficient and practical solution for
certain applications. Further, Fig. 6 shows the detailed classification situation of typical
models.
110 Y. Xiao et al.
In this section, we conduct experiments on the EMOPIA dataset to study the impact of
the network architecture, MIDI-related features, number of heads, and training epochs.
Verify the Superiority of our Feature Extraction Networks. As Table 5 shows, our
model with a MIDI-like feature achieves the best accuracy against REMI [10]. Regarding
accuracy, our model for the symbolic branch outperforms the baseline model on both
MIDI-like and REMI features. However, the improvement is more significant for MIDI-
like features, where the BiLMA model outperforms the baseline model by 6.1%. For the
REMI feature, the BiLMA model shows a gain of 5.7% over the baseline model.
As Table 6 shows, when our audio branch network’s architecture changes, accuracy
will drop to an extent. Besides, the SCMABiL model shows only marginal performance
improvements compared to the baseline model, indicating that the BiLSTM layer may
add complexity to the model without providing significant benefits.
Verify the Superior Training Procedure of our Networks. Table 7 shows that each
number of head and training epochs on the SCMA model influences the result of emotion
recognition. The results indicate that increasing the number of heads from 2 to 4 improves
the model’s performance for all three metrics. When the head number equals four, and
Music Emotion Recognition Using Multi-head Self-attention 111
Fig. 6. The result of baseline models for audio and MIDI is shown in (a) and (c), respectively.
The performance of models incorporated with a multi-head self-attention classifier is shown in
(b) and (d). The audio model effectively recognizes Q2 but less for Q3, while the MIDI model
exhibits the opposite trend.
the epoch equals 118, it reaches the best accuracy. Moreover, even when the head number
equals 8, the lowest accuracy effectively surpasses the baseline shown in Table 3.
Table 7. Influence of the different number of head and training epochs on the SCMA model.
Table 8 indicates that increasing the number of training epochs from 60 to 100 signif-
icantly improves accuracy and F1-score for the MIDI-like feature. However, increasing
the number of epochs further to 250 does not improve the model’s performance.
112 Y. Xiao et al.
Table 8. Influence of the different number of training epochs on the BiLMA model.
4.5 Discussion
Symbolic music is a better form for emotion recognition [17] since it intrinsically con-
tains information such as pitch, duration, speed, and severity, which can be used to
analyze emotion [9]. Besides, the pre-train models did a good job on text-like data,
and MIDI is one of them. However, our results show that acoustic music’s accuracy
is higher than symbolic music’s in emotion recognition. Compared with many current
MIDI branch models in Table 4 show that the BiLMA model for symbolic domain out-
performs all current models on the same dataset EMOPIA on accuracy, improving by
6.1% over the baseline model. It is worth noting that the F1-score of MT-MIDIBERT
(2022) [17] marginally outperformed our method. However, it should be acknowledged
that pre-trained models like BERT and GPT require significantly larger training datasets
and entail higher time overheads than our lightweight model, which can be trained in a
few minutes on a normal GPU without pre-training. While the small performance gains
of pre-trained models come at a high cost, our model offers a more efficient and prac-
tical solution for certain applications. Further, Fig. 6 shows the detailed classification
situation of typical models.
Table 4, our accuracy of audio in Table 3 still surpasses that of symbolic music in all
metrics, which suggests that except for pitch, duration, speed, and severity, there may
be other information in audio that determines the emotion of the music.
5 Conclusion
In this paper, we propose an efficient approach for MER. Firstly, a method to extract
continuous features is presented, intending to excavate emotion-abundant features. After
that, we design a classifier with the MHSA-based model to excavate multi-dimensional
information for music emotion recognition. Experimental results demonstrate our pro-
posal’s effectiveness, which achieves state-of-the-art performance on the EMOPIA
dataset, setting a new benchmark in the field. Based on our approach, future research
directions could include multimodal methods incorporating audio, MIDI, and even
videos to understand music emotion from various perspectives. Such research may yield
valuable insights and contribute to developing more sophisticated MER systems.
Acknowledgments. This paper is supported by the Humanities and Social Sciences Founda-
tion of the Ministry of Education (17YJCZH260), the Sichuan Science and Technology Pro-
gram (2020YFS0057), the National Innovation Training Program for Undergraduate Students
(202210619023).
Music Emotion Recognition Using Multi-head Self-attention 113
References
1. Cañón, J.S.G., et al.: Music emotion recognition: toward new, robust standards in personalized
and context-sensitive applications. IEEE Sig. Process. Mag. 38, 106–114 (2021)
2. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning
algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
3. Chaki, S., Doshi, P., Patnaik, P., Bhattacharya, S.: Attentive RNNs for continuous-time emo-
tion prediction in music clips. In: Chhaya, N., Jaidka, K., Healey, J., Ungar, L., Sinha,
A.R. (eds.) Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020)
Co-Located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020),
New York, USA, 7 February 2020. CEUR Workshop Proceedings, vol. 2614, pp. 36–46.
CEUR-WS.org (2020)
4. Chang, W.H., Li, J.L., Lin, Y.S., Lee, C.C.: A genre-affect relationship network with task-
specific uncertainty weighting for recognizing induced emotion in music. In: 2018 IEEE
International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018)
5. Chen, S., Jin, Q., Zhao, J., Wang, S.: Multimodal multi-task learning for dimensional and
continuous emotion recognition. In: Proceedings of the 7thAnnual Workshop on Audio/Visual
Emotion Challenge, AVEC 2017, pp. 19–26. Association for Computing Machinery, New
York (2017)
6. Chou, Y.H., Chen, I.C., Chang, C.J., Ching, J., Yang, Y.H.: MidiBERT-Piano: large-scale
pre-training for symbolic music understanding. ArXiv abs/2107.05223 (2021)
7. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D
hybrid networks. In: Proceedings of the 18thACM International Conference on Multimodal
Interaction, ICMI 2016, pp. 445–450. Association for Computing Machinery, New York
(2016)
8. Ferreira, L.N., Lelis, L.H.S., Whitehead, J.: Computer-generated music for tabletop role-
playing games. In: Lelis, L., Thue, D. (eds.) Proceedings of the Sixteenth AAAI Conference
on Artificial Intelligence and Interactive Digital Entertainment, AIIDE 2020, Virtual, 19–23
October 2020, pp. 59–65. AAAI Press (2020)
9. Han, D., Kong, Y., Han, J., Wang, G.: A survey of music emotion recognition. Front. Comput.
Sci. 16(6), 166335 (2022)
10. Huang, Y.S., Yang, Y.H.: Pop music transformer: beat-based modeling and generation of
expressive pop piano compositions. In: Proceedings of the 28th ACM International Confer-
ence on Multimedia, MM 2020, pp. 1180–1188. Association for Computing Machinery, New
York (2020)
11. Hung, H., Ching, J., Doh, S., Kim, N., Nam, J., Yang, Y.: EMOPIA: a multi-modal pop piano
dataset for emotion recognition and emotion-based music generation. In: Lee, J.H., et al. (eds.)
Proceedings of the 22nd International Society for Music Information Retrieval Conference,
ISMIR 2021, Online, 7–12 November 2021, pp. 318–325 (2021)
12. Kong, Q., Li, B., Song, X., Wan, Y., Wang, Y.: High-resolution piano transcription with
pedals by regressing onset and offset times. IEEE/ACM Trans. Audio Speech Lang. Process.
29, 3707–3717 (2021)
13. Lin, Y., Chen, X., Yang, D.: Exploration of music emotion recognition based on MIDI. In: de
Souza Britto Jr., A., Gouyon, F., Dixon, S. (eds.) Proceedings of the 14th International Society
for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil, 4–8 November
2013, pp. 221–226 (2013)
14. Lin, Z., et al.: A structured self-attentive sentence embedding. In: 5th International Conference
on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference
Track Proceedings. OpenReview.net (2017)
114 Y. Xiao et al.
15. Oore, S., Simon, I., Dieleman, S., Eck, D., Simonyan, K.: This time with feeling: learning
expressive musical performance. Neural Comput. Appl. 32(4), 955–967 (2020)
16. Panda, R.E.S., Malheiro, R., Rocha, B., Oliveira, A.P., Paiva, R.P.: Multi-modal music emotion
recognition: a new dataset, methodology and comparative analysis. In: 10th International
Symposium on Computer Music Multidisciplinary Research (CMMR 2013), pp. 570–582
(2013)
17. Qiu, J., Chen, C., Zhang, T.: Novel Multi-Task Learning Method for Symbolic Music Emotion
Recognition. arXiv preprint arXiv:2201.05782 (2022)
18. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980)
19. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran
Associates Inc., Red Hook (2017)
20. Won, M., Choi, K., Serra, X.: Semi-supervised music tagging transformer. In: Lee, J.H.,
et al. (eds.) Proceedings of the 22nd International Society for Music Information Retrieval
Conference, ISMIR 2021, Online, 7–12 November 2021, pp. 769–776 (2021)
21. Won, M., Ferraro, A., Bogdanov, D., Serra, X.: Evaluation of CNN-based automatic music
tagging models. In: Proceedings of 17th Sound and Music Computing (2020)