Learning Similarity Metrics For Melody Retrieval

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

LEARNING SIMILARITY METRICS FOR MELODY RETRIEVAL

Folgert Karsdorp1 Peter van Kranenburg1,2 Enrique Manjavacas3


1
KNAW Meertens Instituut, The Netherlands, 2 Utrecht University, The Netherlands
3
University of Antwerp, Belgium
folgert.karsdorp@meertens.knaw.nl

ABSTRACT mation for song identification [29]. Musicologists benefit


from melodic similarity measures for exploring and map-
Similarity measures are indispensable in music informa- ping folk songs [19, 30], or other collections with mono-
tion retrieval. In recent years, various proposals have phonic musical material, such as themes from classical
been made for measuring melodic similarity in symboli- compositions [18], or score incipits [34]. Finally, mu-
cally encoded scores. Many of these approaches are ulti- sic similarity detection plays an important role in cases of
mately based on a dynamic programming approach such copyright violation [31], where objective similarity mea-
as sequence alignment or edit distance, which has various sures can support the court in making decisions.
drawbacks. First, the similarity scores are not necessar- In this paper, we present a Neural Network approach
ily metrics and are not directly comparable. Second, the for melodic similarity learning. The neural encoders learn
algorithms are mostly first-order and of quadratic time- complex mappings from input sequences into distributed
complexity, and finally, the features and weights need to melody representations. The primary aim of our system
be defined precisely. We propose an alternative approach is to be employable for music retrieval. In addition to ac-
which employs deep neural networks for end-to-end simi- curately modeling melodic similarity, desirable properties
larity metric learning. We contrast and compare different of a retrieval system are speed and indexability. Neural
recurrent neural architectures (LSTM and GRU) for rep- Networks very well accommodate both. Generally, once
resenting symbolic melodies as continuous vectors, and trained, a neural network can compute results very fast by
demonstrate how duplet and triplet loss functions can be making use of GPUs. Moreover, indexability is served by
employed to learn compact distributional representations the application of similarity metrics on top of the learned
of symbolic music in an induced melody space. This ap- encodings [3].
proach is contrasted with an alignment-based approach. Various other approaches to melodic similarity have
We present results for the Meertens Tune Collections, been taken [35], including sequence alignment and other
which consists of a large number of vocal and instrumen- dynamic programming approaches, such as edit distance
tal monophonic pieces from Dutch musical sources, span- and dynamic time warping, and, recently, Recurrent Neu-
ning five centuries, and demonstrate the robustness of the ral network representations [4, 9, 11, 16, 27, 30, 33]. In this
learned similarity metrics. study, we contrast our approach with a sequence align-
ment method that has been successfully applied in the con-
1. INTRODUCTION text of folk melodies. Alignment-based methods suffer
from at least three drawbacks. First, they typically are
The question of how melodic similarity can be computa-
first-order, taking into account only adjacent items in se-
tionally modeled is of crucial importance for various Mu-
quences. Second, they have quadratic time-complexity. Fi-
sic Information Retrieval (MIR) tasks [35]. One classic
nally, an alignment score is not a proper metric. Our neural
MIR scenario is a user posing a sung or hummed query
network approach overcomes these disadvantages but does
to a retrieval system in order to retrieve resembling pieces
so at the cost of being a supervised learning algorithm.
of music from a music collection [6, 24]. This query-
by-humming scenario requires melodic matching methods
that are robust against different kinds of melodic variation 2. DATA AND FEATURES
arising from imprecise memory or limited singing skills.
Melody matching is also an important aspect in cover-song 2.1 Data sets
detection, where the predominant melody contains infor- The Meertens Tune Collections (MTC) 1 contains a se-
ries of data sets with melodic material from Dutch sources
(mainly manuscripts, printed sources, and audio record-
c F. Karsdorp, P. van Kranenburg, E. Manjavacas. Li-
censed under a Creative Commons Attribution 4.0 International License ings), spanning five centuries of music history [20, 22].
(CC BY 4.0). Attribution: F. Karsdorp, P. van Kranenburg, E. Man- These data sets are subsets of the Dutch Song Database,
javacas. “Learning Similarity Metrics for Melody Retrieval”, 20th Inter- maintained by the KNAW Meertens Institute [21]. The lat-
national Society for Music Information Retrieval Conference, Delft, The
Netherlands, 2019. 1 http://www.liederenbank.nl/mtc/

478
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

100 # Mel # TF # TF in Train µ|TF| σ|TF|


Train 5,975 1,572 3.80 5.06
10 1 Dev 1,492 495 255 3.01 1.64
Test 1,869 611 287 3.06 1.55
CCDF

10 2 Table 1. Composition of the subsets of MTC-FS-INST


2.0 used for training, development and testing. The table
provides the number of melodies (Mel) and tune families
10 3
(TF) in each set, the number of tune families that are shared
101 with the training set, and mean and standard deviation of
Class size tune family sizes.

Figure 1. Complementary Cumulative Distribution Func- Annual counts 10 500 1000 2000 4000

tion of the tune family sizes in the subset of MTC-FS-INST

Melody type
2.0 which we use in our experiments. INST
FS
1600 1700 1800 1900 2000
Year
est release, MTC-FS-INST 2.0, contains 18,109 digitized
melodies with rich metadata. Many of these melodies oc-
cur in more than one source. Due to oral and semi-oral Figure 2. Number of melodies per year in MTC-FS-
transmission, these different occurrences typically show INST 2.0. The plot displays frequencies for instrumental
melodic variation. As an example, Figure 3 depicts three melodies (INST) and vocal melodies (FS).
variant melodies, illustrating the kind and extent of varia-
tion. To denote such a group of variant melodies, we adopt 2.2 Features
the concept of tune family from folk song research [1]. In a
long-term effort, the collection specialists of the Meertens Melodies are represented as sequences of notes, and notes
Institute aim to identify each melody in terms of tune fam- as sets of feature-values. Since the MTC provide a rich
ily membership. melody encoding, including key, meter, and phrase bound-
In this paper, we use MTC-FS-INST 2.0, which reflects aries, we can assemble a diverse feature-set in which var-
the diversity of the contents of the Dutch Song Database. ious musical parameters are represented: pitch, metric
One main distinction in the data set is between vocal and structure, rhythm, tonality, and phrase structure. See the
instrumental music as illustrated in Figure 2. Generally, supplementary material for an exact list of features.
the instrumental part of the data set dates from the 17th
and 18th centuries. It contains melodies that were played 3. METHODOLOGY
in bars and brothels as well as theaters and upper-class pri-
vate settings. The vocal part of the data set mainly consists Our approach is based on two components. First, we
of songs from the 19th and 20th centuries. As a whole, the deploy distributional melody encoders implemented with
data set provides a rich variety in melodic styles, which Neural Networks. Secondly, we train the encoders with
renders it a perfect source for training general purpose Stochastic Gradient Descent to minimize a Contrastive
melodic similarity measures. Loss that we describe below.
To obtain training, development, and test sets, we filter
and split the data set. First, we exclude all 5,765 unla- 3.1 Distributional Encoder
beled melodies and all 3,008 singleton tune families. This An input melody from the dataset xi ∈ X can be rep-
leaves a selection of 9,336 melodies in 2,094 tune fami- resented by a sequence xi = [x(i,1) , . . . , x(i,k) ] of length
lies. The complementary cumulative distribution function k = |xi |, where each x(i,t) is a bundle of m features ex-
of the class sizes is presented in Figure 1. The distribution plained in Section 2.2. For simplicity, we refer to the j th
of class sizes is heavy-tailed. feature of sequence step t as xjt , thus dropping data set in-
An important criterion to measure the level of success dices. Our goal is to compute an encoding h = f (x) as a
achieved by the metric learning approach is its capability function of the input sequence x, parameterized by a Neu-
to cluster together tune melodies belonging to families un- ral Network f (x).
seen during training. In order to make this possible, we The encoding process can be described as follows. We
perform a controlled test set split, ensuring that all in- first process each time step in the input sequence indepen-
stances from a proportion of tune families do not appear dently by concatenating all features into a single vector
in the training data. The actual proportions of seen and un- et = [e1t ; . . . ; em
t ]. Categorical features are first encoded
seen families is shown in Table 1 together with further data into a one-hot vector and projected into their own embed-
set size statistics. 2 ding space with model parameters Wj ∈ RJxE , where J
2 Supplementary material, data sets and code to replicate the exper-
iments are available from https://github.com/fbkarsdorp/ melodic-similarity.

479
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

 
NLB073337_02                                                  


 

                                            
NLB073483_01
  
  
NLB073990_01                 
                          

Figure 3. Three members of tune family Daar was laatstmaal een ruiter 2, showing various kinds of melodic variation.

is the total number of possible values of the j th categori-

1.0
Positive loss
Soft Negative Loss
cal feature and E is the dimensionality of the embedding

0.8
Hard Negative Loss
space. Continuous features are normalized to have a mean

0.6
of 0 and standard deviation of 1. For all feature types, the

loss
0.4
sequences are padded at the beginning and the end using
special symbols in the case of categorical features and the

0.2
feature mean after re-scaling for continuous features. The

0.0
resulting sequence of input embeddings is fed to a stack of 0.0 0.5 1.0 1.5 2.0
recurrent layers 3 with the tth hidden activation at layer l D
given by h(t,l) = RN Nl (h(t,l−1) , h(t−1,l) ).
We also experiment with bidirectional RNNs, which ex- Figure 4. Visualization of the two loss variants with a hard
tend each RNN layer with an additional RNN run back- margin and a soft margin.
wards. In the case of the bidirectional RNN, the final
melody embedding is given by the concatenation of the
and a negative term L− :
last activations of the forward and backward RNNs at the
−−−−→ ←−−−−
last layer: h = [h(k,|L|) ; h(1,|L|) ]. In the case of the unidi- L− (xi , xj ) = max(0, α − D(xi , xj ))2 (3)
rectional RNN, the embedding is given by a feature-wise
max-pooling operation over the sequence of activations where β is a parameter used to weight the contribution of
at the last layer, with the pth output feature defined by the negative term. 4 The two terms can be combined into
hp = max([h(1,|L|) ]p , . . . , [h(k,|L|) ]p ). Instead of tradi- a single loss function with the help of a variable Yi,j that
tional RNN cells [7], we use LSTM [14] and GRU [5] cells takes value of 1 for yi = yj and 0 when yi 6= yj :
which have been shown to offer stronger performance and
better training behavior. LD (xi , xj ) = (Yij )L+ (xi , xj )+(Yij −1)L− (xi , xj ) (4)

3.2 Contrastive Loss For the current study, we restrict ourselves to the cosine
distance as defined by Eq. 5:
The goal of our approach is to learn a distributional en-
coder such that melodic sequences of the same class are f (xi ) · f (xj )
D(xi , xj ) = 1 − (5)
embedded into neighboring regions and far from melodic kf (xi )kkf (xj )k
sequences belonging to different families. To this end, we
train the encoder using a contrastive loss [12]. Naturally, other distance functions are equally applica-
ble, but the two-sided boundedness of the cosine distance
3.2.1 Duplet Loss (i.e. distances fall between [0, 2]) allows more efficient op-
Let the encodings of two input sequences with tune family timization of the parameters α and β. Moreover, the loss
labels yi and yj be denoted by xi and xj . The goal we specified above employs a soft margin. By contrast, [28]
want to achieve is that the similarity between xi and xj is propose the use of a hard margin, effectively reducing the
high when yi and yj are equal, and low otherwise. More loss to zero if it falls below some value. With the hard
formally, we seek to achieve the following inequality: margin, the negative term in Eq. 3 becomes:
(
D(xi , xj ) < D(xi , xk ) + α (1) (1 − D(xi , xj ))2 D(xi , xj ) < α
L− (xi , xj ) =
0 otherwise
∀(xi , xj , xk ) ∈ X | yi = yj ∧ yi 6= yk , where D is a dis- (6)
tance function and α is a pre-specified margin. In order to Figure 4 visualizes the two loss variants. In the experi-
achieve this goal, we optimize a contrastive loss function ments below, we compare both versions of the loss.
defined over input pairs that is therefore known as the du-
plet loss. The contrastive loss function is decomposed in a 3.2.2 Triplet Loss
positive term L+ : The triplet loss [13, 32] differs from the duplet loss in that
L+ (xi , xj ) = (β)D(xi , xj ) 2
(2) it considers input example triplets xi , xj , xk consisting
of a positive example xi , a negative example xj such that
3 During preparatory work, we also experimented with convolutional
stacks but found no improvements over the recurrent counter-part, which 4 The scaling parameter β and margin α are optimized on a develop-
was, therefore, singled out for the purpose of the present study. ment data set, which we will discuss below.

480
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

yj 6= yi and an anchor xk such that yk = yi . The triplet symbols. The alignment is constructed by inserting gaps at
loss shares the goal of the duplet loss from Eq. 1, but em- appropriate locations in the sequences following a dynamic
ploys anchor positive examples to ensure that the distance programming approach. The alignment score is based on a
between any two instances of the same family is less than similarity function for symbols and a gap scoring scheme.
the distance to an instance of a different family by at least The Gotoh-variant of the algorithm applies an affine gap
a pre-specified margin α. More formally, the triplet loss is scoring function in which the continuation of a gap obtains
defined by Eq. 7: a different score than the opening of a gap, opposed to the
basic variant of the algorithm in which all gaps obtain the
LT (xi , xj , xk ) = max(0, D(xi , xk ) − D(xj , xk ) + α) same score. We use the best scoring configuration in [23],
(7) which uses pitch, metric weight and the position of a note
The triplet loss, therefore, presents a more relaxed form in its phrase.
than the duplet loss, allowing instances of the same fam-
ily to occupy larger regions. As opposed to the composite
3.5 Evaluation
form of the duplet loss, the triplet loss is simple. However,
the triplet loss is more heavily dependent on the quality of We formulate the task of tune family identification as a
the sampled negative examples and anchors, which might ranking problem: given a query melody qi and a data set of
lead to poorer training dynamics. melodies X, qi ∈ / X, the models should provide a ranked
list of the melodies in X. To evaluate how well our mod-
3.3 Online Duplet and Triplet Mining els solve this problem, we measure the performance of the
We train our encoder using mini-batches of duplets or models by means of three evaluation measures: (i) ‘Av-
triplets from the data set. The number of possible du- erage Precision’, (ii) ‘Precision at rank 1’, and (iii) ‘Sil-
plets and triplets grows, respectively, quadratically and cu- houette Coefficient’. Each of these measures addresses a
bically with the number of instances in the data set, which different aspect of the performance quality of the mod-
renders exhaustive training costly. Feasibility aside, train- els. First, Average Precision (AP) addresses the ques-
ing with all possible duplets or triplets is not desirable as a tion whether given a query melody, all or most relevant
large proportion of the resulting duplets and triplets make melodies are high up in the ranking:
it either too easy or too difficult to fulfill the objective of PN
Eq. 4 and Eq. 7. Such examples prevent the network from P (k) × rel(k)
k=1
AP = , (9)
learning, and lead to slower convergence. number of relevant melodies
As suggested by [32] in the context of face recognition, where k is the position in the ranked list of N retrieved
efficient and fast converging training can be achieved by melodies. P (k) represents the precision at position k,
online selection of ‘hard’ duplets or triplets, i.e. the most and rel(k) = 1 if the melody at position k is relevant,
dissimilar positive examples, and the least dissimilar neg- rel(k) = 0 otherwise. By computing the average AP over
ative examples. We apply this approach by first sampling all query melodies, we obtain the Mean Average Precision
a mini-batch of k instances per each of n sampled unique (MAP). As a second ranking measure, we focus on the Pre-
tune families. Subsequently, using the current model we cision at rank 1 score (P@1), which computes the fraction
compute the encodings of all n × k instances and for each of queries for which the highest ranked sequence is rel-
instance we sample positive and negative, or anchor and evant. Third and finally, we compare these two ranking
negative examples from the mini-batch. In the case of the based evaluation measures with the Silhouette Coefficient,
duplet loss, for each instance we select all possible pos- which is a measure of cluster homogeneity and separation.
itive examples (i.e. all other instances in the mini-batch The Silhouette Coefficient contrasts the mean similarity
from the same family) and an equal number of negative between a sample and all other samples from the same fam-
examples from the least dissimilar negatives. In the case ily with the similarity of that sample with members of other
of the triplet loss, for each instance xi we select pairs of families. By taking the average over all silhouette scores,
anchor xk and negative example xj such that the distance we obtain a measure of cluster homogeneity ranging from
between positive and anchor is smaller than the distance -1 (incorrect clustering) to 1 (perfect clustering). 5
between negative and anchor, while the difference between
the distances lies inside the margin α:
3.6 Training and Hyper-Parameter Optimization
0 < D(xi , xk ) − D(xj , xk ) < α (8)
The networks were trained on the training data sets spec-
In case no negative example can be found that satisfies this ified in Table 1. We use the Adam optimizer [17] and
condition, we select a random negative. stop training after no improvement in MAP score was
made on the development data for ten consecutive epochs.
3.4 Baseline: Alignment The neural network consists of a large number of hyper-
parameters, making hyper-parameter tuning expensive and
We compare our results with the performance of a previ- time-consuming. Following [2], we perform a random-
ously proposed alignment method [23]. In this method, the ized hyper-parameter search, in which we train n differ-
Needleman-Wunsch-Gotoh algorithm is used [10], which
computes a global alignment score for two sequences of 5 See the Supplementary Materials for more information.

481
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

MAP P@1 Sil.


9128_0
13885_0
all seen unseen 11112_0
INST
2955_0
RNND 0.72 0.71 0.73 0.78 0.34 FS
9668_1
RNNT 0.71 0.70 0.71 0.77 0.29 1876_0
720_0
Alignment 0.69 – – 0.78 0.23

Table 2. Best evaluation scores for the development data.


RNND is an RNN with duplet loss, and RNNT is an RNN
with triplet loss. Figure 5. Two-dimensional UMAP [25] projection of the
induced melody space obtained with RNND . The left
subplot visualizes the positions of instrumental melodies
MAP P@1 Sil. (INST) and vocal melodies (FS). The subplot to the right
all seen unseen highlights the positions of a small number of randomly
RNND 0.72 0.70 0.74 0.78 0.33 chosen tune families.
RNNT 0.68 0.64 0.71 0.75 0.28
Alignment 0.67 – – 0.78 0.22 jection of the melody space in Figure 5. The left subplot
demonstrates that the learned representations clearly sep-
Table 3. Evaluation scores for the test data. RNND is an arate vocal (FS) from instrumental melodies (INST). The
RNN with duplet loss; RNNT is an RNN with triplet loss. subplot to the right serves as a validation of the cluster-
ing capabilities of the encoder. It highlights the positions
of a small number of randomly chosen tune families, the
ent models with random hyper-parameter settings sampled
members of which all cluster together.
from parameter-specific, relatively flat distributions. 6
4.1 Hyper-Parameter Importance
4. RESULTS
We assess the importance of the different hyper-parameters
Table 2 presents the results for the development data set of the neural networks by modelling their influence on
with MAP scores for the best performing models trained the MAP scores resulting from the randomized parameter
with duplet (RNND ) and triplet loss (RNNT ). The best search [26]. To this end, we fit the following linear regres-
RNND achieves a MAP of 0.72, which is markedly better sion model:
than the Alignment method (0.69), and slightly better than
the best RNNT (0.71). However, as will be discussed in MAPi ∼ N (µi , σ) (10)
more detail below, RNNT models are significantly harder µi = γ + βl li + βm mi + βh hi + βd di (11)
to optimize than RNND models (at least for the current
data set). The columns ‘seen’ and ‘unseen’ represent MAP +βb bi + βc ci + βlm li mi ,
scores for queries of which the corresponding tune fami-
where Eq. 10 specifies the likelihood function with mean
lies were either seen or unseen during training. Crucially,
µ and standard deviation σ, and Eq. 11 represents the lin-
the scores are almost equivalent, indicating that the neural
ear model. Here, γ represents the intercept of the linear
networks are capable of actually learning a similarity met-
model, li is the loss type of model i (i.e. triplet or duplet
ric, and not just a clustering or classification procedure,
loss), mi is the margin value α, hi is the dimension of the
in which the systems learn to assign sequences to known
hidden layer, di refers to the embedding dropout value, bi
class labels and data points. The performance differences
dummy encodes whether a model employed bidirectional
between the models are further expressed by the Silhou-
versus unidirectional RNNs, and ci is the cell type of the
ette coefficient, which indicates superior performance of
RNN (i.e. LSTM or GRU). Since the margin functions dif-
the RNND model. However, note that for P@1, all sys-
ferently in the triplet and duplet loss, we model the inter-
tems perform equally well.
action between margin and loss type (βlm li mi ). All cat-
The best performing models were employed to encode
egorical predictors are dummy encoded, and the contin-
the melodies in the test set. The test results in Table 3 show
uous predictors are zero centered. β priors are sampled
a similar picture. Again, RNND outperforms the other
from uninformative Normal distributions, N (0, 1), and σ
systems. Note that the performance of RNNT slightly
is sampled from a weakly regularizing half-Cauchy prior
dropped in comparison to the development results and that
with location 0 and scale 1.
the performance difference with respect to RNND has be-
Table 4 presents the posterior distribution estimates of
come larger. Overall, the RNNs appear to have adequately
the model along with their estimation errors, their 95%
learned how to form compact distributional representations
credible intervals (CI95 ), and the R̂ statistic. 7 The mean
of symbolic music in an induced melody space. This ca-
pability is further illustrated by the two-dimensional pro- 7 Since the linear model is Bayesian, the credible intervals can be inter-
preted straightforwardly as the 95% probability that the estimates fall in
6 The full list of hyper-parameters and the predefined priors are listed a particular range. The ‘No U-Turn Sampler’ (NUTS) was used for sam-
in Section 2 of the Supplementary Materials. pling [15], which is a specific type of Hamiltonian Monte Carlo (HMC).

482
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

Bidirectional ● no ● yes
Estimate Error l-CI95 u-CI95 R̂
LSTM GRU
γ 0.66 0.00 0.65 0.67 1.0 0.70
βl -0.08 0.01 -0.09 -0.07 1.0 ●

Mean Average Precision


βm 0.01 0.01 -0.02 0.04 1.0 ●
0.65
βd -0.05 0.02 -0.09 0.00 1.0 ●
βb 0.03 0.01 0.02 0.04 1.0
● ●
βc -0.05 0.00 -0.06 -0.04 1.0 0.60

βh 0.02 0.00 0.01 0.03 1.0
βlm -0.18 0.02 -0.22 -0.13 1.0 0.55

σ 0.04 0.00 0.04 0.04 1.0 ●

duplet triplet duplet triplet


Table 4. Posterior distribution estimates for the hyper- Loss
parameters of the Neural Networks. In addition to the
mean estimates, the table provides the estimation errors, Figure 7. Marginal effects plot of RNN cell type (i.e.
95% Credible Intervals, and the R̂ statistic. LSTM or GRU) and bidirectional versus unidirectional
RNNs.
Loss duplet triplet

benefits of employing bidirectional RNNs, which consis-


Mean Average Precision

0.65
tently outperform unidirectional models.

0.60
5. CONCLUSION & FUTURE WORK
0.55 This paper proposed a method for end-to-end melody sim-
ilarity metric learning using deep neural networks. We
0.50 trained distributional melody encoders to minimize Du-
plet and Triplet Contrastive loss functions with which we
−0.25 0.00 0.25 0.50 achieve state-of-the-art retrieval performance on a large set
margin (centered)
of instrumental and vocal melodies. A thorough statistical
analysis of the hyper-parameters of the Neural Networks
Figure 6. Marginal effects plot showing the interaction be-
indicates that on average Duplet Loss RNNs are easier to
tween loss type (i.e. Duplet and triplet loss) and the margin
tune and less sensitive to specific hyper-parameter settings.
α.
Additionally, RNNs trained with GRU cells consistently
outperform LSTM cell implementations. Our system has
intercept γ = 0.66 represents the mean posterior esti- several major advantages over more traditional, alignment-
mate of MAP values for unidirectional models, fit with based methods. First, thanks to its ability to infer com-
GRU cells (ci = 0), duplet loss (li = 0) and mean (i.e. plex interactions between input variables, the Neural Net-
0) values for the continuous predictors. Given this base work approach is less sensitive to specific feature combi-
model, several interesting observations can be made. First, nations and feature selection. Second, as shown by our
as suggested by the negative βl estimate, triplet loss mod- study, the Neural Network approach displays more robust-
els markedly underperform duplet loss models with, ce- ness, achieving similar MAP scores across exclusive sets
teris paribus, a mean drop in performance of 0.08. Sec- of tune families (seen vs unseen).
ond, employing larger hidden dimensions (βh ) and using For future work, we have the following three recom-
bidirectional RNNs (βb ) both positively influence the MAP mendations. First, the applicability of the proposed ap-
scores. Third, on average adding too much dropout hurts proach should be carefully examined on more diverse data
performance (βd = −0.05). Fourth, the strong negative sets, in order to test for the cross-domain robustness of the
posterior distribution estimate for the interaction between learned similarity metrics. Second, a more extensive and
the margin α and loss type indicates that careful tuning thorough comparison (including error analysis) with other
of α is especially important for the triplet loss. By con- existing melodic similarity methods is desired to highlight
trast, different values of α barely impact the performance advantages and possible disadvantages of the neural sys-
of duplet loss models. The marginal effects plot in Fig- tems. Finally, we acknowledge that, while successful, the
ure 6 highlights this interaction. Finally, RNNs trained proposed architecture still leaves room for improvement.
with GRU cells markedly outperform models trained with Inspired by progress in similarity metric learning within
LSTM cells (βc = −0.05). Figure 7 illustrates this perfor- the fields of Paraphrase Detection and Semantic Textual
mance difference. Additionally, the plot demonstrates the Similarity, we would like to experiment with more expres-
sive neural architectures and feature extraction to explore
R̂ is a statistic to assess the convergence of the sampler, and should be be-
low 1.1 [8]. For more information about the convergence and parameters, the performance limits of the Neural Network approach on
see the Supplementary Information. melodic similarity metric learning.

483
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

6. REFERENCES [13] Alexander Hermans, Lucas Beyer, and Bastian


Leibe. In defense of the triplet loss for person re-
[1] S.M. Bayard. Prolegomena to a study of the principal
identification. arXiv preprint arXiv:1703.07737, 2017.
melodic families of british-american folk song. Journal
of American Folklore, 63(247):1–44, 1950. [14] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–1780,
[2] James Bergstra and Yoshua Bengio. Random search 1997.
for hyper-parameter optimization. Journal of Machine
Learning Research, 13(Feb):281–305, 2012. [15] Matthew D Hoffman and Andrew Gelman. The no-u-
turn sampler: adaptively setting path lengths in hamil-
[3] Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-Yates, tonian monte carlo. Journal of Machine Learning Re-
and José Luis Marroquín. Searching in metric spaces. search, 15(1):1593–1623, 2014.
ACM Computing Surveys, 33(3):273–321, 2001.
[16] Berit Janssen, Peter Van Kranenburg, and Anja Volk.
[4] Tian Cheng, Satoru Fukayama, and Masataka Goto. Finding occurrences of melodic segments in folk songs
Comparing rnn parameters for melodic similarity. In employing symbolic similarity measures. Journal of
ISMIR, pages 763–770, 2018. New Music Research, 46(2):118–134, 2017.
[5] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- [17] Diederik P. Kingma and Jimmy Lei Ba. Adam: a
danau, and Yoshua Bengio. On the Properties of Neural Method for Stochastic Optimization. International
Machine Translation: Encoder–Decoder Approaches. Conference on Learning Representations 2015, pages
In Proceedings of SSST-8, Eighth Workshop on Syn- 1–15, 2015.
tax, Semantics and Structure in Statistical Translation,
pages 103–111. Association for Computational Lin- [18] Andreas Kornstädt. Themefinder: A web-based
guistics, 2014. melodic search tool. Computing in Musicology,
11:231–236, 1998.
[6] Roger B. Dannenberg, William P. Birmingham, Bryan
Pardo, Ning Hu, Colin Meek, and George Tzane- [19] Peter Van Kranenburg. A Computational Approach to
takis. A comparative evaluation of search techniques Content-Based Retrieval of Folk Song Melodies. PhD
for query-by-humming using the musart testbed. Jour- thesis, Utrecht University, Utrecht, October 2010.
nal of the American Society for Information Science [20] Peter Van Kranenburg and Martine De Bruin. The
and Technology, 58(5):687–701, 2007. meertens tune collections: Mtc-fs-inst 2.0. Meertens
Online Reports 2019-1, Meertens Institute, Amster-
[7] Jeffrey L. Elman. Finding structure in time. Cognitive
dam, 2019.
Science, 14(2):179–211, 1990.
[21] Peter Van Kranenburg, Martine De Bruin, and Anja
[8] Andrew Gelman, Donald B Rubin, et al. Inference Volk. Documenting a song culture: the dutch song
from iterative simulation using multiple sequences. database as a resource for musicological research. In-
Statistical science, 7(4):457–472, 1992. ternational Journal on Digital Libraries, 20(1):13–23,
[9] Mathieu Giraud, Ken Déguernel, and Emilios Cam- 2019.
bouropoulos. Fragmentations with Pitch, Rhythm and [22] Peter Van Kranenburg, Martine De Bruin, Louis P.
Parallelism Constraints for Variation Matching, vol- Grijp, and Frans Wiering. The meertens tune collec-
ume 8905 of Lecture Notes in Computer Science, chap- tions. Meertens Online Reports 2014-1, Meertens In-
ter Fragmentations with Pitch, Rhythm and Parallelism stitute, Amsterdam, 2014.
Constraints for Variation Matching. Springer, 2014.
[23] Peter Van Kranenburg, Anja Volk, and Frans Wier-
[10] O. Gotoh. An improved algorithm for matching bi- ing. A comparison between global and local features
ological sequences. Journal of Molecular Biology, for computational classification of folk song melodies.
162:705–708, 1982. Journal of New Music Research, 42(1):1–18, 2013.
[11] Maarten Grachten, Josep Lluís Arcos, and Ra- [24] Velankar Makarand and Kulkarni Parag. Unified al-
mon López de Mántaras. Melody retrieval using the gorithm for melodic music similarity and retrieval in
implication/realization model. In Proceedings of the query by humming. In Intelligent Computing and In-
6th International Conference on Music Information formation and Communication: Proceedings of 2nd In-
Retrieval (ISMIR 2005), 2005. ternational Conference, ICICC 2017. Springer Singa-
pore, 2018.
[12] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimen-
sionality reduction by learning an invariant mapping. [25] Leland McInnes, John Healy, and James Melville.
In Proceedings of the IEEE Computer Society Con- Umap: Uniform manifold approximation and pro-
ference on Computer Vision and Pattern Recognition, jection for dimension reduction. arXiv preprint
2006. arXiv:1802.03426, 2018.

484
Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

[26] Stephen Merity, Nitish Shirish Keskar, and Richard


Socher. An analysis of neural language modeling
at multiple scales. arXiv preprint arXiv:1803.08240,
2018.

[27] Marcel Mongeau and David Sankoff. Comparison of


musical sequences. Computers and the Humanities,
24:161–175, 1990.

[28] Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru.


Learning text similarity with siamese recurrent net-
works. In Proceedings of the 1st Workshop on Repre-
sentation Learning for NLP, pages 148–157, 2016.

[29] Justin Salamon, Joan Serrà, and Emilia Gómez. Tonal


representations for music retrieval: from version iden-
tification to query-by-humming. International Jour-
nal of Multimedia Information Retrieval, 2(1):45–58,
2013.

[30] Patrick E. Savage and Quentin D. Atkinson. Automatic


tune family identification by musical sequence align-
ment. In Proceedings of the 16th International Soci-
ety for Music Information Retrieval Conference, pages
162–168, 2015.

[31] Patrick E. Savage, Charles Cronin, Daniel Müllen-


siefen, and Quentin D. Atkinson. Quantitative evalu-
ation of music copyright infringement. In Proceedings
of the 8th International Workshop on Folk Music Anal-
ysis (FMA2018), pages 61–66, 2018.

[32] Florian Schroff, Dmitry Kalenichenko, and James


Philbin. Facenet: A unified embedding for face recog-
nition and clustering. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 815–823, 2015.

[33] Julián Urbano, Juan Lloréns, Jorge Morato, and Sonia


Sánchez-Cuadrado. Melodic similarity through shape
similarity. In Sølvi Ystad, Mitsuko Aramaki, Richard
Kronland-Martinet, and Kristoffer Jensen, editors, Ex-
ploring Music Contents, volume 6684 of Lecture Notes
in Computer Science, pages 338–355. Springer Berlin
Heidelberg, 2011.

[34] Jelmer Van Nus, Geert-Jan Giezeman, and Frans Wier-


ing. Melody retrieval and composer attribution using
sequence alignment on rism incipits. In TENOR 2017
International Conference on Technologies for Music
Notation & Representation, 2017.

[35] Valerio Velardo, Mauro Vallati, and Steven Jan. Sym-


bolic melodic similarity: State of the art and fu-
ture challenges. Computer Music Journal, 40(2):70–
83, 2016.

485

You might also like