MASTER OF SCIENCE in COMPUTER SCIENCE

UNIVERSITY OF CALIFORNIA
SANTA CRUZ
LEARNING MUSIC STYLES WITH ACOUSTIC SIMILARITY

MEASURES
A project submitted in partial satisfaction of the
requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
by
Mark C. Deckert
January 2007
The Project of Mark C. Deckert

is approved:
Professor David Helmbold, Chair
Professor Roberto Manduchi

Copyright
c by
Mark C. Deckert
2007
Table of Contents
List of Figures iv
List of Tables vi
Abstract vii
Acknowledgments viii
1 Introduction 1
2 Related Work 4
3 Creating a Model 6
4 Testing the Classification System 11
5 Modeling Attack 18
6 A Practical Use 21
7 Conclusion 27
Bibliography 28
A Styles with Four to Ten Artists 30
iii
List of Figures
3.1 The figure shows the creation of a style model from an audio corpus
containing the songs which represent that style. . . . . . . . . . . . . . . 9
3.2 The figure shows the evaluation of the likelihood that a particular song
was produced by the GMM for a style where ”Average Likelihood” is the
average of the of the log likelihood for each frame in the piece of music.
To make a classification, this evaluation occurs for multiple style GMMs
and the GMM which produces the greatest likelihood is chosen as the
style of that song. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 A confusion matrix where each row shows the percentage of examples
with the actual label of that row that were labeled by the classifier with
the style indicated at the top of that column. . . . . . . . . . . . . . . . 15
iv
with the style indicated at the top of that column. . . . . . . . . . . . . 17
with the style indicated at the top of that column. . . . . . . . . . . . . 20
v
List of Tables
4.1 Narrowed from a larger set, these styles were found to be independent
and distinct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 These styles were found to have overlapping artists. . . . . . . . . . . . 14
vi
Abstract
Learning Music Styles with Acoustic Similarity Measures
by
Mark C. Deckert
Music is classified into styles based on features derived from audio signals called Mel-
Frequency Cepstral Coefficients (MFCC). Styles are represented by guassian mixture
models that represent the distribution of MFCC vectors obtained from music of that
style. A novel feature based on properties of instrument attack is added to the existing
MFCCs to improve classification performance.

Acknowledgments
I would like to thank Roberto Manduchi, David Helmbold, Charlie McDowell, Dan Ellis,
David Cope, and My Parents for their support and guidance.
viii
Chapter 1
Introduction
Music similarity is an elusive concept—wholly subjective, multifaceted, and

a moving target—but one that must be pursued in support of applications
to provide automatic organization of large music collections [2].
Having experienced data loss that included a large, well-sorted MP3 collection,
my interest in music similarity is more than academic. As will be further discussed, it is
the contention of this paper that music similarity and, more specifically, music sorting
involves a significant element of personal taste. A major goal of this research is to
create a usable, flexible music sorting application that allows the user to choose music
categories that suit their own preferences. To this end, a sorting application has been
produced which achieves a reasonable degree of flexibility.
A substantial part of the music information retrieval literature and research
efforts have sought to find a “ground truth” in music similarity measures. I contend
that “ground truth” is an ill-conceived concept. Just as the opening quote purports,
music similarity is a “wholly subjective” concept. While there are certainly quantita-
1
tive measures which contribute to music similarity, the amount of potential individual
measures is staggering. Furthermore, the particular way in which these measures man-
ifest in the mind of a given listener is a function of both the physical features of their
hearing system and their particular mental state. For example, the focus of a musical
layperson and of a musician when listening to a particular piece of music will likely
be quite different. That said, a data set and some necessarily subjective classifications
are certainly required to create style classification systems. To mitigate this problem,
Ellis et al. define a concept called “consensus truth”, skirting the issues surrounding
individualized style perception by attempting to find consensus amongst the varying
opinions of listeners [2].
Having surveyed the multitude of techniques in this arena, I have attempted
to pull from this multitude those techniques which most closely match the physiological
aspects of the listener; emphasizing techniques that are simple, elegant and effective. A
three part system has been identified which meets these criteria.
The first part of such a system is the choice of styles and style examples. While
this may seem trivial, the choice of styles and examples can make or break a classification
system. If an artist belongs to multiple styles in a classification set or examples don’t
represent the style especially well, classification systems become less effective. Given
the large number of commonly used style classifications and inconsistency as to which
artists belong to them, finding groups of styles which represent typical end user classi-
fications becomes an important step. Once a selection of styles and classified music is
decided upon, the next component is to extract features from the audio. A well refined
2
technique from the speech recognition community called Mel-Frequency Cepstral Coef-
ficients (MFCC) is used. Finally, the features must be modeled in a compact manner.
With each frame containing 20 coefficients that comprise a vector, the MFCCs derived
from a corpus of audio representing a style are treated as a bag of frames and mod-
eled with two clustering algorithms, K-means and expectation maximization, creating
a guassian mixture model representing each style [10].
The remainder of the paper is organized as follows. Chapter 2 describes re-
lated work. In Chapter 3 the modeling methodology is described in detail. Chapter
4 describes the testing methods used. Chapter 5 covers an additional feature which
models the attack of musical instruments. Chapter 6 describes a practical use of these
style classification techniques and Chapter 6 concludes.
3
Chapter 2
Related Work
The Music Information Retrieval (MIR) community has seen much progress in
recent years, especially in the area of acoustic similarity measures. Advances in machine
learning techniques and the cost of computational resources have allowed large amounts
of audio to be analyzed over many feature spaces, often drawing from prior work in the
speech recognition community. Three recent papers act to distill the state of the art in
music similarity measures.
Berenzweig et. al [2] provide a strong, cross-site overview of music similarity
measures, exploring the concepts of consensus truth, anchor space and acoustic similar-
ity, and discussing the use of Mel-Frequency Cepstral Coefficients (MFCC), Gaussian
Mixture Models, K-means, and EM in creating models. In addition, they explore vari-
ous metrics for comparing similarity of models such as KL-divergence and Earth Movers
Distance. A distinct difference between their work and my own is that they explore sim-
ilarity between models whereas my own work, which is more specific to music sorting,
4
takes the approach of averaging log-likelihoods that each frame from a song would be
produced by the model for a given style.
Aucouturier and Patchet [1] focus specifically on global timbre similarity, ex-
ploring the parameter space in search of an optimal parameter set. They find that
adjusting parameters such as number of guassians in the GMM, number of MFCCs,
distance metrics and size of the MFCC window can improve the accuracy significantly.
Building on these results they add various front-end variations such as liftering, cep-
stral mean compensation, delta coefficients, and acceleration coefficients, concluding
that there seems to be a ”glass ceiling” limiting the potential for increasing accuracy
above a certain threshold regardless of the combination of front-end variations and tim-
bre similarity parameters used. My own novel feature builds upon the delta coefficient
variation they tested by adding a transformation which I gleaned from my own knowl-
edge as a musician. Though my confidence is not great, I hope future work with this
feature may serve to break though the glass ceiling.
Bergstra et. al [3] provide the most complete exploration of the acoustic simi-
larity feature space to date and significantly improve classification by computing many
features over medium sized song segments. They combine these meta-features (features
computed from song segments) as weak classifiers using the Adaboost algorithm, chal-
lenging Mandel & Ellis for supremacy at MIREX 2005, a Music Information Retrieval
competition. The use of song segments and Adaboost is an intriguing technique, but is
beyond the scope of my research (their team consists of no less than 5 MIR experts).
5
Chapter 3
Creating a Model
Surveying the existing techniques and prior work in the field of music infor-
mation retrieval, a complex and disparate array of techniques emerges. Most of the
techniques exhibit a fair amount of success but seem a bit complex and unrefined. A
few authors pull more refined techniques from the extensive body of research produced
by the speech recognition community to produce strong results. To quote, Dan Ellis,
one of the better performing Music Information Retrieval (MIR) researchers in years
past:
Decades of research in the speech community has led to usable systems and
convergence of the features and models used for speech analysis [7].
As a result, this author has paid particular attention to the techniques and results
coming out of LabROSA at Columbia University from Ellis et al. [2, 8]. I have accepted
their and Logan’s opinion [7] that Mel-Frequency Cepstral Coefficients (MFCC) are an
ideal feature for style classification and have obtained a large collection of MFCC data
and music labeling called uspop2002 [2]. Ellis et al. have been working to distribute this
6
data set as a means of providing a consistent testbed for the evaluation and comparison
of MIR techniques. Though MFCC are not the only possible audio feature, a strong
case has been made that a consistent data set is needed. Unfortunately due to copyright
restrictions, that data set cannot consist of raw audio and so features must be distributed
instead [6]. The choice of MFCC seems appropriate because, aside from their known
success in the field of speech recognition, they seem to be the features most closely
correlated to human perception [7].
I was able to host the aforementioned data set containing a large amount of
classified MFCC data from 400 artists on UCSC School of Engineering servers and would
like to thank the School of Engineering, David Helmbold and the Machine Learning
group for this privilege. Aside from my own work, this data set will serve as a resource
for future Machine Learning students.
A formal definition of MFCC is borrowed from Bergstra et al.. [3] F s is
the discretized fourier transform of the audio signal. F s is a vector where each s in
F s represents energy in a small frequency range during a near instant of time (a 32
ms frame). First, |F s| is projected according to the Mel-scale. The Mel-scale is a
psycho-acoustic frequency scale on which a changeof 1 unit carries the same perceptual
significance, regardless of theposition on the scale. The Mel-scale increases identically
with Hertz from 0 to 1000 Hz, at which point it continues to rise logarithmically. We
refer to the Mel-scale projected version of F s as Fmel . Following the projection, MFCC
are computed by taking the logarithm, which maps power to loudness, and then the
dct, which produces coefficients that estimate the strength of different harmonic serias
7
in the signal. The following equations define the MFCC, where each frame is referred
to by the superscript k and γ is set to a small value to avoid taking the logarithm of 0:
(k) (k)
Lmf cc = log(Fmel + γ)
rmf cc = dct(Lmf cc )
To clarify, what I refer to as an MFCC is a 20 dimension vector which represents
a near instant of sound. Each coefficient, or element in the vector, is a number which
represents the loudness of a harmonic series in the signal. We may simply further and
say the each coefficient represents loundness in a certian frequency range since lower
frequency harmonics tend to mask higher frequency harmonics. In the remainder of the
paper, we refer to a single vector as an MFCC and a collections of those vectors which
represent an audible amount of audio (generally one or more entire songs) as MFCCs.
The data set used contains 400 artists, 8764 tracks, and 251 unique style tokens.
The MP3s were downsampled to 22050 Hz, and mixed to mono using the mpg123 and
feacalc utilities. Each 32ms frame is distilled into 20 coefficients. [5]
Given the a large number of MFCCs, we must now find a way to model them
– there are too many for them to considered individually. The methodology chosen here
is selected for both for its simplicity and effectiveness. Again drawing from the speech
recognition community, the MFCCs are modeled as a bag of frames represented by a
Gaussian Mixture Model. The GMM is trained using the well established technique of
initializing with the K-means clustering algorithm and then honing in with Expectation
Maximization. For further explanation, the reader is referred to the excellent survey
of clustering techniques by Rui Xu and Donald Wunsch II [10]. Of note is that a well
8
Creating a Style Model
Short MFCC
Audio
Frame Vector
Bag
Audio of K-means+EM
Corpus MFCC GMM
Frames
Short MFCC
Audio
Frame Vector
Figure 3.1: The figure shows the creation of a style model from an audio corpus con-
taining the songs which represent that style.
known problem with EM is that sometimes an outlying point gets isolated to a single
gaussian in the mixture model, sending its covariance to zero and resulting in division
by zero errors. Though this problem did surface at one point, the Netlab toolkit I used
provided an option of handling this issue [9].
9
Computing Style Likelihoods for a Song
Short MFCC
Audio
Frame Vector
Log Likelihoods
Song Style Average
Audio GMM Likelihood
Short MFCC
Audio
Frame Vector
Figure 3.2: The figure shows the evaluation of the likelihood that a particular song was
produced by the GMM for a style where ”Average Likelihood” is the average of the of
the log likelihood for each frame in the piece of music. To make a classification, this
evaluation occurs for multiple style GMMs and the GMM which produces the greatest
likelihood is chosen as the style of that song.
10
Chapter 4
Testing the Classification System
After securing the disk space and transferring the data set, I analyzed the
available style label sets and determined that some restructuring of both the provided
labels and disk structure of the data would be required. The issue was that there
were over 200 styles specified with each artist belonging to several styles. As such,
any attempt to classify the entire database wholesale would be doomed to failure. The
solution to this came in the form of using Perl to massage the provided labeling into a
useful format and to generate shell scripts which created a directory for each style and
softlinked in all the data for each artist listed as belonging to that style. Tests would
then be conducted among groups of styles representative of what the typical end user
might use.
With an eye on the school of engineering’s recently purchased compute servers
and the desire to create a good resource base, I decided to try to model every style in
the data set – a rather computational and memory intensive task. These models were
11
later used to add flexibility to the music sorting application created as a result of the
research. The modeling was accomplished by randomly splitting the data set into half
training and half testing examples. Perl was again employed to create shell scripts, this
time creating three separate shell scripts aimed at the three appropriately configured
compute servers. Each script fired off matlab processes one at a time, using netlab [9]
and matlab to create a guassian mixture model of the songs in each style’s training set.
Having created and saved models for each style and lists of songs for their
training and testing sets, I proceeded to design some groups of styles. Again using
Perl, a list of styles containing content from more than 4 artists and less than 10 artists
was created with the reasoning that well represented, but not overly broad styles were
desired. This set included 24 styles which are listed in the Appendix. Tests were
then conducted on all size three subgroups of this group of styles. Due to limited
computational resources I further reduced this set to 10 styles, shown in the tables
which follow. Using a distributed approach I spread the tests across multiple shell
scripts. Initially I was disappointed to find that some of the tests showed fairly effective
classification while others were dismal. The cause of this ended up being overlap of
artists belonging to certain rock based styles. After removing the overlapping styles,
results improved and became more consistent. Sets of four were then tested and showed
only a slight decrease in classification effectiveness. Future work will involve identifying
all style overlaps and testing a larger, non-overlapping set of styles.
Having multiple labels for some artists does produce somewhat of an issue in
that traditional learning specificies that there be one label per example when training
12
Styles chosen for main experiment
Country-Pop
Electronica
Grunge
House
Prog-Rock+Art Rock
Punk-Pop
Soul
Table 4.1: Narrowed from a larger set, these styles were found to be independent and
distinct.
a classifier. As mentioned, the solution to this comes in choosing sets of styles where
each artist is labeled by only one style in the set, producing the one-to-one mapping
required to train a classifier. While this may seem a bit awkward, it is consistent with
the intended usage of the classification system. When a user uses the system to sort
their music, they choose a small subset of styles from the larger list. Assuming the user
is rational and has a good idea of what each style sounds like, the chosen set will include
styles which are not closely related and each artist from the training corpus will map to
a unique style in the set. Creating a sorting application with a large number of styles
where artists may belong to multiple styles is something which has already been done
and is not a goal of this research. The goal is allow a user to choose their own small
set of styles and to map each artist to a unique style, as I have found this to be the
most effective sorting mechanism for quickly and effectively finding music from one’s
own medium sized collection.
Table 4.1 shows the styles used and Table 4.2 shows those that were excluded
due to overlap.
With results now correctly tabulated for all groups of three and four styles,
13
Styles eliminated due to overlap
Rock + Roll
Blues-Rock
Funk
Table 4.2: These styles were found to have overlapping artists.
Perl was again employed to find the average classification accuracy for each of these
groupings.
Summarizing the current method, over 200 styles were identified and artists
for each style linked into representative directories. For each style, half the songs were
designated to belong to the training set and half to the testing set. A guassian mixture
model was then created for each style based on the training set. A subset of styles was
then chosen (Table 4.1) and tests run on each possible group of 3 and 4 styles from this
subset, producing a number representing the likelihood that each testing example was
produced by each style in the group. When this number was highest for the style the
song belonged to, a success was tallied. When the number was highest for a different
style, a failure was tallied.
For groups of 3 styles, classification was 71.9% successful. Figure 4.1 shows
a confusion matrix for the most poorly classfied set of 3 styles. Figure 4.2 shows a
confusion matrix for an average set of 3 styles. Figure 4.3 shows a confusion matrix for
the most well classfied set of 3 styles.
For groups of 4 styles, classification was 69.9% successful. Figure 4.4 shows
a confusion matrix for the most poorly classfied set of 4 styles. Figure 4.5 shows a
14
Electronica House Soul
Electronica 37.70 27.87 34.43
House 20.91 28.18 50.91
Soul 0.80 9.60 89.60
Figure 4.1: A confusion matrix where each row shows the percentage of examples with
the actual label of that row that were labeled by the classifier with the style indicated
at the top of that column.
Electronica House Punk-Pop

Electronica 38.52 37.70 23.77
House 25.45 47.27 27.27
Punk-Pop 2.91 9.88 87.21
confusion matrix for an average set of 4 styles. Figure 4.8 shows a confusion matrix for
the most well classfied set of 4 styles.
Though not an identical test, these values do fall within the range of genre clas-
sification accuracies achieved by applicants in the MIREX 2005 contest where accuracies
ranged from 60.72% to 82.34% [3] .
Country-Pop Grunge House

Country-Pop 88.35 4.85 6.80
Grunge 9.76 78.86 11.38
House 4.55 7.27 88.18
15
Electronica Grunge House Soul
Electronica 31.15 9.84 27.87 31.15
Grunge 10.57 63.41 2.44 23.58
House 18.18 5.45 26.36 50.00
Soul 0.80 0.80 9.60 88.80
Electronica Grunge House Punk-Pop

Electronica 35.25 6.56 36.07 22.13
Grunge 11.38 34.96 3.25 50.41
House 24.55 1.82 47.27 26.36
Punk-Pop 2.91 3.49 9.88 83.72
Electronica Grunge Prog-Rock+Art Rock Soul

Electronica 52.46 8.20 23.77 15.57
Grunge 9.76 61.79 22.76 5.69
Prog-Rock+Art Rock 1.40 3.27 73.36 21.96
Soul 2.40 0.80 19.20 77.60
Country-Pop Electronica Prog-Rock+Art Rock Soul

Country-Pop 50.49 0.00 25.24 24.27
Electronica 1.64 57.38 24.59 16.39
Soul 4.80 1.60 18.40 75.20
16
Country-Pop Electronica Prog-Rock+Art Rock Punk-Pop
Country-Pop 63.11 0.00 33.98 2.91
Electronica 3.28 50.82 31.15 14.75
Punk-Pop 1.74 1.74 13.37 83.14
the actual label of that row that were labeled by the classifier with with the style
indicated at the top of that column.
17
Chapter 5
Modeling Attack
The motivation for the new feature is to improve results by incorporating a
temporal feature into the existing classification system. To this end, I have conceived
of a method that fits seamlessly into the existing modeling framework. By finding the
delta between consecutive MFCC frames and exponentiating this quantity, the large
positive deltas, which represent the attack of musical instruments (including voice), are
accentuated while still maintaining a set of coefficients which are in the general range
of the original MFCCs. For readers not familiar with the term attack, it is the short
time segment when an instrument goes from silence to producing a sound, marked by
a steep increase in energy in certain frequencies of the signal. By taking this new set of
coefficients and doubling the feature space, a new set of 40 dimension GMMs is produced.
Initial tests show up to a 4% increase in classification accuracy though some tests show
poorer performance. For the four style experiment, overall classification improved to
72.4%. The attack model seems to perform well on music contatining steady beats
18
Electronica Grunge Prog-Rock+Art Rock Soul
Electronica 57.50 37.50 2.50 2.50
Grunge 0.76 58.33 14.39 26.52
Soul 4.62 30.77 13.85 50.77
Country-Pop Electronica Prog-Rock+Art Rock Soul

Country-Pop 77.50 17.50 2.50 2.50
Electronica 3.62 54.35 37.68 4.35
Soul 0.00 14.74 9.47 75.79
(e.g. Punk-Pop) and does poorly when the beats are irregular (e.g. Electronic) or not
prominent (e.g. Country-Pop). Further work will involve attempting to limit attack
modeling to frequenciies likely to contain attack oriented beat information.
Figure 5.1 shows a confusion matrix of a classification that did poorly in com-
parison to its non-attack counterpart, Figure 4.6. Figure 5.2 shows a confusion matrix
for an average attack based modeling classification compared to its non-attack counter-
part, Figure 4.7. Figure 5.3 shows a confusion matrix where attack modeling signifi-
cantly outperformed non-attack – Figure 4.8.
I wish to thank David Cope for spawning this idea by describing how it is
necessary to base music similarity on pitch intervals and not absolute pitch in order to
relate music in different keys [4].
19
Country-Pop Electronica Prog-Rock+Art Rock Punk-Pop
Country-Pop 65.00 12.50 15.00 7.50
Electronica 5.43 60.29 34.29 0.00
Punk-Pop 3.19 5.32 4.51 86.98
the actual label of that row that were labeled by the classifier with with the style
indicated at the top of that column.
20
Chapter 6
A Practical Use
Using the existing models and generating MFCCs from new user provided
audio data (the user points to a directory containing MP3s), I have created a system
which makes recommendations for sorting and allows the user to specify the actual
style to which the new song belongs. Choices are sorted accoring to the average log
likelihood that each song belongs to the style set chosen by the user and the program
moves the actual audio data to directories based on the users choice, creating a usable
music sorting system. The following example run illustrates the program being used.
Notice that the songs are sorted by their likelihood. The classifier was correct in most
of the cases and where it failed, the desired style was always the second choice.
Please provide a substring (or RegEx) of desired style [Enter for Done]: Rock
0. Search Again
1. Acid_Rock
2. Adult_Alternative_Pop+Rock
3. Album_Rock
4. Alternative_Country-Rock
5. Alternative_Pop+Rock
21
6. American_Trad_Rock
7. Arena_Rock
8. Aussie_Rock
9. Blues-Rock
10. Boogie_Rock
11. British_Folk-Rock
12. British_Trad_Rock
13. Celtic_Rock
14. College_Rock
15. Comedy_Rock
16. Country-Rock
17. Experimental_Rock
18. Folk-Rock
19. Glam_Rock
20. Goth_Rock
21. Hard_Rock
22. Heartland_Rock
23. Indie_Rock
24. Jazz-Rock
25. Latin_Rock
26. New_Zealand_Rock
27. Noise-Rock
28. Pop+Rock
29. Prog-Rock+Art_Rock
30. Pub_Rock
31. Rap-Rock
32. Rock_+_Roll
33. Rock_en_Espa?ol
34. Rockabilly
35. Rocksteady
36. Roots_Rock
37. Soft_Rock
38. Southern_Rock
39. Space_Rock
40. Swedish_Pop+Rock
Choose a modeled style number: 3
Please provide a substring (or RegEx) of desired style [Enter for Done]: Jazz
0. Search Again
1. Acid_Jazz
2. Crossover_Jazz
3. Jazz-Funk
4. Jazz-Pop
22
5. Jazz-Rock
6. Smooth_Jazz
7. Vocal_Jazz
Please provide a substring (or RegEx) of desired style [Enter for Done]: Electronic
0. Search Again
1. Electronica
2. Indie_Electronic
3. Progressive_Electronic
Please provide a substring (or RegEx) of desired style [Enter for Done]: Reggae
0. Search Again
1. Contemporary_Reggae
2. Political_Reggae
3. Reggae
4. Reggae-Pop
5. Roots_Reggae
Please provide a substring (or RegEx) of desired style [Enter for Done]:
Styles for Sorting

------------------
Album_Rock
Jazz-Rock
Indie_Electronic
Roots_Reggae
Music Location [/cse/grads/mdeckert/music/]:
Creating MFCCs from Air-La_Femme_dArgent.mp3

Creating MFCCs from AphexTwin-4.mp3
Creating MFCCs from Boards_of_Canada-Music_Is_Math.mp3
Creating MFCCs from Clapton-I_Shot_The_Sheriff.mp3
Creating MFCCs from Faith_No_more-The_Real_Thing.mp3
Creating MFCCs from Jaco_Pastorius-Come_On_Come_Over.mp3
Creating MFCCs from Marley-Get_Up_Stand_Up.mp3
Creating MFCCs from MMW-Last_Chance_To_Dance_Trance.mp3
Creating MFCCs from Quasimodal-Creature_of_the_Night.mp3
Creating MFCCs from The_New_Mastersounds-Aint_No_Telling.mp3
Creating MFCCs from Zappa-CatholicGirls.mp3
23
Comparing to models.
Ready to Sort Songs

-------------------
1. Indie_Electronic
2. Jazz-Rock
3. Album_Rock
4. Roots_Reggae
Choose directory for Air-La_Femme_dArgent.mp3 [1]:
1. Indie_Electronic
2. Jazz-Rock
3. Album_Rock
4. Roots_Reggae
Choose directory for AphexTwin-4.mp3 [1]:
1. Jazz-Rock
2. Indie_Electronic
3. Album_Rock
4. Roots_Reggae
Choose directory for Boards_of_Canada-Music_Is_Math.mp3 [1]: 2
1. Jazz-Rock
2. Album_Rock
3. Roots_Reggae
4. Indie_Electronic
Choose directory for Clapton-I_Shot_The_Sheriff.mp3 [1]: 2
1. Album_Rock
2. Jazz-Rock
3. Roots_Reggae
4. Indie_Electronic
Choose directory for Faith_No_more-The_Real_Thing.mp3 [1]:
1. Jazz-Rock
2. Indie_Electronic
3. Album_Rock
4. Roots_Reggae
Choose directory for Jaco_Pastorius-Come_On_Come_Over.mp3 [1]:
1. Roots_Reggae
2. Jazz-Rock
24
3. Album_Rock
4. Indie_Electronic
Choose directory for Marley-Get_Up_Stand_Up.mp3 [1]:
1. Jazz-Rock
2. Indie_Electronic
3. Roots_Reggae
4. Album_Rock
Choose directory for MMW-Last_Chance_To_Dance_Trance.mp3 [1]:
1. Roots_Reggae
2. Jazz-Rock
3. Album_Rock
4. Indie_Electronic
Choose directory for Quasimodal-Creature_of_the_Night.mp3 [1]: 2
1. Jazz-Rock
2. Indie_Electronic
3. Roots_Reggae
4. Album_Rock
Choose directory for The_New_Mastersounds-Aint_No_Telling.mp3 [1]:
1. Album_Rock
2. Jazz-Rock
3. Roots_Reggae
4. Indie_Electronic
Choose directory for Zappa-CatholicGirls.mp3 [1]:
Clean up? [yes]: yes
> ls -R music
music:
Album_Rock Indie_Electronic Jazz-Rock Roots_Reggae
music/Album_Rock:
Clapton-I_Shot_The_Sheriff.mp3 Zappa-CatholicGirls.mp3
Faith_No_more-The_Real_Thing.mp3
music/Indie_Electronic:
Air-La_Femme_dArgent.mp3 AphexTwin-4.mp3 Boards_of_Canada-Music_Is_Math.mp3
music/Jazz-Rock:
Jaco_Pastorius-Come_On_Come_Over.mp3 Quasimodal-Creature_of_the_Night.mp3
25
MMW-Last_Chance_To_Dance_Trance.mp3 The_New_Mastersounds-Aint_No_Telling.mp3
music/Roots_Reggae:
Marley-Get_Up_Stand_Up.mp3
26
Chapter 7
Conclusion
While style classification may be elusive due to its subjectivity, I hope I have
shown that via the application of existing techniques, the creation of a new feature and
the use of these techniques in a practical music classification system, it is indeed possible
to identify styles accurately enough to create valuable applications. New features and
techniques will certainly serve to increase the value of this field in the future.
27
Bibliography
[1] Jean-Julien Aucouturier and Francois Pachet. Improving timbre similarity: How
high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1),
2004.
[2] A. Berenzweig, B. Logan, D. Ellis, and B. Whitman. A large-scale evaluation of
acoustic and subjective music similarity measures. The Computer Music Journal,
28(2):63–76, July 2004.
[3] James Bergstra, Norman Casagrande, Dumitru Erhan, Douglas Eck, and Balazs
Kegl. Meta-features and adaboost for music classification. Submitted to Machine
Learning, December 2005.
[4] David Cope. Experiments in Musical Intelligence. A-R Additions, Madison, Wis-
consin, 1996.
[5] D. Ellis, A. Berenzweig, and B. Whitman. The uspop2002 pop music data set.
Technical report, 2003. http://labrosa.ee.columbia.edu/projects/musicsim/
uspop2002.html.
28
[6] B. Logan, D. Ellis, and A. Berenzweig. Toward evaluation techniques for music
similarity. In Proceedings of the 4th International Symposium on Music Information
Retrieval (ISMIR’03), Washington, D.C., USA, 2003.
[7] Beth Logan. Mel frequency cepstral coefficients for music modeling. In Proceedings
of the First International Symposium on Music Information Retrieval (ISMIR),
Plymouth, Massachusetts, October 2000.
[8] M. I. Mandel and D. P. Ellis. Song-level features and support vector machines for
music classification. MIREX genre classification contest, 2005.
[9] Ian Nabney. Netlab. Technical report, 2003. http://www.ncrg.aston.ac.uk/
netlab/.
[10] Rui Xu and Donald Wunsch II. Survey of clustering algorithms. IEEE TRANS-
ACTIONS ON NEURAL NETWORKS, 16(3):645–678, May 2005.
29
Appendix A
Styles with Four to Ten Artists
Big_Beat
Blue-Eyed_Soul
British_Blues
British_Invasion
British_Metal
British_Psychedelia
Country-Pop
Dirty_South
Disco
East_Coast_Rap
Euro-Pop
Funk
Funk_Metal
Gangsta_Rap
Hair_Metal
Industrial_Metal
Jazz-Rock
Latin_Pop
Neo-Traditionalist_Country
New_Jack_Swing
Quiet_Storm
Rap-Rock
Ska-Punk
Third_Wave_Ska_Revival
West_Coast_Rap
30

MASTER OF SCIENCE in COMPUTER SCIENCE

Uploaded by

Copyright:

Available Formats

You might also like

MASTER OF SCIENCE in COMPUTER SCIENCE

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MASTER OF SCIENCE in COMPUTER SCIENCE

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF CALIFORNIA

LEARNING MUSIC STYLES WITH ACOUSTIC SIMILARITY

The Project of Mark C. Deckert

Professor David Helmbold, Chair

Professor Roberto Manduchi

4 Testing the Classification System 11

A Styles with Four to Ten Artists 30

Learning Music Styles with Acoustic Similarity Measures

Frequency Cepstral Coefficients (MFCC). Styles are represented by guassian mixture

MFCCs to improve classification performance.

David Cope, and My Parents for their support and guidance.

Music similarity is an elusive concept—wholly subjective, multifaceted, and

my interest in music similarity is more than academic. As will be further discussed, it is

involves a significant element of personal taste. A major goal of this research is to

produced which achieves a reasonable degree of flexibility.

A substantial part of the music information retrieval literature and research

individualized style perception by attempting to find consensus amongst the varying

opinions of listeners [2].

Having surveyed the multitude of techniques in this arena, I have attempted

system. If an artist belongs to multiple styles in a classification set or examples don’t

a guassian mixture model representing each style [10].

The remainder of the paper is organized as follows. Chapter 2 describes re-

lated work. In Chapter 3 the modeling methodology is described in detail. Chapter

style classification techniques and Chapter 6 concludes.

music similarity measures.

Berenzweig et. al [2] provide a strong, cross-site overview of music similarity

produced by the model for a given style.

adjusting parameters such as number of guassians in the GMM, number of MFCCs,

stral mean compensation, delta coefficients, and acceleration coefficients, concluding

feature may serve to break though the glass ceiling.

correlated to human perception [7].

for future Machine Learning students.

A formal definition of MFCC is borrowed from Bergstra et al.. [3] F s is

F s represents energy in a small frequency range during a near instant of time (a 32

ms frame). First, |F s| is projected according to the Mel-scale. The Mel-scale is a

significance, regardless of theposition on the scale. The Mel-scale increases identically

To clarify, what I refer to as an MFCC is a 20 dimension vector which represents

feacalc utilities. Each 32ms frame is distilled into 20 coefficients. [5]

recognition community, the MFCCs are modeled as a bag of frames represented by a

provided an option of handling this issue [9].

Testing the Classification System

With an eye on the school of engineering’s recently purchased compute servers

all style overlaps and testing a larger, non-overlapping set of styles.

own medium sized collection.

Table 4.2: These styles were found to have overlapping artists.

style, a failure was tallied.

the most well classfied set of 3 styles.

Electronica House Punk-Pop

the most well classfied set of 4 styles.

ranged from 60.72% to 82.34% [3] .

Country-Pop Grunge House

Electronica Grunge House Punk-Pop

Electronica Grunge Prog-Rock+Art Rock Soul

Country-Pop Electronica Prog-Rock+Art Rock Soul

The motivation for the new feature is to improve results by incorporating a

Country-Pop Electronica Prog-Rock+Art Rock Soul

modeling to frequenciies likely to contain attack oriented beat information.

cantly outperformed non-attack – Figure 4.8.