Professional Documents
Culture Documents
Musical Instrument Timbres Classification With Spectum
Musical Instrument Timbres Classification With Spectum
Giulio Agostini
Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
Email: guilio@despammed.com
Maurizio Longari
Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
Email: longari@dsi.unimi.it
Emanuele Pollastri
Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
Email: pollastri@dsi.unimi.it
A set of features is evaluated for recognition of musical instruments out of monophonic musical signals. Aiming to achieve a
compact representation, the adopted features regard only spectral characteristics of sound and are limited in number. On top
of these descriptors, various classification methods are implemented and tested. Over a dataset of 1007 tones from 27 musical
instruments, support vector machines and quadratic discriminant analysis show comparable results with success rates close to 70%
of successful classifications. Canonical discriminant analysis never had momentous results, while nearest neighbours performed
on average among the employed classifiers. Strings have been the most misclassified instrument family, while very satisfactory
results have been obtained with brass and woodwinds. The most relevant features are demonstrated to be the inharmonicity, the
spectral centroid, and the energy contained in the first partial.
Keywords and phrases: timbre classification, content-based audio indexing/searching, pattern recognition, audio features
extraction.
the full scale reference value), we find out a rough estimate inharmoncity
of the boundaries of the events. A finer analysis is then con-
4
pi − i · f0
ducted with a 5-ms frame to determine actual on/offsets; in
particular, we look for a 6-dB step around every rough esti- Harmonic energy skewness = · E pi , (3)
i=1
i · f0
mate. Through pitch detection, we achieve a refinement of
signal segmentation, identifying notes that are not well de- where E pi is the percentage of energy contained in the respec-
fined by the energy curve or that are possibly played legato. tive partial.
Pitch is also input to the calculation of some spectral features.
The pitch-tracking algorithm employed follows the one pre-
4. CLASSIFICATION TECHNIQUES
sented in [25], so it will not be described here. The out-
put of the pitch tracker is the average value (in hertz) of In this section, we provide a brief survey on the most popular
each note hypothesis, a frame-by-frame value of pitch and classification techniques, comparing different approaches. As
a confidence value that measures the uncertainty of the esti- an abstract task, pattern recognition aims at associating a
mate. vector y in a p-dimensional space (the feature space) to a
class, given a dataset (or training set) of N vectors di . Since
3.2. Spectral features each of these observations belong to a known class, among
the c available, this is said to be a supervised classification.
We collect a total of 18 descriptors for each tone isolated In our instance of the problem, the features extracted are the
through the procedure just described. More precisely, we dimensions, or variables, and the instrument labels are the
compute mean and standard deviation of 9 features over the classes. The vector y represents the tone played by an un-
length of each tone. The zero-crossing rate is measured di- known musical instrument.
rectly from the waveform as the number of sign inversions
within a 46 ms window. Then, the harmonic structure of the 4.1. Discriminant analysis
signal is evaluated through a short-time Fourier analysis with
half-overlapping windows. The size of the analysis window The multivariate statistical approach to the question [26] has
is variable in order to have a frequency resolution of at least a long tradition of research. Considering y and di as realiza-
1/24 of octave, even for the lowest tones (1024–8192 samples, tions of random vectors, the probability of a misclassification
for tones sampled at 44100 Hz). The signal is first analyzed of a classifier g can be expressed as a function of the proba-
at a low-frequency resolution; the analysis is repeated with bility density functions (PDFs) fi (·) of each class
finer resolutions until a sufficient number of harmonics is c
estimated. This process is controlled by the pitch-tracking al- γg = 1 − πi fi (y)dy , (4)
gorithm [25]. From the harmonic analysis, we calculate spec- i=1 Rp
using the Ni observations di j available for the ith class from metric and the number of nearest samples considered (k).
the training sequence. It can be shown that, in this case, An important drawback is its poor ability to abstract from
the hypersurfaces delimiting the regions of classification—in data since only local information is taken into account.
which the associated class is the same—are quadratic forms,
hence the name of the classifier. 4.3. Support vector machines
Although this is the optimal classifier for normal mix- The support vector machines (SVM) are a recently developed
tures, it could lead to suboptimal error rates in practical cases approach to the learning problem [27]. The aim is to find
for two reasons. First, classes may depart sensibly from the the hyperplane that best separates observations belonging to
assumption of normality. A more subtle source of errors is different classes. This is done by satisfying a generalization
the fact that, with this method, the actual distributions re- bound which maximizes the geometric margin between the
main unknown, since we only have the best estimates of sample data and the hyperplane, as briefly detailed below.
them, based on a finite training set. Suppose we have a set of linearly separable training sam-
ples d1 , . . . , dN , with di ∈ R p . We refer to the simplified bi-
4.1.2 Canonical discriminant analysis nary classification problem (two classes, c = 2), in which a
The canonical discriminant analysis (CDA) is a generaliza- label li ∈ {−1, 1} is assigned to the ith sample, indicating the
tion of the linear discriminant analysis which separates two class they belong to. The hyperplane f (y) = (w · y) + b that
classes (c = 2) in a plane (p = 2) by means of a line. This separates the data can be found by minimizing the 2-norm
line is found by maximizing the separation of the two one- of the weight vector w,
dimensional distributions that result from the projection of
the two bivariate distributions on the direction normal to the minw · w (10)
w,b
line of separation sought.
In a p-dimensional space, using a similar criterion, we subject to the following class separation constraints:
can separate c ≥ 2 classes with hyperplanes by maximizing,
with respect to a generic vector a, the figure of merit li w · di + b ≥ 1, 1 ≤ i ≤ N. (11)
a SB a This approach is called maximal margin classifier. The
D(a) = , (7)
a SW a optimal solution can be viewed in a dual form by apply-
ing the Lagrange theory and imposing the conditions of sta-
where
tionariness. The objective and decision functions can thus be
1 written in terms of the Lagrange multipliers αi as
c
SB = Nj mj − m mj − m (8)
N j =1
N
1
N
L(w, b, α) = αi − li l j αi α j di · d j ,
is the between-class scatter matrix, and i=1
2 i, j =1
(12)
N
1 i
c N
f (y) = li αi di · y + b.
SW = di j − mi di j − mi (9)
N i=1 j =1
i=1
Pizzicati Sustained
Piano et al. Rock strings Pizz. strings Strings Woodwinds Brass
Piano Electric bass Violin pizzicato Violin bowed Flute C trumpet
Harpsichord Elect. bass slap Viola pizzicato Viola bowed Organ French horn
Classic guitar Electric guitar Cello pizzicato Cello bowed Accordion Tuba
Harp Dist. elect. guitar Doublebass pizz. Doublebass bowed Bassoon
Oboe
English horn
E clarinet
Sax
best linear hyperplane can always be found, but, in practice, sults reported in Section 6 can be claimed to hold for a set of
a solution can be heuristically obtained. Thus, the problem 30 instruments (Table 1).
is now to find a kernel function that well separates the ob- The audio files have been analyzed by the feature extrac-
servations. Not just any function is a kernel function; it must tion algorithms. If the accuracy of a pitch estimate is be-
be symmetric, it must satisfy the Cauchy-Schwartz inequal- low a predefined threshold, the corresponding tone is re-
ity, and must satisfy the condition imposed in Mercer’s the- jected from the training set. Following this procedure, the
orem. The simplest example of a kernel function is the dot number of tones accepted for training/testing is 1007 in to-
kernel, which maps the input space directly into the trans- tal. Various classification techniques have been implemented
formed space. Radial basis functions (RBF) and polynomial and tested: CDA, QDA, k-NN, and SVM. k-NN has been
kernels are widely used in image recognition, speech recog- tested with k = 1, 3, 5, 7 and with 3 different distance met-
nition, handwritten digit recognition, and protein homology rics (1-norm, 2-norm, 3-norm). In one experiment, we mod-
detection problems. ified the input space through a kernel function. For SVM,
we adopted a software tool developed at the Royal Holloway
University of London [29]. A number of kernel functions
5. EXPERIMENT
has been considered (dot product, simple polynomial, RBF,
The adopted dataset has been extracted by the MUMS linear splines, regularized Fourier). Input values have been
(McGill University Master Samples) CDs [28], which is a li- normalized independently and we chose a multiclass clas-
brary of isolated sample tones from a wide number of musi- sification method that trains c(c − 1)/2 binary classifiers,
cal instruments, played with several articulation styles and where c is the number of instruments. Therefore, recogni-
covering the entire pitch range. We considered 30 musical tion rates in the classification of instrument families have
instruments ranging from orchestral sounds (strings, wood- been calculated by grouping results from the recognition of
winds, brass) to pop/electronic instruments (bass, electric, individual instruments. All error rates estimates reported in
and distorted guitar). An extended collection of musical in- Section 6 have been computed using a leave-one-out proce-
strument tones is essential for training and testing classifiers dure.
for two distinct reasons. First, methods that require an es-
timate of the covariance matrices, namely, QDA and CDA, 6. RESULTS
must compute it with at least p + 1 linearly independent
observations for each class, p being the number of features The experiments illustrated have been evaluated by means of
extracted, so that they are definite positive. In addition, we overall success rate and confusion matrices. In the first case,
need to avoid the curse of dimensionality discussed in page 6, results have been calculated as the ratio of estimated and ac-
therefore a rich collection of samples brings the expected tual stimuli. Confusion matrices represent a valid method
error rate down. It follows from the first observation that for inspecting performances from a qualitative point of view.
we could not include musical instruments with less than 19 Although we put the emphasis on the instrument level, we
tones in the training set. This is why we collapsed the fam- have also grouped instruments belonging to the same fam-
ily of saxophones (alto, soprano, tenor, baritone) to a single ily (strings, brass, woodwinds and the like), extending Sachs
instrument class.2 Having said that, the total number of mu- taxonomy [30] with the inclusion of rock strings (deep bass,
sical instruments considered was 27, but the classification re- electric guitar, distorted guitar). Figure 2 provides a graphi-
cal representation of the best results both at the instrument
level (17, 20, and 27 instruments) and at the family level
2 We observe that the recognition of the single instrument within the sax (pizzicato-sustained, instrument family).
class can be easily accomplished by inspecting the pitch, since the ranges do SVM with RBF kernel was the best classifier in the
not overlap. recognition of individual instruments, with a success rate
10 EURASIP Journal on Applied Signal Processing
of 69.7%, 78.6%, and 80.2% for, respectively, 27, 20, and doublebass and the pizzicato strings, for which, results have
17 instruments. In comparison with the work by Marques been as low as some 7% and 30%; instead, sustained strings
and Moreno [15], where 8 instruments were recognized with have been identified correctly in 64% of cases, conform-
an error rate of 30%, the SVM implemented in our experi- ing to the overall rate. QDA classifiers did not show a con-
ments had an error rate of 19.8% in the classification of 17 siderable difference in performance between pizzicato and
instruments. The second best score was achieved by QDA, sustained strings. Moreover, most of the misclassifications
with success rates close to SVM’s performances. In the case have been within the same family. This fact explains the
of instrument family recognition and sustain/pizzicato clas- slight advantage of QDA in the classifications at the family
sification, QDA overcame all other classifiers with a success level.
rate of 81%. Success rates with SVM at the family and pizzi- The recognition of woodwinds, brass, and rock strings
cato/sustained levels should be carefully evaluated since we has been very successful (94%, 96%, 89% with QDA), with-
did not train a new SVM for each family (i.e., grouping in- out noticeable differences between QDA and SVM. Misclassi-
struments by family or pizzicato/sustained). Thus, we have fications within these families reveal strong and well-known
to consider results for pizzicato/sustained discrimination for subjective evidence. For example, basoon has been estimated
this classifier as merely indicative although success rates with as tuba (21% with QDA), oboe as flute (11% with QDA), and
all classifiers are comparable for this task. deep bass as deep bass slap (24% with QDA). The detection
CDA never obtained momentous results, ranging from of stimuli from the family of piano and other instruments is
71.2% with 17 instruments to 60.3% with 27 instruments. definitely more spread around the correct family, with suc-
In spite of their simplicity, k-NN performed quite close to cess rates for the detection of this family close to 70% with
QDA. Among the k-NN classifiers, 1-NN with 1-norm dis- SVM and to 64% with QDA.
tance metric obtained the best performance. Since the k-NN We have also calculated a list of the most relevant fea-
was employed in a number of experiments, we observe that tures through the forward selection procedure detailed in
our results are similar to those previously reported, for ex- [32]. The values reported are the normalized versions of
ample, in [31]. Using a kernel function to modify the in- the statistics on which the procedure is based, and can be
put space did not bring any advantage (71% with kernel and interpreted as the amount of the information added by each
74.5% without kernel for 20 instruments). feature. They cannot be strictly decreasing because a feature
A deeper analysis of the results achieved with SVM and might bring more information only jointly with other fea-
QDA (see Figures 3, 4, 5, 6) showed that strings have been tures. For 27 instruments, the most informative feature has
the most misclassified family with 39.52% and 46.75% of in- been the mean of the inharmonicity, followed by the mean
dividual instruments identified correctly on average, respec- and standard deviation of the spectral centroid and the mean
tively, for SVM and QDA. Leaving out strings samples, the of the energy contained in the first partial (see Table 2).
success rates for the remaining 19 instruments grow up to In one of our experiments, we have also introduced a
some 80% for the classification of individual instruments. machine-built decisional tree. We used a hierarchical clus-
Since this behaviour has been registered for both pizzicati tering algorithm [33] to build the structure. CDA or QDA
and sustained strings, we should conclude that our features methods have been employed at each node of the hierarchy.
are not suitable for describing such instruments. In par- Even with these techniques, though, we could not improve
ticular, SVM classifiers seem to be unable to recognize the the error rates, thus confirming the previous findings [13].
Musical Instrument Timbres Classification with Spectral Features 11
Doublebass pizz.
Electric guitar
Classic guitar
Harpsichord
Violin pizz.
Viola pizz.
Cello pizz.
Harp
Recognize as
Hamburg steinway 64 8 15
Harpsichord 76 2
Classic guitar 2 42 6 10 14
Harp 5 58 6 16 1
Deep electric bass 3 17 71 21
Deep electric bass slap 2 25 24 63
Electric guitar 12 94
Distorted electric guitar 2 82 6
Violin pizz. 3 39 21
Viola pizz. 3 9 22 18 7
Cello pizz. 3 6 11 12 64 21
Doublebass pizz. 2 6 7 29 64
Family success (%) 64.00 88.64 79.04
Pizzicato success (%) 91.27
Figure 3: Confusion matrix for the classification of individual instruments in the family of pizzicati with QDA.
Doublebass bowed
Stimulus input
B. Plenum organ
Violin bowed
English horn
Viola bowed
Cello bowed
French horn
Eb clarinet
C trumpet
Accordion
Bassoon
Oboe
Flute
Tuba
Recognize as
Sax
Violin bowed 38 39
Viola bowed 30 37 4
Cello bowed 35 11
Doublebass bowed 18 44 2
Flute 94 11 5
B. Plenum organ 2 91 4
Accordion 8 100
Bassoon 1 68 5
Oboe 6 5 79 12
English horn 6 80 4
E clarinet 8 4 12 91 4
Sax 3 6 6 7 73 4
C trumpet 6 7 8 4 92
French horn 6 100
Tuba 21 95
Family success (%) 62.92 94.08 95.82
Sustained success (%) 93.07
Figure 4: Confusion matrix for the classification of individual instruments in the family of sustained with QDA.
7. DISCUSSION AND FURTHER WORK thermore, in our experiments, we employed widely used ker-
nel functions, so there is a room for improvement adopt-
A thorough evaluation of the resulting performances illus- ing dedicated kernels. However, QDA performed similarly in
trated in Section 6 reveals the power of SVM in the task of the recognition of individual instruments with errors closer
timbre classification, thus confirming the successful results to the way human classify sounds. It was highlighted that
in other fields (e.g., face detection, text classification). Fur- much of the QDA errors are within the correct family, while
12 EURASIP Journal on Applied Signal Processing
Doublebass pizz.
Electric guitar
Classic guitar
Harpsichord
Violin pizz.
Viola pizz.
Cello pizz.
Harp
Recognize as
Hamburg steinway 69 3 4 4 14 3 4
Harpsichord 75
Classic guitar 64 14 13
Harp 2 61 8 7 10 7 11
Deep electric bass 6 50 23 3
Deep electric bass slap 19 32 63
Electric guitar 6 96
Distorted electric guitar 76 4 3 4
Violin pizz. 3 3 33 17 3
Viola pizz. 3 3 7 2 41 34 10 4
Cello pizz. 7 0 4 10 41 22
Doublebass pizz. 11 2 8 7
Family success (%) 69.41 85.24 57.09
Pizzicato success (%) 86.29
Figure 5: Confusion matrix for the classification of individual instruments in the family of pizzicati with SVM.
Doublebass bowed
Stimulus input
B. Plenum organ
Violin bowed
English horn
Viola bowed
Cello bowed
French horn
Eb clarinet
C trumpet
Accordion
Bassoon
Oboe
Flute
Tuba
Recognize as
Sax
Violin bowed 64 29 3 3 3
Viola bowed 27 60 2
Cello bowed 70 18 5
Doublebass bowed 7
Flute 100 2
B. Plenum organ 2 96 6
Accordion 5 85
Bassoon 81 3 5 3
Oboe 2 84 10
English horn 3 9 92
E clarinet 2 2 81 3 7
Sax 2 9 80
C trumpet 6 94 3
French horn 2 9 83
Tuba 6 94
Family success (%) 68.60 93.70 91.51
Sustained success (%) 91.04
Figure 6: Confusion matrix for the classification of individual instruments in the family of sustained with SVM.
SVM show errors scattered throughout the confusion matri- As it was anticipated, sounds that exhibit a predomi-
ces. Since QDA is the optimal classifier under multivariate nant percussive nature are not well characterized by a set of
normality hypotheses, we should conclude that the features features solely based on spectral properties, while sustained
we extracted from isolated tones follow such distribution. To sounds like brass are perfectly tailored. Our experiments have
validate this hypothesis, a series of statistical tests are under- demonstrated that classifiers are not able to overcome this
going on the dataset. difficulty. Moreover, the closeness of performances between
Musical Instrument Timbres Classification with Spectral Features 13
Multimedia and Expo, vol. I, pp. 452–455, New York, NY, USA, Maurizio Longari was born in 1973. In
August 2000. 1998, he received his M.S. degree in in-
[19] S. Pfeiffer, S. Fischer, and W. E. Effelsberg, “Automatic au- formation technology from Università degli
dio content analysis,” in Proc. ACM Multimedia, pp. 21–30, Studi di Milano, Milan, Italy, LIM (Labora-
Boston, Mass, USA, November 1996. torio di Informatica Musicale). In January
[20] T. Zhang and C.-C. Jay Kuo, Eds., Content-Based Au- 2000, he started his research activity as a
dio Classification and Retrieval for Audiovisual Data Parsing, Ph.D. student at Dipartimento di Scienze
Kluwer Academic Publishers, Boston, Mass, USA, February dell’Informazione in the same university.
2001. His main research interests are symbolic
[21] M. Casey, “General sound classification and similarity in musical representation, web/music applica-
MPEG-7,” Organized Sound, vol. 6, no. 2, pp. 153–164, 2001. tions, and multimedia database. He is a member of the IEEE SA
[22] L. Lu, H. Jiang, and H. Zhang, “A robust audio classification
Working Group on Music Application of XML.
and segmentation method,” in Proc. ACM Multimedia, pp.
203–211, Ottawa, Canada, October 2001.
[23] E. Scheirer and M. Slaney, “Construction and evalua- Emanuele Pollastri received his M.S. de-
tion of a robust multifeature speech/music discriminator,” gree in electrical engineering from Politec-
in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, nico di Milano, Milan, Italy, in 1998. He is a
vol. II, pp. 1331–1334, Munich, Germany, April 1997. Ph.D. candidate in computer science at Uni-
[24] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory versità degli Studi di Milano, Milan, Italy,
of Pattern Recognition, Springer-Verlag, New York, NY, USA, where he is expected to graduate at the be-
1996. ginning of 2003 with a thesis entitled “Pro-
[25] G. Haus and E. Pollastri, “A multimodal framework for music cessing singing voice for music retrieval.”
inputs,” in Proc. ACM Multimedia, pp. 382–384, Los Angeles, His research interests include audio analy-
Calif, USA, November 2000. sis, understanding and classification, digital
[26] B. Flury, A First Course in Multivariate Statistics, Springer- signal processing, music retrieval, and music classification. He is
Verlag, New York, NY, USA, 1997. cofounder of Erazero S.r.l., a leading Italian multimedia company.
[27] N. Cristianini and J. Shawe-Taylor, An Introduction to Sup- He worked as a software engineer for speech recognition applica-
port Vector Machines and Other Kernel-based Learning Meth-
tions at IBM Italia S.p.A. and he was a consultant for a number of
ods, Cambridge University Press, Cambridge, UK, 2000.
companies in the field of professional audio equipments.
[28] F. Opolko and J. Wapnick, McGill University Master Samples,
McGill Univeristy, Montreal, Quebec, Canada, 1987.
[29] C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schölkopf,
and A. Smola, “Support vector machine reference manual,”
Tech. Rep., Royal Holloway Department of Computer Science
Computer Learning Research Centre, University of London,
Egham, London, UK, 1998, http://svm.dcs.rhbnc.ac.uk/.
[30] E. M. Hornbostel and C. Sachs, “Systematik der Musikinstru-
mente. ein Versuch,” Zeitschrift für Ethnologie, vol. 46, no. 4-5,
pp. 553–590, 1914, [English translation by A. Baines and K. P.
Wachsmann, “Classification of musical instruments” Galpin
Society Journal, vol. 14, pp. 3–29, 1961].
[31] I. Fujinaga and K. MacMillan, “Realtime recognition of or-
chestral instruments,” in Proc. International Computer Music
Conference, Berlin, Germany, August–September 2000.
[32] G. J. McLachlan, Discriminant Analysis and Statistical Pattern
Recognition, John Wiley & Sons, New York, NY, USA, 1992.
[33] H. Späth, Cluster Analysis Algorithms, E. Horwood, Chich-
ester, UK, 1980.
[34] A. Srinivasan, D. Sullivan, and I. Fujinaga, “Recognition of
isolated instrument tones by conservatory students,” in Proc.
International Conference on Music Perception and Cognition,
pp. 17–21, Sidney, Australia, July 2002.