Musical Instrument Timbres Classification With Spectum

EURASIP Journal on Applied Signal Processing 2003:1, 5–14
c 2003 Hindawi Publishing Corporation
Musical Instrument Timbres Classification

with Spectral Features
Giulio Agostini
Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
Email: guilio@despammed.com
Maurizio Longari
Email: longari@dsi.unimi.it
Emanuele Pollastri
Email: pollastri@dsi.unimi.it
Received 10 May 2002 and in revised form 29 August 2002
A set of features is evaluated for recognition of musical instruments out of monophonic musical signals. Aiming to achieve a
compact representation, the adopted features regard only spectral characteristics of sound and are limited in number. On top
of these descriptors, various classification methods are implemented and tested. Over a dataset of 1007 tones from 27 musical
instruments, support vector machines and quadratic discriminant analysis show comparable results with success rates close to 70%
of successful classifications. Canonical discriminant analysis never had momentous results, while nearest neighbours performed
on average among the employed classifiers. Strings have been the most misclassified instrument family, while very satisfactory
results have been obtained with brass and woodwinds. The most relevant features are demonstrated to be the inharmonicity, the
spectral centroid, and the energy contained in the first partial.
Keywords and phrases: timbre classification, content-based audio indexing/searching, pattern recognition, audio features
extraction.
1. INTRODUCTION nevertheless, automatic timbre classification of audio sources

This paper addresses the problem of musical instrument containing no more than one instrument at a time (source
classification from audio sources. The need for this appli- must be monotimbral and monophonic).
cation strongly arises in the context of multimedia con- Focusing on this area, the forthcoming MPEG-7 stan-
tent description. A great number of commercial applications dard should provide a list of metadata for multimedia con-
will be available soon, especially in the field of multime- tent [4], nevertheless, two important aspects still need to be
dia databases, such as automatic indexing tools, intelligent explored further. First, the best features for a particular task
browsers, and search engines with querying by content capa- must be identified. Then, once obtained a set of descriptors,
bilities. some classification algorithms should be employed to orga-
The goal of automatic music-content understanding and nize metadata in meaningful categories. All these facets will
description is not new and it is traditionally divided into be considered by the present work with the objective of au-
two subtasks: pitch detection, or the extraction of score-like tomatic timbres classification for sound databases.
attributes from an audio signal (i.e., notes and durations), This paper is organized as follows. First, we give some
and sound-source recognition, or the description of sounds background information on the notion of timbre and previ-
involved in an excerpt of music [1]. The former has re- ous related works; then, some details about feature properties
ceived a lot of attention and some recent experiments are and calculation are presented. A brief description of various
described in [2, 3]; the latter has not been studied so much classification techniques is followed by the experiments. Fi-
because of the lack of knowledge about human perception nally, results are presented and compared to previous stud-
and cognition of sounds. This work belongs to the second ies on the same topic. Discussion and further work close the
area and it is devoted to a more modest goal, but important paper.
6 EURASIP Journal on Applied Signal Processing
Rough Zero-crossing rate

Bandpass filter Silence Pitch Harmonic
boundary Centroid
(80 Hz–5 kHz) detection tracking estimation
estimation Bandwidth
Harm. energy %
Window3 Inharmonicity
Window1 (46 ms) Window2 (5 ms) (variable size) Harm. skewness
Figure 1: Description of the feature extraction process.
2. BACKGROUND hidden Markov models could be employed, as illustrated in

[21, 22]. By combining these time-spanning statistics with
Timbre differs from the other sound attributes; namely, the known features, an impressive number of variables can
pitch, loudness, and duration, because it is ill-defined; in be extracted from each sound. The researcher, though, has to
fact, it cannot be directly associated with a particular physical carefully select them in order to both keep the time required
quantity. The American National Standards Institute (ANSI) for the extraction to a minimum and, more importantly, to
defines timbre as “that attribute of auditory sensation in prevent from incurring into the so-called curse of dimen-
terms of which a listener can judge that two sounds similarly sionality. This fanciful term refers to a well-known result of
presented and having the same loudness and pitch are dis- classification theory [24] which states that, as the number of
similar” [5]. The uncertainty about the notion of timbre is variables grows, in order to maintain the same error rate,
reflected by the huge amount of studies that have tackled this the classifier has to be trained with an exponentially grow-
problem. Since the first studies by Grey [6], it was clear that ing training set. The process of feature extraction is crucial;
we are dealing with a multidimensional attribute, which in- it should perform efficient data reduction while preserving
cludes spectral and temporal features. Therefore, early works the appropriate amount of information. Thus, sound analy-
on timbre recognition focused on the exploration of pos- sis techniques must be tailored to the temporal and spectral
sible relationships between the perceptual and the acoustic evolution of musical signals. As it will be demonstrated in
domains. The first experiments on sound classification are Section 6, a set of features related mainly to the harmonic
illustrated in [7, 8, 9] where a limited number of musical in- properties of sounds allows a simplified representation of
struments (eight instruments or less) has been recognized, data. However, lacking features for the discrimination be-
implementing a basic set of features. Other works explored tween sustained sounds and percussive sounds, a classifica-
issues about the relationship between acoustic features and tion solely based on spectral properties has some drawbacks
sound properties [10, 11], justifying their choice in terms of (see Section 7 for details).
musical relevance, brightness, spectral synchronicities, har- The extraction of descriptors relies on a number of pre-
monicity, and so forth. Recently, the diffusion of multimedia liminary steps: temporal segmentation of the signal, detec-
databases has brought to the forth problem of musical in- tion of the fundamental frequency, and the estimation of the
strument identification out of a fragment of audio signal. In harmonic structure (Figure 1).
this context, deep investigations on sound classification as a
pattern recognition problem began to appear in the last few 3.1. Audio segmentation
years [12, 13, 14, 15, 16, 17]. These works emphasized the im- The aim of the first stage is twofold. First of all, the audio sig-
portance of testing different classifiers and set of features with nal must be segmented into a sequence of meaningful events.
datasets of dimension comparable to real world applications. We do not make any assumptions about the content of each
Further works related to timbre classification have dealt with event, which corresponds to an isolated tone in the ideal case.
the more general problem of audio segmentation [18, 19], es- Subsequently, a decision based on the pitch estimation is
pecially with the purpose of automatic (video) scene segmen- taken for a fine adjustment of event boundaries. The output
tation [20]. Finally, the introduction of content management of this stage is a list of nonsilent events (starting and ending
applications like the ones envisioned by MPEG-7 boosted the points) and estimated pitch values.
interest in the topic [4, 21]. In the experiment reported in this paper, we assume
to deal with audio signals characterized by a low level of
3. FEATURE EXTRACTION noise and a good dynamic range. Therefore, a simple pro-
cedure based on energy evaluation is expected to satisfacto-
A considerable number of features is currently available in rily perform in the segmentation task. The signal is first pro-
the literature, each one describing some aspects of audio con- cessed with a bandpass Chebyshev filter of order five; cut-
tent [22, 23]. In the digital domain, features are usually cal- off frequencies are set to 80 Hz to filter out noise due to
culated from a window of samples, which is normally very unwanted vibrations (for instance, oscillation of the micro-
short compared to the total duration of a tone. Thus, we phone stand) and 5000 Hz, corresponding to E8 in a tem-
must face the problem of summarizing their temporal evo- pered musical scale. After windowing the signal (46 ms Ham-
lution into a small set of values. Mean, standard deviation, ming), an root mean square (RMS)-energy curve is com-
skewness, and autocorrelation have been the preferred strate- puted with the same frame size. By comparing the energy to
gies for their simplicity, but more advanced methods like an absolute threshold empirically set to −50 dB (0 dB being
Musical Instrument Timbres Classification with Spectral Features 7
the full scale reference value), we find out a rough estimate inharmoncity
of the boundaries of the events. A finer analysis is then con-
4
pi − i · f0
ducted with a 5-ms frame to determine actual on/offsets; in
particular, we look for a 6-dB step around every rough esti- Harmonic energy skewness = · E pi , (3)
i=1
i · f0
mate. Through pitch detection, we achieve a refinement of
signal segmentation, identifying notes that are not well de- where E pi is the percentage of energy contained in the respec-
fined by the energy curve or that are possibly played legato. tive partial.
Pitch is also input to the calculation of some spectral features.
The pitch-tracking algorithm employed follows the one pre-
4. CLASSIFICATION TECHNIQUES
sented in [25], so it will not be described here. The out-
put of the pitch tracker is the average value (in hertz) of In this section, we provide a brief survey on the most popular
each note hypothesis, a frame-by-frame value of pitch and classification techniques, comparing different approaches. As
a confidence value that measures the uncertainty of the esti- an abstract task, pattern recognition aims at associating a
mate. vector y in a p-dimensional space (the feature space) to a
class, given a dataset (or training set) of N vectors di . Since
3.2. Spectral features each of these observations belong to a known class, among
the c available, this is said to be a supervised classification.
We collect a total of 18 descriptors for each tone isolated In our instance of the problem, the features extracted are the
through the procedure just described. More precisely, we dimensions, or variables, and the instrument labels are the
compute mean and standard deviation of 9 features over the classes. The vector y represents the tone played by an un-
length of each tone. The zero-crossing rate is measured di- known musical instrument.
rectly from the waveform as the number of sign inversions
within a 46 ms window. Then, the harmonic structure of the 4.1. Discriminant analysis
signal is evaluated through a short-time Fourier analysis with
half-overlapping windows. The size of the analysis window The multivariate statistical approach to the question [26] has
is variable in order to have a frequency resolution of at least a long tradition of research. Considering y and di as realiza-
1/24 of octave, even for the lowest tones (1024–8192 samples, tions of random vectors, the probability of a misclassification
for tones sampled at 44100 Hz). The signal is first analyzed of a classifier g can be expressed as a function of the proba-
at a low-frequency resolution; the analysis is repeated with bility density functions (PDFs) fi (·) of each class
finer resolutions until a sufficient number of harmonics is c

estimated. This process is controlled by the pitch-tracking al- γg = 1 − πi fi (y)dy , (4)
gorithm [25]. From the harmonic analysis, we calculate spec- i=1 Rp
tral centroid and bandwidth according to the following equa-

tions: where πi is the a priori probability that an observation be-
longs to the ith class. It can also be proven that the optimal
fmax
f · E( f ) classifier, which is the classifier that minimizes the error rate,
f = fmin
Centroid = fmax , is the one that associates to the ith class every vector y for
f = fmin E( f ) which
fmax (1)
f = fmin |centroid − f | · E( f ) πi fi (y) > π j f j (y), ∀i = j. (5)
Bandwidth = fmax ,
f = fmin E( f )
Unfortunately, PDFs fi (·) are generally unknown. Nonethe-
where fmin = 80 Hz and fmax = 5000 Hz, and E( f ) is the less, we can make assumptions about the distributions of
energy of the spectral component at frequency f . the classes and estimate the necessary parameters to obtain
Since several sounds slightly deviate from the harmonic a good guess of those functions.
rule, a feature called inharmonicity is measured as a cumu- 4.1.1 Quadratic discriminant analysis (QDA)
lative distance between the first four estimated partials (pi )
and their theoretical values (i · f0 , where f0 is the fundamen- This technique starts from the working hypothesis that
tal frequency of the sound), classes have multivariate normal PDFs. The only parame-
ters characterizing those distributions are the mean vectors µi
4
pi − i · f0 and the covariance matrices Σi . We can easily estimate them
Inharmonicity = . (2) by computing the traditional sample statistics
i=1
i · f0
1 i
N
The percentage of energy contained in each one of the mi = di j ,
first four partials is calculated for bins 1/12 oct wide, provid- Ni j =1
ing four different features. (6)
1
Ni

Finally, we introduce a feature obtained by combin- Si = di j − mi di j − mi ,
ing the energy confined in each partial and its respective Ni − 1 j =1
using the Ni observations di j available for the ith class from metric and the number of nearest samples considered (k).
the training sequence. It can be shown that, in this case, An important drawback is its poor ability to abstract from
the hypersurfaces delimiting the regions of classification—in data since only local information is taken into account.
which the associated class is the same—are quadratic forms,
hence the name of the classifier. 4.3. Support vector machines
Although this is the optimal classifier for normal mix- The support vector machines (SVM) are a recently developed
tures, it could lead to suboptimal error rates in practical cases approach to the learning problem [27]. The aim is to find
for two reasons. First, classes may depart sensibly from the the hyperplane that best separates observations belonging to
assumption of normality. A more subtle source of errors is different classes. This is done by satisfying a generalization
the fact that, with this method, the actual distributions re- bound which maximizes the geometric margin between the
main unknown, since we only have the best estimates of sample data and the hyperplane, as briefly detailed below.
them, based on a finite training set. Suppose we have a set of linearly separable training sam-
ples d1 , . . . , dN , with di ∈ R p . We refer to the simplified bi-
4.1.2 Canonical discriminant analysis nary classification problem (two classes, c = 2), in which a
The canonical discriminant analysis (CDA) is a generaliza- label li ∈ {−1, 1} is assigned to the ith sample, indicating the
tion of the linear discriminant analysis which separates two class they belong to. The hyperplane f (y) = (w · y) + b that
classes (c = 2) in a plane (p = 2) by means of a line. This separates the data can be found by minimizing the 2-norm
line is found by maximizing the separation of the two one- of the weight vector w,
dimensional distributions that result from the projection of
the two bivariate distributions on the direction normal to the minw · w (10)
w,b
line of separation sought.
In a p-dimensional space, using a similar criterion, we subject to the following class separation constraints:
can separate c ≥ 2 classes with hyperplanes by maximizing,

with respect to a generic vector a, the figure of merit li w · di + b ≥ 1, 1 ≤ i ≤ N. (11)
a SB a This approach is called maximal margin classifier. The
D(a) = , (7)
a SW a optimal solution can be viewed in a dual form by apply-
ing the Lagrange theory and imposing the conditions of sta-
where
tionariness. The objective and decision functions can thus be
1 written in terms of the Lagrange multipliers αi as
c

SB = Nj mj − m mj − m (8)
N j =1

N
1
N
L(w, b, α) = αi − li l j αi α j di · d j ,
is the between-class scatter matrix, and i=1
2 i, j =1
(12)

N
1 i
c N
f (y) = li αi di · y + b.
SW = di j − mi di j − mi (9)
N i=1 j =1
i=1
The support vectors are defined as the input samples di for

is the within-class scatter matrix, m being the sample mean
which the respective Lagrange multiplier αi is nonzero, so
of all the observations, and N the total number of observa-
they contain all the information needed to reconstruct the
tions. Equivalent to QDA from the point of view of compu-
hyperplane. Geometrically, they are the closest samples to the
tational complexity, CDA has proven to perform better when
hyperplane to lie on the border of the geometric margin.
there are few samples available, because it is less sensitive to
In case the classes are not linearly separable, the sam-
overfitting. CDA and QDA are identical (i.e., optimal) rules
ples are projected through a nonlinear function Φ(·) from
under homoscedasticity conditions. Thus, if the underlying
the input space Y in a higher-dimensional space (with possi-
covariance matrices are quite different, QDA has lower error
bly infinite dimensions), which we will call the transformed
rates. QDA is also preferred in presence of long tails and pro-
space1 T. The transformation Φ(y) : Y → T has to be a
nounced kurtosis, whereas a moderate skewness suggests to
nonlinear function so that the transformed samples can be
use CDA.
linearly separable. Since the high number of dimensions in-
creases the computational effort, it is possible to introduce
4.2. k-nearest neighbours (k-NN)
the kernel functions K(y, z) = Φ(y) · Φ(z), which implicitly
This is one of the most popular nonparametric techniques define the transformation Φ(·) and allow to find the solu-
in pattern recognition. It does not require any knowledge tion in the transformed space T by making simpler calcula-
about the distribution of the samples and it is quite easy to tions in the input space Y . The theory does not grant that the
implement. In fact, this method classifies y as belonging to
the class which is most frequent among its k-nearest obser-
vations. Thus, only two parameters are needed: a distance 1 For the sake of clarity, we will avoid the traditional name “feature space.”
Table 1: Taxonomy of the instruments employed in the experiments.
Pizzicati Sustained
Piano et al. Rock strings Pizz. strings Strings Woodwinds Brass
Piano Electric bass Violin pizzicato Violin bowed Flute C trumpet
Harpsichord Elect. bass slap Viola pizzicato Viola bowed Organ French horn
Classic guitar Electric guitar Cello pizzicato Cello bowed Accordion Tuba
Harp Dist. elect. guitar Doublebass pizz. Doublebass bowed Bassoon
Oboe
English horn
E clarinet
Sax
best linear hyperplane can always be found, but, in practice, sults reported in Section 6 can be claimed to hold for a set of
a solution can be heuristically obtained. Thus, the problem 30 instruments (Table 1).
is now to find a kernel function that well separates the ob- The audio files have been analyzed by the feature extrac-
servations. Not just any function is a kernel function; it must tion algorithms. If the accuracy of a pitch estimate is be-
be symmetric, it must satisfy the Cauchy-Schwartz inequal- low a predefined threshold, the corresponding tone is re-
ity, and must satisfy the condition imposed in Mercer’s the- jected from the training set. Following this procedure, the
orem. The simplest example of a kernel function is the dot number of tones accepted for training/testing is 1007 in to-
kernel, which maps the input space directly into the trans- tal. Various classification techniques have been implemented
formed space. Radial basis functions (RBF) and polynomial and tested: CDA, QDA, k-NN, and SVM. k-NN has been
kernels are widely used in image recognition, speech recog- tested with k = 1, 3, 5, 7 and with 3 different distance met-
nition, handwritten digit recognition, and protein homology rics (1-norm, 2-norm, 3-norm). In one experiment, we mod-
detection problems. ified the input space through a kernel function. For SVM,
we adopted a software tool developed at the Royal Holloway
University of London [29]. A number of kernel functions
5. EXPERIMENT
has been considered (dot product, simple polynomial, RBF,
The adopted dataset has been extracted by the MUMS linear splines, regularized Fourier). Input values have been
(McGill University Master Samples) CDs [28], which is a li- normalized independently and we chose a multiclass clas-
brary of isolated sample tones from a wide number of musi- sification method that trains c(c − 1)/2 binary classifiers,
cal instruments, played with several articulation styles and where c is the number of instruments. Therefore, recogni-
covering the entire pitch range. We considered 30 musical tion rates in the classification of instrument families have
instruments ranging from orchestral sounds (strings, wood- been calculated by grouping results from the recognition of
winds, brass) to pop/electronic instruments (bass, electric, individual instruments. All error rates estimates reported in
and distorted guitar). An extended collection of musical in- Section 6 have been computed using a leave-one-out proce-
strument tones is essential for training and testing classifiers dure.
for two distinct reasons. First, methods that require an es-
timate of the covariance matrices, namely, QDA and CDA, 6. RESULTS
must compute it with at least p + 1 linearly independent
observations for each class, p being the number of features The experiments illustrated have been evaluated by means of
extracted, so that they are definite positive. In addition, we overall success rate and confusion matrices. In the first case,
need to avoid the curse of dimensionality discussed in page 6, results have been calculated as the ratio of estimated and ac-
therefore a rich collection of samples brings the expected tual stimuli. Confusion matrices represent a valid method
error rate down. It follows from the first observation that for inspecting performances from a qualitative point of view.
we could not include musical instruments with less than 19 Although we put the emphasis on the instrument level, we
tones in the training set. This is why we collapsed the fam- have also grouped instruments belonging to the same fam-
ily of saxophones (alto, soprano, tenor, baritone) to a single ily (strings, brass, woodwinds and the like), extending Sachs
instrument class.2 Having said that, the total number of mu- taxonomy [30] with the inclusion of rock strings (deep bass,
sical instruments considered was 27, but the classification re- electric guitar, distorted guitar). Figure 2 provides a graphi-
cal representation of the best results both at the instrument
level (17, 20, and 27 instruments) and at the family level
2 We observe that the recognition of the single instrument within the sax (pizzicato-sustained, instrument family).
class can be easily accomplished by inspecting the pitch, since the ranges do SVM with RBF kernel was the best classifier in the
not overlap. recognition of individual instruments, with a success rate
92.2 90.4 90.0

88.7
Success rate (%)

80.2 80.8
77.2 78.5 77.6 76.2
73.5 75.0 74.5
72.9
71.2
68.5 68.5 69.7
65.7
60.3
17 instr. 20 instr. 27 instr. 27 instr. family 27 instr. pizz/sust.

discrimination discrimination
QDA
SVM
k-NN
CDA
Figure 2: Graphical representation of the success rates for each experiment.
of 69.7%, 78.6%, and 80.2% for, respectively, 27, 20, and doublebass and the pizzicato strings, for which, results have
17 instruments. In comparison with the work by Marques been as low as some 7% and 30%; instead, sustained strings
and Moreno [15], where 8 instruments were recognized with have been identified correctly in 64% of cases, conform-
an error rate of 30%, the SVM implemented in our experi- ing to the overall rate. QDA classifiers did not show a con-
ments had an error rate of 19.8% in the classification of 17 siderable difference in performance between pizzicato and
instruments. The second best score was achieved by QDA, sustained strings. Moreover, most of the misclassifications
with success rates close to SVM’s performances. In the case have been within the same family. This fact explains the
of instrument family recognition and sustain/pizzicato clas- slight advantage of QDA in the classifications at the family
sification, QDA overcame all other classifiers with a success level.
rate of 81%. Success rates with SVM at the family and pizzi- The recognition of woodwinds, brass, and rock strings
cato/sustained levels should be carefully evaluated since we has been very successful (94%, 96%, 89% with QDA), with-
did not train a new SVM for each family (i.e., grouping in- out noticeable differences between QDA and SVM. Misclassi-
struments by family or pizzicato/sustained). Thus, we have fications within these families reveal strong and well-known
to consider results for pizzicato/sustained discrimination for subjective evidence. For example, basoon has been estimated
this classifier as merely indicative although success rates with as tuba (21% with QDA), oboe as flute (11% with QDA), and
all classifiers are comparable for this task. deep bass as deep bass slap (24% with QDA). The detection
CDA never obtained momentous results, ranging from of stimuli from the family of piano and other instruments is
71.2% with 17 instruments to 60.3% with 27 instruments. definitely more spread around the correct family, with suc-
In spite of their simplicity, k-NN performed quite close to cess rates for the detection of this family close to 70% with
QDA. Among the k-NN classifiers, 1-NN with 1-norm dis- SVM and to 64% with QDA.
tance metric obtained the best performance. Since the k-NN We have also calculated a list of the most relevant fea-
was employed in a number of experiments, we observe that tures through the forward selection procedure detailed in
our results are similar to those previously reported, for ex- [32]. The values reported are the normalized versions of
ample, in [31]. Using a kernel function to modify the in- the statistics on which the procedure is based, and can be
put space did not bring any advantage (71% with kernel and interpreted as the amount of the information added by each
74.5% without kernel for 20 instruments). feature. They cannot be strictly decreasing because a feature
A deeper analysis of the results achieved with SVM and might bring more information only jointly with other fea-
QDA (see Figures 3, 4, 5, 6) showed that strings have been tures. For 27 instruments, the most informative feature has
the most misclassified family with 39.52% and 46.75% of in- been the mean of the inharmonicity, followed by the mean
dividual instruments identified correctly on average, respec- and standard deviation of the spectral centroid and the mean
tively, for SVM and QDA. Leaving out strings samples, the of the energy contained in the first partial (see Table 2).
success rates for the remaining 19 instruments grow up to In one of our experiments, we have also introduced a
some 80% for the classification of individual instruments. machine-built decisional tree. We used a hierarchical clus-
Since this behaviour has been registered for both pizzicati tering algorithm [33] to build the structure. CDA or QDA
and sustained strings, we should conclude that our features methods have been employed at each node of the hierarchy.
are not suitable for describing such instruments. In par- Even with these techniques, though, we could not improve
ticular, SVM classifiers seem to be unable to recognize the the error rates, thus confirming the previous findings [13].
Distorted electric guitar

Stimulus input
Deep electric bass slap

Hamburg steinway
Deep electric bass
Doublebass pizz.
Electric guitar
Classic guitar
Harpsichord
Violin pizz.
Viola pizz.
Cello pizz.
Harp
Recognize as
Hamburg steinway 64 8 15
Harpsichord 76 2
Classic guitar 2 42 6 10 14
Harp 5 58 6 16 1
Deep electric bass 3 17 71 21
Deep electric bass slap 2 25 24 63
Electric guitar 12 94
Distorted electric guitar 2 82 6
Violin pizz. 3 39 21
Viola pizz. 3 9 22 18 7
Cello pizz. 3 6 11 12 64 21
Doublebass pizz. 2 6 7 29 64
Family success (%) 64.00 88.64 79.04
Pizzicato success (%) 91.27
Figure 3: Confusion matrix for the classification of individual instruments in the family of pizzicati with QDA.
Doublebass bowed
Stimulus input
B. Plenum organ
Violin bowed
English horn
Viola bowed
Cello bowed
French horn
Eb clarinet
C trumpet
Accordion
Bassoon
Oboe
Flute
Tuba
Recognize as
Sax
Violin bowed 38 39
Viola bowed 30 37 4
Cello bowed 35 11
Doublebass bowed 18 44 2
Flute 94 11 5
B. Plenum organ 2 91 4
Accordion 8 100
Bassoon 1 68 5
Oboe 6 5 79 12
English horn 6 80 4
E clarinet 8 4 12 91 4
Sax 3 6 6 7 73 4
C trumpet 6 7 8 4 92
French horn 6 100
Tuba 21 95
Family success (%) 62.92 94.08 95.82
Sustained success (%) 93.07
Figure 4: Confusion matrix for the classification of individual instruments in the family of sustained with QDA.
7. DISCUSSION AND FURTHER WORK thermore, in our experiments, we employed widely used ker-
nel functions, so there is a room for improvement adopt-
A thorough evaluation of the resulting performances illus- ing dedicated kernels. However, QDA performed similarly in
trated in Section 6 reveals the power of SVM in the task of the recognition of individual instruments with errors closer
timbre classification, thus confirming the successful results to the way human classify sounds. It was highlighted that
in other fields (e.g., face detection, text classification). Fur- much of the QDA errors are within the correct family, while
Distorted electric guitar

Stimulus input
Deep electric bass slap

Hamburg steinway
Deep electric bass
Doublebass pizz.
Electric guitar
Classic guitar
Harpsichord
Violin pizz.
Viola pizz.
Cello pizz.
Harp
Recognize as
Hamburg steinway 69 3 4 4 14 3 4
Harpsichord 75
Classic guitar 64 14 13
Harp 2 61 8 7 10 7 11
Deep electric bass 6 50 23 3
Deep electric bass slap 19 32 63
Electric guitar 6 96
Distorted electric guitar 76 4 3 4
Violin pizz. 3 3 33 17 3
Viola pizz. 3 3 7 2 41 34 10 4
Cello pizz. 7 0 4 10 41 22
Doublebass pizz. 11 2 8 7
Family success (%) 69.41 85.24 57.09
Pizzicato success (%) 86.29
Figure 5: Confusion matrix for the classification of individual instruments in the family of pizzicati with SVM.
Doublebass bowed
Stimulus input
B. Plenum organ
Violin bowed
English horn
Viola bowed
Cello bowed
French horn
Eb clarinet
C trumpet
Accordion
Bassoon
Oboe
Flute
Tuba
Recognize as
Sax
Violin bowed 64 29 3 3 3
Viola bowed 27 60 2
Cello bowed 70 18 5
Doublebass bowed 7
Flute 100 2
B. Plenum organ 2 96 6
Accordion 5 85
Bassoon 81 3 5 3
Oboe 2 84 10
English horn 3 9 92
E clarinet 2 2 81 3 7
Sax 2 9 80
C trumpet 6 94 3
French horn 2 9 83
Tuba 6 94
Family success (%) 68.60 93.70 91.51
Sustained success (%) 91.04
Figure 6: Confusion matrix for the classification of individual instruments in the family of sustained with SVM.
SVM show errors scattered throughout the confusion matri- As it was anticipated, sounds that exhibit a predomi-
ces. Since QDA is the optimal classifier under multivariate nant percussive nature are not well characterized by a set of
normality hypotheses, we should conclude that the features features solely based on spectral properties, while sustained
we extracted from isolated tones follow such distribution. To sounds like brass are perfectly tailored. Our experiments have
validate this hypothesis, a series of statistical tests are under- demonstrated that classifiers are not able to overcome this
going on the dataset. difficulty. Moreover, the closeness of performances between
Table 2: Most discriminating features for 27 instruments. REFERENCES

[1] G. Peeters, S. McAdams, and P. Herrera, “Instrument sound
Feature Name Score description in the context of MPEG-7,” in Proc. International
Inharmonicity mean 1.0 Computer Music Conference, pp. 166–169, Berlin, Germany,
August-September 2000.
Centroid mean 0.202121 [2] T. Virtanen and A. Klapuri, “Separation of harmonic sounds
Centroid standard deviation 0.184183 using linear models for the overtones series,” in Proc. IEEE
Harmonic energy percentage Int. Conf. Acoustics, Speech, Signal Processing, Orlando, Fla,
0.144407 USA, May 2002.
(partial 0) mean
[3] P. J. Walmsley, “Polyphonic pitch tracking using joint bayesian
Zero-crossing mean 0.130214
estimation of multiple frame parameters,” in Proc. IEEE Work-
Bandwidth standard deviation 0.141585 shop on Applications of Signal Processing to Audio and Acous-
Bandwidth mean 0.1388 tics, New Paltz, NY, USA, October 1999.
Harmonic energy skewness [4] Moving Pictures Experts Group, “Overview of the MPEG-
0.130805 7 standard,” Document ISO/IEC JTC1/SC29/WG11 N4509,
standard deviation
Pattaya, Thailand, December 2001.
Harmonic energy percentage [5] American National Standards Institute, American National
0.116544
(partial 2) standard deviation Psychoacoustical Terminology. S3.20, Acoustical Society of
America (ASA), New York, NY, USA, 1973.
[6] J. M. Grey, “Multidimensional perceptual scaling of musical
timbres,” Journal of the Acoustical Society of America, vol. 61,
k-NN and SVM indicates that the choice of features is more no. 5, pp. 1270–1277, 1977.
critical than the choice of a classification method. However, [7] P. Cosi, G. De Poli, and P. Prandoni, “Timbre characteriza-
tion with Mel-Cepstrum and neural nets,” in Proc. Interna-
that may be—beside a set of spectral features, it is important tional Computer Music Conference, pp. 42–45, Aarhus, Den-
to introduce temporal descriptors of sounds—like the log at- mark, 1994.
tack slope or similar. [8] B. Feiten and S. Günzel, “Automatic indexing of a sound
The method employed in our experiments to extract fea- database using self-organizing neural nets,” Computer Music
tures out of a tone (i.e., mean and standard deviation) does Journal, vol. 18, no. 3, pp. 53–65, 1994.
not consider the time-varying nature of sounds known as [9] I. Kaminskyj and A. Materka, “Automatic source identifica-
tion of monophonic musical instrument sounds,” in Proc.
articulation. If the multivariate normality hypotheses were
IEEE Int. Conf. Neural Networks, vol. 1, pp. 189–194, Perth,
confirmed, a suitable model of articulation is the continu- Australia, November 1995.
ous hidden Markov model, in which the PDFs of each state [10] S. Dubnov, N. Tishby, and D. Cohen, “Polyspectra as mea-
is Gaussian [21]. sures of sound texture and timbre,” Journal of New Music Re-
The experiments described so far has been conducted on search, vol. 26, no. 4, pp. 277–314, 1997.
real acoustic instruments with relatively little influence of the [11] S. Rossignol, X. Rodet, J. Soumagne, J. L. Colette, and P. De-
reverberant field. A preliminary test with performances of palle, “Automatic characterisation of musical signals: Feature
extraction and temporal segmentation,” Journal of New Music
trumpet and trombone has shown that our features are quite Research, vol. 28, no. 4, pp. 281–295, 1999.
robust against the effects of room acoustics. The only weak- [12] J. C. Brown, “Musical instrument identification using pattern
ness is their dependence from the pitch, which can be reliably recognition with cepstral coefficients as features,” Journal of
estimated out of monophonic sources only. We are planning the Acoustical Society of America, vol. 105, no. 3, pp. 1933–
to introduce novel harmonic features that are independent of 1941, 1999.
pitch estimation. [13] A. Eronen, “Comparison of features for musical instrument
As a final remark, it is interesting to compare our results recognition,” in IEEE Workshop on Applications of Signal Pro-
cessing to Audio and Acoustics, New Paltz, NY, USA, October
with human performances. In a recent paper [34], 88 con- 2001.
servatory students were asked to recognize 27 musical instru- [14] P. Herrera, X. Amatriain, E. Batlle, and X. Serra, “Towards in-
ments out of a number of isolated tones randomly played by strument segmentation for music content description: a criti-
a CD player. An average of 55.7% of tones has been correctly cal review of instrument classification techniques,” in Interna-
classified. Thus, timbre recognition by computer model is tional Symposium on Music Information Retrieval, pp. 23–25,
able to exceed human performance under the same condi- Plymouth, Mass, USA, October 2000.
[15] J. Marques and P. J. Moreno, “A study of musical instrument
tions (isolated tones).
classification using gaussian mixture models and support vec-
tor machines,” Tech. Rep., Cambridge Research Laboratory,
Cambridge, Mass, USA, June 1999.
ACKNOWLEDGMENTS
[16] K. D. Martin, Sound-source recognition: a theory and compu-
Authors are grateful to Prof. N. Cesa Bianchi, Ryan Rifkin, tational model, Ph.D. thesis, Massachusetts Institute of Tech-
and Alessandro Conconi for the fruitful discussions about nology, Cambridge, Mass, USA, 1999.
[17] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based
SVM and pattern classification. Portions of this work were classification, search, and retrieval of audio,” IEEE Multime-
presented at the Multimedia Signal Processing 2001 IEEE dia, vol. 3, no. 3, pp. 27–36, Fall 1996.
Workshop and the Content-Based Multimedia Indexing [18] J. Foote, “Automatic audio segmentation using a measure
2001 IEEE Workshop. of audio novelty,” in Proc. IEEE International Conference on
Multimedia and Expo, vol. I, pp. 452–455, New York, NY, USA, Maurizio Longari was born in 1973. In
August 2000. 1998, he received his M.S. degree in in-
[19] S. Pfeiffer, S. Fischer, and W. E. Effelsberg, “Automatic au- formation technology from Università degli
dio content analysis,” in Proc. ACM Multimedia, pp. 21–30, Studi di Milano, Milan, Italy, LIM (Labora-
Boston, Mass, USA, November 1996. torio di Informatica Musicale). In January
[20] T. Zhang and C.-C. Jay Kuo, Eds., Content-Based Au- 2000, he started his research activity as a
dio Classification and Retrieval for Audiovisual Data Parsing, Ph.D. student at Dipartimento di Scienze
Kluwer Academic Publishers, Boston, Mass, USA, February dell’Informazione in the same university.
2001. His main research interests are symbolic
[21] M. Casey, “General sound classification and similarity in musical representation, web/music applica-
MPEG-7,” Organized Sound, vol. 6, no. 2, pp. 153–164, 2001. tions, and multimedia database. He is a member of the IEEE SA
[22] L. Lu, H. Jiang, and H. Zhang, “A robust audio classification
Working Group on Music Application of XML.
and segmentation method,” in Proc. ACM Multimedia, pp.
203–211, Ottawa, Canada, October 2001.
[23] E. Scheirer and M. Slaney, “Construction and evalua- Emanuele Pollastri received his M.S. de-
tion of a robust multifeature speech/music discriminator,” gree in electrical engineering from Politec-
in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, nico di Milano, Milan, Italy, in 1998. He is a
vol. II, pp. 1331–1334, Munich, Germany, April 1997. Ph.D. candidate in computer science at Uni-
[24] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory versità degli Studi di Milano, Milan, Italy,
of Pattern Recognition, Springer-Verlag, New York, NY, USA, where he is expected to graduate at the be-
1996. ginning of 2003 with a thesis entitled “Pro-
[25] G. Haus and E. Pollastri, “A multimodal framework for music cessing singing voice for music retrieval.”
inputs,” in Proc. ACM Multimedia, pp. 382–384, Los Angeles, His research interests include audio analy-
Calif, USA, November 2000. sis, understanding and classification, digital
[26] B. Flury, A First Course in Multivariate Statistics, Springer- signal processing, music retrieval, and music classification. He is
Verlag, New York, NY, USA, 1997. cofounder of Erazero S.r.l., a leading Italian multimedia company.
[27] N. Cristianini and J. Shawe-Taylor, An Introduction to Sup- He worked as a software engineer for speech recognition applica-
port Vector Machines and Other Kernel-based Learning Meth-
tions at IBM Italia S.p.A. and he was a consultant for a number of
ods, Cambridge University Press, Cambridge, UK, 2000.
companies in the field of professional audio equipments.
[28] F. Opolko and J. Wapnick, McGill University Master Samples,
McGill Univeristy, Montreal, Quebec, Canada, 1987.
[29] C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schölkopf,
and A. Smola, “Support vector machine reference manual,”
Tech. Rep., Royal Holloway Department of Computer Science
Computer Learning Research Centre, University of London,
Egham, London, UK, 1998, http://svm.dcs.rhbnc.ac.uk/.
[30] E. M. Hornbostel and C. Sachs, “Systematik der Musikinstru-
mente. ein Versuch,” Zeitschrift für Ethnologie, vol. 46, no. 4-5,
pp. 553–590, 1914, [English translation by A. Baines and K. P.
Wachsmann, “Classification of musical instruments” Galpin
Society Journal, vol. 14, pp. 3–29, 1961].
[31] I. Fujinaga and K. MacMillan, “Realtime recognition of or-
chestral instruments,” in Proc. International Computer Music
Conference, Berlin, Germany, August–September 2000.
[32] G. J. McLachlan, Discriminant Analysis and Statistical Pattern
Recognition, John Wiley & Sons, New York, NY, USA, 1992.
[33] H. Späth, Cluster Analysis Algorithms, E. Horwood, Chich-
ester, UK, 1980.
[34] A. Srinivasan, D. Sullivan, and I. Fujinaga, “Recognition of
isolated instrument tones by conservatory students,” in Proc.
International Conference on Music Perception and Cognition,
pp. 17–21, Sidney, Australia, July 2002.
Giulio Agostini received a “Laurea” in

computer science and software engineer-
ing from the Politecnico di Milano, Italy,
in February 2000. His thesis dissertation
covered the automatic recognition of musi-
cal timbres through multivariate statistical
analysis techniques. During the following
years, he has continued to study the same
subject and published his contributions to
two IEEE international workshops devoted
to multimedia signal processing. His other research interests are
combinatorics and mathematical finance.

Musical Instrument Timbres Classification With Spectum

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Musical Instrument Timbres Classification With Spectum

Uploaded by

Copyright:

Available Formats

EURASIP Journal on Applied Signal Processing 2003:1, 5–14

c 2003 Hindawi Publishing Corporation

Musical Instrument Timbres Classification

Received 10 May 2002 and in revised form 29 August 2002

1. INTRODUCTION nevertheless, automatic timbre classification of audio sources

Rough Zero-crossing rate

Figure 1: Description of the feature extraction process.

2. BACKGROUND hidden Markov models could be employed, as illustrated in

tral centroid and bandwidth according to the following equa-

The support vectors are defined as the input samples di for

Table 1: Taxonomy of the instruments employed in the experiments.

92.2 90.4 90.0

Success rate (%)

17 instr. 20 instr. 27 instr. 27 instr. family 27 instr. pizz/sust.

Figure 2: Graphical representation of the success rates for each experiment.

Distorted electric guitar

Deep electric bass slap

Deep electric bass

Distorted electric guitar

Deep electric bass slap

Deep electric bass

Table 2: Most discriminating features for 27 instruments. REFERENCES

Giulio Agostini received a “Laurea” in

You might also like