Professional Documents
Culture Documents
Statistics - in - Musicology, by Jan Beran
Statistics - in - Musicology, by Jan Beran
Statistics - in - Musicology, by Jan Beran
STATISTICS in
MUSICOLOGY
Jan Beran
2003048488
This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable
efforts have been made to publish reliable data and information, but the author and the publisher cannot
assume responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, microlming, and recording, or by any information storage or
retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for
creating new works, or for resale. Specic permission must be obtained in writing from CRC Press LLC
for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identication and explanation, without intent to infringe.
Contents
Preface
1 Some mathematical foundations of music
1.1 General background
1.2 Some elements of algebra
1.3 Specic applications in music
2 Exploratory data mining in musical spaces
2.1 Musical motivation
2.2 Some descriptive statistics and plots for univariate data
2.3 Specic applications in music univariate
2.4 Some descriptive statistics and plots for bivariate data
2.5 Specic applications in music bivariate
2.6 Some multivariate descriptive displays
2.7 Specic applications in music multivariate
3 Global measures of structure and randomness
3.1 Musical motivation
3.2 Basic principles
3.3 Specic applications in music
4 Time series analysis
4.1 Musical motivation
4.2 Basic principles
4.3 Specic applications in music
5 Hierarchical metho ds
5.1 Musical motivation
5.2 Basic principles
5.3 Specic applications in music
6 Markov chains and hidden Markov mo dels
6.1 Musical motivation
6.2 Basic principles
Preface
An essential aspect of music is structure. It is therefore not surprising that a
connection between music and mathematics was recognized long before our
time. Perhaps best known among the ancient quantitative musicologists
are the Pythagoreans, who found fundamental connections between musical intervals and mathematical ratios. An obvious reason why mathematics
comes into play is that a musical performance results in sound waves that
can be described by physical equations. Perhaps more interesting, however,
is the intrinsic organization of these waves that distinguishes music from
ordinary noise. Also, since music is intrinsically linked with human perception, emotion, and reection as well as the human body, the scientic
study of music goes far beyond physics. For a deeper understanding of music, a number of dierent sciences, such as psychology, physiology, history,
physics, mathematics, statistics, computer science, semiotics, and of course
musicology to name only a few need to be combined. This, together
with the lack of available data, prevented, until recently, a systematic development of quantitative methods in musicology. In the last few years,
the situation has changed dramatically. Collection of quantitative data is
no longer a serious problem, and a number of mathematical and statistical methods have been developed that are suitable for analyzing such
data. Statistics is likely to play an essential role in future developments
of musicology, mainly for the following reasons: a) statistics is concerned
with nding structure in data; b) statistical methods and structures are
mathematical, and can often be carried over to various types of data
statistics is therefore an ideal interdisciplinary science that can link dierent scientic disciplines; and c) musical data are massive and complex
and therefore basically useless, unless suitable tools are applied to extract
essential features.
This book is addressed to anybody who is curious about how one may analyze music in a quantitative manner. Clearly, the question of how such an
analysis may be done is very complex, and no ultimate answer can be given
here. Instead, the book summarizes various ideas that have proven useful
in musical analysis and may provide the reader with food for thought or
inspiration to do his or her own analysis. Specically, the methods and applications discussed here may be of interest to students and researchers in
music, statistics, mathematics, computer science, communication, and en-
CHAPTER 1
documented in Beethovens famous sketchbooks. Similarily, the art of counterpoint that culminated in J.S. Bachs (Figure 1.2) work relies to a high
degree on intrinsically mathematical principles. A rather peculiar early account of explicit applications of mathematics is the use of permutations in
change ringing in English churches since the 10th century (Fletcher 1956,
Price 1969, Stewart 1992, White 1983, 1985, 1987, Wilson 1965). More
standard are simple symmetries, such as retrograde (e.g. Crab fugue, or
Canon cancricans), inversion, arpeggio, or augmentation. A curious example of this sort is Mozarts Spiegel Duett (or mirror duett, Figures
1.6, 1.7 ; the attibution to Mozart is actually uncertain). In the 20th century, composers such as Messiaen or Xenakis (Xenakis 1971; gure 1.15)
attempted to develop mathematical theories that would lead to new techniques of composition. From a strictly mathematical point of view, their
derivations are not always exact. Nevertheless, their artistic contributions
were very innovative and inspiring. More recent, mathematically stringent
approaches to music theory, or certain aspects of it, are based on modern tools of abstract mathematics, such as algebra, algebraic geometry,
and mathematical statistics (see e.g. Reiner 1985, Mazzola 1985, 1990a,
2002, Lewin 1987, Fripertinger 1991, 1999, 2001, Beran and Mazzola 1992,
1999a,b, 2000, Read 1997, Fleischer et al. 2000, Fleischer 2003).
The most obvious connection between music and mathematics is due to
the fact that music is communicated in form of sound waves. Musical sounds
can therefore be studied by means of physical equations. Already in ancient
Greece (around the 5th century BC), Pythagoreans found the relationship
between certain musical intervals and numeric proportions, and calculated
intervals of selected scales. These results were probably obtained by studying the vibration of strings. Similar studies were done in other cultures, but
are mostly not well documented. In practical terms, these studies lead to
singling out specic frequencies (or frequency proportions) as musically
useful and to the development of various scales and harmonic systems.
A more systematic approach to physics of musical sounds, music perception, and acoustics was initiated in the second half of the 19th century by
path-breaking contributions by Helmholz (1863) and other physicists (see
e.g. Rayleigh 1896). Since then, a vast amount of knowledge has been accumulated in this eld (see e.g. Backus 1969, 1977, Morse and Ingard 1968,
1986, Benade 1976, 1990, Rigden 1977, Yost 1977, Hall 1980, Berg and
Stork 1995, Pierce 1983, Cremer 1984, Rossing 1984, 1990, 2000, Johnston
1989, Fletcher and Rossing 1991, Gra 1975, 1991, Roederer 1995, Rossing
et al. 1995, Howard and Angus 1996, Beament 1997, Crocker 1998, Nederveen 1998, Orbach 1999, Kinsler et al. 2000, Raichel 2000). For a historic
account on musical acoustics see e.g. Bailhache (2001).
It may appear at rst that once we mastered modeling musical sounds
by physical equations, music is understood. This is, however, not so. Music
is not just an arbitrary collection of sounds music is organized sound.
Physical equations for sound waves only describe the propagation of air
pressure. They do not provide, by themselves, an understanding of how
and why certain sounds are connected, nor do they tell us anything (at
least not directly) about the eect on the audience. As far as structure is
concerned, one may even argue for the sake of argument that music does
not necessarily need physical realization in form of a sound. Musicians
are able to hear music just by looking at a score. Beethoven (Figures 1.3,
1.16) composed his ultimate masterpieces after he lost his hearing. Thus,
on an abstract level, music can be considered as an organized structure
that follows certain laws. This structure may or may not express feelings
of the composer. Usually, the structure is communicated to the audience
by means of physical sounds which in turn trigger an emotional experience of the audience (not necessarily identical with the one intended by
the composer). The structure itself can be analyzed, at least partially, using suitable mathematical structures. Note, however, that understanding
the mathematical structure does not necessarily tell us anything about the
eect on the audience. Moreover, any mathematical structure used for analyzing music describes certain selected aspects only. For instance, studying
symmetries of motifs in a composition by purely algebraic means ignores
psychological, historical, perceptual, and other important issues. Ideally, all
relevant scientic disciplines would need to interact to gain a broad understanding. A further complication is that the existence of a unique truth
is by no means certain (and is in fact rather unlikely). For instance, a
composition may contain certain structures that are important for some
listeners but are ignored by others. This problem became apparent in the
early 20th century with the introduction of 12-tone music. The general
public was not ready to perceive the complex structures of dodecaphonic
music and was rather appalled by the seemingly chaotic noise, whereas a
minority of specialized listeners was enthusiastic. Another example is the
mance styles, identication and modeling of metric, melodic, and harmonic structures, quantication of similarities and dierences between
compositions and performance styles, automatic identication of musical events and structures from audio signals, etc. Some of these methods
will be discussed in detail.
A mathematical discipline that is concerned specically with abstract denitions of structures is algebra. Some elements of basic algebra are therefore
discussed in the next section. Naturally, depending on the context, other
mathematical disciplines also play an equally important role in musical
analysis, and will be discussed later where necessary. Readers who are familiar with modern algebra may skip the following section. A few examples
that illustrate applications of algebraic structures to music are presented
in Section 1.3. An extended account of mathematical approaches to music
based on algebra and algebraic geometry is given, for instance, in Mazzola
(1990a, 2002) (also see Lewin 1987 and Benson 1995-2002).
1.2 Some elements of algebra
1.2.1 Motivation
Algebraic considerations in music theory have gained increasing popularity
in recent years. The reason is that there are striking similarities between
musical and algebraic structures. Why this is so can be illustrated by a simple example: notes (or rather pitches) that dier by an octave can be considered equivalent with respect to their harmonic meaning. If an instrument is tuned according to equal temperament, then, from the harmonic
perspective, there are only 12 dierent notes. These can be represented as
integers modulo 12. Similarily, there are only 12 dierent intervals. This
means that we are dealing with the set Z12 = {0, 1, ..., 11}. The sum of two
elements x, y Z12 , z = x + y is interpreted as the note/interval resulting
from increasing the note/interval x by the interval y. The set Z12 of notes
(intervals) is then an additive group (see denition below).
1.2.2 Denitions and results
We discuss some important concepts of algebra that are useful to describe
musical structures. A more comprehensive overview of modern algebra can
be found in standard text books such as those by Albert (1956), Herstein
(1975), Zassenhaus (1999), Gilbert (2002), and Rotman (2002).
The most fundamental structures in algebra are group, ring, eld, module, and vector space.
Denition 1 Let G be a nonempty set with a binary operation + such that
a + b G for all a, b G and the following holds:
1. (a + b) + c = a + (b + c) (Associativity)
2. a b R for all a, b R
3. (a b) c = a (b c) (Associativity)
4. a (b + c) = a b + a c and (b + c) a = b a + c a (distributive law)
Then (R, +, ) is called an (associative) ring. If also a b = b a for all
a, b R, then R is called a commutative ring.
Further useful denitions are:
Denition 7 Let R be a commutative ring and a R, a = 0 such that
there exists an element b R, b = 0 with a b = 0. Then a is called a
zero-divisor. If R has no zero-divisors, then it is called an integral domain.
Denition 8 Let R be a ring such that (R \ {0}, ) is a group. Then R is
called a division ring. A commutative division ring is called a eld.
A module is dened as follows:
Denition 9 Let (R, +, ) be a ring and M a nonempty set with a binary
operation +. Assume that
1. (M, +) is an abelian group
2. For every r R, m M , there exists an element r m M
3. r (a + b) = r a + r b for every r R, m M
4. r (s b) = (r s) a for every r, s R, m M
5. (r + s) a = r a + s a for every r, s R, m M
Then M is called an Rmodule or module over R. If R has a unit element
e and if e a = a for all a M , then M is called a unital Rmodule. A a
unital Rmodule where R is a eld is called a vector space over R.
There is an enormous amount of literature on groups, rings, modules,
etc. Some of the standard results are summarized, for instance, in text
books such as those given above. Here, we cite only a few theorems that
are especially useful in music. We start with a few more denitions.
Denition 10 Let H G be a subgroup of G such that for every a G,
a H a1 H. Then H is called a normal subgroup of G.
Denition 11 Let G be such that the only normal subgroups are H = G
and H = {e}. Then G is called a simple group.
Denition 12 Let G be a group and H1 , ..., Hn normal subgroups such
that
G = H1 H2 Hn
(1.1)
and any a G can be written uniquely as a product
a = b1 b2 bn
(1.2)
(1.3)
(1.7)
and
g(r a) = r g(a)
(1.8)
is called a (module-)homomorphism (or a linear transformation). If g is
a one-to-one (module-)homomorphism, then it is called an isomorphism
(or module-isomorphism). Furthermore, if G1 = G2 , then g is called an
automorphism (or module-automorphism).
Denition 21 Two modules M1 , M2 are called isomorphic, if there is an
isomorphism g : M1 M2 .
Finally, a general family of transformations is dened by
Denition 22 Let g : M1 M2 be a (module-)homomorphism. Then a
mapping h : M1 M2 dened by
h(a) = c + g(a)
(1.9)
Theorem 6 Let M be a direct sum of M1 , ..., Mn . Then M is isomorphic to the module M = {(a1 , a2 , ..., an ) : ai Mi } with the operations (a1 , a2 , ...) + (b1 , b2 , ...) = (a1 + b1 , a2 + b2 , ...) and r (a1 , a2 , ...) =
(r a1 , r a2 , ...).
Thus, a module M = M1 + M2 + ... + Mn can be described in terms of
its coordinates with respect to Mi (i = 1, ..., n) and the structure of M is
known as soon as we know the structure of Mi (i = 1, ..., n).
Direct products can be used, in particular, to characterize the structure
of nite abelian groups:
Theorem 7 Let (G, ) be a nite commutative group. Then G is isomorphic to the direct product of its Sylow-subgroups.
Theorem 8 Let (G, ) be a nite commutative group. Then G is the direct
product of cyclic groups.
Similar, but slightly more involved, results can be shown for modules, but
will not be needed here.
1.3 Specic applications in music
In the following, the usefulness of algebraic structures in music is illustrated by a few selected examples. This is only a small selection from
the extended literature on this topic. For further reading see e.g. Graeser
(1924), Schnberg (1950), Perle (1955), Fletcher (1956), Babbitt (1960,
o
1961), Price (1969), Archibald (1972), Halsey and Hewitt (1978), Balzano
(1980), Rahn (1980), Gtze and Wille (1985), Reiner (1985), Berry (1987),
o
Mazzola (1990a, 2002 and references therein), Vuza (1991, 1992a,b, 1993),
Fripertinger (1991), Lendvai (1993), Benson (1995-2002), Read (1997), Noll
(1997), Andreatta (1997), Stange-Elbe (2000), among others.
1.3.1 The Mathieu group
It can be shown that nite simple groups fall into families that can be
described explicitly, except for 26 so-called sporadic groups. One such group
is the so-called Mathieu group M12 which was discovered by the French
mathematician Mathieu in the 19th century (Mathieu 1861, 1873, also see
e.g. Conway and Sloane 1988). In their study of probabilistic properties of
(card) shuing, Diaconis et al. (1983) show that M12 can be generated by
two permutations (which they call Mongean shues), namely
1 =
1 2
7 6
3
8
4 5
5 9
2 =
1 2
6 7
3 4
5 8
6 7 8
4 10 3
9 10 11 12
11 2 12 1
(1.10)
and
5 6
4 9
7 8 9
3 10 2
10 11 12
11 1 12
(1.11)
where the low rows denote the image of the numbers 1, ..., 12. The order
of this group is o(M12 ) = 95040 (!) An interesting application of these
permutations can be found in Ile de feu 2 by Olivier Messiaen (Berry 1987)
where 1 and 2 are used to generate sequences of tones and durations.
1.3.2 Campanology
A rather peculiar example of group theory in action (though perhaps
rather trivial mathematically) is campanology or change ringing (Fletcher
1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). The
art of change ringing started in England in the 10th century and is still
performed today. The problem that is to be solved is as follows: there are
k swinging bells in the church tower. One starts playing a melody that
consists of a certain sequence in which the bells are played, each bell being played only once. Thus, the initial sequence is a permutation of the
numbers 1, ..., k. Since it is not interesting to repeat the same melody over
and over, the initial melody has to be varied. However, the bells are very
heavy so that it is not easy to change the timing of the bells. Each variation
is therefore restricted, in that in each round only one pair of adjacent
bells can exchange their position. Thus, for instance, if k = 4 and the previous sequence was (1, 2, 3, 4), then the only permissible permutations are
(2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restiction
is that no sequence should be repeated except that the last one is identical with the initial sequence. A typical solution to this problem is, for
instance, the Plain Bob that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),...
and continues until all permutations in S4 are visited.
1.3.3 Representation of music
Many aspects of music can be embedded in a suitable algebraic module
(see e.g. Mazzola 1990a). Here are some examples:
1. Apart from glissando eects, the essential frequencies in most types of
music are of the form
K
px i
i
= o
(1.12)
i=1
= log = o +
xi i
(1.13)
i=1
K
(1.14)
12
2 so that
p
log 2
(1.15)
= log 440 +
12
If notes that dier by one or several octaves are considered equivalent, then we can identify the set of notes with the Zmodule Z12 =
{0, 1, ..., 11}.
(1.16)
(1.17)
Musical meaning
Shift: f (x) = x + a
Transposition, repetition,
change of duration,
change of loudness
Arpeggio
Retrograde, inversion
Augmentation
Exchange of coordinates:
f (x) = (x2 , x1 , x3 , x4 )
Exchange of parameters
(20th century)
Spiegel-Duett
Allegro q=120
Violin
(W.A. Mozart)
mf
Vln.
12
Vln.
18
Vln.
22
Vln.
27
Vln.
32
Vln.
36
Vln.
41
Vln.
46
Vln.
51
Vln.
57
Vln.
60
Vln.
Figure 1.9 Arnold Schnberg Sketch for the piano concert op. 42 notes with
o
tone row and its inversions and transpositions. (Used by permission of Belmont
Music Publishers.)
Figure 1.10 Notes of Air by Henry Purcell. (For better visibility, only a small
selection of related motifs is marked.)
Figure 1.11 Notes of Fugue No. 1 (rst half ) from Das Wohltemperierte
Klavier by J.S. Bach. (For better visibility, only a small selection of related
motifs is marked.)
Figure 1.12 Notes of op. 68, No. 2 from Album fr die Jugend by Robert Schuu
mann. (For better visibility, only a small selection of related motifs is marked.)
Figure 1.14 Graphical representation of pitch and onset time in Z2 together with
71
a
instrumentation of polygonal areas. (Excerpt from Snti Piano concert No. 2
by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)
CHAPTER 2
Piano
Robert Schumann
ritard.
13
ritard.
17
a tempo
21
23
ritard.
0
-5
1947
log(tempo)
1963
-15
-10
1965
10
20
30
onset time
Denition
Feature measured
n
i=1
1{xi x}
Empirical distribution
function
Fn (x) = n
Minimum
Smallest value
Maximum
Largest value
Range
Total spread
Sample mean
x = n1
n
i=1
xi
Proportion of
obs. x
Center
1
}
2
Sample median
M = inf {x : Fn (x)
Sample quantile
q = inf {x : Fn (x) }
Border of
lower 100%
Q1 = q 1 , Q2 = q 3
Border of
lower 25%,
upper 75%
Sample variance
s2 = (n 1)1
s = + s2
Sample standard
deviation
Center
n
i=1 (xi
x)2
Variability
Interquartile range
IQR = Q2 Q1
Sample skewness
m3 = n1
n
i=1 [(xi
x)/s]3
n
i=1 [(xi
Sample kurtosis
m4 = n
Variability
Variability
x)/s] 3
Asymmetry
Flat/sharp peak
almost all data are in the interval x 3s. For a suciently large sample
size, these conclusions can be carried over to the population from which
the data were drawn.
q = q +
(q q )
1/n
(2.1)
form
1
f (x) =
nb
K(
i=1
xi x
)
b
(2.2)
-20
0
ARGERICH
ARRAU
ASKENAZE
-40
BRENDEL
BUNIN
CAPOVA
CORTOT1
CORTOT2
CORTOT3
CURZON
-60
log(tempo)
DAVIES
DEMUS
ESCHENBACH
GIANOLI
HOROWITZ1
HOROWITZ2
HOROWITZ3
-80
KATSARIS
KLIEN
KRUST
KUBALEK
MOISEIWITSCH
NEY
-100
NOVAES
ORTIZ
SCHNABEL
SHELLEY
ZAK
10
20
30
onset time
sample considered here, pianists of the modern era tend to make a much
stronger distinction between A and A in terms of slow tempi. The only
exceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz
rst performance and Ashkenazy (outlier in the right boxplot). The comparsion of skewness and curtosis in Figures 2.4g and h also indicates that
modern pianists seem to prefer occasional extreme ritardandi. The only
exception in the early 20th century group is Artur Schnabel, with an
extreme skewness of 2.47 and a kurtosis of 7.04.
Direct comparisons of tempo distributions are shown in Figures 2.5a
Figure 2.4 Boxplots of descriptive statistics for the 28 tempo curves in Figure
2.3.
-1
-2
Argerich
-2
-3
Cortot
-2
-4
-4
-4
-3
-3
Ortiz
-1
-1
tions in 1947 and 1963 are almost the same, except for slight changes for
very low tempi (Figure 2.5f).
-2
-1
-2
-1
-4
-3
-2
-1
Ortiz
Demus
0
-1
-4
-4
-4
-3
-2
Horowitz 1963
-2
Cortot 1947
-1
-2
-3
Krust
0
2
Demus
-2
-1
Demus
-4
-3
-2
-1
Cortot 1935
-4
-3
-2
-1
Horowitz 1947
0.3
5
7
8
9
a
e
f
3
4
6
9
a
b
d
e
8
b
d
4
7
a
c
3
4
6
1
5
0
f
1
3
2
7
9
a
c
e
f
1
3
4
9
0
5
4
6
0
a
3
8
d
e
f
1
2
5
4
6
0
1
e
2
5
7
6
a
e
f
3
8
7
c
b
d
e
f
1
2
1
3
4
5
2
8
7
6
9
0
a
c
b
d
e
0
1
3
4
5
2
7
8
6
9
0
a
c
b
d
e
f
2
1
7
9
4
5
7
8
6
9
0
8
c
d
3
2
a
c
b
3
2
5
4
8
6
0
a
c
d
e
b
1
3
4
5
2
7
8
6
9
0
a
c
b
d
e
f
5
1
2
3
5
4
7
8
6
9
0
a
c
b
d
e
f
f
7
0.1
0.2
5
6
9
0
b
1
2
1
c
d
e
b
2
a
f
0
3
4
5
7
8
9
6
10
1
3
2
4
5
6
7
8
9
0
a
c
b
d
e
f
0.0
0.10
0.15
0.20
0.0
0.05
0.4
were the rst one (nalis, the nal note) and the fth note of the scale
(dominant). The system of 12 major and 12 minor scales was developed
later, adding more exibility with respect to modulation and scales. The
main representatives of a major/minor scale are three triads, obtained
by adding thirds, starting at the basic note corresponding to the rst
(tonic), fourth (subtonic) and fth (tonic) note of the scale respectively.
Other triads are also but to a lesser degree associated with the properties
tonic, subtonic and/or dominant. In the 20th century, and partially
already in the late 19th century, other systems of scales as well as systems
that do not rely on any specic scales were proposed (in particular 12-tone
music).
11
4
5
6
7
8
3
9
2
0
1
a
e
f
c
b
d
2
4
6
7
8
3
5
1
2
9
0
a
c
b
d
e
f
1
3
2
4
5
6
7
8
9
0
a
c
b
d
e
f
3
0.30
0.20
4
5
3
7
6
1
2
8
9
0
a
e
f
b
d
c
0.0
1
2
3
4
7
5
6
8
9
0
a
b
c
e
f
d
0
2
3
4
f
1
7
5
6
8
9
c
e
d
0
a
b
1
2
3
4
7
5
6
8
9
0
a
b
c
e
f
d
1
2
a
b
c
e
f
d
4
3
7
5
6
8
9
0
4
5
2
6
9
a
b
e
f
1
4
3
7
8
0
c
d
7
8
0
b
c
d
1
3
9
a
4
2
5
6
e
f
1
4
7
6
8
0
a
c
3
2
5
9
b
e
d
f
1
4
2
3
5
6
e
f
d
7
8
9
0
a
b
c
7
(Notes-Tonic) mod 12
0.10
9
8
0
a
b
c
e
7
5
6
f
d
4
2
3
1
1
4
3
2
7
6
5
8
9
0
a
b
c
e
d
f
9
10
6
7
8
0
a
b
5
9
c
d
2
1
3
4
e
f
11
0.0
0.1
1
b
c
d
2
3
0
a
e
f
4
7
6
8
9
5
1
3
2
4
5
6
7
8
9
0
a
c
b
d
e
f
5
1
2
f
3
c
b
d
4
5
6
9
0
a
e
7
8
3
c
d
e
f
1
2
4
9
0
b
5
6
7
8
a
5
6
7
8
9
0
a
b
1
3
2
4
c
d
e
f
8
10
5
6
7
8
9
0
a
b
4
1
3
2
c
d
e
f
11
(Notes-Tonic) mod 12
0.2
0.3
0.4
(Notes-Tonic) mod 12
a
3
4
5
6
7
8
9
0
c
b
d
e
f
1
2
c
b
d
e
f
a
0
2
9
1
3
4
5
6
7
8
10
5
4
6
7
a
8
0
9
e
f
1
2
3
d
c
b
11
(Notes-Tonic) mod 12
Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.
A very simple illustration of this development can be obtained by counting the frequencies of notes (pitches) in the following way: consider a score
in equal temperament. Ignoring transposition by octaves, we can represent
all notes x(t1 ), ..., x(tn ) by the integers 0, 1, ..., 11. Here, t1 t2 ... tn
0.3
0.0
d
e
f
1
2
3
4
5
6
7
8
9
0
a
b
c
0
a
1
2
3
4
5
6
7
8
9
0
2
7
9
0
a
b
1
2
5
6
8
c
d
3
4
e
f
6
a
1
2
4
7
8
9
0
b
c
d
e
3
5
f
1
2
3
4
5
6
7
8
9
0
a
b
c
d
e
f
3
0.2
f
1
2
3
4
7
8
9
0
a
e
5
6
b
c
d
0.1
c
d
e
f
0.1
c
d
e
f
3
4
5
1
2
6
8
7
9
0
c
a
b
d
e
f
1
2
3
4
5
6
7
8
9
0
a
b
c
d
e
f
7
a
b
c
d
e
f
0
1
2
3
4
5
6
7
8
9
b
c
d
e
f
1
2
3
4
5
6
7
8
9
0
a
8
1
f
2
3
4
5
6
7
0
a
b
c
d
e
8
9
10
11
5
6
3
4
7
8
a
2
9
1
0
b
c
d
e
f
e
2
1
d
3
4
5
6
c
7
8
0
a
b
9
1
9
0
a
b
c
d
2
3
4
8
e
f
5
7
6
0.0
0.3
0.2
8
9
3
5
6
7
0
2
4
1
a
b
c
d
e
f
b
2
1
3
4
5
6
7
8
9
0
a
f
4
5
6
7
8
9
0
a
d
e
2
1
3
b
c
2
3
1
5
6
4
7
9
8
0
a
7
b
c
d
4
6
2
1
3
5
e
f
8
9
0
a
b
c
d
e
f
0
e
f
6
7
8
9
a
b
c
d
1
2
3
5
4
c
d
e
f
1
b
2
3
4
5
9
0
a
6
7
8
1
2
3
4
b
c
d
e
f
5
6
7
8
9
0
a
4
5
6
7
8
9
0
e
f
1
2
3
a
b
d
c
f
1
2
3
4
b
c
d
e
5
6
7
8
9
0
a
10
11
0.12
0
a
d
e
(Notes-Tonic) mod 12
1
2
6
7
8
9
b
c
5
4
6
3
5
7
1
0
a
c
f
1
5
6
a
b
3
5
4
9
a
d
e
0.20
(Notes-Tonic) mod 12
f
e
3
6
b
1
2
1
3
a
f
e
4
6
8
9
8
9
0
a
d
e
2
8
7
9
0
b
c
d
0
a
b
d
5
f
2
4
6
8
7
b
d
e
9
b
c
d
f
3
2
5
4
6
9
f
e
3
2
9
0
c
3
5
9
3
2
5
4
8
7
0
a
e
8
7
0
b
c
d
4
d
f
e
1
6
6
8
7
b
c
f
3
2
6
8
9
c
d
f
1
2
1
5
4
7
0
a
b
e
2
4
9
0
a
c
1
a
2
e
f
1
3
4
5
1
b
d
0.10
4
6
7
9
0
a
b
d
e
3
5
4
b
c
f
8
9
0
a
d
3
5
6
8
7
8
10
7
e
f
1
2
6
6
7
a
b
c
3
5
c
f
e
0.0
0.08
1
3
2
8
c
3
6
0.04
5
4
7
f
1
2
9
0
1
2
3
4
a
b
c
6
7
8
d
e
f
5
9
0
8
9
d
d
c
e
7
8
b
d
c
e
f
11
a
1
7
9
8
0
6
7
9
8
0
a
b
1
2
3
5
d
c
4
e
f
5
6
7
0
a
b
c
5
6
7
8
2
3
4
5
6
(Notes-Tonic) mod 12
4
5
6
7
9
0
a
d
e
f
3
4
5
8
9
0
d
c
c
1
2
3
8
b
6
7
a
e
1
2
b
f
10
11
7
9
8
0
a
b
c
2
4
9
8
0
a
b
3
0
a
b
d
c
e
f
1
2
3
4
5
6
7
6
d
1
2
3
4
5
e
f
d
e
f
(Notes-Tonic) mod 12
Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.
j+2k
pj (x) = (2k + 1)
1{x(ti ) = x}
i=j
6 and 7 by F. Martin (1890-1971). For each j = 4, 8, ..., 64, the frequencies pj (0), ..., pj (11) are joined by lines respectively. The obvious common
feature for Bach, Mozart and Schumann is a distinct preference (local maximum) for the notes 5 and 7 (apart from 0). Note that if 0 is the root of
the tonic triad, then 5 corresponds to the root of the subdominant triad.
Similarily, 7 is root of the dominant triad. Also relatively frequent are the
notes 3 =minor third (second note of tonic triad in minor) and 10 =minor
seventh, which is the fourth note of the dominant seventh chord to the
subtonic. Also note that, for Schumann, the local maxima are somewhat
less pronounced. A dierent pattern can be observed for Scriabin and even
more for Martin. In Scriabins Prlude op. 51/2, the perfect fth almost
e
never occurs, but instead the major sixth is very frequent. In Scriabins
Prlude op. 51/4, the tonal system is dissolved even further, as the clearly
e
dominating note is 6 which builds together with 0 the augmented fourth
(or diminished fth) an interval that is considered highly dissonant in
tonal music. Nevertheless, even in Scriabins compositions, the distribution
of notes does not change very rapidly, since the sixteen overlayed curves are
almost identical. This may indicate that the notion of scales or a slow harmonic development still play a role. In contrast, in Frank Martins Prlude
e
No. 6, the distribution changes very quickly. This is hardly surprising, since
Martins style incorporates, among other inuences, dodecaphonism (12tone music) a compositional technique that does not impose traditional
restrictions on the harmonic structure.
2.4 Some descriptive statistics and plots for bivariate data
2.4.1 Denitions
We give a short overview of important descriptive concepts for bivariate
data. For a comprehensive treatment we refer the reader to standard text
books given above (also see e.g. Plackett 1960, Ryan 1996, Srivastava and
Sen 1997, Draper and Smith 1998, and Rao 1973 for basic theoretical results).
Correlation
If each observation consists of a pair of measurements (xi , yi ), then the main
objective is to investigate the relationship between x and y. Consider, for
example, the case where both variables are quantitative. The data can then
be displayed in a scatter plot (y versus x). Useful statistics are Pearsons
sample correlation
r=
1
n
(
i=1
xi x yi y
)(
)=
sx
sy
n
i=1 (xi
n
i=1 (xi
x)(yi y )
x)2
n
i=1 (yi
y )2
(2.3)
where s2 = n1
x
rank correlation
rSp
1
=
n
(
i=1
n
2
i=1 (xi x)
and s2 = n1
y
n
2
i=1 (yi y )
and Spearmans
i=1 (ui u)
i=1 (vi
ui u vi v
)(
)=
su
sv
v )2
(2.4)
where ui denotes the rank of xi among the xvalues and vi is the rank
of yi among the yvalues. In (2.3) and (2.4) it is assumed that sx , sy ,
su and sv are not zero. Recall that these denitions imply the following
properties: a) 1 r, rSp 1; b) r = 1, if and only if yi = o + 1 xi
and 1 > 0 (exact linear relationship with positive slope); c) r = 1, if
and only if yi = o + 1 xi and 1 < 0 (exact linear relationship with
negative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictly
monotonically increasing relationship); e) r = 1, if and only if xi >
xj implies yi < yj (strictly monotonically decreasing relationship); f) r
measures the strength (and sign) of the linear relationship; g) rSp measures
the strength (and sign) of monotonicity; h) if the data are realizations of a
bivariate random variable (X, Y ), then r is an estimate of the population
correlation = cov(X, Y )/ var(X)var(Y ) where cov(X, Y ) = E[XY ]
E[X]E[Y ], var(X) = cov(X, X) and var(Y ) = cov(Y, Y ). When using
these measures of dependence one should bear in mind that each of them
measures a specic type of dependence only, namely linear and monotonic
dependence respectively. Thus, a Pearson or Spearman correlation near
or equal to zero does not necessarily mean independence. Note also that
correlation can be interpreted in a geometric way as follows: dening the
ndimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal to
the standardized scalar product between x and y, and is therefore equal to
the cosine of the angle between these two vectors.
A special type of correlation is interesting for time series. Time series are
data that are taken in a specic ordered (usually temporal) sequence. If
Y1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, then
one would like to know whether there is any linear dependence between
observations Yi and Yik , i.e. between observations that are k time units
apart. If this dependence is the same for all time points i, and the expected
value of Yi is constant, then the corresponding population correlation can
be written as function of k only (see Chapter 4),
cov(Yi , Yi+k )
var(Yi )var(Yi+k )
= (k)
(2.5)
where s2 = n1
1
n
nk
(
i=1
yi y yi+k y
)(
)
s
s
(2.6)
1
n
nk
(
i=1
xi x yi+k y
)(
)
sX
sY
(2.7)
cov(Xi , Yi+k )
var(Xi )var(Yi+k )
(2.8)
(yi bo b1 xi )2 =
2
ri (bo , b1 )
(2.10)
where ||.|| denotes the squared euclidian norm, or length of the vector. It
is then clear that SSE is minimized by the orthogonal projection of y on
the plane spanned by 1 and x. The estimate of = (o , 1 )t is therefore
= (o , 1 )t = (X t X)1 X t y
(2.11)
by
y = (1 , ..., yn )t = X(X t X)1 X t y
(2.12)
Dening the measure of the total variability of y, SST = ||y1||2 (total
y
sum of squares), and the quantities SSR = ||1||2 (regression sum of
y y
squares=variability due to the fact that the tted line is not horizontal)
2
and SSE = ||y y|| (error sum of squares, variability unexplained by
(2.13)
SSE
|| y 1||2
y
SSR
R2 = i=1
=1
.
(2.14)
=
=
n
2
||y y 1||2
SST
SST
(yi y )
i=1
By denition, 0 R2 1, and R2 = 1 if and only if yi = yi (i.e. all points
are on the regression line). Moreover, for simple regression we also have
R2 = r2 . The advantage of dening R2 as above (instead of via r2 ) is that
the denition remains valid for the multiple regression model (see below),
i.e. when several explanatory variables are available. Finally, note that an
2
estimate of 2 is obtained by 2 = (n 2)1 ri (o , 1 ).
In analogy to the sample mean and the sample variance, the least squares
estimates of the regression parameters are sensitive to the presence of outliers. Outliers in regression can occur in the y-variable as well as in the
x-variable. The latter are also called inuential points. Outliers may often
be correct and in fact very interesting observations (e.g. telling us that the
assumed model may not be correct). However, since least squares estimates
are highly inuenced by outliers, it is often dicult to notice that there
may be a problem, since the tted curve tends to lie close to the outliers.
Alternative, robust estimates can be helpful in such situations (see Huber
1981, Hampel et al. 1986). For instance, instead of minimizing the residual
sum of squares we may minimize
(ri ) where is a bounded function.
If is dierentiable, then the solution can usually also be found by solving
the equations
n
r
( )
r(b) = 0 (j = 0, ..., p)
(2.15)
bj
i=1
where 2 is a robust estimate of 2 obtained from an additional equation
are (up to a certain degree) robust with respect to outliers in y, not however
with respect to inuential points (outliers in x). To control the eect of
inuential points one can, for instance, solve a set of equations
n
r
j ( , xi ) = 0 (j = 0, ..., p)
i=1
(2.16)
where is such that it downweighs outliers in x as well. For a comprehensive theory of robustness see e.g. Huber (1981), Hampel et al. (1986).
For more recent, ecient and highly robust methods see Yohai (1987),
Rousseeuw and Yohai (1984), Gervini and Yohai (2002), and references
therein.
The results for simple linear regression can be extended easily to the case
where more than one explanatory variable is available. The multiple linear
regression model with p explanatory variables is dened by y = o + 1 x1 +
...+p xp +. For data we write yi = o +1 xi1 +...+p xip +i (i = 1, ..., n).
Note that the word linear refers to linearity in the parameters o , ..., p .
The function itself can be nonlinear. For instance, we may have polynomial
regression with y = o +1 x+...+p xp +. The same geometric arguments
as above apply so that (2.11) and (2.12) hold with = (o , ..., p )t , and
the n (p + 1)matrix X = (x(1) , ..., x(p+1) ) with columns x(1) = 1 and
x(j+1) = xj = (x1j , ..., xnj )t (j = 1, ..., p).
Regression smoothing
A more general, but more dicult, approach to modeling a functional relationship is to impose less restrictive assumptions on the function g. For
instance, we may assume
y = g(x) +
(2.17)
with g being a twice continuously dierentiable function. Under suitable
additional conditions on x and it is then possible to estimate g from
observed data by nonparametric smoothing. As a special example consider
observations yi taken at time points i = 1, 2, ..., n. A standard model is
yi = g(ti ) + i
(2.18)
g (t) =
wi yi
i=1
(2.19)
wi = wi (t; b, n) =
n
j=1
K(
ttj
b )
(2.20)
with b > 0, and a kernel function K 0 such that K(u) = K(u), K(u) =
1
0 (|u| > 1) and 1 K(u)du = 1. The role of b is to restrict observations
that inuence the estimate to a small window of neighboring time points.
For instance, the rectangular kernel K(u) = 1 1{|u| 1} yields the sample
2
mean of observations yi in the window n(t b) i n(t + b). An even
more elegant formula can be obtained by approximating the Riemann sum
1
ttj
n
1
j=1 K( b ) by the integral 1 K(u)du = 1:
nb
n
g (t) =
wi yi =
i=1
1
nb
K(
i=1
t ti
)yi
b
(2.21)
In this case, the sum of the weights is not exactly equal to one, but asymptotically (as n and b 0 such that nb3 ) this error is negligible.
It can be shown that, under fairly general conditions on g and , g con
verges to g, in a certain sense that depends on the specic assumptions (see
e.g. Gasser and M ller 1979, Gasser and M ller 1984, Hrdle 1991, Beran
u
u
a
and Feng 2002, Wand and Jones 1995, and references therein).
An alternative to kernel smoothing is local polynomial tting (Fan and
Gijbels 1995, 1996; also see Feng 1999). The idea is to t a polynomial
locally, i.e. to data in a small neighborhood of the point of interest. This
can be formulated as a weighted least squares problem as follows:
g (t) = o
(2.22)
= arg min
a
K(
thresholding, will not be discussed here (see e.g. Daubechies 1992, Donoho
and Johnston 1995, 1998, Donoho et al. 1995, 1996, Vidakovic 1999, and
Percival and Walden 2000 and references therein). A related method based
of wavelets is discussed in Chapter 5.
Smoothing of two-dimensional distributions, sharpening
Estimating a relationship between x and y (where x and y are realizations
of random variables X and Y respectively) amounts to estimating the joint
two-dimensional distribution function F (x, y) = P (X x, Y y). For
continuous variables with F (x, y) = ux vy f (u, v) dudv, the density
function f can be estimated, for instance, by a two-dimensional histogram.
For visual and theoretical reasons, a better estimate is obtained by kernel
estimation (see e.g. Silverman 1986) dened by
f (x, y) =
1
nb1 b2
K(xi x, yi y; b1 , b2 )
(2.24)
i=1
for given numbers a and b, only points with a f (x, y) b are drawn in
the scatterplot. Alternatively, one may plot all points and highlight points
with a f (x, y) b.
Interpolation
Often a process may be generated in continuous time, but is observed at
discrete time points. One may then wish to guess the values of the points
(2.25)
V () =
i=1
[ (t)]2 dt
g
(2.26)
1
to exp(|u|/ 2) sin(/4 + |u|/ 2) and a bandwidth b proportional to 4
1
(Silverman 1986). If ti = i/n, then the bandwidth is exactly equal to 4 .
Statistical inference
In this section, correlation, linear regression, nonparametric smoothing,
and interpolation were introduced in an informal way, without exact discussion of probabilistic assumptions and statistical inference. All these
techniques can be used in an informal way to explore possible structures
without specic model assumptions. Sometimes, however, one wishes to
obtain more solid conclusions by statistical tests and condence intervals.
There is an enormous literature on statistical inference in regression, including nonparametric approaches. For selected results see the references
given above. For nonparametric methods also see Wand and Jones (1995),
Simono (1996), Bowman and Azzalini (1997), Eubank (1999) and references therein.
2.5 Sp ecic applications in music bivariate
2.5.1 Empirical tempo-acceleration
Consider the tempo curves in Figure 2.3. An approximate measure of
tempo-acceleration may be dened by
a(ti ) =
(2.27)
where y(t) is the tempo (or log-tempo) at time t. Figures 2.10a through f
show a(t) for the three performances by Cortot and Horowitz. From the
pictures it is not quite easy to see in how far there are similarilies or differences. Consider now the pairs (aj (ti ), al (ti )) where aj , al are acceleration measurements of performance j and l respectively. We calculate the
sample correlations for each pair (j, l) {1, ..., 28} {1, ..., 28}, (j = l).
Figure 2.11a shows the correlations between Cortot 1 (1947) and the other
performances. As expected, Cortot correlates best with Cortot: the correlation between Cortot 1 and Cortots other two performances (1947, 1953)
is clearly highest. The analogous observation can be made for Horowitz
1 (1947) (Figure 2.11b). Also interesting is to compare how much overall
resemblance there is between a selected performance and the other performances. For each of the 28 performances, the average and the maximal
correlation with other performances were calculated. Figures 2.11c and d
indicate that, in terms of accelaration, Cortots style appears to be quite
unique among the pianists considered here. The overall (average and maximal) similarily between each of his three acceleration curves and the other
performances is much smaller than for any other pianist.
10
10
-10
-15
-10
-10
-5
-5
-5
a(t)
a(t)
a(t)
10
10
15
20
25
30
10
onset time t
10
20
25
30
10
15
20
25
30
onset time t
10
15
20
25
onset time t
30
-15
-10
-10
-10
-5
-5
-5
a(t)
a(t)
a(t)
10
10
15
onset time t
15
10
15
20
onset time t
25
30
10
15
20
25
30
onset time t
0
5
CORTOT2
ASKENAZE
ARRAU
10
Performance
KRUST
20
25
0
5
10
GIANOLI
15
20
KRUST
SHELLEY
ZAK
NOVAES
20
SCHNABEL
ORTIZ
NEY
MOISEIWITSCH
15
KUBALEK
KLIEN
KATSARIS
HOROWITZ3
HOROWITZ2
10
HOROWITZ1
Performance
ESCHENBACH
DEMUS
CURZON
DAVIES
CORTOT1
CORTOT2
CORTOT3
CAPOVA
1.0
BUNIN
BRENDEL
ASKENAZE
ARRAU
0.9
25
ARGERICH
0.8
0.7
mean correlation
ZAK
SHELLEY
SCHNABEL
20
ORTIZ
NOVAES
NEY
MOISEIWITSCH
15
KUBALEK
KLIEN
KATSARIS
HOROWITZ3
HOROWITZ2
0.8
10
HOROWITZ1
GIANOLI
ESCHENBACH
DEMUS
DAVIES
CURZON
CAPOVA
BRENDEL
BUNIN
0.7
0.6
0.6
ARGERICH
CORTOT1
0.5
mean correlation
CORTOT3
0.4
0.2
0.4
KATSARIS
CURZON
25
Performance
SHELLEY
SCHNABEL
ORTIZ
NOVAES
NEY
MOISEIWITSCH
KUBALEK
KRUST
KLIEN
KATSARIS
HOROWITZ3
HOROWITZ2
GIANOLI
ESCHENBACH
DEMUS
DAVIES
CORTOT3
CORTOT2
CORTOT1
CAPOVA
BUNIN
BRENDEL
ASKENAZE
ARRAU
ARGERICH
Correlation
ZAK
0.6
1.0
CORTOT2
SHELLEY
SCHNABEL
ORTIZ
NOVAES
NEY
MOISEIWITSCH
KUBALEK
KRUST
KLIEN
HOROWITZ3
HOROWITZ2
HOROWITZ1
GIANOLI
ESCHENBACH
DEMUS
DAVIES
CURZON
CORTOT3
CAPOVA
BUNIN
BRENDEL
ASKENAZE
ARRAU
ARGERICH
0.8
Correlation
1.2
a) Acceleration - Correlations of
Cortot (1935) with other performances
b) Acceleration- Correlations of
Horowitz (1947) with other performances
25
Performance
1.4
g2 (t) = (nb2 )1
g3 (t) = (nb3 )1
K(
t ti
)yi
b1
t ti
)[yi g1 (t)]
b2
t ti
)[yi g1 (t) g2 (t)]
K(
b3
K(
(2.28)
(2.29)
(2.30)
(2.31)
10 15 20 25 30
t
10 15 20 25 30
t
10 15 20 25 30
t
10 15 20 25 30
t
-0.6
-0.6
10 15 20 25 30
t
KRUST
10 15 20 25 30
t
10 15 20 25 30
t
NOVAES
NEY
10 15 20 25 30
t
10 15 20 25 30
t
ZAK
SHELLEY
10 15 20 25 30
t
HOROWITZ2
-0.6
-0.6
10 15 20 25 30
t
-0.6
SCHNABEL
KLIEN
10 15 20 25 30
t
-0.6
-0.6
10 15 20 25 30
t
-0.6
10 15 20 25 30
t
ORTIZ
MOISEIWITSCH
-0.6
-0.6
10 15 20 25 30
t
10 15 20 25 30
t
DEMUS
10 15 20 25 30
t
-0.6
KUBALEK
KATSARIS
10 15 20 25 30
t
HOROWITZ1
10 15 20 25 30
t
-0.6
-0.6
10 15 20 25 30
t
-0.6
10 15 20 25 30
t
HOROWITZ3
10 15 20 25 30
t
-0.6
-0.6
CORTOT2
DAVIES
GIANOLI
ESCHENBACH
-0.6
10 15 20 25 30
t
10 15 20 25 30
t
-0.6
10 15 20 25 30
t
-0.6
-0.6
CORTOT1
CURZON
CORTOT3
10 15 20 25 30
t
-0.6
-0.6
-0.6
CAPOVA
BUNIN
10 15 20 25 30
t
-0.6
-0.4
-0.4
-0.4
-0.4
BRENDEL
ASKENAZE
ARRAU
ARGERICH
10 15 20 25 30
t
10 15 20 25 30
t
10 15 20 25 30
t
-1.5
-1.5
-1.5
10 15 20 25 30
t
KRUST
10 15 20 25 30
t
NEY
10 15 20 25 30
t
NOVAES
10 15 20 25 30
t
10 15 20 25 30
t
ZAK
SHELLEY
10 15 20 25 30
t
1).
10 15 20 25 30
t
-2.0
10 15 20 25 30
t
-1.5
SCHNABEL
-2.0
10 15 20 25 30
t
KLIEN
10 15 20 25 30
t
10 15 20 25 30
t
HOROWITZ2
-1.5
10 15 20 25 30
t
-2.0
MOISEIWITSCH
ORTIZ
10 15 20 25 30
t
-1.5
-1.5
10 15 20 25 30
t
-1.5
10 15 20 25 30
t
KUBALEK
HOROWITZ1
10 15 20 25 30
t
-1.5
-1.5
KATSARIS
HOROWITZ3
DEMUS
-1.5
10 15 20 25 30
t
10 15 20 25 30
t
DAVIES
10 15 20 25 30
t
-1.5
-1.5
GIANOLI
ESCHENBACH
CORTOT2
-1.5
10 15 20 25 30
t
10 15 20 25 30
t
-2.0
10 15 20 25 30
t
-1.5
-1.5
CORTOT1
CURZON
CORTOT3
10 15 20 25 30
t
-1.5
-1.5
-1.5
CAPOVA
BUNIN
10 15 20 25 30
t
1.0
10 15 20 25 30
t
-2.0
-1.5
-1.5
-1.5
-0.5
BRENDEL
ASKENAZE
ARRAU
ARGERICH
10 15 20 25 30
t
10 15 20 25 30
t
1
-2
0
-3
0
-3
10 15 20 25 30
t
0
-3
10 15 20 25 30
t
KRUST
0
-3
10 15 20 25 30
t
NOVAES
0
SHELLEY
SCHNABEL
10 15 20 25 30
t
10 15 20 25 30
t
ZAK
0
-3
10 15 20 25 30
t
10 15 20 25 30
t
NEY
10 15 20 25 30
t
-3
10 15 20 25 30
t
10 15 20 25 30
t
0
-3
-3
ORTIZ
MOISEIWITSCH
10 15 20 25 30
t
KLIEN
10 15 20 25 30
t
10 15 20 25 30
t
HOROWITZ2
10 15 20 25 30
t
-3
0
-3
-3
KUBALEK
KATSARIS
10 15 20 25 30
t
HOROWITZ1
10 15 20 25 30
t
-3
0
-3
DEMUS
10 15 20 25 30
t
10 15 20 25 30
t
HOROWITZ3
-3
10 15 20 25 30
t
-3
0
-3
DAVIES
GIANOLI
ESCHENBACH
CORTOT2
10 15 20 25 30
t
10 15 20 25 30
t
10 15 20 25 30
t
-3
-3
0
-3
0
-3
10 15 20 25 30
t
CURZON
CORTOT3
CORTOT1
0
-3
10 15 20 25 30
t
10 15 20 25 30
t
-3
1
-2
CAPOVA
BUNIN
10 15 20 25 30
t
BRENDEL
-3
-2
ASKENAZE
-2
ARRAU
-2
ARGERICH
10 15 20 25 30
t
MOISEIWITSCH
-1.5
-1.5
10 15 20 25 30
t
10 15 20 25 30
t
10 15 20 25 30
t
-1.5
-1.5
-1.5
10 15 20 25 30
t
KRUST
10 15 20 25 30
t
10 15 20 25 30
t
NOVAES
10 15 20 25 30
t
10 15 20 25 30
t
ZAK
SHELLEY
-1.5
HOROWITZ2
10 15 20 25 30
t
NEY
10 15 20 25 30
t
-1.5
1.5
-1.5
SCHNABEL
ORTIZ
KLIEN
10 15 20 25 30
t
10 15 20 25 30
t
-1.5
-1.5
KUBALEK
1.5
10 15 20 25 30
t
1.5
-1.5
1.5
KATSARIS
HOROWITZ1
10 15 20 25 30
t
10 15 20 25 30
t
-1.5
DEMUS
10 15 20 25 30
t
-1.5
-1.5
-1.5
-1.5
1.5
HOROWITZ3
GIANOLI
10 15 20 25 30
t
10 15 20 25 30
t
DAVIES
10 15 20 25 30
t
-1.5
-1.5
-1.5
ESCHENBACH
CURZON
10 15 20 25 30
t
10 15 20 25 30
t
CORTOT2
CORTOT1
10 15 20 25 30
t
-1.5
-1.5
1.5
10 15 20 25 30
t
-1.5
10 15 20 25 30
t
CORTOT3
CAPOVA
-1.5
-1.5
10 15 20 25 30
t
1.5
BUNIN
-1.0
-1.0
1.0
10 15 20 25 30
t
1.5
1.5
-1.0
BRENDEL
ASKENAZE
ARRAU
-1.0
1.0
ARGERICH
10 15 20 25 30
t
10 15 20 25 30
t
Figure 2.16 Smoothed tempo curves residuals e(t) = yi g1 (t) g2 (t) g3 (t).
The tempo curves are thus decomposed into curves corresponding to a hierarchy of bandwidths. Each component reveals specic features. The rst
component reects the overall tendency of the tempo. Most pianists have
an essentially monotonically decreasing curve corresponding to a gradual,
and towards the end emphasized, ritardando. For some performances (in
particular Bunin, Capova, Gianoli, Horowitz 1, Kubalek, and Moisewitsch)
there is a distinct initial acceleration with a local maximum in the middle
of the piece. The second component g2 (t) reveals tempo-uctuations that
correspond to a natural division of the piece in 8 times 4 bars. Some pianists, like Cortot, greatly emphasize the 84 structure. For other pianists,
such as Horowitz, the 84 structure is less evident: the smoothed tempo
curve is mostly quite at, though the main, but smaller, tempo changes do
take place at the junctions of the eight parts. Striking is also the distinction
between part B (bars 17 to 24) and the other parts (A,A ,A ) of the composition in particular in Argerichs performance. The third component
characterizes uctuations at the resolution level of 2/8th. At this very local
level, tempo changes frequently for pianists like Horowitz, whereas there
is less local movement in Cortots performances. Finally, the residuals e(t)
consist of the remaining uctuations at the nest resolution of 1/8th. The
similarity between the three residual curves by Horowitz illustrate that
even at this very ne level, the seismic variation of tempo is a highly
controlled process that is far from random.
0.2
2nd der.
-0.4
-0.5
-84
10
15
20
25
30
10
15
20
25
30
10
20
25
30
f) -m(t) (span=8/32)
150
e) -m(t) (span=8/32)
10
15
20
25
30
100
0
2nd der.
-50
-100
-40
-100
-20
-80
1st der.
-60
50
20
-40
15
t
40
d) -m(t) (span=8/32)
mel. Ind.
0.0
0.0
1st der.
-80
-82
mel. Ind.
0.4
0.5
0.6
b) -m(t) (span=24/32)
-78
a) -m(t) (span=24/32)
10
15
t
20
25
30
10
15
20
25
30
Figure 2.17 Melodic indicator local polynomial ts together with rst and second
derivatives.
1.0
1.0
0.5
1st der.
CORTOT1
-1.5
-1.5
-1.5
-0.5
1st der.
0.5
CAPOVA
-0.5
1.0
0.5
BUNIN
-0.5
1st der.
0.5
-1.5
-1.5
-1.5
BRENDEL
-0.5
1st der.
0.5
ASKENAZE
-0.5
1st der.
0.5
ARRAU
-0.5
1st der.
1.0
1.0
1.0
1.0
0.5
-0.5
-1.5
1st der.
ARGERICH
0 5 10 15 20 25 30
CORTOT2
CORTOT3
CURZON
DAVIES
DEMUS
ESCHENBACH
GIANOLI
-1.5
-0.5
1st der.
0.5
0.5
1.0
1.0
0 5 10 15 20 25 30
-1.5
-1.5
-1.5
-0.5
1st der.
0.5
-0.5
1st der.
0.5
-0.5
1st der.
0.5
-0.5
1st der.
-1.5
-1.5
-0.5
1st der.
0.5
0.5
-0.5
1st der.
-1.5
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
HOROWITZ1
HOROWITZ2
HOROWITZ3
KATSARIS
KLIEN
KRUST
KUBALEK
0.5
-0.5
1st der.
-1.5
-1.5
-0.5
1st der.
0.5
0.5
-0.5
1st der.
-1.5
-1.5
-0.5
1st der.
0.5
0.5
-1.5
-1.5
-0.5
1st der.
0.5
-0.5
1st der.
0.5
-0.5
1st der.
-1.5
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
NEY
NOVAES
ORTIZ
SCHNABEL
SHELLEY
ZAK
1st der.
-1.5
-1.5
-1.5
-0.5
0.5
1.0
0 5 10 15 20 25 30
0.5
-0.5
1st der.
0.5
-0.5
1st der.
0.5
-0.5
1st der.
-1.5
-1.5
-0.5
1st der.
0.5
0.5
-1.5
-0.5
1st der.
-0.5
-1.5
1st der.
0.5
1.0
1.0
MOISEIWITSCH
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
1.0
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
Figure 2.18 Tempo curves (Figure 2.3) rst derivatives obtained from local
polynomial ts (span 24/32).
CORTOT1
2nd der.
-2
-1
0
-2
-3
-3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
CAPOVA
-1
2nd der.
-1
2nd der.
-2
-3
0 5 10 15 20 25 30
3
0
BUNIN
-2
2nd der.
-1
BRENDEL
-3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
3
2
3
1
0
-1
2nd der.
-2
-3
3
2
0 5 10 15 20 25 30
ASKENAZE
-2
2nd der.
-1
ARRAU
-3
-3
2
3
2
3
2
-1
3
-2
2nd der.
ARGERICH
0 5 10 15 20 25 30
CORTOT2
CORTOT3
CURZON
DAVIES
DEMUS
ESCHENBACH
GIANOLI
2nd der.
-2
-1
1
0
2nd der.
-3
-2
-3
-3
-3
-1
0
-2
2nd der.
-1
0
-2
2nd der.
-1
-3
-3
-2
-2
2nd der.
-1
0
-1
2nd der.
-1
2nd der.
-2
-3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
HOROWITZ1
HOROWITZ2
HOROWITZ3
KATSARIS
KLIEN
KRUST
KUBALEK
0 5 10 15 20 25 30
3
2
1
2nd der.
-2
3
2
3
--
2nd der.
2
1
2nd der.
0
1
2
3
--
-2
3
3
2
1
2nd der.
0
1
1
-2
3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
NOVAES
ORTIZ
SCHNABEL
SHELLEY
ZAK
2nd der.
1
--
0 5 10 15 20 25 30
t
2
3
2nd der.
0
1
--
2
3
0 5 10 15 20 25 30
3
2
3
0
1
--
0 5 10 15 20 25 30
2
3
--
2
3
0 5 10 15 20 25 30
2nd der.
2
1
2nd der.
0
1
1
2
3
--
3
0
2nd der.
2
1
2nd der.
0
1
-2
3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
NEY
2
1
0
1
--
0 5 10 15 20 25 30
2nd der.
2nd der.
0
1
-2
3
0 5 10 15 20 25 30
t
MOISEIWITSCH
2nd der.
3
2
3
2
1
2nd der.
0
1
-2
3
0 5 10 15 20 25 30
t
2
3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
Figure 2.19 Tempo curves (Figure 2.3) second derivatives obtained from local
polynomial ts (span 8/32).
Figure 2.21b: Points with high estimated joint density f (x, y) are marked
with O. In contrast to what one would expect from a regression model,
random errors i that are independent of x, the points with highest density
gather around a horizontal line rather than the regression line(s) tted in
Figure 2.21b. Thus, a linear regression model is hardly applicable. Instead,
the data may possibly be divided into three clusters: a) a cluster with low
loudness and low tempo; b) a second cluster with medium loudness and
low to medium tempo; and c) a third cluster with a high level of loudness
and medium to high tempo.
Figure 2.21 log(Amplitude) and tempo for Kinderszene No. 4 auto- and cross
correlations (Figure 2.24a), scatter plot with tted least squares and robust lines
(Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Figure 2.24d).
Figure 2.23 Horowitz performance of Kinderszene No. 4 two-dimensional histogram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and
image plot respectively.
Figure 2.25 R. Schumann, Trumerei op. 15, No. 7 density of melodic indicator
a
with sharpening region (a) and melodic curve plotted against onset time, with
sharpening points highlighted (b).
CORTOT2
tempo
CORTOT3
HOROWITZ2
tempo
HOROWITZ3
tempo
tempo
HOROWITZ1
tempo
tempo
CORTOT1
Figure 2.26 R. Schumann, Trumerei op. 15, No. 7 tempo by Cortot and
a
Horowitz at sharpening onset times.
CORTOT1
-10
HOROWITZ2
HOROWITZ3
10
diff(tempo)
0
-10
0
-10
diff(tempo)
10
10
0
-10
diff(tempo)
10
10
HOROWITZ1
diff(tempo)
CORTOT3
-10
diff(tempo)
0
-10
diff(tempo)
10
CORTOT2
Figure 2.27 R. Schumann, Trumerei op. 15, No. 7 tempo derivatives for
a
Cortot and Horowitz at sharpening onset times.
lines. Figures 2.26 and 2.27 show the tempo y and its discrete derivative
v(ti ) = [y(ti+1 ) y(ti )]/(ti+1 ti ) for ti Isharp and the performances by
Cortot and Horowitz. The pictures indicate a systematic dierence between
Cortot and Horowitz. A common feature is the negative derivative at the
fth and sixth sharpening onset time.
2.6 Some multivariate descriptive displays
2.6.1 Denitions
Suppose that we observe multivariate data x1 , x2 , ..., xn where each xi is
a p-dimensional vector (xi1 , ..., xip )t Rp . Obvious numerical summary
statistics are the sample mean
x = (1 , x2 , ..., xp )t
where xj = n1
n
i=1
Sjl = (n 1)1
(xij xj )(xil xl ).
i=1
Most methods for analyzing multivariate data are based on these two statistics. One of the main tools consists of dimension reduction by suitable projections, since it is easier to nd and visualize structure in low dimensions.
These techniques go far beyond descriptive statistics. We therefore postpone the discussion of these methods to Chapters 8 to 11. Another set of
methods consists of visualizing individual multivariate observations. The
main purpose is a simple visual identication of similarities and dierences
between observations, as well as search for clusters and other patterns.
Typical examples are:
Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending
on the values of corresponding coordinates. For instance, the face function in S-Plus has the following correspondence between coordinates and
feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length
of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of
mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle
of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of
pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width
of eyebrows.
Stars: Each coordinate is represented by a ray in a star, the length of
each corresponding to the value of the coordinate. More specically, a
star for a data vector xi = (xi1 , ..., xip )t is constructed as follows:
1. Scale xi to the range [0, r] : 0 x1j, ..., xnj r;
2. Draw p rays at angles j = 2(j 1)/p (j = 1, ..., p); for a star with
origin 0 representing observation xi , the end point of the jth ray has
the coordinates r (xij cos j , xij sin j );
3. For visual reasons, the end points of the rays may be connected by
straight lines.
Proles: An observation xi =(xi1 , ..., xip )t is represented by a plot of
xij versus j where neighboring points xij1 and xij (j = 1, ..., p) are
connected.
Symb ol plot: The horizontal and vertical positions represent xi1 and
xi2 respectively (or any other two coordinates of xi ). The other coordinates xi3 , ..., xip determine p 2 characteristic shape parameters of a
geometric object that is plotted at point (xi1 , xi2 ). Typical symbols are
circle (one additional dimension), rectangle (two additional dimensions),
stars (arbitrary number of additional dimensions), and faces (arbitrary
number of additional dimensions).
2.7 Sp ecic applications in music multivariate
2.7.1 Distribution of notes Cherno faces
In music that is based on scales, pitch (modulo 12) is usually not equally
distributed. Notes that belong to the main scale are more likely to occur,
and within these, there are certain prefered notes as well (e.g. the roots
of the tonic, subtonic and supertonic triads). To illustrate this, we consider the following compositions: 1. Saltarello (Anonymus, 13th century);
2. Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J. S.
Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856);
4. Piano piece op. 19, No. 2 (A. Schnberg, 1874-1951; gure 2.28); 5. Rain
o
Tree Sketch 1 (T. Takemitsu, 1930-1996). For each composition, the distribution of notes (pitches) modulo 12 is calculated and centered around
the central pitch (dened as the most frequent pitch modulo 12). Thus,
the central pitch is dened as zero. We then obtain ve vectors of relative
frequencies pj = (pj0 , ..., pj11 )t (j = 1, ..., 5) characterizing the ve compositions. In addition, for each of these vectors the number nj of local peaks
in pj is calculated. We say that a local peak at i {1, ..., 10} occurs, if
pji > max(pji1 , pji+1 ). For i = 10, we say that a local peak occurs, if
pji > pji1 . Figure 2.29a displays Cherno faces of the 12-dimensional vectors vj = (nj , pj1 , ..., pj11 )t . In Figure 2.29b, the coordinates of vj (and thus
the assignment of feature variables) were permuted. The two plots illustrate the usefulness of Cherno faces, and at the same time the diculties
in nding an objective interpretation. On one hand, the method discovers
a plausible division in two groups: both picures show a clear distinction
between classical tonal music (rst three faces) and the three representatives of avant-garde music of the 20th century. On the other hand, the
exact nature of the distinction cannot be seen. In Figure 2.29a, the classical
faces look much more friendly than the rather miserable avant-garde fellows. The judgment of conservative music lovers that avant-garde music
is unbearable, depressing, or even bad for health, seems to be conrmed!
Yet, bad temper is the response of the classical masters to a simple permutation of the variables (Figure 2.29b), whereas the grim avant-garde seems
to be much more at ease. The diculty in interpreting Cherno faces is
that the result depends on the order of the variables, whereas due to their
psychological eect most feature variables are not interchangeable.
ANONYMUS
BACH
SCHUMANN
WEBERN
SCHOENBERG
TAKEMITSU
ANONYMUS
BACH
SCHUMANN
WEBERN
SCHOENBERG
TAKEMITSU
Figure 2.29 b) Cherno faces for the same compositions as in gure 2.29a, after
permuting coordinates.
isolated (most extremely for Bartks Bagatelle No. 3) or tend to cover more
o
or less the whole range of notes (e.g. Bartk, Prokoe, Takemitsu, Beran).
o
Due to the variety of styles in the 20th century, the specic shape of each
of the stars would need to be discussed in detail individually. For instance,
Messiaens shape may be explained by the specic scales (Messiaen scales)
he used. Generally speaking, the dierence between star plots of the 20th
century and earlier music reects the replacement of the traditional tonal
system with major/minor scales by other principles.
HALLE
OCKEGHEM
ARCADELT
ARCADELT
ARCADELT
BYRD
BYRD
BYRD
RAMEAU
RAMEAU
RAMEAU
BACH
BACH
BACH
SCARLATTI
SCARLATTI
SCARLATTI
HAYDN
MOZART
MOZART
MOZART
CLEMENTI
CLEMENTI
SCHUMANN
SCHUMANN
SCHUMANN
CHOPIN
CHOPIN
CHOPIN
WAGNER
WAGNER
DEBUSSY
DEBUSSY
DEBUSSY
SCRIABIN
SCRIABIN
SCRIABIN
BARTOK
BARTOK
BARTOK
PROKOFFIEFF
PROKOFFIEFF
PROKOFFIEFF
MESSIAEN
SCHOENBERG
WEBERN
TAKEMITSU
BERAN
Figure 2.31 Star plots of p = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for comj
positions from the 13th to the 20th century.
Cj
j=1
dened by
Elow = {(t ,
j
min
x(t)), j = 1, ..., n}
max
(t,x(t))Cj
and
Eup = {(t ,
j
(t,x(t))Cj
In other words, for each onset time, the lowest and highest note are selected to dene the lower and upper envelope respectively. In the example
below, we consider interval steps y(ti ) = y(ti+1 ) y(ti ) mod 12 for the
upper envelope of a composition with onset times t1 , ..., tn and pitches
y(t1 )..., y(tn ). A simple aspect of melodic and harmonic structure is the
question in which sequence intervals are likely to occur. Here, we look at
the empirical two-dimensional distribution of (y(ti ), y(ti+1 )). For each
pair (i, j), (11 i, j 11, i, j=0), we count the number nij of occurences
and dene Nij = log(nij + 1). (The value 0 is excluded here, since repetitions of a note or transposition by an octave are less interesting.) If
only the type of interval and not its direction is of interest, then i, j assume
the values 1 to 11 only. A useful representation of Nij can be obtained by
a symbol plot. In Figures 2.32 and 2.33, the x- and y-coordinates correspond to i and j respectively. The radius of a circle with center (i, j) is
proportional to Nij . The compositions considered here are: a) J.S. Bach:
Prludium No. 1 from Das Wohltemperierte Klavier; b) W.A. Mozart :
a
Sonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Prlude op.
e
51, No. 4; and d) F. Martin: Prlude No. 6. For Bachs piece, there is a clear
e
clustering in three main groups in the rst plot (there are almost never two
successive interval steps downwards) and a horseshoe-like pattern for absolute intervals. Remarkable is the clear negative correlation in Mozarts rst
plot and the concentration on a few selected interval sequences. A negative correlation in the plots of interval steps with sign can also be found
for Scriabin and Martin. However, considering only the types of intervals
without their sign, the number and variety of interval sequences that are
used relatively frequently is much higher for Scriabin and even more for
Martin. For Martin, the plane of absolute intervals (Figure 2.33d) is lled
almost uniformly.
2.7.4 Pitch distribution symbol plots with circles
Consider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitch
modulo 12 as in the star-plot example above. The star plots show a clear
distinction between modern compositions and classical tonal compositions. Symbol plots can be used to see more clearly which composers (or
compositions) are close with respect to pj . In gure 2.34 the x- and yaxis corresponds to pj5 and pj7 . Recall that if 0 is the root of the tonic
triad, then 5 is the root of the subtonic and 7 the root of the dominant
0.20
triad. The radius of the circles in Figure 2.34 is proportional to pj1 , the
frequency of the dissonant minor second. In color Figure 2.35, the radius
represents pj6 , i.e. the augmented fourth. Both plots show a clear positive
relationship between pj5 and pj7 . Moreover the circles tend to be larger
for small values of x and y. The positioning in the plane together with the
size of the circles separates (apart from a few exceptions) classical tonal
compositions from more recent ones. To visualize this, four dierent colors are chosen for early music (black), baroque and classical (green),
romantic (blue) and 20/21st century (red). The clustering of the four
colors indicates that there is indeed an approximate clustering according
to the four time periods. Interesting exceptions can be observed for early
music with two extreme outliers (Halle and Arcadelt). Also, one piece by
Rameau is somewhat far from the rest.
RAMEAU
ARCADELT
RAMEAU
SCHUMANN
ARCADELT
0.15
BYRD
SCRIABIN
RAMEAU
CLEMENTI
MOZART
CHOPIN
CHOPIN
SCRIABIN
0.10
WEBERN
SCARLATTI
SCARLATTIBYRD
BYRD
DEBUSSY
PROKOFFIEFF BACH
OCKEGHEM
BACH
MOZART
SCARLATTI
DEBUSSY
SCHUMANN
BACH HAYDN
WAGNER
MOZART
CLEMENTI
SCRIABIN
DEBUSSY
CHOPIN
PROKOFFIEFF
BARTOK
SCHUMANN
WAGNER
0.05
TAKEMITSU
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK
0.0
PROKOFFIEFF
BARTOK
HALLE
0.0
0.05
0.10
0.15
0.20
0.20
RAMEAU
ARCADELT
RAMEAU
SCHUMANN
ARCADELT
0.15
BYRD
SCRIABIN
RAMEAU
CLEMENTI
MOZART
CHOPIN
CHOPIN
SCRIABIN
0.10
WEBERN
SCARLATTI
SCARLATTIBYRD
BYRD
DEBUSSY
PROKOFFIEFF BACH
OCKEGHEM
BACH
MOZART
SCARLATTI
DEBUSSY
SCHUMANN
BACH HAYDN
WAGNER
MOZART
CLEMENTI
SCRIABIN
DEBUSSY
CHOPIN
PROKOFFIEFF
BARTOK
SCHUMANN
WAGNER
0.05
TAKEMITSU
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK
0.0
PROKOFFIEFF
BARTOK
HALLE
0.0
0.05
0.10
0.15
0.20
pj1 (diminished second) and height pj6 (augmented fourth). Using the same
colors for the names as above, a similar clustering as in the circle-plot can
be observed. The picture not only visualizes a clear four-dimensional relationship between pj1 , pj5 , pj6 and pj7 , but also shows that these quantities
are related to the time period.
2.7.6 Pitch distribution symbol plots with stars
Five dimensions are visualized in color Figure 2.37 with (x, y) = (pj5 , pj7 )
and the variables pj1 , pj6 and pj10 (diminished seventh) dening a starplot
for each observation, the rst variables starting on the right and the subsequent variables winding counterclockwise around the star (in this case
a triangle). The shape of the triangle is obviously a characteristic of the
time period. For tonal music composed mostly before about 1900, the stars
are very narrow with a relatively long beam in the direction of the diminished seventh. The diminished seventh is indeed an important pitch
in tonal music, since it is the fourth note in the dominant seventh chord
to the subtonic. In contrast, notes that are a diminished second and an
0.0
RAMEAU
ARCADELT
SCHUMANN
RAMEAU
ARCADELT
BYRD
RAMEAU SCRIABIN SCARLATTI
SCARLATTI
CLEMENTI
MOZART DEBUSSY BYRD
BYRD
PROKOFFIEFF BACH
BACH SCARLATTI
MOZART OCKEGHEM
CHOPIN SCRIABIN
DEBUSSY
SCHUMANN
CHOPIN
BACH HAYDN
WAGNER
WEBERN
MOZART
CLEMENTI
SCRIABIN
DEBUSSY
CHOPIN
SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK
PROKOFFIEFF
BARTOK
-0.1
HALLE
0.0
0.05
0.10
0.15
0.20
Figure 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1
(diminished second) and height pj 6 (augmented fourth). (Color gures follow page
152.)
augmented fourth above the root of the tonic triad build, together with
the tonic root, highly dissonant intervals and are therefore less frequent in
tonal music. Color Figure 2.37 shows the triangles; the names without the
triangles are plotted in color Figure 2.38.
2.7.7 Pitch distribution prole plots
Finally, as an alternative to star plots, Figure 2.39 displays prole plots
of p = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . For compositions up to
j
about 1900, the proles are essentially U-shaped. This corresponds to stars
with clustered long and short beams respectively, as seen previously. For
modern compositions, there is a large variety of shapes dierent from a
U-shape.
0.10
0.05
0.0
0.0
0.05
0.10
0.15
0.20
Figure 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles dened by pj1 (diminished second), pj 6 (augmented fourth) and pj 10 (diminished seventh). (Color
gures follow page 152.)
0.20
RAMEAU
RAMEAU
SCHUMANN
ARCADELT
0.10
0.15
BYRD
SCRIABIN
RAMEAU
SCARLATTI
SCARLATTIBYRD
CLEMENTI
MOZART
BYRD
DEBUSSY
PROKOFFIEFF BACH
BACH MOZART OCKEGHEM
CHOPIN SCRIABIN
SCARLATTI
DEBUSSY
SCHUMANN
CHOPIN
BACH HAYDN
WAGNER
WEBERN
MOZART
CLEMENTI
SCRIABIN
DEBUSSY
CHOPIN
SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
0.05
0.0
ARCADELT
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK
PROKOFFIEFF
BARTOK
HALLE
0.0
0.05
0.10
0.15
0.20
Figure 2.38 Names plotted at locations (x, y) = (pj 5 , pj 7 ). (Color gures follow
page 152.)
6 8 10
8 10
0.10
0.0
0.0
4 6
8 10
6 8 10
SCARLATTI
0.0
0.08
0.10
0.0
0.0
6 8 10
0.10
2 4
SCARLATTI
4 6
2 4
8 10
CLEMENTI
6 8 10
SCHUMANN
4 6
8 10
6 8 10
DEBUSSY
0.10
0.08
0.0
0.02
0.0
2 4
WAGNER
6 8 10
4 6
2 4
8 10
BARTOK
6 8 10
BARTOK
0.0
6 8 10
4 6
8 10
6 8 10
6 8 10
0.10
BERAN
0.06
0.09
2 4
2 4
TAKEMITSU
0.07
0.08
6 8 10
WEBERN
0.02
2 4
0.06
0.10
0.10
0.15
0.05
2 4
SCHOENBERG
0.20
8 10
BARTOK
6 8 10
0.05
4 6
0.0
0.02
0.10
0.10
2 4
MESSIAEN
2 4
SCRIABIN
8 10
6 8 10
WAGNER
6 8 10
0.10
4 6
0.05
0.10
0.15
0.10
0.02
0.15
0.05
6 8 10
2 4
0.0
2 4
CHOPIN
8 10
0.04
2 4
6 8 10
CLEMENTI
6 8 10
0.04
4 6
0.12
4 6
2 4
SCRIABIN
PROKOFFIEFF
0.05
2 4
MOZART
BYRD
0.10
0.10
0.0
0.10
0.20
0.0
0.0
0.02
0.10
6 8 10
BYRD
BACH
6 8 10
0.10
8 10
0.02
2 4
0.10
0.02
0.04
2 4
4 6
CHOPIN
6 8 10
0.10
PROKOFFIEFF
SCRIABIN
8 10
2 4
8 10
0.0
0.02
0.10
2 4
0.0
4 6
4 6
2 4
6 8 10
BACH
MOZART
CHOPIN
8 10
0.02
6 8 10
0.02
4 6
0.10
0.10
2 4
DEBUSSY
PROKOFFIEFF
6 8 10
0.10
0.08
0.0
6 8 10
2 4
BACH
MOZART
8 10
0.0
6 8 10
0.02
2 4
2 4
SCHUMANN
DEBUSSY
0.10
4 6
0.10
0.08
0.0
2 4
8 10
0.10
8 10
0.05
SCHUMANN
4 6
0.10
4 6
HAYDN
6 8 10
RAMEAU
0.0
6 8 10
6 8 10
BYRD
ARCADELT
0.08
RAMEAU
SCARLATTI
2 4
2 4
8 10
0.0
0.0
0.15
4 6
0.10
0.10
RAMEAU
2 4
0.10
0.10
ARCADELT
0.10
6 8 10
0.20
2 4
ARCADELT
0.0
0.0
0.0
0.08
0.10
OCKEGHEM
0.0
HALLE
4 6
8 10
2 4
6 8 10
CHAPTER 3
Figure 3.1 Ludwig Boltzmann (1844-1906). (Courtesy of Osterreichische Post
AG.)
(3.1)
I2 =
pj log2 Nj =
j=1
pj log2 (N pj )
(3.2)
j=1
Let I1 be the information needed to identify the set Vj which v belongs to.
Then the total information needed for identifying (encoding) elements of
V is
log2 N = I1 + I2
(3.3)
On the other hand,
famous formula
I1 =
pj log2 (pj )
(3.4)
j=1
I1 is also called Shannon information. Shannon information is thus the expected information about the occurence of the sets V1 , ..., Vk contained in
a randomly chosen element from V . Note that the term information can
be used synonymously for uncertainty: the information obtained from
a random experiment diminishes uncertainty by the same amount. The
derivation of Shannon information is credited to Shannon (1948) and, independently, Wiener (1948). In physics, an analogous formula is known as
entropy and is a measure of the disorder of a system (see Boltzmann 1896,
gure 3.1).
Shannons formula can also be derived by postulating the following properties for a measure of information of the outcome of a random experiment:
let V1 , ..., Vk be the possible outcomes of a random experiment and denote
by pj = P (Aj ) the corresponding probabilities. Then a measure of information, say I, obtained by the outcome of the random experiment should
have the following properties:
1. Function of probabilities: I = I(p1 , ..., pk ), i.e. I depends on the probabilities pj only;
2. Symmetry: I(p1 , ..., pk ) = I(p(1) , ..., p(k) ) for any permutation ;
3. Continuity: I(p, 1 p) is a continuous function of p (0 p 1);
4. Denition of unit: I( 1 , 1 ) = 1;
2 2
p1
p2
,
) (3.5)
p1 + p2 p1 + p2
The meaning of the rst four properties is obvious. The last property can
be interpreted as follows: suppose the outcome of an experiment does not
distinguish between V1 and V2 , i.e. if v turns out to be in one of these
two sets, we only know that v V1 V2 . Then the infomation provided
by the experiment is I(p1 + p2 , p3 , ..., pk ). If the experiment did distinguish
between V1 and V2 , then it is reasonable to assume that the information
would be larger by the amount
p1
p2
(p1 + p2 )I(
,
).
p1 + p2 p1 + p2
Equation (3.5) tells us exactly that: the complete information I(p1 , ..., pk )
can be obtained by adding the partial and the additional information. It
turns out that the only function for which the postulates hold is Shannons
information:
Theorem 9 Let I be a functional that assigns each nite discrete distribution function P (dened by probabilities p1 , ..., pk , k 1) a real number
I(P ), such that the properties above hold. Then
k
pj log2 pj
(3.6)
j=1
Shannon information has an obvious upper bound that follows from Jensens
inequality: recall that Jensens inequality states that for a convex function
g and weights wj 0 with
wj = 1 we have
g(
wj xj )
wj g(xj ).
pj log2 pj g(
k 1 pj ) = k 1 log2 k.
Hence,
I(P ) log2 k
(3.7)
pj log2 pj
where pj = P (X = xj ). More subtle is the extension to continuous distributions and random variables. A nice illumination of the problem is given
in Renyi (1970): for a random variable with uniform distribution on (0,1),
the digits in the binary expansion of X are innitely many independent
0-1-random variables where 0 and 1 occur with probability 1/2 each. The
information furnished by a realization of X would therefore be innite. Nevertheless, a meaningful measure of information can be dened as a limit of
discrete approximations:
Theorem 10 Let X be a random variable with density function f. Dene
XN = [N X]/N where [x] denotes the integer part of x. If I(X1 ) < , then
the following holds:
I(XN )
=1
(3.8)
lim
N log2 N
lim (I(XN ) log2 N ) =
(3.9)
We thus have
Denition 26 Let X be a random variable with density function f . Then
I(X) =
(3.10)
(3.11)
This denition is plausible, because for a process with unit variance, f has
the same properties as a probability distribution and can be interpreted as
a distribution on frequencies. The process Xt is uncorrelated if and only if
f is constant, i.e. if f is the uniform distribution on [, ]. Exactly in this
case entropy is maximal, and knowledge of past observations does not help
to predict future observations. On the other hand, if f has one or more
extreme peaks, then entropy is very low (and in the limit minus innity).
This corresponds to the fact that in this case future observations can be
predicted with high accuracy from past values. Thus, future observations
do not contain as much new information as in the case of independence.
3.2.2 Measuring metric, melodic, and harmonic importance
General idea
Western classical music is usually structured in at least three aspects:
melody, metric structure, and harmony. With respect to representing the
essential melodic, metric, and harmonic structures, not all notes are equally
important. For a given composition K, we may therefore try to nd metric,
melodic, and harmonic structures and quantify them in a weight function
w : K R3 (which we will also call an indicator). For each note event x
K, the three components of w(x) = (wmelodic (x), wmetric (x), wharmonic (x))
quantify the importance of x with respect to the melodic, metric, and
harmonic structure of the composition respectively.
Omnibus metric, melodic, and harmonic indicators
Specic denitions of structural indicators (or weight functions) are discussed for instance in Mazzola et al. (1995), Fleischer et al. (2000), and
Beran and Mazzola (2001). To illustrate the general approach, we give a
full denition of metric weights. Melodic and harmonic weights are dened
in a similar fashion, taking into account the specic nature of melodic and
harmonic structures respectively.
Metric structures characterize local periodic patterns in symbolic onset
times. This can be formalized as follows: let K Z4 be a composition (with
coordinates Onset Time, Pitch,Loudness, and Duration), T Z
its set of onset times (i.e. the projection of K on the rst axis) and let
tmax = max{t : t T }. Without loss of generality the smallest onset time
in T is equal to one.
Denition 28 For each triple (t, l, p) Z N N the set
B(t, l, p) = {t + kp : 0 k l}
is called a meter with starting point t, length l and period p. The meter
is called admissible, if B(t, l, p) T . The non-negative length l of a local
meter M = B(t, l, p) is uniquely determined by the set M and is denoted
by l(M ).
Note that by denition, t B(t, l, p) for any (t, l, p) Z N N. The
importance of events at onset time s is now measured by the number of
meters this onset is contained in. For a given triple (t, l, p), three situations
can occur:
h(l(M ))
(3.12)
MM,
l(M)lmin
and denote by
Tu (ti ) = {ti + 1 u , ..., ti + k u } = {s1 , ..., sk }
the corresponding onset times. Moreover, let
Xu (ti ) = {x = (x(s1 ), ..., x(sk )) : (si , x(si )) K}
be the set of all pitch-vectors with onset set Tu (ti ). Then we dene the
distance
k
du (ti ) =
(x(si ) yi )2
min
xXu (ti )
(3.13)
i=1
xo = arg
(x(si ) yi )2 ,
min
xXu (ti )
i=1
and dene ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ).
If M (ti , u) K, then set ru (ti ) = 0.
Disregarding the position within a motif, we can now dene overall motivic
indicators (or weights), for instance by
k
wd,mean (ti ) = g(
du (ti ))
(3.14)
u=1
(3.15)
(3.16)
1uk
or
1uk
Finally, given weights for p dierent motifs, we may combine these into
one overall indicator. For instance, an overall melodic indicator based on
correlations can be dened by
p
wmelod (ti ) =
h(wcorr,j (ti ), Li )
j=1
(3.17)
where wcorr,j is the weight function for motif number j and Li is the number
of elements in the motif. Including Li has the purpose of attributing higher
weights to the presence of longer motifs.
The advantage of the motif-based denition is that one can rst search
for possible motifs in the score, making full use of the available information
in the score as well as musicological and historical knowledge, and then
incorporate these in the denition of melodic weights. Similar denitions
may be obtained for metric and harmonic indicators.
3.2.3 Measuring dimension
There are many dierent denitions of dimension, each measuring a specic
aspect of objects. Best known is the topological dimension. In the usual
k
euclidian
space Rk with scalar product < x, y >= i=1 xi yi and distances
|xy| = < x y, x y >, the topological dimension of the space is equal
to k. The dimension of an object in this space is equal to the dimension
of the subspace it is contained in. The euclidian space is, however, rather
special since it is metric with a scalar product.
More generally, one can dene a topological dimension in any topological
(not necessarily metric) space in terms of coverings. We start with the
denition of a topological space: a topological space is a nonempty set
X together with a family O of so-called open subsets of X satisfying the
following conditions:
1. X O and O ( denotes the empty set)
2. If U1 , U2 O, then U1 U2 O
3. If U1 , U2 O, then U1 U2 O.
A covering of a set S X is a collection U O of open sets such that
S UU U.
A renement of a covering U is a covering U such that for each U U
there exists a U U with U U . The denition of topological dimension
is now as follows:
Denition 33 A topological space X has topological dimension m, if every
covering U of X has a renement U in which every point of X occurs in
at most m + 1 sets of U , and m is the smallest such integer.
The topological dimension of a subset S X is analogous. For instance,
a straight line in a euclidian space can be divided into open intervals such
that at most two intervals intersect so that dT = 1. Similarily, a simple
geometric gure in the plane, such as a disk or a rectangle (including the
inner area), can be covered with arbitrarily small circles or rectangles such
that at most three such sets intersect this number can however not be
made smaller. Thus, the topological dimension of such an object is dT =
3 1 = 2.
h(r)
(3.18)
where the sum is taken over all balls and h is some positive function. This
measure depends on r, the specic covering Ur and h. To obtain a measure
that is independent of a specic covering, we dene the measure
r,h (A) = inf U ,,h (A)
U :<r
(3.19)
(3.20)
r0
Clearly, as r tends to zero, r,h becomes at most larger and therefore has a
limit. The limit can be either zero (if r,h = 0 already), innity, or a nite
number. This leads to the following denition:
Denition 34 A function h for which
0 < h (A) <
is called intrinsic function of A.
Consider, for example, a simple shape in the plane such as a circle with
radius R. The area of the circle A can be measured by covering it by small
circles of radius r and evaluating h (A) using the function h(r) = r2 .
It is well known that limr0 r,h (A) exists and is equal to h (A) = R2 .
On the other hand, if we took h(r) = r with < 2, then h (A) = ,
whereas for > 2, h (A) = 0. For standard sets, such as circles, rectangles,
triangles, cylinders, etc., it is generally true that the intrinsic function for a
set A that with topological dimension dT = d is given by (Hausdor 1919)
h(r) = hd (r) =
1
{( 2 )}d
(1 + d )
2
rd .
(3.21)
Many other more complicated sets, including randomly generated sets, have
intrinsic functions of the form h(r) = L(r)rd for some d > 0 which is
not always equal to dT , and L a function that is slowly varying at the
origin (see e.g. Hausdor 1919, Besicovitch 1935, Besicovitch and Ursell
1937, Mandelbrot 1977, 1983, Falcomer 1985, 1986, Kono 1986, Telcs 1990,
Devaney 1990). Here, L is called slowly varying at zero, if for any u > 0,
limr0 [L(ur)/L(r)] = 1. This leads to the following denition of dimension:
Denition 35 Let A be a subset of a metric space and
h(r) = L(r) rd
an intrinsic function of A where L(r) is slowly varying. Then dH = d is
called the Hausdor-Besicovitch dimension (or Hausdor dimension) of A.
The denition of Hausdor dimension leads to the denition of fractals (see
e.g. Mandelbrot 1977):
Denition 36 Let A be a subset of a metric space. Suppose that A has
topological dimension dT and Hausdor dimension dH such that
dH > dT .
Then A is called a fractal.
Figure 3.2 Fractal pictures (by Cline Beran, computer generated.) (Color gures
e
follow page 152.)
Mandelbrot (1977) and other related books. Many phenomena, not only in
nature but also in art, appear to be fractal. For instance, fractal shapes can
be found in Jackson Pollocks (1912-1956) abstract drip paintings (Taylor
1999a,b,c, 2000). In music, the idea of fractals was used by some contemporary composers, though mainly as a conceptual inspiration rather than
an exact algorithm (e.g. Harri Vuori, Gyrgy Ligeti; Figure 3.3).
o
The notion of fractals is closely related to self-similarity (see Mandelbrot 1977 and references therein). Self-similar geometric objects have the
property that the same shapes are repeated at innitely many scales. By
drawing recursively m smaller copies of the same shape rescaling them
by a factor s one can construct fractals. For self-similar objects, the fractal dimension can be calculated directly from the scaling factor s and the
number m of repetitions of the rescaled objects by
dF =
log m
log s
(3.23)
For many purposes more realistic are random fractals where instead of
the shape itself, the distribution remains the same after rescaling. More
specically, we have
Denition 38 Let Xt (t R) be a stochastic process. The process is called
self-similar with self-similarity parameter H, if for any c > 0
Xt =d cH Xct
where = d means equality of the two processes in distribution.
The parameter H is also called Hurst exponent. Self-similar processes are
(like their deterministic counterparts) very special models. However, they
play a central role for stochastic processes just like the normal distribution
for random variables. The reason is that, under very general conditions,
the limit of partial sum processes (see Lamperti 1962, 1972) is always a
self-similar process:
a1 Snt = a1
n
n
Xs (n = 1, 2, ...)
(3.24)
s=1
where X1 , X2 , ... is a stationary discrete time process with zero mean and
a1 , a2 , ... a sequence of positive normalizing constants such that log an
. Then there exists an H > 0 such that for any u > 0, limn (anu /an ) =
uH , Zt is self-similar with self-similarity parameter H, and Zt has stationary increments.
The self-similarity parameter therefore also makes sense for processes that
are not exactly self-similar themselves, since it is dened by the rate nH
needed to standardize partial sums. Moreover, H is related to the fractal dimension, the exact relationship between H and the fractal dimension however depends on some other properties of the process as well. For
instance, sample paths of (univariate) Gaussian self-similar processes socalled fractional Brownian motion (see Chapter 4) have, with probability
one, a fractal dimension of 2 H with possible values of H in the interval (0, 1). Thus, the closer H is to 1, the more a sample paths is similar
to a simple geometric line with dimension one. On the other hand, as H
approaches zero, a typical sample path lls up most of the plane so that
the dimension approaches two. Practically, H can be determined from an
observed series X1 , ..., Xn , for example by maximum likelihood estimation.
For a thorough discussion of self-similar and related processes and statistical methods see e.g. Beran (1994). Further references on fractals apart
from those given above are, for instance, Edgar (1990), Falconer (1990),
Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995).
A cautionary remark should be made at this point: in view of theorem 11,
the fact that we do nd self-similarity in aggregated time series is hardly
surprising and can therefore not be interpreted as something very special
that would distinguish the particular series from other data. What may
be special at most is which particular value of H is obtained and which
particular self-similar process the normalized aggregated series converges
to.
3.3 Specic applications in music
3.3.1 Entropy of melodic shapes
Let x(ti ) be the upper and y(ti ) the lower envelope of a composition at
score-onset times ti (i = 1, ..., n). To investigate the shape of the melodic
x(ti )
x(ti+1 ) x(ti )
=
ti
ti+1 ti
(3.25)
and
x(2) (ti ) =
2 x(ti )
[x(ti+2 ) x(ti+1 )] [x(ti+1 ) x(ti )]
=
2 ti
[ti+2 ti+1 ] [ti+1 ti ]
(3.26)
(3.27)
and
x(2;12) (ti ) =
(3.28)
(3.29)
where f is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by
kernel estimation.
2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead.
3.
E3 =
(3.30)
Figure 3.4 Comparison of entropies 1, 2, 3, and 4 for J.S. Bachs Cello Suite
No. I and R. Schumanns op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16.
1
41
(yi+j yi )2
(3.31)
j=0
the time series zi = log(vi + 1 ) (see Chapter 4 for the denition of SEMI2
FAR models). The tted spectral density f (; ) is then used to dene the
spectral entropy
E9 =
f (; ) log f (; )d
(3.32)
R +
)
2 + R
where is a small positive number that is needed in order that < zi <
even if R = 0 or 2 respectively. Fitting a SEMIFAR-model to zi we
then dene E10 the same way as E9 above.
Figure 3.6 shows a comparison of E9 and E10 for the same compositions
as in 3.3.1. In contrast to the previous measures of entropy, Bach is consistently lower than Schumann. With respect to E10 this is also the case
in comparison with Scriabin (Figure 3.5) and Martin. Thus, for Bach there
appears to be a high degree of nonrandomness (i.e. organization) in the
way variability of interval steps changes sequentially.
Figure 3.5 Alexander Scriabin (1871-1915) (at the piano) and the conductor
Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gemldegalerie
a
Neuer Meister, Dresden, and Robert-Sterl-House.)
Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scriabin/Martin.
Figure 3.7 Metric, melodic, and harmonic global indicators for Bachs Canon
cancricans.
melodic and harmonic structure can not be seen directly from the raw
curves. However, smoothed weights as shown in the gures above reveal
clear connections between the three weight functions. This is even the case
for Webern, in spite of the absence of tonality.
Figure 3.9 Metric, melodic, and harmonic global indicators for Schumanns op.
15, No. 2 (upper gure), together with smoothed versions (lower gure).
Figure 3.10 Metric, melodic, and harmonic global indicators for Schumanns op.
15, No. 7 upper gure), together with smoothed versions (lower gure).
Figure 3.11 Metric, melodic, and harmonic global indicators for Weberns Variations op. 27, No. 2 (upper gure), together with smoothed versions (lower gure).
Figure 3.12 R. Schumann Trumerei: motifs used for specic melodic indicaa
tors.
150
100
w
50
0
0
10
15
20
25
30
onset time
CHAPTER 4
(4.3)
(4.4)
Xt = +
(4.6)
process (in ) with the following properties: ZX (0) = 0, E[ZX ()] = 0 and
for 1 > 2 1 > 2 ,
E[Z X (2 , 1 )ZX (2 , 1 )] = 0
(4.7)
gn () =
In =
gn ()dZX () =
(4.8)
var(Xt ) =
E[|dZX ()|2 ] =
dFX ()
(4.9)
1
f () =
X (k)eik
(4.10)
2
F ( + ) F () =
k=
X (k) =
eik f ()d
(4.11)
Aj eij t
Xt =
j=1
F () =
(4.12)
j:j
k
var(Xt ) =
E[|Aj |2 ]
(4.13)
j=1
This means that the variance is a sum of contributions that are due to the
frequencies j (1 j k). A sample path of Xt cannot be distinguished
from a deterministic periodic function, because the randomly selected amplitudes Aj are then xed.
Finally, it should be noted that not all frequencies are observable when
observations are taken at discrete time points t = 1, 2, ..., n. The smallest
identiable period is 2, which corresponds to a highest observable frequency
of 2/2 = . The largest identiable period is n/2, which corresponds to
the smallest frequency 4/n. As n increases, the lowest frequency tends to
zero, however the highest does not. In other words, the highest frequency
resolution does not improve with increasing sample size.
To obtain more general models, one may wish to relax the condition
of stationarity. An asymptotic concept of local stationarity is dened in
Dahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n
(4.14)
with = meaning almost sure (a.s.) equality, (u) continuous, and there
exists a 2periodic function A : [0, 1] R C such that A(u, ) =
(4.15)
(a.s.) for some constant c < . Intuitively, this means that for n large
enough, the observed process can be approximated locally in a small time
t
window t by the stationary process exp(it)A( n , )dZX (). The or1
der n of the approximation is chosen such that most standard estimation procedures, such as maximum likelihood estimation, can be applied
locally and their usual properties (e.g. consistency, asymptotic normality)
still hold. Under smoothness conditions on A one can prove that a meaningful evolving spectral density fX (u, ) (u (0, 1)) exists such that
1
n 2
fX (u, ) = lim
cov(X[unk/2],n , X[un+k/2],n )
(4.16)
k=
The function fX (u, ) is called evolutionary spectral density. Note that, for
xed u,
lim cov(X[unk/2],n , X[un+k/2],n ) = X (k)
n
= (2)1
Thumfart (1995) carries this concept over to series with discrete spectra.
A simplied denition can be given as follows: a sequence of stochastic
processes Xt,n (n N ) is said to have a discrete evolutionary spectrum
FX (u, ), if
t
t
t
Aj ( )eij ( n )t
(4.17)
Xt,n = ( ) +
n
n
jM
leads to information loss in the following way: let Y be a second order stationary time series with R. (Stationarity in continuous time is dened
in an exact analogy to denition 39.) Then, Y has a spectral representation
Y =
ei dZY (),
(4.18)
FY () =
E[|dZ()|2 ]
(4.19)
1
2
ei Y ( )d
(4.20)
We also have
Y ( ) = cov(Yt , Yt+ ) =
ei f ()d.
Xt =
/ +(2/ )u
u=
/ +(2/ )u
eij( ) dZY () =
eij( ) dZY ()
(4.22)
u=
/
/
eit dZX ()
(4.23)
where
dZX () =
(4.24)
u=
fX () =
fY ( + (2/ )u)
(4.25)
u=
o to the observed function Xt (in discrete time) are confounded, i.e. they
cannot be distinguished. Thus, if we observe a peak of fX at a frequency
(0, / ], then this may be due to any of the periodic components with
periods 2/( + (2/ )u), u = 0, 1, 2, ..., or a combination of these. This
has, for instance, direct implications for sampling of sound signals. Suppose
that 22050Hz (i.e. = 22050 2 138544.2) is the highest frequency that
we want to identify (and later reproduce) correctly, instead of attributing it
to a lower frequency. This would cover the range perceivable by the human
ear. Then must be so small that / 22050 2. Thus the time gap
between successive measurements of the sound wave must not exceed
1/44100.
4.2.3 Linear lters
Suppose we need to extract or eliminate frequency components from a
signal Xt with spectral density fX . The aim is thus, for instance, to produce
an output signal Yt whose spectral density fY is zero for a frequency interval
a b. The simplest, though not necessarily best, way to do this is linear
ltering. A linear lter maps an input series Xt to an output series Yt by
Yt =
aj Xtj
(4.26)
j=
The coecients must fulll certain conditions in order that the sum is
a2 < . The
dened. If Xt is second order stationary, then we need
j
resulting spectral density of Yt is
fY () = |A()|2 fX ()
where
A() =
aj eij .
(4.27)
(4.28)
j=
To eliminate a certain frequency band [a, b] one thus needs a linear lter
such that A() 0 in this interval.
Equation (4.27) also helps to construct and simulate time series models with desired spectral densities: a series with spectral density fY () =
(2)1 |A()|2 can be simulated by passing a series of independent observations Xt through the lter A(). Note that, in reality, one can use only
a nite number of terms in the lter so that only an approximation can be
achieved.
4.2.4 Special models
When modeling time series statistically, one may use one of the following
approaches: a) parametric modeling; b) nonparametric modeling; and c)
semiparametric modeling. In parametric modeling, the probability distribution of the time series is completely specied a priori, except for a nite dimensional parameter = (1 , ..., p )t . In contrast, for nonparametric
models, an innite dimensional parameter is unknown and must be estimated from the data. Finally, semiparametric models have parametric and
nonparametric components. A link between parametric and nonparametric
models can also be established by data-based choice of the length p of the
unknown parameter vector , with p tending to innity with the sample
size. Some typical parametric models are:
1. White noise: Xt second order stationary, var(Xt ) = 2 ,
fX () = 2 /(2),
and X (k) = 0 (k = 0)
2. Moving average process of order q, MA(q):
q
Xt = + t +
k tk
(4.29)
k=1
(4.30)
q
k=0
Xt =
k (Xtk ) + t .
k=1
(Xt )
k (Xtk ) = t
(4.31)
k=1
p
k=1
k z k = 0
(4.32)
5. Linear process:
|(ei )|2
.
|(ei )|2
Xt = +
j tj
j=
(4.33)
(4.34)
(4.35)
with d = 0, 1, 2, ..., where (z) and (z) are not zero for |z| 1. This
means that the dth dierence (1 B)d Xt is a stationary ARMA process.
7. Fractional ARIMA process, FARIMA(p, d, q) (Granger and Joyeux 1980,
Hosking 1981, Beran 1995):
(1 B) (B){(1 B)m Xt } = (B)t
with d = m + ,
1
2
<<
1
2,
(4.36)
m = 0, 1. Here,
(1 B)d =
(1)k B k
k=0
d
k
with
d
(d + 1)
.
=
(k + 1)(d k + 1)
k
The spectral density of (1 B)m Xt is
2
fX () =
|(ei )|2
|1 ei |2d .
|(ei )|2
(4.37)
X (k) = .
k=
This case is also known as long memory, since autocorrelations decay very
slowly (see Beran 1994). On the other hand, if < 0, then fX () 2
converges to zero at the origin and
X (k) = 0.
k=
This is called antipersistence, since for large lags there is a negative correlation. The fractional dierencing parameter , or d = + m, is also called
long-memory parameter, and is related to the fractal or Hausdor dimension dH (see Chapter 3). For an extended discussion of long-memory and
antipersistent processes see e.g. Beran (1994) and references therein.
Xt = BH (t) BH (t 1) (t N)
(4.38)
f () = 2cf (1 cos )
|2j + |2H1 , [, ]
(4.40)
j=
j t j + U t
Xt =
(4.41)
j=0
where Ut is stationary.
9. Harmonic or seasonal trend model:
p
Xt =
j cos j t +
j=0
j sin j t + Ut
(4.42)
j=0
with Ut stationary
10. Nonparametic trend model:
t
Xt,n = g( ) + Ut
(4.43)
n
with g : [0, 1] R a smooth function (e.g. twice continuously dierentiable) and Ut stationary.
11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q) (Beran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b):
(1 B) (B){(1 B)m Xt g(st )} = Ut
(4.44)
(4.45)
= arg max h(x1 , ..., xn ; )
where h is the joint density function of (X1 , ..., Xn ). If observations are discrete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently,
we may maximize the log-likelihood L(x1 , ..., xn ; ) = log h(x1 , ..., xn ; ).
1
4
[log fX (; ) +
I()
]d
fX (; )
(4.47)
In general, the actual mathematical and practical diculty lies in dening a computationally feasible estimation procedure and also to obtain
= arg max
(4.48)
t=1
1
2
the simplest case where t are normally distributed with h (x)= (2 ) 2
2
2
2
exp{x2 /(2e )} and = ( , 2 , ..., p ) = ( , ), we have et () = et ()
and
n
n
2
et ()
2
= arg min[
log +
]
(4.49)
t=1
t=1
Dierentiating with respect to leads to
n
= arg min
e2 ()
t
(4.50)
t=1
Bij = (2)1
log f (; )
log f (; )d
j
i
(see e.g. Box and Jenkins 1970, Beran 1995).
The estimation method above assumes that the order of the model, i.e.
the length p of the parameter vector , is known. This is not the case in
general so that p has to be estimated from data. Information theoretic considerations (based on denitions discussed in Section 3.1) lead to Akaikes
famous criterion (AIC; Akaike 1973a,b)
p = arg min{2 log likelihood + 2p}
(4.51)
More generally, we may minimize AIC = 2 log likelihood + k with respect to p. This includes the AIC ( = 2), the BIC (Bayesian information
criterion, Schwarz 1978, Akaike 1979) with = log n and the HIC (Han-
nan and Quinn 1979) with = 2c log log n (c > 1). It can be shown that, if
the observed process is indeed generated by a process from the postulated
class of models, and if its order is po , then for O(2c log log n) the estimated order is asymptotically correct with probability one. In contrast, if
/(2c log log n) 0 as n , then the criterion tends to choose too many
parameters in the sense that P ( > po ) converges to a positive probability.
p
This is, for instance, the case for Akaikes criterion. Thus, if identication
of a correct model is the aim, and the observed process is indeed likely to be
at least very close to the postulated model class, then O(2c log log n)
should be used. On the other hand, one may argue that no model is ever
correct, so that increasing the number of parameters with increasing sample size may be the right approach. In this case, the original AIC is a good
candidate. It should be noted, however, that if p as n , then
(4.52)
1
nb
K(
t=1
st sto
)yt
b
(4.53)
{E[(s)]g(s)}2 ds+
g
var((s))ds =
g
{Bias2 +variance}ds.
The Bias only depends on the function g, and is thus independent of the
error process. The variance, on the other hand, is a function of the covariances U (k) = cov(Ut , Ut+k ), or equivalently the spectral density fU .
1
n
nk
(xt x)(xt+k x)
(4.54)
t=1
periodogram
1
I() =
2
n1
(k)e
ik
k=(n1)
1
|
=
(xt x)eit |2
2n t=1
(4.55)
t
w( )(xt x)eit |2
n
t=1
not converge to fX (). Instead, the following holds, under mild regularity
conditions: if 0 < 1 < ... < k < , and n , then, as n ,
the distribution of 2 [I(1 )/fX (1 ), ..., 2I(k )/fX (k )] converges to the
distribution of (Z1 , ..., Zk ) where Zi are independent 2 -distributed random
2
variables. This result is also true for sequences of frequencies 0 < 1,n <
... < k,n < as long as the smallest distance between the frequencies,
min |i,n j,n | does not converge to zero faster than n1 . Because of the
latter condition, and also for computational reasons (fast Fourier transform,
FFT; see Cooley and Tukey 1965, Bringham 1988), one usually calculates
I() at the so-called Fourier frequencies j = 2j/n (j = 1, ..., m) with
n
m = [(n 1)/2]) only. Note that for Fourier frequencies, t=1 eitj = 0, so
that the
I() = (2n)1 |
xt eit |2 .
Thus, the sample mean actually does not need to be subtracted. The periodogram at Fourier frequencies can also be understood as a decomposition
of the variance into orthogonal components, analogous to classical analysis
of variance (Sche 1959): for n odd,
e
n
(xt x)2 = 4
I(j )
(4.56)
I(j ) + 2I().
t=1
(4.57)
j=2
(xt x) = 4
t=1
j=2
This means that I(j ) corresponds to the (empirically observed) contribution of periodic components with frequency j to the overall variability of
x1 , ..., xn .
A consistent estimate of fX can be obtained by eliminating or downweighing sample autocovariances with too large lags:
1
f () =
2
n1
wn (k) (k)eik
(4.58)
k=(n1)
f () =
Wn ( )I()d
(4.59)
Xt =
[j cos j t + j sin j t] + Ut
(4.60)
j=1
with Ut stationary. Note that, theoretically, this model can also be understood as a stationary process with jumps in the spectral distribution
FX (see Section 4.2.1). Given = (1 , ..., p )t , the parameter vector =
(1 , ..., p , 1 , ..., p )t can be estimated by the least squares or, more generally, weighted least squares method,
p
= arg min
t
w( )[xt
(j cos j t + j sin j t)]2
n
t=1
j=1
(4.61)
t
= arg
max
|
w( )xt eij t |2 = arg max
Iw (j ), (4.62)
0<1 ,...,p
n
j=1 t=1
j=1
j =
and
n
t=1
t
w( n )xt cos j t
,
n
t
w( n )
t=1
(4.63)
n
t=1
t
w( n )xt sin j t
(4.64)
n
t
t=1 w( n )
Note that (4.64) means that we look for the k largest peaks in the (wtapered) periodogram. Under quite general assumptions, the asymptotic
distribution of the estimates can be shown to be as follows: the vectors
3
Zn,j = [ n( j j ), n(j j ), n 2 ( j j )]t
j =
(j = 1, ..., p) are asymptotically mutually independent, each having a 3dimensional normal distribution with expected value zero and covariance
matrix C(j ) that depends on fU (j ) and the weight function w. The
formulas for C are as follows (Irizarry 1998, 2000, 2001, 2002):
C(j ) =
4fU (j )
2 V (j )
2 + j
j
(4.65)
where
2
c1 2 + c2 j
j
V (j ) = c3 j j
c4 j
c3 j j
2
c2 2 + c1 j
j
c4 j
c4 j
c4 j ,
co
2
co = ao bo , c1 = Uo Wo , c2 = ao b1 ,
(4.66)
(4.67)
2
2
3
2
c3 = ao W1 Wo (Wo W1 U2 W1 Uo 2Wo W2 U1 + 2Wo W1 W2 Uo ), (4.68)
2
c4 = ao (Wo W1 U2 W1 U1 Wo W2 U1 + W1 W2 Uo ),
2
ao = (Wo W2 W1 )2 ,
2
bn = Wn U2 + Wn+1 (Wn+1 Uo 2Wn U1 ) (n = 0, 1),
1
Un =
sn w2 (s)ds,
(4.69)
(4.70)
(4.71)
(4.72)
o
1
Wn =
sn w(s)ds.
(4.73)
This result can be used to obtain tests and condence intervals for j , j
and j (j = 1, 2, ..., p), with the unknown quantities j , j and fU (j ) then
replaced by estimates. Note that this involves, in particular, estimation of
the spectral density of the residual process Ut .
A quantity that is of particular interest is the dierence between the
fundamental frequency 1 and partials j 1 ,
j = j j 1 .
(4.74)
where
Wij = (2)1 [
log g(x; u)
log g(x; u)dx]|u= , (i, j = 1, ..., p+1).
ui
uj
(4.78)
Then, as n ,
(4.79)
with
p = p ( ) =
1
[g (max , )]T Vp ( )[g (max , )] (4.80)
[g (max , )]2
various factors can play a role. For instance, the sound of a violin depends on the wood it is made of, which manufacturing procedure was used,
current atmospheric conditions (temperature, humidity, air pressure), who
plays the violin, which particular notes are played in which context, etc.
The standard approach that makes modeling feasible is to think of a sound
as the result of harmonic components that may change slowly in time, plus
noise components that may be described by random models. It should be
noted, however, that sound is not only produced by an instrument but also
perceived by the human ear and brain. Thus, when dealing with the significance or eect of sounds, physiology, psychology and related scientic
disciplines come into play. Here, we are rst concerned with the actual objective modeling of the physical sound wave. This is a formidable task on
its own, and far from being solved in a satisfactory manner.
The scientic study of musical sound signals by physical equations goes
back to the 19th century. Helmholtz (1863) proved experimentally that
musical sound signals are mainly composed of frequency components that
are multiples of a fundamental frequency (also see Raleigh 1894). Ohm
conjectured that the human ear perceives sounds by analyzing the power
spectrum (i.e. essentially the periodogram), without taking into account relative phases of the sounds. These conjectures have been mostly conrmed
by psychological and physiological experiments (see e.g. Grey 1977, Pierce
1983/1992). Recent mathematical models of instrumental sound waves (see
e.g. Fletcher and Rossing 1991) lead to the assumption that, for short time
segments, a musical sound signal is stationary and can be written as a
harmonic regression model with 1 < 2 < ... < p . To analyze a musical
sound wave, one therefore can divide time into small blocks and t the
harmonic regression model as described above. The lowest frequency 1 is
called the fundamental frequency and corresponds to what one calls pitch
in music. The higher frequencies j (j 2) are called partials, overtones,
or harmonics. The amplitudes of partials, and how they change gradually,
are main factors in determining the timbre of a sound. For illustration,
Figure 4.1 shows the sound wave (air pressure amplitudes) of a piano during 1.9 seconds where rst a c and then an f are played. The signal
was sampled in 16-bit format at a sampling rate of 44100 Hz. This corresponds to CD-quality and means that every second, 44100 measurements
of the sound wave were taken, each of the measurements taking an integer
value between 32768 to 32767 (32767+32768+1=216). Figure 4.2 shows
an enlarged picture of the shaded area in Figure 4.1 (2050 measurements,
corresponding to 0.046 seconds). The periodogram (in log-coordinates) of
this subseries is plotted in Figure 4.3. The largest peak occurs approximately at the fundamental frequency 1 = 441 29/12 262.22 of c . Note
that, since the periodogram is calculated at Fourier frequencies only, 1
cannot be identied exactly (see also the remarks below). A small number
of partials j (j 2) can also be seen in Figure 4.3 the contribution of
10^3
0
amplitude
1
time in seconds
Figure 4.2 Zoomed piano sound wave shaded area in Figure 4.1.
spectrogram)
n
I(t, ) =
1
t j ij 2
)e
W(
xj |
|
n
2 ( tj )
nb
j=1 W
nb
j=1
(4.81)
where W : R R+ is a weight function such that W (u) = 0 for |u| > 1 and
b > 0 is a bandwidth that determines how large the window (block) is, i.e.
how many consecutive observations are considered to correspond approximately to a harmonic regression model with xed coecients j , j and
stationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichord
sound, with W (u) = 1{|u| 1}. Intense pink corresponds to high values of
I(t, ). Figures 4.6a through d show explicitly the change in I(t, ) between
four dierent blocks. Since the note was played staccato, the sound wave
is very short, namely about 0.1 seconds. Nevertheless, there is a change in
the spectrum of the sound, with some of the higher harmonics fading away.
Apart from the relative amplitudes of partials, most musical sounds in-
1000
0
-3000
-2000
-1000
amplitude
2000
3000
0.0
0.01
0.02
0.03
0.04
time in seconds
played on a harpsichord.
Harpsichord - Periodogram
(block 1)
10^6
10^4
periodogram
10^0
10^2
10^5
10^3
10^1
periodogram
10^7
Harpsichord - Periodogram
(block 22)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
frequency
a
Harpsichord - Periodogram
(block 42)
2.0
2.5
3.0
10^4
periodogram
10^4
10^2
10^0
10^2
10^6
10^6
Harpsichord - Periodogram
(block 62)
10^0
periodogram
1.5
frequency
a
0.0
0.5
1.0
1.5
frequency
b
2.0
2.5
3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
frequency
c
Figure 4.6 Harpsichord sound periodogram plots for dierent time frames (moving windows of time points).
Figure 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds to
high values of I(t, ). (Color gures follow page 152.)
clude a characteristic nonperiodic noise component. This is a further justication, apart from possible measurement errors, to include a random
deviation part in the harmonic regression equation. The properties of the
stochastic process Ut are believed to be characteristic for specic instruments (see e.g. Serra and Smith 1991, Rodet 1997). Typical noise components are, for instance, transient noise in percussive instruments, breath
noise in wind instruments, or bow noise of string instruments. For a discussion of statistical issues in this context see e.g. Irizarry (2001). For most
instruments, not only the harmonic amplitudes but also the characteristics of the noise component change gradually. This may be modeled by
smoothly changing processes as dened for instance in Ghosh et al. (1997).
Other approaches are discussed in Priestley (1965) and Dahlhaus (1996a,b,
1997) (see Section 4.2.1 above).
Some interesting applications of the asymptotic results in Section 4.2.8 to
questions arising in the analysis of musical sounds are discussed in Irizarry
of the sound wave. This means that certain frequency regions correspond
to certain hair groups. Frequency bands with high spectral density f (or
high increments dF of the spectral distribution) activate the associated
hair groups.
To obtain a simple model for the eect of a sound on the basilar membrane movement, Slaney and Lyon (1991) partition the cochlea into 86
sections, each section corresponding to a particular group of cells. Thumfart (1995) assumes that each group of cells acts like a separate linear lter
j (j = 1, ..., 86). (This is a simplication compared to Slaney and Lyon
who use nonlinear models.) The wave entering the inner ear is assumed to
be the original sound wave Xt , ltered by the outer ear by a linear lter
A1 , and the middle ear by a linear A2 . Thus, the output of the inner ear
that generates the nal nerve impulses consists of 86 time series
Yt,j = j (B)A2 (B)A1 (B)Xt (j = 1, ..., 86).
(4.82)
c(k, j, u) =
Ij (u, )eik d
(4.83)
which Slaney and Lyon call correlogram. This is in fact an estimated local
autocovariance at lag k for section j and the time-segment with midpoint
u. The Slaney-Lyon-correlogram thus essentially characterizes the local
autocovariance structure of the resulting nerve impulse series. Thumfart
(1995) shows formally how, and under which conditions, this model can
be dened within the framework of processes with a discrete evolutionary
spectrum. He also suggests a simple method for estimating pitch (the fundamental frequency) at local time u by setting 1 (u) = 2/kmax (u) where
86
kmax (u) = arg maxk C(k, u) and C(k, u) = j=1 c(k, j, u).
4.3.3 Identication of pitch, tone separation and purity of intonation
In a recent study, Weihs et al. (2001) investigate objective criteria for judging the quality of singing (also see Ligges et al. 2002). The main question
asked in their analysis is how to assess purity of intonation. In an experimental setting, with standardized playback piano accompaniment in a
recording studio, 17 singers were asked to sing Hndels Tochter Zion and
a
Beethovens Ehre Gottes aus der Natur. The audio signal of the vocal
performance was recorded in CD quality in 16-bit format at a sampling rate
of 44100 Hz. For the actual statistical analysis, data is reduced to 11000Hz,
for computational reasons, and standardized to the interval [-1,1].
The rst question is how to identify the fundamental frequency (pitch)
1 . In the harmonic regression model above, estimates of 1 and the partials j (2 j k) are identical with the k frequencies where the pe-
min
j {2 ,...,m1 }
(4.84)
peak of the periodogram occurs. Because of the restriction to Fourier frequencies, the peridogram may have two adjacent peaks and the estimate is
too inaccurate in general. An empirical interpolation formula is suggested
by the authors to obtain an improved estimate 1 . A comparison with har
monic regression is not made, however, so that it is not clear how good the
interpolation works in comparison.
Given a procedure for pitch identication, an automatic note separation
procedure can be dened. This is a procedure that identies time points in
a sound signal where a new note starts. The interesting result in Weihs et
al. is that automatic note separation works better for amateur singers than
for professionals. The reason may be the absence of vibrato in amateur
voices. In a third step, Weihs et al. address the question of how to assess computationally the purity of intonation based on a vocal time series.
This is done using discriminant analysis. The discussion of these results is
therefore postponed to Chapter 9.
4.3.4 Music as 1/f noise?
In the 1970s Voss and Clarke (1975, 1978) discovered a seemingly universal
law according to which music has a 1/f spectrum. With 1/f -spectrum
one means that the observed process has a spectral density f such that
f () 1 as 0. In the sense of denition (4.10), such a density
actually does not exist - however, a generalized version of spectral density
exists in the sense that the expected value of the periodogram converges
to this function (see Matheron 1973, Solo 1992, Hurvich and Ray 1995).
Specically, Voss and Clarke analyzed acoustic music signals by rst transforming the recorded signal Xt in the following way: a) Xt is ltered by a
low-pass lter (frequencies outside the interval [10Hz, 10000Hz] are elim2
inated); and b) the instantaneous power Yt = Xt is ltered by another
low-pass lter (frequencies above 20Hz are eliminated). This ltering technique essentially removes higher frequencies but retains the overall shape
(or envelope) of each sound wave corresponding to a note and the relative
position on the onset axis. In this sense, Voss and Clarke actually analyzed
rhythmic structures. A recent, statistically more sophisticated study along
this line is described in Brillinger and Irizarry (1998).
One objection to this approach can be that in acoustic signals, structural
18
b) Harpsichord - log(power)
16
14
15
log(power)
1000
-1000
13
-3000
air pressure
17
3000
0.0
0.02
0.04
0.06
0.08
0.10
0.12
0.0
0.02
0.04
time (sec)
0.06
0.08
0.10
0.12
time (sec)
log(f)
6
0
0
0.0001
2
0
4
0
0.0100
8
0
1
0
1.0000
13
14
15
16
log(y**2)
17
18
0.01
0.05
0.10
0.50
1.00
log(frequency)
Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b),
histogram of the series (c) and its periodogram on log-scale (d) together with tted
SEMIFAR-spectrum.
properties of the composition may be confounded with those of the instruments. Consider, for instance, the harpsichord sound wave in Figure 4.8a.
The square of the wave is displayed in Figure 4.8b on logarithmic scale.
The picture illustrates that, apart from obvious oscillation, the (envelope
of the) signal changes slowly. Fitting a SEMIFAR-model (with order p 8
chosen by the BIC) yields a good t to the periodogram. The estimated
Clarke 1975, 1978, Voss 1988; Brillinger and Irizarry 1998). Figures 4.9a and
c show the log-frequencies plotted against onset time for the rst movement
of Bachs rst Cello-Suite and for Paganinis Capriccio No. 24. For Bach, the
are similar to before, namely d = 0.51 ([0.20, 0.81]) for Bach and 0.33
([0.24, 0.42]) for Paganini.
CHAPTER 5
Hierarchical methods
5.1 Musical motivation
Musical structures are typically generated in a hierarchical manner. Most
compositions can be divided approximately into natural segments (e.g.
movements of a sonata); these are again divided into smaller units (e.g.
exposition, development, and coda of a sonata movement). These can again
be divided into smaller parts (e.g. melodic phrases), and so on. Dierent
parts even at the same hierarchical level need not be disjoint. For instance,
dierent melodic lines may overlap. Moreover, dierent parts are usually
closely related within and across levels. A general mathematical approach
to understanding the vast variety of possibilities can be obtained, for instance, by considering a hierarchy of maps dened in terms of a manifold
(see e.g. Mazzola 1990a). The concept of hierarchical relationships and similarities is also related to self-similarity and fractals as dened in Mandelbrot (1977) (see Chapter 3). To obtain more concrete results, hierarchical
regression models have been developed in the last few years (Beran and
Mazzola 1999a,b, 2000, 2001).
5.2 Basic principles
5.2.1 Hierarchical aggregation and decomposition
Suppose that we have two time series Yt , Xt and we wish to model the relatioship between Yt and Xt . The simplest model is simple linear regression
Yt = o + 1 Xt + t
(5.1)
Yt = o +
j Xt,j + t
j=1
(5.2)
Yt,1 = 01 +
j1 Xt,j + t,1
j=1
M
Yt,2 = 02 +
j2 Xt,j + t,2
j=1
.
.
.
M
Yt,L = 0L +
jL Xt,j + t,L .
j=1
1
nb1
K(
s=1
ts
)Xt
nb1
(5.3)
1
nbj
K(
s=1
ts
)[Xt
nbj
j1
Xt,l ]
(5.4)
l=1
The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hierarchical decomposition of Xt . The HIREG-model is then dened by (5.2).
If t (t = 1, 2, ...) are independent, then usual techniques of multiple linear
regression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Srivastava and Sen 1997, Draper and Smith 1998). In case of correlated errors
t , appropriate adjustments of tests, condence intervals, and parameter
selection techniques must be made. The main assumption in the HIREG
model is that we know which bandwidths to use. In some cases this may
indeed be true. For instance, if there is a three-fourth meter at the beginning of a musical score, then bandwidths that are divisible by three are
plausible.
ab (t, ti ) =
n
j=1
K(
(5.5)
ttj
b )
Yi,n = Y (ti ) =
j g(ti ; bj ) + i
(5.6)
j=1
g(ti ; bj ) =
(5.7)
l=1
Denote by o = ( o , bo )t the true parameter vector. Then o can be estimated by a nonlinear least squares method as follows: dene
M
ei () = Y (ti )
j g(ti ; bj )
(5.8)
l=1
n
2
i=1 ei ()
= argmin S()
and g =
b g.
Then
(5.9)
or equivalently
n
(ti , y; ) = 0
i=1
(5.10)
where = (1 , ..., 2M )t ,
j (t, y; ) = ei ()g(t; bj )
(5.11)
(5.12)
for j = M +1, ..., 2M. Under suitable assumptions, the estimate is asymptotically normal. More specically, set
hi (t; o ) = g(t; bi ) (i = 1, ..., M )
(5.13)
(5.14)
(5.15)
(5.16)
(5.17)
(i j)g(ti ; br )g(tj ; br ),
i,j=1
br = n1
(i j)g(ti ; br )g(tj ; bs ).
i,j=1
Then, as n , lim inf |ar | > 0, and lim inf |br | > 0 for all r, s
{1, ..., M }.
(A3) x(ti ) = (ti ) where : [0, T ] R is a function in C[0, T ], T < .
(A4) The set of time points converges to a set A that is dense in [0, T ].
Then we have (Beran and Mazzola 1999b):
Theorem 12 Let 1 and 2 be compact subsets of R and R+ respectively,
= M M and let = 1 min{1, 1 2d}. Suppose that (A1), (A2), (A3)
1
2
2
and (A4) hold and o is in the interior of . Then, as n ,
(i) p o ;
(ii) Vn V where V is a symmetric positive denite 2M 2M matrix;
(iii) n ( ) d N (0, V ).
Thus, is asymptotically normal, but for d > 0 (i.e. long-memory errors),
1
1
the rate of convergence n 2 d is slower than the usual n 2 rate.
A particular aspect of HISMOOTH models is that the bandwidths bj are
xed positive unknown parameters that are estimated from the data. This
means that, in contrast to nonparametric regression models (see e.g. Gasser
and Mller 1979, Simono 1996, Bowman and Azzalini 1997, Eubank 1999),
u
the notion of optimal bandwidth does not exist here. There is a xed true
bandwidth (or a vector of true bandwidths) that has to be estimated. A
HISMOOTH model is in fact a semiparametric nonlinear regression rather
than a nonparametric smoothing model.
Theorem 1 can be interpreted as multiple linear regression where uncertainty due to (explanatory) variable selection is taken into account. The
set of possible combinations of explanatory variables is parametrized by a
continuous bandwidth-parameter vector b M . Condence intervals for
2
based on the asymptotic distribution of take into account additional uncertainty due to variable selection from the (innite) parametric family of
M explanatory variables X = {(xb1 , ..., xbM ) : bj 2 , b1 > b2 > ... > bM }.
For the practical implementation of the model, the following algorithms
that include estimation of M are dened in Beran and Mazzola (1999b): if
M is xed, then the algorithm consists of two basic steps: a) generation of
the set of all possible explanatory variables xs (s S), and b) selection of
M variables (bandwidths) that maximize R2 . This means that after step 1,
the estimation problem is reduced to variable selection in multiple regression, with a xed number M of explanatory variables. Standard regression
software, such as the function leaps in S-Plus, can be used for this purpose.
The detailed algorithm is as follows:
Algorithm 1 Dene a suciently ne grid S = {s1 , ..., sk } 2 and
carry out the following steps:
Step 1: Dene k explanatory time series xs = [xs (t1 ), ..., xs (tn )]t (s S)
by xs (ti ) = g(ti , s).
Step 2: For each b = (b1 , ..., bM ) S M , with bi > bj (i < j) dene the
n M matrix X = (xb1 , ..., xbM ) and let = (b) = (X t X)1 X t y.
Also, denote by R2 (b) the corresponding value of R2 obtained from least
squares regression of y on X.
b)
If M is unknown, then the algorithm can be modied, for instance by increasing M as long as all -coecients are signicant. In order to calculate
j () of j , and set
1
pj = 2[1 (|j |j ())]
where denotes the cumulative standard normal distribution function.
If max (pj ) < , set Mo = Mo + 1 and repeat 1 through 5. Otherwise,
(5.18)
(5.19)
(5.20)
(5.21)
< jk , ol >= 0
(5.22)
g(x) =
ak ok (x) +
bjk jk (x)
ak (x k) +
(5.23)
j=0 k=
k=
bjk (2j x k)
(5.24)
j=0 k=
k=
where
ak =< g, k >=
g(x)k (x)dx
(5.25)
g(x)jk (x)dx
(5.26)
and
bjk =< g, jk >=
a2 +
b2 . The purpose of this
Note in particular that g 2 (x)dx =
k
jk
representation is a decomposition with respect to frequency and time. A
simple wavelet, where the meaning of the decomposition can be understood
directly, is the Haar wavelet with
(x) = 1{0 x < 1}
(5.27)
(5.28)
k+1
ak =
g(x)dx
(5.29)
Thus, coecients of the basis functions k are equal to the average value
of g in the interval [k, k + 1]. For jk we have
j
2
bjk = 2 [
2j (k+ 1 )
2
2j k
g(x)dx
2j (k+1)
2j (k+ 1 )
2
g(x)dx]
(5.30)
gn (x) =
yk 1{
k=0
k+1
k
x<
}=
n
n
n1
k=0
(5.31)
Since gn is a step function (like the Haar basis functions themselves) and
zero outside the interval [0, 1), the Haar wavelet decomposition of gn has
only a nite number of nonzero terms:
m1 2j 1
gn (x) = aoo +
bjk jk (x)
(5.32)
j=0 k=0
m1 2j 1
(yt y ) =
b2 .
jk
(5.33)
j=0 k=0
5.1b. Note that D stands for mother and S for father wavelet. Moreover,
the numbering in the plot (as given in S-Plus) is opposite to the one given
above: s4 and d4 in the plot correspond to the coarsest level j = 0 above.
The corresponding functions at the dierent levels are given in Figure 5.1c.
The ten and fty largest basis contributions are given in Figures 5.1d and
e respectively (together with the data on top and residuals at the bottom). Figure 5.1f shows the time frequency plot of the squared coecients
in the wavelet decomposition of xi . Bright shading corresponds to large
coecients. All plots emphasize the high-frequency portion with large amplitude between i = 701 and 900. Moreover, the trend at this location is
visible through the coecient values of the father wavelet (s4 in the
plot) and the slightly brighter shading in the lowest frequency band of the
time-frequency plot.
An alternative to HISMOOTH models can be dened via wavelets (the
following denition is a slight modication of Beran and Mazzola 2001):
Denition 41 Let , L2 (R) be a father and the corresponding mother
wavelet respectively, k (.) = (. k), j,k = 2j/2 (2j . k) (k Z, j N)
the orthogonal wavelet basis generated by and , and ui and i (i
Z) independent stationary zero mean processes satisfying suitable moment
conditions. Assume X(ti ) = g(ti ) + ui with g L2 [0, T ], ti [0, T ] and
wavelet decomposition g(t) =
ak k (t) + bj,k j,k (t). For 0 = cM+1 <
cM < ... < c1 < co = let
g(t; ci1 , ci ) =
ak k (t) +
ci |ak |<ci1
Then (X(ti ), Y (ti )) (i = 1, ..., n) is a Hierarchical Wavelet Model (HIWAVE model) of order M , if there exists M N, = (1 , ..., M ) RM ,
= (1 , ..., M ) RM , 0 < M < ...1 < o = such that
+
M
Y (ti ) =
l g(ti ; l1 , l ) + i .
(5.34)
l=1
The denition means that the time series Y (t) is decomposed into orthogonal components that are proportional to certain bands in the wavelet
decomposition of the explanatory series X(t) the bands being dened by
the size of wavelet coecients. As for HISMOOTH models, the parameter
vector = (, )t can be estimated by nonlinear least squares regression.
To illustrate how HIWAVE-models may be used, consider the following
simulated example: let xi = g(ti ) (i = 1, ..., 1024) as in the previous example. The function g is decomposed into g(t) = g(t; , 1 ) + g(t; 1 , 0) =
g1 (t) + g2 (t) where 1 is such that 50 wavelet coecients of g are larger
or equal 1 . Figure 5.2 shows g, g1 , and g2 . A simulated series of response
variables, dened by Y (ti ) = 2g1 (ti ) + i (t = 1, ..., 1024) with independent
2
zero-mean normal errors i with variance = 100, is shown in Figure 5.3b.
20
10
-10
200
400
600
800
1000
200
400
600
800
1000
200
400
600
800
1000
800
1000
200
400
600
d
-10
x-g1
0
10
g1
0
-10
-10
-30
-40
g2=x-g1
10
20
g1=first 50 components of x
-20
10
20
400
200
800
400
400
800
600
800
1000
A comparison of the two scatter plots in Figures 5.3c and d shows a much
clearer dependence between y and g1 as compared to y versus x = g. Figure
5.3e illustrates that there is no relationship between y and g2 . Finally, the
time-frequency plot in Figure 5.3f indicates that the main periodic behavior
occurs for t {701, ..., 900}. The diculty in practice is that the correct
decomposition of x into g1 and the redundant component g2 is not known
the relevant time span [701, 900] quite exactly, since g(ti ; , 1 ) corresponds
parameters as in linear least squares regression. These intervals are generally too short, since they do not take into account that 1 is estimated.
Figure 5.3 Simulated HIWAVE model - explanatory series g1 (a), yseries (b),
y versus x (c), y versus g1 (d), y versus g2 = x g1 (e) and time frequency plot
of y (f ).
COLOR FIGURE 2.35 Symbol plot with x = pj5, y = pj7, and radius of circles
proportional to pj6.
COLOR FIGURE 2.36 Symbol plot with x = pj5, y = pj7. The rectangles have
width pj1 (diminished second) and height pj6 (augmented fourth).
COLOR FIGURE 2.37 Symbol plot with x = pj5, y = pj7, and triangles defined
by pj1 (diminished second), pj6 (augmented fourth), and pj10 (diminished
seventh).
COLOR FIGURE 3.2 Fractal pictures (by Cline Beran, computer generated).
COLOR FIGURE 4.7 A harpsichord sound and its spectrogram. Intense pink
corresponds to high values of I(t,).
Figure 5.5 Hierarchical decomposition of metric, melodic, and harmonic indicators for Bachs Canon cancricans (Das Musikalische Opfer BWV 1079) and
Weberns Variation op. 27, No. 2.
x(tj ) x(tj1 )
tj tj1
and
dx(tj ) dx(tj1 )
.
tj tj1
Each of these variables is decomposed hierarchically into four components,
as decribed above, with the bandwidths b1 = 4 (weighted averaging over 8
bars), b2 = 2 (4 bars), b3 = 1 (2 bars) and b4 = 0 (residual no averaging).
We thus obtain 48 variables (functions):
dx(2) (tj1 ) =
xmetric,1
dxmetric,1
d2 xmetric,1
xmetric,2
dxmetric,2
d2 xmetric,2
xmetric,3
dxmetric,3
d2 xmetric,3
xmetric,4
dxmetric,4
d2 xmetric,4
xmelodic,1
dxmelodic,1
d2 xmelodic,1
xmelodic,2
dxmelodic,2
d2 xmelodic,2
xmelodic,3
dxmelodic,3
d2 xmelodic,3
xmelodic,4
dxmelodic,4
d2 xmelodic,4
xhmax,1
dxhmax,1
d2 xhmax,1
xhmax,2
dxhmax,2
d2 xhmax,2
xhmax,3
dxhmax,3
d2 xhmax,3
xhmax,4
dxhmax,4
d2 xhmax,4
xhmean,1
dxhmean,1
d2 xhmean,1
xhmean,2
dxhmean,2
d2 xhmean,2
xhmean,3
dxhmean,3
d2 xhmean,3
xhmean,4
dxhmean,4
d2 xhmean,4
(5.35)
Figure 5.6 Quantitative analysis of performance data is an attempt to understand objectively how musicians interpret a score without attaching any subjective judgement. (Left: Freddy by J.B.; right: J.S. Bach, woodcutting by Ernst
Wrtemberger, Zrich. Courtesy of Zentralbibliothek Zrich).
u
u
u
(5.36)
(5.37)
The variables are summarized in an n 57 matrix X. After orthonormalization, the following model is assumed:
y(j) = Z(j) + (j)
where y(j) = [y(t1 , j), y(t2 , j), ..., y(tn , j)]t are the tempo measurements
for performance j, Z is the orthonormalized X-matrix, (j) is the vector
of coecients (1 (j), ..., p (j))t and (j) = [(t1 , j), (t2 , j), ..., (tn , j)]t is
a vector of n identically distributed, but possibly correlated, zero mean
random variables (ti , j) (ti T ) with variance var((ti , j)) = 2 (j). Beran and Mazzola (1999a) select the most important variables for each of
the 28 performances separately, by stepwise linear regression. The main
aim of the analysis is to study the relationship between structural weight
functions and tempo with respect to a) existence, b) type and complexity,
and c) comparison of dierent performances. It should perhaps be emphasized at this point that quantitative analysis of performance data aims at
gaining a better objective understanding how pianists interpret a score
the following way: using the size of |k | as criterion for the importance of
variable k, we may add the terms in the regression equation sequentially to
obtain a hierarchy of tempo curves ranging from very simple to complex.
This is illustrated in Figures 5.8a and b for Ashkenazy and Horowitzs third
performance.
5.3.3 HISMOOTH models for the relationship between tempo and
structural curves
An analysis of the relationship between a melodic curve (Chapter 3) and the
28 tempo curves for Schumanns Trumerei is discussed in Beran and Maza
zola (1999). In a rst step, eects of fermatas and ritardandi are subtracted
from each of the 28 tempo series individually, using linear regression. The
component of the melodic curve mt orthogonal to these variables is then
used. The second algorithm for HISMOOTH models is used, with a grid G
that takes into account that 0 t 32 and only certain multiples of 1/8
correspond to musically interesting neighborhoods: G = {32, 30, 28, 26, 24,
25
Figure 5.8a:
Adding effects for ASKENAZE
Figure 5.8b:
Adding effects for HOROWITZ3
10
-5
10
5
15
15
20
20
Figure 5.7 Most important melodic curves obtained from HIREG t to tempo
curves for Schumanns Trumerei.
a
10
15
onset time
20
25
30
10
15
20
25
30
onset time
22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125}.
Note that since for large bandwidths the resulting curves g do not vary
much, large trial bandwidths do not need to be too close together. The
error process is modeled by a fractional AR(p, d) process, the order being
estimated from the data by the BIC. Note that, from the musicological
point of view, the fractional dierencing parameter can be interpreted as a
measure of self-similarity (see Chapter 3).
For illustration, consider the performances CORTOT1 and HOROWITZ1
(see Figures 5.9b and c). In both cases, the number M of explanatory variables estimated by Algorithm 2 turns out to be 3 (with a level of signicance
of = 0.05). The estimated bandwidths (and 95%-condence intervals)
are 1 = 4.0 ([2.66, 5.34]), 2 = 2.0 ([1.10, 2.90]) and 3 = 0.5 ([0.17, 0.83])
b
b
b
b
b
for CORTOT1 and 1 = 4 ([2.26, 5.74]), 2 = 1 ([0.39, 1.62]) and 3 =
b
1 = 0.81
.25 ([0.04, 0.46]) for HOROWITZ1. The estimates of are
[0.48, 1]. For Horowitz we obtain a fractional AR(2) process with d = 0.30
for Horowitz, d > 0 (long-range dependence): while the small scale structures are explained by the melodic structure of the score, the remaining
unexplained part of the performance is still coherent in the sense that
there is a relatively strong (self-)similarity and positive correlations even be
tween remote parts. On the other hand, for Cortot, d < 0 (antipersistence):
While larger scale structures are explained by the melodic structure of
the score, more local uctuations are still coherent in the sense that there
is a relatively strong negative autocorrelation even between remote parts,
these smaller scale structures are however dicult to relate directly to the
melodic structure of the score.
Figures 5.9a through d also simplied tempo curves for all 28 performances, obtained by HISMOOTH ts with M = 3. The comparison of
typical characteristics is now much easier than for the original curves. In
particular, there is a strong similarity between all three performances by
Horowitz on one hand, and the three performances by Cortot on the other
hand. Several performers (Moisewitsch, Novaes, Ortiz, Krust, Schnabel,
Katsaris) put even higher emphasis on global melodic features than Cortot. Striking similarities can also be seen between Horowitz, Klien, and
Brendel. Another group of similar performances consisting of Cortot, Argerich, Capova, Demus, Kubalek, and Shelley.
5.3.4 Digital encoding of musical sounds (CD, mpeg)
Wavelet decomposition plays an important role in modern techniques of
digital sound and image processing. Digital encoding of sounds (e.g. CD,
mpeg) relies on algorithms that make it possible to compress complex data
in as few storage units as possible. Wavelet decoposition is one such technique: instead of storing a complete function (evaluated or measured at a
very large number of time points on a ne grid), one only needs to keep the
relatively small number of wavelet coecients. There is an extensive literature on how exactly this can be done to suit particular engineering needs.
Since here the focus is on genuine musical questions rather than signal
processing, we do not pursue this further. The interested reader is referred
to the engineering literature such as Eelsberg and Steinmetz (1998) and
references therein.
5.3.5 Wavelet analysis of tempo curves
Consider the tempo curves for Schumanns Trumerei. Wavelet analysis
a
can help one to understand some of the similarities and dierences be-
Figure 5.10 Time frequency plots for Cortots and Horowitzs three performances.
two (Figure 5.12), ve (Figure 5.13) and ten (Figure 5.14) best basis functions. The plots show interesting and plausible similarities and dierences.
Particularly striking are Cortots 4-bar oscillations, Horowitzs seismic
local uctuations, the relatively unbalanced tempo with a few extreme
tempo variations for Eschenbach, Klien, Ortiz, and Schnabel, the irregular
shapes for Moisewitsch, and also a strong similarity between Horowitz1 and
Moisewitsch with respect to the general shape (Figure 5.12).
5.3.6 HIWAVE models of the relationship between tempo and melodic
curves
HIWAVE models can be used, for instance, to establish a relationship between structural curves obtained from a score and a performance of the
score. Here, we consider the tempo curves by Cortot and Horowitz (Figure
5.15a), and the melodic weight function m(t) dened in Section 3.3.4. Assuming a HIWAVE-model of order 1, Figure 5.15b displays the value of R2
idwt
idwt
idwt
d1
d1
d1
d2
d2
d2
s2
s2
s2
0 20 40 60 80 100
0 20 40 60 80 100
0 20 40 60 80 100
c
idwt
idwt
idwt
d1
d1
d1
d2
d2
d2
s2
s2
s2
0 20 40 60 80 100
d
0 20 40 60 80 100
e
0 20 40 60 80 100
f
Figure 5.11 Wavelet coecients for Cortots and Horowitzs three performances.
ARRAU
ASKENAZE
BRENDEL
CAPOVA
BUNIN
CORTOT1
1.0
1
100
150
50
100
150
100
150
100
150
50
100
150
-1.0
50
100
150
50
100
150
GIANOLI
50
100
150
0
-1
-2
-2
-3
-3
-4
-1.5
0
HOROWITZ2
ESCHENBACH
1
0.5
0.5
-0.5
50
-1.0
-1.0
-1.0
0
HOROWITZ1
0.0
0.5
50
DEMUS
1.0
DAVIES
0.5
0.0
0.5
0.0
-0.5
-2
0
CURZON
50
0.0
0
-1
-2
-3
0
150
-1
100
1.0
50
CORTOT3
1.0
1.0
0.5
0.0
-1
-1
-2
0
150
0.0
100
1.0
50
CORTOT2
-0.5
-3
-2
-0.5
-1
0.0
0.5
0.5
1.0
1.0
ARGERICH
HOROWITZ3
50
100
150
KATSARIS
50
100
150
KLIEN
50
100
150
KRUST
50
100
150
1.0
1.0
0.5
-1
0.0
-0.5
-2
-3
-2
-1
50
100
150
50
100
150
NEY
50
100
150
NOVAES
50
100
150
50
100
150
SCHNABEL
50
100
150
150
50
100
150
150
0
-1
0.0
-2
-2
-0.5
-3
-3
-4
-3
-1.0
-2
100
100
ZAK
-2
-1
-1
-2
50
50
0.5
-1
0
-1
0
0
0.0
-0.5
0
SHELLEY
1
1
1.0
0.5
-1.0
-1.5
0
ORTIZ
1.0
MOISEIWITSCH
-4
-3
-3
-0.5
-3
-2
-2
0.0
-1
-1
0.5
0.5
1.0
KUBALEK
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
Figure 5.12 Tempo curves approximation by most important 2 best basis functions.
ARGERICH
ARRAU
ASKENAZE
BRENDEL
BUNIN
CAPOVA
1
0
-1
-1
-1
0.0
-2
-3
-3
-1.0
-3
-3
-3
-2
-2
-2
-2
-2
-1
-1
-1
0.5
1.0
CORTOT1
50
100
150
CORTOT2
50
100
150
CORTOT3
50
100
150
CURZON
50
100
150
DAVIES
50
100
150
50
100
150
ESCHENBACH
50
100
150
GIANOLI
50
100
150
HOROWITZ1
50
100
150
HOROWITZ2
50
100
150
1
-1
-2
-1
-2
-1
50
100
150
KATSARIS
50
100
150
-3
-4
-4
0
HOROWITZ3
KLIEN
50
100
150
KRUST
50
100
150
KUBALEK
0.5
0
-2
-2
150
50
100
150
-1.5
-3
-4
0
NEY
50
100
150
50
100
150
ORTIZ
50
100
150
SCHNABEL
50
100
150
50
SHELLEY
100
150
ZAK
-1
-1
-2
50
100
150
50
100
150
-3
-5
-2
-2
0
-2
-3
-0.5
-4
-2
-1
-3
-2
-1
-1
0.0
-1
0.5
1.0
1.5
NOVAES
-1.0
-3
-2
-3
100
-0.5
-1
-2
-3
50
MOISEIWITSCH
0.0
-1
-1
-1
0.0
-1.0
0
1.0
1
1
1.0
-3
-2
-3
-1.5
-2
-2
-3
-2
-1
-1
-0.5
-1
0.5
1.0
DEMUS
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
Figure 5.13 Tempo curves approximation by most important 5 best basis functions.
ARGERICH
ARRAU
ASKENAZE
BRENDEL
CAPOVA
CORTOT1
50
100
150
50
100
150
CORTOT3
50
100
150
-2
-1
-3
0
CURZON
50
100
150
DAVIES
50
100
150
DEMUS
50
100
150
ESCHENBACH
50
100
150
GIANOLI
50
100
150
50
100
150
50
100
150
-2
-1
-3
-2
50
100
150
KATSARIS
50
100
150
-4
-4
0
HOROWITZ3
KLIEN
50
100
150
KRUST
50
100
150
KUBALEK
100
150
50
100
150
1
0
-1
-2
0
NEY
50
100
150
NOVAES
50
100
150
ORTIZ
50
100
150
-2
-4
-4
50
-3
-3
-3
-2
-3
-3
-1.5
0
MOISEIWITSCH
-1
-2
-2
-2
-0.5
-1
-1
-1
-1
0.5
1.5
HOROWITZ2
-3
-2
-1.5
0
HOROWITZ1
-1
-0.5
0
-1
-3
-3
-2
-2
-1
-1
-1
0.5
CORTOT2
-1
-2
-3
-3
-3
-3
-2
-3
0
0
-1
0
-1
-2
-2
-2
-1
-1
-1
BUNIN
SCHNABEL
50
100
150
50
SHELLEY
100
150
0
0
50
100
150
50
100
150
50
100
150
50
100
150
-3
-2
-3
-5
0
-2
-3
-4
-2
-3
-2
-2
-2
-1
-1
-1
-1
-1
-2
-1
-1
ZAK
50
100
150
50
100
150
50
100
150
Figure 5.14 Tempo curves approximation by most important 10 best basis functions.
Figure 5.15 Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2
obtained in HIWAVE-t plotted against trial cut-o parameter (b) and tted
HIWAVE-curves (c).
Figure 5.16 First derivative of tempo curves (a) by Cortot (three curves on top)
and Horowitz, R2 obtained in HIWAVE-t plotted against trial cut-o parameter
(b) and tted HIWAVE-curves (c).
Figure 5.17 Second derivative of tempo curves (a) by Cortot (three curves on top)
and Horowitz, R2 obtained in HIWAVE-t plotted against trial cut-o parameter
(b) and tted HIWAVE-curves (c).
CHAPTER 6
assume that the Markov chain is homogeneous in the sense that for any
i, j N, the conditional probability P (Xt+1 = j|Xt = i) does not depend
on time t. The probability distribution of the process Xt (t = 0, 1, 2, ...) is
then fully specied by the initial distribution
i = P (Xo = i)
(6.2)
(6.3)
i = 1
i=1
and
pij = 1
j=1
= P (Xt+n = j|Xt = i) =
j1 ,...,jn1 =1
and
(n)
pj
= P (Xt+n = j) = [ t M n ]j
(6.5)
(6.6)
where
Tj = min{n : Xn = j}
n1
is the rst time when the process reaches state j. The conditional probability that the process ever visits the state j can be written as
fij = P (Tj < |Xo = i) = P ( {Xn = j}|Xo = i) =
n=1
(n)
fij
(6.7)
n=1
(6.8)
This implies
qij = 0 for fjj < 1
and
qij = 1 for fjj = 1.
A simple way of checking whether a state is persistent or not is given by
Theorem 13 The following holds for a Markov chain:
i) A state j is transient qjj = 0
ii) A state j is persistent qjj = 1
(n)
(n)
n=1 pjj = .
The condition on n=1 pii can be simplied further for irreducible Markov
chains:
Denition 43 A Markov chain is called irreducible, if for each i, j S,
(n)
pij > 0 for some n.
Irreducibility means that wherever we start, any state j can be reached in
due time with positive probability. This excludes the possibility of being
caught forever in a certain subset of S. With respect to persistent and transient states, the situation simplies greatly for irreducible Markov chains:
Theorem 14 Suppose that Xt (t = 0, 1, ...) is an irreducible Markov chain.
Then one of the following possibilities is true:
i pij = j ,
(6.9)
i=1
or in matrix form,
t M = .
(6.10)
This means that if we start with distribution , then the distribution of all
subsequent Xt s is again .
The next question is in how far the initial distribution inuences the
dynamic behavior (probability distribution) into the innite future. A possible complication is that the process may be periodic in the sense that one
may return to certain states periodically:
Denition 45 A state j is called to have period , if
(n)
pjj > 0
implies that n is a multiple of .
For an irreducible Markov chain, all states have the same period. Hence,
the following denition is meaningful:
Denition 46 An irreducible Markov chain is called periodic if > 1, and
it is called aperiodic if = 1.
It can be shown that for an aperiodic Markov chain, there is at most one
stationary distribution and, if there is one, then the initial distribution does
not play any role ultimately:
Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chain
for which a stationary distribution exists, then the following holds:
(i) the Markov chain is persistent;
(n)
=0
for all i, j. Note that this is even the case if the Markov chain is persistent.
One then can classify irreducible aperiodic Markov chains into three classes:
and
=0
(n)
pij <
n=1
= 0,
(n)
pij =
n=1
and
(n)
nfjj =
j =
n=1
= j > 0
for all i, j and the average number of steps till the process returns to
state j is given by
1
j = j
For Markov chains with a nite state space, the results simplify further:
Theorem 17 If Xt is an irreducible aperiodic Markov chain with a nite
state space, then the following holds:
(i) Xt is persistent
(ii) a unique stationary distribution = (1 , ..., k )t exists and is the solution of
t (I M ) = 0, (0 j 1,
j = 1)
(6.11)
(6.12)
(6.13)
t = arg
= arg
max
j=1,...,m1
max
j=1,...,m1
P (t = j|Xt = i)
m1
l=1
(6.14)
n
t=2
1{xt1 = i, xt = j}
n1
t=1
1{xt = i}
(6.15)
t (I M ) = 0
as described above. Figures 6.3a through l show the resulting values of j
For visual clarity, points at neighboring states j and j1 are connected. The
gures illustrate how the characteristic shape of changed in the course of
the last 500 years. The most dramatic change occured in the 20th century
with a attening of the peaks. Starting with Scriabin a pioneer of atonal
music, though still rooted in the romantic style of the late 19th century, this
is most extreme for the compositions by Schnberg, Webern, Takemitsu,
o
and Messiaen. On the other hand, Prokoes Visions fugitives exhibit
clear peaks but at varying locations. The estimated stationary distributions
can also be used to perform a cluster analysis. Figure 6.4 shows the result
of the single linkage algorithm with the manhattan norm (see Chapter
10). To make names legible, only a subsample of the data was used. An
almost perfect separation between Bach and composers from the classical
and romantic period can be seen.
calculated. A cluster analysis as above, but with the new probabilties, yields
practically the same result as before (Figure 6.5). Since the state space contains three elements only, it is now even easier to nd the patterns that
determine clustering. In particular, log-odds-ratios log(i /j ) (i = j) ap
pear to be characteristic. Boxplots are shown in Figures 6.6a, 6.7a and 6.8a
for categories of composers dened by date of birth as follows: a) before
1600 (early music); b) [1600,1720) (baroque); c) [1720,1800) (classic); d) [1800,1880) (romantic and early 20th century) (Figure 6.12); e)
1880 and later (20th century). This is a simple, though somewhat arbitrary, division with some inaccuracies for instance, Schnberg is classied
o
in category 4 instead of 5. The log-odds-ratio between 1 and 2 is high
Figure 6.3 Stationary distributions j (j = 1, ..., 11) of Markov chains with state
space Z12 \ {0}, estimated for the transition between successive intervals.
BRAHMS
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
HAYDN
CHOPIN
RACHMANINOFF
MOZART
BACH
HAYDN
HAYDN
SCHUMANN
MOZART
CHOPIN
BACH
HAYDN
BRAHMS
CHOPIN
BACH
BACH
MOZART
RACHMANINOFF
CHOPIN
SCHUMANN
RACHMANINOFF
SCHUMANN
HAYDN
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
BRAHMS
SCHUMANN
MOZART
MOZART
SCHUMANN
RACHMANINOFF
SCHUMANN
HAYDN
BRAHMS
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
SCHUMANN
Figure 6.4 Cluster analysis based on stationary Markov chain distributions for
compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmanino.
can be observed. The dierences are even more visible when comparing individual composers. This is illustrated in Figures 6.9a and b where Bachs
and Schumanns log(1 /3 ) and log(2 /3 ) are compared, and in Figures
6.10a through f where the median and lower and upper quartiles of j are
plotted against j. Finally, Figure 6.11 shows the plots of log(1 /3 ) and
log(2 /3 ) against the date of birth.
6.3.3 Classication by hidden Markov models
Chai and Vercoe (2001) study classication of folk songs using hidden
Markov models. They consider, essentially, four ways of representating a
melody; namely by a) a vector of pitches modulo 12; b) a vector of pitches
modulo 12 together with duration (duration being represented by repeating
the same pitch); c) a sequence of intervals (dierenced series of pitches); and
d) sequence of intervals, with intervals being classied into only ve interval
classes {0}, {1, 2}, {1, 2}, {x 3} and {x 3}. The observed data
consist of 187 Irish, 200 German, and 104 Austrian homophonic melodies
from folk songs. For each melody representation, the authors estimate the
parameters of several hidden Markov models which dier mainly with respect to the size of the hidden state space. The models are tted for each
3.0
BRAHMS
CHOPIN
RACHMANINOFF
BACH
HAYDN
BRAHMS
SCHUMANN
MOZART
CHOPIN
CHOPIN
BACH
SCHUMANN
HAYDN
SCHUMANN
HAYDN
BACH
RACHMANINOFF
MOZART
SCHUMANN
BACH
SCHUMANN
CHOPIN
SCHUMANN
BRAHMS
RACHMANINOFF
MOZART
SCHUMANN
RACHMANINOFF
SCHUMANN
HAYDN
SCHUMANN
BRAHMS
BRAHMS
BRAHMS
SCHUMANN
MOZART
MOZART
SCHUMANN
BACH
BACH
HAYDN
HAYDN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
BACH
SCHUMANN
BACH
BACH
BACH
1.0
SCHUMANN
1.5
2.0
2.5
country separately. Only 70% of the data are used for estimation. The
remaining 30% are used for validation of a classication rule dened as
follows: a melody is assigned to country j, if the corresponding likelihood
(calculated using the countrys hidden Markov model) is the largest. Not
surprisingly, the authors conclude that the most reliable distinction can be
made between Irish and non-Irish songs.
a) log(pi(1)/pi(2)) for
five different periods
-1.5
-1.5
-1.0
-1.0
-0.5
-0.5
0.0
0.0
0.5
0.5
b. 1600
1600
-1720
1720
-1800
1800
-1880
from
1880
birth 1720-1800
Figure 6.6 Comparison of log odds ratios log(1 /2 ) of stationary Markov chain
distributions of torus distances.
b) log(pi(1)/pi(3)) for
upto baroque vs. after baroque
-1
-1
a) log(pi(1)/pi(3)) for
five different periods
b. 1600
1600
-1720
1720
-1800
1800
-1880
from
1880
Figure 6.7 Comparison of log odds ratios log(1 /3 ) of stationary Markov chain
distributions of torus distances.
a) log(pi(2)/pi(3)) for
five different periods
b) log(pi(2)/pi(3)) for
upto baroque vs. after baroque
b. 1600
1600
-1720
1720
-1800
1800
-1880
from
1880
Figure 6.8 Comparison of log odds ratios log(2 /3 ) of stationary Markov chain
distributions of torus distances.
b) log(pi(2)/pi(3)) for
Bach and Schumann
-0.5
-1.0
0.0
-0.5
0.5
0.0
1.0
0.5
1.5
1.0
2.0
1.5
a) log(pi(1)/pi(3)) for
Bach and Schumann
Bach
Schumann
Bach
Schumann
Figure 6.9 Comparison of log odds ratios log(1 /3 ) and log(2 /3 ) of stationary
Markov chain distributions of torus distances.
log(pi(1)/pi(3))
plotted against date of birth
-1
log(pi(2)/pi(3))
log(pi(1)/pi(3))
log(pi(2)/pi(3))
plotted against date of birth
1200
1400
1600
year
a
1800
1200
1400
1600
1800
year
b
Figure 6.11 Log odds ratios log(1 /3 ) and log(2 /3 ) plotted against date of
birth of composer.
f (yi |)p()
f (yi |)p()d
where
f (yi |) = (2i )mi /2 exp(
mi
(6.16)
2
e2 (t)/i )
i
t=1
and ei (t) = ei (t; ). How many notes and which pitches are played can then
be decided, for instance, by searching for the mode of the distribution.
Even if this model is assumed to be realistic, a major practical diculty
remains: the dimension of can be several hundred. The computation of
the a posteriori distribution is therefore very dicult since calculation of
f (yi |)p()d involves high-dimensional numerical intergration. A further complication is that some of the parameters may be highly correlated.
Walmsley et al. therefore propose to use Markov Chain Monte Carlo Methods (see e.g. Gilks et al. 1996). The essential idea is to simulate the integral
by a sample mean of f (yi |) where is sampled randomly from the a priori distribution p(). Sampling can be done by using a Markov process
whose stationary distribution is p. The simulation can be simplied further by the so-called Gibbs sampler which uses suitable one-dimensional
conditional distributions (Besag 1989).
A more modest task than polyphonic pitch tracking is automatic segmentation of monophonic music. The task is as follows: given a monophonic
musical score and a sampled acoustic signal of a performance of the score,
identify for each note and rest in the score the corresponding time interval in the performance. A possible approach based on hidden Markov
processes and Bayesian models is proposed in Raphael (1999) (also see
Raphael 2001a,b). Raphael, who is a professional oboist and a mathematical statistician, also implemented his method in a computer system, called
Music Plus One, that performs the role of a musical accompanist.
CHAPTER 7
Circular statistics
7.1 Musical motivation
Many phenomena in music are circular. The best known examples are repeated rhythmic patterns, the circles of fourths and fths, and scales modulo octave in the well-tempered system. In the circle of fourths, for example,
one progresses by steps of a fourth and arrives, after 12 steps, at the initial starting point modulo octave. It is not immediately clear whether and
how to calculate in such situations, and what type of statistical procedures may be used. The theory of circular statistics has been developed
to analyze data on circles where angles have a meaning. Originally, this
was motivated by data in biology (e.g. direction of bird ight), meteorology (e.g. direction of wind), and geology (e.g. magnetic elds). Here we
give a very brief introduction, mostly to descriptive statistics. For an extended account of methods and applications of circular statistics see, for
instance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993),
and Jammalamadaka and SenGupta (2001). In music, circular methods can
be applied to situations where angles measure a meaningful distance between points on the circle and arithmetic operations in the sense of circular
data are well dened.
cos i , S =
C=
i=1
sin i , R =
C 2 + S2.
(7.1)
i=1
cos
sin
C/R
S/R
(7.2)
(7.3)
C
Moreover, we have
Denition 49 The mean resultant length of i (i = 1, ..., n) is equal to
R
(7.4)
R=
n
Note that R is the length of the vector n obtained by adding all observed
x
V =1R
(7.5)
R = 0 does not necessarily imply that the data are scattered uniformly.
For instance, suppose n is even, 2i+1 = and 2i = 0. Thus there are two
xi are closer to the point (cos, sin)t dened by . Similarly, the lower
and upper quartiles, Q1 , Q2 can be dened by dividing each of the halves
into two halves again. An alternative measure of variability is then given
by IQR = Q2 Q1 .
Since we are dealing with vectors in the two-dimensional plane, all quantities above can be expressed in terms of complex numbers. In particular,
one can dene trigonometric moments by
Denition 51 For p = 1, 2, ... let
n
Cp =
cos pi , Sp =
i=1
sin pi , Rp =
2
2
Cp + Sp
(7.6)
i=1
Cp
Sp
Rp
, Sp =
, Rp =
Cp =
n
n
n
(7.7)
and
Sp
+ 1{Cp < 0} + 21{Cp > 0, Sp < 0}
Cp
(7.8)
mp = Cp + iSp = Rp ei(p)
(p) = arctan
(7.9)
Then
is called the pth trigonometric sample moment.
For p = 1, this denition yields
m1 = C1 + iS1 = R1 ei(1)
Denition 52 Let
n
n
o
cos p(i (1)), Sp =
o
Cp =
i=1
o
Cp =
o (p) = arctan
(7.10)
i=1
o
o
Cp
Sp
o
, Sp =
n
n
(7.11)
o
Sp
o
o
o
+ 1{Cp < 0} + 21{Cp > 0, Sp < 0}
o
Cp
(7.12)
Then
o
o
o
mo = Cp + iSp = Rp ei (p)
p
is called the pth centered trigonometric (sample) moment
ative to the mean direction (1).
(7.13)
mo ,
p
centered rel-
1
of descriptive measures of center and variability is given in Table 7.1.
Denition
Feature
measured
Sample mean
x = (C/R, S/R)t
with R = C 2 + S 2
Center
(direction)
R = R/n
Concentration
Mean direction
Center (angle)
Median direction
Mn = g() where
g() = n | |i ||
i=1
Center (angle)
Quartiles Q1 , Q2
Q1 = median of {i : Mn i Mn }
Q2 = median of {i : Mn i Mn + }
Center of
left and
right half
Modal direction
f () = estimate of density f
Center (angle)
Principal direction
a = rst eigenvector of
S = n xi xt
i=1
i
Center
(direction,
unit vector)
Concentration
1 = rst eigenvalue of S
Variability
Circular variance
Vn = 1 R
Variability
sn =
Circular dispersion
dn = (1
Mean deviation
Dn =
Interquartile range
IQR = Q2 Q1
2 log(1 V )
Variability
2
C2 + S2 )/(2R2 )
1
n
n
i=1
Variability
| |i Mn ||
Variability
Variability
n
i,j=1;i=j
n
i,j=1;i=j
sin2 (i j )
or
r, =
sin(i j ) sin(i j )
det(n1
det(n1
n
i=1
n
i,j=1;i=j
n
i=1
sin2 (i j )
t
xi yi )
xi xt ) det(n1
i
n
i=1
t
yi yi )
(7.15)
(7.16)
r(k) =
sin(i j ) sin(i+k
n
2
i,j=1;i=j sin (i j )
j+k )
(7.17)
or
r (k) =
det(n1
det(n1
nk
t
i=1 xi xi+k )
nk
t
i=1 xi xi )
(7.18)
F (u) = P (0 u) =
u
1{0 u < 2},
2
1
1{0 u < 2}.
2
In this case, p = p = 0, the mean direction is not dened, and the
circular standard deviation and dispersion are innite. This expresses
the fact that there is no preference for any direction and variability is
therefore maximal.
f () = F () =
F (u) = [
u
sin(u ) +
]1{0 u < 2}
and
1
(1 + 2 cos(u ))1{0 u < 2}
2
where 0 1 . In this case, = , 1 = , p = 0 (p 1) and
2
= 1/(22 ). An interesting property is that this distribution tends to the
uniform distribution as 0.
f (u) =
Denition
Feature
2
0
pth trigonometric
moment
p =
cos(p)dF ()
2
+i 0 sin(p)dF ()
= p,C + ip,S = p ei (p)
Mean direction
Center (angle)
2
o = 0 cos(p( ))dF ()
p
2
+i 0 sin(p( ))dF ()
= o + io
p,C
p,S
Mean resultant
length
= |1 |
Median direction
M = { :
Quartiles q1 , q2
q1 = median of { : M M }
q2 = median of { : M M + }
25%-quantile
75%-quantile
Modal direction
M = arg max f ()
Center (angle)
Principal direction
= rst eigenvector of
t
= E(XX )
Center
(direction)
Concentration
1 = rst eigenvalue of
Circular variance
= 1
dF () =
dF () =
1
}
2
Center (angle)
Variability
Variability
2 log(1 )
Variability
)/(22 )
Variability
= (1
Circular dispersion
Concentration
2
0
Mean deviation
| | M ||dF ()
Interquartile range
IQR = q2 q1
Variability
Variability
Wrapped distribution:
Let X be a random variable with distribution function FX . The random
variable = X (mod 2) has a distribution F on [0, 2) given by
[F (u + 2j) F (2j)]
F (u) =
j=
f (u) =
fX (u + 2j).
j=
2
1
[1 + 2
j cos j(u )]1{0 u < 2}
2
j=1
f (u) =
1
2
j=0
1 2j
( )
(j!)2 2
is the modied Bessel function of the rst kind and order 0. In this case,
we have = , 1 = I1 /Io , = (I1 /Io )1 , p,C = Ip /Io and p,S = 0
(p 1) where
1
( )2j+p
Ip =
(j + p)!j! 2
j=0
is a modied Bessel function of order p. For 0, the M (, )-distribution
converges to U ([0, 2)), and for we obtain a point mass in the
direction .
Mixture distribution:
All distributions above are unimodal. Distributions with more than one
mode can be modeled, for instance, by mixture distributions
f (u) = p1 f,1 (u) + ... + pm f,m (u)
where 0 p1 , ..., pm 1,
ity densities.
Batschelet (1981), Watson (1983), and Fisher (1993). For recent results
see e.g. Jammalamadaka and SenGupta (2001).
7.3 Sp ecic applications in music
7.3.1 Variability and autocorrelation of notes modulo 12
Figure 7.1 Bla Bartk statue by Varga Imre in front of the Bla Bartk Memoe
o
e
o
rial House in Budapest. (Courtesy of the Bla Bartk Memorial House.)
e
o
The following analysis is done for various compositions: pitch is represented in Z12 with 0 set equal to the note (modulo 12) with the highest
frequency in the composition. Given a note j in Z12 , the corresponding
circular point is then x = (x1 , x2 )t = (cos(2j/12), sin(2j/12))t . The
following statistics are calculated: 1 , R, d and the maximal circular autocorrelation m = max1k10 |r (k)|. The compositions considered here are:
Figure 7.4 Boxplots of 1 , R, d and log m for notes modulo 12, comparing Bach,
Scarlatti, Bartk, and Prokoef.
o
Figure 7.6 Boxplots of 1 , R, d and log m for note intervals modulo 12, comparing
Bach, Scarlatti, Bartk, and Prokoef.
o
Figure 7.8 Boxplots of 1 , R, d and log m for notes 12 ordered according to circle
of fourths, comparing Bach, Scarlatti, Bartk and Prokoef.
o
Figure 7.9 Circular representation of intervals of successive notes ordered according to circle of fourths in the following compositions: J. S. Bach (Prludium und
a
Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S. Prokoe (Visions fugio
tives No. 8).
Figure 7.10 Boxplots of 1 , R, d and log m for note intervals modulo 12 ordered
according to circle of fourths, comparing Bach, Scarlatti, Bartk, and Prokoef.
o
CHAPTER 8
do not dier very much with respect to that projection, and are therefore
more dicult to distinguish.
Denition via spectral decomposition of matrices
The algorithm given above has an elegant interpretation:
Theorem 18 (Spectral decomposition theorem) Let B be a symmetric pp
matrix. Then B can be written as
p
B = AAt =
j a(j) [a(j) ]t
(8.1)
j=1
1 0 . . 0
0 2
.
.
.
. is a diagonal matrix, j are the eigenwhere =
.
. 0
0
. . 0 p
values and the columns a(j) of A the corresponding orthonormal eigenvectors of B, i.e. we have
Ba(j) = j a(j)
(8.2)
(8.4)
where I denotes the identity matrix with Ijj = 1 and Ijl = 0 (j = l).
This result can now be applied to the covariance matrix of a random
vector X = (X1 , ..., Xp )t :
Theorem 19 Let X be a p-dimensional random vector with expected value
E(X) = and p p covariance matrix . Then
= AAt
(8.5)
(j)
(8.7)
is equal to
cov(Z) = At A =
(8.8)
Note in particular that var(Z1 ) = 1 var(Z2 ) = 2 ... var(Zp ) = p
and the covariance matrix may be approximated by a matrix
q
(q) =
j a(j) [a(j) ]t
j=1
(8.9)
is called the jth principal component of X. The jth column of A, i.e. the
jth eigenvector a(j) , is called the vector of principal component loadings.
In summary, the principal component transformation rotates the original
random vector X in such a way that the new coordinates Z1 , ..., Zp are
uncorrelated (orthogonal) and they are ordered according to their importance with respect to characterizing the covariance structure of X.
The following result states that the algorithmic and the algebraic denition are indeed the same:
Theorem 21 Consider U = bt X where b = (b1 , ..., bp )t and |b| = 1. Suppose that U is orthogonal (i.e. uncorrelated) to the rst k principal components of X. Then var(U ) is maximal, among all such projections, if and
only if b = a(k+1) , i.e. if U is the (k + 1)st principal component Zk+1 .
8.2.2 Denition of PCA for observed data
The denition of principal components given above cannot applied directly
to data, since the expected value and covariance matrix are usually unknown. It can however be modied in an obvious way by replacing population quantities by suitable estimates. The simplest solution is to use
the sample mean and the sample covariance matrix. For observed vectors
x(i) = (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) one denes
=x=
1
n
x(i)
(8.10)
i=1
=
n
i=1
(8.11)
The estimated ith vector of principal component loadings, a(j) , is the stan
(8.12)
where the columns of A are equal to the orthogonal vectors a(j) . Applying
(8.13)
In other words, the ith observed vector x(i) x is transformed into a rotated
vector z(i) = (z1 (i), ..., zp (i))t with the corresponding observed principal
components. In matrix form, we can dene the n p matrix of observations
(8.14)
X=
.
.
.
.
.
.
.
..
.
x1 (n) x2 (n) xp (n)
and the n p matrix of observed principal components
Z =
.
.
.
.
.
.
.
..
.
z1 (n) z2 (n) zp (n)
(8.15)
so that
Z = (X I y t )A
(8.16)
where I denotes the identity matrix. Note that the jth column z (j) =
(zj (1), ..., zj (n))t consists of the observed jth principal components. Therefore, the sample variance of the jth principal components is given by
s2 = n1
z
zj (i) = j .
i=1
If j is large, then the observed jth principal components zj (1), ..., zj (n)
have a large sample variance so that the observed values are scattered far
apart.
8.2.3 Scale invariance?
The principal component transformation is based on the covariance matrix. It is therefore not scale invariant, since variance and covariance depend on the units in which individual components Xj are measured. It is
n
n
s2 = n1 i=1 (xj (i) xj )2 (or s2 = (n 1)1 i=1 (xj (i) xj )2 ).
j
j
8.2.4 Choosing important principal components
Since an orthogonal transformation does not change the length of vectors,
the total variability of the random vector Z in (8.7) is the same as the one
of the original random vector X with covariance matrix = (ij )i,j=1,...,p .
More specically, one denes total variability by
p
Vtotal = tr() =
ii .
(8.17)
i=1
Vtotal = tr() =
ii
(8.18)
i=1
Since the eigenvalues i are ordered according to their size, we may therefore hope that the proportion of total variation
P (q) =
1 + ... + q
p
i=1 i
(8.19)
is close to one for a low value of q. If this is the case, then one may reduce the dimension of the random vector considerably without losing much
j,k = corr(Zj , Xk ) = ak
j
kk
(8.20)
j,k = ak
j
kk
(8.21)
8.2.5 Plots
One of the main diculties with high-dimensional data is that they cannot
be represented directly in a two-dimensional display. Principal components
provide a possible solution to this problem. The situation is particularly
simple if the rst two principal components explain most of the variability.
In that case, the original data (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) may be replaced by the rst two principal components (z1 (i), z2 (i))t (i = 1, 2, ..., n).
Thus, z2 (i) is plotted against z1 (i). If more than two principal components
are needed, then the plot of z2 (i) versus z1 (i) provides at least a partial view
of the data structure, and further projections can viewed by corresponding
scatter plots of other components, or by symbol plots as described in Chapter 2. The scatter plots can be useful for identifying structure in the data.
In particular, one may detect unusual observations (outliers) or clusters of
similar observations.
8.3 Sp ecic applications in music
8.3.1 PCA of tempo skewness
The 28 tempo curves in Figure 2.3, each consisting of measurements at
p = 212 onset times, can be considered as n = 28 observations of a 212dimensional random vector. Principal component analysis cannot be applied directly to these data. The reason is that PCA relies on estimating
the p p covariance matrix. The number of observations (n = 28) is much
smaller than p. Therefore, not all elements of the covariance matrix can
be estimated consistently and an empirical PCA-decomposition would be
highly unreliable. A solution to this problem is to reduce the dimension p
in a meaningful way. Here, we consider the following reduction: the onsettime axis is divided into 8 disjoint blocks A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 of
4 bars each. For each part number i (i = 1, ..., 8) and each performance j
(j = 1, ..., 28), we calculate the skewness measure
j (i) =
xM
Q2 Q1
-0.6
-0.4
-0.2
0.0
Figure 8.1 Tempo curves for Schumanns Trumerei: skewness for the eight parts
a
A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of
the part.
where M is the median and Q1 , Q2 are the lower and upper quartile respectively. Figure 8.1 shows j (i) plotted against i. An apparent pattern is the
generally strong negative skewness in B2 . (Recall that negative skewness
can be created by extreme ritardandi.) Apart from that, however, Figure
8.1 is dicult to interpret directly. Principal component analysis helps to
nd more interesting features. Figure 8.3 shows the loadings for the rst
four principal components which explain more than 80% of the variability
(see Figure 8.2). The loadings can be interpreted as follows: the rst component corresponds to a weighted average emphasizing the skewness values
in the rst half of the piece. The 28 performances apparently dier most
with respect to j (i) during the rst 16 bars of the piece (parts A1 , A2 ,
A1 , A2 ). The second most important distinction between pianists is characterized by the second component. This component compares skewness for
the A-parts with the values in B1 and B2 . The third component essentially
0.355
0.015
Variances
0.025
0.564
0.709
0.889
1
Comp. 8
0.971
Comp. 7
Comp. 6
Comp. 5
Comp. 4
Comp. 3
Comp. 2
0.935
Comp. 1
0.0
0.005
0.824
compares the rst with the second half. Finally, the fourth component essentially compares the odd with the even numbered parts, excluding the
end A1 , A2 . Components two to ve are displayed in Figure 8.4, with z2
and z3 on the x- and y-axis respectively and rectangles representing z4 and
z5 . Note in particular that Cortot and Horowitz mainly dier with respect
to the third principal component. Horowitz has a more extreme dierence
in skewness betweem the rst and second halves of the piece. Also striking
are the outliers Brendel, Ortiz, and Gianoli. The overall skewness, as
represented by the rst component, is quite extreme for Brendel and Ortiz.
For comparison, their tempo curves are plotted in Figure 8.5 together with
Cortots and Horowitz rst performances. In view of the PCA one may
now indeed see that in the tempo curves by Brendel and Ortiz there is a
strong contrast between small tempo variations applied most of the time
and occasional strong local ritardandi.
A1
A2
B1
B2
Skewness: Loadings of
second PCA-component
A1
A2
A1
A2
A2
B1
B2
A1
A2
-0.2
loading
0.4
0.3
-0.6
2
Skewness: Loadings of
third PCA-component
A2
A1
A2
B1
B2
Skewness: Loadings of
fourth PCA-component
A1
A2
A1
A1
A2
B1
B2
A1
A2
loading
-0.2
-0.6
-0.2
0.2
A2
0.2
0.6
A1
loading
A1
0.2
A2
0.2
loading
0.5
Skewness: Loadings of
first PCA-component
A1
T2
RT
Z3
IT
HO
RO
W
AR
IS
TS
SH
EL
LE
Y
IA
-0.5
NO
LI
-0.4
KA
SC
BA
US M
T OIS
AR A EIW
IT
RA SK
U EN SC
AZ H
DE E
M
US
KR
HO
KU RO
LE I
ZK T
AK KZ
LI1
EN
VI
DA
G
ER
IC
Z2
IT
NI
N
BU
RO
W
HO
HN
AB
EL
ES
CO
RT
A
RZ P
O OV
N A
AR
CH
E
ES
NO
VA
ES
NB
CO
RT
O
CO T1
R
AC TO
T3
CU C H
O
L
BR
EN
DE
-0.3
z3
-0.2
-0.1
IZ
NE
0.0
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
z2
-10
-5
Cortot1
Horowitz1
Gianoli
-25
-20
-15
Brendel
50
100
150
200
clear clustering. For clarity, only three dierent names (Purcell, Bach, and
Schumann) are written explicitly in the plots. Schumann turns out to be
completely separated from Bach. Moreover, Purcell appears to be somewhat outside the regions of Bach and Schumann, in particular in Figure
8.10. In conclusion, entropies, as dened above, do indeed seem to capture
certain features of a composers style.
AIR
q = 96
Piano
14
11
Purcell
-2
-4
-2
Bach
Bach
Bach
Bach
Bach
Purcell Bach
Bach
Schumann
Bach
Schumann
Bach
Purcell
Schumann
Schumann
Schumann
Schumann
-1
Bach
Figure 8.9 Entropies symbol plot of the rst four principal components.
Purcell
Bach
Bach
Bach
Bach
Schumann
Schumann
Schumann
Schumann
-2
Purcell
Schumann
Purcell
Schumann
1
-1
Bach
Bach
Bach
Bach
Bach Bach
Third vs. second principal component rectangles with width=4th comp., height=5th comp.
-1
Figure 8.11 F. Martin (1890-1971). (Courtesy of the Socit Frank Martin and
ee
Mrs. Maria Martin.)
CHAPTER 9
Discriminant analysis
9.1 Musical motivation
Discriminant analysis, often also referred to under the more general notion
of pattern recognition, answers the question of which category an observed
item is most likely to belong to. A typical application in music is attribution
of an anonymous composition to a time period or even to a composer.
Other examples are discussed below. A prerequisite for the application of
discriminant analysis is that a training data set is available where the
correct answers are known. We give a brief introduction to basic principles
of discriminant analysis. For a detailed account see e.g. Mardia et al. (1979),
Klecka (1980), Breiman (1984), Seber (1984), Fukunaga (1990), McLachlan
(1992) and Huberty (1994), Ripley (1995), Duda et al. (2000), Hastie et al.
(2001).
9.2 Basic principles
9.2.1 Allocation rules
Suppose that an observation x Rk is known to belong to one of p mutually exclusive categories G1 , G2 ,...,Gp . Associated with each category is
a probability density fi (x) of X on Rk . This means that if an individual
comes from group i, then the individuals random vector X has the probability distribution fi . The problem addressed by discriminant analysis is
as follows: observe X = x, and try to guess which group the observation
comes from. The aim is, of course, to make as few mistakes as possible. In
probability terms this amounts to minimizing the probability of misclassication.
The solution is dened by a classication rule. A classication rule is a
division of Rk into p disjoint regions: Rk = R1 R2 ... Rp , Ri Rj =
(i = j). The rule allocates an observation to group Gi , if x Ri . More
generally, we may dene a randomized rule by allocating an observation to
group Gi with probability i (x), where i=1 i (x) = 1 for every x. The
advantage of allowing random allocation is that discriminant rules can be
averaged and the set of all random rules is convex, thus allowing to nd
optimal rules. Note that deterministic rules are a special case, by setting
i (x) = 1 if x Ri and 0 otherwise.
(9.1)
f1 (x)
>0
f2 (x)
(9.2)
In the case where all probability densities are normal with equal covariance
matrices we have:
Theorem 23 Suppose that each fi is a multivariate normal distribution
with expected value i and covariance matrix i . Suppose further that 1 =
2 = ... = p = and det > 0. Then the ML-rule is given as follows:
allocate x to group Gi , if
(x i )t 1 (x i ) = min (x j )t 1 (x j )
j=1,...,p
(9.3)
(9.5)
(9.6)
j=1,...,p
j (x)fi (x)dx
(9.8)
Thus, correct classication for individuals from group Gi occurs with probability pii and misclassication with probability 1 pii . A rule r with
correct-classication-probabilities pii is said to be at least as good as a rule
r with probabilities pii , if pii pii for all i. If there is at least one >
i pii <
i pii .
i i fi (x)dx
i max{j fj (x)}dx =
j
max{j fj (x)}dx.
j
i i fi (x)
i pii
i pii
which contradicts the rst inequality. The conclusion is therefore that every
Bayes rule is optimal in the sense that it is admissible. If there are no a
priori probabilities i , or more exactly the noninformative prior is used,
then this means that the ML-rule is optimal.
The second criterion is applicable if a priori probabilities are available:
the probability of correct allocation is
p
pcorrect =
i pii =
i=1
i fi (x)dx
(9.9)
i=1
(9.10)
(9.11)
The rule becomes particularly simple if fi are normal with unknown means
i and equal covariance matrices 1 = 2 = ... = . Let xi be the sample
mean and i the sample covariance matrix for observations from group Gi .
Estimating the common covariance matrix by
(9.12)
at (x (1 + x2 )) > 0
(9.13)
at (x (1 + x2 )) > log
2
1
(9.14)
It should be emphasized here that while a linear discriminant rule is meaningful for the normal distribution, this may not be so for other distributions.
For instance, if for G1 a one-dimensional random variable X is observed
with a uniform distribution on [1, 1] and for G2 the variable X is uniformly
distributed on [3, 2] [2, 3], then the two groups can be distinguished
perfectly, however not by a linear rule.
9.2.4 Case III: Population distributions completely unknown
If the population distributions fi are completely unknown, then the search
for reasonable rules is more dicult. In recent literature, some rules based
on nonparametric estimation or suitable projection techniques have been
proposed (see e.g. Friedman 1977, Breiman 1984, Hastie et al. 1994, Polzehl
1995, Ripley 1995, Duda et al. 2000, Hand et al. 2001).
The simplest, and historically most important, rule is based on Fishers
linear discriminant function. Fisher postulated that a linear rule may often
be reasonable (see however the remark in Section 9.2.3 why this need not
always be so). He proposed to nd a vector a such that the linear function
at x maximizes the ratio between the variability between groups compared
to the variability within the groups. More specically, dene
Xnp = X
to be the n p matrix where each row i corresponds to an observed vector
xi = (xi1 , ..., xip )t . We denote the columns of X by x(j) (j = 1, ..., p). The
rows are assumed to be ordered according to groups, i.e. rows 1 to n1 are
observations from G1 , rows n1 + 1 through n1 + n2 are from G2 and so on.
Moreover, dene the matrix
Mnn = M = I n1 1 1t
where I is the identity matrix and 1 = (1, ..., 1)t . We denote the subma(i)
trices of X and M that belong to the dierent groups by Xnj p = X (j)
(j)
SST =
(yi y )2 = y t M y = at X t M Xa
(9.15)
i=1
can be written as
SST = SSTwithin + SSTbetween
where
nj
SSTwithin =
j=1 i=1
(j)
(yi
y (j) )2 = at W a
(9.16)
(9.17)
and
nj ((j) y )2 = at Ba
y
SSTbetween =
(9.18)
j=1
Here,
p
W =
n j Sj =
j=1
j=1
B=
j=1
(j)
y (j) = n1 yi the mean in group Gj and x(j) and x are the corre
j
sponding (vector) means for x. Fishers linear discriminant function (or
rst canonical variate) is the linear function at x where a maximizes the
ratio
SSTbetween
at Ba
Q(a) =
= t
(9.19)
SSTwithin
a Wa
The solution is given by
Theorem 24 Let a be the eigenvector of W 1 B that corresponds to the
largest eigenvalue. Then Q(a) is maximal.
The classication rule is then: allocate x to Gi , if
(9.20)
n
has rank 1 and the only non-zero eigenvalue is
n1 n2 (1)
tr(W 1 B) =
( x(2) )t W 1 ((1) x(2) )
x
n
with eigenvector a = W 1 ((1) x(2) ). The discriminant rule then becomes
x
the same as the ML-rule for normal distributions with equal covariance
matrices: allocate x to Gi , if
1
x
((1) x(2) )t W 1 (x ((1) + x(2) )) > 0
x
(9.21)
Figure 9.1 Discriminant analysis combined with time series analysis can be used
to judge purity of intonation (Elvira by J.B.).
data. In principle this is easy, since the corresponding estimates can simply
be plugged into the formula for pii . The observed data that are used for
estimation are also called training sample. A problem with these estimates is, however, that the search for the optimal discriminant rule was
done with the same data. Therefore, p11 will tend to be too optimistic (i.e.
too large), unless n is very large. The same is true for any method that
estimates classication probabilities from the training data. A possibility
to avoid this is to partition the data set randomly into a training sample
that is used for estimation of the discriminant rule, and a disjoint validation sample that is used for estimation of classication probabilities.
Obviously, this can only be done for large enough data sets. For recently
developed computational methods of validation, such as bootstrap, see e.g.
Efron (1979), Luter (1985), Fukunaga (1990), Hirst (1996), LeBlanc and
a
Tibshirani (1996), Davison and Hinkley (1997), Chernick (1999), Good
(2001).
9.3 Sp ecic applications in music
9.3.1 Identication of pitch, tone separation, and purity of intonation
Weihs et al. (2001) investigate objective criteria for judging purity of intonation of singing. The acoustic data are as described in Chapter 4. In
order to address the question of how to computationally assess purity of
intonation, a vocal expert classied 132 selected tones of 17 performances
(Figure 9.1) of Hndels Tochter Zion into the classes at, correct,
a
and sharp. The opinion of the expert is assumed to be the truth. An
objective measure of purity is dened by = log12 (observed ) log12 (o )
x2 = E =
log(pi + 0.001)pi
i=1
which is slightly modied measure of entropy. We now describe each composition by a bivariate observation
x = (p5 , E)t .
The question is now whether this very simple 2-dimensional descriptive
statistic can tell us anything about the time when the music was composed.
In view of the somewhat naive simplicity of x, the answer is not at all
obvious.
To simplify the problem, composers are divided into two groups: Group
1 = composers who died before 1800, and Group 2 = composers who died
after 1800 (or are still alive). Essentially, the two groups correspond to
the partition into early music to baroque and classical till today. The
compositions considered here are those given in the star plot example (Section 2.7.2). In order to be able to check objectively how the procedure
works, only a subset of n = 94 compositions is used for estimation. Applying a linear discriminant rule partitions the plane into two half planes by
-1.4
-1.6
-2.0
-1.8
P(Subdominant)
-1.2
before 1800
after 1800
1.9
2.0
2.1
2.2
2.3
2.4
entropy
Figure 9.2 Linear discriminant analysis of compositions before and after 1800,
with the training sample. The data used for the discriminant rule consists of
x = (p5 , E).
a straight line. Figure 9.2 shows the estimated partitioning line together
with the training sample (o = before 1800, x = after 1800). Apparently, the
two groups can indeed be separated quite well by the estimated straight
line. This is quite surprising, given the simplicity of the two variables. As
expected, however, the partition is not perfect, and it does not seem to be
possible to improve it by more complicated partitioning lines. To assess how
well the rule may indeed classify, we consider 50 other compositions that
were not used for estimating the discriminant rule. Figure 9.3 shows that
the rule works well, since almost all observations in the validation sample
are classied correctly. An unusual composition is Bartks Bagatelle No.
o
3 which lies far on the left in the wrong group.
The partitioning can be improved if the time periods of the two groups
are chosen farther apart. This is done in gures 9.3a and b with Group
1 = Early Music to Baroque and 2 = Romantic to 20th century. (A
beautiful example of early music is displayed in Figure 9.6; also see Figures 9.7 and 9.8 for portraits of Brahms and Wagner.) Figure 9.4 shows
the corresponding plot of the partition together with the data (n = 72).
Compositions not used in the estimation are shown in Figure 9.5. Again,
the rule works well, except for Bartks third Bagatelle.
o
B
ar
to
-1.4
-1.6
-2.0
-1.8
P(Subdominant)
-1.2
before 1800
after 1800
1.9
2.0
2.1
2.2
2.3
2.4
entropy
Figure 9.3 Linear discriminant analysis of compositions before and after 1800,
with the validation sample. The data used for the discriminant rule consists of
x = (p5 , E).
-1.6
-1.4
-1.2
-2.0
-1.8
P(Subdominant)
-1.0
1.8
2.0
2.2
2.4
entropy
Figure 9.4 Linear discriminant analysis of Early Music to Baroque and Romantic to 20th Century. The points (o and ) belong to the training sample.
The data used for the discriminant rule consists of x = (p5 , E).
ar
to
k
-1.0
-1.4
-2.2
-1.8
P(Subdominant)
1.8
2.0
2.2
2.4
entropy
Figure 9.5 Linear discriminant analysis of Early Music to Baroque and Romantic to 20th century. The points (o and ) belong to the validation sample. The data used for the discriminant rule consists of x = (p5 , E).
Figure 9.6 Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Zrich.) (Color gures follow
u
page 152.)
Figure 9.7 Johannes Brahms (1833-1897). (Photograph by Maria Fellinger, courtesy of Zentralbibliothek Zrich.)
u
CHAPTER 10
Cluster analysis
10.1 Musical motivation
In discriminant analysis, an optimal allocation rule between dierent groups
is estimated from a training sample. The type and number of groups are
known. In some situations, however, it is neither known whether the data
can be divided into homogeneous subgroups nor how many subgroups there
may be. How to nd such clusters in previously ungrouped data is the purpose of cluster analysis. In music, one may for instance be interested in how
far compositions or performances can be grouped into clusters representing
dierent styles. In this chapter, a brief introduction to basic principles
of statistical cluster analysis is given. For an extended account of cluster
analysis see e.g. Jardine and Sibson (1971), Anderberg (1973), Hartigan
(1978), Mardia et al. (1979), Seber (1984), Blasheld et al. (1985), Hand
(1986), Fukunaga (1990), Arabie et al. (1996), Gordon (1999), Hppner et
o
al. (1999), Everitt et al. (2001), Jajuga et al. (2002), Webb (2002).
10.2 Basic principles
10.2.1 Maximum likelihood classication
Suppose that observations x1 , ..., xn Rk are realizations of n independent
random variables Xi (i = 1, ..., n). Assume further that each random variable comes from one of p possible groups such that if Xi comes from group j,
then it is distributed according to a probability density f (x; j ). In contrast
to discriminant analysis, it is not observed which groups xi (i = 1, ..., n)
belong to. Each observation xi is thus associated with an unobserved parameter (or label) i specifying group membership. We may simply dene
i = j if xi belongs to group j. Denote by = (1 , ..., n )t the vector of
labels and, for each j = 1, ..., p, let Aj = {xi : 1 i n, i = j} be
the unknown set of observations that belong group j. Then the likelihood
function of the observed data is
p
f (xi ; j )}
(10.1)
j=1 xi Aj
..., n , we obtain ML-estimates 1 , ..., p , 1 , ..., n and estimated sets
f (xio ; l )
L(x1 , ..., xn ; 1 , ..., p , 1 , ..., n )
f (xio ; j )
(10.2)
or, dividing by L (assuming that it is not zero),
(10.3)
xi
nj ()
iAj ()
j () =
(xi xj ())(xi xj ())t
nj ()
iAj ()
where
h() =
|j ()|nj ()
(10.4)
(10.5)
j=1
Computationally this means that the function h() is evaluated for all
groupings of the observations x1 , ..., xn , and the estimate is the grouping
(nj j )
(10.6)
j=1
the number of possible assignments for which j=1 (nj j ) may dier is
n1
equal to 2
. In addition, if the number of groups is not known a priori,
then a suitable, and usually computationally costly, method for estimating p must be applied. From a principle point of view it should also be
noted that if normal distributions or any other distributions with overlapping domains are assumed, then there are no perfect clusters. Even if the
distributions were known, an observation x can be from any group with
fi (x) > 0, with positive probability, so that one can never be absolutely
sure where it belongs.
A variation of ML-clustering is obtained if the groups themselves are
associated with probabilities. Let j be the probability that a randomly
sampled observation comes from group j. In analogy to the arguments
above, maximization of the likelihood with respect to all parameters including j (j = 1, ..., p) leads to a Bayesian allocation rule with j as prior
distribution.
10.2.2 Hierarchical clustering
ML-clustering yields a partition of observations into p groups. Sometimes
it is desirable to obtain a sequence of clusters, e.g. starting with two main
groups and then subdividing these into increasingly homogeneous clusters.
This is particularly suitable for data where a hierarchy is expected - such
as, for instance, in music. Generally speaking, a hierarchical method has
the following property: a partitioning into p + 1 clusters consists of
two clusters whose union is equal to one of the clusters from the partitioning into p groups
p 1 clusters that are identical with p 1 clusters of the partitioning
into p groups.
In a rst step, data are transformed into a matrix D = (dij )i,j=1,...,n of
distances or a matrix S = (sij )i,j=1,...,n of similarities. The denition of
distance and similarity used in cluster analysis is more general than the
usual denition of a metric:
Denition 54 Let X be an arbitrary set and d : X X R a real valued
function such that for all x, y X
(i)
(i)
A1 , ..., Ani .
(i)
(i)
djl = d(Aj , Al ) =
max
(i)
(i)
d(x, y)
(10.7)
xAj ,yAl
Table 10.1 Some measures of distance and similarity between x = (x1 , ..., xk )t ,
y = (y1 , ..., yk )t Rk . For some of the distances, it is assumed that a data set of
observations in Rk is available to calculate sample variances s2 (j = 1, ..., k) and
j
a k k sample covariance matrix S.
Name
Denition
Euclidian distance
d(x, y) =
k
i=1 (xi
yi )2
Usual distance
in Rk
Pearson distance
d(x, y) =
k
i=1 (xi
yi )2 /s2
j
Standardized
Euclidian
Mahalanobis distance
d(x, y) =
(x y)t S 1 (x y)
Standardized
Euclidian
Manhattan metric
d(x, y) =
(wi 0)
Minkowski metric
d(x, y) =
( 1)
Bhattacharyya distance
d(x, y) =
Binary similarity
s(x, y) = k 1
s(x, y) = k 1
ai ,
ai = xi yi + (1 xi )(1 yi )
Suitable for
for xi = 0, 1
Gowers similarity
coecient
s(x, y) = 1 k 1
wi |xi yj |,
wi = 1 if
xi qualitative,
wi = 1/Ri if
quantitative
(with Ri = range of
ith coordinate)
Suitable if
some xi
qualitative,
some xi
quantitative
5. If
Comments
k
i=1
wi |xi yi |
k
i=1
wi |xi yi |
k
i=1 ( xi
yi )2
Less sensitive
to outliers
1/
1/2
x i yi
(i)
max d
j,l=1,...,ni jl
do
For = 1 :
Manhattan
For xi , yi 0
(example:
proportions)
Suitable for
xi = 0, 1
(10.8)
For other algorithms and further properties see the references given at the
beginning of this chapter, and references therein.
10.2.3 HISMOOTH and HIWAVE clustering
HISMOOTH and HIWAVE models, as dened in Chapter 5, can be used
to extract dominating features of a time series y(t) that are related to
an explanatory series x(t). Suppose that we have several y-series, yj (t)
(j = 1, ..., N ) that share the same explanatory series x(t). An interesting
question is then in how far features related to x(t) are similar, and which
series have more in common than others. One way to answer the question
consists of the following clustering algorithm:
1. For each series yj (t), t a HISMOOTH or HIWAVE model, thus obtaining a decomposition
yj (t) = j (t, xt ) + ej (t)
DOWLAND
ARCADELT
ANONYMUS
ARCADELT
ANONYMUS
ARCADELT
PALESTRINA
DOWLAND
ANONYMUS
HASSLER
PALESTRINA
PALESTRINA
BYRD
BYRD
BARTOK
BYRD
SCHEIN
BACH
SCHOENBERG
DEBUSSY
BACH
DEBUSSY
MOZART
MOZART
MOZART
MOZART
BACH
BACH
BACH
BACH
MOZART
BACH
BACH
BACH
BACH
DEBUSSY
BACH
BARTOK
WEBERN
MESSIAEN
HAYDN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BARTOK
BARTOK
TAKEMITSU
HASSLER
HASSLER
ANONYMUS
OCKEGHEM
10
HALLE
12
14
10
DOWLAND
HASSLER
HASSLER
ANONYMUS
ARCADELT
ARCADELT
ARCADELT
PALESTRINA
SCHOENBERG
BARTOK
BARTOK
TAKEMITSU
MESSIAEN
BARTOK
WEBERN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
BACH
BACH
BACH
BACH
BACH
BACH
BARTOK
BACH
MOZART
DEBUSSY
DEBUSSY
BACH
BACH
MOZART
MOZART
MOZART
MOZART
BACH
BACH
DEBUSSY
BACH
BACH
BACH
BACH
DOWLAND
HASSLER
PALESTRINA
SCHEIN
PALESTRINA
BYRD
BYRD
BYRD
ANONYMUS
OCKEGHEM
ANONYMUS
ANONYMUS
15
HALLE
20
25
30
1-3; Piano Sonata (2nd Mv.); 17) O. Messiaen (1908-1992): Vingts regards
de Jesu No. 3; 18) T. Takemitsu (1930-1996): Rain tree sketch No. 1.
Figure 10.1 shows the result of complete linkage clustering of the vectors
(1 , ..., 11 )t , based on the Euclidian and do = 5. The most striking feature is the clear separation of early music from the rest. Moreover, the
20th century composers considered here are in a separate cluster, except
for Bartks Bagatelle No. 3 (and Debussy, who may be considered as beo
longing to the 19th and 20th centuries). In contrast, clusters provided by a
single linkage algorithm are less easy to interpret. Figure 10.2 illustrates a
typical result of this method namely long narrow clusters where the maximal distance within a cluster can be quite large. In our example this does
Figure 10.4 Klavierstck op. 19, No. 2 by Arnold Schnberg. (Facsimile; used by
u
o
permission of Belmont Music Publishers.)
Bach: F.8/WK I
Bach: Pr.8/WK I
Bach: F.1/WK I
Bach: Pr.1/WK I
CORTOT3
CORTOT1
CORTOT2
MOISEIWITSCH
ORTIZ
NEY
NOVAES
DAVIES
SCHNABEL
SHELLEY
CURZON
KRUST
ASKENAZE
ARRAU
BRENDEL
ESCHENBACH
ARGERICH
DEMUS
KLIEN
HOROWITZ1
HOROWITZ2
HOROWITZ3
BUNIN
KUBALEK
CAPOVA
ZAK
GIANOLI
KATSARIS
10
12
14
who are represented more than once in the sample, so that the consistency
of their performances can be checked empirically. Figure 10.6 also shows
that Cortot is somewhat of an outlier, since his cluster separates from
all other pianists at the top level.
10.3.4 Tempo curves and melodic structure
Cluster analysis alone does not provide any further explanation about the
meaning of observed clusters. In particular, we do not know which musically meaningful characteristics determine the clustering of tempo curves.
In contrast, cluster analysis based on HISMOOTH or HIWAVE models
provides a way to gain more insight. The tted HISMOOTH curves in Figures 5.9a through d extract essential features that make comparisons easier.
The estimated bandwidths can be interpreted as a measure of how much
emphasis a pianist puts on global and local features respectively. Figure
10.7 shows clusters based on the tted HISMOOTH curves. In contrast to
the original data, complete and single linkage turn out to yield almost the
same clusters. Thus, applying the HISMOOTH t rst leads to a stabilization of results. From Figure 10.7, we may identify about six main clusters,
namely:
A: KRUST, KATSARIS, SCHNABEL;
KRUST
KATSARIS
SCHNABEL
MOISEIWITSCH
NOVAES
ORTIZ
DEMUS
CORTOT1
CORTOT3
ARGERICH
SHELLEY
CAPOVA
CORTOT2
ARRAU
BUNIN
KUBALEK
CURZON
GIANOLI
ASKENAZE
DAVIES
ZAK
ESCHENBACH
NEY
HOROWITZ3
KLIEN
BRENDEL
HOROWITZ1
HOROWITZ2
B
2
F
A
BB
B
A
D
D
C
C
C
D
F
0.5
1.0
CF
D
C
1.5
2.0
F
E F
E
2.5
3.0
3.5
Figure 10.8 Symbol plot of HISMOOTH bandwidths for tempo curves. The radius
of each circle is proportional to a constant plus log b3 ; the horizontal and vertical
axes are equal to b1 and b2 respectively. The letters AF indicate where at least
one observation from the corresponding cluster occurs.
CHAPTER 11
Multidimensional scaling
11.1 Musical motivation
In some situations data consist of distances only. These distances are not
necessarily euclidian so that they do not necessarily correspond to a conguration of points in a euclidian space. The question addressed by multidimensional scaling (MDS) is in how far one may nevertheless nd points in
a hopefully low-dimensional euclidian space that have exactly or approximately the observed distances. The procedure is mainly an exploratory
tool that helps to nd structure in distance data. We give a brief introduction to the basic principles of MDS. For a detailed discussion and an
extended bibliography see, for instance, Kruskal and Wish (1978), Cox and
Cox (1994), Everitt and Rabe-Hesketh (1997), Borg and Groenen (1997),
Schiman (1997); also see textbooks on multivariate statistics, such as the
ones given in the previous chapters. For the origins of MDS and early
references see Young and Householder (1941), Guttman (1954), Shepard
(1962a,b), Kruskal (1964a,b), Ramsay (1977).
11.2 Basic principles
11.2.1 Basic denitions
In MDS, any symmetric n n matrix D = (dij )i,j=1,...,n with dij 0 and
dii = 0 is called a distance matrix. Note that this corresponds to the axioms
D1, D2, and D3 in the previous chapter. If instead of distances, a similarity
matrix S = (sij )i,j=1,....,n is given, then one can dene a corresponding
distance matrix by a suitable transformation. One possible transformation
is, for instance,
dij = sii 2sij + sjj
(11.1)
The question addressed by metric MDS can be formulated as follows: given
an n n distance matrix D, can one nd a dimension k and n points
for some k such that their euclidian distance matrix D, with elements
ij = (xi xj )t (xi xj ), is exactly equal to the original distance matrix
d
D. If this is possible, then D is called euclidian. The condition under which
this is possible is as follows:
Theorem 25 D = Dnn = (dij )i,j=1,...,n is euclidian if and only if the
matrix
B = Bnn = M AM
is positive semidenite, where M = (I n1 11t ), I = Inn is the identity
matrix, 1 = (1, ..., 1)t and A = Ann has elements
1
aij = d2 (i, j = 1, ..., n).
2 ij
The reason for positive semideniteness of B is that if D is indeed a euclidian matrix corresponding to points x1 , ..., xn Rk , then
(11.2)
(11.3)
(11.4)
(11.5)
d(x, y) =
( xi yi )2
1/2
i=1
0.6
before 1720
1720-1880
1880 or later
-0.4
-0.2
0.0
x2
0.2
0.4
Schoenberg
-0.4
-0.2
0.0
0.2
0.4
x1
0.6
0.4
0.2
0.0
-0.2
-0.4
1720-1880
Figure 11.3 Fragment of a graduale from the 14th century. (Courtesy of Zentralbibliothek Zrich.)
u
Figure 11.5 Freddy (by J.B.) and Johannes Brahms (1833-1897) going for a
drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek
Zrich.)
u
List of gures
Figure 1.1: Quantitative analysis of music helps to understand creative
processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris;
and Jim by J.B.)
Figure 1.2: J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek
Z rich.)
u
Figure 1.3: Ludwig van Beethoven (1770-1827). (Drawing by E. Drck
u
after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek
Z rich.)
u
8).
K( tti )[yi
b2
K( tti )[yi
b3
g3 (t).
Figure 3.9: Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 2 (upper gure), together with smoothed versions
(lower gure).
Figure 3.10: Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 7 upper gure), together with smoothed versions
(lower gure).
Figure 3.11: Metric, melodic, and harmonic global indicators for Weberns
Variations op. 27, No. 2 (upper gure), together with smoothed versions
(lower gure).
Figure 3.12: R. Schumann Trumerei: motifs used for specic melodic
a
indicators.
Figure 3.13: R. Schumann Trumerei: indicators of individual motifs.
a
Figure 3.14: R. Schumann Trumerei: contributions of individual motifs
a
to overall melodic indicator.
Figure 3.15: R. Schumann Trumerei: overall melodic indicator.
a
Figure 4.1: Sound wave of c and f played on a piano.
Figure 4.2: Zoomed piano sound wave shaded area in Figure 4.1.
Figure 4.3: Periodogram of piano sound wave in Figure 4.2.
Figure 4.4: Sound wave of e
played on a harpsichord.
with state space Z12 \{0}, estimated for the transition between successive
intervals.
Figure 6.4: Cluster analysis based on stationary Markov chain distributions for compositions by Bach, Mozart, Haydn, Chopin, Schumann,
Brahms, and Rachmanino.
Figure 6.5: Cluster analysis based on stationary Markov chain distributions of torus distances for compositions by Bach, Mozart, Haydn,
Chopin, Schumann, Brahms, and Rachmanino.
Figure 6.6: Comparison of log odds ratios log(1 /2 ) of stationary Markov
chain distributions of torus distances.
Figure 6.7: Comparison of log odds ratios log(1 /3 ) of stationary Markov
chain distributions of torus distances.
Figure 6.8: Comparison of log odds ratios log(2 /3 ) of stationary Markov
chain distributions of torus distances.
Figure 6.9: Comparison of log odds ratios log(1 /3 ) and log(2 /3 ) of
stationary Markov chain distributions of torus distances.
Figure 6.10: Comparison of stationary Markov chain distributions of torus
distances.
Figure 6.11: Log odds ratios log(1 /3 ) and log(2 /3 ) plotted against
date of birth of composer.
Figure 6.12: Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek
Z rich.)
u
Figure 7.1: Bla Bartk statue by Varga Imre in front of the Bla Bartk
e
o
e
o
Memorial House in Budapest. (Courtesy of the Bla Bartk Memorial
e
o
House.)
Figure 7.2: Sergei Prokoe as a child. (Courtesy of Karadar Bertoldi
Ensemble; www.karadar.net/Ensemble/.)
Figure 7.3: Circular representation of compositions by J. S. Bach (Prludium
a
und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti
(Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S.
o
Prokoe (Visions fugitives No. 8).
Figure 7.4: Boxplots of 1 , R, d and log m for notes modulo 12, comparing
Bach, Scarlatti, Bartk, and Prokoef.
o
Figure 7.5: Circular representation of intervals of successive notes in the
following compositions: J. S. Bach (Prludium und Fuge No. 5 from
a
Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No.
125), B. Bartk (Bagatelles No. 3), and S. Prokoe (Visions fugitives
o
No. 8).
Figure 7.6: Boxplots of 1 , R, d and log m for note intervals modulo 12,
comparing Bach, Scarlatti, Bartk, and Prokoef.
o
Figure 7.7: Circular representation of notes ordered according to circle
of fourhts in the following compositions: J. S. Bach (Prludium und
a
Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata
Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S. Prokoe
o
(Visions fugitives No. 8).
Figure 7.8: Boxplots of 1 , R, d and log m for notes 12 ordered according
to circle of fourhts, comparing Bach, Scarlatti, Bartk and Prokoef.
o
Figure 7.9: Circular representation of intervals of successive notes ordered
according to circle of fourhts in the following compositions: J. S. Bach
(Prludium und Fuge No. 5 from Das Wohltemperierte Klavier), D.
a
Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3),
o
and S. Prokoe (Visions fugitives No. 8).
Figure 7.10: Boxplots of 1 , R, d and log m for note intervals modulo 12
ordered according to circle of fourhts, comparing Bach, Scarlatti, Bartk,
o
and Prokoef.
Figure 8.1: Tempo curves for Schumanns Trumerei: skewness for the
a
eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted
against the number of the part.
Figure 8.2: Schumanns Trumerei: screeplot for skewness.
a
Figure 8.3: Schumanns Trumerei: loadings for PCA of skewness.
a
Figure 8.4: Schumanns Trumerei: symbol plot of principal components
a
z2 , ..., z5 for PCA of tempo skewness.
Figure 8.5: Schumanns Trumerei: tempo curves by Cortot, Horowitz,
a
Brendel, and Gianoli.
Figure 8.6: Air by Henry Purcell (1659-1695).
Figure 8.7: Screeplot for PCA of entropies.
Figure 8.8: Loadings for PCA of entropies.
Figure 8.9: Entropies symbol plot of the rst four principal components.
Figure 8.10: Entropies symbol plot of principal components no. 2-5.
Figure 8.11: F. Martin (1890-1971). (Courtesy of the Socit Frank Martin
ee
and Mrs. Maria Martin.)
Figure 8.12: F. Martin (1890-1971) - manuscript from 8 Prludes. (Coure
tesy of the Socit Frank Martin and Mrs. Maria Martin.)
ee
Figure 9.1: Discriminant analysis combined with time series analysis can
be used to judge purity of intonation (Elvira by J.B.).
Figure 9.2: Linear discriminant analysis of compositions before and after
1800, with the training sample. The data used for the discriminant rule
consists of x = (p5 , E).
References
Akaike, H. (1973a). Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory,
B.N. Petrow and F. Csaki (eds.), Akademiai Kiado, Budapest, 267-281.
Akaike, H. (1973b). Maximum likelihood identication of Gaussian autoregressive
moving average models. Biometrika, Vol. 60, 255-265.
Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model tting. Biometrika, Vol. 26, 237-242.
Albert. A.A. (1956). Fundamental Concepts of Higher Algebra. University of
Chicago Press, Chicago.
Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New
York and London.
Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis (2nd
ed.). Wiley, New York.
Andreatta, M. (1997) Group-theoretical methods applied to music. PhD thesis,
University of Sussex.
M. Andreatta, M., Noll, T., Agon, C. and Assayag, G. (2001). The geometrical
groove: rhythmic canons between theory, implementation and musical experiment. In: Les Actes des 8`mes Journes dInformatique Musicale, Bourges 7-9
e
juin 2001, p. 93-97.
Antoniadis, A. and Oppenheim, G. (1995). Wavelets and Statistics. Lecture Notes
in Statistics, No. 103, Springer, New York.
Arabie, P., Hubert, L.J. and De Soete, G. (1996). Clustering and Classication.
World Scientic Pub., London.
Archibald, B. (1972). Some thoughts on symmetry in early Webern. Persp. New
Music, 10, 159-163.
Ash, R.B. (1965). Information Theory. Wiley, New York.
Ashby, W.R. (1956). An Introduction to Cybernetics. Wiley, New York.
Babbitt, M. (1960) Twelve-tone invariants as compositional determinant. Musical
Quarterly, 46, 245-259.
Babbitt, M. (1961) Set structure as a compositional determinant. JMT, 5, No. 2,
72-94.
Babbitt, M. (1987) Words about Music. Dembski A. and Straus J.N. (eds.), University of Wisconsin Press, Madison.
Backus, J. (1969). The acoustical Foundations of Music, W.W. Norton & Co.,
New York (reprinted 1977).
Bailhache, P. (2001). Une Histoire de lAcoustique Musicale, CNRS Editions.
Balzano, G.J. (1980). The group-theoretic description of 12-fold and microtonal
pitch systems. Computer Music Journal, Vol. 4, No. 4, 66-84.
Barnard, G.A. (1951). The theory of information. J. Royal Statist. Soc., Series
B, Vol. 13, 46-69.
Bartlett, M.S. (1955). An Introduction to Stochastic Processes. Cambridge University Press, Cambridge.
Batschelet, E. (1981). Circular Statistics. Academic Press, London.
Beament, J. (1997). The Violin Explained: Components, Mechanism, and Sound.
Oxford University Press, Oxford.
Benade, A.H. (1976). Fundamentals of Musical Acoustics. Oxford University
Press, Oxford. (Reprinted by Dover in 1990).
Benson, D. (1995-2002). Mathematics and Music. Internet Lecture Notes,
Department of Mathematics, University of Georgia, USA (available at
http://www.math.uga.edu/~djb/html/math-music.html).
Beran, J. (1987). Aniseikonia. H.O.E. (Bison Records).
Beran, J. (1991). Cirri. Centaur Records, CRC 2100.
Beran, J. (1994). Statistics for Long-Memory Processes. Chapman & Hall, New
York.
Beran, J. (1995). Maximum likelihood estimation of the dierencing parameter
for invertible short- and long-memory ARIMA models. J. R. Statist. Soc.,
Series B, Vol. 57, No.4, 659-672.
Beran, J. (1998) Modeling and objective distinction of trends, stationarity and
long-range dependence. Proceedings of the VIIth International Congress of
Ecology - INTECOL 98, Farina, A., Kennedy, J. and Boss, V. (Eds.), p.
u
41.
a
Beran, J. (2000). Snti. col legno, WWE 1CD 20062 (http://www.col-legno.de).
Beran, J. and Feng. Y. (2002a). SEMIFAR models a semiparametric framework
for modelling trends, long-range dependence and nonstationarity. Computational Statistics & Data Analysis, Vol. 40, No. 2, 393-419.
Beran, J. and Feng, Y. (2002b). Iterative plug-in algorithms for SEMIFAR models
denition, convergence, and asymptotic properties. J. Computational Graphical Statist., Vol. 11, No. 3, 690-713.
Beran, J. and Ghosh, S. (2000). Estimation of the dominating frequency for stationary and nonstationary fractional autoregressive processes. J. Time Series
Analysis, Vol. 21, No. 5, 513-533.
Beran, J. and Mazzola, G. (1992). Immaculate Concept. SToA music, 1 CD
1002.92, Zrich.
u
Beran, J. and Mazzola, G. (1999). Analyzing musical structure and performance
- a statistical approach. Statistical Science, Vol. 14, No. 1, pp.47-79.
Beran, J. and Mazzola, G. (1999). Visualizing the relationship between two time
series by hierarchical smoothing. J. Computational Graphical Statist., Vol. 8,
No. 2, pp.213-238.
Beran, J. and Mazzola, G. (2000). Timing Microstructure in Schumanns
Trumerei as an Expression of Harmony, Rhythm, and Motivic Structure in
a
Music Performance. Computers Mathematics Appl., Vol. 39, No. 5-6, pp.99130.
Beran, J. and Mazzola, G. (2001). Musical composition and performance statistical decomposition and interpretation. Student, Vol. 4, No.1, 13-42.
Beran, J. and Ocker, D. (1999). SEMIFAR forecasts, with applications to foreign
Leipzig.
Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and
Applications. Springer, New York.
Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data
Analysis: The Kernel Approach with S-Plus Illustrations. Oxford University
Press, Oxford.
Box, G.E.P. and Jenkins, G.M. (1970). Time Series Analysis: Forecasting and
Control. Holden-Day, San Francisco.
Breiman, L. (1984). Classication and Regression Trees. CRC Press, Boca Raton.
Bremaud, P. (1999). Markov Chains. Springer, New York.
Brillouin, L. (1956). Science and Information Theory. Academic Press, New York.
Brillinger, D. (1981). Time Series Data Analysis and Theory (expanded ed.).
Holden Day, San Francisco.
Brillinger, D. and Irizarry, R.A. (1998). An investigation of the second- and
higher-order spectra of music. Signal Processing, Vol. 65, 161-179.
Bringham, E.O. (1988). The Fast Fourier Transform and Applications. Prentice
Hall, New Jersey.
Brockwell, P.J. and Davis, R.A. (1991). Time series: Theory and methods (2nd
ed.). Springer, New York.
Brown, E.N. (1990). A note on the asymptotic distribution of the parameter
estimates for the harmonic regression model. Biometrika, Vol. 77, No. 3, 653656.
Chai, W. and Vercoe, B. (2001). Folk Music Classication Using Hidden Markov
Models. Proceedings of International Conference on Articial Intelligence, June
2001 (//web.media.mit.edu/ chaiwei/papers/chai ICAI183.pdf).
Chambers, J., Cleveland, W., Kleiner, B., and Tukey, P. (1983). Graphical Meth-
Ogden, R.T. (1996). Essential Wavelets for Statistical Applications and Data
Analysis. Birkhuser, Boston.
a
Orbach, J. (1999). Sound and Music. University Press of America, Lanham, MD.
Parzen, E. (1962). On estimation of a probability density function and mode.
Ann. Math. Statistics, Vol. 33, 1065-1076.
Peitgen, H.-O. and Saupe, D. (1988). The Science of Fractal Images. Springer,
New York.
Percival, D.B. and Walden, A.T. (2000). Wavelet Methods for Time Series Analysis. Cambridge University Press, Cambridge, UK.
Perle, G. (1955). Symmetric formations in the string quartets of Bla Bartk.
e
o
Music Review 16, 300-312.
Pierce, J.R. (1983). The Science of Musical Sound. Scientic American Books,
New York (2nd ed. printed by W.H. Freeman & Co, 1992).
Plackett, R.L. (1960). Principles of Regression Analysis. Clarendon Press, Oxford.
Polzehl, J. (1995). Projection pursuit discriminant analysis. Computational
Statist. Data Anal., Vol. 20, 141-157.
Price, B.D. (1969). Mathematical groups in campanology. Math. Gaz., 53, 129133.
Priestley, M.B. (1965). Evolutionary spectra and non-stationary processes. J. R.
Statist. Soc., Series B, Vol. 27, 204-237.
Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 1): Univariate
Time Series. Academic Press, New York.
Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 2): Multivariate
Series, Prediction and Control. Academic Press, New York.
Quinn, B.G. and Thomson, P.J. (1991) Estimating the frequency of a periodic
function. Biometrika, Vol. 78, No. 1, 65-74.
Rahn, J. (1980). Basic Atonal Theory. Longman, New York.
Raichel, D.R. (2000). The Science and Applications of Acoustics. American Inst.
of Physics, College Park, PA.
Ramsay, J.O. (1977). Maximum likelihood estimation in multidimensional scaling. Psychometrika, 42, 241-266.
Raphael, C.S. (1999). Automatic segmentation of acoustic music signals using
hidden Markov models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 21, No. 4, 360-370.
Raphael, C.S. (2001a). A probabilistic expert system for automatic musical accompaniment. J. Computational Graphical Statist., Vol. 10, No. 3, 487-512.
Raphael, C.S. (2001b). Synthesizing musical accompaniment with Bayesian belief
networks. J. New Music Res., Vol. 30, No. 1, 59-67.
Rao, C.R. (1973). Linear Statistical Inference and its Applications (2nd ed.).
Wiley & Sons, New York.
Rayleigh, J.W.S. (1896). The Theory of Sound (2 vols), 2nd ed., Macmillan,
London (Reprinted by Dover, 1945).
Read, R.C. (1997). Combinatorial problems in the theory of music. Discrete Mathematics, 167/168, 543-551.
Reiner, D. (1985). Enumeration in music theory, American Math. Monthly, 92/1,
51-54.
Webb, A.R. (2002). Statistical Pattern Recognition (2nd ed.). Wiley, New York.
Wedin, L. (1972). Multidimensional scaling of emotional expression in music.
Svensk Tidskrift fr Musikforskning, 54, 115-131.
o
Wedin, L. and Goude, G. (1972). Dimension analysis of the perception of musical
timbre. Scand. J. Psychol., 13, 228-240.
Weihs, C., Bergho, S., Hasse-Becker, P. and Ligges, U. (2001). Assessment of
Purity of Intonation in Singing Presentations by Discriminant Analysis. In:
Mathematical Statistics and Biometrical Applications, J. Kunert, and G. Trenkler. (Eds.), pp. 395-410.
White, A.T. (1983). Ringing the changes. Math. Proc. Camb. Phil. Soc. 94, 203215.
White, A.T. (1985). Ringing the changes II. Ars Combinatorica, 20-A, 65-75.
White, A.T. (1987). Ringing the cosets. American Math. Monthly 94/8, 721-746.
Whittle, P. (1953). Estimation and information in stationary time series. Ark.
Mat., Vol. 2, 423-434.
Widmer, G. (2001). Discovering Simple Rules in Complex Data: A Meta-learning
Algorithm and Some Surprising Musical Discoveries. Austrian Research Institute for Artical Intelligence, Vienna, TR-2001-31.
Wiener, N. (1948). Cybernetics or control and communication in the animal and
the machine. Act. Sci. Indust., No. 1053, Hermann et Cie, Paris.
Wilson, W.G. (1965). Change Ringing. October House Inc., New York.
Wolfowitz, J. (1957). The coding of messages subject to chance errors. Illinois J.
Math., Vol. 1, 591-606.
Wolfowitz, J. (1958). Information theory for mathematicians. Ann. Math. Statistics, Vol. 29, 351-356.
Wolfowitz, J. (1961). Coding Theorems of Information Theory. Springer, Berlin.
Woodward, P.M. (1953). Probability and Information Theory with Applications
to Radar. Pergamon Press, London.
Xenakis, I. (1971). Formalized Music: Thought and Mathematics in Composition.
Indiana University Press, Bloomington/London.
Yaglom, A.M. and Yaglom, I.M. (1967). Wahrscheinlichkeit und Information.
Deutscher Verlag der Wissenschaften, Berlin.
Yost, W.A. (1977). Fundamentals of Hearing. An Introduction. Academic Press,
San Diego.
Yohai, V.J. (1987). High breakdown-point and high eciency robust estimates
for regression. Ann. Statistics, Vol. 15, 642-656.
Yohai, V.J., Stahel, W.A., and Zamar, R. (1991). A procedure for robust estimation and inference in linear regression. In: Directions in robust statistics and
diagnostics, Part II, W.A. Stahel, and S.W. Weisberg (Eds.), Springer, New
York.
Young, G. and Householder, A. S. (1941). A note on multidimensional psychophysical analysis. Psychometrika, 6, 331-333.
Zassenhaus, H.J. (1999). The Theory of Groups. Dover, Mineola.
Zivot, E. and Wang, J. (2002). Modeling Financial Time Series with S-Plus.
Springer, New York.