Statistics - in - Musicology, by Jan Beran

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 285

Interdisciplinar y Statistics

STATISTICS in
MUSICOLOGY

Jan Beran

CHAPMAN & HALL/CRC


A CRC Press Company
Boca Raton London New York Washington, D.C.
2004 CRC Press LLC

C2190 disclaimer.fm Page 1 Monday, June 9, 2003 10:51 AM

Library of Congress Cataloging-in-Publication Data


Beran, Jan, 1959Statistics in musicology / Jan Beran.
p. cm. (Interdisciplinary statistics series)
Includes bibliographical references (p. ) and indexes.
ISBN 1-58488-219-0 (alk. paper)
1. Musical analysisStatistical methods. I. Title. II. Interdisciplinary statistics
MT6.B344 2003
781.2dc21

2003048488

This book contains information obtained from authentic and highly regarded sources. Reprinted material
is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable
efforts have been made to publish reliable data and information, but the author and the publisher cannot
assume responsibility for the validity of all materials or for the consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, microlming, and recording, or by any information storage or
retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for
creating new works, or for resale. Specic permission must be obtained in writing from CRC Press LLC
for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identication and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com


2004 by Chapman & Hall/CRC
No claim to original U.S. Government works
International Standard Book Number 1-58488-219-0
Library of Congress Card Number 2003048488
Printed in the United States of America 1 2 3 4 5 6 7 8 9 0
Printed on acid-free paper

2004 CRC Press LLC

2004 CRC Press LLC

Contents
Preface
1 Some mathematical foundations of music
1.1 General background
1.2 Some elements of algebra
1.3 Specic applications in music
2 Exploratory data mining in musical spaces
2.1 Musical motivation
2.2 Some descriptive statistics and plots for univariate data
2.3 Specic applications in music univariate
2.4 Some descriptive statistics and plots for bivariate data
2.5 Specic applications in music bivariate
2.6 Some multivariate descriptive displays
2.7 Specic applications in music multivariate
3 Global measures of structure and randomness
3.1 Musical motivation
3.2 Basic principles
3.3 Specic applications in music
4 Time series analysis
4.1 Musical motivation
4.2 Basic principles
4.3 Specic applications in music
5 Hierarchical metho ds
5.1 Musical motivation
5.2 Basic principles
5.3 Specic applications in music
6 Markov chains and hidden Markov mo dels
6.1 Musical motivation
6.2 Basic principles

2004 CRC Press LLC

2004 CRC Press LLC

6.3 Specic applications in music


7 Circular statistics
7.1 Musical motivation
7.2 Basic principles
7.3 Specic applications in music
8 Principal comp onent analysis
8.1 Musical motivation
8.2 Basic principles
8.3 Specic applications in music
9 Discriminant analysis
9.1 Musical motivation
9.2 Basic principles
9.3 Specic applications in music
10 Cluster analysis
10.1 Musical motivation
10.2 Basic principles
10.3 Specic applications in music
11 Multidimensional scaling
11.1 Musical motivation
11.2 Basic principles
11.3 Specic applications in music
List of gures
References

2004 CRC Press LLC

Preface
An essential aspect of music is structure. It is therefore not surprising that a
connection between music and mathematics was recognized long before our
time. Perhaps best known among the ancient quantitative musicologists
are the Pythagoreans, who found fundamental connections between musical intervals and mathematical ratios. An obvious reason why mathematics
comes into play is that a musical performance results in sound waves that
can be described by physical equations. Perhaps more interesting, however,
is the intrinsic organization of these waves that distinguishes music from
ordinary noise. Also, since music is intrinsically linked with human perception, emotion, and reection as well as the human body, the scientic
study of music goes far beyond physics. For a deeper understanding of music, a number of dierent sciences, such as psychology, physiology, history,
physics, mathematics, statistics, computer science, semiotics, and of course
musicology to name only a few need to be combined. This, together
with the lack of available data, prevented, until recently, a systematic development of quantitative methods in musicology. In the last few years,
the situation has changed dramatically. Collection of quantitative data is
no longer a serious problem, and a number of mathematical and statistical methods have been developed that are suitable for analyzing such
data. Statistics is likely to play an essential role in future developments
of musicology, mainly for the following reasons: a) statistics is concerned
with nding structure in data; b) statistical methods and structures are
mathematical, and can often be carried over to various types of data
statistics is therefore an ideal interdisciplinary science that can link dierent scientic disciplines; and c) musical data are massive and complex
and therefore basically useless, unless suitable tools are applied to extract
essential features.
This book is addressed to anybody who is curious about how one may analyze music in a quantitative manner. Clearly, the question of how such an
analysis may be done is very complex, and no ultimate answer can be given
here. Instead, the book summarizes various ideas that have proven useful
in musical analysis and may provide the reader with food for thought or
inspiration to do his or her own analysis. Specically, the methods and applications discussed here may be of interest to students and researchers in
music, statistics, mathematics, computer science, communication, and en-

2004 CRC Press LLC

gineering. There is a large variety of statistical methods that can be applied


in music. Selected topics are discussed in this book, ranging from simple
descriptive statistics to formal modeling by parametric and nonparametric
processes. The theoretical foundations of each method are discussed briey,
with references to more detailed literature. The emphasis is on examples
that illustrate how to use the results in musical analysis. The methods
can be divided into two groups: general classical methods and specic new
methods developed to solve particular questions in music. Examples illustrate on one hand how standard statistical methods can be used to obtain
quantitative answers to musicological questions. On the other hand, the
development of more specic methodology illustrates how one may design
new statistical models to answer specic questions. The data examples are
kept simple in order to be understandable without extended musicological
terminology. This implies many simplications from the point of view of
music theory and leaves scope for more sophisticated analysis that may
be carried out in future research. Perhaps this book will inspire the reader
to join the eort.
Chapters are essentially independent to allow selective reading. Since
the book describes a large variety of statistical methods in a nutshell it
can be used as a quick reference for applied statistics, with examples from
musicology.
I would like to thank the following libraries, institutes, and museums for
their permission to print various pictures, manuscripts, facsimiles, and photographs: Zentralbibliothek Zrich (Ruth Husler, Handschriftenabteilung;
u
a
Anik Lad`nyi and Michael Kotrba, Graphische Sammlung); Belmont Muo
a

sic Publishers (Anne Wirth); Philippe Gontier, Paris; Osterreichische Post


AG; Deutsche Post AG; Elisabeth von Janoza-Bzowski, D sseldorf; Univeru
sity Library Heidelberg; Galerie Neuer Meister, Dresden; Robert-Sterl-Haus
(K.M. Mieth); Bla Bartk Memorial House (Jnos Szirnyi); Frank Mare
o
a
a
tin Society (Maria Martin); Karadar-Bertoldi Ensemble (Prof. Francesco
Bertoldi); col legno (Wulf Weinmann). Thanks also to B. Repp for providing us with the tempo data for Schumanns Trumerei. I would also like to
a
thank numerous colleagues from mathematics, statistics, and musicology
who encouraged me to write this book. Finally, I would like to thank my
wife and my daughter for their encouragement and support, without which
this book could not have been written.
Jan Beran
Konstanz, March 2003

2004 CRC Press LLC

CHAPTER 1

Some mathematical foundations of


music
1.1 General background
The study of music by means of mathematics goes back several thousand
years. Well documented are, for instance, mathematical and philosophical studies by the Pythagorean school in ancient Greece (see e.g. van der
Waerden 1979). Advances in mathematics, computer science, psychology,
semiotics, and related elds, together with technological progress (in particular computer technology) lead to a revival of quantitative thinking in
music in the last two to three decades (see e.g. Archibald 1972, Solomon
1973, Schnitzler 1976, Balzano 1980, Gtze and Wille 1985, Lewin 1987,
o
Mazzola 1990a, 2002, Vuza 1991, 1992a,b, 1993, Keil 1991, Lendvai 1993,
Lindley and Turner-Smith 1993, Genevois and Orlarey 1997, Johnson 1997;
also see Hofstadter 1999, Andreatta et al. 2001, Leyton 2001, and Babbitt
1960, 1961, 1987, Forte 1964, 1973, 1989, Rahn 1980, Morris 1987, 1995,
Andreatta 1997; for early accounts of mathematical analysis of music also
see Graeser 1924, Perle 1955, Norden 1964). Many recent references can be
found in specialized journals such as Computing in Musicology, Music Theory Online, Perspectives of New Music, Journal of New Music Research,
Intgral, Music Perception, and Music Theory Spectrum, to name a few.
e
Music is, to a large extent, the result of a subconscious intuitive process. The basic question of quantitative musical analysis is in how far music
may nevertheless be described or explained partially in a quantitative manner. The German philosopher and mathematician Leibniz (1646-1716) (Figure 1.5) called music the arithmetic of the soul. This is a profound philosophical statement; however, the diculty is to formulate what exactly it
may mean. Some composers, notably in the 20th century, consciously used
mathematical elements in their compositions. Typical examples are permutations, the golden section, transformations in two or higher-dimensional
spaces, random numbers, and fractals (see e.g. Schnberg, Webern, Bartk,
o
o
Xenakis, Cage, Lutoslawsky, Eimert, Kagel, Stockhausen, Boulez, Ligeti,
Barlow; Figures 1.1, 1.4, 1.15). More generally, conscious logical construction is an inherent part of composition. For instance, the forms of
sonata and symphony were developed based on reections about well balanced proportions. The tormenting search for logical perfection is well

2004 CRC Press LLC

Figure 1.1 Quantitative analysis of music helps to understand creative processes.


(Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and Jim by
J.B.)

Figure 1.2 J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by


Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Zrich.)
u

2004 CRC Press LLC

documented in Beethovens famous sketchbooks. Similarily, the art of counterpoint that culminated in J.S. Bachs (Figure 1.2) work relies to a high
degree on intrinsically mathematical principles. A rather peculiar early account of explicit applications of mathematics is the use of permutations in
change ringing in English churches since the 10th century (Fletcher 1956,
Price 1969, Stewart 1992, White 1983, 1985, 1987, Wilson 1965). More
standard are simple symmetries, such as retrograde (e.g. Crab fugue, or
Canon cancricans), inversion, arpeggio, or augmentation. A curious example of this sort is Mozarts Spiegel Duett (or mirror duett, Figures
1.6, 1.7 ; the attibution to Mozart is actually uncertain). In the 20th century, composers such as Messiaen or Xenakis (Xenakis 1971; gure 1.15)
attempted to develop mathematical theories that would lead to new techniques of composition. From a strictly mathematical point of view, their
derivations are not always exact. Nevertheless, their artistic contributions
were very innovative and inspiring. More recent, mathematically stringent
approaches to music theory, or certain aspects of it, are based on modern tools of abstract mathematics, such as algebra, algebraic geometry,
and mathematical statistics (see e.g. Reiner 1985, Mazzola 1985, 1990a,
2002, Lewin 1987, Fripertinger 1991, 1999, 2001, Beran and Mazzola 1992,
1999a,b, 2000, Read 1997, Fleischer et al. 2000, Fleischer 2003).
The most obvious connection between music and mathematics is due to
the fact that music is communicated in form of sound waves. Musical sounds
can therefore be studied by means of physical equations. Already in ancient
Greece (around the 5th century BC), Pythagoreans found the relationship
between certain musical intervals and numeric proportions, and calculated
intervals of selected scales. These results were probably obtained by studying the vibration of strings. Similar studies were done in other cultures, but
are mostly not well documented. In practical terms, these studies lead to
singling out specic frequencies (or frequency proportions) as musically
useful and to the development of various scales and harmonic systems.
A more systematic approach to physics of musical sounds, music perception, and acoustics was initiated in the second half of the 19th century by
path-breaking contributions by Helmholz (1863) and other physicists (see
e.g. Rayleigh 1896). Since then, a vast amount of knowledge has been accumulated in this eld (see e.g. Backus 1969, 1977, Morse and Ingard 1968,
1986, Benade 1976, 1990, Rigden 1977, Yost 1977, Hall 1980, Berg and
Stork 1995, Pierce 1983, Cremer 1984, Rossing 1984, 1990, 2000, Johnston
1989, Fletcher and Rossing 1991, Gra 1975, 1991, Roederer 1995, Rossing
et al. 1995, Howard and Angus 1996, Beament 1997, Crocker 1998, Nederveen 1998, Orbach 1999, Kinsler et al. 2000, Raichel 2000). For a historic
account on musical acoustics see e.g. Bailhache (2001).
It may appear at rst that once we mastered modeling musical sounds
by physical equations, music is understood. This is, however, not so. Music
is not just an arbitrary collection of sounds music is organized sound.

2004 CRC Press LLC

Figure 1.3 Ludwig van Beethoven (1770-1827). (Drawing by E. Drck after a


u
painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Zrich.)
u

Figure 1.4 Anton Webern (1883-1945). (Courtesy of Osterreichische Post AG.)

2004 CRC Press LLC

Figure 1.5 Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche Post


AG and Elisabeth von Janota-Bzowski.)

Physical equations for sound waves only describe the propagation of air
pressure. They do not provide, by themselves, an understanding of how
and why certain sounds are connected, nor do they tell us anything (at
least not directly) about the eect on the audience. As far as structure is
concerned, one may even argue for the sake of argument that music does
not necessarily need physical realization in form of a sound. Musicians
are able to hear music just by looking at a score. Beethoven (Figures 1.3,
1.16) composed his ultimate masterpieces after he lost his hearing. Thus,
on an abstract level, music can be considered as an organized structure
that follows certain laws. This structure may or may not express feelings
of the composer. Usually, the structure is communicated to the audience
by means of physical sounds which in turn trigger an emotional experience of the audience (not necessarily identical with the one intended by
the composer). The structure itself can be analyzed, at least partially, using suitable mathematical structures. Note, however, that understanding
the mathematical structure does not necessarily tell us anything about the
eect on the audience. Moreover, any mathematical structure used for analyzing music describes certain selected aspects only. For instance, studying
symmetries of motifs in a composition by purely algebraic means ignores
psychological, historical, perceptual, and other important issues. Ideally, all
relevant scientic disciplines would need to interact to gain a broad understanding. A further complication is that the existence of a unique truth
is by no means certain (and is in fact rather unlikely). For instance, a
composition may contain certain structures that are important for some
listeners but are ignored by others. This problem became apparent in the
early 20th century with the introduction of 12-tone music. The general
public was not ready to perceive the complex structures of dodecaphonic
music and was rather appalled by the seemingly chaotic noise, whereas a
minority of specialized listeners was enthusiastic. Another example is the

2004 CRC Press LLC

comparison of performances. Which pianist is the best? This question has


no unique answer, if any. There is no xed gold standard and no unique
solution that would represent the ultimate unchangeable truth. What one
may hope for at most is a classication into types of performances that
are characterized by certain quantiable properties without attaching a
subjective judgment of quality.
The main focus of this book is statistics. Statistics is essential for connecting theoretical mathematical concepts with observed reality, to nd
and explore structures empirically and to develop models that can be applied and tested in practice. Until recently, traditional musical analysis
was mostly carried out in a purely qualitative, and at least partially subjective, manner. Applications of statistical methods to questions in musicology and performance research are very rare (for examples see Yaglom and
Yaglom 1967, Repp 1992, de la Motte-Haber 1996, Steinberg 1995, Waugh
1996, Nettheim 1997, Widmer 2001, Stamatatos and Widmer 2002) and
mostly consist of simple applications of standard statistical tools to conrm results or conjectures that had been known or derived before by
musicological, historic, or psychological reasoning. An interesting overview
of statistical applications in music, and many references, can be found in
Nettheim (1997). The lack of quantitative analysis may be explained, in
part, by the impossibility of collecting objective data. Meanwhile, however, due to modern computer technology, an increasing number of musical
data are becoming available. An in-depth statistical analysis of music is
therefore no longer unrealistic. On the theoretical side, the development
of sophisticated mathematical tools such as algebra, algebraic geometry,
mathematical statistics, and their adaptation to the specic needs of music theory, made it possible to pursue a more quantitative path. Because
of the complex, highly organized nature of music, existing, mostly qualitative, knowledge about music must be incorporated into the process of
mathematical and statistical modeling. The statistical methods that will
be discussed in the subsequent chapters can be divided into two categories:
1. Classical methods of mathematical statistics and exploratory data analysis: many classical methods can be applied to analyze musical structures, provided that suitable data are available. A number of examples
will be discussed. The examples are relatively simple from the point of
view of musicology, the purpose being to illustrate how the appropriate
use of statistics can yield interesting results, and to stimulate the reader
to invent his or her own statistical methods that are appropriate for
answering specic musicological questions.
2. New methods developed specically to answer concrete questions in musicology: in the last few years, questions in music composition and performance lead to the development of new statistical methods that are
specically designed to solve questions such as classication of perfor-

2004 CRC Press LLC

mance styles, identication and modeling of metric, melodic, and harmonic structures, quantication of similarities and dierences between
compositions and performance styles, automatic identication of musical events and structures from audio signals, etc. Some of these methods
will be discussed in detail.
A mathematical discipline that is concerned specically with abstract denitions of structures is algebra. Some elements of basic algebra are therefore
discussed in the next section. Naturally, depending on the context, other
mathematical disciplines also play an equally important role in musical
analysis, and will be discussed later where necessary. Readers who are familiar with modern algebra may skip the following section. A few examples
that illustrate applications of algebraic structures to music are presented
in Section 1.3. An extended account of mathematical approaches to music
based on algebra and algebraic geometry is given, for instance, in Mazzola
(1990a, 2002) (also see Lewin 1987 and Benson 1995-2002).
1.2 Some elements of algebra
1.2.1 Motivation
Algebraic considerations in music theory have gained increasing popularity
in recent years. The reason is that there are striking similarities between
musical and algebraic structures. Why this is so can be illustrated by a simple example: notes (or rather pitches) that dier by an octave can be considered equivalent with respect to their harmonic meaning. If an instrument is tuned according to equal temperament, then, from the harmonic
perspective, there are only 12 dierent notes. These can be represented as
integers modulo 12. Similarily, there are only 12 dierent intervals. This
means that we are dealing with the set Z12 = {0, 1, ..., 11}. The sum of two
elements x, y Z12 , z = x + y is interpreted as the note/interval resulting
from increasing the note/interval x by the interval y. The set Z12 of notes
(intervals) is then an additive group (see denition below).
1.2.2 Denitions and results
We discuss some important concepts of algebra that are useful to describe
musical structures. A more comprehensive overview of modern algebra can
be found in standard text books such as those by Albert (1956), Herstein
(1975), Zassenhaus (1999), Gilbert (2002), and Rotman (2002).
The most fundamental structures in algebra are group, ring, eld, module, and vector space.
Denition 1 Let G be a nonempty set with a binary operation + such that
a + b G for all a, b G and the following holds:
1. (a + b) + c = a + (b + c) (Associativity)

2004 CRC Press LLC

2. There exists a zero element 0 G such that 0 + a = a + 0 = a for all


aG
3. For each a G, there exists an inverse element (a) G such that
(a) + a = a + (a) = 0
Then (G, +) is called a group. The group (G, +) is called commutative (or
abelian), if for each a, b G, a + b = b + a. The number of elements in G
is called order of the group and is denoted by o(G). If the order is nite,
then G is called a nite group.
In a multiplicative way this can be written as
Denition 2 Let G be a nonempty set with a binary operation such that
a b G for all a, b G and the following holds:
1. (a b) c = a (b c) (Associativity)
2. There exists an identity element e G such that e a = a e = a for all
aG
3. For each a G, there exists an inverse element a1 G such that
a1 a = a a1 = e
Then (G, ) is called a group. The group (G, ) is called commutative (or
abelian), if for each a, b G, a b = b a.
For subsets we have
Denition 3 Let (G, ) and (H, ) be groups and H G. Then H is called
subgroup of G.
Some groups can be generated by a single element of the group:
Denition 4 Let (G, ) be a group with n < elements denoted by ai
(i = 0, 1, ..., n 1) and such that
1. ao = an = e
2. ai aj = ai+j if i + j n and ai aj = ai+jn if i + j > n
Then G is called a cyclic group. Furthermore, if G = (a) = {ai : i Z}
where ai denotes the product with all i terms equal to a, then a is called a
generator of G.
An important notion is given in the following
Denition 5 Let G be a group that acts on a set X by assigning to each
x X and g G an element g(x) X. Then, for each x X, the set
G(x) = {y : y = g(x), g G} is called orbit of x.
Note that, given a group G that acts on X, the set X is partitioned into
disjoint orbits.
If there are two operations + and , then a ring is dened by
Denition 6 Let R be a nonempty set with two binary operations + and
such that the following holds:
1. (R, +) is an abelian group

2004 CRC Press LLC

2. a b R for all a, b R
3. (a b) c = a (b c) (Associativity)
4. a (b + c) = a b + a c and (b + c) a = b a + c a (distributive law)
Then (R, +, ) is called an (associative) ring. If also a b = b a for all
a, b R, then R is called a commutative ring.
Further useful denitions are:
Denition 7 Let R be a commutative ring and a R, a = 0 such that
there exists an element b R, b = 0 with a b = 0. Then a is called a
zero-divisor. If R has no zero-divisors, then it is called an integral domain.
Denition 8 Let R be a ring such that (R \ {0}, ) is a group. Then R is
called a division ring. A commutative division ring is called a eld.
A module is dened as follows:
Denition 9 Let (R, +, ) be a ring and M a nonempty set with a binary
operation +. Assume that
1. (M, +) is an abelian group
2. For every r R, m M , there exists an element r m M
3. r (a + b) = r a + r b for every r R, m M
4. r (s b) = (r s) a for every r, s R, m M
5. (r + s) a = r a + s a for every r, s R, m M
Then M is called an Rmodule or module over R. If R has a unit element
e and if e a = a for all a M , then M is called a unital Rmodule. A a
unital Rmodule where R is a eld is called a vector space over R.
There is an enormous amount of literature on groups, rings, modules,
etc. Some of the standard results are summarized, for instance, in text
books such as those given above. Here, we cite only a few theorems that
are especially useful in music. We start with a few more denitions.
Denition 10 Let H G be a subgroup of G such that for every a G,
a H a1 H. Then H is called a normal subgroup of G.
Denition 11 Let G be such that the only normal subgroups are H = G
and H = {e}. Then G is called a simple group.
Denition 12 Let G be a group and H1 , ..., Hn normal subgroups such
that
G = H1 H2 Hn
(1.1)
and any a G can be written uniquely as a product
a = b1 b2 bn

(1.2)

with bi Hi . Then G is said to be the (internal) direct product of H1 , ..., Hn .

2004 CRC Press LLC

Denition 13 Let G1 and G2 be two groups, dene G = G1 G2 =


{(a, b) : a G1 , b G2 } and the operation by (a1 , b1 ) (a2 , b2 ) = (a1
a2 , b1 b2 ). Then the group (G, ) is called the (external) direct product of
G1 and G2 .
Denition 14 Let M be an Rmodule and M1 , ..., Mn submodules such
that every a M can be written uniquely as a sum
a = a1 + a2 + ... + an

(1.3)

with ai Mi . Then M is said to be the direct sum of M1 , ..., Mn .


We now turn to the question which subgroups of nite groups exist.
Theorem 1 Let H be a subgroup of a nite group G. Then o(H) is a
divisor of o(G).
Theorem 2 (Sylow) Let G be a group and p a prime number such that pm
is a divisor of o(G). Then G has a subgroup H with o(H) = pm .
Denition 15 A subgroup H G such that pm is a divisor of o(G) but
pm+1 is not a divisor, is called a pSylow subgroup.
The next theorems help to decide whether a ring is a eld.
Theorem 3 Let R be a nite integral domain. Then R is a eld.
Corollary 1 Let p be a prime number and R = Zp = {x mod p : x N }
be the set of integers modulo p (with the operations m + and dened
accordingly). Then R is a eld.
An essential way to compare algebraic structures is in terms of operationpreserving mappings. The following denitions are needed:
Denition 16 Let (G1 , ) and (G2 , ) be two groups. A mapping g : G1
G2 such that
g(a b) = g(a) g(b)
(1.4)
is called a (group-)homomorphism. If g is a one-to-one (group-)homomorphism, then it is called an isomorphism (or group-isomorphism). Moreover,
if G1 = G2 , then g is called an automorphism (or group-automorphism).
Denition 17 Two groups G1 , G2 are called isomorphic, if there is an
isomorphism g : G1 G2 .
Analogous denitions can be given for rings and modules:
Denition 18 Let R1 and R2 be two rings. A mapping g : G1 G2 such
that
g(a + b) = g(a) + g(b)
(1.5)
and
g(a b) = g(a) g(b)
(1.6)
is called a (ring-)homomorphism. If g is a one-to-one (ring-)homomorphism,
then it is called an isomorphism (or ring-isomorphism). Furthermore, if
R1 = R2 , then g is called an automorphism (or ring-automorphism).

2004 CRC Press LLC

Denition 19 Two rings R1 , R2 are called isomorphic, if there is an isomorphism g : R1 R2 .


Denition 20 Let M1 and M2 be two modules over R. A mapping g :
M1 M2 such that for every a, b M1 , r R,
g(a + b) = g(a) + g(b)

(1.7)

and
g(r a) = r g(a)
(1.8)
is called a (module-)homomorphism (or a linear transformation). If g is
a one-to-one (module-)homomorphism, then it is called an isomorphism
(or module-isomorphism). Furthermore, if G1 = G2 , then g is called an
automorphism (or module-automorphism).
Denition 21 Two modules M1 , M2 are called isomorphic, if there is an
isomorphism g : M1 M2 .
Finally, a general family of transformations is dened by
Denition 22 Let g : M1 M2 be a (module-)homomorphism. Then a
mapping h : M1 M2 dened by
h(a) = c + g(a)

(1.9)

with c M2 is called an ane transformation. If M1 = M2 , then g is called


a symmetry of M . Moreover, if g is invertible, then it is called an invertible
symmetry of M .
Studying properties of groups is equivalent to studying groups of automorphisms:
Theorem 4 (Cayleys theorem) Let G be a group. Then there is a set S
such that G is isomorphic to a subgroup of A(S) where A(S) is the set of
all one-to-one mappings of S onto itself.
Denition 23 Let G be a nite group. Then the group (A(S), ) (where
a b denotes successive application of the functions a and b) is called the
symmetric group of order n, and is denoted by Sn .
Note that Sn is isomorphic to the group of permutations of the numbers
1, 2, ..., n, and has n! elements. Another important concept is motivated by
representation in coordinates as we are used to from euclidian geometry.
The representation follows since, in terms of isomorphy, the inner and outer
product can be shown to be equivalent:
Theorem 5 Let G = H1 H2 Hn be the internal direct product of
H1 , ..., Hn and G = H1 H2 ... Hn the external direct product. Then
G and G are isomorphic, through the isomorphism g : G G dened by
g(a1 , ..., an ) = a1 a2 ... an .
This theorem implies that one does not need to distinguish between the
internal and external direct product. The analogous result holds for modules:

2004 CRC Press LLC

Theorem 6 Let M be a direct sum of M1 , ..., Mn . Then M is isomorphic to the module M = {(a1 , a2 , ..., an ) : ai Mi } with the operations (a1 , a2 , ...) + (b1 , b2 , ...) = (a1 + b1 , a2 + b2 , ...) and r (a1 , a2 , ...) =
(r a1 , r a2 , ...).
Thus, a module M = M1 + M2 + ... + Mn can be described in terms of
its coordinates with respect to Mi (i = 1, ..., n) and the structure of M is
known as soon as we know the structure of Mi (i = 1, ..., n).
Direct products can be used, in particular, to characterize the structure
of nite abelian groups:
Theorem 7 Let (G, ) be a nite commutative group. Then G is isomorphic to the direct product of its Sylow-subgroups.
Theorem 8 Let (G, ) be a nite commutative group. Then G is the direct
product of cyclic groups.
Similar, but slightly more involved, results can be shown for modules, but
will not be needed here.
1.3 Specic applications in music
In the following, the usefulness of algebraic structures in music is illustrated by a few selected examples. This is only a small selection from
the extended literature on this topic. For further reading see e.g. Graeser
(1924), Schnberg (1950), Perle (1955), Fletcher (1956), Babbitt (1960,
o
1961), Price (1969), Archibald (1972), Halsey and Hewitt (1978), Balzano
(1980), Rahn (1980), Gtze and Wille (1985), Reiner (1985), Berry (1987),
o
Mazzola (1990a, 2002 and references therein), Vuza (1991, 1992a,b, 1993),
Fripertinger (1991), Lendvai (1993), Benson (1995-2002), Read (1997), Noll
(1997), Andreatta (1997), Stange-Elbe (2000), among others.
1.3.1 The Mathieu group
It can be shown that nite simple groups fall into families that can be
described explicitly, except for 26 so-called sporadic groups. One such group
is the so-called Mathieu group M12 which was discovered by the French
mathematician Mathieu in the 19th century (Mathieu 1861, 1873, also see
e.g. Conway and Sloane 1988). In their study of probabilistic properties of
(card) shuing, Diaconis et al. (1983) show that M12 can be generated by
two permutations (which they call Mongean shues), namely
1 =

1 2
7 6

3
8

4 5
5 9

2 =

1 2
6 7

3 4
5 8

6 7 8
4 10 3

9 10 11 12
11 2 12 1

(1.10)

and

2004 CRC Press LLC

5 6
4 9

7 8 9
3 10 2

10 11 12
11 1 12

(1.11)

where the low rows denote the image of the numbers 1, ..., 12. The order
of this group is o(M12 ) = 95040 (!) An interesting application of these
permutations can be found in Ile de feu 2 by Olivier Messiaen (Berry 1987)
where 1 and 2 are used to generate sequences of tones and durations.
1.3.2 Campanology
A rather peculiar example of group theory in action (though perhaps
rather trivial mathematically) is campanology or change ringing (Fletcher
1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). The
art of change ringing started in England in the 10th century and is still
performed today. The problem that is to be solved is as follows: there are
k swinging bells in the church tower. One starts playing a melody that
consists of a certain sequence in which the bells are played, each bell being played only once. Thus, the initial sequence is a permutation of the
numbers 1, ..., k. Since it is not interesting to repeat the same melody over
and over, the initial melody has to be varied. However, the bells are very
heavy so that it is not easy to change the timing of the bells. Each variation
is therefore restricted, in that in each round only one pair of adjacent
bells can exchange their position. Thus, for instance, if k = 4 and the previous sequence was (1, 2, 3, 4), then the only permissible permutations are
(2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restiction
is that no sequence should be repeated except that the last one is identical with the initial sequence. A typical solution to this problem is, for
instance, the Plain Bob that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),...
and continues until all permutations in S4 are visited.
1.3.3 Representation of music
Many aspects of music can be embedded in a suitable algebraic module
(see e.g. Mazzola 1990a). Here are some examples:
1. Apart from glissando eects, the essential frequencies in most types of
music are of the form
K

px i
i

= o

(1.12)

i=1

where K < , o is a xed basic frequency, pi are certain xed prime


numbers and xi Q. Thus,
K

= log = o +

xi i

(1.13)

i=1
K

where o = log o , i = log pi (i 1). Let = { : = i=1 xi i , xi


Q} be the set of all log-frequencies generated this way. Then is a
module over Q. Two typical examples are:

2004 CRC Press LLC

(a) o = 440 Hz, K = 3, 1 = 2, 2 = 3, 3 = 5 : This is the so-called


Euler module in which most Western music operates. An important
subset consists of frequencies of the just intonation with the pure intervals octave (ratio of frequencies 2), fth (ratio of frequencies=3/2)
and major third (ratio of frequencies 5/4):
= log = log 440 + x1 log 2 + x2 log 3 + x3 log 5

(1.14)

(xi Z). The notes (frequencies) can then be represented by points


in a three-dimensional space of integers Z3 . Note that, using the notation a = (a1 , a2 , a3 ) and b = (b1 , b2 , b3 ), the pitch obtained by addition
c = a + b corresponds to the frequency o 2a1 +b1 3a2 +b2 5a3 +b3 .
p
(b) o = 440 Hz, K = 1, 1 = 2, and x = 12 , where p Z : This
corresponds to the well-tempered tuning where an octave is divided
into equal intervals. Thus, the ratio 2 is decomposed into 12 ratios

12
2 so that
p
log 2
(1.15)
= log 440 +
12
If notes that dier by one or several octaves are considered equivalent, then we can identify the set of notes with the Zmodule Z12 =
{0, 1, ..., 11}.

2. Consider a nite module of notes (frequencies), such as for instance the


well-tempered module M = Z12 . Then a scale is an element of S =
{(x1 , ..., xk ) : k |M |, xi M, xi = xj (i = j)}, the set of all nite
vectors with dierent components.
1.3.4 Classication of circular chords and other musical objects
A central element of classical theory of harmony is the triad. An algebraic property that distinguishes harmonically important triads from other
chords can be described as follows: let x1 , x2 , x3 Z12 , such that (a) xi =xj
(i=j) and (b) there is an inner symmetry g : Z12 Z12 such that
{y : y = g k (x1 ), k N} = {x1 , x2 , x3 }. It can be shown that all chords
(x1 , x2 , x3 ) for which (a) and (b) hold are standard chords that are harmonically important in traditional theory of harmony. Consider for instance
the major triad (c, e, g) = (0, 4, 7) and the minor triad (c, e , g) = (0, 3, 7).
For the rst triad, the symmetry g(x) = 3x + 7 yields the desired result:
g(0) = 7 = g, g(7) = 4 = e and g(4) = 7 = g. For the minor triad the
only inner symmetry is g(x) = 3x + 3 with g(7) = 0 = c, g(0) = 3 = e
and g(3) = 0 = c. This type of classication of chords can be carried over
to more complicated congurations of notes (see e.g. Mazzola 1990a, 2002,
Straub 1989). In particular, musical scales can be classied by comparing
their inner symmetries.

2004 CRC Press LLC

1.3.5 Torus of thirds


Consider the group G = (Z12 , +) of pitches modulo octave. Then G is
isomorphic to the direct sum of the Sylow groups Z3 and Z4 by applying
the isomorphism
g : Z12 Z3 + Z4 ,
x y = (y1 , y2 ) = (x mod 3, x mod 4)

(1.16)
(1.17)

Geometrically, the elements of Z3 + Z4 can be represented as points on


a torus, y1 representing the position on the vertical meridian and y2 the
position on the horizontal equatorial circle (Figure 1.8). This representation
has a musical meaning: a movement along a meridian corresponds to a
major third, whereas a movement along a horizontal circle corresponds to
a minor third. One then can dene the torus-distance dtorus (x, y) by
equating it to the minimal number of steps needed to move from x to y.
The value of dtorus (x, y) expresses in how far there is a third-relationship
between x and y. The possible values of dtorus are 0 (if x = y), 1, 2, and
3 (smallest third-relationship). Note that dtorus can be decomposed into
d3 + d4 where d3 counts the number of meridian steps and d4 the number
of equatorial steps.
1.3.6 Transformations
For suitably chosen integers p1 , p2 , p3 , p4 , consider the four-dimensional
module M = Zp1 Zp2 Zp3 Zp4 over Z where the coordinates represent onset time, pitch (well-tempered tuning if p2 = 12), duration, and
volume. Transformations in this space play an essential role in music. A selection of historically relevant transformations used by classical composers
is summarized in Table 1.1 (also see Figure 1.13).
Generally, one may say that ane transformations are most important,
and among these the invertible ones. In particular, it can be shown that each
symmetry of Z12 can be written as a product (in the group of symmetries
Symm(Z12 )) of the following musically meaningful transformations:
Multiplication by 1 (inversion);
Multiplication by 5 (ordering of notes according to circle of quarts);
Addition of 3 (transposition by a minor third);
Addition of 4 (transposition by a major third).
All these transformations have been used by composers for many centuries.
Some examples of apparent similarities between groups of notes (or motifs)
are shown in Figures 1.10 through 1.12. In order not to clutter the pictures, only a small selection of similar motifs is marked. In dodecaphonic
and serial music, transformation groups have been applied systematically
(see e.g. Figure 1.9). For instance, in Schbergs Orchestervariationen op.
o

2004 CRC Press LLC

Table 1.1 Some ane transformations used in classical music


Function

Musical meaning

Shift: f (x) = x + a

Transposition, repetition,
change of duration,
change of loudness

Shear, e.g. of x = (x1 , ..., x4 )t


w.r.t. line y = o + t (0, 1, 0, 0):
f (x) = x + a (0, 1, 0, 0)
for x not on line,
f (x) = x for x on line

Arpeggio

Reection, e.g. w.r.t.


v = (a, 0, 0, 0):
f (x) = (a (x1 a), x2 , x3 , x4 )

Retrograde, inversion

Dilatation, e.g. w.r.t. pitch:


f (x) = (x1 , a x2 , x3 , x4 )

Augmentation

Exchange of coordinates:
f (x) = (x2 , x1 , x3 , x4 )

Exchange of parameters
(20th century)

31, the full orbit generated by inversion, retrograde and transposition is


used. Webern used 12-tone series that are diagonally symmetric in the
two-dimensional space spanned by pitch and onset time. Other famous examples include Eimerts rotation by 45 degrees together with a dilatation

by 2 (Eimert 1964) and serial compositions such as Boulezs Structures


and Stockhausens Kontra-Punkte. With advanced computer technology (e.g. composition soft- and hardware such as Xenakis UPIC graphics/computer system or the recently developed Presto software by Mazzola
1989/1994), the application of ane transformations in musical spaces of
arbitrary dimension is no longer the tedious work of the early dodecaphonic
era. On the contrary, the practical ease and enormous artistic exibility
lead to an increasing popularity of computer aided transformations among
contemporary composers (see e.g. Iannis Xenakis, Kurt Dahlke, Wilfried
Jentzsch, Guerino Mazzola 1990b, Dieter Salbert, Karl-Heinz Schppner,
o
Tamas Ungvary, Jan Beran 1987, 1991, 1992, 2000; cf. Figure 1.14).

2004 CRC Press LLC

Spiegel-Duett
Allegro q=120

Violin

(W.A. Mozart)

mf

Vln.

12

Vln.

18

Vln.

22

Vln.

27

Vln.

32

Vln.

36

Vln.

41

Vln.

46

Vln.

51

Vln.

57

Vln.

60

Vln.

Figure 1.6 W.A. Mozart (1759-1791) (authorship uncertain) Spiegel-Duett.

2004 CRC Press LLC

Figure 1.7 Wolfgang Amadeus Mozart (1756-1791). (Engraving by F. Mller afu


ter a painting by J.W. Schmidt; courtesy of Zentralbibliothek Zrich.)
u

Figure 1.8 The torus of thirds Z3 + Z4 .

2004 CRC Press LLC

Figure 1.9 Arnold Schnberg Sketch for the piano concert op. 42 notes with
o
tone row and its inversions and transpositions. (Used by permission of Belmont
Music Publishers.)

Figure 1.10 Notes of Air by Henry Purcell. (For better visibility, only a small
selection of related motifs is marked.)

2004 CRC Press LLC

Figure 1.11 Notes of Fugue No. 1 (rst half ) from Das Wohltemperierte
Klavier by J.S. Bach. (For better visibility, only a small selection of related
motifs is marked.)

Figure 1.12 Notes of op. 68, No. 2 from Album fr die Jugend by Robert Schuu
mann. (For better visibility, only a small selection of related motifs is marked.)

2004 CRC Press LLC

Figure 1.13 A miraculous transformation caused by high exposure to Wagner


operas. (Caricature from a 19th century newspaper; courtesy of Zentralbibliothek
Zrich.)
u

Figure 1.14 Graphical representation of pitch and onset time in Z2 together with
71
a
instrumentation of polygonal areas. (Excerpt from Snti Piano concert No. 2
by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)

2004 CRC Press LLC

Figure 1.15 Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier, Paris.)

Figure 1.16 Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbibliothek


Zrich.)
u

2004 CRC Press LLC

CHAPTER 2

Exploratory data mining in musical


spaces
2.1 Musical motivation
The primary aim of descriptive statistics is to summarize data by a small
set of numbers or graphical displays, with the purpose of nding typical
relevant features. An in-depth descriptive analysis explores the data as far
as possible in the hope of nding anything interesting. This activity is
therefore also called exploratory data analysis (EDA; see Tukey 1977),
or data mining. EDA does not require a priori model assumptions the
purpose is simply free exploration. Many exploratory tools are, however,
inspired by probabilistic models and designed to detect features that may
be captured by these.
Descriptive or exploratory analysis is of special interest in music. The
reason is that in music very subtle local changes play an important role.
For instance, a good pianist may achieve a desired emotional eect by slight
local variations of tempo, dynamics, etc. Composers are able to do the same
by applying subtle variations. Extreme examples of small gradual changes
can be found, for instance, in minimal music (e.g. Reich, Glass, Riley). As a
result, observed data consist of a dominating deterministic component plus
many other very subtle (and presumably also deterministic, i.e. intended)
components. Thus, because of their subtle nature, many musically relevant
features are dicult to detect and can often be identied in a descriptive
way only - for instance by suitable graphical displays. A formal statistical
proof that these features are indeed real, and not just accidental, is then
only possible if more similar data are collected.
To illustrate this, consider the tempo curves of three performances of
Robert Schumanns (1810-1856) Trumerei by Vladimir Horowitz (1903a
1989), displayed in Figure 2.2. It is obvious that the three curves are very
similar even with respect to small details. However, since these details are
of a local nature and we observed only three performances, it is not an easy
task to show formally (by statistical hypothesis testing or condence intervals) that, apart from an overall smooth trend, Horowitzs tempo variations
are not random. An even more dicult task is to explain these features,
i.e. to attach an explicit musical meaning to the local tempo changes.

2004 CRC Press LLC

Trumerei op. 15, No. 7


q=100 (72)

Piano

Robert Schumann

ritard.

13

ritard.

17

a tempo

21

23

ritard.

Figure 2.1 Robert Schumann (1810-1856) Trumerei op. 15, No. 7.


a

2004 CRC Press LLC

0
-5

1947

log(tempo)

1963

-15

-10

1965

10

20

30

onset time

Figure 2.2 Tempo curves of Schumanns Trumerei performed by Vladimir


a
Horowitz.

2.2 Some descriptive statistics and plots for univariate data


2.2.1 Denitions
We give a brief summary of univariate descriptive statistics. For a comprehensive discussion we refer the reader to standard text books such as
Tukey (1977), Mosteller and Tukey (1977), Hoaglin (1977), Tufte (1977),
Velleman and Hoaglin (1981), Chambers et al. (1983), Cleveland (1985).
Suppose that we observe univariate data x1 , x2 , ..., xn . To summarize
general characteristics of the data, various numerical summary statistics
can be calculated. Essential features are in particular center (location),
variability, asymmetry, shape of distribution, and location of unusual values
(outliers). The most frequently used statistics are listed in Table 2.1.
We recall a few well known properties of these statistics:
Sample mean: The sample mean can be understood as the center of
gravity of the data, whereas the median divides the sample in two halves

2004 CRC Press LLC

Table 2.1 Simple descriptive statistics


Name

Denition

Feature measured
n
i=1

1{xi x}

Empirical distribution
function

Fn (x) = n

Minimum

xmin = min{x1 , ..., xn }

Smallest value

Maximum

xmin = max{x1 , ..., xn }

Largest value

Range

xrange = xmax xmin

Total spread

Sample mean

x = n1

n
i=1

xi

Proportion of
obs. x

Center
1
}
2

Sample median

M = inf {x : Fn (x)

Sample quantile

q = inf {x : Fn (x) }

Border of
lower 100%

Lower and upper


quartile

Q1 = q 1 , Q2 = q 3

Border of
lower 25%,
upper 75%

Sample variance

s2 = (n 1)1

s = + s2

Sample standard
deviation

Center

n
i=1 (xi

x)2

Variability

Interquartile range

IQR = Q2 Q1

Sample skewness

m3 = n1

n
i=1 [(xi

x)/s]3

n
i=1 [(xi

Sample kurtosis

m4 = n

Variability

Variability

x)/s] 3

Asymmetry
Flat/sharp peak

with an (approximately) equal number of observations. In contrast to the


median, the mean is sensitive to outliers, since observations that are far
from the majority of the data have a strong inuence on its value.
Sample standard deviation: The sample standard deviation is a measure
of variability. In contrast to the variance, s is directly comparable with
the data, since it is measured in the same unit. If observations are drawn
independently from the same normal probability distribution (or a distribution that is similar to a normal distribution), then the following rule
of thumb applies: (a) approximately 68% of the data are in the interval
x s; (b) approximately 95% of the data are in the interval x 2s; (c)

almost all data are in the interval x 3s. For a suciently large sample

size, these conclusions can be carried over to the population from which
the data were drawn.

2004 CRC Press LLC

Interquartile range: The interquartile range also measures variability. Its


advantage, compared to s, is that it is much less sensitive to outliers. If
the observations are drawn from the same normal probability distribution, then IQR/1.35 (or more precisely IQR/[1 (0.75) 1 (0.25)]
where 1 is the quantile function of the standard normal distribution)
estimates the same quantity as s, namely the population standard deviation.
i
Quantiles: For = n (i = 1, ..., n), q coincides with at least one observation. For other values of , q can be dened as in Table 1.1 or,
alternatively, by interpolating neighboring observed values as follows: let
i
= n < < = i+1 . Then the interpolated quantile q is dened by

q = q +

(q q )
1/n

(2.1)

Note that a slightly dierent convention used by some statisticians is to


call inf{x : Fn (x) } the ( 0.5 )-quantile (see e.g. Chambers et al.
n
1983).
Skewness: Skewness measures symmetry/asymmetry. For exactly symmetric data, m3 = 0, for data with a long right tail m3 > 0, for data
with a long left tail m3 < 0.
Kurtosis: The kurtosis is mainly meaningful for unimodal distributions,
i.e. distributions with one peak. For a sample from a normal distribution,
m4 0. The reason is that then E[(X )4 ] = 3 4 where = E(X).
For samples from unimodal distributions with a sharper or atter peak
than the normal distribution, we then tend to have m4 > 0 and m4 < 0
respectively.
Simple, but very useful graphical displays are:
Histogram: 1. Divide an interval (a, b] that includes all observations into
disjoint intervals I1 = (a1 , b1 ], ..., Ik = (ak , bk ]. 2. Let n1 , ..., nk be the
number of observations in the intervals I1 , ..., Ik respectively. 3. Above
each interval Ij , plot a rectangle of width wj = bj aj and height
hj = nj /wj . Instead of the absolute frequencies, one can also use relative
frequencies nj /n where n = n1 + ... + nk . The essential point is that the
area is proportional to nj . If the data are drawn from a probability
distribution with density function f, then the histogram is an estimate
of f.
Kernel estimate of a density function: The histogram is a step function,
and in that sense does not resemble most density functions. This can be
improved as follows. If the data are realizations of a continuous random
x
variable X with distribution F (x) = P (X x) = f (u)du, then a
smooth estimate of the probability density function f can be dened by
a kernel estimate (Rosenblatt 1956, Parzen 1962, Silverman 1986) of the

2004 CRC Press LLC

form
1

f (x) =
nb

K(
i=1

xi x
)
b

(2.2)

where K(u) = K(u) 0 and K(u)du = 1. Most kernels used in


practice also satisfy the condition K(u) = 0 for |u| > 1. The bandwidth b then species which data in the neighborhood of x are used
to estimate f (x). In situations where one has partial knowledge of the
shape of f, one may incorporate this into the estimation procedure. For
instance, Hjort and Glad (2002) combine parametric estimation based

on a preliminary density function f (x; ) with kernel smoothing of the


They show that major eciency gains
remaining density f /f (x; ).
can be achieved if the preliminary model is close to the truth.
Barchart: If data can assume only a few dierent values, or if data are
qualitative (i.e. we only record which category an item belongs to), then
one can plot the possible values or names of categories on the x-axis and
on the vertical axis the corresponding (relative) frequencies.
Boxplot (simple version): 1. Calculate Q1 , M, Q2 and IQR = Q2 Q1 .
2. Draw parallel lines (in principle of arbitrary length) at the levels
Q1 , M, Q2 , A1 = Q1 3 IQR, A2 = Q2 + 3 IQR, B1 = Q1 3IQR and
2
2
B2 = Q1 + 3IQR. The points A1 , A2 are called inner fence, and B1 , B2
are called outer fence. 3. Identify the observation(s) between Q1 and A1
that is closest to A1 and draw a line connecting Q1 with this point. Do
the same for Q2 and A2 . 4. Identify observation(s) between A1 and B1
and draw points (or other symbols) at those places. Do the same for
A2 and B2 . 5. Draw points (or other symbols) for observations beyond
B1 and B2 respectively. The boxplot can be interpreted as follows: the
relative positions of Q1 , M, Q2 and the inner and outer fences indicate
symmetry or asymmetry. Moreover, the distance between Q1 and Q2 is
the IQR and thus measures variability. The inner and outer fences help
to identify outliers, i.e. values lying unusually far from most of the other
observations.
Q-q-plot for comparing two data sets x1 , ..., xn and y1 , ..., ym : 1. Dene
a certain number of points 0 < p1 < ... < pk 1 (the standard choice is:
pi = i0.5 where N = min(n, m)). 2. Plot the pi -quantiles (i = 1, ..., N )
N
of the yobservations versus those of the x observations. Alternative
plots for comparing distributions are discussed e.g. in Ghosh and Beran
(2000) and Ghosh (1996, 1999).

2004 CRC Press LLC

2.3 Sp ecic applications in music univariate


2.3.1 Tempo curves
Figure 2.3 displays 28 tempo curves for performances of Schumanns Trua
merei op. 15, No. 7, by 24 pianists. The names of the pianists and dates
of the recordings (in brackets) are Martha Argerich (before 1983), Claudio
Arrau (1974), Vladimir Ashkenazy (1987), Alfred Brendel (before 1980),
Stanislav Bunin (1988), Sylvia Capova (before 1987), Alfred Cortot (1935,
1947 and 1953), Cliord Curzon (about 1955), Fanny Davies (1929), Jrg
o
Demus (about 1960), Christoph Eschenbach (before 1966), Reine Gianoli
(1974), Vladimir Horowitz (1947, before 1963 and 1965), Cyprien Katsaris
(1980), Walter Klien (date unknown), Andr Krust (about 1960), Antonin
e
Kubalek (1988), Benno Moisewitsch (about 1950), Elly Ney (about 1935),
Guiomar Novaes (before 1954), Cristina Ortiz (before 1988), Artur Schnabel (1947), Howard Shelley (before 1990), Yakov Zak (about 1960).
Tempo is more likely to be varied in a relative rather than absolute way.
For instance, a musician plays a certain passage twice as fast as the previous one, but may care less about the exact absolute tempo. This suggests
consideration of the logarithm of tempo. Moreover, the main interest lies in
comparing the shapes of the curves. Therefore, the plotted curves consist
of standardized logarithmic tempo (each curve has sample mean zero and
variance one).
Schumanns Trumerei is divided into four main parts, each consisting
a
of about eight bars, the rst two and the last one being almost identical (see Figure 2.1). Thus, the structure is: A, A , B, and A . Already a
very simple exploratory analysis reveals interesting features. For each pianist, we calculate the following statistics for the four parts respectively:
x, M, s, Q1 , Q2 , m3 and m4 . Figures 2.4a through e show a distinct pattern

that corresponds to the division into A, A , B, and A . Tempo is much


lower in A and generally highest in B. Also, A seems to be played at a
slightly slower tempo than A though this distinction is not quite so clear
(Figures 2.4a,b). Tempo is varied most towards the end and considerably
less in the rst half of the piece (Figures 2.4c). Skewness is generally negative which is due to occasional extreme ritardandi. This is most extreme
in part B and, again, least pronounced in the rst half of the piece (A, A ).
A mirror image of this pattern, with most extreme positive values in B,
is observed for kurtosis. This indicates that in B (and also in A ), most
tempo values vary little around an average value, but occasionally extreme
tempo changes occur. Also, for A, there are two outliers with an extremly
negative skewness these turn out to be Fanny Davies and Jrg Demus.
o
Figures 2.4f through h show another interesting comparison of boxplots.
In Figure 2.4f, the dierences between the lower quartiles in A and A
for performances before 1965 are compared with those from performances
recorded in 1965 or later. The clear dierence indicates that, at least for the

2004 CRC Press LLC

-20
0

ARGERICH

ARRAU

ASKENAZE

-40

BRENDEL

BUNIN

CAPOVA

CORTOT1
CORTOT2
CORTOT3

CURZON

-60

log(tempo)

DAVIES
DEMUS

ESCHENBACH
GIANOLI

HOROWITZ1
HOROWITZ2
HOROWITZ3

-80

KATSARIS

KLIEN
KRUST

KUBALEK
MOISEIWITSCH

NEY

-100

NOVAES
ORTIZ

SCHNABEL

SHELLEY

ZAK

10

20

30

onset time

Figure 2.3 Twenty-eight tempo curves of Schumanns Trumerei performed by 24


a
pianists. (For Cortot and Horowitz, three tempo curves were available.)

sample considered here, pianists of the modern era tend to make a much
stronger distinction between A and A in terms of slow tempi. The only
exceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz
rst performance and Ashkenazy (outlier in the right boxplot). The comparsion of skewness and curtosis in Figures 2.4g and h also indicates that
modern pianists seem to prefer occasional extreme ritardandi. The only
exception in the early 20th century group is Artur Schnabel, with an
extreme skewness of 2.47 and a kurtosis of 7.04.
Direct comparisons of tempo distributions are shown in Figures 2.5a

2004 CRC Press LLC

Figure 2.4 Boxplots of descriptive statistics for the 28 tempo curves in Figure
2.3.

through f. The following observations can be made: a) compared to Demus


(quantiles on the horizontal axis), Ortiz has a few relatively extreme slow
tempi (Figure 2.5a); b) similarily, but in a less extreme way, Cortots interpretation includes occasional extremely slow tempo values (Figure 2.5b); c)
Ortiz and Argerich have practically the same (marginal) distribution (Figure 2.5c); d) Figure 2.5d is similar to 2.5a and b, though less extreme; e) the
tempo distribution of Cortots performance (Figure 2.5e) did not change
much in 1947 compared to 1935; f) similarily, Horowitzs tempo distribu-

2004 CRC Press LLC

Figure 2.5c: q-q-plot


Ortiz (1988) - Argerich (1983)

Figure 2.5b: q-q-plot


Demus (1960) - Cortot (1935)
0

Figure 2.5a: q-q-plot


Demus (1960) - Ortiz (1988)

-1
-2

Argerich

-2
-3

Cortot

-2
-4

-4

-4

-3

-3

Ortiz

-1

-1

tions in 1947 and 1963 are almost the same, except for slight changes for
very low tempi (Figure 2.5f).

-2

-1

-2

-1

-4

-3

-2

-1

Ortiz

Demus

Figure 2.5d: q-q-plot


Demus (1960) - Krust (1960)

Figure 2.5e: q-q-plot


Cortot (1935) - Cortot (1947)

0
-1
-4

-4

-4

-3

-2

Horowitz 1963

-2

Cortot 1947

-1
-2
-3

Krust

Figure 2.5f: q-q-plot


Horowitz (1947) - Horowitz (1963)

0
2

Demus

-2

-1

Demus

-4

-3

-2

-1

Cortot 1935

-4

-3

-2

-1

Horowitz 1947

Figure 2.5 q-q-plots of several tempo curves (from Figure 2.3).

2.3.2 Notes modulo 12


In most classical music, a central tone around which notes uctuate can
be identied, and a small selected number of additional notes or chords
(often triads) play a special role. For instance, from about 400 to 1500
A.D., music was mostly written using so-called modes. The main notes

2004 CRC Press LLC

0.3

5
7
8
9
a
e
f

3
4
6
9
a
b
d
e

8
b
d

4
7
a
c

3
4
6

1
5
0
f

1
3
2
7
9
a
c
e
f

1
3
4
9
0

5
4
6
0
a

3
8
d
e
f

1
2

5
4
6
0

1
e

2
5
7
6
a
e
f

3
8
7
c
b
d
e
f

1
2

1
3
4
5
2
8
7
6
9
0
a
c
b
d
e
0

1
3
4
5
2
7
8
6
9
0
a
c
b
d
e
f
2

1
7
9

4
5
7
8
6
9
0

8
c
d

3
2
a
c
b

3
2
5
4
8
6

0
a
c
d
e
b

1
3
4
5
2
7
8
6
9
0
a
c
b
d
e
f
5

1
2
3
5
4
7
8
6
9
0
a
c
b
d
e
f

f
7

Figure 2.6b: W.A.Mozart - KV 545,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0.1

0.2

5
6
9
0
b

1
2

1
c
d
e
b
2
a
f
0
3
4
5
7
8
9
6

10

1
3
2
4
5
6
7
8
9
0
a
c
b
d
e
f

0.0

0.10

0.15

0.20

Figure 2.6a: J.S.Bach - Fugue 1,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
2
7
8
0
c
b
c
d

0.0

0.05

0.4

were the rst one (nalis, the nal note) and the fth note of the scale
(dominant). The system of 12 major and 12 minor scales was developed
later, adding more exibility with respect to modulation and scales. The
main representatives of a major/minor scale are three triads, obtained
by adding thirds, starting at the basic note corresponding to the rst
(tonic), fourth (subtonic) and fth (tonic) note of the scale respectively.
Other triads are also but to a lesser degree associated with the properties
tonic, subtonic and/or dominant. In the 20th century, and partially
already in the late 19th century, other systems of scales as well as systems
that do not rely on any specic scales were proposed (in particular 12-tone
music).

11

4
5
6
7
8
3
9
2
0
1
a
e
f
c
b
d
2

4
6
7
8
3
5
1
2
9
0
a
c
b
d
e
f

1
3
2
4
5
6
7
8
9
0
a
c
b
d
e
f
3

0.30

Figure 2.6c: R.Schumann - op.15/2,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0.20

4
5
3
7
6
1
2
8
9
0
a
e
f
b
d
c

0.0

1
2
3
4
7
5
6
8
9
0
a
b
c
e
f
d
0

2
3
4
f
1
7
5
6
8
9
c
e
d
0
a
b

1
2
3
4
7
5
6
8
9
0
a
b
c
e
f
d

1
2
a
b
c
e
f
d
4
3
7
5
6
8
9
0
4

5
2
6
9
a
b
e
f
1
4
3
7
8
0
c
d

7
8
0
b
c
d
1
3
9
a
4
2
5
6
e
f

1
4
7
6
8
0
a
c
3
2
5
9
b
e
d
f

1
4
2
3
5
6
e
f
d
7
8
9
0
a
b
c
7

(Notes-Tonic) mod 12

0.10

9
8
0
a
b
c
e
7
5
6
f
d
4
2
3
1

1
4
3
2
7
6
5
8
9
0
a
b
c
e
d
f
9

10

6
7
8
0
a
b
5
9
c
d
2
1
3
4
e
f

11

0.0

0.1

1
b
c
d
2
3
0
a
e
f
4
7
6
8
9
5

1
3
2
4
5
6
7
8
9
0
a
c
b
d
e
f
5

1
2
f
3
c
b
d
4
5
6
9
0
a
e
7
8

3
c
d
e
f
1
2
4
9
0
b
5
6
7
8
a

5
6
7
8
9
0
a
b
1
3
2
4
c
d
e
f
8

10

5
6
7
8
9
0
a
b
4
1
3
2
c
d
e
f
11

(Notes-Tonic) mod 12

0.2

0.3

0.4

(Notes-Tonic) mod 12

a
3
4
5
6
7
8
9
0
c
b
d
e
f
1
2

c
b
d
e
f
a
0
2
9
1
3
4
5
6
7
8

Figure 2.6d: R.Schumann - op.15/3,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,65)
e
2
f
3
1
4
d
5
a
0
b
6
8
9
c
c
f
1
d
7
7
b
e 2
e
f
2
3
3
1
4
4
6
d
a
8
9
0
c
b
e
f
1
2
3
5
5
5
6
a
8
9
0
4
6
d
7
7
c
e
f
2
3
7
a
8
8
9
9
9
0
b
e
f
1
1
5
3
4
4
5
6
d
d
6
7
a
8
0
0
a
c
c
b
b
e
e
e
f
f
f
e
f
1
2
2
2
2
3
3
3
1
1
1
5
5
5
3
4
4
4
4
5
6
6
6
d
d
d
d
6
7
7
7
7
a
a
a
9
9
9
9
8
8
8
8
0
0
0
0
a
c
c
c
c
b 2
b
b
b

10

5
4
6
7
a
8
0
9

e
f
1
2
3
d
c
b
11

(Notes-Tonic) mod 12

Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.

A very simple illustration of this development can be obtained by counting the frequencies of notes (pitches) in the following way: consider a score
in equal temperament. Ignoring transposition by octaves, we can represent
all notes x(t1 ), ..., x(tn ) by the integers 0, 1, ..., 11. Here, t1 t2 ... tn

2004 CRC Press LLC

0.3

Figure 2.7a: A.Scriabin - op.51/2,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

0.0

d
e
f
1
2
3
4
5
6
7
8
9
0
a
b
c
0

a
1
2
3
4
5
6
7
8
9
0
2

7
9
0
a
b
1
2
5
6
8
c
d
3
4
e
f

6
a
1
2
4
7
8
9
0
b
c
d
e
3
5
f

1
2
3
4
5
6
7
8
9
0
a
b
c
d
e
f
3

0.2

f
1
2
3
4
7
8
9
0
a
e
5
6
b
c
d

0.1

c
d
e
f

0.1

c
d
e
f

3
4
5
1
2
6
8
7
9
0
c
a
b
d
e
f

1
2
3
4
5
6
7
8
9
0
a
b
c
d
e
f
7

a
b
c
d
e
f
0
1
2
3
4
5
6
7
8
9

b
c
d
e
f
1
2
3
4
5
6
7
8
9
0
a
8

1
f
2
3
4
5
6
7
0
a
b
c
d
e
8
9

10

11

Figure 2.7b: A.Scriabin - op.51/4,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)

5
6
3
4
7
8
a
2
9
1
0
b
c
d
e

f
e
2
1
d
3
4
5
6
c
7
8
0
a
b
9

1
9
0
a
b
c
d
2
3
4
8
e
f
5
7
6

0.0

0.3
0.2

8
9
3
5
6
7
0
2
4
1
a
b

c
d
e
f
b
2
1
3
4
5
6
7
8
9
0
a

f
4
5
6
7
8
9
0
a
d
e
2
1
3
b
c

2
3
1
5
6
4
7

9
8
0
a
7
b
c
d
4
6
2
1
3
5
e
f

8
9
0
a
b
c
d
e
f

0
e
f
6
7
8
9
a
b
c
d
1
2
3
5
4

c
d
e
f
1
b
2
3
4
5
9
0
a
6
7
8

1
2
3
4
b
c
d
e
f
5
6
7
8
9
0
a

4
5
6
7
8
9
0
e
f
1
2
3
a
b
d
c

f
1
2
3
4
b
c
d
e
5
6
7
8
9
0
a

10

11

0.12

0
a
d
e

(Notes-Tonic) mod 12

Figure 2.7c: F.Martin - Prelude 6,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
1
2
8
7
0

Figure 2.7d: F.Martin - Prelude 7,


frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
2
e
f
1
3
4
5

1
2
6
7

8
9
b
c

5
4
6

3
5
7

1
0
a
c
f

1
5
6
a
b

3
5
4
9
a
d
e

0.20

(Notes-Tonic) mod 12

f
e

3
6
b

1
2

1
3
a
f
e

4
6
8
9

8
9
0
a
d
e

2
8
7
9
0
b
c
d

0
a
b
d

5
f

2
4
6
8
7
b
d
e

9
b
c
d
f

3
2
5
4
6
9
f
e

3
2
9
0
c

3
5
9

3
2
5
4
8
7
0
a
e

8
7
0
b
c
d

4
d
f
e

1
6

6
8
7
b
c
f

3
2
6
8
9
c
d
f

1
2

1
5
4
7
0
a
b
e

2
4
9
0
a
c

1
a

2
e
f
1
3
4
5

1
b
d
0.10

4
6
7
9
0
a
b
d
e

3
5
4
b
c
f

8
9
0
a
d

3
5
6
8
7
8

10

7
e
f
1
2
6

6
7
a
b
c

3
5

c
f
e

0.0

0.08

1
3
2
8
c

3
6
0.04

5
4
7
f

1
2
9
0
1
2
3
4
a
b
c
6
7
8
d
e
f
5
9
0

8
9
d

d
c
e

7
8

b
d
c
e
f

11

a
1
7
9
8
0

6
7
9
8
0
a
b
1
2
3
5
d
c

4
e
f

5
6
7

0
a
b
c

5
6
7
8

2
3
4
5
6

(Notes-Tonic) mod 12

4
5
6
7
9
0
a
d
e
f

3
4
5
8
9
0
d
c

c
1
2
3
8
b

6
7
a
e
1
2
b
f

10

11

7
9
8
0
a
b
c

2
4
9
8
0
a
b
3

0
a
b
d
c
e
f

1
2
3
4
5
6
7

6
d
1
2
3
4
5
e
f

d
e
f

(Notes-Tonic) mod 12

Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onset-length 16.

denote the score-onset times of the notes. To make dierent compositions


comparable, the notes are centered by subtracting the central note which
is dened to be the most frequent note. Given a prespecied integer k (in
our case k = 16), we calculate the relative frequencies
1

j+2k

pj (x) = (2k + 1)

1{x(ti ) = x}
i=j

where 1{x(ti ) = x} = 1, if x(ti ) = x and zero otherwise and j = 1, 2, ..., n


2k 1. This means that we calculate the distribution of notes for a moving window of 2k + 1 notes. Figures 2.6a through d and 2.7a through d
display the distributions pj (x) (j = 4, 8, ..., 64) for the following compositions: Fugue 1 from Das Wohltemperierte Klavier I by J.S. Bach (16851750), Sonata KV 545 (rst movement) by W.A. Mozart (1756-1791; Figure
2.8), Kinderszenen No. 2 and 3 by R. Schumann (1810-1856; Figure 2.9),
Prludes op. 51, No. 2 and 4 by A. Scriabin (1872-1915) and Prludes No.
e
e

2004 CRC Press LLC

Figure 2.8 Johannes Chrysostomus Wolfgangus Theophilus Mozart (1756-1791)


in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek
Zrich.)
u

2004 CRC Press LLC

Figure 2.9 R. Schumann (1810-1856) lithography by H. Bodmer. (Courtesy of


Zentralbibliothek Zrich.)
u

2004 CRC Press LLC

6 and 7 by F. Martin (1890-1971). For each j = 4, 8, ..., 64, the frequencies pj (0), ..., pj (11) are joined by lines respectively. The obvious common
feature for Bach, Mozart and Schumann is a distinct preference (local maximum) for the notes 5 and 7 (apart from 0). Note that if 0 is the root of
the tonic triad, then 5 corresponds to the root of the subdominant triad.
Similarily, 7 is root of the dominant triad. Also relatively frequent are the
notes 3 =minor third (second note of tonic triad in minor) and 10 =minor
seventh, which is the fourth note of the dominant seventh chord to the
subtonic. Also note that, for Schumann, the local maxima are somewhat
less pronounced. A dierent pattern can be observed for Scriabin and even
more for Martin. In Scriabins Prlude op. 51/2, the perfect fth almost
e
never occurs, but instead the major sixth is very frequent. In Scriabins
Prlude op. 51/4, the tonal system is dissolved even further, as the clearly
e
dominating note is 6 which builds together with 0 the augmented fourth
(or diminished fth) an interval that is considered highly dissonant in
tonal music. Nevertheless, even in Scriabins compositions, the distribution
of notes does not change very rapidly, since the sixteen overlayed curves are
almost identical. This may indicate that the notion of scales or a slow harmonic development still play a role. In contrast, in Frank Martins Prlude
e
No. 6, the distribution changes very quickly. This is hardly surprising, since
Martins style incorporates, among other inuences, dodecaphonism (12tone music) a compositional technique that does not impose traditional
restrictions on the harmonic structure.
2.4 Some descriptive statistics and plots for bivariate data
2.4.1 Denitions
We give a short overview of important descriptive concepts for bivariate
data. For a comprehensive treatment we refer the reader to standard text
books given above (also see e.g. Plackett 1960, Ryan 1996, Srivastava and
Sen 1997, Draper and Smith 1998, and Rao 1973 for basic theoretical results).
Correlation
If each observation consists of a pair of measurements (xi , yi ), then the main
objective is to investigate the relationship between x and y. Consider, for
example, the case where both variables are quantitative. The data can then
be displayed in a scatter plot (y versus x). Useful statistics are Pearsons
sample correlation
r=

1
n

(
i=1

xi x yi y
)(
)=
sx
sy

2004 CRC Press LLC

n
i=1 (xi

n
i=1 (xi

x)(yi y )

x)2

n
i=1 (yi

y )2

(2.3)

where s2 = n1
x
rank correlation
rSp

1
=
n

(
i=1

n
2
i=1 (xi x)

and s2 = n1
y

n
2
i=1 (yi y )

and Spearmans

i=1 (ui u)(vi v )


n
n
2

i=1 (ui u)
i=1 (vi

ui u vi v
)(
)=
su
sv

v )2

(2.4)

where ui denotes the rank of xi among the xvalues and vi is the rank
of yi among the yvalues. In (2.3) and (2.4) it is assumed that sx , sy ,
su and sv are not zero. Recall that these denitions imply the following
properties: a) 1 r, rSp 1; b) r = 1, if and only if yi = o + 1 xi
and 1 > 0 (exact linear relationship with positive slope); c) r = 1, if
and only if yi = o + 1 xi and 1 < 0 (exact linear relationship with
negative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictly
monotonically increasing relationship); e) r = 1, if and only if xi >
xj implies yi < yj (strictly monotonically decreasing relationship); f) r
measures the strength (and sign) of the linear relationship; g) rSp measures
the strength (and sign) of monotonicity; h) if the data are realizations of a
bivariate random variable (X, Y ), then r is an estimate of the population
correlation = cov(X, Y )/ var(X)var(Y ) where cov(X, Y ) = E[XY ]
E[X]E[Y ], var(X) = cov(X, X) and var(Y ) = cov(Y, Y ). When using
these measures of dependence one should bear in mind that each of them
measures a specic type of dependence only, namely linear and monotonic
dependence respectively. Thus, a Pearson or Spearman correlation near
or equal to zero does not necessarily mean independence. Note also that
correlation can be interpreted in a geometric way as follows: dening the
ndimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal to
the standardized scalar product between x and y, and is therefore equal to
the cosine of the angle between these two vectors.
A special type of correlation is interesting for time series. Time series are
data that are taken in a specic ordered (usually temporal) sequence. If
Y1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, then
one would like to know whether there is any linear dependence between
observations Yi and Yik , i.e. between observations that are k time units
apart. If this dependence is the same for all time points i, and the expected
value of Yi is constant, then the corresponding population correlation can
be written as function of k only (see Chapter 4),
cov(Yi , Yi+k )
var(Yi )var(Yi+k )

= (k)

(2.5)

and a simple estimate of (k) is the sample autocorrelation (acf)


(k) =

where s2 = n1

1
n

nk

(
i=1

yi y yi+k y
)(
)
s
s

(2.6)

(yi y)(yi+k y). Note that here summation stops at

2004 CRC Press LLC

n k, because no data are available beyond (n k) + k = n. For large lags


(large compared to n), (k) is not a very precise estimate, since there are

only very few pairs that are k time units apart.


The denition of (k) and (k) can be extended to multivariate time

series, taking into account that dependence between dierent components


of the series may be delayed. For instance, for a bivariate time series (Xi , Yi )
(i = 1, 2, ...), one considers lag-k sample cross-correlations
XY (k) =

1
n

nk

(
i=1

xi x yi+k y

)(
)
sX
sY

(2.7)

as estimates of the population cross-correlations


XY (k) =

cov(Xi , Yi+k )
var(Xi )var(Yi+k )

(2.8)

where s2 = n1 (xi x)(xi+k x) and s2 = n1 (yi y)(yi+k y). If


X
Y
|XY (k)| is high, then there is a strong linear dependence between Xi and
Yi+k .
Regression
In addition to measuring the strength of dependence between two variables,
one is often interested in nding an explicit functional relationship. For
instance, it may be possible to express the response variable y in terms of an
explanatory variable x by y = g(x, ) where is a variable representing the
part of y that is unexplained. More specically, we may have, for example,
an additive relationship y = g(x) + or a multiplicative equation y =
g(x)e . The simplest relationship is given by the simple linear regression
equation
y = o + 1 x +
(2.9)
where is assumed to be a random variable with E() = 0 (and usually
nite variance 2 = var() < ). Thus, the data are yi = o +1 xi +i (i =
1, ..., n) where the i s are generated by the same zero mean distribution.
Often the i s are also assumed to uncorrelated or even independent this is
however not a necessary assumption. An obvious estimate of the unknown
parameters o and 1 is obtained by minimizing the total sum of squared
errors
SSE = SSE(bo , b1 ) =

(yi bo b1 xi )2 =

2
ri (bo , b1 )

(2.10)

with respect to bo , b1 . The solution is found by setting the partial derivatives


with respect to bo and b1 equal to zero. A more elegant way to nd the
solution is obtained by interpreting the problem geometrically: dening the
n-dimensional vectors 1 = (1, ..., 1)t , b = (bo , b1 )t and the n 2 matrix X
with columns 1 and x, we have SSE = ||y bo 1 b1 x||2 = ||y Xb||2

2004 CRC Press LLC

where ||.|| denotes the squared euclidian norm, or length of the vector. It
is then clear that SSE is minimized by the orthogonal projection of y on
the plane spanned by 1 and x. The estimate of = (o , 1 )t is therefore

= (o , 1 )t = (X t X)1 X t y

(2.11)

and the projection which is the vector of estimated values yi is given

by
y = (1 , ..., yn )t = X(X t X)1 X t y

(2.12)
Dening the measure of the total variability of y, SST = ||y1||2 (total
y
sum of squares), and the quantities SSR = ||1||2 (regression sum of
y y
squares=variability due to the fact that the tted line is not horizontal)
2
and SSE = ||y y|| (error sum of squares, variability unexplained by

regression line), we have by Pythagoras


SST = SSR + SSE

(2.13)

The proportion of variability explained by the regression line y = o +1 x



is therefore
n
(i yi )2
y

SSE
|| y 1||2
y
SSR
R2 = i=1
=1
.
(2.14)
=
=
n
2
||y y 1||2

SST
SST
(yi y )

i=1
By denition, 0 R2 1, and R2 = 1 if and only if yi = yi (i.e. all points

are on the regression line). Moreover, for simple regression we also have
R2 = r2 . The advantage of dening R2 as above (instead of via r2 ) is that
the denition remains valid for the multiple regression model (see below),
i.e. when several explanatory variables are available. Finally, note that an
2
estimate of 2 is obtained by 2 = (n 2)1 ri (o , 1 ).

In analogy to the sample mean and the sample variance, the least squares
estimates of the regression parameters are sensitive to the presence of outliers. Outliers in regression can occur in the y-variable as well as in the
x-variable. The latter are also called inuential points. Outliers may often
be correct and in fact very interesting observations (e.g. telling us that the
assumed model may not be correct). However, since least squares estimates
are highly inuenced by outliers, it is often dicult to notice that there
may be a problem, since the tted curve tends to lie close to the outliers.
Alternative, robust estimates can be helpful in such situations (see Huber
1981, Hampel et al. 1986). For instance, instead of minimizing the residual
sum of squares we may minimize
(ri ) where is a bounded function.
If is dierentiable, then the solution can usually also be found by solving
the equations
n
r
( )
r(b) = 0 (j = 0, ..., p)
(2.15)
bj

i=1
where 2 is a robust estimate of 2 obtained from an additional equation

and p is the number of explanatory variables. This leads to estimates that

2004 CRC Press LLC

are (up to a certain degree) robust with respect to outliers in y, not however
with respect to inuential points (outliers in x). To control the eect of
inuential points one can, for instance, solve a set of equations
n

r
j ( , xi ) = 0 (j = 0, ..., p)

i=1

(2.16)

where is such that it downweighs outliers in x as well. For a comprehensive theory of robustness see e.g. Huber (1981), Hampel et al. (1986).
For more recent, ecient and highly robust methods see Yohai (1987),
Rousseeuw and Yohai (1984), Gervini and Yohai (2002), and references
therein.
The results for simple linear regression can be extended easily to the case
where more than one explanatory variable is available. The multiple linear
regression model with p explanatory variables is dened by y = o + 1 x1 +
...+p xp +. For data we write yi = o +1 xi1 +...+p xip +i (i = 1, ..., n).
Note that the word linear refers to linearity in the parameters o , ..., p .
The function itself can be nonlinear. For instance, we may have polynomial
regression with y = o +1 x+...+p xp +. The same geometric arguments
as above apply so that (2.11) and (2.12) hold with = (o , ..., p )t , and
the n (p + 1)matrix X = (x(1) , ..., x(p+1) ) with columns x(1) = 1 and
x(j+1) = xj = (x1j , ..., xnj )t (j = 1, ..., p).
Regression smoothing
A more general, but more dicult, approach to modeling a functional relationship is to impose less restrictive assumptions on the function g. For
instance, we may assume
y = g(x) +
(2.17)
with g being a twice continuously dierentiable function. Under suitable
additional conditions on x and it is then possible to estimate g from
observed data by nonparametric smoothing. As a special example consider
observations yi taken at time points i = 1, 2, ..., n. A standard model is
yi = g(ti ) + i

(2.18)

where ti = i/n, i are independent identically distributed (iid) random


variables with E(i ) = 0 and 2 = var(i ) < 0. The reason for using
standardized time ti [0, 1] is that this way g is observed on an increasingly
ne grid. This makes it possible to ultimately estimate g(t) for all values
of t by using neighboring values ti , provided that g is not too wild. A
simple estimate of g can be obtained, for instance, by a weighted average
(kernel smoothing)
n

g (t) =

wi yi
i=1

2004 CRC Press LLC

(2.19)

with suitable weights wi 0,


Nadaraya-Watson weights

wi = 1. For example, one may use the


K( tti )
b

wi = wi (t; b, n) =

n
j=1

K(

ttj
b )

(2.20)

with b > 0, and a kernel function K 0 such that K(u) = K(u), K(u) =
1
0 (|u| > 1) and 1 K(u)du = 1. The role of b is to restrict observations
that inuence the estimate to a small window of neighboring time points.
For instance, the rectangular kernel K(u) = 1 1{|u| 1} yields the sample
2
mean of observations yi in the window n(t b) i n(t + b). An even
more elegant formula can be obtained by approximating the Riemann sum
1
ttj
n
1
j=1 K( b ) by the integral 1 K(u)du = 1:
nb
n

g (t) =

wi yi =
i=1

1
nb

K(
i=1

t ti
)yi
b

(2.21)

In this case, the sum of the weights is not exactly equal to one, but asymptotically (as n and b 0 such that nb3 ) this error is negligible.
It can be shown that, under fairly general conditions on g and , g con
verges to g, in a certain sense that depends on the specic assumptions (see
e.g. Gasser and M ller 1979, Gasser and M ller 1984, Hrdle 1991, Beran
u
u
a
and Feng 2002, Wand and Jones 1995, and references therein).
An alternative to kernel smoothing is local polynomial tting (Fan and
Gijbels 1995, 1996; also see Feng 1999). The idea is to t a polynomial
locally, i.e. to data in a small neighborhood of the point of interest. This
can be formulated as a weighted least squares problem as follows:

g (t) = o

(2.22)

where = (o , 1 , ..., p )t solves a local least squares problem dened by


ti t 2
)ri (a).
(2.23)
b
Here ri = yi [ao + a1 (ti t) + ... + ap (ti t)p ], K is a kernel as above and
b > 0 is the bandwidth dening the window of neighboring observations.
It can be shown that asymptotically, a local polynomial smoother can be
written as kernel estimator (Ruppert and Wand 1994). A dierence only
occurs at the borders (t close to 0 or 1) where, in contrast to the local
polynomial estimate, the kernel smoother has to be modied. The reason
is that observations are no longer symmetrically spaced in the window
t b). A major advantage of local polynomials is that they automatically

provide estimates of derivatives, namely g (t) = 1 , g (t) = 22 etc. Kernel

smoothing can also be used for estimation of derivatives; however dierent


(and rather complicated) kernels have to be used for each derivative (Gasser
and Mller 1984, Gasser et al. 1985). A third alternative, so-called wavelet
u

= arg min
a

2004 CRC Press LLC

K(

thresholding, will not be discussed here (see e.g. Daubechies 1992, Donoho
and Johnston 1995, 1998, Donoho et al. 1995, 1996, Vidakovic 1999, and
Percival and Walden 2000 and references therein). A related method based
of wavelets is discussed in Chapter 5.
Smoothing of two-dimensional distributions, sharpening
Estimating a relationship between x and y (where x and y are realizations
of random variables X and Y respectively) amounts to estimating the joint
two-dimensional distribution function F (x, y) = P (X x, Y y). For
continuous variables with F (x, y) = ux vy f (u, v) dudv, the density
function f can be estimated, for instance, by a two-dimensional histogram.
For visual and theoretical reasons, a better estimate is obtained by kernel
estimation (see e.g. Silverman 1986) dened by

f (x, y) =

1
nb1 b2

K(xi x, yi y; b1 , b2 )

(2.24)

i=1

where the kernel K is such that K(u, v) = K(u, v) = K(u, v) 0, and


K(u, v)dudv = 1. Usually, b1 = b2 = b and K(u, v) has compact support. Examples of kernels are K(u, v) = 1 1{|u| 1}1{|v| 1} (rectangular
4
kernel with rectangular support), K(u, v) = 1 1{u2 + v 2 1} (rectangular kernel with circular support), K(u, v) = 2 1 [1u2 v 2 ] (Epanechnikov
1
kernel with circular support) or K(u, v) = (2)1 exp[ 2 (u2 + v 2 )] (normal density kernel with innite support). In analogy to one-dimensional
density estimation, it can be shown that under mild regularity conditions,

f(x, y) is a consistent estimate of f (x, y), provided that b1 , b2 0, and


nb1 , nb2 .
Graphical representations of two-dimensional distribution functions are

3-dimensional perspective plot: z = f (x, y) (or f (x, y)) is plotted against


x and y;
contour plot: like in a geographic map, curves corresponding to equal
levels of f are drawn in the x-y-plane;
image plot: coloring of the x-y-plane with the color at point (x, y) corresponding to the value of f.
A simple way of enhancing the visual understanding of scatterplots is socalled sharpening (Tukey and Tukey 1981; also see Chambers et al. 1983):

for given numbers a and b, only points with a f (x, y) b are drawn in
the scatterplot. Alternatively, one may plot all points and highlight points

with a f (x, y) b.
Interpolation
Often a process may be generated in continuous time, but is observed at
discrete time points. One may then wish to guess the values of the points

2004 CRC Press LLC

in between. Kernel and local polynomial smoothing provide this possibility,


since g (t) can be calculated for any t (0, 1). Alternatively, if the obser
vations are assumed to be completely without error, i.e. yi = g(ti ), then
deterministic interpolation can be used. The most popular method is spline
interpolation. For instance, cubic splines connect neighboring observed values yi1 , yi by cubic polynomials such that the rst and second derivatives
at the endpoints ti1 , ti are equal. For observations y1 , ..., yn at equidistant
time points ti with ti ti1 = tj tj1 = t (i, j = 1, ..., n), we have n 1
polynomials
pi (t) = ai + bi (t ti ) + ci (t ti )2 + di (t ti )3 (i = 1, ..., n 1)

(2.25)

To achieve smoothness at the points ti where two polynomials pi1 , pi meet,


one imposes the condition that the polynomials and their rst two derivatives are equal at ti . This together with the conditions pi (ti ) = yi leads to
a system of 3(n 2) + n = 4(n 1) 2 equations for 4(n 1) parameters
ai , bi , ci , di (i = 1, ..., n 1). To specify a unique solution one therefore
needs two additional conditions at the border. A typical assumption is
p (t1 ) = p (tn ) = 0 which denes so-called natural splines. Cubic splines
have a physical meaning, since these are the curves that form when a thin
rod is forced to pass through n knots (in our case the knots are t1 , ..., tn ),
corresponding to minimum strain energy. The term spline refers to the
thin exible rods that were used in the past by draftsmen to draw smooth
curves in ship design. In spite of their natural meaning, interpolation
splines (and similarily other methods of interpolation) can be problematic since the interpolated values may be highly dependent on the specic
method of interpolation and are therefore purely hypothetical unless the
aim is indeed to build a ship.
Splines can also be used for smoothing purposes by removing the restriction that the curve has to go through all observed points. More specically,
one looks for a function g (t) such that

V () =
i=1

(yi g(ti ))2 +

[ (t)]2 dt
g

(2.26)

is minimized. The parameter > 0 controls the smoothness of the resulting


curve. For small values of , the tted curve will be rather rough but close
to the data; for large values more smoothness is achieved but the curve
is, in general, not as close to the data. The question of which to choose
reects a standard dilemma in statistical smoothing: one needs to balance
the aim of achieving a small bias ( small) against the aim of a small
variance ( large). For a given value of , the solution to the minimization
problem above turns out to be a natural cubic spline (see Reinsch 1967;
also see Wahba 1990 and references therein). The solution can also be
written as a kernel smoother with a kernel function K(u) proportional

2004 CRC Press LLC

1
to exp(|u|/ 2) sin(/4 + |u|/ 2) and a bandwidth b proportional to 4
1
(Silverman 1986). If ti = i/n, then the bandwidth is exactly equal to 4 .
Statistical inference
In this section, correlation, linear regression, nonparametric smoothing,
and interpolation were introduced in an informal way, without exact discussion of probabilistic assumptions and statistical inference. All these
techniques can be used in an informal way to explore possible structures
without specic model assumptions. Sometimes, however, one wishes to
obtain more solid conclusions by statistical tests and condence intervals.
There is an enormous literature on statistical inference in regression, including nonparametric approaches. For selected results see the references
given above. For nonparametric methods also see Wand and Jones (1995),
Simono (1996), Bowman and Azzalini (1997), Eubank (1999) and references therein.
2.5 Sp ecic applications in music bivariate
2.5.1 Empirical tempo-acceleration
Consider the tempo curves in Figure 2.3. An approximate measure of
tempo-acceleration may be dened by
a(ti ) =

[y(ti ) y(ti1 )] [y(ti1 ) y(ti2 )]


2 y(t)
=
2t

[ti ti1 ] [ti1 ti2 ]

(2.27)

where y(t) is the tempo (or log-tempo) at time t. Figures 2.10a through f
show a(t) for the three performances by Cortot and Horowitz. From the
pictures it is not quite easy to see in how far there are similarilies or differences. Consider now the pairs (aj (ti ), al (ti )) where aj , al are acceleration measurements of performance j and l respectively. We calculate the
sample correlations for each pair (j, l) {1, ..., 28} {1, ..., 28}, (j = l).
Figure 2.11a shows the correlations between Cortot 1 (1947) and the other
performances. As expected, Cortot correlates best with Cortot: the correlation between Cortot 1 and Cortots other two performances (1947, 1953)
is clearly highest. The analogous observation can be made for Horowitz
1 (1947) (Figure 2.11b). Also interesting is to compare how much overall
resemblance there is between a selected performance and the other performances. For each of the 28 performances, the average and the maximal
correlation with other performances were calculated. Figures 2.11c and d
indicate that, in terms of accelaration, Cortots style appears to be quite
unique among the pianists considered here. The overall (average and maximal) similarily between each of his three acceleration curves and the other
performances is much smaller than for any other pianist.

2004 CRC Press LLC

10

10

b) Acceleration - Cortot (1947)

c) Acceleration - Cortot (1953)

-10
-15

-10

-10

-5

-5

-5

a(t)

a(t)

a(t)

10

a) Acceleration - Cortot (1935)

10

15

20

25

30

10

onset time t

10

d) Acceleration - Horowitz (1947)

20

25

30

10

15

20

25

30

onset time t

e) Acceleration - Horowitz (1963)

f) Acceleration - Horowitz (1965)

10

15

20

25

onset time t

30

-15

-10

-10

-10

-5

-5

-5

a(t)

a(t)

a(t)

10

10

15

onset time t
15

10

15

20

onset time t

25

30

10

15

20

25

30

onset time t

Figure 2.10 Acceleration of tempo curves for Cortot and Horowitz.

2.5.2 Interpolated and smoothed tempo curves velocity and acceleration


Conceptually it is plausible to assume that musicians control tempo in continuous time. The measure of acceleration given above is therefore a rather
crude estimate of the actual acceleration curve. Interpolation splines provide a simple possibility to guess the tempo and its derivatives between
the observed time points. One should bear in mind, however, that interpolation is always based on specic assumptions. For instance, cubic splines
assume that the curve between two consecutive time points where observations are available is, or can be well approximated by, a third degree
polynomial. This assumption can hardly be checked experimentally and
can lead to undesirable eects. Figure 2.12 shows the observed and interpolated tempo for Martha Argerich. While most of the interpolated values
seem plausible, there are a few rather doubtful interpolations (marked with
arrows) where the cubic polynomial by far exceeds each of the two observed
values at the neighboring knots.

2004 CRC Press LLC

0
5
CORTOT2

ASKENAZE

ARRAU

10

2004 CRC Press LLC


15

Performance
KRUST

20
25
0
5
10
GIANOLI

15
20
KRUST

SHELLEY
ZAK

NOVAES

20

SCHNABEL

ORTIZ

NEY

MOISEIWITSCH

15

KUBALEK

KLIEN

KATSARIS

HOROWITZ3

HOROWITZ2

10

HOROWITZ1

Performance

ESCHENBACH

DEMUS

CURZON

DAVIES

CORTOT1
CORTOT2
CORTOT3

CAPOVA

1.0

BUNIN

BRENDEL

ASKENAZE

ARRAU

0.9

25

ARGERICH

0.8

c) Mean correlations with


other pianists

0.7

mean correlation

ZAK

SHELLEY

SCHNABEL

20

ORTIZ

NOVAES

NEY

MOISEIWITSCH

15

KUBALEK

KLIEN

KATSARIS

HOROWITZ3

HOROWITZ2

0.8

10

HOROWITZ1

GIANOLI

ESCHENBACH

DEMUS

DAVIES

CURZON

CAPOVA

BRENDEL
BUNIN

0.7

0.6

0.6
ARGERICH

CORTOT1

0.5

mean correlation

CORTOT3

0.4

0.2

0.4

KATSARIS

CURZON

25

Performance

Figure 2.11 Tempo acceleration correlation with other performances.


ZAK

SHELLEY

SCHNABEL

ORTIZ

NOVAES

NEY

MOISEIWITSCH

KUBALEK

KRUST

KLIEN

KATSARIS

HOROWITZ3

HOROWITZ2

GIANOLI

ESCHENBACH

DEMUS

DAVIES

CORTOT3

CORTOT2

CORTOT1

CAPOVA

BUNIN

BRENDEL

ASKENAZE

ARRAU

ARGERICH

Correlation

ZAK

0.6

1.0

CORTOT2

SHELLEY

SCHNABEL

ORTIZ

NOVAES

NEY

MOISEIWITSCH

KUBALEK

KRUST

KLIEN

HOROWITZ3

HOROWITZ2

HOROWITZ1

GIANOLI

ESCHENBACH

DEMUS

DAVIES

CURZON

CORTOT3

CAPOVA

BUNIN

BRENDEL

ASKENAZE

ARRAU

ARGERICH

0.8

Correlation

1.2

a) Acceleration - Correlations of
Cortot (1935) with other performances
b) Acceleration- Correlations of
Horowitz (1947) with other performances

25

Performance

d) Maximal correlations with


other pianists

1.4

Figure 2.12 Martha Argerich interpolation of tempo curve by cubic splines.

2.5.3 Tempo hierarchical decomposition by smoothing


The tempo curve may be thought of as an aggregation of mostly smooth
tempo curves at dierent onset-time-scales. This corresponds to the general
structure of music as a mixture of global and local structures at various
scales. It is therefore interesting to look at smoothed tempo curves, and
their derivatives, at dierent scales. Reasonable smoothing bandwidths may
be guessed from the general structure of the composition such as time
signature(s), rhythmic, metric, melodic, and harmonic structure, and so on.
For tempo curves of Schumanns Trumerei (Figure 2.3), even multiples of
a
1/8th are plausible. Figures 2.13 through 2.16 show the following kernelsmoothed tempo curves with b1 = 8, b2 = 1, and b3 = 1/8 respectively:
g1 (t) = (nb1 )1

g2 (t) = (nb2 )1

g3 (t) = (nb3 )1

K(

t ti
)yi
b1

t ti
)[yi g1 (t)]

b2
t ti
)[yi g1 (t) g2 (t)]

K(
b3
K(

(2.28)
(2.29)
(2.30)

and the residuals


e(t) = yi g1 (t) g2 (t) g3 (t).

2004 CRC Press LLC

(2.31)

10 15 20 25 30
t

10 15 20 25 30
t

10 15 20 25 30
t

10 15 20 25 30
t

-0.6

-0.6

10 15 20 25 30
t

KRUST

10 15 20 25 30
t

10 15 20 25 30
t

NOVAES

NEY

10 15 20 25 30
t

10 15 20 25 30
t

ZAK

SHELLEY

10 15 20 25 30
t

Figure 2.13 Smoothed tempo curves g1 (t) = (nb1 )1

2004 CRC Press LLC

HOROWITZ2
-0.6

-0.6

10 15 20 25 30
t

-0.6

SCHNABEL

KLIEN

10 15 20 25 30
t

-0.6

-0.6

10 15 20 25 30
t

-0.6

10 15 20 25 30
t

ORTIZ

MOISEIWITSCH
-0.6

-0.6

10 15 20 25 30
t

10 15 20 25 30
t

DEMUS

10 15 20 25 30
t

-0.6

KUBALEK

KATSARIS

10 15 20 25 30
t

HOROWITZ1

10 15 20 25 30
t

-0.6

-0.6

10 15 20 25 30
t

-0.6

10 15 20 25 30
t

HOROWITZ3

10 15 20 25 30
t

-0.6

-0.6

CORTOT2

DAVIES

GIANOLI

ESCHENBACH

-0.6

10 15 20 25 30
t

10 15 20 25 30
t

-0.6

10 15 20 25 30
t

-0.6

-0.6

CORTOT1

CURZON

CORTOT3

10 15 20 25 30
t

-0.6

-0.6

-0.6

CAPOVA

BUNIN

10 15 20 25 30
t

-0.6

-0.4

-0.4

-0.4

-0.4

BRENDEL

ASKENAZE

ARRAU

ARGERICH

10 15 20 25 30
t

K( tti )yi (b1 = 8).


b1

10 15 20 25 30
t

10 15 20 25 30
t

-1.5

-1.5

-1.5

10 15 20 25 30
t

KRUST

10 15 20 25 30
t

NEY

10 15 20 25 30
t

NOVAES

10 15 20 25 30
t

10 15 20 25 30
t

ZAK

SHELLEY

10 15 20 25 30
t

Figure 2.14 Smoothed tempo curves g2 (t) = (nb2 )1

1).

2004 CRC Press LLC

10 15 20 25 30
t

-2.0

10 15 20 25 30
t

-1.5

SCHNABEL
-2.0

10 15 20 25 30
t

KLIEN

10 15 20 25 30
t

10 15 20 25 30
t

HOROWITZ2

-1.5

10 15 20 25 30
t

-2.0

MOISEIWITSCH

ORTIZ

10 15 20 25 30
t

-1.5

-1.5

10 15 20 25 30
t

-1.5

10 15 20 25 30
t

KUBALEK

HOROWITZ1

10 15 20 25 30
t

-1.5

-1.5

KATSARIS

HOROWITZ3

DEMUS

-1.5

10 15 20 25 30
t

10 15 20 25 30
t

DAVIES

10 15 20 25 30
t

-1.5

-1.5

GIANOLI

ESCHENBACH

CORTOT2

-1.5

10 15 20 25 30
t

10 15 20 25 30
t

-2.0

10 15 20 25 30
t

-1.5

-1.5

CORTOT1

CURZON

CORTOT3

10 15 20 25 30
t

-1.5

-1.5

-1.5

CAPOVA

BUNIN

10 15 20 25 30
t

1.0

10 15 20 25 30
t

-2.0

-1.5

-1.5

-1.5

-0.5

BRENDEL

ASKENAZE

ARRAU

ARGERICH

10 15 20 25 30
t

K( tti )[yi g1 (t)] (b2 =


b2

10 15 20 25 30
t

1
-2

0
-3

0
-3

10 15 20 25 30
t

0
-3

10 15 20 25 30
t

KRUST
0
-3

10 15 20 25 30
t

NOVAES
0

SHELLEY

SCHNABEL

10 15 20 25 30
t

10 15 20 25 30
t

ZAK

0
-3

10 15 20 25 30
t

10 15 20 25 30
t

Figure 2.15 Smoothed tempo curves g3 (t) = (nb3 )1

g2 (t)] (b3 = 1/8).

2004 CRC Press LLC

NEY

10 15 20 25 30
t

-3

10 15 20 25 30
t

10 15 20 25 30
t

0
-3

-3

ORTIZ

MOISEIWITSCH

10 15 20 25 30
t

KLIEN

10 15 20 25 30
t

10 15 20 25 30
t

HOROWITZ2

10 15 20 25 30
t

-3

0
-3

-3

KUBALEK

KATSARIS

10 15 20 25 30
t

HOROWITZ1

10 15 20 25 30
t

-3

0
-3

DEMUS

10 15 20 25 30
t

10 15 20 25 30
t

HOROWITZ3

-3

10 15 20 25 30
t

-3

0
-3

DAVIES

GIANOLI

ESCHENBACH

CORTOT2

10 15 20 25 30
t

10 15 20 25 30
t

10 15 20 25 30
t

-3

-3

0
-3

0
-3

10 15 20 25 30
t

CURZON

CORTOT3

CORTOT1
0

-3

10 15 20 25 30
t

10 15 20 25 30
t

-3

1
-2

CAPOVA

BUNIN

10 15 20 25 30
t

BRENDEL

-3

-2

ASKENAZE

-2

ARRAU

-2

ARGERICH

10 15 20 25 30
t

K( tti )[yi g1 (t)


b3

MOISEIWITSCH

-1.5

-1.5

10 15 20 25 30
t

10 15 20 25 30
t

10 15 20 25 30
t

-1.5

-1.5

-1.5

10 15 20 25 30
t

KRUST

10 15 20 25 30
t

10 15 20 25 30
t

NOVAES

10 15 20 25 30
t

10 15 20 25 30
t

ZAK

SHELLEY
-1.5

HOROWITZ2

10 15 20 25 30
t

NEY

10 15 20 25 30
t

-1.5

1.5
-1.5

SCHNABEL

ORTIZ

KLIEN

10 15 20 25 30
t

10 15 20 25 30
t

-1.5

-1.5

KUBALEK

1.5

10 15 20 25 30
t
1.5

-1.5

1.5

KATSARIS

HOROWITZ1

10 15 20 25 30
t

10 15 20 25 30
t

-1.5

DEMUS

10 15 20 25 30
t

-1.5

-1.5

-1.5

-1.5

1.5

HOROWITZ3

GIANOLI

10 15 20 25 30
t

10 15 20 25 30
t

DAVIES

10 15 20 25 30
t

-1.5

-1.5

-1.5

ESCHENBACH

CURZON

10 15 20 25 30
t

10 15 20 25 30
t

CORTOT2

CORTOT1

10 15 20 25 30
t

-1.5

-1.5

1.5

10 15 20 25 30
t

-1.5

10 15 20 25 30
t

CORTOT3

CAPOVA
-1.5

-1.5

10 15 20 25 30
t

1.5

BUNIN

-1.0

-1.0

1.0

10 15 20 25 30
t

1.5

1.5

-1.0

BRENDEL

ASKENAZE

ARRAU

-1.0

1.0

ARGERICH

10 15 20 25 30
t

10 15 20 25 30
t

Figure 2.16 Smoothed tempo curves residuals e(t) = yi g1 (t) g2 (t) g3 (t).

2004 CRC Press LLC

The tempo curves are thus decomposed into curves corresponding to a hierarchy of bandwidths. Each component reveals specic features. The rst
component reects the overall tendency of the tempo. Most pianists have
an essentially monotonically decreasing curve corresponding to a gradual,
and towards the end emphasized, ritardando. For some performances (in
particular Bunin, Capova, Gianoli, Horowitz 1, Kubalek, and Moisewitsch)
there is a distinct initial acceleration with a local maximum in the middle
of the piece. The second component g2 (t) reveals tempo-uctuations that

correspond to a natural division of the piece in 8 times 4 bars. Some pianists, like Cortot, greatly emphasize the 84 structure. For other pianists,
such as Horowitz, the 84 structure is less evident: the smoothed tempo
curve is mostly quite at, though the main, but smaller, tempo changes do
take place at the junctions of the eight parts. Striking is also the distinction
between part B (bars 17 to 24) and the other parts (A,A ,A ) of the composition in particular in Argerichs performance. The third component
characterizes uctuations at the resolution level of 2/8th. At this very local
level, tempo changes frequently for pianists like Horowitz, whereas there
is less local movement in Cortots performances. Finally, the residuals e(t)
consist of the remaining uctuations at the nest resolution of 1/8th. The
similarity between the three residual curves by Horowitz illustrate that
even at this very ne level, the seismic variation of tempo is a highly
controlled process that is far from random.

2.5.4 Tempo curves and melodic indicator

In Chapter 3, the so-called melodic indicator will be introduced. One of


the aims will be to explain some of the variability in tempo curves
by melodic structures in the score. Consider a simple melodic indicator
m(t) = wmelod (t) (see Section 3.3.4) that is essentially obtained by adding
all indicators corresponding to individual motifs. Figures 2.17a and d display smoothed curves obtained by local polynomial smoothing of m(t)
using a large and a small bandwidth respectively. Figures 2.17b and e show
the rst derivatives of the two curves in 2.17a,d. Similarily, the second
derivatives are given in gures 2.17c and f. For the tempo curves, the rst
and second derivatives of local polynomial ts with b = 4 are given in Figures 2.18 and 2.19 respectively. A resemblance can be found in particular
between the second derivative of m(t) in Figure 2.17f and the second
derivatives of tempo curves in Figure 2.19. Also, there are interesting similarities and dierences between the performances, with respect to the local
variability of the rst two derivatives. Many pianists start with a very small
second derivative, with strongly increased values in part B.

2004 CRC Press LLC

c): -m(t) (span=24/32)

0.2

2nd der.

-0.4

-0.5

-84

10

15

20

25

30

10

15

20

25

30

10

20

25

30

f) -m(t) (span=8/32)
150

e) -m(t) (span=8/32)

10

15

20

25

30

100
0

2nd der.

-50

-100

-40

-100

-20

-80

1st der.

-60

50

20

-40

15
t

40

d) -m(t) (span=8/32)

mel. Ind.

0.0

0.0

1st der.

-80
-82

mel. Ind.

0.4

0.5

0.6

b) -m(t) (span=24/32)

-78

a) -m(t) (span=24/32)

10

15
t

20

25

30

10

15

20

25

30

Figure 2.17 Melodic indicator local polynomial ts together with rst and second
derivatives.

2.5.5 Tempo and loudness


By invitation of Prince Charles, Vladimir Horowitz gave a benet recital
at Londons Royal Festival Hall on May 22, 1982. It was his rst European
appearance in 31 years. One of the pieces played at the concert was Schumanns Kinderszene op. 15, No. 4. Figure 2.20 displays the (approximate)
soundwave of Horowitzs performance sampled from the CD recording. Two
variables that can be extracted quite easily by visual inspection are: a) on
the horizontal axis the time when notes are played (and derived from this
quantity, the tempo) and b) on the vertical axis, loudness. More specically,
let t1 , ..., tn be the score onset-times and u(t1 ), ..., u(tn ) the corresponding
performance times. Then an approximate tempo at score-onset time ti can
be dened by y(ti ) = (ti+1 ti )/(u(ti+1 ) u(ti )). A complication with
loudness is that the amplitude level of piano sounds decreases gradually in
a complex manner so that loudness as such is not dened exactly. For
simplicity, we therefore dene loudness as the initial amplitude level (or
rather its logarithm). Moreover, we consider only events where the scoreonset time is a multiple of 1/8. For illustration, the rst four events (score
onset times 1/8, 2/8, 3/8, 4/8) are marked with arrows in Figure 2.20.
An interesting question is what kind of relationship there may be between time delay y and loudness level x. The autocorrelations of x(ti ) =

2004 CRC Press LLC

1.0

1.0

0.5
1st der.

CORTOT1

-1.5

-1.5

-1.5

-0.5

1st der.

0.5

CAPOVA

-0.5

1.0
0.5

BUNIN

-0.5

1st der.

0.5
-1.5

-1.5

-1.5

BRENDEL

-0.5

1st der.

0.5

ASKENAZE

-0.5

1st der.

0.5

ARRAU

-0.5

1st der.

1.0

1.0

1.0

1.0
0.5
-0.5
-1.5

1st der.

ARGERICH

0 5 10 15 20 25 30

CORTOT2

CORTOT3

CURZON

DAVIES

DEMUS

ESCHENBACH

GIANOLI

-1.5

-0.5

1st der.

0.5

0.5

1.0

1.0

0 5 10 15 20 25 30

-1.5

-1.5

-1.5

-0.5

1st der.

0.5
-0.5

1st der.

0.5
-0.5

1st der.

0.5
-0.5

1st der.

-1.5

-1.5

-0.5

1st der.

0.5

0.5
-0.5

1st der.

-1.5

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

HOROWITZ1

HOROWITZ2

HOROWITZ3

KATSARIS

KLIEN

KRUST

KUBALEK

0.5
-0.5

1st der.

-1.5

-1.5

-0.5

1st der.

0.5

0.5
-0.5

1st der.

-1.5

-1.5

-0.5

1st der.

0.5

0.5

-1.5

-1.5

-0.5

1st der.

0.5
-0.5

1st der.

0.5
-0.5

1st der.

-1.5

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

NEY

NOVAES

ORTIZ

SCHNABEL

SHELLEY

ZAK

1st der.

-1.5

-1.5

-1.5

-0.5

0.5

1.0

0 5 10 15 20 25 30

0.5
-0.5

1st der.

0.5
-0.5

1st der.

0.5
-0.5

1st der.

-1.5

-1.5

-0.5

1st der.

0.5

0.5
-1.5

-0.5

1st der.

-0.5
-1.5

1st der.

0.5

1.0

1.0

MOISEIWITSCH

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

1.0

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

Figure 2.18 Tempo curves (Figure 2.3) rst derivatives obtained from local
polynomial ts (span 24/32).

2004 CRC Press LLC

CORTOT1

2nd der.

-2

-1

0
-2

-3

-3

0 5 10 15 20 25 30

0 5 10 15 20 25 30

CAPOVA

-1

2nd der.

-1

2nd der.

-2
-3

0 5 10 15 20 25 30

3
0

BUNIN

-2

2nd der.

-1

BRENDEL

-3

0 5 10 15 20 25 30

0 5 10 15 20 25 30

3
2

3
1
0
-1

2nd der.

-2
-3

3
2

0 5 10 15 20 25 30

ASKENAZE

-2

2nd der.

-1

ARRAU

-3

-3
2

3
2

3
2
-1
3

-2

2nd der.

ARGERICH

0 5 10 15 20 25 30

CORTOT2

CORTOT3

CURZON

DAVIES

DEMUS

ESCHENBACH

GIANOLI

2nd der.

-2

-1

1
0
2nd der.

-3

-2
-3

-3

-3

-1

0
-2

2nd der.

-1

0
-2

2nd der.

-1

-3

-3

-2

-2

2nd der.

-1

0
-1

2nd der.

-1

2nd der.

-2
-3

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

HOROWITZ1

HOROWITZ2

HOROWITZ3

KATSARIS

KLIEN

KRUST

KUBALEK

0 5 10 15 20 25 30

3
2
1

2nd der.

-2
3

2
3

--

2nd der.

2
1

2nd der.

0
1
2
3

--

-2
3

3
2
1

2nd der.

0
1

1
-2
3

0 5 10 15 20 25 30

0 5 10 15 20 25 30

0 5 10 15 20 25 30

NOVAES

ORTIZ

SCHNABEL

SHELLEY

ZAK

2nd der.

1
--

0 5 10 15 20 25 30
t

2
3

2nd der.

0
1
--

2
3

0 5 10 15 20 25 30

3
2

3
0
1
--

0 5 10 15 20 25 30

2
3

--

2
3

0 5 10 15 20 25 30

2nd der.

2
1

2nd der.

0
1

1
2
3

--

3
0

2nd der.

2
1

2nd der.

0
1
-2
3

0 5 10 15 20 25 30

0 5 10 15 20 25 30

NEY

2
1
0
1
--

0 5 10 15 20 25 30

2nd der.

2nd der.

0
1
-2
3

0 5 10 15 20 25 30
t

MOISEIWITSCH

2nd der.

3
2

3
2
1

2nd der.

0
1
-2
3

0 5 10 15 20 25 30
t

2
3

0 5 10 15 20 25 30

0 5 10 15 20 25 30
t

Figure 2.19 Tempo curves (Figure 2.3) second derivatives obtained from local
polynomial ts (span 8/32).

2004 CRC Press LLC

Figure 2.20 Kinderszene No. 4 sound wave of performance by Horowitz at the


Royal Festival Hall in London on May 22, 1982.

log(Amplitude) and y(ti ) as well as the cross-autocorrelations between the


two time series are shown in Figure 2.21a. The main remarkable crossautocorrelation occurs at lag 8. This can also be seen visually when plotting y(ti+8 ) against x(ti ) (Figure 2.21b). There appears to be a strong
relationship between the two variables with the exception of four outliers.
The three tted lines correspond to a) a least square linear regression t
using all data; b) a robust high breakdown point and high eciency regression (Yohai et al. 1991); and c) a least squares t excluding the outliers. It
should be noted that the outliers all occur together in a temporal cluster
(see Figure 2.21c) and correspond to a phase where tempo is at its extreme
(lowest for the rst three outliers and fastest for the last outlier). This
indicates that these are informative outliers (in contrast to wrong measurements) that should not be dismissed, since they may tell us something
about the intention of the performer.
Finally, Figure 2.21d displays a sharpened version of the scatterplot in

Figure 2.21b: Points with high estimated joint density f (x, y) are marked
with O. In contrast to what one would expect from a regression model,
random errors i that are independent of x, the points with highest density
gather around a horizontal line rather than the regression line(s) tted in
Figure 2.21b. Thus, a linear regression model is hardly applicable. Instead,
the data may possibly be divided into three clusters: a) a cluster with low
loudness and low tempo; b) a second cluster with medium loudness and
low to medium tempo; and c) a third cluster with a high level of loudness
and medium to high tempo.

2004 CRC Press LLC

Figure 2.21 log(Amplitude) and tempo for Kinderszene No. 4 auto- and cross
correlations (Figure 2.24a), scatter plot with tted least squares and robust lines
(Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Figure 2.24d).

2004 CRC Press LLC

2.5.6 Loudness and tempo two-dimensional distribution function


In the example above, the correlation between loudness and tempo, when
measured at the same time, turned out to be relatively small, whereas there
appeared to be quite a clear lagged relationship. Does this mean that there
is indeed no immediate relationship between these two variables? Consider x(ti ) = log(Amplitude) and the logarithm of tempo. The scatterplot
and the boxplot in Figures 22a and b rather suggest that there may be a
relationship, but the dependence is nonlinear. This is further supported by
the two-dimensional histogram (Figure 23a), the smoothed density (Figure
24a) and the corresponding image plots (Figures 23b and 24b; the actual
observations are plotted as stars). The density was estimated by a kernel
estimate with the Epanechnikov kernel. Since correlation only measures linear dependence, it cannot detect this kind of highly nonlinear relationship.

Figure 2.22 Horowitz performance of Kinderszene No. 4 log(tempo) versus


log(Amplitude) and boxplots of log(tempo) for three ranges of amplitude.

2.5.7 Melodic tempo-sharpening


Sharpening can also be applied by using an external variable. This is
illustrated in Figures 2.25 through 2.27. Figure 2.25a displays the estimated
density function of log(m+1) where m(t) is the value of a melodic indicator
at onset time t. The marked region corresponds to very high values of
the density function f (namely f (x) > 0.793). This denes a set Isharp
of corresponding sharpening onset times. The series m(t) is shown in
Figure 2.25b, with sharpening onset times t Isharp highlighted by vertical

2004 CRC Press LLC

Figure 2.23 Horowitz performance of Kinderszene No. 4 two-dimensional histogram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and
image plot respectively.

Figure 2.24 Horowitz performance of Kinderszene No. 4 kernel estimate of


two-dimensional distribution of (x, y) = (log(tempo), log(Amplitude)) displayed
in a perspective and image plot respectively.

2004 CRC Press LLC

Figure 2.25 R. Schumann, Trumerei op. 15, No. 7 density of melodic indicator
a
with sharpening region (a) and melodic curve plotted against onset time, with
sharpening points highlighted (b).

2004 CRC Press LLC

CORTOT2

tempo

CORTOT3

HOROWITZ2

tempo

HOROWITZ3

tempo

tempo

HOROWITZ1

tempo

tempo

CORTOT1

Figure 2.26 R. Schumann, Trumerei op. 15, No. 7 tempo by Cortot and
a
Horowitz at sharpening onset times.

CORTOT1

-10

HOROWITZ2

HOROWITZ3
10

diff(tempo)

0
-10

0
-10

diff(tempo)

10

10
0
-10

diff(tempo)

10

10

HOROWITZ1

diff(tempo)

CORTOT3

-10

diff(tempo)

0
-10

diff(tempo)

10

CORTOT2

Figure 2.27 R. Schumann, Trumerei op. 15, No. 7 tempo derivatives for
a
Cortot and Horowitz at sharpening onset times.

2004 CRC Press LLC

lines. Figures 2.26 and 2.27 show the tempo y and its discrete derivative
v(ti ) = [y(ti+1 ) y(ti )]/(ti+1 ti ) for ti Isharp and the performances by
Cortot and Horowitz. The pictures indicate a systematic dierence between
Cortot and Horowitz. A common feature is the negative derivative at the
fth and sixth sharpening onset time.
2.6 Some multivariate descriptive displays
2.6.1 Denitions
Suppose that we observe multivariate data x1 , x2 , ..., xn where each xi is
a p-dimensional vector (xi1 , ..., xip )t Rp . Obvious numerical summary
statistics are the sample mean
x = (1 , x2 , ..., xp )t

where xj = n1

n
i=1

xij and the p p covariance matrix S with elements

Sjl = (n 1)1

(xij xj )(xil xl ).

i=1

Most methods for analyzing multivariate data are based on these two statistics. One of the main tools consists of dimension reduction by suitable projections, since it is easier to nd and visualize structure in low dimensions.
These techniques go far beyond descriptive statistics. We therefore postpone the discussion of these methods to Chapters 8 to 11. Another set of
methods consists of visualizing individual multivariate observations. The
main purpose is a simple visual identication of similarities and dierences
between observations, as well as search for clusters and other patterns.
Typical examples are:
Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending
on the values of corresponding coordinates. For instance, the face function in S-Plus has the following correspondence between coordinates and
feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length
of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of
mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle
of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of
pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width
of eyebrows.
Stars: Each coordinate is represented by a ray in a star, the length of
each corresponding to the value of the coordinate. More specically, a
star for a data vector xi = (xi1 , ..., xip )t is constructed as follows:
1. Scale xi to the range [0, r] : 0 x1j, ..., xnj r;
2. Draw p rays at angles j = 2(j 1)/p (j = 1, ..., p); for a star with

2004 CRC Press LLC

origin 0 representing observation xi , the end point of the jth ray has
the coordinates r (xij cos j , xij sin j );
3. For visual reasons, the end points of the rays may be connected by
straight lines.
Proles: An observation xi =(xi1 , ..., xip )t is represented by a plot of
xij versus j where neighboring points xij1 and xij (j = 1, ..., p) are
connected.
Symb ol plot: The horizontal and vertical positions represent xi1 and
xi2 respectively (or any other two coordinates of xi ). The other coordinates xi3 , ..., xip determine p 2 characteristic shape parameters of a
geometric object that is plotted at point (xi1 , xi2 ). Typical symbols are
circle (one additional dimension), rectangle (two additional dimensions),
stars (arbitrary number of additional dimensions), and faces (arbitrary
number of additional dimensions).
2.7 Sp ecic applications in music multivariate
2.7.1 Distribution of notes Cherno faces
In music that is based on scales, pitch (modulo 12) is usually not equally
distributed. Notes that belong to the main scale are more likely to occur,
and within these, there are certain prefered notes as well (e.g. the roots
of the tonic, subtonic and supertonic triads). To illustrate this, we consider the following compositions: 1. Saltarello (Anonymus, 13th century);
2. Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J. S.
Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856);
4. Piano piece op. 19, No. 2 (A. Schnberg, 1874-1951; gure 2.28); 5. Rain
o
Tree Sketch 1 (T. Takemitsu, 1930-1996). For each composition, the distribution of notes (pitches) modulo 12 is calculated and centered around
the central pitch (dened as the most frequent pitch modulo 12). Thus,
the central pitch is dened as zero. We then obtain ve vectors of relative
frequencies pj = (pj0 , ..., pj11 )t (j = 1, ..., 5) characterizing the ve compositions. In addition, for each of these vectors the number nj of local peaks
in pj is calculated. We say that a local peak at i {1, ..., 10} occurs, if
pji > max(pji1 , pji+1 ). For i = 10, we say that a local peak occurs, if
pji > pji1 . Figure 2.29a displays Cherno faces of the 12-dimensional vectors vj = (nj , pj1 , ..., pj11 )t . In Figure 2.29b, the coordinates of vj (and thus
the assignment of feature variables) were permuted. The two plots illustrate the usefulness of Cherno faces, and at the same time the diculties
in nding an objective interpretation. On one hand, the method discovers
a plausible division in two groups: both picures show a clear distinction
between classical tonal music (rst three faces) and the three representatives of avant-garde music of the 20th century. On the other hand, the

2004 CRC Press LLC

exact nature of the distinction cannot be seen. In Figure 2.29a, the classical
faces look much more friendly than the rather miserable avant-garde fellows. The judgment of conservative music lovers that avant-garde music
is unbearable, depressing, or even bad for health, seems to be conrmed!
Yet, bad temper is the response of the classical masters to a simple permutation of the variables (Figure 2.29b), whereas the grim avant-garde seems
to be much more at ease. The diculty in interpreting Cherno faces is
that the result depends on the order of the variables, whereas due to their
psychological eect most feature variables are not interchangeable.

Figure 2.28 Arnold Schnberg (1874-1951), self-portrait. (Courtesy of Verwero


tungsgesellschaft Bild-Kunst, Bonn.)

2.7.2 Distribution of notes star plots


We consider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitch
modulo 12 where 0 is the tonal center. In contrast to Cherno faces, permutation of coordinates in star plots is much less likely to have a subjective
inuence on the interpretation of the picture. Nevertheless, certain patterns
can become more visible when using an appropriate ordering of the variables. From the point of view of tonal music, a natural ordering of pitch can
be obtained, for instance, from the ascending circle of fourths. This leads
to the following permutation p = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t .
j
(p0 is omitted, since it is maximal by denition for all compositions.) Since
stars are easy to look at, it is possible to compare a large number of observations simultaneously. We consider the following set of compositions:

2004 CRC Press LLC

ANONYMUS

BACH

SCHUMANN

WEBERN

SCHOENBERG

TAKEMITSU

Figure 2.29 a) Cherno faces for 1. Saltarello (Anonymus, 13th century); 2.


Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J. S. Bach, 16851750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 1810-1856); 4. Piano piece
op. 19, No. 2 (A. Schnberg, 1874-1951); 5. Rain Tree Sketch 1 (T. Takemitsu,
o
1930-1996).

ANONYMUS

BACH

SCHUMANN

WEBERN

SCHOENBERG

TAKEMITSU

Figure 2.29 b) Cherno faces for the same compositions as in gure 2.29a, after
permuting coordinates.

2004 CRC Press LLC

A. de la Halle (1235?-1287): Or est Bayard en la pature, hure!;


J. de Ockeghem (1425-1495): Canon epidiatesseron;
J. Arcadelt (1505-1568): a) Ave Maria, b) La ingratitud, c) Io dico fra
noi;
W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queens
Alman;
J.P. Rameau (1683-1764): a) La Poplini`re, b) Le Tambourin, c) La
e
Triomphante;
J.S. Bach (1685-1750): Das Wohltemperierte Klavier Preludes und
Fuges No. 5, 6 and 7;
D. Scarlatti (1660-1725): Sonatas K 222, K 345 and K 381;
J. Haydn (1732-1809): Sonata op. 34, No. 2;
W.A. Mozart (1756-1791): 2nd movements of Sonatas KV 332, KV 545
and KV 333;
M. Clementi (1752-1832): Gradus ad Parnassum Studies 2 and 9 (Figure 11.4);
R. Schumann (1810-1856): Kinderszenen op. 15, No. 1, 2, and 3;
F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32,
No. 1, c) Etude op. 10, No. 6;
R. Wagner (1813-1883): a) Bridal Choir from Lohengrin, b) Ouverture
to Act 3 of Die Meistersinger;
C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Reections dans leau;
A. Scriabin (1872-1915): Preludes op. 2/2, op. 11/14 and op. 13/2;
B. Bartk (1881-1945): a) Bagatelle op. 11, No. 2 and 3, b) Sonata for
o
Piano;
O. Messiaen (1908-1992): Vingts regards sur lenfant de Jsus, No. 3;
e
S. Prokoe (1891-1953): Visions fugitives No. 11, 12 and 13;
A. Schnberg (1874-1951): Piano piece op. 19, No. 2;
o
T. Takemitsu (1930-1996): Rain Tree Sketch No. 1;
A. Webern (1883-1945): Orchesterst ck op. 6, No. 6;
u
anti piano concert No. 2 (beginning of 2nd Mov.)
J. Beran (*1959): S
The star plots of p are given in Figure 2.31. From Halle (cf. Figure 2.30)
j
up to about the early Scriabin, the long beams form more or less a halfcircle. This means that the most frequent notes are neighbors in the circle
of quarts and are much more frequent than all other notes. This is indeed
what one would expect in music composed in the tonal system. The picture
starts changing in the neighborhood of Scriabin where long beams are either

2004 CRC Press LLC

isolated (most extremely for Bartks Bagatelle No. 3) or tend to cover more
o
or less the whole range of notes (e.g. Bartk, Prokoe, Takemitsu, Beran).
o
Due to the variety of styles in the 20th century, the specic shape of each
of the stars would need to be discussed in detail individually. For instance,
Messiaens shape may be explained by the specic scales (Messiaen scales)
he used. Generally speaking, the dierence between star plots of the 20th
century and earlier music reects the replacement of the traditional tonal
system with major/minor scales by other principles.

Figure 2.30 The minnesinger Burchard von Wengen (1229-1280), contemporary


of Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the University Library Heidelberg.) (Color gures follow page 152.)

2004 CRC Press LLC

Distribution of notes ordered according to ascending circle of fourths

HALLE

OCKEGHEM

ARCADELT

ARCADELT

ARCADELT

BYRD

BYRD

BYRD

RAMEAU

RAMEAU

RAMEAU

BACH

BACH

BACH

SCARLATTI

SCARLATTI

SCARLATTI

HAYDN

MOZART

MOZART

MOZART

CLEMENTI

CLEMENTI

SCHUMANN

SCHUMANN

SCHUMANN

CHOPIN

CHOPIN

CHOPIN

WAGNER

WAGNER

DEBUSSY

DEBUSSY

DEBUSSY

SCRIABIN

SCRIABIN

SCRIABIN

BARTOK

BARTOK

BARTOK

PROKOFFIEFF

PROKOFFIEFF

PROKOFFIEFF

MESSIAEN

SCHOENBERG

WEBERN

TAKEMITSU

BERAN

Figure 2.31 Star plots of p = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t for comj
positions from the 13th to the 20th century.

2.7.3 Joint distribution of interval steps of envelopes


Consider a composition consisting of onset times ti and pitch values x(ti ). In
a polyphonic score, several notes may be played simultaneously. To simplify
analysis, we dene a simplied score by considering the lower and upper
envelope:
Denition 24 Let
n

C = {(ti , x(ti )) : ti A, x(ti ) B, i = 1, 2, ..., N } =

Cj
j=1

where A = {t , ..., t } Z+ (t < t < ... < t ), B R or Z and


1
n
1
2
n
Cj = {(t, x(t)) C : t = t }. Then the lower and upper envelope of C are
j

2004 CRC Press LLC

dened by
Elow = {(t ,
j

min

x(t)), j = 1, ..., n}

max

x(t)), j = 1, ..., n}.

(t,x(t))Cj

and
Eup = {(t ,
j

(t,x(t))Cj

In other words, for each onset time, the lowest and highest note are selected to dene the lower and upper envelope respectively. In the example
below, we consider interval steps y(ti ) = y(ti+1 ) y(ti ) mod 12 for the
upper envelope of a composition with onset times t1 , ..., tn and pitches
y(t1 )..., y(tn ). A simple aspect of melodic and harmonic structure is the
question in which sequence intervals are likely to occur. Here, we look at
the empirical two-dimensional distribution of (y(ti ), y(ti+1 )). For each
pair (i, j), (11 i, j 11, i, j=0), we count the number nij of occurences
and dene Nij = log(nij + 1). (The value 0 is excluded here, since repetitions of a note or transposition by an octave are less interesting.) If
only the type of interval and not its direction is of interest, then i, j assume
the values 1 to 11 only. A useful representation of Nij can be obtained by
a symbol plot. In Figures 2.32 and 2.33, the x- and y-coordinates correspond to i and j respectively. The radius of a circle with center (i, j) is
proportional to Nij . The compositions considered here are: a) J.S. Bach:
Prludium No. 1 from Das Wohltemperierte Klavier; b) W.A. Mozart :
a
Sonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Prlude op.
e
51, No. 4; and d) F. Martin: Prlude No. 6. For Bachs piece, there is a clear
e
clustering in three main groups in the rst plot (there are almost never two
successive interval steps downwards) and a horseshoe-like pattern for absolute intervals. Remarkable is the clear negative correlation in Mozarts rst
plot and the concentration on a few selected interval sequences. A negative correlation in the plots of interval steps with sign can also be found
for Scriabin and Martin. However, considering only the types of intervals
without their sign, the number and variety of interval sequences that are
used relatively frequently is much higher for Scriabin and even more for
Martin. For Martin, the plane of absolute intervals (Figure 2.33d) is lled
almost uniformly.
2.7.4 Pitch distribution symbol plots with circles
Consider once more the distribution vectors pj = (pj0 , ..., pj11 )t of pitch
modulo 12 as in the star-plot example above. The star plots show a clear
distinction between modern compositions and classical tonal compositions. Symbol plots can be used to see more clearly which composers (or
compositions) are close with respect to pj . In gure 2.34 the x- and yaxis corresponds to pj5 and pj7 . Recall that if 0 is the root of the tonic
triad, then 5 is the root of the subtonic and 7 the root of the dominant

2004 CRC Press LLC

Figure 2.32 Symbol plot of the distribution of successive interval pairs


(y(ti ), y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the
upper envelopes of Bachs Prludium No. 1 (Das Wohltemperierte Klavier I) and
a
Mozart s Sonata KV 545 (beginning of 2nd movement).

2004 CRC Press LLC

Figure 2.33 Symbol plot of the distribution of successive interval pairs


(y(ti ), y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for the
upper envelopes of Scriabins Prlude op. 51, No. 4 and F. Martins Prlude No.
e
e
6.

2004 CRC Press LLC

0.20

triad. The radius of the circles in Figure 2.34 is proportional to pj1 , the
frequency of the dissonant minor second. In color Figure 2.35, the radius
represents pj6 , i.e. the augmented fourth. Both plots show a clear positive
relationship between pj5 and pj7 . Moreover the circles tend to be larger
for small values of x and y. The positioning in the plane together with the
size of the circles separates (apart from a few exceptions) classical tonal
compositions from more recent ones. To visualize this, four dierent colors are chosen for early music (black), baroque and classical (green),
romantic (blue) and 20/21st century (red). The clustering of the four
colors indicates that there is indeed an approximate clustering according
to the four time periods. Interesting exceptions can be observed for early
music with two extreme outliers (Halle and Arcadelt). Also, one piece by
Rameau is somewhat far from the rest.

RAMEAU

ARCADELT
RAMEAU
SCHUMANN

ARCADELT

0.15

BYRD
SCRIABIN
RAMEAU
CLEMENTI
MOZART
CHOPIN
CHOPIN

SCRIABIN

0.10

WEBERN

SCARLATTI
SCARLATTIBYRD
BYRD
DEBUSSY
PROKOFFIEFF BACH
OCKEGHEM
BACH
MOZART
SCARLATTI
DEBUSSY
SCHUMANN
BACH HAYDN
WAGNER
MOZART
CLEMENTI
SCRIABIN

DEBUSSY
CHOPIN
PROKOFFIEFF
BARTOK

SCHUMANN
WAGNER

0.05

TAKEMITSU
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK

0.0

PROKOFFIEFF
BARTOK
HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.34 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional


to pj 1 .

2.7.5 Pitch distribution symbol plots with rectangles


By using rectangles, four dimensions can be represented. Color Figure 2.36
shows a symbol with (x, y)-coordinates (pj5 , pj7 ) and rectangles with width

2004 CRC Press LLC

0.20

RAMEAU

ARCADELT
RAMEAU
SCHUMANN

ARCADELT

0.15

BYRD
SCRIABIN
RAMEAU
CLEMENTI
MOZART
CHOPIN
CHOPIN

SCRIABIN

0.10

WEBERN

SCARLATTI
SCARLATTIBYRD
BYRD
DEBUSSY
PROKOFFIEFF BACH
OCKEGHEM
BACH
MOZART
SCARLATTI
DEBUSSY
SCHUMANN
BACH HAYDN
WAGNER
MOZART
CLEMENTI
SCRIABIN

DEBUSSY
CHOPIN
PROKOFFIEFF
BARTOK

SCHUMANN
WAGNER

0.05

TAKEMITSU
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK

0.0

PROKOFFIEFF
BARTOK
HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.35 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional


to pj 6 . (Color gures follow page 152.)

pj1 (diminished second) and height pj6 (augmented fourth). Using the same
colors for the names as above, a similar clustering as in the circle-plot can
be observed. The picture not only visualizes a clear four-dimensional relationship between pj1 , pj5 , pj6 and pj7 , but also shows that these quantities
are related to the time period.
2.7.6 Pitch distribution symbol plots with stars
Five dimensions are visualized in color Figure 2.37 with (x, y) = (pj5 , pj7 )
and the variables pj1 , pj6 and pj10 (diminished seventh) dening a starplot
for each observation, the rst variables starting on the right and the subsequent variables winding counterclockwise around the star (in this case
a triangle). The shape of the triangle is obviously a characteristic of the
time period. For tonal music composed mostly before about 1900, the stars
are very narrow with a relatively long beam in the direction of the diminished seventh. The diminished seventh is indeed an important pitch
in tonal music, since it is the fourth note in the dominant seventh chord
to the subtonic. In contrast, notes that are a diminished second and an

2004 CRC Press LLC

0.0

RAMEAU

ARCADELT
SCHUMANN

RAMEAU
ARCADELT

BYRD
RAMEAU SCRIABIN SCARLATTI
SCARLATTI
CLEMENTI
MOZART DEBUSSY BYRD
BYRD
PROKOFFIEFF BACH
BACH SCARLATTI
MOZART OCKEGHEM
CHOPIN SCRIABIN
DEBUSSY
SCHUMANN
CHOPIN
BACH HAYDN
WAGNER
WEBERN
MOZART
CLEMENTI
SCRIABIN
DEBUSSY
CHOPIN
SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU
SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK
PROKOFFIEFF
BARTOK

-0.1

HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.36 Symbol plot with x = pj5 , y = pj7 . The rectangles have width pj1
(diminished second) and height pj 6 (augmented fourth). (Color gures follow page
152.)

2004 CRC Press LLC

augmented fourth above the root of the tonic triad build, together with
the tonic root, highly dissonant intervals and are therefore less frequent in
tonal music. Color Figure 2.37 shows the triangles; the names without the
triangles are plotted in color Figure 2.38.
2.7.7 Pitch distribution prole plots
Finally, as an alternative to star plots, Figure 2.39 displays prole plots
of p = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t . For compositions up to
j
about 1900, the proles are essentially U-shaped. This corresponds to stars
with clustered long and short beams respectively, as seen previously. For
modern compositions, there is a large variety of shapes dierent from a
U-shape.

2004 CRC Press LLC

0.10
0.05
0.0

0.0

0.05

0.10

0.15

0.20

Figure 2.37 Symbol plot with x = pj5 , y = pj7 , and triangles dened by pj1 (diminished second), pj 6 (augmented fourth) and pj 10 (diminished seventh). (Color
gures follow page 152.)

2004 CRC Press LLC

0.20

RAMEAU

RAMEAU
SCHUMANN

ARCADELT

0.10

0.15

BYRD
SCRIABIN
RAMEAU
SCARLATTI
SCARLATTIBYRD
CLEMENTI
MOZART
BYRD
DEBUSSY
PROKOFFIEFF BACH
BACH MOZART OCKEGHEM
CHOPIN SCRIABIN
SCARLATTI
DEBUSSY
SCHUMANN
CHOPIN
BACH HAYDN
WAGNER
WEBERN
MOZART
CLEMENTI
SCRIABIN
DEBUSSY
CHOPIN
SCHUMANN
PROKOFFIEFF
BARTOK
WAGNER
TAKEMITSU

0.05
0.0

ARCADELT

SCHOENBERG
MESSIAEN
BERAN
ARCADELT
BARTOK
PROKOFFIEFF
BARTOK
HALLE

0.0

0.05

0.10

0.15

0.20

Figure 2.38 Names plotted at locations (x, y) = (pj 5 , pj 7 ). (Color gures follow
page 152.)

2004 CRC Press LLC

6 8 10

8 10

0.10
0.0

0.0

4 6

8 10

6 8 10

SCARLATTI

0.0

0.08

0.10

0.0

0.0

6 8 10

0.10

2 4

SCARLATTI

4 6

2 4

8 10

CLEMENTI

6 8 10

SCHUMANN

4 6

8 10

6 8 10

DEBUSSY
0.10

0.08

0.0

0.02

0.0

2 4

WAGNER

6 8 10

4 6

2 4

8 10

BARTOK

6 8 10

BARTOK

0.0

6 8 10

4 6

8 10

6 8 10

6 8 10

0.10

BERAN

0.06

0.09

2 4

2 4

TAKEMITSU

0.07

0.08

6 8 10

WEBERN

0.02

2 4

0.06

0.10

0.10

0.15
0.05

2 4

SCHOENBERG
0.20

8 10

BARTOK

6 8 10

0.05

4 6

0.0

0.02
0.10

0.10

2 4

MESSIAEN

2 4

SCRIABIN

8 10

6 8 10

WAGNER

6 8 10

0.10

4 6

0.05

0.10

0.15

0.10
0.02
0.15
0.05

6 8 10

2 4

0.0

2 4

CHOPIN

8 10

0.04

2 4

6 8 10

CLEMENTI

6 8 10

0.04

4 6

0.12

4 6

2 4

SCRIABIN

PROKOFFIEFF

0.05

2 4

MOZART

BYRD

0.10

0.10
0.0

0.10
0.20
0.0

0.0
0.02
0.10

6 8 10

BYRD

BACH

6 8 10

0.10

8 10

0.02

2 4

0.10
0.02

0.04

2 4

4 6

CHOPIN

6 8 10

0.10

PROKOFFIEFF

SCRIABIN

8 10

2 4

8 10

0.0

0.02
0.10

2 4

0.0

4 6

4 6

2 4

6 8 10

BACH

MOZART

CHOPIN

8 10

0.02

6 8 10

0.02

4 6

0.10

0.10

2 4

DEBUSSY

PROKOFFIEFF

6 8 10

0.10

0.08
0.0

6 8 10

2 4

BACH

MOZART

8 10

0.0

6 8 10

0.02

2 4

2 4

SCHUMANN

DEBUSSY

0.10

4 6

0.10

0.08
0.0

2 4

8 10

0.10

8 10

0.05

SCHUMANN

4 6

0.10

4 6

HAYDN

6 8 10

RAMEAU

0.0

6 8 10

6 8 10

BYRD

ARCADELT

0.08

RAMEAU

SCARLATTI

2 4

2 4

8 10

0.0

0.0
0.15

4 6

0.10

0.10

RAMEAU

2 4

0.10

0.10

ARCADELT

0.10

6 8 10

0.20

2 4

ARCADELT

0.0

0.0

0.0

0.08

0.10

OCKEGHEM

0.0

HALLE

4 6

8 10

2 4

6 8 10

Figure 2.39 Prole plots of p = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t .


j

2004 CRC Press LLC

CHAPTER 3

Global measures of structure and


randomness
3.1 Musical motivation
Essential aspects of music may be summarized under the keywords structure, information and communication. Even aleatoric pieces where
events are generated randomly (e.g. Cage, Xenakis, Lutoslawsky) have
structure and information induced by the denition of specic random distributions. It is therefore meaningful to measure the amount of structure
and information contained in a composition. Clearly, this is a nontrivial
task and many dierent, and possibly controversial, denitions can be invented. In this chapter, two types of measures are discussed: 1) general
global measures of information or randomness, and 2) specic local measures indicating metric, melodic, and harmonic structures.
3.2 Basic principles
3.2.1 Measuring information and randomness
There is an enormous amount of literature on information measures and
their applications. In this section, only some basic fundamental denitions
and results are reviewed. These and other classical results can be found, in
particular, in Fisher (1925, 1956), Hartley (1928), Bhattacharyya (1946a),
Erds (1946), Wiener (1948), Shannon (1948), Shannon and Weaver (1949),
o
Barnard (1951), McMillan (1953), Mandelbrot (1953, 1956), Khinchin (1953,
1956), Goldman (1953), Bartlett (1955), Brillouin (1956), Komogorov (1956),
Ashby (1956), Joshi (1957), Kullback (1959), Wolfowitz (1957, 1958, 1961),
Woodward (1953), Rnyi (1959a,b, 1961, 1965, 1970). Also see e.g. Ash
e
(1965) for an overview. A classical measure of information (or randomness) is entropy, which is also called Shannon information (Shannon 1948,
Shannon and Weaver 1949). To explain its meaning, consider the following question: how much information is contained in a message, or more
specically, what is the necessary number of digits to encode the message
unambiguously in the binary system? For instance, if the entire vocabulary
only consisted of the words I, hungry, not, very, then the words
could be identied with the binary numbers 00 = I, 01 = hungry, 10 =

2004 CRC Press LLC


Figure 3.1 Ludwig Boltzmann (1844-1906). (Courtesy of Osterreichische Post
AG.)

not and 11 = very. Thus, for a vocabulary V of |V | = N = 22 words,


n = 2 digits would be sucient. More generally, suppose that we have a
set V with N = 2n elements. Then we need n = log2 N digits for encoding the elements in the binary system. The number n is then called the
information of a message from vocabulary V . Note that in the special case
where V consists of one element only, n = 0, i.e. the information content of
a message is zero, because we know which element of V will be contained
in the message even before receiving it.
An extension of this denition to integers N that are not necessarily
powers of 2 can be justied as follows: consider a sequence of k elements
from V . The number of sequences v1 , ..., vk (vi V ) is N k . (Note that one
element is allowed to occur more than once.) The number of binary digits
to express a sequence v1 , ..., vk is nk where 2nk 1 < N k 2nk . The average
number of digits needed to express an element in this sequence is nk /k
where k log2 N nk < k log2 N + 1. We then have
nk
= log2 N.
lim
k k
The following denition is therefore meaningful:
Denition 25 Let VN be a nite set with N elements. Then the information necessary to characterize the elements of VN is dened by
I(VN ) = log2 N

(3.1)

This denition can also be derived by postulating the following properties


a measure of information should have:
1. Additivity: If |VK | = N M , then I(VK ) = I(VN ) + I(VM )

2004 CRC Press LLC

2. Monotonicity: I(VN ) I(VN +1 )


3. Denition of unit: I(V2 ) = 1.
The only function that satises these conditions is I(VN ) = log2 N.
Consider now a more complex situation where VN = k Vj , Vj Vl =
j=1
(j = l) and |Vj | = Nj (and hence N = N1 +...+Nk ), and dene pj = Nj /N .
Suppose that we select an element from V randomly, each element having
the same probability of being chosen. If an element v V is known to
belong to a specic Vj , then the additional information needed to identify it
within Vj is equal to I(Vj ) = log2 Nj . The expected value of this additional
information is therefore
k

I2 =

pj log2 Nj =
j=1

pj log2 (N pj )

(3.2)

j=1

Let I1 be the information needed to identify the set Vj which v belongs to.
Then the total information needed for identifying (encoding) elements of
V is
log2 N = I1 + I2
(3.3)
On the other hand,
famous formula

pj log2 N = log2 N so that we obtain Shannons


k

I1 =

pj log2 (pj )

(3.4)

j=1

I1 is also called Shannon information. Shannon information is thus the expected information about the occurence of the sets V1 , ..., Vk contained in
a randomly chosen element from V . Note that the term information can
be used synonymously for uncertainty: the information obtained from
a random experiment diminishes uncertainty by the same amount. The
derivation of Shannon information is credited to Shannon (1948) and, independently, Wiener (1948). In physics, an analogous formula is known as
entropy and is a measure of the disorder of a system (see Boltzmann 1896,
gure 3.1).
Shannons formula can also be derived by postulating the following properties for a measure of information of the outcome of a random experiment:
let V1 , ..., Vk be the possible outcomes of a random experiment and denote
by pj = P (Aj ) the corresponding probabilities. Then a measure of information, say I, obtained by the outcome of the random experiment should
have the following properties:
1. Function of probabilities: I = I(p1 , ..., pk ), i.e. I depends on the probabilities pj only;
2. Symmetry: I(p1 , ..., pk ) = I(p(1) , ..., p(k) ) for any permutation ;
3. Continuity: I(p, 1 p) is a continuous function of p (0 p 1);
4. Denition of unit: I( 1 , 1 ) = 1;
2 2

2004 CRC Press LLC

5. Additivity and weighting by probabilities:


I(p1 , ..., pk ) = I(p1 + p2 , p3 , ..., pk ) + (p1 + p2 )I(

p1
p2
,
) (3.5)
p1 + p2 p1 + p2

The meaning of the rst four properties is obvious. The last property can
be interpreted as follows: suppose the outcome of an experiment does not
distinguish between V1 and V2 , i.e. if v turns out to be in one of these
two sets, we only know that v V1 V2 . Then the infomation provided
by the experiment is I(p1 + p2 , p3 , ..., pk ). If the experiment did distinguish
between V1 and V2 , then it is reasonable to assume that the information
would be larger by the amount
p1
p2
(p1 + p2 )I(
,
).
p1 + p2 p1 + p2
Equation (3.5) tells us exactly that: the complete information I(p1 , ..., pk )
can be obtained by adding the partial and the additional information. It
turns out that the only function for which the postulates hold is Shannons
information:
Theorem 9 Let I be a functional that assigns each nite discrete distribution function P (dened by probabilities p1 , ..., pk , k 1) a real number
I(P ), such that the properties above hold. Then
k

I(P ) = I(p1 , ..., pk ) =

pj log2 pj

(3.6)

j=1

Shannon information has an obvious upper bound that follows from Jensens
inequality: recall that Jensens inequality states that for a convex function
g and weights wj 0 with
wj = 1 we have
g(

wj xj )

wj g(xj ).

In particular, for g(x) = x log2 x,


k 1 g(pj ) = k 1

pj log2 pj g(

k 1 pj ) = k 1 log2 k.

Hence,
I(P ) log2 k

(3.7)

This bound is achieved by the uniform distribution pj = 1/k. The other


extreme case is pj = 1 for some j. This means that event Vj occurs with
certainty and I(p1 , ..., pk ) = I(pj ) = I(1) = I(1, 0) = I(1, 0, 0) etc. Then
from the fth property we have I(1, 0) = I(1) + I(1, 0) so that I(1) = 0.
The interpretation is that, if it is clear a priori which event will occur, then
a random experiment does not provide any information.
The notion of information can be extended in an obvious way to the
case where one has an innite but countable number of possible outcomes.

2004 CRC Press LLC

The information contained in the realization of a random variable X with


possible outcomes x1 , x2 , ... is dened by
I(X) =

pj log2 pj

where pj = P (X = xj ). More subtle is the extension to continuous distributions and random variables. A nice illumination of the problem is given
in Renyi (1970): for a random variable with uniform distribution on (0,1),
the digits in the binary expansion of X are innitely many independent
0-1-random variables where 0 and 1 occur with probability 1/2 each. The
information furnished by a realization of X would therefore be innite. Nevertheless, a meaningful measure of information can be dened as a limit of
discrete approximations:
Theorem 10 Let X be a random variable with density function f. Dene
XN = [N X]/N where [x] denotes the integer part of x. If I(X1 ) < , then
the following holds:
I(XN )
=1
(3.8)
lim
N log2 N
lim (I(XN ) log2 N ) =

f (x) log2 f (x)dx

(3.9)

We thus have
Denition 26 Let X be a random variable with density function f . Then
I(X) =

f (x) log2 f (x)dx

(3.10)

is called the information (or entropy) of X.


Note that, in contrast to discrete distributions, information can be negative.
This is due to the fact that I(X) is in fact the limit of a dierence of
informations.
The notion of entropy can also be carried over to measuring randomness
in stationary time series in the sense of correlations. (For the denition of
stationarity and time series in general see Chapter 4.)
Denition 27 Let Xt (t Z) be a stationary process with var(Xt ) = 1,
and spectral density f . Then the spectral entropy of Xt is dened by
I(Xt , t Z) =

f (x) log2 f (x)dx

(3.11)

This denition is plausible, because for a process with unit variance, f has
the same properties as a probability distribution and can be interpreted as
a distribution on frequencies. The process Xt is uncorrelated if and only if
f is constant, i.e. if f is the uniform distribution on [, ]. Exactly in this
case entropy is maximal, and knowledge of past observations does not help
to predict future observations. On the other hand, if f has one or more

2004 CRC Press LLC

extreme peaks, then entropy is very low (and in the limit minus innity).
This corresponds to the fact that in this case future observations can be
predicted with high accuracy from past values. Thus, future observations
do not contain as much new information as in the case of independence.
3.2.2 Measuring metric, melodic, and harmonic importance
General idea
Western classical music is usually structured in at least three aspects:
melody, metric structure, and harmony. With respect to representing the
essential melodic, metric, and harmonic structures, not all notes are equally
important. For a given composition K, we may therefore try to nd metric,
melodic, and harmonic structures and quantify them in a weight function
w : K R3 (which we will also call an indicator). For each note event x
K, the three components of w(x) = (wmelodic (x), wmetric (x), wharmonic (x))
quantify the importance of x with respect to the melodic, metric, and
harmonic structure of the composition respectively.
Omnibus metric, melodic, and harmonic indicators
Specic denitions of structural indicators (or weight functions) are discussed for instance in Mazzola et al. (1995), Fleischer et al. (2000), and
Beran and Mazzola (2001). To illustrate the general approach, we give a
full denition of metric weights. Melodic and harmonic weights are dened
in a similar fashion, taking into account the specic nature of melodic and
harmonic structures respectively.
Metric structures characterize local periodic patterns in symbolic onset
times. This can be formalized as follows: let K Z4 be a composition (with
coordinates Onset Time, Pitch,Loudness, and Duration), T Z
its set of onset times (i.e. the projection of K on the rst axis) and let
tmax = max{t : t T }. Without loss of generality the smallest onset time
in T is equal to one.
Denition 28 For each triple (t, l, p) Z N N the set
B(t, l, p) = {t + kp : 0 k l}
is called a meter with starting point t, length l and period p. The meter
is called admissible, if B(t, l, p) T . The non-negative length l of a local
meter M = B(t, l, p) is uniquely determined by the set M and is denoted
by l(M ).
Note that by denition, t B(t, l, p) for any (t, l, p) Z N N. The
importance of events at onset time s is now measured by the number of
meters this onset is contained in. For a given triple (t, l, p), three situations
can occur:

2004 CRC Press LLC

1. B(t, l, p) is admissible and there is no other admissible local meter B =


B (t , l , p ) such that B B ;
2. B(t, l, p) is not admissible;
3. B(t, l, p) is admissible, but there is another admissible local meter B =
B (t , l , p ) such that B B .
We count only case 1. This leads to the following denition:
Denition 29 An admissible meter B(t, l, p) for a composition K Z4
is called a maximal local meter if and only if it is not a proper subset
of another admissible local meter B(t , l , p ) of K. Denote by M(K) the
set of maximal local meters of K and by M(K, t) the set of maximal local
meters of K containing onset t.
Note that the set M(K) is always a covering of T . Metric weights can now
be dened, for instance, by
Denition 30 Let x K be a note event at onset time t(x) T , M =
M(K, t) the set of maximal local meters of K containing t(x), and h a
nondecreasing real function on Z. Specify a minimal length lmin . Then
the metric indicator (or metric weight) of x, associated with the minimal
length lmin , is given by
wmetric (x) =

h(l(M ))

(3.12)

MM,
l(M)lmin

In a similar fashion, melodic indicators wmelodic and harmonic indicators


wharmonic can be derived from a melodic and harmonic analysis respectively.
Specic indicators
A possible objection to weight functions as dened above is that only information about pitch and onset time is used. A score, however, usually
contains much more symbolic information that helps musicians to read it
correctly. For instance, melodic phrases are often connected by a phrasing slur, notes are grouped by beams, separate voices are made visible by
suitable orientation of note stems, etc. Ideally, structural indicators should
take into account such additional information. An improved indicator that
takes into account knowledge about musical motifs can be dened for
example as follows:
Denition 31 Let M = {(1 , y1 ), ..., (k , yk )}, 1 < 2 < ... < k be a
motif where y denotes pitch and onset time. Given a composition K
T Z Z2 , dene for each score-onset time ti T (i = 1, ..., n) and
u {1, ..., k}, the shifted motif
M (ti , u) = {(ti + 1 u , y1 ), ..., (ti + k u , yk )}

2004 CRC Press LLC

and denote by
Tu (ti ) = {ti + 1 u , ..., ti + k u } = {s1 , ..., sk }
the corresponding onset times. Moreover, let
Xu (ti ) = {x = (x(s1 ), ..., x(sk )) : (si , x(si )) K}
be the set of all pitch-vectors with onset set Tu (ti ). Then we dene the
distance
k

du (ti ) =

(x(si ) yi )2

min

xXu (ti )

(3.13)

i=1

If Xu is empty, then du (ti ) is not dened or set equal to an arbitrary upper


bound D < .
In this denition, it is assumed that the motif is identied beforehand
by other means (e.g. by hand using traditional musical analysis). The
distance du (ti ) thus measures in how far there are notes that are similar
to those in M, if ti is at the uth place of the rhythmic pattern of motif M.
Note that the euclidian distance (x(si ) yi )2 could be replaced by any
other reasonable distance.
Analogously, distance or similarity can be measured by correlation:
Denition 32 Using the same denitions as above, let
k

xo = arg

(x(si ) yi )2 ,

min

xXu (ti )

i=1

and dene ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ).
If M (ti , u) K, then set ru (ti ) = 0.
Disregarding the position within a motif, we can now dene overall motivic
indicators (or weights), for instance by
k

wd,mean (ti ) = g(

du (ti ))

(3.14)

u=1

where g is a monotonically decreasing function,


wd,min (ti ) = min du (ti )

(3.15)

wcorr (ti ) = max ru (ti )

(3.16)

1uk

or
1uk

Finally, given weights for p dierent motifs, we may combine these into
one overall indicator. For instance, an overall melodic indicator based on
correlations can be dened by
p

wmelod (ti ) =

h(wcorr,j (ti ), Li )
j=1

2004 CRC Press LLC

(3.17)

where wcorr,j is the weight function for motif number j and Li is the number
of elements in the motif. Including Li has the purpose of attributing higher
weights to the presence of longer motifs.
The advantage of the motif-based denition is that one can rst search
for possible motifs in the score, making full use of the available information
in the score as well as musicological and historical knowledge, and then
incorporate these in the denition of melodic weights. Similar denitions
may be obtained for metric and harmonic indicators.
3.2.3 Measuring dimension
There are many dierent denitions of dimension, each measuring a specic
aspect of objects. Best known is the topological dimension. In the usual
k
euclidian
space Rk with scalar product < x, y >= i=1 xi yi and distances
|xy| = < x y, x y >, the topological dimension of the space is equal
to k. The dimension of an object in this space is equal to the dimension
of the subspace it is contained in. The euclidian space is, however, rather
special since it is metric with a scalar product.
More generally, one can dene a topological dimension in any topological
(not necessarily metric) space in terms of coverings. We start with the
denition of a topological space: a topological space is a nonempty set
X together with a family O of so-called open subsets of X satisfying the
following conditions:
1. X O and O ( denotes the empty set)
2. If U1 , U2 O, then U1 U2 O
3. If U1 , U2 O, then U1 U2 O.
A covering of a set S X is a collection U O of open sets such that
S UU U.
A renement of a covering U is a covering U such that for each U U
there exists a U U with U U . The denition of topological dimension
is now as follows:
Denition 33 A topological space X has topological dimension m, if every
covering U of X has a renement U in which every point of X occurs in
at most m + 1 sets of U , and m is the smallest such integer.
The topological dimension of a subset S X is analogous. For instance,
a straight line in a euclidian space can be divided into open intervals such
that at most two intervals intersect so that dT = 1. Similarily, a simple
geometric gure in the plane, such as a disk or a rectangle (including the
inner area), can be covered with arbitrarily small circles or rectangles such
that at most three such sets intersect this number can however not be
made smaller. Thus, the topological dimension of such an object is dT =
3 1 = 2.

2004 CRC Press LLC

The topological dimension is a relatively rough measure of dimension,


since it can assume integer values only and thus classies sets (in a topological space) into a nite or countable number of categories. On the other
hand, dT is dened for very general spaces where a metric (i.e. distances)
need not exist. A ner denition of dimension, which is however conned
to metric spaces, is the Hausdor-Besicovitch dimension. Suppose we have
a set A in a metric space X. In a metric space, we can dene open balls of
radius r around each point x X by
U (r) = {y X : dX (x, y) < r}
where dX is the metric in X. The idea is now to measure the size of A by
covering it with a nite number of balls Ur = {U1 (r), ..., Uk (r)} of radius r
and to calculate an approximate measure of A by
Ur ,r,h (A) =

h(r)

(3.18)

where the sum is taken over all balls and h is some positive function. This
measure depends on r, the specic covering Ur and h. To obtain a measure
that is independent of a specic covering, we dene the measure
r,h (A) = inf U ,,h (A)
U :<r

(3.19)

This measure is still only an approximation of A. The question is now


whether we can get a measure that corresponds exactly to the set A. This
is done by taking the limit r 0 :
h (A) = lim r,h (A)

(3.20)

r0

Clearly, as r tends to zero, r,h becomes at most larger and therefore has a
limit. The limit can be either zero (if r,h = 0 already), innity, or a nite
number. This leads to the following denition:
Denition 34 A function h for which
0 < h (A) <
is called intrinsic function of A.
Consider, for example, a simple shape in the plane such as a circle with
radius R. The area of the circle A can be measured by covering it by small
circles of radius r and evaluating h (A) using the function h(r) = r2 .
It is well known that limr0 r,h (A) exists and is equal to h (A) = R2 .
On the other hand, if we took h(r) = r with < 2, then h (A) = ,
whereas for > 2, h (A) = 0. For standard sets, such as circles, rectangles,
triangles, cylinders, etc., it is generally true that the intrinsic function for a
set A that with topological dimension dT = d is given by (Hausdor 1919)
h(r) = hd (r) =

2004 CRC Press LLC

1
{( 2 )}d

(1 + d )
2

rd .

(3.21)

Many other more complicated sets, including randomly generated sets, have
intrinsic functions of the form h(r) = L(r)rd for some d > 0 which is
not always equal to dT , and L a function that is slowly varying at the
origin (see e.g. Hausdor 1919, Besicovitch 1935, Besicovitch and Ursell
1937, Mandelbrot 1977, 1983, Falcomer 1985, 1986, Kono 1986, Telcs 1990,
Devaney 1990). Here, L is called slowly varying at zero, if for any u > 0,
limr0 [L(ur)/L(r)] = 1. This leads to the following denition of dimension:
Denition 35 Let A be a subset of a metric space and
h(r) = L(r) rd
an intrinsic function of A where L(r) is slowly varying. Then dH = d is
called the Hausdor-Besicovitch dimension (or Hausdor dimension) of A.
The denition of Hausdor dimension leads to the denition of fractals (see
e.g. Mandelbrot 1977):
Denition 36 Let A be a subset of a metric space. Suppose that A has
topological dimension dT and Hausdor dimension dH such that
dH > dT .
Then A is called a fractal.

Figure 3.2 Fractal pictures (by Cline Beran, computer generated.) (Color gures
e
follow page 152.)

Intuitively, dH > dT means that the set A is more complicated than a


standard set with topological dimension dT . An alternative denition of
Hausdor-dimension is the fractal dimension:
Denition 37 Let A be a compact subset of a metric space. For each > 0,
denote by N () the smallest number of balls of radius r necessary to
cover A. If
log N ()
dF = lim
(3.22)
0
log
exists, then dF is called the fractal dimension of A.
It can be shown that dF dT . Moreover, in Rk one has dF k = dT . Beautiful examples of fractal curves and surfaces (cf. Figure 3.2) can be found in

2004 CRC Press LLC

Mandelbrot (1977) and other related books. Many phenomena, not only in
nature but also in art, appear to be fractal. For instance, fractal shapes can
be found in Jackson Pollocks (1912-1956) abstract drip paintings (Taylor
1999a,b,c, 2000). In music, the idea of fractals was used by some contemporary composers, though mainly as a conceptual inspiration rather than
an exact algorithm (e.g. Harri Vuori, Gyrgy Ligeti; Figure 3.3).
o

Figure 3.3 Gyrgy Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.)


o

The notion of fractals is closely related to self-similarity (see Mandelbrot 1977 and references therein). Self-similar geometric objects have the
property that the same shapes are repeated at innitely many scales. By
drawing recursively m smaller copies of the same shape rescaling them
by a factor s one can construct fractals. For self-similar objects, the fractal dimension can be calculated directly from the scaling factor s and the
number m of repetitions of the rescaled objects by
dF =

log m
log s

(3.23)

For many purposes more realistic are random fractals where instead of
the shape itself, the distribution remains the same after rescaling. More
specically, we have
Denition 38 Let Xt (t R) be a stochastic process. The process is called
self-similar with self-similarity parameter H, if for any c > 0
Xt =d cH Xct
where = d means equality of the two processes in distribution.
The parameter H is also called Hurst exponent. Self-similar processes are
(like their deterministic counterparts) very special models. However, they
play a central role for stochastic processes just like the normal distribution
for random variables. The reason is that, under very general conditions,
the limit of partial sum processes (see Lamperti 1962, 1972) is always a
self-similar process:

2004 CRC Press LLC

Theorem 11 Suppose that Zt (t R+ ) is a stochastic process such that


Z1 = 0 with positive probability and Zt is the limit in distribution of the
sequence of normalized partial sums
[nt]

a1 Snt = a1
n
n

Xs (n = 1, 2, ...)

(3.24)

s=1

where X1 , X2 , ... is a stationary discrete time process with zero mean and
a1 , a2 , ... a sequence of positive normalizing constants such that log an
. Then there exists an H > 0 such that for any u > 0, limn (anu /an ) =
uH , Zt is self-similar with self-similarity parameter H, and Zt has stationary increments.
The self-similarity parameter therefore also makes sense for processes that
are not exactly self-similar themselves, since it is dened by the rate nH
needed to standardize partial sums. Moreover, H is related to the fractal dimension, the exact relationship between H and the fractal dimension however depends on some other properties of the process as well. For
instance, sample paths of (univariate) Gaussian self-similar processes socalled fractional Brownian motion (see Chapter 4) have, with probability
one, a fractal dimension of 2 H with possible values of H in the interval (0, 1). Thus, the closer H is to 1, the more a sample paths is similar
to a simple geometric line with dimension one. On the other hand, as H
approaches zero, a typical sample path lls up most of the plane so that
the dimension approaches two. Practically, H can be determined from an
observed series X1 , ..., Xn , for example by maximum likelihood estimation.
For a thorough discussion of self-similar and related processes and statistical methods see e.g. Beran (1994). Further references on fractals apart
from those given above are, for instance, Edgar (1990), Falconer (1990),
Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995).
A cautionary remark should be made at this point: in view of theorem 11,
the fact that we do nd self-similarity in aggregated time series is hardly
surprising and can therefore not be interpreted as something very special
that would distinguish the particular series from other data. What may
be special at most is which particular value of H is obtained and which
particular self-similar process the normalized aggregated series converges
to.
3.3 Specic applications in music
3.3.1 Entropy of melodic shapes
Let x(ti ) be the upper and y(ti ) the lower envelope of a composition at
score-onset times ti (i = 1, ..., n). To investigate the shape of the melodic

2004 CRC Press LLC

movement we consider the rst and second discrete derivatives


x(1) (ti ) =

x(ti )
x(ti+1 ) x(ti )
=
ti
ti+1 ti

(3.25)

and
x(2) (ti ) =

2 x(ti )
[x(ti+2 ) x(ti+1 )] [x(ti+1 ) x(ti )]
=
2 ti
[ti+2 ti+1 ] [ti+1 ti ]

(3.26)

Alternatively, if octaves do not count, we dene


x(1;12) (ti ) =

[x(ti+1 ) x(ti )]12


ti+1 ti

(3.27)

and
x(2;12) (ti ) =

[x(ti+2 ) x(ti+1 )]12 [x(ti+1 ) x(ti )]12


[ti+2 ti+1 ] [ti+1 ti+2 ]

(3.28)

where [x]k = x mod k. Thus, in this denition intervals between successive


notes x(ti ), x(ti+1 ) and x(tj ), x(tj+1 ) respectively are considered identical
if they dier by octaves only.
The number of possible values of x(2) and x(2;12) is nite, however potentially very large. In rst approximation we may therefore consider both variables to be continuous. In the following, the distribution of x(2) and x(2;12)

is approximated by a continuous density kernel estimate f (see Chapter 2).


For illustration, we dene the following measures of entropy:
1.
E1 =

f (x) log 2 f (x)dx

(3.29)

where f is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by
kernel estimation.
2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead.
3.
E3 =

f (x, y) log2 f (x, y)dxdy

(3.30)

where f (x, y) is a kernel estimate based on observations (ai , bi ) with


(2)
ai = x (ti1 ) and bi = x(2) (ti ). Thus, E3 is the (empirical) entropy of
the joint distribution of two successive values of x(2) .
4. E4 : Same as Entropy 3, but using (x(2;12) (ti1 ), x(2;12) (ti )) instead.
5. E5 : Same as Entropy 3, but using (x(ti ) y(ti ))(1) instead.
6. E6 : Same as Entropy 3, but using (x(ti ) y(ti ))(1;12) instead.
7. E7 : Same as Entropy 1, but using (x(ti ) y(ti ))(1) instead.
8. E8 : Same as Entropy 1, but using (x(ti ) y(ti ))(1;12) instead.

2004 CRC Press LLC

Figure 3.4 Comparison of entropies 1, 2, 3, and 4 for J.S. Bachs Cello Suite
No. I and R. Schumanns op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16.

2004 CRC Press LLC

Each of these entropies characterizes information content (or randomness)


of certain aspects of melodic patterns in the upper and lower envelope.
Figures 3.4a through d show boxplots of Entropies 1 through 4 for Bach
and Schumann (Figure 3.8). The pieces considered here are: J.S. Bach Cello
Suite No. I (each of the six movements separately), Prludium und Fuge
a
No. 1 and 8 from Das Wohltemperierte Klavier I (each piece separately);
R. Schumann op. 15, No. 2, 3, 4 and 7, and op. 68, No. 2 and 16. Obviously there is a dierence between Bach and Schumann in all four entropy
measures. In Bachs pieces, entropy is higher, indicating a more uniform
mixture of local melodic shapes.
3.3.2 Spectral entropy of local interval variability
Consider the local variability of intervals yi = x(ti+1 ) x(ti ) between successive notes. Specically, we consider a moving nearest neighbor window
[ti , ti+4 ] (i = 1, ..., n 4) and dene local variances
vi =

1
41

(yi+j yi )2

(3.31)

j=0

where yi = 41 j=0 yi+j . Based on this, a SEMIFAR-model is tted to

the time series zi = log(vi + 1 ) (see Chapter 4 for the denition of SEMI2

FAR models). The tted spectral density f (; ) is then used to dene the
spectral entropy
E9 =

f (; ) log f (; )d

(3.32)

If octaves do not count, then intervals are circular so that an estimate of


variability for circular data should be used. Here, we use R = 2(1 R) as
dened in Chapter 7. To transform the range [0, 2] of R to the real line,
the logistic transformation is applied, dening
zi = log(

R +
)
2 + R

where is a small positive number that is needed in order that < zi <
even if R = 0 or 2 respectively. Fitting a SEMIFAR-model to zi we
then dene E10 the same way as E9 above.
Figure 3.6 shows a comparison of E9 and E10 for the same compositions
as in 3.3.1. In contrast to the previous measures of entropy, Bach is consistently lower than Schumann. With respect to E10 this is also the case
in comparison with Scriabin (Figure 3.5) and Martin. Thus, for Bach there
appears to be a high degree of nonrandomness (i.e. organization) in the
way variability of interval steps changes sequentially.

2004 CRC Press LLC

Figure 3.5 Alexander Scriabin (1871-1915) (at the piano) and the conductor
Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gemldegalerie
a
Neuer Meister, Dresden, and Robert-Sterl-House.)

Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scriabin/Martin.

2004 CRC Press LLC

3.3.3 Omnibus metric, melodic, and harmonic indicators for compositions


by Bach, Schumann, and Webern
Figures 3.7, and 3.9 through 3.11 show the omnibus metric, melodic,
and harmonic weight functions for Bachs Canon cancricans, Schumanns
op. 15/2 and 7, and for Weberns Variations op 27. For Bachs composition, the almost perfect symmetry around the middle of the composition
can be seen. Moreover, the metric curve exhibits a very regular up and
down. Schumanns curves, in particular the melodic one, show clear periodicities. This appears to be quite typical for Schumann and becomes
even clearer when plotting a kernel-smoothed version of the curves (here
a bandwidth of 8/8 was used). Interestingly, this type of pattern can also
be observed for Webern. In view of the historic development of 12-tone
music as a logical continuation of harmonic freedom and romantic gesture achieved in the 19th and early 20th centuries, this similarity is not
completely unexpected. Finally, note that a relationship between metric,

Figure 3.7 Metric, melodic, and harmonic global indicators for Bachs Canon
cancricans.

melodic and harmonic structure can not be seen directly from the raw
curves. However, smoothed weights as shown in the gures above reveal
clear connections between the three weight functions. This is even the case
for Webern, in spite of the absence of tonality.

2004 CRC Press LLC

Figure 3.8 Robert Schumann (1810-1856). (Courtesy of Zentralbibliothek


Zrich.)
u

3.3.4 Specic melodic indicators for Schumanns Trumerei


a
Schumanns Trumerei is rich in local motifs. Here, we consider eight of
a
these as indicated in Figure 3.12. Figure 3.13 displays the individual indicator functions obtained from (3.16). The overall indicator function m(t) =
wmelod (t) displayed in Figure 3.15 is dened by (3.17) with h(w, L) =
[2 max(w, 0.5)]L and Lj =number of notes in motif j. The contributions
h(wcorr,j (ti ), Lj ) of wcorr,j (j = 1, ..., 8) are given in Figure 3.14.

2004 CRC Press LLC

Figure 3.9 Metric, melodic, and harmonic global indicators for Schumanns op.
15, No. 2 (upper gure), together with smoothed versions (lower gure).

2004 CRC Press LLC

Figure 3.10 Metric, melodic, and harmonic global indicators for Schumanns op.
15, No. 7 upper gure), together with smoothed versions (lower gure).

2004 CRC Press LLC

Figure 3.11 Metric, melodic, and harmonic global indicators for Weberns Variations op. 27, No. 2 (upper gure), together with smoothed versions (lower gure).

2004 CRC Press LLC

Figure 3.12 R. Schumann Trumerei: motifs used for specic melodic indicaa
tors.

2004 CRC Press LLC

Figure 3.13 R. Schumann Trumerei: indicators of individual motifs.


a

Figure 3.14 R. Schumann Trumerei: contributions of individual motifs to overa


all melodic indicator.

2004 CRC Press LLC

150
100

w
50
0
0

10

15

20

25

30

onset time

Figure 3.15 R. Schumann Trumerei: overall melodic indicator.


a

2004 CRC Press LLC

CHAPTER 4

Time series analysis


4.1 Musical motivation
Musical events are ordered according to a specic temporal sequence. Time
series analysis deals with observations that are indexed by an ordered variable (usuallly time). It is therefore not surprising that time series analysis is important for analyzing musical data. Traditional applications are
concerned with raw physical data in the form of audio signals (e.g. digital CD-recording, sound analysis, frequency recognition, synthetic sounds,
modeling musical instruments). In the last few years, time series models
have been developed for modeling symbolic musical data and analyzing
higher level structures in musical performance and composition. A few
examples are discussed in this chapter.
4.2 Basic principles
4.2.1 Deterministic and random components, basic denitions
Time series analysis in its most sophisticated form is a complex subject
that cannot be summarized in one short chapter. Here, we briey mention
some of the main ingredients only. For a thorough systematic account of the
topic we refer the reader to standard text books such as Priestley (1981a,b),
Brillinger (1981), Brockwell and Davis (1991), Diggle (1990), Beran (1994),
Shumway and Stoer (2000).
A time series is a family of (usually, but not necessarily) real variables
Xt with an ordered index t. For simplicity, we assume that observations
are taken at equidistant discrete time points t Z (or N). Usually, observations are random with certain deterministic components. For instance,
we may have an additive decomposition Xt = (t) + Ut where Ut is such
that E(Ut ) = 0 and (t) is a deterministic function of t. One of the main
aims of time series analysis is to identify the probability model that generated an observed time series x1 , ..., xn . In the additive model this would
mean to estimate the mean function (t) and the probability distribution
of the random sequence U1 , U2 , .... Note that a random sequence can also
be understood as a function mapping positive integers t to the real numbers
Ut .
The main diculties in identifying the correct distribution are:

2004 CRC Press LLC

1. The probability law has to be dened on an innite dimensional space of


vectors (X1 , X2 , ...). This diculty is even more serious for continuous
time series where a sample path is a function on R;
2. The nite sample vector X(n) = (X1 , ..., Xn )t has an arbitrary n-dimensional distribution so that it cannot be estimated from observed values x1 , ..., xn consistently, unless some minimal assumptions are made.
Diculty 1 can be solved by applying appropriate mathematical techniques
and is described in detail in standard books on stochastic processes and
time series analysis (see e.g. Billingsley 1986 and the references above). Difculty 2 cannot be solved by mathematical arguments only. It is of course
possible to give necessary or sucient conditions such that the probability
distribution can be estimated with arbitrary accuracy (measured in an appropriate sense) as n tend innity. However, which concrete assumptions
should be used depends on the specic application. Assumptions should neither be too general (otherwise population quantities cannot be estimated)
nor too restrictive (otherwise results are unrealistic).
A standard, and almost necessary, assumption is that Xt can be reduced
to a stationary process Ut by applying a suitable transformation. For instance, we may have a deterministic trend (i) plus stationary noise
Ui ,
Xi = (i) + Ui ,
(4.1)
or an integrated process of order m for which the mth dierence is stationary, i.e.
(1 B)m Xi = Ui
(4.2)
where (1 B)Xi = Xi Xi1 . In the latter case, Xt is called m-dierence
stationary. Stationarity is dened as follows:
Denition 39 A time series Xi is called strictly stationary, if for any
k, i1 , ..., in N,
P (Xi1 x1 , ..., Xin xn ) = P (Xi1 +k x1 , ..., Xin +k xn )

(4.3)

The time series is called weakly (or second order) stationary, if


(i) = E(Xi ) = = const

(4.4)

and for any i, j N, the autocovariance depends on the lag k = |i j| only,


i.e.
cov(Xi , Xi+k ) = (k) = (k)
(4.5)
A second order stationary process can be decomposed into uncorrelated
random components that correspond to periodic signals, via the so-called
spectral representation

Xt = +

eit dZX ().

(4.6)

Here ZX () = ZX,1 () + iZX,2 () C is a so-called orthogonal increment

2004 CRC Press LLC

process (in ) with the following properties: ZX (0) = 0, E[ZX ()] = 0 and
for 1 > 2 1 > 2 ,
E[Z X (2 , 1 )ZX (2 , 1 )] = 0

(4.7)

where ZX (u, v) = ZX (u) ZX (v). The integral in (4.6) is dened as a


limit in mean square. It can be constructed by approximating the function
eit by step functions
i,n 1{ai,n < bi,n }

gn () =

(n N). For step functions we have the integrals

In =

i,n [Z(bi,n ) Z(ai,n )].

gn ()dZX () =

As gn eit , the integrals In converge to a random variable I, in the sense


that
lim E[(I In )2 ] = 0.
n

The random variable I is then denoted by exp(it)dZ(). The spectral


representation is especially useful when one needs to identify (random)
periodicities. For this purpose one denes the spectral distribution function
FX () = E[|ZX () ZX (0)|2 ] = E[|ZX ()|2 ]

(4.8)

The variance is then decomposed into frequency contributions by

var(Xt ) =

E[|dZX ()|2 ] =

dFX ()

(4.9)

This means that the expected contribution (expected squared amplitude)


of components with frequencies in the interval (, + ] to the variance of
Xt is equal to F ( + ) F ().
Two interesting special cases can be distinguished:
Case 1 F dierentiable: In this case,
d
F () + o() = f () + o().
d
The function f is called spectral density and can also be dened directly
by

1
f () =
X (k)eik
(4.10)
2
F ( + ) F () =

k=

where X (k) = cov(Xt , Xt+k ). The inverse relationship is

X (k) =

eik f ()d

(4.11)

A high peak of f at a frequency o means that the component(s) at (or in


the neighborhood of) o contribute largely to the variability of Xt . Note

2004 CRC Press LLC

that the period of exp(it), as a function of t, is T = 2/ (sometimes

one therefore denes = /(2) as frequency in order that the period T is


directly the inverse of the frequency). Thus, a peak of f at o implies that
a sample path of Xt is likely to exhibit a strong periodic component with
frequency o . Periodicity is, however, random the observed series is not a
periodic function. The meaning of random periodicity can be explained best
in the simplest case where T is an integer: if f has a peak at frequency o =
2/T, then the correlation between Xt and Xt+jT (j Z) is relatively high
compared to other correlations with similar lags. A further complication
that blurs periodicity is that, if f is continuous around a peak at o , then
the observed signal is a weighted sum of innitely (in fact uncountably)
many, relatively large components with frequencies that are similar to o .
The sharper the peak, the less this blurring takes place and a distinct
periodicity (though still random) can be seen. In the other extreme case
where f is constant, there is no preference for any frequency, and X (k) = 0
(k = 0), i.e. observations are uncorrelated.
Case 2 - F is a step function with a nite or countable number of jumps:
this corresponds to processes of the form
k

Aj eij t

Xt =
j=1

for some k , and j [0, ], Aj C. We then have


E[|Aj |2 ],

F () =

(4.12)

j:j
k

var(Xt ) =

E[|Aj |2 ]

(4.13)

j=1

This means that the variance is a sum of contributions that are due to the
frequencies j (1 j k). A sample path of Xt cannot be distinguished
from a deterministic periodic function, because the randomly selected amplitudes Aj are then xed.
Finally, it should be noted that not all frequencies are observable when
observations are taken at discrete time points t = 1, 2, ..., n. The smallest
identiable period is 2, which corresponds to a highest observable frequency
of 2/2 = . The largest identiable period is n/2, which corresponds to
the smallest frequency 4/n. As n increases, the lowest frequency tends to
zero, however the highest does not. In other words, the highest frequency
resolution does not improve with increasing sample size.
To obtain more general models, one may wish to relax the condition
of stationarity. An asymptotic concept of local stationarity is dened in
Dahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n

2004 CRC Press LLC

(n N ) is called locally stationary, if we have a spectral representation


t
Xt,n = ( ) +
n

eit At,n ()dZX (),

(4.14)

with = meaning almost sure (a.s.) equality, (u) continuous, and there
exists a 2periodic function A : [0, 1] R C such that A(u, ) =

A(u, ), A(u, ) is continuous in u, and


t
sup |A( , ) At,n ()| cn1
n
t,

(4.15)

(a.s.) for some constant c < . Intuitively, this means that for n large
enough, the observed process can be approximated locally in a small time
t
window t by the stationary process exp(it)A( n , )dZX (). The or1
der n of the approximation is chosen such that most standard estimation procedures, such as maximum likelihood estimation, can be applied
locally and their usual properties (e.g. consistency, asymptotic normality)
still hold. Under smoothness conditions on A one can prove that a meaningful evolving spectral density fX (u, ) (u (0, 1)) exists such that
1
n 2

fX (u, ) = lim

cov(X[unk/2],n , X[un+k/2],n )

(4.16)

k=

The function fX (u, ) is called evolutionary spectral density. Note that, for
xed u,
lim cov(X[unk/2],n , X[un+k/2],n ) = X (k)
n

= (2)1

exp(ik)fX (u, )d.

Thumfart (1995) carries this concept over to series with discrete spectra.
A simplied denition can be given as follows: a sequence of stochastic
processes Xt,n (n N ) is said to have a discrete evolutionary spectrum
FX (u, ), if
t
t
t
Aj ( )eij ( n )t
(4.17)
Xt,n = ( ) +
n
n
jM

where M Z, and j (u) is twice continuously dierentiable. The discrete


evolutionary spectrum can be dened in analogy to the continuous case.
For other denitions of nonstationary processes see e.g. Priestley (1965,
1981), Ghosh et al. (1997) and Ghosh and Draghicescu (2002a,b).
4.2.2 Sampling of continuous-time time series
Often time series observed at discrete time points t = j (j = 1, 2, 3, ...)
actually happen in continuous time R. Sampling in discrete time

2004 CRC Press LLC

leads to information loss in the following way: let Y be a second order stationary time series with R. (Stationarity in continuous time is dened
in an exact analogy to denition 39.) Then, Y has a spectral representation

Y =

ei dZY (),

(4.18)

a spectral distribution function

FY () =

E[|dZ()|2 ]

(4.19)

and, if F exists, a spectral density function


fY () = F () =

1
2

ei Y ( )d

(4.20)

We also have
Y ( ) = cov(Yt , Yt+ ) =

ei f ()d.

The reason why the frequency range extends to (, ), instead of [, ],


is that in continuous time, by denition, arbitrarily small frequencies are
observable.
Suppose now that Y is observed at discrete time points t = j , i.e.
we observe
Xt = Yj
(4.21)
Then we can write

Xt =

/ +(2/ )u

u=

/ +(2/ )u

eij( ) dZY () =

eij( ) dZY ()
(4.22)

u=

eij( ) dZY ( + (2/ )u) =

/
/

eit dZX ()
(4.23)

where

dZX () =

dZY ( + (2/ )u)

(4.24)

u=

Moreover, if Y has spectral density fY , then the spectral density of Xt is

fX () =

fY ( + (2/ )u)

(4.25)

u=

for [ , ]. This result can be interpreted as follows: a frequency >


/ can be written as = o (2/ )j for some j N where o is in
the interval [/, / ]. The contributions of the two frequencies and

2004 CRC Press LLC

o to the observed function Xt (in discrete time) are confounded, i.e. they
cannot be distinguished. Thus, if we observe a peak of fX at a frequency
(0, / ], then this may be due to any of the periodic components with
periods 2/( + (2/ )u), u = 0, 1, 2, ..., or a combination of these. This
has, for instance, direct implications for sampling of sound signals. Suppose
that 22050Hz (i.e. = 22050 2 138544.2) is the highest frequency that
we want to identify (and later reproduce) correctly, instead of attributing it
to a lower frequency. This would cover the range perceivable by the human
ear. Then must be so small that / 22050 2. Thus the time gap
between successive measurements of the sound wave must not exceed
1/44100.
4.2.3 Linear lters
Suppose we need to extract or eliminate frequency components from a
signal Xt with spectral density fX . The aim is thus, for instance, to produce
an output signal Yt whose spectral density fY is zero for a frequency interval
a b. The simplest, though not necessarily best, way to do this is linear
ltering. A linear lter maps an input series Xt to an output series Yt by

Yt =

aj Xtj

(4.26)

j=

The coecients must fulll certain conditions in order that the sum is
a2 < . The
dened. If Xt is second order stationary, then we need
j
resulting spectral density of Yt is
fY () = |A()|2 fX ()
where

A() =

aj eij .

(4.27)

(4.28)

j=

To eliminate a certain frequency band [a, b] one thus needs a linear lter
such that A() 0 in this interval.
Equation (4.27) also helps to construct and simulate time series models with desired spectral densities: a series with spectral density fY () =
(2)1 |A()|2 can be simulated by passing a series of independent observations Xt through the lter A(). Note that, in reality, one can use only
a nite number of terms in the lter so that only an approximation can be
achieved.
4.2.4 Special models
When modeling time series statistically, one may use one of the following
approaches: a) parametric modeling; b) nonparametric modeling; and c)

2004 CRC Press LLC

semiparametric modeling. In parametric modeling, the probability distribution of the time series is completely specied a priori, except for a nite dimensional parameter = (1 , ..., p )t . In contrast, for nonparametric
models, an innite dimensional parameter is unknown and must be estimated from the data. Finally, semiparametric models have parametric and
nonparametric components. A link between parametric and nonparametric
models can also be established by data-based choice of the length p of the
unknown parameter vector , with p tending to innity with the sample
size. Some typical parametric models are:
1. White noise: Xt second order stationary, var(Xt ) = 2 ,
fX () = 2 /(2),
and X (k) = 0 (k = 0)
2. Moving average process of order q, MA(q):
q

Xt = + t +

k tk

(4.29)

k=1

with R, t independent identically distributed (iid) r.v., E(t ) = 0 and


2
= var(t ) < . This can also be written as
Xt = (B)t

(4.30)
q
k=0

where B is the backshift operator with BXt = Xt1 , (B) =


k B k .
q
k
If k=0 k z = 0 implies |z| > 1, then Xt is invertible in the sense that it
can also be written as

Xt =

k (Xtk ) + t .
k=1

3. Autoregressive process of order p, AR(p):


p

(Xt )

k (Xtk ) = t

(4.31)

k=1

or (B)(Xt ) = t where (B) = 1 p k B k . If 1


k=1
implies |z| > 1, then Xt is stationary.
4. Autoregressive moving average process, ARMA(p, q):
(B)(Xt ) = (B)t .

p
k=1

k z k = 0

(4.32)

The spectral density is


2
fX () =

5. Linear process:

|(ei )|2
.
|(ei )|2

Xt = +

j tj
j=

2004 CRC Press LLC

(4.33)

(4.34)

where j depend on a nite dimensional parameter vector . The spectral


density is
2
fX () = |(ei )|2 .
6. Integrated ARIMA process, ARIMA(p, d, q) (Box and Jenkins 1970):
(B)((1 B)d Xt ) = (B)t

(4.35)

with d = 0, 1, 2, ..., where (z) and (z) are not zero for |z| 1. This
means that the dth dierence (1 B)d Xt is a stationary ARMA process.
7. Fractional ARIMA process, FARIMA(p, d, q) (Granger and Joyeux 1980,
Hosking 1981, Beran 1995):
(1 B) (B){(1 B)m Xt } = (B)t
with d = m + ,

1
2

<<

1
2,

(4.36)

m = 0, 1. Here,

(1 B)d =

(1)k B k
k=0

d
k

with

d
(d + 1)
.
=
(k + 1)(d k + 1)
k
The spectral density of (1 B)m Xt is
2
fX () =

|(ei )|2
|1 ei |2d .
|(ei )|2

(4.37)

The fractional dierencing parameter plays an important role. If =


0, then (1 B)m Xt is an ordinary ARIMA(p, 0, q) process, with spectral
density such that fX () converges to a nite value fX (0) as 0 and
the covariances decay exponentially, i.e. |X (k)| Cak for some 0 < C <
, 0 < a < 1. The process is therefore said to have short memory. For
> 0, fX has a pole at the origin of the form fX () 2 as 0, and
X (k) k 2d1 so that

X (k) = .
k=

This case is also known as long memory, since autocorrelations decay very
slowly (see Beran 1994). On the other hand, if < 0, then fX () 2
converges to zero at the origin and

X (k) = 0.
k=

This is called antipersistence, since for large lags there is a negative correlation. The fractional dierencing parameter , or d = + m, is also called
long-memory parameter, and is related to the fractal or Hausdor dimension dH (see Chapter 3). For an extended discussion of long-memory and
antipersistent processes see e.g. Beran (1994) and references therein.

2004 CRC Press LLC

8. Fractional Gaussian noise (Mandelbrot and van Ness 1968, Mandelbrot


and Wallis 1969): recall that a stochastic process Yt (t R) is called selfsimilar with self-similarity parameter H, if for any c > 0, Yt =d cH Yct .
This denition implies that the covariances of Yt are equal to
2 2H
(|t| + |s|2H |t s|2H )
2
where 2 > 0. If Yt is Gaussian (i.e. all joint distributions are normal),
then the process is fully determined by its expected value and the covariance function. Therefore, there is only one self-similar Gaussian process.
This process is called fractional Brownian motion BH (t) with self-similarity
parameter 0 < H < 1. The discrete time increment process
cov(Yt , Yt+s ) =

Xt = BH (t) BH (t 1) (t N)

(4.38)

is called fractional Gaussian noise (FGN). FGN is stationary with autocovariances


2
(|k + 1|2H + |k 1|2H 2|k|2H ),
(k) =
(4.39)
2
the spectral density is equal to (Sinai 1976)

f () = 2cf (1 cos )

|2j + |2H1 , [, ]

(4.40)

j=

with cf = cf (H, 2 ) = 2 (2)1 sin(H)(2H + 1) and 2 = var(Xi ). For


further discussion see e.g. Beran (1994).
8. Polynomial trend model:
p

j t j + U t

Xt =

(4.41)

j=0

where Ut is stationary.
9. Harmonic or seasonal trend model:
p

Xt =

j cos j t +
j=0

j sin j t + Ut

(4.42)

j=0

with Ut stationary
10. Nonparametic trend model:
t
Xt,n = g( ) + Ut
(4.43)
n
with g : [0, 1] R a smooth function (e.g. twice continuously dierentiable) and Ut stationary.
11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q) (Beran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b):
(1 B) (B){(1 B)m Xt g(st )} = Ut

2004 CRC Press LLC

(4.44)

where d, , t and g are as above and m = 0, 1. In this case, the centered


dierenced process Yt = (1 B)m Xt g(st ) is a fractional ARIMA(p, , 0)
model. The SEMIFAR model incorporates stationarity, dierence stationarity, antipersistence, short memory and long memory, as well as an unspecied trend. Incorporating all these components enables us to distinguish
statistically which of the components are present in an observed time series (see Beran and Feng 2002a,b). A software implementation by Beran is
included in the S P luspackage F inM etrics and described in Zivot and
Wang (2002).
4.2.5 Fitting parametric models
If Xt is a second order stationary model with a distribution function that
o
o
is known except for a nite dimensional parameter o = (1 , ..., k )t
k
R , then the standard estimation technique is the maximum likelihood
method: given an observed time series x1 , ..., xn , estimate by

(4.45)
= arg max h(x1 , ..., xn ; )

where h is the joint density function of (X1 , ..., Xn ). If observations are discrete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently,
we may maximize the log-likelihood L(x1 , ..., xn ; ) = log h(x1 , ..., xn ; ).

Under fairly general regularity conditions, is asymptotically consistent, in

the sense that it converges in probabilty to o . In other words, limn P (|


o | > ) = 0 for all > 0. In the case of a Gaussian time series with spectral
density fX (; ), we have
1
t
L(x1 , ..., xn ; ) = [log 2 + log |n | + (x) 1 (x)]
x n
x
(4.46)
2

where x = (x1 , ..., xn )t , x = x (1, 1, ..., 1)t , and |n | is the determinant of
the covariance matrix of (X1 , ..., Xn )t with elements [n ]ij = cov(Xi , Xj ).
Since under general conditions n1 log |n | converges to (2)1 times the
integral of log fX (Grenander and Szeg 1958), and the (j, l)th element of
o
1
1 can be approximated by fX () exp{i(j l)}d, an approximation
n

to can be obtained by the so-called Whittle estimator (Whittle 1953;


also see e.g. Fox and Taqqu 1986, Dahlhaus 1987) that minimizes
Ln () =

1
4

[log fX (; ) +

I()
]d
fX (; )

(4.47)

An alternative approximation for Gaussian processes is obtained by using

an autoregressive representation of the type Xt = j=1 bj Xtj + t , where


t are independent identically distributed zero mean normal variables with
variance 2 . This leads to minimizing the sum of the squared residuals as
explained below in Equation (4.50) (see e.g. Box and Jenkins 1970, Beran
1995).

2004 CRC Press LLC

In general, the actual mathematical and practical diculty lies in dening a computationally feasible estimation procedure and also to obtain

the asymptotic distribution of . There is a large variety of models for


which this has been achieved. Most results are known for linear models
Xt = j tj with iid t . (All examples given in the previous section are
linear.) The reason is that, if the distribution of t is known, then the distribution of the process can be recovered by looking at the autocovariances, or
equivalently the spectral density, only. Furthermore, if Xt is invertible, i.e.
if Xt can be written as Xt = k Xtk + t , then o can be estimated
k=1
by maximizing the loglikelihood of the independent variables t :
n

= arg max

log h (et ())

(4.48)

t=1

where h is the probability density of and et () = xt k xtk .


k=1
t1
For a nite sample, et () is approximated by et () = xt k=1 k xtk . In

1
2
the simplest case where t are normally distributed with h (x)= (2 ) 2
2
2
2
exp{x2 /(2e )} and = ( , 2 , ..., p ) = ( , ), we have et () = et ()
and
n
n
2
et ()
2

= arg min[
log +
]
(4.49)

t=1
t=1
Dierentiating with respect to leads to
n

= arg min

e2 ()
t

(4.50)

t=1

and = n1 e2 (). Under mild regularity conditions, as n tends to


2
t

innity, the distribution of n() tends to a normal distribution N (0, V )
with with covariance matrix V = 2B 1 where B is a p p matrix with
elements

Bij = (2)1
log f (; )
log f (; )d
j
i
(see e.g. Box and Jenkins 1970, Beran 1995).
The estimation method above assumes that the order of the model, i.e.
the length p of the parameter vector , is known. This is not the case in
general so that p has to be estimated from data. Information theoretic considerations (based on denitions discussed in Section 3.1) lead to Akaikes
famous criterion (AIC; Akaike 1973a,b)
p = arg min{2 log likelihood + 2p}

(4.51)

More generally, we may minimize AIC = 2 log likelihood + k with respect to p. This includes the AIC ( = 2), the BIC (Bayesian information
criterion, Schwarz 1978, Akaike 1979) with = log n and the HIC (Han-

2004 CRC Press LLC

nan and Quinn 1979) with = 2c log log n (c > 1). It can be shown that, if
the observed process is indeed generated by a process from the postulated
class of models, and if its order is po , then for O(2c log log n) the estimated order is asymptotically correct with probability one. In contrast, if
/(2c log log n) 0 as n , then the criterion tends to choose too many
parameters in the sense that P ( > po ) converges to a positive probability.
p
This is, for instance, the case for Akaikes criterion. Thus, if identication
of a correct model is the aim, and the observed process is indeed likely to be
at least very close to the postulated model class, then O(2c log log n)
should be used. On the other hand, one may argue that no model is ever
correct, so that increasing the number of parameters with increasing sample size may be the right approach. In this case, the original AIC is a good
candidate. It should be noted, however, that if p as n , then

the asymptotic distribution and even the rate of convergence of changes,


since this is a kind of nonparametric modeling with an ultimately innite
dimensional parameter.
4.2.6 Fitting non- and semiparametric models
Most techniques for tting nonparametric models rely on smoothing, combined with additional estimation of parameters needed for ne tuning of
the smoothing procedure. To illustrate this, consider for instance,
(1 B)m Xt = g(st ) + Ut

(4.52)

as dened above where Ut is second order stationary and st = t/n. If m is


known, then g may be estimated, for instance, by a kernel smoother
g(to ) =

1
nb

K(
t=1

st sto
)yt
b

(4.53)

as dened in Chapter 2, with xt = (1 B)m xt . However, results may


dier considerably depending on the choice of the bandwidth b (see e.g.
Gasser and M ller 1979, Beran and Feng 2002a,b). The optimal bandwidth
u
depends on the nature of the residual process Ut . A criterion for optimality
is, for instance, the integrated mean squared error
IM SE =

E{[(s) g(s)]2 }ds.


g

The IMSE can be written as


IM SE =

{E[(s)]g(s)}2 ds+
g

var((s))ds =
g

{Bias2 +variance}ds.

The Bias only depends on the function g, and is thus independent of the
error process. The variance, on the other hand, is a function of the covariances U (k) = cov(Ut , Ut+k ), or equivalently the spectral density fU .

2004 CRC Press LLC

The bandwidth that minimizes the IM SE thus depends on the unknown


quantities g and fU . Both g and fU , therefore, have to be estimated simultaneously in an iterative fashion. For instance, in a SEMIFAR model, the
asymptotically optimal bandwidth can be shown to be equal to
bopt = Copt n(21)/(52)
where Copt is a constant that depends on the unknown parameter vector
= ( 2 , d, 1 , ..., p )t . Note that in this case, m is also part of the unknown vector. An algorithm for estimating g as well as can be dened by
starting with an initial estimate of , calculating the corresponding optimal
bandwidth, subtracting g from xt , reestimating , estimating the new op
timal bandwidth and so on. Note that in addition the order p is unknown,
so that a model choice criterion has to be used at some stage. This complicates matters considerably, and special care has to be taken to dene
a reliable algorithm. Algorithms that work theoretically as well as practically for reasonably small sample sizes are discussed in Beran and Feng
(2002a,b).
4.2.7 Spectral estimation
Sometimes one is only interested in the spectral density fX of a stationary
process or, equivalently, the autocovariances X (k), without modeling the
whole distribution of the time series. The reason can be, for instance, that
as discussed above, one may be mainly interested in (random) periodicities
which are identiable as peaks in the spectral density.
A natural nonparametric estimate of X (k) is the sample autocovariance
(k) =

1
n

nk

(xt x)(xt+k x)

(4.54)

t=1

for k 0 and (k) = (k). The corresponding estimate of fX is the

periodogram
1
I() =
2

n1

(k)e

ik

k=(n1)

1
|
=
(xt x)eit |2

2n t=1

(4.55)

Sometimes a so-called tapered periodogram is used:


Iw () = (2n)1 |

t
w( )(xt x)eit |2

n
t=1

where w is a weight function. It can be shown that E[I()] fX () as


n . However, for lags close to n1, (k) is very inaccurate, because one

averages over n k observed pairs only. For instance, for k = n 1, there is


only one observed pair, namely (x1 , xn ), with this lag! As a result, I() does

2004 CRC Press LLC

not converge to fX (). Instead, the following holds, under mild regularity
conditions: if 0 < 1 < ... < k < , and n , then, as n ,
the distribution of 2 [I(1 )/fX (1 ), ..., 2I(k )/fX (k )] converges to the
distribution of (Z1 , ..., Zk ) where Zi are independent 2 -distributed random
2
variables. This result is also true for sequences of frequencies 0 < 1,n <
... < k,n < as long as the smallest distance between the frequencies,
min |i,n j,n | does not converge to zero faster than n1 . Because of the
latter condition, and also for computational reasons (fast Fourier transform,
FFT; see Cooley and Tukey 1965, Bringham 1988), one usually calculates
I() at the so-called Fourier frequencies j = 2j/n (j = 1, ..., m) with
n
m = [(n 1)/2]) only. Note that for Fourier frequencies, t=1 eitj = 0, so
that the
I() = (2n)1 |
xt eit |2 .
Thus, the sample mean actually does not need to be subtracted. The periodogram at Fourier frequencies can also be understood as a decomposition
of the variance into orthogonal components, analogous to classical analysis
of variance (Sche 1959): for n odd,
e
n

(xt x)2 = 4

I(j )

(4.56)

I(j ) + 2I().

t=1

(4.57)

j=2

and for n even,


n

(xt x) = 4

t=1

j=2

This means that I(j ) corresponds to the (empirically observed) contribution of periodic components with frequency j to the overall variability of
x1 , ..., xn .
A consistent estimate of fX can be obtained by eliminating or downweighing sample autocovariances with too large lags:
1

f () =
2

n1

wn (k) (k)eik

(4.58)

k=(n1)

where wn (k) = 0 (or becomes negligible) for k > Mn , with Mn /n 0 and


Mn . Equivalently, one can dene a smoothed periodogram

f () =

Wn ( )I()d

(4.59)

for a suitable sequence of window functions Wn such that Wn ()f ()d


converges to f () as n . See e.g. Priestley (1981) for a detailed discussion.
Finally, it should be noted that, in spite of inconsistency, the raw periodogram is very useful for nding periodicities. In particular, in the case

2004 CRC Press LLC

of deterministic periodicities with frequencies j , I() diverges to innity


for = j and remains nite (proportional to a 2 variable) elsewhere.
2
4.2.8 The harmonic regression model
An important approach to analyzing musical sounds is the harmonic regression model
p

Xt =

[j cos j t + j sin j t] + Ut

(4.60)

j=1

with Ut stationary. Note that, theoretically, this model can also be understood as a stationary process with jumps in the spectral distribution
FX (see Section 4.2.1). Given = (1 , ..., p )t , the parameter vector =
(1 , ..., p , 1 , ..., p )t can be estimated by the least squares or, more generally, weighted least squares method,
p

= arg min

t
w( )[xt
(j cos j t + j sin j t)]2
n
t=1
j=1

(4.61)

where w is a weight function. The solution is obtained from usual linear


regression formulas. In many applications the situation is more complex,
since the frequencies 1 , ..., p are also unknown. This leads to a nonlinear
regression problem. A simple approximate solution can be given by (Walker
1971, Hannan 1973, Hassan 1982, Brown 1990, Quinn and Thomson 1991)
p

t
= arg

max
|
w( )xt eij t |2 = arg max
Iw (j ), (4.62)

0<1 ,...,p
n
j=1 t=1
j=1
j =

and

n
t=1

t
w( n )xt cos j t

,
n
t
w( n )
t=1

(4.63)

n
t=1

t
w( n )xt sin j t

(4.64)
n
t
t=1 w( n )
Note that (4.64) means that we look for the k largest peaks in the (wtapered) periodogram. Under quite general assumptions, the asymptotic
distribution of the estimates can be shown to be as follows: the vectors


3
Zn,j = [ n( j j ), n(j j ), n 2 ( j j )]t

j =

(j = 1, ..., p) are asymptotically mutually independent, each having a 3dimensional normal distribution with expected value zero and covariance
matrix C(j ) that depends on fU (j ) and the weight function w. The
formulas for C are as follows (Irizarry 1998, 2000, 2001, 2002):
C(j ) =

2004 CRC Press LLC

4fU (j )
2 V (j )
2 + j
j

(4.65)

where

2
c1 2 + c2 j
j
V (j ) = c3 j j
c4 j

c3 j j
2
c2 2 + c1 j
j
c4 j

c4 j
c4 j ,
co

2
co = ao bo , c1 = Uo Wo , c2 = ao b1 ,

(4.66)
(4.67)

2
2
3
2
c3 = ao W1 Wo (Wo W1 U2 W1 Uo 2Wo W2 U1 + 2Wo W1 W2 Uo ), (4.68)
2
c4 = ao (Wo W1 U2 W1 U1 Wo W2 U1 + W1 W2 Uo ),
2
ao = (Wo W2 W1 )2 ,
2
bn = Wn U2 + Wn+1 (Wn+1 Uo 2Wn U1 ) (n = 0, 1),
1

Un =

sn w2 (s)ds,

(4.69)
(4.70)
(4.71)
(4.72)

o
1

Wn =

sn w(s)ds.

(4.73)

This result can be used to obtain tests and condence intervals for j , j
and j (j = 1, 2, ..., p), with the unknown quantities j , j and fU (j ) then
replaced by estimates. Note that this involves, in particular, estimation of
the spectral density of the residual process Ut .
A quantity that is of particular interest is the dierence between the
fundamental frequency 1 and partials j 1 ,
j = j j 1 .

(4.74)

For many musical instruments, this dierence is exactly or approximately


equal to zero. The asymptotic distribution given above can be used to test
the null hypothesis Ho : j = 0 or to construct condence intervals for j .
3

More specically, n 2 (j j ) is asymptotically normal with zero mean


and variance
fU (j )
j 2 fU (1 )
v = 4co
.
(4.75)
2 + 2 + 2 + 2
j
1
1
j
This can be generalized to any hypothesized relationship j = j g(j)1
(see the example of a guitar mentioned in the next section).
4.2.9 Dominating frequencies in random series
In the harmonic regression model, the main signal consists of deterministic periodic functions. For less harmonic noisy signals, a weaker form

2004 CRC Press LLC

of periodicity may be observed. This can be modeled by a purely random


process whose mth dierence Yt = (1 B)m Xt is stationary (m = 0, 1, ...)
with a spectral density f that has distinct local maxima. Estimation of
local maxima and identication of the corresponding frequencies is considered, for instance, in Newton and Pagano (1983) and Beran and Ghosh
(2000). Beran and Ghosh (2000) consider the case where Yt is a fractional
ARIM A(p, d, 0) process of unknown order p. Suppose we want to estimate the frequency max where f assumes the largest local maximum. In a
rst step, the parameter vector = ( 2 , d, 1 , ..., p ) (with d = + m)
is estimated by maximum likelihood and p is chosen by the BIC. Let
= ( 2 , , 3 , ..., p+2 ) = ( 2 , ) and
2
2
|(ei )|2 |1 ei |2 =
g(; )
(4.76)
2
2

be the spectral density of Yt . Then Yt = (1 B)m Xt and max is set equal

to the frequency where the estimated spectral density f (; ) assumes its


maximum. Dene
Vp ( ) = 2W 1
(4.77)
f (; ) =

where
Wij = (2)1 [

log g(x; u)
log g(x; u)dx]|u= , (i, j = 1, ..., p+1).
ui
uj
(4.78)

Then, as n ,

n( max max ) d N (0, p )

(4.79)

with
p = p ( ) =

1
[g (max , )]T Vp ( )[g (max , )] (4.80)

[g (max , )]2

where d denotes convergence in distribution, g , g denotes derivatives


with respect to frequency and g with respect to the parameter vector. Note

in particular that the order of var( max ) is n1 whereas in the harmonic

regression model the frequency estimates have variances of the order n3 .


The reason is that a deterministic periodic signal is a much stronger form
of periodicity and is therefore easier to identify.
4.3 Specic applications in music
4.3.1 Analysis and modeling of musical instruments
There is an abundance of literature on mathematical modeling of sound
signals produced by musical instruments. Since a musical instrument is a
very complex physical system, even if conditions are kept xed, not only deterministic but also statistical models are important. In addition to that,

2004 CRC Press LLC

various factors can play a role. For instance, the sound of a violin depends on the wood it is made of, which manufacturing procedure was used,
current atmospheric conditions (temperature, humidity, air pressure), who
plays the violin, which particular notes are played in which context, etc.
The standard approach that makes modeling feasible is to think of a sound
as the result of harmonic components that may change slowly in time, plus
noise components that may be described by random models. It should be
noted, however, that sound is not only produced by an instrument but also
perceived by the human ear and brain. Thus, when dealing with the significance or eect of sounds, physiology, psychology and related scientic
disciplines come into play. Here, we are rst concerned with the actual objective modeling of the physical sound wave. This is a formidable task on
its own, and far from being solved in a satisfactory manner.
The scientic study of musical sound signals by physical equations goes
back to the 19th century. Helmholtz (1863) proved experimentally that
musical sound signals are mainly composed of frequency components that
are multiples of a fundamental frequency (also see Raleigh 1894). Ohm
conjectured that the human ear perceives sounds by analyzing the power
spectrum (i.e. essentially the periodogram), without taking into account relative phases of the sounds. These conjectures have been mostly conrmed
by psychological and physiological experiments (see e.g. Grey 1977, Pierce
1983/1992). Recent mathematical models of instrumental sound waves (see
e.g. Fletcher and Rossing 1991) lead to the assumption that, for short time
segments, a musical sound signal is stationary and can be written as a
harmonic regression model with 1 < 2 < ... < p . To analyze a musical
sound wave, one therefore can divide time into small blocks and t the
harmonic regression model as described above. The lowest frequency 1 is
called the fundamental frequency and corresponds to what one calls pitch
in music. The higher frequencies j (j 2) are called partials, overtones,
or harmonics. The amplitudes of partials, and how they change gradually,
are main factors in determining the timbre of a sound. For illustration,
Figure 4.1 shows the sound wave (air pressure amplitudes) of a piano during 1.9 seconds where rst a c and then an f are played. The signal
was sampled in 16-bit format at a sampling rate of 44100 Hz. This corresponds to CD-quality and means that every second, 44100 measurements
of the sound wave were taken, each of the measurements taking an integer
value between 32768 to 32767 (32767+32768+1=216). Figure 4.2 shows
an enlarged picture of the shaded area in Figure 4.1 (2050 measurements,
corresponding to 0.046 seconds). The periodogram (in log-coordinates) of
this subseries is plotted in Figure 4.3. The largest peak occurs approximately at the fundamental frequency 1 = 441 29/12 262.22 of c . Note
that, since the periodogram is calculated at Fourier frequencies only, 1
cannot be identied exactly (see also the remarks below). A small number
of partials j (j 2) can also be seen in Figure 4.3 the contribution of

2004 CRC Press LLC

Figure 4.1 Sound wave of c and f played on a piano.

higher partials is however relatively small. In contrast, the periodogram of


e played on a harpsichord shows a large number of distinctly important
partials (Figures 4.4, 4.5). There is obviously a clear dierence between
piano and harpsichord in terms of amplitudes of higher partials. A comprehensive study of instrumental or vocal sounds also needs to take into
account dierent techniques in which an instrument is played, and other
factors such as the particular pitch 1 that is played. This would, however,
be beyond the scope of this introductory chapter.
A specic component that is important for timbre is the way in which
the coecients j , j change in time (see e.g. Risset and Mathews 1969).
Readers familiar with synthesizers may recall envelopes that are controlled by parameters such as attack and delay. The development of j ,
j can be studied by calculating the periodogram for a moving time-window
and plotting its values against time and frequency in a 3-dimensional or
image plot. Thus, we plot the local periodogram (in this context also called

2004 CRC Press LLC

10^3
0

amplitude

c played by piano (shaded area from figure 4.1)

1
time in seconds

Figure 4.2 Zoomed piano sound wave shaded area in Figure 4.1.

spectrogram)
n

I(t, ) =

1
t j ij 2
)e
W(
xj |
|
n
2 ( tj )
nb
j=1 W
nb
j=1

(4.81)

where W : R R+ is a weight function such that W (u) = 0 for |u| > 1 and
b > 0 is a bandwidth that determines how large the window (block) is, i.e.
how many consecutive observations are considered to correspond approximately to a harmonic regression model with xed coecients j , j and
stationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichord
sound, with W (u) = 1{|u| 1}. Intense pink corresponds to high values of
I(t, ). Figures 4.6a through d show explicitly the change in I(t, ) between
four dierent blocks. Since the note was played staccato, the sound wave
is very short, namely about 0.1 seconds. Nevertheless, there is a change in
the spectrum of the sound, with some of the higher harmonics fading away.
Apart from the relative amplitudes of partials, most musical sounds in-

2004 CRC Press LLC

Figure 4.3 Periodogram of piano sound wave in Figure 4.2.

1000
0
-3000

-2000

-1000

amplitude

2000

3000

Sound wave of e flat played by harpsichord


(0.25sec at sampling rate=44100 Hz)

0.0

0.01

0.02

0.03

0.04

time in seconds

Figure 4.4 Sound wave of e

2004 CRC Press LLC

played on a harpsichord.

Figure 4.5 Periodogram of harpsichord sound wave in Figure 4.4.

Harpsichord - Periodogram
(block 1)
10^6
10^4

periodogram

10^0

10^2

10^5
10^3
10^1

periodogram

10^7

Harpsichord - Periodogram
(block 22)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

frequency
a

Harpsichord - Periodogram
(block 42)

2.0

2.5

3.0

10^4

periodogram

10^4
10^2

10^0

10^2

10^6

10^6

Harpsichord - Periodogram
(block 62)

10^0

periodogram

1.5
frequency
a

0.0

0.5

1.0

1.5

frequency
b

2.0

2.5

3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

frequency
c

Figure 4.6 Harpsichord sound periodogram plots for dierent time frames (moving windows of time points).

2004 CRC Press LLC

Figure 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds to
high values of I(t, ). (Color gures follow page 152.)

clude a characteristic nonperiodic noise component. This is a further justication, apart from possible measurement errors, to include a random
deviation part in the harmonic regression equation. The properties of the
stochastic process Ut are believed to be characteristic for specic instruments (see e.g. Serra and Smith 1991, Rodet 1997). Typical noise components are, for instance, transient noise in percussive instruments, breath
noise in wind instruments, or bow noise of string instruments. For a discussion of statistical issues in this context see e.g. Irizarry (2001). For most
instruments, not only the harmonic amplitudes but also the characteristics of the noise component change gradually. This may be modeled by
smoothly changing processes as dened for instance in Ghosh et al. (1997).
Other approaches are discussed in Priestley (1965) and Dahlhaus (1996a,b,
1997) (see Section 4.2.1 above).
Some interesting applications of the asymptotic results in Section 4.2.8 to
questions arising in the analysis of musical sounds are discussed in Irizarry

2004 CRC Press LLC

(2001). In particular, the following experiment is described: recordings of


a professional clarinet player trying to play concert pitch A (1 = 441Hz)
and a professional guitar player playing D (1 = 146.8Hz) were made. For
the analysis of the clarinet sound, a one-second segment was divided into
non-overlapping blocks consisting of 1025 measurements (23 milliseconds)
and the harmonic regression model was tted to each block separately. For
the guitar, the same was done with 60 non-overlapping intervals with 3000
observations each. Two types of results were obtained:
1. The clarinet player turned out to be always out of tune in the sense that
the estimated fundamental frequency 1 was always outside the 95%
3
o
acceptance region 441Hz 1.96 C33 (1 )n 2 where the null hypothesis
o
is Ho : 1 = 1 = 441Hz. On the other hand, from the point of view of
musical perception, the clarinet player was not out of tune, because the
deviation from 441Hz was less than 0.76Hz which corresponds to 0.03
semitones. According to experimental studies, the human ear cannot
distinguish notes that are 0.03 semitones apart (Pierce 1983/1992).
2. Physical models (see e.g. Fletcher and Rossing 1991) postulate the following relationships between the fundamental frequency and partials:
for a harmonic instrument such as the clarinet, one expects
j = j 1 ,
whereas for a plucked string instrument, such as the guitar, one should
have
j cj 2 1
where c is a constant determined by properties of the strings. The experiment described in Irizarry (2001) supports the assumption for the
clarinet in the sense that, in general, the 95%-condence intervals for
the dierence j j1 contained 0. For the guitar, his ndings suggest
a relationship of the form j c(a + j)2 1 with a = 0.
4.3.2 Lickliders theory of pitch perception
Thumfart (1995) uses the theory of discrete evolutionary spectra to derive
a simple linear model for pitch perception as proposed by Licklider (1951).
The general biological background is as follows (see e.g. Kelly 1991): vibrations of the ear drum caused by sound waves are transferred to the inner
ear (cochlea) by three ossicles in the middle ear. The inner ear is a spiral
structure that is partitioned along its length by the basilar membrane. The
sound wave causes a traveling wave on the basilar membrane which in turn
causes hair cells positioned at dierent locations to release a chemical transmitter. The chemical transmitter generates nerve impulses to the auditory
nerve. At which location on the membrane the highest amplitude occurs,
and thus which groups of hair cells are activated, depends on the frequency

2004 CRC Press LLC

of the sound wave. This means that certain frequency regions correspond
to certain hair groups. Frequency bands with high spectral density f (or
high increments dF of the spectral distribution) activate the associated
hair groups.
To obtain a simple model for the eect of a sound on the basilar membrane movement, Slaney and Lyon (1991) partition the cochlea into 86
sections, each section corresponding to a particular group of cells. Thumfart (1995) assumes that each group of cells acts like a separate linear lter
j (j = 1, ..., 86). (This is a simplication compared to Slaney and Lyon
who use nonlinear models.) The wave entering the inner ear is assumed to
be the original sound wave Xt , ltered by the outer ear by a linear lter
A1 , and the middle ear by a linear A2 . Thus, the output of the inner ear
that generates the nal nerve impulses consists of 86 time series
Yt,j = j (B)A2 (B)A1 (B)Xt (j = 1, ..., 86).

(4.82)

Calculating tapered local periodograms Ij (u, ) of Yt,j for each of the 86


sections (j = 1, ..., 86), one can then dene the quantity

c(k, j, u) =

Ij (u, )eik d

(4.83)

which Slaney and Lyon call correlogram. This is in fact an estimated local
autocovariance at lag k for section j and the time-segment with midpoint
u. The Slaney-Lyon-correlogram thus essentially characterizes the local
autocovariance structure of the resulting nerve impulse series. Thumfart
(1995) shows formally how, and under which conditions, this model can
be dened within the framework of processes with a discrete evolutionary
spectrum. He also suggests a simple method for estimating pitch (the fundamental frequency) at local time u by setting 1 (u) = 2/kmax (u) where

86
kmax (u) = arg maxk C(k, u) and C(k, u) = j=1 c(k, j, u).
4.3.3 Identication of pitch, tone separation and purity of intonation
In a recent study, Weihs et al. (2001) investigate objective criteria for judging the quality of singing (also see Ligges et al. 2002). The main question
asked in their analysis is how to assess purity of intonation. In an experimental setting, with standardized playback piano accompaniment in a
recording studio, 17 singers were asked to sing Hndels Tochter Zion and
a
Beethovens Ehre Gottes aus der Natur. The audio signal of the vocal
performance was recorded in CD quality in 16-bit format at a sampling rate
of 44100 Hz. For the actual statistical analysis, data is reduced to 11000Hz,
for computational reasons, and standardized to the interval [-1,1].
The rst question is how to identify the fundamental frequency (pitch)
1 . In the harmonic regression model above, estimates of 1 and the partials j (2 j k) are identical with the k frequencies where the pe-

2004 CRC Press LLC

riodogram assumes its k largest values. Weihs et al. suggest a simplied


(though clearly suboptimal) version of this, in that they consider the periodogram at Fourier frequencies j = 2j/n (j = 1, 2, ..., m = [(n 1)/2])
only and set
1 =

min

j {2 ,...,m1 }

{j : I(j ) > max[I(j1 ), I(j+1 )]}.

(4.84)

In other words, 1 corresponds to the Fourier frequency where the rst

peak of the periodogram occurs. Because of the restriction to Fourier frequencies, the peridogram may have two adjacent peaks and the estimate is
too inaccurate in general. An empirical interpolation formula is suggested
by the authors to obtain an improved estimate 1 . A comparison with har
monic regression is not made, however, so that it is not clear how good the
interpolation works in comparison.
Given a procedure for pitch identication, an automatic note separation
procedure can be dened. This is a procedure that identies time points in
a sound signal where a new note starts. The interesting result in Weihs et
al. is that automatic note separation works better for amateur singers than
for professionals. The reason may be the absence of vibrato in amateur
voices. In a third step, Weihs et al. address the question of how to assess computationally the purity of intonation based on a vocal time series.
This is done using discriminant analysis. The discussion of these results is
therefore postponed to Chapter 9.
4.3.4 Music as 1/f noise?
In the 1970s Voss and Clarke (1975, 1978) discovered a seemingly universal
law according to which music has a 1/f spectrum. With 1/f -spectrum
one means that the observed process has a spectral density f such that
f () 1 as 0. In the sense of denition (4.10), such a density
actually does not exist - however, a generalized version of spectral density
exists in the sense that the expected value of the periodogram converges
to this function (see Matheron 1973, Solo 1992, Hurvich and Ray 1995).
Specically, Voss and Clarke analyzed acoustic music signals by rst transforming the recorded signal Xt in the following way: a) Xt is ltered by a
low-pass lter (frequencies outside the interval [10Hz, 10000Hz] are elim2
inated); and b) the instantaneous power Yt = Xt is ltered by another
low-pass lter (frequencies above 20Hz are eliminated). This ltering technique essentially removes higher frequencies but retains the overall shape
(or envelope) of each sound wave corresponding to a note and the relative
position on the onset axis. In this sense, Voss and Clarke actually analyzed
rhythmic structures. A recent, statistically more sophisticated study along
this line is described in Brillinger and Irizarry (1998).
One objection to this approach can be that in acoustic signals, structural

2004 CRC Press LLC

18

b) Harpsichord - log(power)

16
14

15

log(power)

1000
-1000

13

-3000

air pressure

17

3000

a) Harpsichord sound wave (e flat)


sampled at 44100 Hz

0.0

0.02

0.04

0.06

0.08

0.10

0.12

0.0

0.02

0.04

time (sec)

0.06

0.08

0.10

0.12

time (sec)

d) Harpsichord log-log-periodogram and SEMIFAR-fit (d=0.51)

log(f)

6
0
0

0.0001

2
0

4
0

0.0100

8
0

1
0

1.0000

c) Harpsichord histogram of log(power)

13

14

15

16

log(y**2)

17

18

0.01

0.05

0.10

0.50

1.00

log(frequency)

Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b),
histogram of the series (c) and its periodogram on log-scale (d) together with tted
SEMIFAR-spectrum.

properties of the composition may be confounded with those of the instruments. Consider, for instance, the harpsichord sound wave in Figure 4.8a.
The square of the wave is displayed in Figure 4.8b on logarithmic scale.
The picture illustrates that, apart from obvious oscillation, the (envelope
of the) signal changes slowly. Fitting a SEMIFAR-model (with order p 8
chosen by the BIC) yields a good t to the periodogram. The estimated

fractional dierencing parameter is d = 0.51 with a 95%-condence interval


of [0.29,0.72]. This corresponds to a spectral density (dened in the generalized sense above) that is proportional to 1.02 , or approximately 1 .
Thus, even in a composition consisting of one single note one would detect
1/f noise in the resulting sound wave.
Instead of recorded sound waves, we therefore consider the score itself,
independently of which instrument is supposed to play. This is similar but
not identical to considering zero crossings of a sound signal (see Voss and

2004 CRC Press LLC

Clarke 1975, 1978, Voss 1988; Brillinger and Irizarry 1998). Figures 4.9a and
c show the log-frequencies plotted against onset time for the rst movement
of Bachs rst Cello-Suite and for Paganinis Capriccio No. 24. For Bach, the

SEMIFAR-t yields d 0.7 with a 95%-condence interval of [0.46, 0.93].


This corresponds to a 1/f 1.4 spectrum; however 1/f (d = 1/2) is included
in the condence interval. Thus, there is not enough evidence against the

1/f hypothesis. In contrast, for Paganini (Figure 4.11) we obtain d 0.21


with a 95%-condence interval of [0.07, 0.35] which excludes 1/f noise. This
indicates that there is a larger variety of fractal behavior than the 1/f law would suggest. Note also that in both cases there is also a trend in
the data which is in fact an even stronger type of long memory than the
stochastic one. Moreover, Bachs (and also to a lesser degree Paganinis)
spectrum has local maxima in the spectral density, indicating periodicities
(see Section 4.2.9). Thus, there is no pure 1/f behavior but instead
a mixture of long-range dependence expressed by the power law near the
origin, and short-range periodicities.

Figure 4.9 Log-frequencies with tted SEMIFAR-trend and log-log-periodogram


together with SEMIFAR-t for Bachs rst Cello Suite (1st movement; a,b) and
Paganinis Capriccio No. 24 (c,d) respectively.

Finally, consider an alternative quantity, namely local variability of notes


modulo octave. Since we are in Z12 , a measure of variability for circular

data should be used. Here, we use the measure V = (1 R) as dened in


Chapter 7 or rather the transformed variable log[(V +0.05)/(1.05V )]. The
resulting standardized time series are displayed in Figures 4.10a and c. The
log-log-plot of the periodgrams and tted SEMIFAR-spectra are given in
Figures 4.10b and d respectively. The estimated long-memory parameters

2004 CRC Press LLC

Figure 4.10 Local variability with tted SEMIFAR-trend and log-log-periodogram


together with SEMIFAR-t for Bachs rst Cello Suite (1st movement; a,b) and
Paganinis Capriccio No. 24 (c,d) respectively.

are similar to before, namely d = 0.51 ([0.20, 0.81]) for Bach and 0.33
([0.24, 0.42]) for Paganini.

2004 CRC Press LLC

Figure 4.11 Niccol` Paganini (1782-1840). (Courtesy of Zentralbibliothek


o
Zrich.)
u

2004 CRC Press LLC

CHAPTER 5

Hierarchical methods
5.1 Musical motivation
Musical structures are typically generated in a hierarchical manner. Most
compositions can be divided approximately into natural segments (e.g.
movements of a sonata); these are again divided into smaller units (e.g.
exposition, development, and coda of a sonata movement). These can again
be divided into smaller parts (e.g. melodic phrases), and so on. Dierent
parts even at the same hierarchical level need not be disjoint. For instance,
dierent melodic lines may overlap. Moreover, dierent parts are usually
closely related within and across levels. A general mathematical approach
to understanding the vast variety of possibilities can be obtained, for instance, by considering a hierarchy of maps dened in terms of a manifold
(see e.g. Mazzola 1990a). The concept of hierarchical relationships and similarities is also related to self-similarity and fractals as dened in Mandelbrot (1977) (see Chapter 3). To obtain more concrete results, hierarchical
regression models have been developed in the last few years (Beran and
Mazzola 1999a,b, 2000, 2001).
5.2 Basic principles
5.2.1 Hierarchical aggregation and decomposition
Suppose that we have two time series Yt , Xt and we wish to model the relatioship between Yt and Xt . The simplest model is simple linear regression
Yt = o + 1 Xt + t

(5.1)

where t is a stationary zero mean process independent of Xt . If Yt and Xt


are expected to be hierarchical, then we may hope to nd a more realistic
model by rst decomposing Xt (and possibly also Yt ) and searching for
dependence structures between Yt (or its components) and the components
of Xt . Thus, given a decomposition Xt = Xt,1 + ... + Xt,M , we consider the
multiple regression model
M

Yt = o +

j Xt,j + t
j=1

2004 CRC Press LLC

(5.2)

with t second order stationary and E(t ) = 0. Alternatively, if Yt = Yt,1 +


... + Yt,L , we may consider a system of L regressions
M

Yt,1 = 01 +

j1 Xt,j + t,1
j=1
M

Yt,2 = 02 +

j2 Xt,j + t,2
j=1

.
.
.
M

Yt,L = 0L +

jL Xt,j + t,L .
j=1

Three methods of hierarchical regression based on decompositions will be


discussed here: HIREG: hierarchical regression using explanatory variables
obtained by kernel smoothing with predetermined xed bandwidths; HISMOOTH: hierarchical smoothing models with automatic bandwidth selection; HIWAVE: hierarchical wavelet models.
5.2.2 Hierarchical regression
Given an explanatory time series Xt (t = 1, 2, ..., n), a smoothing kernel
K, and a hierarchy of bandwidths b1 > b2 > ... > bM > 0, dene
Xt,1 =

1
nb1

K(
s=1

ts
)Xt
nb1

(5.3)

and for 1 < j M ,


Xt,j =

1
nbj

K(
s=1

ts
)[Xt
nbj

j1

Xt,l ]

(5.4)

l=1

The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hierarchical decomposition of Xt . The HIREG-model is then dened by (5.2).
If t (t = 1, 2, ...) are independent, then usual techniques of multiple linear
regression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Srivastava and Sen 1997, Draper and Smith 1998). In case of correlated errors
t , appropriate adjustments of tests, condence intervals, and parameter
selection techniques must be made. The main assumption in the HIREG
model is that we know which bandwidths to use. In some cases this may
indeed be true. For instance, if there is a three-fourth meter at the beginning of a musical score, then bandwidths that are divisible by three are
plausible.

2004 CRC Press LLC

5.2.3 Hierarchical smoothing


Beran and Mazzola (1999b) consider the case where the bandwidths bj
are not known a priori. Essentially, this amounts to a nonlinear regression
M
model Yt = o + j=1 j Xt,j + t where not only j (j = 0, ..., p) are unknown, but also b1 , ..., bM , and possibly the order M, have to be estimated.
The following denition formalizes the idea (for simplicity it is given for
the case of one explanatory series Xt only):
Denition 40 For integers M, n > 0, let = (1 , ..., M ) RM , b =
(b1 , ..., bM ) RM , b1 > b2 > ... > bM = 0, ti [0, T ], 0 < T < , t1 <
t2 < ... < tn , and = (, b)t . Denote by K : [0, 1] R+ a non-negative
symmetric kernel function such that K(u)du = 1, K is twice continuously
dierentiable, and dene for b > 0 and t [0, T ], the Nadaraya-Watson
weights (Nadaraya 1964, Watson 1964)
K( tti )
b

ab (t, ti ) =

n
j=1

K(

(5.5)

ttj
b )

Also, let i (i Z) be a stationary zero mean process satisfying suitable


moment conditions, f the spectral density of i , and assume i to be independent of Xi . Then the sequence of bivariate time series {(X1,n , Y1,n ),
..., (Xn,n , Yn,n )} (n = 1, 2, 3, ...) is a Hierarchical Smoothing Model (or
HISMOOTH model), if
M

Yi,n = Y (ti ) =

j g(ti ; bj ) + i

(5.6)

j=1

where ti = i/n and


n

g(ti ; bj ) =

abj (ti , tl )Xl,n

(5.7)

l=1

Denote by o = ( o , bo )t the true parameter vector. Then o can be estimated by a nonlinear least squares method as follows: dene
M

ei () = Y (ti )

j g(ti ; bj )

(5.8)

l=1

as a function of = (, b)t , let S() =

n
2
i=1 ei ()

= argmin S()

and g =

b g.

Then
(5.9)

or equivalently
n

(ti , y; ) = 0
i=1

2004 CRC Press LLC

(5.10)

where = (1 , ..., 2M )t ,
j (t, y; ) = ei ()g(t; bj )

(5.11)

j (t, y; ) = ei ()j g(t; bj )

(5.12)

for j = 1, ..., M, and

for j = M +1, ..., 2M. Under suitable assumptions, the estimate is asymptotically normal. More specically, set
hi (t; o ) = g(t; bi ) (i = 1, ..., M )

(5.13)

hi (t; o ) = i g(t; bi ) (i = M + 1, ..., 2M )

(5.14)

= [ (i j)]i,j=1,...,n = [cov(i , j )]i,j=1,...,n

(5.15)

and dene the 2M n matrix


G = G2Mn = [hi (tj ; o )]i=1,...,2M;j=1,...,n

(5.16)

and the 2M 2M matrix


Vn = (GGt )1 (GGt )(GGt )1

(5.17)

The following assumptions are sucient to obtain asymptotic normality:


(A1) f () cf ||2d (cf > 0) as 0 with 1 < d < 1 ;
2
2
(A2) Let
ar = n1

(i j)g(ti ; br )g(tj ; br ),
i,j=1

br = n1

(i j)g(ti ; br )g(tj ; bs ).

i,j=1

Then, as n , lim inf |ar | > 0, and lim inf |br | > 0 for all r, s
{1, ..., M }.
(A3) x(ti ) = (ti ) where : [0, T ] R is a function in C[0, T ], T < .
(A4) The set of time points converges to a set A that is dense in [0, T ].
Then we have (Beran and Mazzola 1999b):
Theorem 12 Let 1 and 2 be compact subsets of R and R+ respectively,
= M M and let = 1 min{1, 1 2d}. Suppose that (A1), (A2), (A3)
1
2
2
and (A4) hold and o is in the interior of . Then, as n ,

(i) p o ;
(ii) Vn V where V is a symmetric positive denite 2M 2M matrix;

(iii) n ( ) d N (0, V ).

2004 CRC Press LLC


Thus, is asymptotically normal, but for d > 0 (i.e. long-memory errors),
1
1
the rate of convergence n 2 d is slower than the usual n 2 rate.
A particular aspect of HISMOOTH models is that the bandwidths bj are
xed positive unknown parameters that are estimated from the data. This
means that, in contrast to nonparametric regression models (see e.g. Gasser
and Mller 1979, Simono 1996, Bowman and Azzalini 1997, Eubank 1999),
u
the notion of optimal bandwidth does not exist here. There is a xed true
bandwidth (or a vector of true bandwidths) that has to be estimated. A
HISMOOTH model is in fact a semiparametric nonlinear regression rather
than a nonparametric smoothing model.
Theorem 1 can be interpreted as multiple linear regression where uncertainty due to (explanatory) variable selection is taken into account. The
set of possible combinations of explanatory variables is parametrized by a
continuous bandwidth-parameter vector b M . Condence intervals for
2

based on the asymptotic distribution of take into account additional uncertainty due to variable selection from the (innite) parametric family of
M explanatory variables X = {(xb1 , ..., xbM ) : bj 2 , b1 > b2 > ... > bM }.
For the practical implementation of the model, the following algorithms
that include estimation of M are dened in Beran and Mazzola (1999b): if
M is xed, then the algorithm consists of two basic steps: a) generation of
the set of all possible explanatory variables xs (s S), and b) selection of
M variables (bandwidths) that maximize R2 . This means that after step 1,
the estimation problem is reduced to variable selection in multiple regression, with a xed number M of explanatory variables. Standard regression
software, such as the function leaps in S-Plus, can be used for this purpose.
The detailed algorithm is as follows:
Algorithm 1 Dene a suciently ne grid S = {s1 , ..., sk } 2 and
carry out the following steps:
Step 1: Dene k explanatory time series xs = [xs (t1 ), ..., xs (tn )]t (s S)
by xs (ti ) = g(ti , s).
Step 2: For each b = (b1 , ..., bM ) S M , with bi > bj (i < j) dene the
n M matrix X = (xb1 , ..., xbM ) and let = (b) = (X t X)1 X t y.
Also, denote by R2 (b) the corresponding value of R2 obtained from least
squares regression of y on X.

b)

Step 3: Dene = (, t by = argmax R2 (b) and = (


b
b).
b

If M is unknown, then the algorithm can be modied, for instance by increasing M as long as all -coecients are signicant. In order to calculate

the standard deviation of at each stage, the error process i needs to be


modeled explicitly. Beran and Mazzola (1999) use fractional autoregressive
models together with the BIC for choosing the order of the process. This
leads to
Algorithm 2 Dene a suciently ne grid S = {s1 , ..., sk } 2 for the

2004 CRC Press LLC

bandwidths, and calculate k explanatory time series xs (s S) by xs (ti ) =


g(ti , s). Furthermore, dene a signicance level , set Mo = 0, and carry
out the following steps:
Step 1: Set M = Mo + 1;
Step 2: For each b = (b1 , ..., bM ) S M , with bi > bj (i < j) dene the
n M matrix X = (xb1 , ..., xbM ) and let = (b) = (X t X)1 X t y.
Also, denote by R2 (b) the corresponding value of R2 obtained from least
squares regression of y on X.

Step 3: Dene = ( )t by = argmaxb R2 (b) and = (


b,
b
b).
= [e1 , ..., en ]t be the vector of regression residuals. Assume
Step 4: Let e()
that ei is a fractional autoregressive process of unknown order p charac2
terized by a parameter vector = ( , d, 1 , ..., p ). Estimate p and by
maximum likelihood and the BIC.
Step 5: Calculate for each j = 1, ..., M, the estimated standard deviation

j () of j , and set
1
pj = 2[1 (|j |j ())]
where denotes the cumulative standard normal distribution function.
If max (pj ) < , set Mo = Mo + 1 and repeat 1 through 5. Otherwise,

stop the iteration and set M = Mo and equal to the corresponding


estimate.
5.2.4 Hierarchical wavelet models
Wavelet decomposition has become very popular in statistics and many
elds of application in the last few years. This is due to the exibility to
depict local features at dierent levels of resolution. There is an extended
literature on wavelets spanning a vast range between profound mathematical foundations and mathematical statistics to concrete applications such
as data compression, image and sound processing, and data analysis, to
name only a few. For references see for example Daubechies (1992), Meyer
(1992, 1993), Kaiser (1994), Antoniadis and Oppenheim (1995), Ogden
(1996), Mallat (1998), Hrdle et al. (1998), Vidakovic (1999), Percival and
a
Walden (2000), Jansen (2001), Jaard et al. (2001). The essential principle
of wavelets is to express square integrable functions in terms of orthogonal
basis functions that are zero except in a small neighborhood, the neighborhoods being hierarchical in size. The set of basis functions = {ok ,
k Z} {jk , j, k Z} is generated by two functions only, the father
wavelet and the mother wavelet , respectively, by up/downscaling and
shifting of the location respectively. If scaling is done by powers of 2 and
shifting by integers, then the basis functions are:
ok (x) = oo (x k) = (x k) (k Z)

2004 CRC Press LLC

(5.18)

jk (x) = 2 2 oo (2j x k) = 2 2 (2j x k) (j N, k Z)


With respect to the scalar product < g, h >=
functions are orthonormal:

(5.19)

g(x)h(x)dx, these basis

< ok , om >= 0 (k = m), < ok , ok >= ||k ||2 = 1

(5.20)

< jk , lm >= 0 (k = m or j = l), < jk , jk >= ||jk ||2 = 1

(5.21)

< jk , ol >= 0

(5.22)

Every function g in L (R) (the space of square integrable functions on R)


has a unique representation

g(x) =

ak ok (x) +

bjk jk (x)

ak (x k) +

(5.23)

j=0 k=

k=

bjk (2j x k)

(5.24)

j=0 k=

k=

where
ak =< g, k >=

g(x)k (x)dx

(5.25)

g(x)jk (x)dx

(5.26)

and
bjk =< g, jk >=

a2 +
b2 . The purpose of this
Note in particular that g 2 (x)dx =
k
jk
representation is a decomposition with respect to frequency and time. A
simple wavelet, where the meaning of the decomposition can be understood
directly, is the Haar wavelet with
(x) = 1{0 x < 1}

(5.27)

where 1{0 x < 1} = 1 for 0 x < 1 and zero otherwise, and


1
1
} 1{ x < 1}.
2
2
For the Haar basis functions k , we have coecients
(x) = 1{0 x <

(5.28)

k+1

ak =

g(x)dx

(5.29)

Thus, coecients of the basis functions k are equal to the average value
of g in the interval [k, k + 1]. For jk we have
j
2

bjk = 2 [

2j (k+ 1 )
2
2j k

g(x)dx

2j (k+1)
2j (k+ 1 )
2

g(x)dx]

(5.30)

which is the dierence between the average values of g in the intervals


2j k x < 2j (k + 1 ) and 2j (k + 1 ) x < 2j (k + 1). This can be
2
2
interpreted as a (signed) measure of variability. Since each interval Ijk =

2004 CRC Press LLC

[2j k, 2j (k + 1)] has length 2j and midpoint 2j (k + 1 ), the coecients


2
ajk (or their squares a2 ) characterize the variability of g at dierent scales
jk
2j (j = 0, 1, 2, ...) and a grid of locations 2j (k + 1 ) that becomes ner
2
as the scale decreases with increasing values of j.
Suppose now that a time series (function) yt is observed at a nite number of discrete time points t = 1, 2, ..., n with n = 2m . To relate this to
wavelet decomposition in continuous time, one can construct a piecewise
constant function in continuous time by
n1

gn (x) =

yk 1{
k=0

k+1
k
x<
}=
n
n

n1

yk 1{2mk x < 2m (k + 1)}

k=0

(5.31)
Since gn is a step function (like the Haar basis functions themselves) and
zero outside the interval [0, 1), the Haar wavelet decomposition of gn has
only a nite number of nonzero terms:
m1 2j 1

gn (x) = aoo +

bjk jk (x)

(5.32)

j=0 k=0

Note that gn assumes only a nite number of values gn (x) = ynx (x =


j
1/n, 2/n, ..., 1). Moreover, for x = k/n, jk (x) = 2 2 (2j x k) is nonzero
for 0 k < 1/(2mj 1) only. Therefore, Equation (5.32) can be written
in matrix form and calculation of the coecients aoo and bjk can be done
by matrix inversion. Since matrix inversion may not be feasible for large
data sets, various ecient algorithms such as the so-called discrete wavelet
transform have been developed (see e.g. Percival and Walden 2000).
An interesting interpretation of wavelet decomposition can be given in
terms of total variability. The total variability of an observed series can be
decomposed into contributions of the basis functions by
2

m1 2j 1

(yt y ) =

b2 .
jk

(5.33)

j=0 k=0

A plot of b2 against j (or 2j = frequency, or 2j = period) and k


jk
(location) shows for each k and j how much of the signals variability is
due to variation at the corresponding location k and frequency 2j .
To illustrate how wavelet decomposition works, consider the following
simulated example: let xi = 2 cos(2i/90) if i {1, ..., 300} or {501, ..., 700}
or {901, ..., 1024}, For 301 i 500, set xi = 1 cos(2i/10), and for
2
1
701 i 900, xi = 15 cos(2i/10) + 10000 (i 200)2 . The observed signal
thus consists of several periodic segments with dierent frequencies and
amplitudes, the largest amplitude occurring between t = 701 and 900, together with a slight trend. Figure 5.1a displays xi . The coecients for the
four highest levels (i.e. j = 0, 1, 2, 3) are plotted against time in Figure

2004 CRC Press LLC

5.1b. Note that D stands for mother and S for father wavelet. Moreover,
the numbering in the plot (as given in S-Plus) is opposite to the one given
above: s4 and d4 in the plot correspond to the coarsest level j = 0 above.
The corresponding functions at the dierent levels are given in Figure 5.1c.
The ten and fty largest basis contributions are given in Figures 5.1d and
e respectively (together with the data on top and residuals at the bottom). Figure 5.1f shows the time frequency plot of the squared coecients
in the wavelet decomposition of xi . Bright shading corresponds to large
coecients. All plots emphasize the high-frequency portion with large amplitude between i = 701 and 900. Moreover, the trend at this location is
visible through the coecient values of the father wavelet (s4 in the
plot) and the slightly brighter shading in the lowest frequency band of the
time-frequency plot.
An alternative to HISMOOTH models can be dened via wavelets (the
following denition is a slight modication of Beran and Mazzola 2001):
Denition 41 Let , L2 (R) be a father and the corresponding mother
wavelet respectively, k (.) = (. k), j,k = 2j/2 (2j . k) (k Z, j N)
the orthogonal wavelet basis generated by and , and ui and i (i
Z) independent stationary zero mean processes satisfying suitable moment
conditions. Assume X(ti ) = g(ti ) + ui with g L2 [0, T ], ti [0, T ] and
wavelet decomposition g(t) =
ak k (t) + bj,k j,k (t). For 0 = cM+1 <
cM < ... < c1 < co = let
g(t; ci1 , ci ) =

ak k (t) +
ci |ak |<ci1

bj,k j,k (t).


ci |bj,k |<ci1

Then (X(ti ), Y (ti )) (i = 1, ..., n) is a Hierarchical Wavelet Model (HIWAVE model) of order M , if there exists M N, = (1 , ..., M ) RM ,
= (1 , ..., M ) RM , 0 < M < ...1 < o = such that
+
M

Y (ti ) =

l g(ti ; l1 , l ) + i .

(5.34)

l=1

The denition means that the time series Y (t) is decomposed into orthogonal components that are proportional to certain bands in the wavelet
decomposition of the explanatory series X(t) the bands being dened by
the size of wavelet coecients. As for HISMOOTH models, the parameter
vector = (, )t can be estimated by nonlinear least squares regression.
To illustrate how HIWAVE-models may be used, consider the following
simulated example: let xi = g(ti ) (i = 1, ..., 1024) as in the previous example. The function g is decomposed into g(t) = g(t; , 1 ) + g(t; 1 , 0) =
g1 (t) + g2 (t) where 1 is such that 50 wavelet coecients of g are larger
or equal 1 . Figure 5.2 shows g, g1 , and g2 . A simulated series of response
variables, dened by Y (ti ) = 2g1 (ti ) + i (t = 1, ..., 1024) with independent
2
zero-mean normal errors i with variance = 100, is shown in Figure 5.3b.

2004 CRC Press LLC

20
10
-10

200

400

600

800

1000

Coefficients upto j=4


(numbered in reversed order)
idwt
d1
d2
d3
d4
s4

200

400

600

800

1000

Figure 5.1 Simulated signal (a) and wavelet coecients (b).

2004 CRC Press LLC

Components upto j=4


Data
D1
D2
D3
D4
S4
0

200

400

600

800

1000

800

1000

The largest ten components


D3.92
D3.102
D3.112
D3.109
D3.104
Resid
0

200

400

600
d

Figure 5.1 c and d: wavelet components of simulated signal in a.

2004 CRC Press LLC

Figure 5.1 e and f: wavelet components of simulated signal in a and frequency


plot of coecients.

2004 CRC Press LLC

-10

x-g1
0

10
g1
0

-10

-10

-30
-40

g2=x-g1

10

20

g1=first 50 components of x

-20

10

20

x and its components g1 (left) and g2=x-g1 (right)

400

200

800

400

400

800

600

800

1000

Figure 5.2 Decomposition of xseries in simulated HIWAVE model.

A comparison of the two scatter plots in Figures 5.3c and d shows a much
clearer dependence between y and g1 as compared to y versus x = g. Figure
5.3e illustrates that there is no relationship between y and g2 . Finally, the
time-frequency plot in Figure 5.3f indicates that the main periodic behavior
occurs for t {701, ..., 900}. The diculty in practice is that the correct
decomposition of x into g1 and the redundant component g2 is not known

a priori. Figure 5.4 shows y and the HIWAVE-curve o + 1 g(ti ; , 1 ) (for

graphical reason the tted curve is shifted vertically) tted by nonlinear


least squares regression. Apparently, the algorithm identied 1 and hence

the relevant time span [701, 900] quite exactly, since g(ti ; , 1 ) corresponds

to the sum of the largest 51 wavelet components. The estimated coecients

are o = 0.36 and 1 = 1.95. If we assume (incorrectly of course) that


1 has been known a priori, then we can give condence intervals for both

parameters as in linear least squares regression. These intervals are generally too short, since they do not take into account that 1 is estimated.

However, if a null hypothesis is not rejected using these intervals, then it


will not be rejected by the correct test either. In our case, the linear regression condence intervals for o and 1 are [0.96, 0.24] and [1.81, 2.09]
respectively, and thus contain the true values o = 0 and 1 = 2.

2004 CRC Press LLC

Figure 5.3 Simulated HIWAVE model - explanatory series g1 (a), yseries (b),
y versus x (c), y versus g1 (d), y versus g2 = x g1 (e) and time frequency plot
of y (f ).

2004 CRC Press LLC

COLOR FIGURE 2.30 The minnesinger


Burchard von Wengen (1229-1280), contemporary of Adam de la Halle (1235?-1288).
(From Codex Manesse, courtesy of the University Library, Heidelberg.)

COLOR FIGURE 2.35 Symbol plot with x = pj5, y = pj7, and radius of circles
proportional to pj6.

2004 CRC Press LLC

COLOR FIGURE 2.36 Symbol plot with x = pj5, y = pj7. The rectangles have
width pj1 (diminished second) and height pj6 (augmented fourth).

COLOR FIGURE 2.37 Symbol plot with x = pj5, y = pj7, and triangles defined
by pj1 (diminished second), pj6 (augmented fourth), and pj10 (diminished
seventh).

2004 CRC Press LLC

COLOR FIGURE 2.38 Names plotted at locations (x, y) = (pj5, pj7).

COLOR FIGURE 3.2 Fractal pictures (by Cline Beran, computer generated).

2004 CRC Press LLC

COLOR FIGURE 4.7 A harpsichord sound and its spectrogram. Intense pink
corresponds to high values of I(t,).

COLOR FIGURE 9.6 Graduale


written for an Augustinian monastery
of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek
Zrich.)

2004 CRC Press LLC

Figure 5.4 HIWAVE time series and tted function g1 .

5.3 Sp ecic applications in music


5.3.1 Hierarchical decomposition of metric, melodic, and harmonic
weights
Decomposition of metric, melodic and harmonic weights as in (5.3) and
(5.4) can reveal structures and relationships that are not obvious in the
original series. To illustrate this, Figures 5.5a through d and 5.5e through
h show a decomposition of these weights for Bachs Canon cancricans from
Das Musikalische Opfer BWV 1079 and Weberns Variation op. 27/2
respectively. The bandwidths were chosen based on time signature and
bar grouping. Weberns piano piece is written in 2/4 signature, its formal
grouping is 1 + 11 + 11 + 11 + 11; however, Webern insists on a grouping
in 2-bar portions suggesting the bandwidths of 5.5 (11 bars), 1 (2 bars)
and 0.5 (1 bar). Bachs canon is written in 4/4 signature; the grouping is
9+9+9+9. The chosen bandwidths are 9 (9 bars), 3 (3 bars) and 1 (1 bar).
For both compositions, much stronger similarities between the smoothed
metric, melodic, and harmonic components can be observed than for the
original weights. An extended discussion of these and other examples can
be found in Beran and Mazzola (1999a).

2004 CRC Press LLC

Figure 5.5 Hierarchical decomposition of metric, melodic, and harmonic indicators for Bachs Canon cancricans (Das Musikalische Opfer BWV 1079) and
Weberns Variation op. 27, No. 2.

5.3.2 HIREG models of the relationship between tempo and melodic


curves
Quantitative analysis of performance data is an attempt to understand
objectively how musicians interpret a score (Figure 5.6). For the analysis of tempo curves for Schumanns Trumerei (Figure 2.3), Beran and
a
Mazzola (1999a) construct the following matrix of explanatory variables
by decomposing structural weight functions into components of dierent

2004 CRC Press LLC

smoothness: let x1 = xmetric = metric weight, x2 = xmelod = melodic


weight, x3 = xhmean = harmonic (mean) weight (see Chapter 3). Dene
the bandwidths b1 = 4 (4 bars), b2 = 2 (2 bars) and b3 = 1 (1 bar) and
denote the corresponding components in the decomposition of x1 , x2 , x3
by xj,metric = xj,1 , xj,melod = xj,2 , xj,hmean = xj,3 . More exactly, since
harmonic weights are originally dened for each note, two alternative variables are considered for the harmonic aspect: xhmean (tl ) = average harmonic weight at onset time tl , and xhmax (tl ) = maximal harmonic weight
at onset time tl . Thus, the decomposition of four dierent weight functions
xmetric , xmelod , xhmean , and xhmax is used in the analysis. Moreover, for
each curve, discrete derivatives are dened by
dx(tj ) =

x(tj ) x(tj1 )
tj tj1

and

dx(tj ) dx(tj1 )
.
tj tj1
Each of these variables is decomposed hierarchically into four components,
as decribed above, with the bandwidths b1 = 4 (weighted averaging over 8
bars), b2 = 2 (4 bars), b3 = 1 (2 bars) and b4 = 0 (residual no averaging).
We thus obtain 48 variables (functions):
dx(2) (tj1 ) =

xmetric,1
dxmetric,1
d2 xmetric,1

xmetric,2
dxmetric,2
d2 xmetric,2

xmetric,3
dxmetric,3
d2 xmetric,3

xmetric,4
dxmetric,4
d2 xmetric,4

xmelodic,1
dxmelodic,1
d2 xmelodic,1

xmelodic,2
dxmelodic,2
d2 xmelodic,2

xmelodic,3
dxmelodic,3
d2 xmelodic,3

xmelodic,4
dxmelodic,4
d2 xmelodic,4

xhmax,1
dxhmax,1
d2 xhmax,1

xhmax,2
dxhmax,2
d2 xhmax,2

xhmax,3
dxhmax,3
d2 xhmax,3

xhmax,4
dxhmax,4
d2 xhmax,4

xhmean,1
dxhmean,1
d2 xhmean,1

xhmean,2
dxhmean,2
d2 xhmean,2

xhmean,3
dxhmean,3
d2 xhmean,3

xhmean,4
dxhmean,4
d2 xhmean,4

In addition to these variables, the following score-information is modeled


in a simple way:
1. Ritardandi There are four onset intervals R1 , R2 , R4 , and R4 with an
explicitly written ritardando instruction, starting at onset times to (Rj )
(j = 1, 2, 3, 4) respectively. This is modeled by linear functions
xritj (t) = 1{t Rj } (t to (Rj )), j = 1, 2, 3, 4

2004 CRC Press LLC

(5.35)

Figure 5.6 Quantitative analysis of performance data is an attempt to understand objectively how musicians interpret a score without attaching any subjective judgement. (Left: Freddy by J.B.; right: J.S. Bach, woodcutting by Ernst
Wrtemberger, Zrich. Courtesy of Zentralbibliothek Zrich).
u
u
u

2. Suspensions There are four onset intervals S1 , S2 , S4 , and S4 with


suspensions, starting at onset times to (Sj ) (j = 1, 2, 3, 4) respectively.
The eect is modeled by the variables
xsusj (t) = 1{t Sj } (t to (Sj )), j = 1, 2, 3, 4

(5.36)

3. Fermatas There are two onset intervals F1 , F2 with fermatas. Their


eect is modeled by indicator functions
xf ermj (t) = 1{t Fj }, j = 1, 2

(5.37)

The variables are summarized in an n 57 matrix X. After orthonormalization, the following model is assumed:
y(j) = Z(j) + (j)
where y(j) = [y(t1 , j), y(t2 , j), ..., y(tn , j)]t are the tempo measurements
for performance j, Z is the orthonormalized X-matrix, (j) is the vector
of coecients (1 (j), ..., p (j))t and (j) = [(t1 , j), (t2 , j), ..., (tn , j)]t is
a vector of n identically distributed, but possibly correlated, zero mean
random variables (ti , j) (ti T ) with variance var((ti , j)) = 2 (j). Beran and Mazzola (1999a) select the most important variables for each of
the 28 performances separately, by stepwise linear regression. The main
aim of the analysis is to study the relationship between structural weight
functions and tempo with respect to a) existence, b) type and complexity,
and c) comparison of dierent performances. It should perhaps be emphasized at this point that quantitative analysis of performance data aims at
gaining a better objective understanding how pianists interpret a score

2004 CRC Press LLC

without attaching any subjective judgement. The aim is thus not to nd


the ideal performance which may in fact not exist or to state an opinion about the quality of a performance. The values of R2 , obtained for
the full model with all explanatory variables, vary between 0.65 and 0.85.
Note, however, that the number of potential explanatory variables is very
large so that high values of R2 do not necessarily imply that the regression
model is meaningful. On the other hand, musical performance is a very
complex process. It is therefore not unreasonable that a large number of
explanatory variables may be necessary. This is conrmed formally, in that
for most performances, the selected models turn out to be complex (with
many variables), all variables being statistically signicant (at the 5%-level)
even when correlations in the errors are taken into account. For instance,
for Brendels performance (R2 = 0.76), seventeen signicant variables are
selected (including rst and second derivatives). In spite of the complexity, there is a large degree of similiarity between the performances in the
following sense: a) all except at most 3 of the 57 coecients j have the
same sign for all performances (the results are therefore hardly random), b)
there are canonical variables that are chosen by stepwise regression for
(almost) all performances, and c) the same is true if one considers (for each
performance separately) explanatory variables with the largest coecient.
Figure 5.7 shows three of these curves. The upper curve is the most important explanatory variable for 24 of the 28 performances. The exceptions are:
all three Cortot-performances and Krust with a preference for the middle
curve which reects the division of the piece into 8 parts and the performance by Ashkenazy with a curve similar to Cortots. Apparently, Cortot,
Krust, and Ashkenazy put special emphasis on the division into 8 parts.
The results can also be used to visualize the structure of tempo curves in

the following way: using the size of |k | as criterion for the importance of
variable k, we may add the terms in the regression equation sequentially to
obtain a hierarchy of tempo curves ranging from very simple to complex.
This is illustrated in Figures 5.8a and b for Ashkenazy and Horowitzs third
performance.
5.3.3 HISMOOTH models for the relationship between tempo and
structural curves
An analysis of the relationship between a melodic curve (Chapter 3) and the
28 tempo curves for Schumanns Trumerei is discussed in Beran and Maza
zola (1999). In a rst step, eects of fermatas and ritardandi are subtracted
from each of the 28 tempo series individually, using linear regression. The
component of the melodic curve mt orthogonal to these variables is then
used. The second algorithm for HISMOOTH models is used, with a grid G
that takes into account that 0 t 32 and only certain multiples of 1/8
correspond to musically interesting neighborhoods: G = {32, 30, 28, 26, 24,

2004 CRC Press LLC

25

Figure 5.8a:
Adding effects for ASKENAZE

Figure 5.8b:
Adding effects for HOROWITZ3

10
-5

estimated and observed log(tempo)

10
5

estimated and observed log(tempo)

15

15

20

20

Figure 5.7 Most important melodic curves obtained from HIREG t to tempo
curves for Schumanns Trumerei.
a

10

15
onset time

20

25

30

10

15

20

25

30

onset time

Figure 5.8 Successive aggregation of HIREG-components for tempo curves by


Ashkenazy and Horowitz (third performance).

2004 CRC Press LLC

22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125}.
Note that since for large bandwidths the resulting curves g do not vary
much, large trial bandwidths do not need to be too close together. The
error process is modeled by a fractional AR(p, d) process, the order being
estimated from the data by the BIC. Note that, from the musicological
point of view, the fractional dierencing parameter can be interpreted as a
measure of self-similarity (see Chapter 3).
For illustration, consider the performances CORTOT1 and HOROWITZ1
(see Figures 5.9b and c). In both cases, the number M of explanatory variables estimated by Algorithm 2 turns out to be 3 (with a level of signicance
of = 0.05). The estimated bandwidths (and 95%-condence intervals)
are 1 = 4.0 ([2.66, 5.34]), 2 = 2.0 ([1.10, 2.90]) and 3 = 0.5 ([0.17, 0.83])
b
b
b
b
b
for CORTOT1 and 1 = 4 ([2.26, 5.74]), 2 = 1 ([0.39, 1.62]) and 3 =
b
1 = 0.81
.25 ([0.04, 0.46]) for HOROWITZ1. The estimates of are

([1.53, 0.10]), 2 = 1.08 ([0.21, 1.05]) and 3 = 0.624 ([1.15, 0.10]),

1 = 0.42 ([0.66, 0.18]), 2 = 0.54 ([0.13, 0.95]) and 3 = 0.68


and
([1.08, 0.28]) respectively. Finally, the tted error process for Cortot is

a fractional AR(1) process with d = 0.25 ([0.60, 0.09]) and 1 = 0.77

[0.48, 1]. For Horowitz we obtain a fractional AR(2) process with d = 0.30

1 = 0.26 ([0.09, 0.42]) and 2 = 0.43 ([0.55, 0.30]).


([0.14, 0.45],
A possible interpretation of the results is as follows: the largest bandwidth 1 = 4 (one bar) is the same for both performers. A relatively
b
large portion of the shaping of the tempo happens at this level. Apart
from this, however, Horowitzs bandwidths are smaller. Horowitz appears
to emphasize very local melodic structures more than Cortot. Moreover,

for Horowitz, d > 0 (long-range dependence): while the small scale structures are explained by the melodic structure of the score, the remaining
unexplained part of the performance is still coherent in the sense that
there is a relatively strong (self-)similarity and positive correlations even be
tween remote parts. On the other hand, for Cortot, d < 0 (antipersistence):
While larger scale structures are explained by the melodic structure of
the score, more local uctuations are still coherent in the sense that there
is a relatively strong negative autocorrelation even between remote parts,
these smaller scale structures are however dicult to relate directly to the
melodic structure of the score.
Figures 5.9a through d also simplied tempo curves for all 28 performances, obtained by HISMOOTH ts with M = 3. The comparison of
typical characteristics is now much easier than for the original curves. In
particular, there is a strong similarity between all three performances by
Horowitz on one hand, and the three performances by Cortot on the other
hand. Several performers (Moisewitsch, Novaes, Ortiz, Krust, Schnabel,
Katsaris) put even higher emphasis on global melodic features than Cortot. Striking similarities can also be seen between Horowitz, Klien, and

2004 CRC Press LLC

Figure 5.9 a and b: HISMOOTH-ts to tempo curves (performances 1-14).

Brendel. Another group of similar performances consisting of Cortot, Argerich, Capova, Demus, Kubalek, and Shelley.
5.3.4 Digital encoding of musical sounds (CD, mpeg)
Wavelet decomposition plays an important role in modern techniques of
digital sound and image processing. Digital encoding of sounds (e.g. CD,
mpeg) relies on algorithms that make it possible to compress complex data
in as few storage units as possible. Wavelet decoposition is one such technique: instead of storing a complete function (evaluated or measured at a
very large number of time points on a ne grid), one only needs to keep the
relatively small number of wavelet coecients. There is an extensive literature on how exactly this can be done to suit particular engineering needs.
Since here the focus is on genuine musical questions rather than signal
processing, we do not pursue this further. The interested reader is referred
to the engineering literature such as Eelsberg and Steinmetz (1998) and
references therein.
5.3.5 Wavelet analysis of tempo curves
Consider the tempo curves for Schumanns Trumerei. Wavelet analysis
a
can help one to understand some of the similarities and dierences be-

2004 CRC Press LLC

Figure 5.9 c and d: HISMOOTH-ts to tempo curves (performances 15-28).

tween tempo curves. This is illustrated in Figures 5.10a through f where


time-frequency plots of the three tempo curves by Cortot are compared
with those by Horowitz. (More specically, only the rst 128 observations
are used here.) The obvious dierence is that Horowitz has more power in
the high frequency range. Figures 5.11a through f compare the wavelet coefcients of residuals obtained after subtracting a kernel-smoothed version of
the tempo curves (bandwidth 1/8, i.e. averaging was done over one quarter
of a bar). This provides an overview of local details of the curves. In particular, it can be seen at which level of resolution each pianist kept essentially
the same prole throughout the years. For instance, for Horowitz the
complete prole at level 2 (d2) remains essentially the same. An even better
adaptation to data is achieved by using so-called wavelet packets, which are
generalizations of wavelets, in conjunction with a best-basis algorithm.
The idea of the algorithm is to nd the best type of basis functions suitable to approximate an observed time series with as few basis functions as
possible. This is a way out of the limitation due the very specic shape of
a particular class of wavelet functions (see e.g. Haar wavelets where we are
conned to step functions). For detailed references on wavelet packets see
e.g. Coifman et al. (1992) and Coifman and Wickerhauser (1992). Figures
5.12 through 5.14 illustrate the usefulness of this approach: the 28 tempo
curves of Schumanns Trumerei are approximated by the most important
a

2004 CRC Press LLC

Figure 5.10 Time frequency plots for Cortots and Horowitzs three performances.

two (Figure 5.12), ve (Figure 5.13) and ten (Figure 5.14) best basis functions. The plots show interesting and plausible similarities and dierences.
Particularly striking are Cortots 4-bar oscillations, Horowitzs seismic
local uctuations, the relatively unbalanced tempo with a few extreme
tempo variations for Eschenbach, Klien, Ortiz, and Schnabel, the irregular
shapes for Moisewitsch, and also a strong similarity between Horowitz1 and
Moisewitsch with respect to the general shape (Figure 5.12).
5.3.6 HIWAVE models of the relationship between tempo and melodic
curves
HIWAVE models can be used, for instance, to establish a relationship between structural curves obtained from a score and a performance of the
score. Here, we consider the tempo curves by Cortot and Horowitz (Figure
5.15a), and the melodic weight function m(t) dened in Section 3.3.4. Assuming a HIWAVE-model of order 1, Figure 5.15b displays the value of R2

2004 CRC Press LLC

Coefficients of residuals Cortot1

Coefficients of residuals Cortot3

Coefficients of residuals Cortot2

idwt

idwt

idwt

d1

d1

d1

d2

d2

d2

s2

s2

s2

0 20 40 60 80 100

0 20 40 60 80 100

0 20 40 60 80 100
c

Coefficients of residuals Horowitz1

Coefficients of residuals Horowitz2

Coefficients of residuals Horowitz3

idwt

idwt

idwt

d1

d1

d1

d2

d2

d2

s2

s2

s2

0 20 40 60 80 100
d

0 20 40 60 80 100
e

0 20 40 60 80 100
f

Figure 5.11 Wavelet coecients for Cortots and Horowitzs three performances.

2004 CRC Press LLC

ARRAU

ASKENAZE

BRENDEL

CAPOVA

BUNIN

CORTOT1

1.0

1
100

150

50

100

150

100

150

100

150

50

100

150

-1.0
50

100

150

50

100

150

GIANOLI

50

100

150

0
-1
-2

-2

-3

-3
-4

-1.5
0

HOROWITZ2

ESCHENBACH
1

0.5

0.5

-0.5
50

-1.0

-1.0

-1.0
0

HOROWITZ1

0.0

0.5
50

DEMUS

1.0

DAVIES

0.5
0.0

0.5
0.0

-0.5

-2
0

CURZON

50

0.0

0
-1

-2
-3
0

150

-1

100

1.0

50

CORTOT3
1.0

1.0
0.5
0.0

-1

-1
-2
0

150

0.0

100

1.0

50

CORTOT2

-0.5

-3

-2

-0.5

-1

0.0

0.5

0.5

1.0

1.0

ARGERICH

HOROWITZ3

50

100

150

KATSARIS

50

100

150

KLIEN

50

100

150

KRUST

50

100

150

1.0

1.0

0.5

-1

0.0
-0.5

-2
-3

-2

-1

50

100

150

50

100

150

NEY

50

100

150

NOVAES

50

100

150

50

100

150

SCHNABEL

50

100

150

150

50

100

150

150

0
-1
0.0

-2

-2
-0.5

-3

-3

-4

-3

-1.0

-2
100

100

ZAK

-2

-1
-1
-2
50

50

0.5

-1

0
-1

0
0

0.0
-0.5
0

SHELLEY

1
1

1.0
0.5

-1.0

-1.5
0

ORTIZ

1.0

MOISEIWITSCH

-4

-3

-3

-0.5

-3

-2

-2

0.0

-1

-1

0.5

0.5

1.0

KUBALEK

50

100

150

50

100

150

50

100

150

50

100

150

50

100

150

Figure 5.12 Tempo curves approximation by most important 2 best basis functions.

ARGERICH

ARRAU

ASKENAZE

BRENDEL

BUNIN

CAPOVA
1

0
-1

-1

-1

0.0

-2
-3

-3

-1.0

-3

-3

-3

-2

-2

-2

-2

-2

-1

-1

-1

0.5

1.0

CORTOT1

50

100

150

CORTOT2

50

100

150

CORTOT3

50

100

150

CURZON

50

100

150

DAVIES

50

100

150

50

100

150

ESCHENBACH

50

100

150

GIANOLI

50

100

150

HOROWITZ1

50

100

150

HOROWITZ2

50

100

150

1
-1
-2

-1
-2

-1
50

100

150

KATSARIS

50

100

150

-3
-4

-4
0

HOROWITZ3

KLIEN

50

100

150

KRUST

50

100

150

KUBALEK

0.5
0

-2
-2

150

50

100

150

-1.5

-3

-4
0

NEY

50

100

150

50

100

150

ORTIZ

50

100

150

SCHNABEL

50

100

150

50

SHELLEY

100

150

ZAK

-1

-1

-2

50

100

150

50

100

150

-3

-5

-2

-2
0

-2

-3

-0.5

-4

-2

-1

-3

-2

-1

-1

0.0

-1

0.5

1.0

1.5

NOVAES

-1.0

-3

-2
-3
100

-0.5

-1
-2
-3
50

MOISEIWITSCH

0.0

-1

-1

-1
0.0
-1.0
0

1.0

1
1

1.0

-3

-2

-3

-1.5

-2

-2

-3

-2

-1

-1

-0.5

-1

0.5

1.0

DEMUS

50

100

150

50

100

150

50

100

150

50

100

150

50

100

150

Figure 5.13 Tempo curves approximation by most important 5 best basis functions.

2004 CRC Press LLC

ARGERICH

ARRAU

ASKENAZE

BRENDEL

CAPOVA

CORTOT1

50

100

150

50

100

150

CORTOT3

50

100

150

-2
-1

-3
0

CURZON

50

100

150

DAVIES

50

100

150

DEMUS

50

100

150

ESCHENBACH

50

100

150

GIANOLI

50

100

150

50

100

150

50

100

150

-2

-1

-3

-2

50

100

150

KATSARIS

50

100

150

-4

-4
0

HOROWITZ3

KLIEN

50

100

150

KRUST

50

100

150

KUBALEK

100

150

50

100

150

1
0
-1

-2
0

NEY

50

100

150

NOVAES

50

100

150

ORTIZ

50

100

150

-2

-4

-4
50

-3

-3

-3

-2
-3

-3
-1.5
0

MOISEIWITSCH

-1

-2

-2

-2

-0.5

-1

-1

-1

-1

0.5

1.5

HOROWITZ2

-3

-2

-1.5
0

HOROWITZ1

-1

-0.5

0
-1

-3

-3

-2

-2

-1

-1

-1

0.5

CORTOT2

-1
-2

-3

-3

-3

-3

-2

-3
0

0
-1

0
-1
-2

-2

-2

-1

-1

-1

BUNIN

SCHNABEL

50

100

150

50

SHELLEY

100

150

0
0

50

100

150

50

100

150

50

100

150

50

100

150

-3

-2
-3

-5
0

-2

-3
-4

-2
-3

-2

-2

-2

-1

-1

-1

-1

-1

-2

-1

-1

ZAK

50

100

150

50

100

150

50

100

150

Figure 5.14 Tempo curves approximation by most important 10 best basis functions.

for the simple linear regression model yi = o +1 g(ti ; , ) as a function of


the number of wavelet-coecients of mi that are larger or equal to . Two
observations can be made: a) for almost all choices of , the t for Horowitz
(gray lines) is better and b) the best value of is practically the same for all
six performances. Figure 5.15c shows the tted HIWAVE-curves for Cortot
and Horowitz separately. The result shows an amazing agreement between
the three Cortot performances on one hand and the three Horowitz curves
on the other hand. The HIWAVE-ts seem to have extracted a major aspect of the performance styles. Horowitz appears to build blocks of almost
horizontal tempo levels and adds, within these blocks, very ne tempo
variations. In contrast, for Cortot, blocks have a more parabolic shape.
It should be noted, of course, that, since Haar wavelets were used here,
these features (in particular Horowitz horizontal blocks) may be somewhat overemphasized. Analogous pictures are displayed in Figures 5.16a
through c and 5.17a through c for the rst and second dierence of the
tempo respectively. Particularly interesting are Figures 5.17b and c: the
values of R2 are practically the same for all Horowitz performances and
clearly lower than for Cortot. Moreover, as before, both pianists show an
amazing consistency in their performances.

2004 CRC Press LLC

Figure 5.15 Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2
obtained in HIWAVE-t plotted against trial cut-o parameter (b) and tted
HIWAVE-curves (c).

2004 CRC Press LLC

Figure 5.16 First derivative of tempo curves (a) by Cortot (three curves on top)
and Horowitz, R2 obtained in HIWAVE-t plotted against trial cut-o parameter
(b) and tted HIWAVE-curves (c).

2004 CRC Press LLC

Figure 5.17 Second derivative of tempo curves (a) by Cortot (three curves on top)
and Horowitz, R2 obtained in HIWAVE-t plotted against trial cut-o parameter
(b) and tted HIWAVE-curves (c).

2004 CRC Press LLC

CHAPTER 6

Markov chains and hidden Markov


models
6.1 Musical motivation
Musical events can often be classied into a nite or countable number of
categories that occur in a temporal sequence. A natural question is then
whether the transition between dierent categories can be characterized
by probabilities. In particular, a successful model may be able to reproduce formally a listeners expectation of what happens next, by giving
appropriate conditional probabilities. Markov chains are simple models in
discrete time that are dened by conditioning on the immediate past only.
The theory of Markov chains is well developed and many beautiful results
are available. More complicated, but very exible, are hidden Markov processes. For these models, the probability distribution itself changes dynamically according to a Markov process. Many of the developments on hidden
Markov models have been stimulated by problems in speech recognition.
It is therefore not surprising that these models are also very useful for analyzing musical signals. Here, a very brief introduction to Markov chains
and hidden Markov models is given. For an extended discussion see, for
instance, Chung (1967), Isaacson and Madsen (1976), Kemey et al. (1976),
Billingsley (1986), Elliott et al. (1995), MacDonald and Zucchini (1997),
Norris (1998), Bremaud (1999).
6.2 Basic principles
6.2.1 Denition of Markov chains
Let Xo , X1 , ... be a sequence of random variables with possible outcomes
Xt = xt S. Then the sequence is called a Markov chain, if
M1. The state space S is nite or countable;
M2. For any t N,
P (Xt+1 = j|Xo = io , X1 = i1 , ..., Xt = it ) = P (Xt+1 = j|Xt = it )
(6.1)
Condition M2 means that the future development of the process, given
the past, depends on the most recent value only. In the following we also

2004 CRC Press LLC

assume that the Markov chain is homogeneous in the sense that for any
i, j N, the conditional probability P (Xt+1 = j|Xt = i) does not depend
on time t. The probability distribution of the process Xt (t = 0, 1, 2, ...) is
then fully specied by the initial distribution
i = P (Xo = i)

(6.2)

and the (nite or innite dimensional) matrix of transition probabilities


pij = P (Xt+1 = j|Xt = i) (i, j = 1, 2, ..., |S|)

(6.3)

where |S| = m is the number of elements in the state space S. Without


loss of generality, we may assume S = {1, 2, ..., m}. Note that the vector
= (1 , ..., m )t and the matrix
M = (pij )i,j=1,2,...,m
have the following properties:
0 i , pij 1,
m

i = 1
i=1

and

pij = 1
j=1

Probabilities of events can be obtained by matrix multiplication, since


(n)
pij

pij1 pj1 j2 pjn1 j = [M n ]ij (6.4)

= P (Xt+n = j|Xt = i) =
j1 ,...,jn1 =1

and

(n)

pj

= P (Xt+n = j) = [ t M n ]j

(6.5)

6.2.2 Transience, persistence, irreducibility, periodicity, and stationarity


The dynamic behavior of a Markov chain can essentially be characterized by the notions of transiencepersistence, irreducibilityreducibility,
aperiodicityperiodicity and stationaritynonstationarity. These properties
will be discussed now.
Consider the probability that the rst visit in state j occurs at time n,
given that the process started in state i,
(n)

fij = P (X1 = j, ..., Xn1 = j, Xn = j|Xo = i)


(n)

Note that fij

can also be written as


(n)

fij = P (Tj = n|Xo = i)

2004 CRC Press LLC

(6.6)

where
Tj = min{n : Xn = j}
n1

is the rst time when the process reaches state j. The conditional probability that the process ever visits the state j can be written as
fij = P (Tj < |Xo = i) = P ( {Xn = j}|Xo = i) =
n=1

(n)

fij

(6.7)

n=1

We then have the following


Denition 42 A state i is called
i) transient, if fii < 1.
ii) persistent, if fii = 1;
Persistence means that we return to the same state again with certainty. For
transient states it can occur, with positive probability, that we never return
to the same place. As it turns out, a positive probability of never returning
implies that there is indeed a point of no return, i.e. a time point after
which one never returns. This can be seen as follows. Conditionally on
Xo = i, the probability that state j is reached at least k + 1 times is
k
equal to fij fjj . Hence, for k , we obtain the probability of returning
innitely often
k
qij = P (Xn = j innitely often|Xo = i) = fij lim fjj .
k

(6.8)

This implies
qij = 0 for fjj < 1
and
qij = 1 for fjj = 1.
A simple way of checking whether a state is persistent or not is given by
Theorem 13 The following holds for a Markov chain:
i) A state j is transient qjj = 0
ii) A state j is persistent qjj = 1

(n)

(n)

n=1 pjj <


(n)

n=1 pjj = .

The condition on n=1 pii can be simplied further for irreducible Markov
chains:
Denition 43 A Markov chain is called irreducible, if for each i, j S,
(n)
pij > 0 for some n.
Irreducibility means that wherever we start, any state j can be reached in
due time with positive probability. This excludes the possibility of being
caught forever in a certain subset of S. With respect to persistent and transient states, the situation simplies greatly for irreducible Markov chains:
Theorem 14 Suppose that Xt (t = 0, 1, ...) is an irreducible Markov chain.
Then one of the following possibilities is true:

2004 CRC Press LLC

i) All states are transient.


ii) All states are persistent.
Instead of speaking of transient and persistent states one therefore also
uses the notion of transient and persistent Markov chain respectively.
Another important property is stationarity of Markov chains. The word
stationarity implies that the distribution remains stable in some sense.
The rst denition concerns initial distributions:
Denition 44 A distribution is called stationary if
k

i pij = j ,

(6.9)

i=1

or in matrix form,
t M = .

(6.10)

This means that if we start with distribution , then the distribution of all
subsequent Xt s is again .
The next question is in how far the initial distribution inuences the
dynamic behavior (probability distribution) into the innite future. A possible complication is that the process may be periodic in the sense that one
may return to certain states periodically:
Denition 45 A state j is called to have period , if
(n)

pjj > 0
implies that n is a multiple of .
For an irreducible Markov chain, all states have the same period. Hence,
the following denition is meaningful:
Denition 46 An irreducible Markov chain is called periodic if > 1, and
it is called aperiodic if = 1.
It can be shown that for an aperiodic Markov chain, there is at most one
stationary distribution and, if there is one, then the initial distribution does
not play any role ultimately:
Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chain
for which a stationary distribution exists, then the following holds:
(i) the Markov chain is persistent;
(n)

(ii) limn pij = j > 0 for all i, j;


(iii) the stationary distribution is unique.
In the other case of an aperiodic irreducible Markov chain for which no
stationary distribution exists, we have
(n)
lim p
n ij

=0

for all i, j. Note that this is even the case if the Markov chain is persistent.
One then can classify irreducible aperiodic Markov chains into three classes:

2004 CRC Press LLC

Theorem 16 If Xt (t = 0, 1, 2, ...) is an irreducible aperiodic Markov


chain, then one the following three possibilities is true:
(i) Xt is transient,
(n)
lim p
n ij

and

=0

(n)

pij <
n=1

(ii) Xt is persistent, but no stationary distribution exists,


(n)
lim p
n ij

= 0,

(n)

pij =
n=1

and

(n)

nfjj =

j =
n=1

(iii) Xt is persistent, and a unique stationary distribution exists,


(n)
lim p
n ij

= j > 0

for all i, j and the average number of steps till the process returns to
state j is given by
1
j = j
For Markov chains with a nite state space, the results simplify further:
Theorem 17 If Xt is an irreducible aperiodic Markov chain with a nite
state space, then the following holds:
(i) Xt is persistent
(ii) a unique stationary distribution = (1 , ..., k )t exists and is the solution of
t (I M ) = 0, (0 j 1,

j = 1)

(6.11)

where I is the m m identity matrix.


Note that j Mij = j pij = 1 so that j (I M )ij = 0, i.e. the matrix
(I M ) is singular. (If this were not the case, then the only solution to the
system of linear equations would be 0 so that no stationary distribution
would exist.) Thus, there are innitely many solutions of (6.13). However,
there is only one solution that satises the conditions 0 j 1 and
j = 1.

2004 CRC Press LLC

6.2.3 Hidden Markov models


A hidden Markov model is, as the name says, a model where an underlying Markov process is not directly observable. Instead, observations Xt
(t = 1, 2, ...) are generated by a series of probability distributions which in
turn are controlled by an unobserved Markov chain. More specically, the
following denitions are used: let t (t = 1, 2, ...) be a Markov chain with
initial distribution so that P (1 = j) = j , and transition probabilities
pij = P (t+1 = j|t = i).

(6.12)

The state of the Markov chain determines the probability distribution of


the observable random variables Xt by
ij = P (Xt = j|t = i)

(6.13)

In particular, if the state spaces of t and Xt are nite with dimensions


m1 and m2 respectively, then the probability distribution of the process Xt
is determined by the m1 -dimensional vector , the m1 m1 -dimensional
transition matrix M = (pij )i,j=1,...,m1 and the m2 m1 -dimensional matrix
= (ij )i=1,...,m2 ;j=1,...,m1 that links t with Xt . Analogous models can
be dened for the case where Xt (t N) are continuous variables.
The exibility of hidden Markov models is due to the fact that Xt can
be an arbitrary quantity with an arbitrary distribution that can change in
time. For instance, Xt itself can be equal to a time series Xt = (Z1 , ..., Zn ) =
(Z1 (t), ..., Zn (t)) whose distribution depends on t . Typically, such models
are used in automatic speech processing (see e.g. Levinson et al. 1983, Juang
and Rabiner 1991). The variable t may represent the unobservable state of
the vocal tract at time t, which in turn produces an observable acoustic
signal Z1 (t), ..., Zn (t) generated by a distribution characterized by t . Given
observations Xt (t = 1, 2, ..., N ), the aim is to guess which congurations
t (t = 1, 2, ..., N ) the vocal tract was in. More specically, it is sometimes
assumed that there is only a nite number of possible acoustic signals. We
may therefore denote by Xt the label of the observed signal and estimate
by maximizing the a posteriori probability P ( = j|Xt = i). Using the
Bayes rule, this leads to

t = arg
= arg

max

j=1,...,m1

max

j=1,...,m1

P (t = j|Xt = i)

P (Xt = i|t = j)P (t = j)


P (Xt = i|t = l)P (t = l)

m1
l=1

(6.14)

6.2.4 Parameter estimation for Markov and hidden Markov models


In principle, parameter estimation for Markov chains and hidden Markov
models is simple, since the likelihood function can be written down explic-

2004 CRC Press LLC

itly in terms of simple conditional probabilities. The main diculties that


can occur are:
1. Large number of unknown parameters: the unknown parameters for a
Markov chain are the initial distribution and the transition matrix
M = (pij )i,j=1,...,m . If m is nite, then the number of unknown parameters is (m1)+m(m1). If the initial distribution does not matter, then
this reduces to m(m 1). Both numbers can be quite large compared
to the available sample size, since they increase quadratically in m. The
situation is even worse if the state space is innite, since then the number of unknown parameters is innite. A solution to this problem is to
impose restrictions on the parameters or to dene parsimonious models
where M is characterized by a low-dimensional parameter vector.
2. Implicit solution: The maximum likelihood estimate of the unknown
parameters is the solution of a system of nonlinear equations, and therefore must be found by a suitable numerical algorithm. For real time
applications with massive data input, as they typically occur in speech
processing or processing of musical sound signals, fast algorithms are
required.
3. Asymptotic distribution: The asymptotic distribution of maximum likelihood estimates is not always easy to derive.
6.3 Specic applications in music
6.3.1 Stationary distribution of intervals modulo 12
We consider intervals between successive notes modulo octave for the upper
envelopes of the following compositions:
Anonymus: a) Saltarello (13th century); b) Saltarello (14th century); c)
Alle Psallite (13th century); d) Troto (13th century)
A. de la Halle (1235?-1287): Or est Bayard en la pature, hure!
J. de Ockeghem (1425-1495): Canon epidiatesseron
J. Arcadelt (1505-1568): a) Ave Mari, b) La Ingratitud, c) Io Dico Fra
Noi
W. Byrd (1543-1623): a) Ave Verum Corpus, b) Alman, c) The Queens
Alman
J. Dowland (1562-1626): a) Come Again, b) The Frog Galliard, c) The
King Denmarks Galliard
H.L. Hassler (1564-1612): a) Galliard, b) Kyrie from Missa secunda,
c) Sanctus et Benedictus from Missa secunda
G.P. Palestrina (1525-1594): a) Jesu Rex admirabilis, b) O bone Jesu,
c) Pueri Hebraeorum

2004 CRC Press LLC

J.P. Rameau (1683-1764): a) La Popliniere, b) Tambourin, c) La Triomphante (Figure 6.1)


J.F. Couperin (1668-1733): a) Barriquades mysterieuses, b) La Linotte
Earouche, c) Les Moissonneurs, d) Les Papillons
e
J.S. Bach (1685-1750): Das Wohltemperierte Klavier; Cello-Suites I to
VI (1st Movements)
D. Scarlatti (1660-1725): a) Sonata K 222, b) Sonata K 345, c) Sonata
K 381
J. Haydn (1732-1809): Sonata op. 34, No. 2
W.A. Mozart (1756-1791): a) Sonata KV 332, 2nd Mov., b) Sonata KV
545, 2nd Mov., c) Sonata KV 333, 2nd Mov.
F. Chopin (1810-1849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32,
No. 1, c) Etude op. 10, No. 6 (Figure 6.2)
R. Schumann (1810-1856): Kinderszenen op. 15
J. Brahms (1833-1897): a) Hungarian dances No. 1, 2, 3, 6, 7, b) Intermezzo op. 117, No. 1 (Figures 6.12, 9.7, 11.5)
C. Debussy (1862-1918): a) Claire de lune, b) Arabesque No. 1, c) Reections dans leau
A. Scriabin (1872-1915): Preludes a) op. 2, No. 2, b) op. 11, No. 14, c)
op. 13, No. 2
S. Rachmanino (1873-1943): a) Prelude op. 3, No. 2, b) Preludes op.
23, No. 3, 5, 9
B. Bartk (1881-1945): a) Bagatelle op. 11, No. 2, b) Bagatelle op. 11,
o
No. 3, c) Sonata for piano
O. Messiaen (1908-1992): Vingts regards sur lenfant de Jsus, No. 3
e
S. Prokoe (1891-1953): Visions fugitives a) No. 11, b) No. 12, c) No.
13
A. Schnberg (1874-1951): Piano piece op. 19, No. 2
o
T. Takemitsu (1930-1996): Rain tree sketch No. 1
A. Webern (1883-1945): Orchesterst ck op. 6, No. 6
u
Since we are not interested in note repetitions, zero is excluded, i.e. the
state space of Xt consists of the numbers 1,...,11. For the sake of simplicity,
Xt is assumed to be a Markov chain. This is, of course, not really true
nevertheless an approximation by a Markov chain may reveal certain
characteristics of the composition. The elements of the transition matrix
M = (pij )i,j=1,...,11 are estimated by relative frequencies
pij =

2004 CRC Press LLC

n
t=2

1{xt1 = i, xt = j}
n1
t=1

1{xt = i}

(6.15)

Figure 6.1 Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin after


J. J. Caeri, Paris after 1764; courtesy of Zentralbibliothek Zrich.)
u

and the stationary distribution of the Markov chain with transition ma


trix M = (ij )i,j=1,...,11 is estimated by solving the system of linear equap
tions

t (I M ) = 0
as described above. Figures 6.3a through l show the resulting values of j

(joined by lines). For each composition, the vector j is plotted against j.

For visual clarity, points at neighboring states j and j1 are connected. The
gures illustrate how the characteristic shape of changed in the course of
the last 500 years. The most dramatic change occured in the 20th century
with a attening of the peaks. Starting with Scriabin a pioneer of atonal
music, though still rooted in the romantic style of the late 19th century, this
is most extreme for the compositions by Schnberg, Webern, Takemitsu,
o
and Messiaen. On the other hand, Prokoes Visions fugitives exhibit
clear peaks but at varying locations. The estimated stationary distributions
can also be used to perform a cluster analysis. Figure 6.4 shows the result
of the single linkage algorithm with the manhattan norm (see Chapter
10). To make names legible, only a subsample of the data was used. An
almost perfect separation between Bach and composers from the classical
and romantic period can be seen.

2004 CRC Press LLC

Figure 6.2 Frdric Chopin (1810-1849). (Courtesy of Zentralbibliothek Zrich.)


e e
u

6.3.2 Stationary distribution of interval torus values


An analogous analysis can be carried out replacing the interval numbers by
the corresponding values of the torus distance (see Chapter 1). Excluding
zeroes, the state space consists of the three numbers 1, 2, 3 only. For the
same compositions as above, the stationary probabilities j (j = 1, 2, 3) are

calculated. A cluster analysis as above, but with the new probabilties, yields
practically the same result as before (Figure 6.5). Since the state space contains three elements only, it is now even easier to nd the patterns that
determine clustering. In particular, log-odds-ratios log(i /j ) (i = j) ap
pear to be characteristic. Boxplots are shown in Figures 6.6a, 6.7a and 6.8a
for categories of composers dened by date of birth as follows: a) before
1600 (early music); b) [1600,1720) (baroque); c) [1720,1800) (classic); d) [1800,1880) (romantic and early 20th century) (Figure 6.12); e)
1880 and later (20th century). This is a simple, though somewhat arbitrary, division with some inaccuracies for instance, Schnberg is classied
o
in category 4 instead of 5. The log-odds-ratio between 1 and 2 is high

est in the classical period and generally tends to decrease afterwards.


Moreover, there is a distinct jump from the baroque to the classical period.
This jump is also visible for log(1 /3 ). Here, however, the attained level

is kept in the subsequent time periods. For log(2 /3 ) a gradual increase

2004 CRC Press LLC

Figure 6.3 Stationary distributions j (j = 1, ..., 11) of Markov chains with state

space Z12 \ {0}, estimated for the transition between successive intervals.

2004 CRC Press LLC

BRAHMS
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
HAYDN
CHOPIN
RACHMANINOFF
MOZART
BACH
HAYDN

HAYDN
SCHUMANN
MOZART
CHOPIN
BACH
HAYDN
BRAHMS
CHOPIN
BACH
BACH
MOZART
RACHMANINOFF
CHOPIN
SCHUMANN
RACHMANINOFF
SCHUMANN
HAYDN
BRAHMS
SCHUMANN
SCHUMANN
SCHUMANN
BRAHMS
SCHUMANN
MOZART
MOZART
SCHUMANN
RACHMANINOFF
SCHUMANN
HAYDN
BRAHMS

BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN

SCHUMANN

Clusters based on stationary distribution

Figure 6.4 Cluster analysis based on stationary Markov chain distributions for
compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmanino.

can be observed. The dierences are even more visible when comparing individual composers. This is illustrated in Figures 6.9a and b where Bachs
and Schumanns log(1 /3 ) and log(2 /3 ) are compared, and in Figures


6.10a through f where the median and lower and upper quartiles of j are

plotted against j. Finally, Figure 6.11 shows the plots of log(1 /3 ) and

log(2 /3 ) against the date of birth.

6.3.3 Classication by hidden Markov models
Chai and Vercoe (2001) study classication of folk songs using hidden
Markov models. They consider, essentially, four ways of representating a
melody; namely by a) a vector of pitches modulo 12; b) a vector of pitches
modulo 12 together with duration (duration being represented by repeating
the same pitch); c) a sequence of intervals (dierenced series of pitches); and
d) sequence of intervals, with intervals being classied into only ve interval
classes {0}, {1, 2}, {1, 2}, {x 3} and {x 3}. The observed data
consist of 187 Irish, 200 German, and 104 Austrian homophonic melodies
from folk songs. For each melody representation, the authors estimate the
parameters of several hidden Markov models which dier mainly with respect to the size of the hidden state space. The models are tted for each

2004 CRC Press LLC

3.0

BRAHMS
CHOPIN
RACHMANINOFF
BACH
HAYDN
BRAHMS
SCHUMANN
MOZART
CHOPIN
CHOPIN
BACH
SCHUMANN
HAYDN
SCHUMANN
HAYDN
BACH
RACHMANINOFF
MOZART
SCHUMANN
BACH
SCHUMANN
CHOPIN
SCHUMANN
BRAHMS
RACHMANINOFF
MOZART
SCHUMANN
RACHMANINOFF
SCHUMANN
HAYDN
SCHUMANN
BRAHMS
BRAHMS
BRAHMS
SCHUMANN
MOZART
MOZART
SCHUMANN

BACH
BACH
HAYDN
HAYDN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
BACH
SCHUMANN
BACH
BACH
BACH

1.0

SCHUMANN

1.5

2.0

2.5

Clusters based on stationary distribution


of torus distances

Figure 6.5 Cluster analysis based on stationary Markov chain distributions of


torus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann,
Brahms, and Rachmanino.

country separately. Only 70% of the data are used for estimation. The
remaining 30% are used for validation of a classication rule dened as
follows: a melody is assigned to country j, if the corresponding likelihood
(calculated using the countrys hidden Markov model) is the largest. Not
surprisingly, the authors conclude that the most reliable distinction can be
made between Irish and non-Irish songs.

2004 CRC Press LLC

a) log(pi(1)/pi(2)) for
five different periods

-1.5

-1.5

-1.0

-1.0

-0.5

-0.5

0.0

0.0

0.5

0.5

b): log(pi(1)/pi(2)) for


classic vs. not classic

b. 1600

1600
-1720

1720
-1800

1800
-1880

from
1880

birth 1720-1800

birth before 1720 or


1800 and later

Figure 6.6 Comparison of log odds ratios log(1 /2 ) of stationary Markov chain

distributions of torus distances.

b) log(pi(1)/pi(3)) for
upto baroque vs. after baroque

-1

-1

a) log(pi(1)/pi(3)) for
five different periods

b. 1600

1600
-1720

1720
-1800

1800
-1880

from
1880

birth before 1720

birth 1720 and later

Figure 6.7 Comparison of log odds ratios log(1 /3 ) of stationary Markov chain

distributions of torus distances.

2004 CRC Press LLC

a) log(pi(2)/pi(3)) for
five different periods

b) log(pi(2)/pi(3)) for
upto baroque vs. after baroque

b. 1600

1600
-1720

1720
-1800

1800
-1880

from
1880

birth before 1720

birth 1720 and later

Figure 6.8 Comparison of log odds ratios log(2 /3 ) of stationary Markov chain

distributions of torus distances.

b) log(pi(2)/pi(3)) for
Bach and Schumann

-0.5

-1.0

0.0

-0.5

0.5

0.0

1.0

0.5

1.5

1.0

2.0

1.5

a) log(pi(1)/pi(3)) for
Bach and Schumann

Bach

Schumann

Bach

Schumann

Figure 6.9 Comparison of log odds ratios log(1 /3 ) and log(2 /3 ) of stationary


Markov chain distributions of torus distances.

2004 CRC Press LLC

Figure 6.10 Comparison of stationary Markov chain distributions of torus distances.

log(pi(1)/pi(3))
plotted against date of birth

-1

log(pi(2)/pi(3))

log(pi(1)/pi(3))

log(pi(2)/pi(3))
plotted against date of birth

1200

1400

1600

year
a

1800

1200

1400

1600

1800

year
b

Figure 6.11 Log odds ratios log(1 /3 ) and log(2 /3 ) plotted against date of


birth of composer.

2004 CRC Press LLC

Figure 6.12 Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek


Zrich.)
u

6.3.4 Reconstructing scores from acoustic signals


One of the ultimate dreams of musical signal recognition is to reconstruct
a musical score from the acoustic signal of a musical performance. This
is a highly complex task that has not yet been solved in a satisfactory
manner. Consider, for instance, the problem of polyphonic pitch tracking
dened as follows: given a musical audio signal, identify the pitches of
the music. This problem is not easy for at least two reasons: a) dierent
instruments have dierent harmonics and a dierent change of the spectrum; and b) in polyphonic music, one must be able to distinguish dierent
voices (pitches) that are played simultaneously by the same or dierent
instruments. An approach based on a rather complex hierarchical model
is proposed for instance in Walmsley, Godsill, and Rayner (1999). Suppose that a maximal number N of notes can be played simultaneously and
denote by = (1 , ..., N )t the vector of 0-1-variables indicating whether
note j (j = 1, ..., N ) is played or not. Each note j is associated with a harmonic representation (see Chapter 4) with fundamental frequency j and
amplitudes b1 (j), ..., bk (j) (k = number of harmonics). Time is divided into

2004 CRC Press LLC

disjoint time intervals, so-called frames. In each frame i of length mi , the


sound signal is assumed to be equal to yi (t) = i (t) + ei (t) where i (t)
(t = 1, ..., mi ) is the sum of the harmonic representations of the notes
and a random noise ei . Walmsley et al. assume ei to be iid (independent
2
identically distributed) normal with zero mean and variance i . Taking everything together, the probability distribution of the acoustic signal is fully
specied by a nite dimensional parameter vector . In principle, given an
observed signal, could be estimated by maximizing the likelihood (see
Chapter 4). The diculty is, however, that the dimension of is very high
compared to the number of observations. The solution proposed by Walmsley et al. is to circumvent this problem by a Bayesian approach, in that
is assumed to be generated by an a priori distribution. Given the data, consisting of a sound signal and an a priori distribution p(), the a posteriori
distribution p(|yi ) of is given by
p(|yi ) =

f (yi |)p()

f (yi |)p()d

where
f (yi |) = (2i )mi /2 exp(

mi

(6.16)

2
e2 (t)/i )
i

t=1

and ei (t) = ei (t; ). How many notes and which pitches are played can then
be decided, for instance, by searching for the mode of the distribution.
Even if this model is assumed to be realistic, a major practical diculty
remains: the dimension of can be several hundred. The computation of
the a posteriori distribution is therefore very dicult since calculation of

f (yi |)p()d involves high-dimensional numerical intergration. A further complication is that some of the parameters may be highly correlated.
Walmsley et al. therefore propose to use Markov Chain Monte Carlo Methods (see e.g. Gilks et al. 1996). The essential idea is to simulate the integral
by a sample mean of f (yi |) where is sampled randomly from the a priori distribution p(). Sampling can be done by using a Markov process
whose stationary distribution is p. The simulation can be simplied further by the so-called Gibbs sampler which uses suitable one-dimensional
conditional distributions (Besag 1989).
A more modest task than polyphonic pitch tracking is automatic segmentation of monophonic music. The task is as follows: given a monophonic
musical score and a sampled acoustic signal of a performance of the score,
identify for each note and rest in the score the corresponding time interval in the performance. A possible approach based on hidden Markov
processes and Bayesian models is proposed in Raphael (1999) (also see
Raphael 2001a,b). Raphael, who is a professional oboist and a mathematical statistician, also implemented his method in a computer system, called
Music Plus One, that performs the role of a musical accompanist.

2004 CRC Press LLC

CHAPTER 7

Circular statistics
7.1 Musical motivation
Many phenomena in music are circular. The best known examples are repeated rhythmic patterns, the circles of fourths and fths, and scales modulo octave in the well-tempered system. In the circle of fourths, for example,
one progresses by steps of a fourth and arrives, after 12 steps, at the initial starting point modulo octave. It is not immediately clear whether and
how to calculate in such situations, and what type of statistical procedures may be used. The theory of circular statistics has been developed
to analyze data on circles where angles have a meaning. Originally, this
was motivated by data in biology (e.g. direction of bird ight), meteorology (e.g. direction of wind), and geology (e.g. magnetic elds). Here we
give a very brief introduction, mostly to descriptive statistics. For an extended account of methods and applications of circular statistics see, for
instance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993),
and Jammalamadaka and SenGupta (2001). In music, circular methods can
be applied to situations where angles measure a meaningful distance between points on the circle and arithmetic operations in the sense of circular
data are well dened.

7.2 Basic principles


7.2.1 Some descriptive statistics
Circular data are observations on a circle. In other words, observations
consist of directions expressed in terms of angles. The rst question is which
statistics describe the data in a meaningful way or, at an even more basic
level, how to calculate at all when moving on a circle. The diculty can
be seen easily by trying to determine the average direction. Suppose we
observe two angles 1 = 330o and 2 = 10o . It is plausible to say that the
average direction is 350o . However, the average is (330o + 10o )/2 = 170o
which is almost the opposite direction. Calculating the sample mean of
angles is obviously not meaningful.
The simple solution is to interpret angular observations as vectors in
the plane, with end points on the unit circle, and applying vector addition

2004 CRC Press LLC

instead of adding angles. Thus, we replace i (i = 1, ..., n) by


xi = (sin i , cos i )
where is measured anti-clockwise relative to the horizontal axis. The
following descriptive statistics can then be dened.
Denition 47 Let
n

cos i , S =

C=
i=1

sin i , R =

C 2 + S2.

(7.1)

i=1

The (vector of the) mean direction of i (i = 1, ..., n) is equal to


x=

cos

sin

C/R
S/R

(7.2)

Equivalently one may use the following


Denition 48 The (angle of the) mean direction of i (i = 1, ..., n) is
equal to
S
= arctan + 1{C < 0} + 21{C > 0, S < 0}

(7.3)
C
Moreover, we have
Denition 49 The mean resultant length of i (i = 1, ..., n) is equal to
R
(7.4)
R=
n
Note that R is the length of the vector n obtained by adding all observed
x

vectors. If all angles are identical, then R = n so that R = 1. In all other


< 1. In the other extreme case with i = 2i/n
cases, we have 0 R
(i.e. the angles are scattered uniformly over [0, 2], there are no clusters

of directions), we have R = 0. In this sense, R measures the amount of


concentration around the mean direction. This leads to
Denition 50 The sample circular variance of i (i = 1, ..., n) is equal to

V =1R

(7.5)

Note, however, that R is not a perfect measure of concentration, since

R = 0 does not necessarily imply that the data are scattered uniformly.
For instance, suppose n is even, 2i+1 = and 2i = 0. Thus there are two

preferred directions. Nevertheless, R = 0.


Alternative measures of center and variability respectively are the median and the dierence between the lower and upper quartile. The median
direction is a direction Mn = o determined as follows: a) nd the axis
(straight line through zero) such that the data are divided into two groups
of equal size (if n is odd, then the axis passes through at least one point,
otherwise through the midpoint between the two observations in the middle); b) take the direction on the chosen axis for which the more points

2004 CRC Press LLC

xi are closer to the point (cos, sin)t dened by . Similarly, the lower
and upper quartiles, Q1 , Q2 can be dened by dividing each of the halves
into two halves again. An alternative measure of variability is then given
by IQR = Q2 Q1 .
Since we are dealing with vectors in the two-dimensional plane, all quantities above can be expressed in terms of complex numbers. In particular,
one can dene trigonometric moments by
Denition 51 For p = 1, 2, ... let
n

Cp =

cos pi , Sp =
i=1

sin pi , Rp =

2
2
Cp + Sp

(7.6)

i=1

Cp
Sp
Rp

, Sp =
, Rp =
Cp =
n
n
n

(7.7)

and
Sp
+ 1{Cp < 0} + 21{Cp > 0, Sp < 0}
Cp

(7.8)


mp = Cp + iSp = Rp ei(p)

(p) = arctan

(7.9)

Then
is called the pth trigonometric sample moment.
For p = 1, this denition yields


m1 = C1 + iS1 = R1 ei(1)

with C1 = C, Sp = S, Rp = R and (p) = as before. Similarily, we have

Denition 52 Let
n

n
o
cos p(i (1)), Sp =

o
Cp =
i=1

o
Cp =
o (p) = arctan

sin p(i (1))

(7.10)

i=1
o
o
Cp
Sp
o
, Sp =
n
n

(7.11)

o
Sp
o
o
o
+ 1{Cp < 0} + 21{Cp > 0, Sp < 0}
o
Cp

(7.12)

Then
o
o
o
mo = Cp + iSp = Rp ei (p)
p
is called the pth centered trigonometric (sample) moment
ative to the mean direction (1).

(7.13)
mo ,
p

centered rel-

Note, in particular, that sin(i (1)) = 0 so that mo = R1 . An overview

1
of descriptive measures of center and variability is given in Table 7.1.

2004 CRC Press LLC

Table 7.1 Some Important Descriptive Statistics for Circular Data


Name

Denition

Feature
measured

Sample mean

x = (C/R, S/R)t

with R = C 2 + S 2

Center
(direction)

Mean resultant length

R = R/n

Concentration

Mean direction

= arctan S/C + 1{C < 0}

+21{C > 0, S < 0}

Center (angle)

Median direction

Mn = g() where
g() = n | |i ||
i=1

Center (angle)

Quartiles Q1 , Q2

Q1 = median of {i : Mn i Mn }
Q2 = median of {i : Mn i Mn + }

Center of
left and
right half

Modal direction

Mn = arg max f () where

f () = estimate of density f

Center (angle)

Principal direction

a = rst eigenvector of
S = n xi xt
i=1
i

Center
(direction,
unit vector)

Concentration

1 = rst eigenvalue of S

Variability

Circular variance

Vn = 1 R

Variability

Circular stand. dev.

sn =

Circular dispersion

dn = (1

Mean deviation

Dn =

Interquartile range

IQR = Q2 Q1

2 log(1 V )

Variability

2
C2 + S2 )/(2R2 )
1
n

n
i=1

Variability

| |i Mn ||

Variability
Variability

7.2.2 Correlation and autocorrelation


A model for perfect linear association between two circular random variables , is
= + (c mod 2)
(7.14)
where c [0, 2) is a xed constant. A sample statistic that measures how
close we are to this perfect association is
r, =

n
i,j=1;i=j
n
i,j=1;i=j

sin2 (i j )

or
r, =

2004 CRC Press LLC

sin(i j ) sin(i j )

det(n1
det(n1

n
i=1

n
i,j=1;i=j
n
i=1

sin2 (i j )

t
xi yi )

xi xt ) det(n1
i

n
i=1

t
yi yi )

(7.15)

(7.16)

where xi = (cos i , sin i )t and yi = (cos i , sin i )t . For a time series


t (t = 1, 2, ...) of circular data, this denition can be carried over to
autocorrelations
n
i,j=1;i=j

r(k) =

sin(i j ) sin(i+k
n
2
i,j=1;i=j sin (i j )

j+k )

(7.17)

or
r (k) =

det(n1
det(n1

nk
t
i=1 xi xi+k )
nk
t
i=1 xi xi )

(7.18)

7.2.3 Probability distributions


A probability distribution for circular data is a distribution F on the interval [0, 2). The sample statistics dened in Section 7.1 are estimates of
the corresponding population counterparts in Table 7.2.
Most frequently used distributions are the uniform, cardioid, wrapped,
von Mises, and mixture distributions.
Uniform distribution U ([0, 2)):

F (u) = P (0 u) =

u
1{0 u < 2},
2

1
1{0 u < 2}.
2
In this case, p = p = 0, the mean direction is not dened, and the
circular standard deviation and dispersion are innite. This expresses
the fact that there is no preference for any direction and variability is
therefore maximal.
f () = F () =

Cardioid (or Cosine) distribution C(, ):

F (u) = [

u
sin(u ) +
]1{0 u < 2}

and
1
(1 + 2 cos(u ))1{0 u < 2}
2
where 0 1 . In this case, = , 1 = , p = 0 (p 1) and
2
= 1/(22 ). An interesting property is that this distribution tends to the
uniform distribution as 0.
f (u) =

2004 CRC Press LLC

Table 7.2 Some important population statistics for circular data


Name

Denition

Feature

2
0

pth trigonometric
moment

p =
cos(p)dF ()
2
+i 0 sin(p)dF ()
= p,C + ip,S = p ei (p)

Mean direction

= arctan 1,S /1,C


+1{1,C < 0}
+21{1,C > 0, 1,S < 0}

Center (angle)

2
o = 0 cos(p( ))dF ()
p
2
+i 0 sin(p( ))dF ()
= o + io
p,C
p,S

pth central trig.


moment

Mean resultant
length

= |1 |

Median direction

M = { :

Quartiles q1 , q2

q1 = median of { : M M }
q2 = median of { : M M + }

25%-quantile
75%-quantile

Modal direction

M = arg max f ()

Center (angle)

Principal direction

= rst eigenvector of
t
= E(XX )

Center
(direction)

Concentration

1 = rst eigenvalue of

Circular variance

= 1

Circular stand. dev.

dF () =

dF () =

1
}
2

Center (angle)

Variability
Variability

2 log(1 )

Variability

)/(22 )

Variability

= (1

Circular dispersion

Concentration

2
0

Mean deviation

| | M ||dF ()

Interquartile range

IQR = q2 q1

Variability
Variability

Wrapped distribution:
Let X be a random variable with distribution function FX . The random
variable = X (mod 2) has a distribution F on [0, 2) given by

[F (u + 2j) F (2j)]

F (u) =
j=

If X has a density function fX , then the density function of is equal to

f (u) =

fX (u + 2j).
j=

2004 CRC Press LLC

An important special example is the wrapped normal distribution. The


wrapped normal distribution W N (, ) is obtained by wrapping a normal
distribution with E(X) = and var(X) = 2 log (0 < 1). This
yields the circular density function

2
1
[1 + 2
j cos j(u )]1{0 u < 2}
2
j=1

f (u) =

Then, = , 1 = , = (1 4 )/(22 ), p,C = p and p,S = 0 (p 1).


For 0, we obtain the uniform distribution, and for 1 a distribution
with point mass in the direction .
von Mises distribution M (, )
The most frequently used unimodal circular distribution is the von Mises
distribution with density function
1
e cos(u) 1{0 u < 2}
f (u) =
2Io ()
where 0 < , 0 < 2 and
Io =

1
2

exp( cos(v ))dv =

j=0

1 2j
( )
(j!)2 2

is the modied Bessel function of the rst kind and order 0. In this case,
we have = , 1 = I1 /Io , = (I1 /Io )1 , p,C = Ip /Io and p,S = 0
(p 1) where

1
( )2j+p
Ip =
(j + p)!j! 2
j=0
is a modied Bessel function of order p. For 0, the M (, )-distribution
converges to U ([0, 2)), and for we obtain a point mass in the
direction .
Mixture distribution:
All distributions above are unimodal. Distributions with more than one
mode can be modeled, for instance, by mixture distributions
f (u) = p1 f,1 (u) + ... + pm f,m (u)
where 0 p1 , ..., pm 1,
ity densities.

pi = 1 and f,j are dierent circular probabil-

7.2.4 Statistical inference


Statistical inference about population parameters is mainly known for the
distributions above. Classical methods can be found in Mardia (1972),

2004 CRC Press LLC

Batschelet (1981), Watson (1983), and Fisher (1993). For recent results
see e.g. Jammalamadaka and SenGupta (2001).
7.3 Sp ecic applications in music
7.3.1 Variability and autocorrelation of notes modulo 12

Figure 7.1 Bla Bartk statue by Varga Imre in front of the Bla Bartk Memoe
o
e
o
rial House in Budapest. (Courtesy of the Bla Bartk Memorial House.)
e
o

The following analysis is done for various compositions: pitch is represented in Z12 with 0 set equal to the note (modulo 12) with the highest
frequency in the composition. Given a note j in Z12 , the corresponding
circular point is then x = (x1 , x2 )t = (cos(2j/12), sin(2j/12))t . The

following statistics are calculated: 1 , R, d and the maximal circular autocorrelation m = max1k10 |r (k)|. The compositions considered here are:

Figure 7.2 Sergei Prokoe as a child. (Courtesy of Karadar Bertoldi Ensemble;


www.karadar.net/Ensemble/.)

2004 CRC Press LLC

Figure 7.3 Circular representation of compositions by J. S. Bach (Prludium und


a
Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S. Prokoe (Visions fugio
tives No. 8).

2004 CRC Press LLC

J. S. Bach: Das Wohltemperierte Klavier I (all preludes and fugues)


D. Scarlatti: Sonatas Kirkpatrick No. 49, 125, 222, 345, 381, 412, 440,
541
B. Bartk (Figure 7.1): Bagatelles No. 13, Sonata for Piano (2nd moveo
ment)
S. Prokoef (Figure 7.2): Visions fugitives No. 115.
To simplify the analysis, the upper envelope is considered for each composition. The data set that was available consists of played music. Thus, instead
of the written score we are looking at its realization by a pianist. This results in some changes of onset times. In particular, some notes with equal
score onset times are not played simultaneously. Strictly speaking, the analysis thus refers to the played music rather than the original score. In Figure
7.3, four representative compositions are displayed. Z12 is represented by a
circle starting on top with 0 and proceeding clockwise as j Z12 increases.
A composition is thus represented by pitches j1 , ..., jn Z12 , each pitch beings represented by a dot on the circle. In order to visualize how frequent
each note is, each point xi = (cos i , sin i )t (i = 1, ..., n) where i = 2ji ,
is displaced slightly by adding a random number from a uniform distribution on [0, 0.1] to the angle i . (This technique of exploratory data analysis
is often referred to as jittering see Chambers et al. 1983) Moreover, to
obtain an impression of the dynamic movement, successive points xi , xi+1
are joined by a line. The connections visualize which notes are likely to
follow each other. Some clear dierences are visible between the four plots:
for Bach, the main movements take place along the edges, the main points
and vertices corresponding to the D-major scale. The rather curious simple
gure for Bartks Bagatelle No. 3 stems from the continuous repetition of
o
the same chromatic gure in the upper voice. For Prokoe one can see
two main vertices that are positioned symmetrically with respect to the
middle vertical line. This is due to the repetitive nature of the upper en
velope. Figure 7.4 shows boxplots of 1 , R, d, and log m, comparing Bach,
Scarlatti, Bartk and Prokoef. Variability is clearly lower for Bartk and
o
o
Prokoef, independently of the specic statistic that is used. There are
also some, but less extreme, dierences with respect to the maximal autocorrelation m. As one may perhaps expect, Bartk has the highest values
o
of m.
7.3.2 Variability and autocorrelation of note intervals modulo 12
The same as above can be carried out for intervals between successive
notes (Figure 7.5). Figure 7.6 shows that, again, variability is much lower
for Bartk and Prokoe.
o

2004 CRC Press LLC


Figure 7.4 Boxplots of 1 , R, d and log m for notes modulo 12, comparing Bach,
Scarlatti, Bartk, and Prokoef.
o

2004 CRC Press LLC

Figure 7.5 Circular representation of intervals of successive notes in the following


compositions: J. S. Bach (Prludium und Fuge No. 5 from Das Wohltemperierte
a
Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No.
o
3), and S. Prokoe (Visions fugitives No. 8).

2004 CRC Press LLC


Figure 7.6 Boxplots of 1 , R, d and log m for note intervals modulo 12, comparing
Bach, Scarlatti, Bartk, and Prokoef.
o

2004 CRC Press LLC

Figure 7.7 Circular representation of notes ordered according to circle of fourths


in the following compositions: J. S. Bach (Prludium und Fuge No. 5 from Das
a
Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk
o
(Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8).

2004 CRC Press LLC


Figure 7.8 Boxplots of 1 , R, d and log m for notes 12 ordered according to circle
of fourths, comparing Bach, Scarlatti, Bartk and Prokoef.
o

2004 CRC Press LLC

Figure 7.9 Circular representation of intervals of successive notes ordered according to circle of fourths in the following compositions: J. S. Bach (Prludium und
a
Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S. Prokoe (Visions fugio
tives No. 8).

2004 CRC Press LLC


Figure 7.10 Boxplots of 1 , R, d and log m for note intervals modulo 12 ordered
according to circle of fourths, comparing Bach, Scarlatti, Bartk, and Prokoef.
o

2004 CRC Press LLC

7.3.3 Notes and intervals on the circle of fourths


Alternatively, the analysis above can be carried out by ordering notes according to the circle of fourths. Thus, a rotation by 360o/12 = 30o corresponds to a step of one fourth. The analogous plots are given in Figures 7.7
through 7.10. This specic circular representation makes some symmetries
and their harmonic meaning more visible.

2004 CRC Press LLC

CHAPTER 8

Principal component analysis


8.1 Musical motivation
Observations in music often consist of vectors. Consider, for instance, the
tempo measurements for Schumanns Trumerei (Figure 2.3). In this case,
a
the observational units are performances and an observation consists of a
tempo curve which is a vector of n tempo measurements x(ti ) at symbolic
score onset times ti (i = 1, ..., p). The main question is which similarities
and dierences there are between the performances. Principal component
analysis (PCA) provides an answer in the sense that the most interesting,
and hopefully interpretable, projections are found. In this chapter, a brief
introduction to PCA is given. For a detailed account and references see
e.g. Mardia et al. (1979), Anderson (1984), Dillon and Goldstein (1984),
Seber (1984), Krzanowski (1988), Flury and Riedwyl (1988), Johnson and
Wichern (2002).
8.2 Basic principles
8.2.1 Denition of PCA for multivariate probability distributions
Algorithmic denition
Let X = (X1 , ..., Xp )t be a random vector with expected value E(X) =
and covariance matrix . The following algorithm is dened:
Step 0. Initialization: Set j = 1 and Z (1) = X.
Step 1. Find a direction, i.e. a vector a(j) with |a(j) | = 1, such that
(j) (j)
(j) (j)
the projection Zj = [a(j) ]t Z (j) = a1 Z1 + ... + ap Zp has the largest
possible variance.
Step 2. Consider the part of Z (j) that is orthogonal to a(1) , ..., a(j) , i.e.
set Z (j+1) = Z (j) Zj a(j) . If j = p, or all components of Z (j+1) have
variance zero, then stop. Otherwise set j = j + 1 and go to Step 1.
The algorithm nds successively orthogonal directions a(1) , a(2) , ... such
that the corresponding projections of Z have the largest variance among
all projections that are orthogonal to the previous ones. A projection with
a large variance is suitable for comparing, ranking, and classifying observations, since dierent random realizations of the projection tend to be widely
scattered. In contrast, if a projection has a small variance, then individuals

2004 CRC Press LLC

do not dier very much with respect to that projection, and are therefore
more dicult to distinguish.
Denition via spectral decomposition of matrices
The algorithm given above has an elegant interpretation:
Theorem 18 (Spectral decomposition theorem) Let B be a symmetric pp
matrix. Then B can be written as
p

B = AAt =

j a(j) [a(j) ]t

(8.1)

j=1

1 0 . . 0
0 2
.

.
.
. is a diagonal matrix, j are the eigenwhere =

.
. 0
0
. . 0 p
values and the columns a(j) of A the corresponding orthonormal eigenvectors of B, i.e. we have
Ba(j) = j a(j)
(8.2)

|a(j) |2 = [a(j) ]t a(j) = 1, and [a(j) ]t a(l) = 0 for j = l


(8.3)
In matrix form equation (8.3) means that A is an orthogonal matrix, i.e.
At A = I

(8.4)

where I denotes the identity matrix with Ijj = 1 and Ijl = 0 (j = l).
This result can now be applied to the covariance matrix of a random
vector X = (X1 , ..., Xp )t :
Theorem 19 Let X be a p-dimensional random vector with expected value
E(X) = and p p covariance matrix . Then
= AAt

(8.5)

(j)

of A are eigenvectors of and is a diagonal


where the columns a
matrix with eigenvalues 1 , ..., p 0.
In particular, we may permute the sequence of the X-components such that
the eigenvalues are ordered. We thus obtain:
Theorem 20 Let X be a p-dimensional random vector with expected value
E(X) = and a pp covariance matrix . Then there exists an orthogonal
matrix A such that
= AAt
(8.6)
(j)
of A are eigenvectors of
and is a diagonal
where the columns a
matrix with eigenvalues 1 2 ... p 0. Moreover, the covariance
matrix of the transformed vector
Z = At (X )

2004 CRC Press LLC

(8.7)

is equal to
cov(Z) = At A =
(8.8)
Note in particular that var(Z1 ) = 1 var(Z2 ) = 2 ... var(Zp ) = p
and the covariance matrix may be approximated by a matrix
q

(q) =

j a(j) [a(j) ]t

j=1

for a suitably chosen value q p. If a good approximation can be achieved


for a relatively small value of q, then this means that most of the random
variation in X occurs in a low dimensional space spanned by the random
vector Z(q) = (Z1 , ..., Zq )t .
Denition 53 The transformation dened by Z = At (X ) is called the
principal component transformation. The ith component of Z,
Zj = [At (X )]j = [(X )t a(j) ]t

(8.9)

is called the jth principal component of X. The jth column of A, i.e. the
jth eigenvector a(j) , is called the vector of principal component loadings.
In summary, the principal component transformation rotates the original
random vector X in such a way that the new coordinates Z1 , ..., Zp are
uncorrelated (orthogonal) and they are ordered according to their importance with respect to characterizing the covariance structure of X.
The following result states that the algorithmic and the algebraic denition are indeed the same:
Theorem 21 Consider U = bt X where b = (b1 , ..., bp )t and |b| = 1. Suppose that U is orthogonal (i.e. uncorrelated) to the rst k principal components of X. Then var(U ) is maximal, among all such projections, if and
only if b = a(k+1) , i.e. if U is the (k + 1)st principal component Zk+1 .
8.2.2 Denition of PCA for observed data
The denition of principal components given above cannot applied directly
to data, since the expected value and covariance matrix are usually unknown. It can however be modied in an obvious way by replacing population quantities by suitable estimates. The simplest solution is to use
the sample mean and the sample covariance matrix. For observed vectors
x(i) = (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) one denes
=x=

1
n

x(i)

(8.10)

i=1

and the estimate of the covariance matrix


1

=
n

2004 CRC Press LLC

(x(i) x)(x(i) x)t .

i=1

(8.11)

The estimated ith vector of principal component loadings, a(j) , is the stan

dardized eigenvector corresponding to the jth-largest eigenvalue of . The


estimated principal component transformation is then dened by

z = At (x x) = [(x x)t A]t

(8.12)

where the columns of A are equal to the orthogonal vectors a(j) . Applying

this transformation to the observed vectors x(1), ..., x(n), enables us to


compare observations with respect to their principal components. The jth
principal component of the ith observation is equal to
zj (i) = (x(i) x)t a(j)

(8.13)

In other words, the ith observed vector x(i) x is transformed into a rotated

vector z(i) = (z1 (i), ..., zp (i))t with the corresponding observed principal
components. In matrix form, we can dene the n p matrix of observations

x1 (1) x2 (1) xp (1)


x1 (2) x2 (2) xp (2)

(8.14)
X=

.
.
.
.
.
.

.
..
.
x1 (n) x2 (n) xp (n)
and the n p matrix of observed principal components

z1 (1) z2 (1) zp (1)


z1 (2) z2 (2) zp (2)

Z =

.
.
.
.
.
.

.
..
.
z1 (n) z2 (n) zp (n)

(8.15)

so that
Z = (X I y t )A

(8.16)

where I denotes the identity matrix. Note that the jth column z (j) =
(zj (1), ..., zj (n))t consists of the observed jth principal components. Therefore, the sample variance of the jth principal components is given by
s2 = n1
z

zj (i) = j .

i=1

If j is large, then the observed jth principal components zj (1), ..., zj (n)
have a large sample variance so that the observed values are scattered far
apart.
8.2.3 Scale invariance?
The principal component transformation is based on the covariance matrix. It is therefore not scale invariant, since variance and covariance depend on the units in which individual components Xj are measured. It is

2004 CRC Press LLC

therefore often recommended to standardize all components. Thus, we ren


place each coordinate xj by (xj xj )/sj where xj = n1 i=1 xj (i) and

n
n
s2 = n1 i=1 (xj (i) xj )2 (or s2 = (n 1)1 i=1 (xj (i) xj )2 ).

j
j
8.2.4 Choosing important principal components
Since an orthogonal transformation does not change the length of vectors,
the total variability of the random vector Z in (8.7) is the same as the one
of the original random vector X with covariance matrix = (ij )i,j=1,...,p .
More specically, one denes total variability by
p

Vtotal = tr() =

ii .

(8.17)

i=1

The singular value decomposition (spectral decomposition) of then implies


Theorem 22 Let be a covariance matrix with spectral decomposition
= AAt . Then
p

Vtotal = tr() =

ii

(8.18)

i=1

Since the eigenvalues i are ordered according to their size, we may therefore hope that the proportion of total variation
P (q) =

1 + ... + q
p
i=1 i

(8.19)

is close to one for a low value of q. If this is the case, then one may reduce the dimension of the random vector considerably without losing much

information. For data, we plot P (q) = (1 + ... + q )/ i versus q and


(q) is not worth the
judge by eye from which point on the increase in P
price of adding additional dimensions. Alternatively, we may plot the con

tribution of each eigenvalue, j / i or j itself, against j. This is the


so-called scree graph. More formal tests, e.g. for testing which eigenvalues
are nonzero or for comparing dierent eigenvalues, are available however
mostly under the rather restrictive assumption that the distribution of X
is multivariate normal (see e.g. Mardia et al. 1979, Ch. 8.3.2).
In addition to the scree plot, the decision on the number of principal
components is often also based on the (possibly subjective) interpretability
of the components. The interpretation of principal components may be
(i)
based on the coecients aj and/or on the correlation between Zj and
the coordinates of the original random vector X = (X1 , ..., Xp )t . Note that
since E(ZX t ) = E(At XX t) = At = At AAt = At , var(Xk ) = kk and

2004 CRC Press LLC

var(Zi ) = i , the correlation between Zj and Xk is equal to


(j)

j,k = corr(Zj , Xk ) = ak

j
kk

(8.20)

Analogously, for observed data we have the empirical correlations


(j)

j,k = ak

j
kk

(8.21)

8.2.5 Plots
One of the main diculties with high-dimensional data is that they cannot
be represented directly in a two-dimensional display. Principal components
provide a possible solution to this problem. The situation is particularly
simple if the rst two principal components explain most of the variability.
In that case, the original data (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) may be replaced by the rst two principal components (z1 (i), z2 (i))t (i = 1, 2, ..., n).
Thus, z2 (i) is plotted against z1 (i). If more than two principal components
are needed, then the plot of z2 (i) versus z1 (i) provides at least a partial view
of the data structure, and further projections can viewed by corresponding
scatter plots of other components, or by symbol plots as described in Chapter 2. The scatter plots can be useful for identifying structure in the data.
In particular, one may detect unusual observations (outliers) or clusters of
similar observations.
8.3 Sp ecic applications in music
8.3.1 PCA of tempo skewness
The 28 tempo curves in Figure 2.3, each consisting of measurements at
p = 212 onset times, can be considered as n = 28 observations of a 212dimensional random vector. Principal component analysis cannot be applied directly to these data. The reason is that PCA relies on estimating
the p p covariance matrix. The number of observations (n = 28) is much
smaller than p. Therefore, not all elements of the covariance matrix can
be estimated consistently and an empirical PCA-decomposition would be
highly unreliable. A solution to this problem is to reduce the dimension p
in a meaningful way. Here, we consider the following reduction: the onsettime axis is divided into 8 disjoint blocks A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 of
4 bars each. For each part number i (i = 1, ..., 8) and each performance j
(j = 1, ..., 28), we calculate the skewness measure
j (i) =

2004 CRC Press LLC

xM

Q2 Q1

-0.6

-0.4

-0.2

0.0

Skewness of tempo plotted against period 1,2, ,8

Figure 8.1 Tempo curves for Schumanns Trumerei: skewness for the eight parts
a
A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of
the part.

where M is the median and Q1 , Q2 are the lower and upper quartile respectively. Figure 8.1 shows j (i) plotted against i. An apparent pattern is the
generally strong negative skewness in B2 . (Recall that negative skewness
can be created by extreme ritardandi.) Apart from that, however, Figure
8.1 is dicult to interpret directly. Principal component analysis helps to
nd more interesting features. Figure 8.3 shows the loadings for the rst
four principal components which explain more than 80% of the variability
(see Figure 8.2). The loadings can be interpreted as follows: the rst component corresponds to a weighted average emphasizing the skewness values
in the rst half of the piece. The 28 performances apparently dier most
with respect to j (i) during the rst 16 bars of the piece (parts A1 , A2 ,
A1 , A2 ). The second most important distinction between pianists is characterized by the second component. This component compares skewness for
the A-parts with the values in B1 and B2 . The third component essentially

2004 CRC Press LLC

0.355

0.015

Variances

0.025

Skewness of tempo - screeplot

0.564

0.709

0.889
1
Comp. 8

0.971

Comp. 7

Comp. 6

Comp. 5

Comp. 4

Comp. 3

Comp. 2

0.935

Comp. 1

0.0

0.005

0.824

Figure 8.2 Schumanns Trumerei: screeplot for skewness.


a

compares the rst with the second half. Finally, the fourth component essentially compares the odd with the even numbered parts, excluding the
end A1 , A2 . Components two to ve are displayed in Figure 8.4, with z2
and z3 on the x- and y-axis respectively and rectangles representing z4 and
z5 . Note in particular that Cortot and Horowitz mainly dier with respect
to the third principal component. Horowitz has a more extreme dierence
in skewness betweem the rst and second halves of the piece. Also striking
are the outliers Brendel, Ortiz, and Gianoli. The overall skewness, as
represented by the rst component, is quite extreme for Brendel and Ortiz.
For comparison, their tempo curves are plotted in Figure 8.5 together with
Cortots and Horowitz rst performances. In view of the PCA one may
now indeed see that in the tempo curves by Brendel and Ortiz there is a
strong contrast between small tempo variations applied most of the time
and occasional strong local ritardandi.

2004 CRC Press LLC

A1

A2

B1

B2

Skewness: Loadings of
second PCA-component
A1

A2

A1

A2

A2

B1

B2

A1

A2

-0.2

loading

0.4
0.3

-0.6
2

Skewness: Loadings of
third PCA-component
A2

A1

A2

B1

B2

Skewness: Loadings of
fourth PCA-component
A1

A2

A1

A1

A2

B1

B2

A1

A2

loading

-0.2
-0.6

-0.2

0.2

A2

0.2

0.6

A1

loading

A1

0.2

A2

0.2

loading

0.5

Skewness: Loadings of
first PCA-component
A1

Figure 8.3 Schumanns Trumerei: loadings for PCA of skewness.


a

T2

RT

Z3
IT
HO
RO
W

AR
IS

TS
SH
EL

LE
Y

IA

-0.5

NO

LI

-0.4

KA

SC

BA

US M
T OIS
AR A EIW
IT
RA SK
U EN SC
AZ H
DE E
M
US

KR

HO
KU RO

LE I
ZK T
AK KZ
LI1
EN

VI
DA

G
ER
IC

Z2
IT
NI
N

BU

RO
W
HO

HN
AB

EL

ES

CO
RT

A
RZ P
O OV
N A
AR

CH
E

ES

NO

VA

ES

NB

CO

RT
O
CO T1
R
AC TO
T3
CU C H

O
L
BR
EN
DE

-0.3

z3

-0.2

-0.1

IZ

NE

0.0

PCA of skewness symbol plot of principal components 2-5

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

z2

Figure 8.4 Schumanns Trumerei: symbol plot of principal components z2 , ..., z5


a
for PCA of tempo skewness.

2004 CRC Press LLC

-10

-5

Cortot1

Horowitz1

Gianoli

-25

-20

-15

Brendel

50

100

150

200

Figure 8.5 Schumanns Trumerei: tempo curves by Cortot, Horowitz, Brendel,


a
and Gianoli.

8.3.2 PCA of entropies


Consider the entropy measures E1 , E2 , E3 , E4 , E8 and E10 dened in
Chapter 3. We ask the following question: is there a combination of entropy measures that enables us to distinguish computationally between
various styles of composition? The following compositions are included in
the study: Henry Purcell 2 Airs (Figure 8.6), Hornpipe; J.S. Bach First
movements of Cello Suites No. 1-6, Prelude and Fugue No. 1 and 8 from
Das Wohltemperierte Klavier; W.A. Mozart KV 1e, 331/1, 545/1; R.
Schumann op. 15, No. 2,3,4,7; op. 68, No. 2, 16; A. Scriabin op. 51, No.
2, 4; F. Martin Prludes No. 6, 7 (cf. Figures 8.11, 8.12). For each compoe
sition, we dene the vector x = (x1 , ..., x6 )t = (E1 , E2 , E3 , E4 , E9 , E10 )t .
The results of PCA are displayed in Figures 8.7 through 8.10. The rst
principal component mainly consists of an average of the rst four components and a comparison with E10 (Figure 8.8). The second component
essentially includes a comparison between E9 and E10 , whereas the third
component is mainly a weighted average of E2 , E9 , and E10 . Finally, the
fourth component compares E2 , E3 with E1 . According to the screeplot
(Figure 8.7), the rst three components already explain more than 95% of
the variability. Scatterplots of the rst three components (Figures 8.9 and
8.10) together with symbols representing the next two components show a

2004 CRC Press LLC

clear clustering. For clarity, only three dierent names (Purcell, Bach, and
Schumann) are written explicitly in the plots. Schumann turns out to be
completely separated from Bach. Moreover, Purcell appears to be somewhat outside the regions of Bach and Schumann, in particular in Figure
8.10. In conclusion, entropies, as dened above, do indeed seem to capture
certain features of a composers style.

2004 CRC Press LLC

AIR
q = 96


Piano

Figure 8.6 Air by Henry Purcell (1659-1695).

2004 CRC Press LLC

14

11

Henry Purcell (1659-1695)

Figure 8.7 Screeplot for PCA of entropies.

Figure 8.8 Loadings for PCA of entropies.

2004 CRC Press LLC

Purcell

-2

-4

-2

Bach

Bach

Bach

Bach
Bach
Purcell Bach
Bach
Schumann

Bach

Schumann
Bach

Purcell
Schumann

Schumann
Schumann

Schumann

-1

Bach

Entropies - second vs. first principal component;


rectangles with width=3rd comp., height=4th comp.

Figure 8.9 Entropies symbol plot of the rst four principal components.

Purcell

Bach

Bach
Bach

Bach
Schumann
Schumann

Schumann
Schumann

-2

Purcell

Schumann

Purcell

Schumann

1
-1

Bach
Bach
Bach
Bach
Bach Bach

Third vs. second principal component rectangles with width=4th comp., height=5th comp.

-1

Figure 8.10 Entropies symbol plot of principal components no. 2-5.

2004 CRC Press LLC

Figure 8.11 F. Martin (1890-1971). (Courtesy of the Socit Frank Martin and
ee
Mrs. Maria Martin.)

Figure 8.12 F. Martin (1890-1971) - manuscript from 8 Prludes. (Courtesy of


e
the Socit Frank Martin and Mrs. Maria Martin.)
ee

2004 CRC Press LLC

CHAPTER 9

Discriminant analysis
9.1 Musical motivation
Discriminant analysis, often also referred to under the more general notion
of pattern recognition, answers the question of which category an observed
item is most likely to belong to. A typical application in music is attribution
of an anonymous composition to a time period or even to a composer.
Other examples are discussed below. A prerequisite for the application of
discriminant analysis is that a training data set is available where the
correct answers are known. We give a brief introduction to basic principles
of discriminant analysis. For a detailed account see e.g. Mardia et al. (1979),
Klecka (1980), Breiman (1984), Seber (1984), Fukunaga (1990), McLachlan
(1992) and Huberty (1994), Ripley (1995), Duda et al. (2000), Hastie et al.
(2001).
9.2 Basic principles
9.2.1 Allocation rules
Suppose that an observation x Rk is known to belong to one of p mutually exclusive categories G1 , G2 ,...,Gp . Associated with each category is
a probability density fi (x) of X on Rk . This means that if an individual
comes from group i, then the individuals random vector X has the probability distribution fi . The problem addressed by discriminant analysis is
as follows: observe X = x, and try to guess which group the observation
comes from. The aim is, of course, to make as few mistakes as possible. In
probability terms this amounts to minimizing the probability of misclassication.
The solution is dened by a classication rule. A classication rule is a
division of Rk into p disjoint regions: Rk = R1 R2 ... Rp , Ri Rj =
(i = j). The rule allocates an observation to group Gi , if x Ri . More
generally, we may dene a randomized rule by allocating an observation to

group Gi with probability i (x), where i=1 i (x) = 1 for every x. The
advantage of allowing random allocation is that discriminant rules can be
averaged and the set of all random rules is convex, thus allowing to nd
optimal rules. Note that deterministic rules are a special case, by setting
i (x) = 1 if x Ri and 0 otherwise.

2004 CRC Press LLC

9.2.2 Case I: Known population distributions


Discriminant analysis without prior group probabilities the ML-rule
Assume that it is not known a priori which of the groups is more likely to
occur; however for each group the distribution fi is known exactly. This
case is mainly of theoretical interest; it does however illustrate the essential
ideas of discriminant analysis.
A plausible discriminant rule is the Maximum Likelihood Rule (MLRule): allocate x to group Gi , if
fi (x) = max fj (x)
j=1,...,p

(9.1)

If the maximum is reached for several groups, then x is considered to be in


the union of these (for continuous distributions this occurs with probability
zero). In the case of two groups the ML-rule means that x is allocated to
G1 , if f1 (x) > f2 (x), or, equivalently,
log

f1 (x)
>0
f2 (x)

(9.2)

In the case where all probability densities are normal with equal covariance
matrices we have:
Theorem 23 Suppose that each fi is a multivariate normal distribution
with expected value i and covariance matrix i . Suppose further that 1 =
2 = ... = p = and det > 0. Then the ML-rule is given as follows:
allocate x to group Gi , if
(x i )t 1 (x i ) = min (x j )t 1 (x j )
j=1,...,p

(9.3)

Note that the Mahalanobis distance di = (x i )t 1 (x i ) measures


how far x is from the expected value i , while taking into account covariances between the components of the random vector X = (X1 , ..., Xp )t . In
particular, for p = 2, x is allocated to G1 , if
1
at (x (1 + 2 )) > 0
(9.4)
2
where a = 1 (1 2 ). Thus, we obtain a linear rule where x is compared
with the midpoint between 1 and 2 .
Discriminant analysis with prior group probabilities the Bayesian rule
Sometimes one has a priori knowledge (or belief) how likely each of the
groups is to occur. Thus, it is assumed that we know the probabilities
i = P (observation drawn from group Gi ) (i = 1, ..., p)

(9.5)

i = 1. The conditional likelihood that the obserwhere 0 i 1 and


vation comes from group Gi given the observed value X = x is proportional

2004 CRC Press LLC

to i fi (x). The natural rule is then the Bayes rule: Allocate x to Gi , if


i fi (x) = max j fj (x)

(9.6)

j=1,...,p

For the noninformative prior 1 = 2 = ... = p = 1/p, representing


complete lack of knowledge about which groups observations are more likely
to come from, the Bayes rule coincides with the ML-rule. In the case of two
groups, the Bayes rule is a simple modication of the ML-rule, since x is
allocated to G1 , if
f1 (x)
2
log
> log
(9.7)
f2 (x)
1
Which rule is better?
The quality of a rule is judged by the probability of correct classication (or
misclassication). There are two standard ways of comparing classication
rules: a) comparison of individual probabilities of correct classication; and
b) comparison of the overall probability of correct classication.
The rst criterion can be understood as follows: for a random allocation rule with probabilties i (.), the probability that a randomly chosen
individual coming from group Gi is classied into group Gj is equal to
pji =

j (x)fi (x)dx

(9.8)

Thus, correct classication for individuals from group Gi occurs with probability pii and misclassication with probability 1 pii . A rule r with
correct-classication-probabilities pii is said to be at least as good as a rule
r with probabilities pii , if pii pii for all i. If there is at least one >

sign, then r is better. If there is no better rule than r, then r is called


admissible. Consider now a Bayes rule r with probabilities pij . Is there any
better rule than r? Suppose that r is better. Then

i pii <

i pii .

On the other hand,


i pii =

i i fi (x)dx

i max{j fj (x)}dx =
j

max{j fj (x)}dx.
j

Since r is a Bayes rule, we have


max{j fj (x)} =
j

i i fi (x)

so that nally, the inequality is:


i i fi (x)dx =

2004 CRC Press LLC

i pii

i pii

which contradicts the rst inequality. The conclusion is therefore that every
Bayes rule is optimal in the sense that it is admissible. If there are no a
priori probabilities i , or more exactly the noninformative prior is used,
then this means that the ML-rule is optimal.
The second criterion is applicable if a priori probabilities are available:
the probability of correct allocation is
p

pcorrect =

i pii =
i=1

i fi (x)dx

(9.9)

i=1

A rule is optimal if pcorrect is maximal. In contrast to admissibility, all rules


can be ordered according to classication correctness. As before, it can
be shown that the Bayes rule is optimal.
Both criteria can be generalized to the case where misclassication is
associated with costs that may dier for dierent groups.
9.2.3 Case II: Population distribution form known, parameters unknown
Suppose that each fi is known, except for a nite dimensional parameter vector i . Then the rules above can be adopted accordingly, replacing
parameters by their estimates. The ML-rule is then: allocate x to Gi , if

fi (x; i ) = max fj (x; j )


j=1,...,p

(9.10)

The Bayes rule allocates x to G1 , if

i fi (x; i ) = max j fj (x; j )


j=1,...,p

(9.11)

The rule becomes particularly simple if fi are normal with unknown means
i and equal covariance matrices 1 = 2 = ... = . Let xi be the sample

mean and i the sample covariance matrix for observations from group Gi .
Estimating the common covariance matrix by

= (n1 1 + n2 2 + ... + np p )/(n p)


where ni is the number of observations from Gi and n = n1 + ... + np , the
ML-rule allocates x to Gi , if
(x i )t 1 (x i ) = min (x j )t 1 (x j )
j=1,...,p

(9.12)

For two groups, we have the linear ML-rule


1
x

at (x (1 + x2 )) > 0

(9.13)

where a = 1 (1 x2 ), and the corresponding Bayes rule



1
2
x

at (x (1 + x2 )) > log

2
1

2004 CRC Press LLC

(9.14)

It should be emphasized here that while a linear discriminant rule is meaningful for the normal distribution, this may not be so for other distributions.
For instance, if for G1 a one-dimensional random variable X is observed
with a uniform distribution on [1, 1] and for G2 the variable X is uniformly
distributed on [3, 2] [2, 3], then the two groups can be distinguished
perfectly, however not by a linear rule.
9.2.4 Case III: Population distributions completely unknown
If the population distributions fi are completely unknown, then the search
for reasonable rules is more dicult. In recent literature, some rules based
on nonparametric estimation or suitable projection techniques have been
proposed (see e.g. Friedman 1977, Breiman 1984, Hastie et al. 1994, Polzehl
1995, Ripley 1995, Duda et al. 2000, Hand et al. 2001).
The simplest, and historically most important, rule is based on Fishers
linear discriminant function. Fisher postulated that a linear rule may often
be reasonable (see however the remark in Section 9.2.3 why this need not
always be so). He proposed to nd a vector a such that the linear function
at x maximizes the ratio between the variability between groups compared
to the variability within the groups. More specically, dene
Xnp = X
to be the n p matrix where each row i corresponds to an observed vector
xi = (xi1 , ..., xip )t . We denote the columns of X by x(j) (j = 1, ..., p). The
rows are assumed to be ordered according to groups, i.e. rows 1 to n1 are
observations from G1 , rows n1 + 1 through n1 + n2 are from G2 and so on.
Moreover, dene the matrix
Mnn = M = I n1 1 1t
where I is the identity matrix and 1 = (1, ..., 1)t . We denote the subma(i)
trices of X and M that belong to the dierent groups by Xnj p = X (j)
(j)

and Mnj nj = M (j) respectively. The corresponding subvectors of y =


(y1 , ..., yn )t are denoted by y (j) . Then the variability of the vector y = Xa,
dened by
n

SST =

(yi y )2 = y t M y = at X t M Xa

(9.15)

i=1

can be written as
SST = SSTwithin + SSTbetween
where

nj

SSTwithin =
j=1 i=1

2004 CRC Press LLC

(j)

(yi

y (j) )2 = at W a

(9.16)

(9.17)

and

nj ((j) y )2 = at Ba
y

SSTbetween =

(9.18)

j=1

Here,
p

W =

n j Sj =
j=1

[X (j) ]t M (j) X (j)

j=1

is the within groups matrix and


p

B=

nj ((j) x)((j) x)t


x
x

j=1

the between groups matrix, Sj is the sample covariance matrix of obsernj


(j)
p
vations xi from group Gj , y = n1 j=1 i=1 yi is the overall mean,

(j)

y (j) = n1 yi the mean in group Gj and x(j) and x are the corre
j
sponding (vector) means for x. Fishers linear discriminant function (or
rst canonical variate) is the linear function at x where a maximizes the
ratio
SSTbetween
at Ba
Q(a) =
= t
(9.19)
SSTwithin
a Wa
The solution is given by
Theorem 24 Let a be the eigenvector of W 1 B that corresponds to the
largest eigenvalue. Then Q(a) is maximal.
The classication rule is then: allocate x to Gi , if

|at x at x(i) | = min |at x at x(j) |


j=1,...,p

(9.20)

If there are only p = 2 groups, then


n1 n2 (1)
B=
( x(2) )((1) x(2) )t
x

n
has rank 1 and the only non-zero eigenvalue is
n1 n2 (1)
tr(W 1 B) =
( x(2) )t W 1 ((1) x(2) )
x

n
with eigenvector a = W 1 ((1) x(2) ). The discriminant rule then becomes
x

the same as the ML-rule for normal distributions with equal covariance
matrices: allocate x to Gi , if
1
x
((1) x(2) )t W 1 (x ((1) + x(2) )) > 0
x

(9.21)

9.2.5 How good is an empirical discriminant rule?


If the densities fi are not known, then the classication rule as well as the
probabilities pii of correct classication must be estimated from the given

2004 CRC Press LLC

Figure 9.1 Discriminant analysis combined with time series analysis can be used
to judge purity of intonation (Elvira by J.B.).

data. In principle this is easy, since the corresponding estimates can simply
be plugged into the formula for pii . The observed data that are used for
estimation are also called training sample. A problem with these estimates is, however, that the search for the optimal discriminant rule was
done with the same data. Therefore, p11 will tend to be too optimistic (i.e.

too large), unless n is very large. The same is true for any method that
estimates classication probabilities from the training data. A possibility
to avoid this is to partition the data set randomly into a training sample
that is used for estimation of the discriminant rule, and a disjoint validation sample that is used for estimation of classication probabilities.
Obviously, this can only be done for large enough data sets. For recently
developed computational methods of validation, such as bootstrap, see e.g.
Efron (1979), Luter (1985), Fukunaga (1990), Hirst (1996), LeBlanc and
a
Tibshirani (1996), Davison and Hinkley (1997), Chernick (1999), Good
(2001).
9.3 Sp ecic applications in music
9.3.1 Identication of pitch, tone separation, and purity of intonation
Weihs et al. (2001) investigate objective criteria for judging purity of intonation of singing. The acoustic data are as described in Chapter 4. In
order to address the question of how to computationally assess purity of
intonation, a vocal expert classied 132 selected tones of 17 performances
(Figure 9.1) of Hndels Tochter Zion into the classes at, correct,
a
and sharp. The opinion of the expert is assumed to be the truth. An
objective measure of purity is dened by = log12 (observed ) log12 (o )

2004 CRC Press LLC

where o is the correct basic frequency, corresponding to the note in the


score and adjusted to the tuning of the accompanying piano, and observed
is the actually measured frequency. Maximum likelihood discriminant analysis leads to the following classication rule: the maximal permissible error
in halftones which is accepted in order to classify a tone as correct is
about 0.4 halftones below and above the target tone. Note that this is
much higher than 0.03 halfnotes which is the minimal distance between
frequencies a trained ear can distinguish in principle (see Pierce 1992). If
a note is considered incorrect by an expert, then the estimated probability
of being nevertheless classied as correct by the discriminant rule turns
out to be 0.174. This rather high error rate may be due to several causes.
Purity of intonation is a phenomenon that probably depends on more
than just the basic frequency. Possible factors are, for instance, amount
of vibrato, loudness, pitch, context (e.g. previous and subsequent notes),
timbre, etc. Thus, more variables that characterize the sound may have to
be incorporated, in addition to , in order to dene a musically meaningful
notion of purity of intonation.
9.3.2 Identication of historic periods
For a composition, consider notes modulo octave, with 0 being set equal to
the most frequent note (which we will also call basic tone). The relative
frequencies of each note 0, ..., 11 are denoted by po , ..., p11 . We the set x1 =
p5 . Note that, if 0 is the root of the tonic triad then 5 is the root of the
subdominant. Moreover we dene
n

x2 = E =

log(pi + 0.001)pi
i=1

which is slightly modied measure of entropy. We now describe each composition by a bivariate observation
x = (p5 , E)t .
The question is now whether this very simple 2-dimensional descriptive
statistic can tell us anything about the time when the music was composed.
In view of the somewhat naive simplicity of x, the answer is not at all
obvious.
To simplify the problem, composers are divided into two groups: Group
1 = composers who died before 1800, and Group 2 = composers who died
after 1800 (or are still alive). Essentially, the two groups correspond to
the partition into early music to baroque and classical till today. The
compositions considered here are those given in the star plot example (Section 2.7.2). In order to be able to check objectively how the procedure
works, only a subset of n = 94 compositions is used for estimation. Applying a linear discriminant rule partitions the plane into two half planes by

2004 CRC Press LLC

Fitted discriminant rule and training data used for estimation

-1.4
-1.6
-2.0

-1.8

P(Subdominant)

-1.2

before 1800
after 1800

1.9

2.0

2.1

2.2

2.3

2.4

entropy

Figure 9.2 Linear discriminant analysis of compositions before and after 1800,
with the training sample. The data used for the discriminant rule consists of
x = (p5 , E).

a straight line. Figure 9.2 shows the estimated partitioning line together
with the training sample (o = before 1800, x = after 1800). Apparently, the
two groups can indeed be separated quite well by the estimated straight
line. This is quite surprising, given the simplicity of the two variables. As
expected, however, the partition is not perfect, and it does not seem to be
possible to improve it by more complicated partitioning lines. To assess how
well the rule may indeed classify, we consider 50 other compositions that
were not used for estimating the discriminant rule. Figure 9.3 shows that
the rule works well, since almost all observations in the validation sample
are classied correctly. An unusual composition is Bartks Bagatelle No.
o
3 which lies far on the left in the wrong group.
The partitioning can be improved if the time periods of the two groups
are chosen farther apart. This is done in gures 9.3a and b with Group
1 = Early Music to Baroque and 2 = Romantic to 20th century. (A
beautiful example of early music is displayed in Figure 9.6; also see Figures 9.7 and 9.8 for portraits of Brahms and Wagner.) Figure 9.4 shows
the corresponding plot of the partition together with the data (n = 72).
Compositions not used in the estimation are shown in Figure 9.5. Again,
the rule works well, except for Bartks third Bagatelle.
o

2004 CRC Press LLC

Fitted discriminant rule and validation data


not used for estimation

B
ar

to

-1.4
-1.6
-2.0

-1.8

P(Subdominant)

-1.2

before 1800
after 1800

1.9

2.0

2.1

2.2

2.3

2.4

entropy

Figure 9.3 Linear discriminant analysis of compositions before and after 1800,
with the validation sample. The data used for the discriminant rule consists of
x = (p5 , E).

-1.6

-1.4

-1.2

Early & Baroque


Romantic & 20th

-2.0

-1.8

P(Subdominant)

-1.0

Fitted discriminant rule and data used for estimation

1.8

2.0

2.2

2.4

entropy

Figure 9.4 Linear discriminant analysis of Early Music to Baroque and Romantic to 20th Century. The points (o and ) belong to the training sample.
The data used for the discriminant rule consists of x = (p5 , E).

2004 CRC Press LLC

Fitted discriminant rule and validation data


not used for estimation

ar
to
k

-1.0

-1.4
-2.2

-1.8

P(Subdominant)

Early & Baroque


Romantic & 20th

1.8

2.0

2.2

2.4

entropy

Figure 9.5 Linear discriminant analysis of Early Music to Baroque and Romantic to 20th century. The points (o and ) belong to the validation sample. The data used for the discriminant rule consists of x = (p5 , E).

Figure 9.6 Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Zrich.) (Color gures follow
u
page 152.)

2004 CRC Press LLC

Figure 9.7 Johannes Brahms (1833-1897). (Photograph by Maria Fellinger, courtesy of Zentralbibliothek Zrich.)
u

Figure 9.8 Richard Wagner (1813-1883). (Engraving by J. Bankel after a painting


by C. Jger, courtesy of Zentralbibliothek Zrich.)
a
u

2004 CRC Press LLC

CHAPTER 10

Cluster analysis
10.1 Musical motivation
In discriminant analysis, an optimal allocation rule between dierent groups
is estimated from a training sample. The type and number of groups are
known. In some situations, however, it is neither known whether the data
can be divided into homogeneous subgroups nor how many subgroups there
may be. How to nd such clusters in previously ungrouped data is the purpose of cluster analysis. In music, one may for instance be interested in how
far compositions or performances can be grouped into clusters representing
dierent styles. In this chapter, a brief introduction to basic principles
of statistical cluster analysis is given. For an extended account of cluster
analysis see e.g. Jardine and Sibson (1971), Anderberg (1973), Hartigan
(1978), Mardia et al. (1979), Seber (1984), Blasheld et al. (1985), Hand
(1986), Fukunaga (1990), Arabie et al. (1996), Gordon (1999), Hppner et
o
al. (1999), Everitt et al. (2001), Jajuga et al. (2002), Webb (2002).
10.2 Basic principles
10.2.1 Maximum likelihood classication
Suppose that observations x1 , ..., xn Rk are realizations of n independent
random variables Xi (i = 1, ..., n). Assume further that each random variable comes from one of p possible groups such that if Xi comes from group j,
then it is distributed according to a probability density f (x; j ). In contrast
to discriminant analysis, it is not observed which groups xi (i = 1, ..., n)
belong to. Each observation xi is thus associated with an unobserved parameter (or label) i specifying group membership. We may simply dene
i = j if xi belongs to group j. Denote by = (1 , ..., n )t the vector of
labels and, for each j = 1, ..., p, let Aj = {xi : 1 i n, i = j} be
the unknown set of observations that belong group j. Then the likelihood
function of the observed data is
p

L(x1 , ..., xn ; 1 , ..., p , 1 , ..., n ) =

f (xi ; j )}

(10.1)

j=1 xi Aj

Maximizing L with respect to the unknown parameters 1 , ..., p and 1 ,


..., n , we obtain ML-estimates 1 , ..., p , 1 , ..., n and estimated sets

2004 CRC Press LLC

A1 , ..., Ap . Denoting by m the dimension of j , the number of estimated


parameters is p m + n. This is larger than the number of observations.
It can therefore not be expected that all parameters are estimated consistently. Nevertheless, the ML-estimate provides a classication rule due to

the following property: suppose that we change one of the Aj s by removing


j and putting it into another set Al (l=j). Then

an observation xio from A


the likelihood can at most become smaller. The new likelihood is obtained

from the old one by dividing by f (xio ; j ) and multiplying by f (xio ; l ).


We therefore have the following property

f (xio ; l )


L(x1 , ..., xn ; 1 , ..., p , 1 , ..., n )

L(x1 , ..., xn ; 1 , ..., p , 1 , ..., n )

f (xio ; j )
(10.2)
or, dividing by L (assuming that it is not zero),

f (x; j ) f (x; l ) for x Aj

(10.3)

This is identical with the ML-allocation rule in discriminant analysis. The


only, but essential, dierence here is that is unknown, i.e. our sample
(training data) gives us only information about the distribution of X
but not about . This makes the task much more dicult. In particular,
since the number of unknown parameters is too large in general, maximum likelihood clustering can not only be computationally dicult but
its asymptotic performance may not stabilize suciently. In special cases,
however, a simple method can be obtained. Suppose, for instance, that the
distributions in the groups are multivariate normal with means j and covariance matrices j . Then the ML-estimates of these parameters, given ,
are the group sample means
1
xj =

xi
nj ()
iAj ()

and group sample covariance matrices


1

j () =
(xi xj ())(xi xj ())t

nj ()
iAj ()

respectively. The log-likelihood function then reduces to a constant minus


p
1

j=1 nj log |j |. Maximization with respect to leads to the estimate


2
= arg min h()

where

h() =

|j ()|nj ()

(10.4)

(10.5)

j=1

Computationally this means that the function h() is evaluated for all
groupings of the observations x1 , ..., xn , and the estimate is the grouping

2004 CRC Press LLC

that minimizes h(). Clearly, this is a computationally demanding task. A


simpler rule is obtained if we assume that all covariance matrices are equal
to a common covariance matrix
. Then

= arg min || = arg min n1

(nj j ) = arg min


j=1

(nj j )

(10.6)

j=1

Even in this simplied form, nding the best clustering is computationally


demanding. For instance, if data have to be divided into two groups, then
2

the number of possible assignments for which j=1 (nj j ) may dier is
n1
equal to 2
. In addition, if the number of groups is not known a priori,
then a suitable, and usually computationally costly, method for estimating p must be applied. From a principle point of view it should also be
noted that if normal distributions or any other distributions with overlapping domains are assumed, then there are no perfect clusters. Even if the
distributions were known, an observation x can be from any group with
fi (x) > 0, with positive probability, so that one can never be absolutely
sure where it belongs.
A variation of ML-clustering is obtained if the groups themselves are
associated with probabilities. Let j be the probability that a randomly
sampled observation comes from group j. In analogy to the arguments
above, maximization of the likelihood with respect to all parameters including j (j = 1, ..., p) leads to a Bayesian allocation rule with j as prior

distribution.
10.2.2 Hierarchical clustering
ML-clustering yields a partition of observations into p groups. Sometimes
it is desirable to obtain a sequence of clusters, e.g. starting with two main
groups and then subdividing these into increasingly homogeneous clusters.
This is particularly suitable for data where a hierarchy is expected - such
as, for instance, in music. Generally speaking, a hierarchical method has
the following property: a partitioning into p + 1 clusters consists of
two clusters whose union is equal to one of the clusters from the partitioning into p groups
p 1 clusters that are identical with p 1 clusters of the partitioning
into p groups.
In a rst step, data are transformed into a matrix D = (dij )i,j=1,...,n of
distances or a matrix S = (sij )i,j=1,...,n of similarities. The denition of
distance and similarity used in cluster analysis is more general than the
usual denition of a metric:
Denition 54 Let X be an arbitrary set and d : X X R a real valued
function such that for all x, y X

2004 CRC Press LLC

D1. d(x, y) = d(y, x)


D2. d(x, y) 0
D3. d(x, x) = 0
Then d is called a distance. If in addition we also have
D4. d(x, y) = 0 x = y
D5. d(x, z) d(x, y) + d(y, z) (triangle inequality),
then d is a metric.
A measure of similarity is usually assumed to have the following properties:
Denition 55 Let X be an arbitrary set and s : X X R a real valued
function such that for all x, y X
S1. s(x, y) = s(y, x)
S2. s(x, y) > 0
S3. s(x, y) increases with increasing similarity.
Then s is called a measure of similarity.
Axiom S3 is of course somewhat subjective, since it depends on what is
meant exactly by similarity. Table 10.1 gives examples of distances and
measures of similarity.
Suppose now that, for an observed data set x1 , ..., xn , we can dene a
distance matrix D = (dij )i,j=1,...,n where dij denotes the distance between
vectors xi and xj . A hierarchical clustering algorithm tries to group the
data into a hierarchy of clusters in such a way that the distances within
these clusters are generally much smaller than those between the clusters.
Numerous algorithms are available in the literature. The reason for the
variety of solutions is that in general the result depends on various free
choices, such as the sequence in which clusters are built or the denition
of distance between clusters. For illustration, we give the denition of the
complete linkage (or furthest neighbor) algorithm:
1. Set a threshold do .
(o)
(o)
2. Start with the initial clusters A1 = {x1 }, ..., An = {xn } and set i = 1.
(o)
(o)
(o)
The distances between the clusters are dened by djl = d(Aj , Al ) =
(o)

d(xj , xl ). This gives the n n distance matrix D(o) = (djl )j,l=1,...,n .


(i1)

3. Join the two clusters for which the distance djl

is minimal, thus ob-

(i)
(i)
A1 , ..., Ani .

taining new clusters


4. Calculate the new distances between clusters by
(i)

(i)

(i)

djl = d(Aj , Al ) =

max

(i)

(i)

d(x, y)

(10.7)

xAj ,yAl

and the corresponding (ni)(ni) distance matrix D(i) with elements


(i)
djl (j, l = 1, ..., n i).

2004 CRC Press LLC

Table 10.1 Some measures of distance and similarity between x = (x1 , ..., xk )t ,
y = (y1 , ..., yk )t Rk . For some of the distances, it is assumed that a data set of
observations in Rk is available to calculate sample variances s2 (j = 1, ..., k) and
j
a k k sample covariance matrix S.
Name

Denition

Euclidian distance

d(x, y) =

k
i=1 (xi

yi )2

Usual distance
in Rk

Pearson distance

d(x, y) =

k
i=1 (xi

yi )2 /s2
j

Standardized
Euclidian

Mahalanobis distance

d(x, y) =

(x y)t S 1 (x y)

Standardized
Euclidian

Manhattan metric

d(x, y) =
(wi 0)

Minkowski metric

d(x, y) =
( 1)

Bhattacharyya distance

d(x, y) =

Binary similarity

s(x, y) = k 1

Simple matching coecient

s(x, y) = k 1
ai ,
ai = xi yi + (1 xi )(1 yi )

Suitable for
for xi = 0, 1

Gowers similarity
coecient

s(x, y) = 1 k 1
wi |xi yj |,
wi = 1 if
xi qualitative,
wi = 1/Ri if
quantitative
(with Ri = range of
ith coordinate)

Suitable if
some xi
qualitative,
some xi
quantitative

5. If

Comments

k
i=1

wi |xi yi |

k
i=1

wi |xi yi |

k
i=1 ( xi

yi )2

Less sensitive
to outliers
1/

1/2

x i yi

(i)
max d
j,l=1,...,ni jl

do

For = 1 :
Manhattan
For xi , yi 0
(example:
proportions)
Suitable for
xi = 0, 1

(10.8)

then stop. Otherwise, set i = i + 1 and go to step 3.


Note in particular that for the nal clusters, the maximal distance within
each cluster is at most do . As a result, the nal clusters tend to be very
compact. A related method is the so-called nearest neighbor single linkage algorithm. It is identical with the above except that distance between
clusters is dened as the minimal distance between points in the two clusters. This can lead to so-called chaining in the form of elongated clusters.

2004 CRC Press LLC

For other algorithms and further properties see the references given at the
beginning of this chapter, and references therein.
10.2.3 HISMOOTH and HIWAVE clustering
HISMOOTH and HIWAVE models, as dened in Chapter 5, can be used
to extract dominating features of a time series y(t) that are related to
an explanatory series x(t). Suppose that we have several y-series, yj (t)
(j = 1, ..., N ) that share the same explanatory series x(t). An interesting
question is then in how far features related to x(t) are similar, and which
series have more in common than others. One way to answer the question
consists of the following clustering algorithm:
1. For each series yj (t), t a HISMOOTH or HIWAVE model, thus obtaining a decomposition
yj (t) = j (t, xt ) + ej (t)

where j is the estimated expected value of yj given x(t).

2. Perform a cluster analysis of the tted curves j (t, xt ).

10.3 Sp ecic applications in music


10.3.1 Distribution of notes
Consider the distribution pj (j = 0, 1, ..., 11) of notes modulo as dened
for the star plots in Chapter 2. Can the visual impression of star plots
in Figure 2.31 be conrmed by cluster analysis? We consider the transformed data vectors = (1 , ..., 11 )t , with j = log(pj /(1 pj )), for the
following compositions: 1) Anonymus: Saltarello (13th century); Saltarello
(14th century); Troto (13th century); Alle psalite (13th century); 2) A. de
la Halle (1235?-1287): Or est Bayard en la pature, hure!; 3) J. Ockeghem
(1425-1495): Canon epidiatesseron; 4) J. Arcadelt (1505-1568): Ave Maria;
La Ingratitud; Io dico fra noi; 5) W. Byrd (1543-1623): Ave Verum Corpus;
Alman; The Queens Alman; 6) J. Dowland (1562-1626): The Frog Galliard;
The King of Denmarks Galliard; Come again; 7) H.L. Hassler (1564-1612):
Galliarda; Kyrie from Missa Secunda; Sanctus et Benedictus from Missa
Secunda; 8) Palestrina (1525-1594): Jesu! Rex admirablis; O bone Jesu;
Pueri hebraeorum; 9) J.H. Schein (1586-1630): Banchetto musicale; 10) J.S.
Bach (1685-1750): Preludes and Fugues 1-24 from Das Wohltemperierte
Klavier; 11) J. Haydn (1732-1809): Sonata op. 34/3 (Figure 10.3); 12)
W.A. Mozart (1756-1791): Sonata KV 545 (2nd Mv.); Sonata KV 281 (2nd
Mv.); Sonata KV 332 (2nd Mv.); Sonata KV 333 (2nd Mv); 13) C. Debussy (1862-1918): Claire de lune; Arabesque 1; Reections dans leau; 14)
A. Schnberg (1874-1951): op. 19/2 (Figure 10.4); 15) A. Webern (1883o
1945): Orchesterst ck op. 6, No. 6; 16) Bartk (1881-1945): Bagatelles No.
u
o

2004 CRC Press LLC

2004 CRC Press LLC


8

DOWLAND
ARCADELT
ANONYMUS
ARCADELT
ANONYMUS
ARCADELT
PALESTRINA
DOWLAND
ANONYMUS

HASSLER
PALESTRINA
PALESTRINA
BYRD
BYRD
BARTOK
BYRD
SCHEIN
BACH
SCHOENBERG
DEBUSSY
BACH
DEBUSSY
MOZART
MOZART
MOZART
MOZART
BACH
BACH
BACH
BACH
MOZART
BACH
BACH
BACH
BACH
DEBUSSY
BACH
BARTOK
WEBERN
MESSIAEN
HAYDN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BARTOK
BARTOK
TAKEMITSU

HASSLER
HASSLER

ANONYMUS
OCKEGHEM

10

HALLE

12

14

10

DOWLAND
HASSLER
HASSLER
ANONYMUS
ARCADELT
ARCADELT
ARCADELT
PALESTRINA

SCHOENBERG
BARTOK
BARTOK
TAKEMITSU
MESSIAEN
BARTOK
WEBERN
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
BACH
HAYDN
BACH
BACH
BACH
BACH
BACH
BACH
BARTOK
BACH
MOZART
DEBUSSY
DEBUSSY
BACH
BACH
MOZART
MOZART
MOZART
MOZART
BACH
BACH
DEBUSSY
BACH
BACH
BACH
BACH
DOWLAND
HASSLER
PALESTRINA
SCHEIN
PALESTRINA
BYRD
BYRD
BYRD
ANONYMUS
OCKEGHEM
ANONYMUS
ANONYMUS

15

HALLE

20

25

30

Distribution of notes modulo 12 - complete linkage

Figure 10.1 Complete linkage clustering of log-odds-ratios of note-frequencies.

Distribution of notes modulo 12 - single linkage

Figure 10.2 Single linkage clustering of log-odds-ratios of note-frequencies.

Figure 10.3 Joseph Haydn (1732-1809). (Title page of a biography published by


the Allgemeine Musik-Gesellschaft Zrich, 1830; courtesy of Zentralbibliothek
u
Zrich.)
u

1-3; Piano Sonata (2nd Mv.); 17) O. Messiaen (1908-1992): Vingts regards
de Jesu No. 3; 18) T. Takemitsu (1930-1996): Rain tree sketch No. 1.
Figure 10.1 shows the result of complete linkage clustering of the vectors
(1 , ..., 11 )t , based on the Euclidian and do = 5. The most striking feature is the clear separation of early music from the rest. Moreover, the
20th century composers considered here are in a separate cluster, except
for Bartks Bagatelle No. 3 (and Debussy, who may be considered as beo
longing to the 19th and 20th centuries). In contrast, clusters provided by a
single linkage algorithm are less easy to interpret. Figure 10.2 illustrates a
typical result of this method namely long narrow clusters where the maximal distance within a cluster can be quite large. In our example this does

2004 CRC Press LLC

Figure 10.4 Klavierstck op. 19, No. 2 by Arnold Schnberg. (Facsimile; used by
u
o
permission of Belmont Music Publishers.)

2004 CRC Press LLC

Bach: F.8/WK I

Bach: Pr.8/WK I

Bach: F.1/WK I

Bach: Pr.1/WK I

Bach: Cello Suite V/1

Bach: Cello Suite III/1

Bach: Cello Suite II/1

Bach: Cello Suite I/1

Bach: Cello Suite VI/1

Bach: Cello Suite IV/1

Clusters of entropies - complete linkage

Figure 10.5 Complete linkage clustering of entropies.

not seem appropriate, since, due to the organic historic development of


music, the eect of chaining is likely to be particularly pronounced.
10.3.2 Entropies
Consider entropies as dened in Chapter 3. More specically, we dene
for each composition a vector y = (E1 , ..., E10 )t . After standardization of
each coordinate, cluster analysis is applied the following compositions by
J.S. Bach: Cello Suites No. I to VI (1st movement from each); Preludes and
Fugues No. 1 and 8 from Das Wohltemperierte Klavier (each separately).
The complete linkage algorithm leads to a clear separation of the Cello
Suites from Das Wohltemperierte Klavier displayed in Figure 10.5.
10.3.3 Tempo curves
One of the obvious questions with respect to the tempo curves in Figure
2.3 is whether one can nd clusters of similar performances. Applying complete linkage cluster analysis (with the euclidian distance) to the raw data
yields the clusters in Figure 10.6. Cortot and Horowitz appear to have very
individual styles, since they build distinct clusters on their own. It should
be noted, however, that this does not imply that other pianists do not have
their own styles. Cortot and Horowitz simply happen to be the lucky ones

2004 CRC Press LLC

CORTOT3
CORTOT1
CORTOT2
MOISEIWITSCH
ORTIZ
NEY
NOVAES
DAVIES
SCHNABEL
SHELLEY
CURZON
KRUST
ASKENAZE
ARRAU
BRENDEL
ESCHENBACH
ARGERICH
DEMUS
KLIEN
HOROWITZ1
HOROWITZ2
HOROWITZ3
BUNIN
KUBALEK
CAPOVA
ZAK
GIANOLI
KATSARIS

10

12

14

Clusters of tempo curves - complete linkage

Figure 10.6 Complete linkage clustering of tempo.

who are represented more than once in the sample, so that the consistency
of their performances can be checked empirically. Figure 10.6 also shows
that Cortot is somewhat of an outlier, since his cluster separates from
all other pianists at the top level.
10.3.4 Tempo curves and melodic structure
Cluster analysis alone does not provide any further explanation about the
meaning of observed clusters. In particular, we do not know which musically meaningful characteristics determine the clustering of tempo curves.
In contrast, cluster analysis based on HISMOOTH or HIWAVE models
provides a way to gain more insight. The tted HISMOOTH curves in Figures 5.9a through d extract essential features that make comparisons easier.
The estimated bandwidths can be interpreted as a measure of how much
emphasis a pianist puts on global and local features respectively. Figure
10.7 shows clusters based on the tted HISMOOTH curves. In contrast to
the original data, complete and single linkage turn out to yield almost the
same clusters. Thus, applying the HISMOOTH t rst leads to a stabilization of results. From Figure 10.7, we may identify about six main clusters,
namely:
A: KRUST, KATSARIS, SCHNABEL;

2004 CRC Press LLC

KRUST
KATSARIS
SCHNABEL
MOISEIWITSCH
NOVAES
ORTIZ
DEMUS
CORTOT1
CORTOT3
ARGERICH
SHELLEY
CAPOVA
CORTOT2
ARRAU
BUNIN
KUBALEK
CURZON
GIANOLI
ASKENAZE
DAVIES
ZAK
ESCHENBACH
NEY
HOROWITZ3
KLIEN
BRENDEL
HOROWITZ1
HOROWITZ2

Clusters of HISMOOTH fits - complete linkage

Figure 10.7 Complete linkage clustering of HISMOOTH-ts to tempo curves.

B: MOISEIWITSCH, NOVAES, ORTIZ;


C: DEMUS, CORTOT1, CORTOT2, CORTOT3, ARGERICH, SHELLEY, CAPOVA;
D: ARRAU, BUNIN, KUBALEK, CURZON, GIANOLI;
E: ASKENAZE, DAVIES;
F: HOROWITZ1, HOROWITZ2, HOROWITZ3, ZAK, ESCHENBACH,
NEY, KLIEN, BRENDEL.
This is related to grouping of the vector of estimated bandwidths, (b1 , b2 , b3 )t
R3 . In gure 10.8, the x- and y-coordinates correspond to b1 and b2 respec+
tively, and the radius of a circle is proportional to b3 . The letters A through
F identify locations where one or more observation from that cluster occurs.
The pictures show that only a few selected values of b1 and b2 are selected.
Particularly striking are the large bandwidths for clusters A and B. Apparently, these pianists emphasize mostly larger structures of the composition.
Also note that the clusters do not separate equally well in each projection. Apart from clusters A and B, one cannot order the performances
in terms of large versus small bandwidth. Overall, one may conclude that
HISMOOTH-clustering together with analytic indicator functions provides
a better understanding of essential characteristics of musical performance
(Figure 10.9).

2004 CRC Press LLC

B
2

F
A

BB
B

A
D

D
C
C
C
D
F

0.5

1.0

CF
D
C
1.5

2.0

F
E F
E
2.5

3.0

3.5

Figure 10.8 Symbol plot of HISMOOTH bandwidths for tempo curves. The radius
of each circle is proportional to a constant plus log b3 ; the horizontal and vertical
axes are equal to b1 and b2 respectively. The letters AF indicate where at least
one observation from the corresponding cluster occurs.

Figure 10.9 Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.)

2004 CRC Press LLC

CHAPTER 11

Multidimensional scaling
11.1 Musical motivation
In some situations data consist of distances only. These distances are not
necessarily euclidian so that they do not necessarily correspond to a conguration of points in a euclidian space. The question addressed by multidimensional scaling (MDS) is in how far one may nevertheless nd points in
a hopefully low-dimensional euclidian space that have exactly or approximately the observed distances. The procedure is mainly an exploratory
tool that helps to nd structure in distance data. We give a brief introduction to the basic principles of MDS. For a detailed discussion and an
extended bibliography see, for instance, Kruskal and Wish (1978), Cox and
Cox (1994), Everitt and Rabe-Hesketh (1997), Borg and Groenen (1997),
Schiman (1997); also see textbooks on multivariate statistics, such as the
ones given in the previous chapters. For the origins of MDS and early
references see Young and Householder (1941), Guttman (1954), Shepard
(1962a,b), Kruskal (1964a,b), Ramsay (1977).
11.2 Basic principles
11.2.1 Basic denitions
In MDS, any symmetric n n matrix D = (dij )i,j=1,...,n with dij 0 and
dii = 0 is called a distance matrix. Note that this corresponds to the axioms
D1, D2, and D3 in the previous chapter. If instead of distances, a similarity
matrix S = (sij )i,j=1,....,n is given, then one can dene a corresponding
distance matrix by a suitable transformation. One possible transformation
is, for instance,
dij = sii 2sij + sjj
(11.1)
The question addressed by metric MDS can be formulated as follows: given
an n n distance matrix D, can one nd a dimension k and n points

x1 , ..., xn in Rk such that these points have a distance matrix D with D


approximately, or even exactly, equal to D? Clearly one prefers low dimensions (k = 2 or 3, if possible), since it is then easy to display the points
graphically. On the other hand, the dimension cannot be too low in order
to obtain a good approximation of D, and hence a realistic picture of structures in the data. As an alternative to metric MDS, one may also consider

2004 CRC Press LLC

non-metric methods where one tries to nd points in a euclidian space such


that the ranking of the distances remains the same, whereas their nominal
values may dier.
11.2.2 Metric MDS
In the ideal case, the metric solution constructs n points x1 , ..., xn Rk

for some k such that their euclidian distance matrix D, with elements
ij = (xi xj )t (xi xj ), is exactly equal to the original distance matrix
d
D. If this is possible, then D is called euclidian. The condition under which
this is possible is as follows:
Theorem 25 D = Dnn = (dij )i,j=1,...,n is euclidian if and only if the
matrix
B = Bnn = M AM
is positive semidenite, where M = (I n1 11t ), I = Inn is the identity
matrix, 1 = (1, ..., 1)t and A = Ann has elements
1
aij = d2 (i, j = 1, ..., n).
2 ij
The reason for positive semideniteness of B is that if D is indeed a euclidian matrix corresponding to points x1 , ..., xn Rk , then

bij = (xi x)t (xj x)

(11.2)

so that B denes a centered scalar product for these points. In matrix


form we have B = (M X)(M X)t where the n rows of Xnk correspond to
the vectors xi (i = 1, ..., n). Since for any matrix C, the matrices C t C and
CC t are positive semidenite, so is B.
The construction of the points x1 , ..., xn given D = Dnn (or Bnn 0)
is done as follows: suppose that B is of rank k n. Since B is a symmetric
matrix, we have the spectral decomposition
B = CC t = ZZ t

(11.3)

where is the nn diagonal matrix with eigenvalues 1 2 ... k > 0


and j = 0 (j > k) in the diagonal, and Z = Znn = (zij )i,j=1,...,n the
n n matrix with the rst k columns z (j) (j = 1, ..., k) equal to the rst k
eigenvectors. Then
xi = (zi1 , ..., zik )t (i = 1, ..., n)

(11.4)

of Z are points in R with distance matrix D.


In practice, the following diculties can occur: 1. D is euclidian, but k is
too large to be of any use (after all the purpose is to obtain an interpretable
picture of the data); 2. D is not euclidian with a) all i positive, or, b) some
i negative. Because of these problems, one often uses a rough approximak

2004 CRC Press LLC

tion of D, based on a small number of eigenvectors that correspond to


positive eigenvalues.
Finally, note that if instead of distances, similarities are given and the
similarity matrix S is positive semidenite, then S can be transformed into
a euclidian distance matrix by dening
dij =

sii 2sij + sjj

(11.5)

11.2.3 Non-metric MDS


For qualitative data, or generally observations in non-metric spaces, distances can only be interpreted in terms of ranking. For instance, the subjective judgement of an audience may be that a composition by Webern is
slightly more dicult than Wagner, but much more dicult than Mozart,
thus dening a larger distance between Webern and Mozart than Webern
and Wagner. It may, however, not be possible to express distances between
the compositions by numbers that could be interpreted directly. In such
cases, D is often called a dissimilarity matrix rather than a distance matrix.
Since only the relative size of distances is meaningful, various computationally demanding algorithmic methods for dening points in a euclidian space
such that the ranking of the distances remains the same have been developed in the literature (e.g. Shepard 1962a,b, Kruskal 1964a,b, Guttman
1968, Lingoes and Roskam 1973).

11.2.4 Chronological ordering


Suppose a distance matrix D (or a similarity matrix S) is given and one
would like to nd out whether there is a natural ordering of the observational units. For instance, a listener may assign a distance matrix between various musical pieces without knowing anything about these pieces
a priori. The question then may be whether the listeners distance matrix
corresponds approximately to the sequence in time when the pieces were
composed. This problem is also called seriation. MDS provides a possible solution in the following way: if the distances expressed the temporal
(or any other) sequence exactly, then the conguration of points found by
MDS would be one-dimensional. In the more realistic case that distances
are partially due to the temporal sequence, the points in Rk should be
scattered around a one-dimensional, not necessarily straight, line in Rk . In
the simplest case, this may already be visible in a two-dimensional plot.

2004 CRC Press LLC

11.3 Sp ecic applications in music


11.3.1 Seriation by simple descriptive statistics
Suppose we would like to guess which time a composition is from, without
listening to the music but instead using an algorithm. There is a large
amount of music theory that can be used to determine the time when a
composition was written. One may wonder, however, whether there may
be a very simple computational way of guessing.
Consider, for instance, the following frequencies: xi = pi1 (i = 1, ..., 12)
are the relative frequencies of notes modulo 12 centered around the central
tone, as dened in section 9.3.2. Moreover, set x13 equal to the relative
frequency of a sequence of four notes following the sequence of interval steps
3, 3 and 3. This corresponds to an arpeggio of the diminished seventh chord.
Thus, we consider a vector x = (x1 , ..., x13 )t with coordinates corresponding
to proportions. An appropriate measure of distance between proportions
is the Bhattacharyya distance (Bhattacharyya 1946b) given in Table 10.1
namely
k

d(x, y) =

( xi yi )2

1/2

i=1

This is not a euclidian distance so that it is not a priori clear whether a


suitable representation of the observations in a euclidian space is possible.
MDS with k = 2 yields the points in Figure 11.1. Three time periods
are distinguished by using dierent symbols for the points. The periods
are dened in a very simple way, namely by date of birth of the composer:
a) before 1720 (early to baroque; see e.g. Figure 11.3); b) 1720-1880
(classical to romantic); and c) 1880 or later (20th century). The conguration of the respective points does show an eect of time. The three
time periods can be associated with regional clusters though the regions
overlap. An outlier from the middle category is Schoenberg. This is due to
the crude denition of the time periods: Schoenberg (in particular his op.
19/2) clearly belongs the 20th century he just happens to be born a little
bit too early (1874), and is therefore classied as classical to romantic.
The dependence between time period and second MDS-coordinate can also
be seen by comparing boxplots (Figure 11.2).
11.3.2 Perception and music psychology
MDS is frequently used to analyze data that consist of subjective distances
between musical sounds (e.g. with respect to pitch or timbre) or compositions obtained in controlled experiments. Typical examples are Grey and
Gordon (1978), Gromko (1993), Ueda and Ohgushi (1987), Wedin (1972),
Wedin and Goude (1972), Markuse and Schneider (1995). Since it is not
known in how far the cognitive metric may correspond approximately to

2004 CRC Press LLC

0.6

before 1720
1720-1880
1880 or later

-0.4

-0.2

0.0

x2

0.2

0.4

Schoenberg

-0.4

-0.2

0.0

0.2

0.4

x1

Figure 11.1 Two-dimensional multidimensional scaling of compositions ranging


from the 13th to the 20th century, based on frequencies of intervals and interval
sequences.

a euclidian distance, MDS is a useful method to investigate this question,


to simplify high-dimensional distance data and possibly nd interesting
structures. Grey and Gordon consider perceptual eects of timbres characterized by spectra. For a related study see Wedin and Goude (1972).
Gromko (1993) carries out an MDS analysis to study perceptual dierences between expert and novice music listeners. Ueda and Ohgushi (1987)
study perceptual components of pitch and use MDS to obtain a spatial
representation of pitch.

2004 CRC Press LLC

0.6
0.4
0.2
0.0
-0.2
-0.4

birth before 1720

1720-1880

1880 and later

Figure 11.2 Boxplots of second MDS-component where compositions are classied


according to three time periods.

Figure 11.3 Fragment of a graduale from the 14th century. (Courtesy of Zentralbibliothek Zrich.)
u

2004 CRC Press LLC

Figure 11.4 Muzio Clementi (1752-1832). (Lithography by H. Bodmer, courtesy


of Zentralbibliothek Zrich.)
u

Figure 11.5 Freddy (by J.B.) and Johannes Brahms (1833-1897) going for a
drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek
Zrich.)
u

2004 CRC Press LLC

List of gures
Figure 1.1: Quantitative analysis of music helps to understand creative
processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris;
and Jim by J.B.)
Figure 1.2: J.S. Bach (1685-1750). (Engraving by L. Sichling after a painting by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek
Z rich.)
u
Figure 1.3: Ludwig van Beethoven (1770-1827). (Drawing by E. Drck
u
after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek
Z rich.)
u

Figure 1.4: Anton Webern (1883-1945). (Courtesy of Osterreichische Post


AG.)
Figure 1.5: Gottfried Wilhelm Leibniz (1646-1716). (Courtesy of Deutsche
Post AG and Elisabeth von Janota-Bzowski.)
Figure 1.6: W.A. Mozart (1759-1791) (authorship uncertain) SpiegelDuett.
Figure 1.7: Wolfgang Amadeus Mozart (1756-1791). (Engraving by F.
M ller after a painting by J.W. Schmidt; courtesy of Zentralbibliothek
u
Z rich.)
u
Figure 1.8: The torus of thirds Z3 + Z4 .
Figure 1.9: Arnold Schnberg Sketch for the piano concert op. 42 notes
o
with tone row and its inversions and transpositions. (Used by permission
of Belmont Music Publishers.)
Figure 1.10: Notes of Air by Henry Purcell. (For better visibility, only
a small selection of related motifs is marked.)
Figure 1.11: Notes of Fugue No. 1 (rst half) from Das Wohltemperierte
Klavier by J.S. Bach. (For better visibility, only a small selection of
related motifs is marked.)
Figure 1.12: Notes of op. 68, No. 2 from Album fr die Jugend by
u
Robert Schumann. (For better visibility, only a small selection of related
motifs is marked.)
Figure 1.13: A miraculous transformation caused by high exposure to
Wagner operas. (Caricature from a 19th century newspaper; courtesy of
Zentralbibliothek Zrich.)
u

2004 CRC Press LLC

Figure 1.14: Graphical representation of pitch and onset time in Z2 to71


a
gether with instrumentation of polygonal areas. (Excerpt from Snti
Piano concert No. 2 by Jan Beran, col legno CD 20062; courtesy of col
legno, Germany.)
Figure 1.15: Iannis Xenakis (1922-1998). (Courtesy of Philippe Gontier,
Paris.)
Figure 1.16: Ludwig van Beethoven (1770-1827). (Courtesy of Zentralbibliothek Z rich.)
u
Figure 2.1: Robert Schumann (1810-1856) Trumerei op. 15, No. 7.
a
Figure 2.2: Tempo curves of Schumanns Trumerei performed by Vladimir
a
Horowitz.
Figure 2.3: Twenty-eight tempo curves of Schumanns Trumerei pera
formed by 24 pianists. (For Cortot and Horowitz, three tempo curves
were available.)
Figure 2.4: Boxplots of descriptive statistics for the 28 tempo curves in
Figure 2.3.
Figure 2.5: q-q-plots of several tempo curves (from Figure 2.3).
Figure 2.6: Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16.
Figure 2.7: Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16.
Figure 2.8: Johannes Chrysostomus Wolfgangus Theophilus Mozart (17561791) in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek Zrich.)
u
Figure 2.9: R. Schumann (1810-1856) lithography by H. Bodmer. (Courtesy of Zentralbibliothek Zrich.)
u
Figure 2.10: Acceleration of tempo curves for Cortot and Horowitz.
Figure 2.11: Tempo acceleration correlation with other performances.
Figure 2.12: Martha Argerich interpolation of tempo curve by cubic
splines.
Figure 2.13: Smoothed tempo curves g1 (t) = (nb1 )1

8).

K( tti )yi (b1 =


b1

Figure 2.14: Smoothed tempo curves g2 (t) = (nb2 )1

g1 (t)] (b2 = 1).

K( tti )[yi
b2

Figure 2.15: Smoothed tempo curves g3 (t) = (nb3 )1

g1 (t) g2 (t)] (b3 = 1/8).

K( tti )[yi
b3

Figure 2.16: Smoothed tempo curves residuals e(t) = yi g1 (t) g2 (t)

g3 (t).

2004 CRC Press LLC

Figure 2.17: Melodic indicator local polynomial ts together with rst


and second derivatives.
Figure 2.18: Tempo curves (Figure 2.3) rst derivatives obtained from
local polynomial ts (span 24/32).
Figure 2.19: Tempo curves (Figure 2.3) second derivatives obtained from
local polynomial ts (span 8/32).
Figure 2.20: Kinderszene No. 4 sound wave of performance by Horowitz
at the Royal Festival Hall in London on May 22, 1982.
Figure 2.21: log(Amplitude) and tempo for Kinderszene No. 4 auto- and
cross correlations (Figure 2.24a), scatter plot with tted least squares
and robust lines (Figure 2.24b), time series plots (Figure 2.24c), and
sharpened scatter plot (Figure 2.24d).
Figure 2.22: Horowitz performance of Kinderszene No. 4 log(tempo)
versus log(Amplitude) and boxplots of log(tempo) for three ranges of
amplitude.
Figure 2.23: Horowitz performance of Kinderszene No. 4 two-dimensional
histogram of (x, y) = (log(tempo), log(Amplitude)) displayed in a perspective and image plot respectively.
Figure 2.24: Horowitz performance of Kinderszene No. 4 kernel estimate
of two-dimensional distribution of (x, y) = (log(tempo), log(Amplitude))
displayed in a perspective and image plot respectively.
Figure 2.25: R. Schumann, Trumerei op. 15, No. 7 density of melodic
a
indicator with sharpening region (a) and melodic curve plotted against
onset time, with sharpening points highlighted (b).
Figure 2.26: R. Schumann, Trumerei op. 15, No. 7 tempo by Cortot
a
and Horowitz at sharpening onset times.
Figure 2.27: R. Schumann, Trumerei op. 15, No. 7 tempo derivatives
a
for Cortot and Horowitz at sharpening onset times.
Figure 2.28: Arnold Schnberg (1874-1951), self-portrait. (Courtesy of
o
Verwertungsgesellschaft Bild-Kunst, Bonn.)
Figure 2.29: a) Cherno faces for 1. Saltarello (Anonymus, 13th century);
2. Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J.
S. Bach, 1685-1750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 18101856); 4. Piano piece op. 19, No. 2 (A. Schnberg, 1874-1951); 5. Rain
o
Tree Sketch 1 (T. Takemitsu, 1930-1996); b) Cherno faces for the same
compositions as in gure 2.29a, after permuting coordinates.
Figure 2.30: The minnesinger Burchard von Wengen (1229-1280), contemporary of Adam de la Halle (1235?-1288). (From Codex Manesse, courtesy of the University Library Heidelberg.) (Color gures follow page
168.)

2004 CRC Press LLC

Figure 2.31: Star plots of p = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 )t


j
for compositions from the 13th to the 20th century.
Figure 2.32: Symbol plot of the distribution of successive interval pairs
(y(ti ), y(ti+1 )) (2.36a, c) and their absolute values (b, d) respectively,
for the upper envelopes of Bachs Prludium No. 1 (Das Wohltemperierte
a
Klavier I) and Mozart s Sonata KV 545 (beginning of 2nd movement).
Figure 2.33: Symbol plot of the distribution of successive interval pairs
(y(ti ), y(ti+1 )) (a, c) and their absolute values (b, d) respectively, for
the upper envelopes of Scriabins Prlude op. 51, No. 4 and F. Martins
e
Prlude No. 6.
e
Figure 2.34: Symbol plot with x = pj5 , y = pj7 and radius of circles
proportional to pj1 .
Figure 2.35: Symbol plot with x = pj5 , y = pj7 and radius of circles
proportional to pj6 . (Color gures follow page 168.)
Figure 2.36: Symbol plot with x = pj5 , y = pj7 . The rectangles have
width pj1 (diminished second) and height pj6 (augmented fourth). (Color
gures follow page 168.)
Figure 2.37: Symbol plot with x = pj5 , y = pj7 , and triangles dened by
pj1 (diminished second), pj6 (augmented fourth) and pj10 (diminished
seventh). (Color gures follow page 168.)
Figure 2.38: Names plotted at locations (x, y) = (pj5 , pj7 ). (Color gures
follow page 168.)
Figure 2.39: Prole plots of p = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 )t .
j

Figure 3.1: Ludwig Boltzmann (1844-1906). (Courtesy of Osterreichische


Post AG.)
Figure 3.2: Fractal pictures (by Cline Beran, computer generated.) (Color
e
gures follow page 168.)
Figure 3.3: Gyrgy Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.)
o
Figure 3.4: Comparison of entropies 1, 2, 3, and 4 for J.S. Bachs Cello
Suite No. I and R. Schumanns op. 15, No. 2, 3, 4, and 7, and op. 68,
No. 2 and 16.
Figure 3.5: Alexander Scriabin (1871-1915) (at the piano) and the conductor Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of
Gemldegalerie Neuer Meister, Dresden, and Robert-Sterl-House.)
a
Figure 3.6: Comparison of entropies 9 and 10 for Bach, Schumann, and
Scriabin/Martin.
Figure 3.7: Metric, melodic, and harmonic global indicators for Bachs
Canon cancricans.
Figure 3.8: Robert Schumann (1810-1856). (Courtesy of Zentralbibliothek
Z rich.)
u

2004 CRC Press LLC

Figure 3.9: Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 2 (upper gure), together with smoothed versions
(lower gure).
Figure 3.10: Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 7 upper gure), together with smoothed versions
(lower gure).
Figure 3.11: Metric, melodic, and harmonic global indicators for Weberns
Variations op. 27, No. 2 (upper gure), together with smoothed versions
(lower gure).
Figure 3.12: R. Schumann Trumerei: motifs used for specic melodic
a
indicators.
Figure 3.13: R. Schumann Trumerei: indicators of individual motifs.
a
Figure 3.14: R. Schumann Trumerei: contributions of individual motifs
a
to overall melodic indicator.
Figure 3.15: R. Schumann Trumerei: overall melodic indicator.
a
Figure 4.1: Sound wave of c and f played on a piano.
Figure 4.2: Zoomed piano sound wave shaded area in Figure 4.1.
Figure 4.3: Periodogram of piano sound wave in Figure 4.2.
Figure 4.4: Sound wave of e

played on a harpsichord.

Figure 4.5: Periodogram of harpsichord sound wave in Figure 4.4.


Figure 4.6: Harpsichord sound periodogram plots for dierent time
frames (moving windows of time points).
Figure 4.7: A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I(t, ). (Color gures follow page 168.)
Figure 4.8: A harpsichord sound wave (a), logarithm of squared amplitudes
(b), histogram of the series (c) and its periodogram on log-scale (d)
together with tted SEMIFAR-spectrum.
Figure 4.9: Log-frequencies with tted SEMIFAR-trend and log-log-periodogram
together with SEMIFAR-t for Bachs rst Cello Suite (1st movement;
a,b) and Paganinis Capriccio No. 24 (c,d) respectively.
Figure 4.10: Local variability with tted SEMIFAR-trend and log-logperiodogram together with SEMIFAR-t for Bachs rst Cello Suite (1st
movement; a,b) and Paganinis Capriccio No. 24 (c,d) respectively.
Figure 4.11: Niccol` Paganini (1782-1840). (Courtesy of Zentralbibliothek
o
Z rich.)
u
Figure 5.1: Simulated signal (a) and wavelet coecients (b); (c) and (d):
wavelet components of simulated signal in a; (e) and (f): wavelet components of simulated signal in a and frequency plot of coecients.
Figure 5.2: Decomposition of xseries in simulated HIWAVE model.

2004 CRC Press LLC

Figure 5.3: Simulated HIWAVE model - explanatory series g1 (a), yseries


(b), y versus x (c), y versus g1 (d), y versus g2 = x g1 (e) and time
frequency plot of y (f).
Figure 5.4: HIWAVE time series and tted function g1 .

Figure 5.5: Hierarchical decomposition of metric, melodic, and harmonic


indicators for Bachs Canon cancricans (Das Musikalische Opfer BWV
1079) and Weberns Variation op. 27, No. 2.
Figure 5.6: Quantitative analysis of performance data is an attempt to
understand objectively how musicians interpret a score without attaching any subjective judgement. (Left: Freddy by J.B.; right: J.S.
Bach, woodcutting by Ernst Wrtemberger, Z rich. Courtesy of Zenu
u
tralbibliothek Zrich).
u
Figure 5.7: Most important melodic curves obtained from HIREG t to
tempo curves for Schumanns Trumerei.
a
Figure 5.8: Successive aggregation of HIREG-components for tempo curves
by Ashkenazy and Horowitz (third performance).
Figure 5.9 a and b: HISMOOTH-ts to tempo curves (performances 1-14);
Figure 5.9 c and d: HISMOOTH-ts to tempo curves (performances 1528).
Figure 5.10: Time frequency plots for Cortots and Horowitzs three performances.
Figure 5.11: Wavelet coecients for Cortots and Horowitzs three performances.
Figure 5.12: Tempo curves approximation by most important 2 best
basis functions.
Figure 5.13: Tempo curves approximation by most important 5 best
basis functions.
Figure 5.14: Tempo curves approximation by most important 10 best
basis functions.
Figure 5.15: Tempo curves (a) by Cortot (three curves on top) and Horowitz,
R2 obtained in HIWAVE-t plotted against trial cut-o parameter (b)
and tted HIWAVE-curves (c).
Figure 5.16: First derivative of tempo curves (a) by Cortot (three curves
on top) and Horowitz, R2 obtained in HIWAVE-t plotted against trial
cut-o parameter (b) and tted HIWAVE-curves (c).
Figure 5.17: Second derivative of tempo curves (a) by Cortot (three curves
on top) and Horowitz, R2 obtained in HIWAVE-t plotted against trial
cut-o parameter (b) and tted HIWAVE-curves (c).
Figure 6.1: Jean-Philippe Rameau (1683-1764). (Engraving by A. St. Aubin
after J. J. Caeri, Paris after 1764; courtesy of Zentralbibliothek Zrich.)
u

2004 CRC Press LLC

Figure 6.2: Frdric Chopin (1810-1849). (Courtesy of Zentralbibliothek


e e
Z rich.)
u
Figure 6.3: Stationary distributions j (j = 1, ..., 11) of Markov chains

with state space Z12 \{0}, estimated for the transition between successive
intervals.
Figure 6.4: Cluster analysis based on stationary Markov chain distributions for compositions by Bach, Mozart, Haydn, Chopin, Schumann,
Brahms, and Rachmanino.
Figure 6.5: Cluster analysis based on stationary Markov chain distributions of torus distances for compositions by Bach, Mozart, Haydn,
Chopin, Schumann, Brahms, and Rachmanino.
Figure 6.6: Comparison of log odds ratios log(1 /2 ) of stationary Markov

chain distributions of torus distances.
Figure 6.7: Comparison of log odds ratios log(1 /3 ) of stationary Markov

chain distributions of torus distances.
Figure 6.8: Comparison of log odds ratios log(2 /3 ) of stationary Markov

chain distributions of torus distances.
Figure 6.9: Comparison of log odds ratios log(1 /3 ) and log(2 /3 ) of


stationary Markov chain distributions of torus distances.
Figure 6.10: Comparison of stationary Markov chain distributions of torus
distances.
Figure 6.11: Log odds ratios log(1 /3 ) and log(2 /3 ) plotted against


date of birth of composer.
Figure 6.12: Johannes Brahms (1833-1897). (Courtesy of Zentralbibliothek
Z rich.)
u
Figure 7.1: Bla Bartk statue by Varga Imre in front of the Bla Bartk
e
o
e
o
Memorial House in Budapest. (Courtesy of the Bla Bartk Memorial
e
o
House.)
Figure 7.2: Sergei Prokoe as a child. (Courtesy of Karadar Bertoldi
Ensemble; www.karadar.net/Ensemble/.)
Figure 7.3: Circular representation of compositions by J. S. Bach (Prludium
a
und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti
(Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S.
o
Prokoe (Visions fugitives No. 8).

Figure 7.4: Boxplots of 1 , R, d and log m for notes modulo 12, comparing
Bach, Scarlatti, Bartk, and Prokoef.
o
Figure 7.5: Circular representation of intervals of successive notes in the
following compositions: J. S. Bach (Prludium und Fuge No. 5 from
a
Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No.
125), B. Bartk (Bagatelles No. 3), and S. Prokoe (Visions fugitives
o
No. 8).

2004 CRC Press LLC


Figure 7.6: Boxplots of 1 , R, d and log m for note intervals modulo 12,
comparing Bach, Scarlatti, Bartk, and Prokoef.
o
Figure 7.7: Circular representation of notes ordered according to circle
of fourhts in the following compositions: J. S. Bach (Prludium und
a
Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata
Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3), and S. Prokoe
o
(Visions fugitives No. 8).

Figure 7.8: Boxplots of 1 , R, d and log m for notes 12 ordered according
to circle of fourhts, comparing Bach, Scarlatti, Bartk and Prokoef.
o
Figure 7.9: Circular representation of intervals of successive notes ordered
according to circle of fourhts in the following compositions: J. S. Bach
(Prludium und Fuge No. 5 from Das Wohltemperierte Klavier), D.
a
Scarlatti (Sonata Kirkpatrick No. 125), B. Bartk (Bagatelles No. 3),
o
and S. Prokoe (Visions fugitives No. 8).

Figure 7.10: Boxplots of 1 , R, d and log m for note intervals modulo 12
ordered according to circle of fourhts, comparing Bach, Scarlatti, Bartk,
o
and Prokoef.
Figure 8.1: Tempo curves for Schumanns Trumerei: skewness for the
a
eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted
against the number of the part.
Figure 8.2: Schumanns Trumerei: screeplot for skewness.
a
Figure 8.3: Schumanns Trumerei: loadings for PCA of skewness.
a
Figure 8.4: Schumanns Trumerei: symbol plot of principal components
a
z2 , ..., z5 for PCA of tempo skewness.
Figure 8.5: Schumanns Trumerei: tempo curves by Cortot, Horowitz,
a
Brendel, and Gianoli.
Figure 8.6: Air by Henry Purcell (1659-1695).
Figure 8.7: Screeplot for PCA of entropies.
Figure 8.8: Loadings for PCA of entropies.
Figure 8.9: Entropies symbol plot of the rst four principal components.
Figure 8.10: Entropies symbol plot of principal components no. 2-5.
Figure 8.11: F. Martin (1890-1971). (Courtesy of the Socit Frank Martin
ee
and Mrs. Maria Martin.)
Figure 8.12: F. Martin (1890-1971) - manuscript from 8 Prludes. (Coure
tesy of the Socit Frank Martin and Mrs. Maria Martin.)
ee
Figure 9.1: Discriminant analysis combined with time series analysis can
be used to judge purity of intonation (Elvira by J.B.).
Figure 9.2: Linear discriminant analysis of compositions before and after
1800, with the training sample. The data used for the discriminant rule
consists of x = (p5 , E).

2004 CRC Press LLC

Figure 9.3: Linear discriminant analysis of compositions before and after


1800, with the validation sample. The data used for the discriminant
rule consists of x = (p5 , E).
Figure 9.4: Linear discriminant analysis of Early Music to Baroque and
Romantic to 20th Century. The points (o and ) belong to the
training sample. The data used for the discriminant rule consists of x =
(p5 , E).
Figure 9.5: Linear discriminant analysis of Early Music to Baroque and
Romantic to 20th century. The points (o and ) belong to the
validation sample. The data used for the discriminant rule consists of
x = (p5 , E).
Figure 9.6: Graduale written for an Augustinian monastery of the diocese
Konstanz, 13th century. (Courtesy of Zentralbibliothek Zrich.) (Color
u
gures follow page 168.)
Figure 9.7: Johannes Brahms (1833-1897). (Photograph by Maria Fellinger,
courtesy of Zentralbibliothek Zrich.)
u
Figure 9.8: Richard Wagner (1813-1883). (Engraving by J. Bankel after a
painting by C. Jger, courtesy of Zentralbibliothek Zrich.)
a
u
Figure 10.1: Complete linkage clustering of log-odds-ratios of note-frequencies.
Figure 10.2: Single linkage clustering of log-odds-ratios of note-frequencies.
Figure 10.3: Joseph Haydn (1732-1809). (Title page of a biography published by the Allgemeine Musik-Gesellschaft Zrich, 1830; courtesy of
u
Zentralbibliothek Zrich.)
u
Figure 10.4: Klavierstck op. 19, No. 2 by Arnold Schnberg. (Facsimile;
u
o
used by permission of Belmont Music Publishers.)
Figure 10.5: Complete linkage clustering of entropies.
Figure 10.6: Complete linkage clustering of tempo.
Figure 10.7: Complete linkage clustering of HISMOOTH-ts to tempo
curves.
Figure 10.8: Symbol plot of HISMOOTH bandwidths for tempo curves.
The radius of each circle is proportional to a constant plus log b3 the
horizontal and vertical axes are equal to b1 and b2 respectively. The letters AF indicate where at least one observation from the corresponding
cluster occurs.
Figure 10.9: Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.)
Figure 11.1: Two-dimensional multidimensional scaling of compositions
ranging from the 13th to the 20th century, based on frequencies of intervals and interval sequences.
Figure 11.2: Boxplots of second MDS-component where compositions are
classied according to three time periods.

2004 CRC Press LLC

Figure 11.3: Fragment of a graduale from the 14th century. (Courtesy of


Zentralbibliothek Zrich.)
u
Figure 11.4: Muzio Clementi (1752-1832). (Lithography by H. Bodmer,
courtesy of Zentralbibliothek Zrich.)
u
Figure 11.5: Freddy (by J.B.) and Johannes Brahms (1833-1897) going
for a drink. (Caricature from a contemporary newspaper; courtesy of
Zentralbibliothek Zrich.)
u

2004 CRC Press LLC

References
Akaike, H. (1973a). Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory,
B.N. Petrow and F. Csaki (eds.), Akademiai Kiado, Budapest, 267-281.
Akaike, H. (1973b). Maximum likelihood identication of Gaussian autoregressive
moving average models. Biometrika, Vol. 60, 255-265.
Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model tting. Biometrika, Vol. 26, 237-242.
Albert. A.A. (1956). Fundamental Concepts of Higher Algebra. University of
Chicago Press, Chicago.
Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New
York and London.
Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis (2nd
ed.). Wiley, New York.
Andreatta, M. (1997) Group-theoretical methods applied to music. PhD thesis,
University of Sussex.
M. Andreatta, M., Noll, T., Agon, C. and Assayag, G. (2001). The geometrical
groove: rhythmic canons between theory, implementation and musical experiment. In: Les Actes des 8`mes Journes dInformatique Musicale, Bourges 7-9
e
juin 2001, p. 93-97.
Antoniadis, A. and Oppenheim, G. (1995). Wavelets and Statistics. Lecture Notes
in Statistics, No. 103, Springer, New York.
Arabie, P., Hubert, L.J. and De Soete, G. (1996). Clustering and Classication.
World Scientic Pub., London.
Archibald, B. (1972). Some thoughts on symmetry in early Webern. Persp. New
Music, 10, 159-163.
Ash, R.B. (1965). Information Theory. Wiley, New York.
Ashby, W.R. (1956). An Introduction to Cybernetics. Wiley, New York.
Babbitt, M. (1960) Twelve-tone invariants as compositional determinant. Musical
Quarterly, 46, 245-259.
Babbitt, M. (1961) Set structure as a compositional determinant. JMT, 5, No. 2,
72-94.
Babbitt, M. (1987) Words about Music. Dembski A. and Straus J.N. (eds.), University of Wisconsin Press, Madison.
Backus, J. (1969). The acoustical Foundations of Music, W.W. Norton & Co.,
New York (reprinted 1977).
Bailhache, P. (2001). Une Histoire de lAcoustique Musicale, CNRS Editions.
Balzano, G.J. (1980). The group-theoretic description of 12-fold and microtonal
pitch systems. Computer Music Journal, Vol. 4, No. 4, 66-84.

2004 CRC Press LLC

Barnard, G.A. (1951). The theory of information. J. Royal Statist. Soc., Series
B, Vol. 13, 46-69.
Bartlett, M.S. (1955). An Introduction to Stochastic Processes. Cambridge University Press, Cambridge.
Batschelet, E. (1981). Circular Statistics. Academic Press, London.
Beament, J. (1997). The Violin Explained: Components, Mechanism, and Sound.
Oxford University Press, Oxford.
Benade, A.H. (1976). Fundamentals of Musical Acoustics. Oxford University
Press, Oxford. (Reprinted by Dover in 1990).
Benson, D. (1995-2002). Mathematics and Music. Internet Lecture Notes,
Department of Mathematics, University of Georgia, USA (available at
http://www.math.uga.edu/~djb/html/math-music.html).
Beran, J. (1987). Aniseikonia. H.O.E. (Bison Records).
Beran, J. (1991). Cirri. Centaur Records, CRC 2100.
Beran, J. (1994). Statistics for Long-Memory Processes. Chapman & Hall, New
York.
Beran, J. (1995). Maximum likelihood estimation of the dierencing parameter
for invertible short- and long-memory ARIMA models. J. R. Statist. Soc.,
Series B, Vol. 57, No.4, 659-672.
Beran, J. (1998) Modeling and objective distinction of trends, stationarity and
long-range dependence. Proceedings of the VIIth International Congress of
Ecology - INTECOL 98, Farina, A., Kennedy, J. and Boss, V. (Eds.), p.
u
41.
a
Beran, J. (2000). Snti. col legno, WWE 1CD 20062 (http://www.col-legno.de).
Beran, J. and Feng. Y. (2002a). SEMIFAR models a semiparametric framework
for modelling trends, long-range dependence and nonstationarity. Computational Statistics & Data Analysis, Vol. 40, No. 2, 393-419.
Beran, J. and Feng, Y. (2002b). Iterative plug-in algorithms for SEMIFAR models
denition, convergence, and asymptotic properties. J. Computational Graphical Statist., Vol. 11, No. 3, 690-713.
Beran, J. and Ghosh, S. (2000). Estimation of the dominating frequency for stationary and nonstationary fractional autoregressive processes. J. Time Series
Analysis, Vol. 21, No. 5, 513-533.
Beran, J. and Mazzola, G. (1992). Immaculate Concept. SToA music, 1 CD
1002.92, Zrich.
u
Beran, J. and Mazzola, G. (1999). Analyzing musical structure and performance
- a statistical approach. Statistical Science, Vol. 14, No. 1, pp.47-79.
Beran, J. and Mazzola, G. (1999). Visualizing the relationship between two time
series by hierarchical smoothing. J. Computational Graphical Statist., Vol. 8,
No. 2, pp.213-238.
Beran, J. and Mazzola, G. (2000). Timing Microstructure in Schumanns
Trumerei as an Expression of Harmony, Rhythm, and Motivic Structure in
a
Music Performance. Computers Mathematics Appl., Vol. 39, No. 5-6, pp.99130.
Beran, J. and Mazzola, G. (2001). Musical composition and performance statistical decomposition and interpretation. Student, Vol. 4, No.1, 13-42.
Beran, J. and Ocker, D. (1999). SEMIFAR forecasts, with applications to foreign

2004 CRC Press LLC

exchange rates. J. Statistical Planning Inference, 80, 137-153.


Beran, J. and Ocker, D. (2001). Volatility of stock market indices - an analysis
based on SEMIFAR models. J. Bus. Economic Statist., Vol. 19, No. 1, 103-116.
Berg, R.E. and Stork, D.G. (1995). The Physics of Sound (2nd ed.). Prentice
Hall, New Jersey.
Berry, W. (1987). Structural Function in Music. Dover, Mineola.
Besag, J. (1989). Towards Bayesian image analysis. J. Appl. Statistics, Vol. 16,
395-407.
Besicovitch, A.S. (1935). On the sum of digits of real numbers represented in the
dyadic system (On sets of fractional dimensions II). Mathematische Annalen,
Vol. 110, 321-330.
Besicovitch, A.S. and Ursell, H.D. (1937). Sets of fractional dimensions (V): On
dimensional numbers of some continuous curves. J. London Mathematical Society, Vol. 29, 449-459.
Bhattacharyya, A. (1946a). On some analogues of the amount of information and
their use in statistical estimation. Sankhya, Vol. 8, 1-14.
Bhattacharyya, A. (1946b). On a measure of divergence between two multinomial
populations. Sankhya, 7, 401-406.
Billingsley, P. (1986). Probability and Measure (2nd ed.). Wiley, New York.
Blasheld, R.K. and Aldenderfer, M.S. (1985). Cluster Analysis. Sage, London.
Boltzmann, L. (1896). Vorlesungen uber Gastheorie. Johann Ambrosius Barth,

Leipzig.
Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and
Applications. Springer, New York.
Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data
Analysis: The Kernel Approach with S-Plus Illustrations. Oxford University
Press, Oxford.
Box, G.E.P. and Jenkins, G.M. (1970). Time Series Analysis: Forecasting and
Control. Holden-Day, San Francisco.
Breiman, L. (1984). Classication and Regression Trees. CRC Press, Boca Raton.
Bremaud, P. (1999). Markov Chains. Springer, New York.
Brillouin, L. (1956). Science and Information Theory. Academic Press, New York.
Brillinger, D. (1981). Time Series Data Analysis and Theory (expanded ed.).
Holden Day, San Francisco.
Brillinger, D. and Irizarry, R.A. (1998). An investigation of the second- and
higher-order spectra of music. Signal Processing, Vol. 65, 161-179.
Bringham, E.O. (1988). The Fast Fourier Transform and Applications. Prentice
Hall, New Jersey.
Brockwell, P.J. and Davis, R.A. (1991). Time series: Theory and methods (2nd
ed.). Springer, New York.
Brown, E.N. (1990). A note on the asymptotic distribution of the parameter
estimates for the harmonic regression model. Biometrika, Vol. 77, No. 3, 653656.
Chai, W. and Vercoe, B. (2001). Folk Music Classication Using Hidden Markov
Models. Proceedings of International Conference on Articial Intelligence, June
2001 (//web.media.mit.edu/ chaiwei/papers/chai ICAI183.pdf).
Chambers, J., Cleveland, W., Kleiner, B., and Tukey, P. (1983). Graphical Meth-

2004 CRC Press LLC

ods for Data Analysis. Wadsworth Publishing Company: Belmont, California.


Chernick, M.R. (1999). Bootstrap Methods: A Practitioners Guide. Jpssey-Bass,
New York.
Chung, K.L. (1967). Markov Chains with Stationary Transition Probabilities.
Springer, Berlin.
Cleveland, W. (1985). Elements of Graphing Data. Wadsworth Publishing Company: Belmont, California.
Coifman, R., Meyer, Y., and Wickerhauser, V. (1992). Wavelet analysis and
sinal processing. In: Wavelets and Their Applications, pp. 153-178. Jones and
Bartlett Publishers, Boston.
Coifman, R. and Wickerhauser, V. (1992). Entropy-based algorithms for best
basis selection. IEEE Transactions on Information Theory, Vol. 38, No. 2,
713-718.
Conway, J.H. and Sloane, N.J.A. (1988). Sphere packings, lattices and groups.
Grundlehren der mathematischen Wissenschaften 290, Springe, Berlin.
Cooley, J.W. and Tukey, J.W. (1965). An algorithm for the machine calculation
of complex Fourier series. Math. Comput., Vol. 19, 297-301.
Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. Chapman & Hall,
London.
Cremer, L. (1984). The Physics of The Violin, MIT Press, 1984.
Crocker, M.J. (ed.) (1998). Handbook of Acoustics, Wiley Interscience: New York.
Dahlhaus, R. (1987). Ecient parameter estimation for self-similar processes.
Ann. Statist., Vol. 17, 1749-1766.
Dahlhaus, R. (1996a). Maximum likelihood estimation and model selection for
locally stationary processes. J. Nonpar. Statist., Vol. 6, 171-191.
Dahlhaus, R. (1996b) Asymptotic statistical inference for nonstationary processes
with evolutionary spectra. In: Athens Conference on Applied Probability and
Time Series, Vol. II, P.M. Robinson and M. Rosenblatt (Eds.), 145-159, Lecture Notes in Statistics, 115, Springer, New York.
Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Ann.
Statistics, Vol. 25, 1-37.
Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia, PA.
Davison, A.C. and Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press, Cambridge.
de la Motte-Haber, H. (1996). Handbuch der Musikpsychologie (2nd ed.). Laaber
Verlag, Laaber.
Devaney, R.L. (1990). Chaos, Fractals and Dynamics. Addison-Wesley, California.
Diaconis, P., Graham, R.L., and Kantor, W.M. (1983). The mathematics of perfect shues. Adv. Appl. Math., Vol. 4, 175-196.
Diggle, P. (1990) Time Series A Biostatistical Introduction. Oxford University
Press, Ocford.
Dillon, W. R. and Goldstein, M. (1984). Multivariate Analysis, Methods and
Applications. Wiley, New York.
Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via
wavelet shrinkage. JASA, 90, 1200-1224.
Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet shrink-

2004 CRC Press LLC

age. Ann. Statistics 26, 879-921.


Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995).
Wavelet shrinkage: Asymptopia? J. R. Statist. Soc., Series B, 57, 301-337.
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1996). Density
estimation by wavelet thresholding. Ann. Statistics, 24, 508-539.
Draper, N.R. and Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley,
New York.
Duda, R.O., Hart, P.E. and Stork, D.G. (2000). Pattern classication (2nd ed.).
Wiley, New York.
Edgar, G.A. (1990). Measure, Topology and Fractal Geometry. Springer, New
York.
Eelsberg, W. and Steinmetz, R. (1998). Video Compression Techniques. Dpunkt
Verlag, Heidelberg.
Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statistics, Vol. 7, 1- 26.
Eimert, H. (1964). Grundlagen der musikalischen Reihentechnik. Universal Edition, Vienna.
Elliott, R.J., Agoun, L., and Moore, J.B. (1995). Hidden Markov Models: Estimation and Control. Springer, New York.
Erds, P. (1946). On the distribution function of additive functions. Ann. Matho
ematics, Vol. 43, 1-20.
Eubank, R.L. (1999). Nonparametric Regression and Spline Smoothing (2nd ed.).
Marcel Dekker: New York.
Everitt, B.S., Landau, S. and Leese, M. (2001). Cluster Analysis (4th ed.). Oxford
University Press, Oxford.
Everitt, B.S. and Rabe-Hesketh, S. (1997). The Analysis of Proximity Data.
Arnold, London.
Falconer, K.J. (1985). The Geometry of Fractal Sets. Cambridge University Press,
Cambridge.
Falconer, K.J. (1986). Random Fractals. Math. Proc. Cambridge Philos. Soc.,
Vol. 100, 559-582.
Falconer, K.J. (1990). Fractal Geometry. Wiley, New York.
Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial tting: Variable bandwidth and spatial adaptation. J. R. Statist. Soc.,
Ser. B, 57, 371394.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications.
Chapman & Hall, London.
Feng, Y. (1999). Kernel- and Locally Weighted Regression with Applications to
Time Series Decomposition. Verlag fr Wissenschaft und Forschung, Berlin.
u
Fisher, N.I. (1993). Statistical Analysis of Circular Data. Cambridge University
Press, Cambridge.
Fisher, R.A. (1925). Theory of Statistical Information. Proc. Camb. Phil. Soc.,
Vol. 22, pp. 700-725.
Fisher, R.A. (1956). Statistical Methods and Scientic Inference. Oliver & Boyd,
London.
Fleischer, A. (2003). Die analytische Interpretation. Schritte zur Erschlieung
eines Forschungsfeldes am Beispiel der Metrik. PhD dissertation, Humboldt-

2004 CRC Press LLC

University Berlin. dissertation.de, Verlag im Internet GmbH, Berlin.


Fleischer, A., Mazzola, G., Noll, Th. Zur Konzeption der Software RUBATO fr
u
musikalische Analyse und Performance. Musiktheorie, Heft 4, pp.314-325, 2000.
Fletcher, T.J. (1956). Campanological groups. American Math. Monthly, 63/9,
619-626.
Fletcher, N.H. and Rossing, T.D. (1991). The Physics of Musical Instruments.
Springer, Berlin/New York.
Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach.
Cambridge University Press, Cambridge, UK.
Forte, A. (1964). A theory of set-complexes for music. JMT, 8, No. 2, 136-183.
Forte, A. (1973). Structure of atonal music. Yale University Press, New Haven.
Forte, A. (1989). La set-complex theory: elevons les enjeux! Analyse musicale,
4eme trimestre, 80-86.
Fox, R. and Taqqu, M.S. (1986). Large sample properties of parameter estimates
for strongly dependent stationary Gaussian time series. Ann. Statisics., Vol.
14, 517-532.
Friedman, J.H. (1977). A recursive partitioning decision rule for nonparametric
classication. IEEE Transactions on Computers, Vol. 26, No. 4, 404-408.
Fripertinger, H. (1991). Enumeration in music theory. Sminaire Lotharingien de
e
Combinatoire, 26, 29-42.
Fripertinger, H. (1999). Enumeration and construction in music theory. Diderot
Forum on Mathematics and Music Computational and Mathematical Methods
in Music, Vienna, Austria. December 24, 1999. H. G. Feichtinger and M.
Drer, editors. sterreichische Computergesellschaft, 179-204.
Fripertinger, H. (2001). Enumeration of non-isomorphic canons. Tatra Mountains
Math. Publ., 23.
Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (2nd ed.).
Academic Press, New York.
Gasser, T. and Mller, H.G. (1979). Kernel estimation of regression functions.
u
In: Smoothing Techniques for Curve Estimation. Gasser, T., Rosenblatt, M.
(Eds.), Springer, New York, pp. 23-68.
Gasser, T. and Mller, H.G. (1984). Estimating regression functions and their
u
derivatives by the kernel method. Scand. J. Statist., Vol. 11, 171-185.
Gasser, T., Mller, H.G., and Mammitzsch, V. (1985). Kernels for nonparametric
u
curve estimation. J. R. Statist. Soc., Ser. B, Vol. 47, 238-252.
Genevois, H. and Orlarey, Y. (1997). Musique et Mathmatiques. Alas-Grame,
e
e
Lyon.
Gervini, D. and Yohai, V.J. (2002). A class of robust and fully ecient regression
estimators. Ann. Statistics, Vol. 30, 583-616.
Ghosh, S. (1996). A new graphical tool to detect non-normality. J. R. Statist.
Society, Series B, Vol. 58, 691-702.
Ghosh, S. (1999). T3-plot. In: Encyclopedia for Statistical Sciences, Update volume 3, (S. Kotz ed.), pp. 739-744, Wiley, New York.
Ghosh, S. and Beran, J. (2000). Comparing two distributions: The two sample
T3 plot. J. Computational Graphical Statist., Vol. 9, No. 1, 167-179.
W.J. Gilbert (2002) Modern Algebra with Applications. Wiley, New York.
Ghosh, S. and Draghicescu, D.(2002a). Predicting the distribution function for

2004 CRC Press LLC

long-memory processes. Int. J. Forecasting, 18, 283-290.


Ghosh, S., Draghicescu, D. (2002b). An algorithm for optimal bandwidth selection for smooth nonparametric quantiles and distribution functions. In: Statistics in Industry and Technology: Statistical Data Analysis Based on the L1Norm and Related Methods. Dodge Y. (Ed.), Birkhuser Verlag, Basel, Switzera
land, pp. 161-168.
Ghosh, S., Beran, J. and Innes, J. (1997). Nonparametric conditional quantile
estimation in the presence of long memory. Student - Special issue on the
conference on L1-Norm and related methods, Vol. 2, 109-117.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (Eds.) (1996). Markov Chain
Monte Carlo in Practice. Chapman & Hall, London.
Goldman, S. (1953). Information Theory. Prentice Hall, New Jersey.
Good, P.I. (2001). Resampling Methods. Birkhuser, Basel.
a
Gordon, A.D. (1999). Classication (2nd ed.). Chapman and Hall, London.
Gtze, H. and Wille, R. (Eds.) (1985). Musik und Mathematik (Salzburger
o
Musikgesprch 1984 unter Vorsitz von Herbert von Karajan). Springer, Berlin.
Graeser, W. (1924). Bachs Kunst der Fuge. In: Bach-Jahrbuch, 1924.
Gra, K.F. (1975). Wave Motion in Elastic Solids. Oxford University Press.
(reprinted by Dover, 1991).
Granger, C.W.J. and Joyeux, R. (1980). An introduction to long-range time series
models and fractional dierencing. J. Time Series Anal., Vol. 1, 15-30.
Grenander, U. and Szeg, G. (1958). Toeplitz Forms and Their Application. Univ.
o
California Press, Berkeley.
Grey, J. (1977). Multidimensional perceptual scaling of musical timbre. J. Acoustical Soc. America, Vol. 62, 1270-1277.
Grey, J. and Gordon, J. (1978). Perceptual Eects of spectralmodications on
musical timbres. J. Acoust. Soc. America, 63, 1493-1500.
Gromko, J.E. (1993). Perceptual Dierences between expert and novice music
listeners at multidimensional scaling analysis. Psychology of Music, 21, 34-47.
Guttman, L. (1954). A new approach to factor analysis: the radex. In: Mathematical thinking in the behavioral sciences, P. Lazarsfeld (Ed.). Free Press, New
York, pp. 258-348.
Guttman, L. (1968). A general non-metric technique for nding the smallest coordinate space for a conguration of points. Psychometrika, 33, 469-506.
Hall, D.E. (1980). Musical Acoustics. Wadsworth Publishing Company: Belmont,
California.
Halsey, D. and Hewitt, E. (1978). Eine gruppentheoretische Methode in der
Musiktheorie. Jaresbericht der Duetschen Math. Vereinigung, Vol. 80.
Hampel, F.R., Ronchetti, E., Rousseeuw, P., and Stahel, W.A. (1986). Robust
Statistics: The Approach based on Inuence Functions. Wiley, New York.
Hand, D.J. (1986). Discrimination and Classication. Wiley, New York.
Hand, D.J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. MIT
Press, Cambdridge (USA).
Hannan, E.J. (1973). The estimation of frequency. J. Appl. Probab., Vol. 10,
510-519.
Hannan, E.J. and Quinn, B.G. (1979). The determination of the order of an
autoregression. J. R. Statist. Soc., Series B, Vol. 41, 190-195.

2004 CRC Press LLC

Hrdle, W. (1991) Smoothing Techniques. Springer. New York.


a
Hrdle, W., Kerkyacharian, G., Picard, D., and Tsybokov, A. (1998). Wavelets,
a
Approximation, and Statistical Applications. Lecture Notes in Statistics, No.
129. Springer, New York.
Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York.
Hartley, R.V. (1928). Transmission of information. Bell Syst. Techn. J., 535-563.
Hassan, T. (1982). Nonlinear time series regression for a class of amplitude modulated cosinusoids. J. Time Series Analysis, Vol. 3, 109-122.
Hastie, T., Tibshirani, R., and Buja, A. (1994). Flexible discriminant analysis by
optimal scoring. JASA, Vol. 89, 1255-1270.
Hastie, T., Tibshirani, R., and Friedman, J.H. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer, New York.
Hausdor, F. (1919). Dimension und usseres mass. Mathematische Annalen, Vol.
a
79, 157-179.
von Helmholtz, H. (1863). Die Lehre von den Tonempndungen als physiologische
Grundlage der Musik, Reprinted in Darmstadt, 1968.
Herstein, I.N. (1975). Topics in Algebra. Wiley, New York.
Hirst, D. (1996). Error-rate estimation in multiple-group linear discriminant analysis. Technometrics, Vol. 38, 389-399.
Hjort, N.L. and Glad, I.K. (2002). Nonparametric density estimation with a parametric start. Ann. Statistics, Vol. 23, No. 3, 882-904.
Hofstadter, D.R. (1999). Gdel, Escher, Bach, Basic Books, New York.
Hppner, F. Klawonn, F., Kruse, R. and Runkler, T. (1999). Fuzzy Cluster Analo
ysis. Wiley, New York.
Hosking, J.R.M. (1981). Fractional dierencing. Biometrika, Vol. 68, 165-176.
Howard, D.M. and Angus, J. (1996). Acoustics and Psychoacoustics, Focal Press.
Huber, P. (1981). Robust Statistics. Wiley, New York.
Huberty, C.J. (1994). Applied Discriminant Analysis. Wiley, New York.
Hurvich, C.M. and Ray, B.K. (1995). Estimation of the memory parameter for
nonstationary or noninvertible fractionally integrated processes. J. Time Series
Anal., Vol. 16 17-41.
Irizarry, R.A. (1998). Statistics and music: tting a local harmonic model to
musical sound signals. PhD thesis, University of California, Berkeley.
Irizarry, R.A. (2000). Asymptotic distribution of estimates for a time-varying
parameter in a harmonic model with multiple fundamentals. Statistica Sinica,
Vol. 10, 1041-1067.
Irizarry, R.A. (2001). Local harmonic estimation in musical sound signals. JASA,
Vol. 96, No. 454, 357-367.
Irizarry, R.A. (2002). Weighted estimation of harmonic components in a musical
sound signal. J. Time Series Anal., Vol. 23, 29-48.
Isaacson, D.L. and Madsen, R.W. (1976). Markov Chains Theory and Applications. Wiley, New York.
Jaard, S., Meyer, Y., and Ryan, R. (2001). Wavelets: Tools for Science and
Technology. SIAM, Philadelphia.
Jajuga, K., Sokoowski, A. and Bock, H.H. (Eds.) (2002). Statistical Pattern
Recognition. Springer, New York.
Jammalamadaka, S.R. and SenGupta, A. (2001). Topics in circular statistics.

2004 CRC Press LLC

Series on Multivariate Analysis, Vol. 5. World Scientic, River Edge, NJ.


Jansen, M. (2001). Noise Reduction by Wavelet Thresholding. Lecture Notes in
Statistics, No. 161. Springer, New York.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York.
Johnson, J. (1997). Graph Theoretical Methods of Abstract Musical Transformation. Greenwood Publishing Group, London.
Johnson, R.A. and Wichern, D.W. (2002). Applied Multivariate Statistical Analysis. Pretice Hall, New Jersey.
Johnston, I. (1989). Measured Tones: The Interplay of Physics and Music. Institute of Physics Publishing, Bristol and Philadelphia.
Joshi, D.D. (1957). Linformation en statistique mathmatique et dans la thorie
e
e
des communications. PhD thesis, Facult des Sciences de lUniversit de Paris.
e
e
Juang, B.H. and Rabiner, L.R. (1991). Hidden Markov models for speech recognition. Technometrics, Vol. 33, 251-272.
Kaiser, G. (1994). A Friendly Guide to Wavelets. Birkhuser, Boston.
a
Keil, W. (1991). Gibt es den Goldenen Schnitt in der Musik des 16. bis 19.
Jahrhunderts? Eine kritische Untersuchung rezenter Forschungen. Augsburger
Jahrbuch fr Musikwissenschaft, Vol. 8 1991. p. 7-70. Schneider, Tutzing, Geru
many.
Kelly, J.P. (1991). Hearing. In: Principles of Neural Science, E.R. Kandel, J.H.
Schwarz, T.M. Jessel (Eds.), Elsevier, New York, pp. 481-499.
Kemey, J.G., Snell, J.L., and Knapp, A.W. (1976). Denumerable Markov Chains.
Springer, New York.
Khinchin, A.I. (1953). The entropy concept in probability theory. Uspekhi Matematicheskikh Nauk, Vol. 8, No. 3 (55), 3-20 (Russian).
Khinchin, A.I. (1956). On the fundamental theorems of information theory. Uspekhi Matematicheskikh Nauk, Vol. 11, No. 1 (67), 17-75 (Russian).
Kinsler, L.E., Frey, A.R., Coppens, A.B., and Sanders, J.V. (2000) Fundamentals
of Acoustics, (4th ed.). Wiley, New York.
Klecka, W.R. (1980). Discriminant Analysis. Sage, London.
Kolmogorov, A.N. (1956). On the Shannon theory of information transmission
in the case of continuous signals. IRE Trans. on Inform. Theory, Vol. IT-2,
102-108.
Kono, N. (1986). Hausdor dimension of sample paths for self-similar processes.
In: Dependence in Probability and Statistics, E. Eberlein and M.S. Taqqu (eds.),
Birkhuser, Boston.
a
Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of t to
a nonmetric hypothesis. Psychometrika, 29, 1-27.
Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method.
Psychometrika, 29, 115-129.
Kruskal, J.B. and Wish, M. (1978). Multidimensional Scaling. Sage, London.
Krzanowski, W.J. (1988). Principles of Multivariate Analysis. Oxford University
Press, Oxford.
Kullback, S. (1959). Information Theory and Statistics. Wiley, Newy York.
Lanciani, A. (2001). Mathmatiques et musique: les labyrinthes de la
e
phnomnologie. Editions Jrme Millon, Grenoble.
e
e
eo
Luter, H. (1985). An ecient estimator for the error rate in discriminant anala

2004 CRC Press LLC

ysis. Statistics, Vol. 16, 107-119.


Lamperti, J.W. (1962). Semi-stable stochastic processes. Trans. American Math.
Soc., Vol. 104, 62-78.
Lamperti, J.W. (1972). Semi-stable Markov processes. Z. Wahrsch. verw. Geb.,
Vol. 22, 205-225.
LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and
classication. JASA, Vol. 91, 1641-1650.
Lendvai, E. (1993). Symmetries of Music. Kodly Institute, Kecskemet.
a
Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. (1983). An introduction to the
application of the theory of probabilistic functions of a Markov process to
automatic speech reconition. Bell Systems Tech. J., Vol. 62, 1035-1074.
Lewin, D. (1987). Generalized Musical Intervals and Transformations. Yale University Press, New Haven/London.
Leyton, M. (2001). A Generative Theory of Shape. Springer, New York.
Licklider, J.R.C. (1951). A duplex theory of pitch reception. Experientia, Vol. 7,
128-134.
Ligges, U., Weihs, C., Hasse-Becker, P. (2002). Detection of locally stationary
segments in time series. In: Proceedings in Computational Statistics, W. Hrdle,
B. Rnz (Eds.), pp. 285-290.
Lindley, M. and Turner-Smith, R. (1993). Mathematical Models of Musical Scales.
Verlag fr systematische Musikwissenschaft GmbH, Bonn.
u
Lingoes, J.C. and Roskam, E.E. (1973). A mathematical and empirical analysis
of two multidimensional scaling algorithms. Psychometrika, 38, Monograph
Suppl. No. 19.
MacDonald, I.L. and Zucchini, W. (1997). Hidden Markov and Other Models for
Discrete-valued Time Series. Chapman & Hall, London.
Mallat, S. (1998). A Wavelet Tour of Signal Processing. Academic Press, London.
Mandelbrot, B.B. (1953). Contribution ` la thorie mathmatique des jeux de
a
e
e
communication. Publs. Inst. Statist. Univ. Paris, Vol. 2, Fasc. 1 et 2, 3-124.
Mandelbrot, B.B. (1956). An outline of a purely phenomenological theory of
statistical thermodynamics: I. canonical ensembles. IRE Trans. on Inform.
Theory, Vol. IT-2, 190-203.
Mandelbrot, B.B. (1977). Fractals: Form, Chance and Dimension. Freeman &
Co., San Francisco.
Mandelbrot, B.B. (1983). The Fractal Geometry of Nature. Freeman & Co., San
Francisco.
Mandelbrot, B.B. and van Ness, J.W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Review, Vol. 10, No.4, 422-437.
Mandelbrot, B.B. and Wallis, J.R. (1969). Computer experiments with fractional
Gaussian noises. Water Resour. Res., Vol. 5, No.1, 228-267.
Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London.
Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London.

Markuse, B. and Schneider, A. (1995). Ahnlichkeit, Nhe, Distanz: Zur


a
Anwendung multidimensionaler Skalierung in musikwissenschaftlichen Untersuchungen. In: Festschrift fr Jobst Peter Fricke zum 65. Geburtu
stag, W. Auhagen, B. Gtjen and K. Niemller (Eds.), Musikwisa
o

2004 CRC Press LLC

senschaftliches Institut der Universitt zu Kln (http://www.uni-koeln.de/philo


fak/muwi/publ/fs fricke/festschrift.html).
Matheron, G. (1973). The intrinsic random functions and their applications. Adv.
Appl. Prob., Vol. 5, 439-468.
Mathieu, E. (1861). Mmoire sur ltude des fonctions de plusieurs quantites. J.
e
e
e
Math. Pures Appl., Vol. 6, 241-243.
Mathieu, E. (1873). Sur la fonction cinq fois transitive de 24 quantites. J. Math.
e
Pures Appl., Vol. 18, 25-46.
Mazzola, G. (1985) Gruppen und Kategorien in der Musik, Heldermann-Verlag,
Berlin.
Mazzola, G. (1990a). Geometrie der Tne. Birkhuser, Basel.
o
a
Mazzola, G. (1990b). Synthesis. SToA music 1001.90, Zrich.
u
Mazzola, G. (1989/1994). Presto. SToA music, Zrich.
u
Mazzola, G. (2002). The Topos of Music. Birkhuser, Basel.
a
Mazzola, G. and Beran, J. (1998). Rational composition of performance. In: controlling creative processes in music, W. Auhagen, R. Kopiez (Eds.), Staatliches
Institut fr Musikforschung (Berlin), Lang Verlag, Frankfurt/New York.
u
Mazzola, G., Zahorka, O. and Stange-Elbe, J. (1995). Analysis and Performance
of a Dream. In: Proceedings of the 1995 Symposium on Musical Performance,
J. Sundberg (ed.), KTH, Stockholm.
McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.
McMillan, B. (1953). The basic theorems of information theory. Ann. Math.
Statistics, 24, 196-219.
Meyer, Y. (1992). Wavelets and Operators. Cambridge University
Press,Cambridge.
Meyer, Y. (1993). Wavelets: Algorithms and Applications. SIAM, Philadelphia,
PA.
Morris, R.D. (1987). Composition with Pitch-Classes. Yale University Press, New
Haven.
Morris, R.D. (1995). Compositional spaces and other territories. PNM 33, 328358.
Morse, P.M. and Ingard, K.U. (1968). Theoretical Acoustics. McGraw Hill.
(Reprinted by Princeton University Press 1986.)
Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. AddisonWesley, Reading, MA.
Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its
Applications, Vol. 9, 141-142.
Nederveen, C.J. (1998). Acoustical Aspects of Woodwind Instruments. Northern
Illinois University Press, de Kalb.
Nettheim, N. (1997). A Bibliography of Statistical Applications in Musicology.
Musicology Australia, Vol. 20, 94-106.
Newton, H.J. and Pagano, M. (1983). A method for determining periods in time
series. JASA, Vol. 78, 152-157.
Noll, T. (1997). Harmonische Morpheme. Musikometrika, Vol. 8, 7-32.
Norden, H. (1964). Proportions in Music. Fibonacci Quarterly, Vol. 2, 219.
Norris, J.R. (1998). Markov Chains. Cambridge University Press, Cambridge.

2004 CRC Press LLC

Ogden, R.T. (1996). Essential Wavelets for Statistical Applications and Data
Analysis. Birkhuser, Boston.
a
Orbach, J. (1999). Sound and Music. University Press of America, Lanham, MD.
Parzen, E. (1962). On estimation of a probability density function and mode.
Ann. Math. Statistics, Vol. 33, 1065-1076.
Peitgen, H.-O. and Saupe, D. (1988). The Science of Fractal Images. Springer,
New York.
Percival, D.B. and Walden, A.T. (2000). Wavelet Methods for Time Series Analysis. Cambridge University Press, Cambridge, UK.
Perle, G. (1955). Symmetric formations in the string quartets of Bla Bartk.
e
o
Music Review 16, 300-312.
Pierce, J.R. (1983). The Science of Musical Sound. Scientic American Books,
New York (2nd ed. printed by W.H. Freeman & Co, 1992).
Plackett, R.L. (1960). Principles of Regression Analysis. Clarendon Press, Oxford.
Polzehl, J. (1995). Projection pursuit discriminant analysis. Computational
Statist. Data Anal., Vol. 20, 141-157.
Price, B.D. (1969). Mathematical groups in campanology. Math. Gaz., 53, 129133.
Priestley, M.B. (1965). Evolutionary spectra and non-stationary processes. J. R.
Statist. Soc., Series B, Vol. 27, 204-237.
Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 1): Univariate
Time Series. Academic Press, New York.
Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 2): Multivariate
Series, Prediction and Control. Academic Press, New York.
Quinn, B.G. and Thomson, P.J. (1991) Estimating the frequency of a periodic
function. Biometrika, Vol. 78, No. 1, 65-74.
Rahn, J. (1980). Basic Atonal Theory. Longman, New York.
Raichel, D.R. (2000). The Science and Applications of Acoustics. American Inst.
of Physics, College Park, PA.
Ramsay, J.O. (1977). Maximum likelihood estimation in multidimensional scaling. Psychometrika, 42, 241-266.
Raphael, C.S. (1999). Automatic segmentation of acoustic music signals using
hidden Markov models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 21, No. 4, 360-370.
Raphael, C.S. (2001a). A probabilistic expert system for automatic musical accompaniment. J. Computational Graphical Statist., Vol. 10, No. 3, 487-512.
Raphael, C.S. (2001b). Synthesizing musical accompaniment with Bayesian belief
networks. J. New Music Res., Vol. 30, No. 1, 59-67.
Rao, C.R. (1973). Linear Statistical Inference and its Applications (2nd ed.).
Wiley & Sons, New York.
Rayleigh, J.W.S. (1896). The Theory of Sound (2 vols), 2nd ed., Macmillan,
London (Reprinted by Dover, 1945).
Read, R.C. (1997). Combinatorial problems in the theory of music. Discrete Mathematics, 167/168, 543-551.
Reiner, D. (1985). Enumeration in music theory, American Math. Monthly, 92/1,
51-54.

2004 CRC Press LLC

Rnyi, A. (1959a). On the dimension and entropy of probability distributions.


e
Acta Mathe. Acad. Sci. Hung., Vol. 10, 193-215.
Rnyi, A. (1959b). On a theorem of P. Erds and its applications in information
e
o
theory. Mathematica Cluj, Vol. 1, No. 24, 341-344.
Rnyi, A. (1961). On measures of entropy and information. Proc. Fourth Berkeley
e
Symposium on Math. Stat. Prob., Vol. I, Univ. California Press, Berkeley, 547561.
Rnyi, A. (1965). On foundations of information theory. Review of the Internae
tional Statistical Institute, Vol. 33, 1-14.
Rnyi, A. (1970). Probability Theory. North Holland, Amsterdam.
e
Repp, B. (1992). Diversity and Communality in Music Performance: An Analysis
of Timing Microstructure in Schumanns Trumerei. J. Acoustic Soc. Am.,
a
92, 2546-2568.
Rigden, J.S. (1977). Physics and the Sound of Music. Wiley, New York.
Ripley, B. (1995). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge.
Rodet, X.(1997). Musical sound signals analysis/synthesis: sinusoidal+residual
and elementary waveform models. Appl. Signal Processing, 4, 131-141.
Roederer, J.G. (1995). The Physics and Psychophysics of Music. Springer,
Berlin/New York.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density
function. Ann. Math. Statistics, Vol. 27, 832-837.
Rossing, T.D. (ed.) (1984). Acoustics of Bells. Van Nostrand Reinhold, New York.
Rossing, T.D. (1990). The Science of Sound (2nd ed.). Addison-Wesley, Reading,
MA.
Rossing, T.D. (2000). Science of Percussion Instruments. World Scientic, London.
Rossing, T.D. and Fletcher, N.H. (1995). Principles of Vibration and Sound.
Springer, Berlin/New York.
Rotman, J.J. (2002). Advanced Modern Algebra. Prentice Hall, New Jersey.
Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of Sestimators. In: Robust Nonlinear Time Series Analysis, J. Franke, W. Hardle,
and D. Martin (Eds.), Lecture Notes in Statistics, Vol. 26, 256-277, Springer,
New York.
Ruppert, D. and Wand, M.P. (1994). Multivariate locally weighted least squares
regression. Ann. Statistics, Vol. 22, 1346-1370.
Ryan, T.P. (1997). Modern Regression Methods. Wiley, New York.
Sche, H. (1959). The Analysis of Variance. Wiley, New York.
e
Schnitzler, G. (1976). Musik und Zahl. Verlag fr systematische Musikwissenschaft,
Bonn.
Schnberg, A. (1950). Die Komposition in 12 Tnen. In: Style and Idea, New
o
o
York.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., Vol. 6,
461-464.
Seber, G.A.F. (1984). Multivariate Observations. Wiley, New York.
Serra, X. and Smith, J.O. (1991). Spectral modeling synthesis: A sound analysis/synthesis system based on deterministic plus stochastic decomposition.
Computer Music J., Vol. 14, No. 4, 12-24.

2004 CRC Press LLC

Shannon, C.E. (1948). A mathematical theory of communication. Bell Syst.


Techn. J., Vol. 27, 379-423.
Shannon, C.E. and Weaver, W. (1949). The Mathematical Theory of Communication. Univ. Illinois Press, Urbana.
Shepard, R.N. (1962a). The analysis of proximities: multidimensional scaling with
unknown distance function Part I. Psychometrika, 27, 125-140.
Shepard, R.N. (1962b). The analysis of proximities: multidimensional scaling with
unknown distance function Part II. Psychometrika, 27, 219-246.
Schiman, S. (1997). Introduction to Multidimensional Scaling: Theory, Methods,
and Applications by Susan. Academic Press, New York.
Shumway, R. and Stoer, D.S. (2000). Time Series Analysis and Its Applications.
Springer, New York.
Silverman, B. (1986). Density estimation for statistics and data analysis. Chapman & Hall, London.
Simono, J.S. (1996). Smoothing methods in statistics. Springer, New York.
Sinai, Y.G. (1976). Self-similar probability distributions. Theory Probab. Appl.,
Vol. 21, 64-80.
Slaney, M. and Lyon, R.F. (1991). Apple hearing demo real. Apple Technical
Report No. 25, Apple Computer Inc., Cupertino, CA.
Solo, V. (1992). Intrinsic random uctuations. SIAM Appl. Math., Vol. 52, 270291.
Solomon, L.J. (1973). Symmetry as a determinant of musical composition. PhD
thesis, University of West Virginia.
Srivastava, M. and Sen, A.K. (1997). Regression Analysis: Theory, Methods and
Applications. Springer, New York.
Stamatatos, E. and Widmer, G. (2002). Music perfomer recognition using an
ensemble of simple classiers. Austrian Research Institute for Articial Intelligence, Vienna, TR-2002-02.
Stange-Elbe, J. (2000). Analyse- und Interpretationsperspektiven zu J.S. Bachs
Kunst der Fuge mit Werkzeugen der objektorientierten Informationstechnologie. Habilitation thesis, University of Osnabrck.
u
Steinberg, R. (ed.) (1995). Music and the Mind Machine. Springer, Heidelberg.
Stewart, I. (1992). Another ne math youve got me into. . . , W. H. Freeman.
Stoyan, D. and Stoyan, H. (1994). Fractals, Random Shapes and Point Fields:
Methods of Geometrical Statistics. Wiley, New York.
Straub, H. (1989). Beitrge zur modultheoretischen Klassikation musikalischer
a
Motive. Diploma thesis, ETH Zrich.
u
Taylor, R. (1999a). Fractal analysis of Pollocks drip paintings. Nature, Vol. 399,
p. 422.
Taylor, R. (1999b). Fractal Expressionism. Physics World, Vol. 12, No. 10, p. 25.
Taylor, R. (1999c). Fractal expressionism where art meets science. In: Art and
Complexity, J. Casti (ed.), Perseus Press, Vol.
Taylor, R. (2000). The use of science to investigate Jackson Pollocks drip paintings. Art and the Brain, Journal of Consciousness Studies, Vol. 7, No. 8-9,
p137.
Telcs, A. (1990). Spectra of graphs and fractal dimensions. Probab. Th. Rel.
Fields, Vol. 82, 435-449.

2004 CRC Press LLC

Thumfart, A. (1995). Discrete Evolutionary Spectra and their Application to a


Theory of Pitch Perception. StatLab Heidelberg, Beitrge zur Statistik, No.
a
30.
Tricot, C. (1995). Curves and Fractal Dimension. Springer, New York.
Tufte, E. (1983). The visual display of quantitative information. Addison-Wesley,
Reading, MA.
Tukey, J.W. (1977). Exploratory data analysis. Addison-Wesley, Reading, MA.
Tukey, P.A. and Tukey, J.W. (1981). Graphical display of data sets in 3 or more
dimensions. In: Interpreting Multivariate Data, V. Barnett (ed.), Wiley, Chichester, UK.
Ueda, K. and Ohgushi, K. (1987). Perceptual components of pitch: spatial representation using a multidimensional scaling technique. J. Acoust. Soc. Am.,
82, 1193-1200.
Velleman, P. and Hoaglin, D. (1981). The ABCs of EDA: Applications, Basics,
and Computing of Exploratory Data Analysis. Duxbury, Belmont, CA.
Vidakovic, B. (1999). Statistical Modeling by Wavelets. John Wiley, New York.
Voss, R.F. and Clarke, J. (1975). 1/f noise in music and speech. Nature, Vol. 258,
317-318.
Voss, R.F. and Clarke, J. (1978). 1/f noise in music: music from 1/f noise. J.
Acoust. Soc. America, Vol. 63, 258-263.
Voss, R.F. (1988). Fractals in nature: From characterization to simulation. In:
Science of fractal images, H.-O. Peitgen and D. Saupe (Eds.), Springer, Berlin,
pp. 26-69.
Vuza, D.T. (1991). Supplementary sets and regular complementary unending
canons (part one). Persp. New Music, Vol. 29, No. 2, 22-49.
Vuza, D.T. (1992a). Supplementary sets and regular complementary unending
canons (part two). Persp. of New Music, Vol. 30, No. 1, 184-207.
Vuza, D.T. (1992b). Supplementary sets and regular complementary unending
canons (part three). Persp. New Music, Vol. 30, No. 2, 102-125.
Vuza, D.T. (1993). Supplementary sets and regular complementary unending
canons (part four). Persp. New Music, Vol. 31, No. 1, 270-305.
van der Waerden, B.L. (1979). Die Pythagoreer. Artemis, Zrich.
u
Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia.
Walker, A.M. (1971). On the estimation of a harmonic component in a time series
with stationary independent residuals. Biometrika, Vol. 58, 21-36.
Walmsley, P.J., Godsill, S.J. and Rayner, P.J.W. (1999). Bayesian graphical models for polyphonic pitch tracking. In: Diderot Forum on Mathematics and Music
Computational and Mathematical Methods in Music, Vienna, Austria, December 2-4, 1999, H. G. Feichtinger and M. Drer (eds.), sterreichische Computergesellschaft.
Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall,
London.
Watson, G. (1964). Smooth regression analysis. Sankhya, Series A, Vol. 26, 359372.
Watson, G. (1983). Statistics on Spheres. Wiley, New York.
Waugh, W.A. (1996). Music, probability, and statistics. In: Encyclopedia of Statistical Sciences, by S. Kotz, C. B. Read, and D.L. Banks (Eds.), 6, 134-137.

2004 CRC Press LLC

Webb, A.R. (2002). Statistical Pattern Recognition (2nd ed.). Wiley, New York.
Wedin, L. (1972). Multidimensional scaling of emotional expression in music.
Svensk Tidskrift fr Musikforskning, 54, 115-131.
o
Wedin, L. and Goude, G. (1972). Dimension analysis of the perception of musical
timbre. Scand. J. Psychol., 13, 228-240.
Weihs, C., Bergho, S., Hasse-Becker, P. and Ligges, U. (2001). Assessment of
Purity of Intonation in Singing Presentations by Discriminant Analysis. In:
Mathematical Statistics and Biometrical Applications, J. Kunert, and G. Trenkler. (Eds.), pp. 395-410.
White, A.T. (1983). Ringing the changes. Math. Proc. Camb. Phil. Soc. 94, 203215.
White, A.T. (1985). Ringing the changes II. Ars Combinatorica, 20-A, 65-75.
White, A.T. (1987). Ringing the cosets. American Math. Monthly 94/8, 721-746.
Whittle, P. (1953). Estimation and information in stationary time series. Ark.
Mat., Vol. 2, 423-434.
Widmer, G. (2001). Discovering Simple Rules in Complex Data: A Meta-learning
Algorithm and Some Surprising Musical Discoveries. Austrian Research Institute for Artical Intelligence, Vienna, TR-2001-31.
Wiener, N. (1948). Cybernetics or control and communication in the animal and
the machine. Act. Sci. Indust., No. 1053, Hermann et Cie, Paris.
Wilson, W.G. (1965). Change Ringing. October House Inc., New York.
Wolfowitz, J. (1957). The coding of messages subject to chance errors. Illinois J.
Math., Vol. 1, 591-606.
Wolfowitz, J. (1958). Information theory for mathematicians. Ann. Math. Statistics, Vol. 29, 351-356.
Wolfowitz, J. (1961). Coding Theorems of Information Theory. Springer, Berlin.
Woodward, P.M. (1953). Probability and Information Theory with Applications
to Radar. Pergamon Press, London.
Xenakis, I. (1971). Formalized Music: Thought and Mathematics in Composition.
Indiana University Press, Bloomington/London.
Yaglom, A.M. and Yaglom, I.M. (1967). Wahrscheinlichkeit und Information.
Deutscher Verlag der Wissenschaften, Berlin.
Yost, W.A. (1977). Fundamentals of Hearing. An Introduction. Academic Press,
San Diego.
Yohai, V.J. (1987). High breakdown-point and high eciency robust estimates
for regression. Ann. Statistics, Vol. 15, 642-656.
Yohai, V.J., Stahel, W.A., and Zamar, R. (1991). A procedure for robust estimation and inference in linear regression. In: Directions in robust statistics and
diagnostics, Part II, W.A. Stahel, and S.W. Weisberg (Eds.), Springer, New
York.
Young, G. and Householder, A. S. (1941). A note on multidimensional psychophysical analysis. Psychometrika, 6, 331-333.
Zassenhaus, H.J. (1999). The Theory of Groups. Dover, Mineola.
Zivot, E. and Wang, J. (2002). Modeling Financial Time Series with S-Plus.
Springer, New York.

2004 CRC Press LLC

You might also like