Cook - Singing Voice Synthesis

Singing Voice Synthesis: History, Current Work, and Future Directions
Author(s): Perry R. Cook

Source: Computer Music Journal, Vol. 20, No. 3 (Autumn, 1996), pp. 38-46
Published by: The MIT Press
Stable URL: http://www.jstor.org/stable/3680822
Accessed: 12/01/2010 06:40
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=mitpress.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
The MIT Press is collaborating with JSTOR to digitize, preserve and extend access to Computer Music
Journal.
http://www.jstor.org
Perry
R. Cook
Department
of
Computer
Science
and
Department
of Music
Princeton
University
Princeton,
New
Jersey,
USA
PRC@cs.princeton.edu
This article will
briefly
review the
history
of
sing-
ing
voice
synthesis,
and will
highlight
some cur-
rently
active
projects
in this area. It will
survey
and
discuss the benefits and trade-offs of
using
different
techniques
and models. Performance
control,
some
attractions of
composing
with vocal
models,
and ex-
citing
directions for future research will be
high-
lighted.
Basic Vocal Acoustics
The voice can be characterized as
consisting
of one
or more
sources,
such as the
oscillating
vocal folds
or turbulence
noise,
and a
system
of filters whose
properties
are controlled
by
the
shape
of the vocal
tract.
By moving
various
articulators,
we
change
the
ways
the sources and filters behave. The
spec-
trum of the voice is characterized
by
resonant
peaks
called formants. The location and
shapes
of
these resonances are
strong perceptual
cues that hu-
mans use to differentiate and
identify
vowels and
consonants. For a
system
to
generate speech-like
sounds,
it should allow for
manipulation
of the res-
onant
peaks
of the
spectrum,
and also for
manipula-
tion of source
parameters (voice pitch,
noise
level,
etc.) independent
of the resonances of the vocal
tract. Voice
pitch
is
commonly
denoted as
fo,
and
the formant
frequencies
are
commonly
denoted as
fl, f2, f3,
etc.
Figure
1 shows a vocal tract cross-
section
forming
the vowel
/
i
/ (as
in
"beet"),
where
the
quasi-periodic
oscillations of the vocal folds are
shaped by
the resonant filter of the vocal tract
tube. The
spectrum
of the vowel shows the harmon-
ics of the voice source
outlining
the
peaks
and val-
leys
of the vocal tract
response. Figure
2 shows the
vocal tract cross-section for
forming
the conso-
Computer
Music
Journal, 20:3, pp. 38-46,
Fall 1996
? 1996 Massachusetts Institute of
Technology
Singing
Voice
Synthesis:
H istory,
Current
W ork,
and Future Directions
nant
/ /("shh"),
where the "source" is not the vo-
cal
folds,
but turbulence noise formed
by forcing
air
through
a constriction. Also shown is the noise-
like
spectrum
of the
consonant, showing
two
princi-
pal
formant
peaks corresponding
to the resonances
of the vocal tract
upstream
from the noise source.
A Brief
H istory
of
Digital Singing (Speech)
Synthesis
The earliest
computer
music
project
at Bell Labs in
the late 1950s
yielded
a number of
speech synthe-
sis
systems capable
of
singing,
one
being
the acous-
tic tube model of
Kelly
and Lochbaum
(1962).
This
model was
actually
an
early physical
model. At
that time it was considered too
computationally
ex-
pensive
for commercialization as a
speech synthe-
sizer,
and too
expensive
to be
practical
for musical
composition.
Max Mathews worked with
Kelly
and
Lochbaum to
generate
some
early examples
of
sing-
ing synthesis (Computer
Music
Journal 1995;
W ergo 1995).
Other
techniques
to arise from the
early legacy
of
speech signal processing
include the channel vo-
coder
(VOice CODER) (Dudley 1939)
and linear
pre-
dictive
coding (LPC) (Atal 1970;
Makhoul
1975).
In
the
vocoder,
the
spectrum
is broken into sections
called
sub-bands,
and the information in each sub-
band is
analyzed,
then
parameters
are stored or
transmitted for reconstruction at another time or
site. The
parametric
data
representing
the informa-
tion in each sub-band can be
manipulated, yielding
transformations such as
pitch
or time
shifting,
or
spectral shaping.
The vocoder does not
strictly
as-
sume that the
signal
is
speech,
and thus
generalizes
to other sounds. The
phase vocoder, implemented
using
the discrete Fourier
transform,
has found ex-
tensive use in
computer
music
(Moorer 1978;
Dol-
son
1986).
Computer
Music
Journal
38
Figure 1. Vocal tract
shape
and
spectrum of
vowel
/ i / (as in
"beet"),
show-
ing formants
and harmon-
ics
of periodic
voice
source.
Figure
2. Vocal tract
shape
(left)
and
spectrum (right)
of
consonant / f / ("shh"),
showing
a
noisy spectrum
with two
formants.
^
Formants
ea. e 4}3.oov~~~~I
Figure
1
Consonant
/f/
(as
in
shh) ,r,
Figure
2
The introduction of linear
predictive coding (Atal
1970)
revolutionized
speech technology,
and had a
great impact
on musical
composition
as well
(Moorer 1979; Steiglitz
and
Lansky 1981; Lansky
1989).
W ith
LPC,
a
time-varying
filter is automati-
cally designed
that
predicts
the next value of the
signal,
based on
past samples.
An error
signal
is
pro-
duced
which,
if fed back
through
the
time-varying
filter,
will
yield exactly
the
original signal.
The fil-
ter models linear correlations in the
signal,
which
correspond
to
spectral
features such as formants.
The error
signal
models the
input
to the formant
filter,
and
typically
is
periodic
and
impulsive
for
voiced
speech,
and noise-like for unvoiced
speech.
The success of LPC in
speech coding
is
largely
due to the
similarity
between the
source/filter
de-
composition yielded by
the mathematics of linear
prediction,
and the
source/filter
model of the hu-
man vocal tract. The
power
of LPC as a
speech
com-
pression technique (Spanias 1994)
stems from its
ability
to
parametrically
code and
compress
the
source and filter
parameters.
The effectiveness of
LPC as a
compositional
tool
emerges
from its abil-
ity
to
modify
the
parameters
before
resynthesis.
There are
weaknesses, however,
in
LPC,
which are
related to the
assumption
of
linearity
inherent in
the filter model.
Also,
all
spectral properties
are
modeled in the filter. In
actuality
the voice has mul-
tiple possible
sources of non-linear
behavior,
includ-
ing
source-tract
coupling,
non-linear wall vibration
losses,
and
aerodynamic
effects. Due to these devia-
tions from the ideal source-filter
model,
the result
of
analysis/modification/resynthesis using
LPC or a
sub-band vocoder often sounds
"buzzy."
Cook
i
i
I
I
39
Cross-Synthesis
and Other
Compositional
Attractions of Vocal Models
The
compositional
interest in vocal
analysis/syn-
thesis has at least three foundations. The first is
rooted in the human as a
linguistic organism,
for it
seems in the nature of humans to find interest in
voice-like sounds.
Any technique
or device that
allows
independent
control over
pitch
and
spectral
peaks
tends to
produce
sounds that are vocal in
nature,
and such sounds catch the interest of hu-
mans. The second
compositional
interest in
using
systems
that
decompose
sounds in a
source/filter
paradigm
is to allow for
cross-synthesis.
Cross-
synthesis
involves the
analysis
of two
instruments,
typically
a voice and a non-voice
instrument,
with
the
parameters exchanged
and modified on
resyn-
thesis. This allows the resonances of the voice to
be
imposed
on the source of a non-voice instru-
ment. The third interest comes from the fact that
once
pitch
and resonance structure are
analyzed
as
they
evolve in
time,
these three dimensions are in-
dependently
available to some extent for
manipula-
tion on
resynthesis.
The elusive
goal
of
being
able
to stretch time without
changing pitch,
to
change
pitch
without
changing
timbral
quality, etc.,
are all
of
high
interest to
computer
music
composers.
Other
Popular Synthesis Techniques
Frequency
modulation
(FM) proved
successful for
singing synthesis (Chowning 1981, 1989)
as well as
the
synthesis
of other sounds. As described in com-
munications
literature,
FM involves the modula-
tion of the
frequency
of one oscillator with the
output
of another to create a
spread spectrum
con-
sisting
of side-bands
surrounding
the
original
car-
rier
(oscillator
that is
modulated) frequency.
In FM
sound
synthesis,
both the carrier and modulator
oscillators
typically
store a sinusoidal
waveform,
and
operate
in the audio band.
By controlling
the
amount of
modulation,
and
using multiple carrier/
modulator
pairs, spectra
of somewhat
arbitrary
shape
can be constructed. This
technique proved
ef-
ficient
yet sufficiently
flexible for music
composi-
tion,
and became the basis for the most successful
commercial music
synthesizers
in
history.
In vocal
modeling,
carriers
placed
near formant locations in
the
spectrum
are modulated
by
a common modula-
tor oscillator
operating
at the voice fundamental fre-
quency.
Sinusoidal
speech modeling (McAulay
and
Qua-
tieri
1986)
has been
improved
and
applied
to music
synthesis by
Julius
Smith and Xavier Serra
(Smith
and Serra
1987;
Serra and Smith
1990),
Xavier Ro-
det and
Philippe Depalle (1992),
and others. These
techniques
use Fourier
analysis
to locate and track
individual sinusoidal
partials.
Individual
trajector-
ies
(tracks)
of sinusoidal
amplitude, frequency,
and
phase
as a function of time are extracted from the
time-varying peaks
in a series of short-time Fourier
transforms. To
help
define
tracks,
heuristics regard-
ing physical systems
and the voice in
particular
are
used,
such as the fact that a sinusoid should not
ap-
pear, disappear,
or
change frequency
or
phase
instan-
taneously.
The sinusoids can be
resynthesized
from
the track
parameters,
after modification or
coding,
by
additive
synthesis.
Noise can be treated as
rap-
idly varying sinusoids,
or
explicitly
as a non-
sinusoidal
component.
Formant wave functions
(FOFs
in
French)
were
pioneered by
Xavier Rodet
(1984)
at Institute de
Recherche et
Coordination, Acoustique/Musique
(IRCAM).
An FOF is a time-domain waveform
model of the
impulse response
of individual for-
mants,
characterized as a sinusoid at the formant
center
frequency
with an
amplitude
that rises
rap-
idly upon
excitation and
decays exponentially. By
describing
a
spectral region
as a windowed sinusoi-
dal oscillation in the time
domain,
an FOF can be
viewed as a
special type
of wavelet. The control
pa-
rameters define the center
frequency
and band-
width of the formant
being modeled,
and the rate
at which the FOFs are
generated
and added deter-
mines the base
frequency
of the voice. The
synthe-
sis
system
for
using
FOFs was dubbed
CH ANT,
and
found
application
in
general synthesis (Rodet,
Po-
tard,
and Barriere
1984).
Gerald Bennett and Xavier
Rodet used CH ANT to
produce
a number of
impres-
sive
singing examples
and
compositions (Bennett
and Rodet
1989).
Formant
synthesizers,
in which individual for-
Computer
Music
Journal
40
mants are modeled
by
second-order resonant
filters,
have been
investigated by many speech
researchers
(Rabiner 1968;
Klatt
1980).
An attractive feature of
formant
synthesizers
is that Fourier or LPC
analy-
sis can be used to
automatically
extract formant
frequencies
and source
parameters
from recorded
speech.
Charles
Dodge
used such
techniques
in a
composition
in 1973
(Dodge 1989).
The
group
that
has
accomplished
the most in the domain of
sing-
ing synthesis using
formant models is the
Speech
Transmission
Laboratory (STL)
of the
Royal
Insti-
tute of
Technology (KTH ),
Stockholm. This STL
MUSSE DIG
(MUsic
and
Singing Synthesis Equip-
ment,
DIGital
version) synthesizer (Carlson
and
Neovius
1990)
has been used in
singing synthesis
(Zera, Gauffin,
and
Sundberg 1984),
for
studying
performance synthesis-by-rule (Sundberg 1989),
and
has been
adapted
for real-time control in
perfor-
mance
(Carlson,
et al.
1991).
The KTH has con-
ducted and
published extensively
on
speech,
and
has
arguably produced
the
largest body
of research
on
singing (Sundberg 1987)
and
music,
both acous-
tics and
performance.
Robert C. Maher
(1995)
re-
cently
demonstrated
singing synthesis using
modi-
fied forms of the second-order resonant filter which
lend themselves to
parallel implementation.
Acoustic Tube Models of the Vocal Tract
Acoustic tube models solve the wave
equation,
usu-
ally
in one
dimension,
inside a
smoothly varying
tube. The one-dimensional
approximation
is
justi-
fied
by noting
that the
length
of the vocal tract is
significantly larger
than
any
width
dimension,
and
thus the
longitudinal
modes dominate the reso-
nance structure
up
to about
4,000
H z. Modal stand-
ing
waves in an acoustic tube
correspond
to the for-
mants.
The basic
Kelly
and Lochbaum model
(Kelly
and
Lochbaum
1962) critically samples space
and time
by approximating
the smooth vocal tract tube with
cylindrical segments equal
in
length
to the dis-
tance traveled
by
a soundwave in one time
sample.
The SPASM and
Singer systems (Cook 1992)
are
based on a
physical
model of the vocal tract
filter,
developed using
the
waveguide
formulation
(Smith
1987).
This model is a direct descendent of the
Kelly
and Lochbaum
model,
but with
many
en-
hancements,
such as a nasal
tract, modeling
of radi-
ation
through
the throat
wall,
various
steady
and
pulsed
noise sources
(Chafe 1990),
and real-time
controls.
Shinji
Maeda's
(1982)
model
numerically
integrates
the wave
equation using
the
rectangular
method in
space,
and the
trapezoidal
rule in time.
W all losses are also
modeled,
and an
articulatory
layer
of control modifies the basic tube
shape
from
higher-order descriptions
like
tongue
and
jaw posi-
tion. Rene Carre's
(1992)
model is based on distinc-
tive
regions (DR) arising
from
sensitivity analysis,
noting
that movements in
particular regions
of the
vocal tract affect formant
frequencies
more than
movements in others.
H ill, Manzara,
and Taube-
Schock
(1995)
have
implemented
a
synthesis-by-
rule
system using
a model based on distinctive re-
gions,
with libraries and
examples
that include ex-
amples
of
singing synthesis. Liljencrants (1985)
in-
vestigated
an
undersampled
acoustic tube model
and derived rules for
modifying
the
shape
without
adding unnaturally
to the
energy
contained within
the vocal tract. The
computer
music research
group
in H elsinki
(Vilimaki
and
Karjalainen 1994)
have
used fractional
sample interpolation
and truncated
conical tube
segments
to derive an
improved
ver-
sion of the
Kelly
and Lochbaum model.
Other Active
Singing Synthesis Projects
Pabon
(1993)
has constructed a
singing synthesizer,
with real-time formant control via
spectrogram-
like
displays
called
phonetograms,
and source wave-
form
synthesis using
FOF-like controls. Titze and
Story (1993)
have
produced
a
super-computer
tenor
called "Pavarobotti" that
sings
duets with
Titze,
and is used for
studying many aspects
of the
voice,
including
advanced
physical
models of normal and
pathological
vocal folds. H oward and Rossiter
(H ow-
ard and Rossiter
1993;
Rossiter and H oward
1994)
have studied source
parameters
for more natural
singing synthesis,
as well as interactive
singing
analysis
software for
pedagogical applications.
Cook 41
Spectral
Models vs.
Physical
Models
Synthesis
models can be
loosely
broken into two
groups: spectral models,
which can be viewed as
based on
perceptual mechanisms,
and
physical
models,
which can be viewed as based on
produc-
tion mechanisms. Of the models and
techniques
discussed
above,
the
spectrally
based models in-
clude
FM, FOFs, vocoders,
and sinusoidal models.
Acoustic tube models are
physically based,
while
formant
synthesizers
are
spectral models,
but could
be classified as
pseudo-physical
because of the
source/filter decomposition.
It's
possible
to inter-
pret
LPC three
ways:
as a
least-squares
linear
predic-
tion in the time
domain,
as a
least-squares
match-
ing process
on the
spectrum,
and as a source-filter
decomposition. Therefore,
LPC is both
spectral
and
pseudo-physical,
but not
strictly
a
physical
model
because wave variables are not
propagated directly,
and no articulation
parameters go
into the basic
model. Since LPC can be
mapped
to a filter related
to the acoustic tube model
(Markel
and
Gray 1976),
it
may
be
brought
into the
physical camp.
Both
physical
and
spectral
models have
merit,
and one or another
might
be more suitable
given
a
specific goal
and set of
computational
resources.
The main attraction of
physical
models is that
most of the control
parameters
are those that a hu-
man uses to control
his/her
own vocal
system.
As
such,
some intuition can be
brought
into the de-
sign
and
composition processes.
Another motiva-
tion is that
time-varying
model
parameters
can be
generated by
the model
itself,
if the model is con-
structed so that it
sufficiently
matches the
physical
system. Disadvantages
of
physical
models are that
the number of control
parameters
can be
large,
and
while some
parameters might
have intuitive
sig-
nificance for humans
(jaw drop),
others
might
not
(specific
muscles
controlling
the vocal
folds).
Fur-
ther, parameters
often interact in non-obvious
ways.
In
general
there exist no exact methods for
analysis/resynthesis using physical
models. Parame-
ter estimation
techniques
have been
investigated,
but for
physical
models of reasonable
complexity,
especially
those
involving any
non-linear
compo-
nent, identity analysis/resynthesis
is a
practical
and often theoretical
impossibility (Cook 1991b;
Scavone and Cook
1994).
Model Extensions and Future W ork
W ork remains to be done in
refining techniques
for
spectral analysis
and
synthesis
of the voice. For ex-
ample,
a
spectral envelope
estimation
technique
like that of Galas and Xavier Rodet
(1990)
allows
more accurate formant
tracking
on even
high
fe-
male
tones,
which because of the
large
inter-
harmonic
spacing
have
proven
difficult for
analysis
systems
in the
past.
There are far more directions
for research to
proceed
in
improving physical
mod-
els and source models for
pseudo-physical
models
of the voice. Most of them involve some
significant
component
of
non-linearity, and/or higher
dimen-
sional models. The main research areas involve
modeling
of airflow in the vocal
tract, development
of more exact models of the inner
shape
of the vo-
cal tract
tube, physical
models of the
tongue
and
other
articulators,
more accurate models of the vo-
cal
folds,
and facial animation
coupled
to voice
syn-
thesis.
The
modeling
of flow is a difficult but
important
task,
and until
recently
it has been confined to the-
oretical
explorations, occasionally
verified
experi-
mentally
with hot-wire
anemometry
or other flow
measurement
techniques (Teager 1980).
Mico
H irschberg
has
begun
to make advances in
actually
photographing
flow in constructed models of musi-
cal
instruments,
and the vocal tract
(Pelorson
et al.
1994).
These
techniques,
combined with classical
and new
theories,
should
yield greater
understand-
ing
about air flow and how it affects vocal acous-
tics.
Along
with more exact solutions to the flow-
physics problems, development
of efficient means
for
calculating
the flow
simulations, allowing
the
inclusion of these non-linear effects in
practical
synthesis
models must also
emerge (Chafe 1995;
Verge 1995).
Constructing
a
physical
model that includes
more detailed simulations of the
dynamics
of the
tongue
and articulators would allow the model to
calculate the
time-varying parameters,
rather than
Computer
Music
Journal
I
42
having
the
shape,
etc.
explicitly specified
or calcu-
lated. W ilhelms-Tricarico
(1995)
has
developed
a set
of models of soft
tissue,
and has used these to con-
struct a
tongue
model. Such models can be cali-
brated from the results of articulation studies
using
X
ray pellets, magnetic
resonance
imaging,
and
other
techniques.
All of this can combine to
yield
models that "behave"
correctly
in a
dynamical
sense,
and
give
a better
picture
of the fine structure
of the
space
inside the vocal tract. This latter infor-
mation is critical if flow simulations are to be ac-
curate.
Vocal fold models continue to be the
target
of
much
research, and,
like the case of
airflow,
theo-
ries are difficult to
conclusively prove
or
disprove.
More elaborate models of the vocal fold tissue are
being developed (Story
and Titze
1995),
and theoret-
ical and
experimental
studies
revisiting
and
compar-
ing
the classic models are
being
conducted
(Rodet
1995).
Facial animation
coupled
with
speech synthesis
is
important
for a number of reasons. One reason is
for
pedagogy,
where
speech synthesizers
with ani-
mated
displays
could be used as
teaching
and reha-
bilitation tools. Another
important
reason involves
speech perception
in
general,
because humans use
a
significant
amount of
lip reading
in understand-
ing speech.
W ork has been done
by
Massaro
(1987)
and
H ill, Pearce,
and
W yvill (1988), employing
fa-
cial animation to
study coupling
of visual and audi-
tory
information in human
speech understanding
(McGurk
and MacDonald
1976). Musically,
we
know that the face of the
singer
can
carry
even
more information about the
meaning
of music
than the actual text
being sung (Scotto
Di Carlo
and Guaitella
1995),
further
motivating
the combi-
nation of facial animation with
singing synthesis.
Modeling
Performance
One of the
distinguishing
features of the voice is
the continuous nature of
pitch control,
both inten-
tional and uncontrolled. Research in random and
periodic pitch
deviations
(Sundberg 1987;
Chown-
ing 1989;
Ternstrom and
Friberg 1989;
Prame
1994;
Cook
1995),
and the
synthesis
and
perception
of
short vibrato tones
(d'Allessandro
and
Castellengo
1993),
has
provided
data and models for more natu-
ral
sounding
voice
synthesis.
On the macro
scale,
rule
systems
for vocal
performance
and
phrasing
(Berndtsson 1995),
and
composition (Rodet
and
Cointe
1984; Barriere, Iovino,
and Laurson
1991)
have been constructed. The Stockholm KTH rule
system
is available on the
compact
disc
Informa-
tion
Technology
and Music
(KTH 1994).
These im-
portant
areas of research shall remain a
topic
for a
future
survey paper.
Extended
Singing
and
Language Systems
Investigations
into non-W estern traditional Bel
Canto
singing styles, traditions,
and acoustics
include studies of overtone
singing (Bloothooft,
et
al.
1992),
traditional Scandanavian
shepherd sing-
ing (Johnson, Sundberg,
and W illbrand
1983),
a
highly
structured
system
of funeral laments
(Ross
and Lehiste
1993),
and even castrati
singing (De-
palle, Garcia,
and Rodet
1994). Language systems
for the
SPASM/Singer
instruments include an Eccle-
siastical Latin
system
called LECTOR
(Cook
1991a),
and a
system
for modern Greek called
IGDIS
(Cook,
et al.
1993).
The IGDIS
system
in-
cludes
support
for
arbitrary tuning systems,
and
common vocal ornaments can be called
up by
name, allowing
traditional folk
songs
and
Byzan-
tine chants to be
synthesized quickly.
Real-Time Voice
Processing
and
Interactive Karaoke
Recently,
commercial
products
have been intro-
duced that allow for real-time "smart harmonies"
to be added to a vocal
signal,
or
implement
real-
time score
following
with
accompaniment.
Vocod-
ers and
LPC, by
virtue of
being analysis/synthesis
systems,
allow
potential
for real-time modification
of voice
signals
under the control of rules or real-
time
computer processes.
W e will soon see
systems
that
integrate pitch detection,
score
following,
and
Cook
I
43
sophisticated
voice
processing algorithms
into a
new
generation
of interactive karaoke
systems.
This will remain a
topic
for a future review
paper.
References
Atal,
B. 1970.
"Speech Analysis
and
Synthesis by
Linear
Prediction of the
Speech
W ave."
Journal of
the Acousti-
cal
Society of
America
47:65(A).
Barriere, J. B.,
E
Iovino,
and M. Laurson. 1991. "A New
CH ANT
Synthesizer
in C and its Control Environ-
ment in Patchwork." In
Proceedings of
the 1991 Inter-
national
Computer
Music
Conference.
San
Francisco,
California: International
Computer
Music
Association,
pp.
11-14.
Bennett, G.,
and X. Rodet. 1989.
"Synthesis
of the
Sing-
ing
Voice." In
Mathews,
M. and
J. Pierce, eds.,
Current
Directions in
Computer
Music Research.
Cambridge,
Massachusetts: The MIT
Press, pp.
19-44.
Berndtsson,
G.
1995,
"The KTH Rule
System
For
Singing
Synthesis." Computer
Music Journal 20(1):76-91.
Bloothooft, G.,
et al. 1992. "Acoustics and
Perception
of
Overtone
Singing."
Journal of
the Acoustical
Society of
America
92(4):1827-1836.
Carlson, G.,
and L. Neovius. 1990.
"Implementations
of
Synthesis
Models for
Speech
and
Singing."
STL-
Quarterly Progress
and Status
Report.
Stockholm:
KTH , pp. 2/3:63-67.
Carlson, G.,
et al. 1991. "A New
Digital System
for
Sing-
ing Synthesis Allowing Expressive
Control." In Proceed-
ings of
the 1991 International
Computer
Music
Confer-
ence. San
Francisco,
Computer
Music
Association, pp.
315-318.
Carre,
R. 1992. "Distinctive
Regions
in Acoustic Tubes."
Journal d'Acoustique, 5(141):141-159.
Chafe,
C. 1990. "Pulsed Noise in Self-Sustained Oscilla-
tions of Musical Instruments." In
Proceedings of
the
IEEE International
Conference
on
Acoustics, Speech,
and
Signal Processing.
New York: IEEE
Press, pp.
1157-1160.
Chafe,
C. 1995.
"Adding
Vortex Noise to W ind Instru-
ment
Physical
Models." In
Proceedings of
the 1995 In-
ternational
Computer
Music
Conference.
San Fran-
cisco,
Computer
Music
Association, pp.
57-60.
Chowning, J. 1981, "Computer Synthesis
of the
Singing
Voice." In Research
Aspects
on
Singing.
Stockholm:
KTH , pp.
4-13.
Chowning, J. 1989.
"Frequency
Modulation
Synthesis
of
the
Singing
Voice." In
Mathews,
M. and J. Pierce, eds.,
Current Directions in
Computer
Music Research. Cam-
bridge,
Press, pp.
57-64.
Computer
Music Journal. 1995.
Computer
Music Journal
Volume 19
Compact
Disc.
Cambridge,
Massachusetts:
The MIT Press.
Cook,
P. 1991a. "LECTOR: An Ecclesiastical Latin Con-
trol
Language
for the
SPASM/Singer
Instrument." In
Proceedings of
Computer
Mu-
sic
Conference.
San
Francisco,
Computer
Music
Association, pp.
319-321.
Cook,
P. 1991b. "Non-Linear Periodic Prediction for On-
Line Identification of Oscillator Characteristics in
W oodwind Instruments." In
Proceedings of
the Interna-
tional
Computer
Music
Conference.
San
Francisco,
Cal-
ifornia: International
Computer
Music
Association, pp.
157-160.
Cook,
P. 1992. "SPASM: A Real-Time Vocal Tract
Physi-
cal Model
Editor/Controller
and
Singer:
the
Compan-
ion Software
Synthesis System." Computer
Music
Jour-
nal
17(1):30-44.
Cook,
P. 1995. "A
Study
of Pitch Deviation in
Singing
as
a Function of Pitch and
Dynamics."
13th International
Congress of
Phonetic Sciences. Stockholm:
KTH , pp.
1:202-205.
Cook, P.,
et al. 1993. "IGDIS: A Modern Greek Text to
Speech/Singing Program
for the
SPASM/Singer
Instru-
ment." In
Proceedings of
the International
Computer
Music
Conference.
San
Francisco,
California: Interna-
tional
Computer
Music
Association, pp.
387-389.
d'Allessandro, C.,
and M.
Castellengo.
1993. "The Pitch
of Short-Duration Vibrato Tones:
Experimental
Data
and Numerical Model." In
Proceedings of
the Stock-
holm Music Acoustics
Conference.
Stockholm:
KTH ,
pp.
25-30.
Depalle, P.,
G.
Garcia,
and X. Rodet.
1994,
"A Virtual Cas-
trato
(!?)"
In
Proceedings of
Computer
Music
Conference.
San
Francisco,
Califor-
nia: International
Computer
Music
Association, pp.
357-360.
Dodge,
C. 1989. "On
Speech Songs."
In
Mathews,
M. and
J. Pierce, eds.,
Computer
Music
Research.
Cambridge,
Press,
pp.
9-18.
Dolson,
M.
1986,
"The Phase Vocoder: A Tutorial." Com-
puter
Music
Journal 10(4):14-27.
Dudley,
H . 1939. "The Vocoder." Bell Laboratories Rec-
ord,
December.
Galas, T.,
and X. Rodet. 1990 "An
Improved Cepstral
Method for Deconvolution of Source-Filter
Systems
with Discrete
Spectra: Application
to Musical Sound
Computer
Music
Journal
I
44
Signals."
In
Proceedings of
Computer
Music
Conference.
San
Francisco,
Califor-
nia: International
Computer
Music
Association, pp.
82-84.
H ill, D., L.
Manzara,
and C. Taube-Schock. 1995. "Real-
Time
Articulatory Speech-Synthesis-By-Rules."
AVIOS. San
Jose,
California.
H ill, D.,
A.
Pearce,
and B.
W yvill.
1988.
'Animating
Speech:
An Automated
Approach Using Speech Synthe-
sized
by
Rules." The Visual
Computer 3(5):277-289.
H oward, D.,
and D. Rossiter. 1993. "Real-Time Visual
Displays
for Use in
Singing Training:
An Overview." In
Proceedings of
the Stockholm Music Acoustics
Confer-
ence. Stockholm:
KTH , pp.
191-196.
Johnson, A.,
J.
Sundberg,
and H . W illbrand. 1983. "K61n-
ing:
A
Study
of Phonation and Articulation in a
Type
of Swedish
H erding Song."
In
Proceedings of
the Stock-
holm Music Acoustics
Conference.
Stockholm:
KTH ,
pp.
187-202.
Kelly, J.,
and C. Lochbaum. 1962.
"Speech Synthesis" (pa-
per G42).
In
Proceedings of
the Fourth International
Congress
on Acoustics.
pp.
1-4.
Klatt,
D. 1980. "Software for a
Cascade/Parallel
Formant
Synthesizer." Journal of
the Acoustical
Society of
America
67(3):971-995.
KTH . 1994.
Information Technology
and Music
(a
com-
pact
disc to celebrate the 75th
anniversary
of the
Royal
Swedish
Academy
of
Engineering Science).
Stockholm:
KTH .
Lansky,
P. 1989.
"Compositional Applications
of Linear
Predictive
Coding."
In
Mathews,
M. and
J. Pierce, eds.,
Computer
Music Research. Cam-
bridge,
Press, pp.
5-8.
Liljencrants, J.
1985.
Speech Synthesis
W ith a
Reflection-
Type
Line
Analog,
DS
Dissertation, Speech
Communi-
cation and Music
Acoustics,
Stockholm: KTH .
Maeda, S. 1982. "A
Digital
Simulation Method of the Vo-
cal Tract
System." Speech
Communication 1:199-299.
Maher,
R. 1995. "Tunable
Bandpass
Filters in Music
Syn-
thesis"
(paper
4098
L2).
In
Proceedings of
the Audio
Engineering Society Conference.
Makhoul, J.
1975. "Linear Prediction: A Tutorial Re-
view." In
Proceedings of
the IEEE 63:561-580.
Markel, J.,
and A.
Gray.
1976. Linear Prediction
of
Speech.
New York:
Springer.
Massaro,
D. 1987.
Speech Perception by
Ear and
Eye.
H illsdale,
New
Jersey:
Erlbaum Associates.
Mathews, M.,
and
J. Pierce,
eds. 1989. Current Directions
in
Computer
Music Research.
Cambridge,
Massachu-
setts: The MIT Press.
McAulay, R.,
and T.
Quatieri.
1986.
"Speech Analysis/
Synthesis
Based on a Sinusoidal
Representation."
IEEE
Transactions on
Acoustics, Speech,
and
Signal
Pro-
cessing 34(4):744-754.
McGurk, H .,
and
J.
MacDonald. 1976.
"H earing Lips
and
Seeing
Voices." Nature 264:746-748.
Moorer,
A. 1978. "The Use of the Phase Vocoder in Com-
puter
Music
Applications." Journal of
the Audio
Engi-
neering Society
26
(1/2):42-45.
Moorer,
A.
1979,
"The Use of Linear Prediction of
Speech
in
Computer
Music
Applications." Journal of
the Audio
Engineering Society 27(3):134-140.
Pabon,
P.
1993, "A Real-Time
Singing
Voice
Synthesizer."
In
Proceedings of
the Stockholm Music Acoustics Con-
ference.
Stockholm:
KTH , pp.
288-293.
Pelorson, X.,
et al. 1994. "Theoretical and
Experimental
Study
of
Quasi-Steady
Flow
Separation
W ithin the
Glottis
During
Phonation.
Applications
to a Modified
Two-Mass Model."
Journal of
the Acoustical
Society of
America 96
(6):3416-3431.
Prame,
E. 1994. "Measurements of the Vibrato Rate of
Ten
Singers."
Journal of
the Acoustical
Society of
America
96(4):1979-1984.
Rabiner,
L. 1968.
"Digital
Formant
Synthesizer."
Journal
of
the Acoustical
Society of
America
43(4):822-828.
Rodet,
X. 1984. "Time-Domain Formant-W ave-Function
Synthesis." Computer
Music
Journal 8(3):9-14.
Rodet,
X. 1995. "One and Two Mass Model Oscillations
for Voice and Instruments." In
Proceedings of
the 1995
International
Computer
Music
Conference.
San Fran-
cisco,
Computer
Music Asso-
ciation, pp.
207-210.
Rodet, X.,
and P. Cointe. 1984. "FORMES:
Composition
and
Scheduling
of Processes."
Computer
Music
Journal
8(3):32-50.
Rodet, X.,
and P.
Depalle.
1992.
"Spectral Envelopes
and
Inverse FFT
Synthesis" (paper
3393
H 3).
In Proceed-
ings of
the Audio
Engineering Society Conference,
NY:
AES.
Rodet, X.,
Y.
Potard,
and
J.
B. Barriere. 1984. "The
CH ANT
Project:
From the
Synthesis
of the
Singing
Voice to
Synthesis
in General."
Computer
Music
Jour-
nal
8(3):15-31.
Ross, J.,
and I. Lehiste. 1993. "Estonian Laments: A
Study
of Their
Temporal
Structure." In
Proceedings of
Conference.
Stock-
holm:
KTH , pp.
244-248.
Rossiter, D.,
and D. H oward. 1994. "Voice Source and
Acoustic
Output
Qualities for
Singing Synthesis."
In
Proceedings of
Computer
Mu-
sic
Conference.
San
Francisco,
Computer
Music
Association, pp.
191-196.
Cook 45
Scavone, G.,
and P. Cook. 1994. "Combined Linear and
Non-Linear Periodic Prediction in
Calibrating
Models
of Musical Instruments to
Recordings."
In
Proceedings
of
Computer
Music
Confer-
ence. San
Francisco,
California: International Com-
puter
Music
Association, pp.
433-434.
Scotto Di
Carlo, N.,
and I. Guaitella. 1995. "Facial Ex-
pressions
in
Singing."
In
Proceedings of
the 13th Inter-
national
Congress of
Phonetic Sciences. Stockholm:
KTH , pp.
1:226-229.
Serra, X.,
and J. Smith. 1990.
"Spectral Modeling Synthe-
sis: A Sound
Analysis/Synthesis System
Based on a De-
terministic
plus
Stochastic
Decomposition." Computer
Music
Journal 14(4):12-24.
Smith, J.
1987. "Musical
Applications
of
Digital
W ave-
guides."
Technical
report
STAN-M-39. Stanford Univer-
sity
Center for
Computer
Research in Music and
Acoustics.
Smith, J.,
and X. Serra. 1987. "PARSH L:
Analysis/Synthe-
sis
Program
for Non-H armonic Sounds Based on a Si-
nusoidal
Representation."
In
Proceedings of
the 1987
International
Computer
Music
Conference.
San Fran-
cisco,
Computer
Music Asso-
ciation, pp.
290-297.
Spanias,
A. 1994.
"Speech Coding:
A Tutorial Review." In
Proceedings of
the IEEE
82(10):1541-1582.
Steiglitz, K.,
and P.
Lansky.
1981.
"Synthesis
of Timbral
Families
by W arped
Linear Prediction."
Computer
Music
Journal 5(3):45-49.
Story, B.,
and I. Titze. 1995. "Voice Simulation W ith a
Body-Cover
Model of the Vocal Folds." Journal of
the
Acoustical
Society of
America
97(2):3416-3431.
Sundberg, J.
1987. The Science
of
the
Singing
Voice. De-
kalb,
Illinois: Northern Illinois
University
Press.
Sundberg, J.
1989.
"Synthesis
of
Singing by
Rule." In
Mathews,
M. and
J. Pierce, eds.,
Computer
Music Research.
Cambridge,
Massachusetts:
The MIT
Press, pp.
45-56.
Teager,
H . 1980. "Some Observations on Oral Air Flow
During
Phonation." IEEE Transactions on
Acoustics,
Speech,
and
Signal Processing 28(5):599-601.
Ternstrom, S.,
and A.
Friberg.
1989.
"Analysis
and Simula-
tion of Small Variations in the Fundamental
Frequency
of Sustained Vowels."
STL-Quarterly Progress
and Sta-
tus
Report
3:1-14.
Titze, I.,
and B.
Story.
1993. "The Iowa
Singing Synthe-
sis." In
Proceedings of
Conference.
Stockholm:
KTH , p.
294.
Valimaki, V.,
and M.
Karjalainen.
1994.
"Improving
the
Kelly-Lochbaum
Vocal Tract Model
Using
Conical
Tube Sections and Fractional
Delay Filtering
Tech-
niques."
In
Proceedings of
the 1994 International Con-
ference
on
Spoken Language Processing. Yokohama,
Ja-
pan, pp.
18-22.
Verge,
M. 1995. Aeroacoustics
of Confined Jets,
with
Applications
to the
Physics of
Recorder-Like Instru-
ments.
Thesis,
Technical
University
of Eindhoven
(also
available from
IRCAM).
W ergo.
1995. The H istorical CD
of Digital
Sound
Synthe-
sis. W ER 2033-2.
W ilhelms-Tricarico,
R. 1995.
"Physiological Modeling
of
Speech
Production: Methods for
Modeling
Soft-Tissue
Articulators."
Journal of
the Acoustical
Society of
America
97(5):3085-3098.
Zera, J., J. Gauffin,
and
J. Sundberg.
1984.
"Synthesis
of
Selected
VCV-Syllables
in
Singing."
In
Proceedings of
Computer
Music
Conference.
San
Francisco,
Computer
Music
Association, pp.
83-86.
Computer
Music
Journal
46

Cook - Singing Voice Synthesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cook - Singing Voice Synthesis

Uploaded by

Copyright:

Available Formats

Singing Voice Synthesis: History, Current Work, and Future Directions

Author(s): Perry R. Cook

You might also like