Singing Voice Synthesis: History, Current Work, and Future Directions
Author(s): Perry R. Cook
Source: Computer Music Journal, Vol. 20, No. 3 (Autumn, 1996), pp. 38-46 Published by: The MIT Press Stable URL: http://www.jstor.org/stable/3680822 Accessed: 12/01/2010 06:40 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=mitpress. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. The MIT Press is collaborating with JSTOR to digitize, preserve and extend access to Computer Music Journal. http://www.jstor.org Perry R. Cook Department of Computer Science and Department of Music Princeton University Princeton, New Jersey, USA PRC@cs.princeton.edu This article will briefly review the history of sing- ing voice synthesis, and will highlight some cur- rently active projects in this area. It will survey and discuss the benefits and trade-offs of using different techniques and models. Performance control, some attractions of composing with vocal models, and ex- citing directions for future research will be high- lighted. Basic Vocal Acoustics The voice can be characterized as consisting of one or more sources, such as the oscillating vocal folds or turbulence noise, and a system of filters whose properties are controlled by the shape of the vocal tract. By moving various articulators, we change the ways the sources and filters behave. The spec- trum of the voice is characterized by resonant peaks called formants. The location and shapes of these resonances are strong perceptual cues that hu- mans use to differentiate and identify vowels and consonants. For a system to generate speech-like sounds, it should allow for manipulation of the res- onant peaks of the spectrum, and also for manipula- tion of source parameters (voice pitch, noise level, etc.) independent of the resonances of the vocal tract. Voice pitch is commonly denoted as fo, and the formant frequencies are commonly denoted as fl, f2, f3, etc. Figure 1 shows a vocal tract cross- section forming the vowel / i / (as in "beet"), where the quasi-periodic oscillations of the vocal folds are shaped by the resonant filter of the vocal tract tube. The spectrum of the vowel shows the harmon- ics of the voice source outlining the peaks and val- leys of the vocal tract response. Figure 2 shows the vocal tract cross-section for forming the conso- Computer Music Journal, 20:3, pp. 38-46, Fall 1996 ? 1996 Massachusetts Institute of Technology Singing Voice Synthesis: H istory, Current W ork, and Future Directions nant / /("shh"), where the "source" is not the vo- cal folds, but turbulence noise formed by forcing air through a constriction. Also shown is the noise- like spectrum of the consonant, showing two princi- pal formant peaks corresponding to the resonances of the vocal tract upstream from the noise source. A Brief H istory of Digital Singing (Speech) Synthesis The earliest computer music project at Bell Labs in the late 1950s yielded a number of speech synthe- sis systems capable of singing, one being the acous- tic tube model of Kelly and Lochbaum (1962). This model was actually an early physical model. At that time it was considered too computationally ex- pensive for commercialization as a speech synthe- sizer, and too expensive to be practical for musical composition. Max Mathews worked with Kelly and Lochbaum to generate some early examples of sing- ing synthesis (Computer Music Journal 1995; W ergo 1995). Other techniques to arise from the early legacy of speech signal processing include the channel vo- coder (VOice CODER) (Dudley 1939) and linear pre- dictive coding (LPC) (Atal 1970; Makhoul 1975). In the vocoder, the spectrum is broken into sections called sub-bands, and the information in each sub- band is analyzed, then parameters are stored or transmitted for reconstruction at another time or site. The parametric data representing the informa- tion in each sub-band can be manipulated, yielding transformations such as pitch or time shifting, or spectral shaping. The vocoder does not strictly as- sume that the signal is speech, and thus generalizes to other sounds. The phase vocoder, implemented using the discrete Fourier transform, has found ex- tensive use in computer music (Moorer 1978; Dol- son 1986). Computer Music Journal 38 Figure 1. Vocal tract shape and spectrum of vowel / i / (as in "beet"), show- ing formants and harmon- ics of periodic voice source. Figure 2. Vocal tract shape (left) and spectrum (right) of consonant / f / ("shh"), showing a noisy spectrum with two formants. ^ Formants ea. e 4}3.oov~~~~I Figure 1 Consonant /f/ (as in shh) ,r, Figure 2 The introduction of linear predictive coding (Atal 1970) revolutionized speech technology, and had a great impact on musical composition as well (Moorer 1979; Steiglitz and Lansky 1981; Lansky 1989). W ith LPC, a time-varying filter is automati- cally designed that predicts the next value of the signal, based on past samples. An error signal is pro- duced which, if fed back through the time-varying filter, will yield exactly the original signal. The fil- ter models linear correlations in the signal, which correspond to spectral features such as formants. The error signal models the input to the formant filter, and typically is periodic and impulsive for voiced speech, and noise-like for unvoiced speech. The success of LPC in speech coding is largely due to the similarity between the source/filter de- composition yielded by the mathematics of linear prediction, and the source/filter model of the hu- man vocal tract. The power of LPC as a speech com- pression technique (Spanias 1994) stems from its ability to parametrically code and compress the source and filter parameters. The effectiveness of LPC as a compositional tool emerges from its abil- ity to modify the parameters before resynthesis. There are weaknesses, however, in LPC, which are related to the assumption of linearity inherent in the filter model. Also, all spectral properties are modeled in the filter. In actuality the voice has mul- tiple possible sources of non-linear behavior, includ- ing source-tract coupling, non-linear wall vibration losses, and aerodynamic effects. Due to these devia- tions from the ideal source-filter model, the result of analysis/modification/resynthesis using LPC or a sub-band vocoder often sounds "buzzy." Cook i i I I 39 Cross-Synthesis and Other Compositional Attractions of Vocal Models The compositional interest in vocal analysis/syn- thesis has at least three foundations. The first is rooted in the human as a linguistic organism, for it seems in the nature of humans to find interest in voice-like sounds. Any technique or device that allows independent control over pitch and spectral peaks tends to produce sounds that are vocal in nature, and such sounds catch the interest of hu- mans. The second compositional interest in using systems that decompose sounds in a source/filter paradigm is to allow for cross-synthesis. Cross- synthesis involves the analysis of two instruments, typically a voice and a non-voice instrument, with the parameters exchanged and modified on resyn- thesis. This allows the resonances of the voice to be imposed on the source of a non-voice instru- ment. The third interest comes from the fact that once pitch and resonance structure are analyzed as they evolve in time, these three dimensions are in- dependently available to some extent for manipula- tion on resynthesis. The elusive goal of being able to stretch time without changing pitch, to change pitch without changing timbral quality, etc., are all of high interest to computer music composers. Other Popular Synthesis Techniques Frequency modulation (FM) proved successful for singing synthesis (Chowning 1981, 1989) as well as the synthesis of other sounds. As described in com- munications literature, FM involves the modula- tion of the frequency of one oscillator with the output of another to create a spread spectrum con- sisting of side-bands surrounding the original car- rier (oscillator that is modulated) frequency. In FM sound synthesis, both the carrier and modulator oscillators typically store a sinusoidal waveform, and operate in the audio band. By controlling the amount of modulation, and using multiple carrier/ modulator pairs, spectra of somewhat arbitrary shape can be constructed. This technique proved ef- ficient yet sufficiently flexible for music composi- tion, and became the basis for the most successful commercial music synthesizers in history. In vocal modeling, carriers placed near formant locations in the spectrum are modulated by a common modula- tor oscillator operating at the voice fundamental fre- quency. Sinusoidal speech modeling (McAulay and Qua- tieri 1986) has been improved and applied to music synthesis by Julius Smith and Xavier Serra (Smith and Serra 1987; Serra and Smith 1990), Xavier Ro- det and Philippe Depalle (1992), and others. These techniques use Fourier analysis to locate and track individual sinusoidal partials. Individual trajector- ies (tracks) of sinusoidal amplitude, frequency, and phase as a function of time are extracted from the time-varying peaks in a series of short-time Fourier transforms. To help define tracks, heuristics regard- ing physical systems and the voice in particular are used, such as the fact that a sinusoid should not ap- pear, disappear, or change frequency or phase instan- taneously. The sinusoids can be resynthesized from the track parameters, after modification or coding, by additive synthesis. Noise can be treated as rap- idly varying sinusoids, or explicitly as a non- sinusoidal component. Formant wave functions (FOFs in French) were pioneered by Xavier Rodet (1984) at Institute de Recherche et Coordination, Acoustique/Musique (IRCAM). An FOF is a time-domain waveform model of the impulse response of individual for- mants, characterized as a sinusoid at the formant center frequency with an amplitude that rises rap- idly upon excitation and decays exponentially. By describing a spectral region as a windowed sinusoi- dal oscillation in the time domain, an FOF can be viewed as a special type of wavelet. The control pa- rameters define the center frequency and band- width of the formant being modeled, and the rate at which the FOFs are generated and added deter- mines the base frequency of the voice. The synthe- sis system for using FOFs was dubbed CH ANT, and found application in general synthesis (Rodet, Po- tard, and Barriere 1984). Gerald Bennett and Xavier Rodet used CH ANT to produce a number of impres- sive singing examples and compositions (Bennett and Rodet 1989). Formant synthesizers, in which individual for- Computer Music Journal 40 mants are modeled by second-order resonant filters, have been investigated by many speech researchers (Rabiner 1968; Klatt 1980). An attractive feature of formant synthesizers is that Fourier or LPC analy- sis can be used to automatically extract formant frequencies and source parameters from recorded speech. Charles Dodge used such techniques in a composition in 1973 (Dodge 1989). The group that has accomplished the most in the domain of sing- ing synthesis using formant models is the Speech Transmission Laboratory (STL) of the Royal Insti- tute of Technology (KTH ), Stockholm. This STL MUSSE DIG (MUsic and Singing Synthesis Equip- ment, DIGital version) synthesizer (Carlson and Neovius 1990) has been used in singing synthesis (Zera, Gauffin, and Sundberg 1984), for studying performance synthesis-by-rule (Sundberg 1989), and has been adapted for real-time control in perfor- mance (Carlson, et al. 1991). The KTH has con- ducted and published extensively on speech, and has arguably produced the largest body of research on singing (Sundberg 1987) and music, both acous- tics and performance. Robert C. Maher (1995) re- cently demonstrated singing synthesis using modi- fied forms of the second-order resonant filter which lend themselves to parallel implementation. Acoustic Tube Models of the Vocal Tract Acoustic tube models solve the wave equation, usu- ally in one dimension, inside a smoothly varying tube. The one-dimensional approximation is justi- fied by noting that the length of the vocal tract is significantly larger than any width dimension, and thus the longitudinal modes dominate the reso- nance structure up to about 4,000 H z. Modal stand- ing waves in an acoustic tube correspond to the for- mants. The basic Kelly and Lochbaum model (Kelly and Lochbaum 1962) critically samples space and time by approximating the smooth vocal tract tube with cylindrical segments equal in length to the dis- tance traveled by a soundwave in one time sample. The SPASM and Singer systems (Cook 1992) are based on a physical model of the vocal tract filter, developed using the waveguide formulation (Smith 1987). This model is a direct descendent of the Kelly and Lochbaum model, but with many en- hancements, such as a nasal tract, modeling of radi- ation through the throat wall, various steady and pulsed noise sources (Chafe 1990), and real-time controls. Shinji Maeda's (1982) model numerically integrates the wave equation using the rectangular method in space, and the trapezoidal rule in time. W all losses are also modeled, and an articulatory layer of control modifies the basic tube shape from higher-order descriptions like tongue and jaw posi- tion. Rene Carre's (1992) model is based on distinc- tive regions (DR) arising from sensitivity analysis, noting that movements in particular regions of the vocal tract affect formant frequencies more than movements in others. H ill, Manzara, and Taube- Schock (1995) have implemented a synthesis-by- rule system using a model based on distinctive re- gions, with libraries and examples that include ex- amples of singing synthesis. Liljencrants (1985) in- vestigated an undersampled acoustic tube model and derived rules for modifying the shape without adding unnaturally to the energy contained within the vocal tract. The computer music research group in H elsinki (Vilimaki and Karjalainen 1994) have used fractional sample interpolation and truncated conical tube segments to derive an improved ver- sion of the Kelly and Lochbaum model. Other Active Singing Synthesis Projects Pabon (1993) has constructed a singing synthesizer, with real-time formant control via spectrogram- like displays called phonetograms, and source wave- form synthesis using FOF-like controls. Titze and Story (1993) have produced a super-computer tenor called "Pavarobotti" that sings duets with Titze, and is used for studying many aspects of the voice, including advanced physical models of normal and pathological vocal folds. H oward and Rossiter (H ow- ard and Rossiter 1993; Rossiter and H oward 1994) have studied source parameters for more natural singing synthesis, as well as interactive singing analysis software for pedagogical applications. Cook 41 Spectral Models vs. Physical Models Synthesis models can be loosely broken into two groups: spectral models, which can be viewed as based on perceptual mechanisms, and physical models, which can be viewed as based on produc- tion mechanisms. Of the models and techniques discussed above, the spectrally based models in- clude FM, FOFs, vocoders, and sinusoidal models. Acoustic tube models are physically based, while formant synthesizers are spectral models, but could be classified as pseudo-physical because of the source/filter decomposition. It's possible to inter- pret LPC three ways: as a least-squares linear predic- tion in the time domain, as a least-squares match- ing process on the spectrum, and as a source-filter decomposition. Therefore, LPC is both spectral and pseudo-physical, but not strictly a physical model because wave variables are not propagated directly, and no articulation parameters go into the basic model. Since LPC can be mapped to a filter related to the acoustic tube model (Markel and Gray 1976), it may be brought into the physical camp. Both physical and spectral models have merit, and one or another might be more suitable given a specific goal and set of computational resources. The main attraction of physical models is that most of the control parameters are those that a hu- man uses to control his/her own vocal system. As such, some intuition can be brought into the de- sign and composition processes. Another motiva- tion is that time-varying model parameters can be generated by the model itself, if the model is con- structed so that it sufficiently matches the physical system. Disadvantages of physical models are that the number of control parameters can be large, and while some parameters might have intuitive sig- nificance for humans (jaw drop), others might not (specific muscles controlling the vocal folds). Fur- ther, parameters often interact in non-obvious ways. In general there exist no exact methods for analysis/resynthesis using physical models. Parame- ter estimation techniques have been investigated, but for physical models of reasonable complexity, especially those involving any non-linear compo- nent, identity analysis/resynthesis is a practical and often theoretical impossibility (Cook 1991b; Scavone and Cook 1994). Model Extensions and Future W ork W ork remains to be done in refining techniques for spectral analysis and synthesis of the voice. For ex- ample, a spectral envelope estimation technique like that of Galas and Xavier Rodet (1990) allows more accurate formant tracking on even high fe- male tones, which because of the large inter- harmonic spacing have proven difficult for analysis systems in the past. There are far more directions for research to proceed in improving physical mod- els and source models for pseudo-physical models of the voice. Most of them involve some significant component of non-linearity, and/or higher dimen- sional models. The main research areas involve modeling of airflow in the vocal tract, development of more exact models of the inner shape of the vo- cal tract tube, physical models of the tongue and other articulators, more accurate models of the vo- cal folds, and facial animation coupled to voice syn- thesis. The modeling of flow is a difficult but important task, and until recently it has been confined to the- oretical explorations, occasionally verified experi- mentally with hot-wire anemometry or other flow measurement techniques (Teager 1980). Mico H irschberg has begun to make advances in actually photographing flow in constructed models of musi- cal instruments, and the vocal tract (Pelorson et al. 1994). These techniques, combined with classical and new theories, should yield greater understand- ing about air flow and how it affects vocal acous- tics. Along with more exact solutions to the flow- physics problems, development of efficient means for calculating the flow simulations, allowing the inclusion of these non-linear effects in practical synthesis models must also emerge (Chafe 1995; Verge 1995). Constructing a physical model that includes more detailed simulations of the dynamics of the tongue and articulators would allow the model to calculate the time-varying parameters, rather than Computer Music Journal I 42 having the shape, etc. explicitly specified or calcu- lated. W ilhelms-Tricarico (1995) has developed a set of models of soft tissue, and has used these to con- struct a tongue model. Such models can be cali- brated from the results of articulation studies using X ray pellets, magnetic resonance imaging, and other techniques. All of this can combine to yield models that "behave" correctly in a dynamical sense, and give a better picture of the fine structure of the space inside the vocal tract. This latter infor- mation is critical if flow simulations are to be ac- curate. Vocal fold models continue to be the target of much research, and, like the case of airflow, theo- ries are difficult to conclusively prove or disprove. More elaborate models of the vocal fold tissue are being developed (Story and Titze 1995), and theoret- ical and experimental studies revisiting and compar- ing the classic models are being conducted (Rodet 1995). Facial animation coupled with speech synthesis is important for a number of reasons. One reason is for pedagogy, where speech synthesizers with ani- mated displays could be used as teaching and reha- bilitation tools. Another important reason involves speech perception in general, because humans use a significant amount of lip reading in understand- ing speech. W ork has been done by Massaro (1987) and H ill, Pearce, and W yvill (1988), employing fa- cial animation to study coupling of visual and audi- tory information in human speech understanding (McGurk and MacDonald 1976). Musically, we know that the face of the singer can carry even more information about the meaning of music than the actual text being sung (Scotto Di Carlo and Guaitella 1995), further motivating the combi- nation of facial animation with singing synthesis. Modeling Performance One of the distinguishing features of the voice is the continuous nature of pitch control, both inten- tional and uncontrolled. Research in random and periodic pitch deviations (Sundberg 1987; Chown- ing 1989; Ternstrom and Friberg 1989; Prame 1994; Cook 1995), and the synthesis and perception of short vibrato tones (d'Allessandro and Castellengo 1993), has provided data and models for more natu- ral sounding voice synthesis. On the macro scale, rule systems for vocal performance and phrasing (Berndtsson 1995), and composition (Rodet and Cointe 1984; Barriere, Iovino, and Laurson 1991) have been constructed. The Stockholm KTH rule system is available on the compact disc Informa- tion Technology and Music (KTH 1994). These im- portant areas of research shall remain a topic for a future survey paper. Extended Singing and Language Systems Investigations into non-W estern traditional Bel Canto singing styles, traditions, and acoustics include studies of overtone singing (Bloothooft, et al. 1992), traditional Scandanavian shepherd sing- ing (Johnson, Sundberg, and W illbrand 1983), a highly structured system of funeral laments (Ross and Lehiste 1993), and even castrati singing (De- palle, Garcia, and Rodet 1994). Language systems for the SPASM/Singer instruments include an Eccle- siastical Latin system called LECTOR (Cook 1991a), and a system for modern Greek called IGDIS (Cook, et al. 1993). The IGDIS system in- cludes support for arbitrary tuning systems, and common vocal ornaments can be called up by name, allowing traditional folk songs and Byzan- tine chants to be synthesized quickly. Real-Time Voice Processing and Interactive Karaoke Recently, commercial products have been intro- duced that allow for real-time "smart harmonies" to be added to a vocal signal, or implement real- time score following with accompaniment. Vocod- ers and LPC, by virtue of being analysis/synthesis systems, allow potential for real-time modification of voice signals under the control of rules or real- time computer processes. W e will soon see systems that integrate pitch detection, score following, and Cook I 43 sophisticated voice processing algorithms into a new generation of interactive karaoke systems. This will remain a topic for a future review paper. References Atal, B. 1970. "Speech Analysis and Synthesis by Linear Prediction of the Speech W ave." Journal of the Acousti- cal Society of America 47:65(A). Barriere, J. B., E Iovino, and M. Laurson. 1991. "A New CH ANT Synthesizer in C and its Control Environ- ment in Patchwork." In Proceedings of the 1991 Inter- national Computer Music Conference. San Francisco, California: International Computer Music Association, pp. 11-14. Bennett, G., and X. Rodet. 1989. "Synthesis of the Sing- ing Voice." In Mathews, M. and J. Pierce, eds., Current Directions in Computer Music Research. Cambridge, Massachusetts: The MIT Press, pp. 19-44. Berndtsson, G. 1995, "The KTH Rule System For Singing Synthesis." Computer Music Journal 20(1):76-91. Bloothooft, G., et al. 1992. "Acoustics and Perception of Overtone Singing." Journal of the Acoustical Society of America 92(4):1827-1836. Carlson, G., and L. Neovius. 1990. "Implementations of Synthesis Models for Speech and Singing." STL- Quarterly Progress and Status Report. Stockholm: KTH , pp. 2/3:63-67. Carlson, G., et al. 1991. "A New Digital System for Sing- ing Synthesis Allowing Expressive Control." In Proceed- ings of the 1991 International Computer Music Confer- ence. San Francisco, California: International Computer Music Association, pp. 315-318. Carre, R. 1992. "Distinctive Regions in Acoustic Tubes." Journal d'Acoustique, 5(141):141-159. Chafe, C. 1990. "Pulsed Noise in Self-Sustained Oscilla- tions of Musical Instruments." In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. New York: IEEE Press, pp. 1157-1160. Chafe, C. 1995. "Adding Vortex Noise to W ind Instru- ment Physical Models." In Proceedings of the 1995 In- ternational Computer Music Conference. San Fran- cisco, California: International Computer Music Association, pp. 57-60. Chowning, J. 1981, "Computer Synthesis of the Singing Voice." In Research Aspects on Singing. Stockholm: KTH , pp. 4-13. Chowning, J. 1989. "Frequency Modulation Synthesis of the Singing Voice." In Mathews, M. and J. Pierce, eds., Current Directions in Computer Music Research. Cam- bridge, Massachusetts: The MIT Press, pp. 57-64. Computer Music Journal. 1995. Computer Music Journal Volume 19 Compact Disc. Cambridge, Massachusetts: The MIT Press. Cook, P. 1991a. "LECTOR: An Ecclesiastical Latin Con- trol Language for the SPASM/Singer Instrument." In Proceedings of the 1991 International Computer Mu- sic Conference. San Francisco, California: International Computer Music Association, pp. 319-321. Cook, P. 1991b. "Non-Linear Periodic Prediction for On- Line Identification of Oscillator Characteristics in W oodwind Instruments." In Proceedings of the Interna- tional Computer Music Conference. San Francisco, Cal- ifornia: International Computer Music Association, pp. 157-160. Cook, P. 1992. "SPASM: A Real-Time Vocal Tract Physi- cal Model Editor/Controller and Singer: the Compan- ion Software Synthesis System." Computer Music Jour- nal 17(1):30-44. Cook, P. 1995. "A Study of Pitch Deviation in Singing as a Function of Pitch and Dynamics." 13th International Congress of Phonetic Sciences. Stockholm: KTH , pp. 1:202-205. Cook, P., et al. 1993. "IGDIS: A Modern Greek Text to Speech/Singing Program for the SPASM/Singer Instru- ment." In Proceedings of the International Computer Music Conference. San Francisco, California: Interna- tional Computer Music Association, pp. 387-389. d'Allessandro, C., and M. Castellengo. 1993. "The Pitch of Short-Duration Vibrato Tones: Experimental Data and Numerical Model." In Proceedings of the Stock- holm Music Acoustics Conference. Stockholm: KTH , pp. 25-30. Depalle, P., G. Garcia, and X. Rodet. 1994, "A Virtual Cas- trato (!?)" In Proceedings of the 1994 International Computer Music Conference. San Francisco, Califor- nia: International Computer Music Association, pp. 357-360. Dodge, C. 1989. "On Speech Songs." In Mathews, M. and J. Pierce, eds., Current Directions in Computer Music Research. Cambridge, Massachusetts: The MIT Press, pp. 9-18. Dolson, M. 1986, "The Phase Vocoder: A Tutorial." Com- puter Music Journal 10(4):14-27. Dudley, H . 1939. "The Vocoder." Bell Laboratories Rec- ord, December. Galas, T., and X. Rodet. 1990 "An Improved Cepstral Method for Deconvolution of Source-Filter Systems with Discrete Spectra: Application to Musical Sound Computer Music Journal I 44 Signals." In Proceedings of the 1990 International Computer Music Conference. San Francisco, Califor- nia: International Computer Music Association, pp. 82-84. H ill, D., L. Manzara, and C. Taube-Schock. 1995. "Real- Time Articulatory Speech-Synthesis-By-Rules." AVIOS. San Jose, California. H ill, D., A. Pearce, and B. W yvill. 1988. 'Animating Speech: An Automated Approach Using Speech Synthe- sized by Rules." The Visual Computer 3(5):277-289. H oward, D., and D. Rossiter. 1993. "Real-Time Visual Displays for Use in Singing Training: An Overview." In Proceedings of the Stockholm Music Acoustics Confer- ence. Stockholm: KTH , pp. 191-196. Johnson, A., J. Sundberg, and H . W illbrand. 1983. "K61n- ing: A Study of Phonation and Articulation in a Type of Swedish H erding Song." In Proceedings of the Stock- holm Music Acoustics Conference. Stockholm: KTH , pp. 187-202. Kelly, J., and C. Lochbaum. 1962. "Speech Synthesis" (pa- per G42). In Proceedings of the Fourth International Congress on Acoustics. pp. 1-4. Klatt, D. 1980. "Software for a Cascade/Parallel Formant Synthesizer." Journal of the Acoustical Society of America 67(3):971-995. KTH . 1994. Information Technology and Music (a com- pact disc to celebrate the 75th anniversary of the Royal Swedish Academy of Engineering Science). Stockholm: KTH . Lansky, P. 1989. "Compositional Applications of Linear Predictive Coding." In Mathews, M. and J. Pierce, eds., Current Directions in Computer Music Research. Cam- bridge, Massachusetts: The MIT Press, pp. 5-8. Liljencrants, J. 1985. Speech Synthesis W ith a Reflection- Type Line Analog, DS Dissertation, Speech Communi- cation and Music Acoustics, Stockholm: KTH . Maeda, S. 1982. "A Digital Simulation Method of the Vo- cal Tract System." Speech Communication 1:199-299. Maher, R. 1995. "Tunable Bandpass Filters in Music Syn- thesis" (paper 4098 L2). In Proceedings of the Audio Engineering Society Conference. Makhoul, J. 1975. "Linear Prediction: A Tutorial Re- view." In Proceedings of the IEEE 63:561-580. Markel, J., and A. Gray. 1976. Linear Prediction of Speech. New York: Springer. Massaro, D. 1987. Speech Perception by Ear and Eye. H illsdale, New Jersey: Erlbaum Associates. Mathews, M., and J. Pierce, eds. 1989. Current Directions in Computer Music Research. Cambridge, Massachu- setts: The MIT Press. McAulay, R., and T. Quatieri. 1986. "Speech Analysis/ Synthesis Based on a Sinusoidal Representation." IEEE Transactions on Acoustics, Speech, and Signal Pro- cessing 34(4):744-754. McGurk, H ., and J. MacDonald. 1976. "H earing Lips and Seeing Voices." Nature 264:746-748. Moorer, A. 1978. "The Use of the Phase Vocoder in Com- puter Music Applications." Journal of the Audio Engi- neering Society 26 (1/2):42-45. Moorer, A. 1979, "The Use of Linear Prediction of Speech in Computer Music Applications." Journal of the Audio Engineering Society 27(3):134-140. Pabon, P. 1993, "A Real-Time Singing Voice Synthesizer." In Proceedings of the Stockholm Music Acoustics Con- ference. Stockholm: KTH , pp. 288-293. Pelorson, X., et al. 1994. "Theoretical and Experimental Study of Quasi-Steady Flow Separation W ithin the Glottis During Phonation. Applications to a Modified Two-Mass Model." Journal of the Acoustical Society of America 96 (6):3416-3431. Prame, E. 1994. "Measurements of the Vibrato Rate of Ten Singers." Journal of the Acoustical Society of America 96(4):1979-1984. Rabiner, L. 1968. "Digital Formant Synthesizer." Journal of the Acoustical Society of America 43(4):822-828. Rodet, X. 1984. "Time-Domain Formant-W ave-Function Synthesis." Computer Music Journal 8(3):9-14. Rodet, X. 1995. "One and Two Mass Model Oscillations for Voice and Instruments." In Proceedings of the 1995 International Computer Music Conference. San Fran- cisco, California: International Computer Music Asso- ciation, pp. 207-210. Rodet, X., and P. Cointe. 1984. "FORMES: Composition and Scheduling of Processes." Computer Music Journal 8(3):32-50. Rodet, X., and P. Depalle. 1992. "Spectral Envelopes and Inverse FFT Synthesis" (paper 3393 H 3). In Proceed- ings of the Audio Engineering Society Conference, NY: AES. Rodet, X., Y. Potard, and J. B. Barriere. 1984. "The CH ANT Project: From the Synthesis of the Singing Voice to Synthesis in General." Computer Music Jour- nal 8(3):15-31. Ross, J., and I. Lehiste. 1993. "Estonian Laments: A Study of Their Temporal Structure." In Proceedings of the Stockholm Music Acoustics Conference. Stock- holm: KTH , pp. 244-248. Rossiter, D., and D. H oward. 1994. "Voice Source and Acoustic Output Qualities for Singing Synthesis." In Proceedings of the 1994 International Computer Mu- sic Conference. San Francisco, California: International Computer Music Association, pp. 191-196. Cook 45 Scavone, G., and P. Cook. 1994. "Combined Linear and Non-Linear Periodic Prediction in Calibrating Models of Musical Instruments to Recordings." In Proceedings of the 1994 International Computer Music Confer- ence. San Francisco, California: International Com- puter Music Association, pp. 433-434. Scotto Di Carlo, N., and I. Guaitella. 1995. "Facial Ex- pressions in Singing." In Proceedings of the 13th Inter- national Congress of Phonetic Sciences. Stockholm: KTH , pp. 1:226-229. Serra, X., and J. Smith. 1990. "Spectral Modeling Synthe- sis: A Sound Analysis/Synthesis System Based on a De- terministic plus Stochastic Decomposition." Computer Music Journal 14(4):12-24. Smith, J. 1987. "Musical Applications of Digital W ave- guides." Technical report STAN-M-39. Stanford Univer- sity Center for Computer Research in Music and Acoustics. Smith, J., and X. Serra. 1987. "PARSH L: Analysis/Synthe- sis Program for Non-H armonic Sounds Based on a Si- nusoidal Representation." In Proceedings of the 1987 International Computer Music Conference. San Fran- cisco, California: International Computer Music Asso- ciation, pp. 290-297. Spanias, A. 1994. "Speech Coding: A Tutorial Review." In Proceedings of the IEEE 82(10):1541-1582. Steiglitz, K., and P. Lansky. 1981. "Synthesis of Timbral Families by W arped Linear Prediction." Computer Music Journal 5(3):45-49. Story, B., and I. Titze. 1995. "Voice Simulation W ith a Body-Cover Model of the Vocal Folds." Journal of the Acoustical Society of America 97(2):3416-3431. Sundberg, J. 1987. The Science of the Singing Voice. De- kalb, Illinois: Northern Illinois University Press. Sundberg, J. 1989. "Synthesis of Singing by Rule." In Mathews, M. and J. Pierce, eds., Current Directions in Computer Music Research. Cambridge, Massachusetts: The MIT Press, pp. 45-56. Teager, H . 1980. "Some Observations on Oral Air Flow During Phonation." IEEE Transactions on Acoustics, Speech, and Signal Processing 28(5):599-601. Ternstrom, S., and A. Friberg. 1989. "Analysis and Simula- tion of Small Variations in the Fundamental Frequency of Sustained Vowels." STL-Quarterly Progress and Sta- tus Report 3:1-14. Titze, I., and B. Story. 1993. "The Iowa Singing Synthe- sis." In Proceedings of the Stockholm Music Acoustics Conference. Stockholm: KTH , p. 294. Valimaki, V., and M. Karjalainen. 1994. "Improving the Kelly-Lochbaum Vocal Tract Model Using Conical Tube Sections and Fractional Delay Filtering Tech- niques." In Proceedings of the 1994 International Con- ference on Spoken Language Processing. Yokohama, Ja- pan, pp. 18-22. Verge, M. 1995. Aeroacoustics of Confined Jets, with Applications to the Physics of Recorder-Like Instru- ments. Thesis, Technical University of Eindhoven (also available from IRCAM). W ergo. 1995. The H istorical CD of Digital Sound Synthe- sis. W ER 2033-2. W ilhelms-Tricarico, R. 1995. "Physiological Modeling of Speech Production: Methods for Modeling Soft-Tissue Articulators." Journal of the Acoustical Society of America 97(5):3085-3098. Zera, J., J. Gauffin, and J. Sundberg. 1984. "Synthesis of Selected VCV-Syllables in Singing." In Proceedings of the 1984 International Computer Music Conference. San Francisco, California: International Computer Music Association, pp. 83-86. Computer Music Journal 46