Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Pitch-Synchronous Spectrogram:

Principles and Applications


C. Julian Chen

Department of Applied Physics


and Applied Mathematics
May 24, 2018
Outline
• The traditional spectrogram
• Observations with the electroglottograph (EGG)
• Process of human voice production
• Pitch-synchronous segmentation of voice signals
• Pitch-synchronous spectrogram
• Display of timbre spectrum within each period
• Display of power evolution within each period
• Free evaluation version and full versions
The traditional spectrogram

The graph is always a mixture of pitch and timbre.


(A) with a wide window, the overtones of fundamental
frequency dominate.
(B) with a narrow window, a mixture of formant peaks and
details in each pitch period dominate.
Display of timbre spectrum

The curve is always a mixture of pitch and timbre. It is very


difficult to decipher formant frequencies and peak profiles.
The Source-Filter Theory

Source: Fourier transform of glottal airflow waveform, -12 dB/oct.


Filter: an all-pole transfer function. Radiation factor: +6 dB per
octave, which is against the law of energy conservation.
Pitch-Asynchronous Speech Parameterization (1)

The speech signal is blocked into overlapping frames with


a fixed window size (25 msec) and a fixed shift (10 msec),
and then multiplied by a processing window, typically a
Hamming window. The windows often cross phoneme
boundaries. Timbre and pitch cannot be separated.
Pitch-Asynchronous Speech Parameterization (2)

Using an all-pole filter model from LPC analysis, the formants of


speech signals can be extracted. But the process is not convergent.
Anatomy of voice-production organs
Observation of Speech Signals (1)

Vowel [a], King-TTS-012, 050007, 2.23-2.28 sec.


Observation of Speech Signals (2)

Vowel [i], King-TTS-012, 004419, 1.938 – 1.968 sec.


Observation of Speech Signals (3)

Vowel [u], King-TTS-012, 005044, 1.06 – 1.11 sec.


Observation of Speech Signals (4)

Vowel [e], King-TTS-012, 050053, 2.535 – 2.585 sec.


Observation of Speech Signals (5)

Vowel [o], King-TTS-012, 051022, 1.827 – 1.877 sec.


The Electroglottograph (EGG)

A non-invasive instrument to detect the change of electric


conductance between the two vocal cords, thus to monitor
the opening and closing of the glottis (circa 1956).
What the Correlation of EGG Signals
and Voice Signals Tells Us?

A voice waveform is triggered by a glottal closing, starting with an


impulse. The acoustic wave is strong in the closed phase, and weak
in the open phase. (Fig. 5.6, Resonance in Singing, D. G. Miller).
The Handclap Analogy (Robert Sataloff)

“Sound is actually produced by the closing of the vocal folds, in a


manner similar to the sound generated by hand clapping. … (T)he
more frequent they open and close, the higher the pitch.” (Sataloff).
The Water-Hammer Analogy (Ronald Baken)

“The sharp cutoff of flow is particularly crucial, because it is this relatively sudden
stoppage of the air flow that is the raw material of voice. An impulse-like shock
wave is produced that “excites” air molecules in the vocal tract.” (R. Baken)
Principle of Superposition (Peter Ladefoged)

The voice signal is a superposition of elementary decaying


waves, each elementary wave starts at a glottal closing event.
Pitch is the repetition rate of glottal closing. (Ladefoged)
What Is Timber Spectrum?
• As the glottis closes, the air moving in the vocal
tract at that moment maintains its momentum.
• The kinetic energy of the moving air in the vocal
tract is converted into acoustic energy.
• The impulse resonates in the vocal tract.
• The decaying elementary wave in each pitch
period is determined by the geometry of vocal
tract, thus it represents instantaneous timber.
• Accurate timber spectrum must be computed from
the waveform in each pitch period.
Process Within Each Pitch Period
• A glottal closing starts a pitch period.
• The acoustic wave decays exponentially during
the closed phase.
• A glottal opening connects the vocal tract with
the lungs thus accelerates power decay.
• A glottal opening also generates random noise.
• The excitation at a glottal opening is mostly
weaker than that at a glottal closing.
Pitch-Synchronous Segmentation Using EGG

The sharp peaks in EGG derivative occur about 1 msec before the
starting impulse, which is in the weakly varying section of a pitch
period, suitable as segmentation points.
Pitch-Synchronous Segmentation from Voice

By multiplying the voice signal with an asymmetric window,


an excitation profile function is generated. The peaks of the
excitation profile function generate pitch marks.
Ends-meeting procedure to make waveform cyclic

After an ends-meeting procedure, the waveform of each pitch


period becomes a sample of a smooth periodic function.
Example of a pitch-synchronous spectrogram

For voiced sections, the vertical lines represent glottal closing


instants. In each pitch period, the amplitude timbre spectrum is
displayed. Unvoiced sections has no glottal closings.
Display of Timbre Spectrum and Power Decay

By left-clicking the spectrogram at a pitch period, its timbre


spectrum is displayed. By right-clicking at a pitch period, a
graph of power decay in that pitch period is displayed.
Examples: Timbre spectra of some vowels
Examples: Consistency of Timbre Spectra

Six examples of timbre spectra of vowel [i]. All showing a strong


peak at about 300 Hz, and a group of peaks around 2-4 kHz.
Examples: Timbre spectra of some consonants
Examples: Power decay in a single pitch period
A Free Evaluation Version
• Includes pitch-synchronous segmentation of voice
signals, spectrogram generation, timbre spectrum
generation, and power decay computation.
• Only works on Mac OS
• Requires an installation of Tcl/Tk
• Partially open-source: the C++ program is compiled,
the Tcl/Tk source code is open.
• Includes two sets of standard speech data: the CMU
ARCTIC databases for US English speakers, male
speaker bdl and female speaker slt
• Manually corrected phoneme label files for the two
sets of speech data are also included
Input panel of the evaluation version
The entire package is in a single dir, PSS. In that dir, type
IMAC: PSS usermane$ wish pss.tcl <enter>

An input panel appears:


References
1. D. G. Miller, Resonance in Singing, Inside View Press,
2008.
2. R. T. Sataloff, The Human Voice, Scientific American,
December 1992, Vol. 108.
3. R. J. Baken, Electroglottography, Journal of Voice, Vol 6,
page 98-110 (1992)
4. R. J. Baken, An Overview of Laryngeal Function for
Voice Production, in Professional Voice, Third Edition,
edited by R. T. Sataloff, Plural Publishing, Vol. 1, pages
237-256 (2005).
5. P. Ladefoged, Elements of Acoustic Phonetics, University
of Chicago Press, 1966.
6. C. J. Chen, Elements of Human Voice, World Scientific
Publishing, 2016.

You might also like