Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.

1 JANUARY 2005
280

PAPER Special Section on Cryptography and Information Security

Discrimination Method of Synthetic Speech Using Pitch Frequency


against Synthetic Speech Falsification
Akio OGIHARA†a) , Member, Hitoshi UNNO† , Nonmember, and Akira SHIOZAKI† , Member

SUMMARY We propose discrimination method of synthetic speech


using pitch pattern of speech signal. By applying the proposed syn-
thetic speech discrimination system as pre-process before the conventional
HMM speaker verification system, we can improve the safety of conven-
tional speaker verification system against imposture using synthetic speech.
The proposed method distinguishes between synthetic speech and natu-
ral speech according to the pitch pattern which is distribution of value of
normalized short-range autocorrelation function. We performed the ex-
periment of user verification, and confirmed the validity of the proposed
method.
key words: speaker verification, biometrics, synthetic speech, pitch fre-
quency

1. Introduction

Currently, speaker verification system is often used for user


certification. Most of speaker verification systems are con-
structed by using HMM (Hidden Markov Model), and they Fig. 1 Applying the proposed discrimination method before conven-
are tolerant to vocal mimicry by human being. tional speaker verification system.
On the other hand, imposture methods [1], [2] against
speaker verification system have been reported. Recent syn-
thetic speech has sufficient quality by the fruit of various re- der to impose on HMM speaker verification system. And
searches [3]–[12] on speech synthesis. Imposture using syn- hence, in this paper, the proposed discrimination method
thetic speech is minority at this moment in time, but it has targets not various kinds of synthetic speech but the HMM-
excellent ability to swindle. Consequently, the conventional based synthetic speech [11], [12] which has the ability to
speaker verification systems are exposed to the menace of impose on HMM speaker verification system. And we con-
imposture using synthetic speech [13]. sider the synthetic speech using waveform segment as out-
Our purpose is to improve the safety of conventional side the scope of this study, because it is difficult without
speaker verification systems, and we have already proposed imposture victim’s cooperation to collect enough amount of
the speaker verification method against imposture using syn- speech samples for constructing segment database.
thetic speech [14]. This method has an effect on tolerance The proposed discrimination methods of synthetic
against synthetic speech when the proposed verification pro- speech have the following merits as compared with other
cess is not known to impostor. However, if the verification discrimination method [15].
process is leaked to impostor, this method may be broken (1) Discrimination process is independent of speaker
through by impostor. verification process. (wide adaptability)
In this paper, we propose two discrimination meth- (2) Discrimination process uses different feature from
ods of synthetic speech. By using the proposed synthetic speaker verification process. (increase of imposture diffi-
speech discrimination system as pre-process before the con- culty)
ventional HMM speaker verification system like Fig. 1, we The paper is organized as follows. In Sect. 2, the out-
can improve the safety of conventional speaker verification line of pitch extraction is explained. And two discrimina-
system. The proposed discrimination system has the role tion methods of synthetic speech are proposed in Sect. 3.
of rejecting the synthetic speech which is generated in or- In order to show the validity of the proposed methods, the
experiment of discrimination of synthetic speech has been
Manuscript received March 22, 2004. performed in Sect. 4.
Manuscript revised June 22, 2004.
Final manuscript received August 31, 2004.
† 2. Pitch Extraction for Speech Signal
The authors are with the Department of Computer and Sys-
tems Sciences, Graduate School of Engineering, Osaka Prefecture
University, Sakai-shi, 599-8531 Japan. In the proposed method, we perform the discrimination of
a) E-mail: ogi@cs.osakafu-u.ac.jp synthetic speech by using the pitch pattern of speech sig-

Copyright 
c 2005 The Institute of Electronics, Information and Communication Engineers
OGIHARA et al.: DISCRIMINATION METHOD OF SYNTHETIC SPEECH USING PITCH FREQUENCY
281

nal. Therefore the pitch of speech signal must be obtained signal x(t) at time t0 . The value of φ(t, τ) at each time po-
at first. In this paper, we use the autocorrelation function sition is shown in Fig. 2(c), and the thick line corresponds
proposed by Fujisaki et al. [16] for pitch extraction, and de- to Fig. 2(b). In this example, maximal peaks at each time
scribe the pitch extraction process briefly in this section. occur at same delay time τ0 , and consequently, τ0 is regard
The short-range autocorrelation function R(t, τ) is de- as pitch of this speech signal.
fined as follow,
 l(τ)/2
3. Discrimination Method of Synthetic Speech
R(t, τ) = x(t + ξ − τ/2)x(t + ξ + τ/2)dξ (1)
−l(τ)/2

where x(t) is speech signal, t is time, τ is delay time, l(τ) = In the proposed method, the synthetic speech discrimina-
(m − 1)τ, and integer number m is equal to or more than 2. tion is performed before conventional HMM speaker veri-
In the next place, the normalized short-range autocorrelation fication system. The synthetic speech discrimination sys-
function φ(t, τ) is calculated from R(t, τ) and P(t, τ) by the tem distinguishes between synthetic speech and natural
following equations, speech according to the pitch pattern which is distribution
of value of normalized short-range autocorrelation function
R(t, τ) like Fig. 2(c).
φ(t, τ) = (2)
P(t, τ) An example of pitch pattern is shown in Fig. 3. Pitch

1

 l(τ)/2
pattern of synthetic speech is shown in Fig. 3(a) and pitch
P(t, τ) =  x(t + ξ − τ/2)2 dξ pattern of natural speech is shown in Fig. 3(b). In this figure,
2
 −l(τ)/2
 horizontal axis means time t, and vertical axis means delay
 l(τ)/2  time τ. The brightness of pixel is in proportion to value of
2 

+ x(t + ξ + τ/2) dξ 
 (3) normalized short-range autocorrelation function. According
−l(τ)/2 
to Fig. 3, shape of the locus, on which the maximal values
where P(t, τ) is normalization function, and its value is pro- of normalized short-range autocorrelation function occur,
portional to power of x(t) in analysis range. is different between synthetic speech and natural speech.
Pitch extraction process by using φ(t, τ) is shown in The shape of bright area is also different between synthetic
Fig. 2. Figure 2(b) is φ(t0 , τ) at time t0 of speech signal speech and natural speech. These differences are caused by
x(t) shown in Fig. 2(a). When the maximal peak of φ(t0 , τ) the synthetic process that synthetic speech is generated from
occurs at τ0 , τ0 is regard as pitch (pitch period) of speech partial high-probability symbols in HMM. Consequently, by
digitalizing these characteristics, synthetic speech discrimi-
nation can be performed.
In this section, we propose two discrimination methods
of synthetic speech: “discrimination method using time sta-
bility” and “discrimination method using pitch pattern.” The
former is simplified discrimination method [17], and it can
be performed with low CPU power and small memory. And
hence, the former is suitable for low cost implementation.
The latter needs CPU cost and memory, but it can distin-
(a) Speach signal x(t).
guish more accurately than the former [18]. And hence, the
latter is suitable for high-security implementation.

(b) φ(t0 , τ)) at time t0 .

(a) Synthetic speech.

(c) φ(t, τ) of speech signal x(t).


(b) Natural speech.
Fig. 2 Pitch extraction process by normalized short-range autocorrela-
tion function. Fig. 3 An example of pitch pattern.
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.1 JANUARY 2005
282

to the end of input speech, then go to Step 4. If it doesn’t


3.1 Discrimination Method Using Time Stability reach to the end, then time t is increased and return to Step
3.
We pay attention to the fluctuation of the brightest area in
[Step 4] If the number of continuous stability points is more
Fig. 3. The brightest area of Fig. 3(b) tends to have larger
than threshold value θ3 , then add the quotient of “the number
fluctuation than Fig. 3(a) in vertical direction. In other
of continuous stability points /θ3 ” to time stability s. If delay
words, synthetic speech has time stability and its brightest
time τ > 20 msec, then go to Step 5. If τ  20 msec, then τ
area tends to form into line in horizontal direction. This ten-
is increased and return to Step 2.
dency exists in almost synthetic speech, and hence we at-
tempt to discriminate synthetic speech by detection of time [Step 5] Normalization: time stability s is normalized by the
stability. time length of input speech.
The detection of time stability is performed along the
By using time stability obtained by above-mentioned proce-
following procedure, and expository figure is shown in
dure, we can discriminate synthetic speech when the time
Fig. 4.
stability exceeds standard value.
[Step 1] Initialization: time stability s = 0, delay time τ = The typical pitch frequency of male voice is approx-
2 msec, and the value of normalized short-range autocorre- imately 100 Hz, and the typical pitch frequency of female
lation function φ(t, τ) is calculated from input speech. voice is approximately 200 Hz. In order to cover these pitch
frequencies, the above-mentioned procedure is performed in
[Step 2] Delay time τ is fixed, and time t = 0.
the frequency range 50–500 Hz, i.e. 2 msec  τ  20 msec.
[Step 3] If conditions “φ(t, τ) > threshold value θ1 ” and
“|φ(t, τ)−φ(t+1, τ)| < threshold value θ2 ” are satisfied, coor- 3.2 Discrimination Method Using Pitch Pattern
dinates (t, τ) is regard as stability point. If time t+1 reaches
First, the maximal value of normalized short-range autocor-
relation function is extracted as “peak” at each time t. Sec-
ond, “lower half” whose value is half of peak is extracted
by increasing τ, and “upper half” is also extracted by de-
creasing τ. Third, “half bandwidth” which is width between
lower half and upper half is extracted. Figure 5 shows an
example of time sequences of peak, lower half, upper half
and half bandwidth extracted from the pitch pattern shown
(a) Step1: φ(t, τ) is calculated from input speech.
in Fig. 3. In this figure, horizontal axis means time t, and
vertical axis means delay time τ.
In registration phase, user of the proposed discrimi-
nation system enrolls the above-mentioned characteristics
(peak, lower half, upper half, half bandwidth) extracted
from own natural speech in the synthetic speech discrimi-
nation system as registration data.
(b) Step2: Delay time τ is fixed. After registration, when the user needs to be verified,
the discrimination system compares the characteristics of

(c) Step3: φ(t0 , τ0 ) is compared with φ(t0 + 1, τ0 ).

(d) Step4: The number of continuous stability points is compared with θ3 .


Fig. 5 Characteristics of synthetic speech (upper) and natural speech
Fig. 4 Detection of time stability from pitch pattern. (lower).
OGIHARA et al.: DISCRIMINATION METHOD OF SYNTHETIC SPEECH USING PITCH FREQUENCY
283

inputted user’s speech signal with the registration data. In


the proposed method, the comparison is performed by cal-
culating the distance between inputted characteristics and
registration data with DP (dynamic programming) match-
ing. If the DP distance is smaller than the threshold value
provided by pre-experiment, the discrimination system re-
gards the inputted speech signal as natural speech. And then
the HMM speaker verification system checks whether the
inputted speech signal belongs to genuine user or impostor.
(a) Time stability of synthetic speech.
4. Experiments

In this section, we show that characteristics of pitch pattern


can discriminate synthetic speech from natural speech, and
confirm the validity of the proposed method.

4.1 Experimental Condition

100 samples of natural speech pronounced by a male test


subject are used for the following experiment. In advance,
100 samples of synthetic speech are generated from these (b) Time stability of natural speech.
natural speech samples by the synthetic speech genera-
Fig. 6 An example of detection of time stability.
tion method proposed by T. Masuko et al. [13]. The out-
line of generation procedure of synthetic speech is as fol-
lows. First, phoneme HMM is trained with the feature ex-
tracted from natural speech samples by mel-cepstrum analy-
sis. We used HTK (hidden Markov model toolkit) for train-
ing of HMM, and SPTK (speech signal processing toolkit)
for mel-cspstrum analysis. Next, we combine phoneme
HMM according to the desired sentence (content of syn-
thetic speech), and generate the mel-cepstrum sequence
which has maximal likelihood against “HMM speaker ver-
ification system.” Finally, synthetic speech is obtained by
using the mel-cepstrum sequence and MLSA (mel log spec-
trum approximation) filter. We also used SPTK for genera-
tion of the mel-cepstrum sequence and MLSA filtering.
The quality of natural speech and synthetic speech is Fig. 7 Distribution of time stability.
16000 Hz sampling rate and 16-bits liner quantization, and
the content of pronunciation is “aiu” (three Japanese vow-
els). The normalized short-range autocorrelation function point. It is confirmed from Fig. 6 that synthetic speech has
is calculated under the condition of m = 2 in the equation more time stability as compared with natural speech.
defined in Sect. 2. Figure 7 shows the distribution of time stability ob-
tained from 100 synthetic speech samples and 100 natural
4.2 Experimental Results of Discrimination Method Us- speech samples. Horizontal axis means time stability s,
ing Time Stability and vertical axis means the rate of frequency distribution.
The distribution of synthetic speech is drawn by dotted line,
Experiment of discrimination method using time stability is and the distribution of natural speech is drawn by solid line.
performed under these conditions obtained by pilot study: From this experimental result, 90% of synthetic speech can
threshold values θ1 = 0.97, θ2 = 0.003, θ3 = 9. In the pi- be discriminated by using the threshold value drawn by thick
lot study, 20 samples of natural speech and 20 samples of broken vertical line in Fig. 7.
synthetic speech are used, and we chose the best combina-
tion of threshold values for distinguishing between synthetic 4.3 Experimental Results of Discrimination Method Us-
speech and natural speech in the range of 0.01  θ1  0.99 ing Pitch Pattern
(0.01 step), 0.001  θ2  0.999 (0.001 step), 1  θ3  20 (1
step). First, the pitch pattern is calculated from natural speech
An example of detection of time stability is shown samples, and then characteristics (peak, lower half, upper
in Fig. 6, and highlight point represents detected stability half, half bandwidth) are extracted from the pitch pattern.
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.1 JANUARY 2005
284

As a result, 100 genuine data are obtained. On the other


hand, 100 impostor data are also obtained from synthetic
speech samples in the same manner.
Second, DP distances of the genuine matching between
a genuine data and another genuine data is calculated, and
the number of genuine matching is 100 C2 = 4950. On the
other hand, DP distances of the impostor matching between
a genuine data and an impostor data, and the number of im-
postor matching is 100 × 100 = 10000.
The distribution of these DP distances is shown in
(a) Peak.
Fig. 8. Figures 8(a)–(d) represent the case of using peak,
upper half, lower half and half bandwidth, respectively. In
each figure, horizontal axis means the value of DP distance,
and vertical axis means the rate of frequency distribution.
If the genuine distribution drawn by solid line is separated
from the impostor distribution drawn by broken line, the dis-
crimination of synthetic speech can be performed perfectly.
In other words, the smaller the common area between gen-
uine distribution and impostor distribution is, the better the
characteristic is for discriminating synthetic speech.
According to Fig. 8, “half bandwidth” is best for dis-
(b) Upper half.
crimination in four characteristics, because its common area
is smallest. We can discriminate synthetic speech at 99%
accuracy, when the value of cross point between solid line
and broken line is used as threshold value. By the way, even
“peak” having the largest common area in four characteris-
tics achieves 93% accuracy by using cross point threshold
value.
Table 1 shows the experimental result for other male
test subjects (speaker 2–speaker 4). The experiment is per-
formed by using 20 natural speech samples and 20 synthetic
speech samples per each additional test subject. Overall,
(c) Lower half.
“half bandwidth” is most suitable for discrimination of syn-
thesis speech from this result.

4.4 Consideration to Attack on the Proposed Discrimina-


tion Method

For instance, “discrimination method using time stability”


can be deceived by the unstable synthetic speech which is
generated by adding noise to synthetic speech at constant
intervals. The unstable synthetic speech may pass the syn-
thetic speech discrimination system in Fig. 1, but it becomes (d) Half bandwidth.
difficult to pass the HMM speaker verification system than Fig. 8 Distribution of genuine matching (natural × natural) and impostor
before by influence of the noise adding operation. And matching (synthetic × natural).
hence, it seems that the resistance ability to impostor of the
entire system shown in Fig. 1 has increased by applying the Table 1 Correct rejection percentage of synthetic speech.
proposed “discrimination method using time stability.”
It is more difficult to impose “discrimination method
using pitch pattern” than “discrimination method using time
stability,” because the feature of pitch pattern is more com-
plex than time stability. If synthetic speech is disfigured
in order to feign pitch pattern, it becomes hard to pass the
HMM speaker verification system. And hence, in the same
manner as “discrimination method using time stability,” it
seems that the resistance ability of the entire system has in-
creased by applying “discrimination method using pitch pat-
OGIHARA et al.: DISCRIMINATION METHOD OF SYNTHETIC SPEECH USING PITCH FREQUENCY
285

tern.” speech synthesis using dynamic features,” IEICE Trans. Inf. & Syst.
(Japanese Edition), vol.J79-D-II, no.12, pp.2184–2190, Dec. 1996.
[12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T.
5. Conclusions Kitamura, “Simultaneous modeling of spectrum, pitch and dura-
tion in HMM-based speech synthesis,” IEICE Trans. Inf. & Syst.
In this paper, we propose two discrimination methods of (Japanese Edition), vol.J83-D-II, no.11, pp.2099–2107, Nov. 2000.
synthetic speech using pitch frequency: “discrimination [13] T. Masuko, K. Tokuda, and T. Kobayashi, “Imposture against a
method using time stability” and “discrimination method us- speaker verification system using synthetic speech,” IEICE Trans.
Inf. & Syst. (Japanese Edition), vol.J83-D-II, no.11, pp.2283–2290,
ing pitch pattern.” The former is simplified discrimination Nov. 2000.
method, and it is suitable for low cost implementation. The [14] H. Unno, A. Ogihara, and H. Shibata, “Speaker verification against
latter can distinguish more accurately than the former, and it falsification using synthetic speech,” Proc. 2002 Kansai-Section
is suitable for high-security implementation. By using these Joint Convention of Institute of Electrical Engineering, p.G15-9,
proposed discrimination methods of synthetic speech as pre- 2002.
[15] T. Satoh, T. Masuko, T. Kobbayashi, and K. Tokuda, “Discrimina-
process before the conventional HMM speaker verification
tion of synthetic speech generated by an HMM-based speech syn-
system, we can improve the safety of conventional speaker thesis system form speaker verification,” IPSJ Trans., vol.43, no.7,
verification system. pp.2197–2204, July 2002.
In order to show the validity of the proposed meth- [16] H. Fujisaki, K. Hirose, and S. Seto, “A method for pitch extraction
ods, experiments of discrimination of synthetic speech have of speech with reduced errors due to analysis frame positioning,”
been performed. From experimental results, discrimina- IEICE Technical Report, SP89-69, Nov. 1989.
[17] H. Unno, A. Ogihara, and A. Shiozaki, “Synthesized speech detec-
tion method using time stability can discriminate synthetic tion for speaker verification against falsification using synthesized
speech at 90% accuracy, and discrimination method using speech,” Proc. 26th Symposium on Information Theory and Its Ap-
pitch pattern can discriminate synthetic speech at 99% accu- plications, pp.73–78, 2003.
racy by using “half bandwidth.” The practicable high accu- [18] H. Unno, A. Ogihara, and A. Shiozaki, “A protection method us-
racy has been obtained through the experiments, and hence ing pitch frequency against synthesized speech falsification,” Proc.
2004 Symposium on Cryptography and Information Security, vol.I,
the validity of the proposed method has been confirmed.
pp.695–700, 2004.

References

[1] D. Genoud and G. Chollet, “Speech pre-processing against inten-


tional imposture in speaker recognition,” Proc. ICSLP-98, pp.105–
108, 1998.
[2] B.L. Pellom and J.H.L. Hansen, “An experimental study of speaker
verification sensitivity to computer voice-altered imposters,” Proc. Akio Ogihara received the B.E., M.E.
ICASSP-99, pp.837–840, 1999. and D.E. degrees in electrical engineering from
[3] K. Hirose, “Speech synthesis technologies,” J. IPSJ, vol.38, no.11, Osaka Prefecture University, in 1987, 1989 and
pp.984–991, Nov. 1997. 1992, respectively. Since 1992 he has been with
[4] K. Hirose and H. Fujisaki, “A system for the synthesis of high- Osaka Prefecture University, where he is now
quality speech from texts on general weather conditions,” IEICE an Associate Professor of Graduate School of
Trans. Fundamentals, vol.E76-A, no.11, pp.1971–1980, Nov. 1993. Engineering. His current research interests in-
[5] T. Kitamura, E. Hayahara, E. Ohta, and Y. Xie, “One method for clude digital speech processing, information se-
improving cepstral-synthesized speech,” IEICE Trans. Fundamen- curity and financial engineering. Dr. Ogihara is
tals (Japanese Edition), vol.J76-A, no.9, pp.1373–1375, Sept. 1993. a member of the IEEE, the Information Process-
[6] N. Minematsu and S. Nakagawa, “Experimental study on the quality ing Society of Japan, and the Acoustical Society
improvement of F0 modified speech based upon PSOLA analysis- of Japan.
synthesis,” IEICE Trans. Inf. & Syst. (Japanese Edition), vol.J83-D-
II, no.7, pp.1590–1599, July 2000.
[7] T. Toda, H. Banno, S. Kajita, K. Takeda, F. Itakura, and K. Shikano,
“Improvement of STRAIGHT method under noisy conditions based Hitoshi Unno received the B.E. degree
on lateral inhibitive weighting,” IEICE Trans. Inf. & Syst. (Japanese in computer and systems sciences engineering
Edition), vol.J83-D-II, no.11, pp.2180–2189, Nov. 2000. from Osaka Prefecture University in 2002. He
[8] Y. Sagisaka and N. Campbell, “Description, acquisition and evalu- is currently working towards the M.E. degree
ation of rules and data for speech synthesis,” IEICE Trans. Inf. & in computer and systems sciences engineering,
Syst. (Japanese Edition), vol.J83-D-II, no.11, pp.2068–2076, Nov. Osaka Prefecture University. His current re-
2000. search interests include digital speech process-
[9] T. Koyama, T. Yoshioka, J. Takahashi, and T. Nakamura, “High ing and information security.
quality speech synthesis using reconfigurable VCV waveform seg-
ments with smaller pitch modification,” IEICE Trans. Inf. & Syst.
(Japanese Edition), vol.J83-D-II, no.11, pp.2264–2275, Nov. 2000.
[10] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano, “A segment se-
lection algorithm for Japanese concatenative speech synthesis based
on both phoneme unit and diphone unit,” IEICE Trans. Inf. & Syst.
(Japanese Edition), vol.J85-D-II, no.12, pp.1760–1770, Dec. 2002.
[11] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “HMM-based
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.1 JANUARY 2005
286

Akira Shiozaki received the B.E., M.E.


and D.E. degrees in electrical engineering all
from Osaka Prefecture University, in 1971, 1973
and 1976, respectively. In 1976, he joined
the Faculty of Engineering of Osaka Electro-
Communication University, where he was a
Lecturer from 1976 to 1978, an Associate Pro-
fessor from 1978 to 1985 and a Professor from
1985 to 1997. Since 1997 he has been with
Osaka Prefecture University, where he is now
a Professor of Graduate School of Engineering.
His research has mainly been concerned with coding theory, information
security, signal processing and image processing. Dr. Shiozaki is a member
of the IEEE, the Information Processing Society of Japan, the Institute of
Image Information and Television Engineers, the Japan Society of Medical
Electronics and Biological Engineering, the Society of Information Theory
and its Applications, and the Japan Society of Ultrasonics in Medicine.

You might also like