Professional Documents
Culture Documents
Paper Deteccion Clonado Por Pitch
Paper Deteccion Clonado Por Pitch
1 JANUARY 2005
280
1. Introduction
Copyright
c 2005 The Institute of Electronics, Information and Communication Engineers
OGIHARA et al.: DISCRIMINATION METHOD OF SYNTHETIC SPEECH USING PITCH FREQUENCY
281
nal. Therefore the pitch of speech signal must be obtained signal x(t) at time t0 . The value of φ(t, τ) at each time po-
at first. In this paper, we use the autocorrelation function sition is shown in Fig. 2(c), and the thick line corresponds
proposed by Fujisaki et al. [16] for pitch extraction, and de- to Fig. 2(b). In this example, maximal peaks at each time
scribe the pitch extraction process briefly in this section. occur at same delay time τ0 , and consequently, τ0 is regard
The short-range autocorrelation function R(t, τ) is de- as pitch of this speech signal.
fined as follow,
l(τ)/2
3. Discrimination Method of Synthetic Speech
R(t, τ) = x(t + ξ − τ/2)x(t + ξ + τ/2)dξ (1)
−l(τ)/2
where x(t) is speech signal, t is time, τ is delay time, l(τ) = In the proposed method, the synthetic speech discrimina-
(m − 1)τ, and integer number m is equal to or more than 2. tion is performed before conventional HMM speaker veri-
In the next place, the normalized short-range autocorrelation fication system. The synthetic speech discrimination sys-
function φ(t, τ) is calculated from R(t, τ) and P(t, τ) by the tem distinguishes between synthetic speech and natural
following equations, speech according to the pitch pattern which is distribution
of value of normalized short-range autocorrelation function
R(t, τ) like Fig. 2(c).
φ(t, τ) = (2)
P(t, τ) An example of pitch pattern is shown in Fig. 3. Pitch
1
l(τ)/2
pattern of synthetic speech is shown in Fig. 3(a) and pitch
P(t, τ) = x(t + ξ − τ/2)2 dξ pattern of natural speech is shown in Fig. 3(b). In this figure,
2
−l(τ)/2
horizontal axis means time t, and vertical axis means delay
l(τ)/2 time τ. The brightness of pixel is in proportion to value of
2
+ x(t + ξ + τ/2) dξ
(3) normalized short-range autocorrelation function. According
−l(τ)/2
to Fig. 3, shape of the locus, on which the maximal values
where P(t, τ) is normalization function, and its value is pro- of normalized short-range autocorrelation function occur,
portional to power of x(t) in analysis range. is different between synthetic speech and natural speech.
Pitch extraction process by using φ(t, τ) is shown in The shape of bright area is also different between synthetic
Fig. 2. Figure 2(b) is φ(t0 , τ) at time t0 of speech signal speech and natural speech. These differences are caused by
x(t) shown in Fig. 2(a). When the maximal peak of φ(t0 , τ) the synthetic process that synthetic speech is generated from
occurs at τ0 , τ0 is regard as pitch (pitch period) of speech partial high-probability symbols in HMM. Consequently, by
digitalizing these characteristics, synthetic speech discrimi-
nation can be performed.
In this section, we propose two discrimination methods
of synthetic speech: “discrimination method using time sta-
bility” and “discrimination method using pitch pattern.” The
former is simplified discrimination method [17], and it can
be performed with low CPU power and small memory. And
hence, the former is suitable for low cost implementation.
The latter needs CPU cost and memory, but it can distin-
(a) Speach signal x(t).
guish more accurately than the former [18]. And hence, the
latter is suitable for high-security implementation.
tern.” speech synthesis using dynamic features,” IEICE Trans. Inf. & Syst.
(Japanese Edition), vol.J79-D-II, no.12, pp.2184–2190, Dec. 1996.
[12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T.
5. Conclusions Kitamura, “Simultaneous modeling of spectrum, pitch and dura-
tion in HMM-based speech synthesis,” IEICE Trans. Inf. & Syst.
In this paper, we propose two discrimination methods of (Japanese Edition), vol.J83-D-II, no.11, pp.2099–2107, Nov. 2000.
synthetic speech using pitch frequency: “discrimination [13] T. Masuko, K. Tokuda, and T. Kobayashi, “Imposture against a
method using time stability” and “discrimination method us- speaker verification system using synthetic speech,” IEICE Trans.
Inf. & Syst. (Japanese Edition), vol.J83-D-II, no.11, pp.2283–2290,
ing pitch pattern.” The former is simplified discrimination Nov. 2000.
method, and it is suitable for low cost implementation. The [14] H. Unno, A. Ogihara, and H. Shibata, “Speaker verification against
latter can distinguish more accurately than the former, and it falsification using synthetic speech,” Proc. 2002 Kansai-Section
is suitable for high-security implementation. By using these Joint Convention of Institute of Electrical Engineering, p.G15-9,
proposed discrimination methods of synthetic speech as pre- 2002.
[15] T. Satoh, T. Masuko, T. Kobbayashi, and K. Tokuda, “Discrimina-
process before the conventional HMM speaker verification
tion of synthetic speech generated by an HMM-based speech syn-
system, we can improve the safety of conventional speaker thesis system form speaker verification,” IPSJ Trans., vol.43, no.7,
verification system. pp.2197–2204, July 2002.
In order to show the validity of the proposed meth- [16] H. Fujisaki, K. Hirose, and S. Seto, “A method for pitch extraction
ods, experiments of discrimination of synthetic speech have of speech with reduced errors due to analysis frame positioning,”
been performed. From experimental results, discrimina- IEICE Technical Report, SP89-69, Nov. 1989.
[17] H. Unno, A. Ogihara, and A. Shiozaki, “Synthesized speech detec-
tion method using time stability can discriminate synthetic tion for speaker verification against falsification using synthesized
speech at 90% accuracy, and discrimination method using speech,” Proc. 26th Symposium on Information Theory and Its Ap-
pitch pattern can discriminate synthetic speech at 99% accu- plications, pp.73–78, 2003.
racy by using “half bandwidth.” The practicable high accu- [18] H. Unno, A. Ogihara, and A. Shiozaki, “A protection method us-
racy has been obtained through the experiments, and hence ing pitch frequency against synthesized speech falsification,” Proc.
2004 Symposium on Cryptography and Information Security, vol.I,
the validity of the proposed method has been confirmed.
pp.695–700, 2004.
References