Professional Documents
Culture Documents
Fund Freq Det 05 Pitch en
Fund Freq Det 05 Pitch en
Jan Cernock
y, Valentina Hubeika
{cernocky|ihubeika}@fit.vutbr.cz
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
1/37
Agenda
Fundamental frequency characteristics.
Issues.
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
2/37
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
3/37
Introduction
Fundamental frequency, pitch, is the frequency vocal cords oscillate on: F0 .
The period of fundamental frequency, (pitch period) is T0 =
1
F0 .
The term lag denotes the pitch period expressed in samples : L = T0 Fs , where Fs
is the sampling frequency.
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
4/37
coding
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
5/37
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
6/37
During transmission over land line (3003400 Hz) the basic harmonic of pitch is not
presented but its folds (higher harmonics). So simple filtering for purpose of capturing
pitch would not work. . .
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
7/37
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
8/37
N 1m
X
s(n)s(n + m)
(1)
n=0
s(n)s(n m)
(2)
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
9/37
R(m) =
n=m
x 10
2000
4000
6000
8000
10000
12000
x 10
1.5
0.5
0.5
1.5
50
100
150
200
250
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
300
350
10/37
Shift illustration
4
x 10
1.5
1
0.5
0
0.5
1
50
100
150
200
250
300
350
400
50
100
150
200
250
300
350
400
x 10
1.5
1
0.5
0
0.5
1
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
11/37
x 10
10
R(m)
8
0
50
100
150
200
250
300
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
12/37
N 1m
X
c[s(n)]c[s(n + m)]
(3)
n=0
Rmax R(0)
unvoiced
voiced
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
(4)
13/37
x 10
10
max
lag
8
R(m)
8
0
50
100
150
200
250
300
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
14/37
AMDF
Earlier, when multiplication was computationally expensive, the autocorrelation function
was substituted with AMDF (Average Magnitude Difference Function):
RD (m) =
N 1m
X
n=0
(5)
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
15/37
x 10
2.5
1.5
0.5
50
100
150
200
250
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
300
16/37
Cross-Correlation Function
The drawback of the ACF is a stepwise shortening of the segment, the coefficients are
computed from. Here, we want to use the whole signal CCF. b indicates the beginning
of the frame:
b+N
X1
s(n)s(n m)
(6)
CCF (m) =
n=b
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
17/37
x 10
1.5
1
0.5
0
0.5
1
1.5
50
100
150
200
250
300
350
400
350
400
N1
x 10
1.5
1
past signal
0.5
0
0.5
1
k=87
50
100
150
200
250
300
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
18/37
The CCF shift is a problem as the shifted signal has much higher energy!
3000
2000
1000
0
1000
50
100
150
200
250
300
350
400
350
400
N1
3000
2000
big values !
1000
0
1000
50
100
150
200
250
300
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
19/37
(7)
zr+N
X1
n=zr
s2 (n)
E2 =
zr+N
X1
n=zr
s2 (n m)
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
(8)
20/37
x 10
10
50
100
150
200
250
300
50
100
150
200
250
300
0.5
0.5
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
21/37
x 10
1.5
1
0.5
0
0.5
1
1.5
2
50
100
150
200
250
300
50
100
150
200
250
300
0.5
0.5
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
22/37
Drawback: the methods do not suppress the formants influence (results additional
maxima in ACF or AMDF).
Center Clipping
- a signal preprocessing before ACF: we are interested only in the signal picks. Define so
called clipping level cL . We can either leave out the the values from the interval
< cL , +cL >. Or we can substitute the values of the signal by 1 and -1 where it crosses
the levels cL and cL , respectively:
s(n) > cL
s(n) cL pro
c1 [s(n)] =
(9)
0
pro cL s(n) cL
s(n) + cL pro
s(n) < cL
s(n) > cL
+1 pro
(10)
c2 [s(n)] =
0
pro cL s(n) cL
1 pro
s(n) < cL
Fundamental Frequency Detection
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
23/37
Figures illustrate clipping into frames on a speech signal with the clipping level 9562:
2
10
1.5
original
0.5
0.5
1.5
50
100
150
200
250
300
350
200
250
300
350
200
250
300
350
samples
10
1.5
clip 1
0.5
0.5
1.5
50
100
150
samples
1
0.8
0.6
0.4
clip 2
0.2
0
0.2
0.4
0.6
0.8
1
0
50
100
150
samples
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
24/37
max
n=0...N 1
|x(n)|,
(11)
where the constant k is selected between 0.6 and 0.8. Further, subdivision into several
micro-frames can be done, for instance x1 (n), x2 (n), x3 (n) of one third of the original
frame length. The clipping level is then given by the lowest maximum from the
micro-frames:
cL = k min {max |x1 (n)|, max |x2 (n)|, max |x3 (n)|}
(12)
Issue: clipping of noise in pauses, where subsequently can be detected pitch. The method
therefore should be preceded by the silence level sL estimation. In the maximum of the
signal is < sL , then the frame is not further processed.
Fundamental Frequency Detection
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
25/37
e(n) = s(n) +
P
X
i=1
ai s(n i)
(13)
(14)
(15)
(16)
The signal e(n) contains no information about formants, thus is more suitable for the
estimation. Lag estimation from the error signal can be done using the ACF method, etc..
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
26/37
1.5
10
10
R(m)
0.5
0.5
50
100
150
200
250
300
350
10
Rcl1(m)
50
100
150
200
250
300
350
200
250
300
350
10
Re(m)
50
100
150
m
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
27/37
(17)
(18)
N
1
X
ee2 (n)
(19)
n=0
The approach is similar as for LPC coefficients, the coefficients 1 and 2 are:
1 = [re (1)re (m) re (m 1)]/[1 re2 (1)]
re2 (1)],
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
(20)
28/37
where re (m) are normalized autocorrelation coefficients of the error signal e(n). After
substituting these coefficients to the energy estimation equation 19, the energy can be
estimated as a function of the shift m:
E(m) = 1 K(m)/[1 re2 (1)]
(21)
(22)
The lag can be determined by identifying either the minimum energy or the maximum of
the function K(m) (notice, the nominator 1 re2 (1) does not depend on m).
L = arg
min
K(m)
(23)
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
29/37
m[Lmin ,Lmax ]
E(m) = arg
max
m[Lmin ,Lmax ]
2
1
ln|Fs(n)|
c(m) = F
(24)
In cepstrum, it is possible to separate the coefficients representing the vocal tract (low
indices) and the coefficients carrying the information on the fundamental frequency, pitch,
(high indices). The lag can be predicted by identifying the maximum of c(m) in the
potential range of lag values.
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
30/37
c0 is logenergy
1.2
filter
excitation
0.8
0.6
1st multiple
of lag
0.4
2nd multiple
of lag
0.2
0.2
0.4
20
40
60
80
100
120
140
160
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
180
200
31/37
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
32/37
(25)
Sort the items by their value and pick up the middle item. The lag values from the above
example are therefore corrected to the sequence: 50, 50, 50, 50, 50.
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
33/37
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
34/37
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
35/37
Decimal Sampling
To improve the F0 detection accuracy we can apply super-sampling onto the signal and
consequently filter it. This operation doesnt have to be implemented physically but
rather can be projected into the autocorrelation coefficient estimation. Super-sampling
often prevents detection of the double lag value.
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
36/37
1.2
14000
1
12000
0.8
10000
0.6
8000
0.4
6000
0.2
4000
2000
0.2
0
29
30
31
32
33
34
35
36
37
38
10
15
20
25
30
35
Jan Cernock
y, Valentina Hubeika, DCGM FIT BUT Brno
40
45
50
37/37