Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

MEL-SCALED DISCRETE WAVELET COEFFICIENTS FOR SPEECH RECOGNITION J.N. Gowdy, Z.

Tufekci Department of Electrical and Computer Engineering Clemson University Clemson, SC 29634, USA jgowdy@ces.clemson.edu, ztufekc@ces.clemson.edu
ABSTRACT In this paper we propose a new feature vector consisting of MelFrequency Discrete Wavelet Coefcients (MFDWC). The MFDWC are obtained by applying the Discrete Wavelet Transform (DWT) to the mel-scaled log lterbank energies of a speech frame. The purpose of using the DWT is to benet from its localization property in the time and frequency domains. MFDWC are similar to subband-based (SUB) features and multi-resolution (MULT) features in that both attempt to achieve good time and frequency localization. However, MFDWC have better time/frequency localization than SUB features and MULT features. We evaluated the performance of new features for clean speech and noisy speech and compared the performance of MFDWC with Mel-Frequency Cepstral Coefcients (MFCC), SUB features and MULT features. Experimental results on a phoneme recognition task showed that a MFDWC-based recognizer gave better results than recognizers based on MFCC, SUB features, and MULT features for the white gaussian noise, band-limited white gaussian noise and clean speech cases. 1. INTRODUCTION Mel-Frequency Cepstral Coefcients (MFCC) have been the most widely used speech features for speech recognition since Davis and Mermelstein [1] showed that a MFCC based speech recognizer outperforms other feature based (Linear Prediction Cepstral Coefcients (LPCC), Linear Prediction Coefcients (LPC), Reection Coefcients (RC), and Linear Filter Cepstral Coefcients (LFCC)) speech recognizers. MFCC, which are calculated by taking the Discrete Cosine Transform (DCT) of mel-scaled log lterbank energies, have two drawbacks: 1. Since basis vectors of the DCT cover all frequency bands, corruption of a frequency band of speech by noise affects all MFCC. However, if we nd a transform for which corruption of a frequency band affects only a few coefcients, we can decrease the effect of noise on the performance of the recognizer by weighting the corrupted coefcients or by completely eliminating those noise corrupted coefcients. 2. A frame of speech may contain information of two adjacent phonemes. If one of these two phonemes is voiced and the other is unvoiced, then the low frequency spectrum may be dominated by voiced phoneme information and the high frequency spectrum may be dominated by the
This work was partially supported by Izmir Institute of Technology

unvoiced phoneme information. In the MFCC case, we inherently assume that a speech frame conveys information of only one phoneme at a time. However, this asynchrony can be taken into account by dividing frequency band into subbands and processing each subband separately. As pointed out by Fletcher [2] (and reviewed by Allen in [3]), the Human Speech Recognition (HSR) system works with partial recognition information across frequency, probably in the form of speech features that are local in frequency. The HSR system doesnt assume that timing of features are synchronous. Hermansky et al. [4] and Bourlard et al. [5] proposed a subband-based speech recognition system to overcome the problems associated with MFCC-based recognizers. They simply divided the frequency spectrum into subbands and developed merging techniques such as multi-layer perceptron (MLP), weighting subband scores based on recognition rate of each subband, and weighting subband scores based on the SNR of each subband for combining individual subband recognition scores. The basic idea behind dividing the frequency spectrum into subbands is to keep the error caused by a noise corrupted frequency band local. They also investigated asynchrony between different subbands of the speech spectrum. Asynchrony between frequency bands was studied in [58] and it has been shown that accommodating asynchrony between frequency bands improves performance. It also has been shown [9, 10] that subband-based recognizers are robust to band-limited noise but not good in the white noise case. Since cepstral coefcients give better results than just the speech spectrum, LPCC or MFCC of each subband are typically used as features. Therefore, the resulting features are cosine series expansion of preprocessed log magnitude spectrum of subbands or DCT of mel-scaled log lterbank energies of subbands. Subband based recognizers have two drawbacks: 1. The basis vectors of DCT (CT) have approximately the same resolution in time and frequency since we use same length windows to calculate cepstral coefcients. However, to capture the changes in time and frequency we may need some basis vectors which have good time resolution and some which have good frequency resolution. 2. Basis vectors of DCT and CT are not the best basis vectors for localization in time and frequency. To overcome the former problem, Vaseghi et al. [11] suggested multi-resolution features. However, since they use the same basis vectors as the DCT, the latter problem was still present. We propose to use the DWT which has good time and frequency resolution instead of the DCT to solve the problems mentioned above.

Wavelet basis function


2 1.5 1.5

FT of wavelet basis function

1 0.5 0 0.5 1 2 0 2 4

0.5

10

20

30

40

In our applications we do not use continuous transforms but instead we use discrete transforms since our signals are discrete signals. Therefore, it is better to dene the localization property (set of squarefor the discrete transform. Let summable sequences) be a set of vectors associated with transformation . may be multidimensional index. One can dene the measure of spread of in time and frequency as shown below. is Assume has unit energy. The mean of

Amplitide

Amplitide

t(sec) Cosine basis function (k=4)


1 0.5

w(radian/sec) FT of cosine basis function (k=4)


1.5

Amplitide

Amplitide

and the measure of resolution of

0.5

0.5

0.5

1.5

10

20

30

40

The mean of

is

in time domain is (4)

(3)

t(sec)

w(radian/sec)

Figure 1: The spreads of the basis functions of Wavelet Transform and Cosine-I Transform

(5)

and the measure of resolution of 2. LOCALIZATION PROPERTY OF A LINEAR TRANSFORMATION In general, we take the transformation of a signal to get a better representation of the signal. However, better representation of a signal has different meanings for different purposes. For coding better means representing the signal with fewer coefcients. For recognition, better means having more ability to separate the signals belonging to different categories in the new domain than in the original domain. Let be a set of functions on (space of square-integrable functions) associated with transformation . may be a multidimensional index. Then dene the transformation of as:

in the frequency is

(6)

(1)

where denotes inner product and gate. Parsevals theorem states that

denotes complex conju

(2)

The locality of transformation of a signal is important in two ways for pattern recognition. First, different parts of the signal may convey different amount of information. When our coefcients represent local information, we can adjust the contribution of each coefcient to the total recognition rate depending on the information that each coefcient conveys. Second, when our signal is corrupted by noise local in time and/or in frequency, this noise affects only a few coefcients if our coefcients represent local information in time and frequency. Therefore, we can decrease the contribution of noise corrupted coefcients to the overall recognition score depending on the SNR for noise corrupted coefcients. For the reasons mentioned above, we want the transformation of a signal to represent local information as much as possible in both the time and frequency domains. However, Heisenbergs Uncertainty Theorem states that the transformed signal cant represent arbitrarily local information in both the time and frequency domains. Altough there is a tradeoff between locality in time and frequency domain, we can keep the joint measure of locality, , as small as possible.

where and are Fourier Transforms (FT) of and , respectively. As seen the transform of the signal depends on both and FT of . So the locality of in the time and frequency domains depends on the spread of in time and frequency, respectively. Figure 1 shows the cosine basis function and wavelet basis function in the time and frequency domains. As seen from the Figure 1, the wavelet basis function is more concentrated than the cosine basis function in time and frequency. Therefore the wavelet coefcient is more local than cosine coefcient in both the time and frequency domains. The wavelet shown above is Cohen, Daubechies and Feauveaus [12] compactly supported biorthogonal wavelet with similar length lters (the high-pass lter has 9 taps and the low-pass lter has 7 taps).

3. SPEECH FEATURES AND COMPARISON OF LOCALITY OF SPEECH FEATURES In this section we give a brief description of speech features whose performance we compared with MFDWC. 3.1. MFCC The MFCC are calculated by taking the DCT of mel-scaled log lterbank energies as shown below.


(7)

where N is the number of lterbanks, represents the kth and represents log-energy output of the nth lter. 3.2. Subband Features The rst step for extracting speech features is to divide the frequency band into subbands. Then speech features from each subband are extracted. These speech features may be MFCC, LPCC or other features. In this paper the MFCC of each subbands are used as subband features. Then a likelihood for each subband is calculated. Finally, recognition scores from each subband are merged to give the total score. The most critical part of subband based recognizer is the merging algorithm. The merging algorithm weights the partial recognition scores of subbands based on information conveyed in subbands or signal-noise ratio of each subband. In this paper, all subband scores are weighted equally. 3.3. Multi-resolution Features Multi-resolution features were proposed by Vaseghi et al. [11] to obtain different time-frequency resolution of the speech spectrum. In this method cepstral coefcients from all resolutions are combined, i.e., 4 coefcients from full-band, 2 coefcients from each half band, 2 coefcients from each quarter band, etc. In this paper we calculated MFCC at different resolutions. 3.4. MFDWC The MFDWC are obtained by applying the Discrete Wavelet Transform to the mel-scaled log lterbank energies of a speech frame.For information about the wavelet transform and implementations of the wavelet transform, interested readers may refer to [13, 14]. The wavelet transform uses short windows to measure the high frequency content of the signal and long windows to measure the low frequency content of the signal. This property of the wavelet transform makes the Wavelet Transform different from the Short Time Fourier Transform and the Fourier Transform. We chose Cohen, Daubechies and Feauveaus [12] compactly supported biorthogonal spline wavelet with three taps and ve taps for the highpass lter and the low-pass lter, respectively. 3.5. Comparison of Localization Property of DWT Basis Vectors with DCT Basis Vectors in Time and Frequency In this section we compare the localization properties of the DWT and DCT which were used to calculate the MFDWC, MFCC, subband features, and the multi-resolution features described above. Basis vectors of the DCT are given by:

Table 1: The measures of resolution of discrete cosine basis vectors in time, frequency and joint. k 1 8.4043 0.2129 1.8413 2 5.9571 0.2517 1.4995 3 5.4742 0.3297 1.8051 4 5.2500 0.4422 2.3216

Table 2: The measures of resolution of discrete wavelet basis vectors in time, frequency and joint. Scale 4 2.7398 0.2204 0.6038 8 11.0501 0.0587 0.6487 16 43.7323 0.0149 0.6525 32 174.2714 0.0038 0.6567

and joint for N=8. Table-2 gives the spreads of discrete wavelet (wavelet with seven taps low-pass lter and nine taps high-pass lter) basis vectors in time , in frequency and joint for the scales 4,8,16 and 32. As seen from Table 1 and Table 2 discrete wavelet basis vectors have better joint resolution than discrete cosine basis vectors. One other observation from Table 1 and Table 2 is that all basis vectors of the DCT have about the same resolution in the time and frequency domains. However, the basis vectors of the DWT which is used to calculate the high frequency content of the signal have better resolution in the time domain than the basis vectors of the DWT which is used to calculate the low frequency content of the signal. And the basis vectors of the DWT that is used to calculate the low frequency content of the signal have better resolution in the frequency domain than the basis vectors of the DWT that is used to calculate the low frequency content of the signal. This property is important as explained in section 2. We can speculate from these two observations that the proposed features yielded better results than the MFCC, SUB features and MULT features as seen in the section 5 because the proposed features have better joint resolution than the MFCC, SUB, and MULT features. Furthermore, the proposed features have different time and frequency resolution for better representing the low frequency and the high frequency content of the signal.
4. EXPERIMENTAL SETUP AND TASK We used the TIMIT database to evaluate and compare the performance of the proposed features with MFCC, subband features and multi-resolution features on a phoneme recognition task. 22 phone labels of 61 quasiphonemic labels dened in the TIMIT database were merged into the remaining 39 ones as in [15]. The confusions within the same categories are not counted in calculating classication accuracy. The sa sentences which are common to all speakers are not used to avoid possible bias towards certain phones. Three-state left to right with no skip content independent HMM models were constructed for each phoneme category. The output probability distribution of each state is modeled by a mixture of ve multivariate Gaussian density functions with diagonal covariance matrix. The speech signal sampled at 16 kHz is analyzed with a 32 ms

(natural number) and . Since we get where similar values for different and , the measures of spread of and , DCT basis vectors are calculated only for which corresponds to the measure spread of DCT basis vectors for calculating SUB-4 features (see section 3.2 for SUB-4 features). DWT is implemented by using a pair of perfect reconstruction lters [3]. Discrete wavelet coefcients of Mel-scaled lterbank energies are calculated at the scales 4 (8 coefcients), 8 (4 coefcients), 16 (2 coefcients) and 32 (1 coefcient). Table-1 gives the spreads of DCT basis vectors in time , in frequency

(8)

Table 3: Phoneme recognition rates for different features for clean and noisy speech. MFCC SUB-4 SUB-8 MULT MFDWC (1-15) (1-16) (1-16) (1-16) (1-15) Clean 59.64 58.11 54.00 56.20 61.15 20-db 47.02 47.55 42.33 45.23 52.14 10-db 22.10 24.57 20.14 16.86 28.32 20-db lp 54.43 54.66 49.15 49.40 57.82 10-db lp 45.25 44.44 39.08 37.28 50.02 5-db lp 36.74 28.89 29.31 28.86 37.59

a phoneme recognition task on the TIMIT database. Further improvement can be achieved by using a recombination method to combine the subband probabilities. Future work will include investigating different wavelets, real noise, and continuous speech recognition. 7. REFERENCES [1] S. B. Davis and P. Mermelstein, Comparision of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, 1980. [2] H. Fletcher, Speech and Hearing in Communication. 1953. [3] J. B. Allen, How Do Humans Process and Recognize Speech, IEEE Transactions on Speech, and Audio Processing, vol. 2, no. 4, 1994. [4] H. Hermansky, S. Tibrewala, and M. Pavel, Towards ASR on Partially Corrupted Speech, in Proceedings of ICSLP, 1996. [5] H. Bourlard and S. Dupont, A new ASR Approach Based on Independent Processing and Recombination of Partial Frequency Bands, in Proceedings of ICSLP, 1996. [6] N. Mirghafori and N. Morgan, Transmission and Transition in Multi-band ASR, in Proceedings of ICASSP, 1998. [7] C. Cerisara, J. P. Haton, J. F. Mari, and D. Fohr, A Recombination Model for Multi-band Speech Recognition, in Proceedings of ICASSP, 1998. [8] M. J. Tomlinson, M. J. Russell, R. K. Moore, A. P. Buckland, and M. A. Fawley, Modelling Asynchrony in Speech Using Elementary Single-Signal Decomposition, in Proceedings of ICASSP, 1997. [9] S. Okawa, E. Bocchieri, and A. Potamianos, Multi-band Speech Recognition in Noisy Environments, in Proceedings of ICASSP, 1998. [10] S. Tiberwala and H. Hermansky, Sub-band Based Recognition of Noisy Speech, in Proceedings of ICASSP, 1997. [11] S. Vaseghi, N. Harte, and B. Milner, Multi Resolution Phonetic/Segmental Features and Models for HMM Based Speech Recognition, in Proceedings of ICASSP, 1997. [12] A. Cohen, I. Daubechies, and J. Feauveau, Biortogonal Bases of Compactly Supported Wavelets, Comm. on Pure and Applied Math., 1992. [13] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, 1998. [14] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding. Prentice Hall, 1995. [15] R. Chengalvarayan and L. Deng, Use of Generalized Dynamic Feature Parameters for Speech Recognition, IEEE Transactions on Speech, and Audio Processing, vol. 5, May 1997. [16] H. Bourlard and S. Dupont, Subband-Based Speech Recognition, in Proceedings of ICASSP, 1997.

hamming window every 10 ms. The FFT of each frame is taken to calculate the power spectrum of the signal. For the computation of mel-scaled log energies, 32 triangular mel-scaled band-pass lters were designed. The denitions of feature vectors that used in this paper are given below. 1. MFCC: MFCCs are computed by taking the DCT of log lterbank energies. The rst fteen of the MFCCs are used. 2. SUB-4, SUB-8 (subband features): Mel scaled lterbank outputs are divided into four (eight) equally mel spaced subbands for calculating SUB-4 (SUB-8) features. MFCCs of each subband are calculated independently and the rst four (two) MFCCs are taken from each subband for SUB-4 (SUB-8) features. 3. MULT (Multi-resolution features): Four MFCCs are taken from the full-band, two MFCCs are taken from each halfband, and two MFCCs are taken from each quarter-band, for a total of sixteen. 4. MFDWC: Eight coefcients at scale four, four coefcients at scale eight, two coefcients at scale sixteen, and one coefcient at scale thirty two are taken. The total number of static coefcients is therefore fteen. All feature vectors also include delta coefcients and delta energy in order to include the dynamic features. Since it is not possible to have an arbitrary number of coefcients, i.e., for SUB-4 we cant have 17 coefcients (the number of coefcients has to be multiple of 4), we used fteen coefcients for some features and sixteen coefcients for others. 5. EXPERIMENTAL RESULTS We conducted a series of experiments under different noise conditions, and Table 3 shows the results. The rst column shows SNR values. 20-db and 10-db correspond to SNR for white noise. 20-db lp, 10-db lp and 5 lp correspond to SNR for low-pass ltered white noise with cut-off frequency of 1.2 kHz. The proposed features yielded better performance than all the other features considered. Furtermore, if we use a merging algorithm as proposed in [4, 5, 10, 16], we may get additional improvements. 6. CONCLUSION In this paper we introduced a new speech feature extraction method which takes advantage of the time-frequency localization property of the wavelet transform. The proposed speech feature MFDWC yielded better recognition rates than SUB, MULT and MFCC for

You might also like