Professional Documents
Culture Documents
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
Abstract
The objective of this paper is to examine the human voice by applying STFT, crosscorrelation technique. In both the
methods, we take the input query and compare witheach sample voice stored in the database. The voice that has best
match is given as result .The first method has more hit ratio.
1. INTRODUCTION
The overview of the following experiment is to apply different methods to examine the human voice and visualize the
differences between words or sounds from the outcome of the same method. The method that stood out and was incorporated
in this project was the Short Time Fourier Transform (STFT) and its three dimensional plot, Spectrogram. A simple but
practical recognizer which could be applied in Automatic Speech Recognition System was established in the project. The best
way of showing the results is to step through the whol e procedural and truly understand the usage of different method and
compare the outcomes.
2. SYSTEM ANALYSIS
2.1 Introduction
For developing this project, the basic knowledge about Matlab environment and the various voice related functions present in
the matlab should be known.
3. ANALYSIS
In analysis phase, the project is divided into many stages or phrases and each phase is carried out in a sequential fashion.
3.1 Constraints, Dependencies and Assumptions
The constraints and dependencies of the system are as follows. A database containing the voice of the users should be
maintained another database containing the query should be maintained. Assumed that there is always a match between the
query and the stored database. No interactive voice is required for matching.
40
Proceedings of the International Conference on Cognition and Recognition
Fig. 1
The figure illustrates that first analysis phase is one, followed by the design phase. Then comes the coding i.e. the
implementation and it is completed with the testing phase. This model goes in a sequential manner. The classic life cycle
paradigm has a definite and simple structure. It provides a template into which methods for analysis, design, coding, testing
and support can be placed.
4. DESIGN
Two design techniques can be used. The design techniques there advantages and disadvantages are as follows.
The equation above can be treated as taking the FFT of { x[no + n]· w[n] } and the length of
{x[no + n]· w[n] } is controlled by the applied window w[n]. The way of choosing window length is base on the focusing
aspects: The shorter the window is, the more time detail will be covered but less detail in the frequency domain. After the
STFT is done, x [kno] can be plotted on the k-no coordinates, which is the time and frequency combination. The various steps
performed are,
Step 1 : Voices of number from one to ten was recorded and saved as the database.
Step 2 : Another set of numbers, also from one to ten, was also recorded and saved as the queries (This set of recording was
recorded in the different time, and different from the set above.).
Step 3 : Once the query is chosen, or input from the query set, the following steps apply on both query and each sample in the
data base.
Step 4 : The returning parameters from matlab ‘wavread’ comment are the sampling rate (rs) and the sampled data (d). The
sampled data is the data that we w ant to work on in the time domain.
Step5 : First of all, a program, called “clipspeech.m”, does the signal clipping was applied on the data (d) to extract the major
piece that contents the most information.
Step 6 : The resultant data (d’) was taken to apply the STFT with window size of 512 points, and 256 points overlap.
Step 7 : A mask was applied on the resultant data (after STFT the data became the coefficients of the F.S.) to suppress the
white noise which had energies less than 5dB in this case.
Step 8 : Apply the mask and take the inverse of STFT, the data was converted back to time domain, and the matlab built in
comment “spectrogram” was applied to give the plot of the subject wave signal. Now, let’s take a look at the outcome
spectrograms of the database.
41
Voice Analysis using Short Time Fourier Transform and Cross Correlation Methods
Step 9 : Three levels of frequency that can mostly classify the energies were chosen to be the points of comparison. The time
axis was break down into four regions, and in each region the total energy was calculated.
Step 10: All three levels of energy on the frequency axis and four regions of energy on the time axis of each sample in the
database were compared with the respective ones of the query. A score index with ten spaces was generated and
recorded the score of similarity.
Step 11:The sample in the database with the highest score will be picked out and reported In Step6, the window has 50%
overlap with the successive one. This is because when we put a series of such kind of overlapping windows together,
we will create an overall union gain; therefore when we want to retrieve the signal, the signal won’t be affected by the
side lops of the window and the magnitude won’t get changed either. This is shown in figure 2). Figure 2: The overall
sum of a series of windows with 50% overlapping is 1. In Step7, the mask equation is shown below :
d has been chosen such that correlation is maximal. However, the accuracy of the result by only using cross-correlation is still
far less than our expectation. Therefore, spectrogram from approach one has been exploited to improve the situation. There are
several steps to achieve the sound example recognition in the sound recognition system.
42
Proceedings of the International Conference on Cognition and Recognition
After the procedural described above, the output result was largely improved. The result and reason will be explained in the
next section. The normalization equation chosen for t he cross correlation is as following:
5. IMPLEMENTATION
The proposed system is implemented in MATLAB environment. The implementation includes the following built-in functions
and user defined functions. The built-in functions used are,
§ Wavread ()
WAVREAD Read Microsoft
WAVE (".wav") sound files=WAVREAD
(FILE) reads a WAVE file specified by the string
FILE, returning the sampled data in Y. The ".wav" extension is appended if no extension is
given. Amplitude values are in the range [-1, +1].
§ Wavplay ()
WAVPLAY Play sound usingWindows audio output device.WAVPLAY(Y,FS) sends the signal in vector Y with
samplefrequency of FS Hertz to the Windows WAVEaudio device. Standard audio rates are 8000,11025, 22050, and 44100 Hz.
§ Specgram ()
Specgram Calculates spectrogr am from the given signal. B = SPECGRAM (A, NFFT,Fs, WINDOW, NOVERLAP)
calculates the spectrogram for the signal in vector A.SPECGRAM splits the signal into overlapping segments, windows
each with the WINDOW vector and forms the columns of B with their zero-padded, length NFFT discrete Fourier
transforms. Thus each column of B contains an estimate of the short -term, time-localized frequency content of the signal A.
Time increases linearly across the columns of B, from left to right. Frequency increases linearly down the rows, starting at
0.The user defined functions used in this system are,
§ STFT ()
D = stft(X, F, W, H) ShorttimeFourier transform. Returns some frames ofshort-term Fourier transform of x. Each columnof
the result is one F-point FFT; each successiveframe is offset by H points until X is exhausted.Data is Hamm-windowed %
at W pts (or W isused as the window if a vector).
§ ISTFT ()
X = istft (D, F, W, H) Inverseshort-time Fourier transform. Performs overlapaddre synthesis from the short -time
Fouriertransform data in D. Each column of D is takenas the result of an F-point FFT; each successiveframe was offset by
H points. Data is Hammwindowedat W pts, or W is used as the window if a vector.
§ Clipspeech ()
Clipspeech is used to clip the region which has the maximum information in the voice signals present in the database and in
query base. This helps in concentrating more on the information populated area rather than less populated area while
determining the energy distribution in the input query data base and thestored database.
6. PERFORMANCE ANALYSIS
6.1 Approach I
The following table 1 shows the correct etections in each number in a set of tentrials. Please note that for each trial in both
approaches, the query was re-recorded for the purpose of generalization, but in the demo section (as the program has been set
now), the queries were pre-recorded and saved.
43
Voice Analysis using Short Time Fourier Transform and Cross Correlation Methods
Table 1: The testing results for the number recognition by using the approach above
6.2 Approach II
The simulation result by using the cross correlation in time domain is shown in Table 2. The results with the help of
spectrogram are shown in Table 3. Correct recognition percentages were computed and listed in those tables. By compared
tables, it's easy to tell that the spectrogram dramatically improved recognition accuracy.
Table 2: Recognition by only using cross correlation in time Table 3: Recognition by using both cross correlation and
domain spectrogram
7. RESULTS
44
Proceedings of the International Conference on Cognition and Recognition
8. CONCLUSION
8.1 Approach I
It has been proved above that the spectrogram is a very useful and powerful tool to visualize the 3-D feature of the sounds. In
approach I, it showed the difficulty of spotting the human voices. The result was not very consistent. This was due to many
factors in the effecting of the detection. E.g. the same word, but the energy will be different according to different person, or
even the same person with different condition.
8.2 Approach II
In the beginning of approach II, we compared two voice samples by using cross-correlation in time domain. However, the
result was totally unacceptable. The probability of correct recognition is close to zero. The reason was because that a sound is a
combination of many signals in different frequencies. Especially in human’s voices, the human voice is the combination of
hundreds of frequencies. In order to acquire a more accurate result of sound recognition is not enough to only use information
from time domain. So our approach was modified and concentrated on the exploitation of spectrogram feature. As mention in
the first approach, the spectrogram shows the useful characteristic of sounds in frequency domain. Recognition accuracy was
improved dramatically as shown in table 3 There was an interesting result in the same table. Although the overall accuracy was
low, the correct recognitions for “k” & “t” were still higher that others in the same condition. The reason, as illustrated in
Figure 2, was because the two sounds were consonants, which have very different energies distribution and shorter periods
than vowels. By focusing on the energy in frequency domain, the output result improved a lot. S spectrogram shows the useful
characteristic by combining frequency and time variations in the same plot. Cross correlation can more efficiently find the
similarity between two sounds. In terms of different sound example, “e” and “o”, “k” and “t”, “u” and “ar” were easily to cause
the wrong recognition to each other. “e” and “o” are vowel sound which more intense in the energy aspect and both echo in a
long period. (ii) Compared to “e” and “o”, “k” and “t” are extreme. They will also cause faults in sound recognition. (iii) Other
words have no specified peer in the wrong recognition. Normalization achieved a more accurate recognition. Different sound
has different energies due to the diverse components of frequencies. The cross correlation of two signals can only be compared
under the normalized condition, which is to divide the cross correlation by the auto correction of both signal, denoted by Rxx
[0] and RYY [0] in Eq (4).
ACKNOWLEDGEMENT
The Author Thanks Professor Dr.Kumaravel Anna University, Chennai For His Many Helpful Lectures, Instructive Guidance
And The valuable Comments. The Author’s Appreciation Is Also Extended To The Fellow Students For Their Support. And In
This Paper, Most Of The Simulations Are Carried Out With The Software Mat Lab And C. The Important Resources in
Websites Are Gratefully acknowledged.
REFERENCE
[1] Ronald M.Aarts,,Roy Irwan, and Augustus J. E. M. Janssen , september 2002 “Efficient Tracking of the cross correlation
coefficient”, IEEE transactions on speech and audio processing, vol. 10, no. 6, 882002 –391.
[2] Portnoff, M.R., 1980” Time-frequency representation of digital signals and systems based on short -time Fourier
analysis”. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-28(1): p. 55-69.
[3] Allen, J.B. and L.R. Rabiner, 1977” A unified approach to short-time Fourier analysis and synthesis.” Proceedings of the
IEEE, 1. 65(11): p. 1558-64.
[4] Sanjit K. Mitra, textbook, 2000 “DigitalSignal Processing: A computer-based approach” McGraw -Hill, 2000.
[5] Dupont, Ris, Deroo, Fontaine, Boite & Zanoni, , 1997 ”Context independent and context dependent hybrid HMM/ANN
systems for vocabulary independent tasks
[6] Renals & Hochberg, , 1999 “Start- Synchronous Search for Large Vocabulary Continuous Speech Recognition”Action
and
[7] Mazin Rahim, Yoshua Bengio, Yann LeCun 1997 “Discriminative Feature And Model Design For Automatic Speech
Recognition”
[8] Shafer, R.W. and L.R. Rabiner, 1973. ”Design and simulation of a speech analysis- synthesis system based on short -time
Fourier analysis.” IEEE Transactions on Audio and Electroacoustics,. AU-21(3): p. 165 -74
45