Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the International Conference on Cognition and Recognition

Voice Analysis using Short Time Fourier Transform


and Cross Correlation Methods
N. Rajalakshmi, A. Anitha and K. Narayanan
Lecturer, Dept of CSE, PSG College of Technology

Abstract

The objective of this paper is to examine the human voice by applying STFT, crosscorrelation technique. In both the
methods, we take the input query and compare witheach sample voice stored in the database. The voice that has best
match is given as result .The first method has more hit ratio.

1. INTRODUCTION

The overview of the following experiment is to apply different methods to examine the human voice and visualize the
differences between words or sounds from the outcome of the same method. The method that stood out and was incorporated
in this project was the Short Time Fourier Transform (STFT) and its three dimensional plot, Spectrogram. A simple but
practical recognizer which could be applied in Automatic Speech Recognition System was established in the project. The best
way of showing the results is to step through the whol e procedural and truly understand the usage of different method and
compare the outcomes.

2. SYSTEM ANALYSIS

2.1 Introduction
For developing this project, the basic knowledge about Matlab environment and the various voice related functions present in
the matlab should be known.

2.2 Existing Service


The existing voice recognition system makes use of Fast Fourier Transform (FFT) to find the energy distribution of the input
query signal and the stored data base signal. FFT focuses on only the frequency domain of the voice signal. But there may be
variation in amplitude while comparing between two signals. Hence the match obtained between two signals is not very
accurate.

2.3 Proposed Service


The proposed service uses Short Time Fourier Transform (STFT) to find the energy distribution of the input query signal and
the stored data base signal. STFT focuses on both the time domain and frequency domain of the voice signal. So the variation
in amplitude between the signals while matching is reduced. Hence the match obtained between the signals is more accurate.

3. ANALYSIS

In analysis phase, the project is divided into many stages or phrases and each phase is carried out in a sequential fashion.
3.1 Constraints, Dependencies and Assumptions
The constraints and dependencies of the system are as follows. A database containing the voice of the users should be
maintained another database containing the query should be maintained. Assumed that there is always a match between the
query and the stored database. No interactive voice is required for matching.

40
Proceedings of the International Conference on Cognition and Recognition

3.2 Software Process Model


Linear Sequential Model
The software process model followed by us to develop the project is linear model. Sometimes called the classic life cycle
model or waterfall model, the linear sequential model suggests a systematic, sequential approach to software development that
begins at the system level and progresses through analysis, design, coding testing and support. The following fig 1 illustrates
the linear sequential model for software engineering. Analysis Design testing coding and support

Fig. 1

The figure illustrates that first analysis phase is one, followed by the design phase. Then comes the coding i.e. the
implementation and it is completed with the testing phase. This model goes in a sequential manner. The classic life cycle
paradigm has a definite and simple structure. It provides a template into which methods for analysis, design, coding, testing
and support can be placed.

4. DESIGN

Two design techniques can be used. The design techniques there advantages and disadvantages are as follows.

4.1 Approach I: (spectrogram)


As a known fact that FFT can only show the frequency response, but how about if the signal has the variation in magnitude
with respect to time! If one wants to show this aspect with frequency on the same plot, the best way would be plotting the
Spectrogram of that signal. The concept of STFT is to apply a short window to the long time durational signal, and take the
FFT on segment of that signal covered in the window. Keep doing so the multiple frequency responses of a signal can be
obtained. In each frequency response, the magnitude reflects the amplitude of the original signal in that particular window slot.
The formula is as the form shown below:

The equation above can be treated as taking the FFT of { x[no + n]· w[n] } and the length of
{x[no + n]· w[n] } is controlled by the applied window w[n]. The way of choosing window length is base on the focusing
aspects: The shorter the window is, the more time detail will be covered but less detail in the frequency domain. After the
STFT is done, x [kno] can be plotted on the k-no coordinates, which is the time and frequency combination. The various steps
performed are,

Step 1 : Voices of number from one to ten was recorded and saved as the database.
Step 2 : Another set of numbers, also from one to ten, was also recorded and saved as the queries (This set of recording was
recorded in the different time, and different from the set above.).
Step 3 : Once the query is chosen, or input from the query set, the following steps apply on both query and each sample in the
data base.
Step 4 : The returning parameters from matlab ‘wavread’ comment are the sampling rate (rs) and the sampled data (d). The
sampled data is the data that we w ant to work on in the time domain.
Step5 : First of all, a program, called “clipspeech.m”, does the signal clipping was applied on the data (d) to extract the major
piece that contents the most information.
Step 6 : The resultant data (d’) was taken to apply the STFT with window size of 512 points, and 256 points overlap.
Step 7 : A mask was applied on the resultant data (after STFT the data became the coefficients of the F.S.) to suppress the
white noise which had energies less than 5dB in this case.
Step 8 : Apply the mask and take the inverse of STFT, the data was converted back to time domain, and the matlab built in
comment “spectrogram” was applied to give the plot of the subject wave signal. Now, let’s take a look at the outcome
spectrograms of the database.

41
Voice Analysis using Short Time Fourier Transform and Cross Correlation Methods

Fig. 2: The spectrograms of the sample voice "1 ~ 10"

Step 9 : Three levels of frequency that can mostly classify the energies were chosen to be the points of comparison. The time
axis was break down into four regions, and in each region the total energy was calculated.
Step 10: All three levels of energy on the frequency axis and four regions of energy on the time axis of each sample in the
database were compared with the respective ones of the query. A score index with ten spaces was generated and
recorded the score of similarity.
Step 11:The sample in the database with the highest score will be picked out and reported In Step6, the window has 50%
overlap with the successive one. This is because when we put a series of such kind of overlapping windows together,
we will create an overall union gain; therefore when we want to retrieve the signal, the signal won’t be affected by the
side lops of the window and the magnitude won’t get changed either. This is shown in figure 2). Figure 2: The overall
sum of a series of windows with 50% overlapping is 1. In Step7, the mask equation is shown below :

4.2 Approach 2 (cross Correlation)


It is well known that cross correlation can be efficiently implemented in the transform domain. The correlation between two
signals (cross correlation) is a standard approach in the feature detection. The presentations of correlation in the textbook
shows the convolution theorem and the attendant possibility of efficiently computing correlation in the time domain, but not
many details about how to use the normalized form of correlation efficiently to perform in template matching in the frequency
domain. In this approach, for the purpose of feature matching applications the sound recognizer has been designed by using the
normalized form of cross correlation. The seven sample sounds (“e”, “i”, “u”, “o”, “ar”, “k”, and “t”) which are the common
consonants and vowels in English, have been used in the simulation. The procedure and result are described in the following:
The first part in cross correlation was to compare the two sound sample by using the cross correlation in time domain.

In the cross-correlation formula:

d has been chosen such that correlation is maximal. However, the accuracy of the result by only using cross-correlation is still
far less than our expectation. Therefore, spectrogram from approach one has been exploited to improve the situation. There are
several steps to achieve the sound example recognition in the sound recognition system.

Step 1 : generate the spectrograms for query and database sound


Step 2 : extract the sound signal by assigned boundary (by its energy distribution)
Step 3 : normalize and compute the cross correlation between the query sample and each sample in the database
Step 4 : compare and decide the closest sound after the recognition

42
Proceedings of the International Conference on Cognition and Recognition

Step 5 : Report the result

After the procedural described above, the output result was largely improved. The result and reason will be explained in the
next section. The normalization equation chosen for t he cross correlation is as following:

5. IMPLEMENTATION

The proposed system is implemented in MATLAB environment. The implementation includes the following built-in functions
and user defined functions. The built-in functions used are,

§ Wavread ()
WAVREAD Read Microsoft
WAVE (".wav") sound files=WAVREAD
(FILE) reads a WAVE file specified by the string
FILE, returning the sampled data in Y. The ".wav" extension is appended if no extension is
given. Amplitude values are in the range [-1, +1].

§ Wavplay ()
WAVPLAY Play sound usingWindows audio output device.WAVPLAY(Y,FS) sends the signal in vector Y with
samplefrequency of FS Hertz to the Windows WAVEaudio device. Standard audio rates are 8000,11025, 22050, and 44100 Hz.

§ Specgram ()
Specgram Calculates spectrogr am from the given signal. B = SPECGRAM (A, NFFT,Fs, WINDOW, NOVERLAP)
calculates the spectrogram for the signal in vector A.SPECGRAM splits the signal into overlapping segments, windows
each with the WINDOW vector and forms the columns of B with their zero-padded, length NFFT discrete Fourier
transforms. Thus each column of B contains an estimate of the short -term, time-localized frequency content of the signal A.
Time increases linearly across the columns of B, from left to right. Frequency increases linearly down the rows, starting at
0.The user defined functions used in this system are,

§ STFT ()
D = stft(X, F, W, H) ShorttimeFourier transform. Returns some frames ofshort-term Fourier transform of x. Each columnof
the result is one F-point FFT; each successiveframe is offset by H points until X is exhausted.Data is Hamm-windowed %
at W pts (or W isused as the window if a vector).

§ ISTFT ()
X = istft (D, F, W, H) Inverseshort-time Fourier transform. Performs overlapaddre synthesis from the short -time
Fouriertransform data in D. Each column of D is takenas the result of an F-point FFT; each successiveframe was offset by
H points. Data is Hammwindowedat W pts, or W is used as the window if a vector.

§ Clipspeech ()
Clipspeech is used to clip the region which has the maximum information in the voice signals present in the database and in
query base. This helps in concentrating more on the information populated area rather than less populated area while
determining the energy distribution in the input query data base and thestored database.

6. PERFORMANCE ANALYSIS

6.1 Approach I
The following table 1 shows the correct etections in each number in a set of tentrials. Please note that for each trial in both
approaches, the query was re-recorded for the purpose of generalization, but in the demo section (as the program has been set
now), the queries were pre-recorded and saved.

43
Voice Analysis using Short Time Fourier Transform and Cross Correlation Methods

Table 1: The testing results for the number recognition by using the approach above

6.2 Approach II
The simulation result by using the cross correlation in time domain is shown in Table 2. The results with the help of
spectrogram are shown in Table 3. Correct recognition percentages were computed and listed in those tables. By compared
tables, it's easy to tell that the spectrogram dramatically improved recognition accuracy.

Table 2: Recognition by only using cross correlation in time Table 3: Recognition by using both cross correlation and
domain spectrogram

7. RESULTS

Fig. 3: Comparison of voice signal Fig. 4: Spectrogram of query voice signal

Fig. 5: Matched voice file

44
Proceedings of the International Conference on Cognition and Recognition

8. CONCLUSION

8.1 Approach I
It has been proved above that the spectrogram is a very useful and powerful tool to visualize the 3-D feature of the sounds. In
approach I, it showed the difficulty of spotting the human voices. The result was not very consistent. This was due to many
factors in the effecting of the detection. E.g. the same word, but the energy will be different according to different person, or
even the same person with different condition.

8.2 Approach II
In the beginning of approach II, we compared two voice samples by using cross-correlation in time domain. However, the
result was totally unacceptable. The probability of correct recognition is close to zero. The reason was because that a sound is a
combination of many signals in different frequencies. Especially in human’s voices, the human voice is the combination of
hundreds of frequencies. In order to acquire a more accurate result of sound recognition is not enough to only use information
from time domain. So our approach was modified and concentrated on the exploitation of spectrogram feature. As mention in
the first approach, the spectrogram shows the useful characteristic of sounds in frequency domain. Recognition accuracy was
improved dramatically as shown in table 3 There was an interesting result in the same table. Although the overall accuracy was
low, the correct recognitions for “k” & “t” were still higher that others in the same condition. The reason, as illustrated in
Figure 2, was because the two sounds were consonants, which have very different energies distribution and shorter periods
than vowels. By focusing on the energy in frequency domain, the output result improved a lot. S spectrogram shows the useful
characteristic by combining frequency and time variations in the same plot. Cross correlation can more efficiently find the
similarity between two sounds. In terms of different sound example, “e” and “o”, “k” and “t”, “u” and “ar” were easily to cause
the wrong recognition to each other. “e” and “o” are vowel sound which more intense in the energy aspect and both echo in a
long period. (ii) Compared to “e” and “o”, “k” and “t” are extreme. They will also cause faults in sound recognition. (iii) Other
words have no specified peer in the wrong recognition. Normalization achieved a more accurate recognition. Different sound
has different energies due to the diverse components of frequencies. The cross correlation of two signals can only be compared
under the normalized condition, which is to divide the cross correlation by the auto correction of both signal, denoted by Rxx
[0] and RYY [0] in Eq (4).

ACKNOWLEDGEMENT

The Author Thanks Professor Dr.Kumaravel Anna University, Chennai For His Many Helpful Lectures, Instructive Guidance
And The valuable Comments. The Author’s Appreciation Is Also Extended To The Fellow Students For Their Support. And In
This Paper, Most Of The Simulations Are Carried Out With The Software Mat Lab And C. The Important Resources in
Websites Are Gratefully acknowledged.

REFERENCE

[1] Ronald M.Aarts,,Roy Irwan, and Augustus J. E. M. Janssen , september 2002 “Efficient Tracking of the cross correlation
coefficient”, IEEE transactions on speech and audio processing, vol. 10, no. 6, 882002 –391.
[2] Portnoff, M.R., 1980” Time-frequency representation of digital signals and systems based on short -time Fourier
analysis”. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-28(1): p. 55-69.
[3] Allen, J.B. and L.R. Rabiner, 1977” A unified approach to short-time Fourier analysis and synthesis.” Proceedings of the
IEEE, 1. 65(11): p. 1558-64.
[4] Sanjit K. Mitra, textbook, 2000 “DigitalSignal Processing: A computer-based approach” McGraw -Hill, 2000.
[5] Dupont, Ris, Deroo, Fontaine, Boite & Zanoni, , 1997 ”Context independent and context dependent hybrid HMM/ANN
systems for vocabulary independent tasks
[6] Renals & Hochberg, , 1999 “Start- Synchronous Search for Large Vocabulary Continuous Speech Recognition”Action
and
[7] Mazin Rahim, Yoshua Bengio, Yann LeCun 1997 “Discriminative Feature And Model Design For Automatic Speech
Recognition”
[8] Shafer, R.W. and L.R. Rabiner, 1973. ”Design and simulation of a speech analysis- synthesis system based on short -time
Fourier analysis.” IEEE Transactions on Audio and Electroacoustics,. AU-21(3): p. 165 -74

45

You might also like