Professional Documents
Culture Documents
Speech Recognition UTHM
Speech Recognition UTHM
Design of Automatic Speech Recognition Systems for Voice Identification and Analysis
Supervisor: Dr. Mohammed Faiz Liew bin Abdullah Dr. Rozlan bin Alias En.Ariffuddin bin Joret
DINESHWARAN GUNALAN TUAN MUSTAQIM HE120104 HE120105
Speech Recognition
Speech Recognition are technologies of particular interest, for their support of direct communication between humans and computers, through a communications mode, humans commonly use among themselves and at which they are highly skilled.
Articulation produces sound waves which the ear conveys to the brain for processing
Acoustic waveform
Acoustic signal
Speech recognition
Challenges Faced
Ease of use Robust performance Automatic learning of new words and sounds Grammar for spoken language Control of synthesized voice quality Integrated learning for speech recognition and synthesis Getting a computer to understand spoken language Digitization Converting analogue signal into digital representation Signal processing Separating speech from background noise Phonetics Variability in human speech Phonology Recognizing individual sound distinctions (similar phonemes)
Digitization
Analogue to digital conversion Sampling and quantizing Use filters to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
Identifying Phonemes
Differences between some phonemes are sometimes very small
May be reflected in speech signal (eg vowels have more or less distinctive f1 and f2) Often show up in coarticulation effects (transition to next sound)
e.g. aspiration of voiceless stops in English
Allophonic variation
Disambiguating Homophones
Mostly differences are recognised by humans by context and need to make sense
Its hard to wreck a nice beach What dimes a necks drain to stop port?
Systems can only recognize words that are in their lexicon, so limiting the lexicon is an obvious ploy Some ASR systems include a grammar which can help disambiguation
Pitch
Pitch contour can be extracted from speech signal
But pitch differences are relative One mans high is another (wo)mans low Pitch range is variable
Length
Length is easy to measure but difficult to interpret Again, length is relative It is phonemic in many languages Speech rate is not constant slows down at the end of a sentence
Loudness
Loudness is easy to measure but difficult to interpret Again, loudness is relative
Speech Recognition
Identification
Verification
Hardware
GUI Software
ze
ro
There are silence at the beginning and at the end of the signals. The word consists of two syllables.
Frame Blocking
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Continuous speech signal is blocked into frames of N samples with adjacent frames being separated by M(M<N).
Frame1 Frame2 Frame3 Consist of First N samples. Begins M samples after first frame, and the overlaps it by N-M samples. Begins 2M samples after the first frame, and the overlaps it by N-2M samples.
Frame Blocking
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Continuous speech signal is blocked into frames of N samples with adjacent frames being separated by M(M<N).
Frame1 Frame2 Frame3 Consist of First N samples. Begins M samples after first frame, and the overlaps it by N-M samples. Begins 2M samples after the first frame, and the overlaps it by N-2M samples.
The speech signals were blocked into frames of N samples with overlap.
Windowing
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Each individual frame will be windowed. Hamming window is used in this project.
After Windowing
Before Windowing
After Windowing
After Windowing
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Convert each frame of N samples form time domain into the frequency domain.
Spectrum