Speech Recognition UTHM

UNIVERSITY TUN HUSSEIN ONN (UTHM)
Design of Automatic Speech Recognition Systems for Voice Identification and Analysis
Supervisor: Dr. Mohammed Faiz Liew bin Abdullah Dr. Rozlan bin Alias En.Ariffuddin bin Joret
DINESHWARAN GUNALAN TUAN MUSTAQIM HE120104 HE120105
Speech Recognition
Speech Recognition are technologies of particular interest, for their support of direct communication between humans and computers, through a communications mode, humans commonly use among themselves and at which they are highly skilled.
Rudnicky, Hauptman, and Lee
Advantages and Applications

Dictation Command and control Telephony Medical/disabilities
How the Brain Work
Articulation produces sound waves which the ear conveys to the brain for processing
How the Computer Work
Acoustic waveform
Acoustic signal
Digitization Acoustic analysis of the speech signal Linguistic interpretation
Speech recognition
Challenges Faced
Ease of use Robust performance Automatic learning of new words and sounds Grammar for spoken language Control of synthesized voice quality Integrated learning for speech recognition and synthesis Getting a computer to understand spoken language Digitization Converting analogue signal into digital representation Signal processing Separating speech from background noise Phonetics Variability in human speech Phonology Recognizing individual sound distinctions (similar phonemes)
Digitization
Analogue to digital conversion Sampling and quantizing Use filters to measure energy levels for various points on the frequency spectrum Knowing the relative importance of different frequency bands (for speech) makes this process more efficient E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
Separating Speech from Background Noise

Noise cancelling microphones Two mics, one facing speaker, the other facing away Ambient noise is roughly same for both mics Knowing which bits of the signal relate to speech Spectrograph analysis
Variability in Individuals Speech

Variation among speakers due to Vocal range (f0, and pitch range see later) Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc) ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.) Variation within speakers due to Health, emotional state Ambient conditions Speech style: formal read vs spontaneous
Identifying Phonemes
Differences between some phonemes are sometimes very small
May be reflected in speech signal (eg vowels have more or less distinctive f1 and f2) Often show up in coarticulation effects (transition to next sound)
e.g. aspiration of voiceless stops in English
Allophonic variation
Disambiguating Homophones
Mostly differences are recognised by humans by context and need to make sense
Its hard to wreck a nice beach What dimes a necks drain to stop port?
Systems can only recognize words that are in their lexicon, so limiting the lexicon is an obvious ploy Some ASR systems include a grammar which can help disambiguation
Interpreting Prosodic Features

Pitch, length and loudness are used to indicate stress All of these are relative
On a speaker-by-speaker basis And in relation to context
Pitch and length are phonemic in some languages
Pitch
Pitch contour can be extracted from speech signal
But pitch differences are relative One mans high is another (wo)mans low Pitch range is variable
Pitch contributes to intonation

But has other functions in tone languages
Intonation can convey meaning
Length
Length is easy to measure but difficult to interpret Again, length is relative It is phonemic in many languages Speech rate is not constant slows down at the end of a sentence
Loudness
Loudness is easy to measure but difficult to interpret Again, loudness is relative
Objectives of our Project

Design and implement an automatic voice recognition system. Design a GUI system. Design an Interface circuit to blink LED MATLAB program is used to implement the project.
Speech Recognition
Main principle of Speech Recognition
Identification
Verification
Hardware
GUI Software
Input Speech Signals
ze
ro
There are silence at the beginning and at the end of the signals. The word consists of two syllables.
Frame Blocking
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Continuous speech signal is blocked into frames of N samples with adjacent frames being separated by M(M<N).
Frame1 Frame2 Frame3 Consist of First N samples. Begins M samples after first frame, and the overlaps it by N-M samples. Begins 2M samples after the first frame, and the overlaps it by N-2M samples.
Frame Blocking
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Continuous speech signal is blocked into frames of N samples with adjacent frames being separated by M(M<N).
Frame1 Frame2 Frame3 Consist of First N samples. Begins M samples after first frame, and the overlaps it by N-M samples. Begins 2M samples after the first frame, and the overlaps it by N-2M samples.
After Frame Blocking
The speech signals were blocked into frames of N samples with overlap.
Windowing
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Each individual frame will be windowed. Hamming window is used in this project.
After Windowing
Before Windowing
After Windowing
After Windowing
Continuous speech
Frame Blocking
frame
Windowing
FFT
spectrum
Final Output
Convert each frame of N samples form time domain into the frequency domain.
Log Scale of S1 speech wave

120 100 80 60 40 20 20 40 60 80 100120 120 100 80 60 40 20 50 120 100 80 60 40 20 20 40 60 80100 120 140 100 150 120 100 80 60 40 20 20 40 60 80100 120 140 120 100 80 60 40 20 50 100 150 120 100 80 60 40 20 20 40 60 80100 120 120 100 80 60 40 20 50 100 150 120 100 80 60 40 20 50 100 150
Spectrum
Thank You For Your Time

Speech Recognition UTHM

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition UTHM

Uploaded by

Copyright:

Available Formats

UNIVERSITY TUN HUSSEIN ONN (UTHM)

Rudnicky, Hauptman, and Lee

Advantages and Applications

How the Brain Work

How the Computer Work

Digitization Acoustic analysis of the speech signal Linguistic interpretation

Separating Speech from Background Noise

Variability in Individuals Speech

Interpreting Prosodic Features

Pitch and length are phonemic in some languages

Pitch contributes to intonation

Intonation can convey meaning

Objectives of our Project

Main principle of Speech Recognition

Input Speech Signals

After Frame Blocking

Log Scale of S1 speech wave

Thank You For Your Time

You might also like