Professional Documents
Culture Documents
Speech Recognition
Speech Recognition
Speech Recognition
net/publication/334285447
CITATIONS READS
2 16,463
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
The Effectiveness of Zoom Assisted Teaching and Learning for ESL learners at Undergraduate Level View project
All content following this page was uploaded by Ali Mansour Al-madani on 07 July 2019.
Project Report
On
AI Speech Recognition System
Submitted by
Ali Mansour Almadani
Email:
abounazek2012@gmail.com
csit.amm@bamu.ac.in
Guided by
Mr. Ashish Bhalerao
Assistant Professor
a
MAHATMA GANDHI MISSION
Certificate
Seat No: _
b
INDEX
Sr.No. Contents Page
No.
1. Introduction to Project 1
System 1
2. System Requirement 8
3. Feasibility Study 13
4. Requirement Analysis 16
7. E-R Diagram 32
8. Database Design 33
10. Reports 55
11. Conclusion 57
13. Enhancement 59
14. Bibliography
c
d
Acknowledgement
e
Chapter 1
1.1. Introduction
Speech Recognition (SR) is the ability to translate a dictation or spoken word
to text.
Speech Recognition known as “automatic speech recognition“ (ASR),or speech
to text(STT)
Speech recognition is the process of converting an acoustic signal, captured
by a microphone or any peripherals , to a set of words .
To achieve speech understanding we can use linguistic processing
The recognized words can be an end in themselves, as for applications such
as commands & control data entry and document preparation.
In the society every one either human or animals wish to interact with each other
and tries to convey own message to others . The receiver for messages may get the
exact and full idea of the senders, or may get the partial idea or sometimes can not
understand anything out of it.
In some cases may happen when there is some lacking in communication (i.e when
a child convey message, the mother can understand easily while others can not )
Project overview
1
1.2. Existing System (History)
After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of
designing a machine that truly functions like an intelligent human is still a major one
going forward.
2
o Speech Generation , (convert the text to voice )
o Text Editing (copy ,past ,select )
Designing and development of an interactive user-friendly text
editor , which allows the user to enter the text , manipulate text ,
formatting text all by familiar commands .
Developing software for speech recognition (speech to text
conversion)
Developing advanced technology incorporating these ideas.
Development of a model that will compare the wave data with
phoneme database and displaying the characters (sentences)on the
screen
Speech recognition is a technology that able a computer to capture the words
spoken by a human with a help of microphone (embedded in computer or external)
1.4. Abstract
Speech recognition Technology is one of the fast growing engineering
technologies.
This project is designed and developed keeping that facto in mind , and a little effort
is made to achieve this aim.
It has a number of applications in different areas and provides potential benefits ,
Nearly 20% people of the world ae suffering from various disabilities ; many of them
are blind or unable to use their hands effectively . The speech recognition system in
those particular cases provide a significant help to them , so that they can share
information with people by operating computer through voice input .
Consider the Thousands of people in world they are not able to use their hands
making typing impossible. our project it for these people who can’t type ,and see
,even for those of us who are lazy and don’t feel like it Our project is capable to
recognize the speech and convert the input audio into text; it also enables a user to
perform operations such as (open , close ,exit, read, ……) program application and
a file by providing voice input . example open Word processing ,google chrome
,Notepad and calculator …,,etc.
In our project capable to read the text which is wrote by any one or the text which is
entered by the user himself
3
CDMA - Code Division Multiple Access
CELP - Code Excited Linear Prediction
DCT - Discrete Cosine Transform
DFT - Discrete Fourier Transform
DSP - Digital Signal Processing
FEC - Forward Error Correction
FIR - Finite Impulse Response
GSM - Global System for Mobile telecommunications
IIR - Infinite Impulse Response
IDCT - Inverse Discrete Cosine Transform
IDFT - Inverse Discrete Fourier Transform
LPC - Linear Predictive Coding
LSP - Line Spectrum Pair
IMBE - Improved Multi-Band Excitation
MBE - Multi-Band Excitation
MSE - Mean Square Error
NLP - Non-Linear Pitch
PCM - Pulse Code Modulation
PSTN - Public Switched Telephone Network
RMS - Root Mean Square
RPE - Regular Pulse Excitation
SD - Spectral Distortion
SEGSNR- Segmental Signal to Noise Ratio
SNR - Signal to Noise Ratio
VSELP - Vector Sum Excited Linear Prediction
AMDF - Averaged Magnitude Differentiate Function
F0 - Fundamental Frequency of Speech
STE - Short Term Energy
ZCR - Zero Crossing Rate
ITU - Upper Energy threshold
ITL - Lower Energy threshold
IZCT - Zero Crossing Rate Threshold
C-V - Consonant Vowel
FFT - Fast Fourier Transform
DFFT - Discrete Fast Fourier Transform
STFT - Short-Time Fourier Transform
MFCC - Mel frequency cepstrum computation
DCT - Discrete Cosine Transform
Continuous speech: When user speak in a more normal, fluid manner without
having to pause between word, which is referred as continuous speech.
Discrete speech: when user speak with taking rest between each word then such
speech is referred as discrete speech.
4
This project has the speech recognizing and speech synthesizing capabilities
though it is not a complete replacement of what we call a notepad but still a good
text editor to be used through voice this software also can open windows based
software such as notepad , google chrome and etc..
5
Streamlined access to application controls and large lists enables a user to
speak any one item from a list or any command from a potentially huge set
of commands without having to navigate through several dialog boxes or
cascading menus.
Speech activated macros let a user speak a natural word or phrase rather than
use the keyboard or a command to activate a macro.
Speech recognition offers game and edutainment developers the potential to
bring their applications to a new level of play. It enhances the realism and
fun in many computer games, it also provides a useful alternative to
keyboard-based control, and voice commands provide new freedom for the
user in any sort of application, from entertainment to office productivity.
Applications that require users to keyboard paper-based data into the
computer are good candidates for a speech recognition application. Reading
data directly to the computer is much easier for most users and can
significantly speed up data entry.
There are many situation in which hands are not available to issue commands
to a device such as it is natural alternative interferer to computers for people
with limited mobility in their arms & hands, or for those with sight
limitations.
Speech can be saved in appropriate format, so that the speaker or a third party
can replay recorded speech to facilitate correction.
Can switch between dictations and typing made without any extra efforts.
Hands-free computing as an alternative to the keyboard, or to allow the
application to be used in environments where a keyboard is impractical (i.e.
small mobile devices, Auto PCs, or in mobile phones)
Voice responses to message boxes and wizard screens can easily be designed
into an application.
A more “human” computer, one user can talk to, may make educational and
entertainment applications seem more friendly and realistic
Streamlined access to application controls and large lists enables a user to
speak any one item from a list or any command from a potentially huge set
of commands without having to navigate through several dialog boxes or
cascading menus.
Speech activated macros let a user speak a natural word or phrase rather than
use the keyboard or a command to activate a macro.
Speech recognition offers game and edutainment developers the potential to
bring their applications to a new level of play. It enhances the realism and
fun in many computer games, it also provides a useful alternative to
keyboard-based control, and voice commands provide new freedom for the
user in any sort of application from entertainment to office productivity.
6
Applications that require users to keyboard paper-based data into the
computer are good candidates for a speech recognition application. Reading
data directly to the computer is much easier for most users and can
significantly speed up data entry. Some recognizers can even handle spelling
fairly well. If an application has fields with mutually exclusive data types
(i.e. sex, age and city), the speech recognition engine can process the
command and automatically determine which field to fill in.
7
Chapter 2
1. System Requirements:
8
Analog to Acoustic
Audio
Digital Model
Language
Model
Display Speech
Engine
Speech recognition process is easy for a human but it is a difficult talk for a
machine , comparing with a human mind speech recognition programs seems less
intelligent, this is due to that fact that a human mind is God gifted thing and the
capability of thinking, understanding and reacting is natural, while for a computer
program it is a complicated task, first it need to understand the spoken words with
respect to their meanings, and it has to create a sufficient balance between the words,
noise and spaces. A human has a built in capability of filtering the noise form a
9
speech while a machine requires training, computer requires help for separating the
speech sound from the other sounds.
1.4. Factors on the speech recognition:
2.4.1. Homonyms: are the words that are differently spelled and have the
different meaning but acquires the same meaning, for example “there”
“their” “be” and “bee”, “cool” and “coal”. This is a challenge for
computer machine to distinguish between such types of phrases that
sound alike.
2.4.2. Overlapping speeches: a second challenge in the process, is to
understand the speech uttered by different users, current systems have
a difficulty to separate simultaneous speeches form multiple users.
2.4.3. Noise factor: the program requires hearing the words uttered by a
human distinctly and clearly. Any extra sound can create interference,
first you need to place system away form noisy environments and the n
speak clearly else the machine will confuse and will mix up the words.
1.5. The future of speech recognition :
Dictation speech recognition will gradually become accepted.
Accuracy will become better and better.
Microphone and sound systems will be designed to adapt more
quickly to changing background noise levels, different
environments, with better recognition of extraneous material to be
discarded.
Greater use will be made of “intelligent systems” which will
attempt to guess what the speaker intended to say, rather than what
was actually said, as people often misspeak and make unintentional
mistakes.
Methodology
As an emerging technology , not all developers are familiar with speech recognition
technology . While the basic functions of both speech synthesis and speech
recognition takes only
few minutes to understand (after all, most people learn to speak and listen by
age two), there are subtle and powerful capabilities provided by computerized
speech that
developers will want to understand and utilize.
An understanding of the capabilities and limitations of speech technology is also
important for developers in making decisions about whether a particular applications
will benefit from the use of speech input and output.
10
System Requirements:
CPU:-
Our Application depend on efficiency of CPU(central processing unit).
This is because a large amount of digital filtering and signal processing can take
place in ASR(Automated Speech Recognition ).
11
Chapter3
3. Feasibility Study :
Through these studies were obtained on the conclusions and proposals for the project
:
12
Component Minimum Recommended
CPU 1.6 GHz 2.53GHz
RAM 2 GB 4gb
Visual studio 2015: for build up our project, creates all the window forms
application and designing an interfaces.
MySQL: for managing the database (creates tables, store the data).
Word processor: for write a project report.
Programming language:
The programming language is C SHARP(C#)
Its easy for learning and its use for create windows forms application, its
also a well-known and high-level programing language.
Microsoft Speech SDK is one of the many tools that enable a
developer to add speech capability in to an applications.
13
Planning 7 days 12/10/2017
23/10/2017
Costs :
It’s the costs which will be spend by the team for complete the project as we
will discuss.
Profits:
The profits that the team can achieve after implementation the project ,in the
beginning , it will be a trial version and any user can use it for free. After improving
the version there will be a product key that no one can use the application without it
, and the application will be sold to users , every version will have different product
key .
14
3.4. Operational Feasibility
3.4.1. The performance (Throughput):
increasing recognition throughput in batch processing of speech
data; and reducing recognition latency in realtime usage scenarios.
Improve Throughput: Allow batch processing of the speech recognition task to
execute as efficiently as possible,thereby increasing the utility for multimedia search
and retrieval.
3.4.2. Information
With the help of microphone audio is input to the system, the pc sound card
produces the equivalent digital representation of received audio.
15
4. Requirement Analysis
Analyze
- Identify opportunities for speech
& outline project strategy
- Review business requirements &
processes
- Review of existing IVR
- Interview Subject Matter Experts
- Voice User Interface & Technical
requirements (use-case scenarios)
- Define success criteria
- Map out solution
- Client review & sign-off on
requirements
16
4.1. Fundamentals to speech recognition
Speech recognition is basically the science of talking with the computer, and
having it correctly recognized. To elaborate it we have to understand the following
terms.
4.1.1. Utterance
when user say some things ,then this is an utterance in other words speaking
a word or a combination of words that means something to computer is called an
utterance. Utterances ate then sent to speech engine to be processed.
4.1.2. Pronunciation
a speech recognition engine uses a process word is its pronunciation, that
represents what the speech engine thinks a word should sounds like. Words can the
multiple pronunciations associated with them.
4.1.3. Grammar:
Grammar uses particular set of rules in order to define the words and
phrases that are going to be recognized by speech engine, more concisely grammar
define the domain with which the speech engine works. Grammar can be simple as
list of words or flexible enough to support the various degrees of variations.
4.1.4. Accuracy
The performance of the speech recognition system is measurable ;the ability of
recognizer can be measured by calculating its accuracy. It is useful to identify an
utterance.
4.1.5. Vocabularies
Vocabularies are the list of words that can be recognized by the speech
recognition engine. Generally the smaller vocabularies are easier to identify by a
speech recognition engine, while a large listing of words are difficult task to be
identified by engine.
4.1.6. Training
Training can be used by the users who have difficulty of speaking or
pronouncing certain words, speech recognition systems with training should be able
to adapt.
4.2. Tools
1- Visual studio 2015 (coding )
2- Office 2016 (word processor )for make documentation
17
3- Power point (for presentation )
Prosody Analysis
18
.
The remaining steps convert the spoken text to speech.
Text-to-phoneme conversion: convert each word to phonemes. A
phoneme is a basic unit of sound in a language. US English has
around 45 phonemes including the consonant and vowel sounds.
For example, "times" is spoken as four phonemes "t ay m s".
Different languages have different sets of sounds (different
phonemes). For example, Japanese has fewer phonemes including
sounds not found in English, such as "ts" in "tsunami".
19
Chapter5
5. Software Requirement Specifications (SRS)
When one thinks about speaking to computers, the first image is usually speech
recognition, the conversion of an acoustic signal to a stream of words. After
many years of research, speech recognition technology is beginning to pass the
threshold of practicality. The last decade has witnessed dramatic improvement
in speech recognition technology, to the extent that high performance
algorithms and systems are becoming available.
Wide varieties of techniques with different levels of speech recognition are
used to perform speech recognition. The speech recognition process is
performed by a software component known as the speech recognition engine.
The primary function of the speech recognition engine is to process spoken input
word and translate it into text that an application understood. The application
can be work in two different mode Command and control mode some times
referred as voice navigation and Dictation mode.
In command and control mode the application can interpret the result of the
recognition as command. This mode offers developers the easiest implementation of
a speech interface in an existing application. In this mode the grammar (or list of
recognized words) can be limited to the list of available commands. This provides
better accuracy and performance, and reduces the processing overhead required by
the application. An example of a command and control application is one in which
the caller says “open” “file”, and the application asks the name of the file to be
opened.
20
Functions of speech recognizer
o Functions of speech recognizer.
o Filters the raw signals into frequency bands.
o Cut the utterance into a fixed no. of segments.
o Average data for each band in each segment.
o Store this pattern with its name.
o Collect training set of about three repetitions of each patterns (utterance).
o Recognize unknown by comparing its pattern against all patterns in the
training set and returning the name of the pattern closest to the unknown.
21
The process of conversion from speech to words is complex and
varies slightly between systems. It consists of three steps :
(1) Feature extraction – Pre-processing of the speech signal, extracting the
important features into feature vectors.
(2) Phoneme recognition – bases on a statistically trained phoneme model (HMM)
the most likely sequence of phoneme is calculated.
(3) Word recognition – Based on statistically trained language model similar to
the phoneme model, the most likely sequence of word are calculated.
22
Speech dictation process :
After the preparation of master database of features of Gujarati Alphabets,
the researcher has proposed the dictation model from where the actual human-
machine interaction starts in the form of speech dictation. The researcher has
divided the model into five different steps (Fig:9.4) i.e. (1) Input acquition
(2) Front end (3) Feature extractor (4) local match and (5) character printing.
23
The entire process is summarized in the following steps:
24
10) Flowchart of a speech recognition system
Feature vectors
Reference
Model result
Speech
Recognition result
25
11) Use case diagram :
26
12) Diagrams :
27
Writing Text
28
Opening Document
29
Closing Document
30
Opening system software
31
13) E-R Diagram
32
8. Database Design
Here is the part where basic to intermediate experience with MS Access comes in
handy because we will not go into every detail on this process. I generally use MS
Access 2007 so my instruction will be geared toward that version. To begin, select
the Windows icon in the upper left then click on New. Name the database
whatever you would like; for this tutorial I will be naming mine VR.accdb. Create
a database that your project can use. I named mine CustomCommands and I
included the following fields:
ID
CommonField
33
Command
Result
Save that database and so we can use it during the next step.
After you have connected your program to the MS Access database you have created
we will add that database to our program forms. You will see that you now have
your DataSet under your Data Sources.
This is where you can click and hold the given field such as "Command" or
"Result" and drag them onto your forms. Just make sure the field is set to TextBox
and you should have a form that looks something like this:
The "Common Field" field is what the computer will speak to you, the "Command"
field is what you would speak to the computer and "Result" field is the program that
would be executed. To help me mentally keep these straight I will rename my labels
to that effect. We just walked through connecting a database to only one of our
forms. You can follow the appropriate steps listed in this section to connect this
same database to your second form.
On our main form, we will not need a data grid view, however, on our second form
we will. Having the grid is not entirely necessary for the operation of the program
34
but it does help you keep things organized as you are managing your commands.
Under the Data Sources explorer select CustomCommands and from the dropdown
menu select DataGridView. Then simply grab CustomCommands with your mouse
and drag it onto your form. You can arrange your data grid view however you would
like from there.
Now, we can create all of the buttons and text boxes we will need on our forms.
We'll just do this in steps to keep things simple.
35
Chapter 9
9. User Interface
Code the main window and movement between the windows by voice
Option Strict On
36
Dim a As New Speech.Synthesis.SpeechSynthesizer
Private WithEvents OutputListBox As New ListBox With {.Dock =
DockStyle.Fill, .IntegralHeight = False, .ForeColor = Color.AntiqueWhite,
.BackColor = Color.Green}
Private WithEvents SpeechEngine As New
System.Speech.Recognition.SpeechRecognitionEngine(System.Globalization.Cult
ureInfo.GetCultureInfo("en-us"))
Dim tms As Integer = 0
Dim st As String
Dim WithEvents recognizer As SpeechRecognitionEngine
SkinOb.LoadSkinFromFile("C:\Users\fathail\Desktop\vb\project\A_67.skf")
SkinOb.ApplySkin()
Me.Text = "Speech recognition, by Doc Oc, version:" &
My.Application.Info.Version.ToString & " , say hello to piss the pc off"
Controls.Add(OutputListBox)
SpeechEngine.LoadGrammar(New
System.Speech.Recognition.DictationGrammar)
SpeechEngine.SetInputToDefaultAudioDevice()
SpeechEngine.RecognizeAsync(Speech.Recognition.RecognizeMode.Multiple)
37
recognizer = New SpeechRecognitionEngine()
recognizer.SetInputToDefaultAudioDevice()
End Sub
38
MyForm.Show()
Case "close"
Me.Close()
Case "maximize"
Me.WindowState = FormWindowState.Maximized
Case "minimize"
Me.WindowState = FormWindowState.Minimized
End Select
' End If
End Sub
End Class
39
9.2. Add Commands
Imports System.Drawing.Drawing2D
Imports System.ComponentModel
Imports DMSoft
40
Dim CnString As String = "Provider=Microsoft.ACE.OLEDB.12.0;Data
Source=C:\Users\fathail\Desktop\vb\project\project\db2.accdb;
Persist Security Info=False;"
' ////////////////////////////////////////////////////////
SkinOb.LoadSkinFromFile("C:\Users\fathail\Desktop\vb\project\A_67.skf")
SkinOb.ApplySkin()
Try
Conn.Open()
Dim DataAdapter1 As New OleDbDataAdapter(SQLstr, Conn)
DataAdapter1.Fill(DataSet1, "CustomCommands")
dataGridView1.DataSource = DataSet1
dataGridView1.DataMember = "CustomCommands"
41
dataGridView1.Refresh()
Conn.Close()
Catch e1 As Exception
Console.WriteLine(e1)
End Try
End Sub
Else
MsgBox("please enter field data", MsgBoxStyle.Critical, "wrong in data
enter")
Exit Sub
End If
End Sub
btnOpen.Visible = False
text2.Text = "http://www."
text2.Focus()
42
End If
End Sub
btnOpen.Visible = True
text2.Text = ""
End If
End Sub
End Sub
43
MsgBox("delete is done", MsgBoxStyle.Information +
MsgBoxStyle.MsgBoxRight, "delet ")
End Sub
44
Code view commands:
Imports System.Data.OleDb
Imports System.IO.StreamReader
45
9.4.Grammar is loading
46
Imports System.Speech.Recognition ' Add reference Assemblies Framework
System.Speech
Imports System.Speech.Recognition.SrgsGrammar ' Adding this is unnecessary on
my PC
Controls.Add(OutputListBox)
SpeechEngine.LoadGrammar(New
System.Speech.Recognition.DictationGrammar)
SpeechEngine.SetInputToDefaultAudioDevice()
SpeechEngine.RecognizeAsync(Speech.Recognition.RecognizeMode.Multiple)
47
Dim ReadLines As New
System.IO.StreamReader("C:\Users\fathail\Desktop\vb\project\Command.txt")
Do Until ReadLines.EndOfStream
Dim NewGrammar As New Grammar(New Choices(New
String(CType(ReadLines.ReadLine(), Char()))))
recognizer.LoadGrammarAsync(NewGrammar)
Loop
ReadLines.Close()
recognizer.RecognizeAsync(RecognizeMode.Multiple)
End Sub
48
Case Is = "open facebook"
a.SpeakAsync("now")
' Shell("notepad.exe", AppWinStyle.NormalFocus, False)
System.Diagnostics.Process.Start("http://www.facebook.com")
Case Is = "RESTART"
a.Speak("restart")
System.Diagnostics.Process.Start("shutdown", "-r")
49
Timer1.Enabled = True
Timer1.Start()
System.Diagnostics.Process.Start("https://www.google.com/webhp?sourceid=chro
me-instant&ion=1&ie=UTF-8#output=search&sclient=psy-
ab&q=weather&oq=&gs_l=&pbx=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47008514,
d.eWU&fp=6c7f8a5fed4db490&biw=1366&bih=643&ion=1&pf=p&pdl=300")
a.Speak("Searching for local weather")
Case Is = "HELLO "
a.Speak("Hello sir")
Case Is = "GOODBYE "
a.Speak("Until next time")
Me.Close()
Case Is = "OPEN DISK DRIVE"
'Case Is = "NINE"
a.Speak("Its now open")
Dim oWMP = CreateObject("WMPlayer.OCX.7")
Dim CDROM = oWMP.cdromCollection
If CDROM.Count = 2 Then
CDROM.Item(1).Eject()
End If
End
End Select
End If
End Sub
50
Private Sub recognizer_LoadGrammarCompleted(sender As Object, e As
LoadGrammarCompletedEventArgs) Handles
recognizer.LoadGrammarCompleted
Label1.Text = ("Grammar " & grammarName & " " & If((grammarLoaded),
"Is", "Is Not") & " loaded.")
End Sub
Label2.Text = "Grammar " & e.Result.Grammar.Name & " " & e.Result.Text
End Sub
51
' Case Is = "white"
OutputListBox.BackColor = Color.White
' Case Is = "Yellow"
'End Select
OutputListBox.BackColor = Color.Red
End Sub
End Class
Imports System.Speech
Imports System.Speech.Recognition
52
Imports System.Speech.Recognition.SrgsGrammar
Try
colorRule.Add(colorsList)
gram.Rules.Add(colorRule)
gram.Root = colorRule
reco.LoadGrammarAsync(New Recognition.Grammar(gram))
reco.SetInputToDefaultAudioDevice()
reco.RecognizeAsync(RecognizeMode.Multiple)
Catch s As Exception
MessageBox.Show(s.Message)
End Try
End Sub
53
Case "green"
SetColor(Color.Lime)
Case "Yellow"
SetColor(Color.Yellow)
Case "black"
SetColor(Color.Black)
Case "blue"
SetColor(Color.Blue)
End Select
End Sub
End Class
54
10. Report
Storage of speech files and its feature in traditional flat file format
The process of data storage in traditional flat file format consists two or more
type of the files. Each prompted utterance is stored within a separate file in any valid
audio file format. The stored speech file for each utterance is processed with the
speech processing tools (i.e. software) and the corresponding features for each
utterance is extracted and these processed outcome is stored in the other flat file,
which is accompanying each utterance file. For the storage of the features we may
use many different approaches as follow:
(1) One may take the separate file for each utterance i.e. for pitch of all
the utterance one separate file, for the frequency of all the utterance
one separate file and so on. In each feature file each row represents
the different utterance. The affiliation of each row with the
accompanying utterance must be previously determined. And for
every feature file this affiliation remains same. Suppose there are 36
utterances and 10 features then there are 46 files.
(2) All the features (i.e. pitch, frequency and many more) for the one
utterance are stored in the one file. For the second utterance again all
the features (i.e. pitch, frequency and many more) are stored in the
other file and so on. In this approach, every file is named in such a
55
way so that it accompanying the utterance. Suppose there are 36
utterances then there are 72 files (i.e. 36 – for utterance and 36 – for
features of the accompanying utterance).
(3) All the features (i.e. pitch, frequency and many more) for the one
utterance is stored in one line of the flat file separated by either
comma or space, second line again stores the same features for the
second utterance and so on. In the file format each column represents
the same feature for all the utterance and each row represents
different features for the one utterance. The affiliation of each feature
with the column and affiliation of each utterance with each row must
be previously determined. In this approach if there are 36 utterances
then there are 37 files.
56
11. Conclusion
This project work of speech recognition started with a brief introduction of
the application and the technology in the computer (desktop applications)This
project able to write the text through both keyboard and voice ,the speech
recognition of different notepad command such as open ,save ,select ,copy ,clear
and close,Open the different windows software depends on voice input.
One challenge is to develop ways in which our knowledge of the speech
signal, and of speech production and perception, can be incorporated more
effectively into recognition methods. For example, the fact that speakers have
different vocal tract lengths could be used to develop more compact models for
improved speaker-independent recognition.
57
12. System limitations
A speech signal is a highly redundant non-stationary signal. These attributes
make this signal very challenging to characterise. It should be possible to
recognize speech directly from the digitized waveform. However, because of the
large variability of the speech signal, it is a good idea to perform some form of
feature extraction that would reduce that variability. Applications that need voice
processing (such as coding, synthesis, recognition) require specific representations
of speech information. For instance, the main requirement for speech recognition is
the extraction of voice features, which may distinguish different phonemes of a
language. From a statistical point of view, this procedure is equivalent to finding a
sufficient for statistic to estimate phonemes. Other information, not require for this
aim, such as phonatory apparatus dimensions (that is speaker dependent), the
speaker’s moods, sex, age, dialect inflexions, and background noise etc., should be
overlooked. To decrease vocal message ambiguity, speech is therefore filtered
before it arrives at the automatic recognizer. Hence, the filtering procedure can be
considered as the first stage of speech analysis. Filtering is performed on discrete
time quantized speech signals. Hence, the first procedure consists of analog to
digital signal conversion. Then, the extraction procedure of the significant features
of speech signal is performed.
When captured by a microphone, speech signals are seriously distorted by
background noise and reverberation. Fundamentally speech is made up of
discrete units. The units can be a word, a syllable or a phoneme. Each stored
unit of speech includes details of the characteristics that differentiate it from
the others. Apart from the message content, the speech signal also carries
variability such as speaker characteristics, emotions and background noise.
Speech recognition differentiates between accents, dialects, age, genders,
emotional state, rate of speech and environmental noises. According to Rosen
(1992), the temporal features of speech signals can be partitioned into three
categories, i.e., envelope (2-50 Hz), periodicity (50-500 Hz), and fine structure
(500-10000 Hz). A method of generating feature signals from speech signals
comprising the following steps:
Receives the speech signals.
Block the speech signals into frames.
Form frequency domain representations of said blocked speech signals.
Pass the said frequency domain representations through mel-filter banks.
During speech, altering the size and shape of the vocal tract, mostly by moving
the tongue, result in frequency and intensity changes that emphasize some
harmonics and suppress others. The resulting waveform has a series of peaks and
valleys. Each of the peak is called a formant and it is manipulation of formant
frequencies that facilitates the recognition of different vowels sounds. Speech has a
number of features that need to be taken into account. Perform the combination
linear predictive coding and cepstral recursion analysis on the blocked speech
signals to produce various features of the signals.
58
13. Enhancement
Thus the goal of speech enhancement is to find an optimal estimate (i.e., preferred
by a human listener), given a noisy measurement. The relative unimportance of
phase for speech quality has given rise to a family of speech enhancement algorithms
based on spectral magnitude estimation. These are frequency-domain estimators in
which an estimate of the clean-speech spectral magnitude is recombined with the noisy
phase before resynthesize with a standard overlap-add procedure.
59
14. Bibliography
[1] “Speech recognition- The next revolution” 5th edition.
[2] Ksenia Shalonova, “Automatic Speech Recognition” 07 DEC 2007
[3]Source:http://www.cs.bris.ac.uk/Teaching/Resources/COMS12303/lectures/Kse
nia_Shalonoa- Speech_Recognition.pdf
[4] "Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993. ISBN:
0130151572.
[5] http://www.abilityhub.com/speech/speech-description.htm
[6] Charu Joshi “Speech Recognition”
Source: http://www.scribd.com/doc/2586608/speechrecognition.pdf
60
View publication stats