Speech Recognition

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334285447

Project Report On AI Speech Recognition System

Thesis · April 2018


DOI: 10.13140/RG.2.2.31037.20965

CITATIONS READS

2 16,463

1 author:

Ali Mansour Al-madani


Dr. Babasaheb Ambedkar Marathwada University
21 PUBLICATIONS 78 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

The Effectiveness of Zoom Assisted Teaching and Learning for ESL learners at Undergraduate Level View project

Yemen culture View project

All content following this page was uploaded by Ali Mansour Al-madani on 07 July 2019.

The user has requested enhancement of the downloaded file.


MAHATMA GANDHI MISSION

DR. G. Y. PATHRIKAR COLLEGE OF COMPUTER SCIENCE


AND INFORMATION TECHNOLOGY, AURANGABAD
Affiliated to Dr. Babasaheb Ambedkar Marathwada University, Aurangabad

Project Report
On
AI Speech Recognition System

Submitted by
Ali Mansour Almadani
Email:
abounazek2012@gmail.com
csit.amm@bamu.ac.in

Guided by
Mr. Ashish Bhalerao
Assistant Professor

M.Sc. (Information Technology) fourth Semester,


Academic Year 2017-2018

a
MAHATMA GANDHI MISSION

DR. G. Y. PATHRIKAR COLLEGE OF COMPUTER SCIENCE


AND INFORMATION TECHNOLOGY, AURANGABAD
Affiliated to Dr. Babasaheb Ambedkar Marathwada University, Aurangabad

Certificate

This is to certify that, ALI MANSOUR ALMADANI has


successfully completed the Project Report on “AI Speech Recognition
System” for partial fulfillment of the course M.Sc.(Information Technology)
Fourth Semester, affiliated to Dr. Babasaheb Ambedkar Marathwada
University Aurangabad, during the Academic Year 2017-2018.

Seat No: _

Mr. Ashish Bhalera Dr.Satish Sankye.


Project Guide Head of the Department
Examiner

b
INDEX
Sr.No. Contents Page

No.

1. Introduction to Project 1

1.1 Existing System 1

1.2 Need and Significance of Proposed 1

System 1

1.3 Objectives and Motivation

2. System Requirement 8

3. Feasibility Study 13

4. Requirement Analysis 16

5. Software Requirement Specifications (SRS) 20

6. Data Flow Diagram 25

7. E-R Diagram 32

8. Database Design 33

9. User Interface Design 36

10. Reports 55

11. Conclusion 57

12. System Limitations 58

13. Enhancement 59

14. Bibliography

c
d
Acknowledgement

• I extend my sincere thanks to Dr. Sankaye Satish ,Professor and head


Department of Information Technology (HOD(IT)), I take this
opportunity to convey my deepest gratitude to him for his valuable
advice, comfort encouragement, keen interest , constructive criticism,
scholarly guidance and wholehearted support.

• express my special thanks to Mr. Bharat Naiknaware, The Associate


Professor, Department of Education, in MGM’s Dr. G. Y. Pathrikar
College of computer science and information technology for his
constant encouragement, elderly advice and for providing the required
moral support during the work. His mentorship is a rewarding
experience, which I will treasure my whole life.

• and all my special thanks to Mr.Ashish Bhalerao Assistant Professor in


MGM’s Dr. G. Y. Pathrikar College of computer science and
information technology , who has permitted to work under his
guidance for our project .

• I have no words to express my sincere thanks to all the teaching and


non-teaching staff of my department i.e Information Technology
,MGM’s Dr. G. Y. Pathrikar College of computer science and
information technology.
• I can also not forget all the current as well as passed students of
Msc.IT & computer science who are always eagerly seeking and
praying to “GOD” for the completion of my work as per my ambition
and desire. I am afraid the list of names will be very long one. I am
grateful to their encouragement and convictions and again thanks
them all with due regards.

• I must express my indebtedness to my Brothers and my lovely mother


. Without their kind and silent support this work could not have been
completed.

• Above all, I offered my heartiest regard to Allah “ GOD ” for giving


strength to work inspiration.

e
Chapter 1

1.1. Introduction
Speech Recognition (SR) is the ability to translate a dictation or spoken word
to text.
Speech Recognition known as “automatic speech recognition“ (ASR),or speech
to text(STT)
 Speech recognition is the process of converting an acoustic signal, captured
by a microphone or any peripherals , to a set of words .
 To achieve speech understanding we can use linguistic processing
 The recognized words can be an end in themselves, as for applications such
as commands & control data entry and document preparation.

In the society every one either human or animals wish to interact with each other
and tries to convey own message to others . The receiver for messages may get the
exact and full idea of the senders, or may get the partial idea or sometimes can not
understand anything out of it.
In some cases may happen when there is some lacking in communication (i.e when
a child convey message, the mother can understand easily while others can not )

Project overview

This thesis report considers an overview of speech recognition technology,


software development , and its applications . The first section deals with the
descriptions of speech recognition process , its applications in different sectors, its
flaws and finally the future of technology. Later part of report covers the speech
recognition process, and the code for the software and its working. Finally the report
concludes at the different potentials uses of the application and further
improvements and considerations.

1
1.2. Existing System (History)

The concept of speech recognition started somewhere in 1940s, practically the


first speech recognition program was appeared in 1953 at the bell labs, that was
about recognition of a digit in a noise free environment.
1940s and 1950s consider as the foundational period of the speech recognition
technology, in this period work was done on the of the speech recognition that is
automation and information theoretic models.
In the 1960’s we were able to recognize small vocabularies (order of 10-100
words) of isolated words , based on simple acoustic-phonetic properties of speech
sounds. The key technologies that were developed during this decade were, filter
banks and time normalization methods.

In 1970s the medium vocabularies (order of 100-100 words) using simple


template-based, pattern recognition methods were recognized.

In 1980s large vocabularies (1000-unlimited) were used and speech recognition


problems based on statistical, with a large range of networks for handling language
structures were addressed. The key invention of this era were hidden mark model
(HMM) and the stochastic language model , which together enabled powerful new
methods for handling continuous speech recognition problem efficiently and with
high performance.
In 1990s the key technologies developed during this period were the methods
for stochastic language understanding, statistical learning of acoustic and language
models, and the methods for implementation of large vocabulary speech
understanding systems.

After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of
designing a machine that truly functions like an intelligent human is still a major one
going forward.

1.3. Project objective


 Its application work in different areas
 Its implementation as a desktop Application
 This application as software that can be use for :
o Speech Recognition (convert the voice to text)

2
o Speech Generation , (convert the text to voice )
o Text Editing (copy ,past ,select )
 Designing and development of an interactive user-friendly text
editor , which allows the user to enter the text , manipulate text ,
formatting text all by familiar commands .
 Developing software for speech recognition (speech to text
conversion)
 Developing advanced technology incorporating these ideas.
 Development of a model that will compare the wave data with
phoneme database and displaying the characters (sentences)on the
screen
Speech recognition is a technology that able a computer to capture the words
spoken by a human with a help of microphone (embedded in computer or external)

1.4. Abstract
Speech recognition Technology is one of the fast growing engineering
technologies.
This project is designed and developed keeping that facto in mind , and a little effort
is made to achieve this aim.
It has a number of applications in different areas and provides potential benefits ,
Nearly 20% people of the world ae suffering from various disabilities ; many of them
are blind or unable to use their hands effectively . The speech recognition system in
those particular cases provide a significant help to them , so that they can share
information with people by operating computer through voice input .
Consider the Thousands of people in world they are not able to use their hands
making typing impossible. our project it for these people who can’t type ,and see
,even for those of us who are lazy and don’t feel like it Our project is capable to
recognize the speech and convert the input audio into text; it also enables a user to
perform operations such as (open , close ,exit, read, ……) program application and
a file by providing voice input . example open Word processing ,google chrome
,Notepad and calculator …,,etc.
In our project capable to read the text which is wrote by any one or the text which is
entered by the user himself

Glossary of terms (key words):


ASR - Automatic speech Recognition
Dictation- In which the user enters the data by reading directly to the computer.
AR - Auto Regressive
ARMA - Auto Regressive Moving Average
CD - Cepstral Distortion

3
CDMA - Code Division Multiple Access
CELP - Code Excited Linear Prediction
DCT - Discrete Cosine Transform
DFT - Discrete Fourier Transform
DSP - Digital Signal Processing
FEC - Forward Error Correction
FIR - Finite Impulse Response
GSM - Global System for Mobile telecommunications
IIR - Infinite Impulse Response
IDCT - Inverse Discrete Cosine Transform
IDFT - Inverse Discrete Fourier Transform
LPC - Linear Predictive Coding
LSP - Line Spectrum Pair
IMBE - Improved Multi-Band Excitation
MBE - Multi-Band Excitation
MSE - Mean Square Error
NLP - Non-Linear Pitch
PCM - Pulse Code Modulation
PSTN - Public Switched Telephone Network
RMS - Root Mean Square
RPE - Regular Pulse Excitation
SD - Spectral Distortion
SEGSNR- Segmental Signal to Noise Ratio
SNR - Signal to Noise Ratio
VSELP - Vector Sum Excited Linear Prediction
AMDF - Averaged Magnitude Differentiate Function
F0 - Fundamental Frequency of Speech
STE - Short Term Energy
ZCR - Zero Crossing Rate
ITU - Upper Energy threshold
ITL - Lower Energy threshold
IZCT - Zero Crossing Rate Threshold
C-V - Consonant Vowel
FFT - Fast Fourier Transform
DFFT - Discrete Fast Fourier Transform
STFT - Short-Time Fourier Transform
MFCC - Mel frequency cepstrum computation
DCT - Discrete Cosine Transform

Continuous speech: When user speak in a more normal, fluid manner without
having to pause between word, which is referred as continuous speech.
Discrete speech: when user speak with taking rest between each word then such
speech is referred as discrete speech.

1.1. Project Scope

4
This project has the speech recognizing and speech synthesizing capabilities
though it is not a complete replacement of what we call a notepad but still a good
text editor to be used through voice this software also can open windows based
software such as notepad , google chrome and etc..

Statement of the problem


Title of the present study was: “Speech Pattern Recognition for
Speech To Text Conversion “
When user speaks any alphabets from the microphone the different
patterns of the alphabets will be identified and it will be compared
with the corresponding pattern stored in the standard phoneme
databases and corresponding highest matching alphabet of Gujarati
language will be return in form of text on the screen.

1.2. Rational of the study


Importance of the study can be identified as :
 The software helps the professional to manage their workload by dictation in speech
language.
 The software can operate transparently behind the application, benefiting
users who are unfamiliar with speech recognition or who might become
confused with multiple applications.
 Prepares such a standard file that can be used in another speech dictation or
in any familiar editors.
 Enables end-users to mange their user files (i.e. opening , saving, backing
up, restoring, renaming and deleting)
 Elimination of training to the software for every user. That means it is
speaker independent speech dictation software. In addition, the software
enhanced runtimes include support for customizable command and control
functions, transcription, dictation playback.
 Speech can be saved in appropriate format, so that the speaker or a third party
can replay recorded speech to facilitate correction.
 Can switch between dictations and typing made without any extra efforts.
 The software uses its own changing fonts, and also supports other well
known changing fonts.
 Hands-free computing as an alternative to the keyboard is impractical (i.e.
small mobile devices, Auto PCs, or in mobile phones)
 Voice responses to message boxes and wizard screens can easily be designed
into an application.
 A more “human” computer, one user can talk to, may make educational and
entertainment applications seem more friendly and realistic.

5
 Streamlined access to application controls and large lists enables a user to
speak any one item from a list or any command from a potentially huge set
of commands without having to navigate through several dialog boxes or
cascading menus.
 Speech activated macros let a user speak a natural word or phrase rather than
use the keyboard or a command to activate a macro.
 Speech recognition offers game and edutainment developers the potential to
bring their applications to a new level of play. It enhances the realism and
fun in many computer games, it also provides a useful alternative to
keyboard-based control, and voice commands provide new freedom for the
user in any sort of application, from entertainment to office productivity.
 Applications that require users to keyboard paper-based data into the
computer are good candidates for a speech recognition application. Reading
data directly to the computer is much easier for most users and can
significantly speed up data entry.
 There are many situation in which hands are not available to issue commands
to a device such as it is natural alternative interferer to computers for people
with limited mobility in their arms & hands, or for those with sight
limitations.
 Speech can be saved in appropriate format, so that the speaker or a third party
can replay recorded speech to facilitate correction.
 Can switch between dictations and typing made without any extra efforts.
 Hands-free computing as an alternative to the keyboard, or to allow the
application to be used in environments where a keyboard is impractical (i.e.
small mobile devices, Auto PCs, or in mobile phones)
 Voice responses to message boxes and wizard screens can easily be designed
into an application.
 A more “human” computer, one user can talk to, may make educational and
entertainment applications seem more friendly and realistic
 Streamlined access to application controls and large lists enables a user to
speak any one item from a list or any command from a potentially huge set
of commands without having to navigate through several dialog boxes or
cascading menus.
 Speech activated macros let a user speak a natural word or phrase rather than
use the keyboard or a command to activate a macro.
 Speech recognition offers game and edutainment developers the potential to
bring their applications to a new level of play. It enhances the realism and
fun in many computer games, it also provides a useful alternative to
keyboard-based control, and voice commands provide new freedom for the
user in any sort of application from entertainment to office productivity.

6
 Applications that require users to keyboard paper-based data into the
computer are good candidates for a speech recognition application. Reading
data directly to the computer is much easier for most users and can
significantly speed up data entry. Some recognizers can even handle spelling
fairly well. If an application has fields with mutually exclusive data types
(i.e. sex, age and city), the speech recognition engine can process the
command and automatically determine which field to fill in.

 Document editing, in which one or both modes of speech recognition could


be used to dramatically improve productivity. Dictation would allow users
to dictate entire documents without typing. Command and control mode
would allow users to modify formatting or change views without using the
mouse or keyboard.
 The earliest use of Spoken Language Technologies was the reading machine,
allowing blind people to read books. This field has now a whole array of
technologies that also help hearing impaired children with interactive audio-
visual software, and many other assistive approaches.

7
Chapter 2

1. System Requirements:

2.1. An overview of Speech Recognition


Speech recognition is a technology that able a computer to capture the words
spoken by a human with a help of microphone. These words are later on recognized
by speech recognizer , and in the end, system outputs the recognized words.
An ideal situation in the process of speech recognition is that, a
speech recognition engine recognizes all words uttered by a human but, practically
the performance of a speech recognition engine depends on number of factors.
Vocabularies, multiple users and users and noisy environment are the major factors
that are counted in as the depending factors for a speech recognition engine.
2.2. Types of speech recognition
Speech recognition systems can be divided into the number of classes
based on their ability to recognize that words and list of words they have. A few
classes of speech recognition are classified as under:
2.2.1. Isolated Speech
Isolated words usually involve a pause between two utterances; it doesn’t
mean that it only accepts a single word but instead it requires one utterance at a
time.
2.2.2. Connected Speech
Connected words or connected speech is similar to isolated speech but
allow separate utterances with minimal pause between them.
2.2.3. Continuous speech
Continuous speech allow the user to speak almost naturally , it is also
called the computer dictation.
2.2.4. Spontaneous Speech
At a basic level, it can be thought of as speech that is natural sounding and
not rehearsed. An ASR system with spontaneous speech ability should be able to
handle a variety of natural speech features such as words being run together,
“ums” and “ahs”, and even slight stutters.

8
Analog to Acoustic
Audio
Digital Model

Language
Model

Display Speech
Engine

Fig: 2.1 Speech Recognition Process

1.3. Speech Recognition weakness :


Besides all these advantages and benefits, yet a hundred percent perfect
speech recognition system is unable to be developed. There are number of factors
that can reduces the accuracy and performance of a speech recognition program.

Speech recognition process is easy for a human but it is a difficult talk for a
machine , comparing with a human mind speech recognition programs seems less
intelligent, this is due to that fact that a human mind is God gifted thing and the
capability of thinking, understanding and reacting is natural, while for a computer
program it is a complicated task, first it need to understand the spoken words with
respect to their meanings, and it has to create a sufficient balance between the words,
noise and spaces. A human has a built in capability of filtering the noise form a

9
speech while a machine requires training, computer requires help for separating the
speech sound from the other sounds.
1.4. Factors on the speech recognition:
2.4.1. Homonyms: are the words that are differently spelled and have the
different meaning but acquires the same meaning, for example “there”
“their” “be” and “bee”, “cool” and “coal”. This is a challenge for
computer machine to distinguish between such types of phrases that
sound alike.
2.4.2. Overlapping speeches: a second challenge in the process, is to
understand the speech uttered by different users, current systems have
a difficulty to separate simultaneous speeches form multiple users.

2.4.3. Noise factor: the program requires hearing the words uttered by a
human distinctly and clearly. Any extra sound can create interference,
first you need to place system away form noisy environments and the n
speak clearly else the machine will confuse and will mix up the words.
1.5. The future of speech recognition :
 Dictation speech recognition will gradually become accepted.
 Accuracy will become better and better.
 Microphone and sound systems will be designed to adapt more
quickly to changing background noise levels, different
environments, with better recognition of extraneous material to be
discarded.
 Greater use will be made of “intelligent systems” which will
attempt to guess what the speaker intended to say, rather than what
was actually said, as people often misspeak and make unintentional
mistakes.

Methodology
As an emerging technology , not all developers are familiar with speech recognition
technology . While the basic functions of both speech synthesis and speech
recognition takes only
few minutes to understand (after all, most people learn to speak and listen by
age two), there are subtle and powerful capabilities provided by computerized
speech that
developers will want to understand and utilize.
An understanding of the capabilities and limitations of speech technology is also
important for developers in making decisions about whether a particular applications
will benefit from the use of speech input and output.

10
System Requirements:

Component Minimum Recommended


CPU 1.6 GHz 2.53GHz
RAM 2 GB 4gb
Microphone Mic High quality
microphones
Sound card Sound card Sound cards with very
clear signals

CPU:-
Our Application depend on efficiency of CPU(central processing unit).
This is because a large amount of digital filtering and signal processing can take
place in ASR(Automated Speech Recognition ).

11
Chapter3

3. Feasibility Study :

Feasibility Study has several aspects:

3.1. Technical Feasibility


3.2. Scheduling Feasibility
3.3. Financial Feasibility
3.4. Operational Feasibility
3.5. Social and Ethical Considerations
3.6. Legal Feasibility

Through these studies were obtained on the conclusions and proposals for the project
:

3.1. Technical Feasibility


There are many of components that can build our system : (hardware
,software , and human components ).
3.1.1.1. Hardware components
Network communication
Modem for connecting to internet , Connecting Wires .

3.1.1.2. Computer components :


computer devices which are use to implements the application
(Speech Recognition System)

12
Component Minimum Recommended
CPU 1.6 GHz 2.53GHz
RAM 2 GB 4gb

3.1.2. Human Components


Programmers, Analyzers, Designers and etc...

3.1.3. Software Components

Visual studio 2015: for build up our project, creates all the window forms
application and designing an interfaces.
MySQL: for managing the database (creates tables, store the data).
Word processor: for write a project report.
Programming language:
The programming language is C SHARP(C#)
Its easy for learning and its use for create windows forms application, its
also a well-known and high-level programing language.
Microsoft Speech SDK is one of the many tools that enable a
developer to add speech capability in to an applications.

C# is the open source language and run on Windows , Mac , and


Linux. This language help you for developing the windows store application
, Android apps, and ios apps. It can also be useful to build backend and
middle-tire framework and libraries. It supports language interoperability it
means that C# can access code written in any .NET compliant language .
The C# can runs on a variety of computer platform so a
developer can easily perform reusability of coding.
C# supports operator overloading and pre-processor directive
that will help for speech recognition grammar.
With this language, we can easily handle speech recognition
event. We can find freelance jobs online in this sector.

3.2. Scheduling Feasibility


3.2.1. Initiation phase

Table ( ): Scheduling feasibility for planning

Task Duration Start  Finish


Selection 14 days 12/9/2017  28/9/2017

13
Planning 7 days 12/10/2017 
23/10/2017

3.2.2. Analysis Phase :

Table ()Scheduling feasibility for Analysis phase

Task Duration Start _ Finish


Requirements Gathering 10 days 24/10/2017  2/11/2017
Requirement Analysis 10 days 5/11/2017  15/11/2017
and specification

3.2.3. Implementation phase :

Table(): Secluding feasibility for Implementation phases


Task Duration Start  Finish
Coding 39 days 6/1/2018  14/2/2018
Testing & 13 days 15/2/2018  28/2/2018
Maintenance
Installation 7 days 23/3/2018 
30/3/2018
Documentation 8 days 10/4/2018  20/4/2018
Support 9 days 30/4/2018  8/5/18

3.3. Financial Feasibility

Costs :
It’s the costs which will be spend by the team for complete the project as we
will discuss.
Profits:
The profits that the team can achieve after implementation the project ,in the
beginning , it will be a trial version and any user can use it for free. After improving
the version there will be a product key that no one can use the application without it
, and the application will be sold to users , every version will have different product
key .

14
3.4. Operational Feasibility
3.4.1. The performance (Throughput):
increasing recognition throughput in batch processing of speech
data; and reducing recognition latency in realtime usage scenarios.
Improve Throughput: Allow batch processing of the speech recognition task to
execute as efficiently as possible,thereby increasing the utility for multimedia search
and retrieval.

Batch speech transcription can be “embarrassingly parallel” by distributing


different speech utterances to different machines. However, there is significant
value in improving compute efficiency, which is increasingly relevant in today’s
energy limited and form-factor limited devices and compute facilities.

Response Time (RT): is widely used in the study of human speech


recognition as a measure of relative processing difficulty at all levels ,including the
sentence , word and phoneme levels.

3.4.2. Information

With the help of microphone audio is input to the system, the pc sound card
produces the equivalent digital representation of received audio.

An acoustic model is created by taking audio recordings of speech, and their


text transcriptions, and using software to create statistical representations of the
sounds that make up each word. It is used by a speech recognition engine to
recognize speech.

Speech is used in transactional applications to navigate around the


application or to conduct a transaction. For example, speech can be used to purchase
stock, reserve an airline itinerary, or transfer bank account balances. It can also be
used to follow links on the web or move from application to application on one's
desktop. Most often, but not exclusively, this category of speech applications
involves the use of a telephone. The user speaks into a phone, the signal is interpreted
by a computer (not the phone), and an appropriate response is produced. A custom,
application-specific vocabulary is usually used; this means that the system can only
"hear" the words in the vocabulary. This implies that the user can only speak what
the system can "hear." These applications require careful attention to what the
system says to the user since these prompts are the only way to cue the user as to
which words can be used for a successful outcome.
3.5. Social and Ethical Considerations
3.6. Legal Feasibility

15
4. Requirement Analysis

Analyze
- Identify opportunities for speech
& outline project strategy
- Review business requirements &
processes
- Review of existing IVR
- Interview Subject Matter Experts
- Voice User Interface & Technical
requirements (use-case scenarios)
- Define success criteria
- Map out solution
- Client review & sign-off on
requirements

Analysis Design Build Deploy

Design Build Deploy


- Personal/Bran - Application - Limited pilot
ding code deployment
- Call Flow - Recognition - Whole-call
Diagrams grammars analysis
- Sample Dialogs - Prompt - Recognition
- Vision Clip recordings tuning
- Detailed Dialog - Integration - Full
Design - Logging & deployment
- Early Usability reporting - Recognition
Research - Test plan tuning (R2)
- Architecture - Unit & - Whole-call
- Client review & system analysis(R2)
sign-off on testing - Finalize
- Dialog documentation
Traversal - Project handoff
Testing
- Voice User
interface
Review
- Recruited

16
4.1. Fundamentals to speech recognition
Speech recognition is basically the science of talking with the computer, and
having it correctly recognized. To elaborate it we have to understand the following
terms.

4.1.1. Utterance
when user say some things ,then this is an utterance in other words speaking
a word or a combination of words that means something to computer is called an
utterance. Utterances ate then sent to speech engine to be processed.

4.1.2. Pronunciation
a speech recognition engine uses a process word is its pronunciation, that
represents what the speech engine thinks a word should sounds like. Words can the
multiple pronunciations associated with them.

4.1.3. Grammar:
Grammar uses particular set of rules in order to define the words and
phrases that are going to be recognized by speech engine, more concisely grammar
define the domain with which the speech engine works. Grammar can be simple as
list of words or flexible enough to support the various degrees of variations.
4.1.4. Accuracy
The performance of the speech recognition system is measurable ;the ability of
recognizer can be measured by calculating its accuracy. It is useful to identify an
utterance.
4.1.5. Vocabularies
Vocabularies are the list of words that can be recognized by the speech
recognition engine. Generally the smaller vocabularies are easier to identify by a
speech recognition engine, while a large listing of words are difficult task to be
identified by engine.

4.1.6. Training
Training can be used by the users who have difficulty of speaking or
pronouncing certain words, speech recognition systems with training should be able
to adapt.

4.2. Tools
1- Visual studio 2015 (coding )
2- Office 2016 (word processor )for make documentation

17
3- Power point (for presentation )

4.3. Speech Synthesis


Structure analysis

Text to phoneme conversion

Prosody Analysis

Wave form production

Fig : 4.1 Speech Synthesis

A speech synthesizer converts written text into spoken language. Speech


synthesis is also referred to as text-to-speech (TTS) conversion.

The major steps in producing speech from text are as follows:


 Structure analysis: process the input text to determine where
paragraphs,sentences and other structures start and end. For most
languages, punctuation and formatting data are used in this stage.
 Text pre-processing: analyze the input text for special constructs of the
language. In English, special treatment is required for abbreviations,
acronyms, dates, times, numbers, currency amounts, email addresses and
many other forms. Other languages need special processing for these forms
and most languages have other specialized requirements.

18
.
The remaining steps convert the spoken text to speech.
 Text-to-phoneme conversion: convert each word to phonemes. A
phoneme is a basic unit of sound in a language. US English has
around 45 phonemes including the consonant and vowel sounds.
For example, "times" is spoken as four phonemes "t ay m s".
Different languages have different sets of sounds (different
phonemes). For example, Japanese has fewer phonemes including
sounds not found in English, such as "ts" in "tsunami".

Prosody analysis: process the sentence structure, words and phonemes to


determine appropriate prosody for the sentence. Prosody includes many of the
features of speech other than the sounds of the words being spoken. This includes
the pitch (or melody), the timing (or rhythm), the pausing, the speaking rate, the
emphasis on words and many other features. Correct prosody is important for
making speech sound right and for correctly conveying the meaning of a sentence.

Waveform production: finally, the phonemes and prosody


information are used to produce the audio waveform for each sentence. There
are many ways in which the speech can be produced from the phoneme and
prosody information. Most current systems do it in one of two ways:
concatenation of chunks of recorded human speech, or formant synthesis
using signal processing techniques based on knowledge of how phonemes
sound and how prosody affects those phonemes. The details of waveform
generation are not typically important to application developers.

19
Chapter5
5. Software Requirement Specifications (SRS)

General working mechanism of speech recognition

When one thinks about speaking to computers, the first image is usually speech
recognition, the conversion of an acoustic signal to a stream of words. After
many years of research, speech recognition technology is beginning to pass the
threshold of practicality. The last decade has witnessed dramatic improvement
in speech recognition technology, to the extent that high performance
algorithms and systems are becoming available.
Wide varieties of techniques with different levels of speech recognition are
used to perform speech recognition. The speech recognition process is
performed by a software component known as the speech recognition engine.
The primary function of the speech recognition engine is to process spoken input
word and translate it into text that an application understood. The application
can be work in two different mode Command and control mode some times
referred as voice navigation and Dictation mode.
In command and control mode the application can interpret the result of the
recognition as command. This mode offers developers the easiest implementation of
a speech interface in an existing application. In this mode the grammar (or list of
recognized words) can be limited to the list of available commands. This provides
better accuracy and performance, and reduces the processing overhead required by
the application. An example of a command and control application is one in which
the caller says “open” “file”, and the application asks the name of the file to be
opened.

The entire speech recognize process is summarized as follow:

 Speech recognition starts with the digital sampling of speech.


 The ASR program parses the noise, words and from the words it parse the
phonemes, which are the smallest sound units.
 The program database maps sounds to characters groups and converts into
an appropriate character group.

20
Functions of speech recognizer
o Functions of speech recognizer.
o Filters the raw signals into frequency bands.
o Cut the utterance into a fixed no. of segments.
o Average data for each band in each segment.
o Store this pattern with its name.
o Collect training set of about three repetitions of each patterns (utterance).
o Recognize unknown by comparing its pattern against all patterns in the
training set and returning the name of the pattern closest to the unknown.

A prototype model for speech recognition


Speech recognition, or speech to text, involves capturing and digitizing the
sound waves, converting them into basic language units or phonemes, constructing
words from phonemes, and contextually analyzing the words to ensure correct spelling
for words that sound like. The figure-3.1 describes the on sight outline of the entire
speech processing mechanism.

Normally a speech recognition system consists of three subsystems that includes


microphone for translation of spoken words to analog signals, An analog-to-digital
signal processor and software & hardware for translation of digital signal back to
words.

21
The process of conversion from speech to words is complex and
varies slightly between systems. It consists of three steps :
(1) Feature extraction – Pre-processing of the speech signal, extracting the
important features into feature vectors.
(2) Phoneme recognition – bases on a statistically trained phoneme model (HMM)
the most likely sequence of phoneme is calculated.
(3) Word recognition – Based on statistically trained language model similar to
the phoneme model, the most likely sequence of word are calculated.

Fig: 5.3 Model for speech file preparation

22
Speech dictation process :
After the preparation of master database of features of Gujarati Alphabets,
the researcher has proposed the dictation model from where the actual human-
machine interaction starts in the form of speech dictation. The researcher has
divided the model into five different steps (Fig:9.4) i.e. (1) Input acquition
(2) Front end (3) Feature extractor (4) local match and (5) character printing.

Fig: 5.4 speech dictation process

23
The entire process is summarized in the following steps:

 Any user speaks into microphone.


 Microphone accepts the voice and generates electrical impulses.
 Sound card converts acoustics signal to digital signal.
 Speech engine compares the digital signals attributes with stored voice file i.e.
master database and using probability model, it returns the highest matching
possible word.

24
10) Flowchart of a speech recognition system

speech Speech vectors


Robust
analysis processing

Feature vectors

Reference
Model result
Speech
Recognition result

25
11) Use case diagram :

Fig: 6.1 use case diagram

26
12) Diagrams :

Saving the document

Fig: 7.1 Saving the document

27
Writing Text

Fig: 7.2 Writing Text

28
Opening Document

Fig: 7.3 Opening Document

29
Closing Document

Fig: 7.4 closing document

30
Opening system software

Fig: 7.5 Opening System Software

31
13) E-R Diagram

Fig: E-R Diagram

32
8. Database Design

Code for design the Database :

Creating an Access Database

Here is the part where basic to intermediate experience with MS Access comes in
handy because we will not go into every detail on this process. I generally use MS
Access 2007 so my instruction will be geared toward that version. To begin, select
the Windows icon in the upper left then click on New. Name the database
whatever you would like; for this tutorial I will be naming mine VR.accdb. Create
a database that your project can use. I named mine CustomCommands and I
included the following fields:

 ID
 CommonField

33
 Command
 Result

Save that database and so we can use it during the next step.

Connecting the Pieces

 Connecting the Database to Your Form


 Creating Data Grids
 Creating Text Boxes and Buttons

Connecting the Database to Your Form


Thanks to MS Visual Studio you can drag and drop a lot of things and then
it will add in a lot of the connection strings for you. Let's go ahead and connect the
database to your forms now. To begin, go to Data Sources and Add New Data
Source… this will open a Data Source Configuration Wizard which will walk you
through the steps to add your database connection.

After you have connected your program to the MS Access database you have created
we will add that database to our program forms. You will see that you now have
your DataSet under your Data Sources.

This is where you can click and hold the given field such as "Command" or
"Result" and drag them onto your forms. Just make sure the field is set to TextBox

and you should have a form that looks something like this:

The "Common Field" field is what the computer will speak to you, the "Command"
field is what you would speak to the computer and "Result" field is the program that
would be executed. To help me mentally keep these straight I will rename my labels
to that effect. We just walked through connecting a database to only one of our
forms. You can follow the appropriate steps listed in this section to connect this
same database to your second form.

Creating Data Grids

On our main form, we will not need a data grid view, however, on our second form
we will. Having the grid is not entirely necessary for the operation of the program

34
but it does help you keep things organized as you are managing your commands.
Under the Data Sources explorer select CustomCommands and from the dropdown
menu select DataGridView. Then simply grab CustomCommands with your mouse
and drag it onto your form. You can arrange your data grid view however you would
like from there.

Creating Text Boxes and Buttons

Now, we can create all of the buttons and text boxes we will need on our forms.
We'll just do this in steps to keep things simple.

For Form1 do the following:

1. Step 1 - Rename the following labels:


 "Command" field will now be "Your Spoken Command:"
 "Result" field will now be "Program to Launch:"
 "Common Field" field will now be "What the PC Speaks"
2. Step 2 - Add button1 and change the text to "Close this window"
3. Step 3 - Add button2 and change the text to "Open Search Box"
4. Step 4 - Add groupBox1 and remove text
5. Step 5 - Add radioButton1 inside of groupBox1 and change the text to
"Website"
6. Step 6 - Add radioButton2 inside of groupBox1 and change the text to
"Program"
7. Step 7 - Add button3 and change the name to btnSave and change the text
to "Save"
8. Step 8 - Add button4 and change the name to btnOpen and change the text
to "Open"
9. Step 9 - Add button5 and change the name to btnToListbox and change the
text to "to listbox"

For Form4 do the following:

1. Step 1 - Add button1 and name it button4


2. Step 2 - Add text box and call it textBox1
3. Step 3 - Drag CustomCommands into this form so you have all of the
textboxes from your database. Rename the labels so you don't forget which
text box is linked to which thing (this will help you keep things organized).

35
Chapter 9

9. User Interface

9.1. Main Form

Code the main window and movement between the windows by voice
Option Strict On

Imports System.Speech.Recognition ' Add reference Assemblies Framework


System.Speech
Imports System.Speech.Recognition.SrgsGrammar ' Adding this is unnecessary on
my PC

Imports System.Runtime.InteropServices 'For Monitor Command


Imports System.Speech
Imports System.Drawing.Drawing2D
Imports System.ComponentModel
Imports DMSoft
Public Class Form5

36
Dim a As New Speech.Synthesis.SpeechSynthesizer
Private WithEvents OutputListBox As New ListBox With {.Dock =
DockStyle.Fill, .IntegralHeight = False, .ForeColor = Color.AntiqueWhite,
.BackColor = Color.Green}
Private WithEvents SpeechEngine As New
System.Speech.Recognition.SpeechRecognitionEngine(System.Globalization.Cult
ureInfo.GetCultureInfo("en-us"))
Dim tms As Integer = 0
Dim st As String
Dim WithEvents recognizer As SpeechRecognitionEngine

Dim Neuro As New Speech.Synthesis.SpeechSynthesizer


Public SkinOb As DMSoft.SkinCrafter

Private Sub Form5_Load(ByVal sender As System.Object, ByVal e As


System.EventArgs) Handles MyBase.Load
Me.CenterToScreen()
a.SpeakAsync("my name is report")
a.SpeakAsync("welcome by you tell your computer what do you want ")
' The code in CODE SECTION has to be called before InitializeComponent()
function
' --------- Begin of CODE SECTION ----------- '
DMSoft.SkinCrafter.Init()
SkinOb = New DMSoft.SkinCrafter
'These function parameters are used for Skincrafter DEMO
SkinOb.InitLicenKeys("SKINCRAFTER", "SKINCRAFTER.COM",
"support@skincrafter.com",
"DEMOSKINCRAFTERLICENCE")
SkinOb.InitDecoration(CBool(1))
' --------- End of CODE SECTION ---------- '

SkinOb.LoadSkinFromFile("C:\Users\fathail\Desktop\vb\project\A_67.skf")
SkinOb.ApplySkin()
Me.Text = "Speech recognition, by Doc Oc, version:" &
My.Application.Info.Version.ToString & " , say hello to piss the pc off"

Controls.Add(OutputListBox)
SpeechEngine.LoadGrammar(New
System.Speech.Recognition.DictationGrammar)
SpeechEngine.SetInputToDefaultAudioDevice()

SpeechEngine.RecognizeAsync(Speech.Recognition.RecognizeMode.Multiple)

37
recognizer = New SpeechRecognitionEngine()
recognizer.SetInputToDefaultAudioDevice()

End Sub

Private Sub Form5_FormClosing(sender As Object, e As EventArgs) Handles


Me.FormClosing
Try
SpeechEngine.RecognizeAsyncCancel()
SpeechEngine.Dispose()
Catch ex As Exception
End Try
End Sub

Private Sub SpeechEngine_SpeechRecognized(sender As Object, e As


System.Speech.Recognition.SpeechRecognizedEventArgs) Handles
SpeechEngine.SpeechRecognized
OutputListBox.Items.Add("You said: " & e.Result.Text)
' a.SpeakAsync("You said: " & e.Result.Text)
' If e.Result.Text.ToLower.Contains("add") = True Or
e.Result.Text.ToLower.Contains("her") = True Or
e.Result.Text.ToLower.Contains("hello") = True Or
e.Result.Text.ToLower.Contains("o") = True Or
e.Result.Text.ToLower.Contains("who") = True Or
e.Result.Text.ToLower.Contains("will") = True Or
e.Result.Text.ToLower.Contains("hole") = True Or
e.Result.Text.ToLower.Contains("whole") = True Or
e.Result.Text.ToLower.Contains("hold") Or e.Result.Text.ToLower.Contains("se")
= True Or e.Result.Text.ToLower.Contains("c") = True Then
st = e.Result.Text
Label2.Text = e.Result.Text

Select Case e.Result.Text


Case "add"
Dim MyForm As New Form1
MyForm.Show()
Case "view"
Dim MyForm As New Form2
MyForm.Show()
Case "speech"
Dim MyForm As New Form3

38
MyForm.Show()
Case "close"
Me.Close()
Case "maximize"
Me.WindowState = FormWindowState.Maximized
Case "minimize"
Me.WindowState = FormWindowState.Minimized
End Select
' End If
End Sub

Private Sub AddCommandsToolStripMenuItem_Click(sender As Object, e As


EventArgs) Handles AddCommandsToolStripMenuItem.Click
Dim MyForm As New Form1
MyForm.Show()
End Sub

Private Sub ViewCommandsToolStripMenuItem_Click(sender As Object, e As


EventArgs) Handles ViewCommandsToolStripMenuItem.Click
Dim MyForm As New Form2
MyForm.Show()
End Sub

Private Sub SpeachRecognationToolStripMenuItem_Click(sender As Object, e


As EventArgs) Handles SpeachRecognationToolStripMenuItem.Click
Dim MyForm As New Form3
MyForm.Show()
End Sub

Private Sub ExitToolStripMenuItem_Click(sender As Object, e As EventArgs)


Handles ExitToolStripMenuItem.Click
Me.Close()
End Sub

End Class

39
9.2. Add Commands

Code add commands To Database


Imports System.Data.OleDb
Imports System.IO.StreamReader

Imports System.Drawing.Drawing2D
Imports System.ComponentModel
Imports DMSoft

Public Class Form1

40
Dim CnString As String = "Provider=Microsoft.ACE.OLEDB.12.0;Data
Source=C:\Users\fathail\Desktop\vb\project\project\db2.accdb;
Persist Security Info=False;"

Dim Conn As New OleDbConnection(CnString)


Dim DataSet1 As New DataSet
Dim DataAdapter1 As OleDbDataAdapter
Dim CMD As New OleDbCommand
Dim SQLstr As String = "SELECT * FROM CustomCommands"
Dim sqcomand As OleDbCommand
Public SkinOb As DMSoft.SkinCrafter

Private Sub Form1_Load(sender As Object, e As EventArgs) Handles


MyBase.Load
Me.CenterToScreen()

' ////////////////////////////////////////////////////////

' The code in CODE SECTION has to be called before InitializeComponent()


function
' --------- Begin of CODE SECTION ----------- '
DMSoft.SkinCrafter.Init()
SkinOb = New DMSoft.SkinCrafter
'These function parameters are used for Skincrafter DEMO
SkinOb.InitLicenKeys("SKINCRAFTER", "SKINCRAFTER.COM",
"support@skincrafter.com",
"DEMOSKINCRAFTERLICENCE")
SkinOb.InitDecoration(1)
' --------- End of CODE SECTION ---------- '

SkinOb.LoadSkinFromFile("C:\Users\fathail\Desktop\vb\project\A_67.skf")
SkinOb.ApplySkin()
Try
Conn.Open()
Dim DataAdapter1 As New OleDbDataAdapter(SQLstr, Conn)
DataAdapter1.Fill(DataSet1, "CustomCommands")

dataGridView1.DataSource = DataSet1
dataGridView1.DataMember = "CustomCommands"

41
dataGridView1.Refresh()
Conn.Close()
Catch e1 As Exception
Console.WriteLine(e1)
End Try

End Sub

Private Sub btnSave_Click(sender As Object, e As EventArgs) Handles


btnSave.Click

If Trim(text1.Text) <> "" And Trim(text2.Text) <> "" And Trim(text3.Text)


<> "" Then

sqcomand = New OleDbCommand("insert into CustomCommands


(CommonField ,Command,Result)values ('" & text1.Text & "','" & text2.Text &
"','" & text3.Text & "')", Conn)
Conn.Open()
sqcomand.ExecuteNonQuery()
Conn.Close()
MsgBox("save is done", MsgBoxStyle.Information +
MsgBoxStyle.MsgBoxRight, "save done")

Else
MsgBox("please enter field data", MsgBoxStyle.Critical, "wrong in data
enter")
Exit Sub
End If

End Sub

Private Sub radioButton1_CheckedChanged(sender As Object, e As EventArgs)


Handles radioButton1.CheckedChanged
If radioButton1.Checked = True Then

btnOpen.Visible = False
text2.Text = "http://www."
text2.Focus()

42
End If
End Sub

Private Sub radioButton2_CheckedChanged(sender As Object, e As EventArgs)


Handles radioButton2.CheckedChanged
If radioButton2.Checked = True Then

btnOpen.Visible = True
text2.Text = ""
End If
End Sub

Private Sub btnOpen_Click(sender As Object, e As EventArgs) Handles


btnOpen.Click
Dim dlg As New OpenFileDialog
dlg.ShowDialog()

' If dlg.ShowDialog = Windows.Forms.DialogResult.OK Then


Dim fileName As String
fileName = dlg.FileName
MsgBox(fileName)
'End If
End Sub

Private Sub btnToListbox_Click(sender As Object, e As EventArgs) Handles


btnToListbox.Click

Dim MyForm As New Form2


MyForm.Show()

End Sub

Private Sub button1_Click(sender As Object, e As EventArgs) Handles


button1.Click
Me.Close()
End Sub

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles


Button2.Click
sqcomand = New OleDbCommand("delete from CustomCommands", Conn)
Conn.Open()
sqcomand.ExecuteNonQuery()
Conn.Close()

43
MsgBox("delete is done", MsgBoxStyle.Information +
MsgBoxStyle.MsgBoxRight, "delet ")
End Sub

Private Sub Button3_Click(sender As Object, e As EventArgs) Handles


Button3.Click
text1.Text = ""
text2.Text = "http://www."
text3.Text = ""
text1.Focus.Equals(True)
End Sub
End Class

9.4. View commands

44
Code view commands:
Imports System.Data.OleDb
Imports System.IO.StreamReader

Public Class Form2


Dim CnString As String = "Provider=Microsoft.ACE.OLEDB.12.0;Data
Source=C:\Users\fathail\Desktop\vb\project\project\db2.accdb;
Persist Security Info=False;"

Dim Conn As New OleDbConnection(CnString)


Dim DataSet1 As New DataSet
Dim DataAdapter1 As OleDbDataAdapter
Dim CMD As New OleDbCommand
Private Sub Form2_Load(sender As Object, e As EventArgs) Handles
MyBase.Load
Conn.Open()

Dim dt1 As New DataTable


Dim dlebgrab As String = "SELECT * FROM CustomCommands"
Dim cmd As New OleDbCommand(dlebgrab, Conn)
Dim adtp As New OleDbDataAdapter(cmd)
adtp.SelectCommand = cmd
adtp.Fill(dt1)
ListBox1.DataSource = dt1
ListBox1.DisplayMember = "CommonField"
ListBox2.DataSource = dt1
ListBox2.DisplayMember = "Command"
ListBox3.DataSource = dt1
ListBox3.DisplayMember = "Result"
End Sub
End Class

45
9.4.Grammar is loading

46
Imports System.Speech.Recognition ' Add reference Assemblies Framework
System.Speech
Imports System.Speech.Recognition.SrgsGrammar ' Adding this is unnecessary on
my PC

Imports System.Runtime.InteropServices 'For Monitor Command


Imports System.Speech

Public Class Form3


Dim a As New Speech.Synthesis.SpeechSynthesizer
Private WithEvents OutputListBox As New ListBox With {.Dock =
DockStyle.Fill, .IntegralHeight = False, .ForeColor = Color.AntiqueWhite,
.BackColor = Color.Green}
Private WithEvents SpeechEngine As New
System.Speech.Recognition.SpeechRecognitionEngine(System.Globalization.Cult
ureInfo.GetCultureInfo("en-us"))
Dim tms As Integer = 0
Dim st As String
Dim WithEvents recognizer As SpeechRecognitionEngine

Dim Neuro As New Speech.Synthesis.SpeechSynthesizer

Private Sub Form3_Load(sender As Object, e As EventArgs) Handles


MyBase.Load

Me.Text = "Speech recognition, by Doc Oc, version:" &


My.Application.Info.Version.ToString & " , say hello to piss the pc off"

Controls.Add(OutputListBox)
SpeechEngine.LoadGrammar(New
System.Speech.Recognition.DictationGrammar)
SpeechEngine.SetInputToDefaultAudioDevice()

SpeechEngine.RecognizeAsync(Speech.Recognition.RecognizeMode.Multiple)

recognizer = New SpeechRecognitionEngine()


recognizer.SetInputToDefaultAudioDevice()

' Voice commands in Commands.txt are Hello mister monkeyboy, Goodbye


and What is up.

47
Dim ReadLines As New
System.IO.StreamReader("C:\Users\fathail\Desktop\vb\project\Command.txt")

Do Until ReadLines.EndOfStream
Dim NewGrammar As New Grammar(New Choices(New
String(CType(ReadLines.ReadLine(), Char()))))
recognizer.LoadGrammarAsync(NewGrammar)
Loop

ReadLines.Close()

recognizer.RecognizeAsync(RecognizeMode.Multiple)

End Sub

Private Sub Form3_FormClosing(sender As Object, e As EventArgs) Handles


Me.FormClosing
Try
SpeechEngine.RecognizeAsyncCancel()
SpeechEngine.Dispose()
Catch ex As Exception
End Try
End Sub
Private Sub SpeechEngine_SpeechRecognized(sender As Object, e As
System.Speech.Recognition.SpeechRecognizedEventArgs) Handles
SpeechEngine.SpeechRecognized
OutputListBox.Items.Add("You said: " & e.Result.Text)
' a.SpeakAsync("You said: " & e.Result.Text)
If e.Result.Text.ToLower.Contains("home") = True Or
e.Result.Text.ToLower.Contains("her") = True Or
e.Result.Text.ToLower.Contains("hello") = True Or
e.Result.Text.ToLower.Contains("o") = True Or
e.Result.Text.ToLower.Contains("who") = True Or
e.Result.Text.ToLower.Contains("will") = True Or
e.Result.Text.ToLower.Contains("hole") = True Or
e.Result.Text.ToLower.Contains("whole") = True Or
e.Result.Text.ToLower.Contains("hold") Or e.Result.Text.ToLower.Contains("se")
= True Or e.Result.Text.ToLower.Contains("c") = True Then
st = e.Result.Text
Select Case e.Result.Text.ToUpper

48
Case Is = "open facebook"
a.SpeakAsync("now")
' Shell("notepad.exe", AppWinStyle.NormalFocus, False)
System.Diagnostics.Process.Start("http://www.facebook.com")

Case Is = "open google"


' Application.Exit()
System.Diagnostics.Process.Start("http://www.google.com")
Case Is = "open yahoo"
Case Is =”yahoo”
System.Diagnostics.Process.Start("http://www.yahoo.com")
Case Is = "open pandora"
System.Diagnostics.Process.Start("http://www.pandora.com")
Case Is = "open thepiratebay"
System.Diagnostics.Process.Start("http://www.thepiratebay.se")
Case Is = "SIX"
'System.Diagnostics.Process.Start("http://www.youtube.com")
OutputListBox.Text = " "

Case Is = "Open notepad"


a.Speak("Running Notepad.")
Shell("notepad.exe", AppWinStyle.NormalFocus, False)
Case Is = "EIGHT"
a.Speak("CLOSE Notepad.")
' Application.Exit()
Case Is = "shutdown"
a.Speak("shutdown")
System.Diagnostics.Process.Start("shutdown", "-s")

Case Is = "RESTART"
a.Speak("restart")
System.Diagnostics.Process.Start("shutdown", "-r")

Case Is = "LOG OFF"


a.Speak("log off")
System.Diagnostics.Process.Start("shutdown", "-l")
Case Is = "WHAT TIME IS IT"

a.Speak(Format(Now, "Short Time"))


Case Is = "C"

a.Speak("which color you want")

49
Timer1.Enabled = True
Timer1.Start()

Case Is = "HOWS THE WEATHER"

System.Diagnostics.Process.Start("https://www.google.com/webhp?sourceid=chro
me-instant&ion=1&ie=UTF-8#output=search&sclient=psy-
ab&q=weather&oq=&gs_l=&pbx=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47008514,
d.eWU&fp=6c7f8a5fed4db490&biw=1366&bih=643&ion=1&pf=p&pdl=300")
a.Speak("Searching for local weather")
Case Is = "HELLO "
a.Speak("Hello sir")
Case Is = "GOODBYE "
a.Speak("Until next time")
Me.Close()
Case Is = "OPEN DISK DRIVE"
'Case Is = "NINE"
a.Speak("Its now open")
Dim oWMP = CreateObject("WMPlayer.OCX.7")
Dim CDROM = oWMP.cdromCollection
If CDROM.Count = 2 Then
CDROM.Item(1).Eject()

End If

End

End Select
End If
End Sub

50
Private Sub recognizer_LoadGrammarCompleted(sender As Object, e As
LoadGrammarCompletedEventArgs) Handles
recognizer.LoadGrammarCompleted

Dim grammarName As String = e.Grammar.Name


Dim grammarLoaded As Boolean = e.Grammar.Loaded

If e.[Error] IsNot Nothing Then


' Add exception handling code here.
Label1.Text = "LoadGrammar for " & grammarName & " failed with a " &
e.[Error].[GetType]().Name & "."
End If

Label1.Text = ("Grammar " & grammarName & " " & If((grammarLoaded),
"Is", "Is Not") & " loaded.")

End Sub

Private Sub recognizer_SpeechRecognized(sender As Object, e As


SpeechRecognizedEventArgs) Handles recognizer.SpeechRecognized

Label2.Text = "Grammar " & e.Result.Grammar.Name & " " & e.Result.Text

Select Case e.Result.Text.ToUpper


Case Is = "HELLO MISTER MONKEYBOY"
Neuro.SpeakAsync("Hello Mister Monkeyboy!")
Process.Start("Notepad")
Case Is = "GOODBYE"
Neuro.SpeakAsync("Goodbye!")
Me.Close()
End
Case Is = "WHAT IS UP"
Neuro.SpeakAsync("Whaas up?")
End Select

End Sub

Private Sub Timer1_Tick(sender As Object, e As EventArgs) Handles


Timer1.Tick
' Select Case e.Result.Text
'Case Is = "red"
OutputListBox.BackColor = Color.Yellow
OutputListBox.BackColor = Color.Red

51
' Case Is = "white"
OutputListBox.BackColor = Color.White
' Case Is = "Yellow"

'End Select
OutputListBox.BackColor = Color.Red
End Sub
End Class

9.4. Changing the theme for all windows(forms):

Code for changing the themes :


Option Strict On

Imports System.Speech
Imports System.Speech.Recognition

52
Imports System.Speech.Recognition.SrgsGrammar

Public Class Form4

Dim WithEvents reco As New SpeechRecognitionEngine


Dim synth As New Synthesis.SpeechSynthesizer

Private Sub Form4_Load(ByVal sender As System.Object, ByVal e As


System.EventArgs) Handles MyBase.Load

Try

Dim gram As New SrgsDocument


Dim colorRule As New SrgsRule("color")
Dim colorsList As New SrgsOneOf("red", "green", "Yellow", "Black")

colorRule.Add(colorsList)
gram.Rules.Add(colorRule)
gram.Root = colorRule
reco.LoadGrammarAsync(New Recognition.Grammar(gram))
reco.SetInputToDefaultAudioDevice()
reco.RecognizeAsync(RecognizeMode.Multiple)

Catch s As Exception
MessageBox.Show(s.Message)
End Try

End Sub

Private Sub reco_SpeechDetected(ByVal sender As Object, ByVal e As


System.Speech.Recognition.SpeechDetectedEventArgs) Handles
reco.SpeechDetected
Label1.Text = "Speech Detected"
End Sub

Private Sub reco_SpeechRecognized(ByVal sender As Object, ByVal e As


System.Speech.Recognition.RecognitionEventArgs) Handles
reco.SpeechRecognized
Label2.Text = e.Result.Text
Select Case e.Result.Text
Case "red"
SetColor(Color.Red)

53
Case "green"
SetColor(Color.Lime)
Case "Yellow"
SetColor(Color.Yellow)
Case "black"
SetColor(Color.Black)
Case "blue"
SetColor(Color.Blue)
End Select
End Sub

Private Sub SetColor(ByVal color As System.Drawing.Color)


reco.RecognizeAsyncCancel()
reco.RecognizeAsyncStop()
synth.Speak("setting the back color to " & color.ToString)
Me.BackColor = color
reco.RecognizeAsync(RecognizeMode.Multiple)
End Sub

End Class

54
10. Report

Technique to extract speech features from speech signals


Speech recognition is highly affected by the type of speech i.e. isolated word
Vs continuous. One of the hardest problems in speech recognition is determining
when one word ends and the next one begin. In order to side step the problem, most
systems force the user to issue single word-at-a-time commands. Typically, words
must be separated by a gap of on the order of 300 milliseconds. Since this is
unnatural, speech recognition systems that require multi-word commands may
requires special training on the part of the users. One perspective on this is presented
in Biermann et al. (1985).

Various approaches or recognition system are used to extract features of


speech. The recognition system describe here, classifies the different features of
Gujarati alphabet produce by a speaker. The system has a number of characteristic
features. The researcher has performed explicit segmentation of speech signal into
phonetic categories. Explicit segmentation allows using segment duration to
discriminate letters, and to extract features from specific regions of the signal.
Finally speech knowledge is used to design a set of features that work best for
Gujarati letters.

Speech separation from noise, given a-priori information, can be viewed as a


subspace estimation problem. Some conventional speech enhancement methods are
spectral subtraction, Wiener filtering , blind signal separation and hidden Markov
modeling.

Storage of speech files and its feature in traditional flat file format

The process of data storage in traditional flat file format consists two or more
type of the files. Each prompted utterance is stored within a separate file in any valid
audio file format. The stored speech file for each utterance is processed with the
speech processing tools (i.e. software) and the corresponding features for each
utterance is extracted and these processed outcome is stored in the other flat file,
which is accompanying each utterance file. For the storage of the features we may
use many different approaches as follow:

(1) One may take the separate file for each utterance i.e. for pitch of all
the utterance one separate file, for the frequency of all the utterance
one separate file and so on. In each feature file each row represents
the different utterance. The affiliation of each row with the
accompanying utterance must be previously determined. And for
every feature file this affiliation remains same. Suppose there are 36
utterances and 10 features then there are 46 files.
(2) All the features (i.e. pitch, frequency and many more) for the one
utterance are stored in the one file. For the second utterance again all
the features (i.e. pitch, frequency and many more) are stored in the
other file and so on. In this approach, every file is named in such a

55
way so that it accompanying the utterance. Suppose there are 36
utterances then there are 72 files (i.e. 36 – for utterance and 36 – for
features of the accompanying utterance).

(3) All the features (i.e. pitch, frequency and many more) for the one
utterance is stored in one line of the flat file separated by either
comma or space, second line again stores the same features for the
second utterance and so on. In the file format each column represents
the same feature for all the utterance and each row represents
different features for the one utterance. The affiliation of each feature
with the column and affiliation of each utterance with each row must
be previously determined. In this approach if there are 36 utterances
then there are 37 files.

56
11. Conclusion
This project work of speech recognition started with a brief introduction of
the application and the technology in the computer (desktop applications)This
project able to write the text through both keyboard and voice ,the speech
recognition of different notepad command such as open ,save ,select ,copy ,clear
and close,Open the different windows software depends on voice input.
One challenge is to develop ways in which our knowledge of the speech
signal, and of speech production and perception, can be incorporated more
effectively into recognition methods. For example, the fact that speakers have
different vocal tract lengths could be used to develop more compact models for
improved speaker-independent recognition.

57
12. System limitations
A speech signal is a highly redundant non-stationary signal. These attributes
make this signal very challenging to characterise. It should be possible to
recognize speech directly from the digitized waveform. However, because of the
large variability of the speech signal, it is a good idea to perform some form of
feature extraction that would reduce that variability. Applications that need voice
processing (such as coding, synthesis, recognition) require specific representations
of speech information. For instance, the main requirement for speech recognition is
the extraction of voice features, which may distinguish different phonemes of a
language. From a statistical point of view, this procedure is equivalent to finding a
sufficient for statistic to estimate phonemes. Other information, not require for this
aim, such as phonatory apparatus dimensions (that is speaker dependent), the
speaker’s moods, sex, age, dialect inflexions, and background noise etc., should be
overlooked. To decrease vocal message ambiguity, speech is therefore filtered
before it arrives at the automatic recognizer. Hence, the filtering procedure can be
considered as the first stage of speech analysis. Filtering is performed on discrete
time quantized speech signals. Hence, the first procedure consists of analog to
digital signal conversion. Then, the extraction procedure of the significant features
of speech signal is performed.
When captured by a microphone, speech signals are seriously distorted by
background noise and reverberation. Fundamentally speech is made up of
discrete units. The units can be a word, a syllable or a phoneme. Each stored
unit of speech includes details of the characteristics that differentiate it from
the others. Apart from the message content, the speech signal also carries
variability such as speaker characteristics, emotions and background noise.
Speech recognition differentiates between accents, dialects, age, genders,
emotional state, rate of speech and environmental noises. According to Rosen
(1992), the temporal features of speech signals can be partitioned into three
categories, i.e., envelope (2-50 Hz), periodicity (50-500 Hz), and fine structure
(500-10000 Hz). A method of generating feature signals from speech signals
comprising the following steps:
 Receives the speech signals.
 Block the speech signals into frames.
 Form frequency domain representations of said blocked speech signals.
 Pass the said frequency domain representations through mel-filter banks.
During speech, altering the size and shape of the vocal tract, mostly by moving
the tongue, result in frequency and intensity changes that emphasize some
harmonics and suppress others. The resulting waveform has a series of peaks and
valleys. Each of the peak is called a formant and it is manipulation of formant
frequencies that facilitates the recognition of different vowels sounds. Speech has a
number of features that need to be taken into account. Perform the combination
linear predictive coding and cepstral recursion analysis on the blocked speech
signals to produce various features of the signals.

58
13. Enhancement

Speech enhancement in the past decades has focused on the suppression of


additive background noise. From a signal processing point of view additive noise is
easier to deal with than convolute noise or nonlinear disturbances. Moreover, due to
the busty nature of speech, it is possible to observe the noise by itself during speech
pauses, which can be of great value.
Speech enhancement is a very special case of signal estimation as speech is
non-stationary, and the human ear---the final judge---does not believe in a simple
mathematical error criterion. Therefore subjective measurements of intelligibility
and quality are required.

Thus the goal of speech enhancement is to find an optimal estimate (i.e., preferred
by a human listener), given a noisy measurement. The relative unimportance of
phase for speech quality has given rise to a family of speech enhancement algorithms
based on spectral magnitude estimation. These are frequency-domain estimators in
which an estimate of the clean-speech spectral magnitude is recombined with the noisy
phase before resynthesize with a standard overlap-add procedure.

Speech file consists a variety of characteristics and extraction of all such


characteristics is the key factor in the speech processing. It is extremely difficult that
the one software is able to extract all the characteristics of the speech file. The
researcher has identified nearly 32 different speech processing software.

59
14. Bibliography
[1] “Speech recognition- The next revolution” 5th edition.
[2] Ksenia Shalonova, “Automatic Speech Recognition” 07 DEC 2007
[3]Source:http://www.cs.bris.ac.uk/Teaching/Resources/COMS12303/lectures/Kse
nia_Shalonoa- Speech_Recognition.pdf
[4] "Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993. ISBN:
0130151572.

[5] http://www.abilityhub.com/speech/speech-description.htm
[6] Charu Joshi “Speech Recognition”
Source: http://www.scribd.com/doc/2586608/speechrecognition.pdf

[7] John Kirriemuir “Speech recognition technologies”


[8] http://electronics.howstuffworks.com/gadgets/high-tech-
gadgets/speechrecognition.htm

60
View publication stats

You might also like