Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Developing Speech to Text Messaging System

Using Android Platform

Supervised By
Dr. Wint Pa Pa Kyaw
Associate Professor
Candidate
Ma Htet Yi Zaw
3PhDCom - 2
Department of Computer Studies
University of Yangon
6-March-2020 1
Main Title :

MYANMAR SPEECH TO TEXT SYSTEM ON ANDROID

2
1PhD Regular Title :

Study of Myanmar Language Acoustics Signal to


Strings

1PhD Credit Title :

Modeling Approaches for Myanmar Language


Speech Recognizer

3
2PhD Regular Title :

Compatible Methods and Models for Myanmar


Continuous Speech Recognition System

2PhD Credit Title :

Compatible Models on Speech to Text SMS


Messaging System

4
3PhD Regular Title :

Developing the Speech Recognition Models for


Myanmar Language

3PhD Credit Title :

Developing Speech to Text Messaging System Using


Android Platform

5
Contents

1. Introduction
2. Data Preparation for Building Models
3. Setting Up the Environment
4. Building Acoustics Model
5. Building Phonetic Dictionary
6. Building Language Model
7. Conclusion

6
Introduction

 Speech to text messaging system is the mobile application with the


process of writing hand-free SMS (Short Message Service)
 This can help smart phone users to send their message faster and
also give the chance to handicapped individuals who are unable to
type, and write their messages.
 For building Myanmar language speech recognizer, Sphinx tools
are chosen to use after learning about different tools.
 And then the data and files/scripts are needed to learn and prepare

7
Limitations of Speech Recognition Models

 Vocabulary Size and Confusability


 Speaker Dependence and Independence
 Isolated, Discontinuous, or Continuous Speech
 Read and Spontaneous Speech
 Real-time Recognition and Recorded Samples

8
Data Preparation for Building Models

 An acoustic model contains
acoustic properties for each state of
phone. 
 A phonetic dictionary contains
a mapping from words to phones. 
 A language model is used to
restrict word search.

Those three entities are combined together in an engine to


recognize speech.
9
Data Preparation for Building Models

 The general data items of a typical speech recognizer are


 Text Preparation
 Speech Corpus
 Transcription File
 Pronunciation Dictionary
 Language Model
 Phone File

10
Data Preparation for Building Models

 Text Preparation
• List of possible saying words for messaging are selected.

မင်္ လာပါ အခု ဘယ်မှ ာလဲ


သူ ငယ်ချင်း အစည်းအဝေးခန်းမှ ာ
နေကောင်းရဲ့လား ဖု န်းပြန်ခေါ်လို က်မယ်
ဒီဟာဘယ်လောက်ကျလဲ အိပ်ပြီလား
ကားဂိတ်က ဘယ်မှ ာလဲ မနက်တွေ့မယ်လေ
ကျေးဇူ းတင်ပါတယ် လဘက်ရည်ဆို င်သွားရအောင်
ထမင်း စားပြီးပြီလား ဒီနေ့ကျောင်းလာမှ ာလား
မနက်ဖြန် တွေ့ရအောင် လိပ်စာလေးပို့ပေးပါ
ဒီစာရရင် ဖု န်းပြန်ဆက်ပါ ခဏနေရောက်မယ်
အခု မအားလို့ နောက်မှ ဖု န်းပြန်ခေါ်လို က်မယ် မု န့်ဝယ်ခဲ့ ပါ

11
Data Preparation for Building Models

 Speech Corpus
• To gather the speech that has already been recorded and
manually transcribe it into text.
• To create the text corpus first and record the speech by
reading the collected text.
• To collect daily conversational data, the latter method is used
• 4 male and 4 female speakers
• Recording 20 sentences of the general messaging dialogs

12
Data Preparation for Building Models

 Transcription File
• Gives the words spoken
• This file contains one line for each file used in training
• The line contains the text of the words spoken and the
filename (without extension such as .wav)
• So in a file the dialogue of the speaker noted exactly the same
precise way it has been recorded, with silence tag (starting tag
<s> , ending tag </s>), followed by the file id which represent
the utterance. For example:
သူ ငယ်ချင်းရေ<s> ငါတို့ </s>,<s> မနက်ဖြန် </s>, ဆုံ ရအောင်
13
Data Preparation for Building Models
 Pronunciation Dictionary
• Maps words to pronunciations
• A dictionary can also contain alternative pronunciations.
Single word may have multiple pronunciations
အပြုံ း a pjoun:

တောင် ပြုံ း taun pjoun: => taun bjoun:

14
Data Preparation for Building Models

 The general data items of a typical speech recognizer are


 Language Model
A language model is used to restrict word search.
Sample texts are collected for language model training
The issue with such a collection is to put present documents
(like PDFs, web pages, scans) into a spoken text form. That
is, removing tags and headings, expanding numbers are
needed to their spoken form and to expand abbreviations.

15
Setting Up the System Environment

 Hardware Requirements
• Android mobile of version 2.2 minimum.
• Processor should not be less than 500MHZ.
• RAM should not be less than 170MB.
• SD card of minimum 512 MB.
• Device should be enabled for USB debugging.
 Software Requirements
• Android Mobile Operating System of version 2.2 or later.
• IDE tools: Eclipse or Android Studio.
• User Interface: XML.
• Code Behind: JAVA and XML.
• Internet: Yes.
16
CMUSphinx Toolkit

 State of the art speech recognition algorithms for efficient


speech recognition.
 CMUSphinx toolkit is a best platform for the practical
application development.
 Support for several languages like US English, UK English,
French, German and ability to build a models for low-resourced
languages such as Myanmar language
 Wide range of tools for many speech recognition related
purposes.

17
CMUSphinx Toolkit

 CMUSphinx contains a number of packages for different tasks


and applications.
• Pocketsphinx — lightweight recognizer library written in C.
• Sphinxbase — support library required by Pocketsphinx
• Sphinx4 — adjustable, modifiable recognizer written in Java
• Sphinxtrain — acoustic model training tools

18
Training an Acoustic Model

 The acoustic model is trained by analyzing large corpora of


Myanmar language speech with label.
 Sphinxtrain tool are chosen to build acoustic model for a new
language, Myanmar.
 The trainer learns the parameters for the models of the sound
units using a set of sample speech signals. 
  This is call a training database.
 The database contains information that is required to extract
statistics from the speech in form of the acoustic model.

19
Example of the Sentences in the Acoustic Model

မင်္ လာပါ
သူ ငယ်ချင်း
နေကောင်းရဲ့လား
ဒီဟာဘယ်လောက်ကျလဲ
ကားဂိတ်က ဘယ်မှ ာလဲ
ကျေးဇူ းတင်ပါတယ်
ထမင်း စားပြီးပြီလား
မနက်ဖြန် တွေ့ရအောင်

 The speech corpus is created by recording speech of the above


texts.
20
Building a Phonetic Dictionary

 A phonetic dictionary provides the system with a mapping of


vocabulary words to sequences of phonemes. 
 A dictionary can also contain alternative pronunciations.
က/ka
က/ga
စား/sa:
က စား/ga za:

 A dictionary should contain all the words, otherwise the


recognizer will not be able to recognize them. 
 The recognizer looks for a word in both the dictionary and the
language model. 
 Without the language model, a word will not be recognized.
21
Example of Phonetic Dictionary (Lexicon)

Part of Phonetic Dictionary


က ka.
ကာ ka
ကား ka:
ကိ ki
ကီ ki
ကောင်း kaun:
လည်းကောင်း le` kaun: => la gaun:
တောင် ပြုံ း taun pjoun: => taun bjoun:
နတ် က တော် na’ ka. to => na’ ga do

22
Building a Language Model

 Language models help guide and constrain the search among


alternative word hypotheses during recognition
 The language model is an important component of the
configuration which tells the decoder which sequences of words
are possible to recognize.
 There are several types of models:
 keyword lists
 grammars and
 statistical language models
 They have different capabilities and performance properties.
23
Statistical Language Models
 Statistical language models contain probabilities of the words
and word combinations.
 Those probabilities are estimated from sample data and
automatically have some flexibility.
 Every combination from the vocabulary is possible, although
the probability of each combination will vary. 
 And then, they require way less engineering effort than
grammars. 
 A language model can be stored and loaded in three different
formats: text ARPA format, binary BIN format and binary
DMP format.  24
How to Build a Statistical Language Model
 Prepare a reference text that will be used to generate the
language model.
 The set of sentences that are bounded by the start and end
markers of the sentence: <s> and </s>. More data will generate
better language models. 
 Generate the vocabulary file from a reference text.
 Edit the vocabulary file to remove words (numbers,
misspellings, names)
 Finally , convert the model to a binary format for faster loading.

25
How to Build a Statistical Language Model
 There are many approach and tools to create the statistical
language model.
 CMU language modeling toolkit will be used to create n-gram
language model
 The output language model file is the ARPA format or binary

26
Overview of Speech Recognizer

27
Selecting Next Set of States

 Uses Grammar to select next set of possible words


 Uses dictionary to collect pronunciations for words
 Uses Acoustic Model to collect HMMs for each pronunciation
 Uses transition probabilities in HMMs to select next set of states

28
Conclusion

 The works presented in above is a step towards the development


of Myanmar speech to text system on android platform
 By incorporating the three features such as large vocabularies,
continuous capability and speaker independent, Myanmar
speech to text system on android will be developed.
 Speech recognition system requires fast machines with lots of
data capacity and memory for complex recognition tasks
 In addition, the size of the vocabulary improves massively the
accuracy of the recognition.

29
QU
EST
I ON
S&
ANS
WE
RS

30
THANK YOU SO MUCH

31

You might also like