Professional Documents
Culture Documents
Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda
Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda
Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda
By
TP059923
APU3F2209 CS(DA)
May-2023
Acknowledgement
First and foremost, I would like to express my deepest gratitude to my FYP supervisor, Mr.
Amardeep for providing step-by-step guidance throughout my FYP semester 1 and 2. I really
appreciate his efforts in reviewing my documentation ranging from Project Proposal Form (PPF),
Project Specification Form (PSF) to Investigation Report (IR) and providing relevant suggestions
on how to improvise my work. Not to mention that he also asked me several in-depth questions
that triggered my inner thoughts on how to view my project from an external or outsider point of
view. In addition, he also kept track of my progress from time to time to ensure that I was on the
right track.
Next, I would like to extend my sincere thanks to our FYP lecturer, Mr. Dhason Padmakumar for
briefing us on the series of tasks that we are required to do in FYP module. During lecture
classes, he provided a detailed explanation on the guidelines, requirements and formatting of
each documentation and even gave us sample documents from previous batch of students as a
reference. Tips on how to score a good grade in FYP have also been highlighted by Mr. Dhason
which have helped me a lot in doing my documentation.
I am also grateful to the lecturers that have taught me during my 3 years degree period in APU.
Thanks to lecturers who conduct modules such as Introduction to Database (IDB), Data Mining
and Predictive Modelling (DMPM), Research Methods for Computing and Technology (RMCT),
Text Analytics and Sentiment Analysis (TXSA) etc., I have attained sufficient knowledge in the
corresponding domain to be applied into the implementation of this project.
Last but not least, I am also extremely thankful to my parents and friends. Without their
unconditional support and assistance throughout this journey, I would not have completed my
project on time. I would also like to take this opportunity to appreciate all participants who have
provided their constructive feedback in the questionnaire survey. I promised that I would use
these feedback to improve the accuracy and functionality of my system to satisfy all
requirements from the users’ end.
Table of Contents
Acknowledgement.......................................................................................................................................2
Table of Contents........................................................................................................................................3
List of Figures.............................................................................................................................................7
List of Tables.............................................................................................................................................14
CHAPTER 1: INTRODUCTION TO THE STUDY.................................................................................16
1.1 Background of the Project...............................................................................................................16
1.2 Problem Context..............................................................................................................................19
1.3 Rationale..........................................................................................................................................21
1.4 Potential Benefits.............................................................................................................................22
1.4.1 Tangible Benefits......................................................................................................................22
1.4.2 Intangible Benefits....................................................................................................................23
1.5 Target Users....................................................................................................................................23
1.6 Scopes & Objectives........................................................................................................................24
1.6.1 Aim...........................................................................................................................................24
1.6.2 Objectives.................................................................................................................................24
1.7 Overview of this Investigation Report.............................................................................................26
1.8 Project Plan......................................................................................................................................29
CHAPTER 2: LITERATURE REVIEW...................................................................................................30
2.1 Introduction.....................................................................................................................................30
2.2 Domain Research.............................................................................................................................31
2.2.1 Classification of ASR...............................................................................................................31
2.2.2 General Overview of ASR System...........................................................................................33
2.2.3 Machine Learning Models........................................................................................................38
2.2.4 Performance Evaluation of ASR...............................................................................................47
2.3 Similar Systems...............................................................................................................................48
2.3.1 Automatic Speech Recognition for Bangla Digits....................................................................48
2.3.2 Dynamic Time Warping (DTW) Based Speech Recognition System for Isolated Sinhala Words
...........................................................................................................................................................50
2.3.3 Convolutional Neural Network (CNN) based Speech Recognition System for Punjabi
Language...........................................................................................................................................51
List of Figures
Figure 38: Word Recognition Rate (WRR) for 4 speakers in 3 respective sessions (Priyadarshani et al.,
2012).........................................................................................................................................................51
Figure 39: Parameter setup for CNN based Speech Recognition System for Punjabi language (Dua et al.,
2022).........................................................................................................................................................52
Figure 40: Framework for CNN based Speech Recognition System for Tonal Speech Signals (Dua et al.,
2022).........................................................................................................................................................53
Figure 41: Word Recognition Rate (WRR) of different speakers (Dua et al., 2022)...................................53
Figure 42: Overall Word Recognition Rate (WRR) compared to other speech recognition systems (Dua et
al., 2022)....................................................................................................................................................54
Figure 43: Overview of Operating System (Understanding Operating Systems - University of Wollongong
– UOW, 2022)............................................................................................................................................67
Figure 44: Overview of CRISP-DM methodology (Wirth, R., & Hipp, J., 2000, April)..................................74
Figure 45: Contents of LJ speech dataset..................................................................................................79
Figure 46: list of "wav" audio files in "wavs" folder...................................................................................79
Figure 47: Code and output for data extraction........................................................................................81
Figure 48: Code and output for adding column names.............................................................................81
Figure 49: Code and output for dimension of DataFrame.........................................................................81
Figure 50: Code and output for information of each attribute..................................................................82
Figure 51: Code and output for total and unique word count in "Normalized Transcript" column...........82
Figure 52: Code and output to display frequencies and number of samples for an audio file..................83
Figure 53: Code and output for dropping rows that contain empty values...............................................84
Figure 54: Code and output for dropping "Transcript" column.................................................................84
Figure 55: Code and output for dropping rows that contain non-ASCII characters in "Normalized
Transcript" column....................................................................................................................................85
Figure 56: Code and Output for computing word frequency distribution.................................................86
Figure 57: Code for data sampling.............................................................................................................87
Figure 58: Output for data sampling..........................................................................................................87
Figure 59: Code and output for total and unique word count in "Normalized Transcript" column after
Data Sampling............................................................................................................................................88
Figure 60: Code Snippet to Create list of dictionary from DataFrame Object............................................89
Figure 61: Code Snippet to generate Spectrogram for audio file..............................................................89
Figure 62: Code Snippet for Vocabulary Set with Encoder and Decoder...................................................90
Figure 63: Code Snippet to generate Transcript Label...............................................................................90
Figure 64: Code Snippet to Merge Spectrogram and Transcript Label......................................................91
Figure 65: Code Snippet to Construct Keras Dataset Object......................................................................91
Figure 66: Code Snippet to plot bar chart for token frequency dictionary................................................93
Figure 67: Bar Chart for common tokens versus frequency......................................................................93
Figure 68: Code Snippet to plot histogram for individual transcript's token count versus frequency.......94
Figure 69: Histogram for individual transcript's token count versus frequency........................................94
Figure 70: Code Snippet to plot histogram for individual transcript's token count versus frequency after
data sampling............................................................................................................................................95
Figure 71 Histogram for individual transcript's token count versus frequency after data sampling..........95
Figure 72: Code Snippet to plot waveform, MFCC and Mel Coefficients...................................................96
Figure 73: Raw Waveform for Amplitude Against Time for Audio File......................................................98
Figure 74: Heatmap for MFCC Coefficients against Windows for Audio File.............................................98
Figure 75: Heatmap for Mel Coefficients against Windows for Audio File................................................99
Figure 76: Code Snippet to create function for data partitioning............................................................100
Figure 77: Code Snippet to perform data partitioning.............................................................................100
Figure 78: Code Snippet for CTC loss function.........................................................................................101
Figure 79: Code Snippet for constructing a CNN-GRU model..................................................................102
Figure 80: Summary of CNN-GRU model.................................................................................................104
Figure 81: Code Snippet to train CNN-GRU model..................................................................................105
Figure 82: Code Snippet to construct regularized CNN-GRU model........................................................106
Figure 83: Summary of regularized CNN-GRU model..............................................................................107
Figure 84: Code Snippet to train regularized CNN-GRU model................................................................108
Figure 85: Code Snippet to construct regularized GRU model................................................................109
Figure 86: Summary of regularized GRU model.......................................................................................111
Figure 87: Code Snippet for Training regularized GRU model.................................................................112
Figure 88: Code Snippet to construct regularized CNN-LSTM model.......................................................113
Figure 89: Summary of regularized CNN-LSTM model.............................................................................115
Figure 90: Code Snippet for training regularized CNN-LSTM model........................................................116
Figure 91: Code Snippet for CTC Decoding..............................................................................................119
Figure 92: Code Snippet to generate metrics for Validation and Testing................................................120
Figure 93: Code Snippet to visualize Train and Validation Loss for CNN-GRU model..............................121
Figure 94: Relationship between Train and Validation Loss against Epoch for CNN-GRU model............121
Figure 95: Code Snippet to visualize Train and Validation Loss for regularized CNN-GRU model............122
Figure 96: Relationship between Train and Validation Loss against Epoch for regularized CNN-GRU model
................................................................................................................................................................ 123
Figure 97: Code Snippet to visualize Train and Validation Loss for regularized GRU model....................124
Figure 98: Relationship between Train and Validation Loss against Epoch for regularized GRU model. .124
Figure 99: Code Snippet to visualize Train and Validation Loss for regularized CNN-LSTM model..........125
Figure 100: Relationship between Train and Validation Loss against Epoch for regularized CNN-LSTM
model......................................................................................................................................................125
Figure 101: Code Snippet to Generate Line Graph for WER Over Epoch for CNN-GRU model................127
Figure 102: Line Graph for WER Over Epoch for CNN-GRU model..........................................................127
Figure 103: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-GRU model
................................................................................................................................................................ 128
Figure 104: Line Graph for WER Over Epoch for regularized CNN-GRU model........................................128
Figure 105: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model......129
Figure 106: Line Graph for WER Over Epoch for regularized GRU model................................................129
Figure 107: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-LSTM model
................................................................................................................................................................ 130
Figure 108: Line Graph for WER Over Epoch for regularized CNN-LSTM model......................................130
Figure 109: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model
................................................................................................................................................................ 132
Figure 110: Line Graph for CER Over Epoch for regularized CNN-GRU model.........................................132
Figure 111: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model.......133
Figure 112: Line Graph for CER Over Epoch for regularized GRU model.................................................133
Figure 113: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-LSTM model
................................................................................................................................................................ 134
Figure 114: Line Graph for CER Over Epoch for regularized CNN-LSTM model.......................................134
Figure 115: Code Snippet to find minimum point for CER on Validation Set of regularized CNN-GRU and
CNN-LSTM model....................................................................................................................................135
Figure 116: Code Snippet to Generate Metrics for Unregularized CNN-GRU model’s Evaluation on Testing
Set...........................................................................................................................................................135
Figure 117: Code Snippet to Generate Metrics for Regularized CNN-GRU model’s Evaluation on Testing
Set...........................................................................................................................................................136
Figure 118: Code Snippet to Generate Metrics for Regularized GRU model’s Evaluation on Testing Set 137
Figure 119: Code Snippet to Generate Metrics for Regularized CNN-LSTM model’s Evaluation on Testing
Set...........................................................................................................................................................138
Figure 120: Code Snippet to Generate Line Graph for WER Over Epoch for Unregularized CNN-GRU
model on Testing Set...............................................................................................................................139
Figure 121: Line Graph for WER Over Epoch for Unregularized CNN-GRU model on Testing Set............139
Figure 122: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-GRU model
on Testing Set..........................................................................................................................................140
Figure 123: Line Graph for WER Over Epoch for regularized CNN-GRU model on Testing Set................140
Figure 124: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model on
Testing Set...............................................................................................................................................141
Figure 125: Line Graph for WER Over Epoch for regularized GRU model on Testing Set.........................141
Figure 126: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-LSTM model
on Testing Set..........................................................................................................................................142
Figure 127: Line Graph for WER Over Epoch for regularized CNN-LSTM model on Testing Set...............142
Figure 128: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model on
Testing Set...............................................................................................................................................143
Figure 129: Line Graph for CER Over Epoch for regularized CNN-GRU model on Testing Set..................143
Figure 130: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model on
Testing Set...............................................................................................................................................144
Figure 131: Line Graph for CER Over Epoch for regularized GRU model on Testing Set..........................144
Figure 132: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-LSTM model
on Testing Set..........................................................................................................................................145
Figure 133: Line Graph for CER Over Epoch for regularized CNN-LSTM model on Testing Set................145
Figure 134: Summary of Evaluation Metrics for 4 Models.......................................................................147
Figure 135: Use Case Diagram.................................................................................................................151
Figure 136: Activity Diagram for register function...................................................................................161
Figure 137: Activity Diagram for login function.......................................................................................162
Figure 178: Property Viewer Interface when User Clicks the Expander for Displaying Mel Filterbank
Heatmap..................................................................................................................................................205
Figure 179: Interface for Resampling Page..............................................................................................206
Figure 180: Resampling Interface when User Checks Audio File's Frequency.........................................207
Figure 181: Resampling Interface when User Enters A Sample Rate Lower than 8,000 Hz.....................207
Figure 182: Resampling Interface when User Enters a Sample Rate Higher than 48,000 Hz...................208
Figure 183: Resampling Interface when User Enters a Sample Rate Between 8,000 to 48,000 Hz.........208
Figure 184: Resampling Interface when User Clicks 'Resample Audio File' Button..................................209
Figure 185: Interface for Transcript Page................................................................................................210
Figure 186: Interface when User Clicks 'Generate Transcript' Button.....................................................211
Figure 187: Interface when User Clicks 'Download Transcript' Button....................................................211
Figure 188: Interface when User Selects Language and Clicks 'Translate Transcript' Button..................212
Figure 189: Interface when User Clicks 'Download Translated Transcript' Button..................................212
Figure 190: Code Snippet for Home Page................................................................................................213
Figure 191: Code Snippet for Registration Page......................................................................................214
Figure 192: Code Snippet for Login Page.................................................................................................216
Figure 193: Code Snippet for Password Changing Page...........................................................................217
Figure 194: Code Snippet for File Uploader Page....................................................................................219
Figure 195: Code Snippet for Viewing Property of Audio File..................................................................221
Figure 196: Code Snippet for Resampling Audio File...............................................................................223
Figure 197: Code Snippet for Generating Transcript (1)..........................................................................225
Figure 198: Code Snippet for Generating Transcript (2)..........................................................................226
Figure 199: FYP TURNITIN Report (1).......................................................................................................260
Figure 200: FYP TURNITIN Report (2).......................................................................................................261
Figure 201: Library Form.........................................................................................................................262
Figure 202: Confidentiality Document.....................................................................................................263
Figure 203: FYP Poster.............................................................................................................................264
Figure 204: Project Log Sheet Semester 1 (1)..........................................................................................265
Figure 205: Project Log Sheet Semester 1 (2)..........................................................................................266
Figure 206: Project Log Sheet Semester 1 (3)..........................................................................................267
Figure 207: Project Log Sheet Semester 2 (1)..........................................................................................268
Figure 208: Project Log Sheet Semester 2 (2)..........................................................................................269
Figure 209: Project Log Sheet Semester 2 (3)..........................................................................................270
Figure 210: PPF (1)...................................................................................................................................271
Figure 211: PPF (2)...................................................................................................................................272
Figure 212: PPF (3)...................................................................................................................................273
Figure 213: PPF (4)...................................................................................................................................274
Figure 214: PPF (5)...................................................................................................................................275
Figure 215: PPF (6)...................................................................................................................................276
Figure 216: PPF (7)...................................................................................................................................277
Figure 217: PPF (8)...................................................................................................................................278
Figure 218: PSF (1)...................................................................................................................................279
List of Tables
The research domain for the current project is e-learning. Unlike traditional learning which
acquires physical interactions among tutors and students in classrooms, e-learning is the recently
established learning paradigm that utilizes Information and Communication Technologies (ICT)
and relevant electronic devices to deliver knowledge. The increasing popularity of e-learning is
in accordance with the usage of such technologies which has resulted in creativity and innovation
boost within the educational environment as claimed by many researchers. Also known as
distance learning, e-learning can help reduce expenditure, time for those living at distant places
and even administrative workload of school staffs. (Maatuk et al., 2022) Using the university
(APU) as an example, it has incorporated other e-learning platforms with their existing school
systems in the curriculum paradigm. Despite having a large international community from over
130 countries, it managed to provide all students with diverse yet suitable learning materials,
supplementary courses and assessments, thus international or out-of-state students do not need to
pay for travel or accommodation fees to pursue their tertiary education. As for tutors, with the
help of presentation, work coordination and grading tools such as Microsoft Teams, Moodle,
Turnitin etc., they can focus more on teaching methods instead of preparing for the learning
resources which are available in the system.
However, the main factor that constitute to the rise of e-learning is the unexpected Covid-19
outbreak in early 2020. As such, many countries have enforced strict health protocols and lock-
down regulations. This involves the limitation of many social, recreational and economical
activities to reduce physical interactions and promote social distancing among people.
Correspondingly, education sectors have also become a victim of this crisis. Previous studies on
the impact of Covid-19 have acknowledged the tendency that higher learning institutions respond
by shutting down campus and switching to experiment with e-learning as teaching and learning
alternatives. Despite the e-learning domain having multiple issues in terms of connectivity,
academic transition, efficacy in teaching and learning etc., many educational institutions are
eager to introduce or implement their own e-learning systems with the aim of resolving or at
least minimizing the disruption towards the education sector. (Mseleku, 2020)
In the solution context of this project, Automatic Speech Recognition (ASR) can be interpreted
as a speech-to-text conversion tool whereby human’s utterance of speech is captured by a vocal
receiver such as microphone or other transducers, processed accordingly and converted into a
sequence of texts or transcripts by means of algorithms. Being the most natural and convenient
communication mode among humans, speech plays a pivotal role in our daily lives. (S & E,
2016) Existing as acoustic wave form in the air, the speech signals are transmitted from the
speaker and perceived by the listener’s ears, then converted into electrical signals to be
interpreted by the brain. The brain then formulates a meaningful message through a speech
model to be transmitted and the process repeats again as illustrated in figure 1. (Kanabur et al.,
2019a) Using the similar technique, ASR establishes communications links between the
computer interface and the natural human voice in a flexible yet convenient way. It has been
proven to ease the life of people with physical or learning disabilities in which they are incapable
of receiving, transmitting and conveying vocal signals appropriately.
In the spoken language context of this project, English has been acknowledged as the global
language and being dominant in many developments of the world nowadays including politics,
technology, international relation, education, travel etc. . The findings of Reddy (n.d.) have also
revealed that tens of thousands of specific terms have been added to the English lexicon with the
advancement of Science and Technology. The evidence from this study also highlighted that
approximately 80 percent of digital information in the world is written in English which include
individually stored data by firms, institutions and libraries as well as those information readily
available in the World Wide Web. As several browsers are still unable to possess multilingual
presentation, being proficient in English has thus become a big advantage for those browsing
through the Internet. (Reddy, n.d.) To provide a more comprehensive picture on the wide spread
of English language, studies conducted by Ahmad also revealed that 5000 newspapers more than
half of the newspaper published worldwide are written in English whereby countries that place
English as their second language have at least 1 English newspaper published. (Riyaz Ahmad
Assistant Professor of English & Riyaz Ahmad, 2016)
With the basis of introduction on e-learning, ASR and usability of English in various fields, this
paper aims to analyse previous research on existing statistical and machine learning techniques
in implementing ASR, identify and evaluate appropriate acoustic models and construct such
models into an English-based speech recognition system to be integrated with existing e-learning
systems within the education sector.
Despite having high applicability in the field of education, e-learning suffers from several major
drawbacks such as creating learning barriers for students. According to studies conducted by
Wald, students tend to spend much time and mental effort in jotting down notes during online
lecture or tutorial classes. It is evident that this happens more often when lecturers are speaking
at a much rapid pace or when students are unfamiliar with the spoken language (English in this
case) or course content that they are currently taking. During the process of taking notes,
students have to perform a series of tasks which consist of listening to the tutor’s speech,
comprehending the speech, reading slide’s content displayed on screen, relating the speech to
slide’s content and noting everything down in a simplified yet readable manner. This has posed
great difficulty for students who are unable to attend classes and multitask efficiently, such as in
listening while taking down notes. In relation to this, the current study has also highlighted that
students are unable to grasp the meaning of module content due to poor oral explanation and
teaching skills from tutors. Hence, students easily lose concentration during the lessons and their
absent rates tend to increase gradually. (Wald, 2006)
Furthermore, A considerable amount of literature study has tended to focus more on students’
requirements rather than teaching proficiency prior to the success of e-learning. In a study related
to e-learning challenges in higher academics, Islam (2015) has justified that these vast majority
of research have been discussing on how to provide recommendations and solutions to enhance
student learning experiences from technologies’ point of view. In contrast, these studies are
limited in terms of analyzing post-teaching feedback from academic staff. Generally, each
student has their unique learning style due to cultural influences and other factors. Hence,
academic studies should also suggest improvements on content delivery, speaking rate, teaching
quality and other factors from the tutors’ perspective in order to boost individual learning
outcomes. (Islam et al., 2015) Research conducted by Shao & Wang (2008) has also
acknowledged that a lot of e-learning systems are not using automated methods in processing
large volumes of video and audio data. The amount of cost in terms of time spent and resources
required are enormous due to inefficient methods used in processing learning resources. The two
researchers also found that ASR is more widely used in military and broadcast fields with the
implementation of news and meetings transcript when compared to education sectors. (Shao &
Wang, 2008)
Thirdly and lastly, the current implementation of e-learning systems did not manage to address
the learning disabilities of users proactively. This is especially evident for students who are deaf
or possess hearing disorders regardless of it being congenital or acquired. According to a study
by Wald (2006), such group of students face a very steep learning curve in following tutors’
speech and taking down notes as they rely heavily on lip-reading or sign-language interpretation
which can hardly be achieved in e-learning. Relevant evidence in the study also revealed that
such group of students at Rochester Institute of Technology preferred to use ASR re-voicing
techniques to stimulate real-time text displays, similar to captions and transcripts due to their
high literacy level. (Wald, 2006) The term “literacy” is generally understood as the inability to
read or write text data. Studies of Walsh & Meade (2003) have also implied that the learning bar
continues to elevate upon learners with low literacy level experience first-time exposure to
Information Technology (IT) and their supporting hardware. This is due to the fact that most e-
learning systems operate via text-based input commands as opposed to acoustic inputs which
require them to practice using the technology at hand instead of speaking naturally. (Walsh &
Meade, 2003)
1.3 Rationale
Considering the fact that e-learning system has become a major part of students’ and teaching
staffs’ life, there is indeed an urge to develop an Automatic Speech Recognition (ASR) system
that can be integrated into e-learning systems with English being the main communication
medium. According to the issues highlighted above, many e-learning systems are student-
oriented, as in they only focus on improving students’ academic performance rather than tutors’
teaching proficiencies. It is also notified that many e-learning systems lack effective aiding tools
which hinder students’ concentration during teaching sessions especially those with learning
disabilities. As such, current e-learning systems require a speech-to-text mechanism in the form
of transcripts as an aiding tool for learning and teaching. Regardless of traditional or digital
learning, speech is still the prominent medium in information or knowledge transmissions. From
such a basis, such transcripts obtained from an ASR system can not only ease content
interpretations for students, but also act as a reference for tutors to reflect on their content-
delivery performance in terms of teaching speed, point-accuracy, clarity of utterance and so
forth. One can also easily store, visualize, edit, delete and duplicate a transcript than audio files
which is a crucial feature for any form of teaching-learning activities. Such transcripts can also
help educational institutions to reduce their operational and administrative costs in teaching and
assessing aspects.
This section discusses the potential tangible and intangible benefits of the project to target users
of the system. Tangible benefits are benefits which are quantifiable and measurable using
specific metrics or indicators. As for intangible benefits, it is the benefits subjective to the
projects’ improvement that cannot be consistently measured using a quantifiable unit.
i) Save time for tutors and students as tutors can use transcripts to reflect on their main
teaching points easily while students can directly use them as their additional notes.
ii) Reduce administration cost with online transcripts being used instead of paper-written
ones which are stored in the form of text files as opposed to physical space.
iii) Minimize the workload of tutors in conducting different teaching methods for
learners with low literacy level.
i) Sharpen English language skills of students and teachers during lecture or tutorial
sessions
ii) Boost confidence of students in their learning capabilities as post-learning can be
conducted more effectively
iii) Improve teaching capabilities of tutors with more self-reflection on their teaching
methods.
iv) Enhance students’ and teachers’ satisfaction and productivity in using the e-learning
system
The implementation of ASR in this project is applicable to all levels of education sector, ranging
from primary schools to universities. The target users of the ASR system are tutors who will be
conducting the module or course delivery and students who will be learning new knowledge
from tutors. The system will be utilized by these users to conduct teaching and learning sessions
more effectively.
1.6.1 Aim
To develop a decisive and fully functional English based Automatic Speech Recognition (ASR)
system using appropriate machine learning techniques to be integrated with existing e-learning
systems in the education sectors.
1.6.2 Objectives
⁕ To evaluate existing acoustic modelling technique within the scope of machine learning
⁕ To evaluate the effectiveness and accuracy of the selected technique implemented in the ASR
1.6.3 Deliverables
i) A statistical or machine learning based acoustic model with the highest recognition
accuracy which is trained using audio datasets in the form of lecture videos.
ii) A transcript which is able to convert utterances of speech from lecture videos into text
to be displayed in a panel.
iii) A Graphical User Interface (GUI) which allows users to register, log in, view
properties of audio files, resample audio files, generate, download transcript and log
out of the system.
One of the most onerous challenge in this project is to come up with a suitable title. The purpose
of conducting a project is to essentially address existing issues within a domain, for which
finding such domain can be difficult with the vast advancement of technologies. In addition,
apart from area of studies such as weather prediction models, stock trading systems, customer
segmentation analysis etc, it is difficult to find a domain in relation to my course that has yet to
gain much popularity in the research field. Another challenge is understanding the mathematical
paradigms and algorithms at each stage of ASR which consist of Feature Extraction, acoustic
modelling, machine learning technique and so forth. Searching for quality-wise similar systems
with comparable results that can fit within the e-learning domain is also another challenging task.
Moreover, I also face difficulty in selecting the appropriate programming language for the task
since there exists many programming languages alongside with their relevant libraries, packages
and modules that are deemed relevant to the implementation of ASR.
This section gives a brief introduction on the project background regarding e-learning domain,
ASR and the usage of English language. It also outlines the critical issues that exist in such
domain alongside the importance of conducting this project, tangible and intangible benefits
obtained by users, target users of the ASR system, aims, objectives, deliverables of the project as
well as nature of challenge faced by the developer themselves. The Investigation Report’s
Project Plan documented using Microsoft Excel is also shown in the next sub-section.
Firstly, a brief idea on Literature Review and its purpose is highlighted. A detailed research in
the domain is then conducted, focusing mainly on the classification, system architecture of ASR
and statistical or machine learning models to be considered in this project. Similar systems
utilizing the same system architecture but with different chosen models available in the domain
are compared to provide insights during the implementation stage of the project.
This section focuses on the technical aspect of the project, including the comparison of different
programming languages, Integrated Development Environment (IDE), programming libraries
and Operating Systems. The best option out of each of these categories are then chosen to
implement the ASR system along with detailed justifications.
This section starts off by providing a general overview of system development methodology,
then proceeds to elaborate several methodologies done in previous research. The most
appropriate methodology to be utilized in this project is then chosen and justified accordingly
along with a detailed explanation on the course of action in each phase.
This section starts off by giving a general concept regarding data analysis and the sub-processes
involved. Further elaboration on the metadata of the chosen dataset is given. The inner workings
of Exploratory Data Analysis (EDA), data cleaning, data visualization, data partitioning and
modelling are also explained with code snippets and output display.
This section evaluates the 3 deep learning models constructed in the previous section. Several
evaluation metrics are written and compared among these models, out of which the model with
the most optimal performance across each metric is chosen.
This section illustrates the general overview of the system including the system’s feature from
end users’ perspective through graphical and tabular representation in the form of a Use Case
Diagram, Use Case Specification and Activity Diagram. To provide an initial overview to system
designers and developers, an interface design of each page of the web-based system will also be
illustrated.
This section gives detailed explanation on each feature of the system. It also highlights the
release plan, Unit Testing plan and User Acceptance Testing (UAT) plan of the system in which
the latter two will be designed for system validation.
Chapter 9: Implementation
This chapter focuses on the features of the web-based system, which include front-end or design-
wise and back-end or coding-wise implementations. Screenshots of pages as well as source code
will be documented in a detailed manner.
The Unit Testing plan and User Acceptance Testing (UAT) plan documented in Chapter 8 will
be given to system testers and end users or clients respectively to verify and validate if all
features of the system are working as expected from the business requirements.
In this chapter, a summary on the FYP Report is provided in terms of modelling and deployment.
Reflection on whether the current research has achieved the desired goal and objectives is
analysed. Any form of research gap or limitation in the design of this project is also explored
with corresponding improvements identified to be implemented in the future.
Overall timeline and planning of the proposed project will be listed out in a Gantt Chart with the
aid of Microsoft Excel application. The entire pipeline diagram will be displayed in the appendix
section.
Generally, there are four criteria in classifying ASR, namely speech utterance, speaker model,
size of vocabulary and environmental condition as shown in the figure above. (Bhardwaj et al.,
2022) By understanding such classifications, the developer can decide on what form of acoustic
properties are most suitable to train and validate the ASR models in this project.
In broad terms, speech utterance can be described as the vocalization of one or more
pronunciation in the form of words or sentences that can be interpreted by the computer. Firstly,
Isolated Words are the easiest and most structured speech utterance type because it accepts single
utterances at one point of time. It requires pauses between each utterance but does not limit itself
to single words input only which provides a clear pronunciation to the listener. (An, n.d.) Similar
to Isolated Words, Connected Words require much smaller pause between each utterances such
as reading the numerical representation of a number (“1,298,350”). (Bhardwaj et al., 2022) In a
more difficult context, Continuous Speech allows computers to determine the content voiced out
by natural speakers. Typically, it involves multiple words run together without pause or distinct
boundaries between each utterance, thus elevating the computation difficulty when the
vocabulary range grows. (An, n.d.) Also utilizing natural speech as a medium, Spontaneous
Speech accepts any form of acoustic inputs including those spoken in noisy environments or
filled with pronunciation errors, false starts by the speaker. (Bhardwaj et al., 2022)
Regards to speaker dependency, a speaker-dependent system will be trained using the specific
speakers’ voice characteristics. Although such systems are not flexible to be used across different
speakers, it is easier to develop with a lower cost and higher accuracy in speech identification. In
contrast, a speaker-independent system is designed for large group of speakers with exclusive
speech patterns. It is apparent that such systems are more costly and difficult to develop with a
relatively low recognition accuracy despite exhibiting great flexibility. (An, n.d.) As for speaker
adaptive systems, it adapts the speaker-independent system by utilizing portion of the specific
speakers’ acoustic characters. (Bhardwaj et al., 2022)
Published studies have also identified that vocabulary size will affect the processing
requirements, time complexity and accuracy of the ASR. A dictionary or so called, lexicon with
a small, medium, large and very large vocabulary size can have tens, hundreds, thousands and
tens of thousands of words respectively. (An, n.d.) The environment variability in terms of noise
level can also have a detrimental impact on the accuracy of the ASR. (Bhardwaj et al., 2022)
In another research proposed by Wald (2006), he emphasizes that ASR systems used in
education normally consists of unrestricted vocabulary range and are of speaker-dependent type.
In other words, the ASR system has to be trained with read-aloud transcripts, written documents
as well as pre-recorded lecture videos filled with special vocabularies not available in its own
dictionary. Another alternative is to utilize pre-trained voice model provided by specific speech
recognition engines which might guarantee a higher accuracy in terms of vocabularies and
spontaneous speech structures. (Wald, 2006) Despite this, little progress has been made in
demonstrating the implementation of connected-word type ASR systems which is the basis of
communication in English language.
Prior to implementing a fully functional ASR system, one must thoroughly grasp the main
component of its system architecture, namely acoustic front-end, acoustic model, language
model, lexicon and decoder as shown in figure 4. A large volume of studies on ASR architecture
also highlights that there are two process in its implementation, namely front-end and back-end
in which the latter can only be initiated if the former has been done accordingly. (Jamal et al.,
2017)
Front-end Process
The main aim of the front-end process is to convert the analogue speech signals into digital form
by parameterizing the unique acoustic characteristics of the speech. This can be achieved by
performing signal processing and feature extraction. (Jamal et al., 2017) According to previous
studies, each of them has selected distinct features for their application but has established
several profound principles in extraction criteria. One of those criteria being the ability to
construct acoustic models automatically with small amount of training data provided. Another
prominent property is the extracted feature must exhibit little fluctuation to no variant among
speaker and the surrounding environment to maintain great stability of utterance over time. (S &
E, 2016) Another researched proposed by An (n.d.) has also presented characteristic of extracted
speech features such as great measurability and insusceptible to mimicry. (An, n.d.)
In technical terms, feature extraction can be defined as the pre-processing of analogue speech
signals by removing irrelevant observation vectors and filtering a set of correlated voice
properties into several quantifiable metrics that are deemed meaningful in the model construction
stage. Despite a considerable amount of feature extraction techniques have been developed over
the past few decades, the full discussion on each method lies beyond the scope of this project,
thus this study will only provide an overview of Mel Frequency Cepstral Coefficient (MFCC)
technique which is the most prominent, efficient and simple compared to other methods. (Jamal
et al., 2017) The process for MFCC technique is summarized as shown in figure 5.
Figure 5: Speech Frequency Graph for Hamming Window (Kanabur et al., 2019b)
First and foremost, pre-emphasis is conducted to amplify the frequencies of speech signals to
enhance model training and recognition accuracies. (S & E, 2016) Next, frame blocking is
performed to segment the continuous speech signals into discrete small frames whereby each
frame consisting of N samples are separated by M samples from the adjacent frames, with
N−M samples overlapping between them. This process will continue until the whole speech
signal is segmented into frames. In Gupta’s (2013) case study of MFCC, he also claimed that the
standard values used in much research for N and M are 256 and 100 respectively whereby M < N
. This is to ensure that sufficient acoustic information is stored inside each frame and are not
susceptible to change. (Gupta et al., 2013) Windowing is then initiated to multiply the speech
signal frames with windows of varied shape, typically a hamming window as shown in figure 6.
(Kanabur et al., 2019b) This is to reduce the disruption at the start and end point of each frame.
The hamming window’s equation is shown in figure 7 whereby N m represents the number of
samples in each frame. The output signal after this operation is X ( m ) · W n ( m) whereby X (m)
represents the input signal. (Gupta et al., 2013)
Figure 9: Graph for relationship between frequency and Mel frequency (Gupta et al., 2013)
Prior to the windowed speech signals, Discrete Fourier Transform (DFT) or Fast Fourier
Transform (FFT) are used to compute the frequency or magnitude spectrum ¿) for the next stage.
The equation to derive this spectrum ¿) shown in figure 8 whereby k represents 0 to N m −1.
Analysis from the study also highlighted that DFT and FFT are the same by definition but have
different computational time complexity. From the results obtained, a positive frequency ¿) in
Fs N
the range 0 ≤ f ≤ corresponds to the lower half of the sample size 0 ≤ m≤ m −1 whereas a
2 2
−F s Nm
negative frequency in the range < f <0 corresponds to the upper half +1 ≤ m≤ N m−1
2 2
whereby F s represents the sampling frequency. (Gupta et al., 2013) From the calculated
spectrum frequency above, it is wrapped using a logarithmic Mel scale and the Mel frequencies
are computed through the equation as shown in figure 9. (S & E, 2016) A set of triangular
overlapping window, also known as triangular filter banks, is constructed whereby filters are
scattered linearly below the 1000 Hz range and spaced logarithmically beyond 100 Hz as
illustrated in figure 10. Such mapping principle makes it easier to identify the spacing between
filters, thus estimate the approximate energy at each spot of the spectrum. Last but not least,
Discrete Cosine Transformation (DCT) is performed to convert the Mel spectrum from
frequency domain back to spatial domain. The study presented by Gupta (2013) presented that
the vocal output of DCT can contain more energy when compared with DFT as DCT is used in
data compression, resulting in higher concentration of energy in few section of coefficients
whereas DFT is used in spectral analysis. The equation is shown in figure 11 whereby C n
represents Mel Frequency Spectrum Coefficient (MFCC), m represents number of coefficients
and k represents numbers from 0 to m−1. (Gupta et al., 2013)
Back-end Process
Being the core of ASR, acoustic model is a container that stores statistical representation of
segmented speech signals known as basic speech units that constitute the pronunciation of a
word. The acoustic model is established based on sequences of feature vectors computed from
speech waveforms. (S & E, 2016) Such basic speech units can also be referred as phones,
phenomes, syllables or feature-exclusive acoustic observations extracted during pre-processing
stage. In order to recognize such phenomes, a language model consisting of linguistic and
grammatic properties is required. (Jamal et al., 2017) Similar to how humans recognize
utterances in conversations, a language model can help the acoustic model to distinguish between
valid and invalid words in utterances as well as their sequences by providing some form of
context. Such context is measured in terms of probability distribution of word that will be voiced
out by the speaker. This probability can be deduced from a large text corpus based on the
speaker’s previously spoken words. Some common language models are bigram and trigram
whereby the former and latter group two and three consecutive words together respectively, only
then the probability of sequence can be computed. The pronunciation model, also known as
lexicon or dictionary provides mapping between words and phenomes in order to form optimal
word sequences (S & E, 2016)
Figure 11: General equation for statistical computation of ASR (Jiang, 2003)
With the aid of acoustic model, language model and lexicon, the decoder can only be able to
compute the most likely word sequence (W ) from the acoustic input sequence observed ( X ).
Such computation can be understood as the maximum posterior probability for any observation (
^˙ whereby the posterior probability is denoted by P ( W | X ¿. (Jiang,
X ), indicated by W
2003)Through implementing Bayes’ Theorem, the second part of the equation is derived as
shown in figure 11. (S & E, 2016) In this context, P(W ) refers to the probability of deriving a
word from the language model, P(X ) refers to the probability of deriving the observed
sequences from acoustic model and P ( X|W ) refers to the probability of deriving the observed
sequence on the basis that the underlying word sequences is W . As mentioned in most previous
ASR studies, the term P ( X ) can be roughly estimated as a constant across different word
sequences, thus it can be ignored in calculation which results in the third part of the equation in
figure 12. (Jiang, 2003)
Figure 13: Equation for transition probability (a ij) (Rupali et al., 2013)
Figure 14: Equation for summation of transition probability ( ∑ aij (k )) (Rupali et al., 2013)
Before explaining the inner workings of HMM, one must fully grasp the concept of a Markov
Chain. Makhoul & Schwartz (1995) generalizes a Markov Chain as a simple network consisting
of a finite number of states with transitions among them. Each state is represented as an
alphabetic symbol and the transition represents probability. For instance, a ij symbolizes the
transition probability from state 1 to state 2 whereby each state is associated by A and B
respectively and the output symbol is always B as shown in figure 13. (Makhoul & Schwartz,
1995) The equation for a ij is shown in figure 14 whereby p represents probability, q t represents
current state, q t +1 represents next state and N n represents the number of hidden states in the
model. Since the naming of states always starts from 1 up to N n, the i th and j th states are within
this range with both ends inclusive. The summation of transition probability a ij is always 1 as
shown in figure 15. (Rupali et al., 2013) The relevance of such findings has implied that the
transition among states is probabilistic whereas the final output is deterministic. ‘
Figure 15: Equation for observational symbol probability (b j (k )) (Rupali et al., 2013)
Figure 16: Equation for summation of observational symbol probability ( ∑ b j (k ))) (Rupali et al., 2013)
In contrast, the output symbols of HMM are probabilistic, as in each state can be associated with
arbitrary number of symbols in any form instead of given a defined one. The selection of such
symbol is dependent on the transition probability among each states. Since both the symbols and
transition are probabilistic, HMM is generally known as a doubly stochastic statistical model. In
relation to the non-deterministic nature of transition among states, it is impossible to derive the
sequence of states based on the final output symbols. As such, the model is known as Hidden
Markov Model whereby the states’ sequences are hidden and observers can only see the final
outputs symbols. (Makhoul & Schwartz, 1995) The equation for deriving the probability
distribution of the output symbol in state j (b j (k )) is shown in figure 16 whereby V k represents
the k th observational symbol in the list of alphabets, o t represents the current parameter vector
and M represents the total number of unique observation symbols for each state. Similar to
previously highlighted constraints, the k th observed symbol for each state must be within 1 to M
with both ends inclusive and the summation of probability distribution for each state’s
observational symbol is always 1 as shown in figure 17. (Rupali et al., 2013)
Figure 17: Equation for initial state distribution ( π i) (Rupali et al., 2013)
In the case of initial state distribution matrix for state i ( π i) as shown in figure 18, there is only 1
observational symbol for each state, making it equivalent to the probability of deriving that
particular output symbol. By defining the 5 main terminologies for HMM which are listed as
follows: number of hidden states ( N ), number of unique observational output symbols per state (
M ), state transition probability distribution ( A={a ij }), observation symbol probability
distribution per state ( B={b j (k )}) and initial state distribution ( π={π i }), the observation
sequence of the model can be defined as λ=( A , B , π ). (Rupali et al., 2013)
Figure 18: Warping between two non-linear time series (Muda et al., 2010)
A key implication regarding DTW is to compute the minimum distance between two time-series
using existing pair-wise coordinates. This terminology involved in this procedure is often called
“distance function” or “cost function”. First and foremost, the distance or local cost matrix ( C ) is
constructed using the equation in figure 20. From the equation, N denotes the height of the
matrix, M denotes the width of the matrix, C ij denotes the local cost of the (i th, j th) coordinate,
x i and y j denote the coordinates for X and Y time series sequences respectively. (Senin, 2008) It
can therefore be stated that cost computation for each vertical line is done by calculating
Euclidean distance between the two coordinates. (Muda et al., 2010)
Figure 20: Local path alternatives for grid point (i, j) (Saleh, n.d.)
After the matrix has been constructed, the alignment path, warping path or warping function
must be computed. Collectively, multiple previous studies outlined that computation of all
possible warping path P between both sequences of X and Y are redundant as the time
complexity of the algorithm will grow exponentially when the length of each sequence grows
linearly. (Senin, 2008) According to Saleh (n.d.), the problem is approached by restricting time
warping between two vector sequences through several boundaries. One of them being the first
and last vector or coordinates of sequence X must be assigned to its corresponding sequence Y .
As for other coordinates in between, the repetitive forward and backward leaping between points
that may have been visited is prevented with the “reuse” of preceding vectors to perform time
warping operations. For a clearer visualization, the local path alternative for grid points(i , j)with
all possible predecessors is illustrated in figure 21. (Saleh, n.d.)
Figure 21: Equation to compute the minimum accumulated distance for optimal path δ(i,j) (Saleh, n.d.)
Another interesting finding in figure 21 is the coordinate ( i−1 , j−1 ) can reach coordinate (i , j)
directly via diagonal transition without going through vertical coordinate ( i , j−1 ) and horizontal
coordinate ( i−1 , j ). With only one vector distance computed, the local distance or cost function
C ij must be added twice. Moreover, it is evident that there are only 3 possible predecessors or
partial path leading to (i , j), namely paths from (0,0) to (i−1 , j), (i , j−1) and (i−1 , j−1). As
such, one can apply the Bellman’s Principle which states that if there exists an optimal path P
starting at point ( 0,0 ) , ending at point (N −1, M −1) and having grid point (i , j) in between, then
the partial path from ( 0,0 ) to (i , j) is also part of P . Based on such fundamentals, the minimum
accumulated distance δ (i , j) for globally optimal path from ( 0,0 ) to (i , j) can be derived using
the equation as shown in figure 22. (Saleh, n.d.)
Figure 22: CNN architecture for speech recognition (Huang et al., n.d.)
Figure 23: Stages of CNN architecture for speech recognition (Palaz Mathew Magimai-Doss Ronan Collobert et al., 2014)
Despite HMM being the most traditional and widely used model for speech recognition, a large
volume of literature research has incorporated CNN in image processing techniques to generate
spectrum images from acoustic phonemes. According to Musaev (2019), a research by Ossama
in 2014 also implemented CNN to perform adaptive dialogue recognition consisting of various
accents in call centre environments. In addition, a paper published by Dennis in 2010 presented
his findings on using CNN to classify sound events based on visual signature extracted from
acoustic inputs. (Musaev et al., 2019) Being a deep learning model used in various applications,
CNN can transform a sequence of acoustic signals into segments of frames, then output a score
value for each class among each frame. The general architecture of CNN is illustrated in figure
23. (Palaz Mathew Magimai-Doss Ronan Collobert et al., 2014) There are two stages for the
network architecture, namely filter extraction or feature learning stage and classification or
modelling stage as shown in figure 24. The convolutional and pooling layer correspond to the
former stage whereas the Fully connected and SoftMax layer correspond to the latter stage. The
usage of tanh() function will be discussed in the following few sections.
Being the core component of CNN, the convolution layer consists of several filters or kernels
that processes previous fragment of layers by computing the summation of each fragment’s
matrix multiplication. (Musaev et al., 2019) Previous fragments are given in the form of raw
waveform for most research papers, but Wang (2019) has chosen Mel-Frequency Cepstral
Coefficients as inputs. Such inputs are denoted as X ={X 1 , X 2 , … X T } whereby X i ∈ R b∗c . In this
context, T represents the time step or length of the X-axis, b represents the bandwidth or length
of the Y-axis and c represents channels. The output will be a 2-dimensional feature map or
matrix (o ¿ consisting of elements ( o i , j ) with i denoting width term and j denoting height term
calculated as shown in figure 25. In terms of the equation, swc represents the convolution stride’s
width, shc represents the convolution stride’s height, w represents the kernel’s width, h
represents the kernel’s height and k represents the kernel whereby k ∈ R w∗h∗c . The convolutional
stride is the amount or unit of sliding movement for each kernel from the input layer to the
output layer. (Wang et al., 2019)
Figure 26: Equation for time span (t ¿¿ c) ¿ and time shift window(window¿¿ c) ¿ (Wang et al., 2019)
Further analysis of the equation revealed that each element ( o i , j ) in the matrix is the by-product
of w∗h elements in each input feature map (channel), which means that a MFCC sequence with
c channels require w∗h∗c input elements to derive the output in their corresponding position. A
more detailed visualization on the equation workings can be seen in figure 26 whereby T = 3, b
= 3, c = 1, w = 2, h = 2, swc = 1 and swh = 1. To simplify such analysis, the time span across the
resulting matrix (t ¿¿ c) ¿and time shift window between each adjacent element of the matrix
(window¿¿ c) ¿ are computed as shown in figure 27. In this context, w c represents the kernel’s width,
swc represents the convolution stride’s width, t i represents the term for time scope and window i
represents the term for shift window. These equations will act as an input to the processes or
layers to be discussed in the next section. (Wang et al., 2019)
Figure 27: Equation for ReLU activation function (Wang et al., 2019)
Figure 28: Equation for ClippedReLU activation function (Wang et al., 2019)
The scalar results of the convolution layer are then passed to a pre-determined activation or non-
linear function. Despite the activation layer being genuinely known to merge with the
convolution layer, a more comprehensive study, like a research conducted by Musaev has
discussed them separately due to the complexity they possess. According to him, the non-
linearity functions used traditionally are hyperbolic tangent function ¿, absolute hyperbolic
−1
tangent function ¿and sigmoid function ( ( 1+e− x ) ) . Further research in early 2000s by Glorot
has discovered that ReLU activation function is more reliable in terms of speeding up learning
process of each neuron, in addition to simplifying the computation process by trimming negative
scalars in the matrix. (Musaev et al., 2019) This function is computed using the equation in
figure 28. From the equation, if the input matrix’s ( X ) element is greater than 0, it outputs itself,
otherwise it outputs 0. Whilst for the revision of ReLU called Clipped ReLU activation function,
it adds a new parameter (α ) to the tally which finds the minimum between it and the result from
ReLU function, thus ensuring the output in {0 , α } as shown in figure 29. (Wang et al., 2019)
Figure 29: Equations for time span (t ¿¿ p) ¿ and time shift window( window¿¿ p)¿ of max pooling (Wang et al., 2019)
Figure 30: Final equations for time span (t ¿¿ p) ¿ and time shift window(window¿¿ p)¿ of max pooling (Wang et al.,
2019)
Another example of a non-linearity operation is the pooling layer. It takes the group of pixels
from each region in the previous convolutional layer and compresses them into one pixel. In this
scenario, the max-pooling function is typically used which finds the maximum element out of
each group of pixels. (Musaev et al., 2019) The continuous progression of this computation
requires the concept of time span and time shift window described earlier which has yet to
involve pooling operations. With t p andwindow p indicating time span and time shift window of
max pooling layer respectively, the corresponding equation is shown in figure 30. The layer
consists of max pooling of size w p ¿ h p and pooling strides of size sw p ¿ sh p. The final equations
are a result of deriving equations in figure 30 from the equations in figure 27 as shown in figure
31. Not only did the significant acoustic features are maximized, but also the corresponding time
spans are enlarged with lesser computational steps to follow. (Huang et al., n.d.)
After extensively explaining the feature learning stage, we proceed to the classification stage
which starts off with linear transformation in fully connected or so-called dense layers. Utilizing
the flattened 1-dimensional vector of sequences after down-sampled by the pooling layers, each
element of the vector is connected to every output neutron by a specific weight, which finalizes
the mapping among the networks. (Yamashita et al., 2018) Moving on, a non-linear
transformation for Deep Neural Network called Batch Normalization (BN) is performed to
minimize the effect of Internal Covariate Shift. With respect to this, it is a phenomenon whereby
the input parameters to each layer changes due to a change in the preceding layers’ network
parameters. As the number of layers increases (i.e., the network becomes deeper), the
amplification of change becomes more apparent. As implied by its name, it introduces
normalization step after each batch of layers instead of viewing all layers as a whole, thus
enhancing training rates and improving performance during testing phase. (Wang et al., 2019)
Figure 32: Equations for each neuron's BN in every vector of a batch (Wang et al., 2019)
For a batch of layer ( X ={X 1 , X 2 , … X m }¿, there are m numbers of flattened vectors. For each
vector, there are d number of neurons. Hence, the BN of a vector can be summarized as a set of
neurons with their corresponding BN as shown in figure 32. For every neuron k , their BN is
computed using the equations in figure 33. (Wang et al., 2019)
Figure 33: Equation for Softmax activation function (M. Rammo & N. Al-Hamdani, 2022)
Last but not least, to apply classification technique on the normalized features, an activation
function called Softmax is implemented. This function is optimal for multiclass single-class
classification operations which can normalize any values from the last fully connected layer into
class probabilities with a range between 0 to 1, whereby its summation equals to 1. (Yamashita et
al., 2018) The computation of such function is shown in figure 34. It is statistically
acknowledged that when there are n neuron values in each of the input vector x within the layer,
there are n possibilities of probability distribution. (M. Rammo & N. Al-Hamdani, 2022)
Figure 34: Equation for calculation of Word Error Rate (WER) (du Simplon et al., 2005)
Figure 35: Equation for calculation of Word Recognition Rate (WRR) (du Simplon et al., 2005)
One of the most common evaluation technique for speech recognition accuracy is Word Error
Rate (WER). It implements the Levenshtein distance or Edit distance algorithm which finds the
minimum number of insertions, deletions and substitutions operations to transform one input
string to the other. In particular, it computes the minimum distance between the referenced
transcript sequence and the automatic transcript generated by the developed model, then
normalizes by length of reference. The equation can be shown in figure 35 whereby N r
represents total words in the generated transcript, S, D , I represent number of words substituted,
inserted and deleted in the generated transcript respectively. Another metric used in-line is Word
Recognition Rate (WRR) which is computed using the equation in figure 36. Since the
Figure 36: 10 Bangla digit representation, pronunciation and IPA (Muhammad et al., 2009)
First and foremost, a previous research by Muhammad (2009) had made a significant
contribution in the area of Bangla-based ASR system by exploring the analysis of Bangla digit in
constructing a speech recognition model. Due to insufficient Bangla digit speech corpus and
relevancy in previous literature studies, a medium-sized Bangla digit speech corpus consisting of
10 digits written in Arabic numerals is developed. Their corresponding pronunciation in Bangla
language and International Phonetic Alphabet (IPA) are shown in the figure above. As for data
collection, a total of 100 Bangladesh native speakers aged between 16 to 60 years old with equal
gender distribution is chosen. Each speaker commends 10 trials for each digit whereby half of
them are conducted in quiet and office rooms respectively, both of which exhibiting similar
environmental properties. (Muhammad et al., 2009)
Figure 37: Digit correct rate (%) of Bangla digits in ASR (Muhammad et al., 2009)
Being one of the more practical ways of extracting acoustic features from speech, MFCC
technique was chosen by the author. to Among the 100 speakers, 37 of male and female
respectively were selected as training sets and the remaining is for testing sets. The parameters
chosen are as follows: sampling rate of 11.025 kHz with 16-bit sample resolution, Hamming
Window of 25 ms with step size of 10 ms and 13 features with a pre-emphasis coefficient of
0.97. As such, there are a total of 13 hidden states in the HMM with a varying number of mixture
components to be tested. Since vocabulary size is limited to 10 only for this research, the word
model used to recognize these 10 digits will be HMM with left-to-right orientation. The
evaluation of the result is then conducted based on digit correct rate (%). Prior to the training and
testing results obtained in the figure above, it is apparent that the first 6 digits have a digit correct
rate of over 95% whereas the remaining 4 have a digit correct rate of below 90%. Moreover, it
can be seen that digit '~' (2) has the highest correct rate of 100% whilst '",' (8) has the lowest
correct rate of 84% only. Another significant finding is that the 8-mixture component seems to
be the most optimal one with more than half of the digits having the highest correct rate for this
mixture category. (Muhammad et al., 2009)
2.3.2 Dynamic Time Warping (DTW) Based Speech Recognition System for Isolated
Sinhala Words
In the implementation of this system proposed by Priyadarshani (2012), he highlighted that the
research to date regarding speech recognition paradigms in Sinhala language is still at an initial
stage in Sri Lanka with insufficient to zero useful information available. Moreover, small
vocabulary sizes in the lexicon have been a critical issue in most ASR systems developed
especially using DTW technique. This is likely due to the fact that there is a greater probability
of similar sounding words appearing in the speech corpus in which their sub-word’s
pronunciation duration differ from one another, making it difficult to parse acoustic inputs into
accurate phrases. Hence, this research attempts to use a relatively large Sinhala vocabulary with
a total of 1055 frequently used words to develop an efficient speech recognition system. To
achieve this, feature extraction using MFCC technique and feature matching, as in comparing the
test pattern with preloaded reference for word identification through DTW are done.
(Priyadarshani et al., 2012)
Figure 38: Word Recognition Rate (WRR) for 4 speakers in 3 respective sessions (Priyadarshani et al., 2012)
The acoustic inputs are gathered from four native Sinhala speakers into an audio file. Three
sessions are conducted on each speaker whereby the 2nd and 3rd session are 3 months and 1 year
after the 1st session respectively. In each session, one utterance of each word is used to train the
model whereas the 2nd utterance is used as a testing set. The entire simulation is done in
MATLAB 7.0 and Word Recognition Rate (WRR) is used as the evaluation technique. An
overall WRR accuracy of 93.92% is achieved based on the results in the figure above. It is also
observed that there is a clear declining trend in the recognition accuracies throughout sessions
due to variation of speakers’ vocal. Considering the fact that large speech corpus is involved,
DTW has successfully identified varying Sinhala speech with unique acoustic properties from
different speakers. (Priyadarshani et al., 2012)
2.3.3 Convolutional Neural Network (CNN) based Speech Recognition System for
Punjabi Language
Figure 39: Parameter setup for CNN based Speech Recognition System for Punjabi language (Dua et al., 2022)
In a systematic study of uncommon speech recognition using Punjabi language, Dua (2022)
observed that most of the current literature studies use HMM, GMM and ANN techniques in
recognizing speech inputs. Further analysis on such studies have pointed out that CNN is
becoming a more prominent modelling paradigm in speech, pattern recognition and artificial
intelligence, machine learning related research due to its enhanced model training speed and
applicability in systems with large-vocabulary datasets. On such a basis, Dua (2022)
implemented a CNN-based approach to recognize tonal Punjabi cues with additional background
noises in his research. As shown in figure 40, the vocal data has been collected from 11 different
Punjabi speakers of different ages, accents in different environments speaking up to 38 additional
stanzas in a continuous mode of speech. Hence, there are a total of 418 sentences (38 * 11) to be
recognized. A sampling rate of 44.1kHz was recorded and the audio file was stored in .wav
format. The author also signified to develop a large corpus of Gurbani hymns for the system due
to an absence of tonal speech dataset in the current domain. (Dua et al., 2022)
Figure 40: Framework for CNN based Speech Recognition System for Tonal Speech Signals (Dua et al., 2022)
The proposed speech recognition system’s framework is shown in figure 41. Firstly, a Praat
software version 6.1.49 was used to generate the Mel spectrogram waveforms from the input
speech signals. Since the programming language chosen is Python, LIBROSA library was used
to perform MFCC feature extraction. For the purpose of feature learning, six-layer 2D
convolution layers along with two Fully Connected layers were used. Then, a flattened layer was
inserted between the 2D convolutional layers and the 256 Dense Layer units. The non-linear
Softmax activation function was used to activate neurons in the form of vector sequences and
classify them accordingly. The processes after feature extraction are completed by TensorFlow,
Kaldi toolkit and other back-end libraries. The model is then trained using Google Cloud
Services and Keras Sequential API. (Dua et al., 2022)
Figure 41: Word Recognition Rate (WRR) of different speakers (Dua et al., 2022)
Figure 42: Overall Word Recognition Rate (WRR) compared to other speech recognition systems (Dua et al., 2022)
The results from the training model can be seen in figure 42 which uses Word Recognition Rate
(WRR) as the evaluation metric. It is apparent that speaker 7 has the highest WRR of 90.911%
whereas speaker 9 has the lowest WRR of 86.765%. Such a drastic difference in WRR is related
to the varying acoustic pattern, tonal speech frequencies and timing between each utterance of
the speaker. From these data, an average WRR of 89.15% can be derived. When compared with
other speech recognition systems using different modelling technique, such result has strikingly
emerged to be the highest as shown in figure 43. Overall, these results suggest that CNN is the
optimal modelling paradigm in handling large tonal-based speech datasets with MFCC being the
best feature extraction technique. Thus, as pointed out by the author, studies in experimenting
more speakers in varying environments with different speech classifications should be conducted
more extensively in the future. (Dua et al., 2022)
Feature MFCC
Extraction
Nature of 100 native Bangladesh 4 native Sinhala speakers 11 native Punjabi speakers
Speaker speakers of equal with different gender and
gender distribution age speaking using
ranging from 16 to 60 continuous mode of speech
years old
2.4 Summary
This chapter begins by introducing the context of a literature review, followed by discussing the
technical information in the domain research section. In this section, classification for ASR is
first described whereby the sub-categories in terms of speech, speaker, vocabulary and
environment are explained in detail. In the next sub-section, overview architecture of ASR
system consisting of front-end and back-end process is portraited. During front-end process,
feature extraction is briefly introduced with an in-depth explanation about MFCC technique. As
for back-end process, acoustic models, language models, lexicon and their inter-relations are
outlined. What follows is the extensive analysis on 3 machine learning models along with their
mathematical paradigms namely HMM (an acoustic model), DTW (a dynamic programming
algorithm) and CNN (a deep learning model). The section then ends with a brief highlight on
speech recognition or evaluation metrics which are WER and WRR. After having a
comprehensive understanding on specific domain knowledge within machine learning and
speech recognition, a comparison between 3 past research on similar speech recognition systems
in various aspects is done and demonstrated in table 1.
Programming Language
Data ⁕ Excellent data handling capacity and can perform parallel computation
Management (Python vs R vs SAS | Difference between Python, R and SAS, 2019)
GUI support ⁕ GUI libraries such as ⁕ GUI libraries such as ⁕ Highly customizable
PyQT5, PySide 2, R Package Explorer, GUI can be created
Tkinter, Kivy, wxPython Conference Tweet using frame entities and
etc. (Top 5 Best Python Dashboard, Bue SCL code
GUI Libraries - Dashboard (Singh,
⁕ Macro QCKGUI is
AskPython, 2020) 2021) and gWidgets
used to insert
(Creating GUIs in R
parameters into frame
with GWidgets | R-
controls
Bloggers, 2010)
(Jain & Hanley, n.d.)
In the context of this project, Jupyter Notebook is chosen to be the IDE. It is a powerful tool that
integrates code with other media such as narrative text, visualizations, mathematical equations,
videos, images etc. . It is a free, open-source and standalone software that is part of the
Anaconda data science toolkit. (Pryke, 2020) It can also be run on web browsers like Firefox and
Chrome. The popularity of this application has to dealt with its ability in achieving a balance
between a simple text-editor and feature-rich IDEs that require complicated initial startups.
Hence, it becomes handy to solve problems regarding data exploration, data pre-processing and
modelling. Developers can also understand and debug code easily through the text description
section that describes the functionality of respective code blocks. It is also supported with
segmentation of section between code cells, output cells and markdown cells. The code cell
displays Python’s source code written by developers, output cell displays command line, images
or other visualizations whereas markdown cell contains headings alongside images and links.
(Kazarinoff, 2022) In the context of this project, Jupyter Notebook extension on Visual Studio
Code (VSC) will be used, in order to cater greater RAM for complex machine learning
operations within the local environment instead of the default allocated limited RAM and disk
space.
Numerical Python generally referred as NumPy is a Python library that works on multi-
dimensional array objects. Unlike Python lists which are one-dimensional array, NumPy arrays
are multi-dimensional in which the n-dimensional array is called “ndarray”. Moreover, NumPy
arrays are homogeneous, as in they can only accept elements with the same data type whereas
python lists are heterogeneous. In terms of efficiency, NumPy arrays can perform mathematical
operation among arrays such as addition, multiplication etc. and are faster than Python lists.
(Great Learning Team, 2022) In this project, NumPy will be used to perform Fourier
transformation during MFCC.
SciPy
Being one of the most comprehensive tools in Python, Scikit-learn consists of statistical
modelling and machine learning functionalities such as supervised learning algorithms,
Keras is chosen as the deep learning library because it is a high-level API which supports neural
network computation. It has a relatively low learning curve as its front-end is Python-based, thus
utilizing common code and providing clear error messages upon execution failure. In regard to
supporting almost every neural network mode, it is supported by multiple frameworks such as
TensorFlow, MXNet, CNTK etc. . (Simplilearn, 2021) In this project, Keras will be used to
perform CNN modelling on the extracted features.
A key component of this project is to observe the changes in graphs of vocal wave form during
the feature extraction stage. Hence, Matplotlib is used as a visualization extension library for
NumPy to provide visual access for multi-dimensional arrays. Several plots offered by
Matplotlib include scatter plot, pie chart, line plot, histogram etc. . It can be installed in
Windows, MacOs and Linux. (Python | Introduction to Matplotlib - GeeksforGeeks, 2018)
3.3.5 GUI
Streamlit
Since this project aims to construct a simple yet functional GUI in the shortest possible time,
Streamlit is an optimal choice. It is a simple web application that can construct effective and
intuitive user interface quickly. Being an open-source Python library, it is compatible with
Python’s Data Science and Deep Learning libraries. has no front-end coding involved such as
HTML, CSS, JS. Moreover, it allows images, audios and videos to be uploaded with additional
widget support such as sliders, buttons, check box, radio, selection box etc. Visualization on
Streamlit using charts, graphs, maps and plots can also be done which is the optimal choice for
data science based projects like the current implementation. (mhadhbi, 2021)
Figure 43: Overview of Operating System (Understanding Operating Systems - University of Wollongong – UOW, 2022)
Operating System (OS) is a frequent term used in the technology field that refers to a program
acting as an interface between the user and hardware components after it is loaded by a
bootstrap. It offers an environment for users to execute programs or applications and hides
specific details of hardware through abstraction. (What Is Operating System? Explain Types of
OS, Features and Examples, 2020) To perform file, memory management, I/O operations and
other tasks, all application programs running in the background need to access Central
Processing Unit (CPU), memory and storage to gain equal number of resources. Thus, the OS
facilitates the interaction route between the hardware and application as well as system software
for the user to be able to interact with the programs as illustrated in the figure above. The most
common OS are Microsoft Windows, which is preloaded on all laptops except Apple products,
Mac OS which is preloaded in all Apple laptops and Linux which is not preloaded but users can
voluntarily download it for free. Hence, Microsoft Windows 11 Pro has been chosen as the OS
for this project. (Understanding Operating Systems - University of Wollongong – UOW, 2022)
The minimum hardware requirements for this project are stated as follows:
a. Mouse
b. Keyboard
c. Monitor
d. Microphone (2 channels, 16 bits, 48000Hz)
e. Speaker (16 bits, 48000Hz)
f. Router (RJ45 / Wireless Fidelity (Wi-Fi))
The minimum software requirements for this project are stated as follows:
3.5 Summary
To sum up this chapter, a comparison between Python, R and SAS in applicability and other
aspects has been made. After such comparison, Python has been chosen and justified to be the
programming language for this project. This has to do with the variety of data pre-processing and
machine learning libraries available in Python. Next, Jupyter Notebook has been chosen to be the
IDE with Visual Studio Code (VSC) as the local environment to support Python and such
selection has been justified accordingly. Various Python libraries used for data pre-processing,
machine learning, deep learning, data visualization and GUI creation purposes have been
documented with their features highlighted to be applied during modelling phase, which is the
coding stage during FYP Semester 2. Lastly, Microsoft Windows 11 Pro has been chosen as the
OS for this project along with minimum hardware and software requirements stated clearly.
CHAPTER 4: METHODOLOGY
4.1 Introduction
In the context of research, the broad use of the term “methodology” refers to a set of guidelines
or framework used to solve a problem in specific domains. Since this is a data science based
project, data mining methodology will be imposed. This chapter will compare the three data
mining methodologies, namely: KDD, CRISP-DM and SEMMA, discuss the reasoning in
choosing one of these methodologies and give an in-depth explanation about the activities to be
carried out in each phase of the methodology chosen.
Stages There are a total of 5 There are a total of 6 There are a total of 5
phases: phases: phases:
(Quantum, 2019)
(Quantum, 2019)
4.4 CRISP-DM
Figure 44: Overview of CRISP-DM methodology (Wirth, R., & Hipp, J., 2000, April)
The project team or developers must have a thorough understanding on the project background to
formulate a well-defined analytic and business strategy. In order to achieve this, the business aim
or primary goal must be identified, which is the development of a fully functional ASR system
for e-learning sector in this case. (Great Learning Team, 2020) Having established aims, the
current situation must be assessed whereby the tangible benefits, non-tangible benefits,
functional requirements, non-functional requirements, budget allocated, project completion time,
constraints alongside potential risks are identified. Most importantly, resources of the project
must be documented through fact-finding techniques. These resources are as follows: hardware,
data mining tools, technical expertise and operational data. The business or data mining success
criteria can then be determined which are the pre-defined objects in helping the project team to
achieve the aim. Lastly, a project plan and Gantt Chart are documented using Microsoft Excel
and Microsoft Project respectively. (Crisp DM Methodology - Smart Vision Europe, 2020)
After understanding the domain backgrounds, the initial dataset used to train the model must be
collected and loaded into the chosen IDE, Jupyter Notebook within local VSC environment in
this case. For this project, the dataset will be retrieved from an online source comprising of audio
files and their corresponding transcript as reference. The surface properties of the dataset are
then examined extensively by listing out the number of observations, attributes, data type, range
and meaning of attribute values in business terms. (Great Learning Team, 2020) Data exploration
is then performed by the developers to find correlation between acoustic properties, identify
target variable, calculate simple aggregations and compute statistical analysis. The data’s quality
is also verified to ensure that there are no missing fields, noisy data or erroneous values. (Crisp
DM Methodology - Smart Vision Europe, 2020)
The dataset used to train the acoustic model for ASR in the next stage is selected alongside the
reason for such choice justified. To ensure that the output is accurate and not prone to misleading
information, data cleaning practices are adhered by developers. This includes filling missing
values with a general indicator, eliminating erroneous values and formatting values of specific
field into proper data types. (Great Learning Team, 2020) As explained earlier, feature extraction
will be performed at this stage whereby the flow of speech signals will be converted into
numerical vectorized form, i.e., MFCC or FFT to act as an input to the acoustic model. With
more features being retrieved, the deep learning model deduced will have a higher speech
recognition accuracy.
Phase 4: Modelling
As pointed out in the previous section, several statistical or machine learning techniques will be
implemented as potential acoustic models in formulating a fully functional ASR and assumptions
about data or tools to integrate with the model are made. Next, to validate each models’ quality, a
test design is conducted by separating the dataset into training, validation and testing segments
whereby the former is used to construct the model, the middle is used to validate the model
during training and the latter is used to estimate the model’s quality after training. Upon running
the model using tools, the parameter settings and their reasoning are justified. The model is then
run using the pre-processed signals as input with the obtained results evaluated using WER and
CER as explained in section 2.2. Moving on, the special features and potential issues are derived
from interpretations of results. The model is then assessed continuously by incorporating
business success criteria, previous findings and other metrics into consideration and previous
steps are repeated multiple times until the best model is found. (Great Learning Team, 2020)
Phase 5: Evaluation
Unlike evaluating the model’s accuracy in the previous stage, developers start off by assessing
the general aspect of the project. In terms of this, the models are evaluated on the basis of
whether they meet the business objectives. If all models satisfy such requirements, the ASR
model with the greatest applicability in solving existing e-learning systems’ issues within
educational institutions is approved. With that being said, the developers conduct a
comprehensive review on previous stages to highlight quality assurance issues that have been
overlooked. The next step is then determined, as in whether the project team should proceed to
the next phase if modelling results are deemed up-to-par or re-iterate previous stages to refine the
model within pre-defined constraints. (Crisp DM Methodology - Smart Vision Europe, 2020)
Phase 6: Deployment
Before releasing the final product, which is the ASR system, to be incorporated with e-learning
systems, implementation strategies are summarized in a deployment plan. The model will be
deployed using Streamlit web application with other functionalities available. Monitoring and
maintenance activities are to be carried out from time to time to retain the system’s performance
and determine the threshold in which the model is deemed to be unapplicable within the system.
The final report is then documented to provide target users (students and tutors) a comprehensive
summary on the deliverables. Last but not least, a project retrospective is conducted either
through interviews or questionnaires to gather information from end users regarding the system’s
features, drawbacks, potential enhancements and other experiences. (Great Learning Team,
2020)
4.5 Summary
To sum things up, out of the three data mining methodologies discussed in section 4.2, CRISP-
DM is chosen to be the most suitable methodology for this project. Generally speaking, it is the
standard methodology in a variety of industries worldwide especially in the data science sector.
Its well-defined process, in-depth documentation and high flexibility in terms of transitioning
between phases make it so widely used in many applications. Moreover, developers do not need
much pre-requisite data mining knowledge in order to grasp the main activities in each stage.
There are a total of 6 stages in CRISP-DM. Firstly, in business understanding stage, developers
must understand the existing issues within the e-learning and corresponding domains as well as
the aims and objectives of integrating an ASR system into e-learning systems in English
language. In data understanding phase, developers should find a suitable speech dataset in the
form of online tutorial videos. In data preparation phase, the speech signals must be segmented
with background noise removed for acoustic characteristics or features to be extracted through
MFCC technique. The project team must then utilize the filtered acoustic inputs to train and
validate multiple machine learning models with backtracking involved to obtain the finest model.
The models’ performances are then evaluated using WER and CER as the two indicators, only
then the model with the highest speech recognition accuracy can be chosen. Lastly, the acoustic
model is deployed into an ASR system which is featured with Streamlit GUI implementation and
other functionalities for the users to interact within the e-learning system. Contingency and
maintenance plan is also documented to monitor the products’ effectiveness in the long run.
5.2 Metadata
As shown in figure , the chosen LJ speech dataset for the project consists of 3 components,
namely a “wavs” folder that stores individual “.wav” format audio files illustrated in figure 47, a
README file that stores metadata about the dataset and a “csv” file. Each audio file in the
“wavs” folder has a sample rate of 22,050 Hz, consists of a single-channel 16-bit Pulse-Code
Modulation (PCM) and ranges from 1.11 to 10.1 seconds. Within the “csv” file, there are 13,100
observations, each having 3 attributes that are tabulated in table 4. The metadata also mentions
there are a total of 225,715 words spoken by readers whilst glimpsing through 7 non-fiction
books in which 13,821 of them are unique words. (The LJ Speech Dataset, 2016)
by reader
The figure above shows the first 5 rows of the dataset being displayed. The code uses “pandas”
library’s “read_csv()” function so that information in the “metadata.csv” dataset is stored in a
DataFrame object. The parameter “sep” indicates the pipe character (“|”) will be the separator
between columns and “header” indicates default numbering starting from 0 will be assigned to
each column.
The default numbering of column names is replaced with corresponding names similar to that in
the metadata as shown in the figure above with the first 5 rows displayed.
The figure above shows the number of rows and columns available in the form of a tuple,
indicating there are 13,100 rows and 3 attributes in the original dataset. This complies with the
information presented in the metadata section.
The figure above shows an overview of the DataFrame structure derived from the dataset. It
provides information on the data types for each attribute, the number of non-null values and total
memory usage to load the data. It is observed that the datatype for all attributes is “object” which
indicates a string or mixture of string and other data types. There are also no missing or null
values in the 3 attributes except for the “Normalized Transcript” column with 16 missing values.
To address this, data cleaning and pre-processing will be carried out accordingly in the next
section.
Figure 51: Code and output for total and unique word count in "Normalized Transcript" column
By iterating through each row of the DataFrame object, word tokenization is applied on the
sentence in “Normalized Transcript” column. The total number of tokens, excluding special
characters are computed. Each token is then added to a “set” object in which the total unique
count of tokens across all rows’ corresponding column are computed as shown in the figure
above.
Figure 52: Code and output to display frequencies and number of samples for an audio file
The figure above shows 5 randomized audio files are selected from the dataset. The sample rate
(frequency) and signal of the audio file is obtained by reading the file’s path using SciPy’s
“wav”s module “read” function. The name of the 5 randomized audio files are then stored in a
list called “eda” for further usage in visualization. The frequency of each audio file is displayed
in which all of them have the same sample rate of 22,050 Hz. The signal or number of samples in
the audio file differs from one another due to different vocal speeds.
Figure 53: Code and output for dropping rows that contain empty values
When the function “dropna()” is called on the DataFrame object in figure above, all rows that
contain one or more missing values in any column will be removed. As such, the 16 rows with
missing values in the “Normalized Transcript” have been removed, leaving only 13,084 non-
missing rows.
The code in figure above is used to drop the “Transcript” column from the DataFrame object by
calling the “drop” method. The “axis” parameter is set to 1, indicating a column-wise drop. The
column is redundant with the “Normalized Transcript” column since the former represents
simple form of abbreviations whereas the latter expresses those in full word form. The first 5
rows of the DataFrame are displayed, indicating the column has indeed been dropped.
Figure 55: Code and output for dropping rows that contain non-ASCII characters in "Normalized Transcript" column
The figure above demonstrates the removal of rows that contain non-ASCII characters with
corresponding output displayed, as in 23 out of the original 13,084 rows have been dropped,
leaving only 13,061 rows. A regular expression check on whether the normalized transcript
contains any characters beyond the range of 7-bit (ASCII) characters is applied to the
“str.contains” method which will return a Boolean Series. Then, Boolean indexing is performed
with the “loc[]” method to select only rows that do not satisfy the condition above. This process
is crucial since non-ASCII characters are not valid English characters nor words and are
considered as erroneous data.
Figure 56: Code and Output for computing word frequency distribution
The code in figure above produces a word frequency distribution of each word in the
“Normalized Transcript” column in the form of a dictionary, with key-value pair representing the
word and frequency count respectively. The dictionary is then sorted in descending order based
on the frequency count of each word and filters words with a frequency count greater than or
equal to 20 to be displayed to the console. From the output, it appears that conjunctions,
prepositions and pronouns are the common words in the context of this project.
The code in the figure above creates an empty DataFrame object (“filtered_df”) with 2 columns:
“File Name” and “Normalized Transcript” assigned to it. The original DataFrame object (“df”) is
then iterated to perform word tokenization on the transcript column. If the proportion of words
found in the tokenized list comprehension is greater than or equal to 90%, the corresponding row
is added to “filtered_df” with the counter incremented by one. It is estimated that the deep
learning model is more likely to recognize speech if the word-found percentage in each audio file
is greater. From the output in figure below, there are a total of 2612 audio files that satisfy the
condition above with the first 5 rows displayed in the console.
Figure 59: Code and output for total and unique word count in "Normalized Transcript" column after Data Sampling
After performing data sampling, the code above is used to compute the total and unique word
count in the “Normalized Transcript” column of the filtered DataFrame object. By iterating
through each row of the DataFrame object, word tokenization is applied on the sentence in
“Normalized Transcript” column. The total number of tokens, excluding special characters are
computed. Each token is then added to a “set” object in which the total unique count of tokens
across all rows’ corresponding column are computed. It is observed that the total word count has
been reduced from 223,964 to 44,140 whereas the unique word count has been reduced from
14,364 to 3,201. Within the constraint of limited processing resources, such smaller subsets of
data can help train our model within a shorter time frame.
Figure 60: Code Snippet to Create list of dictionary from DataFrame Object
The function above first creates an empty list which will store the dictionary for each input.
Then, the DataFrame object is iterated using the “iterrows” method to obtain the file path stored
in local directory and transcription by accessing the “File Name” and “Normalized Transcript”
column respectively. The dictionary is then created with 2 key-value pairs for audio and
transcript, then inserted into the list. This facilitates the retrieval of data from DataFrame Object
as it is easier to retrieve data from a list of dictionary.
With easily accessible inputs ready, the audio file must be converted into a spectrogram which is
a representation of audio signals in time-frequency domain to be accepted and trained by the
models. As such, the code above starts off with reading the audio file using “tf.io.read_file”
function. The file is then decoded using “tf.audio.decode_wav” function with a single channel
and compressed by removing the last axis. The Short-time Fourier Transform (STFT) of an
audio signal is computed using “tf.signal.stft” function. It accepts a Tensor floating-point signal
(“sig”), the number of samples for each frame (“frame_length”), the number of samples between
intervals of frame (“frame_step”) and the number of points in FFT (“fft_length”) as input and
produces the spectrogram. (Tf.signal.stft | TensorFlow V2.12.0, 2023) To normalize the
spectrogram, it is applied with a power transformation with its absolute value, subtracted from its
mean using the “tf.math.reduce_mean” function, then divided by the standard deviation using the
“tf.math.reduce_std” function, producing a final matrix with a 2D shape representing number of
frames in audio and number of frequency bins respectively with all values between -1 to 1.
Figure 62: Code Snippet for Vocabulary Set with Encoder and Decoder
The code above is used to define the vocabulary set with encoder and decoder for
“normalized_transcript” column. First, a list of characters is listed out to be used as the encoder
and decoder’s vocabulary. Keras’s “StringLookup” function is then called to build the encoder. It
accepts the vocabulary list and an “oov_token” which resembles an empty string to specify out-
of-vocabulary characters. The decoder is also built using the same function with same parameters
with an additional “invert” parameter set to True. The mapping from indices to characters will be
inverted, which facilitates the decoding operation of vectorized integer sequences into character
sequences. The “get_vocabulary()” and “vocabulary_size()” function are then called to display
the list of vocabularies and its total count in the output panel. (Tf.keras.layers.StringLookup |
TensorFlow V2.12.0, 2023)
The code above will convert the inputted text or transcript into lowercase form using
“tf.strings.lower” function. Then, “tf.strings.unicode_split” is applied to split the Tensorflow
String object into a list of Unicode characters by specifying the “input_encoding” be “UTF-8”.
(Module: Tf.strings | TensorFlow V2.12.0, 2023) The “encoder” object defined in section
5.4.4.3 will then be applied to map the Unicode characters into sequence of integers.
The code above is used to construct a tuple by taking the path of an audio file and its
corresponding transcript as arguments.
The function above takes a DataFrame object and the batch size of Tensorflow dataset as an
argument. The list of dictionaries is first created by calling the “create_dct_from_df” function.
The list of audio path and transcript are then stored in respective lists. A Tensorflow dataset is
then created using “tf.data.Dataset.from_tensor_slices” function which accepts a tuple in the
form of audio path and transcript as input. The “merger” function is then called within the “map”
function to convert the tuple into a tuple of spectrogram and label. Next, the “padded_batch”
method is used to compile the tuple data into batches with “batch_size” specifying the number of
samples in each batch. Each sample within the batch must have the same shape, which can be
achieved by padding shorter samples with zeros. The “pre_fetch” function will then perform
asynchronous operations on the current and next batch dynamically through assigning the
“buffer_size” parameter to “tf.data.AUTOTUNE” to improve modeling performance. The
Dataset object is then formulated with its size displayed using “cardinality” function.
(Tf.data.Dataset | TensorFlow V2.12.0, 2023)
Figure 66: Code Snippet to plot bar chart for token frequency dictionary
The code above utilizes the dictionary for frequency of individual tokens to plot a vertical bar
chart. First, the key-value pairs are converted into respective list. For tokens with a frequency of
fewer than 1200, they are labelled as “others” whereas tokens with a frequency more than or
equal to 1200 are labelled as their respective tokens. The plot is then formed with the “subplots”
function with the X-axis’s tokens rotated by 45 degrees using the “xticks” function with a
rotation parameter of 45 to prevent overlapping of tokens upon display. (Matplotlib.pyplot.xticks
— Matplotlib 3.7.1 Documentation, 2023)
From the bar chart, we can infer that the token “the” has the highest frequency count and is twice
as high than the frequency of 2 nd highest token “of”. We can also observe that apart from the
word “others”, all other common words are either prepositions (‘of’, ‘to’, ‘in’, ‘on’, ‘with’, ‘by’,
‘at’), conjunctions (‘and’, ‘for’, ‘as’), pronouns (‘that’, ‘he’, ‘his’, ‘it’, ‘which’), determiners
(‘the’, ‘a’) and finite verbs (‘was’, ‘had’, ‘were’).
Figure 68: Code Snippet to plot histogram for individual transcript's token count versus frequency
The code above creates a histogram plot using seaborn’s “histplot” method. The plot shows the
frequency distribution for the number of tokens in each transcript. Upon adding the “kde = True”
parameter, a density estimation line is added to the plot which estimates the probability of the
underlying data’s distribution. The “axvline” is used to draw a red vertical line representing the
mean value for number of tokens in each transcript. (Matplotlib.pyplot.axvline — Matplotlib
3.7.1 Documentation, 2023)
Figure 69: Histogram for individual transcript's token count versus frequency
The histogram above with density curve overlaid on top of it shows that the distribution is
relatively normal, with a peak around the 20 mark. This implies that the number of tokens in
common transcripts are between 15 to 25 with the mean value around the 17 to 18 mark. Token
count for majority of transcript fall between 0 and 35. As the line moves away from the range,
the frequency tapers off.
Figure 70: Code Snippet to plot histogram for individual transcript's token count versus frequency after data sampling
The code above creates a histogram plot using seaborn’s “histplot” method. The plot shows the
frequency distribution for the number of tokens in each transcript after data sampling. Upon
adding the “kde = True” parameter, a density estimation line is added to the plot which estimates
the probability of the underlying data’s distribution. The “axvline” is used to draw a red vertical
line representing the mean value for number of tokens in each transcript after data sampling.
(Matplotlib.pyplot.axvline — Matplotlib 3.7.1 Documentation, 2023)
Figure 71 Histogram for individual transcript's token count versus frequency after data sampling
The histogram above with density curve overlaid on top of it shows that the distribution is
bimodal. This is because the middle peak value is lower than the two neighbouring bars,
indicating that there are two groups of filtered transcript, one with fewer than 15 tokens and the
other with more than 15 tokens. With such two groups, the set of filtered transcripts can be more
generalized and do not always concentrate near the median or mean of the graph. Moreover, the
mean value for token frequency count in individual transcript after data sampling is similar with
that of before data sampling, suggesting that the data sampling technique applied is optimal.
5.5.2 Audio
Figure 72: Code Snippet to plot waveform, MFCC and Mel Coefficients
The code above will plot the waveform, MFCC and Mel Coefficients of 5 randomized audio files
that are stored in “eda” list in figure 52. First, the list of audio names is iterated to form an audio
path by merging with the root path. The frequency and signal of the audio file are then obtained
using “wav.read” function which accepts the audio path as input. To plot the raw frequency wave
of audio file, the number of samples is obtained by accessing the signal’s first shape’s element.
Then, the X-axis which resembles time in seconds is computed by dividing the sequence of
samples from 0 to “total – 1” by the sampling frequency alongside the Y-axis which resembles
the signal. The two axis are then plotted using “matplotlib” module’s “plot” function. Horizontal
red lines indicating the minimum and maximum audio signals are also plotted using “axhline”
function.
During computation of Mel Frequency Cepstral Coefficients (MFCC) of audio signal, the “mfcc”
function from “python_speech_module” is applied. It accepts the audio signal, sampling
frequency, NFFT or number of data points in Fast Fourier Transform (FFT) and a Hamming
window function to be applied on each frame. (Welcome to Python_speech_features’s
Documentation! — Python_speech_features 0.1.0 Documentation, 2013) The NFFT is set to
1024 instead of the default value of 512 since some of the audio file’s signal has exceeded 512
data points and has to be trimmed to process capture audio features within each frame. The
NFFT must also be a power of 2 since it results in the most efficient output upon algorithm
execution. (Søndergaard, n.d.) Normalization is then performed by computing the mean and
standard deviation of MFCC coefficients across each frame. The MFCC coefficient is then
subtracted from the computed mean and divided by standard deviation. Finally, the MFCC
matrix is transposed to adjust reach row to be a frame that corresponds to a coefficient. The
transposed matrix is then displayed in the form of a heatmap using “matplotlib” module’s
“imshow()” function. (Matplotlib.pyplot.imshow — Matplotlib 3.7.1 Documentation, 2023)
In visualizing the Mel filterbank energies of an audio file, “mfcc” module’s “logfbank” function
is used which accepts an audio signal, sample rate and NFFT as inputs. The output is a 2D
numpy array with a shape consisting of the number of overlapping windows in FFT and number
of triangular filters in Mel-scale filterbank. The array is then transposed so that the X-axis
corresponds to the windows and Y-axis corresponds to the Mel filterbank coefficients.
(Søndergaard, n.d.) The transposed matrix is then displayed in the form of a heatmap using
“matplotlib” module’s “imshow()” function. (Matplotlib.pyplot.imshow — Matplotlib 3.7.1
Documentation, 2023)
Figure 73: Raw Waveform for Amplitude Against Time for Audio File
The raw waveform plot above shows the amplitude of the audio signal over time in seconds for
an audio file. The amplitude refers to the audio’s magnitude and is measured using decibels scale
(“dB”). The audio file above has a total duration over 2.5 seconds. An amplitude peak is recoded
around the 1 second mark with a peak value just below 20,000. As time propagates, the
amplitude of the audio file decreases with several intervals of complete silence with minimal
background noises. This implies that the main content or vocal point of the speech is at the front
to middle section of the audio file.
Figure 74: Heatmap for MFCC Coefficients against Windows for Audio File
The heatmap above represents the intensity of energy signals across different coefficients of
respective window. For warmer colours such as red or brown, the magnitude of coefficient or
energy at that particular window is stronger whereas for lighter colours such as orange or yellow,
the loudness of audio at that particular window is weaker. Moreover, dark-coloured regions in a
specific window are more prone to be distinguishable from other speech patterns. From heatmap
above, we can observe that most of the red or brown-coloured window-MFCC pairs are
concentrated at the start to middle section of the audio file, ranging from the 0 th to 150th window.
This is aligned with our findings from the raw waveform plot in figure 73.
Figure 75: Heatmap for Mel Coefficients against Windows for Audio File
The heatmap above represents the intensity of energy signals across different Mel filterbank
coefficients of respective window. We can see that at the starting to middle section of the audio
file between 0th to 150th window, the dark-coloured sections are concentrated at the first few
coefficients only. This implies that the audio file at this region has more energy in the lower
frequency range than that of the high frequency range. However, in the last section of audio file
between 175th to 250th window, we can see that the dispersion of energy throughout all frequency
ranges are relatively high with an even spread of dark-coloured regions across all coefficients.
A function is defined which accepts a DataFrame object, a batch size for the Tensorflow Dataset
with a default value of 16, a training proportion with a default value of 0.7 or 70%, a validation
proportion with a default value of 0.15 or 15% and a testing proportion with a default value of
0.15 or 15%. First, the “formulate_dataset” function is called to create a Tensorflow Dataset
object with the specific batch size. The size of the Dataset object is then computed and
multiplied with the training and validation proportion to obtain the training and validation
interval respectively. The training set (“train_ds”) is created by taking the first “train_interval”
elements; validation set (“val_ds”) is created by skipping the first “train_interval” elements and
taking the next “val_interval” elements; testing set (“test_ds”) is created by skipping both
interval’s elements and taking the remaining ones. The cardinality of each set is displayed with
the training, validation and testing Tensorflow Dataset returned.
By calling the function above, data partitioning is done using the filtered DataFrame object, a
batch size of 25 and the default proportions as arguments. As such, each dataset’s batch consists
of 25 samples whereby the merged dataset has a total of 105 batches. Out of this, 70% or 73
batches belong to the training set, 15% or 15 belong to the validation set and remaining 15% or
17 belong to the testing set as shown in figure above.
5.7 Modelling
After performing Exploratory Data Analysis (EDA), Data Cleaning and Visualization, we have a
thorough understanding on the acoustic and transcript properties for our dataset. In this section,
we will be constructing a standalone GRU, a CNN-GRU and a CNN-LSTM model. Out of these
models, we will be experimenting with combinations of different hyperparameters, epochs,
optimization techniques and RNN units to determine the most optimum model under the
constraint of limited processing resources.
The CTC loss function above accepts the actual and predicted vector sequences as input. It first
computes the number of batches in the input sequence by converting the 1-D Tensor shape into
“int64” data type through “tf.cast” function. Then, it computes the length of the actual and
predicted vector sequence by multiplying the number of time steps in the sequence with a 2-D
Tensor of ones that have the same shape as the previous computed batch sequence. Finally, the
“ctc_batch_cost” function is called with the sequences and their corresponding length as input to
automatically compute the CTC loss between actual and predicted sequences.
5.7.2 CNN-GRU
5.7.2.1 Non-regularized CNN-GRU
The code above is constructing a Convolutional Neural Network (CNN) with Gated Recurrent
Units (GRU) model. It consists of 2 convolutional layers, 3 bidirectional GPU layers and 2 dense
layers. Since the spectrogram or input vector sequences obtained using Short-Time Fourier
Transformation (STFT) technique is symmetrical, only the first half of the values are required to
represent the frequency domain. As such, the input dimension is half the size of Fast Fourier
Transform (FFT). The output dimension corresponds to the number of characters in the pre-
defined vocabulary list.
To make the input layer compatible with the expected 4D shape for CNN, it is reshaped using
Keras’s “Reshape” layer. The 1st convolutional layer is defined with 32 filters, a kernel size of
11x41 and a stride of 2x2 and a Rectified Linear Unit (ReLU) activation function using Keras’s
“Conv2D” layer. To normalize the previous layers to have a mean of 0 and a standard deviation
of 1, a Batch Normalization layer is added. (Brownlee, 2019) The 2nd convolutional layer is then
defined with 32 filters, a kernel size of 11x21 and a stride of 1x2 and a ReLU activation function
with Batch Normalization layer applied afterwards.
Before proceeding to the bidirectional GRU layers, the previous layer must be flattened into a 3D
Tensor using the “Reshape” layer. It is then passed into a three-time loop to be fed into a GRU
layer which accepts 256 RNN units with the “return_sequences” and “reset_after” parameters set
to True. Upon setting “return_sequences” to True, the full sequence output in the form of batch
size, timesteps of input sequence and RNN units will be displayed instead of the original form of
batch size and RNN units only. Upon setting “reset_after” parameter to True, reset gate will be
applied after matrix multiplication instead of before which can better capture the features in
speech signals. (Team, 2014) This GRU layer is then wrapped inside a Bidirectional layer so that
it can incorporate timesteps of vector sequences from both past and future. A Dropout layer is
then added after each Bidirectional GRU layer except the last one to prevent overfitting by
randomly dropping half of the input units for each batch of training set. (Team, 2023)
The previous layer is then passed to a Dense layer which accepts twice the number of RNN units
for dimensionality inputs with a ReLU activation function. This computes the dot product
between the inputs and the kernel defined in CNN, resulting in a Tensor shape comprising of
batch size, timesteps and RNN units doubled. (Team, 2023) Resulting layer is then applied with
a Dropout layer to randomly drop half of the input units. Using results from the Dropout layer,
the output layer is obtained in the form of a Dense layer which accepts 1 more unit than the
output dimension and a softmax activation function. To account for inconsistent length in output
sequences with input sequences during computation of CTC Loss function, the blank label is
added to the original output dimension. (Graves et al., 2013)
The model is then compiled using CTC Loss function and “Adam” optimizer with a learning rate
of 0.001. The “Adam” optimizer is an extension of Stochastic Gradient Descent (SGD) algorithm
whereby it utilizes both first and second moment of gradient during training stages. It is
preferrable over other optimizers due to its adaptive learning rate in updating each network
weight individually, less memory requirement, faster computation time and less tuning efforts.
(Gupta, 2021) Finally, the model summary is displayed in the console using “summary()”
method as shown in the figure below.
The summary of CNN-GRU model in figure above shows that the starting input layer has a 3D
shape with 193 representing the input dimension (INPUT_DIM) which is then reshaped into a
4D layer to be fed into the 2D Convolutional layer. After going through the 2 nd Batch
Normalization layer, the previous layer’s dimension is multiplied with each other to obtain the
value 1568. Finally, the output layer will produce 32 units which corresponds to 1 more unit than
the predefined output dimension (OUTPUT_DIM). Out of all the layers, the Bidirectional GRU
layer has the highest number of parameters. This is due to the fact that the layer processes input
sequences from previous layers in both forward and backward direction (i.e., accounting past and
future timesteps). Not only does the layer require a forward and backward weight, but it also
requires a recurrent forward and backward weight to store information from previous timesteps
of forward and backward sequence respectively. (Dobilas, 2022) It is also observed that there are
a total of 5.7 million parameters with 128 of them deemed non-trainable as they are not updated
when transmitted from one layer to the next, such as parameters involved in transferring mean
and standard deviation during the Batch Normalization layer.
The code in figure above is used to train the CNN-GRU model on the training set for 50 epochs.
The batch size is set to the validation dataset’s size and the “callbacks” parameter will be used to
call the “Metrics” function that accepts the validation dataset and CNN-GRU model as input
after each epoch. Details on the callback function and analysis conducted based on the stored
model’s history information will be discussed further in section 6.2.
The code above is constructing a regularized Convolutional Neural Network (CNN) with Gated
Recurrent Units (GRU) model. This model is similar to the above one except it utilizes 400 RNN
units instead of the original 256. It also implements extensive regularization techniques by
including the “kernel_regularizer” parameter in the Dense layers to apply penalty on the layer’s
kernel after performing 3 cycles of Bidirectional GRU operations. In the first Dense layer, both
L1 and L2 regularizations are applied using Keras’s “l1_l2” function with penalty rate “l1” and
“l2” set to 0.001. The L1 Regularizer computes the sum of the absolute values of weight whereas
L2 Regularizers computes the sum of squared values of weight. As for the second Dense layer,
L1 Regularizer is applied with a penalty rate of 0.001.
The summary of regularized CNN-GRU model in figure above shows that the layers involved
are the same as the CNN-GRU model, each having the same dimensionality of shapes. The only
thing that differs is that the total number of parameters involved in constructing the model has
increased to 11.4 million. This is due to an increase in the number of RNN units passed to the
GRU and Dense layers from 256 to 400 and from 512 to 800 respectively. This also explains the
need to incorporate more regularization and optimization techniques to prevent the model from
overfitting as the model becomes more complex and possesses greater capacity in capturing
audio features.
The code in figure above is used to train the regularized CNN-GRU model on the training set for
60 epochs. The batch size is set to the validation dataset’s size and the “callbacks” parameter will
be used to call the “Metrics” function that accepts the validation dataset and regularized CNN-
GRU model as input after each epoch. Details on the callback function and analysis conducted
based on the stored model’s history information will be discussed further in section 6.2.
5.7.3 GRU
The code above is constructing a Gated Recurrent Units (GRU) model. It consists of 3
bidirectional GPU layers and 2 dense layers. Since the spectrogram or input vector sequences
obtained using Short-Time Fourier Transformation (STFT) technique is symmetrical, only the
first half of the values are required to represent the frequency domain. As such, the input
dimension is half the size of Fast Fourier Transform (FFT). The output dimension corresponds to
the number of characters in the pre-defined vocabulary list.
Before proceeding to the bidirectional GRU layers, the input layer must be reshaped into a 3D
Tensor using the “Reshape” layer. It is then passed into a three-time loop to be fed into a GRU
layer which accepts 400 RNN units with the “return_sequences” and “reset_after” parameters set
to True. Upon setting “return_sequences” to True, the full sequence output in the form of batch
size, timesteps of input sequence and RNN units will be displayed instead of the original form of
batch size and RNN units only. Upon setting “reset_after” parameter to True, reset gate will be
applied after matrix multiplication instead of before which can better capture the features in
speech signals. (Team, 2014) This GRU layer is then wrapped inside a Bidirectional layer so that
it can incorporate timesteps of vector sequences from both past and future. A Dropout layer is
then added after each Bidirectional GRU layer except the last one to prevent overfitting by
randomly dropping half of the input units for each batch of training set. (Team, 2023)
The previous layer is then passed to a Dense layer which accepts twice the number of RNN units
for dimensionality inputs with a ReLU activation function and a Kernel Regularizer using
“l1_l2” function. The regularizer has a penalty rate for “l1” and “l2” respectively. This layer
computes the dot product between the inputs and the kernel defined in CNN, resulting in a
Tensor shape comprising of batch size, timesteps and RNN units doubled. (Team, 2023)
Resulting layer is then applied with a Dropout layer to randomly drop half of the input units. As
a result from the Dropout layer, the output layer is obtained in the form of a Dense layer which
accepts 1 more unit than the output dimension, a softmax activation function and a Kernel
Regularizer using “l1” function with a penalty rate of 0.001. To account for inconsistent length
in output sequences with input sequences during computation of CTC Loss function, the blank
label is added to the original output dimension. (Graves et al., 2013)
The model is then compiled using CTC Loss function and “Adam” optimizer with a learning rate
of 0.001. The “Adam” optimizer is an extension of Stochastic Gradient Descent (SGD) algorithm
whereby it utilizes both first and second moment of gradient during training stages. It is
preferrable over other optimizers due to its adaptive learning rate in updating each network
weight individually, less memory requirement, faster computation time and less tuning efforts.
(Gupta, 2021) Finally, the model summary is displayed in the console using “summary()”
method as shown in the figure below.
The summary of regularized GRU model in figure above shows that the starting input layer has a
3D shape with 193 representing the input dimension (INPUT_DIM). Since there is no 2D
Convolutional layer involved, the dimensional shape is fixed at 3D throughout the model training
phase. Finally, the output layer will produce 32 units which corresponds to 1 more unit than the
predefined output dimension (OUTPUT_DIM). Out of all the layers, the Bidirectional GRU
layer has the highest number of parameters. This is due to the fact that the layer processes input
sequences from previous layers in both forward and backward direction (i.e., accounting past and
future timesteps). Not only does the layer require a forward and backward weight, but it also
requires a recurrent forward and backward weight to store information from previous timesteps
of forward and backward sequence respectively. (Dobilas, 2022) It is also observed that there are
a total of 7.8 million parameters with none of them deemed non-trainable because there are no
Batch Normalization layer’s parameters transmitting the mean and standard deviation values
from one layer to the other. This also explains the reduction of parameters from 11.4 million
when compared with the regularized CNN-GRU model.
The code in figure above is used to train the regularized GRU model on the training set for 60
epochs. The batch size is set to the validation dataset’s size and the “callbacks” parameter will be
used to call the “Metrics” function that accepts the validation dataset and regularized GRU
model as input after each epoch. Details on the callback function and analysis conducted based
on the stored model’s history information will be discussed further in section 6.2.
5.7.4 CNN-LSTM
The code above is constructing a Convolutional Neural Network (CNN) with Long Short Term
Memory (LSTM) model. It consists of 2 convolutional layers, 3 bidirectional LSTM layers and 2
dense layers. Since the spectrogram or input vector sequences obtained using Short-Time Fourier
Transformation (STFT) technique is symmetrical, only the first half of the values are required to
represent the frequency domain. As such, the input dimension is half the size of Fast Fourier
Transform (FFT). The output dimension corresponds to the number of characters in the pre-
defined vocabulary list.
To make the input layer compatible with the expected 4D shape for CNN, it is reshaped using
Keras’s “Reshape” layer. The 1st convolutional layer is defined with 32 filters, a kernel size of
11x41 and a stride of 2x2 and a Rectified Linear Unit (ReLU) activation function using Keras’s
“Conv2D” layer. To normalize the previous layers to have a mean of 0 and a standard deviation
of 1, a Batch Normalization layer is added. (Brownlee, 2019) The 2nd convolutional layer is then
defined with 32 filters, a kernel size of 11x21 and a stride of 1x2 and a ReLU activation function
with Batch Normalization layer applied afterwards.
Before proceeding to the bidirectional LSTM layers, the previous layer must be flattened into a
3D Tensor using the “Reshape” layer. It is then passed into a three-time loop to be fed into a
LSTM layer which accepts 400 RNN units with the “return_sequences” parameter set to True.
Unlike a GRU, LSTM layer does not have a “reset_after” parameter. This is because GRU has a
reset gate that can ignore past states based on the parameter entered whereas LSTM has a forget
gate that will completely discard specific information in the past. (Vijaysinh Lendave, 2021)
Upon setting “return_sequences” to True, the full sequence output in the form of batch size,
timesteps of input sequence and RNN units will be displayed instead of the original form of
batch size and RNN units only. (Team, 2014) This LSTM layer is then wrapped inside a
Bidirectional layer so that it can incorporate timesteps of vector sequences from both past and
future. A Dropout layer is then added after each Bidirectional LSTM layer except the last one to
prevent overfitting by randomly dropping half of the input units for each batch of training set.
(Team, 2023)
The previous layer is then passed to a Dense layer which accepts twice the number of RNN units
for dimensionality inputs with a ReLU activation function and a Kernel Regularizer using
“l1_l2” function. The regularizer has a penalty rate for “l1” and “l2” respectively. This layer
computes the dot product between the inputs and the kernel defined in CNN, resulting in a
Tensor shape comprising of batch size, timesteps and RNN units doubled. (Team, 2023)
Resulting layer is then applied with a Dropout layer to randomly drop half of the input units. As
a result from the Dropout layer, the output layer is obtained in the form of a Dense layer which
accepts 1 more unit than the output dimension, a “softmax” activation function and a Kernel
Regularizer using “l1” function with a penalty rate of 0.001. To account for inconsistent length
in output sequences with input sequences during computation of CTC Loss function, the blank
label is added to the original output dimension. (Graves et al., 2013)
The model is then compiled using CTC Loss function and “Adam” optimizer with a learning rate
of 0.001. The “Adam” optimizer is an extension of Stochastic Gradient Descent (SGD) algorithm
whereby it utilizes both first and second moment of gradient during training stages. It is
preferrable over other optimizers due to its adaptive learning rate in updating each network
weight individually, less memory requirement, faster computation time and less tuning efforts.
(Gupta, 2021) Finally, the model summary is displayed in the console using “summary()”
method as shown in the figure below.
The summary of regularized CNN-LSTM model in figure above shows that the starting input
layer has a 3D shape with 193 representing the input dimension (INPUT_DIM) which is then
reshaped into a 4D layer to be fed into the 2D Convolutional layer. After going through the 2 nd
Batch Normalization layer, the previous layer’s dimension is multiplied with each other to obtain
the value 1568. Finally, the output layer will produce 32 units which corresponds to 1 more unit
than the predefined output dimension (OUTPUT_DIM). Out of all the layers, the Bidirectional
LSTM layer has the highest number of parameters. This is due to the fact that the layer processes
input sequences from previous layers in both forward and backward direction (i.e., accounting
past and future timesteps). Not only does the layer require a forward and backward weight, but it
also requires a recurrent forward and backward weight to store information from previous
timesteps of forward and backward sequence respectively. (Dobilas, 2022)
It is also observed that there are a total of 14.9 million parameters with 128 of them deemed non-
trainable as they are not updated when transmitted from one layer to the next, such as parameters
involved in transferring mean and standard deviation during the Batch Normalization layer.
Further observation revealed that the number of parameters has risen when compared with the
11.4 million count of regularized CNN-GRU model. This is due to the fact that LSTM layer has
3 gates namely an input gate which controls information stored in long term memory, a forget
gate that controls what information to discard and an output gate that controls what information
to pass to the next time step sequence. Whilst a GPU layer only has an update gate that updates
information on the next state gate that controls what information to ignore from the past.
(Vijaysinh Lendave, 2021)
The code in the figure above is used to train the regularized CNN-LSTM model on the training
set for 60 epochs. The batch size is set to the validation dataset’s size and the “callbacks”
parameter will be used to call the “Metrics” function that accepts the validation dataset and
regularized CNN-LSTM model as input after each epoch. Details on the callback function and
analysis conducted based on the stored model’s history information will be discussed further in
section 6.2.
5.8 Summary
This chapter first introduces the concept of data analysis and its relevance to the current Web-
based ASR data science project. Then, the metadata of the chosen dataset is tabulated in which
the details regarding properties of audio file and transcript attributes are explained clearly. To
have a thorough understanding on the dataset and verify if the metadata given is accurate, Initial
Data Exploration or Exploratory Data Analysis (EDA) is carried out by obtaining information
about frequency and signal waves of audio file as well as the total token count, unique token
count and token distribution for corresponding and overall transcript. Utilizing such information,
data cleaning or pre-processing is performed to remove empty rows, erroneous data, down-
sample the data to enhance model training capabilities and transforming the data into a
Tensorflow Dataset object to be fed into each deep learning model. Data visualization on both
the audio file and transcript are displayed in the form of bar graphs, heatmaps, histograms etc. .
Before constructing our models, data partitioning is initiated to divide our dataset into training,
validation and testing set. Once the dataset is transformed and partitioned, we can construct a
CNN-GRU, GRU and CNN-LSTM model with experimentation of regularizers, optimizers,
epochs, RNN units and other hyperparameters.
To convert the Tensor 3D shape in the form a tuple back to character sequences, the CTC
Decoding function is defined. Each value in the tuple represents batch size, timesteps and
number of classes respectively. In order to obtain the length of input sequence, the batch size
must be multiplied with the time step along each input sequence. Tensorflow Keras’s
“ctc_decode” function which accepts the predicted matrix, length of input sequence and a flag on
whether to use Greedy or Beam search as parameters will be used to perform decoding. The
“decoder” function defined in the previous section will then be applied on the individual strings
in the list to map the integer class labels to corresponding character in the vocabulary. Then,
“tf.strings.reduce_join” is used to join the characters into one chunk, then convert to Numpy
array using “numpy()” method and decode into “UTF-8” string.
Figure 92: Code Snippet to generate metrics for Validation and Testing
Upon reaching the end of each epoch, the “on_epoch_end” function will be called which will
generate the predicted output by passing the spectrogram into the chosen model, then apply CTC
Decoding technique to return a sequence of characters. The actual output is obtained by passing
the target label into the predefined decoder, joining character sequences into chunks, converting
to Numpy array and decoding into “UTF-8” string. The Word Error Rate (WER) of the audio is
computed using “wer” imported function whereas the Character Error Rate (CER) is computed
through the “editdistance” package “eval” function, both of which are stored in a list. 5
randomized predicted and actual results will then be displayed upon each epoch ends.
Figure 93: Code Snippet to visualize Train and Validation Loss for CNN-GRU model
The values of “loss” and “val_loss” are obtained from the CNN-GRU training dictionary as
shown in figure above. A range of 50, which is equal to the number of epoch executed to train
the CNN-GRU model is created and assigned as the X-axis while the Y-axis represents the
training loss in blue color and validation loss in red color. The line plot is displayed in the figure
below.
Figure 94: Relationship between Train and Validation Loss against Epoch for CNN-GRU model
The line graph above shows that the validation loss is greater than the training loss throughout
the training phase consisting of 50 epochs. It is observed that the training loss stops decreasing
upon reaching the 60-70 mark around the 45th epoch, then it increases slightly to the 150 mark
before dropping below the 100 mark. As for the validation loss, the period at which it stops
decreasing is relatively earlier at around the 35th epoch. It then rises marginally beyond the 150
mark, followed by a steady fall before surging drastically beyond the 300-value mark. This
implies that the surge rate of validation loss between the 45 th to 50th epoch is more intense
compared to the training loss. The final training and validation loss are estimated to be around
125 and 75 respectively. All these signify that the CNN-GRU model is overfitting with the
training dataset as it is memorizing the training data points rather than learning how to generalize
patterns across training and validation sets., causing them to perform much better with
specialized training set rather than unfamiliar validation set.
Figure 95: Code Snippet to visualize Train and Validation Loss for regularized CNN-GRU model
The values of “loss” and “val_loss” are obtained from the regularized CNN-GRU training
dictionary as shown in figure above. A range of 60, which is equal to the number of epoch
executed to train the regularized CNN-GRU model is created and assigned as the X-axis while
the Y-axis represents the training loss in blue color and validation loss in red color. The line plot
is displayed in the figure below.
Figure 96: Relationship between Train and Validation Loss against Epoch for regularized CNN-GRU model
For the line graph above, we will be executing 10 more epochs up to 60 because in the non-
regularized CNN-GRU model, we are unable to observe whether the training and validation loss
will drop to a new lowing point or remain constant beyond the 50-epoch range. Based on the
graph, we can observe that at the initial stage around 5 epochs, the validation loss is slightly
greater than the validation loss. After that point, both the training and validation loss decrease
gradually below the 50 mark and have reached a relatively constant state after the 50 th epoch.
Although there is minor fluctuation for the validation loss between 20th to 50th epoch, it still
remains lower than the training loss. Additionally, the gap between training and validation loss is
minimizing as the epoch is executed. This implies that the model is performing well on both the
training and validation sets. It is not overfitting to the training dataset and is generalizing patterns
obtained from unfamiliar validation data well. It is also not underfitting on the training dataset as
it is able to capture the underlying audio features, supported by the fact that both training and
validation sets achieve lower loss value compared to the unregularized CNN-GRU model. Such
an improvement is likely associated with the increase in RNN units which increases the model
capacity in learning audio features as well as the insertion of regularizes that can prevent
overfitting.
Figure 97: Code Snippet to visualize Train and Validation Loss for regularized GRU model
The values of “loss” and “val_loss” are obtained from the GRU training dictionary as shown in
figure above. A range of 60, which is equal to the number of epoch executed to train the GRU
model is created and assigned as the X-axis while the Y-axis represents the training loss in blue
color and validation loss in red color. The line plot is displayed in the figure below.
Figure 98: Relationship between Train and Validation Loss against Epoch for regularized GRU model
The line graph above shows that there has been an inconsistent fluctuation for both training and
validation loss. At the start of execution, the training loss is significantly higher than the
validation loss, after which it the training loss dropped significantly until it is lower than the
validation loss around the 15th epoch. However, a steep surge beyond the 300 and 250 mark for
validation and training loss respectively can be observed between the 15th to 20th epoch. Beyond
this point, the validation loss has been remaining around the 100 mark whereas the training loss
continues to decrease towards the 50 mark. This implies that the GRU model is overfitting with
the training set as it is constantly memorizing the training data rather than learning underlying
patterns to be applied on the validation set. This is likely associated with a lack of complexity of
the GRU model in extracting features from acoustic data, insufficient well-tuned
hyperparameters that can optimize model performance and lack of Batch Normalization layer to
prevent overfitting.
Figure 99: Code Snippet to visualize Train and Validation Loss for regularized CNN-LSTM model
The values of “loss” and “val_loss” are obtained from the CNN-LSTM training dictionary as
shown in figure above. A range of 60, which is equal to the number of epoch executed to train
the CNN-LSTM model is created and assigned as the X-axis while the Y-axis represents the
training loss in blue color and validation loss in red color. The line plot is displayed in the figure
below.
Figure 100: Relationship between Train and Validation Loss against Epoch for regularized CNN-LSTM model
The figure above shows that at the 1st epoch, the training loss is slightly higher than the
validation loss. However, the training loss has been decreasing substantially since then until it
reaches a constant pace below the 50 loss-mark. On the other hand, the validation loss’s
reduction rate stopped before the 20th epoch, causing it to fluctuate unstably just beyond the 100
loss-mark. This implies that the CNN-LSTM model is overfitting with the training set as it is
constantly memorizing the training data rather than learning underlying patterns to be applied on
the validation set. One explanation to such occurrence being the internal working of LSTM
which consists of 1 extra gate than GRU, thus requiring more RNN units to transfer, discard and
store information between layers. Moreover, our training dataset is not big since it only focuses
on common spoken words that have been filtered accordingly due to limited memory constraints.
Figure 101: Code Snippet to Generate Line Graph for WER Over Epoch for CNN-GRU model
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for CNN-GRU model. A horizontal line to display the
minimum value of WER is drawn using a red deashed line.
Figure 102: Line Graph for WER Over Epoch for CNN-GRU model
The line graph above illustrates that the pattern of WER on validation set for CNN-GRU model
is similar to the corresponding validation line graph that illustrates loss over epochs as shown in
figure 94. This is because upon reaching the new low point just below 0.6 before the 35 th epoch,
WER rises marginally beyond the 0.7 mark, followed by a steady fall to a new low point before
drastically going beyond the 0.9 mark. The fluctuation in WER implies that the validation set is
not generalized well for the CNN-GRU model.
Figure 103: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-GRU model
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-GRU model. A horizontal line to
display the minimum value of WER is drawn using a red deashed line.
Figure 104: Line Graph for WER Over Epoch for regularized CNN-GRU model
The line graph above illustrates that the pattern of WER on validation set for regularized CNN-
GRU model is similar to the corresponding validation line graph that illustrates loss over epochs
as shown in figure 96. This is because despite experiencing strong fluctuation between the 20 th to
60th epoch, the WER remains at the low point around 0.4 without going through drastic rises
unlike the line graph for unregularized CNN-GRU. In addition to that, the new WER low point
for regularized CNN-GRU is around 0.4 whereas it is 0.1 to 0.2 or 10% to 20% higher for
unregularized CNN-GRU, implying that regularized CNN-GRU is able to generalize patterns for
unfamiliar validation set much better than unregularized CNN-GRU. Based on such observation,
we will be applying regularization technique throughout the following models.
Figure 105: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized GRU model. A horizontal line to display
the minimum value of WER is drawn using a red deashed line.
Figure 106: Line Graph for WER Over Epoch for regularized GRU model
The line graph above illustrates that the pattern of WER on validation set for regularized GRU
model is similar to the corresponding validation line graph that illustrates loss over epochs as
shown in figure 98. This is because upon reaching the first new low point just below 0.6 around
the 15th epoch, a steep surge beyond the 0.9 or 90% mark can be observed between the 15th to
20th epoch. Beyond this point, the WER continues to decrease below the 0.6 mark around 30 th
epoch until it reaches a new low point just below 0.5 mark around 50 th epoch. Despite only
having 1 intense surge and decline in WER between 15 th to 20th mark compared to the double
surge and decline in WER of unregularized CNN-GRU model, the regularized GRU model is
still not generalizing well with the validation set. This can also be supported by the fact that the
new WER low point of 0.5 is still 0.1 or 10% higher than that of the WER for regularized CNN-
GRU model.
Figure 107: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-LSTM model
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-LSTM model. A horizontal line to
display the minimum value of WER is drawn using a red deashed line.
Figure 108: Line Graph for WER Over Epoch for regularized CNN-LSTM model
The line graph above illustrates that the pattern of WER on validation set for regularized CNN-
LSTM model is slightly different compared to the corresponding validation line graph that
illustrates loss over epochs as shown in figure 100. This is because the validation loss line in
figure 100 stops decreasing before the 20th epoch whereas the WER line only stops decreasing
after the 50th epoch. Another major difference is the minimum point of validation loss line is
around 100 which is marginally higher than its corresponding WER’s minimum point at 0.3 or
30%. This implies that the learning capability of regularized CNN-LSTM model on validation
set is better than the regularized CNN-GRU model which only has a minimum WER of 0.4 or
40%. Despite able to predict validation data more accurately, the subtle difference between
validation loss and WER imply that the model might be memorizing patterns in the training data
and fitting them into validation set instead of generalizing the relevant acoustic features from the
validation set. Several factors that might lead to such an observation include the model’s
architecture whereby CNN-LSTM cannot capture temporal dependencies precisely than CNN-
GRU and over-regularization whereby the model’s capacity to generalize underlying patterns of
validation set are constrained due to excessive rigidness and inflexibility of the model.
Figure 109: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-GRU model. A horizontal line to
display the minimum value of CER is drawn using a red deashed line.
Figure 110: Line Graph for CER Over Epoch for regularized CNN-GRU model
The line graph above illustrates that the pattern of CER on validation set for regularized CNN-
GRU model is similar to the corresponding validation line graph that illustrates loss over epochs
and WER line graph as shown in figure 96 and figure 104 respectively. This is because despite
experiencing minor fluctuation between the 10th to 30th epoch, the CER remains at the low point
below 0.2 until it reaches the final epoch. In addition to that, the new CER low point for
regularized CNN-GRU is around 0.1 or 10%, implying that regularized CNN-GRU is able to
generalize patterns for unfamiliar validation set.
Figure 111: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized GRU model. A horizontal line to display
the minimum value of CER is drawn using a red deashed line.
Figure 112: Line Graph for CER Over Epoch for regularized GRU model
The line graph above illustrates that the pattern of CER of validation set for regularized GRU
model is similar to the corresponding validation line graph that illustrates loss over epochs as
shown in figure 98. This is because upon reaching the first new low point at 0.2 around the 15 th
epoch, a steep surge beyond the 0.9 or 90% mark can be observed between the 15th to 20th
epoch. Beyond this point, the CER continues to decrease until it reaches a new minimum point
just below 0.2 mark around 50th epoch. Despite the minimum point is close to the low point for
CER line graph in regularized CNN-GRU as shown in figure 110, the regularized GRU model is
still not generalizing well with the validation set because there is 1 intense surge and decline in
CER between 15th to 20th epoch whereas such pattern is not observed in the CER line graph in
figure 110.
Figure 113: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-LSTM model
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-LSTM model. A horizontal line to
display the minimum value of CER is drawn using a red deashed line.
Figure 114: Line Graph for CER Over Epoch for regularized CNN-LSTM model
The line graph above illustrates that the pattern of CER on validation set for regularized CNN-
LSTM model is slightly different compared to the corresponding validation line graph that
illustrates loss over epochs as shown in figure 100. This is because the validation loss line in
figure 100 stops decreasing before the 20 th epoch whereas the CER line only stops decreasing
after the 30th epoch. Another major difference is the minimum point of validation loss line is
around 100 which is marginally higher than its corresponding CER’s minimum point at 0.1 or
10%. From the visualization in figure 110, we can see that both graphs’ minimum point for CER
seems to be around the 0.1 or 10% range. Therefore, we will be displaying their respective
minimum points as shown in the figure below.
Figure 115: Code Snippet to find minimum point for CER on Validation Set of regularized CNN-GRU and CNN-LSTM model
The figure above implies that the learning capability of regularized CNN-LSTM model on
validation set with a minimum CER of 0.08 or 8% is better than the regularized CNN-GRU
model which only has a minimum CER of 0.11 or 11%. Despite able to predict validation data
more accurately, the subtle difference between validation loss and CER imply that the model
might be memorizing patterns in the training data and fitting them into validation set instead of
generalizing the relevant acoustic features from the validation set. Several factors that might lead
to such an observation include the model’s architecture whereby CNN-LSTM cannot capture
temporal dependencies precisely than CNN-GRU and over-regularization whereby the model’s
capacity to generalize underlying patterns of validation set are constrained due to excessive
rigidness and inflexibility of the model.
Figure 116: Code Snippet to Generate Metrics for Unregularized CNN-GRU model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for unregularized CNN-GRU
model based on WER metrics only. This is because findings from section 6.2.3 and 6.2.4 suggest
that line graphs for WER and CER follow similar patterns and we will only be critically
analysing findings on the more optimal regularized CNN-GRU model than the unregularized
one. First, the code generates the predicted output by passing the spectrogram into the chosen
model, then apply CTC Decoding technique to return a sequence of characters. The actual output
is obtained by passing the target label into the predefined decoder, joining character sequences
into chunks, converting to Numpy array and decoding into “UTF-8” string. The Word Error Rate
(WER) of the audio is computed using “wer” imported function and stored in a list. 3
randomized predicted and actual results for the testing set will then be displayed.
Figure 117: Code Snippet to Generate Metrics for Regularized CNN-GRU model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for regularized CNN-GRU model
based on WER and CER metrics. First, the code generates the predicted output by passing the
spectrogram into the chosen model, then apply CTC Decoding technique to return a sequence of
characters. The actual output is obtained by passing the target label into the predefined decoder,
joining character sequences into chunks, converting to Numpy array and decoding into “UTF-8”
string. The Word Error Rate (WER) of the audio is computed using “wer” imported function
whereas the Character Error Rate (CER) is computed through the “editdistance” package “eval”
function, both of which are stored in a list. 3 randomized predicted and actual results for the
testing set will then be displayed.
Figure 118: Code Snippet to Generate Metrics for Regularized GRU model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for regularized GRU model based
on WER and CER metrics. First, the code generates the predicted output by passing the
spectrogram into the chosen model, then apply CTC Decoding technique to return a sequence of
characters. The actual output is obtained by passing the target label into the predefined decoder,
joining character sequences into chunks, converting to Numpy array and decoding into “UTF-8”
string. The Word Error Rate (WER) of the audio is computed using “wer” imported function
whereas the Character Error Rate (CER) is computed through the “editdistance” package “eval”
function, both of which are stored in a list. 3 randomized predicted and actual results for the
testing set will then be displayed.
Figure 119: Code Snippet to Generate Metrics for Regularized CNN-LSTM model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for regularized CNN-LSTM
model based on WER and CER metrics. First, the code generates the predicted output by passing
the spectrogram into the chosen model, then apply CTC Decoding technique to return a sequence
of characters. The actual output is obtained by passing the target label into the predefined
decoder, joining character sequences into chunks, converting to Numpy array and decoding into
“UTF-8” string. The Word Error Rate (WER) of the audio is computed using “wer” imported
function whereas the Character Error Rate (CER) is computed through the “editdistance”
package “eval” function, both of which are stored in a list. 3 randomized predicted and actual
results for the testing set will then be displayed.
Figure 120: Code Snippet to Generate Line Graph for WER Over Epoch for Unregularized CNN-GRU model on Testing Set
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for unregularized CNN-GRU model on testing set. A
horizontal line to display the minimum value of WER is drawn using a red deashed line.
Figure 121: Line Graph for WER Over Epoch for Unregularized CNN-GRU model on Testing Set
The line graph above shows that the WER on unregularized CNN-GRU model for the 16 batches
in the testing set, each consisting of 25 audio files, is ranging between 0.55 to 0.58 or 55% to
58%. This is aligned with the findings derived from the validation set in figure 102 whereby the
minimum point of WER is just below 0.6 or 60% The higher minimum point of WER implies the
unregularized CNN-GRU model is only capable of memorizing patterns from training set to be
applied on unseen testing set. Moreover, there has been slight fluctuation at the 8th and 15th
epoch, indicating that the model performance can be improved further with tuning on
hyperparameters and incorporation of different regularization or optimization techniques.
Figure 122: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-GRU model on Testing Set
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-GRU model on testing set. A
horizontal line to display the minimum value of WER is drawn using a red deashed line.
Figure 123: Line Graph for WER Over Epoch for regularized CNN-GRU model on Testing Set
The line graph above shows that the WER on regularized CNN-GRU model for the 16 batches in
the testing set, each consisting of 25 audio files, is ranging between 0.37 to 0.4 or 37% to 40%.
This is aligned with the findings derived from the validation set in figure 104 whereby the
minimum point of WER is around 0.4 or 40%. When compared with the findings from WER’s
testing result for unregularized CNN-GRU model in figure 121 above, the regularized CNN-
GRU model is more capable in learning generalizable patterns from training set that can be
applied on unseen testing set. Despite so, there has been slight fluctuation at the 7 th and 15th
epoch, indicating that the model performance can be improved further with tuning on
hyperparameters and incorporation of different regularization or optimization techniques.
Figure 124: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model on Testing Set
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized GRU model on testing set. A horizontal
line to display the minimum value of WER is drawn using a red deashed line.
Figure 125: Line Graph for WER Over Epoch for regularized GRU model on Testing Set
The line graph above shows that the WER on regularized GRU model for the 16 batches in the
testing set, each consisting of 25 audio files, is ranging between 0.43 to 0.46 or 43% to 46%.
This is aligned with the findings derived from the validation set in figure 106 whereby the
minimum point of WER is just below 0.5 or 50%. In-spite the WER range is slightly higher
compared to the regularized CNN-GRU model in figure 123, the regularized GRU model can
still relatively memorize patterns from training set that can be applied on unseen testing set.
Despite so, there has been slight fluctuation at the 7 th and 15th epoch, indicating that the model
Figure 126: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-LSTM model on Testing Set
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-LSTM model on testing set. A
horizontal line to display the minimum value of WER is drawn using a red deashed line.
Figure 127: Line Graph for WER Over Epoch for regularized CNN-LSTM model on Testing Set
The line graph above shows that the WER on regularized CNN-LSTM model for the 16 batches
in the testing set, each consisting of 25 audio files, is ranging between 0.27 to 0.3 or 27% to
30%. This is aligned with the findings derived from the validation set in figure 108 whereby the
minimum point of WER is at 0.3 or 30%. The WER range is lower compared to the regularized
CNN-GRU and regularized GRU model in figure 123 and figure 125 respectively, implying that
the regularized GRU model has learned generalizable patterns from training set that can be
applied on unseen testing set. Despite so, there has been slight fluctuation at the 7 th and 15th
epoch, indicating that the model performance can be improved further with tuning on
hyperparameters and incorporation of different regularization or optimization techniques.
Figure 128: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model on Testing Set
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-GRU model on testing set. A
horizontal line to display the minimum value of WER is drawn using a red deashed line.
Figure 129: Line Graph for CER Over Epoch for regularized CNN-GRU model on Testing Set
The line graph above shows that the CER on regularized CNN-GRU model for the 16 batches in
the testing set, each consisting of 25 audio files, is ranging between 0.1 to 0.12 or 10% to 12%.
This is aligned with the findings derived from the validation set in figure 110 whereby the
minimum point of CER is around 0.1 or 10%. This implies that the regularized CNN-GRU
model has learned generalizable patterns from training set that can be applied on unseen testing
set. Despite so, there has been slight fluctuation at the 7 th and 15th epoch, indicating that the
model performance can be improved further with tuning on hyperparameters and incorporation
of different regularization or optimization techniques.
Figure 130: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model on Testing Set
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized GRU model on testing set. A horizontal
line to display the minimum value of WER is drawn using a red deashed line.
Figure 131: Line Graph for CER Over Epoch for regularized GRU model on Testing Set
The line graph above shows that the CER on regularized GRU model for the 16 batches in the
testing set, each consisting of 25 audio files, is ranging between 0.13 to 0.15 or 13% to 15%.
This is aligned with the findings derived from the validation set in figure 112 whereby the
minimum point of CER is just below 0.2 or 20%. In-spite the CER range is slightly higher
compared to the regularized CNN-GRU model in figure 129, the regularized GRU model can
still relatively memorize patterns from training set that can be applied on unseen testing set.
Despite so, there has been slight fluctuation at the 7 th and 15th epoch, indicating that the model
performance can be improved further with tuning on hyperparameters and incorporation of
different regularization or optimization techniques.
Figure 132: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-LSTM model on Testing Set
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-LSTM model on testing set. A
horizontal line to display the minimum value of WER is drawn using a red deashed line.
Figure 133: Line Graph for CER Over Epoch for regularized CNN-LSTM model on Testing Set
The line graph above shows that the CER on regularized CNN-LSTM model for the 16 batches
in the testing set, each consisting of 25 audio files, is ranging between 0.07 to 0.08 or 7% to 8%.
This is aligned with the findings derived from the validation set in figure 114 and statistics in
figure 115 whereby the minimum point of CER is around 0.08 or 8%. The CER range is lower
compared to the regularized CNN-GRU and regularized GRU model in figure 129 and figure
131 respectively, implying that the regularized CNN-LSTM model has learned generalizable
patterns from training set that can be applied on unseen testing set. Despite so, there has been
slight fluctuation at the 7th and 15th epoch, indicating that the model performance can be
improved further with tuning on hyperparameters and incorporation of different regularization or
optimization techniques.
6.3 Summary
Based on the evaluation on Training and Validation Loss, WER and CER for validation data as well as WER and CER for testing set,
the following table is summarized.
Testing WER 55% to 58% 37% to 40% 43% to 46% 27% to 30%
Based on the evaluation on 5 metrics above for 4 models, we can deduce that regularized CNN-
GRU model performs the best in terms of training and validating the dataset as it is the only one
with a training loss greater than validation loss. It also has the lowest training and validation loss
with the slightest gap between training and validation loss out of the 4 models as both the values
are below 50. This implies that the regularized CNN-GRU model is not just capable of
memorizing the training data, but also able to generalize the features’ patterns learned from the
training set and applied to the validation set.
In terms of validation WER and CER, we can see that regularized CNN-LSTM perform the best
with the lowest WER of 30% and CER of 8%. It is worth noting that CER is not evaluated for
the unregularized CNN-GRU model because the values obtained from WER already indicated
that the model has a lower learning capability compared to the regularized one. However, model
which has the best training and learning capacity (i.e, regularized CNN-GRU model) is also not
far from behind in terms of WER and CER, with a probability around 40% and 11%
respectively. This indicates that the validation set may be too small such that many similar data
are generated, resulting in a lower WER and CER for regularized CNN-LSTM model.
As for the testing CER and WER, their percentages are lower than or equal to corresponding
results derived from the validation set with a minimum difference between 1 to 4%. This implies
that the validation and training sets have rather similar speech features such that they can be
captured and learned by the model precisely.
To sum it up, the most optimal model is the regularized CNN-GRU model because it is neither
overfitting nor underfitting to the training set. It also has 0 intense fluctuations, meaning that the
training and validation loss of the model are rather stable and do not change overtime. Moreover,
the model also has a relatively medium to low WER and CER considering it is trained on a small
sample of datasets prior to the down-sampling technique applied. It is able to predict a big
portion of the characters accurately and more than half of the tokens or words were predicted
correctly as well. If such model were to train on larger datasets such as the original dataset
without data sampling technique applied, it is able to generalize well with more complex
patterns, thus resulting in even lower training and validation loss, subsequently lower WER and
CER as well.
Tutor
Student
Pre-conditions User must not have a registered Email Address in the system
Basic Workflow 1. User enters their school Email Address and password
2. User clicks on “Register” button
3. System validates Email Address and password
4. A message pops up with “Account created successfully!”
Alternative Workflow 1. When user’s Email Address is not detected within the
system in login interface, user will be given the option to
redirect back to registration interface
Post-conditions User successfully registered into the system and was given the
option to redirect to login interface
Description Allows users to login into the system based on their credential
Pre-conditions User must already have a registered Email Address in the system
for login credentials to be verified
Basic Workflow 1. User enters their registered school Email Address and
corresponding password
2. System verifies Email Address and password
3. A message pops up with “All your credentials are correct!
You may login now.” if all credentials are entered correctly
4. User clicks on “Login” button
Post-conditions User successfully logged into the system and immediately got
redirected to main interface
Pre-conditions User must already have a registered Email Address in the system
for password to be updated
updated password
3. User clicks on “Change Password” button
4. System verifies Email Address and updated password
5. A message pops up with “Password changed successfully!”
Post-conditions User successfully changed their password and was given the
option to redirect to login interface
Pre-conditions N/A
Post-conditions User can adjust playback speed, adjust audio volume and
download the playable audio
Description Allows users to upload audio file from their local directory into
Streamlit environment
Basic Workflow 1. Upon logging in, click the “Browse Files” icon
2. Select an audio file of “wav” format
3. Key in the corresponding file path
4. A message pops up with “File ‘filename’ has been
uploaded!”
Post-conditions The “wav” format audio file is displayed and users will be
prompted to select the next operations
Pre-conditions User must be already logged into the system and uploaded an
audio file
Basic Workflow 1. Upon uploading an audio file, click the “View properties of
audio file” icon
2. User is redirected to the “Property Viewer” page
3. User clicks on any of the expanders or button to visualize
specific properties of the audio file
Alternative Workflow Upon landing on the resampling and transcript page, users can
navigate back to the property viewer page by clicking “View
Properties of Audio File” button
Table 11: Use Case Specification for viewing audio file's properties
Pre-conditions User must be already logged into the system and uploaded an
audio file
Basic Workflow 1. Upon uploading an audio file, click the “Resample Audio
File” button
2. User is redirected to the “Resampling” page
3. User enters the new sample rate in the numeric text box
4. User clicks “Resample audio file” button
Alternative Workflow Upon landing on the property viewer and transcript page, users can
navigate back to the sampling page by clicking “Resample Audio
File” button
Post-conditions The resampled audio file will be displayed and users will be
prompted to download the file
Description Allows users to download audio file with frequency, pitch and
speed modified according to their needs
Pre-conditions User must have already logged into the system, uploaded an audio
file and generated a resampled version of the original audio file
Post-conditions Open the audio file through the download footer of browser or
within user’s ‘download’ folder in local directory
Table 13: Use Case Specification for downloading resampled audio files
Pre-conditions User must have already logged into the system and uploaded an
audio file
Basic Workflow 1. After uploading audio file, click the “Generate Transcript”
button
2. User is redirected to the “Transcript” page
3. User clicks on “Generate Transcript” button
4. An info panel indicating the transcript for the chosen audio
file will pop up
Alternative Workflow Upon landing on the property viewer and resampling page, users
can navigate back to the transcript page by clicking “Generate
Transcript” button
Post-conditions User can view the transcript in the info panel and perform further
operations with it
Description Allows users to translate the generated transcript into their chosen
language
Pre-conditions User must have already logged into the system, uploaded an audio
file and generated a transcript
Post-conditions User can view the translated transcript in the info panel and
prompted to download it
Post-conditions Open the transcript in the form of text file through the download
represent complex business workflows. The internal structure of an Activity Diagram is similar
to flowchart diagrams whereby it depicts sequential and control flow mechanisms. (What Is
Activity Diagram?, 2022) In order to describe the intended behavior of every feature or process
of the web-based ASR system extensively, several activity diagrams are illustrated below.
Figure 142: Activity Diagram for viewing properties of audio file function
Figure 144: Activity Diagram for downloading resampled audio file function
8.1.2 Login
Users will be prompted to the login interface if they considered themselves as registered users.
The user then types their registered institution’s Email Address and corresponding password.
The user’s credentials will then be verified and if any of the credential is invalid, corresponding
warning messages will appear in the interface. Invalid credentials include empty entries, Email
Address or password that do not satisfy system requirements and non-registered Email Address.
Upon all credentials are filled in correctly, the message indicating success will pop up and users
will be redirected to the main interface.
8.1.4 Logout
After logging into the system, users can logout by clicking the “Logout” button at the top left
corner of the interface. It is worth noting that the “Logout” button is different from the “Back to
Home Page” button within the login, register and password changing interface as users are not
considered as logged in upon reaching the latter 3 interfaces.
addition to that, they must also copy the corresponding audio file’s path into a text area in order
for Streamlit to retrieve its content. Once done, a message indicating success and buttons for next
operations will pop up.
achieved an accuracy beyond 60%, i.e., the Word Error Rate (WER) deprived from the model
should be less than 40% and the Character Error Rate (CER) should be less than 20%.
Several Uunit Testing plans has been documented below regarding features or interfaces of the
system.
icon displayed
Fail)
Tester No
Tester Name
Tester Job
Date
No Test Case with Acceptance Test Rating (Use √ on one of the box from 1 – 5,
Criteria i.e. from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives
resampled
Transcript can be generated
2 User Interface
4 Functionalities
Completeness and
correctness of features
Application will not crash
when users navigate
through each feature
Successful and error
messages are displayed
clearly
5 Performance
up to an optimum level of
accuracy
Able to display computed
results of WER and CER
System has significant
processing speed
Less loading time upon
navigating between pages
CHAPTER 9: IMPLEMENTATION
9.1 Screenshots
9.1.1 Home Page
Upon opening the local host application, users will be directed to the home page as shown in
figure above. An introductory text on an overview of ASR, the purpose of this application and
how users can put it to good use will be shown. Then, users will be prompted as to whether they
are first-time users of the system. If they are, they should click the ‘Yes’ button to be redirected
to registration page; otherwise, they should click the ‘No’ button to be redirected to login page.
For users that require extensive aid, they can click on the "🔊" icon at top right corner of
interface or the playable audio icons below title and sub-headings to listen to audio-based
commands on how to navigate through the application.
After selecting “Yes” option in home page, user is redirected to the interface as shown in figure
above. At this point, the user did not fill in any Email Address or Password, so the pop-up
message indicating that both Email Address and password are empty will appear. Users can
choose to go back to home page by clicking the “Back to Home Page” button at top left corner.
For users that require extensive aid, they can click on the "🔊" icon at top right corner of
interface or the playable audio icons below each warning message to listen to audio-based
commands on how to register.
Figure 158: Registration Page Showing Error Message for invalid Email Address and Password
As shown in figure above, if user enters an invalid school-use Email Address and clicks the
“Register” button, an error message will pop up. The message implies users that the Email
Address must start with “TP” for students or “LS” for tutors and must be followed by 6 digits
with the school’s abbreviation in between. A valid example of an Email Address is
“TP059923@mail.PTg.edu.my”. If a user enters an invalid password and clicks the “Register”
button, an error message will also pop up implying users that password must have a length
between 7 to 20 characters, contain at least 1 digit, 1 lowercased letter, 1 uppercased letter and 1
symbol.
Figure 159: Text File to store Email Address and Password entries
Figure 160: Registration Page Showing Error Message for Duplicate Email Address's Entry
As shown in figure above, if user enters a registered Email Address in the system, then clicks the
“Register” button, an error message will pop up. The message implies users have entered
duplicate entry of Email Address as the Email Address entered by user is the same as the first
entry in the text file.
As shown in figure above, if user enters their credentials based on the specified format, then
click the “Register” button, a message indicating account creation was successful will appear
along with the playable audio. At this point, the user’s credentials will be stored in the text file
with password encrypted using hashing algorithms. The user will then be prompted back to the
login page with the “Proceed to login interface” button popping up.
After selecting the “No” option in home page or upon registration, users will be redirected back
to the login page as shown in figure above. At this point, the user did not fill in any Email
Address or Password, so the pop-up message indicating that both Email Address and password
are empty will appear. Users can click on the “Change Password” icon to be directed to the
password changing interface. Users can also choose to go back to home page by clicking the
“Back to Home Page” button at top left corner. For users that require extensive aid, they can
click on the "🔊" icon at top right corner of interface or the playable audio icons below each
warning message to listen to audio-based commands on how to login.
Figure 163: Login Interface Showing Error Message for invalid Email Address and Password
As shown in figure above, if user enters an invalid school-use Email Address, an error message
will pop up. The message implies the Email Address has yet to be registered into the web-based
system. Thus, a pop-up “Go to Registration Page” button will prompt the user back to the
registration interface in case the user mis-clicked into the login page before registering. If a user
enters an invalid password, an error message will also pop up implying users that the entered
password is incorrect.
As shown in figure above, if user enters their credentials based on the specified format, a
message indicating account all credentials are correct will pop up. The user will then be
prompted to login with the “Login” button popping up, afterwards will redirect user to the File
Uploader page upon clicking it.
After clicking “Change Password” button in login page, user is redirected to the interface as
shown in figure above. At this point, the user did not fill in any Email Address or Password, so
the pop-up message indicating that both Email Address and password are empty will appear.
Users can choose to go back to home page by clicking the “Back to Home Page” button at top
left corner. For users that require extensive aid, they can click on the "🔊" icon at top right corner
of interface or the playable audio icons below each warning message to listen to audio-based
commands on how to register.
Figure 166: Password Changing Interface Showing Error Message for invalid Email Address and Password
As shown in figure above, if user enters an invalid school-use Email Address, an error message
will pop up. The message implies the Email Address has yet to be registered into the web-based
system. Thus, a pop-up “Go to Registration Page” button will prompt the user back to the
registration interface. If a user enters an invalid password, an error message will also popped up
implying users that the entered password must have a length between 7 to 20 characters, contain
at least 1 digit, 1 lowercased letter, 1 uppercased letter and 1 symbol.
As shown in figure above, if user enters their credentials based on the specified format, then
click the “Change Password” button, a message indicating successful modification on password
will pop up. The user will then be prompted to redirect back to the login page with the pop-up
“Back to Login Page” button.
After clicking “Login” button in login page, user is redirected to the file uploader interface as
shown in figure above. At this point, the user did not fill in any file path nor upload any file to
the file uploader panel, so the pop-up message indicating that user has not uploaded audio file or
entered file path will appear. In order to proceed further with other operations, users must copy a
valid file path of their uploaded “wav” format audio file. Users can also choose to go back to
home page by clicking the “Logout” button at top left corner. For users that require extensive
aid, they can click on the "🔊" icon at top right corner of interface or the playable audio icons
below each warning message to listen to audio-based commands on how to upload an audio file.
Figure 169: File Uploader Interface showing Error Message for Invalid File Type
If the user drags or drops an audio file that is not of “wav” format, the system will consider it as
invalid file type with the corresponding error message displayed as shown in figure above.
Figure 170: File Uploader Interface showing Error Message for Invalid File Path
The file path entered by user in figure above is invalid as it is ending with a “\”. Hence, the
corresponding error message pops up.
Figure 171: File Uploader Interface showing Error Message for Incorrect File Path
Despite the file path uploaded by user is a valid one, it does not match with the file name of the
uploaded audio file. Hence, the error message indicating unmatched file path with audio file pops
up.
Figure 172: File Uploader Interface Showing Message to Remove Quotation Marks
Upon copying the file path from local folders, a trailing and leading quotation mark will be there
by default. This causes the back-end file path reader function to be unable to read contents of the
file. As such, an error message informing users to remove starting and ending quotation marks
will pop up.
As shown in figure above, if user enters the correct file path that matches with their uploaded
audio file, the recording of the audio file will be displayed below along with a message
indicating file upload was successful. The user is then prompted to continue next operations with
the pop-up text and audio-based command as well as 3 buttons, connecting to the Property
Viewer, Resampling and Transcript interface respectively.
Upon clicking the “View Properties of Audio File” button in file uploading, resampling or
transcript interface, user is redirected to the property viewer page as shown in figure above.
Users will be presented with a recording of their uploaded audio file. They can check the
frequency of uploaded audio file by clicking the “Check Frequency of Audio File” button. They
can also visualize properties of audio in the form of raw waveforms, MFCC and Mel filterbank
by clicking on the corresponding expanders. Users can also choose to go back to home page by
clicking the “Logout” button at top left corner or go back to the file uploader page by clicking
the “Back to File Uploader Page” button. For users that require extensive aid, they can click on
the "🔊" icon at top right corner of interface or alongside each text-based commands. They can
also proceed with resampling and transcript page through the buttons at the lower section of the
UI respectively.
Figure 175: Property Viewer Interface when User Checks Frequency of Audio File
Upon clicking the “Check Frequency of Audio File” button, the original frequency of uploaded
audio file will be displayed as shown in figure above.
Figure 176: Property Viewer Interface when User Clicks the Expander for Displaying Audio's Waveform
Upon clicking the expander for raw waveform of audio file, the interface above will pop up. It
gives a brief introduction on the X-axis, Y-axis and an overview of the waveform graph. Users
can also hover their cursor over the different time frames to view their corresponding amplitude.
Figure 177: Property Viewer Interface when User Clicks the Expander for Displaying MFCC Heatmap
Upon clicking the expander for MFCC coefficient of audio file, the interface above will pop up.
It gives a brief introduction on the X-axis, Y-axis, color intensity and an overview of the
heatmap. Users can also hover their cursor over different window frames to view their
corresponding MFCC coefficient and color intensity level.
Figure 178: Property Viewer Interface when User Clicks the Expander for Displaying Mel Filterbank Heatmap
Upon clicking the expander for Mel filterbank coefficient of audio file, the interface above will
pop up. It gives a brief introduction on the X-axis, Y-axis, color intensity and an overview of the
heatmap. Users can also hover their cursor over different window frames to view their
corresponding Mel filterbank coefficient and color intensity level.
Upon clicking the “Resample Audio File” button in file uploading, property viewer or transcript
interface, user is redirected to the resampling page as shown in figure above. Users will be
presented with a recording of their uploaded audio file. They can check the frequency of
uploaded audio file by clicking the “Check Frequency of Audio File” button, after which they
can select a suitable sample rate between 8,000 to 48,000 Hz. Users can also choose to go back
to home page by clicking the “Logout” button at top left corner or go back to the file uploader
page by clicking the “Back to File Uploader Page” button. For users that require extensive aid,
they can click on the "🔊" icon at top right corner of interface or alongside each text-based
commands. They can also proceed with property viewer and transcript page through the buttons
at the lower section of the UI respectively.
Figure 180: Resampling Interface when User Checks Audio File's Frequency
Upon clicking the “Check Frequency of Audio File” button, the original frequency of uploaded
audio file will be displayed as shown in figure above.
Figure 181: Resampling Interface when User Enters A Sample Rate Lower than 8,000 Hz
Figure 182: Resampling Interface when User Enters a Sample Rate Higher than 48,000 Hz
As shown in the 2 figures above, if users attempt to enter a sample rate of lower than 8,000 Hz or
higher than 48,000 Hz, the corresponding error messages will pop up, reminding users to enter a
sample rate between 8,000 to 48,000 Hz.
Figure 183: Resampling Interface when User Enters a Sample Rate Between 8,000 to 48,000 Hz
Once the user selects a valid frequency between the specified range, a “Resample Audio File”
button will pop up.
Figure 184: Resampling Interface when User Clicks 'Resample Audio File' Button
Upon clicking the “Resample Audio File” button, a recording of the resampled audio file is
displayed. A “Download Resampled Audio File” button also pops up, prompting users to
download the resampled audio file. Upon clicking it, users can open the resampled audio file in
the pop-up footer panel of the browser or retrieve the file in their local ‘Downloads’ folder.
Upon clicking the “Generate Transcript” button in file uploading, property viewer or resampling
interface, user is redirected to the transcript page as shown in figure above. Users will be
presented with a recording of their uploaded audio file. They can then generate the transcript by
clicking the “Generate Transcript” button, after which they download the transcript to their local
‘Downloads’ folder. Users can also choose to go back to home page by clicking the “Logout”
button at top left corner or go back to the file uploader page by clicking the “Back to File
Uploader Page” button. For users that require extensive aid, they can click on the "🔊" icon at top
right corner of interface or alongside each text-based commands. They can also proceed with
property viewer and resampling page through the buttons at the lower section of the UI
respectively.
Upon clicking the ‘Generate Transcript’ button, a message indicating the generated transcript
will be shown below pops up. Then, the transcript highlighted in green will be displayed.
After generating transcript, the ‘Download Transcript’ button will pop up which prompts users to
click it. After doing so, a download footer will pop up in the browser and users can retrieve the
downloaded transcript in local ‘Downloads’ folder.
Figure 188: Interface when User Selects Language and Clicks 'Translate Transcript' Button
In the language selection box, users can choose the desired language to translate their transcript
into. Once user clicks the ‘Translate Transcript’ button, the corresponding transcript in the
chosen language will be displayed in green text.
Figure 189: Interface when User Clicks 'Download Translated Transcript' Button
After generating the translated version of a transcript, the ‘Download Translated Transcript’
button will pop up which prompts users to click it. After doing so, a download footer will pop up
in the browser and users can retrieve the downloaded transcript in local ‘Downloads’ folder.
The code above generates the home page for web-based ASR system. Firstly, headings and
images are displayed with markdown formatting using Streamlit’s “markdown” and “image”
functions respectively. The 2nd section of the code consists of 3 question and answer pair-wise
information text that gives users a brief understanding in ASR and the purpose of the application.
The last section prompts if users are new to the system. If users select “Yes”, they will be
prompted to the registration page, otherwise they will be directed to the login page. This is
achieved through the “page_switcher” function that takes corresponding operation’s function as
argument within the button’s “on_click” parameter. At this point, Streamlit’s “session_state” will
be changed in the “page_switcher” function based on the page that the user chooses to navigate
to. All text areas, warning messages and command-wise instructions are also supported with a
"🔊" icon or playable audio recordings displayed using Streamlit’s “button” widget specifically
for users with low literacy levels. The alignment of buttons, text, images and markdowns is
achieved with Streamlit’s “column” function by specifying the number of columns and the ratio
that each column will be occupying.
The code above generates the registration interface for web-based ASR application. The first
section displays the header of the interface which includes “Back to Home Page” button and
page title. Then, the Email Address and password entry sections are created. Using inputs
entered by users, the next section checks the emptiness of Email Address and password. Once
user clicks the “Register” button, the validity of Email Address and password are checked
against the pre-defined requirements. The Email Address for new users must also be unique from
any registered Email Address in the system. Corresponding error messages will be displayed if
invalid credentials or registered (i.e., duplicated) Email Address are detected. A success message
along with a pop-up “Proceed to Login Interface” button will only appear when both Email
Address and password are validated correctly. Upon clicking the button, users will be directed to
login page. This is achieved through the “page_switcher” function that takes corresponding
operation’s function as argument within the button’s “on_click” parameter. At this point,
Streamlit’s “session_state” will be changed in the “page_switcher” function based on the page
that the user chooses to navigate to. All text areas, warning messages and command-wise
instructions are also supported with a "🔊" icon or playable audio recordings displayed using
Streamlit’s “button” widget specifically for users with low literacy levels. The alignment of
buttons, text, images and markdowns is achieved with Streamlit’s “column” function by
specifying the number of columns and the ratio that each column will be occupying.
The code above generates login interface for web-based ASR application. The first section
displays the header of the interface which includes “Back to Home Page” button and page title.
Then, the Email Address and password entry sections are created along with button widget for
users to change password. Using inputs entered by users, the validity of Email Address and
password are checked against registered credentials stored in CSV files. Corresponding error
messages will be displayed if invalid credentials or unregistered Email Address are detected. A
success message along with a pop-up “Login” button will only appear when both Email Address
and password are correct. Upon clicking, users will be directed to file uploader page. This is
achieved through the “page_switcher” function that takes corresponding operation’s function as
argument within the button’s “on_click” parameter. At this point, Streamlit’s “session_state” will
be changed in the “page_switcher” function based on the page that the user chooses to navigate
to. All text areas, warning messages and command-wise instructions are also supported with a
"🔊" icon or playable audio recordings displayed using Streamlit’s “button” widget specifically
for users with low literacy levels. The alignment of buttons, text, images and markdowns is
achieved with Streamlit’s “column” function by specifying number of columns and ratio that
each column will be occupying.
The code above generates the password changing interface for web-based ASR application. The
first section displays the header of the interface which includes “Back to Home Page” button and
page title. Then, the Email Address and password entry sections are created. Using inputs
entered by users, the next section checks the emptiness of Email Address and password. Once
user clicks the “Change Password” button, the validity of Email Address and password are
checked against the pre-defined requirements. The Email Address for new users must be a
registered Email Address in the system. Corresponding error messages will be displayed if
invalid credentials or non-registered Email Address are detected. A success message indicating
successful password change along with a pop-up “Back to Login Page” button will only appear
when both Email Address and password are validated correctly. Upon clicking the button, users
will be directed to login page. This is achieved through the “page_switcher” function that takes
corresponding operation’s function as argument within the button’s “on_click” parameter. At this
point, Streamlit’s “session_state” will be changed in the “page_switcher” function based on the
page that the user chooses to navigate to. All text areas, warning messages and command-wise
instructions are also supported with a "🔊" icon or playable audio recordings displayed using
Streamlit’s “button” widget specifically for users with low literacy levels. The alignment of
buttons, text, images and markdowns is achieved with Streamlit’s “column” function by
specifying the number of columns and the ratio that each column will be occupying.
The code above generates the file uploader interface for web-based ASR application. The first
section displays the header of the interface which includes “Logout” button, a GIF visualization
on how to upload an audio file briefly and page title. The task description is formulated, a text
area to enter audio file path and a file uploader panel is rendered using Streamlit’s “text_input”
and “file_uploader” widget. In this case, only “wav” format audio files are allowed with one
single upload permitted at any point of time. Corresponding error messages are displayed if the
file path entered by user does not match with the uploaded one or consists of leading and trailing
quotation marks. Once there is a match between file path and uploaded file, the playable audio
file is displayed using Streamlit’s “audio” widget and a message indicating file was uploaded
successfully will be displayed.
The user is then prompted to choose the next 3 operations, namely viewing audio file’s property,
resampling audio files and generating transcript by clicking on corresponding buttons. Upon
clicking the buttons, users will be directed to respective pages. This is achieved through the
“page_switcher” function that takes corresponding operation’s function as argument within the
button’s “on_click” parameter. At this point, Streamlit’s “session_state” will be changed in the
“page_switcher” function based on the page that the user chooses to navigate to. All text areas,
warning messages and command-wise instructions are also supported with a "🔊" icon or
playable audio recordings displayed using Streamlit’s “button” widget specifically for users with
low literacy levels. The alignment of buttons, text, images and markdowns is achieved with
Streamlit’s “column” function by specifying the number of columns and the ratio that each
column will be occupying.
The code above generates the property viewer interface for web-based ASR application. The first
section displays the header of the interface which includes “Logout” button, a “Back to File
Uploader Page” button and page title. The playable audio file is then displayed using Streamlit
“audio” widget. Then, user will be prompted to click on a button displayed using Streamlit
“button” widget to check the frequency of uploaded audio file. Once clicked, a description of the
audio file’s name and frequency in ‘Hz’ will be displayed. Users are then prompted to view the
visualization contents in the expander through Streamlit “expander” widget. These contents
include a raw waveform, MFCC and Mel filterbank coefficient of the uploaded audio file. Such
contents are displayed by utilizing the code from section 5.5.2.
The user is then prompted to choose the next 2 operations, namely resampling and generating
transcript by clicking on corresponding buttons. Upon clicking the buttons, users will be directed
to respective pages. This is achieved through the “page_switcher” function that takes
corresponding operation’s function as argument within the button’s “on_click” parameter. At this
point, Streamlit’s “session_state” will be changed in the “page_switcher” function based on the
page that the user chooses to navigate to. All text areas, warning messages and command-wise
instructions are also supported with a "🔊" icon or playable audio recordings displayed using
Streamlit’s “button” widget specifically for users with low literacy levels. The alignment of
buttons, text, images and markdowns is achieved with Streamlit “column” function by specifying
the number of columns and the ratio that each column will be occupying.
The code above generates the resampling interface for web-based ASR application. The first
section displays the header of the interface which includes “Logout” button, a “Back to File
Uploader Page” button and page title. The playable audio file is then displayed using Streamlit
“audio” widget. Then, user will be prompted to click on a button displayed using Streamlit
“button” widget to check the frequency of uploaded audio file. Once clicked, a description of the
audio file’s name and frequency in ‘Hz’ will be displayed.
Users are then prompted to enter the new sample rate for the audio file in the input box displayed
using Streamlit “number_input” widget. The default value of the input box will always be 0 Hz.
If the user does not enter a sample rate between 8,000 to 48,000 Hz with both ends inclusive, a
warning message will pop up. Otherwise, they will be prompted to resample the audio file by
clicking the corresponding button. Upon click, the file’s signal will be transformed into floating
type using “astype” function, resampled based on original frequency and size of signal using
“resample” function and written to memory. The recording of the resampled audio file is then
displayed alongside another button that prompts the user to download the resampled audio file.
Upon clicking, users can retrieve the file in the pop-up footer of browser or in their local
‘Downloads’ folder.
The user is then prompted to choose the next 2 operations, namely viewing audio file’s property,
and generating transcript by clicking on corresponding buttons. Upon clicking the buttons, users
will be directed to respective pages. This is achieved through the “page_switcher” function that
takes corresponding operation’s function as argument within the button’s “on_click” parameter.
At this point, Streamlit’s “session_state” will be changed in the “page_switcher” function based
on the page that the user chooses to navigate to. All text areas, warning messages and command-
wise instructions are also supported with a "🔊" icon or playable audio recordings displayed
using Streamlit’s “button” widget specifically for users with low literacy levels. The alignment
of buttons, text, images and markdowns is achieved with Streamlit “column” function by
specifying the number of columns and the ratio that each column will be occupying.
The code in the 2 figures above generates the transcript interface for web-based ASR application.
The first section displays the header of the interface which includes “Logout” button, a “Back to
File Uploader Page” button and page title. The playable audio file is then displayed using
Streamlit “audio” widget. Then, user will be prompted to click on a button to generate transcript
for the uploaded audio file using Streamlit “button” widget. Once clicked, the regularized CNN-
GRU model, which is the optimum model will be loaded from the local folder using the
‘tensorflow.keras.models’ package’s “load_model” function. The audio path is then read and
passed to a “spec” function to obtain its spectrogram to be fed into the model for prediction. The
output will be a sequence of integers, which is passed to the “CTC_decode” function to be
converted into a string sequence.
Then, the transcript is displayed with a markdown size of 25 in green colour. The transcript is
then stored in a temporary file called “TR” for later retrieval. Next, users will be prompted to
download the transcript by clicking on corresponding button. Upon clicking, the download
transcript will be saved into the users’ local ‘downloads’ folder in the form of “transcript_” plus
the name of uploaded audio file in ‘.txt’ format through Streamlit “download_button” widget.
Next, users will be prompted to select a language to translate into through Streamlit “selectbox”
widget. They will then have to click on the popped up ‘Translate Transcript’ button to translate
the transcript. Upon clicking, the original transcript will be retrieved from “TR.txt”, then applied
to Google Translator API through identification of language code. The translated transcript is
displayed with a markdown size of 25 in green colour. Finally, users are prompted to download
the translated transcript which will be saved into the users’ local ‘downloads’ folder in the form
of “{lang}_transcript_” plus the name of uploaded audio file in ‘.txt’ format whereby ‘lang’
represents the chosen language.
The user is then prompted to choose the next 2 operations, namely viewing audio file’s property,
and resampling audio file by clicking on corresponding buttons. Upon clicking the buttons, users
will be directed to respective pages. This is achieved through the “page_switcher” function that
takes corresponding operation’s function as argument within the button’s “on_click” parameter.
At this point, Streamlit’s “session_state” will be changed in the “page_switcher” function based
on the page that the user chooses to navigate to. All text areas, warning messages and command-
wise instructions are also supported with a "🔊" icon or playable audio recordings displayed
using Streamlit’s “button” widget specifically for users with low literacy levels. The alignment
of buttons, text, images and markdowns is achieved with Streamlit “column” function by
specifying the number of columns and the ratio that each column will be occupying.
1.1 1. Leave all text fields empty Error messages popped up As Pass
indicating both Email expected
2. Click “Register” button
Address and password
columns are empty
2.1 1. Leave all text fields Error messages popped up indicating As Pass
empty both Email Address and password expected
columns are empty
3.1 1. Leave all text fields empty Error messages popped up As Pass
indicating both Email Address expected
2. Click “Change Password”
and password columns are
button
empty
4.1 1. Clicks “Logout” User is logged out and redirected back As Pass
button to home page of application expecte
d
6.4 1. Clicks log filterbank expander for User can identify As Pass
chosen audio file intensity of log expected
filterbank coefficient at
2. Adjust contents of chart through
specific window frames
task bar at top right panel of
expander
7.2 1. Enter a new frequency lower than Error message popped As Pass
8,000 Hz in the text box up indicating expected
frequency cannot be
lower than 8,000 Hz
Tester No 1
Date 1/5/2023
No Test Case with Acceptance Test Rating (Use √ on one of the box from 1 – 5,
Criteria i.e., from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives √
2 User Interface √
Information is well
structured
Crucial information can be
displayed clearly
4 Functionalities √
Completeness and
correctness of features
Application will not crash
when users navigate
through each feature
Successful and error
messages are displayed
clearly
5 Performance √
The overall web application is smooth and user-friendly. However, some problems occur
while running the web application such as the long waiting time while going through the
model process. Besides that, the web application will need to refresh after one entry is done
towards the web application, which makes the browsing experience to be decreased.
Tester No 2
Date 2/5/23
No Test Case with Acceptance Test Rating (Use √ on one of the box from 1 – 5,
Criteria i.e., from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives √
2 User Interface √
4 Functionalities √
Completeness and
correctness of features
Application will not crash
when users navigate
through each feature
Successful and error
messages are displayed
clearly
5 Performance √
preloaded model
System has significant
processing speed
Less loading time upon
navigating between pages
The objective is not stated too clearly as the system has more functionalities than the intended
objectives. Also, the panels and widgets of the system are kind of scattered such that
navigating through them can be a bit unpleasant for new-comers.
Tester No 3
Date 2/5/23
No Test Case with Acceptance Test Rating (Use √ on one of the box from 1 – 5,
Criteria i.e., from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives √
resampled
Transcript can be generated
2 User Interface √
4 Functionalities √
Completeness and
correctness of features
Application will not crash
when users navigate
through each feature
Successful and error
messages are displayed
clearly
5 Performance √
up to an optimum level of
accuracy using the
preloaded model
System has significant
processing speed
Less loading time upon
navigating between pages
The error and successful messages are a bit too long and may cause confusion for users that
have low literacy levels. Some messages originate from the same error, but use different
wording to represent.
Tester No 4
Date 2/5/23
No Test Case with Acceptance Test Rating (Use √ on one of the box from 1 – 5,
Criteria i.e., from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives √
2 User Interface √
4 Functionalities √
Completeness and
correctness of features
Application will not crash
when users navigate
through each feature
Successful and error
messages are displayed
clearly
5 Performance √
The audio-based commands are a bit out of order as they are not aligned with the original text
command in some pages. Users may also not be able to distinguish between audio-based
commands and the audio file they uploaded.
10.3 Summary
After documenting the Unit Testing sheet and User Acceptance Testing (UAT) sheet in section
8.3, these documents have been assigned to the internal system testing team and several clients
including both students and tutors. According to the result from System Testing, we can see that
all 11 features of the system are working as intended, as in they produce the result as what is
expected from developers. Hence, all individual functions or testing units within each feature
have passed the test successfully.
As for UAT, 5 criteria are being assessed namely, “Meeting Objectives”, “User Interface”,
“Design and aesthetics”, “Functionalities” and “Performance”. The UAT result conducted on 2
students and 2 tutors respectively show that the overall feedback is quite constructive. Based on
their feedback, it can be deduced that the system is bug-free as all scoring for functionalities are
greater than or equal to 3. However, the most problematic aspect of the system is the User
Interface design. This is due to the fact that there are several feedbacks regarding the UI is not so
beginner friendly with the button widgets scattered unevenly, causing users unable to distinguish
between audio files and audio commands. Another feedback worth take note of is about system’s
performance. This is due to the nature of Streamlit whereby the entire application will rerun
again once a user clicks a button or enters a new text input. Such consideration will be brought to
the software development team and corresponding improvements will be made in following
release versions.
Prior to this project’s success, extensive amount of research has been done, not only limited to
previous studies, but also on technical aspects of the project such as Programming Language,
IDE, libraries and other hardware or software specifications. In terms of previous research, the
classification general architecture of ASR, front-end feature extraction process, back-end
implementation, various machine learning models and evaluation technique for speech
recognition are explored thoroughly without restricting the scope of research to e-learning
systems in English language only. A side-by-side comparison in various aspects between 2
programming languages: Java and Python, as well as 3 data mining methodologies: KDD,
CRISP-DM and SEMMA are also done to choose the most suitable tool and framework for this
project, which are Python and CRISP-DM respectively.
In terms of implementation-wise aspects of the project, the dataset chosen has an audio file and a
corresponding transcript column as reference. Exploration, pre-processing, visualization and
partitioning on the dataset is performed to convert the acoustic inputs into vectorized input
sequences of spectrograms which represents the underlying speech features. Then, a non-
regularized CNN-GRU, regularized CNN-GRU, regularized GRU and regularized CNN-LSTM
model are developed by applying knowledge regarding hyperparameter tuning, regularization,
optimization and loss computation. Moving on, the evaluation metrics such as training and
validation, WER, CER for both testing and validation set are prepared and evaluated to decideon
which model is best in terms of all aspects. Finally, it is concluded that the regularized CNN-
GRU model is the most performant model out of all 4 model variations.
Along with basic functionalities such as login, logout, register, change password and advanced
functionalities such as view audio file’s property, resample audio file and download such files,
the model is deployed into the web-based ASR system functioning through the implementation
of Streamlit environment. The deployment model will be used to generate the transcript, after
which users can translate the transcript as well as download to be utilized for self-learning and
conducting lessons.
11.2 Reflection
As a self-reflection on the developer, there is a subtle difference between theoretical thoughts
and coding-wise implementations. This can be justified by the fact several techniques introduced
in the Literature Review section cannot be implemented within the context of the project. This is
due to the fact that speech recognition system does not have a boundary for word limit as it has a
huge to infinite text corpus. As for studies conducted in Literature Review section, some if not
all have limited themselves to a very small range of corpus, presumably less than 100 tokens.
One example being the study regarding recognition for 10 Bangla digits from a limited number
of speakers. This has caused several theoretical addressed models such as HMM, GMM, HMM-
GMM hybrid models etc. cannot be implemented within the context of this project as these
models are machine learning-based models that do not require extensive training when compared
to deep learning models like RNN, CNN-RNN hybrid models etc. .
Hardware configuration must also be taken into full consideration upon implementing projects
related to deep learning. This is due to the fact that deep learning requires extensive training with
massive number of hyperparameters in-place. Such activity consumes a wholesome of memory
in which CPU memory is typically insufficient for such tasks as studies confirm that GPU can
process such inputs 30 times faster than CPU. Due to the absence of up-to-date graphic card with
the current one only having 2 GB for GPU, several attempts on running the modelling code
result in BSOD, hence the decision to switch back to CPU.
Furthermore, to minimize the research gap, more research must be conducted on how to utilize
other modelling techniques specifically hybrid models to generate text from speech. Further
studies on hyperparameter tuning, selection, optimization and a combination of all these should
be done more extensively. Additionally, comparison with more ASR projects that are practically
implemented using similar approach, if wise, within the E-learning domain, should be done. In
terms of coping with the learning environment in an online classroom , corresponding models
with more hyperparameters that are able to capture more precise speech features should bes
tudied. Develoeprs can also start off with perform analysis and modelling on speech recognition
accuracy for audio data with higher complexity in the form of spontaneous speech, speaker
adaptive, large speech corpus and within a noisy environment as a starting point. The situation
becomes even more complicated when multiple speakers are voicing out at one point of time,
which can occur periodically in e-learning sessions. The researcher should also conduct more
analysis on the recognition accuracy of applying different machine learning models, previously
described models with more feature extraction layers added or changes in default parameter
settings to derive the most optimal model in future studies.
REFERENCES
7 Best R Packages for Machine Learning GeeksforGeeks. (2020, November 22).
https://www.geeksforgeeks.org/7-best-r-packages-for-machine-learning/
An, G. (n.d.). A Review on Speech Recognition Challenges and Approaches Related papers
ISOLAT ED WORD SPEECH RECOGNIT ION SYST EM USING HT K TJPRC Publicat
ion Speech Recognit ion Technology: A Survey on Indian Languages Hemakumar Gopal
Aut omat ic Speech Recognit ion Syst em for Isolat ed & Connect ed Words of Hindi
Language By Using H… A Review on Speech Recognition Challenges and Approaches. In
World of Computer Science and Information Technology Journal (WCSIT) (Vol. 2, Number
1).
Best Python libraries for Machine Learning - GeeksforGeeks. (2019, January 18).
GeeksforGeeks. https://www.geeksforgeeks.org/best-python-libraries-for-machine-learning/
Bhardwaj, V., Kukreja, V., Othman, M. T. ben, Belkhier, Y., Bajaj, M., Goud, B. S., Rehman, A.
U., Shafiq, M., & Hamam, H. (2022). Automatic Speech Recognition (ASR) Systems for
Children: A Systematic Literature Review. In Applied Sciences (Switzerland) (Vol. 12,
Number 9). MDPI. https://doi.org/10.3390/app12094419
Brownlee, J. (2019, January 15). A Gentle Introduction to Batch Normalization for Deep Neural
Networks - MachineLearningMastery.com. MachineLearningMastery.com.
https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-
networks/
Creating GUIs in R with gWidgets | R-bloggers. (2010, October 6). R-Bloggers. https://www.r-
bloggers.com/2010/10/creating-guis-in-r-with-gwidgets/
Crisp DM methodology - Smart Vision Europe. (2020, June 17). Smart Vision Europe.
https://www.sv-europe.com/crisp-dm-methodology/
Dåderman, A., Rosander, S., Skolan, K., Elektroteknik, F., & Datavetenskap, O. (n.d.).
Evaluating Frameworks for Implementing Machine Learning in Signal Processing A
Comparative Study of CRISP-DM, SEMMA and KDD.
Dobilas, S. (2022, February 21). GRU Recurrent Neural Networks — A Smart Way to Predict
Sequences in Python. Medium; Towards Data Science.
https://towardsdatascience.com/gru-recurrent-neural-networks-a-smart-way-to-predict-
sequences-in-python-80864e4fe9f6
Dua, S., Kumar, S. S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot,
A., Alshamrani, S. S., & Alghamdi, A. S. (2022). Developing a Speech Recognition System
for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Applied
Sciences (Switzerland), 12(12). https://doi.org/10.3390/app12126223
Graves, A., Mohamed, A., & Hinton, G. E. (2013). Speech recognition with deep recurrent
neural networks. International Conference on Acoustics, Speech, and Signal Processing.
https://doi.org/10.1109/icassp.2013.6638947
Great Learning Team. (2022, September 19). Python NumPy Tutorial - 2023. Great Learning
Blog: Free Resources What Matters to Shape Your Career!
https://www.mygreatlearning.com/blog/python-numpy-tutorial/
Gupta, S., Jaafar, J., wan Ahmad, W. F., & Bansal, A. (2013). Feature Extraction Using Mfcc.
Signal & Image Processing : An International Journal, 4(4), 101–108.
https://doi.org/10.5121/sipij.2013.4408
Huang, J.-T., Li, J., & Gong, Y. (n.d.). AN ANALYSIS OF CONVOLUTIONAL NEURAL
NETWORKS FOR SPEECH RECOGNITION.
Islam, N., Beer, M., & Slack, F. (2015). E-Learning Challenges Faced by Academics in Higher
Education: A Literature Review. Journal of Education and Training Studies, 3(5).
https://doi.org/10.11114/jets.v3i5.947
Jain, K. (2017, September 11). Python vs R vs SAS | Which Data Analysis Tool should I
Learn? Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/09/sas-vs-vs-
python-tool-learn/
Jain, K. (2015, January 5). SKLearn | Scikit-Learn In Python | SciKit Learn Tutorial. Analytics
Vidhya. https://www.analyticsvidhya.com/blog/2015/01/scikit-learn-python-machine-
learning-tool/
Jain, V., & Hanley, S. (n.d.). Quickly and Easily Make Your SAS Programs Interactive with
Macro QCKGUI.
Jamal, N., Shanta, S., Mahmud, F., & Sha’Abani, M. (2017). Automatic speech recognition
(ASR) based approach for speech therapy of aphasic patients: A review. AIP Conference
Proceedings, 1883. https://doi.org/10.1063/1.5002046
Kanabur, V., Harakannanavar, S. S., & Torse, D. (2019a). An Extensive Review of Feature
Extraction Techniques, Challenges and Trends in Automatic Speech Recognition.
International Journal of Image, Graphics and Signal Processing, 11(5), 1–12.
https://doi.org/10.5815/ijigsp.2019.05.01
Kanabur, V., Harakannanavar, S. S., & Torse, D. (2019b). An Extensive Review of Feature
Extraction Techniques, Challenges and Trends in Automatic Speech Recognition.
International Journal of Image, Graphics and Signal Processing, 11(5), 1–12.
https://doi.org/10.5815/ijigsp.2019.05.01
Kelley, K. (2020, May 26). What is Data Analysis? Methods, Process and Types Explained.
Simplilearn.com; Simplilearn. https://www.simplilearn.com/data-analysis-methods-
process-types-article
M. Rammo, F., & N. Al-Hamdani, M. (2022). Detecting The Speaker Language Using CNN
Deep Learning Algorithm. Iraqi Journal for Computer Science and Mathematics, 43–52.
https://doi.org/10.52866/ijcsm.2022.01.01.005
Maatuk, A. M., Elberkawi, E. K., Aljawarneh, S., Rashaideh, H., & Alharbi, H. (2022). The
COVID-19 pandemic and E-learning: challenges and opportunities from the perspective of
students and instructors. Journal of Computing in Higher Education, 34(1), 21–38.
https://doi.org/10.1007/s12528-021-09274-2
Makhoul, J., & Schwartz, R. (1995). State of the art in continuous speech recognition (Vol. 92).
https://www.pnas.org
Mseleku, Z. (2020). A Literature Review of E-Learning and E-Teaching in the Era of Covid-19
Pandemic. In International Journal of Innovative Science and Research Technology (Vol. 5,
Number 10). www.ijisrt.com
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice Recognition Algorithms using Mel
Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques
(Vol. 2).
Muhammad, G., Alotaibi, Y. A., & Huda, M. N. (2009). Automatic speech recognition for
Bangla digits. ICCIT 2009 - Proceedings of 2009 12th International Conference on
Computer and Information Technology, 379–383.
https://doi.org/10.1109/ICCIT.2009.5407267
Musaev, M., Khujayorov, I., & Ochilov, M. (2019, September 25). Image Approach to Speech
Recognition on CNN. ACM International Conference Proceeding Series.
https://doi.org/10.1145/3386164.3389100
Palaz Mathew Magimai-Doss Ronan Collobert, D., Palaz, D., Magimai-Doss, M., & Collobert,
R. (2014). I CONVOLUTIONAL NEURAL NETWORKS-BASED CONTINUOUS SPEECH
RECOGNITION USING RAW SPEECH SIGNAL Convolutional Neural Networks-based
Continuous Speech Recognition using Raw Speech Signal.
Priyadarshani, P. G. N., Dias, N. G. J., & Punchihewa, A. (2012). Dynamic time warping based
speech recognition for isolated Sinhala words. Midwest Symposium on Circuits and
Systems, 892–895. https://doi.org/10.1109/MWSCAS.2012.6292164
Pryke, B. (2020, August 24). How to Use Jupyter Notebook: A Beginner’s Tutorial. Dataquest.
https://www.dataquest.io/blog/jupyter-notebook-tutorial/
Python Vs R Vs SAS | Difference Between Python, R and SAS. (2019, April 23). Besant
Technologies. https://www.besanttechnologies.com/python-vs-r-vs-sas
Reddy, P. S. (n.d.). Importance of English and Different Methods of Teaching English Editor
BORJ Importance of English and Different Methods of Teaching English. Journal of
Business Management & Social Sciences Research. www.borjournals.com
Riyaz Ahmad Assistant Professor of English, S., & Riyaz Ahmad, S. (2016). Impact Factor: 5.2
IJAR. 2(3), 478–480. www.allresearchjournal.com
Rupali, M., Chavan, S., & Sable, G. S. (2013). An Overview of Speech Recognition Using
HMM International Journal of Computer Science and Mobile Computing An Overview of
Speech Recognition Using HMM. In IJCSMC (Vol. 2, Number 6).
https://www.researchgate.net/publication/335714660
Saleh, A. (n.d.). Speech Recognition with Dynamic Time Warping using MATLAB Related
papers Analyt ical Review of Feat ure Ext ract ion Techniques for Aut omat ic Speech
Recognit ion IOSR Journals Mobilit y Enhancement for Elderly Ramviyas Parasuraman
Analyt ical Review of Feat ure Ext ract ion Technique for Aut omat ic Speech Recognit ion
himanshu chaurasiya Download a PDF Pack of t he best relat ed papers .
SAS Visual Data Mining and Machine Learning Software. (2020). Sas.com.
https://www.sas.com/en_nz/software/visual-data-mining-machine-learning.html
Shao, Y., & Wang, L. (2008). E-Seminar: an audio-guide e-learning system. 2008 International
Workshop on Education Technology and Training and 2008 International Workshop on
Geoscience and Remote Sensing, ETT and GRS 2008, 1, 80–83.
https://doi.org/10.1109/ETTandGRS.2008.292
Simplilearn. (2021, March 4). What Is Keras: The Best Introductory Guide To Keras.
Simplilearn.com; Simplilearn. https://www.simplilearn.com/tutorials/deep-learning-
tutorial/what-is-keras
Singh, M. (2021, August 15). How to build beautiful GUI’s with R - Manpreet Singh - Medium.
Medium; Medium. https://preettheman.medium.com/how-build-beautiful-guis-with-r-
1392997133e9
Team, T. (2020, February 24). SAS vs R vs Python - The Battle for Data Science! - TechVidvan.
TechVidvan. https://techvidvan.com/tutorials/sas-vs-r-vs-python/
Top 5 Best Python GUI Libraries - AskPython. (2020, August 20). AskPython.
https://www.askpython.com/python-modules/top-best-python-gui-libraries
Vijay. (2023, March 20). What is User Acceptance Testing (UAT): A Complete Guide. Software
Testing Help. https://www.softwaretestinghelp.com/what-is-user-acceptance-testing-uat/
Varol, A., Institute of Electrical and Electronics Engineers. Lebanon Section., & Institute of
Electrical and Electronics Engineers. (n.d.). 8th International Symposium on Digital
Forensics and Security : 1-2 June 2020, Beirut, Lebanon.
Wald, M. (2006). An exploration of the potential of Automatic Speech Recognition to assist and
enable receptive communication in higher education. ALT-J, 14(1), 9–20.
https://doi.org/10.1080/09687760500479977
Walsh, P., & Meade, J. (2003). Speech Enabled E-Learning for Adult Literacy Tutoring.
www.literacytools.ie/speech/.
Wang, D., Wang, X., & Lv, S. (2019). End-to-end Mandarin speech recognition combining CNN
and BLSTM. Symmetry, 11(5). https://doi.org/10.3390/sym11050644
What is an IDE? IDE Explained - AWS. (2022). Amazon Web Services, Inc.
https://aws.amazon.com/what-is/ide/
What is Operating System? Explain Types of OS, Features and Examples. (2020, December 21).
Guru99. https://www.guru99.com/operating-system-tutorial.html
Yamashita, R., Nishio, M., Do, R. K. G., & Togashi, K. (2018). Convolutional neural networks:
an overview and application in radiology. In Insights into Imaging (Vol. 9, Number 4, pp.
611–629). Springer Verlag. https://doi.org/10.1007/s13244-018-0639-9
du Simplon, R., McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., &
Bourlard, H. (2005). I D I A P On the Use of Information Retrieval Measures for Speech
Recognition Evaluation. www.idiap.ch
APPENDICES
FYP TURNITIN Report (First 2 Pages)
Library Form
Confidentiality Document
FYP Poster
Ethics Form