Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda

Andrew Nathan Lee TP059923
Automatic Speech Recognition (ASR) system for e-learning in English Language
By
ANDREW NATHAN LEE
TP059923
APU3F2209 CS(DA)
A project submitted in partial fulfilment of the requirements of Asia Pacific University of

Technology and Innovation for the degree of
BSc (Hons) in Computer Science specialism in Data Analytics
Supervised by Mr. AMARDEEP
2nd Marker: Ms. HEMA LATHA KRISHNA NAIR
May-2023
P a g e | 1 of 294 Asia Pacific University of Technology and Innovation

Acknowledgement
First and foremost, I would like to express my deepest gratitude to my FYP supervisor, Mr.
Amardeep for providing step-by-step guidance throughout my FYP semester 1 and 2. I really
appreciate his efforts in reviewing my documentation ranging from Project Proposal Form (PPF),
Project Specification Form (PSF) to Investigation Report (IR) and providing relevant suggestions
on how to improvise my work. Not to mention that he also asked me several in-depth questions
that triggered my inner thoughts on how to view my project from an external or outsider point of
view. In addition, he also kept track of my progress from time to time to ensure that I was on the
right track.
Next, I would like to extend my sincere thanks to our FYP lecturer, Mr. Dhason Padmakumar for
briefing us on the series of tasks that we are required to do in FYP module. During lecture
classes, he provided a detailed explanation on the guidelines, requirements and formatting of
each documentation and even gave us sample documents from previous batch of students as a
reference. Tips on how to score a good grade in FYP have also been highlighted by Mr. Dhason
which have helped me a lot in doing my documentation.
I am also grateful to the lecturers that have taught me during my 3 years degree period in APU.
Thanks to lecturers who conduct modules such as Introduction to Database (IDB), Data Mining
and Predictive Modelling (DMPM), Research Methods for Computing and Technology (RMCT),
Text Analytics and Sentiment Analysis (TXSA) etc., I have attained sufficient knowledge in the
corresponding domain to be applied into the implementation of this project.
Last but not least, I am also extremely thankful to my parents and friends. Without their
unconditional support and assistance throughout this journey, I would not have completed my
project on time. I would also like to take this opportunity to appreciate all participants who have
provided their constructive feedback in the questionnaire survey. I promised that I would use
these feedback to improve the accuracy and functionality of my system to satisfy all
requirements from the users’ end.

Table of Contents
Acknowledgement.......................................................................................................................................2
Table of Contents........................................................................................................................................3
List of Figures.............................................................................................................................................7
List of Tables.............................................................................................................................................14
CHAPTER 1: INTRODUCTION TO THE STUDY.................................................................................16
1.1 Background of the Project...............................................................................................................16
1.2 Problem Context..............................................................................................................................19
1.3 Rationale..........................................................................................................................................21
1.4 Potential Benefits.............................................................................................................................22
1.4.1 Tangible Benefits......................................................................................................................22
1.4.2 Intangible Benefits....................................................................................................................23
1.5 Target Users....................................................................................................................................23
1.6 Scopes & Objectives........................................................................................................................24
1.6.1 Aim...........................................................................................................................................24
1.6.2 Objectives.................................................................................................................................24
1.7 Overview of this Investigation Report.............................................................................................26
1.8 Project Plan......................................................................................................................................29
CHAPTER 2: LITERATURE REVIEW...................................................................................................30
2.1 Introduction.....................................................................................................................................30
2.2 Domain Research.............................................................................................................................31
2.2.1 Classification of ASR...............................................................................................................31
2.2.2 General Overview of ASR System...........................................................................................33
2.2.3 Machine Learning Models........................................................................................................38
2.2.4 Performance Evaluation of ASR...............................................................................................47
2.3 Similar Systems...............................................................................................................................48
2.3.1 Automatic Speech Recognition for Bangla Digits....................................................................48
2.3.2 Dynamic Time Warping (DTW) Based Speech Recognition System for Isolated Sinhala Words
...........................................................................................................................................................50
2.3.3 Convolutional Neural Network (CNN) based Speech Recognition System for Punjabi
Language...........................................................................................................................................51

2.3.4 Comparison of Similar Systems................................................................................................54

2.4 Summary.........................................................................................................................................57
CHAPTER 3: TECHNICAL RESEARCH................................................................................................58
3.1 Programming Language Chosen......................................................................................................58
3.1.1 Justification of Programming Language Chosen......................................................................62
3.2 IDE (Integrated Development Environment) Chosen......................................................................63
3.3 Libraries / Tools Chosen..................................................................................................................64
3.3.1 Data Pre-processing..................................................................................................................64
3.3.2 Machine Learning.....................................................................................................................64
3.3.3 Deep Learning..........................................................................................................................64
3.3.4 Data Visualization....................................................................................................................65
3.3.5 GUI...........................................................................................................................................65
3.4 Operating System Chosen................................................................................................................66
3.5 Summary.........................................................................................................................................68
CHAPTER 4: METHODOLOGY.............................................................................................................69
4.1 Introduction.....................................................................................................................................69
4.2 Comparison of Data Mining Methodologies....................................................................................70
4.3 Justification on Data Mining Methodology chosen..........................................................................72
4.4 CRISP-DM......................................................................................................................................73
4.5 Summary.........................................................................................................................................76
CHAPTER 5: DATA ANALYSIS............................................................................................................77
5.1 Introduction.....................................................................................................................................77
5.2 Metadata..........................................................................................................................................78
5.3 Initial Data Exploration...................................................................................................................80
5.4 Data Cleaning..................................................................................................................................83
5.4.1 Handling Missing Values..........................................................................................................83
5.4.2 Removing Erroneous Data........................................................................................................84
5.4.3 Data Sampling..........................................................................................................................85
5.4.4 Data Transformation.................................................................................................................88
5.4.4.1 Generate List of Dictionary................................................................................................88
5.4.4.2 Generate Spectrogram........................................................................................................88
5.4.4.3 Generate Vocabulary..........................................................................................................89

5.4.4.4 Generate Transcript Label..................................................................................................89

5.4.4.5 Generate Tensorflow Dataset.............................................................................................90
5.5 Data Visualization...........................................................................................................................92
5.5.1 Transcript..................................................................................................................................92
5.5.2 Audio........................................................................................................................................95
5.6 Data Partition...................................................................................................................................99
5.7 Modelling......................................................................................................................................100
5.7.1 Model Preparation..................................................................................................................100
5.7.2 CNN-GRU..............................................................................................................................101
5.7.2.1 Non-regularized CNN-GRU............................................................................................101
5.7.2.2 Regularized CNN-GRU...................................................................................................105
5.7.3 GRU........................................................................................................................................108
5.7.4 CNN-LSTM............................................................................................................................112
5.8 Summary.......................................................................................................................................116
CHAPTER 6: RESULTS AND DISCUSSIONS.....................................................................................117
6.1 Introduction...................................................................................................................................117
6.2 Results and Discussions.................................................................................................................118
6.2.1 Metrics Preparation.................................................................................................................118
6.2.2 Training and Validation Loss..................................................................................................120
6.2.3 WER of Validation Set...........................................................................................................126
6.2.4 CER of Validation Set............................................................................................................131
6.2.5 Code For Evaluation on Testing Set.......................................................................................134
6.2.6 WER of Testing Set................................................................................................................138
6.2.7 CER of Testing Set.................................................................................................................142
6.3 Summary.......................................................................................................................................146
CHAPTER 7: SYSTEM ARCHITECTURE...........................................................................................148
7.1 Introduction...................................................................................................................................148
7.2 Abstract Architecture.....................................................................................................................150
7.2.1 System Design........................................................................................................................150
7.2.1.1 Use Case Diagram...........................................................................................................150
7.2.1.2 Use Case Specification....................................................................................................151
7.2.1.3 Activity Diagram..............................................................................................................160

7.2.2 Interface Design......................................................................................................................166

CHAPTER 8: PROJECT PLAN..............................................................................................................170
8.1 Features.........................................................................................................................................170
8.1.1 Register...................................................................................................................................170
8.1.2 Login......................................................................................................................................170
8.1.3 Change Password....................................................................................................................170
8.1.4 Logout....................................................................................................................................171
8.1.5 Listen to audio-based commands............................................................................................171
8.1.6 Upload Audio File..................................................................................................................171
8.1.7 View Properties of Audio File................................................................................................171
8.1.8 Resample Audio File..............................................................................................................171
8.1.9 Download Resampled Audio File...........................................................................................172
8.1.10 Generate Transcript...............................................................................................................172
8.1.11 Translate Transcript..............................................................................................................172
8.1.12 Download Transcript............................................................................................................172
8.2 Details of the Release Plan............................................................................................................173
8.2.1 Version 1.0.............................................................................................................................173
8.2.2 Version 2.0.............................................................................................................................173
8.2.3 Version 3.0.............................................................................................................................173
8.3 Test Plan........................................................................................................................................175
8.3.1 Unit Testing............................................................................................................................175
8.3.2 User Acceptance Testing........................................................................................................185
CHAPTER 9: IMPLEMENTATION......................................................................................................188
9.1 Screenshots....................................................................................................................................188
9.1.1 Home Page..............................................................................................................................188
9.1.2 Registration Page....................................................................................................................190
9.1.3 Login Page..............................................................................................................................193
9.1.4 Password Changing Page........................................................................................................195
9.1.5 File Uploader Page.................................................................................................................197
9.1.6 Property Viewer Page.............................................................................................................201
9.1.7 Resampling Page....................................................................................................................205
9.1.8 Transcript Page.......................................................................................................................209

9.2 Sample Codes................................................................................................................................212

9.2.1 Home Page..............................................................................................................................212
9.2.2 Registration Page....................................................................................................................213
9.2.3 Login Page..............................................................................................................................215
9.2.4 Password Changing Page........................................................................................................216
9.2.5 File Uploader Page.................................................................................................................218
9.2.6 Property Viewer Page.............................................................................................................220
9.2.7 Resampling Page....................................................................................................................222
9.2.8 Transcript Page.......................................................................................................................224
CHAPTER 10: SYSTEM VALIDATION...............................................................................................227
10.1 Unit Testing Result......................................................................................................................227
10.2 User Acceptance Testing.............................................................................................................237
10.3 Summary.....................................................................................................................................245
CHAPTER 11: CONCLUSIONS AND REFLECTIONS.......................................................................246
11.1 Conclusion...................................................................................................................................246
11.2 Reflection....................................................................................................................................247
REFERENCES........................................................................................................................................249
APPENDICES.........................................................................................................................................259
FYP TURNITIN Report (First 2 Pages)..............................................................................................259
Library Form.......................................................................................................................................261
Confidentiality Document...................................................................................................................262
FYP Poster...........................................................................................................................................263
Project Log Sheets...............................................................................................................................264
Project Proposal Form (PPF)...............................................................................................................270
Project Specification Form (PSF)........................................................................................................278
Ethics Form.........................................................................................................................................288
Gantt Chart for FYP............................................................................................................................292
List of Figures
Figure 1: Speech transmission and formulation process (Varol et al., n.d.)...............................................18

Figure 2: ASR system classification (Bhardwaj et al., 2022........................................................................32

Figure 3: ASR system architecture (Jamal et al., 2017...............................................................................34
Figure 4: Process of MFCC Feature Extraction (S & E, 2016)......................................................................35
Figure 5: Speech Frequency Graph for Hamming Window (Kanabur et al., 2019b)..................................35
Figure 6: Equation for Hamming Window graph (Gupta et al., 2013)........................................................35
Figure 7: Equation for DFT computation (Gupta et al., 2013)....................................................................36
Figure 8: Equation to calculate Mel frequency (Gupta et al., 2013)..........................................................36
Figure 9: Graph for relationship between frequency and Mel frequency (Gupta et al., 2013)..................36
Figure 10: Equation to calculate MFCC (Gupta et al., 2013)......................................................................36
Figure 11: General equation for statistical computation of ASR (Jiang, 2003)...........................................38
Figure 12: Three-state Markov chain (Makhoul & Schwartz, 1995)...........................................................39
Figure 13: Equation for transition probability (aij ) (Rupali et al., 2013)....................................................39
Figure 14: Equation for summation of transition probability ( aij (k )) (Rupali et al., 2013).....................39
Figure 15: Equation for observational symbol probability ( bj (k ) ) (Rupali et al., 2013)............................39
Figure 16: Equation for summation of observational symbol probability ( bj(k ))) (Rupali et al., 2013). .40
Figure 17: Equation for initial state distribution ( πi) (Rupali et al., 2013).................................................40
Figure 18: Warping between two non-linear time series (Muda et al., 2010)...........................................41
Figure 19: Equation to construct a local cost matrix (Senin, 2008)............................................................41
Figure 20: Local path alternatives for grid point (i, j) (Saleh, n.d.).............................................................42
Figure 21: Equation to compute the minimum accumulated distance for optimal path δ(i,j) (Saleh, n.d.)
.................................................................................................................................................................. 42
Figure 22: CNN architecture for speech recognition (Huang et al., n.d.)...................................................43
Figure 23: Stages of CNN architecture for speech recognition (Palaz Mathew Magimai-Doss Ronan
Collobert et al., 2014)................................................................................................................................43
Figure 24: Equation for convolution calculation (Wang et al., 2019).........................................................44
Figure 25: Example for convolution calculation (Wang et al., 2019).........................................................44
Figure 26: Equation for time span (tc) and time shift window( windowc) (Wang et al., 2019)................45
Figure 27: Equation for ReLU activation function (Wang et al., 2019).......................................................45
Figure 28: Equation for ClippedReLU activation function (Wang et al., 2019)...........................................45
Figure 29: Equations for time span (tp) and time shift window(windowp) of max pooling (Wang et al.,
2019).........................................................................................................................................................46
Figure 30: Final equations for time span (tp ) and time shift window(windowp) of max pooling (Wang et
al., 2019)....................................................................................................................................................46
Figure 31:Equation for every vector's BN in a batch (Wang et al., 2019)..................................................47
Figure 32: Equations for each neuron's BN in every vector of a batch (Wang et al., 2019).......................47
Figure 33: Equation for Softmax activation function (M. Rammo & N. Al-Hamdani, 2022).......................47
Figure 34: Equation for calculation of Word Error Rate (WER) (du Simplon et al., 2005)..........................48
Figure 35: Equation for calculation of Word Recognition Rate (WRR) (du Simplon et al., 2005)...............48
Figure 36: 10 Bangla digit representation, pronunciation and IPA (Muhammad et al., 2009)...................49
Figure 37: Digit correct rate (%) of Bangla digits in ASR (Muhammad et al., 2009)...................................50

Figure 38: Word Recognition Rate (WRR) for 4 speakers in 3 respective sessions (Priyadarshani et al.,
2012).........................................................................................................................................................51
Figure 39: Parameter setup for CNN based Speech Recognition System for Punjabi language (Dua et al.,
2022).........................................................................................................................................................52
Figure 40: Framework for CNN based Speech Recognition System for Tonal Speech Signals (Dua et al.,
2022).........................................................................................................................................................53
Figure 41: Word Recognition Rate (WRR) of different speakers (Dua et al., 2022)...................................53
Figure 42: Overall Word Recognition Rate (WRR) compared to other speech recognition systems (Dua et
al., 2022)....................................................................................................................................................54
Figure 43: Overview of Operating System (Understanding Operating Systems - University of Wollongong
– UOW, 2022)............................................................................................................................................67
Figure 44: Overview of CRISP-DM methodology (Wirth, R., & Hipp, J., 2000, April)..................................74
Figure 45: Contents of LJ speech dataset..................................................................................................79
Figure 46: list of "wav" audio files in "wavs" folder...................................................................................79
Figure 47: Code and output for data extraction........................................................................................81
Figure 48: Code and output for adding column names.............................................................................81
Figure 49: Code and output for dimension of DataFrame.........................................................................81
Figure 50: Code and output for information of each attribute..................................................................82
Figure 51: Code and output for total and unique word count in "Normalized Transcript" column...........82
Figure 52: Code and output to display frequencies and number of samples for an audio file..................83
Figure 53: Code and output for dropping rows that contain empty values...............................................84
Figure 54: Code and output for dropping "Transcript" column.................................................................84
Figure 55: Code and output for dropping rows that contain non-ASCII characters in "Normalized
Transcript" column....................................................................................................................................85
Figure 56: Code and Output for computing word frequency distribution.................................................86
Figure 57: Code for data sampling.............................................................................................................87
Figure 58: Output for data sampling..........................................................................................................87
Figure 59: Code and output for total and unique word count in "Normalized Transcript" column after
Data Sampling............................................................................................................................................88
Figure 60: Code Snippet to Create list of dictionary from DataFrame Object............................................89
Figure 61: Code Snippet to generate Spectrogram for audio file..............................................................89
Figure 62: Code Snippet for Vocabulary Set with Encoder and Decoder...................................................90
Figure 63: Code Snippet to generate Transcript Label...............................................................................90
Figure 64: Code Snippet to Merge Spectrogram and Transcript Label......................................................91
Figure 65: Code Snippet to Construct Keras Dataset Object......................................................................91
Figure 66: Code Snippet to plot bar chart for token frequency dictionary................................................93
Figure 67: Bar Chart for common tokens versus frequency......................................................................93
Figure 68: Code Snippet to plot histogram for individual transcript's token count versus frequency.......94
Figure 69: Histogram for individual transcript's token count versus frequency........................................94
Figure 70: Code Snippet to plot histogram for individual transcript's token count versus frequency after
data sampling............................................................................................................................................95
Figure 71 Histogram for individual transcript's token count versus frequency after data sampling..........95

Figure 72: Code Snippet to plot waveform, MFCC and Mel Coefficients...................................................96
Figure 73: Raw Waveform for Amplitude Against Time for Audio File......................................................98
Figure 74: Heatmap for MFCC Coefficients against Windows for Audio File.............................................98
Figure 75: Heatmap for Mel Coefficients against Windows for Audio File................................................99
Figure 76: Code Snippet to create function for data partitioning............................................................100
Figure 77: Code Snippet to perform data partitioning.............................................................................100
Figure 78: Code Snippet for CTC loss function.........................................................................................101
Figure 79: Code Snippet for constructing a CNN-GRU model..................................................................102
Figure 80: Summary of CNN-GRU model.................................................................................................104
Figure 81: Code Snippet to train CNN-GRU model..................................................................................105
Figure 82: Code Snippet to construct regularized CNN-GRU model........................................................106
Figure 83: Summary of regularized CNN-GRU model..............................................................................107
Figure 84: Code Snippet to train regularized CNN-GRU model................................................................108
Figure 85: Code Snippet to construct regularized GRU model................................................................109
Figure 86: Summary of regularized GRU model.......................................................................................111
Figure 87: Code Snippet for Training regularized GRU model.................................................................112
Figure 88: Code Snippet to construct regularized CNN-LSTM model.......................................................113
Figure 89: Summary of regularized CNN-LSTM model.............................................................................115
Figure 90: Code Snippet for training regularized CNN-LSTM model........................................................116
Figure 91: Code Snippet for CTC Decoding..............................................................................................119
Figure 92: Code Snippet to generate metrics for Validation and Testing................................................120
Figure 93: Code Snippet to visualize Train and Validation Loss for CNN-GRU model..............................121
Figure 94: Relationship between Train and Validation Loss against Epoch for CNN-GRU model............121
Figure 95: Code Snippet to visualize Train and Validation Loss for regularized CNN-GRU model............122
Figure 96: Relationship between Train and Validation Loss against Epoch for regularized CNN-GRU model
................................................................................................................................................................ 123
Figure 97: Code Snippet to visualize Train and Validation Loss for regularized GRU model....................124
Figure 98: Relationship between Train and Validation Loss against Epoch for regularized GRU model. .124
Figure 99: Code Snippet to visualize Train and Validation Loss for regularized CNN-LSTM model..........125
Figure 100: Relationship between Train and Validation Loss against Epoch for regularized CNN-LSTM
model......................................................................................................................................................125
Figure 101: Code Snippet to Generate Line Graph for WER Over Epoch for CNN-GRU model................127
Figure 102: Line Graph for WER Over Epoch for CNN-GRU model..........................................................127
Figure 103: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-GRU model
................................................................................................................................................................ 128
Figure 104: Line Graph for WER Over Epoch for regularized CNN-GRU model........................................128
Figure 105: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model......129
Figure 106: Line Graph for WER Over Epoch for regularized GRU model................................................129
Figure 107: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-LSTM model
................................................................................................................................................................ 130
Figure 108: Line Graph for WER Over Epoch for regularized CNN-LSTM model......................................130

Figure 109: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model
................................................................................................................................................................ 132
Figure 110: Line Graph for CER Over Epoch for regularized CNN-GRU model.........................................132
Figure 111: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model.......133
Figure 112: Line Graph for CER Over Epoch for regularized GRU model.................................................133
Figure 113: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-LSTM model
................................................................................................................................................................ 134
Figure 114: Line Graph for CER Over Epoch for regularized CNN-LSTM model.......................................134
Figure 115: Code Snippet to find minimum point for CER on Validation Set of regularized CNN-GRU and
CNN-LSTM model....................................................................................................................................135
Figure 116: Code Snippet to Generate Metrics for Unregularized CNN-GRU model’s Evaluation on Testing
Set...........................................................................................................................................................135
Figure 117: Code Snippet to Generate Metrics for Regularized CNN-GRU model’s Evaluation on Testing
Set...........................................................................................................................................................136
Figure 118: Code Snippet to Generate Metrics for Regularized GRU model’s Evaluation on Testing Set 137
Figure 119: Code Snippet to Generate Metrics for Regularized CNN-LSTM model’s Evaluation on Testing
Set...........................................................................................................................................................138
Figure 120: Code Snippet to Generate Line Graph for WER Over Epoch for Unregularized CNN-GRU
model on Testing Set...............................................................................................................................139
Figure 121: Line Graph for WER Over Epoch for Unregularized CNN-GRU model on Testing Set............139
on Testing Set..........................................................................................................................................140
Figure 123: Line Graph for WER Over Epoch for regularized CNN-GRU model on Testing Set................140
Figure 124: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model on
Testing Set...............................................................................................................................................141
Figure 125: Line Graph for WER Over Epoch for regularized GRU model on Testing Set.........................141
on Testing Set..........................................................................................................................................142
Figure 127: Line Graph for WER Over Epoch for regularized CNN-LSTM model on Testing Set...............142
Figure 128: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model on
Testing Set...............................................................................................................................................143
Figure 129: Line Graph for CER Over Epoch for regularized CNN-GRU model on Testing Set..................143
Figure 130: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model on
Testing Set...............................................................................................................................................144
Figure 131: Line Graph for CER Over Epoch for regularized GRU model on Testing Set..........................144
on Testing Set..........................................................................................................................................145
Figure 133: Line Graph for CER Over Epoch for regularized CNN-LSTM model on Testing Set................145
Figure 134: Summary of Evaluation Metrics for 4 Models.......................................................................147
Figure 135: Use Case Diagram.................................................................................................................151
Figure 136: Activity Diagram for register function...................................................................................161
Figure 137: Activity Diagram for login function.......................................................................................162

Figure 138: Activity Diagram for password changing function................................................................162

Figure 139: Activity Diagram for logout function.....................................................................................163
Figure 140: Activity Diagram for listening audio-based commands function..........................................163
Figure 141: Activity Diagram for uploading audio file function...............................................................163
Figure 142: Activity Diagram for viewing properties of audio file function.............................................164
Figure 143: Activity Diagram for resampling audio file function..............................................................164
Figure 144: Activity Diagram for downloading resampled audio file function.........................................165
Figure 145: Activity Diagram for generating transcript function.............................................................165
Figure 146: Activity Diagram for translating transcript function..............................................................166
Figure 147: Activity Diagram for downloading transcript function..........................................................166
Figure 148: Wireframe for home page....................................................................................................167
Figure 149: Wireframe for registration page...........................................................................................167
Figure 150: Wireframe for login page......................................................................................................168
Figure 151: Wireframe for password changing page...............................................................................168
Figure 152: Wireframe for file uploader page.........................................................................................168
Figure 153: Wireframe for property viewer page....................................................................................169
Figure 154: Wireframe for resampling page............................................................................................169
Figure 155: Wireframe for transcript page..............................................................................................170
Figure 156: Interface for Home Page.......................................................................................................189
Figure 157: Registration Interface...........................................................................................................191
Figure 158: Registration Page Showing Error Message for invalid Email Address and Password............191
Figure 159: Text File to store Email Address and Password entries.........................................................192
Figure 160: Registration Page Showing Error Message for Duplicate Email Address's Entry...................192
Figure 161: Interface for Successful Registration....................................................................................193
Figure 162: Interface for Login Page........................................................................................................194
Figure 163: Login Interface Showing Error Message for invalid Email Address and Password................195
Figure 164: Interface for Successful Login...............................................................................................195
Figure 165: Interface for Password Changing Page.................................................................................196
Figure 166: Password Changing Interface Showing Error Message for invalid Email Address and Password
................................................................................................................................................................ 196
Figure 167: Interface for successful change in password........................................................................197
Figure 168: Interface for File Uploader Page...........................................................................................198
Figure 169: File Uploader Interface showing Error Message for Invalid File Type...................................199
Figure 170: File Uploader Interface showing Error Message for Invalid File Path...................................199
Figure 171: File Uploader Interface showing Error Message for Incorrect File Path................................200
Figure 172: File Uploader Interface Showing Message to Remove Quotation Marks..............................200
Figure 173: File Uploader Interface for successful upload.......................................................................201
Figure 174: Interface for Audio File's Property Viewer Page...................................................................202
Figure 175: Property Viewer Interface when User Checks Frequency of Audio File................................203
Figure 176: Property Viewer Interface when User Clicks the Expander for Displaying Audio's Waveform
................................................................................................................................................................ 203
Figure 177: Property Viewer Interface when User Clicks the Expander for Displaying MFCC Heatmap. .204

Figure 178: Property Viewer Interface when User Clicks the Expander for Displaying Mel Filterbank
Heatmap..................................................................................................................................................205
Figure 179: Interface for Resampling Page..............................................................................................206
Figure 180: Resampling Interface when User Checks Audio File's Frequency.........................................207
Figure 181: Resampling Interface when User Enters A Sample Rate Lower than 8,000 Hz.....................207
Figure 182: Resampling Interface when User Enters a Sample Rate Higher than 48,000 Hz...................208
Figure 183: Resampling Interface when User Enters a Sample Rate Between 8,000 to 48,000 Hz.........208
Figure 184: Resampling Interface when User Clicks 'Resample Audio File' Button..................................209
Figure 185: Interface for Transcript Page................................................................................................210
Figure 186: Interface when User Clicks 'Generate Transcript' Button.....................................................211
Figure 187: Interface when User Clicks 'Download Transcript' Button....................................................211
Figure 188: Interface when User Selects Language and Clicks 'Translate Transcript' Button..................212
Figure 189: Interface when User Clicks 'Download Translated Transcript' Button..................................212
Figure 190: Code Snippet for Home Page................................................................................................213
Figure 191: Code Snippet for Registration Page......................................................................................214
Figure 192: Code Snippet for Login Page.................................................................................................216
Figure 193: Code Snippet for Password Changing Page...........................................................................217
Figure 194: Code Snippet for File Uploader Page....................................................................................219
Figure 195: Code Snippet for Viewing Property of Audio File..................................................................221
Figure 196: Code Snippet for Resampling Audio File...............................................................................223
Figure 197: Code Snippet for Generating Transcript (1)..........................................................................225
Figure 198: Code Snippet for Generating Transcript (2)..........................................................................226
Figure 199: FYP TURNITIN Report (1).......................................................................................................260
Figure 200: FYP TURNITIN Report (2).......................................................................................................261
Figure 201: Library Form.........................................................................................................................262
Figure 202: Confidentiality Document.....................................................................................................263
Figure 203: FYP Poster.............................................................................................................................264
Figure 204: Project Log Sheet Semester 1 (1)..........................................................................................265
Figure 210: PPF (1)...................................................................................................................................271
Figure 211: PPF (2)...................................................................................................................................272
Figure 212: PPF (3)...................................................................................................................................273
Figure 213: PPF (4)...................................................................................................................................274
Figure 214: PPF (5)...................................................................................................................................275
Figure 215: PPF (6)...................................................................................................................................276
Figure 216: PPF (7)...................................................................................................................................277
Figure 217: PPF (8)...................................................................................................................................278
Figure 218: PSF (1)...................................................................................................................................279

Figure 219: PSF (2)...................................................................................................................................280

Figure 220: PSF (3)...................................................................................................................................281
Figure 221: PSF (4)...................................................................................................................................282
Figure 222: PSF (5)...................................................................................................................................283
Figure 223: PSF (6)...................................................................................................................................284
Figure 224: PSF (7)...................................................................................................................................285
Figure 225: PSF (8)...................................................................................................................................286
Figure 226: PSF (9)...................................................................................................................................287
Figure 227: PSF (10).................................................................................................................................288
Figure 228: Fast Track Ethnic Form (1)....................................................................................................289
Figure 232: Gantt Chart for FYP (1)..........................................................................................................293
Figure 233: Gantt Chart for FYP (2)..........................................................................................................294
List of Tables

Table 1: Comparison of similar speech recognition systems.....................................................................55

Table 2: Comparison of programming languages......................................................................................59
Table 3: Comparison of data mining methodologies.................................................................................71
Table 4: Attributes of dataset....................................................................................................................80
Table 5: Use Case Specification for registration......................................................................................152
Table 6: Use case specification for login..................................................................................................153
Table 7: Use case specification for changing password...........................................................................154
Table 8: Use case specification for changing password...........................................................................154
Table 9: Use Case Specification for listening to audio-based command..................................................155
Table 10: Use case specification for uploading audio file........................................................................155
Table 11: Use Case Specification for viewing audio file's properties.......................................................156
Table 12: Use Case Specification for resampling audio file......................................................................157
Table 13: Use Case Specification for downloading resampled audio files...............................................157
Table 14: Use Case Specification for generating transcript.....................................................................158
Table 15: Use Case Specification for translating transcript......................................................................159
Table 16: Use Case Specification for downloading transcript..................................................................160
Table 17: Test plan for registration..........................................................................................................177
Table 18: Test plan for Login...................................................................................................................178
Table 19: Test plan for changing password.............................................................................................180
Table 20: Test plan for logout..................................................................................................................180
Table 21: Test plan for uploading audio file............................................................................................182
Table 22: Test plan for viewing properties of audio file..........................................................................183
Table 23: Test plan for resampling audio file...........................................................................................183
Table 24: Test plan for downloading resampled audio file......................................................................184
Table 25: Test plan for generating transcript..........................................................................................184
Table 26: Test plan for translating transcript...........................................................................................185
Table 27: Test plan for downloading transcript.......................................................................................185
Table 28: User Acceptance Testing Form.................................................................................................188
Table 29: Test Output for registration.....................................................................................................229
Table 30: Test output for Login................................................................................................................230
Table 31: Test output for changing password.........................................................................................231
Table 32: Test output for logout..............................................................................................................232
Table 33: Test ouptut for uploading audio file........................................................................................234
Table 34: Test output for viewing properties of audio file......................................................................235
Table 35: Test output for resampling audio file.......................................................................................235
Table 36: Test output for downloading resampled audio file..................................................................236
Table 37: Test output for generating transcript.......................................................................................236
Table 38: Test output for translating transcript.......................................................................................237
Table 39: Test Output for downloading transcript..................................................................................237
Table 40: Result for User Acceptance Testing (1)....................................................................................240

CHAPTER 1: INTRODUCTION TO THE STUDY
1.1 Background of the Project
The research domain for the current project is e-learning. Unlike traditional learning which
acquires physical interactions among tutors and students in classrooms, e-learning is the recently

established learning paradigm that utilizes Information and Communication Technologies (ICT)
and relevant electronic devices to deliver knowledge. The increasing popularity of e-learning is
in accordance with the usage of such technologies which has resulted in creativity and innovation
boost within the educational environment as claimed by many researchers. Also known as
distance learning, e-learning can help reduce expenditure, time for those living at distant places
and even administrative workload of school staffs. (Maatuk et al., 2022) Using the university
(APU) as an example, it has incorporated other e-learning platforms with their existing school
systems in the curriculum paradigm. Despite having a large international community from over
130 countries, it managed to provide all students with diverse yet suitable learning materials,
supplementary courses and assessments, thus international or out-of-state students do not need to
pay for travel or accommodation fees to pursue their tertiary education. As for tutors, with the
help of presentation, work coordination and grading tools such as Microsoft Teams, Moodle,
Turnitin etc., they can focus more on teaching methods instead of preparing for the learning
resources which are available in the system.
However, the main factor that constitute to the rise of e-learning is the unexpected Covid-19
outbreak in early 2020. As such, many countries have enforced strict health protocols and lock-
down regulations. This involves the limitation of many social, recreational and economical
activities to reduce physical interactions and promote social distancing among people.
Correspondingly, education sectors have also become a victim of this crisis. Previous studies on
the impact of Covid-19 have acknowledged the tendency that higher learning institutions respond
by shutting down campus and switching to experiment with e-learning as teaching and learning
alternatives. Despite the e-learning domain having multiple issues in terms of connectivity,
academic transition, efficacy in teaching and learning etc., many educational institutions are
eager to introduce or implement their own e-learning systems with the aim of resolving or at
least minimizing the disruption towards the education sector. (Mseleku, 2020)

Figure 1: Speech transmission and formulation process (Varol et al., n.d.)
In the solution context of this project, Automatic Speech Recognition (ASR) can be interpreted
as a speech-to-text conversion tool whereby human’s utterance of speech is captured by a vocal
receiver such as microphone or other transducers, processed accordingly and converted into a
sequence of texts or transcripts by means of algorithms. Being the most natural and convenient
communication mode among humans, speech plays a pivotal role in our daily lives. (S & E,
2016) Existing as acoustic wave form in the air, the speech signals are transmitted from the
speaker and perceived by the listener’s ears, then converted into electrical signals to be
interpreted by the brain. The brain then formulates a meaningful message through a speech
model to be transmitted and the process repeats again as illustrated in figure 1. (Kanabur et al.,
2019a) Using the similar technique, ASR establishes communications links between the
computer interface and the natural human voice in a flexible yet convenient way. It has been
proven to ease the life of people with physical or learning disabilities in which they are incapable
of receiving, transmitting and conveying vocal signals appropriately.
In the spoken language context of this project, English has been acknowledged as the global
language and being dominant in many developments of the world nowadays including politics,
technology, international relation, education, travel etc. . The findings of Reddy (n.d.) have also
revealed that tens of thousands of specific terms have been added to the English lexicon with the
advancement of Science and Technology. The evidence from this study also highlighted that
approximately 80 percent of digital information in the world is written in English which include
individually stored data by firms, institutions and libraries as well as those information readily
available in the World Wide Web. As several browsers are still unable to possess multilingual
presentation, being proficient in English has thus become a big advantage for those browsing

through the Internet. (Reddy, n.d.) To provide a more comprehensive picture on the wide spread
of English language, studies conducted by Ahmad also revealed that 5000 newspapers more than
half of the newspaper published worldwide are written in English whereby countries that place
English as their second language have at least 1 English newspaper published. (Riyaz Ahmad
Assistant Professor of English & Riyaz Ahmad, 2016)
With the basis of introduction on e-learning, ASR and usability of English in various fields, this
paper aims to analyse previous research on existing statistical and machine learning techniques
in implementing ASR, identify and evaluate appropriate acoustic models and construct such
models into an English-based speech recognition system to be integrated with existing e-learning
systems within the education sector.

1.2 Problem Context
Despite having high applicability in the field of education, e-learning suffers from several major
drawbacks such as creating learning barriers for students. According to studies conducted by
Wald, students tend to spend much time and mental effort in jotting down notes during online
lecture or tutorial classes. It is evident that this happens more often when lecturers are speaking
at a much rapid pace or when students are unfamiliar with the spoken language (English in this
case) or course content that they are currently taking. During the process of taking notes,
students have to perform a series of tasks which consist of listening to the tutor’s speech,
comprehending the speech, reading slide’s content displayed on screen, relating the speech to
slide’s content and noting everything down in a simplified yet readable manner. This has posed
great difficulty for students who are unable to attend classes and multitask efficiently, such as in
listening while taking down notes. In relation to this, the current study has also highlighted that
students are unable to grasp the meaning of module content due to poor oral explanation and
teaching skills from tutors. Hence, students easily lose concentration during the lessons and their
absent rates tend to increase gradually. (Wald, 2006)
Furthermore, A considerable amount of literature study has tended to focus more on students’
requirements rather than teaching proficiency prior to the success of e-learning. In a study related
to e-learning challenges in higher academics, Islam (2015) has justified that these vast majority
of research have been discussing on how to provide recommendations and solutions to enhance
student learning experiences from technologies’ point of view. In contrast, these studies are
limited in terms of analyzing post-teaching feedback from academic staff. Generally, each
student has their unique learning style due to cultural influences and other factors. Hence,
academic studies should also suggest improvements on content delivery, speaking rate, teaching
quality and other factors from the tutors’ perspective in order to boost individual learning
outcomes. (Islam et al., 2015) Research conducted by Shao & Wang (2008) has also
acknowledged that a lot of e-learning systems are not using automated methods in processing
large volumes of video and audio data. The amount of cost in terms of time spent and resources
required are enormous due to inefficient methods used in processing learning resources. The two
researchers also found that ASR is more widely used in military and broadcast fields with the

implementation of news and meetings transcript when compared to education sectors. (Shao &
Wang, 2008)
Thirdly and lastly, the current implementation of e-learning systems did not manage to address
the learning disabilities of users proactively. This is especially evident for students who are deaf
or possess hearing disorders regardless of it being congenital or acquired. According to a study
by Wald (2006), such group of students face a very steep learning curve in following tutors’
speech and taking down notes as they rely heavily on lip-reading or sign-language interpretation
which can hardly be achieved in e-learning. Relevant evidence in the study also revealed that
such group of students at Rochester Institute of Technology preferred to use ASR re-voicing
techniques to stimulate real-time text displays, similar to captions and transcripts due to their
high literacy level. (Wald, 2006) The term “literacy” is generally understood as the inability to
read or write text data. Studies of Walsh & Meade (2003) have also implied that the learning bar
continues to elevate upon learners with low literacy level experience first-time exposure to
Information Technology (IT) and their supporting hardware. This is due to the fact that most e-
learning systems operate via text-based input commands as opposed to acoustic inputs which
require them to practice using the technology at hand instead of speaking naturally. (Walsh &
Meade, 2003)

1.3 Rationale
Considering the fact that e-learning system has become a major part of students’ and teaching
staffs’ life, there is indeed an urge to develop an Automatic Speech Recognition (ASR) system
that can be integrated into e-learning systems with English being the main communication
medium. According to the issues highlighted above, many e-learning systems are student-
oriented, as in they only focus on improving students’ academic performance rather than tutors’
teaching proficiencies. It is also notified that many e-learning systems lack effective aiding tools
which hinder students’ concentration during teaching sessions especially those with learning
disabilities. As such, current e-learning systems require a speech-to-text mechanism in the form
of transcripts as an aiding tool for learning and teaching. Regardless of traditional or digital
learning, speech is still the prominent medium in information or knowledge transmissions. From
such a basis, such transcripts obtained from an ASR system can not only ease content
interpretations for students, but also act as a reference for tutors to reflect on their content-
delivery performance in terms of teaching speed, point-accuracy, clarity of utterance and so
forth. One can also easily store, visualize, edit, delete and duplicate a transcript than audio files
which is a crucial feature for any form of teaching-learning activities. Such transcripts can also
help educational institutions to reduce their operational and administrative costs in teaching and
assessing aspects.

1.4 Potential Benefits
This section discusses the potential tangible and intangible benefits of the project to target users
of the system. Tangible benefits are benefits which are quantifiable and measurable using
specific metrics or indicators. As for intangible benefits, it is the benefits subjective to the
projects’ improvement that cannot be consistently measured using a quantifiable unit.
1.4.1 Tangible Benefits
i) Save time for tutors and students as tutors can use transcripts to reflect on their main
teaching points easily while students can directly use them as their additional notes.
ii) Reduce administration cost with online transcripts being used instead of paper-written
ones which are stored in the form of text files as opposed to physical space.
iii) Minimize the workload of tutors in conducting different teaching methods for
learners with low literacy level.
1.4.2 Intangible Benefits
i) Sharpen English language skills of students and teachers during lecture or tutorial
sessions
ii) Boost confidence of students in their learning capabilities as post-learning can be
conducted more effectively
iii) Improve teaching capabilities of tutors with more self-reflection on their teaching
methods.
iv) Enhance students’ and teachers’ satisfaction and productivity in using the e-learning
system

1.5 Target Users
The implementation of ASR in this project is applicable to all levels of education sector, ranging
from primary schools to universities. The target users of the ASR system are tutors who will be
conducting the module or course delivery and students who will be learning new knowledge
from tutors. The system will be utilized by these users to conduct teaching and learning sessions
more effectively.

1.6 Scopes & Objectives
1.6.1 Aim
To develop a decisive and fully functional English based Automatic Speech Recognition (ASR)
system using appropriate machine learning techniques to be integrated with existing e-learning
systems in the education sectors.
1.6.2 Objectives
⁕ To evaluate existing acoustic modelling technique within the scope of machine learning
⁕ To incorporate mathematical notation, statistical and machine learning paradigms in

automating a fully functional ASR
⁕ To evaluate the effectiveness and accuracy of the selected technique implemented in the ASR
⁕ To enhance the learning experience of students and teaching experience of lecturers
1.6.3 Deliverables
The key deliverables of the ASR system are stated as follows:
i) A statistical or machine learning based acoustic model with the highest recognition
accuracy which is trained using audio datasets in the form of lecture videos.
ii) A transcript which is able to convert utterances of speech from lecture videos into text
to be displayed in a panel.
iii) A Graphical User Interface (GUI) which allows users to register, log in, view
properties of audio files, resample audio files, generate, download transcript and log
out of the system.
1.6.4 Nature of Challenges
One of the most onerous challenge in this project is to come up with a suitable title. The purpose
of conducting a project is to essentially address existing issues within a domain, for which
finding such domain can be difficult with the vast advancement of technologies. In addition,
apart from area of studies such as weather prediction models, stock trading systems, customer

segmentation analysis etc, it is difficult to find a domain in relation to my course that has yet to
gain much popularity in the research field. Another challenge is understanding the mathematical
paradigms and algorithms at each stage of ASR which consist of Feature Extraction, acoustic
modelling, machine learning technique and so forth. Searching for quality-wise similar systems
with comparable results that can fit within the e-learning domain is also another challenging task.
Moreover, I also face difficulty in selecting the appropriate programming language for the task
since there exists many programming languages alongside with their relevant libraries, packages
and modules that are deemed relevant to the implementation of ASR.

1.7 Overview of this Investigation Report
In the current Investigation Report, there are a total of 11 chapters.
Chapter 1: Introduction to the Study
This section gives a brief introduction on the project background regarding e-learning domain,
ASR and the usage of English language. It also outlines the critical issues that exist in such
domain alongside the importance of conducting this project, tangible and intangible benefits
obtained by users, target users of the ASR system, aims, objectives, deliverables of the project as
well as nature of challenge faced by the developer themselves. The Investigation Report’s
Project Plan documented using Microsoft Excel is also shown in the next sub-section.
Chapter 2: Literature Review
Firstly, a brief idea on Literature Review and its purpose is highlighted. A detailed research in
the domain is then conducted, focusing mainly on the classification, system architecture of ASR
and statistical or machine learning models to be considered in this project. Similar systems
utilizing the same system architecture but with different chosen models available in the domain
are compared to provide insights during the implementation stage of the project.
Chapter 3: Technical Research
This section focuses on the technical aspect of the project, including the comparison of different
programming languages, Integrated Development Environment (IDE), programming libraries
and Operating Systems. The best option out of each of these categories are then chosen to
implement the ASR system along with detailed justifications.
Chapter 4: System Development Methodology
This section starts off by providing a general overview of system development methodology,
then proceeds to elaborate several methodologies done in previous research. The most
appropriate methodology to be utilized in this project is then chosen and justified accordingly
along with a detailed explanation on the course of action in each phase.
Chapter 5: Data Analysis

This section starts off by giving a general concept regarding data analysis and the sub-processes
involved. Further elaboration on the metadata of the chosen dataset is given. The inner workings
of Exploratory Data Analysis (EDA), data cleaning, data visualization, data partitioning and
modelling are also explained with code snippets and output display.
Chapter 6: Results and Discussion
This section evaluates the 3 deep learning models constructed in the previous section. Several
evaluation metrics are written and compared among these models, out of which the model with
the most optimal performance across each metric is chosen.
Chapter 7: System Architecture
This section illustrates the general overview of the system including the system’s feature from
end users’ perspective through graphical and tabular representation in the form of a Use Case
Diagram, Use Case Specification and Activity Diagram. To provide an initial overview to system
designers and developers, an interface design of each page of the web-based system will also be
illustrated.
Chapter 8: Project Plan
This section gives detailed explanation on each feature of the system. It also highlights the
release plan, Unit Testing plan and User Acceptance Testing (UAT) plan of the system in which
the latter two will be designed for system validation.
Chapter 9: Implementation
This chapter focuses on the features of the web-based system, which include front-end or design-
wise and back-end or coding-wise implementations. Screenshots of pages as well as source code
will be documented in a detailed manner.
Chapter 10: System Validation
The Unit Testing plan and User Acceptance Testing (UAT) plan documented in Chapter 8 will
be given to system testers and end users or clients respectively to verify and validate if all
features of the system are working as expected from the business requirements.

Chapter 11: Conclusion and Reflections
In this chapter, a summary on the FYP Report is provided in terms of modelling and deployment.
Reflection on whether the current research has achieved the desired goal and objectives is
analysed. Any form of research gap or limitation in the design of this project is also explored
with corresponding improvements identified to be implemented in the future.

1.8 Project Plan
Overall timeline and planning of the proposed project will be listed out in a Gantt Chart with the
aid of Microsoft Excel application. The entire pipeline diagram will be displayed in the appendix
section.

CHAPTER 2: LITERATURE REVIEW

2.1 Introduction
In the research domain, literature review can be expressed as a well-formatted academic
documentation that demonstrates the knowledge of the writer in a specific domain. As indicated
by its name, not only does the writer need to report on the key points of the specific domain, but
one also has to perform critical evaluation on the materials read. The main goal of conducting a
literature review is to familiarize oneself with findings that are known and unknown within the
field of studies before diving into a new project. In this chapter, research on statistical or
acoustic-based machine learning model are summarized with the gaps between each of them and
their integration accuracy with existing systems are highlighted to have some familiarity with
limitations in past studies. (Literature Review, 2022)

2.2 Domain Research

2.2.1 Classification of ASR
Figure 2: ASR system classification (Bhardwaj et al., 2022
Generally, there are four criteria in classifying ASR, namely speech utterance, speaker model,
size of vocabulary and environmental condition as shown in the figure above. (Bhardwaj et al.,
2022) By understanding such classifications, the developer can decide on what form of acoustic
properties are most suitable to train and validate the ASR models in this project.
In broad terms, speech utterance can be described as the vocalization of one or more
pronunciation in the form of words or sentences that can be interpreted by the computer. Firstly,
Isolated Words are the easiest and most structured speech utterance type because it accepts single
utterances at one point of time. It requires pauses between each utterance but does not limit itself
to single words input only which provides a clear pronunciation to the listener. (An, n.d.) Similar
to Isolated Words, Connected Words require much smaller pause between each utterances such
as reading the numerical representation of a number (“1,298,350”). (Bhardwaj et al., 2022) In a
more difficult context, Continuous Speech allows computers to determine the content voiced out
by natural speakers. Typically, it involves multiple words run together without pause or distinct
boundaries between each utterance, thus elevating the computation difficulty when the
vocabulary range grows. (An, n.d.) Also utilizing natural speech as a medium, Spontaneous

Speech accepts any form of acoustic inputs including those spoken in noisy environments or
filled with pronunciation errors, false starts by the speaker. (Bhardwaj et al., 2022)
Regards to speaker dependency, a speaker-dependent system will be trained using the specific
speakers’ voice characteristics. Although such systems are not flexible to be used across different
speakers, it is easier to develop with a lower cost and higher accuracy in speech identification. In
contrast, a speaker-independent system is designed for large group of speakers with exclusive
speech patterns. It is apparent that such systems are more costly and difficult to develop with a
relatively low recognition accuracy despite exhibiting great flexibility. (An, n.d.) As for speaker
adaptive systems, it adapts the speaker-independent system by utilizing portion of the specific
speakers’ acoustic characters. (Bhardwaj et al., 2022)
Published studies have also identified that vocabulary size will affect the processing
requirements, time complexity and accuracy of the ASR. A dictionary or so called, lexicon with
a small, medium, large and very large vocabulary size can have tens, hundreds, thousands and
tens of thousands of words respectively. (An, n.d.) The environment variability in terms of noise
level can also have a detrimental impact on the accuracy of the ASR. (Bhardwaj et al., 2022)
In another research proposed by Wald (2006), he emphasizes that ASR systems used in
education normally consists of unrestricted vocabulary range and are of speaker-dependent type.
In other words, the ASR system has to be trained with read-aloud transcripts, written documents
as well as pre-recorded lecture videos filled with special vocabularies not available in its own
dictionary. Another alternative is to utilize pre-trained voice model provided by specific speech
recognition engines which might guarantee a higher accuracy in terms of vocabularies and
spontaneous speech structures. (Wald, 2006) Despite this, little progress has been made in
demonstrating the implementation of connected-word type ASR systems which is the basis of
communication in English language.

2.2.2 General Overview of ASR System
Figure 3: ASR system architecture (Jamal et al., 2017
Prior to implementing a fully functional ASR system, one must thoroughly grasp the main
component of its system architecture, namely acoustic front-end, acoustic model, language
model, lexicon and decoder as shown in figure 4. A large volume of studies on ASR architecture
also highlights that there are two process in its implementation, namely front-end and back-end
in which the latter can only be initiated if the former has been done accordingly. (Jamal et al.,
2017)
Front-end Process
The main aim of the front-end process is to convert the analogue speech signals into digital form
by parameterizing the unique acoustic characteristics of the speech. This can be achieved by
performing signal processing and feature extraction. (Jamal et al., 2017) According to previous
studies, each of them has selected distinct features for their application but has established
several profound principles in extraction criteria. One of those criteria being the ability to
construct acoustic models automatically with small amount of training data provided. Another
prominent property is the extracted feature must exhibit little fluctuation to no variant among
speaker and the surrounding environment to maintain great stability of utterance over time. (S &
E, 2016) Another researched proposed by An (n.d.) has also presented characteristic of extracted
speech features such as great measurability and insusceptible to mimicry. (An, n.d.)

Figure 4: Process of MFCC Feature Extraction (S & E, 2016)
In technical terms, feature extraction can be defined as the pre-processing of analogue speech
signals by removing irrelevant observation vectors and filtering a set of correlated voice
properties into several quantifiable metrics that are deemed meaningful in the model construction
stage. Despite a considerable amount of feature extraction techniques have been developed over
the past few decades, the full discussion on each method lies beyond the scope of this project,
thus this study will only provide an overview of Mel Frequency Cepstral Coefficient (MFCC)
technique which is the most prominent, efficient and simple compared to other methods. (Jamal
et al., 2017) The process for MFCC technique is summarized as shown in figure 5.
Figure 5: Speech Frequency Graph for Hamming Window (Kanabur et al., 2019b)
Figure 6: Equation for Hamming Window graph (Gupta et al., 2013)
First and foremost, pre-emphasis is conducted to amplify the frequencies of speech signals to
enhance model training and recognition accuracies. (S & E, 2016) Next, frame blocking is
performed to segment the continuous speech signals into discrete small frames whereby each
frame consisting of N samples are separated by M samples from the adjacent frames, with
N−M samples overlapping between them. This process will continue until the whole speech

signal is segmented into frames. In Gupta’s (2013) case study of MFCC, he also claimed that the
standard values used in much research for N and M are 256 and 100 respectively whereby M < N
. This is to ensure that sufficient acoustic information is stored inside each frame and are not
susceptible to change. (Gupta et al., 2013) Windowing is then initiated to multiply the speech
signal frames with windows of varied shape, typically a hamming window as shown in figure 6.
(Kanabur et al., 2019b) This is to reduce the disruption at the start and end point of each frame.
The hamming window’s equation is shown in figure 7 whereby N m represents the number of
samples in each frame. The output signal after this operation is X ( m ) · W n ( m) whereby X (m)
represents the input signal. (Gupta et al., 2013)
Figure 7: Equation for DFT computation (Gupta et al., 2013)
Figure 8: Equation to calculate Mel frequency (Gupta et al., 2013)
Figure 9: Graph for relationship between frequency and Mel frequency (Gupta et al., 2013)
Figure 10: Equation to calculate MFCC (Gupta et al., 2013)
Prior to the windowed speech signals, Discrete Fourier Transform (DFT) or Fast Fourier
Transform (FFT) are used to compute the frequency or magnitude spectrum ¿) for the next stage.
The equation to derive this spectrum ¿) shown in figure 8 whereby k represents 0 to N m −1.

Analysis from the study also highlighted that DFT and FFT are the same by definition but have
different computational time complexity. From the results obtained, a positive frequency ¿) in
Fs N
the range 0 ≤ f ≤ corresponds to the lower half of the sample size 0 ≤ m≤ m −1 whereas a
2 2
−F s Nm
negative frequency in the range < f <0 corresponds to the upper half +1 ≤ m≤ N m−1
2 2
whereby F s represents the sampling frequency. (Gupta et al., 2013) From the calculated
spectrum frequency above, it is wrapped using a logarithmic Mel scale and the Mel frequencies
are computed through the equation as shown in figure 9. (S & E, 2016) A set of triangular
overlapping window, also known as triangular filter banks, is constructed whereby filters are
scattered linearly below the 1000 Hz range and spaced logarithmically beyond 100 Hz as
illustrated in figure 10. Such mapping principle makes it easier to identify the spacing between
filters, thus estimate the approximate energy at each spot of the spectrum. Last but not least,
Discrete Cosine Transformation (DCT) is performed to convert the Mel spectrum from
frequency domain back to spatial domain. The study presented by Gupta (2013) presented that
the vocal output of DCT can contain more energy when compared with DFT as DCT is used in
data compression, resulting in higher concentration of energy in few section of coefficients
whereas DFT is used in spectral analysis. The equation is shown in figure 11 whereby C n
represents Mel Frequency Spectrum Coefficient (MFCC), m represents number of coefficients
and k represents numbers from 0 to m−1. (Gupta et al., 2013)
Back-end Process
Being the core of ASR, acoustic model is a container that stores statistical representation of
segmented speech signals known as basic speech units that constitute the pronunciation of a
word. The acoustic model is established based on sequences of feature vectors computed from
speech waveforms. (S & E, 2016) Such basic speech units can also be referred as phones,
phenomes, syllables or feature-exclusive acoustic observations extracted during pre-processing
stage. In order to recognize such phenomes, a language model consisting of linguistic and
grammatic properties is required. (Jamal et al., 2017) Similar to how humans recognize
utterances in conversations, a language model can help the acoustic model to distinguish between
valid and invalid words in utterances as well as their sequences by providing some form of

context. Such context is measured in terms of probability distribution of word that will be voiced
out by the speaker. This probability can be deduced from a large text corpus based on the
speaker’s previously spoken words. Some common language models are bigram and trigram
whereby the former and latter group two and three consecutive words together respectively, only
then the probability of sequence can be computed. The pronunciation model, also known as
lexicon or dictionary provides mapping between words and phenomes in order to form optimal
word sequences (S & E, 2016)
Figure 11: General equation for statistical computation of ASR (Jiang, 2003)
With the aid of acoustic model, language model and lexicon, the decoder can only be able to
compute the most likely word sequence (W ) from the acoustic input sequence observed ( X ).
Such computation can be understood as the maximum posterior probability for any observation (
^˙ whereby the posterior probability is denoted by P ( W | X ¿. (Jiang,
X ), indicated by W
2003)Through implementing Bayes’ Theorem, the second part of the equation is derived as
shown in figure 11. (S & E, 2016) In this context, P(W ) refers to the probability of deriving a
word from the language model, P(X ) refers to the probability of deriving the observed
sequences from acoustic model and P ( X|W ) refers to the probability of deriving the observed
sequence on the basis that the underlying word sequences is W . As mentioned in most previous
ASR studies, the term P ( X ) can be roughly estimated as a constant across different word
sequences, thus it can be ignored in calculation which results in the third part of the equation in
figure 12. (Jiang, 2003)

2.2.3 Machine Learning Models

Hidden Markov Model (HMM)
Figure 12: Three-state Markov chain (Makhoul & Schwartz, 1995)
Figure 13: Equation for transition probability (a ij) (Rupali et al., 2013)
Figure 14: Equation for summation of transition probability ( ∑ aij (k )) (Rupali et al., 2013)
Before explaining the inner workings of HMM, one must fully grasp the concept of a Markov
Chain. Makhoul & Schwartz (1995) generalizes a Markov Chain as a simple network consisting
of a finite number of states with transitions among them. Each state is represented as an
alphabetic symbol and the transition represents probability. For instance, a ij symbolizes the
transition probability from state 1 to state 2 whereby each state is associated by A and B
respectively and the output symbol is always B as shown in figure 13. (Makhoul & Schwartz,
1995) The equation for a ij is shown in figure 14 whereby p represents probability, q t represents
current state, q t +1 represents next state and N n represents the number of hidden states in the
model. Since the naming of states always starts from 1 up to N n, the i th and j th states are within
this range with both ends inclusive. The summation of transition probability a ij is always 1 as
shown in figure 15. (Rupali et al., 2013) The relevance of such findings has implied that the
transition among states is probabilistic whereas the final output is deterministic. ‘

Figure 15: Equation for observational symbol probability (b j (k )) (Rupali et al., 2013)
Figure 16: Equation for summation of observational symbol probability ( ∑ b j (k ))) (Rupali et al., 2013)
In contrast, the output symbols of HMM are probabilistic, as in each state can be associated with
arbitrary number of symbols in any form instead of given a defined one. The selection of such
symbol is dependent on the transition probability among each states. Since both the symbols and
transition are probabilistic, HMM is generally known as a doubly stochastic statistical model. In
relation to the non-deterministic nature of transition among states, it is impossible to derive the
sequence of states based on the final output symbols. As such, the model is known as Hidden
Markov Model whereby the states’ sequences are hidden and observers can only see the final
outputs symbols. (Makhoul & Schwartz, 1995) The equation for deriving the probability
distribution of the output symbol in state j (b j (k )) is shown in figure 16 whereby V k represents
the k th observational symbol in the list of alphabets, o t represents the current parameter vector
and M represents the total number of unique observation symbols for each state. Similar to
previously highlighted constraints, the k th observed symbol for each state must be within 1 to M
with both ends inclusive and the summation of probability distribution for each state’s
observational symbol is always 1 as shown in figure 17. (Rupali et al., 2013)
Figure 17: Equation for initial state distribution ( π i) (Rupali et al., 2013)
In the case of initial state distribution matrix for state i ( π i) as shown in figure 18, there is only 1
observational symbol for each state, making it equivalent to the probability of deriving that
particular output symbol. By defining the 5 main terminologies for HMM which are listed as
follows: number of hidden states ( N ), number of unique observational output symbols per state (
M ), state transition probability distribution ( A={a ij }), observation symbol probability
distribution per state ( B={b j (k )}) and initial state distribution ( π={π i }), the observation
sequence of the model can be defined as λ=( A , B , π ). (Rupali et al., 2013)

Dynamic Time Wrapping (DTW)
Figure 18: Warping between two non-linear time series (Muda et al., 2010)
By contrast of a statistical-based HMM model, DTW is a dynamic programming algorithm that

computes the similarities between two time-series datasets. The study conducted by Muda (2010)
also highlighted that DTW is used to perform optimal warping and alignment operations for non-
linear time-series datasets through repetitive shrinking and expansion along the time-axis (x-
axis). (Muda et al., 2010) In the context of this project, the speech signals’ datasets will be
represented as a series of non-linear feature vectors depending on the utterance speed of the
speaker. (Saleh, n.d.) An illustration of this scenario can be shown in figure 19. The upper and
lower segmented lines each represent distinct time series. The starting and end points of each
vertical line represent the corresponding matching coordinates for the two-time series. On top of
that, the vertical line also indicates the level of variety between two time series in terms of speed
or time. Facts have also revealed that the summation of distance between two pair of points for
each of the vertical lines equates with the wrap path distance between two time series. (Muda et
al., 2010)
Figure 19: Equation to construct a local cost matrix (Senin, 2008)
A key implication regarding DTW is to compute the minimum distance between two time-series
using existing pair-wise coordinates. This terminology involved in this procedure is often called
“distance function” or “cost function”. First and foremost, the distance or local cost matrix ( C ) is
constructed using the equation in figure 20. From the equation, N denotes the height of the
matrix, M denotes the width of the matrix, C ij denotes the local cost of the (i th, j th) coordinate,
x i and y j denote the coordinates for X and Y time series sequences respectively. (Senin, 2008) It

can therefore be stated that cost computation for each vertical line is done by calculating
Euclidean distance between the two coordinates. (Muda et al., 2010)
Figure 20: Local path alternatives for grid point (i, j) (Saleh, n.d.)
After the matrix has been constructed, the alignment path, warping path or warping function
must be computed. Collectively, multiple previous studies outlined that computation of all
possible warping path P between both sequences of X and Y are redundant as the time
complexity of the algorithm will grow exponentially when the length of each sequence grows
linearly. (Senin, 2008) According to Saleh (n.d.), the problem is approached by restricting time
warping between two vector sequences through several boundaries. One of them being the first
and last vector or coordinates of sequence X must be assigned to its corresponding sequence Y .
As for other coordinates in between, the repetitive forward and backward leaping between points
that may have been visited is prevented with the “reuse” of preceding vectors to perform time
warping operations. For a clearer visualization, the local path alternative for grid points(i , j)with
all possible predecessors is illustrated in figure 21. (Saleh, n.d.)
Figure 21: Equation to compute the minimum accumulated distance for optimal path δ(i,j) (Saleh, n.d.)
Another interesting finding in figure 21 is the coordinate ( i−1 , j−1 ) can reach coordinate (i , j)
directly via diagonal transition without going through vertical coordinate ( i , j−1 ) and horizontal
coordinate ( i−1 , j ). With only one vector distance computed, the local distance or cost function
C ij must be added twice. Moreover, it is evident that there are only 3 possible predecessors or

partial path leading to (i , j), namely paths from (0,0) to (i−1 , j), (i , j−1) and (i−1 , j−1). As
such, one can apply the Bellman’s Principle which states that if there exists an optimal path P
starting at point ( 0,0 ) , ending at point (N −1, M −1) and having grid point (i , j) in between, then
the partial path from ( 0,0 ) to (i , j) is also part of P . Based on such fundamentals, the minimum
accumulated distance δ (i , j) for globally optimal path from ( 0,0 ) to (i , j) can be derived using
the equation as shown in figure 22. (Saleh, n.d.)
Convolutional Neural Network (CNN)
Figure 22: CNN architecture for speech recognition (Huang et al., n.d.)
Figure 23: Stages of CNN architecture for speech recognition (Palaz Mathew Magimai-Doss Ronan Collobert et al., 2014)
Despite HMM being the most traditional and widely used model for speech recognition, a large
volume of literature research has incorporated CNN in image processing techniques to generate
spectrum images from acoustic phonemes. According to Musaev (2019), a research by Ossama
in 2014 also implemented CNN to perform adaptive dialogue recognition consisting of various
accents in call centre environments. In addition, a paper published by Dennis in 2010 presented
his findings on using CNN to classify sound events based on visual signature extracted from
acoustic inputs. (Musaev et al., 2019) Being a deep learning model used in various applications,
CNN can transform a sequence of acoustic signals into segments of frames, then output a score
value for each class among each frame. The general architecture of CNN is illustrated in figure
23. (Palaz Mathew Magimai-Doss Ronan Collobert et al., 2014) There are two stages for the

network architecture, namely filter extraction or feature learning stage and classification or
modelling stage as shown in figure 24. The convolutional and pooling layer correspond to the
former stage whereas the Fully connected and SoftMax layer correspond to the latter stage. The
usage of tanh() function will be discussed in the following few sections.
Figure 24: Equation for convolution calculation (Wang et al., 2019)
Being the core component of CNN, the convolution layer consists of several filters or kernels
that processes previous fragment of layers by computing the summation of each fragment’s
matrix multiplication. (Musaev et al., 2019) Previous fragments are given in the form of raw
waveform for most research papers, but Wang (2019) has chosen Mel-Frequency Cepstral
Coefficients as inputs. Such inputs are denoted as X ={X 1 , X 2 , … X T } whereby X i ∈ R b∗c . In this
context, T represents the time step or length of the X-axis, b represents the bandwidth or length
of the Y-axis and c represents channels. The output will be a 2-dimensional feature map or
matrix (o ¿ consisting of elements ( o i , j ) with i denoting width term and j denoting height term
calculated as shown in figure 25. In terms of the equation, swc represents the convolution stride’s
width, shc represents the convolution stride’s height, w represents the kernel’s width, h
represents the kernel’s height and k represents the kernel whereby k ∈ R w∗h∗c . The convolutional
stride is the amount or unit of sliding movement for each kernel from the input layer to the
output layer. (Wang et al., 2019)

Figure 25: Example for convolution calculation (Wang et al., 2019)
Figure 26: Equation for time span (t ¿¿ c) ¿ and time shift window(window¿¿ c) ¿ (Wang et al., 2019)
Further analysis of the equation revealed that each element ( o i , j ) in the matrix is the by-product
of w∗h elements in each input feature map (channel), which means that a MFCC sequence with
c channels require w∗h∗c input elements to derive the output in their corresponding position. A
more detailed visualization on the equation workings can be seen in figure 26 whereby T = 3, b
= 3, c = 1, w = 2, h = 2, swc = 1 and swh = 1. To simplify such analysis, the time span across the
resulting matrix (t ¿¿ c) ¿and time shift window between each adjacent element of the matrix
(window¿¿ c) ¿ are computed as shown in figure 27. In this context, w c represents the kernel’s width,
swc represents the convolution stride’s width, t i represents the term for time scope and window i
represents the term for shift window. These equations will act as an input to the processes or
layers to be discussed in the next section. (Wang et al., 2019)
Figure 27: Equation for ReLU activation function (Wang et al., 2019)

Figure 28: Equation for ClippedReLU activation function (Wang et al., 2019)
The scalar results of the convolution layer are then passed to a pre-determined activation or non-
linear function. Despite the activation layer being genuinely known to merge with the
convolution layer, a more comprehensive study, like a research conducted by Musaev has
discussed them separately due to the complexity they possess. According to him, the non-
linearity functions used traditionally are hyperbolic tangent function ¿, absolute hyperbolic
−1
tangent function ¿and sigmoid function ( ( 1+e− x ) ) . Further research in early 2000s by Glorot
has discovered that ReLU activation function is more reliable in terms of speeding up learning
process of each neuron, in addition to simplifying the computation process by trimming negative
scalars in the matrix. (Musaev et al., 2019) This function is computed using the equation in
figure 28. From the equation, if the input matrix’s ( X ) element is greater than 0, it outputs itself,
otherwise it outputs 0. Whilst for the revision of ReLU called Clipped ReLU activation function,
it adds a new parameter (α ) to the tally which finds the minimum between it and the result from
ReLU function, thus ensuring the output in {0 , α } as shown in figure 29. (Wang et al., 2019)
Figure 29: Equations for time span (t ¿¿ p) ¿ and time shift window( window¿¿ p)¿ of max pooling (Wang et al., 2019)
Figure 30: Final equations for time span (t ¿¿ p) ¿ and time shift window(window¿¿ p)¿ of max pooling (Wang et al.,
2019)
Another example of a non-linearity operation is the pooling layer. It takes the group of pixels
from each region in the previous convolutional layer and compresses them into one pixel. In this
scenario, the max-pooling function is typically used which finds the maximum element out of
each group of pixels. (Musaev et al., 2019) The continuous progression of this computation
requires the concept of time span and time shift window described earlier which has yet to
involve pooling operations. With t p andwindow p indicating time span and time shift window of

max pooling layer respectively, the corresponding equation is shown in figure 30. The layer
consists of max pooling of size w p ¿ h p and pooling strides of size sw p ¿ sh p. The final equations
are a result of deriving equations in figure 30 from the equations in figure 27 as shown in figure
31. Not only did the significant acoustic features are maximized, but also the corresponding time
spans are enlarged with lesser computational steps to follow. (Huang et al., n.d.)
After extensively explaining the feature learning stage, we proceed to the classification stage
which starts off with linear transformation in fully connected or so-called dense layers. Utilizing
the flattened 1-dimensional vector of sequences after down-sampled by the pooling layers, each
element of the vector is connected to every output neutron by a specific weight, which finalizes
the mapping among the networks. (Yamashita et al., 2018) Moving on, a non-linear
transformation for Deep Neural Network called Batch Normalization (BN) is performed to
minimize the effect of Internal Covariate Shift. With respect to this, it is a phenomenon whereby
the input parameters to each layer changes due to a change in the preceding layers’ network
parameters. As the number of layers increases (i.e., the network becomes deeper), the
amplification of change becomes more apparent. As implied by its name, it introduces
normalization step after each batch of layers instead of viewing all layers as a whole, thus
enhancing training rates and improving performance during testing phase. (Wang et al., 2019)
Figure 31:Equation for every vector's BN in a batch (Wang et al., 2019)
Figure 32: Equations for each neuron's BN in every vector of a batch (Wang et al., 2019)
For a batch of layer ( X ={X 1 , X 2 , … X m }¿, there are m numbers of flattened vectors. For each
vector, there are d number of neurons. Hence, the BN of a vector can be summarized as a set of

neurons with their corresponding BN as shown in figure 32. For every neuron k , their BN is
computed using the equations in figure 33. (Wang et al., 2019)
Figure 33: Equation for Softmax activation function (M. Rammo & N. Al-Hamdani, 2022)
Last but not least, to apply classification technique on the normalized features, an activation
function called Softmax is implemented. This function is optimal for multiclass single-class
classification operations which can normalize any values from the last fully connected layer into
class probabilities with a range between 0 to 1, whereby its summation equals to 1. (Yamashita et
al., 2018) The computation of such function is shown in figure 34. It is statistically
acknowledged that when there are n neuron values in each of the input vector x within the layer,
there are n possibilities of probability distribution. (M. Rammo & N. Al-Hamdani, 2022)
2.2.4 Performance Evaluation of ASR
Figure 34: Equation for calculation of Word Error Rate (WER) (du Simplon et al., 2005)
Figure 35: Equation for calculation of Word Recognition Rate (WRR) (du Simplon et al., 2005)
One of the most common evaluation technique for speech recognition accuracy is Word Error
Rate (WER). It implements the Levenshtein distance or Edit distance algorithm which finds the
minimum number of insertions, deletions and substitutions operations to transform one input
string to the other. In particular, it computes the minimum distance between the referenced
transcript sequence and the automatic transcript generated by the developed model, then
normalizes by length of reference. The equation can be shown in figure 35 whereby N r
represents total words in the generated transcript, S, D , I represent number of words substituted,
inserted and deleted in the generated transcript respectively. Another metric used in-line is Word
Recognition Rate (WRR) which is computed using the equation in figure 36. Since the

summation of probability is equal to 1, WRR is the complement of WER whereby it is the

number of correctly recognized words (W ¿ excluding the number of words inserted into the
generated transcript (I ) normalized by length of reference N r . (du Simplon et al., 2005)
2.3 Similar Systems

This section will discuss about several studies conducted by researchers for the past few years on
the implementation of Automatic Speech Recognition (ASR) software or systems and their
acoustic models in relation to the education sector, but not limited in using English as the one
and only communication language.
2.3.1 Automatic Speech Recognition for Bangla Digits
Figure 36: 10 Bangla digit representation, pronunciation and IPA (Muhammad et al., 2009)
First and foremost, a previous research by Muhammad (2009) had made a significant
contribution in the area of Bangla-based ASR system by exploring the analysis of Bangla digit in

constructing a speech recognition model. Due to insufficient Bangla digit speech corpus and
relevancy in previous literature studies, a medium-sized Bangla digit speech corpus consisting of
10 digits written in Arabic numerals is developed. Their corresponding pronunciation in Bangla
language and International Phonetic Alphabet (IPA) are shown in the figure above. As for data
collection, a total of 100 Bangladesh native speakers aged between 16 to 60 years old with equal
gender distribution is chosen. Each speaker commends 10 trials for each digit whereby half of
them are conducted in quiet and office rooms respectively, both of which exhibiting similar
environmental properties. (Muhammad et al., 2009)
Figure 37: Digit correct rate (%) of Bangla digits in ASR (Muhammad et al., 2009)
Being one of the more practical ways of extracting acoustic features from speech, MFCC
technique was chosen by the author. to Among the 100 speakers, 37 of male and female
respectively were selected as training sets and the remaining is for testing sets. The parameters
chosen are as follows: sampling rate of 11.025 kHz with 16-bit sample resolution, Hamming
Window of 25 ms with step size of 10 ms and 13 features with a pre-emphasis coefficient of
0.97. As such, there are a total of 13 hidden states in the HMM with a varying number of mixture
components to be tested. Since vocabulary size is limited to 10 only for this research, the word
model used to recognize these 10 digits will be HMM with left-to-right orientation. The
evaluation of the result is then conducted based on digit correct rate (%). Prior to the training and
testing results obtained in the figure above, it is apparent that the first 6 digits have a digit correct
rate of over 95% whereas the remaining 4 have a digit correct rate of below 90%. Moreover, it
can be seen that digit '~' (2) has the highest correct rate of 100% whilst '",' (8) has the lowest
correct rate of 84% only. Another significant finding is that the 8-mixture component seems to

be the most optimal one with more than half of the digits having the highest correct rate for this
mixture category. (Muhammad et al., 2009)
2.3.2 Dynamic Time Warping (DTW) Based Speech Recognition System for Isolated
Sinhala Words
In the implementation of this system proposed by Priyadarshani (2012), he highlighted that the
research to date regarding speech recognition paradigms in Sinhala language is still at an initial
stage in Sri Lanka with insufficient to zero useful information available. Moreover, small
vocabulary sizes in the lexicon have been a critical issue in most ASR systems developed
especially using DTW technique. This is likely due to the fact that there is a greater probability
of similar sounding words appearing in the speech corpus in which their sub-word’s
pronunciation duration differ from one another, making it difficult to parse acoustic inputs into
accurate phrases. Hence, this research attempts to use a relatively large Sinhala vocabulary with
a total of 1055 frequently used words to develop an efficient speech recognition system. To
achieve this, feature extraction using MFCC technique and feature matching, as in comparing the
test pattern with preloaded reference for word identification through DTW are done.
(Priyadarshani et al., 2012)
Figure 38: Word Recognition Rate (WRR) for 4 speakers in 3 respective sessions (Priyadarshani et al., 2012)
The acoustic inputs are gathered from four native Sinhala speakers into an audio file. Three
sessions are conducted on each speaker whereby the 2nd and 3rd session are 3 months and 1 year
after the 1st session respectively. In each session, one utterance of each word is used to train the

model whereas the 2nd utterance is used as a testing set. The entire simulation is done in
MATLAB 7.0 and Word Recognition Rate (WRR) is used as the evaluation technique. An
overall WRR accuracy of 93.92% is achieved based on the results in the figure above. It is also
observed that there is a clear declining trend in the recognition accuracies throughout sessions
due to variation of speakers’ vocal. Considering the fact that large speech corpus is involved,
DTW has successfully identified varying Sinhala speech with unique acoustic properties from
different speakers. (Priyadarshani et al., 2012)
2.3.3 Convolutional Neural Network (CNN) based Speech Recognition System for
Punjabi Language
Figure 39: Parameter setup for CNN based Speech Recognition System for Punjabi language (Dua et al., 2022)
In a systematic study of uncommon speech recognition using Punjabi language, Dua (2022)
observed that most of the current literature studies use HMM, GMM and ANN techniques in
recognizing speech inputs. Further analysis on such studies have pointed out that CNN is
becoming a more prominent modelling paradigm in speech, pattern recognition and artificial
intelligence, machine learning related research due to its enhanced model training speed and
applicability in systems with large-vocabulary datasets. On such a basis, Dua (2022)
implemented a CNN-based approach to recognize tonal Punjabi cues with additional background
noises in his research. As shown in figure 40, the vocal data has been collected from 11 different
Punjabi speakers of different ages, accents in different environments speaking up to 38 additional
stanzas in a continuous mode of speech. Hence, there are a total of 418 sentences (38 * 11) to be
recognized. A sampling rate of 44.1kHz was recorded and the audio file was stored in .wav
format. The author also signified to develop a large corpus of Gurbani hymns for the system due
to an absence of tonal speech dataset in the current domain. (Dua et al., 2022)

Figure 40: Framework for CNN based Speech Recognition System for Tonal Speech Signals (Dua et al., 2022)
The proposed speech recognition system’s framework is shown in figure 41. Firstly, a Praat
software version 6.1.49 was used to generate the Mel spectrogram waveforms from the input
speech signals. Since the programming language chosen is Python, LIBROSA library was used
to perform MFCC feature extraction. For the purpose of feature learning, six-layer 2D
convolution layers along with two Fully Connected layers were used. Then, a flattened layer was
inserted between the 2D convolutional layers and the 256 Dense Layer units. The non-linear
Softmax activation function was used to activate neurons in the form of vector sequences and
classify them accordingly. The processes after feature extraction are completed by TensorFlow,
Kaldi toolkit and other back-end libraries. The model is then trained using Google Cloud
Services and Keras Sequential API. (Dua et al., 2022)
Figure 41: Word Recognition Rate (WRR) of different speakers (Dua et al., 2022)

Figure 42: Overall Word Recognition Rate (WRR) compared to other speech recognition systems (Dua et al., 2022)
The results from the training model can be seen in figure 42 which uses Word Recognition Rate
(WRR) as the evaluation metric. It is apparent that speaker 7 has the highest WRR of 90.911%
whereas speaker 9 has the lowest WRR of 86.765%. Such a drastic difference in WRR is related
to the varying acoustic pattern, tonal speech frequencies and timing between each utterance of
the speaker. From these data, an average WRR of 89.15% can be derived. When compared with
other speech recognition systems using different modelling technique, such result has strikingly
emerged to be the highest as shown in figure 43. Overall, these results suggest that CNN is the
optimal modelling paradigm in handling large tonal-based speech datasets with MFCC being the
best feature extraction technique. Thus, as pointed out by the author, studies in experimenting
more speakers in varying environments with different speech classifications should be conducted
more extensively in the future. (Dua et al., 2022)

2.3.4 Comparison of Similar Systems

This section compares the 3 speech recognition systems explained in the previous section. The
comparison of spoken language, feature extraction techniques, modelling techniques, speakers’
nature, nature of audio data, software used, strengths and weaknesses are as follows:
Table 1: Comparison of similar speech recognition systems
Speech Recognition Systems
Automatic Speech Dynamic Time Warping Convolutional Neural

Recognition for (DTW) Based Speech Network (CNN) based
Bangla Digits Recognition System for Speech Recognition
Aspects
(Muhammad et al., Isolated Sinhala Words System for Punjabi
2009) (Priyadarshani et al., Language (Dua et al.,
2012) 2022)
Spoken Bangla Sinhala Punjabi

Language
Feature MFCC
Extraction
Acoustic HMM DTW CNN

Modeling
Nature of 100 native Bangladesh 4 native Sinhala speakers 11 native Punjabi speakers
Speaker speakers of equal with different gender and
gender distribution age speaking using
ranging from 16 to 60 continuous mode of speech
years old

Nature of 10 trials for each of 3 sessions are conducted 38 stanzas or sentences

Audio Data the 10 digits whereby on each speaker whereby spoken by each speaker
half of them are the 2nd and 3rd session are with instrumental
conducted in quiet and 3 months and 1 year after background music and a
office rooms the 1st session sampling rate of 44.1 kHz
respectively with a respectively in .wav format
sampling rate of
11.025 kHz
Software  GoldWave  Praat  Praat

 MATLAB 7.0  NCH Suite-
WavePad,
Australia
Results First 6 digits have a Overall WRR accuracy An average WRR of

digit correct rate of of 93.92% with a 89.15% which is the
over 95% whereas decrease in WRR as the highest compared to other
remaining 4 have a session goes on 5 speech recognition
digit correct rate of systems
below 90%
Strengths  Being the first  The speech  Compares WRR

published corpus has a large with results from
paper on vocabulary up to other speech
researching 1055 words recognition
Bangla digit  DTW can cope systems that use
ASR with utterances of different acoustic
 An analysis on different speed modelling
IPA for each and supports techniques
digit and their recognition  CNN possess
corresponding accuracy excellent training
result is speed with the

documented usage of built-in

libraries
 CNN is optimal for
recognition models
with large speech
corpus
 More audio  Nature of  Requires large

data, not speakers and memory to store
limited to 10 audio data are not training data and
Bangla digits described clearly speech corpus
should be  No comparison  Slightly
tested with results from insufficient
 Average digit other case studies speakers with
correct rate is  Very few different dialect in
not computed speakers with Punjabi language
 No comparison different dialect involved
with results in Sinhala
from other language
literature involved
studies

2.4 Summary
This chapter begins by introducing the context of a literature review, followed by discussing the
technical information in the domain research section. In this section, classification for ASR is
first described whereby the sub-categories in terms of speech, speaker, vocabulary and
environment are explained in detail. In the next sub-section, overview architecture of ASR
system consisting of front-end and back-end process is portraited. During front-end process,
feature extraction is briefly introduced with an in-depth explanation about MFCC technique. As
for back-end process, acoustic models, language models, lexicon and their inter-relations are
outlined. What follows is the extensive analysis on 3 machine learning models along with their
mathematical paradigms namely HMM (an acoustic model), DTW (a dynamic programming
algorithm) and CNN (a deep learning model). The section then ends with a brief highlight on
speech recognition or evaluation metrics which are WER and WRR. After having a
comprehensive understanding on specific domain knowledge within machine learning and
speech recognition, a comparison between 3 past research on similar speech recognition systems
in various aspects is done and demonstrated in table 1.

CHAPTER 3: TECHNICAL RESEARCH

3.1 Programming Language Chosen
The term programming language can be defined as the language used by programmers or
developers to interact with computers. Ranging from low-level to high-level programming
languages, most web application, software and programs today are designed using High-level
programming language (HLL) due to its simplicity, sustainability and user-friendliness. HLL
uses a compiler or interpreter to translate the input program into machine language which is
object code comprising of 0s and 1s. Among HLL, each programming language has their own
unique features and functionalities. (Programming Language | What Is Programming Language
- Javatpoint, 2021) Since the current research is a data analytics project, three of the most
popular programming language in the domain are compared, namely Python, R and SAS as
shown in the table below, in which one will be chosen for the implementation of this project.
Table 2: Comparison of programming languages
Programming Language
Aspects Python R SAS
Overview ⁕ Interpreted object- ⁕ Counterpart of SAS ⁕ SAS: Statistical

oriented high level (Ravindra Savaram, Analytics System
programming language 2021) (Team, 2020)
(Ravindra Savaram,
⁕ Used in academic and ⁕ Used by large IT
2021)
research sector companies such as
⁕ Used by many (Ravindra Savaram, Nestle, HSBC, Volvo
worldwide organization 2021) etc. (Ravindra Savaram,
such as Google, Reddit, 2021)
⁕ Data science oriented
Quora etc. (Ravindra
language which ⁕ One of the market
Savaram, 2021)

⁕ Widely used in data performs complex leaders in the field of

science as well as statistical computations data analytics and
software development (Team, 2020) science (Jain, 2017)
fields (Team, 2020)
Cost ⁕ Open-source language ⁕ Expensive

Effectiveness commercial software
⁕ Free to download
(Jain, 2017)
(Team, 2020)
⁕ Although a University
Edition is introduced,
institutions still have to
pay certain amount to
access its full
functionality (Python vs
R vs SAS | Difference
between Python, R and
SAS, 2019)
Ease of ⁕ Beginner-friendly ⁕ Steepest learning ⁕ Easiest to learn

Learning programming language curve with simple code because those with no
difficult to interpret prior knowledge in
⁕ High level
SQL can still
programming language ⁕ Low level
understand SAS
programming language
⁕ Enhanced simplicity (Ravindra Savaram,
and versatility (Jain, 2017) 2021), uses PROC SQL
(Jain, 2017)
(Ravindra Savaram,
2021) ⁕ Comprehensive
documentation and
many tutorials available
(Jain, 2017)

⁕ Stable GUI interface

in repository (Jain,
2017)
Data ⁕ Excellent data handling capacity and can perform parallel computation
Management (Python vs R vs SAS | Difference between Python, R and SAS, 2019)
Data ⁕ Slightly more complex ⁕ Most dynamic and ⁕ Functional graphical

Visualization graphical representation interactive graphical features
representation
⁕ VisPy, Matplotlib, ⁕ Requires
Seaborn and other ⁕ ggplot, RGIS, Lattice programmers to
graphical packages and other graphical understand SAS Graph
packages Package to customize
(Ravindra Savaram,
plots
2021) (Ravindra Savaram,
2021) (Ravindra Savaram,
2021)
Community ⁕ Lack of customer ⁕ Lack of customer ⁕ Excellent customer

Support service support (Jain, service support service support and
2017) community that provide
⁕ Huge online
great amount of
⁕ Huge online community
technical assistance
community with mailing
(Jain, 2017) (Ravindra Savaram,
lists, Stack Overflow etc.
2021)
(SAS vs. R vs. Python -
Javatpoint, 2021)
Machine ⁕ Vast collection of ⁕ Packages such as ⁕ SAS Visual Data

Learning libraries for machine dplyr, ggplot2, Mining and Machine
learning such as data.table, Learning provides
PyTorch, Numpy, Scipy, randomForest etc. are open-source algorithms
Scikit-learn etc. (Best used to construct in constructing machine
Python Libraries for machine learning learning pipelines and

Machine Learning - models (7 Best R automated modelling

GeeksforGeeks, 2019) Packages for Machine API (SAS Visual Data
Learning - Mining and Machine
GeeksforGeeks, 2020) Learning Software,
2020)
Deep ⁕ Advancements in deep ⁕ Basic support in deep ⁕ A lot of

Learning learning with packages learning with packages improvements to be
such as TensorFlow and such as KerasR and made since deep
Keras (Jain, 2017) Keras (Jain, 2017) learning is at early
stages (Jain, 2017)
GUI support ⁕ GUI libraries such as ⁕ GUI libraries such as ⁕ Highly customizable
PyQT5, PySide 2, R Package Explorer, GUI can be created
Tkinter, Kivy, wxPython Conference Tweet using frame entities and
etc. (Top 5 Best Python Dashboard, Bue SCL code
GUI Libraries - Dashboard (Singh,
⁕ Macro QCKGUI is
AskPython, 2020) 2021) and gWidgets
used to insert
(Creating GUIs in R
parameters into frame
with GWidgets | R-
controls
Bloggers, 2010)
(Jain & Hanley, n.d.)

3.1.1 Justification of Programming Language Chosen

Based on the comparison table among Python, R and SAS above, it is evident that Python is the
most suitable programming language to be implemented in this project. In comparison with SAS,
despite SAS having a dedicated customer service platform and a learning curve which is less
steep, python outweighs SAS in terms of being a free, open-source programming language in
which their libraries, modules, functionalities and features are accessible by anyone who
downloads their software. This has provided ample convenience for starters in the programming
industry as well as developers or learners who want to build projects in the field of data science.
When compared with R, despite R having simpler graphical representation toolkits, Python’s
code syntaxes are more readable. It is acknowledged being a low-level programming language, R
acquires learners’ proficiency in basic programming paradigms in order to understand its
syntaxes. (Ravindra Savaram, 2021) Overall, In terms of domain-specific library support, Python
has offered a variety of modules and packages with exclusive features to perform data
visualization, statistical, machine learning, deep learning computations as well as GUI libraries
to build interactive front-end interfaces for users.

3.2 IDE (Integrated Development Environment) Chosen
A standalone programming language is insufficient to interact with the computer without an

appropriate platform or mechanism. Hence, Integrated Development Environment (IDE) is an
inevitable part of every project development plan. It is generally referred to a software
application that helps developers to write source code effectively. Without having to manually
integrate and configure different software, IDE acts as a centralized development tool
management application which facilitates the software development process for developers.
(What Is an IDE? IDE Explained - AWS, 2022) The 3 key components of IDE can be listed as
follows: code editor, compiler and debugger. Code editor being distinguished from other text
editors is used to write and edit source code. Compiler is used to transform source code in the
form of human language into binary or object code that is interpretable by the machine. As for
debugger, it is used to detect source code and output’s error during testing. (What Is IDE or
Integrated Development Environments? | Veracode, 2020) Most IDEs are also featured with
syntax highlighting, code completion and refactoring support which enhance codes’ readability.
(What Is an IDE? IDE Explained - AWS, 2022)
In the context of this project, Jupyter Notebook is chosen to be the IDE. It is a powerful tool that
integrates code with other media such as narrative text, visualizations, mathematical equations,
videos, images etc. . It is a free, open-source and standalone software that is part of the
Anaconda data science toolkit. (Pryke, 2020) It can also be run on web browsers like Firefox and
Chrome. The popularity of this application has to dealt with its ability in achieving a balance
between a simple text-editor and feature-rich IDEs that require complicated initial startups.
Hence, it becomes handy to solve problems regarding data exploration, data pre-processing and
modelling. Developers can also understand and debug code easily through the text description
section that describes the functionality of respective code blocks. It is also supported with
segmentation of section between code cells, output cells and markdown cells. The code cell
displays Python’s source code written by developers, output cell displays command line, images

or other visualizations whereas markdown cell contains headings alongside images and links.
(Kazarinoff, 2022) In the context of this project, Jupyter Notebook extension on Visual Studio
Code (VSC) will be used, in order to cater greater RAM for complex machine learning
operations within the local environment instead of the default allocated limited RAM and disk
space.
3.3 Libraries / Tools Chosen

This section describes various libraries in Python for data pre-processing, machine learning, deep
learning, data visualization and GUI support to construct the ASR system.
3.3.1 Data Pre-processing

NumPy
Numerical Python generally referred as NumPy is a Python library that works on multi-
dimensional array objects. Unlike Python lists which are one-dimensional array, NumPy arrays
are multi-dimensional in which the n-dimensional array is called “ndarray”. Moreover, NumPy
arrays are homogeneous, as in they can only accept elements with the same data type whereas
python lists are heterogeneous. In terms of efficiency, NumPy arrays can perform mathematical
operation among arrays such as addition, multiplication etc. and are faster than Python lists.
(Great Learning Team, 2022) In this project, NumPy will be used to perform Fourier
transformation during MFCC.
SciPy
Similar to NumPy, Scientific Python is an open-source, BSD-licensed extension library from

NumPy which has higher level syntaxes applicable across multiple domain of knowledge. As an
additional tool for performing operations on multi-dimensional arrays, it provides algorithms for
linear algebra, integration, optimization and other advanced statistical problems. (SciPy, 2022) In
this project, SciPy will be used read and manage the audio files.
3.3.2 Machine Learning

Scikit-learn
Being one of the most comprehensive tools in Python, Scikit-learn consists of statistical
modelling and machine learning functionalities such as supervised learning algorithms,

unsupervised learning algorithms, feature extractions etc. . It is also featured as an open-source

tool and supported by a huge community across the globe. (Jain, 2015) In this project, Scikit-
learn will be used to perform HMM and other modelling techniques after MFCC.
3.3.3 Deep Learning

Keras
Keras is chosen as the deep learning library because it is a high-level API which supports neural
network computation. It has a relatively low learning curve as its front-end is Python-based, thus
utilizing common code and providing clear error messages upon execution failure. In regard to
supporting almost every neural network mode, it is supported by multiple frameworks such as
TensorFlow, MXNet, CNTK etc. . (Simplilearn, 2021) In this project, Keras will be used to
perform CNN modelling on the extracted features.
3.3.4 Data Visualization

Matplotlib
A key component of this project is to observe the changes in graphs of vocal wave form during
the feature extraction stage. Hence, Matplotlib is used as a visualization extension library for
NumPy to provide visual access for multi-dimensional arrays. Several plots offered by
Matplotlib include scatter plot, pie chart, line plot, histogram etc. . It can be installed in
Windows, MacOs and Linux. (Python | Introduction to Matplotlib - GeeksforGeeks, 2018)
3.3.5 GUI
Streamlit
Since this project aims to construct a simple yet functional GUI in the shortest possible time,
Streamlit is an optimal choice. It is a simple web application that can construct effective and
intuitive user interface quickly. Being an open-source Python library, it is compatible with
Python’s Data Science and Deep Learning libraries. has no front-end coding involved such as
HTML, CSS, JS. Moreover, it allows images, audios and videos to be uploaded with additional
widget support such as sliders, buttons, check box, radio, selection box etc. Visualization on
Streamlit using charts, graphs, maps and plots can also be done which is the optimal choice for
data science based projects like the current implementation. (mhadhbi, 2021)

3.4 Operating System Chosen
Figure 43: Overview of Operating System (Understanding Operating Systems - University of Wollongong – UOW, 2022)
Operating System (OS) is a frequent term used in the technology field that refers to a program
acting as an interface between the user and hardware components after it is loaded by a
bootstrap. It offers an environment for users to execute programs or applications and hides
specific details of hardware through abstraction. (What Is Operating System? Explain Types of
OS, Features and Examples, 2020) To perform file, memory management, I/O operations and
other tasks, all application programs running in the background need to access Central
Processing Unit (CPU), memory and storage to gain equal number of resources. Thus, the OS
facilitates the interaction route between the hardware and application as well as system software
for the user to be able to interact with the programs as illustrated in the figure above. The most
common OS are Microsoft Windows, which is preloaded on all laptops except Apple products,
Mac OS which is preloaded in all Apple laptops and Linux which is not preloaded but users can
voluntarily download it for free. Hence, Microsoft Windows 11 Pro has been chosen as the OS
for this project. (Understanding Operating Systems - University of Wollongong – UOW, 2022)
The minimum hardware requirements for this project are stated as follows:
i) Processor: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz 1.50 GHz

ii) Random Access Memory (RAM): 16GB
iii) Peripheral devices:
a. Mouse

b. Keyboard
c. Monitor
d. Microphone (2 channels, 16 bits, 48000Hz)
e. Speaker (16 bits, 48000Hz)
f. Router (RJ45 / Wireless Fidelity (Wi-Fi))
The minimum software requirements for this project are stated as follows:
i) Operating System: Windows 11 Pro

ii) Scripting Language: Python 3.10
iii) Documentation and Planning: Microsoft Word 2007, Microsoft Excel 2007,
Microsoft Project 2021

3.5 Summary
To sum up this chapter, a comparison between Python, R and SAS in applicability and other
aspects has been made. After such comparison, Python has been chosen and justified to be the
programming language for this project. This has to do with the variety of data pre-processing and
machine learning libraries available in Python. Next, Jupyter Notebook has been chosen to be the
IDE with Visual Studio Code (VSC) as the local environment to support Python and such
selection has been justified accordingly. Various Python libraries used for data pre-processing,
machine learning, deep learning, data visualization and GUI creation purposes have been
documented with their features highlighted to be applied during modelling phase, which is the
coding stage during FYP Semester 2. Lastly, Microsoft Windows 11 Pro has been chosen as the
OS for this project along with minimum hardware and software requirements stated clearly.

CHAPTER 4: METHODOLOGY
4.1 Introduction
In the context of research, the broad use of the term “methodology” refers to a set of guidelines
or framework used to solve a problem in specific domains. Since this is a data science based
project, data mining methodology will be imposed. This chapter will compare the three data
mining methodologies, namely: KDD, CRISP-DM and SEMMA, discuss the reasoning in
choosing one of these methodologies and give an in-depth explanation about the activities to be
carried out in each phase of the methodology chosen.

4.2 Comparison of Data Mining Methodologies

The table below demonstrates the comparison between KDD, CRISP-DM and SEMMA in which
one data mining methodology will be chosen for the implementation of this project.
Table 3: Comparison of data mining methodologies
Data Mining Methodologies
Aspects KDD CRISP-DM SEMMA
⁕ Knowledge ⁕ Cross-Industry ⁕ Developed by SAS

Discovery in Standard Process Institute
Databases
⁕ Developed by IBM for ⁕ Emphasizes model
⁕ Emphasizes data mining and science development in SAS
Overview knowledge retrieval tasks Enterprise Mine

from large amount of software
(Quantum, 2019)
data
(Dåderman et al., n.d.)
Stages There are a total of 5 There are a total of 6 There are a total of 5
phases: phases: phases:
⁕ Selection ⁕ Business ⁕ Sample

Understanding
⁕ Pre-processing ⁕ Explore
⁕ Data Understanding
⁕ Transformation ⁕ Modify
⁕ Data Preparation
⁕ Data Mining ⁕ Model
⁕ Modelling
⁕ Interpretation / ⁕ Access
Evaluation ⁕ Evaluation
(Quantum, 2019)
(Quantum, 2019) ⁕ Deployment

(Quantum, 2019)
⁕ Previous stages are ⁕ Not limited to the ⁕ Has a stricter cyclic

repeated if insufficient evaluation stage, each nature
knowledge is attained stage can be transitioned
⁕ Will repeat the cycle
from the evaluation to the previous ones if it
until the data mining
Flexibility stage is deemed incomplete
goal is reached
(Quantum, 2019) ⁕ Need not to cycle back
(Quantum, 2019)
to the starting phase
(Quantum, 2019)
⁕ Has a pre-KDD ⁕ Pays off for large ⁕ No phase that

process which allows projects as it is one of the determines projects’
Business developers to gather phases for the goal from business
Understanding information from methodology (Dåderman point of view
business users et al., n.d.) (Dåderman et al., n.d.)
⁕ Background ⁕ Easy to understand ⁕ Requires extensive

knowledge regarding knowledge in SAS
Prerequisite ⁕ Requires general
data mining is Enterprise MinerTM
knowledge understanding on the
required tool and data mining
structured processes and
techniques
data mining approaches
⁕ No website with ⁕ One main task along ⁕ Full documentation

detailed instruction with several subtasks are available in SAS
provided explained thoroughly in Enterprise MinerTM
Documentation each stage (Dåderman et tool (Dåderman et al.,
⁕ Guidelines are based
al., n.d.) n.d.)
on a scientific article

4.3 Justification on Data Mining Methodology chosen

Based on the comparison made above, it is concluded that CRISP-DM will be chosen for the
current projects’ implementation. As indicated by its name, it can be implemented in any data
science project regardless of the projects’ domain, size, budget and other factors. Furthermore,
this methodology is also capable of allowing several phases to perform in a non-sequential order.
In other words, it is possible to backtrack to previous phases as the subsequent phase depends on
the precedent action’s outcome and thus has the greatest flexibility among the three
methodologies. With a rich number of documentation and guidelines available, CRISP-DM
guarantees a simple and uniform framework for developers to uphold best practices in managing
and replicating projects. (Great Learning Team, 2020) Another key feature is that CRISP-DM
allows developers to establish long-term strategies by constructing a simple model upon first
iteration, then improvising it in further iterations. (LinkedIn, 2022) In the context of this project,
it is the best data mining methodology because the ASR acoustic model may not be the most
accurate one. As such, the acoustic inputs, dataset chosen, proportional percentage given to
training and testing operations as well as other criteria have to be adjusted to achieve the most
performant model. The project also requires developers to relate back to the aims, objectives,
potential benefits and functionalities provided to target users when evaluating the effectiveness
of the model to be incorporated into the e-learning systems.

4.4 CRISP-DM
Figure 44: Overview of CRISP-DM methodology (Wirth, R., & Hipp, J., 2000, April)
Phase 1: Business Understanding
The project team or developers must have a thorough understanding on the project background to
formulate a well-defined analytic and business strategy. In order to achieve this, the business aim
or primary goal must be identified, which is the development of a fully functional ASR system
for e-learning sector in this case. (Great Learning Team, 2020) Having established aims, the
current situation must be assessed whereby the tangible benefits, non-tangible benefits,
functional requirements, non-functional requirements, budget allocated, project completion time,
constraints alongside potential risks are identified. Most importantly, resources of the project
must be documented through fact-finding techniques. These resources are as follows: hardware,
data mining tools, technical expertise and operational data. The business or data mining success
criteria can then be determined which are the pre-defined objects in helping the project team to
achieve the aim. Lastly, a project plan and Gantt Chart are documented using Microsoft Excel
and Microsoft Project respectively. (Crisp DM Methodology - Smart Vision Europe, 2020)
Phase 2: Data Understanding
After understanding the domain backgrounds, the initial dataset used to train the model must be
collected and loaded into the chosen IDE, Jupyter Notebook within local VSC environment in
this case. For this project, the dataset will be retrieved from an online source comprising of audio
files and their corresponding transcript as reference. The surface properties of the dataset are

then examined extensively by listing out the number of observations, attributes, data type, range
and meaning of attribute values in business terms. (Great Learning Team, 2020) Data exploration
is then performed by the developers to find correlation between acoustic properties, identify
target variable, calculate simple aggregations and compute statistical analysis. The data’s quality
is also verified to ensure that there are no missing fields, noisy data or erroneous values. (Crisp
DM Methodology - Smart Vision Europe, 2020)
Phase 3: Data Preparation
The dataset used to train the acoustic model for ASR in the next stage is selected alongside the
reason for such choice justified. To ensure that the output is accurate and not prone to misleading
information, data cleaning practices are adhered by developers. This includes filling missing
values with a general indicator, eliminating erroneous values and formatting values of specific
field into proper data types. (Great Learning Team, 2020) As explained earlier, feature extraction
will be performed at this stage whereby the flow of speech signals will be converted into
numerical vectorized form, i.e., MFCC or FFT to act as an input to the acoustic model. With
more features being retrieved, the deep learning model deduced will have a higher speech
recognition accuracy.
Phase 4: Modelling
As pointed out in the previous section, several statistical or machine learning techniques will be
implemented as potential acoustic models in formulating a fully functional ASR and assumptions
about data or tools to integrate with the model are made. Next, to validate each models’ quality, a
test design is conducted by separating the dataset into training, validation and testing segments
whereby the former is used to construct the model, the middle is used to validate the model
during training and the latter is used to estimate the model’s quality after training. Upon running
the model using tools, the parameter settings and their reasoning are justified. The model is then
run using the pre-processed signals as input with the obtained results evaluated using WER and
CER as explained in section 2.2. Moving on, the special features and potential issues are derived
from interpretations of results. The model is then assessed continuously by incorporating
business success criteria, previous findings and other metrics into consideration and previous
steps are repeated multiple times until the best model is found. (Great Learning Team, 2020)

Phase 5: Evaluation
Unlike evaluating the model’s accuracy in the previous stage, developers start off by assessing
the general aspect of the project. In terms of this, the models are evaluated on the basis of
whether they meet the business objectives. If all models satisfy such requirements, the ASR
model with the greatest applicability in solving existing e-learning systems’ issues within
educational institutions is approved. With that being said, the developers conduct a
comprehensive review on previous stages to highlight quality assurance issues that have been
overlooked. The next step is then determined, as in whether the project team should proceed to
the next phase if modelling results are deemed up-to-par or re-iterate previous stages to refine the
model within pre-defined constraints. (Crisp DM Methodology - Smart Vision Europe, 2020)
Phase 6: Deployment
Before releasing the final product, which is the ASR system, to be incorporated with e-learning
systems, implementation strategies are summarized in a deployment plan. The model will be
deployed using Streamlit web application with other functionalities available. Monitoring and
maintenance activities are to be carried out from time to time to retain the system’s performance
and determine the threshold in which the model is deemed to be unapplicable within the system.
The final report is then documented to provide target users (students and tutors) a comprehensive
summary on the deliverables. Last but not least, a project retrospective is conducted either
through interviews or questionnaires to gather information from end users regarding the system’s
features, drawbacks, potential enhancements and other experiences. (Great Learning Team,
2020)

4.5 Summary
To sum things up, out of the three data mining methodologies discussed in section 4.2, CRISP-
DM is chosen to be the most suitable methodology for this project. Generally speaking, it is the
standard methodology in a variety of industries worldwide especially in the data science sector.
Its well-defined process, in-depth documentation and high flexibility in terms of transitioning
between phases make it so widely used in many applications. Moreover, developers do not need
much pre-requisite data mining knowledge in order to grasp the main activities in each stage.
There are a total of 6 stages in CRISP-DM. Firstly, in business understanding stage, developers
must understand the existing issues within the e-learning and corresponding domains as well as
the aims and objectives of integrating an ASR system into e-learning systems in English
language. In data understanding phase, developers should find a suitable speech dataset in the
form of online tutorial videos. In data preparation phase, the speech signals must be segmented
with background noise removed for acoustic characteristics or features to be extracted through
MFCC technique. The project team must then utilize the filtered acoustic inputs to train and
validate multiple machine learning models with backtracking involved to obtain the finest model.
The models’ performances are then evaluated using WER and CER as the two indicators, only
then the model with the highest speech recognition accuracy can be chosen. Lastly, the acoustic
model is deployed into an ASR system which is featured with Streamlit GUI implementation and
other functionalities for the users to interact within the e-learning system. Contingency and
maintenance plan is also documented to monitor the products’ effectiveness in the long run.


CHAPTER 5: DATA ANALYSIS

5.1 Introduction
Conceptually speaking, data analysis is the process of collecting, extracting and analysing raw
data in order to extract actionable insights that can facilitate decision making procedures. It is an
inevitable process for enterprises across all industries as it identifies targeted customer segments
for their marketing departments to advertise their product as well as reduces operational and
functional costs when it comes to allocating resources to departments based on priority level and
other demographic attributes of a segment. (Kelley, 2020) In the context of ASR, quantitative
analysis will be performed through various statistical methods as audio data will first be
converted into vectorized numerical form before being mapped into characters to form words. As
such, this section will present information about the metadata of dataset used, Exploratory Data
Analysis (EDA) or data exploration, data cleaning, data visualization and modelling with
snapshots of code snippets and outputs.

5.2 Metadata
Figure 45: Contents of LJ speech dataset
Figure 46: list of "wav" audio files in "wavs" folder
As shown in figure , the chosen LJ speech dataset for the project consists of 3 components,
namely a “wavs” folder that stores individual “.wav” format audio files illustrated in figure 47, a
README file that stores metadata about the dataset and a “csv” file. Each audio file in the
“wavs” folder has a sample rate of 22,050 Hz, consists of a single-channel 16-bit Pulse-Code
Modulation (PCM) and ranges from 1.11 to 10.1 seconds. Within the “csv” file, there are 13,100
observations, each having 3 attributes that are tabulated in table 4. The metadata also mentions
there are a total of 225,715 words spoken by readers whilst glimpsing through 7 non-fiction
books in which 13,821 of them are unique words. (The LJ Speech Dataset, 2016)
No Attribute Type Description
1 ID Nominal Unique name given

to each “.wav” file
2 Transcript Nominal Correct words spoken

by reader
3 Normalized Nominal Transcript with

Transcript numbers, ordinals
and monetary units
written in full form
Table 4: Attributes of dataset

5.3 Initial Data Exploration
Figure 47: Code and output for data extraction
The figure above shows the first 5 rows of the dataset being displayed. The code uses “pandas”
library’s “read_csv()” function so that information in the “metadata.csv” dataset is stored in a
DataFrame object. The parameter “sep” indicates the pipe character (“|”) will be the separator
between columns and “header” indicates default numbering starting from 0 will be assigned to
each column.
Figure 48: Code and output for adding column names
The default numbering of column names is replaced with corresponding names similar to that in
the metadata as shown in the figure above with the first 5 rows displayed.
Figure 49: Code and output for dimension of DataFrame
The figure above shows the number of rows and columns available in the form of a tuple,
indicating there are 13,100 rows and 3 attributes in the original dataset. This complies with the
information presented in the metadata section.

Figure 50: Code and output for information of each attribute
The figure above shows an overview of the DataFrame structure derived from the dataset. It
provides information on the data types for each attribute, the number of non-null values and total
memory usage to load the data. It is observed that the datatype for all attributes is “object” which
indicates a string or mixture of string and other data types. There are also no missing or null
values in the 3 attributes except for the “Normalized Transcript” column with 16 missing values.
To address this, data cleaning and pre-processing will be carried out accordingly in the next
section.
Figure 51: Code and output for total and unique word count in "Normalized Transcript" column
By iterating through each row of the DataFrame object, word tokenization is applied on the
sentence in “Normalized Transcript” column. The total number of tokens, excluding special
characters are computed. Each token is then added to a “set” object in which the total unique
count of tokens across all rows’ corresponding column are computed as shown in the figure
above.

Figure 52: Code and output to display frequencies and number of samples for an audio file
The figure above shows 5 randomized audio files are selected from the dataset. The sample rate
(frequency) and signal of the audio file is obtained by reading the file’s path using SciPy’s
“wav”s module “read” function. The name of the 5 randomized audio files are then stored in a
list called “eda” for further usage in visualization. The frequency of each audio file is displayed
in which all of them have the same sample rate of 22,050 Hz. The signal or number of samples in
the audio file differs from one another due to different vocal speeds.

5.4 Data Cleaning

5.4.1 Handling Missing Values
Figure 53: Code and output for dropping rows that contain empty values
When the function “dropna()” is called on the DataFrame object in figure above, all rows that
contain one or more missing values in any column will be removed. As such, the 16 rows with
missing values in the “Normalized Transcript” have been removed, leaving only 13,084 non-
missing rows.
Figure 54: Code and output for dropping "Transcript" column
The code in figure above is used to drop the “Transcript” column from the DataFrame object by
calling the “drop” method. The “axis” parameter is set to 1, indicating a column-wise drop. The
column is redundant with the “Normalized Transcript” column since the former represents
simple form of abbreviations whereas the latter expresses those in full word form. The first 5
rows of the DataFrame are displayed, indicating the column has indeed been dropped.

5.4.2 Removing Erroneous Data
Figure 55: Code and output for dropping rows that contain non-ASCII characters in "Normalized Transcript" column
The figure above demonstrates the removal of rows that contain non-ASCII characters with
corresponding output displayed, as in 23 out of the original 13,084 rows have been dropped,
leaving only 13,061 rows. A regular expression check on whether the normalized transcript
contains any characters beyond the range of 7-bit (ASCII) characters is applied to the
“str.contains” method which will return a Boolean Series. Then, Boolean indexing is performed
with the “loc[]” method to select only rows that do not satisfy the condition above. This process
is crucial since non-ASCII characters are not valid English characters nor words and are
considered as erroneous data.

5.4.3 Data Sampling

Generally speaking, the amount of memory required to train deep learning models can never be
lower than 12GB for a GPU. To satisfy RAM requirements even further, an extra 25% RAM
more than the available GPU memory must be allocated for deep learning models that can
become more complex over training time. (Montantes, 2021) However the current hardware
configuration only offers 2GB of GPU memory with multiple Blue Screen Of Death (BSOD)
experienced upon training the deep learning model with 70% of the audio data available. An
alternative is to apply data-sampling method, which selects a subset of the dataset based on
specific criteria to predict the accuracy of findings.
Figure 56: Code and Output for computing word frequency distribution
The code in figure above produces a word frequency distribution of each word in the
“Normalized Transcript” column in the form of a dictionary, with key-value pair representing the
word and frequency count respectively. The dictionary is then sorted in descending order based
on the frequency count of each word and filters words with a frequency count greater than or
equal to 20 to be displayed to the console. From the output, it appears that conjunctions,
prepositions and pronouns are the common words in the context of this project.

Figure 57: Code for data sampling
The code in the figure above creates an empty DataFrame object (“filtered_df”) with 2 columns:
“File Name” and “Normalized Transcript” assigned to it. The original DataFrame object (“df”) is
then iterated to perform word tokenization on the transcript column. If the proportion of words
found in the tokenized list comprehension is greater than or equal to 90%, the corresponding row
is added to “filtered_df” with the counter incremented by one. It is estimated that the deep
learning model is more likely to recognize speech if the word-found percentage in each audio file
is greater. From the output in figure below, there are a total of 2612 audio files that satisfy the
condition above with the first 5 rows displayed in the console.
Figure 58: Output for data sampling

Figure 59: Code and output for total and unique word count in "Normalized Transcript" column after Data Sampling
After performing data sampling, the code above is used to compute the total and unique word
count in the “Normalized Transcript” column of the filtered DataFrame object. By iterating
through each row of the DataFrame object, word tokenization is applied on the sentence in
“Normalized Transcript” column. The total number of tokens, excluding special characters are
computed. Each token is then added to a “set” object in which the total unique count of tokens
across all rows’ corresponding column are computed. It is observed that the total word count has
been reduced from 223,964 to 44,140 whereas the unique word count has been reduced from
14,364 to 3,201. Within the constraint of limited processing resources, such smaller subsets of
data can help train our model within a shorter time frame.

5.4.4 Data Transformation

To feed the audio signals into the model, they must be converted into speech features in the form
of vectorized inputs. To facilitate this operation, the original DataFrame object has to undergo
several changes so that the vectorized sequence and the transcript label can be merged
accordingly.
5.4.4.1 Generate List of Dictionary
Figure 60: Code Snippet to Create list of dictionary from DataFrame Object
The function above first creates an empty list which will store the dictionary for each input.
Then, the DataFrame object is iterated using the “iterrows” method to obtain the file path stored
in local directory and transcription by accessing the “File Name” and “Normalized Transcript”
column respectively. The dictionary is then created with 2 key-value pairs for audio and
transcript, then inserted into the list. This facilitates the retrieval of data from DataFrame Object
as it is easier to retrieve data from a list of dictionary.
5.4.4.2 Generate Spectrogram
Figure 61: Code Snippet to generate Spectrogram for audio file
With easily accessible inputs ready, the audio file must be converted into a spectrogram which is
a representation of audio signals in time-frequency domain to be accepted and trained by the

models. As such, the code above starts off with reading the audio file using “tf.io.read_file”
function. The file is then decoded using “tf.audio.decode_wav” function with a single channel
and compressed by removing the last axis. The Short-time Fourier Transform (STFT) of an
audio signal is computed using “tf.signal.stft” function. It accepts a Tensor floating-point signal
(“sig”), the number of samples for each frame (“frame_length”), the number of samples between
intervals of frame (“frame_step”) and the number of points in FFT (“fft_length”) as input and
produces the spectrogram. (Tf.signal.stft | TensorFlow V2.12.0, 2023) To normalize the
spectrogram, it is applied with a power transformation with its absolute value, subtracted from its
mean using the “tf.math.reduce_mean” function, then divided by the standard deviation using the
“tf.math.reduce_std” function, producing a final matrix with a 2D shape representing number of
frames in audio and number of frequency bins respectively with all values between -1 to 1.
5.4.4.3 Generate Vocabulary
Figure 62: Code Snippet for Vocabulary Set with Encoder and Decoder
The code above is used to define the vocabulary set with encoder and decoder for
“normalized_transcript” column. First, a list of characters is listed out to be used as the encoder
and decoder’s vocabulary. Keras’s “StringLookup” function is then called to build the encoder. It
accepts the vocabulary list and an “oov_token” which resembles an empty string to specify out-
of-vocabulary characters. The decoder is also built using the same function with same parameters
with an additional “invert” parameter set to True. The mapping from indices to characters will be
inverted, which facilitates the decoding operation of vectorized integer sequences into character
sequences. The “get_vocabulary()” and “vocabulary_size()” function are then called to display
the list of vocabularies and its total count in the output panel. (Tf.keras.layers.StringLookup |
TensorFlow V2.12.0, 2023)

5.4.4.4 Generate Transcript Label
Figure 63: Code Snippet to generate Transcript Label
The code above will convert the inputted text or transcript into lowercase form using
“tf.strings.lower” function. Then, “tf.strings.unicode_split” is applied to split the Tensorflow
String object into a list of Unicode characters by specifying the “input_encoding” be “UTF-8”.
(Module: Tf.strings | TensorFlow V2.12.0, 2023) The “encoder” object defined in section
5.4.4.3 will then be applied to map the Unicode characters into sequence of integers.
5.4.4.5 Generate Tensorflow Dataset
Figure 64: Code Snippet to Merge Spectrogram and Transcript Label
The code above is used to construct a tuple by taking the path of an audio file and its
corresponding transcript as arguments.
Figure 65: Code Snippet to Construct Keras Dataset Object
The function above takes a DataFrame object and the batch size of Tensorflow dataset as an
argument. The list of dictionaries is first created by calling the “create_dct_from_df” function.
The list of audio path and transcript are then stored in respective lists. A Tensorflow dataset is
then created using “tf.data.Dataset.from_tensor_slices” function which accepts a tuple in the
form of audio path and transcript as input. The “merger” function is then called within the “map”

function to convert the tuple into a tuple of spectrogram and label. Next, the “padded_batch”
method is used to compile the tuple data into batches with “batch_size” specifying the number of
samples in each batch. Each sample within the batch must have the same shape, which can be
achieved by padding shorter samples with zeros. The “pre_fetch” function will then perform
asynchronous operations on the current and next batch dynamically through assigning the
“buffer_size” parameter to “tf.data.AUTOTUNE” to improve modeling performance. The
Dataset object is then formulated with its size displayed using “cardinality” function.
(Tf.data.Dataset | TensorFlow V2.12.0, 2023)

5.5 Data Visualization

5.5.1 Transcript
Figure 66: Code Snippet to plot bar chart for token frequency dictionary
The code above utilizes the dictionary for frequency of individual tokens to plot a vertical bar
chart. First, the key-value pairs are converted into respective list. For tokens with a frequency of
fewer than 1200, they are labelled as “others” whereas tokens with a frequency more than or
equal to 1200 are labelled as their respective tokens. The plot is then formed with the “subplots”
function with the X-axis’s tokens rotated by 45 degrees using the “xticks” function with a
rotation parameter of 45 to prevent overlapping of tokens upon display. (Matplotlib.pyplot.xticks
— Matplotlib 3.7.1 Documentation, 2023)
Figure 67: Bar Chart for common tokens versus frequency

From the bar chart, we can infer that the token “the” has the highest frequency count and is twice
as high than the frequency of 2 nd highest token “of”. We can also observe that apart from the
word “others”, all other common words are either prepositions (‘of’, ‘to’, ‘in’, ‘on’, ‘with’, ‘by’,
‘at’), conjunctions (‘and’, ‘for’, ‘as’), pronouns (‘that’, ‘he’, ‘his’, ‘it’, ‘which’), determiners
(‘the’, ‘a’) and finite verbs (‘was’, ‘had’, ‘were’).
Figure 68: Code Snippet to plot histogram for individual transcript's token count versus frequency
The code above creates a histogram plot using seaborn’s “histplot” method. The plot shows the
frequency distribution for the number of tokens in each transcript. Upon adding the “kde = True”
parameter, a density estimation line is added to the plot which estimates the probability of the
underlying data’s distribution. The “axvline” is used to draw a red vertical line representing the
mean value for number of tokens in each transcript. (Matplotlib.pyplot.axvline — Matplotlib
3.7.1 Documentation, 2023)
Figure 69: Histogram for individual transcript's token count versus frequency
The histogram above with density curve overlaid on top of it shows that the distribution is
relatively normal, with a peak around the 20 mark. This implies that the number of tokens in
common transcripts are between 15 to 25 with the mean value around the 17 to 18 mark. Token

count for majority of transcript fall between 0 and 35. As the line moves away from the range,
the frequency tapers off.
Figure 70: Code Snippet to plot histogram for individual transcript's token count versus frequency after data sampling
The code above creates a histogram plot using seaborn’s “histplot” method. The plot shows the
frequency distribution for the number of tokens in each transcript after data sampling. Upon
adding the “kde = True” parameter, a density estimation line is added to the plot which estimates
the probability of the underlying data’s distribution. The “axvline” is used to draw a red vertical
line representing the mean value for number of tokens in each transcript after data sampling.
(Matplotlib.pyplot.axvline — Matplotlib 3.7.1 Documentation, 2023)
Figure 71 Histogram for individual transcript's token count versus frequency after data sampling
The histogram above with density curve overlaid on top of it shows that the distribution is
bimodal. This is because the middle peak value is lower than the two neighbouring bars,
indicating that there are two groups of filtered transcript, one with fewer than 15 tokens and the
other with more than 15 tokens. With such two groups, the set of filtered transcripts can be more
generalized and do not always concentrate near the median or mean of the graph. Moreover, the
mean value for token frequency count in individual transcript after data sampling is similar with
that of before data sampling, suggesting that the data sampling technique applied is optimal.

5.5.2 Audio
Figure 72: Code Snippet to plot waveform, MFCC and Mel Coefficients
The code above will plot the waveform, MFCC and Mel Coefficients of 5 randomized audio files
that are stored in “eda” list in figure 52. First, the list of audio names is iterated to form an audio
path by merging with the root path. The frequency and signal of the audio file are then obtained
using “wav.read” function which accepts the audio path as input. To plot the raw frequency wave
of audio file, the number of samples is obtained by accessing the signal’s first shape’s element.
Then, the X-axis which resembles time in seconds is computed by dividing the sequence of
samples from 0 to “total – 1” by the sampling frequency alongside the Y-axis which resembles
the signal. The two axis are then plotted using “matplotlib” module’s “plot” function. Horizontal
red lines indicating the minimum and maximum audio signals are also plotted using “axhline”
function.
During computation of Mel Frequency Cepstral Coefficients (MFCC) of audio signal, the “mfcc”
function from “python_speech_module” is applied. It accepts the audio signal, sampling

frequency, NFFT or number of data points in Fast Fourier Transform (FFT) and a Hamming
window function to be applied on each frame. (Welcome to Python_speech_features’s
Documentation! — Python_speech_features 0.1.0 Documentation, 2013) The NFFT is set to
1024 instead of the default value of 512 since some of the audio file’s signal has exceeded 512
data points and has to be trimmed to process capture audio features within each frame. The
NFFT must also be a power of 2 since it results in the most efficient output upon algorithm
execution. (Søndergaard, n.d.) Normalization is then performed by computing the mean and
standard deviation of MFCC coefficients across each frame. The MFCC coefficient is then
subtracted from the computed mean and divided by standard deviation. Finally, the MFCC
matrix is transposed to adjust reach row to be a frame that corresponds to a coefficient. The
transposed matrix is then displayed in the form of a heatmap using “matplotlib” module’s
“imshow()” function. (Matplotlib.pyplot.imshow — Matplotlib 3.7.1 Documentation, 2023)
In visualizing the Mel filterbank energies of an audio file, “mfcc” module’s “logfbank” function
is used which accepts an audio signal, sample rate and NFFT as inputs. The output is a 2D
numpy array with a shape consisting of the number of overlapping windows in FFT and number
of triangular filters in Mel-scale filterbank. The array is then transposed so that the X-axis
corresponds to the windows and Y-axis corresponds to the Mel filterbank coefficients.
(Søndergaard, n.d.) The transposed matrix is then displayed in the form of a heatmap using
“matplotlib” module’s “imshow()” function. (Matplotlib.pyplot.imshow — Matplotlib 3.7.1
Documentation, 2023)

Figure 73: Raw Waveform for Amplitude Against Time for Audio File
The raw waveform plot above shows the amplitude of the audio signal over time in seconds for
an audio file. The amplitude refers to the audio’s magnitude and is measured using decibels scale
(“dB”). The audio file above has a total duration over 2.5 seconds. An amplitude peak is recoded
around the 1 second mark with a peak value just below 20,000. As time propagates, the
amplitude of the audio file decreases with several intervals of complete silence with minimal
background noises. This implies that the main content or vocal point of the speech is at the front
to middle section of the audio file.
Figure 74: Heatmap for MFCC Coefficients against Windows for Audio File

The heatmap above represents the intensity of energy signals across different coefficients of
respective window. For warmer colours such as red or brown, the magnitude of coefficient or
energy at that particular window is stronger whereas for lighter colours such as orange or yellow,
the loudness of audio at that particular window is weaker. Moreover, dark-coloured regions in a
specific window are more prone to be distinguishable from other speech patterns. From heatmap
above, we can observe that most of the red or brown-coloured window-MFCC pairs are
concentrated at the start to middle section of the audio file, ranging from the 0 th to 150th window.
This is aligned with our findings from the raw waveform plot in figure 73.
Figure 75: Heatmap for Mel Coefficients against Windows for Audio File
The heatmap above represents the intensity of energy signals across different Mel filterbank
coefficients of respective window. We can see that at the starting to middle section of the audio
file between 0th to 150th window, the dark-coloured sections are concentrated at the first few
coefficients only. This implies that the audio file at this region has more energy in the lower
frequency range than that of the high frequency range. However, in the last section of audio file
between 175th to 250th window, we can see that the dispersion of energy throughout all frequency
ranges are relatively high with an even spread of dark-coloured regions across all coefficients.

5.6 Data Partition
Figure 76: Code Snippet to create function for data partitioning
A function is defined which accepts a DataFrame object, a batch size for the Tensorflow Dataset
with a default value of 16, a training proportion with a default value of 0.7 or 70%, a validation
proportion with a default value of 0.15 or 15% and a testing proportion with a default value of
0.15 or 15%. First, the “formulate_dataset” function is called to create a Tensorflow Dataset
object with the specific batch size. The size of the Dataset object is then computed and
multiplied with the training and validation proportion to obtain the training and validation
interval respectively. The training set (“train_ds”) is created by taking the first “train_interval”
elements; validation set (“val_ds”) is created by skipping the first “train_interval” elements and
taking the next “val_interval” elements; testing set (“test_ds”) is created by skipping both
interval’s elements and taking the remaining ones. The cardinality of each set is displayed with
the training, validation and testing Tensorflow Dataset returned.
Figure 77: Code Snippet to perform data partitioning
By calling the function above, data partitioning is done using the filtered DataFrame object, a
batch size of 25 and the default proportions as arguments. As such, each dataset’s batch consists
of 25 samples whereby the merged dataset has a total of 105 batches. Out of this, 70% or 73
batches belong to the training set, 15% or 15 belong to the validation set and remaining 15% or
17 belong to the testing set as shown in figure above.

5.7 Modelling
After performing Exploratory Data Analysis (EDA), Data Cleaning and Visualization, we have a
thorough understanding on the acoustic and transcript properties for our dataset. In this section,
we will be constructing a standalone GRU, a CNN-GRU and a CNN-LSTM model. Out of these
models, we will be experimenting with combinations of different hyperparameters, epochs,
optimization techniques and RNN units to determine the most optimum model under the
constraint of limited processing resources.
5.7.1 Model Preparation

Before constructing the models, a Connectionist Temporal Classification (CTC) loss function
that calculates the difference between the actual and predicted result of the audio signal’s vector
sequences will be defined. This function will then be fed to the optimizer to update weights of
the model during training phase by computing the probability of all output vector sequences and
choosing the one with best score (i.e, highest probability). (Harald Scheidl, 2018)
Figure 78: Code Snippet for CTC loss function
The CTC loss function above accepts the actual and predicted vector sequences as input. It first
computes the number of batches in the input sequence by converting the 1-D Tensor shape into
“int64” data type through “tf.cast” function. Then, it computes the length of the actual and
predicted vector sequence by multiplying the number of time steps in the sequence with a 2-D
Tensor of ones that have the same shape as the previous computed batch sequence. Finally, the
“ctc_batch_cost” function is called with the sequences and their corresponding length as input to
automatically compute the CTC loss between actual and predicted sequences.

5.7.2 CNN-GRU
5.7.2.1 Non-regularized CNN-GRU
Figure 79: Code Snippet for constructing a CNN-GRU model
The code above is constructing a Convolutional Neural Network (CNN) with Gated Recurrent
Units (GRU) model. It consists of 2 convolutional layers, 3 bidirectional GPU layers and 2 dense
layers. Since the spectrogram or input vector sequences obtained using Short-Time Fourier
Transformation (STFT) technique is symmetrical, only the first half of the values are required to
represent the frequency domain. As such, the input dimension is half the size of Fast Fourier
Transform (FFT). The output dimension corresponds to the number of characters in the pre-
defined vocabulary list.
To make the input layer compatible with the expected 4D shape for CNN, it is reshaped using
Keras’s “Reshape” layer. The 1st convolutional layer is defined with 32 filters, a kernel size of
11x41 and a stride of 2x2 and a Rectified Linear Unit (ReLU) activation function using Keras’s
“Conv2D” layer. To normalize the previous layers to have a mean of 0 and a standard deviation
of 1, a Batch Normalization layer is added. (Brownlee, 2019) The 2nd convolutional layer is then

defined with 32 filters, a kernel size of 11x21 and a stride of 1x2 and a ReLU activation function
with Batch Normalization layer applied afterwards.
Before proceeding to the bidirectional GRU layers, the previous layer must be flattened into a 3D
Tensor using the “Reshape” layer. It is then passed into a three-time loop to be fed into a GRU
layer which accepts 256 RNN units with the “return_sequences” and “reset_after” parameters set
to True. Upon setting “return_sequences” to True, the full sequence output in the form of batch
size, timesteps of input sequence and RNN units will be displayed instead of the original form of
batch size and RNN units only. Upon setting “reset_after” parameter to True, reset gate will be
applied after matrix multiplication instead of before which can better capture the features in
speech signals. (Team, 2014) This GRU layer is then wrapped inside a Bidirectional layer so that
it can incorporate timesteps of vector sequences from both past and future. A Dropout layer is
then added after each Bidirectional GRU layer except the last one to prevent overfitting by
randomly dropping half of the input units for each batch of training set. (Team, 2023)
The previous layer is then passed to a Dense layer which accepts twice the number of RNN units
for dimensionality inputs with a ReLU activation function. This computes the dot product
between the inputs and the kernel defined in CNN, resulting in a Tensor shape comprising of
batch size, timesteps and RNN units doubled. (Team, 2023) Resulting layer is then applied with
a Dropout layer to randomly drop half of the input units. Using results from the Dropout layer,
the output layer is obtained in the form of a Dense layer which accepts 1 more unit than the
output dimension and a softmax activation function. To account for inconsistent length in output
sequences with input sequences during computation of CTC Loss function, the blank label is
added to the original output dimension. (Graves et al., 2013)
The model is then compiled using CTC Loss function and “Adam” optimizer with a learning rate
of 0.001. The “Adam” optimizer is an extension of Stochastic Gradient Descent (SGD) algorithm
whereby it utilizes both first and second moment of gradient during training stages. It is
preferrable over other optimizers due to its adaptive learning rate in updating each network
weight individually, less memory requirement, faster computation time and less tuning efforts.
(Gupta, 2021) Finally, the model summary is displayed in the console using “summary()”
method as shown in the figure below.

Figure 80: Summary of CNN-GRU model
The summary of CNN-GRU model in figure above shows that the starting input layer has a 3D
shape with 193 representing the input dimension (INPUT_DIM) which is then reshaped into a
4D layer to be fed into the 2D Convolutional layer. After going through the 2 nd Batch
Normalization layer, the previous layer’s dimension is multiplied with each other to obtain the
value 1568. Finally, the output layer will produce 32 units which corresponds to 1 more unit than
the predefined output dimension (OUTPUT_DIM). Out of all the layers, the Bidirectional GRU
layer has the highest number of parameters. This is due to the fact that the layer processes input
sequences from previous layers in both forward and backward direction (i.e., accounting past and
future timesteps). Not only does the layer require a forward and backward weight, but it also
requires a recurrent forward and backward weight to store information from previous timesteps
of forward and backward sequence respectively. (Dobilas, 2022) It is also observed that there are

a total of 5.7 million parameters with 128 of them deemed non-trainable as they are not updated
when transmitted from one layer to the next, such as parameters involved in transferring mean
and standard deviation during the Batch Normalization layer.
Figure 81: Code Snippet to train CNN-GRU model
The code in figure above is used to train the CNN-GRU model on the training set for 50 epochs.
The batch size is set to the validation dataset’s size and the “callbacks” parameter will be used to
call the “Metrics” function that accepts the validation dataset and CNN-GRU model as input
after each epoch. Details on the callback function and analysis conducted based on the stored
model’s history information will be discussed further in section 6.2.

5.7.2.2 Regularized CNN-GRU
Figure 82: Code Snippet to construct regularized CNN-GRU model
The code above is constructing a regularized Convolutional Neural Network (CNN) with Gated
Recurrent Units (GRU) model. This model is similar to the above one except it utilizes 400 RNN
units instead of the original 256. It also implements extensive regularization techniques by
including the “kernel_regularizer” parameter in the Dense layers to apply penalty on the layer’s
kernel after performing 3 cycles of Bidirectional GRU operations. In the first Dense layer, both
L1 and L2 regularizations are applied using Keras’s “l1_l2” function with penalty rate “l1” and
“l2” set to 0.001. The L1 Regularizer computes the sum of the absolute values of weight whereas
L2 Regularizers computes the sum of squared values of weight. As for the second Dense layer,
L1 Regularizer is applied with a penalty rate of 0.001.

Figure 83: Summary of regularized CNN-GRU model
The summary of regularized CNN-GRU model in figure above shows that the layers involved
are the same as the CNN-GRU model, each having the same dimensionality of shapes. The only
thing that differs is that the total number of parameters involved in constructing the model has
increased to 11.4 million. This is due to an increase in the number of RNN units passed to the
GRU and Dense layers from 256 to 400 and from 512 to 800 respectively. This also explains the
need to incorporate more regularization and optimization techniques to prevent the model from
overfitting as the model becomes more complex and possesses greater capacity in capturing
audio features.

Figure 84: Code Snippet to train regularized CNN-GRU model
The code in figure above is used to train the regularized CNN-GRU model on the training set for
60 epochs. The batch size is set to the validation dataset’s size and the “callbacks” parameter will
be used to call the “Metrics” function that accepts the validation dataset and regularized CNN-
GRU model as input after each epoch. Details on the callback function and analysis conducted
based on the stored model’s history information will be discussed further in section 6.2.

5.7.3 GRU
Figure 85: Code Snippet to construct regularized GRU model
The code above is constructing a Gated Recurrent Units (GRU) model. It consists of 3
bidirectional GPU layers and 2 dense layers. Since the spectrogram or input vector sequences
obtained using Short-Time Fourier Transformation (STFT) technique is symmetrical, only the
first half of the values are required to represent the frequency domain. As such, the input
dimension is half the size of Fast Fourier Transform (FFT). The output dimension corresponds to
the number of characters in the pre-defined vocabulary list.
Before proceeding to the bidirectional GRU layers, the input layer must be reshaped into a 3D
Tensor using the “Reshape” layer. It is then passed into a three-time loop to be fed into a GRU
layer which accepts 400 RNN units with the “return_sequences” and “reset_after” parameters set
to True. Upon setting “return_sequences” to True, the full sequence output in the form of batch
size, timesteps of input sequence and RNN units will be displayed instead of the original form of
batch size and RNN units only. Upon setting “reset_after” parameter to True, reset gate will be
applied after matrix multiplication instead of before which can better capture the features in
speech signals. (Team, 2014) This GRU layer is then wrapped inside a Bidirectional layer so that
it can incorporate timesteps of vector sequences from both past and future. A Dropout layer is

then added after each Bidirectional GRU layer except the last one to prevent overfitting by
randomly dropping half of the input units for each batch of training set. (Team, 2023)
for dimensionality inputs with a ReLU activation function and a Kernel Regularizer using
“l1_l2” function. The regularizer has a penalty rate for “l1” and “l2” respectively. This layer
computes the dot product between the inputs and the kernel defined in CNN, resulting in a
Tensor shape comprising of batch size, timesteps and RNN units doubled. (Team, 2023)
Resulting layer is then applied with a Dropout layer to randomly drop half of the input units. As
a result from the Dropout layer, the output layer is obtained in the form of a Dense layer which
accepts 1 more unit than the output dimension, a softmax activation function and a Kernel
Regularizer using “l1” function with a penalty rate of 0.001. To account for inconsistent length
in output sequences with input sequences during computation of CTC Loss function, the blank
label is added to the original output dimension. (Graves et al., 2013)

Figure 86: Summary of regularized GRU model
The summary of regularized GRU model in figure above shows that the starting input layer has a
3D shape with 193 representing the input dimension (INPUT_DIM). Since there is no 2D
Convolutional layer involved, the dimensional shape is fixed at 3D throughout the model training
phase. Finally, the output layer will produce 32 units which corresponds to 1 more unit than the
predefined output dimension (OUTPUT_DIM). Out of all the layers, the Bidirectional GRU
layer has the highest number of parameters. This is due to the fact that the layer processes input
sequences from previous layers in both forward and backward direction (i.e., accounting past and
future timesteps). Not only does the layer require a forward and backward weight, but it also
requires a recurrent forward and backward weight to store information from previous timesteps
of forward and backward sequence respectively. (Dobilas, 2022) It is also observed that there are
a total of 7.8 million parameters with none of them deemed non-trainable because there are no
Batch Normalization layer’s parameters transmitting the mean and standard deviation values
from one layer to the other. This also explains the reduction of parameters from 11.4 million
when compared with the regularized CNN-GRU model.

Figure 87: Code Snippet for Training regularized GRU model
The code in figure above is used to train the regularized GRU model on the training set for 60
epochs. The batch size is set to the validation dataset’s size and the “callbacks” parameter will be
used to call the “Metrics” function that accepts the validation dataset and regularized GRU
model as input after each epoch. Details on the callback function and analysis conducted based
on the stored model’s history information will be discussed further in section 6.2.

5.7.4 CNN-LSTM
Figure 88: Code Snippet to construct regularized CNN-LSTM model
The code above is constructing a Convolutional Neural Network (CNN) with Long Short Term
Memory (LSTM) model. It consists of 2 convolutional layers, 3 bidirectional LSTM layers and 2
dense layers. Since the spectrogram or input vector sequences obtained using Short-Time Fourier
Transformation (STFT) technique is symmetrical, only the first half of the values are required to
represent the frequency domain. As such, the input dimension is half the size of Fast Fourier
Transform (FFT). The output dimension corresponds to the number of characters in the pre-
defined vocabulary list.
To make the input layer compatible with the expected 4D shape for CNN, it is reshaped using
Keras’s “Reshape” layer. The 1st convolutional layer is defined with 32 filters, a kernel size of
11x41 and a stride of 2x2 and a Rectified Linear Unit (ReLU) activation function using Keras’s
“Conv2D” layer. To normalize the previous layers to have a mean of 0 and a standard deviation
of 1, a Batch Normalization layer is added. (Brownlee, 2019) The 2nd convolutional layer is then
defined with 32 filters, a kernel size of 11x21 and a stride of 1x2 and a ReLU activation function
with Batch Normalization layer applied afterwards.

Before proceeding to the bidirectional LSTM layers, the previous layer must be flattened into a
3D Tensor using the “Reshape” layer. It is then passed into a three-time loop to be fed into a
LSTM layer which accepts 400 RNN units with the “return_sequences” parameter set to True.
Unlike a GRU, LSTM layer does not have a “reset_after” parameter. This is because GRU has a
reset gate that can ignore past states based on the parameter entered whereas LSTM has a forget
gate that will completely discard specific information in the past. (Vijaysinh Lendave, 2021)
Upon setting “return_sequences” to True, the full sequence output in the form of batch size,
timesteps of input sequence and RNN units will be displayed instead of the original form of
batch size and RNN units only. (Team, 2014) This LSTM layer is then wrapped inside a
Bidirectional layer so that it can incorporate timesteps of vector sequences from both past and
future. A Dropout layer is then added after each Bidirectional LSTM layer except the last one to
prevent overfitting by randomly dropping half of the input units for each batch of training set.
(Team, 2023)
for dimensionality inputs with a ReLU activation function and a Kernel Regularizer using
“l1_l2” function. The regularizer has a penalty rate for “l1” and “l2” respectively. This layer
computes the dot product between the inputs and the kernel defined in CNN, resulting in a
Tensor shape comprising of batch size, timesteps and RNN units doubled. (Team, 2023)
Resulting layer is then applied with a Dropout layer to randomly drop half of the input units. As
a result from the Dropout layer, the output layer is obtained in the form of a Dense layer which
accepts 1 more unit than the output dimension, a “softmax” activation function and a Kernel
Regularizer using “l1” function with a penalty rate of 0.001. To account for inconsistent length
in output sequences with input sequences during computation of CTC Loss function, the blank
label is added to the original output dimension. (Graves et al., 2013)

Figure 89: Summary of regularized CNN-LSTM model
The summary of regularized CNN-LSTM model in figure above shows that the starting input
layer has a 3D shape with 193 representing the input dimension (INPUT_DIM) which is then
reshaped into a 4D layer to be fed into the 2D Convolutional layer. After going through the 2 nd
Batch Normalization layer, the previous layer’s dimension is multiplied with each other to obtain
the value 1568. Finally, the output layer will produce 32 units which corresponds to 1 more unit
than the predefined output dimension (OUTPUT_DIM). Out of all the layers, the Bidirectional
LSTM layer has the highest number of parameters. This is due to the fact that the layer processes
input sequences from previous layers in both forward and backward direction (i.e., accounting

past and future timesteps). Not only does the layer require a forward and backward weight, but it
also requires a recurrent forward and backward weight to store information from previous
timesteps of forward and backward sequence respectively. (Dobilas, 2022)
It is also observed that there are a total of 14.9 million parameters with 128 of them deemed non-
trainable as they are not updated when transmitted from one layer to the next, such as parameters
involved in transferring mean and standard deviation during the Batch Normalization layer.
Further observation revealed that the number of parameters has risen when compared with the
11.4 million count of regularized CNN-GRU model. This is due to the fact that LSTM layer has
3 gates namely an input gate which controls information stored in long term memory, a forget
gate that controls what information to discard and an output gate that controls what information
to pass to the next time step sequence. Whilst a GPU layer only has an update gate that updates
information on the next state gate that controls what information to ignore from the past.
(Vijaysinh Lendave, 2021)
Figure 90: Code Snippet for training regularized CNN-LSTM model
The code in the figure above is used to train the regularized CNN-LSTM model on the training
set for 60 epochs. The batch size is set to the validation dataset’s size and the “callbacks”
parameter will be used to call the “Metrics” function that accepts the validation dataset and
regularized CNN-LSTM model as input after each epoch. Details on the callback function and
analysis conducted based on the stored model’s history information will be discussed further in
section 6.2.

5.8 Summary
This chapter first introduces the concept of data analysis and its relevance to the current Web-
based ASR data science project. Then, the metadata of the chosen dataset is tabulated in which
the details regarding properties of audio file and transcript attributes are explained clearly. To
have a thorough understanding on the dataset and verify if the metadata given is accurate, Initial
Data Exploration or Exploratory Data Analysis (EDA) is carried out by obtaining information
about frequency and signal waves of audio file as well as the total token count, unique token
count and token distribution for corresponding and overall transcript. Utilizing such information,
data cleaning or pre-processing is performed to remove empty rows, erroneous data, down-
sample the data to enhance model training capabilities and transforming the data into a
Tensorflow Dataset object to be fed into each deep learning model. Data visualization on both
the audio file and transcript are displayed in the form of bar graphs, heatmaps, histograms etc. .
Before constructing our models, data partitioning is initiated to divide our dataset into training,
validation and testing set. Once the dataset is transformed and partitioned, we can construct a
CNN-GRU, GRU and CNN-LSTM model with experimentation of regularizers, optimizers,
epochs, RNN units and other hyperparameters.

CHAPTER 6: RESULTS AND DISCUSSIONS

6.1 Introduction
This section will explain a step-by-step procedure on how to prepare the metrics to evaluate the
validation and testing results of partitioned datasets when applied on the corresponding models.
Some of these metrics include training loss, validation loss, Word Error Rate (WER) and
Character Error Rate (CER). After finalizing the coding aspect, the results obtained are evaluated
and interpreted thoroughly by accounting multiple factors through comparison between different
metrics. Such interpretations will then be finalized to determine the most optimum model in
predicting speech signals from audio files, with all metric’s results taken into consideration.

6.2 Results and Discussions

6.2.1 Metrics Preparation
Figure 91: Code Snippet for CTC Decoding
To convert the Tensor 3D shape in the form a tuple back to character sequences, the CTC
Decoding function is defined. Each value in the tuple represents batch size, timesteps and
number of classes respectively. In order to obtain the length of input sequence, the batch size
must be multiplied with the time step along each input sequence. Tensorflow Keras’s
“ctc_decode” function which accepts the predicted matrix, length of input sequence and a flag on
whether to use Greedy or Beam search as parameters will be used to perform decoding. The
“decoder” function defined in the previous section will then be applied on the individual strings
in the list to map the integer class labels to corresponding character in the vocabulary. Then,
“tf.strings.reduce_join” is used to join the characters into one chunk, then convert to Numpy
array using “numpy()” method and decode into “UTF-8” string.

Figure 92: Code Snippet to generate metrics for Validation and Testing
Upon reaching the end of each epoch, the “on_epoch_end” function will be called which will
generate the predicted output by passing the spectrogram into the chosen model, then apply CTC
Decoding technique to return a sequence of characters. The actual output is obtained by passing
the target label into the predefined decoder, joining character sequences into chunks, converting
to Numpy array and decoding into “UTF-8” string. The Word Error Rate (WER) of the audio is
computed using “wer” imported function whereas the Character Error Rate (CER) is computed
through the “editdistance” package “eval” function, both of which are stored in a list. 5
randomized predicted and actual results will then be displayed upon each epoch ends.

6.2.2 Training and Validation Loss

In this section, we will be illustrating line graphs for training and validation loss against the
number of epochs and interpreting findings based on it for all the models constructed in section
5, including the unregularized CNN-GRU model.
Figure 93: Code Snippet to visualize Train and Validation Loss for CNN-GRU model
The values of “loss” and “val_loss” are obtained from the CNN-GRU training dictionary as
shown in figure above. A range of 50, which is equal to the number of epoch executed to train
the CNN-GRU model is created and assigned as the X-axis while the Y-axis represents the
training loss in blue color and validation loss in red color. The line plot is displayed in the figure
below.
Figure 94: Relationship between Train and Validation Loss against Epoch for CNN-GRU model

The line graph above shows that the validation loss is greater than the training loss throughout
the training phase consisting of 50 epochs. It is observed that the training loss stops decreasing
upon reaching the 60-70 mark around the 45th epoch, then it increases slightly to the 150 mark
before dropping below the 100 mark. As for the validation loss, the period at which it stops
decreasing is relatively earlier at around the 35th epoch. It then rises marginally beyond the 150
mark, followed by a steady fall before surging drastically beyond the 300-value mark. This
implies that the surge rate of validation loss between the 45 th to 50th epoch is more intense
compared to the training loss. The final training and validation loss are estimated to be around
125 and 75 respectively. All these signify that the CNN-GRU model is overfitting with the
training dataset as it is memorizing the training data points rather than learning how to generalize
patterns across training and validation sets., causing them to perform much better with
specialized training set rather than unfamiliar validation set.
Figure 95: Code Snippet to visualize Train and Validation Loss for regularized CNN-GRU model
The values of “loss” and “val_loss” are obtained from the regularized CNN-GRU training
dictionary as shown in figure above. A range of 60, which is equal to the number of epoch
executed to train the regularized CNN-GRU model is created and assigned as the X-axis while
the Y-axis represents the training loss in blue color and validation loss in red color. The line plot
is displayed in the figure below.

Figure 96: Relationship between Train and Validation Loss against Epoch for regularized CNN-GRU model
For the line graph above, we will be executing 10 more epochs up to 60 because in the non-
regularized CNN-GRU model, we are unable to observe whether the training and validation loss
will drop to a new lowing point or remain constant beyond the 50-epoch range. Based on the
graph, we can observe that at the initial stage around 5 epochs, the validation loss is slightly
greater than the validation loss. After that point, both the training and validation loss decrease
gradually below the 50 mark and have reached a relatively constant state after the 50 th epoch.
Although there is minor fluctuation for the validation loss between 20th to 50th epoch, it still
remains lower than the training loss. Additionally, the gap between training and validation loss is
minimizing as the epoch is executed. This implies that the model is performing well on both the
training and validation sets. It is not overfitting to the training dataset and is generalizing patterns
obtained from unfamiliar validation data well. It is also not underfitting on the training dataset as
it is able to capture the underlying audio features, supported by the fact that both training and
validation sets achieve lower loss value compared to the unregularized CNN-GRU model. Such
an improvement is likely associated with the increase in RNN units which increases the model
capacity in learning audio features as well as the insertion of regularizes that can prevent
overfitting.

Figure 97: Code Snippet to visualize Train and Validation Loss for regularized GRU model
The values of “loss” and “val_loss” are obtained from the GRU training dictionary as shown in
figure above. A range of 60, which is equal to the number of epoch executed to train the GRU
model is created and assigned as the X-axis while the Y-axis represents the training loss in blue
color and validation loss in red color. The line plot is displayed in the figure below.
Figure 98: Relationship between Train and Validation Loss against Epoch for regularized GRU model
The line graph above shows that there has been an inconsistent fluctuation for both training and
validation loss. At the start of execution, the training loss is significantly higher than the
validation loss, after which it the training loss dropped significantly until it is lower than the
validation loss around the 15th epoch. However, a steep surge beyond the 300 and 250 mark for
validation and training loss respectively can be observed between the 15th to 20th epoch. Beyond
this point, the validation loss has been remaining around the 100 mark whereas the training loss
continues to decrease towards the 50 mark. This implies that the GRU model is overfitting with

the training set as it is constantly memorizing the training data rather than learning underlying
patterns to be applied on the validation set. This is likely associated with a lack of complexity of
the GRU model in extracting features from acoustic data, insufficient well-tuned
hyperparameters that can optimize model performance and lack of Batch Normalization layer to
prevent overfitting.
Figure 99: Code Snippet to visualize Train and Validation Loss for regularized CNN-LSTM model
The values of “loss” and “val_loss” are obtained from the CNN-LSTM training dictionary as
shown in figure above. A range of 60, which is equal to the number of epoch executed to train
the CNN-LSTM model is created and assigned as the X-axis while the Y-axis represents the
training loss in blue color and validation loss in red color. The line plot is displayed in the figure
below.
Figure 100: Relationship between Train and Validation Loss against Epoch for regularized CNN-LSTM model

The figure above shows that at the 1st epoch, the training loss is slightly higher than the
validation loss. However, the training loss has been decreasing substantially since then until it
reaches a constant pace below the 50 loss-mark. On the other hand, the validation loss’s
reduction rate stopped before the 20th epoch, causing it to fluctuate unstably just beyond the 100
loss-mark. This implies that the CNN-LSTM model is overfitting with the training set as it is
constantly memorizing the training data rather than learning underlying patterns to be applied on
the validation set. One explanation to such occurrence being the internal working of LSTM
which consists of 1 extra gate than GRU, thus requiring more RNN units to transfer, discard and
store information between layers. Moreover, our training dataset is not big since it only focuses
on common spoken words that have been filtered accordingly due to limited memory constraints.

6.2.3 WER of Validation Set

In this section, we will be illustrating line graphs for Word Error Rate (WER) against the number
of epochs on validation set and interpreting findings based on it for all the models constructed in
section 5, including the unregularized CNN-GRU model.
Figure 101: Code Snippet to Generate Line Graph for WER Over Epoch for CNN-GRU model
The code in the figure above plots a line graph with the WER on the Y-axis and number of
epochs on the X-axis using a blue line for CNN-GRU model. A horizontal line to display the
minimum value of WER is drawn using a red deashed line.
Figure 102: Line Graph for WER Over Epoch for CNN-GRU model

The line graph above illustrates that the pattern of WER on validation set for CNN-GRU model
is similar to the corresponding validation line graph that illustrates loss over epochs as shown in
figure 94. This is because upon reaching the new low point just below 0.6 before the 35 th epoch,
WER rises marginally beyond the 0.7 mark, followed by a steady fall to a new low point before
drastically going beyond the 0.9 mark. The fluctuation in WER implies that the validation set is
not generalized well for the CNN-GRU model.
epochs on the X-axis using a blue line for regularized CNN-GRU model. A horizontal line to
display the minimum value of WER is drawn using a red deashed line.
Figure 104: Line Graph for WER Over Epoch for regularized CNN-GRU model

The line graph above illustrates that the pattern of WER on validation set for regularized CNN-
GRU model is similar to the corresponding validation line graph that illustrates loss over epochs
as shown in figure 96. This is because despite experiencing strong fluctuation between the 20 th to
60th epoch, the WER remains at the low point around 0.4 without going through drastic rises
unlike the line graph for unregularized CNN-GRU. In addition to that, the new WER low point
for regularized CNN-GRU is around 0.4 whereas it is 0.1 to 0.2 or 10% to 20% higher for
unregularized CNN-GRU, implying that regularized CNN-GRU is able to generalize patterns for
unfamiliar validation set much better than unregularized CNN-GRU. Based on such observation,
we will be applying regularization technique throughout the following models.
Figure 105: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model
epochs on the X-axis using a blue line for regularized GRU model. A horizontal line to display
the minimum value of WER is drawn using a red deashed line.
Figure 106: Line Graph for WER Over Epoch for regularized GRU model

The line graph above illustrates that the pattern of WER on validation set for regularized GRU
model is similar to the corresponding validation line graph that illustrates loss over epochs as
shown in figure 98. This is because upon reaching the first new low point just below 0.6 around
the 15th epoch, a steep surge beyond the 0.9 or 90% mark can be observed between the 15th to
20th epoch. Beyond this point, the WER continues to decrease below the 0.6 mark around 30 th
epoch until it reaches a new low point just below 0.5 mark around 50 th epoch. Despite only
having 1 intense surge and decline in WER between 15 th to 20th mark compared to the double
surge and decline in WER of unregularized CNN-GRU model, the regularized GRU model is
still not generalizing well with the validation set. This can also be supported by the fact that the
new WER low point of 0.5 is still 0.1 or 10% higher than that of the WER for regularized CNN-
GRU model.

epochs on the X-axis using a blue line for regularized CNN-LSTM model. A horizontal line to
display the minimum value of WER is drawn using a red deashed line.
Figure 108: Line Graph for WER Over Epoch for regularized CNN-LSTM model
The line graph above illustrates that the pattern of WER on validation set for regularized CNN-
LSTM model is slightly different compared to the corresponding validation line graph that
illustrates loss over epochs as shown in figure 100. This is because the validation loss line in
figure 100 stops decreasing before the 20th epoch whereas the WER line only stops decreasing
after the 50th epoch. Another major difference is the minimum point of validation loss line is
around 100 which is marginally higher than its corresponding WER’s minimum point at 0.3 or
30%. This implies that the learning capability of regularized CNN-LSTM model on validation
set is better than the regularized CNN-GRU model which only has a minimum WER of 0.4 or
40%. Despite able to predict validation data more accurately, the subtle difference between
validation loss and WER imply that the model might be memorizing patterns in the training data
and fitting them into validation set instead of generalizing the relevant acoustic features from the
validation set. Several factors that might lead to such an observation include the model’s
architecture whereby CNN-LSTM cannot capture temporal dependencies precisely than CNN-
GRU and over-regularization whereby the model’s capacity to generalize underlying patterns of
validation set are constrained due to excessive rigidness and inflexibility of the model.

6.2.4 CER of Validation Set

In this section, we will be illustrating line graphs for Character Error Rate (CER) against the
number of epochs on validation set and interpreting findings based on it for all the models
constructed in section 5 excluding the unregularized CNN-GRU model. This is because the
validation loss and WER findings from section 6.2.2 and 6.2.3 suggest that regularized CNN-
GRU is a more optimal model than the unregularized one.
Figure 109: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model
The code in the figure above plots a line graph with the CER on the Y-axis and number of
epochs on the X-axis using a blue line for regularized CNN-GRU model. A horizontal line to
display the minimum value of CER is drawn using a red deashed line.
Figure 110: Line Graph for CER Over Epoch for regularized CNN-GRU model

The line graph above illustrates that the pattern of CER on validation set for regularized CNN-
GRU model is similar to the corresponding validation line graph that illustrates loss over epochs
and WER line graph as shown in figure 96 and figure 104 respectively. This is because despite
experiencing minor fluctuation between the 10th to 30th epoch, the CER remains at the low point
below 0.2 until it reaches the final epoch. In addition to that, the new CER low point for
regularized CNN-GRU is around 0.1 or 10%, implying that regularized CNN-GRU is able to
generalize patterns for unfamiliar validation set.
Figure 111: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model
epochs on the X-axis using a blue line for regularized GRU model. A horizontal line to display
the minimum value of CER is drawn using a red deashed line.
Figure 112: Line Graph for CER Over Epoch for regularized GRU model
The line graph above illustrates that the pattern of CER of validation set for regularized GRU
model is similar to the corresponding validation line graph that illustrates loss over epochs as
shown in figure 98. This is because upon reaching the first new low point at 0.2 around the 15 th

epoch, a steep surge beyond the 0.9 or 90% mark can be observed between the 15th to 20th
epoch. Beyond this point, the CER continues to decrease until it reaches a new minimum point
just below 0.2 mark around 50th epoch. Despite the minimum point is close to the low point for
CER line graph in regularized CNN-GRU as shown in figure 110, the regularized GRU model is
still not generalizing well with the validation set because there is 1 intense surge and decline in
CER between 15th to 20th epoch whereas such pattern is not observed in the CER line graph in
figure 110.
epochs on the X-axis using a blue line for regularized CNN-LSTM model. A horizontal line to
display the minimum value of CER is drawn using a red deashed line.
Figure 114: Line Graph for CER Over Epoch for regularized CNN-LSTM model

The line graph above illustrates that the pattern of CER on validation set for regularized CNN-
LSTM model is slightly different compared to the corresponding validation line graph that
illustrates loss over epochs as shown in figure 100. This is because the validation loss line in
figure 100 stops decreasing before the 20 th epoch whereas the CER line only stops decreasing
after the 30th epoch. Another major difference is the minimum point of validation loss line is
around 100 which is marginally higher than its corresponding CER’s minimum point at 0.1 or
10%. From the visualization in figure 110, we can see that both graphs’ minimum point for CER
seems to be around the 0.1 or 10% range. Therefore, we will be displaying their respective
minimum points as shown in the figure below.
Figure 115: Code Snippet to find minimum point for CER on Validation Set of regularized CNN-GRU and CNN-LSTM model
The figure above implies that the learning capability of regularized CNN-LSTM model on
validation set with a minimum CER of 0.08 or 8% is better than the regularized CNN-GRU
model which only has a minimum CER of 0.11 or 11%. Despite able to predict validation data
more accurately, the subtle difference between validation loss and CER imply that the model
might be memorizing patterns in the training data and fitting them into validation set instead of
generalizing the relevant acoustic features from the validation set. Several factors that might lead
to such an observation include the model’s architecture whereby CNN-LSTM cannot capture
temporal dependencies precisely than CNN-GRU and over-regularization whereby the model’s
capacity to generalize underlying patterns of validation set are constrained due to excessive
rigidness and inflexibility of the model.
6.2.5 Code For Evaluation on Testing Set

In this section, we will be writing code snippets for evaluation metrics on testing set for all the
models constructed in section 5, including the unregularized CNN-GRU model.

Figure 116: Code Snippet to Generate Metrics for Unregularized CNN-GRU model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for unregularized CNN-GRU
model based on WER metrics only. This is because findings from section 6.2.3 and 6.2.4 suggest
that line graphs for WER and CER follow similar patterns and we will only be critically
analysing findings on the more optimal regularized CNN-GRU model than the unregularized
one. First, the code generates the predicted output by passing the spectrogram into the chosen
model, then apply CTC Decoding technique to return a sequence of characters. The actual output
is obtained by passing the target label into the predefined decoder, joining character sequences
into chunks, converting to Numpy array and decoding into “UTF-8” string. The Word Error Rate
(WER) of the audio is computed using “wer” imported function and stored in a list. 3
randomized predicted and actual results for the testing set will then be displayed.
Figure 117: Code Snippet to Generate Metrics for Regularized CNN-GRU model’s Evaluation on Testing Set

The code in the figure above is used to evaluate the testing set for regularized CNN-GRU model
based on WER and CER metrics. First, the code generates the predicted output by passing the
spectrogram into the chosen model, then apply CTC Decoding technique to return a sequence of
characters. The actual output is obtained by passing the target label into the predefined decoder,
joining character sequences into chunks, converting to Numpy array and decoding into “UTF-8”
string. The Word Error Rate (WER) of the audio is computed using “wer” imported function
whereas the Character Error Rate (CER) is computed through the “editdistance” package “eval”
function, both of which are stored in a list. 3 randomized predicted and actual results for the
testing set will then be displayed.
Figure 118: Code Snippet to Generate Metrics for Regularized GRU model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for regularized GRU model based
on WER and CER metrics. First, the code generates the predicted output by passing the
spectrogram into the chosen model, then apply CTC Decoding technique to return a sequence of
characters. The actual output is obtained by passing the target label into the predefined decoder,
joining character sequences into chunks, converting to Numpy array and decoding into “UTF-8”
string. The Word Error Rate (WER) of the audio is computed using “wer” imported function
whereas the Character Error Rate (CER) is computed through the “editdistance” package “eval”
function, both of which are stored in a list. 3 randomized predicted and actual results for the
testing set will then be displayed.

Figure 119: Code Snippet to Generate Metrics for Regularized CNN-LSTM model’s Evaluation on Testing Set
The code in the figure above is used to evaluate the testing set for regularized CNN-LSTM
model based on WER and CER metrics. First, the code generates the predicted output by passing
the spectrogram into the chosen model, then apply CTC Decoding technique to return a sequence
of characters. The actual output is obtained by passing the target label into the predefined
decoder, joining character sequences into chunks, converting to Numpy array and decoding into
“UTF-8” string. The Word Error Rate (WER) of the audio is computed using “wer” imported
function whereas the Character Error Rate (CER) is computed through the “editdistance”
package “eval” function, both of which are stored in a list. 3 randomized predicted and actual
results for the testing set will then be displayed.

6.2.6 WER of Testing Set

In this section, we will be illustrating line graphs for Word Error Rate (WER) against the number
of epochs on testing set and interpreting findings based on it for all the models constructed in
section 5, including the unregularized CNN-GRU model.
Figure 120: Code Snippet to Generate Line Graph for WER Over Epoch for Unregularized CNN-GRU model on Testing Set
epochs on the X-axis using a blue line for unregularized CNN-GRU model on testing set. A
horizontal line to display the minimum value of WER is drawn using a red deashed line.
Figure 121: Line Graph for WER Over Epoch for Unregularized CNN-GRU model on Testing Set
The line graph above shows that the WER on unregularized CNN-GRU model for the 16 batches
in the testing set, each consisting of 25 audio files, is ranging between 0.55 to 0.58 or 55% to
58%. This is aligned with the findings derived from the validation set in figure 102 whereby the
minimum point of WER is just below 0.6 or 60% The higher minimum point of WER implies the
unregularized CNN-GRU model is only capable of memorizing patterns from training set to be

applied on unseen testing set. Moreover, there has been slight fluctuation at the 8th and 15th
epoch, indicating that the model performance can be improved further with tuning on
hyperparameters and incorporation of different regularization or optimization techniques.
Figure 122: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-GRU model on Testing Set
epochs on the X-axis using a blue line for regularized CNN-GRU model on testing set. A
Figure 123: Line Graph for WER Over Epoch for regularized CNN-GRU model on Testing Set
The line graph above shows that the WER on regularized CNN-GRU model for the 16 batches in
the testing set, each consisting of 25 audio files, is ranging between 0.37 to 0.4 or 37% to 40%.
This is aligned with the findings derived from the validation set in figure 104 whereby the
minimum point of WER is around 0.4 or 40%. When compared with the findings from WER’s
testing result for unregularized CNN-GRU model in figure 121 above, the regularized CNN-
GRU model is more capable in learning generalizable patterns from training set that can be

applied on unseen testing set. Despite so, there has been slight fluctuation at the 7 th and 15th
Figure 124: Code Snippet to Generate Line Graph for WER Over Epoch for regularized GRU model on Testing Set
epochs on the X-axis using a blue line for regularized GRU model on testing set. A horizontal
line to display the minimum value of WER is drawn using a red deashed line.
Figure 125: Line Graph for WER Over Epoch for regularized GRU model on Testing Set
The line graph above shows that the WER on regularized GRU model for the 16 batches in the
testing set, each consisting of 25 audio files, is ranging between 0.43 to 0.46 or 43% to 46%.
minimum point of WER is just below 0.5 or 50%. In-spite the WER range is slightly higher
compared to the regularized CNN-GRU model in figure 123, the regularized GRU model can
still relatively memorize patterns from training set that can be applied on unseen testing set.
Despite so, there has been slight fluctuation at the 7 th and 15th epoch, indicating that the model

performance can be improved further with tuning on hyperparameters and incorporation of

different regularization or optimization techniques.
Figure 126: Code Snippet to Generate Line Graph for WER Over Epoch for regularized CNN-LSTM model on Testing Set
epochs on the X-axis using a blue line for regularized CNN-LSTM model on testing set. A
Figure 127: Line Graph for WER Over Epoch for regularized CNN-LSTM model on Testing Set
The line graph above shows that the WER on regularized CNN-LSTM model for the 16 batches
in the testing set, each consisting of 25 audio files, is ranging between 0.27 to 0.3 or 27% to
30%. This is aligned with the findings derived from the validation set in figure 108 whereby the
minimum point of WER is at 0.3 or 30%. The WER range is lower compared to the regularized
CNN-GRU and regularized GRU model in figure 123 and figure 125 respectively, implying that
the regularized GRU model has learned generalizable patterns from training set that can be
applied on unseen testing set. Despite so, there has been slight fluctuation at the 7 th and 15th

6.2.7 CER of Testing Set

In this section, we will be illustrating line graphs for Character Error Rate (CER) against the
number of epochs on testing set and interpreting findings based on it for all the models
constructed in section 5 excluding the unregularized CNN-GRU model. This is because the
validation loss and WER findings in section 6.2.2 and 6.2.3 suggest that regularized CNN-GRU
is a more optimal model than the unregularized one.
Figure 128: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-GRU model on Testing Set
epochs on the X-axis using a blue line for regularized CNN-GRU model on testing set. A
Figure 129: Line Graph for CER Over Epoch for regularized CNN-GRU model on Testing Set
The line graph above shows that the CER on regularized CNN-GRU model for the 16 batches in
the testing set, each consisting of 25 audio files, is ranging between 0.1 to 0.12 or 10% to 12%.

minimum point of CER is around 0.1 or 10%. This implies that the regularized CNN-GRU
model has learned generalizable patterns from training set that can be applied on unseen testing
set. Despite so, there has been slight fluctuation at the 7 th and 15th epoch, indicating that the
model performance can be improved further with tuning on hyperparameters and incorporation
of different regularization or optimization techniques.
Figure 130: Code Snippet to Generate Line Graph for CER Over Epoch for regularized GRU model on Testing Set
epochs on the X-axis using a blue line for regularized GRU model on testing set. A horizontal
line to display the minimum value of WER is drawn using a red deashed line.
Figure 131: Line Graph for CER Over Epoch for regularized GRU model on Testing Set
The line graph above shows that the CER on regularized GRU model for the 16 batches in the
testing set, each consisting of 25 audio files, is ranging between 0.13 to 0.15 or 13% to 15%.

minimum point of CER is just below 0.2 or 20%. In-spite the CER range is slightly higher
compared to the regularized CNN-GRU model in figure 129, the regularized GRU model can
still relatively memorize patterns from training set that can be applied on unseen testing set.
Despite so, there has been slight fluctuation at the 7 th and 15th epoch, indicating that the model
performance can be improved further with tuning on hyperparameters and incorporation of
different regularization or optimization techniques.
Figure 132: Code Snippet to Generate Line Graph for CER Over Epoch for regularized CNN-LSTM model on Testing Set
epochs on the X-axis using a blue line for regularized CNN-LSTM model on testing set. A
Figure 133: Line Graph for CER Over Epoch for regularized CNN-LSTM model on Testing Set
The line graph above shows that the CER on regularized CNN-LSTM model for the 16 batches
in the testing set, each consisting of 25 audio files, is ranging between 0.07 to 0.08 or 7% to 8%.
This is aligned with the findings derived from the validation set in figure 114 and statistics in
figure 115 whereby the minimum point of CER is around 0.08 or 8%. The CER range is lower
compared to the regularized CNN-GRU and regularized GRU model in figure 129 and figure

131 respectively, implying that the regularized CNN-LSTM model has learned generalizable
patterns from training set that can be applied on unseen testing set. Despite so, there has been
slight fluctuation at the 7th and 15th epoch, indicating that the model performance can be
improved further with tuning on hyperparameters and incorporation of different regularization or
optimization techniques.

6.3 Summary
Based on the evaluation on Training and Validation Loss, WER and CER for validation data as well as WER and CER for testing set,
the following table is summarized.
Evaluation Metrics Unregularized CNN- Regularized CNN- Regularized GRU Regularized

GRU model GRU model model CNN-LSTM
model
Training Number of Intense 2 0 1 0

and Fluctuations
Validation
Higher Loss Validation Training Validation Validation
Loss
Minimum Training and Around 125 and 75 Both below 50 100 and slightly Slightly higher
Validation Loss higher than 50 than 100 and
below 50
Validation WER 55% to 58% Around 40% Around 50% 30%
Validation CER - 11% Around 20% 8%
Testing WER 55% to 58% 37% to 40% 43% to 46% 27% to 30%
Testing CER - 10% to 12%. 13% to 15% 7% to 8%
Figure 134: Summary of Evaluation Metrics for 4 Models

Based on the evaluation on 5 metrics above for 4 models, we can deduce that regularized CNN-
GRU model performs the best in terms of training and validating the dataset as it is the only one
with a training loss greater than validation loss. It also has the lowest training and validation loss
with the slightest gap between training and validation loss out of the 4 models as both the values
are below 50. This implies that the regularized CNN-GRU model is not just capable of
memorizing the training data, but also able to generalize the features’ patterns learned from the
training set and applied to the validation set.
In terms of validation WER and CER, we can see that regularized CNN-LSTM perform the best
with the lowest WER of 30% and CER of 8%. It is worth noting that CER is not evaluated for
the unregularized CNN-GRU model because the values obtained from WER already indicated
that the model has a lower learning capability compared to the regularized one. However, model
which has the best training and learning capacity (i.e, regularized CNN-GRU model) is also not
far from behind in terms of WER and CER, with a probability around 40% and 11%
respectively. This indicates that the validation set may be too small such that many similar data
are generated, resulting in a lower WER and CER for regularized CNN-LSTM model.
As for the testing CER and WER, their percentages are lower than or equal to corresponding
results derived from the validation set with a minimum difference between 1 to 4%. This implies
that the validation and training sets have rather similar speech features such that they can be
captured and learned by the model precisely.
To sum it up, the most optimal model is the regularized CNN-GRU model because it is neither
overfitting nor underfitting to the training set. It also has 0 intense fluctuations, meaning that the
training and validation loss of the model are rather stable and do not change overtime. Moreover,
the model also has a relatively medium to low WER and CER considering it is trained on a small
sample of datasets prior to the down-sampling technique applied. It is able to predict a big
portion of the characters accurately and more than half of the tokens or words were predicted
correctly as well. If such model were to train on larger datasets such as the original dataset
without data sampling technique applied, it is able to generalize well with more complex
patterns, thus resulting in even lower training and validation loss, subsequently lower WER and
CER as well.

CHAPTER 7: SYSTEM ARCHITECTURE

7.1 Introduction
The proposed Web-based ASR System for E-learning in English Language is seen to be a
solution to many end users including tutors and students especially in tackling aspects of
teaching and learning. The proposed system will be deployed using Streamlit to become a web
application in which users can access it by connecting to a local host. Some functionalities of the
proposed system are as such:
Tutor
* Tutor is able to register an account
* Tutor is able to login
* Tutor is able to change password
* Tutor is able to logout
* Tutor is able to access audio-based commands
* Tutor is able to view properties of audio file
* Tutor is able to resample audio file
* Tutor is able to download resampled audio file
* Tutor is able to generate transcript
* Tutor is able to translate transcript
* Tutor is able to download original and translated transcript
Student
* Student is able to register an account
* Student is able to login
* Student is able to change password
* Student is able to logout

* Student is able to access audio-based commands
* Student is able to view properties of audio file
* Student is able to resample audio file
* Student is able to download resampled audio file
* Student is able to generate transcript
* Student is able to translate transcript
* Student is able to download original and translated transcript

7.2 Abstract Architecture

7.2.1 System Design
7.2.1.1 Use Case Diagram
A Use Case Diagram is a type of behavioural diagram under the Unified Modelling Language
(UML) hierarchy that summarizes the functional requirements of the system and its interaction
with end users. Instead of listing out a step-by-step procedure on how an event occurs, it utilizes
textual and visual representations to communicate the expected behaviour of the system from
end users’ point of view towards developers’ end. (What Is Use Case Diagram?, 2022) Hence, a
Use Case Diagram for the ASR web-based system for E-learning in English language is
illustrated in the figure below with target users being tutors and students within any educational
institutions.
Figure 135: Use Case Diagram

7.2.1.2 Use Case Specification

A Use Case Specification is an extension of the Use Case Diagram whereby it outlines details on
how end users interact with the system to achieve the desired goal. (What Is Use Case
Specification?, 2022) As such, different use case specifications mentioned in the Use Case
Diagram above for ASR web-based system for E-learning in English language are detailed in the
tables below.
Use Case ID 1.0
Use Case Name Register
Business Actor Tutors and students
Description Allows users to sign up before using the system
Pre-conditions User must not have a registered Email Address in the system
Basic Workflow 1. User enters their school Email Address and password
2. User clicks on “Register” button
3. System validates Email Address and password
4. A message pops up with “Account created successfully!”
Alternative Workflow 1. When user’s Email Address is not detected within the
system in login interface, user will be given the option to
redirect back to registration interface
Post-conditions User successfully registered into the system and was given the
option to redirect to login interface
Table 5: Use Case Specification for registration
Use Case ID 2.0
Use Case Name Login

Description Allows users to login into the system based on their credential
Pre-conditions User must already have a registered Email Address in the system
for login credentials to be verified
Basic Workflow 1. User enters their registered school Email Address and
corresponding password
2. System verifies Email Address and password
3. A message pops up with “All your credentials are correct!
You may login now.” if all credentials are entered correctly
4. User clicks on “Login” button
Alternative Workflow Upon a user successfully registered an account in the registration

interface, user will be given the option to redirect back to login
interface
Post-conditions User successfully logged into the system and immediately got
redirected to main interface
Table 6: Use case specification for login
Use Case ID 3.0
Use Case Name Change Password
Description Allows users to change password based on their credential
Pre-conditions User must already have a registered Email Address in the system
for password to be updated
Basic Workflow 1. In the login interface, user clicks “Change Password”

button to be directed to password-changing interface
2. User enters their registered school Email Address and new

updated password
3. User clicks on “Change Password” button
4. System verifies Email Address and updated password
5. A message pops up with “Password changed successfully!”
Alternative Workflow N/A
Post-conditions User successfully changed their password and was given the
option to redirect to login interface
Table 7: Use case specification for changing password
Use Case ID 4.0
Use Case Name Logout
Description Allows users to log out of the system
Pre-conditions User must be already logged into the system
Basic Workflow 1. User clicks the “Back to Home Page” button

2. User logs out of the system
Post-conditions User successfully logs out of the system
Table 8: Use case specification for changing password
Use Case ID 5.0
Use Case Name Listen to audio-based command
Description Allows users to listen to audio commands when navigating

through the web application

Pre-conditions N/A
Basic Workflow 1. User clicks the “🔊” button

2. A recording of the playable audio is displayed
3. User can play the audio and perform other operations
Post-conditions User can adjust playback speed, adjust audio volume and
download the playable audio
Table 9: Use Case Specification for listening to audio-based command
Use Case ID 6.0
Use Case Name Upload Audio File
Description Allows users to upload audio file from their local directory into
Streamlit environment
Pre-conditions User must be already logged into the system
Basic Workflow 1. Upon logging in, click the “Browse Files” icon
2. Select an audio file of “wav” format
3. Key in the corresponding file path
4. A message pops up with “File ‘filename’ has been
uploaded!”
Post-conditions The “wav” format audio file is displayed and users will be
prompted to select the next operations
Table 10: Use case specification for uploading audio file
Use Case ID 7.0

Use Case Name View Properties of Audio File
Description Allows users to visualize properties of the uploaded file within

Streamlit environment
Pre-conditions User must be already logged into the system and uploaded an
audio file
Basic Workflow 1. Upon uploading an audio file, click the “View properties of
audio file” icon
2. User is redirected to the “Property Viewer” page
3. User clicks on any of the expanders or button to visualize
specific properties of the audio file
Alternative Workflow Upon landing on the resampling and transcript page, users can
navigate back to the property viewer page by clicking “View
Properties of Audio File” button
Post-conditions Corresponding graphs and information regarding the uploaded

audio file are displayed
Table 11: Use Case Specification for viewing audio file's properties
Use Case ID 8.0
Use Case Name Resample Audio File
Description Allows users to change sample rate or frequency of audio files,

subsequently changing its speed and pitch
Pre-conditions User must be already logged into the system and uploaded an
audio file
Basic Workflow 1. Upon uploading an audio file, click the “Resample Audio

File” button
2. User is redirected to the “Resampling” page
3. User enters the new sample rate in the numeric text box
4. User clicks “Resample audio file” button
Alternative Workflow Upon landing on the property viewer and transcript page, users can
navigate back to the sampling page by clicking “Resample Audio
File” button
Post-conditions The resampled audio file will be displayed and users will be
prompted to download the file
Table 12: Use Case Specification for resampling audio file
Use Case ID 9.0
Use Case Name Download Resampled Audio File
Description Allows users to download audio file with frequency, pitch and
speed modified according to their needs
Pre-conditions User must have already logged into the system, uploaded an audio
file and generated a resampled version of the original audio file
Basic Workflow 1. After resampling, click the “Download Resampled Audio

File” button in the same page
Post-conditions Open the audio file through the download footer of browser or
within user’s ‘download’ folder in local directory
Table 13: Use Case Specification for downloading resampled audio files
Use Case ID 10.0
Use Case Name Generate Transcript

Description Allows users to view the transcript generated by the back-end

model based on the chosen audio file in a panel
Pre-conditions User must have already logged into the system and uploaded an
audio file
Basic Workflow 1. After uploading audio file, click the “Generate Transcript”
button
2. User is redirected to the “Transcript” page
3. User clicks on “Generate Transcript” button
4. An info panel indicating the transcript for the chosen audio
file will pop up
Alternative Workflow Upon landing on the property viewer and resampling page, users
can navigate back to the transcript page by clicking “Generate
Transcript” button
Post-conditions User can view the transcript in the info panel and perform further
operations with it
Table 14: Use Case Specification for generating transcript
Use Case ID 11.0
Use Case Name Translate Transcript
Description Allows users to translate the generated transcript into their chosen
language
Pre-conditions User must have already logged into the system, uploaded an audio
file and generated a transcript

Basic Workflow 1. After generating the transcript, choose a language to

translate into from the selector box
2. Click the “ Translate Transcript” button
3. An info panel indicating the translated transcript for the
chosen audio file will pop up
Post-conditions User can view the translated transcript in the info panel and
prompted to download it
Table 15: Use Case Specification for translating transcript
Use Case ID 12.0
Use Case Name Download Transcript
Description Allows users to download the original and translated transcript

into their local directory
Pre-conditions 1. User must have already logged into the system,

uploaded an audio file and generated a transcript if they
intend to download the original transcript
2. User must have already logged into the system,
uploaded an audio file, generated a transcript and
translated it into a chosen language if they intend to
download the translated transcript
Basic Workflow 1. After generating the transcript, click the “Download

Transcript” or “Download Translated Transcript” button in
the same page depending on user’s intent
Post-conditions Open the transcript in the form of text file through the download

footer of browser or within user’s ‘download’ folder in local

directory
Table 16: Use Case Specification for downloading transcript
7.2.1.3 Activity Diagram

Also classified as a type of behavioral diagram under the UML hierarchy, Activity Diagram
describes how sequences of events or activities are coordinated in terms of abstraction level to

represent complex business workflows. The internal structure of an Activity Diagram is similar
to flowchart diagrams whereby it depicts sequential and control flow mechanisms. (What Is
Activity Diagram?, 2022) In order to describe the intended behavior of every feature or process
of the web-based ASR system extensively, several activity diagrams are illustrated below.
Figure 136: Activity Diagram for register function

Figure 137: Activity Diagram for login function
Figure 138: Activity Diagram for password changing function

Figure 139: Activity Diagram for logout function
Figure 140: Activity Diagram for listening audio-based commands function
Figure 141: Activity Diagram for uploading audio file function

Figure 142: Activity Diagram for viewing properties of audio file function
Figure 143: Activity Diagram for resampling audio file function

Figure 144: Activity Diagram for downloading resampled audio file function
Figure 145: Activity Diagram for generating transcript function

Figure 146: Activity Diagram for translating transcript function
Figure 147: Activity Diagram for downloading transcript function

7.2.2 Interface Design

The interface design of all pages within the web-based ASR system must be simple and concise.
This allows students across all hierarchy of the educational institution to navigate through the
application at ease with fewer steps possible. To enhance User Experience (UX) in interacting
with the system’s components even further, the buttons, text boxes, headings and sub-headings
are made as big as possible and designed to correlate with the dark background theme of the
interfaces. Apart from text-based commands, audio and animation-wise commands on how to
navigate through certain section of the pages act as an alternative for students and tutors
especially for those who possess low literacy levels, visual impairment and hearing disorders.
Therefore, simple interface design for each webpage of the system will be displayed as shown in
figures below.
Figure 148: Wireframe for home page

Figure 149: Wireframe for registration page
Figure 150: Wireframe for login page
Figure 151: Wireframe for password changing page

Figure 152: Wireframe for file uploader page
Figure 153: Wireframe for property viewer page

Figure 154: Wireframe for resampling page
Figure 155: Wireframe for transcript page

CHAPTER 8: PROJECT PLAN

8.1 Features
8.1.1 Register
Before putting the web-based ASR system of great use, students and tutors must register an
account beforehand. Users will be prompted to the registration interface if they considered
themselves as new (non-registered) users. The user then types their institution’s assigned Email
Address and a password of their choice. The user’s credentials will then be validated and if any
of the credential is invalid, corresponding warning messages will appear in the interface. Invalid
credentials include empty entries, Email Address or password that do not satisfy system
requirements and duplicate Email Address that have been registered into the system. Upon all
credentials are filled in correctly, the system will store them in a CSV file with the password
encrypted for security purposes. Afterwards user can click the “register” button a message
indicating success popping up and they can proceed to the login interface by clicking on the pop-
up button.
8.1.2 Login
Users will be prompted to the login interface if they considered themselves as registered users.
The user then types their registered institution’s Email Address and corresponding password.

The user’s credentials will then be verified and if any of the credential is invalid, corresponding
warning messages will appear in the interface. Invalid credentials include empty entries, Email
Address or password that do not satisfy system requirements and non-registered Email Address.
Upon all credentials are filled in correctly, the message indicating success will pop up and users
will be redirected to the main interface.
8.1.3 Change Password

Users will be prompted to the login interface if they considered themselves as registered users.
The user then clicks on the “Change Password” button to proceed to password changing
interface. They must type their registered institution’s Email Address and updated password. The
user’s credentials will then be verified and if any of the credential is invalid, corresponding
warning messages will appear in the interface. Invalid credentials include empty entries, Email
Address or password that do not satisfy system requirements and non-registered Email Address.
Upon all credentials are filled in correctly, a message indicating success will pop up and users
can click on the pop-up button to be redirected back to the login interface.
8.1.4 Logout
After logging into the system, users can logout by clicking the “Logout” button at the top left
corner of the interface. It is worth noting that the “Logout” button is different from the “Back to
Home Page” button within the login, register and password changing interface as users are not
considered as logged in upon reaching the latter 3 interfaces.
8.1.5 Listen to audio-based commands

Upon landing on each interface including the home page, there will be at least one audio-based
command to aid users with learning disabilities or low literacy level. Users can click on the “🔊”
button and a playable audio will pop up. Users can then play, pause the audio, adjust audio
speed, volume and download audio into their local directory for further learning and teaching
use.
8.1.6 Upload Audio File

Upon logging in, users will be directed to this page. In order to proceed with other operations,
they must upload an audio file from their local directory into Streamlit environment. To do so,
they must click the corresponding icon and select an audio file of “wav” format from folders. In

addition to that, they must also copy the corresponding audio file’s path into a text area in order
for Streamlit to retrieve its content. Once done, a message indicating success and buttons for next
operations will pop up.
8.1.7 View Properties of Audio File

User will be redirected to the property viewer interface after uploading an audio file and clicking
on the corresponding button. Upon landing, they can view the can view the playable audio file
and its frequency. Visualizations on waveform, MFCC and log filter banks of the audio file can
also be viewed in expanders. At this point, they may proceed with other operations or return
back to the file uploader page or logout of the system.
8.1.8 Resample Audio File

User will be redirected to the resampling interface after uploading an audio file and clicking on
the corresponding button. Upon landing, they can view the playable audio file and its frequency.
They can then decide on what frequency to adjust to by entering the new sample rate between
8.000 to 48,000 Hz. Upon clicking the corresponding button, they can view the playable
resampled audio file appearing in the interface.
8.1.9 Download Resampled Audio File

After resampling an audio file, users can choose whether they want to download it by clicking on
the corresponding button. They can retrieve the downloaded audio file either through the pop up
footer of the browser or the local ‘downloads’ folder in their directory with a default name of
“resampled_ ‘original name of audio file’ ”. They can opt to modify the file name in their local
directory later on.
8.1.10 Generate Transcript

User will be redirected to the transcript interface after uploading an audio file and clicking on the
corresponding button. Upon landing, they can view the playable audio file. They can then click
on the corresponding button to generate a transcript which will appear in an information panel
below.

8.1.11 Translate Transcript

In the transcript interface, users must first choose the language to translate into through the
selector box. Only then, they can click the corresponding button to translate the transcript listed
above. The transcript written in the language chosen by the user will then be displayed in the
information panel below.
8.1.12 Download Transcript

After generating or translating a transcript, one can download it by clicking the corresponding
button. They can retrieve the downloaded text file either through the pop up footer of the
browser or the local ‘downloads’ folder in their directory.
8.2 Details of the Release Plan

A release plan is a must for each software or system development project. It is a tactical
document to identify features planned for future releases. Typically, this document is only for
internal use as its purpose is to provide a blueprint for software development team on what
features to release and how to release them into the customer environment. (Release Plan, 2020)
For this project, 3 release plans, updated functionalities included within and release date range of
each version will be specified below.
8.2.1 Version 1.0

The version 1.0 of web-based ASR System for E-learning in English Language will be released
in the first week of July 2023. This version is supported with basic functionalities of the system
such as register, login, change password, logout, resample audio files and generate transcripts.
The GUI deployed using Streamlit is supported with big toggle buttons, text inputs and the
ability to navigate between pages. The deep learning model used to generate transcript should

achieved an accuracy beyond 60%, i.e., the Word Error Rate (WER) deprived from the model
should be less than 40% and the Character Error Rate (CER) should be less than 20%.
8.2.2 Version 2.0

in the first week of August 2023. This version is supported with other advanced functionalities of
the system such as viewing properties of audio file and translating transcripts. The GUI deployed
using Streamlit in this version is supported with audio-wise commands that are deemed
beneficial to tutors and students with language barriers or low literacy levels. The deep learning
model used to generate transcript should achieved an accuracy beyond 70%, i.e., the Word Error
Rate (WER) deprived from the model should be less than 30% and the Character Error Rate
(CER) should be less than 15%.
8.2.3 Version 3.0

in the first week of September 2023. This version is supported with more advanced
functionalities of the system such as downloading resampled audio files as well as downloading
both transcript and translated transcripts. The GUI deployed using Streamlit is supported with
background themes and more visualization or audio-based commands that aid users in navigating
through the application. The deep learning model used to generate transcript should achieved an
accuracy beyond 80%, i.e., the Word Error Rate (WER) deprived from the model should be less
than 20% and the Character Error Rate (CER) should be less than 10%.

8.3 Test Plan

A test plan is a critical document that ensures the software development project meets the
Quality Assurance (QA) standards established by the QA team and test managers. It describes
the testing approach as well as the initial setup of testing environments, resource allocation,
objectives, schedules and the testing criteria for each approach in a detailed manner. (Bose,
2022) Since the system consists of several features with basic and alternative workflow
available, a comprehensive test plan for Unit Testing and User Acceptance Testing (UAT) will
be documented to ensure the system is functioning as expected.
8.3.1 Unit Testing

Unit Testing is a testing technique applied for individual components or units of a functional
application. Being the first level of testing, it is a fundamental procedure as it can detect coding
errors at an early stage of development. This greatly reduces efforts of testers in tracking origins
of error during Integration or System Testing as isolated components are guaranteed to execute
properly and they can focus on other root cause of errors. (Unit Testing - Javatpoint, 2021)

Several Uunit Testing plans has been documented below regarding features or interfaces of the
system.
Test Case: Filling in credentials for registration
ID Test Condition Expected Result Actual Status

Result (Pass /
Fail)
1.1 1. Leave all text fields empty Error messages popped up

indicating both Email
2. Click “Register” button
Address and password
columns are empty
1.2 1. Enters Email Address with Error messages popped up

wrong format and empty indicating password column
password is empty and Email Address
is invalid
1.3 1. Enters password with wrong Error messages popped up

format and empty Email Address indicating Email Address
column is empty and
password is invalid
1.4 1. Enters Email Address and Error messages popped up

password with wrong format indicating both Email
entries are invalid
1.5 1. Enters registered Email Error message popped up

Address and password with indicating duplicate entry of
correct format Email Address

1.6 1. Enters non-registered Email “Account created

Address and password with successfully!” messaged
correct format popped up and user will be
prompted to click the
“Proceed to login interface”
Table 17: Test plan for registration
Test Case: Filling in credentials for login

Result (Pass /
Fail)

indicating both Email Address
and password columns are
empty

wrong format and empty indicating password column is
password empty as well as Email Address
is not registered. A “Go to
Registration Page” button will
appear, prompting the user to
register an account before
proceeding
2.3 1. Enters password with Error messages popped up

wrong format and empty indicating Email Address
Email Address column is empty and password
is incorrect


password with wrong format indicating Email Address is not
registered and password is
incorrect. A “Go to Registration
Page” button will appear,
prompting the user to register an
account before proceeding

Address and wrong password indicating password is incorrect
2.6 1. Enters registered Email A message with “All your

Address and correct password credentials are correct! You may
login now.” and “Login” button
popped up, directing the user to
the main interface
Table 18: Test plan for Login
Test Case: Filling in credentials for changing password

Result (Pass /
Fail)

indicating both Email Address
2. Click “Change Password”
button
empty

wrong format and empty indicating password column is
password empty as well as Email

2. Click “Change Password” Address is not registered. A

button “Go to Registration Page”
button will appear, prompting
the user to register an account
before proceeding
3.3 1. Enters password with wrong Error messages popped up

format and empty Email indicating Email Address
Address column is empty and password
is invalid
button

password with wrong format indicating Email Address is
not registered and password is
invalid. A “Go to Registration
button
prompting the user to register
an account before proceeding

Address and password with indicating password is invalid
wrong format

button
3.6 1. Enters registered Email A message with “Password

Address and correct password changed successfully!” and
“Back to Login Page” button
popped up, prompting the user
button
to the login interface
Table 19: Test plan for changing password

Test Case: Logout

Result (Pass /
Fail)
4.1 1. Clicks “Logout” User is logged out and redirected back

button to home page of application
Table 20: Test plan for logout
Test Case: Upload Audio File

Result (Pass /
Fail)
5.1 1. File path column is empty An error message popped up

notifying the user to upload an
2. File uploader panel is
audio file and enter
empty
corresponding file path
5.2 1. Enters invalid file path An error message popped up

empty
5.3 1. File path column is empty 1. An error message popped up

2. Clicks “Browse File”
icon
3. Select a file that is not of
2. An error message popped up
“wav” format
indicating the file type is not
allowed

5.4 1. File path column is empty An error message popped up

icon
3. Select a “wav” audio file
5.5 1. Enters invalid file path 1. An error message popped up

icon
“wav” format
allowed
5.6 1. Enters invalid file path An error message popped up

indicating file path does not
match with uploaded audio file
icon
5.7 1. Enters valid file path with 1. An error message popped up

leading and trailing indicating file path does not
quotation mark (“”) match with uploaded audio file
2. Clicks “Browse File” 2. An error message popped up

icon notifying user to remove starting
and ending quotation marks
5.8 1. Enters valid file path with 1. Playable recording of the

no leading and trailing uploaded audio file is displayed
quotation mark (“”)
2. A message indicating the file
2. Clicks “Browse File” has been uploaded successfully is

icon displayed
3. Select a “wav” audio file 3. Buttons that prompt user to

perform next operations popped
up
Table 21: Test plan for uploading audio file
Test Case: View Properties of Audio File

Result (Pass /
Fail)
6.1 1. Clicks “Check Frequency of Frequency of audio file

Audio File” button is displayed
6.2 1. Clicks waveform expander for User can identify

chosen audio file amplitude of the audio
file at specific time
2. Adjust contents of chart through
frames
task bar at top right panel of
expander
3. Hover over contents of the chart
6.3 1. Clicks MFCC coefficient expander User can identify

for chosen audio file intensity of MFCC
coefficient at specific
window frames
expander
6.4 1. Clicks log filterbank expander for User can identify

chosen audio file intensity of log

2. Adjust contents of chart through filterbank coefficient at

task bar at top right panel of specific window frames
expander
Table 22: Test plan for viewing properties of audio file
Test Case: Resampling Audio File

Result (Pass /
Fail)
7.1 1. Clicks “Check Frequency of Frequency of audio file

Audio File” button is displayed
7.2 1. Enter a new frequency lower than Error message popped

8,000 Hz in the text box up indicating frequency
cannot be lower than
8,000 Hz
7.3 1. Enter a new frequency greater than Error message popped

48,000 Hz in the text box up indicating frequency
cannot be greater than
48,000 Hz
7.4 1. Clicks “Resample audio file” A playable resampled

button audio file is displayed
Table 23: Test plan for resampling audio file
Test Case: Downloading Resampled Audio File

Result (Pass /

Fail)
8.1 1. Clicks “Download Resampled Audio file is played

Audio File” button using Windows Media
Player or other default
2. Click on the pop-up downloaded
audio tools
file in the browser’s footer
Table 24: Test plan for downloading resampled audio file
Test Case: Generate Transcript

Result (Pass /
Fail)
9.1 1. Clicks “Generate 1. Information panel with transcript for the

Transcript” button chosen audio file is displayed
2. Button to prompt user to download

transcript popped up
Table 25: Test plan for generating transcript
Test Case: Translate Transcript

Result (Pass /
Fail)
10. 1. Did not click on 1. The transcript generated previously

1 selector box to will be translated into the default
choose language Afrikaans language and displayed in an
information panel
2. Clicks “Translate
Transcript” button 2. Button to prompt user to download
translated transcript popped up

10. 1. Choose a 1. The transcript generated previously

2 language from the will be translated into the chosen
selector box language and displayed in an information
panel
Table 26: Test plan for translating transcript
Test Case: Download Transcript

Result (Pass /
Fail)
11. 1. Click on “Download Transcript” or Transcript or translated

1 “Download Translated Transcript” transcript is displayed
button in Notepad

Table 27: Test plan for downloading transcript

8.3.2 User Acceptance Testing

Also Known as beta or end-user testing, User Acceptance Testing (UAT) is the final stage in
software development project whereby users or clients provide feedback on the functional
application. It is an inevitable stage before deploying the system to the customers’ environment
as it can validate if all business requirements are fulfilled. Although internal developers and beta
testers have pre-requisite domain knowledge in software development, they may interpret several
business requirements that are deemed inappropriate or incompatible from the users’ perspective,
hence the need of UAT. (Vijay, 2023) The UAT for ASR web-based system for E-learning in
English language will be carried out by at least 3 testers through the web application and an
online form with no instructions provided by developers beforehand. The table below shows the
UAT form for the developed system.
Title User Acceptance Testing (UAT) for ASR web-

based system for E-learning in English Language
Description To collect feedback from target users on the web

application and make corresponding
improvements, if necessary
Tester No
Tester Name
Tester Job
Date
No Test Case with Acceptance Test Rating (Use √ on one of the box from 1 – 5,
Criteria i.e. from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives
 Audio Files can be

resampled
 Transcript can be generated
2 User Interface
 Smooth interaction with UI

 Information is well
structured
 Crucial information can be
displayed clearly
3 Design and Aesthetic
 Good image resolution and

audio quality
 Big toggle buttons and
widgets
 Theme is aligned with font
style and color
4 Functionalities
 Completeness and
correctness of features
 Application will not crash
when users navigate
through each feature
 Successful and error
messages are displayed
clearly
5 Performance
 Able to generate transcript

up to an optimum level of
accuracy
 Able to display computed
results of WER and CER
 System has significant
processing speed
 Less loading time upon
navigating between pages
Feedbacks and Suggestions:
Table 28: User Acceptance Testing Form

CHAPTER 9: IMPLEMENTATION
9.1 Screenshots
9.1.1 Home Page
Figure 156: Interface for Home Page

Upon opening the local host application, users will be directed to the home page as shown in
figure above. An introductory text on an overview of ASR, the purpose of this application and
how users can put it to good use will be shown. Then, users will be prompted as to whether they
are first-time users of the system. If they are, they should click the ‘Yes’ button to be redirected
to registration page; otherwise, they should click the ‘No’ button to be redirected to login page.
For users that require extensive aid, they can click on the "🔊" icon at top right corner of
interface or the playable audio icons below title and sub-headings to listen to audio-based
commands on how to navigate through the application.

9.1.2 Registration Page
Figure 157: Registration Interface
After selecting “Yes” option in home page, user is redirected to the interface as shown in figure
above. At this point, the user did not fill in any Email Address or Password, so the pop-up
message indicating that both Email Address and password are empty will appear. Users can
choose to go back to home page by clicking the “Back to Home Page” button at top left corner.
For users that require extensive aid, they can click on the "🔊" icon at top right corner of
interface or the playable audio icons below each warning message to listen to audio-based
commands on how to register.

Figure 158: Registration Page Showing Error Message for invalid Email Address and Password
As shown in figure above, if user enters an invalid school-use Email Address and clicks the
“Register” button, an error message will pop up. The message implies users that the Email
Address must start with “TP” for students or “LS” for tutors and must be followed by 6 digits
with the school’s abbreviation in between. A valid example of an Email Address is
“TP059923@mail.PTg.edu.my”. If a user enters an invalid password and clicks the “Register”
button, an error message will also pop up implying users that password must have a length
between 7 to 20 characters, contain at least 1 digit, 1 lowercased letter, 1 uppercased letter and 1
symbol.
Figure 159: Text File to store Email Address and Password entries

Figure 160: Registration Page Showing Error Message for Duplicate Email Address's Entry
As shown in figure above, if user enters a registered Email Address in the system, then clicks the
“Register” button, an error message will pop up. The message implies users have entered
duplicate entry of Email Address as the Email Address entered by user is the same as the first
entry in the text file.
Figure 161: Interface for Successful Registration
As shown in figure above, if user enters their credentials based on the specified format, then
click the “Register” button, a message indicating account creation was successful will appear
along with the playable audio. At this point, the user’s credentials will be stored in the text file
with password encrypted using hashing algorithms. The user will then be prompted back to the
login page with the “Proceed to login interface” button popping up.

9.1.3 Login Page
Figure 162: Interface for Login Page
After selecting the “No” option in home page or upon registration, users will be redirected back
to the login page as shown in figure above. At this point, the user did not fill in any Email
Address or Password, so the pop-up message indicating that both Email Address and password
are empty will appear. Users can click on the “Change Password” icon to be directed to the
password changing interface. Users can also choose to go back to home page by clicking the
“Back to Home Page” button at top left corner. For users that require extensive aid, they can
click on the "🔊" icon at top right corner of interface or the playable audio icons below each
warning message to listen to audio-based commands on how to login.

Figure 163: Login Interface Showing Error Message for invalid Email Address and Password
As shown in figure above, if user enters an invalid school-use Email Address, an error message
will pop up. The message implies the Email Address has yet to be registered into the web-based
system. Thus, a pop-up “Go to Registration Page” button will prompt the user back to the
registration interface in case the user mis-clicked into the login page before registering. If a user
enters an invalid password, an error message will also pop up implying users that the entered
password is incorrect.
Figure 164: Interface for Successful Login
As shown in figure above, if user enters their credentials based on the specified format, a
message indicating account all credentials are correct will pop up. The user will then be
prompted to login with the “Login” button popping up, afterwards will redirect user to the File
Uploader page upon clicking it.

9.1.4 Password Changing Page
Figure 165: Interface for Password Changing Page
After clicking “Change Password” button in login page, user is redirected to the interface as
shown in figure above. At this point, the user did not fill in any Email Address or Password, so
the pop-up message indicating that both Email Address and password are empty will appear.
Users can choose to go back to home page by clicking the “Back to Home Page” button at top
left corner. For users that require extensive aid, they can click on the "🔊" icon at top right corner
of interface or the playable audio icons below each warning message to listen to audio-based
commands on how to register.
Figure 166: Password Changing Interface Showing Error Message for invalid Email Address and Password

As shown in figure above, if user enters an invalid school-use Email Address, an error message
will pop up. The message implies the Email Address has yet to be registered into the web-based
system. Thus, a pop-up “Go to Registration Page” button will prompt the user back to the
registration interface. If a user enters an invalid password, an error message will also popped up
implying users that the entered password must have a length between 7 to 20 characters, contain
at least 1 digit, 1 lowercased letter, 1 uppercased letter and 1 symbol.
Figure 167: Interface for successful change in password
As shown in figure above, if user enters their credentials based on the specified format, then
click the “Change Password” button, a message indicating successful modification on password
will pop up. The user will then be prompted to redirect back to the login page with the pop-up
“Back to Login Page” button.

9.1.5 File Uploader Page
Figure 168: Interface for File Uploader Page
After clicking “Login” button in login page, user is redirected to the file uploader interface as
shown in figure above. At this point, the user did not fill in any file path nor upload any file to
the file uploader panel, so the pop-up message indicating that user has not uploaded audio file or
entered file path will appear. In order to proceed further with other operations, users must copy a
valid file path of their uploaded “wav” format audio file. Users can also choose to go back to
home page by clicking the “Logout” button at top left corner. For users that require extensive
aid, they can click on the "🔊" icon at top right corner of interface or the playable audio icons
below each warning message to listen to audio-based commands on how to upload an audio file.

Figure 169: File Uploader Interface showing Error Message for Invalid File Type
If the user drags or drops an audio file that is not of “wav” format, the system will consider it as
invalid file type with the corresponding error message displayed as shown in figure above.
Figure 170: File Uploader Interface showing Error Message for Invalid File Path
The file path entered by user in figure above is invalid as it is ending with a “\”. Hence, the
corresponding error message pops up.

Figure 171: File Uploader Interface showing Error Message for Incorrect File Path
Despite the file path uploaded by user is a valid one, it does not match with the file name of the
uploaded audio file. Hence, the error message indicating unmatched file path with audio file pops
up.
Figure 172: File Uploader Interface Showing Message to Remove Quotation Marks
Upon copying the file path from local folders, a trailing and leading quotation mark will be there
by default. This causes the back-end file path reader function to be unable to read contents of the
file. As such, an error message informing users to remove starting and ending quotation marks
will pop up.

Figure 173: File Uploader Interface for successful upload
As shown in figure above, if user enters the correct file path that matches with their uploaded
audio file, the recording of the audio file will be displayed below along with a message
indicating file upload was successful. The user is then prompted to continue next operations with
the pop-up text and audio-based command as well as 3 buttons, connecting to the Property
Viewer, Resampling and Transcript interface respectively.

9.1.6 Property Viewer Page
Figure 174: Interface for Audio File's Property Viewer Page
Upon clicking the “View Properties of Audio File” button in file uploading, resampling or
transcript interface, user is redirected to the property viewer page as shown in figure above.
Users will be presented with a recording of their uploaded audio file. They can check the
frequency of uploaded audio file by clicking the “Check Frequency of Audio File” button. They
can also visualize properties of audio in the form of raw waveforms, MFCC and Mel filterbank
by clicking on the corresponding expanders. Users can also choose to go back to home page by
clicking the “Logout” button at top left corner or go back to the file uploader page by clicking
the “Back to File Uploader Page” button. For users that require extensive aid, they can click on
the "🔊" icon at top right corner of interface or alongside each text-based commands. They can
also proceed with resampling and transcript page through the buttons at the lower section of the
UI respectively.

Figure 175: Property Viewer Interface when User Checks Frequency of Audio File
Upon clicking the “Check Frequency of Audio File” button, the original frequency of uploaded
audio file will be displayed as shown in figure above.
Figure 176: Property Viewer Interface when User Clicks the Expander for Displaying Audio's Waveform
Upon clicking the expander for raw waveform of audio file, the interface above will pop up. It
gives a brief introduction on the X-axis, Y-axis and an overview of the waveform graph. Users
can also hover their cursor over the different time frames to view their corresponding amplitude.

Figure 177: Property Viewer Interface when User Clicks the Expander for Displaying MFCC Heatmap
Upon clicking the expander for MFCC coefficient of audio file, the interface above will pop up.
It gives a brief introduction on the X-axis, Y-axis, color intensity and an overview of the
heatmap. Users can also hover their cursor over different window frames to view their
corresponding MFCC coefficient and color intensity level.

Figure 178: Property Viewer Interface when User Clicks the Expander for Displaying Mel Filterbank Heatmap
Upon clicking the expander for Mel filterbank coefficient of audio file, the interface above will
pop up. It gives a brief introduction on the X-axis, Y-axis, color intensity and an overview of the
heatmap. Users can also hover their cursor over different window frames to view their
corresponding Mel filterbank coefficient and color intensity level.

9.1.7 Resampling Page
Figure 179: Interface for Resampling Page
Upon clicking the “Resample Audio File” button in file uploading, property viewer or transcript
interface, user is redirected to the resampling page as shown in figure above. Users will be
presented with a recording of their uploaded audio file. They can check the frequency of
uploaded audio file by clicking the “Check Frequency of Audio File” button, after which they
can select a suitable sample rate between 8,000 to 48,000 Hz. Users can also choose to go back
to home page by clicking the “Logout” button at top left corner or go back to the file uploader
page by clicking the “Back to File Uploader Page” button. For users that require extensive aid,
they can click on the "🔊" icon at top right corner of interface or alongside each text-based
commands. They can also proceed with property viewer and transcript page through the buttons
at the lower section of the UI respectively.

Figure 180: Resampling Interface when User Checks Audio File's Frequency
Upon clicking the “Check Frequency of Audio File” button, the original frequency of uploaded
audio file will be displayed as shown in figure above.
Figure 181: Resampling Interface when User Enters A Sample Rate Lower than 8,000 Hz

Figure 182: Resampling Interface when User Enters a Sample Rate Higher than 48,000 Hz
As shown in the 2 figures above, if users attempt to enter a sample rate of lower than 8,000 Hz or
higher than 48,000 Hz, the corresponding error messages will pop up, reminding users to enter a
sample rate between 8,000 to 48,000 Hz.
Figure 183: Resampling Interface when User Enters a Sample Rate Between 8,000 to 48,000 Hz
Once the user selects a valid frequency between the specified range, a “Resample Audio File”
button will pop up.

Figure 184: Resampling Interface when User Clicks 'Resample Audio File' Button
Upon clicking the “Resample Audio File” button, a recording of the resampled audio file is
displayed. A “Download Resampled Audio File” button also pops up, prompting users to
download the resampled audio file. Upon clicking it, users can open the resampled audio file in
the pop-up footer panel of the browser or retrieve the file in their local ‘Downloads’ folder.

9.1.8 Transcript Page
Figure 185: Interface for Transcript Page
Upon clicking the “Generate Transcript” button in file uploading, property viewer or resampling
interface, user is redirected to the transcript page as shown in figure above. Users will be
presented with a recording of their uploaded audio file. They can then generate the transcript by
clicking the “Generate Transcript” button, after which they download the transcript to their local
‘Downloads’ folder. Users can also choose to go back to home page by clicking the “Logout”
button at top left corner or go back to the file uploader page by clicking the “Back to File
Uploader Page” button. For users that require extensive aid, they can click on the "🔊" icon at top
right corner of interface or alongside each text-based commands. They can also proceed with
property viewer and resampling page through the buttons at the lower section of the UI
respectively.

Figure 186: Interface when User Clicks 'Generate Transcript' Button
Upon clicking the ‘Generate Transcript’ button, a message indicating the generated transcript
will be shown below pops up. Then, the transcript highlighted in green will be displayed.
Figure 187: Interface when User Clicks 'Download Transcript' Button
After generating transcript, the ‘Download Transcript’ button will pop up which prompts users to
click it. After doing so, a download footer will pop up in the browser and users can retrieve the
downloaded transcript in local ‘Downloads’ folder.

Figure 188: Interface when User Selects Language and Clicks 'Translate Transcript' Button
In the language selection box, users can choose the desired language to translate their transcript
into. Once user clicks the ‘Translate Transcript’ button, the corresponding transcript in the
chosen language will be displayed in green text.
Figure 189: Interface when User Clicks 'Download Translated Transcript' Button
After generating the translated version of a transcript, the ‘Download Translated Transcript’
button will pop up which prompts users to click it. After doing so, a download footer will pop up
in the browser and users can retrieve the downloaded transcript in local ‘Downloads’ folder.

9.2 Sample Codes

9.2.1 Home Page
Figure 190: Code Snippet for Home Page
The code above generates the home page for web-based ASR system. Firstly, headings and
images are displayed with markdown formatting using Streamlit’s “markdown” and “image”
functions respectively. The 2nd section of the code consists of 3 question and answer pair-wise
information text that gives users a brief understanding in ASR and the purpose of the application.
The last section prompts if users are new to the system. If users select “Yes”, they will be
prompted to the registration page, otherwise they will be directed to the login page. This is
achieved through the “page_switcher” function that takes corresponding operation’s function as
argument within the button’s “on_click” parameter. At this point, Streamlit’s “session_state” will
be changed in the “page_switcher” function based on the page that the user chooses to navigate
to. All text areas, warning messages and command-wise instructions are also supported with a
"🔊" icon or playable audio recordings displayed using Streamlit’s “button” widget specifically
for users with low literacy levels. The alignment of buttons, text, images and markdowns is
achieved with Streamlit’s “column” function by specifying the number of columns and the ratio
that each column will be occupying.

9.2.2 Registration Page
Figure 191: Code Snippet for Registration Page
The code above generates the registration interface for web-based ASR application. The first
section displays the header of the interface which includes “Back to Home Page” button and
page title. Then, the Email Address and password entry sections are created. Using inputs
entered by users, the next section checks the emptiness of Email Address and password. Once
user clicks the “Register” button, the validity of Email Address and password are checked
against the pre-defined requirements. The Email Address for new users must also be unique from
any registered Email Address in the system. Corresponding error messages will be displayed if
invalid credentials or registered (i.e., duplicated) Email Address are detected. A success message
along with a pop-up “Proceed to Login Interface” button will only appear when both Email
Address and password are validated correctly. Upon clicking the button, users will be directed to
login page. This is achieved through the “page_switcher” function that takes corresponding
operation’s function as argument within the button’s “on_click” parameter. At this point,
Streamlit’s “session_state” will be changed in the “page_switcher” function based on the page
that the user chooses to navigate to. All text areas, warning messages and command-wise

instructions are also supported with a "🔊" icon or playable audio recordings displayed using
Streamlit’s “button” widget specifically for users with low literacy levels. The alignment of
buttons, text, images and markdowns is achieved with Streamlit’s “column” function by
specifying the number of columns and the ratio that each column will be occupying.

9.2.3 Login Page
Figure 192: Code Snippet for Login Page
The code above generates login interface for web-based ASR application. The first section
displays the header of the interface which includes “Back to Home Page” button and page title.
Then, the Email Address and password entry sections are created along with button widget for
users to change password. Using inputs entered by users, the validity of Email Address and
password are checked against registered credentials stored in CSV files. Corresponding error
messages will be displayed if invalid credentials or unregistered Email Address are detected. A
success message along with a pop-up “Login” button will only appear when both Email Address
and password are correct. Upon clicking, users will be directed to file uploader page. This is
achieved through the “page_switcher” function that takes corresponding operation’s function as
argument within the button’s “on_click” parameter. At this point, Streamlit’s “session_state” will
be changed in the “page_switcher” function based on the page that the user chooses to navigate
to. All text areas, warning messages and command-wise instructions are also supported with a
"🔊" icon or playable audio recordings displayed using Streamlit’s “button” widget specifically
for users with low literacy levels. The alignment of buttons, text, images and markdowns is

achieved with Streamlit’s “column” function by specifying number of columns and ratio that
each column will be occupying.
9.2.4 Password Changing Page
Figure 193: Code Snippet for Password Changing Page
The code above generates the password changing interface for web-based ASR application. The
first section displays the header of the interface which includes “Back to Home Page” button and
page title. Then, the Email Address and password entry sections are created. Using inputs
entered by users, the next section checks the emptiness of Email Address and password. Once
user clicks the “Change Password” button, the validity of Email Address and password are
checked against the pre-defined requirements. The Email Address for new users must be a
registered Email Address in the system. Corresponding error messages will be displayed if
invalid credentials or non-registered Email Address are detected. A success message indicating
successful password change along with a pop-up “Back to Login Page” button will only appear
when both Email Address and password are validated correctly. Upon clicking the button, users
will be directed to login page. This is achieved through the “page_switcher” function that takes

corresponding operation’s function as argument within the button’s “on_click” parameter. At this
point, Streamlit’s “session_state” will be changed in the “page_switcher” function based on the
page that the user chooses to navigate to. All text areas, warning messages and command-wise
buttons, text, images and markdowns is achieved with Streamlit’s “column” function by

9.2.5 File Uploader Page
Figure 194: Code Snippet for File Uploader Page
The code above generates the file uploader interface for web-based ASR application. The first
section displays the header of the interface which includes “Logout” button, a GIF visualization
on how to upload an audio file briefly and page title. The task description is formulated, a text
area to enter audio file path and a file uploader panel is rendered using Streamlit’s “text_input”
and “file_uploader” widget. In this case, only “wav” format audio files are allowed with one

single upload permitted at any point of time. Corresponding error messages are displayed if the
file path entered by user does not match with the uploaded one or consists of leading and trailing
quotation marks. Once there is a match between file path and uploaded file, the playable audio
file is displayed using Streamlit’s “audio” widget and a message indicating file was uploaded
successfully will be displayed.
The user is then prompted to choose the next 3 operations, namely viewing audio file’s property,
resampling audio files and generating transcript by clicking on corresponding buttons. Upon
clicking the buttons, users will be directed to respective pages. This is achieved through the
“page_switcher” function that takes corresponding operation’s function as argument within the
button’s “on_click” parameter. At this point, Streamlit’s “session_state” will be changed in the
“page_switcher” function based on the page that the user chooses to navigate to. All text areas,
warning messages and command-wise instructions are also supported with a "🔊" icon or
playable audio recordings displayed using Streamlit’s “button” widget specifically for users with
low literacy levels. The alignment of buttons, text, images and markdowns is achieved with
Streamlit’s “column” function by specifying the number of columns and the ratio that each
column will be occupying.

9.2.6 Property Viewer Page
Figure 195: Code Snippet for Viewing Property of Audio File
The code above generates the property viewer interface for web-based ASR application. The first
section displays the header of the interface which includes “Logout” button, a “Back to File
Uploader Page” button and page title. The playable audio file is then displayed using Streamlit

“audio” widget. Then, user will be prompted to click on a button displayed using Streamlit
“button” widget to check the frequency of uploaded audio file. Once clicked, a description of the
audio file’s name and frequency in ‘Hz’ will be displayed. Users are then prompted to view the
visualization contents in the expander through Streamlit “expander” widget. These contents
include a raw waveform, MFCC and Mel filterbank coefficient of the uploaded audio file. Such
contents are displayed by utilizing the code from section 5.5.2.
The user is then prompted to choose the next 2 operations, namely resampling and generating
transcript by clicking on corresponding buttons. Upon clicking the buttons, users will be directed
to respective pages. This is achieved through the “page_switcher” function that takes
corresponding operation’s function as argument within the button’s “on_click” parameter. At this
point, Streamlit’s “session_state” will be changed in the “page_switcher” function based on the
page that the user chooses to navigate to. All text areas, warning messages and command-wise
buttons, text, images and markdowns is achieved with Streamlit “column” function by specifying
the number of columns and the ratio that each column will be occupying.

9.2.7 Resampling Page
Figure 196: Code Snippet for Resampling Audio File

The code above generates the resampling interface for web-based ASR application. The first
section displays the header of the interface which includes “Logout” button, a “Back to File
Uploader Page” button and page title. The playable audio file is then displayed using Streamlit
“audio” widget. Then, user will be prompted to click on a button displayed using Streamlit
“button” widget to check the frequency of uploaded audio file. Once clicked, a description of the
audio file’s name and frequency in ‘Hz’ will be displayed.
Users are then prompted to enter the new sample rate for the audio file in the input box displayed
using Streamlit “number_input” widget. The default value of the input box will always be 0 Hz.
If the user does not enter a sample rate between 8,000 to 48,000 Hz with both ends inclusive, a
warning message will pop up. Otherwise, they will be prompted to resample the audio file by
clicking the corresponding button. Upon click, the file’s signal will be transformed into floating
type using “astype” function, resampled based on original frequency and size of signal using
“resample” function and written to memory. The recording of the resampled audio file is then
displayed alongside another button that prompts the user to download the resampled audio file.
Upon clicking, users can retrieve the file in the pop-up footer of browser or in their local
‘Downloads’ folder.
and generating transcript by clicking on corresponding buttons. Upon clicking the buttons, users
will be directed to respective pages. This is achieved through the “page_switcher” function that
takes corresponding operation’s function as argument within the button’s “on_click” parameter.
At this point, Streamlit’s “session_state” will be changed in the “page_switcher” function based
on the page that the user chooses to navigate to. All text areas, warning messages and command-
wise instructions are also supported with a "🔊" icon or playable audio recordings displayed
using Streamlit’s “button” widget specifically for users with low literacy levels. The alignment
of buttons, text, images and markdowns is achieved with Streamlit “column” function by

9.2.8 Transcript Page
Figure 197: Code Snippet for Generating Transcript (1)

Figure 198: Code Snippet for Generating Transcript (2)
The code in the 2 figures above generates the transcript interface for web-based ASR application.
The first section displays the header of the interface which includes “Logout” button, a “Back to
File Uploader Page” button and page title. The playable audio file is then displayed using
Streamlit “audio” widget. Then, user will be prompted to click on a button to generate transcript
for the uploaded audio file using Streamlit “button” widget. Once clicked, the regularized CNN-
GRU model, which is the optimum model will be loaded from the local folder using the
‘tensorflow.keras.models’ package’s “load_model” function. The audio path is then read and
passed to a “spec” function to obtain its spectrogram to be fed into the model for prediction. The
output will be a sequence of integers, which is passed to the “CTC_decode” function to be
converted into a string sequence.
Then, the transcript is displayed with a markdown size of 25 in green colour. The transcript is
then stored in a temporary file called “TR” for later retrieval. Next, users will be prompted to
download the transcript by clicking on corresponding button. Upon clicking, the download
transcript will be saved into the users’ local ‘downloads’ folder in the form of “transcript_” plus
the name of uploaded audio file in ‘.txt’ format through Streamlit “download_button” widget.
Next, users will be prompted to select a language to translate into through Streamlit “selectbox”
widget. They will then have to click on the popped up ‘Translate Transcript’ button to translate
the transcript. Upon clicking, the original transcript will be retrieved from “TR.txt”, then applied

to Google Translator API through identification of language code. The translated transcript is
displayed with a markdown size of 25 in green colour. Finally, users are prompted to download
the translated transcript which will be saved into the users’ local ‘downloads’ folder in the form
of “{lang}_transcript_” plus the name of uploaded audio file in ‘.txt’ format whereby ‘lang’
represents the chosen language.
and resampling audio file by clicking on corresponding buttons. Upon clicking the buttons, users
will be directed to respective pages. This is achieved through the “page_switcher” function that
takes corresponding operation’s function as argument within the button’s “on_click” parameter.
At this point, Streamlit’s “session_state” will be changed in the “page_switcher” function based
on the page that the user chooses to navigate to. All text areas, warning messages and command-
wise instructions are also supported with a "🔊" icon or playable audio recordings displayed
using Streamlit’s “button” widget specifically for users with low literacy levels. The alignment
of buttons, text, images and markdowns is achieved with Streamlit “column” function by

CHAPTER 10: SYSTEM VALIDATION

10.1 Unit Testing Result
This section will finalize the Unit Testing result for each feature of the system carried out by
internal system testers based on the test plan documented in section 8.2.
Test Case: Filling in credentials for registration

Result (Pass /
Fail)
1.1 1. Leave all text fields empty Error messages popped up As Pass
indicating both Email expected
columns are empty
1.2 1. Enters Email Address with Error messages popped up As Pass

wrong format and empty indicating password expected
password column is empty and Email
Address is invalid
1.3 1. Enters password with wrong Error messages popped up As Pass

format and empty Email Address indicating Email Address expected
column is empty and
password is invalid
1.4 1. Enters Email Address and Error messages popped up As Pass

password with wrong format indicating both Email expected
entries are invalid
1.5 1. Enters registered Email Error message popped up As Pass

Address and password with indicating duplicate entry expected
correct format of Email Address

1.6 1. Enters non-registered Email “Account created As Pass

Address and password with successfully!” messaged expected
correct format popped up and user will be
prompted to click the
“Proceed to login
interface”
Table 29: Test Output for registration
Test Case: Filling in credentials for login

Result (Pass /
Fail)
2.1 1. Leave all text fields Error messages popped up indicating As Pass
empty both Email Address and password expected
columns are empty
2.2 1. Enters Email Error messages popped up indicating As Pass

Address with wrong password column is empty as well as expected
format and empty Email Address is not registered. A
password “Go to Registration Page” button
will appear, prompting the user to
register an account before
proceeding

2.3 1. Enters password Error messages popped up indicating As Pass

with wrong format Email Address column is empty and expected
and empty Email password is incorrect
Address
2.4 1. Enters Email Error messages popped up indicating As Pass

Address and password Email Address is not registered and expected
with wrong format password is incorrect. A “Go to
Registration Page” button will
appear, prompting the user to register
2.5 1. Enters registered Error message popped up indicating As Pass

Email Address and password is incorrect expected
wrong password
2.6 1. Enters registered A message with “All your credentials As Pass

Email Address and are correct! You may login now.” expected
correct password and “Login” button popped up,
directing the user to the main
interface
Table 30: Test output for Login
Test Case: Filling in credentials for changing password

Result (Pass /
Fail)
3.1 1. Leave all text fields empty Error messages popped up As Pass
indicating both Email Address expected
button

empty
3.2 1. Enters Email Address with Error messages popped up As Pass

wrong format and empty indicating password column is expected
password empty as well as Email
Address is not registered. A
“Go to Registration Page”
button
button will appear, prompting
the user to register an account
before proceeding
3.3 1. Enters password with Error messages popped up As Pass

wrong format and empty indicating Email Address expected
Email Address column is empty and
password is invalid
button
3.4 1. Enters Email Address and Error messages popped up As Pass

password with wrong format indicating Email Address is expected
not registered and password is
invalid. A “Go to Registration
button
prompting the user to register
3.5 1. Enters registered Email Error message popped up As Pass

Address and password with indicating password is invalid expected
wrong format

button
3.6 1. Enters registered Email A message with “Password As Pass

changed successfully!” and

Address and correct password “Back to Login Page” button expected

popped up, prompting the
user to the login interface
button
Table 31: Test output for changing password
Test Case: Logout

Result (Pass /
Fail)
4.1 1. Clicks “Logout” User is logged out and redirected back As Pass
button to home page of application expecte
d
Table 32: Test output for logout
Test Case: Upload Audio File

Result (Pass /
Fail)
5.1 1. File path column is An error message popped up As Pass

empty notifying the user to upload an expected
empty
5.2 1. Enters invalid file path An error message popped up As Pass

notifying the user to upload an expected
empty
5.3 1. File path column is 1. An error message popped up As Pass


icon
“wav” format
allowed
5.4 1. File path column is An error message popped up As Pass

icon
5.5 1. Enters invalid file path 1. An error message popped up As Pass

notifying the user to upload an expected
icon
“wav” format
allowed
5.6 1. Enters invalid file path An error message popped up As Pass

indicating file path does not expected
match with uploaded audio file
icon
5.7 1. Enters valid file path 1. An error message popped up As Pass

with leading and trailing indicating file path does not expected
quotation mark (“”) match with uploaded audio file

icon 2. An error message popped up

notifying user to remove starting
and ending quotation marks
5.8 1. Enters valid file path 1. Playable recording of the As Pass

with no leading and trailing uploaded audio file is displayed expected
quotation mark (“”)
2. A message indicating the file
2. Clicks “Browse File” has been uploaded successfully
icon is displayed
3. Select a “wav” audio file 3. Buttons that prompt user to

perform next operations popped
up
Table 33: Test ouptut for uploading audio file
Test Case: View Properties of Audio File

Result (Pass /
Fail)
6.1 1. Clicks “Check Frequency of Frequency of audio file As Pass

Audio File” button is displayed expected
6.2 1. Clicks waveform expander for User can identify As Pass

chosen audio file amplitude of the audio expected
file at specific time
frames
expander
6.3 1. Clicks MFCC coefficient User can identify As Pass

expander for chosen audio file intensity of MFCC expected

2. Adjust contents of chart through coefficient at specific

task bar at top right panel of window frames
expander
6.4 1. Clicks log filterbank expander for User can identify As Pass
chosen audio file intensity of log expected
filterbank coefficient at
specific window frames
expander
Table 34: Test output for viewing properties of audio file
Test Case: Resampling Audio File

Result (Pass /
Fail)
7.1 1. Clicks “Check Frequency of Frequency of audio As Pass

Audio File” button file is displayed expected
7.2 1. Enter a new frequency lower than Error message popped As Pass
8,000 Hz in the text box up indicating expected
frequency cannot be
lower than 8,000 Hz
7.3 1. Enter a new frequency greater Error message popped As Pass

than 48,000 Hz in the text box up indicating expected
frequency cannot be
greater than 48,000 Hz
7.4 1. Clicks “Resample audio file” A playable resampled As Pass

button audio file is displayed expected
Table 35: Test output for resampling audio file
Test Case: Downloading Resampled Audio File

Result (Pass /
Fail)
8.1 1. Clicks “Download Resampled Audio file is played As Pass

Audio File” button using Windows Media expecte
Player or other default d
audio tools
Table 36: Test output for downloading resampled audio file
Test Case: Generate Transcript

Result (Pass /
Fail)
9.1 1. Clicks “Generate 1. Information panel with transcript for As Pass

Transcript” button the chosen audio file is displayed expecte
d
transcript popped up
Table 37: Test output for generating transcript
Test Case: Translate Transcript

Result (Pass /
Fail)
10. 1. Did not click on 1. The transcript generated previously As Pass

1 selector box to will be translated into the default expecte

choose language Afrikaans language and displayed in d
an information panel
10. 1. Choose a language 1. The transcript generated previously As Pass

2 from the selector box will be translated into the chosen expecte
language and displayed in an d
information panel
Transcript” button
Table 38: Test output for translating transcript
Test Case: Download Transcript

Result (Pass /
Fail)
11. 1. Click on “Download Transcript” Transcript or translated As Pass

1 or “Download Translated transcript is displayed expecte
Transcript” button in Notepad d

Table 39: Test Output for downloading transcript

10.2 User Acceptance Testing


Tester No 1
Tester Name Kwan De Liang
Tester Job Student
Date 1/5/2023
Criteria i.e., from least satisfied to most satisfied)
1 2 3 4 5
1 Meeting Objectives √

resampled
2 User Interface √

structured
displayed clearly
3 Design and Aesthetic √

audio quality
widgets
style and color
4 Functionalities √
when users navigate
clearly
5 Performance √

accuracy using the
preloaded model
processing speed


The overall web application is smooth and user-friendly. However, some problems occur
while running the web application such as the long waiting time while going through the
model process. Besides that, the web application will need to refresh after one entry is done
towards the web application, which makes the browsing experience to be decreased.
Table 40: Result for User Acceptance Testing (1)


Tester No 2
Tester Name Wu Ing I
Tester Job Lecturer
Date 2/5/23
1 2 3 4 5

resampled


structured
displayed clearly

audio quality
widgets
style and color
when users navigate
clearly
5 Performance √

accuracy using the

preloaded model
processing speed
The objective is not stated too clearly as the system has more functionalities than the intended
objectives. Also, the panels and widgets of the system are kind of scattered such that
navigating through them can be a bit unpleasant for new-comers.


Tester No 3
Tester Name Lai Guan Leong
Tester Job Student
Date 2/5/23
1 2 3 4 5

resampled

structured
displayed clearly

audio quality
widgets
style and color
when users navigate
clearly
5 Performance √

accuracy using the
preloaded model
processing speed
The error and successful messages are a bit too long and may cause confusion for users that
have low literacy levels. Some messages originate from the same error, but use different
wording to represent.


Tester No 4
Tester Name Ng Kit Ying
Tester Job Tutor
Date 2/5/23
1 2 3 4 5


resampled

structured
displayed clearly

audio quality
widgets
style and color
when users navigate
clearly
5 Performance √


accuracy using the
preloaded model
processing speed
The audio-based commands are a bit out of order as they are not aligned with the original text
command in some pages. Users may also not be able to distinguish between audio-based
commands and the audio file they uploaded.
10.3 Summary
After documenting the Unit Testing sheet and User Acceptance Testing (UAT) sheet in section
8.3, these documents have been assigned to the internal system testing team and several clients
including both students and tutors. According to the result from System Testing, we can see that
all 11 features of the system are working as intended, as in they produce the result as what is
expected from developers. Hence, all individual functions or testing units within each feature
have passed the test successfully.
As for UAT, 5 criteria are being assessed namely, “Meeting Objectives”, “User Interface”,
“Design and aesthetics”, “Functionalities” and “Performance”. The UAT result conducted on 2
students and 2 tutors respectively show that the overall feedback is quite constructive. Based on
their feedback, it can be deduced that the system is bug-free as all scoring for functionalities are
greater than or equal to 3. However, the most problematic aspect of the system is the User
Interface design. This is due to the fact that there are several feedbacks regarding the UI is not so
beginner friendly with the button widgets scattered unevenly, causing users unable to distinguish
between audio files and audio commands. Another feedback worth take note of is about system’s
performance. This is due to the nature of Streamlit whereby the entire application will rerun

again once a user clicks a button or enters a new text input. Such consideration will be brought to
the software development team and corresponding improvements will be made in following
release versions.
CHAPTER 11: CONCLUSIONS AND REFLECTIONS

11.1 Conclusion
Collectively, this project aims to develop a fully functional web-based ASR in English Language
with the aid of existing mathematical paradigms and machine learning techniques to be
integrated into e-learning systems utilized by educational institutions. As of now, the e-learning
sector has faced several critical issues, which are difficulty for learners to concentrate during
sessions, lack of emphasis on tutors’ teaching strategies tutors and failure to address learning
disabilities of specific users. To overcome these issues, a detailed analysis on past literature
studies in the knowledge area of ASR, e-learning, statistical paradigms and machine learning
techniques have been initiated. Incorporating these knowledges, multiple acoustic models are
constructed using machine learning libraries in Python, among which the most performant model
is selected. An ASR system can then be developed, with the model generating transcripts,
resampling audio files, viewing corresponding properties and GUI available to put this into
visual reality.
Prior to this project’s success, extensive amount of research has been done, not only limited to
previous studies, but also on technical aspects of the project such as Programming Language,
IDE, libraries and other hardware or software specifications. In terms of previous research, the
classification general architecture of ASR, front-end feature extraction process, back-end
implementation, various machine learning models and evaluation technique for speech
recognition are explored thoroughly without restricting the scope of research to e-learning
systems in English language only. A side-by-side comparison in various aspects between 2
programming languages: Java and Python, as well as 3 data mining methodologies: KDD,
CRISP-DM and SEMMA are also done to choose the most suitable tool and framework for this
project, which are Python and CRISP-DM respectively.
In terms of implementation-wise aspects of the project, the dataset chosen has an audio file and a
corresponding transcript column as reference. Exploration, pre-processing, visualization and

partitioning on the dataset is performed to convert the acoustic inputs into vectorized input
sequences of spectrograms which represents the underlying speech features. Then, a non-
regularized CNN-GRU, regularized CNN-GRU, regularized GRU and regularized CNN-LSTM
model are developed by applying knowledge regarding hyperparameter tuning, regularization,
optimization and loss computation. Moving on, the evaluation metrics such as training and
validation, WER, CER for both testing and validation set are prepared and evaluated to decideon
which model is best in terms of all aspects. Finally, it is concluded that the regularized CNN-
GRU model is the most performant model out of all 4 model variations.
Along with basic functionalities such as login, logout, register, change password and advanced
functionalities such as view audio file’s property, resample audio file and download such files,
the model is deployed into the web-based ASR system functioning through the implementation
of Streamlit environment. The deployment model will be used to generate the transcript, after
which users can translate the transcript as well as download to be utilized for self-learning and
conducting lessons.
11.2 Reflection
As a self-reflection on the developer, there is a subtle difference between theoretical thoughts
and coding-wise implementations. This can be justified by the fact several techniques introduced
in the Literature Review section cannot be implemented within the context of the project. This is
due to the fact that speech recognition system does not have a boundary for word limit as it has a
huge to infinite text corpus. As for studies conducted in Literature Review section, some if not
all have limited themselves to a very small range of corpus, presumably less than 100 tokens.
One example being the study regarding recognition for 10 Bangla digits from a limited number
of speakers. This has caused several theoretical addressed models such as HMM, GMM, HMM-
GMM hybrid models etc. cannot be implemented within the context of this project as these
models are machine learning-based models that do not require extensive training when compared
to deep learning models like RNN, CNN-RNN hybrid models etc. .
Hardware configuration must also be taken into full consideration upon implementing projects
related to deep learning. This is due to the fact that deep learning requires extensive training with
massive number of hyperparameters in-place. Such activity consumes a wholesome of memory
in which CPU memory is typically insufficient for such tasks as studies confirm that GPU can

process such inputs 30 times faster than CPU. Due to the absence of up-to-date graphic card with
the current one only having 2 GB for GPU, several attempts on running the modelling code
result in BSOD, hence the decision to switch back to CPU.
Furthermore, to minimize the research gap, more research must be conducted on how to utilize
other modelling techniques specifically hybrid models to generate text from speech. Further
studies on hyperparameter tuning, selection, optimization and a combination of all these should
be done more extensively. Additionally, comparison with more ASR projects that are practically
implemented using similar approach, if wise, within the E-learning domain, should be done. In
terms of coping with the learning environment in an online classroom , corresponding models
with more hyperparameters that are able to capture more precise speech features should bes
tudied. Develoeprs can also start off with perform analysis and modelling on speech recognition
accuracy for audio data with higher complexity in the form of spontaneous speech, speaker
adaptive, large speech corpus and within a noisy environment as a starting point. The situation
becomes even more complicated when multiple speakers are voicing out at one point of time,
which can occur periodically in e-learning sessions. The researcher should also conduct more
analysis on the recognition accuracy of applying different machine learning models, previously
described models with more feature extraction layers added or changes in default parameter
settings to derive the most optimal model in future studies.

REFERENCES
7 Best R Packages for Machine Learning GeeksforGeeks. (2020, November 22).
https://www.geeksforgeeks.org/7-best-r-packages-for-machine-learning/
An, G. (n.d.). A Review on Speech Recognition Challenges and Approaches Related papers
ISOLAT ED WORD SPEECH RECOGNIT ION SYST EM USING HT K TJPRC Publicat
ion Speech Recognit ion Technology: A Survey on Indian Languages Hemakumar Gopal
Aut omat ic Speech Recognit ion Syst em for Isolat ed & Connect ed Words of Hindi
Language By Using H… A Review on Speech Recognition Challenges and Approaches. In
World of Computer Science and Information Technology Journal (WCSIT) (Vol. 2, Number
1).
Best Python libraries for Machine Learning - GeeksforGeeks. (2019, January 18).
GeeksforGeeks. https://www.geeksforgeeks.org/best-python-libraries-for-machine-learning/
Bhardwaj, V., Kukreja, V., Othman, M. T. ben, Belkhier, Y., Bajaj, M., Goud, B. S., Rehman, A.
U., Shafiq, M., & Hamam, H. (2022). Automatic Speech Recognition (ASR) Systems for
Children: A Systematic Literature Review. In Applied Sciences (Switzerland) (Vol. 12,
Number 9). MDPI. https://doi.org/10.3390/app12094419
Bose, S. (2022, December 20). Test Planning: A Detailed Guide | BrowserStack. BrowserStack;

BrowserStack. https://www.browserstack.com/guide/test-planning
Brownlee, J. (2019, January 15). A Gentle Introduction to Batch Normalization for Deep Neural
Networks - MachineLearningMastery.com. MachineLearningMastery.com.
https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-
networks/
Creating GUIs in R with gWidgets | R-bloggers. (2010, October 6). R-Bloggers. https://www.r-
bloggers.com/2010/10/creating-guis-in-r-with-gwidgets/
Crisp DM methodology - Smart Vision Europe. (2020, June 17). Smart Vision Europe.
https://www.sv-europe.com/crisp-dm-methodology/

Dåderman, A., Rosander, S., Skolan, K., Elektroteknik, F., & Datavetenskap, O. (n.d.).
Evaluating Frameworks for Implementing Machine Learning in Signal Processing A
Comparative Study of CRISP-DM, SEMMA and KDD.
Dobilas, S. (2022, February 21). GRU Recurrent Neural Networks — A Smart Way to Predict
Sequences in Python. Medium; Towards Data Science.
https://towardsdatascience.com/gru-recurrent-neural-networks-a-smart-way-to-predict-
sequences-in-python-80864e4fe9f6
Dua, S., Kumar, S. S., Albagory, Y., Ramalingam, R., Dumka, A., Singh, R., Rashid, M., Gehlot,
A., Alshamrani, S. S., & Alghamdi, A. S. (2022). Developing a Speech Recognition System
for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Applied
Sciences (Switzerland), 12(12). https://doi.org/10.3390/app12126223
Graves, A., Mohamed, A., & Hinton, G. E. (2013). Speech recognition with deep recurrent
neural networks. International Conference on Acoustics, Speech, and Signal Processing.
https://doi.org/10.1109/icassp.2013.6638947
‌Great Learning Team. (2022, September 19). Python NumPy Tutorial - 2023. Great Learning
Blog: Free Resources What Matters to Shape Your Career!
https://www.mygreatlearning.com/blog/python-numpy-tutorial/
Gupta, A. (2021, October 7). A Comprehensive Guide on Optimizers in Deep Learning.

Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-
guide-on-deep-learning-optimizers/#Gradient_Descent_Deep_Learning_Optimizer
Gupta, S., Jaafar, J., wan Ahmad, W. F., & Bansal, A. (2013). Feature Extraction Using Mfcc.
Signal & Image Processing : An International Journal, 4(4), 101–108.
https://doi.org/10.5121/sipij.2013.4408
Harald Scheidl. (2018, June 10). An Intuitive Explanation of Connectionist Temporal

Classification. Medium; Towards Data Science.
https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-
classification-3797e43a86c

Huang, J.-T., Li, J., & Gong, Y. (n.d.). AN ANALYSIS OF CONVOLUTIONAL NEURAL
NETWORKS FOR SPEECH RECOGNITION.
Islam, N., Beer, M., & Slack, F. (2015). E-Learning Challenges Faced by Academics in Higher
Education: A Literature Review. Journal of Education and Training Studies, 3(5).
https://doi.org/10.11114/jets.v3i5.947
Jain, K. (2017, September 11). Python vs R vs SAS | Which Data Analysis Tool should I
Learn? Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/09/sas-vs-vs-
python-tool-learn/
Jain, K. (2015, January 5). SKLearn | Scikit-Learn In Python | SciKit Learn Tutorial. Analytics
Vidhya. https://www.analyticsvidhya.com/blog/2015/01/scikit-learn-python-machine-
learning-tool/
Jain, V., & Hanley, S. (n.d.). Quickly and Easily Make Your SAS Programs Interactive with
Macro QCKGUI.
Jamal, N., Shanta, S., Mahmud, F., & Sha’Abani, M. (2017). Automatic speech recognition
(ASR) based approach for speech therapy of aphasic patients: A review. AIP Conference
Proceedings, 1883. https://doi.org/10.1063/1.5002046
Jiang, H. (2003). Conndence Measures for Speech Recognition: A Survey.
Kanabur, V., Harakannanavar, S. S., & Torse, D. (2019a). An Extensive Review of Feature
Extraction Techniques, Challenges and Trends in Automatic Speech Recognition.
International Journal of Image, Graphics and Signal Processing, 11(5), 1–12.
https://doi.org/10.5815/ijigsp.2019.05.01
Kanabur, V., Harakannanavar, S. S., & Torse, D. (2019b). An Extensive Review of Feature
Extraction Techniques, Challenges and Trends in Automatic Speech Recognition.
International Journal of Image, Graphics and Signal Processing, 11(5), 1–12.
https://doi.org/10.5815/ijigsp.2019.05.01

Kazarinoff, P. D. (2022). What is a Jupyter Notebook? - Problem Solving with Python.

Problemsolvingwithpython.com. https://problemsolvingwithpython.com/02-Jupyter-
Notebooks/02.01-What-is-a-Jupyter-Notebook/
Kelley, K. (2020, May 26). What is Data Analysis? Methods, Process and Types Explained.
Simplilearn.com; Simplilearn. https://www.simplilearn.com/data-analysis-methods-
process-types-article
LinkedIn. (2022). Linkedin.com. https://www.linkedin.com/pulse/managing-data-science-

project-aleksey-zavgorodniy/
Literature review. (2022, August 29). The University of Edinburgh.

https://www.ed.ac.uk/institute-academic-development/study-hub/learning-resources/
literature-review
M. Rammo, F., & N. Al-Hamdani, M. (2022). Detecting The Speaker Language Using CNN
Deep Learning Algorithm. Iraqi Journal for Computer Science and Mathematics, 43–52.
https://doi.org/10.52866/ijcsm.2022.01.01.005
Maatuk, A. M., Elberkawi, E. K., Aljawarneh, S., Rashaideh, H., & Alharbi, H. (2022). The
COVID-19 pandemic and E-learning: challenges and opportunities from the perspective of
students and instructors. Journal of Computing in Higher Education, 34(1), 21–38.
https://doi.org/10.1007/s12528-021-09274-2
Makhoul, J., & Schwartz, R. (1995). State of the art in continuous speech recognition (Vol. 92).
https://www.pnas.org
Module: tf.strings | TensorFlow v2.12.0. (2023). TensorFlow.

https://www.tensorflow.org/api_docs/python/tf/strings
Montantes, J. (2021, December 8). Understanding Memory Requirements for Deep Learning

and Machine Learning. Medium; Medium.
https://james-montantes-exxact.medium.com/understanding-memory-requirements-for-
deep-learning-and-machine-learning-7a04567341d2

Mseleku, Z. (2020). A Literature Review of E-Learning and E-Teaching in the Era of Covid-19
Pandemic. In International Journal of Innovative Science and Research Technology (Vol. 5,
Number 10). www.ijisrt.com
Muda, L., Begam, M., & Elamvazuthi, I. (2010). Voice Recognition Algorithms using Mel
Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques
(Vol. 2).
Muhammad, G., Alotaibi, Y. A., & Huda, M. N. (2009). Automatic speech recognition for
Bangla digits. ICCIT 2009 - Proceedings of 2009 12th International Conference on
Computer and Information Technology, 379–383.
https://doi.org/10.1109/ICCIT.2009.5407267
Musaev, M., Khujayorov, I., & Ochilov, M. (2019, September 25). Image Approach to Speech
Recognition on CNN. ACM International Conference Proceeding Series.
https://doi.org/10.1145/3386164.3389100
Palaz Mathew Magimai-Doss Ronan Collobert, D., Palaz, D., Magimai-Doss, M., & Collobert,
R. (2014). I CONVOLUTIONAL NEURAL NETWORKS-BASED CONTINUOUS SPEECH
RECOGNITION USING RAW SPEECH SIGNAL Convolutional Neural Networks-based
Continuous Speech Recognition using Raw Speech Signal.
Priyadarshani, P. G. N., Dias, N. G. J., & Punchihewa, A. (2012). Dynamic time warping based
speech recognition for isolated Sinhala words. Midwest Symposium on Circuits and
Systems, 892–895. https://doi.org/10.1109/MWSCAS.2012.6292164
Programming Language | What is Programming Language - Javatpoint. (2021). Www.javatpoint.com.

https://www.javatpoint.com/programming-language
Pryke, B. (2020, August 24). How to Use Jupyter Notebook: A Beginner’s Tutorial. Dataquest.
https://www.dataquest.io/blog/jupyter-notebook-tutorial/
Python | Introduction to Matplotlib - GeeksforGeeks. (2018, May 14). GeeksforGeeks.

https://www.geeksforgeeks.org/python-introduction-matplotlib/

Python Vs R Vs SAS | Difference Between Python, R and SAS. (2019, April 23). Besant
Technologies. https://www.besanttechnologies.com/python-vs-r-vs-sas
Quantum. (2019, August 20). Data Science project management methodologies -

DataDrivenInvestor. Medium; DataDrivenInvestor.
https://medium.datadriveninvestor.com/data-science-project-management-methodologies-
f6913c6b29eb
Ravindra Savaram. (2021, April 22). Python vs SAS vs R. Mindmajix; Mindmajix.

https://mindmajix.com/python-vs-sas-vs-r
Reddy, P. S. (n.d.). Importance of English and Different Methods of Teaching English Editor
BORJ Importance of English and Different Methods of Teaching English. Journal of
Business Management & Social Sciences Research. www.borjournals.com
Release Plan. (2020, May 11). Productplan.com. https://www.productplan.com/glossary/release-

plan/
Riyaz Ahmad Assistant Professor of English, S., & Riyaz Ahmad, S. (2016). Impact Factor: 5.2
IJAR. 2(3), 478–480. www.allresearchjournal.com
Rupali, M., Chavan, S., & Sable, G. S. (2013). An Overview of Speech Recognition Using
HMM International Journal of Computer Science and Mobile Computing An Overview of
Speech Recognition Using HMM. In IJCSMC (Vol. 2, Number 6).
https://www.researchgate.net/publication/335714660
S, K., & E, C. (2016). A Review on Automatic Speech Recognition Architecture and

Approaches. International Journal of Signal Processing, Image Processing and Pattern
Recognition, 9(4), 393–404. https://doi.org/10.14257/ijsip.2016.9.4.34
Saleh, A. (n.d.). Speech Recognition with Dynamic Time Warping using MATLAB Related
papers Analyt ical Review of Feat ure Ext ract ion Techniques for Aut omat ic Speech
Recognit ion IOSR Journals Mobilit y Enhancement for Elderly Ramviyas Parasuraman
Analyt ical Review of Feat ure Ext ract ion Technique for Aut omat ic Speech Recognit ion
himanshu chaurasiya Download a PDF Pack of t he best relat ed papers .

SAS Visual Data Mining and Machine Learning Software. (2020). Sas.com.
https://www.sas.com/en_nz/software/visual-data-mining-machine-learning.html
SAS vs. R vs. Python - Javatpoint. (2021). Www.javatpoint.com.

https://www.javatpoint.com/sas-vs-r-vs-python
SciPy. (2022). Scipy.org. https://scipy.org/
Senin, P. (2008). Dynamic Time Warping Algorithm Review.
Shao, Y., & Wang, L. (2008). E-Seminar: an audio-guide e-learning system. 2008 International
Workshop on Education Technology and Training and 2008 International Workshop on
Geoscience and Remote Sensing, ETT and GRS 2008, 1, 80–83.
https://doi.org/10.1109/ETTandGRS.2008.292
Simplilearn. (2021, March 4). What Is Keras: The Best Introductory Guide To Keras.
Simplilearn.com; Simplilearn. https://www.simplilearn.com/tutorials/deep-learning-
tutorial/what-is-keras
Singh, M. (2021, August 15). How to build beautiful GUI’s with R - Manpreet Singh - Medium.
Medium; Medium. https://preettheman.medium.com/how-build-beautiful-guis-with-r-
1392997133e9
Søndergaard, P. (n.d.). Next fast FFT size. Retrieved May 5, 2023, from

https://ltfat.org/notes/ltfatnote017.pdf
Team, K. (2023). Keras documentation: Dense layer. Keras.io.

https://keras.io/api/layers/core_layers/dense/
Team, K. (2023). Keras documentation: Dropout layer. Keras.io.

https://keras.io/api/layers/regularization_layers/dropout/
Team, K. (2014). Keras documentation: GRU layer. Keras.io.

https://keras.io/api/layers/recurrent_layers/gru/
Team, T. (2020, February 24). SAS vs R vs Python - The Battle for Data Science! - TechVidvan.
TechVidvan. https://techvidvan.com/tutorials/sas-vs-r-vs-python/

Techopedia. (2011, June 10). Database Management System (DBMS). Techopedia.com;

Techopedia. https://www.techopedia.com/definition/24361/database-management-
systems-dbms
‌The LJ Speech Dataset. (2016). Keithito.com. https://keithito.com/LJ-Speech-Dataset/
‌Top 5 Best Python GUI Libraries - AskPython. (2020, August 20). AskPython.
https://www.askpython.com/python-modules/top-best-python-gui-libraries
Understanding operating systems - University of Wollongong – UOW. (2022). Uow.edu.au.

https://www.uow.edu.au/student/learning-co-op/technology-and-software/operating-
systems/
‌Unit Testing - javatpoint. (2021). Www.javatpoint.com. https://www.javatpoint.com/unit-testing
‌Vijay. (2023, March 20). What is User Acceptance Testing (UAT): A Complete Guide. Software
Testing Help. https://www.softwaretestinghelp.com/what-is-user-acceptance-testing-uat/
Vijaysinh Lendave. (2021, August 28). LSTM Vs GRU in Recurrent Neural Network: A

Comparative Study. Analytics India Magazine. https://analyticsindiamag.com/lstm-vs-gru-
in-recurrent-neural-network-a-comparative-study/
Varol, A., Institute of Electrical and Electronics Engineers. Lebanon Section., & Institute of
Electrical and Electronics Engineers. (n.d.). 8th International Symposium on Digital
Forensics and Security : 1-2 June 2020, Beirut, Lebanon.
Wald, M. (2006). An exploration of the potential of Automatic Speech Recognition to assist and
enable receptive communication in higher education. ALT-J, 14(1), 9–20.
https://doi.org/10.1080/09687760500479977
Walsh, P., & Meade, J. (2003). Speech Enabled E-Learning for Adult Literacy Tutoring.
www.literacytools.ie/speech/.
Wang, D., Wang, X., & Lv, S. (2019). End-to-end Mandarin speech recognition combining CNN
and BLSTM. Symmetry, 11(5). https://doi.org/10.3390/sym11050644

Welcome to python_speech_features’s documentation! — python_speech_features 0.1.0

documentation. (2013). Readthedocs.io.
https://python-speech-features.readthedocs.io/en/latest/
What is Activity Diagram? (2022). Visual-Paradigm.com.

https://www.visual-paradigm.com/guide/uml-unified-modeling-language/what-is-activity-
diagram/
‌What is an IDE? IDE Explained - AWS. (2022). Amazon Web Services, Inc.
https://aws.amazon.com/what-is/ide/
‌What is IDE or Integrated Development Environments? | Veracode. (2020, December 8).

Veracode. https://www.veracode.com/security/integrated-development-environment
What is Operating System? Explain Types of OS, Features and Examples. (2020, December 21).
Guru99. https://www.guru99.com/operating-system-tutorial.html
What is Use Case Diagram? (2022). Visual-Paradigm.com.

https://www.visual-paradigm.com/guide/uml-unified-modeling-language/what-is-use-case-
diagram/
‌What is Use Case Specification? (2022). Visual-Paradigm.com. https://www.visual-

paradigm.com/guide/use-case/what-is-use-case-specification/
Yamashita, R., Nishio, M., Do, R. K. G., & Togashi, K. (2018). Convolutional neural networks:
an overview and application in radiology. In Insights into Imaging (Vol. 9, Number 4, pp.
611–629). Springer Verlag. https://doi.org/10.1007/s13244-018-0639-9
du Simplon, R., McCowan, I., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P., &
Bourlard, H. (2005). I D I A P On the Use of Information Retrieval Measures for Speech
Recognition Evaluation. www.idiap.ch
matplotlib.pyplot.axvline — Matplotlib 3.7.1 documentation. (2023). Matplotlib.org.

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html

matplotlib.pyplot.imshow — Matplotlib 3.7.1 documentation. (2023). Matplotlib.org.

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html
matplotlib.pyplot.xticks — Matplotlib 3.7.1 documentation. (2023). Matplotlib.org.

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xticks.html
mhadhbi, N. (2021, December 21). Python Tutorial: Streamlit. Datacamp.com; DataCamp.

https://www.datacamp.com/tutorial/streamlit
tf.data.Dataset | TensorFlow v2.12.0. (2023). TensorFlow.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset
tf.keras.layers.StringLookup | TensorFlow v2.12.0. (2023). TensorFlow.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup
‌tf.signal.stft | TensorFlow v2.12.0. (2023). TensorFlow.

https://www.tensorflow.org/api_docs/python/tf/signal/stft

APPENDICES
FYP TURNITIN Report (First 2 Pages)
Figure 199: FYP TURNITIN Report (1)

Figure 200: FYP TURNITIN Report (2)

Library Form
Figure 201: Library Form

Confidentiality Document
Figure 202: Confidentiality Document

FYP Poster

Figure 203: FYP Poster

Project Log Sheets
Figure 204: Project Log Sheet Semester 1 (1)






Project Proposal Form (PPF)
Figure 210: PPF (1)

Figure 211: PPF (2)

Figure 212: PPF (3)

Figure 213: PPF (4)

Figure 214: PPF (5)

Figure 215: PPF (6)

Figure 216: PPF (7)

Figure 217: PPF (8)

Project Specification Form (PSF)
Figure 218: PSF (1)

Figure 219: PSF (2)

Figure 220: PSF (3)

Figure 221: PSF (4)

Figure 222: PSF (5)

Figure 223: PSF (6)

Figure 224: PSF (7)

Figure 225: PSF (8)

Figure 226: PSF (9)

Figure 227: PSF (10)

Ethics Form
Figure 228: Fast Track Ethnic Form (1)




Gantt Chart for FYP
Figure 232: Gantt Chart for FYP (1)

Figure 233: Gantt Chart for FYP (2)

Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda

Uploaded by

Copyright:

Available Formats

You might also like

Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Andrew Nathan Lee MR Tp059923 Fyp Apu3f2209 Csda

Uploaded by

Copyright:

Available Formats

Andrew Nathan Lee TP059923

Automatic Speech Recognition (ASR) system for e-learning in English Language

ANDREW NATHAN LEE

A project submitted in partial fulfilment of the requirements of Asia Pacific University of

BSc (Hons) in Computer Science specialism in Data Analytics

Supervised by Mr. AMARDEEP

2nd Marker: Ms. HEMA LATHA KRISHNA NAIR

P a g e | 1 of 294 Asia Pacific University of Technology and Innovation

P a g e | 2 of 294 Asia Pacific University of Technology and Innovation

P a g e | 3 of 294 Asia Pacific University of Technology and Innovation

2.3.4 Comparison of Similar Systems................................................................................................54

P a g e | 4 of 294 Asia Pacific University of Technology and Innovation

5.4.4.4 Generate Transcript Label..................................................................................................89

P a g e | 5 of 294 Asia Pacific University of Technology and Innovation

7.2.2 Interface Design......................................................................................................................166

P a g e | 6 of 294 Asia Pacific University of Technology and Innovation

9.2 Sample Codes................................................................................................................................212

Figure 1: Speech transmission and formulation process (Varol et al., n.d.)...............................................18

P a g e | 7 of 294 Asia Pacific University of Technology and Innovation

Figure 2: ASR system classification (Bhardwaj et al., 2022........................................................................32

P a g e | 8 of 294 Asia Pacific University of Technology and Innovation

P a g e | 9 of 294 Asia Pacific University of Technology and Innovation

P a g e | 10 of 294 Asia Pacific University of Technology and Innovation

P a g e | 11 of 294 Asia Pacific University of Technology and Innovation

Figure 138: Activity Diagram for password changing function................................................................162

P a g e | 12 of 294 Asia Pacific University of Technology and Innovation

P a g e | 13 of 294 Asia Pacific University of Technology and Innovation

Figure 219: PSF (2)...................................................................................................................................280

P a g e | 14 of 294 Asia Pacific University of Technology and Innovation

Table 1: Comparison of similar speech recognition systems.....................................................................55

P a g e | 15 of 294 Asia Pacific University of Technology and Innovation

Table 43: Result for User Acceptance Testing (4)....................................................................................246

CHAPTER 1: INTRODUCTION TO THE STUDY

1.1 Background of the Project

P a g e | 16 of 294 Asia Pacific University of Technology and Innovation

P a g e | 17 of 294 Asia Pacific University of Technology and Innovation

Figure 1: Speech transmission and formulation process (Varol et al., n.d.)

P a g e | 18 of 294 Asia Pacific University of Technology and Innovation

P a g e | 19 of 294 Asia Pacific University of Technology and Innovation

1.2 Problem Context

P a g e | 20 of 294 Asia Pacific University of Technology and Innovation

P a g e | 21 of 294 Asia Pacific University of Technology and Innovation

P a g e | 22 of 294 Asia Pacific University of Technology and Innovation

1.4 Potential Benefits

1.4.1 Tangible Benefits

1.4.2 Intangible Benefits

P a g e | 23 of 294 Asia Pacific University of Technology and Innovation

1.5 Target Users

P a g e | 24 of 294 Asia Pacific University of Technology and Innovation

1.6 Scopes & Objectives

⁕ To incorporate mathematical notation, statistical and machine learning paradigms in

⁕ To enhance the learning experience of students and teaching experience of lecturers

The key deliverables of the ASR system are stated as follows:

1.6.4 Nature of Challenges

P a g e | 25 of 294 Asia Pacific University of Technology and Innovation

P a g e | 26 of 294 Asia Pacific University of Technology and Innovation

1.7 Overview of this Investigation Report