Professional Documents
Culture Documents
Sharifuddin 2019
Sharifuddin 2019
Using CNNs
Mohammad Shahrul Izham Sharifuddin1, Sharifalillah Nordin2, Azliza Mohd Ali3
1,2,3
Faculty of Computer and Mathematical Sciences,
Universiti Teknologi MARA
Shah Alam, Selangor, Malaysia
1
mohammadshahrulizham@gmail.com,{2sharifa,3azliza}@tmsk.uitm.edu.my
Abstract— In this paper, we introduced a voice control Ever since that time, voice identification has made aggressive
intelligent wheelchair movement using Convolutional development to be implemented in various domains [2]. There
Neural Networks (CNNs). The intelligent wheelchair used are currently several voice identification software available in
four voice commands such as stop, go, left and right to the market such as Siri, Google Assistant, Alexa, and Cortana.
assist disable people to move. Data are collected from The software is designed to help user implementing voice
google in the wav format. Mel-Frequency Cepstral recognition in a device.
Coefficient (MFCC) is applied to extract the command Different voice properties require different extraction and
voice. The hardware used to deploy the system is classification method. Table I shows five properties of voice
Raspberry PI 3B+. The proposed method is using CNNs to characteristics used in this project. The properties are type of
classify the voice command and achieved excellent result speech signal, type of speakers, type of language, type of
with 95.30% accuracy. Therefore, the method can be vocabulary, and background noise.
commercialized and hopefully can give benefit to the
disable society. TABLE I. PROPERTIES OF VOICE RECOGNITION
,((( !
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.
voice recognition and convolutional neural networks. The highlight important points in the dataset. CNNs is divided into
proposed methodology is demonstrated in section III. two layers, where the first layer is the feature extraction and
Experimentation details and results are presented in section the second layer is classification layer.
IV, and, finally, the paper is concluded in section V.
II. RELATED WORKS A. Feature Extraction Layer
Feature extraction layer consists of two layers, i.e.
Voice control wheelchair enables disabilities people move Convolutional Layer and Pooling Layer. 2D Convolutional
their wheelchair using voice. Electric wheelchair is better than Layer and three 2D Max Pooling layers are used in this
manual when people want to move alone but people still project.
needing hand to control. With the advancement of technology
especially in applying artificial intelligence, wheelchair can be B. Classification Layer
move by using voice recognition. Barriuso [4] proposed an Classification layer is used to classify the class. There are
agent based intelligent interface for wheelchair movement three layers that need to be set in this project; there are input
with smartphone application. Another application proposed by layer, hidden layer, and output layer.
Nasrin [5] applied voice recognition and also GPS for the
wheelchair user to track the location where they are. They also
produced mobile apps for the wheelchair user. However, the
user must have the Wi-Fi connection to activate the GPS
features. Meanwhile, Avutu et.al [1] proposed a voice
recognition wheelchair with the low cost map which only can
be applied to local location. Therefore, the cost will be much
cheaper because the application can be used without wifi
connection. Another application proposed by [6], the system
applied Arduino Mega 2560 as a controller and had some
additional devices to enhance security features for example
sending emergency messages to the important people. They
applied ultrasonic sensors to detect and avoid conflicts with Fig. 1. Flow of the system
any obstacles. The best part of this system it also can works
without internet connection. Fig 1 shows the flow of the system. Firstly, the user i.e.
Recently, many researches related to voice recognition have the disabled people push the button while recording the voice
applied CNNs. CNNs is one of neural network techniques that command using the microphone. The recorded voice is then
are currently popular and achieve better results compared to processed inside the Raspberry PI 3B+ using MFCC and
other supervised techniques in neural network such as CNNs. After the system recognizes the voice, the output signal
backpropagation. Back propagation neural network (BPNN) is is sent to the motor driver. The motor driver determines the
the second generation of neural network that calculates error movement of the motor and moves the robot car.
to produce learning [7] similar to CNNs. BPNN consists of
three-layer; input layer, hidden layer while CNNs involves one
IV. EXPERIMENTATIONS AND RESULTS
or more convolution layers, pooling layer and fully connected
layer [8]. CNNs is a feedforward neural network and have A. Data Sample
better generalisation than networks with full connectivity
between adjacent layers [9]. CNNs has been successfully In this paper, we collected four raw voice data which are
applied with great success in many others application such as go, left, right and stop. Other than that, also two types of
image recognition [10], surveillance [11], and person noises which are urban noise and white noise are also
identification [12][13]. Lei and She [14] applying CNNs to collected. This data were downloaded from Google and for
authenticate voice in a noisy environment. The proposed white noise the data are self-recorded. The reason why white
method reduces equal error rate and produce better accuracy in noise is self-recorded, it is because white noise audio data are
authenticate voice. Guan [15] applying CNNs to optimise not available online. The raw voice data are saved in WAV
performance in speech recognition and the result improved the format, and each of the voice will be the one-second length.
recognition performance with error rate 13.88%. Table II, shows the number of the command that has been
downloaded and recorded in this research. Total of collected
data are 14,145. 11,845 data are collected from Google and
III. PROPOSED METHOD 2,300 data are self-recorded.
CNNs is one of the most powerful techniques used in B. Mel Frequency Cepstral Coefficients (MFCC)
voice recognition research. CNNs provides high accuracy
MFCC feature extraction is very close to the human
especially in solving image recognition and speech recognition auditory system and becomes the most suitable technique to be
problem [16]. CNNs operate well in filtering noise and used. The reason behind this, because MFCC put the
!
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.
frequency band logarithmically, with these it will allow speech done. We set number of input layers is 20 nodes, 6 hidden
to process better [17]. In this step, feature extraction is to get layers with 20 nodes and 6 outputs. Table III shows the best
the signal features. Those signal features later will be used in model that have been achieved after 12 experiments have been
feature matching.The MFCC process will be done using conducted. In this model, the best accuracy achieved is
Librosa library. Librosa library is one of the famous libraries 74.87% with 0.61270 minute.
that used to extract a feature from the audio signal. In this
process, 20th MFCC will be extracted from the raw audio TABLE III. BPNN BEST MODEL
signal. The raw data of MFCC will be in 2-Dimensional shape.
Layers Nodes Activation Bias
Therefore, for different algorithm, the data preparation will be
different. Input 20 - -
Hidden Layer 20 relu True
TABLE II. NUMBER OF A SAMPLE OF VOICE COMMAND
Hidden Layer 20 relu True
Sources Command Total of data Hidden Layer 20 relu True
Google Go 2,372 Hidden Layer 20 relu True
Stop 2,380 Hidden Layer 20 relu True
Left 2,353 Hidden Layer 20 relu True
Right 2,367 Output 6 softmax True
Urban Noise 2,373 Optimizer Adam (lr = 0.001)
Self-Recorded White Noise 2,300 Loss Function Logcosh
Total of the data 14,145 Epochs 200
Batch Size 40
C. Data Preparation
2-dimensional CNNs is used in this research. Therefore, the
dataset needs to be prepared as 2D. Since 2D CNNs already E. CNNs Parameter Tuning
need a 2D dataset, so we do not need to compress it. We just
In CNNs parameter tuning there are 12 experiments have
have to take it directly from raw MFCC. Before that, we also
been done. Table IV shows the best model that have been
need to make sure that all the dataset is in the same shape. For
this project, the shape of the data that has been set is (44, 20). achieved after 12 experiment have been conducted. In this
The dataset has been set like that because of our data only 1- model, the best accuracy achieved is 95.30% with 4.1672
second length, so that why the shape of the data is not so big. minute.
If the shape of the data exceeded the shape that has been set,
TABLE IV. CNNS BEST MODEL
then the excess will be cut off and if the data shape is not
Layers Kernel Stride Filters Padding Nodes Bias Activat
enough it will be pad with zero. Lastly, the channel used is 1
which is mean grayscale. Figure 2 shows the flow of data Size -ion
preparation of CNNs. Input 20,44,1
2D 2 1 16 valid - False relu
Extract the signal Data Normalization
2D Max 2 2 - valid - - -
using MFCC (0-1)
(Output shape : (Output shape : 2D 2 1 32 valid - False relu
(44,20) (44,20)
2D Max 2 2 - valid - - -
2D 2 1 64 valid - False relu
2D Max 2 2 - valid - - -
Feed the data Pad with zero if Flatten - - -
inside 2D CNNs output shape is more
than (44,20) and cut Output - 6 False softplus
off if it is less than
that Optimizer Adam
Loss mean_squared_error
Fig 2 Flow of the data preparation for CNNs Epochs 50
Batch Size 10
D. BPNN Parameter Tuning
F. Comparison between BPNN and CNNs
We also train the data using Backpropagation Neural
Network to show the comparison between BPNN and CNNs. In this project, we can see that CNNs produced 95.30%
In BPNN parameter tuning there are 12 experiments have been accuracy higher compared to BPNN which is only 74.87%. In
!
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.
time comparison, CNNs took 4.1672 minutes to run while Bravo, G. V. González, and J. F. De Paz, “Agent-
BPNN only 0.61270 minutes. This is because in the CNNs, Based Intelligent Interface for Wheelchair Movement
there are feature extraction layer that able to filter the noise of Control,” Sensors, vol. 18, no. 1511, pp. 1–31, 2018.
the audio before feed into the classification layer. [5] N. Aktar, I. Jahan, and B. Lala, “Voice Recognition
based Intelligent Wheelchair and GPS Tracking
TABLE V. MODEL COMPARISON System,” in 2019 International Conference on
Electrical, Computer and Communication
Model Accuracy Time(m)
Engineering (ECCE), 2019, pp. 7–9.
BPNN 74.87% 0.61270 [6] Z. Raiyan, “Design of an Arduino Based Voice-
CNNs 95.30% 4.1672 Controlled Automated Wheelchair,” pp. 21–23, 2017.
[7] T. Bouwmans, S. Javed, M. Sultana, and S. Ki, “Deep
G. Hardware Tuning neural network concepts for background subtraction:
A systematic review and comparative evaluation,”
Table VI shows the logic gate used in Motor Driver LN298N,
Neural Networks, vol. 117, pp. 8–66, 2019.
this logic gate is used to control the direction of the motor and
[8] A. A. Amri, A. R. Ismail, and A. A. Zarir,
use it to move the demo car according to the required
“Comparative Performance of Deep Learning and
command.
Machine Learning Algorithms on Imbalanced
TABLE VI. MOTOR DIRECTION
Handwritten Data,” Int. J. Adv. Comput. Sci. Appl.,
vol. 9, no. 2, pp. 258–264, 2018.
Motor Driver Pin LN298N Motor [9] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,”
In1 In2 In3 In4 Direction Nat. Int. Wkly. J. Sci., vol. 521, pp. 436–444, 2015.
0 0 0 0 Stop [10] A. S. Razavian, H. Azizpour, J. Sullivan, and S.
Carlsson, “CNN features off-the-shelf: An astounding
0 1 0 1 Go
baseline for recognition,” in IEEE Computer Society
1 0 0 1 Left Conference on Computer Vision and Pattern
0 1 1 0 Right Recognition Workshops, 2014, pp. 512–519.
[11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R.
V. CONCLUSION Sukthankar, and F. F. Li, “Large-scale video
classification with convolutional neural networks,” in
In this paper, we developed a prototype that able to classify the
Proceedings of the IEEE Computer Society
command based on features that have been extracted from
Conference on Computer Vision and Pattern
MFCC. With the highest accuracy of 95.30%, this prototype
Recognition, 2014, pp. 1725–1732.
can be considered reliable and efficient. It is our hope that the
[12] X. Wang, A. Mohd Ali, and P. Angelov, “Gender and
prototype can help disabled people to move around without
Age Classification of Human Faces for Automatic
another person help. However, there are several limitations in
Detection of Anomalous Human Behaviour,” in
the study, which can be considered in future
International Conference on Cybernetics (CYBCONF
enhancements/research such as the quality of the data, number
2017), 2017, pp. 1–6.
of vocabularies used, and type of language used.
[13] A. M. Ali and P. Angelov, “Anomalous behaviour
detection based on heterogeneous data and data
ACKNOWLEDGEMENT fusion,” Soft Comput., 2018.
The authors would like to thank to the Faculty of Computer [14] L. Lei and K. She, “Identity Vector Extraction by
and Mathematical Sciences, Universiti Teknologi MARA, Perceptual Wavelet Packet Entropy and Convolutional
Malaysia for the support throughout this research. Neural Network,” 2018.
[15] W. Guan, “Performance Optimization of Speech
REFERENCES Recognition System with Deep Neural Network
Model,” vol. 27, no. 4, pp. 272–282, 2018.
[1] S. R. Avutu, “Voice control module for Low cost
[16] A. Incze, Henrietta-Bernadett Jancso, Z. Szilagyi, A.
Local-Map navigation based Intillegent wheelchair,”
Farkas, and C. Sulyok, “Bird Sound Recognition
2017 IEEE 7th Int. Adv. Comput. Conf., pp. 609–613,
Using a Convolutional Neural Network,” in IEEE 16th
2017.
International Symposium on Intelligent Systems and
[2] S. Khan, H. Akmal, I. Ali, and N. Naeem, “Efficient
Informatics, 2018, pp. 295–300.
and unique learning of in-car voice control for
[17] S. T. S and C. Lingam, “Speaker based Language
engineering education,” pp. 1–6, 2017.
Independent Isolated Speech Recognition System,” in
[3] R. Lang, M. Lescisin, and Q. H. Mahmoud, “Selecting
2015 International Conference on Communication,
a development board for your capstone or course
Information & Computing Technology (ICCICT),
project,” IEEE Potentials, vol. 37, pp. 6–14, 2018.
2015, pp. 1–7.
[4] A. L. Barriuso, J. Pérez-Marcos, D. M. Jiménez-
!
Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.