Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Voice Control Intelligent Wheelchair Movement

Using CNNs
Mohammad Shahrul Izham Sharifuddin1, Sharifalillah Nordin2, Azliza Mohd Ali3
1,2,3
Faculty of Computer and Mathematical Sciences,
Universiti Teknologi MARA
Shah Alam, Selangor, Malaysia
1
mohammadshahrulizham@gmail.com,{2sharifa,3azliza}@tmsk.uitm.edu.my

Abstract— In this paper, we introduced a voice control Ever since that time, voice identification has made aggressive
intelligent wheelchair movement using Convolutional development to be implemented in various domains [2]. There
Neural Networks (CNNs). The intelligent wheelchair used are currently several voice identification software available in
four voice commands such as stop, go, left and right to the market such as Siri, Google Assistant, Alexa, and Cortana.
assist disable people to move. Data are collected from The software is designed to help user implementing voice
google in the wav format. Mel-Frequency Cepstral recognition in a device.
Coefficient (MFCC) is applied to extract the command Different voice properties require different extraction and
voice. The hardware used to deploy the system is classification method. Table I shows five properties of voice
Raspberry PI 3B+. The proposed method is using CNNs to characteristics used in this project. The properties are type of
classify the voice command and achieved excellent result speech signal, type of speakers, type of language, type of
with 95.30% accuracy. Therefore, the method can be vocabulary, and background noise.
commercialized and hopefully can give benefit to the
disable society. TABLE I.  PROPERTIES OF VOICE RECOGNITION

Keywords— Classification, CNNs, Voice Command, MFCC


Properties Characteristic

I. INTRODUCTION Type of speech signal Isolated words


The wheelchair is the important tool required by the Type of speakers Independent speakers
people with mobility impairments due to illness, injury and
Type of language English
disability. There are two types of wheelchair; motorized and
manual wheelchair. Motorised wheelchair is a wheelchair Type of vocabulary Small
integrated with electric motor and usually have joystick at the
armrest while manual wheelchair is operated manually by a Background noise True
person rear of the chair. Elderly people and children with
quadriplegia, spinal cord injuries (SCI) and amputation, require A development board is a small, compact circuit board
motorized wheelchair rather than the manual wheelchair [1]. that contains either microprocessor or microcontroller, or both.
Many researches currently focusing on creating motorized It supplies necessary components to hardware and software for
wheelchair to help the impaired person. Researchers have bottom-up design and programming [3]. The board is used to
conducted many experiments to increase the accuracy of voice process the signal or command and control the device. There
recognition using different feature extraction and classification are various development boards available in the market, and
technique like Support Vector Machine (SVM), Hidden each development board has its advantages and disadvantages.
Markov Model (HMM), Linear Prediction Coefficients (LPC), Among the examples of development board available in the
Mel Frequency Cepstral Coefficents (MFCC) and etc. market are Raspberry PI, Banana PI, Arduino, and SK40C. The
Voice recognition is one of a technique that enables the development board that will be used in this project is
human voice to control a device. The needs to use buttons and Raspberry PI 3B+, because it is easy to programme, cheaper
switches are no longer needed; people can easily control compared to the other Single Board Computer (SBC), and
appliances with their voice while doing other tasks. Voice faster in processing.
command device (VCD) can be found in many home In this paper we propose an approach to recognize a
appliances. Lights and washing machine for example, can be voice command given by the user and send signal to the motor
controlled using voice. In the 1960 Texas Instruments has to control the movement of the wheelchair. This paper is
started to improve voice or speech identification technology. organised as follows. Section II represents the related work on

‹,((( !

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.
voice recognition and convolutional neural networks. The highlight important points in the dataset. CNNs is divided into
proposed methodology is demonstrated in section III. two layers, where the first layer is the feature extraction and
Experimentation details and results are presented in section the second layer is classification layer.
IV, and, finally, the paper is concluded in section V.
II. RELATED WORKS A. Feature Extraction Layer
Feature extraction layer consists of two layers, i.e.
Voice control wheelchair enables disabilities people move Convolutional Layer and Pooling Layer. 2D Convolutional
their wheelchair using voice. Electric wheelchair is better than Layer and three 2D Max Pooling layers are used in this
manual when people want to move alone but people still project.
needing hand to control. With the advancement of technology
especially in applying artificial intelligence, wheelchair can be B. Classification Layer
move by using voice recognition. Barriuso [4] proposed an Classification layer is used to classify the class. There are
agent based intelligent interface for wheelchair movement three layers that need to be set in this project; there are input
with smartphone application. Another application proposed by layer, hidden layer, and output layer.
Nasrin [5] applied voice recognition and also GPS for the
wheelchair user to track the location where they are. They also
produced mobile apps for the wheelchair user. However, the
user must have the Wi-Fi connection to activate the GPS
features. Meanwhile, Avutu et.al [1] proposed a voice
recognition wheelchair with the low cost map which only can
be applied to local location. Therefore, the cost will be much
cheaper because the application can be used without wifi
connection. Another application proposed by [6], the system
applied Arduino Mega 2560 as a controller and had some
additional devices to enhance security features for example
sending emergency messages to the important people. They
applied ultrasonic sensors to detect and avoid conflicts with Fig. 1. Flow of the system
any obstacles. The best part of this system it also can works
without internet connection. Fig 1 shows the flow of the system. Firstly, the user i.e.
Recently, many researches related to voice recognition have the disabled people push the button while recording the voice
applied CNNs. CNNs is one of neural network techniques that command using the microphone. The recorded voice is then
are currently popular and achieve better results compared to processed inside the Raspberry PI 3B+ using MFCC and
other supervised techniques in neural network such as CNNs. After the system recognizes the voice, the output signal
backpropagation. Back propagation neural network (BPNN) is is sent to the motor driver. The motor driver determines the
the second generation of neural network that calculates error movement of the motor and moves the robot car.
to produce learning [7] similar to CNNs. BPNN consists of
three-layer; input layer, hidden layer while CNNs involves one
IV. EXPERIMENTATIONS AND RESULTS
or more convolution layers, pooling layer and fully connected
layer [8]. CNNs is a feedforward neural network and have A. Data Sample
better generalisation than networks with full connectivity
between adjacent layers [9]. CNNs has been successfully In this paper, we collected four raw voice data which are
applied with great success in many others application such as go, left, right and stop. Other than that, also two types of
image recognition [10], surveillance [11], and person noises which are urban noise and white noise are also
identification [12][13]. Lei and She [14] applying CNNs to collected. This data were downloaded from Google and for
authenticate voice in a noisy environment. The proposed white noise the data are self-recorded. The reason why white
method reduces equal error rate and produce better accuracy in noise is self-recorded, it is because white noise audio data are
authenticate voice. Guan [15] applying CNNs to optimise not available online. The raw voice data are saved in WAV
performance in speech recognition and the result improved the format, and each of the voice will be the one-second length.
recognition performance with error rate 13.88%. Table II, shows the number of the command that has been
downloaded and recorded in this research. Total of collected
data are 14,145. 11,845 data are collected from Google and
III. PROPOSED METHOD 2,300 data are self-recorded.
CNNs is one of the most powerful techniques used in B. Mel Frequency Cepstral Coefficients (MFCC)
voice recognition research. CNNs provides high accuracy
MFCC feature extraction is very close to the human
especially in solving image recognition and speech recognition auditory system and becomes the most suitable technique to be
problem [16]. CNNs operate well in filtering noise and used. The reason behind this, because MFCC put the

!

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.
frequency band logarithmically, with these it will allow speech done. We set number of input layers is 20 nodes, 6 hidden
to process better [17]. In this step, feature extraction is to get layers with 20 nodes and 6 outputs. Table III shows the best
the signal features. Those signal features later will be used in model that have been achieved after 12 experiments have been
feature matching.The MFCC process will be done using conducted. In this model, the best accuracy achieved is
Librosa library. Librosa library is one of the famous libraries 74.87% with 0.61270 minute.
that used to extract a feature from the audio signal. In this
process, 20th MFCC will be extracted from the raw audio TABLE III.  BPNN BEST MODEL
signal. The raw data of MFCC will be in 2-Dimensional shape.
Layers Nodes Activation Bias
Therefore, for different algorithm, the data preparation will be
different. Input 20 - -
Hidden Layer 20 relu True
TABLE II.  NUMBER OF A SAMPLE OF VOICE COMMAND
Hidden Layer 20 relu True
Sources Command Total of data Hidden Layer 20 relu True
Google Go 2,372 Hidden Layer 20 relu True
Stop 2,380 Hidden Layer 20 relu True
Left 2,353 Hidden Layer 20 relu True
Right 2,367 Output 6 softmax True
Urban Noise 2,373 Optimizer Adam (lr = 0.001)
Self-Recorded White Noise 2,300 Loss Function Logcosh
Total of the data 14,145 Epochs 200
Batch Size 40
C. Data Preparation
2-dimensional CNNs is used in this research. Therefore, the
dataset needs to be prepared as 2D. Since 2D CNNs already E. CNNs Parameter Tuning
need a 2D dataset, so we do not need to compress it. We just
In CNNs parameter tuning there are 12 experiments have
have to take it directly from raw MFCC. Before that, we also
been done. Table IV shows the best model that have been
need to make sure that all the dataset is in the same shape. For
this project, the shape of the data that has been set is (44, 20). achieved after 12 experiment have been conducted. In this
The dataset has been set like that because of our data only 1- model, the best accuracy achieved is 95.30% with 4.1672
second length, so that why the shape of the data is not so big. minute.
If the shape of the data exceeded the shape that has been set,
TABLE IV.  CNNS BEST MODEL
then the excess will be cut off and if the data shape is not
Layers Kernel Stride Filters Padding Nodes Bias Activat
enough it will be pad with zero. Lastly, the channel used is 1
which is mean grayscale. Figure 2 shows the flow of data Size -ion
preparation of CNNs. Input 20,44,1
2D 2 1 16 valid - False relu
Extract the signal Data Normalization
2D Max 2 2 - valid - - -
using MFCC (0-1)
(Output shape : (Output shape : 2D 2 1 32 valid - False relu
(44,20) (44,20)
2D Max 2 2 - valid - - -
2D 2 1 64 valid - False relu
2D Max 2 2 - valid - - -
Feed the data Pad with zero if Flatten - - -
inside 2D CNNs output shape is more
than (44,20) and cut Output - 6 False softplus
off if it is less than
that Optimizer Adam
Loss mean_squared_error
Fig 2 Flow of the data preparation for CNNs Epochs 50
Batch Size 10
D. BPNN Parameter Tuning
F. Comparison between BPNN and CNNs
We also train the data using Backpropagation Neural
Network to show the comparison between BPNN and CNNs. In this project, we can see that CNNs produced 95.30%
In BPNN parameter tuning there are 12 experiments have been accuracy higher compared to BPNN which is only 74.87%. In

!

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.
time comparison, CNNs took 4.1672 minutes to run while Bravo, G. V. González, and J. F. De Paz, “Agent-
BPNN only 0.61270 minutes. This is because in the CNNs, Based Intelligent Interface for Wheelchair Movement
there are feature extraction layer that able to filter the noise of Control,” Sensors, vol. 18, no. 1511, pp. 1–31, 2018.
the audio before feed into the classification layer. [5] N. Aktar, I. Jahan, and B. Lala, “Voice Recognition
based Intelligent Wheelchair and GPS Tracking
TABLE V.  MODEL COMPARISON System,” in 2019 International Conference on
Electrical, Computer and Communication
Model Accuracy Time(m)
Engineering (ECCE), 2019, pp. 7–9.
BPNN 74.87% 0.61270 [6] Z. Raiyan, “Design of an Arduino Based Voice-
CNNs 95.30% 4.1672 Controlled Automated Wheelchair,” pp. 21–23, 2017.
[7] T. Bouwmans, S. Javed, M. Sultana, and S. Ki, “Deep
G. Hardware Tuning neural network concepts for background subtraction:
A systematic review and comparative evaluation,”
Table VI shows the logic gate used in Motor Driver LN298N,
Neural Networks, vol. 117, pp. 8–66, 2019.
this logic gate is used to control the direction of the motor and
[8] A. A. Amri, A. R. Ismail, and A. A. Zarir,
use it to move the demo car according to the required
“Comparative Performance of Deep Learning and
command.
Machine Learning Algorithms on Imbalanced
TABLE VI.  MOTOR DIRECTION
Handwritten Data,” Int. J. Adv. Comput. Sci. Appl.,
vol. 9, no. 2, pp. 258–264, 2018.
Motor Driver Pin LN298N Motor [9] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,”
In1 In2 In3 In4 Direction Nat. Int. Wkly. J. Sci., vol. 521, pp. 436–444, 2015.
0 0 0 0 Stop [10] A. S. Razavian, H. Azizpour, J. Sullivan, and S.
Carlsson, “CNN features off-the-shelf: An astounding
0 1 0 1 Go
baseline for recognition,” in IEEE Computer Society
1 0 0 1 Left Conference on Computer Vision and Pattern
0 1 1 0 Right Recognition Workshops, 2014, pp. 512–519.
[11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R.
V. CONCLUSION Sukthankar, and F. F. Li, “Large-scale video
classification with convolutional neural networks,” in
In this paper, we developed a prototype that able to classify the
Proceedings of the IEEE Computer Society
command based on features that have been extracted from
Conference on Computer Vision and Pattern
MFCC. With the highest accuracy of 95.30%, this prototype
Recognition, 2014, pp. 1725–1732.
can be considered reliable and efficient. It is our hope that the
[12] X. Wang, A. Mohd Ali, and P. Angelov, “Gender and
prototype can help disabled people to move around without
Age Classification of Human Faces for Automatic
another person help. However, there are several limitations in
Detection of Anomalous Human Behaviour,” in
the study, which can be considered in future
International Conference on Cybernetics (CYBCONF
enhancements/research such as the quality of the data, number
2017), 2017, pp. 1–6.
of vocabularies used, and type of language used.
[13] A. M. Ali and P. Angelov, “Anomalous behaviour
detection based on heterogeneous data and data
ACKNOWLEDGEMENT fusion,” Soft Comput., 2018.
The authors would like to thank to the Faculty of Computer [14] L. Lei and K. She, “Identity Vector Extraction by
and Mathematical Sciences, Universiti Teknologi MARA, Perceptual Wavelet Packet Entropy and Convolutional
Malaysia for the support throughout this research. Neural Network,” 2018.
[15] W. Guan, “Performance Optimization of Speech
REFERENCES Recognition System with Deep Neural Network
Model,” vol. 27, no. 4, pp. 272–282, 2018.
[1] S. R. Avutu, “Voice control module for Low cost
[16] A. Incze, Henrietta-Bernadett Jancso, Z. Szilagyi, A.
Local-Map navigation based Intillegent wheelchair,”
Farkas, and C. Sulyok, “Bird Sound Recognition
2017 IEEE 7th Int. Adv. Comput. Conf., pp. 609–613,
Using a Convolutional Neural Network,” in IEEE 16th
2017.
International Symposium on Intelligent Systems and
[2] S. Khan, H. Akmal, I. Ali, and N. Naeem, “Efficient
Informatics, 2018, pp. 295–300.
and unique learning of in-car voice control for
[17] S. T. S and C. Lingam, “Speaker based Language
engineering education,” pp. 1–6, 2017.
Independent Isolated Speech Recognition System,” in
[3] R. Lang, M. Lescisin, and Q. H. Mahmoud, “Selecting
2015 International Conference on Communication,
a development board for your capstone or course
Information & Computing Technology (ICCICT),
project,” IEEE Potentials, vol. 37, pp. 6–14, 2018.
2015, pp. 1–7.
[4] A. L. Barriuso, J. Pérez-Marcos, D. M. Jiménez-

!

Authorized licensed use limited to: Auckland University of Technology. Downloaded on June 06,2020 at 16:11:03 UTC from IEEE Xplore. Restrictions apply.

You might also like