Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Voice Activation Using Speaker Recognition for

Controlling Humanoid Robot


2nd Hanif Fakhrurroja 3rd Carmadi Machbub
1st Dyah Ayu Anggreini Tuasikal
School of Electrical Engineering and School of Electrical Engineering and
School of Electrical Engineering and
Informatics Informatics
Informatics
Bandung Institute of Technology Bandung Institute of Technology
Bandung Institute of Technology
Bandung, Indonesia Bandung, Indonesia
Bandung, Indonesia
hani002@lipi.go.id carmadi@lskk.ee.itb.ac.id
anggreiniayu@students.itb.ac.id

Abstract— Voice activation and speaker recognition commonly used methods for voice extraction are Mel-
are needed in many applications today. Speaker frequency cepstral coefficients (MFCC), Linear Prediction
recognition is the process of automatically recognizing who Cepstrum Coefficient (LPCC), Modified Mel- frequency
speaks based on the voice signal. The introduction of these Cepstral Coefficients (MMFCC), Bark Frequency Cepstrum
speakers is generally required on systems that use security Coefficients (BFCC), and Revised Perceptual Linear Prediction
and privacy. One example of this paper application is for (RPLP ) in a study conducted by [4] from the comparison of
activation and security in controlling humanoid robots. the five methods the MFCC method achieved 99.87% accuracy
Voice recording process using Kinect 2.0. The first step in for speaker recognition. Then some research results using the
the speech recognition process is feature extraction. In this Dynamic Time Warping (DTW) method resulted in
paper use Mel Frequency Cepstrum Coefficient (MFCC) verification of the speakers by combining MFCC and DTW
on characteristic extraction process and Dynamic Time working well on text dependent for speaker verification
Warping (DTW) used as feature matching technique. The purposes [5], using both algorithms MFCC and DTW can
test was performed by 5 different speakers, with 2 types of improve voice recognition performance [6], and the study
words ("aktifkan" means activate and "hello slim"), and using MFCC and DTW achieved a 92% success rate for the
test with different recording distance (0.5m, 2m, 4m). 0.25 threshold [7].
Robot activation using two different types of words has an This paper will prove that two methods of MFCC and
average accuracy of 91.5%. At the next difficulty level for DTW which have been widely used by previous researchers on
testing the recording distance accuracy decreased from speaker verification can work well. Speaker recognition is
97.5% to 85% to 65%. implemented on voice activation of Biolod GP Robot to
receive voice commands using Dynamic Time Warping
Keywords— Speaker Recognition, Dynamic Time (DTW) Method for speaker recognition process, and voice
feature extraction using MFCC method. The recording process
Warping (DTW), Mel Frequency Cepstrum Coefficient
using Kinect 2.0 that can produce the captured audio has noise
(MFCC), Kinect 2.0, Humanoid Robot, Bioloid GP.
resistance[8]. We will test the accuracy of the system by using
two different types of pronunciation words and the difference
I. INTRODUCTION in recording distance between the sensor and the speaker.
Speaker Recognition is the process of recognizing II. THEORY
automatically who speaks based on the individual information
contained in the voice signal. With Speaker Recognition allows A. Feature Extraction – Mel Frequency Cepstrum
speaker sound to verify their identity and access control[1]. Coefficient (MFCC)
Speaker recognition are two important functions, namely
MFCC is a popular feature extraction technique used for
identification and verification. Speaker identification is the
voice signals. The main purpose of MFCC is to imitating the
process of getting the speaker's identity by comparing the
perception of human hearing that can not receive frequencies
speaker sound features with all the features of each speaker in
above 1 Khz. MFCC is based on the variation of human ear’s
database. While speaker verification is the process of accepting
critical bandwidth with frequency. MFCC has two types of
or rejecting an identity where the speaker's identity has been
filter which are spaced linearly at low frequency below 1000
previously known based on data that has been entered in
Hz and logarithmic spacing above 1000Hz[9]. The block
database[2]. Two main modules in speaker recognition are
diagram in Figure 1 below summarizes relevant processes
feature extraction and feature matching. The first step is
associated with MFCC.
process is feature extraction using Mel Frequency Cepstrum
Coefficient (MFCC). On the feature matching process to
measure the similarity between two time series that may have
variations in time and speed are used the most popular method
of dynamic time warping (DTW)[3].
Research speaker recognition has been done with several
methods one of which is used in feature extraction. The most Fig. 1. Block Diagram of MFCC Process

1
1) Pre-emphasize Filtering 4) Fast Fourier Transform (FFT)
These filters maintain high frequencies on a spectrum, FFT serves to convert the sample sound signal (frame N)
which are generally eliminated during sound production from the time domain to the frequency domain. The signal in
processes. The purpose of pre - emphasis filtering is reduce the frame is periodic when the FFT is used on the frame. The
noise ratio of the signal, thus improving signal quality and fast algorithm for implementing DFT is FFT[11]. The FFT
balancing the spectrum of voiced sound. equation is
X n =  k =0 X k e−2π jkn / N , n = 0,1, 2,..., N − 1
N −1
(4)
Y [ n ] = X [ n ] − aX [ n − 1] (1)

Where is aperiodic row with n-value and is number of


Where Y [ n] is signal pre-emphasize result, X [ n] is signal sample.
before pre-emphasize process, and is konstanta 0.9 ≤ ≤
1.0[5]. 5) Mel Filterbank
The human ear is not sensitive to all band frequencies
2) Frame Blocking because of the distinctive human ear shape. The human ear
The frame blocking function is to divide the signal into becomes less sensitive approximately at a greater frequency
multiple frames. Sound signals must be processed by short than 1000Hz. Therefore, it is used to overcome this. The Mel
segments (short frame) because voice signals continue to Filterbank graph is shown in figure 4. The mel filterbank
change due to the articulation shift of the vocal cords. For equation is
signal processing, the commonly used frame length is between  f  (5)
F ( Mel ) = 2595log10  1 + 
10-30 ms[10]. The windowing process parameters relate to  700 
width of the window, the distance between windows, and the
shape of the window which will result in frame size (M) and Where, is frequency in Hz.
frame shift (N). The frame blocking process is illustrated in 2
Mel - Filter Bank

figure 2. 1.8

1.6

1.4

N 1.2

N
1
N
N 0.8

0.6
M M M M
0.4

Fig. 2. Frame Blocking Process 0.2

3) Windowing
0 2000 4000 6000 8000 10000 12000
Frekuensi (Hz)

The next process is windowing process, the purpose of this


Fig 4. Mel Filterbank
process is to reduce the non-continuous signal due to frame
blocking process at the beginning and end of each frame.
Window is defined as where is the 6) Discrete Cosine Transform (DCT)
number of samples in each frame. The windowing process can The final step of the MFCC feature extraction process is
be calculated with. DCT. Discrete cosine transform process obtained the desired
Yn = X n × Wn (2) feature vector. In this step takes only the cosine part from the
complex exponential using the fourier transforms imposed for
Where is signal result of windowing n-sample, is n- the discrete signal function.
sample values, and is window value. Type of window used
N f −1
is hamming window[6]. The hamming window equation is  2π rk 
 2π n 
Wn = 0, 54 − 0, 46 cos  (3)
F (k ) =  f ( n ) .cos  N 
 (6)

 M − 1 
r =0

Where value is 0,1, … , and is frame length. where F ( k ) is the discrete cosine signal function and f(n) the
discrete signal function.
DCT results are only real without imaginary parts that can
simplify the calculation. with the DCT process, the value of
magnitude is the result of the magnitude of the DCT itself, and
regardless of the phase[6].

Fig. 3. Hamming Window

2
B. Dynamic Time Warping (DTW) In order for the processed data to start from the beginning
Dynamic Time Warping utilizes a dynamic-programming to the end, then the warping path is formed from the
technique that is quite popular in speech signal processing starting point and end point of the data set.
technology. This method is used to calculate the distance 2) Monotonic Condition
between two time series data. The basic principle is to provide In order to maintain a sequence of time series, the process
a range of 'steps' in space (time frames in the sample, time is based on time or condition, to avoid looping.
frames in the template) and used to match paths that can show 3) Continuity Condition
similarities between straight time frames. It can be used to In order for the processed data does not jump to the distant
determine the similarity between two time series as well as to data.
find the corresponding area between two time series. The After getting the warping path, DTW matrix is made by
constraint usually found in speaker recognition is the calculating the accumulated distance with the following
recording process that has a difference in duration, even if the equation.
word or phrase is stipulated the same, therefore this method is  D(i − 1, j − 1)
needed to overcome it[12].  (10)
D(i, j ) = d (i, j ) + min  D(i − 1, j )
The advantage of this method is to calculate the distance of
 D(i, j − 1)
two vectors of different lengths[3]. It can determine how well 
the similarity between the template and the sample sound is
determined by the total similiarity cost (the result of the III. SYSTEM DESIGN AND IMPLEMENTATION
pattern matching of two voice). Total ‘similarity cost’ The speaker recognition is used to enable GP bioloid robot
obtained with this algorithm is an indication of how well these in order to receive voice commands at the next stage in real-
samples and templates have in common, which will then be time. The speaker recognition system configuration is shown
selected best-matching templates. in figure 6.
The DTW distance between two vectors is calculated from
the optimal bending path of the two vectors. Illustration of
matching with DTW method is shown in the figure 5[13].

Fig. 5. Illustration Matching Two Time Series DTW Method

The technique used in this DTW is to use dynamic


programming method. Distance DTW can be calculated using
the following equation.
Fig. 6. Speaker Recognition System configuration
If there are two sets of data Q and C, with each length m and
n with This Bioloid GP robot will be active only with the owner's
voice. In the early stages a voice recording process conducted
Q = q1 , q2 ,...., qm (7) by five different speakers consisted of 3 women and 2 men to
C = c1 , c2 ,...., cn (8) test the accuracy of the system. The voice data to be processed
comes from the 5 speaker's recorded audio. Bioloid GP Robot
To obtain the similarity of both data using DTW method, Owner and 4 other speakers record voice exercise data 20
m x n matrix is formed with matrix element (i, j) in the times, and sequentially test with one word "aktifkan" means
distance value activate, with two words "hello slim", and record sound at a
d (qi , c j ) = (qi − c j )2 (9) distance of 0.5m, 2m, and 4m.
The recording process uses a Kinect 2.0 4-mic array with
Next determine the warping path is the path that has the 24 bit analog to digital converter (ADC), using 48 KHz
lowest cost. The criteria for determining warping paths are as sampling frequency, processed through Visual Studio 2017,
follows [6]. with a duration of 2 seconds for every word spoken. Then the
1) Boundary Condition sound is extracted using MFCC.

3
At the MFCC stage, the recorded sound is formed up to TABLE I. LIST OF IMPLEMENTED COMMANDS
2048 samples per frame. The processed sound is first converted Respond Robot Movement
into the frequency domain using FFT before passing through
Verified Robot stands up and raises his hand to the
the filter stage. At this stage, the amount of warping is 20
Speaker right and left
pieces of filterbank, so for the next stage of formation cepstrum
can be changed back to time domain by DCT. Then the feature Not Verified Robot stands and doesn’t move
extraction results are used for the matching process on the Speaker
DTW method. The DTW method is performed according to the Robot movement design can be seen from simulated
flow chart of figure 7. movements that have been programmed in the application
Robo Plus. The simulation results for each speaker recognition
response to be used are shown in figure 8.

Fig. 8. Robot Position for Speaker Recognition Response

After designing robot movement with simulation, then


Fig. 7. Flow Chart Metode Dynamic Time Warping
implemented directly to the robot using the servo angle
positioning position of Robo Plus which is then implanted in
Arduino Mega. Bioloid GP robot consists of motor dynamixel
A. Serial Communication Design AX-12A and AX-18A which then driven using Arduino Mega.
Serial communication in this research is used to connect
microcontroller with other devices using embedded system. Pin IV. EXPERIMENTAL RESULTS
serial port on the microcontroller ie RxD and TxD. RxD This experiment is done by searching for voice
function is to receive data from computer or other equipment, characteristics of a word spoken using MFCC feature
while TxD function is to send data to computer or other extraction. After getting feature features that will be compared,
equipment. Communication between PC and arduino uses full then processed by DTW to verify the speaker. The speaker
duplex serial communication with 2 data lines, 1 sends (pinTX) matching process is performed by storing the extraction of a
and 1 receive (pinRX) however, dynamixel motors only require single speaker feature as a reference for comparison with the
1 data path to communicate. To connect arduino mega with reference speaker and the other four speakers. The speaker's
dynamixel motor required another interface that is IC introduction with DTW results in a similiarity cost used to
74LS241N as serial data multiplexer where one distinguish one speaker's voice from the other four speakers.
communication line can be used for communication more than The system will verify the speaker if the comparable sound
one dynamixel. is the same as the sound previously stored, otherwise if the
sound is different from the comparison then the system will
B. Design of Bioloid GP Robot Movement refuse or not verified. Testing process of speaker recognition
Design of Bioloid GP Robot movement with 18 DOF implementation on Bioloid GP robot is shown in figure 9.
using Robo Plus application. Robo Plus application is Then Bioloid GP Robot will give response according to
available several features to make the desired movement. The table III.2. Here are the results of the robot movement
response to be implemented on the Bioloid GP robot is shown implementation shown in Figure 10.
in Table I.

4
Fig. 9. Testing process of speaker recognition implementation on
Bioloid GP robot

Fig. 10. Implementation of robot motion response during speaker


verification
Fig. 12. Speech Signal from “Hello Slim” Pronunciation
Speaker's voice from the five speakers has different age,
gender, and speech accent to pronunciate the words shown in Speaker recognition test was conducted by 5 speakers, each
Figure 11 and Figure 12. speaker performing 40 tests of pronunciation for one word
"aktifkan" and "hello Slim" to test the accuracy of the system
with total testing as much as 200 times. Experimental results
tested on the five speakers are shown in figure 13.

Fig. 13. Level of Accuracy from Single Word and Two Words Data Testing

As we can see, speaker recognition can be used to control


robots through voice activation. The robot can be active if it is
ordered by speaker 1 and inactive if it is ordered by the other
four speakers according to what we want. From the graph in
figure 13 the word "aktifkan" is more recognizable than the
word "hello Slim". Accuracy of 97.5% for speaker 1, 95%
accuracy for speaker 2, 100% accuracy for speaker 3, 80%
accuracy for speaker 4, and 85% for speaker 5, with an average
accuracy of 91.5%. According to experimental results the
average accuracy rate for single word is 93% and 90% for two
words.
Fig. 11. Speech Signal from “Aktifkan” Pronunciation For test results by varying the distance between Kinect 2.0
and speaker is shown in figure 12. In this test speaker 1 speaks
20 times for 2 words of pronunciation on all three recording
distances. With a total test of 120 times.

5
ACKNOWLEDGMENT
This work was supported by Program of Post Graduate
Team Research 2018 from The Ministry of Research,
Technology and Higher Education, Republic of Indonesia.

REFERENCES
[1] A. R. G, “REAL TIME SPEAKER RECOGNITION USING MFCC
AND VQ National Institute of Technology,” National Institute of
Technology, Rourkela, 2008.
[2] M. Limkar, “Speaker Recognition using VQ and DTW,” Int. Conf.
Fig. 14. Level of Accuracy from Recording Distance Data Testing Adv. Commun. Comput. Technol., pp. 18–20, 2012.
[3] D. Vashisht, S. Sharma, and L. Dogra, “DESIGN OF MFCC AND
As we can see at figure 14, the recording distance DTW FOR ROBUST SPEAKER RECOGNITION,” Int. J. Electr.
parameter affects the speaker's introductory experimental Electron. Eng., vol. 2, no. 3, pp. 12–17, 2015.
[4] M. G. Sumithra and a. K. Devika, “A study on feature extraction
results. For 0.5 meters recording distance between Kinect and
techniques for text independent speaker identification,” 2012 Int.
speakers get accuracy of 97.5%, accuracy decreased to 85% at Conf. Comput. Commun. Informatics, pp. 1–5, 2012.
a distance of 2 meters, and decreased to 65% at a distance of [5] K. B. Joshi and V. V Patil, “Text-dependent Speaker Recognition
4m. With an average accuracy for testing the recording and Verification using Mel Frequency Cepstral Coefficient and
distance at the introduction of the speaker is 82.5%. The Dynamic Time Warping,” Int. J. Electron. Commun. Technol., vol.
accuracy of speaker recognition will decrease as the recording 7109, pp. 150–154, 2015.
[6] L. Muda, M. Begam, and I. Elamvazuthi, “Voice Recognition
distance increases. This is caused by increasing the distance Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and
affecting the magnitude of the resulting amplitude which Dynamic Time Warping (DTW) Techniques,” J. Comput., no.
causes the feature extraction process to be inaccurate, which March, 2010.
can interfere with the process of verifying the speaker on the [7] S. Verma, T. Gulati, and R. Lamba, “Recognizing Voice for
DTW method. Numerics Using Mfcc and Dtw,” Int. J. Appl. or Innov. Eng.
Manag., vol. 2, no. 5, pp. 127–130, 2013.
[8] M. H. Tambunan, Martin, H. Fakhruroja, and C. Machbub,
V. CONCLUSION “Indonesian Speech Recognition Grammar Using Kinect 2.0 for
Controlling Humanoid Robot,” Int. Conf. Signals Syst., no. 978, pp.
Voice activation studies using speaker recognition to 59–63, 2018.
control the Bioloid GP Robot by MFCC and DTW methods [9] A. Bala, “Voice command recognition system based voice
can be implemented well in humanoid robots. The test was command recognition,” Int. J. Eng. Sci. Technol., no. December,
2010.
performed by 5 different speakers, with 2 types of words [10] R. Hasan, M. Jamil, G. Rabbani, and S. Rahman, “Speaker
("aktifkan" and "hello slim"), and test with different recording Identification Using Mel Frequency Cepstral Coefficients,” 3rd Int.
distance (0.5m, 2m, 4m). Robot activation using two different Conf. Electr. Comput. Eng. ICECE 2004, no. December, pp. 28–30,
types of words has an average accuracy of 91.5%. At the next 2004.
[11] D. Handaya, H. Fakhruroja, E. M. I. Hidayat, and C. Machbub,
difficulty level for testing the recording distance accuracy “Comparison of Indonesian speaker recognition using Vector
decreased from 97.5% to 85% to 65% due to increased Quantization and Hidden Markov Model for unclear pronunciation
spacing between sensors and speakers that could affect the problem,” in 2016 6th International Conference on System
size of the amplitude generated. Engineering and Technology (ICSET), 2016, pp. 39–45.
[12] B. Priya and S. Kaur, “Comparative Study of Male and Female
The value of the MFCC parameter used affects the success Voices Using Mfcc and Dtw Algorithm in,” Int. J. Adv. Res.
rate when matching by DTW. Experimental results show that Electron. Commun. Eng., vol. 3, no. 8, pp. 2–5, 2014.
speaker recognition to control the Bioloid GP Robot can be [13] A. Mueen and E. Keogh, “Extracting Optimal Performance from
solved by DTW. The number of words spoken and the Dynamic Time Warping,” Int. Conf. Knowl. Discov. Data Min., pp.
2129–2130, 2016.
recording distance affect the accuracy of the recognition.

You might also like