Professional Documents
Culture Documents
Voice Activation Using Speaker Recognition For Controlling Humanoid Robot
Voice Activation Using Speaker Recognition For Controlling Humanoid Robot
Abstract— Voice activation and speaker recognition commonly used methods for voice extraction are Mel-
are needed in many applications today. Speaker frequency cepstral coefficients (MFCC), Linear Prediction
recognition is the process of automatically recognizing who Cepstrum Coefficient (LPCC), Modified Mel- frequency
speaks based on the voice signal. The introduction of these Cepstral Coefficients (MMFCC), Bark Frequency Cepstrum
speakers is generally required on systems that use security Coefficients (BFCC), and Revised Perceptual Linear Prediction
and privacy. One example of this paper application is for (RPLP ) in a study conducted by [4] from the comparison of
activation and security in controlling humanoid robots. the five methods the MFCC method achieved 99.87% accuracy
Voice recording process using Kinect 2.0. The first step in for speaker recognition. Then some research results using the
the speech recognition process is feature extraction. In this Dynamic Time Warping (DTW) method resulted in
paper use Mel Frequency Cepstrum Coefficient (MFCC) verification of the speakers by combining MFCC and DTW
on characteristic extraction process and Dynamic Time working well on text dependent for speaker verification
Warping (DTW) used as feature matching technique. The purposes [5], using both algorithms MFCC and DTW can
test was performed by 5 different speakers, with 2 types of improve voice recognition performance [6], and the study
words ("aktifkan" means activate and "hello slim"), and using MFCC and DTW achieved a 92% success rate for the
test with different recording distance (0.5m, 2m, 4m). 0.25 threshold [7].
Robot activation using two different types of words has an This paper will prove that two methods of MFCC and
average accuracy of 91.5%. At the next difficulty level for DTW which have been widely used by previous researchers on
testing the recording distance accuracy decreased from speaker verification can work well. Speaker recognition is
97.5% to 85% to 65%. implemented on voice activation of Biolod GP Robot to
receive voice commands using Dynamic Time Warping
Keywords— Speaker Recognition, Dynamic Time (DTW) Method for speaker recognition process, and voice
feature extraction using MFCC method. The recording process
Warping (DTW), Mel Frequency Cepstrum Coefficient
using Kinect 2.0 that can produce the captured audio has noise
(MFCC), Kinect 2.0, Humanoid Robot, Bioloid GP.
resistance[8]. We will test the accuracy of the system by using
two different types of pronunciation words and the difference
I. INTRODUCTION in recording distance between the sensor and the speaker.
Speaker Recognition is the process of recognizing II. THEORY
automatically who speaks based on the individual information
contained in the voice signal. With Speaker Recognition allows A. Feature Extraction – Mel Frequency Cepstrum
speaker sound to verify their identity and access control[1]. Coefficient (MFCC)
Speaker recognition are two important functions, namely
MFCC is a popular feature extraction technique used for
identification and verification. Speaker identification is the
voice signals. The main purpose of MFCC is to imitating the
process of getting the speaker's identity by comparing the
perception of human hearing that can not receive frequencies
speaker sound features with all the features of each speaker in
above 1 Khz. MFCC is based on the variation of human ear’s
database. While speaker verification is the process of accepting
critical bandwidth with frequency. MFCC has two types of
or rejecting an identity where the speaker's identity has been
filter which are spaced linearly at low frequency below 1000
previously known based on data that has been entered in
Hz and logarithmic spacing above 1000Hz[9]. The block
database[2]. Two main modules in speaker recognition are
diagram in Figure 1 below summarizes relevant processes
feature extraction and feature matching. The first step is
associated with MFCC.
process is feature extraction using Mel Frequency Cepstrum
Coefficient (MFCC). On the feature matching process to
measure the similarity between two time series that may have
variations in time and speed are used the most popular method
of dynamic time warping (DTW)[3].
Research speaker recognition has been done with several
methods one of which is used in feature extraction. The most Fig. 1. Block Diagram of MFCC Process
1
1) Pre-emphasize Filtering 4) Fast Fourier Transform (FFT)
These filters maintain high frequencies on a spectrum, FFT serves to convert the sample sound signal (frame N)
which are generally eliminated during sound production from the time domain to the frequency domain. The signal in
processes. The purpose of pre - emphasis filtering is reduce the frame is periodic when the FFT is used on the frame. The
noise ratio of the signal, thus improving signal quality and fast algorithm for implementing DFT is FFT[11]. The FFT
balancing the spectrum of voiced sound. equation is
X n = k =0 X k e−2π jkn / N , n = 0,1, 2,..., N − 1
N −1
(4)
Y [ n ] = X [ n ] − aX [ n − 1] (1)
figure 2. 1.8
1.6
1.4
N 1.2
N
1
N
N 0.8
0.6
M M M M
0.4
3) Windowing
0 2000 4000 6000 8000 10000 12000
Frekuensi (Hz)
M − 1
r =0
Where value is 0,1, … , and is frame length. where F ( k ) is the discrete cosine signal function and f(n) the
discrete signal function.
DCT results are only real without imaginary parts that can
simplify the calculation. with the DCT process, the value of
magnitude is the result of the magnitude of the DCT itself, and
regardless of the phase[6].
2
B. Dynamic Time Warping (DTW) In order for the processed data to start from the beginning
Dynamic Time Warping utilizes a dynamic-programming to the end, then the warping path is formed from the
technique that is quite popular in speech signal processing starting point and end point of the data set.
technology. This method is used to calculate the distance 2) Monotonic Condition
between two time series data. The basic principle is to provide In order to maintain a sequence of time series, the process
a range of 'steps' in space (time frames in the sample, time is based on time or condition, to avoid looping.
frames in the template) and used to match paths that can show 3) Continuity Condition
similarities between straight time frames. It can be used to In order for the processed data does not jump to the distant
determine the similarity between two time series as well as to data.
find the corresponding area between two time series. The After getting the warping path, DTW matrix is made by
constraint usually found in speaker recognition is the calculating the accumulated distance with the following
recording process that has a difference in duration, even if the equation.
word or phrase is stipulated the same, therefore this method is D(i − 1, j − 1)
needed to overcome it[12]. (10)
D(i, j ) = d (i, j ) + min D(i − 1, j )
The advantage of this method is to calculate the distance of
D(i, j − 1)
two vectors of different lengths[3]. It can determine how well
the similarity between the template and the sample sound is
determined by the total similiarity cost (the result of the III. SYSTEM DESIGN AND IMPLEMENTATION
pattern matching of two voice). Total ‘similarity cost’ The speaker recognition is used to enable GP bioloid robot
obtained with this algorithm is an indication of how well these in order to receive voice commands at the next stage in real-
samples and templates have in common, which will then be time. The speaker recognition system configuration is shown
selected best-matching templates. in figure 6.
The DTW distance between two vectors is calculated from
the optimal bending path of the two vectors. Illustration of
matching with DTW method is shown in the figure 5[13].
3
At the MFCC stage, the recorded sound is formed up to TABLE I. LIST OF IMPLEMENTED COMMANDS
2048 samples per frame. The processed sound is first converted Respond Robot Movement
into the frequency domain using FFT before passing through
Verified Robot stands up and raises his hand to the
the filter stage. At this stage, the amount of warping is 20
Speaker right and left
pieces of filterbank, so for the next stage of formation cepstrum
can be changed back to time domain by DCT. Then the feature Not Verified Robot stands and doesn’t move
extraction results are used for the matching process on the Speaker
DTW method. The DTW method is performed according to the Robot movement design can be seen from simulated
flow chart of figure 7. movements that have been programmed in the application
Robo Plus. The simulation results for each speaker recognition
response to be used are shown in figure 8.
4
Fig. 9. Testing process of speaker recognition implementation on
Bioloid GP robot
Fig. 13. Level of Accuracy from Single Word and Two Words Data Testing
5
ACKNOWLEDGMENT
This work was supported by Program of Post Graduate
Team Research 2018 from The Ministry of Research,
Technology and Higher Education, Republic of Indonesia.
REFERENCES
[1] A. R. G, “REAL TIME SPEAKER RECOGNITION USING MFCC
AND VQ National Institute of Technology,” National Institute of
Technology, Rourkela, 2008.
[2] M. Limkar, “Speaker Recognition using VQ and DTW,” Int. Conf.
Fig. 14. Level of Accuracy from Recording Distance Data Testing Adv. Commun. Comput. Technol., pp. 18–20, 2012.
[3] D. Vashisht, S. Sharma, and L. Dogra, “DESIGN OF MFCC AND
As we can see at figure 14, the recording distance DTW FOR ROBUST SPEAKER RECOGNITION,” Int. J. Electr.
parameter affects the speaker's introductory experimental Electron. Eng., vol. 2, no. 3, pp. 12–17, 2015.
[4] M. G. Sumithra and a. K. Devika, “A study on feature extraction
results. For 0.5 meters recording distance between Kinect and
techniques for text independent speaker identification,” 2012 Int.
speakers get accuracy of 97.5%, accuracy decreased to 85% at Conf. Comput. Commun. Informatics, pp. 1–5, 2012.
a distance of 2 meters, and decreased to 65% at a distance of [5] K. B. Joshi and V. V Patil, “Text-dependent Speaker Recognition
4m. With an average accuracy for testing the recording and Verification using Mel Frequency Cepstral Coefficient and
distance at the introduction of the speaker is 82.5%. The Dynamic Time Warping,” Int. J. Electron. Commun. Technol., vol.
accuracy of speaker recognition will decrease as the recording 7109, pp. 150–154, 2015.
[6] L. Muda, M. Begam, and I. Elamvazuthi, “Voice Recognition
distance increases. This is caused by increasing the distance Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and
affecting the magnitude of the resulting amplitude which Dynamic Time Warping (DTW) Techniques,” J. Comput., no.
causes the feature extraction process to be inaccurate, which March, 2010.
can interfere with the process of verifying the speaker on the [7] S. Verma, T. Gulati, and R. Lamba, “Recognizing Voice for
DTW method. Numerics Using Mfcc and Dtw,” Int. J. Appl. or Innov. Eng.
Manag., vol. 2, no. 5, pp. 127–130, 2013.
[8] M. H. Tambunan, Martin, H. Fakhruroja, and C. Machbub,
V. CONCLUSION “Indonesian Speech Recognition Grammar Using Kinect 2.0 for
Controlling Humanoid Robot,” Int. Conf. Signals Syst., no. 978, pp.
Voice activation studies using speaker recognition to 59–63, 2018.
control the Bioloid GP Robot by MFCC and DTW methods [9] A. Bala, “Voice command recognition system based voice
can be implemented well in humanoid robots. The test was command recognition,” Int. J. Eng. Sci. Technol., no. December,
2010.
performed by 5 different speakers, with 2 types of words [10] R. Hasan, M. Jamil, G. Rabbani, and S. Rahman, “Speaker
("aktifkan" and "hello slim"), and test with different recording Identification Using Mel Frequency Cepstral Coefficients,” 3rd Int.
distance (0.5m, 2m, 4m). Robot activation using two different Conf. Electr. Comput. Eng. ICECE 2004, no. December, pp. 28–30,
types of words has an average accuracy of 91.5%. At the next 2004.
[11] D. Handaya, H. Fakhruroja, E. M. I. Hidayat, and C. Machbub,
difficulty level for testing the recording distance accuracy “Comparison of Indonesian speaker recognition using Vector
decreased from 97.5% to 85% to 65% due to increased Quantization and Hidden Markov Model for unclear pronunciation
spacing between sensors and speakers that could affect the problem,” in 2016 6th International Conference on System
size of the amplitude generated. Engineering and Technology (ICSET), 2016, pp. 39–45.
[12] B. Priya and S. Kaur, “Comparative Study of Male and Female
The value of the MFCC parameter used affects the success Voices Using Mfcc and Dtw Algorithm in,” Int. J. Adv. Res.
rate when matching by DTW. Experimental results show that Electron. Commun. Eng., vol. 3, no. 8, pp. 2–5, 2014.
speaker recognition to control the Bioloid GP Robot can be [13] A. Mueen and E. Keogh, “Extracting Optimal Performance from
solved by DTW. The number of words spoken and the Dynamic Time Warping,” Int. Conf. Knowl. Discov. Data Min., pp.
2129–2130, 2016.
recording distance affect the accuracy of the recognition.