Stream Final

OBJECTIVE:
To design and develop a speech Signal based biopotential amplifier for Rehabilitative Aids (e.g.
Wheelchair).
ABSTRACT:
For the last few decade, an increase in experimental research has been carried-out to help the people
with different-level of disabilities, who seek help towards the mobility or movement or transport aids.
Hence, in our study, we have developed a system which has the ability to classify the human speech signal
into corresponding commands for triggering a rehabilitative aid such as wheelchair for helping the people
seek for a movement from one place to other. For this purpose, we have introduced a classification model
which uses hidden Markov-Gaussian mixture, for classifying a speech signal especially the words for
driving a wheelchair. To improve the efficiency of the proposed classification model, the classification was
performed based on the features extracted from the speech signal. Ten volunteers were used in this study
to validate the result and it is found to have an accuracy of about 98% to classify the word into their
corresponding command. Hence, we propose this model as a good tool for helping the people with
differently-disabled people.
1. INTRODUCTION:
To perform the mapping of words to their corresponding speech signal is an important task in any
Speech/pattern recognition system development. Many researchers have conducted their research in
developing such a system which can recognize and map the speech without any loss [1]. For automatic
pattern recognition process, unique features were extracted from processed speech signal which is
termed as ‘acoustic features’. The acoustic features represent the speech signal. If the acoustic feature
vectors fail, then the representation the underlying information about the speech gets mismatched.
Hence, the system performance depends on the property and choice of the acoustic feature selected. So,
many researchers are involved in developing a proper acoustic feature extraction method over the past
few decades, to reduce the complexity in classifying/represented of a particular speech signal or a
sequence of words [2]. In general, Speech recognition (SR) involves two main steps: feature extraction
and classify them to their corresponding speech using an appropriate pattern recognizing method, but
classification accuracy, robustness, and complexity of the algorithm developed are used to measure the
performance of such a method [3-5]. The feature extraction is the process by which the unique properties
of a particular signal are analysed/measured but reconstructing the original signal is not possible from the
features, as the feature extraction method follows non-invertible or lossy transformation process. Feature
extraction technique is generally of either temporal or spectral analysis technique. The temporal analysis
uses the input signal/waveform itself, whereas the spectral analysis uses the spectral representation of
input signal/waveform given, for analysis [1]. In SR systems, Linear Predictive coding (LPC) [6], Mel-
frequency Cepstral Co-efficient (MFCC) [7], and Perceptual linear Prediction (PLP) [8] are widely used to
extract acoustic features.
2. LITERATURE SURVEY
2.1. Speech Signal:
The process of translating thoughts into speech/words usually termed as “speech production” [9]. Speech
is produced when air is forced from the lungs through the vocal cords and along the vocal tract. The vocal
tract extends from the opening in the vocal cords (called the glottis) to the mouth, and in an average man
is about 17 cm long. It introduces short-term correlations (of the order of 1 ms) into the speech signal,
and can be thought of as a filter with broad resonances called formants. The frequencies of these formants
are controlled by varying the shape of the tract, for example by moving the position of the tongue. An
important part of many speech codecs is the modelling of the vocal tract as a short term filter. As the
shape of the vocal tract varies relatively slowly, the transfer function of its modelling filter needs to be
updated only relatively infrequently (typically every 20 ms or so).
The vocal tract filter is excited by air forced into it through the vocal cords. Speech sounds can be broken
into three classes depending on their mode of excitation.
 Voiced sounds are produced when the vocal cords vibrate open and closed, thus interrupting the
flow of air from the lungs to the vocal tract and producing quasiperiodic pulses of air as the
excitation. The rate of the opening and closing gives the pitch of the sound. This can be adjusted
by varying the shape of, and the tension in, the vocal cords, and the pressure of the air behind
them. Voiced sounds show a high degree of periodicity at the pitch period, which is typically
between 2 and 20 ms.
 Unvoiced sounds result when the excitation is a noise-like turbulence produced by forcing air at
high velocities through a constriction in the vocal tract while the glottis is held open. Such sounds
show little long-term periodicity, although short-term correlations due to the vocal tract are still
present.
 Plosive sounds result when a complete closure is made in the vocal tract, and air pressure is built
up behind this closure and released suddenly.
Some sounds cannot be considered to fall into any one of the three classes above, but are a mixture. For
example voiced fricatives result when both vocal cord vibration and a constriction in the vocal tract are
present.
Although there are many possible speech sounds which can be produced, the shape of the vocal tract and
its mode of excitation change relatively slowly, and so speech can be considered to be quasi-stationary
over short periods of time (of the order of 20 ms). Speech signals show a high degree of predictability,
due sometimes to the quasiperiodic vibrations of the vocal cords and also due to the resonances of the
vocal tract.
Important reason to show why the speech/sound produced for a given the word is different; (1) due to
the nature of physiological organs, (2) difference in accent (Vowels & Consonants), (3) difference in
gender, weight and/or height and (4) stress applied (emphasis) on the word will also create considerable
change in the production of speech signal, which can be visualized as the variabilities present in the speech
waveform [3, 4]. Human speech signal falls below 10 kHz so to match the Nyquist rate criteria 20 kHz
sampling frequency is necessary to choose. Whereas, the telephonic speech falls below 4 kHz, so to match
the Nyquist rate criteria 8 kHz sampling frequency is used in general by all the researcher [7].
Speech coders attempt to exploit this predictability in order to reduce the data rate necessary for good
quality voice transmission.
2.2. Speech Recognition Types and Styles
Voice enabled devices basically use the principal of speech recognition. It is the process of electronically
converting a speech waveform (as the realization of a linguistic expression) into words (as a best-decoded
sequence of linguistic units).
Converting a speech waveform into a sequence of words involves several essential steps:
1. A microphone picks up the signal of the speech to be recognized and converts it into an electrical signal.
A modern speech recognition system also requires that the electrical signal be represented digitally by
means of an analog-to-digital (A/D) conversion process, so that it can be processed with a digital computer
or a microprocessor.
2. This speech signal is then analyzed (in the analysis block) to produce a representation consisting of
salient features of the speech. The most prevalent feature of speech is derived from its short-time
spectrum, measured successively over short-time windows of length 20–30 milliseconds overlapping at
intervals of 10–20 ms. Each short-time spectrum is transformed into a feature vector, and the temporal
sequence of such feature vectors thus forms a speech pattern.
3. The speech pattern is then compared to a store of phoneme patterns or models through a dynamic
programming process in order to generate a hypothesis (or a number of hypotheses) of the phonemic unit
sequence. (A phoneme is a basic unit of speech and a phoneme model is a succinct representation of the
signal that corresponds to a phoneme, usually embedded in an utterance.) A speech signal inherently has
substantial variations along many dimensions.
Speech recognition is classified into two categories, speaker dependent and speaker independent.
 Speaker dependent systems are trained by the individual who will be using the system. These
systems are capable of achieving a high command count and better than 95% accuracy for word
recognition. The drawback to this approach is that the system only responds accurately only to
the individual who trained the system. This is the most common approach employed in software
for personal computers.
 Speaker independent is a system trained to respond to a word regardless of who speaks.
Therefore the system must respond to a large variety of speech patterns, inflections and
enunciation's of the target word. The command word count is usually lower than the speaker
dependent however high accuracy can still be maintain within processing limits. Industrial
requirements more often need speaker independent voice systems, such as the AT&T system
used in the telephone systems.
A more general form of voice recognition is available through feature analysis and this technique usually
leads to "speaker-independent" voice recognition. Instead of trying to find an exact or near-exact match
between the actual voice input and a previously stored voice template, this method first processes the
voice input using "Fourier transforms" or "linear predictive coding (LPC)", then attempts to find
characteristic similarities between the expected inputs and the actual digitized voice input. These
similarities will be present for a wide range of speakers, and so the system need not be trained by each
new user. The types of speech differences that the speaker-independent method can deal with, but which
pattern matching would fail to handle, include accents, and varying speed of delivery, pitch, volume, and
inflection. Speaker-independent speech recognition has proven to be very difficult, with some of the
greatest hurdles being the variety of accents and inflections used by speakers of different nationalities.
Recognition accuracy for speaker independent systems is somewhat less than for speaker-dependent
systems, usually between 90 and 95 percent. Speaker independent systems do not ask to train the system
as an advantage, but perform with lower quality. These systems find applications in telephony
communications such as dictating a number or a word where many people are in concern. However,
there is a need for a well training database in speaker independent systems.
2.2.1. Recognition Style:
Speech recognition systems have another constraint concerning the style of speech they can recognize.
They are three styles of speech: isolated, connected and continuous.
 Isolated speech recognition systems can just handle words that are spoken separately. This is the
most common speech recognition systems available today. The user must pause between each
word or command spoken. The speech recognition circuit is set up to identify isolated words of
.96 second lengths.
 Connected is a halfway point between isolated word and continuous speech recognition. Allows
users to speak multiple words. The HM2007 can be set up to identify words or phrases 1.92
seconds in length. This reduces the word recognition vocabulary number to 20.
 Continuous is the natural conversational speech we are used to in everyday life. It is extremely
difficult for a recognizer to shift through the text as the word tend to merge together. For instance,
"Hi, how are you doing?" sounds like "Hi, .howyadoin" Continuous speech recognition systems
are on the market and are under continual development.
2.2.2. Approaches of Statistical Speech Recognition:
a. Hidden Markov model (HMM)-based speech recognition
Modern general-purpose speech recognition systems are generally based on hidden Markov models
(HMMs). This is a statistical model which outputs a sequence of symbols or quantities.
One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as
a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time
in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus
be thought as a Markov model for many stochastic processes (known as states).
Another reason why HMMs are popular is because they can be trained automatically and are simple and
computationally feasible to use. In speech recognition, to give the very simplest setup possible, the hidden
Markov model would output a sequence of n dimensional real-valued vectors with n around, say, 13,
outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist
of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of
speech and de-correlating the spectrum using a cosine transform, then taking the first (most significant)
coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a
mixture of diagonal covariance Gaussians which will give likelihood for each observed vector. Each word,
or (for more general speech recognition systems), each phoneme, will have a different output distribution;
a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual
trained hidden Markov models for the separate words and phonemes.
The above is a very brief introduction to some of the more central aspects of speech recognition. Modern
speech recognition systems use a host of standard techniques which it would be too time consuming to
properly explain, but just to give a flavor; a typical large-vocabulary continuous system would probably
have the following parts. It would need context dependency for the phones (so phones
with different left and right context have different realizations); to handle unseen contexts it would need
tree clustering of the contexts; it would of course use cepstral normalization to normalize
for different recording conditions and depending on the length of time that the system had to adapt on
different speakers and conditions it might use cepstral mean and variance normalization for channel
differences, vocal tract length normalization (VTLN) for male-female normalization and maximum
likelihood linear regression (MLLR) for more general speaker adaptation. The features would have delta
and delta delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear
discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use LDA followed
perhaps by heteroscedastic linear discriminant analysis or a global semi tied covariance transform (also
known as maximum likelihood linear transform (MLLT)). A serious company with a large amount of
training data would probably want to consider discriminative training techniques like maximum mutual
information (MMI), MPE, or (for short utterances) MCE, and if a large amount of speaker-specific
enrollment data was available a more wholesale speaker adaptation could be done using MAP or, at least,
tree based maximum likelihood linear regression. Decoding of the speech (the term for what happens
when the system is presented with a new utterance and must compute the most likely source sentence)
would probably use the Viterbi algorithm to find the best path, but there is a choice between dynamically
creating combination hidden Markov models which includes both the acoustic and language model
information, or combining it statically beforehand (the AT&T approach, for which their FSM toolkit might
be useful). Those who value their sanity might consider the AT&T approach, but be warned that it is
memory hungry.
b. Neural network-based speech recognition
Another approach in acoustic modeling is the use of neural networks. They are capable of solving much
more complicated recognition tasks, but do not scale as well as HMMs when it comes to large
vocabularies. Rather than being used in general-purpose speech recognition applications they can handle
low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM
based systems, as long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but generally the results
are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for
phoneme recognition and the hidden Markov model part for language modeling.
c. Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in
time or speed. For instance, similarities in walking patterns would be detected, even if in one video the
person was walking slowly and if in another they were walking more quickly, or even if there were
accelerations and decelerations during the course of one observation. DTW has been applied to video,
audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed
with DTW.
A well-known application has been automatic speech recognition, to cope with different speaking speeds.
In general, it is a method that allows a computer to find an optimal match between two given sequences
(e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each
other. This sequence alignment method is often used in the context of hidden Markov models.
III. MATERIALS USED
SAMSUNG Headset with Mic (Model: EHS61ASFWE, Samsung Electronics Co. Ltd, South Kr), HP Dv6
pavilion Laptop [2nd gen Core i5, 4 GB RAM, Windows 10], and MATLAB Software (ver. 2010, Math Works)
were used in this study.
IV. RESULT AND DISCUSSION
A. Acquisition of Speech Signal:
Speech signal was acquired using Samsung headset (mic) connected with MATLAB software and the
sampling frequency was kept 8kHz.The acquired speech signal of various words (Forward, Reverse, Right,
Left, and Stop) are shown in figure 1.
Figure 1: shows the original speech signals (a) forward (b) reverse (c) right (d) left (e) stop
B. Feature extraction:
Speech signal is segmented over a short period of time on overlapped fixed-length frames. For each frame,
a corresponding parameters are derived by using either frequency or cepstral domain to form, the feature
vector.
1) Pre-emphasis (Spectral Flattening):
It is a very simple signal processing method used in order to increase the amplitude of higher-frequency
and to suppress the amplitude of lower-frequency bands artificially. It uses first order FIR filter (equation
1) to improve signal to noise ratio [3, 12]
X ( z)  1   z 1 ,0.9    1 (1)
Then, the output of the pre-emphasis network is expressed as equation 2
y '  m   y  m     y  m  1 (2)
where 𝑦(𝑚) is input signal, 𝑦’(𝑚) is output pre-emphasized signal, α is pre-emphasized parameter. The
spectrum magnitude of 𝑦’(𝑚) will be increased by 20 dB (upper frequencies) and 32 dB increase in Nyquist
frequency [1, 3]. Figure 2(c) shows the pre-emphasized output of the speech signal processed.
Figure 2: shows the word ‘forward’ (a) original speech signal, (b) after silence removal and (c) pre-
emphasis speech signal
2) Frame blocking & Windowing:
The pre-emphasized speech signal is a non-stationary signal. So, the signal is segmented into 25
millisecond frames with frame length of 200 samples and a frame shift of 10 millisecond/80 samples is
used, which allows overlapping of the frames to capture information that will be within the frame
boundaries. Then each frames is multiplied with a fixed frame length to preform windowing function and
it is expressed as equation 3.
 x m  
M 0 1
(3)
m0
where 𝑀0 is frame length.
Windowing of a blocked 𝑦’(𝑚) speech signal is used to develop non-zero or spectral leakage values at
frequencies other than 𝜔. The leakage is observed to be highest near 𝜔 and low at frequencies far from
𝜔. In general, spectral analysis involves a trade-off between frequency resolution and time resolution. It
compares the nearly equivalent strength components with same frequencies and inequivalent
components with different frequencies [7]. This trade-off occurs at the beginning and end of each frame.
The Hamming and Hanning are the two extremely used moderate windows in narrowband applications,
such as telephone channel to minimize the signal discontinuities in the edge of each frame [13].
The hamming window is used as windowing function, which is defined as (equation 4)
0.54  0.46*cos(2 m / M 0 ), 0  m  M 0
x ( m)   (4)
0, otherwise
The output of this section is expressed as equation 5
X (m)  x(m)* y '(m),0  m  M 0 (5)

Figure 3: represents the 7th frame of the emphasis speech signal (a-b) with corresponding windowing (c)
3) Discrete Fourier Transform (DFT) and its Spectral analysis:
DFT is used to convert a finite sequence of equally-spaced samples of a function into an equivalent-length
sequence of corresponding in Discrete-time Fourier Transform (DTFT). DFT uses equation 6 to convert
speech data from time-domain to frequency-domain whereas DTFT is only a complex-valued function of
frequency using equation 6:
Y (k )   m 0 x(m)e  j 2 mk / M , 0  k  M  1
M 1
(6)
Here, we have used 200 point FFT algorithm to transfer each frame of 200 samples into its corresponding
DFT. FFT’s output contains both real and imaginary part using equation 7. As speech signal processing can
be done using real part, the imaginary part are discarded. Thus, the spectrum of the real part speech signal
of each frame was stored single matrix, where each row represents each frame and number of column
represents total samples (200)
 Re Y  k     Im Y  k  
2 2
Y (k )  (7)
Figure 4: spectral representation of speech signal (7th frame)

4) Critical Band Filter Bank Analysis:
Human auditory system are highly sensitive to low frequency then that of higher frequency. So, the Critical
Band Filter Bank Analysis was proposed and built on the basis of initial transduction phenomenon. Spectral
representation of speech input signal for extracting the features, uses linear phased FIR bandpass filters
(BPF) that are arranged on the Bark or Mel scale of a simple bank [7, 12, 13]. The speech signal doesn’t
follows the linear frequency as like in FFT and it is proportional to the log of linear frequency. Equation 8
is used to represent the linear-scale frequency with respected Mel-scale frequency.
 f 
M f ( f )  2595*log10 1   (8)
 700 
Then, triangular overlapping filters are used to create spectral envelope which represents dominating
components present in speech signal. This filter has the features to distribute the components with non-
uniformity to the linear-scale frequency and uniformed components to the Mel-scale frequency, by
placing more number of filers in the lower frequency region then the higher frequency region (𝑓ℎ < 𝑓𝑖𝑙𝑡𝑒𝑟
< 𝑓 𝑙), which is represent by equation (9) and it must satisfy equation (10):
 0 , for k  f (n  1)
 k  f (n  1)
 , for f (n  1)  k  f (n)
 f (n)  f (n  1)
Sn(k )   (9)
 f (n  1)  k , for f (n)  k  f (n  1)
 f (n  1)  f (n)

 0 , for k  f (n  1)

N 1
n0
Sn(k )  1 (10)
Where, n represents number of filter in filter bank. We have used 20 filter to compute the Mel spectrum.
To calculate centre frequency 𝑓𝑐 of filter bank is defined by equation 11:
N  1   Mel ( f h )  Mel ( fl )  
f ( n)    Mel  Mel ( fl )  m    (11)
 fs    M 1 
where, Sampling frequency (𝑓𝑠 ) in Hz), M and N are the number of filters and size of FFT. The inverse Mel-
scale frequency is defined by equation 12
Mel 1 (a)  700*(10a /2595 1) (12)

Figure 5: represents the mel-scale frequency analysis of the speech signal (7th frame)
5) Log Energy estimation of Filtered Bank:
This log energy (𝑠) resemble the process of smoothening of spectrum for approximation of log-scale,
which is basically done by human ears. The equation 13 is used to compute (𝑠), to describe logarithmic
sum of spectral magnitudes of filtered components present in each filter bank (i.e. each bins per frame
per filter will have (𝑠). Here, we will be getting 20 numerical values for each frame at the output of filter-
bank.
E ( s)  log10   k 0 Y (k )  Sn(k )  ,
K 1 2
  (13)
for 0  s  S
6) Discrete Cosine Transformation:
DCT collects maximum information from the signal that’s lower order coefficients, by ignoring the higher
order coefficients. This, further significantly reduces the computational cost. Whereas, inverse DFT (IDFT)
converts the log power spectrum of Mel-frequency into time domain using diagonal covariance matrices
of Gaussian function to reduce the complexity and computational cost. DCT especially useful for speech
recognition systems due to its simplicity. DCT is defined as a sum of cosine function fluctuating at each
and every frequencies to get a finite sequence of data points, which is expressed as in equation 14:
c(n)   x 0 E  s   cos  n  x  1 2  X  ,

X 1
(14)
for 0  n  N
where, n is varied (8 to 13) but in our case we have used 13 to get 13 coefficients for each frame. Thus,
cepstral analysis is used to calculate Mel-frequency cepstrum coefficients (MFCC) feature vector on Mel-
spectrum.
7) Delta & Acceleration coefficient estimation:
In addition to the MFCC coefficients, the time domain approximations are also an important feature
vectors used to represent the dynamic characteristic of speech signal. So, to combine the dynamic
characteristics of speech signal, the 1st order (Delta coefficients) & 2nd order (Delta-Delta or Acceleration
coefficients) differences of these MFCC coefficients are used. Further incorporation of these two
coefficients enhanced the performance of speech signal recognition dynamically. The delta (dt,k) and
Acceleration (at,k) are defined as equation 15 & 16 respectively.
 n  c(t  n, k )  c(t  n, k ) 
N
dt ,k  n 1
2   n 1 n 2 
N
  (15)
 n  d (t  n, k )  d (t  n, k ) 
N
at ,k  n 1
2  n1 n2 
N
  (16)
Where, k represents cepstrum coefficient at frame t after performing DCT. Now, we will be having 39
feature vector totally (MFCC, Delta and Acceleration coefficients).
C. Classifier:
Hidden Markov model (HMM) is used as the most common approach to solve the problem of classifying
speech signals. These HMM is originally by the researchers at CMU and IBM in early 1970s, they used
HMM for speech recognition [14]. HMM uses the statistical approach to recognition a signal/input. HMM
can be modulate and extend, this individuality of HMM maximizes its use in a number of alternative model
for example, with artificial neural network (ANN-HMM) [15]. However, this work uses of Hidden Markov-
Gaussian mixture model for developing an automatic speech recognition systems.
1) i. HMM model for Speech Recognition [16]:
P = p1,p2,..........pN corresponding to phones of each set of states
Q = q01,q02,…….qm1,…….qmn Q represents the transition probability matrix and for each 𝑝𝑖𝑗 is
representation of the probability of each phone taking to a self-loop or the next phone.
Together, P and Q is the implementation of a pronunciation lexicon, an HMM state graph structure for
each word that the system is capable of recognizing is shown in figure 6.
A = ai(ot) A, represents the set of likelihoods called emission probabilities. For each phone state I,
a set of probability of a cepstral feature vector (ot) is being generated.
Figure 6: graphical representation of the pronunciation lexicon of each word

2) Forward Algorithm:
For each cell of forward word is 𝑤𝑡 (𝑗) is represents the probability of being in state j at each
observations of t for a given model λ.
wt  j   A  o1 , o2 , o3 .......ot , wt  j |  
The value 𝑤𝑡 (𝑗) is computed for a given state aj at time t as:
N
wt ( j )   wt 1 (i ) pij a j (ot )
t 1
where, wt 1 (i) is the forward path probability at time step of previous instance. pij is the transition
probability of previous state bi and the current state 𝑎𝑗 . a j  ot  is the likelihood state of the observation
𝑜𝑡 of the current state.
To estimate maximum likelihood a j  ot  of a D-dimensional feature vector o𝑡 of a given state j, a

diagonal–covariance multivariate Gaussian equation 17 is used.
1  tz jz 
Z   o   2 
1
a j  ot    exp      (17)
z 1 2 jz
2  2   jz 2

  
Where,  j represents the mean cepstral vector and  2j represents the variance cepstral vector.
To avoid the underflow of numeric data, the log of probability of maximum likelihood is calculated using
equation 18 and 19
1 Z  
2
o  
log a j  ot     log(2 )  2log  jz   tz 2 jz   (18)
2 z 1    
  jz  
2
1 Z o  
log a j  ot   D    tz 2 jz  (19)
2 z 1   jz 
1 Z
D    log(2 )  2 log  jz  (20)
2 z 1
where D is the precomputed data in equation (20):
Phonation model for the commands used to operate wheelchair prototype shown in table 1 and the
corresponding indicator using LED are shown in figure 7:
Table 1 shows Phonation model of the commands with their corresponding states:
Command Phonation States

FORWARD F|AO|R|W|ER| 6
D
REVERSE R|IH|V|ER|S 5
RIGHT R|AY|T 3
LEFT L|EH|F|T 4
STOP S|T|AA|P 4
Figure 1: LED system indication of each command/word (a) Forward (b) Reverse (c) Right (d) Left (e) Stop
The algorithm for computing the maximum likelihood of observed sequence is given
Func FORWARD (observation-length L, state-graph length G) returns forward-probability
create a prob_matrix forward [G+2, L]
for every state y go 1 to G do; %INITIALIZATION%
forward  y,1  x0, y * a y  o1 
for each time (t) go 2 to L do; %LOOP%
for every state Y go 1 to G do
Y
forward  y, t    forward  y ', t  1 * a y ', y * by  ot 
y '1
Y
forward  pF , L    forward  y, t  * a y , pF %END%
y 1
Return 𝑓𝑜𝑟𝑤𝑎𝑟𝑑[𝑝𝐹 , 𝐿]
D. Hardware Implementation:
For real-time hardware implementation of wheelchair (prototype) control following materials were used:
1. Arduino UNO Microcontroller
2. Arduino Motor Shield
3. Wheelchair Prototype
Arduino is an open source electronics platform used to control various actuators such as motors, lights
etc. It is a board with microcontroller embedded on it. This microcontroller contains 28 pins in total.
ATmega328 has digital and analog pins available on it. Digital pins works with digital values of pins. It
needs power supply of 5volts for its operation. It receives serial values from the laptop with the help of
USB supply. Its microcontroller is programmed on the basis of application used. Download libraries for
particular application and upload into the microcontroller using Arduino IDE. Arduino board is shown in
Fig. 8.
Figure 8.Arduino UNO microcontroller.
Arduino IDE is used to program the Arduino microcontroller according to the requirement. Arduino IDE is
windows based software written in java. Its programs are divided into three parts. First one includes
declaration of Pins, second is about specifying the mode as „input‟ or „output‟, third one is about writing
the values on it. Fig. 2.2 shows Arduino windows based software to program the microcontroller.
In this work, Arduino UNO microcontroller board id used to control the signal movements in the
wheelchair. It communicates with the computing platform to receive the signals from the computer and
sends it to the hardware platform to perform actual motion.
Motor Shield (shown in Fig.8) is responsible for receiving the signals from the Arduino microcontroller and
sending it to the DC motors attached to the rear wheels of the wheelchair prototype. It is the motor shield
which allows the rotation of both the motors simultaneously either clockwise or anticlockwise direction.
It takes the signals from the microcontroller, amplifies it to enhance its power and then send it to motors.
It needs power supply of 5volts supplied by microcontroller. Since microcontroller supplies only 5 v but
motor operation needs power supply of 12v hence motors cannot be directly controlled by the
microcontroller. It provides power supply of 12v to the motors with the help of external source.
Figure 9.Arduino Motor Shield

The hardware implementation process is accomplished in three steps:
1) Interfacing of MATLAB with Arduino Board
This is done to send the commands from the MATLAB application to Arduino board in order to set or reset
its pins. For serial communication between software and hardware two programs need to be executed.
The first one is in the Arduino and the second one is in MATLAB.
2) Interfacing of Arduino board with Motor Shield
The pins of the Arduino motor shield align with the Arduino UNO (shown in Fig.10). So the motor shield
pins are inserted into the socket of the Arduino UNO. The motor shield has 2 channels, which allows for
the control of two DC motors. With an external power supply, the motor shield can safely supply up to
12V and 2A per motor channel. There are pins on the Arduino that are always in use by the shield. By
addressing these pins we can select a motor channel to initiate, specify the motor direction (polarity), set
motor speed (PWM), stop and start the motor.
Figure 10.Arduino Motor Shield interfaced with Arduino UNO
3) Interfacing of Motor shield with DC Motors
The DC motors are connected to the 2 channels A and B with the help of wires. Power supply is given to
the motor shield by external battery. For forward motion and backward motion both motors rotate
simultaneously but in opposite directions. For turn to left or right one motor remains in stop position
while other possesses motion in forward direction. For instance, for right turn motor on the right side is
in rest whereas motor at left side moves in forward direction results in right movement occurs.
To communicate with wheelchair, the computer send the command as digital data through the serial port
to an Arduino UNO which then transmits the signal to the motor shield which is further connected to the
DC motors of the wheelchair prototype.
V. CONCLUSION
We observed that introducing a hidden Markov-Gaussian mixture model for classifying a human speech
signal, the performance of system got improved to a larger extent but the reliability of this proposed
model depend more on the probability of the maximum likelihoods, which sometimes may cause
misleading of the rehabilitative aid (wheelchair). To improve the efficiency of the classification, different
feature extraction methods need to be used for improving the reliability of the speech signal to be
classification. Using these words, a LED system was tested to validate the results and using the words such
as Forward, Reverse, Right, Left and Stop, the system to be found working good with an efficiency of about
98%. In later stage hardware implementation was done for controlling a wheelchair prototype using
speech signal in real time. Hence, we propose the model to be good and efficient which in-turn can be
used for driving a wheelchair (a rehabilitative aid) for helping the people with differently disabled, seeking
for the help towards the movement/transportation as like in other technics [17-20]. Wavelet based
feature extraction methods can also be employed to improve the speech signals feature extraction
difference improvement to enhance the probabilities function [21].
VI. FUTURE WORK
The focus of future work will be on following points:
 Combining multiple classifiers with different features for more robust speech recognition.
 Implementing Principle Component Analysis (PCA) with Ensembled Classifier for best outcome.
 Making this whole system wireless.
REFERENCES
[1] J. H. Martin and D. Jurafsky, "Speech and language processing," International Edition, vol. 710, 2000.
[2] C. C. Chibelushi, F. Deravi, and J. S. Mason, "A review of speech-based bimodal recognition," IEEE
transactions on multimedia, vol. 4, pp. 23-37, 2002.
[3] J. W. Picone, "Signal modeling techniques in speech recognition," Proceedings of the IEEE, vol. 81,
pp. 1215-1247, 1993.
[4] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, et al., "Automatic speech
recognition and speech variability: A review," Speech communication, vol. 49, pp. 763-786, 2007.
[5] M. Singh, R. Verma, G. Kumar, and S. Singh, "Machine perception in biomedical applications: an
introduction and review," Journal of Biological Engineering Research and Review, vol. 1, pp. 20-25, 2014.
[6] F. Itakura, "Line spectrum representation of linear predictor coefficients of speech signals," The Journal
of the Acoustical Society of America, vol. 57, pp. S35-S35, 1975.
[7] M. A. Hossan, S. Memon, and M. A. Gregory, "A novel approach for MFCC feature extraction," in
Signal Processing and Communication Systems (ICSPCS), 2010 4th International Conference on, 2010,
pp. 1-5.
[8] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," the Journal of the Acoustical
Society of America, vol. 87, pp. 1738-1752, 1990.
[9] P. Mermelstein, "Articulatory model for the study of speech production," The Journal of the Acoustical
Society of America, vol. 53, pp. 1070-1082, 1973.
[10] W. V. Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes, "Effects of noise on

speech production: Acoustic and perceptual analyses," The Journal of the Acoustical Society of America,
vol. 84, pp. 917-928, 1988.
[11] P. Lieberman, "Some effects of semantic and grammatical context on the production and perception
of speech," Language and speech, vol. 6, pp. 172-187, 1963.
[12] C. Nadeu, D. Macho, and J. Hernando, "Time and frequency filtering of filter-bank energies for robust
HMM speech recognition," Speech Communication, vol. 34, pp. 93-114, 2001.
[13] W. Han, C.-F. Chan, C.-S. Choy, and K.-P. Pun, "An efficient MFCC extraction method in speech
recognition," in Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International
Symposium on, 2006, p. 4 pp.
[14] M. J. Gales, "Maximum likelihood linear transformations for HMM-based speech recognition,"
Computer speech & language, vol. 12, pp. 75-98, 1998.
[15] G. Rigoll, "Maximum mutual information neural networks for hybrid connectionist-HMM speech
recognition systems," IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 175-184, 1994.
[16] A. Acero, L. Deng, T. T. Kristjansson, and J. Zhang, "HMM adaptation using vector taylor series for
noisy speech recognition," in INTERSPEECH, 2000, pp. 869-872.
[17] K. Uvanesh, S. K. Nayak, B. Champaty, G. Thakur, B. Mohapatra, D. Tibarewala, et al., "Classification

of Surface Electromyogram Signals Acquired from the Forearm of a Healthy Volunteer," in Classification
and Clustering in Biomedical Signal Processing, ed: IGI Global, 2016, pp. 315-333.
[18] K. Uvanesh, S. K. Nayak, B. Champaty, G. Thakur, B. Mohapatra, D. Tibarewala, et al., "Development

of a Surface EMG-Based Control System for Controlling Assistive Devices: A Study on Robotic Vehicle,"
in Classification and Clustering in Biomedical Signal Processing, ed: IGI Global, 2016, pp. 335-355.
19] L. Y. Deng, C.-L. Hsu, T.-C. Lin, J.-S. Tuan, and S.-M. Chang, "EOG-based Human–Computer
Interface system development," Expert Systems with Applications, vol. 37, pp. 3337-3343, 2010.
[20] F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, "A review of classification algorithms
for EEG-based brain–computer interfaces," Journal of neural engineering, vol. 4, p. R1, 2007.
[21] Y. Hu and P. C. Loizou, "Speech enhancement based on wavelet thresholding the multitaper spectrum,"
IEEE transactions on Speech and Audio processing, vol. 12, pp. 59-67, 2004.

Stream Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stream Final

Uploaded by

Copyright:

Available Formats

OBJECTIVE:

2.1. Speech Signal:

2.2. Speech Recognition Types and Styles

2.2.1. Recognition Style:

2.2.2. Approaches of Statistical Speech Recognition:

a. Hidden Markov model (HMM)-based speech recognition

b. Neural network-based speech recognition

IV. RESULT AND DISCUSSION

A. Acquisition of Speech Signal:

1) Pre-emphasis (Spectral Flattening):

Then, the output of the pre-emphasis network is expressed as equation 2

where 𝑀0 is frame length.

The hamming window is used as windowing function, which is defined as (equation 4)

The output of this section is expressed as equation 5

X (m)  x(m)* y '(m),0  m  M 0 (5)

Figure 4: spectral representation of speech signal (7th frame)

Mel 1 (a)  700*(10a /2595 1) (12)

5) Log Energy estimation of Filtered Bank:

c(n)   x 0 E  s   cos  n  x  1 2  X  ,

7) Delta & Acceleration coefficient estimation:

1) i. HMM model for Speech Recognition [16]:

P = p1,p2,..........pN corresponding to phones of each set of states

Figure 6: graphical representation of the pronunciation lexicon of each word

To estimate maximum likelihood a j  ot  of a D-dimensional feature vector o𝑡 of a given state j, a

Command Phonation States

Func FORWARD (observation-length L, state-graph length G) returns forward-probability

create a prob_matrix forward [G+2, L]

for every state y go 1 to G do; %INITIALIZATION%

forward  y,1  x0, y * a y  o1 

for each time (t) go 2 to L do; %LOOP%

for every state Y go 1 to G do

1. Arduino UNO Microcontroller

2. Arduino Motor Shield

Figure 9.Arduino Motor Shield

1) Interfacing of MATLAB with Arduino Board

2) Interfacing of Arduino board with Motor Shield

Figure 10.Arduino Motor Shield interfaced with Arduino UNO

3) Interfacing of Motor shield with DC Motors

VI. FUTURE WORK

The focus of future work will be on following points:

[10] W. V. Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes, "Effects of noise on

[17] K. Uvanesh, S. K. Nayak, B. Champaty, G. Thakur, B. Mohapatra, D. Tibarewala, et al., "Classification

[18] K. Uvanesh, S. K. Nayak, B. Champaty, G. Thakur, B. Mohapatra, D. Tibarewala, et al., "Development

You might also like