Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Ethiopian Sign Language Recognition Using Artificial Neural Network

Yonas Fantahun Admasu Kumudha Raimond


Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
Adds Ababa University Adds Ababa University
Addis Ababa, Ethiopia Addis Ababa, Ethiopia
e-mail: yofad2000@yahoo.com e-mail: kumudharaimond@yahoo.co.in

Abstract— Pattern recognition is very challenging and hearing community and is expected to alleviate the
multidisciplinary research area attracting researchers and communication barriers facing the deaf community in
practitioners. Gesture recognition is a specialized pattern
Ethiopia and to improve the learning skill of interpreters
recognition task with the goal of interpreting human gestures
via mathematical models. One of the usages of gesture and ESL teachers.
recognition is the sign language recognition which is the basic
communication method between deaf people. Since there is II. RELATED WORKS
lack of proficient sign language teachers at schools for the Sign recognition has been studied extensively by
deaf, the teaching and learning process is remaining affected. different communities. Research on hand gestures can be
A system is therefore required to overcome communication classified into three categories: glove based, vision based
barriers facing the deaf community. So, in this paper, a hand
gesture detection and recognition system for Ethiopian Sign and analysis of drawing gestures as stated in [7].
Language (ESL) has been proposed. Gabor Filter (GF) Hand gesture detection and recognition technique for
together with Principal Component Analysis (PCA) has been international sign language has been proposed in [9] and
used for extracting features from the digital images of hand neural network was employed as a knowledge base for sign
gestures while Artificial Neural Network (ANN) is used for language. The system has yielded a recognition rate of
recognizing the ESL from extracted features and to translate
into Amharic voice. The experimental results show that the 78.6% for 70 input images. But, the preprocessing and
system has produced recognition rate of 98.53%. recognition stages are to be improved for better recognition
rate. The Korean Manual Alphabet recognition system
Keywords--sign language recognition; artificial neural quoted in [2] recognized skin-colored human hands by
network; gabor filter; principal component analysis implementing fuzzy min-max neural network algorithm
and produced a recognition rate of 96.7%.
I. INTRODUCTION
Four very simple but efficient methods were proposed
Hand gesture is one of the typical methods used in sign in [10] to implement hand gesture recognition namely
language for non-verbal communication. A sign language is
Subtraction, Gradient, PCA and Rotation Invariant. The
a language which, instead of acoustically conveying sound
methods used were successful to retrieve the correct
patterns, uses visually transmitted sign patterns to convey
matches. A new method for hand gesture recognition was
meanings. Sign language is most commonly used by the
proposed in [4]. The system was based on an innovative
deaf people who communicate among themselves or with
normal people [9]. Self-Growing and Self-Organized Neural Gas (SGONG)
According to the 1994 Housing and Population Census network. The proposed system’s recognition rate was
of Ethiopia, there were 1,90,220 deaf and hard of hearing 90.45%. A hierarchical gesture recognition algorithm was
people in the country of whom most are young. It is learned proposed in [1] to recognize a large number of gestures
that the majority of deaf people reside in rural areas where using a kalman filtering process, hidden markov models
there are no schools for the deaf. Because of this, the great and graph matching and achieved a recognition rate of
majorities are illiterate and cannot make adequate social 89.6%. A visual approach based gesture recognition system
interaction with their peers as there are no proficient sign was proposed in [17]. The system used neural network,
language teachers and interpreters at schools and as a result PCA as well as clustering and encoding techniques.
the teaching and learning process is handicapped. After closely reviewing the reported works related to
A system is therefore required which can act as an hand gestures detection and recognition, the proposed
interpreter between deaf and hearing community besides system in this paper is designed by using image
training the sign language teachers to communicate preprocessing and feature extraction techniques (GF
efficiently with a deaf. together with PCA) that have not been used in the previous
So, in this paper, a hand gesture detection and works.
recognition system has been proposed to detect and
III. DESIGN OF SYSTEM ARCHITECTURE
recognize the ESL using ANN and to translate into
Amharic voice. The system has been verified using 34 The block diagram representation of the proposed system
letters of the Ethiopian Manual Alphabets (EMA). The is illustrated in Fig.1. The details of the system are
system will act as a unidirectional interpreter between deaf explained below.

978-1-4244-8136-1/10/$26.00 2010
c IEEE 995
1) Image Size Normalization: It is usually done to resize
the image. In this case, the acquired image size is changed
to an image size of 128 X 96 as shown in Fig. 4a.
2) Image-background Subtraction: In the captured
image, the background illumination is not uniform.
Morphological opening operations such as dilation and
erosion as explained in [15] were used to estimate the
background illumination. To create a more uniform
Figure 1. Proposed System Architecture background, estimated background image was subtracted
A. Data Collection from the original image (Fig. 4b).
ESL images of size 640 X 480 were collected from five 3) Image Adjustment: It is usually done on too dark or
ESL trained students by using Sony DSC-S730 digital on too bright images in order to enhance image quality and
camera in jpg format for the learning and recognition to improve hand recognition performance. It modifies the
purposes. Five samples were collected for 34 EMA from dynamic range (contrast range) of the image and as a
each ESL student, totaling up to 170 images. All the images result, some important features become more apparent
were captured in a classroom with blackboard as its (Fig. 4c).
background with the right hand confined to it which 4) Image Segmentation: Image segmentation describes
communicated the letters. Further, in order to reduce the the process of dividing an image into non over-lapping,
difficulty of segmentation due to high variation of skin connected image areas, called regions, on the basis of
color among the signers, white glove was used. All criteria governing similarity and homogeneity (Fig. 4d).
captured images were preprocessed to have uniform
background color using Adobe Photoshop Creative Suite 4. 5) Image Filtering: Median filter is the nonlinear filter
The original (Fig. 2a) and the preprocessed images (Fig. 2b) used to remove the impulsive noise from an image.
were used for further processing, so the total images Furthermore, it is a more robust method than the traditional
became 340. Samples of the processed EMA and the linear filtering, because it simultaneously reduces noise
corresponding ESL are shown in Fig.2c. and preserves the sharp edges (Fig. 4e).

IV.

Figure 4. Image Pre-Processign Results

C. Feature Extraction
For most applications, it is necessary first to transform the
data into some representation form before training a neural
network. In practice, it is always advantageous to apply pre-
processing transformations as shown above to the input data
before it is presented to a network and the choice of pre-
processing will be one of the most significant factors in
determining the performance of the final system. Another
important way, in which network’s performance can be
Figure 2. Ethiopian Manual Alphabets improved, sometimes dramatically, is through the
reduction of the dimensionality of the input data.
B. Image Pre-processing Dimensionality reductions involve forming linear or non-
The original and processed images were subjected to the linear combinations of the original variables to generate
image pre-processing steps as shown in Fig.3. The steps are inputs for the network. Such combinations of inputs are
explained below and the corresponding outputs of a sample sometimes called features, and the process of generating
image are shown in Fig.4. them is called feature extraction. Principal motivation for
dimensionality reduction is that it can help to alleviate the
worst effects of a large dimensionality.
In this work, combination of GF together with PCA is
used for feature extraction.
1) Gabor Filter: GFs have been used during several
years for extracting features from images. The Gabor
Figure 3. Image Pre-Processign Steps wavelet representation of an image is the convolution of

996 2010 10th International Conference on Intelligent Systems Design and Applications
the image with a family of Gabor kernels as defined by 2) Principal Component Analysis
equ.1. 2-D Gabor filter is a product of an elliptical PCA is a useful statistical technique that has found
Gaussian in any rotation and a complex exponential
applications in fields such as pattern recognition and image
representing a sinusoidal plane wave. The sharpness of the
compression, and is a common technique for finding
filter is controlled on major and minor axis by Ȗ and Ș
patterns in data of high dimension.
respectively [5].
The mean of the vector population X (from GF) of
dimension 40960 by 340 and covariance were calculated.
Then eigenvalues and eigenvectors of the covariance
matrix were computed. Once eigenvectors are found from
the covariance matrix, the next step is to order them by
eigenvalues, highest to lowest to arrange the components
in order of significance. The eigenvectors with highest
where f0 is the central frequency of the filter, ș is the eigenvalues are the principle components of the data set.
rotation angle of both the Gaussian major axis and the Optimal number of eigenvectors is to be selected based on
plane wave, Ȗ is the sharpness along the major axis and Ș is the reconstruction mean square error (MSE) value. The
the sharpness along the minor axis (perpendicular to the first 200 eigenvectors were selected for further work as it
wave). provides sufficient dimensionality reduction and optimal
MSE when it is used to reconstruct the original PCA signal
To facilitate the GF representation, the images were
(Fig 7(a) and 7(b)).
scaled to 128 X 128 using a bicubic interpolation. A
number of GFs at different scales and orientations are
usually used [16]. So, in this paper, a filter bank or a set of
GFs with 5 spatial frequencies (v) and 8 distinct
orientations (μ) were designed, which makes 40 different
GFs as represented in Fig 5(a) and (b). Once GFs have
been designed, image features at different locations,
frequencies, and orientations can be extracted by
convolving the image I(x, y) with the filters using equ.2
[8]. The results are shown in Fig 5(c) and 5(d). Figure 6. Output of GF (a) Original image, (b) Output image,
(c) Down sampled image

The feature vector consists of both the real and the


imaginary part of the Gabor transform and its length was
655360 (128 * 128 * 40). Each output was first down
sampled by a factor of ȡ=16 to reduce the dimensionality
of the original vector space [3]. The down sampling factor
was selected experimentally based on the computational Figure 7. (a) Input to PCA, (b) Reconstructed Output
power and the memory requirements of calculation. The
output of the GF is shown in Fig.6. The resulting feature D. Data representation
vector of length 40960 from GF was given to PCA. The input data for ANN is in real number
representation provided by PCA technique. One-of-N code
is the most common representation for ANN output data. It
has a length of 34 (categories of EMA), where every
element in the code vector is 0, except for single element
(set to 1), which represents the EMA category number [6].
E. Training, Testing and Validation Data sets
The most common approach is to randomly divide the
source data into two or more data sets for training and
testing. To improve the generalization, it is better to split
the data into three sets such as training sets, validation sets
and testing sets based on their standard deviation. So, each
category of EMA consisting of 10 images were subjected
to the standard deviation and sorted in the increasing order
of their standard deviation. Then, the representative vectors
Figure 5. Gabor Filter (a) Real Part (b) Magnitude; Convolution output of were selected experimentally. According to the
sample image (c) Real part and (d) Magnitude experiments, image number 3, 7 and 9 were selected for

2010 10th International Conference on Intelligent Systems Design and Applications 997
testing (30%), number 4 for validation (10%) and number 2) Selection of MSE: The training MSE is the objective
1, 2, 5, 6, 8, and 10 for training (60%) purposes. Sample function of MLP that has to be minimized. In this work,
graph of standard deviation is shown in Fig.8. the optimum value was selected experimentally as shown
in Fig. 10.
Selection of MSE was made based on the successful
recognition rate on testing sets and validation sets and the
number of epochs. So, experiments were conducted for
various MSE values and the optimal value was selected based
on recognition rate and number of epochs. So, in this case,
MSE of 1e-006 value was selected (pointed by arrow).

Figure 8. Image index and their standard deviation

F. Neural Network Modeling


1) Network Architecture Selection: Once the data
representation and selection are complete, the next step is
to model and train the ANN. Multiple Layer Perceptron
(MLP) feed forward ANN is used for the recognition
purpose. Fig. 9 shows the functional block diagram of
MLP classifier. Details of MLP are found in [13].
Figure 10. Effect of MSE on Recognition Rate
The network consists of three layers: input, hidden and
output layers. The number of neurons in the input layer is 3) Selection of Network Parameters: Using MLP with
equal to the number of feature vectors extracted from PCA 100 hidden neurons, different combinations of Ș İ {0.01,
and the number of output layer neurons is equal to the 0.1, 0.5, 0.9} and Į İ {0, 0.1, 0.5, 0.7, 0.9} were simulated
number of categories to be recognized. The number of to observe their effect on network convergence and
hidden layer nodes was selected experimentally. So, the recognition.
architecture consists of 200, 100 and 34 neurons in the From the results, it was observed that, with small Ș, and
input, hidden and output layers respectively. with high Į, network settles to a global solution. However,
The neural network with this architecture was trained by smaller values of Ș take larger number of epochs to
back propagation with momentum. There are a number of converge and also may get stuck to local minima. So, to
parameters that are associated with this training. Some of address this problem, the learning was implemented with
the parameters of the network such as learning rate (Ș), adaptive Ș. So, during training as the training error
momentum constant (Į) and the number of epochs were increased, the value of Ș was decreased and vice versa.
determined experimentally. Initial weights were generated And as the value of Į increased, the performance of the
randomly between -0.5 and 0.5. Stopping criterion was set network has improved much. This is because it can help
based on the validation data performance. The training the network to be out of local minima. The optimum value
MSE at the output layer is set to various trial values and of Ș and Į found to be 0.1 and 0.7 respectively. Sample
the optimal value was selected experimentally. experimental results for Ș=0.1 and Į ={0, 0.1, 0.5, 0.7,
0.9} is shown in Fig.11.

Figure 9. Functional block diagram of MLP Figure 11. Effect of Momentum on Network Performance

998 2010 10th International Conference on Intelligent Systems Design and Applications
4) Recognition / Testing Phase: During the testing
phase, if one neuron at the output layer is activated (value
greater than 0.5) and the remaining neurons are deactivated
(less than 0.5), then the test image is recognized to be in the
class, determined by the position of the activated neuron,
otherwise the image is unrecognized.
If there are more than one neuron whose values are
greater than 0.5 (threshold value) and the difference among
each other is less than 0.0010, the image is considered
unrecognized. If all the neurons are deactivated (less than
0.5), the image is still considered unrecognized. The
threshold value and the difference among the nodes’ output
that are stated above were selected experimentally. Figure 13. Non EMA Images

During testing, the test image was subjected to The network also performed well for certain level of image
preprocessing steps and projected to the eigenspace and the
orientations. However, it did not perform well on increasing
feature vector obtained was used as the input to the network.
the orientation level due to the fact that there is a high
Testing set is composed of three subsets:
possibility of resembling other EMAs. Gestures with
a) EMA Images with Glove (30% of original data set): different orientation usually indicate different objects in sign
The first subset consists of 30% of 340 samples selected language [11]. Also, the network performed well up to 30-
through standard deviation. The trained network was tested 40% of noise level. Overall, satisfactory result was obtained.
and the recognition rate found to be 98.53%.
b) EMA Images without Glove: The second subset 6) False Rejection Rate (FRR) and False Acceptance Rate
consists of 15 patterns collected from five ESL untrained (FAR): It is necessary to measure the accuracy in order to
people and also without hand gloves as shown in Fig. 12. evaluate the proposed system. When performing testing, the
The proposed system recognized all the images except image obtains some value at the output node. If the value
image 12 and 15 which were unrecognized and wrongly obtained is above or equal to some threshold, the image is
recognized respectively. The average recognition rate was considered recognized, otherwise the image is not
found to be 86.67%. recognized (as stated in section F4 above). So, it is
important to experimentally select the threshold value. The
selection is done through two probabilities [12, 14]:
False Acceptance Rate (FAR): The chance that a non
EMA image will be erroneously recognized (obtain a
matching score equal to or higher than the threshold).
False Rejection Rate (FRR): The chance that correct
EMA image will not obtain a score equal to or above the
threshold.
To determine the FAR of the proposed system, non
EMA images shown in Fig.13 were used. The results show
the potentiality of the proposed system and it has rejected all
the new images and FAR is 0% up to the thresholding limit
Figure 12. Images Without Glove of 0.9 and the difference between the two highest output
values was 0.0010 as shown in Table.1. Also, the table
c) Non-EMA Images without Glove: The third subset shows FRR corresponding to the test images (102 test
consists of some unknown non-EMA images (Fig. 13) images and EMA without glove) of the data sets collected.
collected from websites, research papers and from random So, the best values for FRR and FAR of the proposed
set of users to evaluate the performance of the network. The
system are 3.921% and 0% respectively. Accordingly, the
network is capable of not recognizing those non-EMA
images. threshold value and difference between 2 highest output
values were set to 0.5 and 0.0010 respectively in order to
5) The Tolerance of the System for Varying Environments: The achieve good rejection and acceptance rate.
network’s performance was good irrespective of subjecting F. Translation to Voice in Real-time
the network with images of various pixel sizes, different
backgrounds, image brightness and images captured from Once the image is recognized, it is presented to the user
different persons. (normal person) in terms of audio.

2010 10th International Conference on Intelligent Systems Design and Applications 999
TABLE 1. FAR and FRR values at different threshold values Face Recognition,” IEEE Transactions on Neural Networks, vol. 14,
no. 4, pp. 919 -928, July 2003.
Difference Between [4] E. Stergiopoulou and N. Papamarkos, “A New Technique for Hand
Threshold the Two Highest FRR FAR Gesture Recognition,” Proc. of the IEEE International Conference on
Numbers Image Processing, 2006, pp. 2657-2660.
0.9 0.0001 7.843 23.08 [5] J. Kämäräinen, “Feature Extraction using Gabor Filters,”
0.9 0.0006 7.843 7.69 Ph.D Thesis, Lappeenranta University of Technology, Finland, 2003.
0.9 0.0010 7.843 0.00 [6] J. P. Bigus, “Data Mining with Neural Networks: Solving Business
0.5 0.0001 3.921 23.08 Problems from Application Development to Decision Support,” Mc
0.5 0.0006 3.921 7.69 Graw Hill,1996.
[7] K. Symeonidis, “Hand Gesture Recognition Using Neural Networks,
0.5 0.0010 3.921 0.00 “ Master Thesis, Surrey University, August, 2000.
0.2 0.0001 6.862 23.08 [8] L. Shen and L. Bai, “Information Theory for GF Selection for Face
0.2 0.0006 6.862 7.69 Recognition,” EURASIP Journal on Applied Signal Processing,
0.2 0.0010 6.862 0.00 article ID 30274, pp. 1–11, 2006.
[9] Oi M. Foong, T. J. Low, and S. Wibowo, “Hand Gesture
Recognition: Sign to Voice System,” Proc. of World Academy of
IV. CONCLUSIONS Science, Engineering and Technology, August 2008, vol. 32, pp.
2070 – 3740.
[10] P. Chakraborty, P. Sarawgi, A. Mehrotra, G. Agarwal, R.
Hand gestures detection and recognition system for EMA Pradhan, “Hand Gesture Recognition: A Comparative Study,” Proc.
was developed using GF together with PCA for feature of the International MultiConference of Engineers and Computer
extraction and ANN for recognizing the ESL of 34 letters of Scientists, March 2008, vol. I, pp.388-393.
EMA. The experiments conducted proved that with [11] R. Liang and M. Ouhyoung, “A Real-time Continuous Gesture
sufficient data, ANN approach produces very good results Recognition system for Sign Language,” Third IEEE International
Conference on Automatic Face and Gesture Recognition, April 14-
and able to recognize an unknown input sign very fast. Also, 16, 1998, pp. 558 - 567.
the unrecognizing capability and the fault tolerance [12] S. E. Wolde and K. Raimond, “Face Recognition Using Artificial
capability were observed using non-EMA images and with Neural Network,” Zede: Journal of the Ethiopian Engineers and
transformed EMA images. Overall, it is tolerant to most of Architects, Vol.25, pp43-52, December 2008.
the transformations. Further, the recognizing capability of [13] S. Haykin, “Neural Networks - A Comprehensive Foundation,” 2nd
the system was tested in real time by capturing the images Edition, Pearson Education, 2005.
[14] S. Lim, K. Lee, O. Byeon, and T. Kim, “Efficient Iris
through camera. The results show the feasibility of the Recognition through Improvement of Feature Vector and Classifier,”
approach followed in this work for the recognition purpose. ETRI Journal, vol. 23, no. 2, pp 61-70, June 2001.
[15] The MathWorks, “Image Processing Toolbox User’s Guide,” The
REFERENCES MathWorks, Inc., 2008.
[1] A. Shamaie and A. Sutherland, “Accurate Recognition of Large [16] Y. B. Jemaa and S. Khanfir, “Automatic Local Gabor Features
Number of Hand Gestures,” 2nd Iranian Conference on Machine Extraction for Face Recognition,” (IJCSIS) International Journal of
Vision and Image Processing, K.N. Toosi University of Technology, Computer Science and Information Security, vol. 3, no. 1, 2009.
Tehran, Iran, 13-15 February 2003. [17] Y. Zhang and J. Yuan, “Gesture Recognition for Human-Computer
[2] C. Lee, G. Park and J. Kim, “Real-Time Recognition System of KSL Interaction Using Neural Networks,” 8th International Conference
based on Elementary Components,” Journal of the Korea Institute of on Neural Information Processing, Shanghai, China, November 14-
Telematics and Electronics, vol. 35, no.6, pp. 76-87, 1998. 18,2001,pp.735-740.
[3] C. Liu, “Independent Component Analysis of Gabor Features for

1000 2010 10th International Conference on Intelligent Systems Design and Applications

You might also like