Sign Language Recognition System - A Survey

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Survey paper on Sign Language Recognition System

Jyoti Deshmukh, Akshay Kedar, Mangesh Jadhav, Pranali Khuspe, Rohit Singh
Department of Computer Engineering
Rajiv Gandhi Institute of Technology, Andheri, Mumbai
University of Mumbai.

Abstract— Sign Language (SL) is a language that uses a II. A REVIEW ON EXISTING SL RECOGNITION SYSTEMS
system of manual, facial and other means of communication. It is
used by deaf and dumb people as their main source of
Deaf and hard-of-hearing persons use sign language to
communication. The main focus of the SL Recognition system is to communicate with one other and with others in their
create a vision-based system which is able to identify the different communities. The process of computer recognition of sign
sign languages from the images or video captured. In the paper, we language begins with the acquisition of sign gestures and ends
provide a review of the current research to identify the problem with the synthesis of text or voice. Until now, the majority of
areas and challenges faced by researchers while making a SL study has been done on static signs. Currently , advancement in
Recognition System. the field of Machine Learning has led to the emergence of
Dynamic Based SL recognition systems.
Keywords— Sign language, Deep Learning, Stochastic Gradient
Descent (SGD), HIdden Markov Model (HMM), Multi-class
Dictionary of signs, segmentation, feature selection,
Support Vector Machine ((MSVM) and classification techniques are the most significant aspects for
a sign language recognition system. The following summary is
focused to cover trends of all these four aspects of sign language
I. INTRODUCTION recognition systems.
The Sign Language recognition system has gained a lot of Jose L. Flores C. and et. al [6] have used the alphabet
importance. Sign Language is the primary means of of SL of Peru (LSP). Different image processing techniques
communication in the deaf and dumb community. At this age of were used to remove noise, improve contrast, detect and crop
technology, it is quintessential to make these people feel part of out the region containing hand gesture.With varied quantities of
the society by helping them communicate smoothly. Hence, an layers and parameters per layer, two CNN architectures were
intelligent computer system is required to be developed and be developed. First architecture based on LeNet-5 showed an
taught. The project intends to close the communication gap for accuracy of 95.37% whereas the second architecture which is
persons who are deaf or hard of hearing, allowing them to the proposed CNN architecture showed an accuracy of 96.20%
converse more smoothly with the outside world. We translate in terms of recognition of the 24 static hand gestures.
hand gestures into understandable words using them as the Surejya Suresh and et al. [1] developed a system whose
primary input. primary goal was to develop a vision-based system, a
In this paper, research and development in the field of Sign Convolutional Neural Network (CNN) model, that could
Language (SL) recognition systems are reviewed and recognise distinct indications from recorded pictures. The SGD
thoroughly analyzed. This paper examines the shortcomings in and Adam optimizers are used in the two CNN models created.
previous studies in the area of sign language recognition. This Using optimizers, SGD and Adam, this model has categorised 6
list of identified shortcomings is intended to serve as a different sign languages with an accuracy of 99.12 % and 99.51
reference for future work that can be done and what %, respectively.
researchers should focus on while conducting study in this Jie Huang and et al. [2] have introduced a new 3D
field. convolutional neural network (CNN) that automatically extracts
Researchers are increasingly interested in Sign Language discriminative spatial-temporal characteristics from raw video
Recognition because of its applications in a variety of fields, streams without the need for prior information, removing the
including deaf people communication systems, virtual reality, need to build features. Multi-channels of video streams,
human-computer interaction, industrial machine control, and comprising color information, depth clues, and body joint
many more. locations, are used as input to the 3D CNN to integrate colors,
depth, and trajectory information in order to improve
According to the quantity of inputs required, there are two types performance. The average accuracy of the 3D CNN with Gray
of sign language recognition systems: Channel features is 88.5 %, whereas the average accuracy of the
1.Static SL Recognition 3D CNN with Multi-channel features is 94.2 %.
This system accepts images as input. Because the Yogeshwar I. Rokade and et al. [14] developed a system
image input is not particularly dense, the accuracy is usually which includes an artificial neural network (ANN) and a Support
greater than 97.00 %. Vector Machine to provide a novel method for recognising
2. Dynamic SL Recognition Indian sign language (SVM). The shapes of the hand region are
obtained using sign segmentation, the grey level picture is
This system accepts video as input. It is less accurate, obtained through Euclidean distance transformation, and the
but it is more suitable for use in the real world. feature vector is obtained through feature extraction. ANN
1
achieves an average accuracy of 94.37 %, while SVM achieves help of speech that coexists with the sign to develop their
an average accuracy of 92.12 %. database. For sign segmentation, multiple alignment-based
techniques are applied. They used a variety of alignment
Karishma Dixit and Anand Singh Jalal [13] used only methods, including feature concatenation on Dynamic Time
manual signs, which include hand movements of isolated Warping (DTW) and HMMs, modelling using linked and
signs, have been evaluated. Each class is trained using an parallel HMMs, and sequential DTW and HMM fusion. As a
MSVM, and a feature vector for sign identification is built conclusion, modelling the signs using HMM utilising intervals
using combinational parameters of invariant moment structural discovered previously by DTW yields the maximum accuracy.
form descriptions. The global thresholding algorithm is used to
Soma Shrenika and Myneni Madhu Bala [10] have used the
segment the input image of hand motions. According to the
average method for conversion of sign image to a grayscale
findings, combining invariant moments with shape descriptors
image. The resultant image is processed through a canny edge
yields a better result of 96.23 %.
detection algorithm for determination of edges which outputs a
P. Subha Rajan and el al.[12] have proposed a method that template image. Lastly, the template image is processed
provides a basis for the development of Sign Language through a template matching algorithm in which the
Recognition system for one of the south Indian languages.A comparison of template image with images present in the
collection of 32 signs, i.e. Tamil letters with 12 vowels and 18 dataset is carried out by the Sum of Absolute Difference
consonants, are defined in the suggested approach, each (SAD) Method. The image that produces the least value of
reflecting the binary UP and DOWN positions of the five SAD is being output with its corresponding label.
fingers.This hand pattern recognition procedure includes four
Nursuriati Jamil and et al. [3] used an image retrieval
major features. They are: a) Data Acquisition, b) Palm Image
system, which is made up of two parts: a database model and a
Extraction, c) Sign Detection, d) Training Phase, and e) Binary
retrieval model. The images are retrieved using eccentricity,
to Text Conversion. When trained with 320 photos and
compactness, convexity, rectangularity, and solidity. Recall and
evaluated with 160 images, this system was able to recognise
precision are obtained by examining the photos. Simple shapes
images with 98.125 % accuracy.
used to retrieve songket motifs are remarkably precise, with a
Tanzania Ferdous Ayeshee, Sadia Afrin Raka and Et. Al [7] precision rate of 97 percent at 10% recall and 70% at 20%. The
have used Fuzzy logic. Fuzzy logic or fuzzification is done in study was an experiment to see which picture retrieval
two phases, The data is first processed from raw photos, and techniques were the most efficient and fast.
then the rules are found using angle measurements. Harries
Ignacio Rocco and et al. [16] Have used a convolution
technique was also used for point specification and corner
Neural network for geometric matching, 3 fold are the
detection of the hand. This approach was effective in
contribution of this work. A correspondence between two
determining only the two letters that the researchers tested.
images in agreement with a geometric model searches for an
This method might not work with other Bengali letters.
affine. thin plate spline transformation or holography, and
Furthermore, improved computer vision technologies can be
estimating its parameter. Through this method, it has also been
used to reduce the suggested system's limitations.
proven useful for instance level alignment, obtaining
T. Shanableh and K. Assaleh [5] have used gloves to reasonable alignment for challenging graffiti benchmarks.
recognize Arabic SL gestures and to use color segmentation to
Nobuyuki Otsu [15], has used an unsupervised and
make the process of segmenting out the signer's hands easier.
nonparametric method of automatic threshold selection. The
Bounded images are then transformed into the frequency
discriminant criteria aids in determining the best threshold, i.e.,
domain using Discrete Cosine Transformation followed by
the separability of resultant classes in grey level, in order to
zenal coding to form the feature vectors. The proposed
maximise. The grey level histogram is calculated using only
solution was validated using KNN and polynomial networks
the zeroth and first order cumulative movement, which is a
resulting in a classification rate of 87% in user-independent
relatively easy process. Taking this into consideration, the
mode.
approach proposed for this communication may be regarded as
Thad Starner and et al. [8] have shown two real-time the standard and most straightforward way for automatic
HMM-based systems for identifying continuous American threshold selection that can be used for a variety of real
Sign Language (ASL) sentences at the sentence level. The first problems. It's a broad technique that covers a large range of
device uses a desk-mounted camera to view the user and unsupervised decision-making scenarios.
achieves 92 % word accuracy. The second device embeds the
camera in the user's headgear and obtains a 98 % accuracy rate
III. A REVIEW TABLE ON EXISTING SIGN LANGUAGE
(97 % with an unrestricted grammar).
RECOGNITION SYSTEMS
Britta Bauer and Hermann Hienz [9] have employed a
system based on continuous density HMM, where each sign is
represented by a separate model. A feature vector reflecting
manual sign parameters is served for training and recognition Table 1
input. The main goal of this system is to detect SL sentences
automatically based on a lexicon of 97 signs in German sign
Ref. Classifiers/ Feature Dataset Results/
language (GSL). Experiments have shown that this system
understands sign language with 94% accuracy rate based on 52 No Recognition Extraction Objective
lexical signals and all available attributes. However, with a s/
vocabulary of 97 signs, recognition accuracy drops to 91.7 %. Applicati
on
Pinar Santemiz and et. al [4] have used broadcast news for
the hearing-impaired to automatically extract signs with the

2
[16] Gabor Evenly 24 static 92% [17] HMM Three-dime 53 87.71%
200 filter-based Distribute alphabet using nsional American for 3D
4 hand shape Scheme signs depth data motion of a signs context
feature and (EDS) and obtained subject's vocabulary independe
random (B) a from arms and nt, 89.91%
forest Distance kinetic hands for 3D
Adaptive sensors. context
Scheme dependent
(DAS). and
83.63%
[25] Multi-class Color Patch 26 English 75% for 2D
201 random and depth language mean context
1 forest based hand alphabets precision dependent
tracking using
combined [21] Linear SVM Depth 61 video 28.77%
vectors image sequences for DMM,
sequence, based on 15.98%
[22] HMM 8 element 494 99.2% Body American Pair-1,
feature sentences without Skeleton Sign 24.66%
vector formed by explicitly joints, Language Pair-2,
40 modelling Facial (ASL) 36.07%
vocabul-ar the finger landmarks, Overall
y words Hand
Shapes,
[18] CRF Appearance 48 signs: 93.5% Facial
threshold based hand/ 18 one Distinguis expressions
model face handed hing /attributes
detection and 30 two between
and handed signs and [20] Transition Hand 5113 91.9%
tracking (ASL) non sign Movement Shape, Chinese average
patterns Model position signs for TMM
(TMM) and vocabulary
[24] SDTW+CD block 120 signs SDTW+C movement
200 FD and tracking (Dutch DFD and
8 SDTW+ and skin language) SDTW+Q
QDFFM segmentatio DFFM IV. SIGN LANGUAGE RECOGNITION
n. outperfor CHALLENGES
m HMM
in single A critical examination of previous research has revealed
model that the field of SL Recognition has numerous flaws and inherent
problems. The flaws and challenges are generic in the sense that
[19] CNN Gradient 36 Indian 98.81% the research does not focus on the issues associated with a
Kernel Sign based on specific classification technique because there is already a lot of
Descriptor Language RGB and literature on the subject. Rather, the focus is on the broader
and Scale (ISL) depth issues that arise throughout the development of an entire SL
Invariant alphabets images recognition system. The issues are stated as below:
Feature and training,
Transform numbers 99.08% A. Segmentation
(SIFT) based on
Techniques for segmentation based on skin
1080
color have a number of flaws. Many
videos
researchers have turned to external aid to
training
overcome the issues that come with skin
color-based approaches. Several researchers
[23] Multi-stream Hand 21,960 75.6%
employ colored gloves as external aid. This
HMM Position Sign based on
has simplified the segmentation process, but it
and Language appropriat
is not well received by many. Because it is not
movements Word Data e weights
admired to first train oneself to use a system,
and 70.6%
and it severely limits the system's general
based on
application.
same
weights

3
B. Sign Language Databases challenges should be addressed effectively. Such features should
be chosen that are unaffected by the signer's background in
A big SL corpus should be used to develop a order for these features to be extracted properly.
consistent SL recognition system. The majority Many factors should be considered while developing
of the current SL databases were created in a assistive technology for Deaf and hard of hearing persons, one
lab setting, which imposes a variety of of which is end users' ease of use. For example, to be able to use
limitations. They also have a small number of an SLR system in a portable device, such as a mobile phone, by
signers, which makes it difficult for the system the Deaf people which they can seamlessly carry anywhere with
to generalise. It takes a long time to build a them.
large annotated SL database for recognition
purposes. To build an annotated dataset, an
electronic tool is required to speed up the
annotation process. ELAN Software [26] is
one of the most popular programmes, and it's REFERENCES
ideal for annotating SL videos.
[1] Sureja Suresh, H. T. P. Mithun and M. H. Supriya, "Sign Language
C. Feature Extraction Recognition System Using Deep Neural Network," 2019 5th International
Conference on Advanced Computing & Communication Systems
When features are selected another important (ICACCS), 2019, pp. 614-618, doi: 10.1109/ICACCS.2019.8728411.
step is to extract these features from the image [2] Jie Huang,Wengang Zhou,Houqiang Li, and Weiping Li,”Sign Language
Recognition using 3D convolutional neural networks,” IEEE
of the signer. Feature extraction has to
International Conference on Multimedia and Expo (ICME), Turin 2015.
overcome challenges such as complex DOI: 10.1109/ICME.2015.7177428.
background, colored gloves. COmplex [3] N. Jamil, Z. A. Bakar and T. M. Tengku Sembok, "Image Retrieval of
background definitely makes it difficult to Songket Motifs Using Simple Shape Descriptors," Geometric Modeling
extract required features accurately. For and Imaging--New Trends (GMAI'06) IEEE 2006, pp. 171-176, doi:
10.1109/GMAI.2006.29.
facilitating the process of feature extraction,
[4] Pinar Santemiz, Oya Aran, Murat Saraclar, Lale Akarun, “Automatic
many researchers have used colored gloves. Sign Segmentation from Continuous Signing via Multiple Sequence
This is definitely not required for the general Alignment”, IEEE 12th International Conference on Computer Vision
applicability of the system. Workshops, 2019. DOI: 10.1109/ICCVW.2009.5457527
[5] T. Shanableh, K. Assaleh, “Arabic sign language recognition in
V. CONCLUSION user-independent mode”, International Conference on Intelligent and
Advanced Systems, 2007. DOI: 10.1109/ICIAS.2007.4658457
Sign Language Recognition System has been [6] Jose L. Flores C., E. Gladys Cutipa A., Lauro Enciso R., “Application of
Convolutional Neural Networks for Static Hand Gestures Recognition
developed from classifying only static signs and alphabets, to Under Different Invariant Features”, IEEE XXIV International
the system that can successfully recognize dynamic movements Conference on Electronics, Electrical Engineering and Computing
that come in continuous sequences of images. Researchers (INTERCON), 2017. DOI: 10.1109/INTERCON.2017.8079727
Nowadays, Building a large vocabulary is more important for [7] T. F. Ayshee, S. A. Raka, Q. R. Hasib, M. Hossain and R. M. Rahman,
SL Recognition Systems. Many researchers are developing their "Fuzzy rule-based hand gesture recognition for bengali characters," 2014
IEEE International Advance Computing Conference (IACC), 2014, pp.
SL Recognition System by using small vocabulary and 484-489, doi: 10.1109/IAdCC.2014.6779372.
self-made databases. Large database built for Sign Language [8] Thad Starner , Alex Pentland and Joshua Weaver "Real-time American
Recognition System is still not available for some of the sign language recognition using desk and wearable computer based
countries that are involved in developing Sign Language video," in IEEE Transactions on Pattern Analysis and Machine
Recognition System. After studying the sign language Intelligence, vol. 20, no. 12, pp. 1371-1375, Dec. 1998,
recognition system in a variety of languages and locales over the doi: 10.1109/34.735811
years, It has yielded significant outcomes in terms of improving [9] Britta Bauer and Hermann Hienz, "Relevant features for video-based
performance. In order to accomplish so, over 25 of the most continuous sign language recognition," Proceedings Fourth IEEE
International Conference on Automatic Face and Gesture Recognition
extensively cited scholarly papers were reviewed. The author (Cat. No. PR00580), 2000, pp. 440-445, doi:
addresses a few of the most difficult challenges. A discussion of 10.1109/AFGR.2000.840672.
future potential was presented after an assessment of existing [10] Soma Shrenika, Myneni Madhu Bala, “Sign Language Recognition using
feature extraction and classification methodologies. Template Matching Techniques”, International Conference on Computer
Science, Engineering and Applications (ICCSEA), 2020. DOI:
10.1109/ICCSEA49143.2020.9132899
VI. FUTURE WORK [11] I. Rocco, R. Arandjelović and J. Sivic, "Convolutional Neural Network
Architecture for Geometric Matching," in IEEE Transactions on Pattern
The advancement in acquisition devices have provided Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2553-2567, 1
a promising outlook for practical recognition systems. Nov. 2019, doi: 10.1109/TPAMI.2018.2865351.
Researchers nowadays use more sophisticated technology. The [12] P. Subha Rajam and G. Balakrishnan, "Real time Indian Sign Language
Recognition System to aid deaf-dumb people," 2011 IEEE 13th
majority of research, on the other hand, continues to focus on International Conference on Communication Technology, 2011, pp.
isolated sign language recognition. As a result, continuous 737-742, doi: 10.1109/ICCT.2011.6157974.
signing problems have been given less attention. Real-world [13] Karishma Dixit, Anand Singh Jalal, "Automatic Indian Sign Language
sign language databases should be developed based on real life recognition system," 2013 3rd IEEE International Advance Computing
scenarios for recognition purposes. Conference (IACC), 2013, pp. 883-887, doi:
10.1109/IAdCC.2013.6514343.
Techniques for classification and segmentation that do
[14] Rokade, Yogeshwar & Jadav, Prashant. (2017). Indian Sign Language
not impose significant restrictions on the signers' surroundings Recognition System. International Journal of Engineering and
should be used. For future applications, feature extraction
4
Technology. 9. 189-196. 10.21817/ijet/2017/v9i3/170903S030.
[15] N. Otsu, "A Threshold Selection Method from Gray-Level Histograms,"
in IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1,
pp. 62-66, Jan. 1979, doi: 10.1109/TSMC.1979.4310076.
[16] Cao Dong, M. C. Leu and Z. Yin, "American Sign Language alphabet
recognition using Microsoft Kinect," 2015 IEEE Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW), 2015,
pp. 44-52, doi: 10.1109/CVPRW.2015.7301347.
[17] C. Vogler and D. Metaxas, "ASL recognition based on a coupling
between HMMs and 3D motion analysis," Sixth International
Conference on Computer Vision (IEEE Cat. No.98CH36271), 1998, pp.
363-369, doi: 10.1109/ICCV.1998.710744.
[18] R. Yang, S. Sarkar and B. Loeding, "Handling Movement Epenthesis
and Hand Segmentation Ambiguities in Continuous Sign Language
Recognition Using Nested Dynamic Programming," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no.
3, pp. 462-477, March 2010, doi: 10.1109/TPAMI.2009.26.
[19] N. K. Bhagat, Y. Vishnusai and G. N. Rathna, "Indian Sign Language
Gesture Recognition using Image Processing and Deep Learning," 2019
Digital Image Computing: Techniques and Applications (DICTA), 2019,
pp. 1-8, doi: 10.1109/DICTA47822.2019.8945850.
[20] G. Fang, W. Gao and D. Zhao, "Large-Vocabulary Continuous Sign
Language Recognition Based on Transition-Movement Models," in
IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems
and Humans, vol. 37, no. 1, pp. 1-9, Jan. 2007, doi:
10.1109/TSMCA.2006.886347.
[21] C. Zhang, Y. Tian and M. Huenerfauth, "Multi-modality American Sign
Language recognition," 2016 IEEE International Conference on Image
Processing (ICIP), 2016, pp. 2881-2885, doi:
10.1109/ICIP.2016.7532886.
[22] T. Starner and A. Pentland, "Real-time American Sign Language
recognition from video using hidden Markov models," Proceedings of
International Symposium on Computer Vision - ISCV, 1995, pp.
265-270, doi: 10.1109/ISCV.1995.477012.
[23] M. Maebatake, I. Suzuki, M. Nishida, Y. Horiuchi and S. Kuroiwa,
"Sign Language Recognition Based on Position and Movement Using
Multi-Stream HMM," 2008 Second International Symposium on
Universal Communication, 2008, pp. 478-481, doi:
10.1109/ISUC.2008.56.
[24] J. F. Lichtenauer, E. A. Hendriks and M. J. T. Reinders, "Sign Language
Recognition by Combining Statistical DTW and Independent
Classification," in IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 30, no. 11, pp. 2040-2046, Nov. 2008, doi:
10.1109/TPAMI.2008.123.
[25] N. Pugeault and R. Bowden, "Spelling it out: Real-time ASL
fingerspelling recognition," 2011 IEEE International Conference on
Computer Vision Workshops (ICCV Workshops), 2011, pp. 1114-1119,
doi: 10.1109/ICCVW.2011.6130290.
[26] ELAN annotator, Max Planck Institute for Psycholinguistics, available
at: http://www.mpi.nl/tools/elan.html

You might also like