Dual Mode Sign Language Recognizer-An Android Based CNN and LSTM Prediction Model

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Dual mode Sign Language Recognizer-An Android

Based CNN and LSTM Prediction model


Prof. Rubi Mandal Dhiraj Patil Sanket Gadhe
Dept. of Information Technology Dept. of Information Technology Dept. of Information Technology
2023 3rd International conference on Artificial Intelligence and Signal Processing (AISP) | 979-8-3503-2074-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/AISP57993.2023.10134768

SVKM’s IOT Dhule SVKM’s IoT Dhule SVKM’s IoT Dhule


mandalruby@gmail.com dhirajpatil.1434@gmail.com sanketgadhe366@gmail.com

Gaurav Birari Tejaswini Buwa


Dept. of Information Technology Dept. of Information Technology
SVKM’s IoT Dhule SVKM’s IoT Dhule
gauravbirari07@gmail.com tejbuwa06@gmail.com

Abstract -Communication is varied important part of our


life. By communicating with each other, we can interact, share
our ideas and emotions. However, mute and deaf people of our which focusses on reducing communication gap between
society face lots of problem while communicating with normal normal and deaf/mute peoples. Already some solutions are
people. Previous researches were focusing only one-sided
available which are not centralized, only one way and uses
communication i.e., Speech to Sign Conversion using only
Static dataset like for Alphabets, Digits and only for specific static gestures only. So, for removing the lacunae of the
languages. While keeping in mind the lacunae of current existing system we are developing an application-based
system, this research is about an application-based system that system which consists of two modules first is - sign language
will serve as an interpreter for sign language, enabling two- to speech and second is - speech to 3- d animated sign
ways communication between hearing-impaired people and gestures. Through open CV python library, we are capturing
normal people while working on dynamic gestures and the sign gestures. In the first modules convolutional neural
centralized system for everyone. This system will work on two network (CNN) and LSTM is used to train the dataset. The
modules. One will be sign gestures to speech and other will be system is dynamic, means it can also recognize words and
speech to 3-D animated sign gestures. For this purpose, dataset
sentences.
is collected from deaf and mute School, Dhule Maharashtra
and ISL official site of size approx. 1800 words. Sign gestures Our system will put a very deep impact on the society. It
to speech conversion is implemented using CNN+LSTM deep will definitely benefit not only the deaf/mute people but also
learning algorithm while Speech to 3D sign gestures is the normal peoples for communicating with each other. A
implemented by extracting keywords from live speech and survey was conducted by us in real time at ANAND NIVAS
developing 3D animated gestures using extracted keywords.
DEAF/MUTE SCHOOL, DHULE Maharashtra.
This system will bridge the gap between deaf, mute and normal
peoples. The survey is all about to collecting the real time sign
gestures for creating a new dataset. By meeting the deaf/mute
Keywords - sign recognition, deaf and mute, LSTM, deep
learning, CNN, 3D model, OpenCV, python.
students and by interacting with the teachers we collected the
videos and images for datasets. There are about 300 different
I. INTRODUCTION sign gestures, and for Uniformity and more accuracy we had
taken data set from the official website of ISL to get more
According to World Health Organization (WHO) there are sign gestures, from that we get about 15000 different sign
5% means 430 million of peoples are deaf and mute. We gestures as video. Also, from ourself we created signs with
know that communication play's very important role in our different angles for a making huge collection of data set due
life, so that peoples communicate to each other with the help to this our system give more accurate result.
of sign language. Sign language is very essential for them,
after all it has ability to join the two communities. But there II. LITERATURE RIVIEW
are only few amounts of people of our society who knows the
sign language. Basically, this language is the language which Many academics have proposed various strategies that
is used by the especially abled peoples who are hearing allow sign language to be translated into text in order to
impaired and unable to speak, they use hands to make the sign narrow the communication gap between hearing- and speech-
gestures instead of spoken words. The sign language is not impaired people.
same in the whole universe, there are various sign language This idea is based on continuous real-time action detection
across the world. It is very difficult to understand the sign of visual frames to identify the activity carried out by the user.
language for normal people. We can't understand it because After locating critical spots utilizing media pipe holistic, this
we don't know the actual meaning that particular sign. research only focuses on key points of person. [1]. Sign
What if we understand the sign language, it will be Language Recognition is implemented by combination of
benefited not only for us but also for that special people, but two algorithms CNN + LSTM. Real time sign language
it is not simple as we think. That’s why we are proposing a translation is possible because of YOLOv5 processes the task
solution of sign recognition by using hand gestures. But the limitation

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.
is this system is static [2]. The device makes use of a camera For this purpose, we did a live survey in a school Anand
to record different hand motions. The image is then processed Nivasi Mute and Dumb school meet their teachers, student
using a variety of techniques. The first step is pre-processing understood their problem from the depth. Taken their help in
the image. Then, an edge detection algorithm is employed to collecting data set it is not possible to collect the whole from
identify the edges. The text is shown and the sign is identified them. So currently we referring ISL (Indian Sign language)
by a template-matching algorithm, so it takes more time to government portal for knowing the sign action for particular
display the result [3]. A deep multi-layered convolution, but word and shooting the video of that particular sign by our
it can only applicable for images [4]. A model which only own. The same video is used for training purpose of the sign
convert the sign from ASL into English character so that’s to speech module.
why it is static in nature [5]. After generating a approx., collection of 1800 video
III. ARCHITECTURE AND METHODOLOGY dataset now, For the speech to 3-D animated sign gesture
module we are currently making the 3-D animated sign
Architecture:
gesture of that 2-D video by using following tools:1) Deep
Our system mainly a dual communication between Motion (creating 3-D animated character with action of sign
Deaf/Mute and Normal people helping in establishing in two- gestures) 2) Blender (Minute works of fingers and body part
way communication. movement) 3) I clone (For facial expressions)
Architecture is divided into two types: After generating 3-D animated gesture we are collecting
them at one place. After that we labelled them of particular
name in the data set to give the desired output when particular
word of that label is given by the user.
Sign Gestures Classifier training model: To Predict the
words for specific Sign gestures proposed model consists of
Convolutional Neural Network + LSTM model.
Convolutional neural network is Deep Learning
Algorithm used for Image classification considered as one of
the best algorithm for image classification because of its
highest accuracy of prediction, less computation work and
Fig 1. Architecture of Sign to Speech module
less time required for result prediction.
Sign to Speech module: Here Sign gestures are converted
CNN consists of two sections through which input data is
into speech by undergoing different stages right from frames
passed:
collection till the speech data creation. It consisting of
different phases through which the input data has to pass out 1. Feature Extraction: This section of CNN consists of
before feeding as input to the sign gestures model and finally different layers of filters. These filters are used for Feature
predicting the words for particular sign gestures. extraction different layers extracts different features of the
input image and at each level no of layers are increased to
Speech to Sign module: Here Speech data is converted into
text followed by removing stop words from text, replacing extract more data.
synonyms and representing 3-D animated gestures for each 2. Max Poling: Max Pooling is added for Reducing the size
and every word. of the input array data. Max Pooling is added at each layer for
size reduction which is wholly responsible reducing
computation work.
Finally, the compressed and extracted features are
converted into the flatten array which is given as input to the
LSTM model.
A unique type of recurrent neural network as RNN called
Long Short-Term Memory as LSTM is primarily used to
classify time-series data. The output of a certain layer in a
conventional RNN is influenced by both the previous output
and the current input.
Fig 2. Architecture of Speech to Sign module
At last the output from the LSTM is passed as input to the
Methodology: dense neural network for classification.
Our Methodology consist of different sections listed below:
I. Dataset Creation

Fig 4. Working of CNN and LSTM Real time Sign and


Fig 3. Dataset of research Speech Prediction
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.
Our proposed system basically is Android application gestures as an output. Using the speech recognizer, we will
driven consisting of two major modules for establishing the take the input of speech into the system. After taking input
intercommunication between Deaf/mute and normal peoples: we need to perform several processes like:

1. Sign Gestures to Speech Conversion module 1. Converting Speech to Text

2. Speech to 3-D animated gestures module: 2. Grammatical text formatting for particular sign language
1) Sign Gestures to Speech Conversion module: 3. Breaking Sentences into words
A) Live Data Collection: 4. Removing stop words
This module focuses Data Gathering. Here for data 5. Synonyms replacement.
collection Open CV python Library is used. Applications
camera is used as the source for capturing the live video data 6. 3-D animation
and as we all know video is nothing but the collection of
frames so n no of frames are captured for a particular word 1] Converting Speech to text:
and likewise frames are captured and stored as stack for each
word differently and pre-processed before giving as input to The speech recognizer consists of three models: 1)
the trained Sign classifier. acoustic Model 2) signal analyser 3) language model. In the
acoustic model the speech is analysed based on mathematical
B) Data Preprocessing: probability. A statistical Markov model is the hidden Markov
model (HMM). Model where the system being modelled, let's
This module focuses on the important process which is call it X, is believed to be a Markov process with
processing the data. The frames which are collected at the unobservable states HMM's definition calls for the existence
time of data collection are pre-processed in the following of an observable process Y whose results are "affected" in a
steps: known manner by those of X. Signal analyzer converts this
analogy signal to the digital signal removing the unwanted
I. Gray Scale Conversion: Images Naturally are captured in data like background noise from the audio and then in the
the RGB (Red, Green, Blue) format are converted into gray
language model using dictionary files and grammar files
scale using MATLAB.
identify words and sentence speaked. After all this the arrays
II. Extraction of Features: Canny Edge detection is used to of words are framed to text.
detect edges in the output in order to obtain correct findings, 2] Grammatical text formatting for particular sign language:
and filters like Gaussian Specific features are extracted using
filters, while the remaining noise is discarded. For particular sign language of any country there is a
predefined grammatical rules like example In the ISL the
III. Normalization of Input and Output data: Image cannot be structure of sentence follows subject +object +verb whereas
feeded as it is as input to the system it needs to normalized the user will speak the sentence in form of subject+ verb +
into NumPy array and need to be converted into the range of object. The sign languages do not use any gerund, suffixes or
(-1 to 1) for linearity in the input data. other forms it focuses only on the root words.

E.g. This normalized input is feeded into the Sign Classifier


model
Prediction Module: Fig 5. Speech to text conversion
Here Depending on the input provided the specific class of Eg: Speaked Sentence: Go to school (verb + object)
words are predicted by the Sign Classifier. Grammatical framed: school to go (object +verb)

It is done using Convolutional Neural Network (CNN) and (Note: It will depend on the sign language)
Long Short-term Memory (LSTM)
3] Breaking sentences into words:
Which consisting of Features extractions and Dense neural
networks part for Classes Classification The grammatically framed sentence will be break down
into the list of words using splitting sentence function and
Words are Predicted depending upon the accuracy of the arrays of list of words will be generated.
Prediction of the word. Words having higher accuracy are For eg.
stored into the array of Words

C) Sentence Formation and Speech as output:

Here Array of Collected words is converted into sentence


and sentence is converted into speech using Python pyttsx3
library.
Fig 6. Sentence breaking
2) Speech to 3-D animated gestures module: 4] Removing stop words:
Speech to 3-D animated gestures module: In this module After getting list of words, each word will be checked in
speech is feeded as an input and will generate 3-D sign the dictionary of sign language. The word which has no sign
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.
gestures are said to be stop words and they are removed Fig 9. This figure predicted the word for love
from the list of words.
For Eg:

Fig 7. Stop word removal

5] Synonyms replacement + Lemmatization:

In the dataset, 3-D animated labelled gestures are there.


We are searching every word in the data set and if any of the
word is not matched with the labelled gestures, then we will Fig 10. This figure predicted the word for hello
get the list of synonyms of that particular word and now the
synonym will be searched again in the data set and that
found word will replace the previous word.
For eg:
Fig 11. Model Accuracy

Fig 8. Removing

synonyms For eg: user input: Hello dad, how are you?
But in the data set we have labelled animated gesture as
Fig 12. Graph between validation accuracy and total
father, so dad will get replaced by the synonym father.
accuracy
Hello father, how are you?
In this part we also replace and lemmatize the word by
removing the v1, v2, v3, v4 form of the verbs from the words.
Like Talking, Talked, Talks is replaced by talk by removing
ing, ed and s.

5] Showing 3-D animated sign gesture:


Now we have the desired word to search in our data set
and get the animated sign gestures from the data set and we
will show that one by one according to the sequence of the
word in the array.
Following the above sequence, we are getting a 3-D sign
gesture as an output.
IV. RESULT Fig 13. Graph between validation total loss and validation
accuracy
A) Sign to Speech Result
B) Speech to sign results:

I) Grammatically corrected sentence result: According to


rule of ISL grammar rule showing the result of speech to text
and text to grammatically framed sentence.

Fig 14. Grammatically changed sentence result


Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.
II) Breaking sentence result: The spoken sentence by the user V. CONCLUSION
is break downed into chunks of words as a set of words in
array shown in fig Our system works best for intercommunication between
deaf and normal people. Previous systems have limitations of
being static and one way. Our system has overcome this
limitation and hence it is static as well as dynamic. It provides
two-way communication between deaf and normal people.
Development is done using CNN as it is efficient and more
accurate. It requires less computation as compared to ANN
and performs classification based on feature extraction. Our
system has accuracy of 92.62%. Our system is beneficial for
normal people and hearing-impaired people as it will
overcome communication gap between these two
Fig 15. Breaking sentenced result communities. As we are developing a system-based
application, so it will be portable.
III) Removing the stop words result: The words which have
no sign gestures is removed from the set of words as shown In future we are looking forward to make a centralized
in the fig interpretation application for everyone with large number of
Dynamic Data set. We are aiming for making app available
in many languages.
VI. REFERENCES

[1] Vishwa Hariharam Iyer, U.M. Prakash, Ashrut Vijay and P.


Satishkumar, “Sign Language Detection using Action Recognition”
,2nd International Conference on Advance Computing and
Fig 16. Removed stop words result Innovative Technologies in Engineering (ICACITE),2022.

[2] Tengfei Lia Yongmeng Yanb, Wenqing Duc “Sign Language


1) Synonyms replacement and lemmatization: the words
Recognition Based on Computer Vision” ,IEEE International
which are not in the data set are replaced by the synonyms
Conference on Artificial Intelligence and Computer Application
and verbs forms are removed result are shown in the fig (ICAICA),2022.

[3] Sona Shrenika and Myneni Madhu Bala "SIGN LANGUAGE


RECOGNITION USING TEMPLATE MATCHING
TECHNIQUE" ,IEEE ,2022.

[4] Rajarshi Bhadra and Subhajit Kar,"Sign Language Detection from


Hand Gesture Images using Deep Multi- layered Convolution
Neural Network", IEEE Second International Conference on
Control, Measurement and Instrumentation (CMI), India,2021.

[5] Yogya Tewari, payal soni, shubham singh, Murali Saketh Turlapati
and Prof. Avani Bhuva, “Real Time Sign Language Recognition
Framework For Two Way Communication”,IEEE International
Conference on Communication information and Computing
Technology (ICCICT), June 25-27, 2021, Mumbai, India.

Fig 17. Synonyms replacement [6] G.Anantha, K.Syamala, P.V.V.Kishore, A.S.C.S Sastry, ‘” Deep
Convolutional Neural Networks For Sign Language Recognition”,
2) Desired 3-D Animated sign gestures: For particular word 2018.
desired 3-D gesture are played in a sequential order according
to the arrays of video result is shown in the figThe analysis [7] Setiawardhana, Rizky Yuniar Hakkun, Achmad Baharuddin,"Sign
of some of the existing systems is given below in detail: Language Learning based on Android For Deaf and Speech
Impaired People", International Electronics Symposium
(IES),2015.

[8] S. M. Kamrul Hasan\ Mohiuddin Ahmadl,"A New Approach of


Sign Language Recognition System for Bilingual Users",1st
International Conference on Electrical & Electronic Engineering
(ICEEE) 04-06 November 2015.

[9] Vishal pande, parminadar singh and Vishal Munnaluri,"Machine


Learning based Approach for Indian Sign Language
Recognition",Proceedings of the Seventh International Conference
on Communication and Electronics Systems ICCES,2022.

[10] R. Kurdyumov, P. Ho, and J. K. Ng, “Sign language classification


using webcam images”, 2011.

Fig 18. 3-D Animated sign gestures output

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.

You might also like