Professional Documents
Culture Documents
Dual Mode Sign Language Recognizer-An Android Based CNN and LSTM Prediction Model
Dual Mode Sign Language Recognizer-An Android Based CNN and LSTM Prediction Model
Dual Mode Sign Language Recognizer-An Android Based CNN and LSTM Prediction Model
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.
is this system is static [2]. The device makes use of a camera For this purpose, we did a live survey in a school Anand
to record different hand motions. The image is then processed Nivasi Mute and Dumb school meet their teachers, student
using a variety of techniques. The first step is pre-processing understood their problem from the depth. Taken their help in
the image. Then, an edge detection algorithm is employed to collecting data set it is not possible to collect the whole from
identify the edges. The text is shown and the sign is identified them. So currently we referring ISL (Indian Sign language)
by a template-matching algorithm, so it takes more time to government portal for knowing the sign action for particular
display the result [3]. A deep multi-layered convolution, but word and shooting the video of that particular sign by our
it can only applicable for images [4]. A model which only own. The same video is used for training purpose of the sign
convert the sign from ASL into English character so that’s to speech module.
why it is static in nature [5]. After generating a approx., collection of 1800 video
III. ARCHITECTURE AND METHODOLOGY dataset now, For the speech to 3-D animated sign gesture
module we are currently making the 3-D animated sign
Architecture:
gesture of that 2-D video by using following tools:1) Deep
Our system mainly a dual communication between Motion (creating 3-D animated character with action of sign
Deaf/Mute and Normal people helping in establishing in two- gestures) 2) Blender (Minute works of fingers and body part
way communication. movement) 3) I clone (For facial expressions)
Architecture is divided into two types: After generating 3-D animated gesture we are collecting
them at one place. After that we labelled them of particular
name in the data set to give the desired output when particular
word of that label is given by the user.
Sign Gestures Classifier training model: To Predict the
words for specific Sign gestures proposed model consists of
Convolutional Neural Network + LSTM model.
Convolutional neural network is Deep Learning
Algorithm used for Image classification considered as one of
the best algorithm for image classification because of its
highest accuracy of prediction, less computation work and
Fig 1. Architecture of Sign to Speech module
less time required for result prediction.
Sign to Speech module: Here Sign gestures are converted
CNN consists of two sections through which input data is
into speech by undergoing different stages right from frames
passed:
collection till the speech data creation. It consisting of
different phases through which the input data has to pass out 1. Feature Extraction: This section of CNN consists of
before feeding as input to the sign gestures model and finally different layers of filters. These filters are used for Feature
predicting the words for particular sign gestures. extraction different layers extracts different features of the
input image and at each level no of layers are increased to
Speech to Sign module: Here Speech data is converted into
text followed by removing stop words from text, replacing extract more data.
synonyms and representing 3-D animated gestures for each 2. Max Poling: Max Pooling is added for Reducing the size
and every word. of the input array data. Max Pooling is added at each layer for
size reduction which is wholly responsible reducing
computation work.
Finally, the compressed and extracted features are
converted into the flatten array which is given as input to the
LSTM model.
A unique type of recurrent neural network as RNN called
Long Short-Term Memory as LSTM is primarily used to
classify time-series data. The output of a certain layer in a
conventional RNN is influenced by both the previous output
and the current input.
Fig 2. Architecture of Speech to Sign module
At last the output from the LSTM is passed as input to the
Methodology: dense neural network for classification.
Our Methodology consist of different sections listed below:
I. Dataset Creation
2. Speech to 3-D animated gestures module: 2. Grammatical text formatting for particular sign language
1) Sign Gestures to Speech Conversion module: 3. Breaking Sentences into words
A) Live Data Collection: 4. Removing stop words
This module focuses Data Gathering. Here for data 5. Synonyms replacement.
collection Open CV python Library is used. Applications
camera is used as the source for capturing the live video data 6. 3-D animation
and as we all know video is nothing but the collection of
frames so n no of frames are captured for a particular word 1] Converting Speech to text:
and likewise frames are captured and stored as stack for each
word differently and pre-processed before giving as input to The speech recognizer consists of three models: 1)
the trained Sign classifier. acoustic Model 2) signal analyser 3) language model. In the
acoustic model the speech is analysed based on mathematical
B) Data Preprocessing: probability. A statistical Markov model is the hidden Markov
model (HMM). Model where the system being modelled, let's
This module focuses on the important process which is call it X, is believed to be a Markov process with
processing the data. The frames which are collected at the unobservable states HMM's definition calls for the existence
time of data collection are pre-processed in the following of an observable process Y whose results are "affected" in a
steps: known manner by those of X. Signal analyzer converts this
analogy signal to the digital signal removing the unwanted
I. Gray Scale Conversion: Images Naturally are captured in data like background noise from the audio and then in the
the RGB (Red, Green, Blue) format are converted into gray
language model using dictionary files and grammar files
scale using MATLAB.
identify words and sentence speaked. After all this the arrays
II. Extraction of Features: Canny Edge detection is used to of words are framed to text.
detect edges in the output in order to obtain correct findings, 2] Grammatical text formatting for particular sign language:
and filters like Gaussian Specific features are extracted using
filters, while the remaining noise is discarded. For particular sign language of any country there is a
predefined grammatical rules like example In the ISL the
III. Normalization of Input and Output data: Image cannot be structure of sentence follows subject +object +verb whereas
feeded as it is as input to the system it needs to normalized the user will speak the sentence in form of subject+ verb +
into NumPy array and need to be converted into the range of object. The sign languages do not use any gerund, suffixes or
(-1 to 1) for linearity in the input data. other forms it focuses only on the root words.
It is done using Convolutional Neural Network (CNN) and (Note: It will depend on the sign language)
Long Short-term Memory (LSTM)
3] Breaking sentences into words:
Which consisting of Features extractions and Dense neural
networks part for Classes Classification The grammatically framed sentence will be break down
into the list of words using splitting sentence function and
Words are Predicted depending upon the accuracy of the arrays of list of words will be generated.
Prediction of the word. Words having higher accuracy are For eg.
stored into the array of Words
Fig 8. Removing
synonyms For eg: user input: Hello dad, how are you?
But in the data set we have labelled animated gesture as
Fig 12. Graph between validation accuracy and total
father, so dad will get replaced by the synonym father.
accuracy
Hello father, how are you?
In this part we also replace and lemmatize the word by
removing the v1, v2, v3, v4 form of the verbs from the words.
Like Talking, Talked, Talks is replaced by talk by removing
ing, ed and s.
[5] Yogya Tewari, payal soni, shubham singh, Murali Saketh Turlapati
and Prof. Avani Bhuva, “Real Time Sign Language Recognition
Framework For Two Way Communication”,IEEE International
Conference on Communication information and Computing
Technology (ICCICT), June 25-27, 2021, Mumbai, India.
Fig 17. Synonyms replacement [6] G.Anantha, K.Syamala, P.V.V.Kishore, A.S.C.S Sastry, ‘” Deep
Convolutional Neural Networks For Sign Language Recognition”,
2) Desired 3-D Animated sign gestures: For particular word 2018.
desired 3-D gesture are played in a sequential order according
to the arrays of video result is shown in the figThe analysis [7] Setiawardhana, Rizky Yuniar Hakkun, Achmad Baharuddin,"Sign
of some of the existing systems is given below in detail: Language Learning based on Android For Deaf and Speech
Impaired People", International Electronics Symposium
(IES),2015.
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on October 30,2023 at 06:16:13 UTC from IEEE Xplore. Restrictions apply.