Conversion of Sign Language To Text and Audio Using Deep Learning Techniques

Conversion of Sign Language to Text
and Audio Using Deep Learning

Techniques
K. Kaviyadharshini, S. P. Abirami, R. Nivetha, M. Ramyaa,

and M. Vaseegaran
Abstract Languages like sign languages rely on manual gestures to communicate.

As effective means of communication for the deaf, sign languages were devised.
It will be of great help for deaf and dumb people and also for people who have
no knowledge of sign language. This eliminates the need for an intermediary as a
translation medium. The number of sign languages used worldwide is unknown.
Typically, each nation has its own native multiple languages. The primary goal of
this study is to recognize the poses and hand gestures of all kinds of sign languages
available worldwide and convert them into text and audio and making the system
bilingual. This system tries to build a communication bridge between people with
hearing and speech impairments and the rest of society. So far, many researchers
have studied this area to get a better result. In this research work, the conversion of
sign language to text and audio is carried out using techniques such as Media Pipe
Holistic, Drawing Landmarks, Open CV, LSTM Neural Network, Google Translator,
and GTTS in order to achieve a good accuracy of 98%. This model can be further
improved by adding various language conversions for better performance.
Keywords Gesture recognition · Sign language recognition · Key points · Media

pipe holistic · Open CV · LSTM neural network
1 Introduction
This system’s major objective is to aid those with speech and hearing impairments in
connecting with regular people. It’s a method of communication without words. The
structured form is sign language, where each gesture denotes a certain concept or
personality. With the development of science and technology, numerous researchers
K. Kaviyadharshini · R. Nivetha · M. Ramyaa · M. Vaseegaran

Department of Computer Science and Engineering, Coimbatore Institute of Technology,
Coimbatore, Tamil Nadu, India
S. P. Abirami (B)
Coimbatore Institute of Technology, Coimbatore, Tamil Nadu, India
e-mail: abirami.sp@cit.edu.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 285
V. Goar et al. (eds.), Advances in Information Communication Technology and Computing,
Lecture Notes in Networks and Systems 628,
https://doi.org/10.1007/978-981-19-9888-1_21
286 K. Kaviyadharshini et al.
are working on numerous strategies that could raise the level of human–computer
connection. Both static and dynamic frames can be translated into text by the
computer thanks to its training. The system is set up and intended to recognize
postures and movements used in sign language and to instantly display the relevant
text and music for a given action. Videos are pre-processed after being recorded. Key
ideas are compiled using the holistic media pipe approach [2]. Video pipe Holistic
gathers important information from the hands, face, and body. The video is accessed,
recorded, and transformed to N frames using the Open-CV webcam. To train and
test the model, key point values are gathered. The LSTM method is used to train
and evaluate the model (Long- Term and Short-Term Memory) [1]. With the help of
Google Translate, the corresponding English word is translated into Tamil. Later, the
Tamil word is converted to audio using a library called GTTS and played immedi-
ately. By importing GTTS any Tamil word can be converted into the corresponding
audio format. Thus, this model converts a character into its respective English text
and as a bilingual model to its respective Tamil audio production model.
2 State of Art
Many theories have been proposed related to employing machine learning and deep
learning to translate between sign languages. This literature review focuses and
covers a wide variety of such related topics.
P. Ushasri et al. (2022) proposed a method that can identify hand poses & gestures
from Sign Language in real-time using CNN and RNN. Using this methodology, the
system is able to achieve a good model for gesture prediction.
R. Harini et al. (2020) suggested a system that tries to achieve computer vision
that can instantly translate user-provided signs into text. Four modules, including
picture capture, pre-processing, classification, and prediction, are included in the
suggested system. CNN is employed by proposed system to transform sign gestures
to text. The model’s accuracy is 99.91%.
A normal digital camera was merely utilised to acquire the signs; no wearable
devices were necessary to capture electrical signals, according to Ashok Kumar
Sahoo (2021), who describes a system for automatic recognition of ISL immobile
numeric signs. Since each sign image submitted must have exactly one numerical
sign in order for the system to translate solitary digit signs into text, In terms of
classification accuracy, it employs two classifiers: k-Nearest Neighbor and Naive
Bayes.
From the review of sign language conversion to text utilizing techniques from
deep and machine learning it is observed that the choice of using Deep learning
algorithms which will be a good choice to convert sign language to text.
Conversion of Sign Language to Text and Audio Using Deep Learning … 287
3 Proposed System
The primary goal of the research project is to use deep learning and machine-learning
algorithms to interpret sign language to text and audio. The input to this model
would be a real time sign video which is treated as dataset to this model. Initially
the model was tested with 4 words which include “Hello”, ”Thank you”, “Please”
and “Help”. This research work incorporates Media pipe Holistic Library for finding
and extracting the key points from Face, Hand and Body. Another important library
called Open CV that is employed to support a variety of applications, including
facial recognition, object tracking, landmark recognition, and more [6]. In this work
it is used to access web camera in order to detect human face. Utilizing the LSTM
(Long Short Term Memory) deep learning method, the work is trained, tested and
evaluated. Further the model is carried out by incorporating Google translator to
convert the corresponding English word to Tamil word in order to make this model
as a Bilingual [4]. The GTTS library is used to convert Tamil word into Tamil Audio.
Figure 1 depicts the flow diagram of the shows initially a video is feed into the
application via a webcam. Then the video is divided into frames and from that the
key points are taken from each frame and those key points are stored in form of
vectors in a stored database. The data from the database is portioned for training
and testing data. After that the training dataset is given to the LSTM model and
trained till the accuracy level reaches the satisfaction. If not the model architecture is
fine tuned to reach the required accuracy level. Then classification using the trained
model is done. This gives the output as words that resemble the sign in the video.
After delivering the words the system gives the respective Tamil audio of the signs
[9].
3.1 Media Pipe Holistic
The MP holistic library is used to extract the important details from hands, faces, and
bodies. In this model, holistic tracking is leveraged to track 33 poses, 21 per hand, and
468 facial landmarks simultaneously and in a semantically coherent manner. With
the help of media pipe total 1662 key points are captured, processed and detected
and fed into our model to get the exact word for the corresponding sign.
3.2 Open CV
A cross-platform library called Open CV is used to create applications for real-time

image processing. The Open CV library is capable of reading and writing images,
recording and saving videos, processing images (filtering, transforming), performing
feature detection, identifying particular objects—such as faces, eyes, or cars—in
Fig. 1 Flow diagram of the

proposed approach
images or videos, and estimating the motion in videos by removing the background
and tracing objects in them [3]. In this model, Open CV is utilized to record video
and find important regions on the body, hands, and face. With the help of this library,
our computer’s webcam is accessed and the final result can be shown to the users in
an obvious way.
3.3 LSTM (Long Short Term Memory)
LSTM is an RNN variant (Recurrent Neural Network). It is a sequential network

that enables the persistence of information. It can deal with the long-term depen-
dency problem, which arises when RNNs need to remember information for extended
periods of time. LSTM in general has a cell state and 3 gates which selectively learns,
unlearns or preserve information from each of the units. The cell state enables the
information to by skip through the units without being altered. Each component
includes an input gate, forget gate, and output gate. The information can be uploaded
to the cell state and removed from the cell state with the help of those gates. The
forget gate decides which information from the previous cell state has to be forgotten,
and it does so by using the sigmoid function [8]. It exams at ht −1 and x t , and outputs
a variety of among zero and 1 for every quantity with inside the cell state C t −1 . A
1 represents “absolutely hold this” and zero represents “absolutely dispose of this”.
The formula used to calculate the information to be forgotten is given below
f t = σ (W f .[h t−1 , xt ] + b f )
The old cell state in this instance is C t–1 , and the new cell state is C t . The following
step is to add new data that will be stored in the cell state. The value to be updated
is decided by the input gate layer (sigmoid layer). A vector of potential new values
that could be added to the state is created by the tanh layer. Below is the formula for
calculating the sigmoid layer and tanh layer function.

i t = σ Wi . h t−1 , xt + bi

C̃t = tanh Wc . h t−1 , xt + bc
The new data must then be added to the cell state after the old data has been
removed. Following is the calculation formula for this technique.
t
Ct = f t ∗ Ct−1 + i t ∗ C
Finally, designers must determine what to display. This output, however filtered,
will be based on the state of our cell. The first step is to employ a sigmoid layer to
choose which components of the cell structure will be released. Then, in order to
extract only the sections we have chosen to make, we multiply by the sigmoid gate
result after setting the cell state by tanh (to push the values between −1 and 1).

Ot = σ Wo . h t−1 , xt + bo
h t = Ot ∗ tanh(Ct )
3.4 Google Text to Speech (GTTS)
An MP3 file can be created using the Python package GTTS to turn the entered text
into audio. English, Hindi, Tamil, French, German, and a host of other languages
are among the many that the GTTS API supports. In this model it is used to convert
Tamil text to Tamil audio and the audio will be played at the time of detection making
the system as a real time model.
Fig. 2 Result of Open CV

and MP holistic
4 Overall Findings and Discussion
4.1 Input Dataset
For model training, the input dataset has been provided. The input dataset contains
seven words of action such as ‘HELP’, ‘THANKS’, ‘PLEASE’, ‘HELLO’, ‘MILK’,
‘FOOD’, ‘WANT’. This input can be merged to get complete meaningful sentences.
Here a collection of 30 videos is taken as training dataset for each sign which is then
converted to array format. Each video is divided into 30 frames and from that the
key points are extracted.
4.2 Open CV and Media Pipe Holistic
The pose landmark involves 33 key points, Face landmarks involve 468 key points,
Left hand landmarks involve 21 and right hand has 21 landmarks in total 1662
landmark (key point values) has been collected for each frame and the results are
stored in an array [7,10]. Figure 2 displays the outcome of the model discussed
previously.
4.3 LSTM Model
Input given to this model, X = np.array(sequences) in the shape of (30, 1662) to the
first layer. All 1662 features (key points [30 frames for each video] for all the 4 signs)
will be equally divided and connected to 64 units of the first layer. The output of this
layer will be 30*64 matrixes because the return sequence is fed to the subsequent
Fig. 3 Diagrammatic representation of LSTM Model
LSTM layer and is given as True in the first LSTM layer [5]. Similarly for second
layer the output will be 30*128 matrixes and it is given to the next layer. In the final
layer the return sequence is false so the output of this layer will be 64 and is given to
the dense layer. The same function happens in the dense layer. As employed in this
research, the LSTM model’s diagrammatic representation is depicted in Fig. 3.
4.4 Google Text to Speech
The Google Text To Speech Translator implemented with improved accuracy. The
Translator is used to convert the Tamil word to Tamil Audio and the converted audio
is played at the time of prediction. The output of the converted audio at the time of
prediction will be in the form which is shown in Fig. 4. The result of applying all
the functions and libraries to the dataset and the dataset is trained and tested using
LSTM model in order to convert the sign to English text and Tamil Audio and the
final output of the project is shown in Fig. 5. The output of the model that gives the
respective Tamil audio with the help of GTTS is shown as in Fig. 6.
Here the output is received in the form of both English and Tamil word in which
the Tamil words is delivered as audio using the GTTS algorithm and the English
words are printed in the screen at the top of the window. This process happens
simultaneously where English words are displayed and the audio is delivered from
the Tamil words. The English words are printed in a sequence form at the top of the
window.
Fig. 4 Audio output
Fig. 5 Output of the model
Fig. 6 Output of audio for sign
5 Conclusion and Future Work
Since sign language conversion to text as well as audio is an important concern from
which many people will get benefited. The proposed system is implemented using
Media pipe Holistic library for detecting Key points from the video, Open CV is used
to access the web camera during the dataset collection and during the testing time,
LSTM Neural network is the main model of this project, used to train the dataset
and a Translator is used to translate the English word to Tamil word, GTTS is used
to translate the Tamil word into its corresponding Tamil Audio. The sign video is
given as input to the model which is trained and tested using LSTM algorithm and
98% accuracy is achieved using this model. The corresponding English word for
a specific sign is displayed along with its Tamil audio is achieved in this system.
Further multiple language conversions can be incorporated into this model making
it as a multilingual model.
References
1. Abraham A, Rohini V (2018) Real time conversion of sign language to speech and prediction
of gestures using artificial neural network. Procedia Comput Sci 143:587–594
2. Bharathi CU, Ragavi G, Karthika K (2021) Sign language to text and speech conversion.
In: 2021 International conference on advancements in electrical, electronics, communication,
computing and automation (ICAECA). IEEE, pp 1–4
3. Camgoz NC, Koller O, Hadfield S, Bowden R (2020) Sign language transformers: Joint end-
to-end sign language recognition and translation. In: Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pp 10023–10033
4. Chai X, Li G, Lin Y, Xu Z, Tang Y, Chen X, Zhou M (2013) Sign language recognition and
translation with kinect. In: IEEE conference on AFGR, vol 655, p 4
5. Chandra MM, Rajkumar S, Kumar LS (2019) Sign languages to speech conversion prototype
using the SVM classifier. In: TENCON 2019–2019 IEEE region 10 conference (TENCON).
IEEE, pp 1803–1807
6. Cooper H, Holt B, Bowden R (2011) Sign language recognition. In: Visual analysis of humans.
Springer, London, pp 539–562
7. Dutta KK, Anil Kumar GS (2015) Double handed Indian sign language to speech and text.
In: 2015 Third international conference on image information processing (ICIIP). IEEE, pp
374–377
8. Elmahgiubi M, Ennajar M, Drawil N, Elbuni MS (2015) Sign language translator and gesture
recognition. In: 2015 Global summit on computer & information technology (GSCIT). IEEE,
pp 1–6
9. Hosain AA, Santhalingam PS, Pathak P, Rangwala H, Kosecka J (2021) Hand pose guided
3d pooling for word-level sign language recognition. In: Proceedings of the IEEE/CVF winter
conference on applications of computer vision, pp 3429–3439
10. Kunjumon J, Megalingam RK (2019) Hand gesture recognition system for translating Indian
sign language into text and speech. In: 2019 International conference on smart systems and
inventive technology (ICSSIT). IEEE, pp 14–18

Conversion of Sign Language To Text and Audio Using Deep Learning Techniques

Uploaded by

Copyright:

Available Formats

You might also like

Conversion of Sign Language To Text and Audio Using Deep Learning Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Conversion of Sign Language To Text and Audio Using Deep Learning Techniques

Uploaded by

Copyright:

Available Formats

Conversion of Sign Language to Text

and Audio Using Deep Learning

K. Kaviyadharshini, S. P. Abirami, R. Nivetha, M. Ramyaa,

Abstract Languages like sign languages rely on manual gestures to communicate.

Keywords Gesture recognition · Sign language recognition · Key points · Media

K. Kaviyadharshini · R. Nivetha · M. Ramyaa · M. Vaseegaran

3.1 Media Pipe Holistic

A cross-platform library called Open CV is used to create applications for real-time

Fig. 1 Flow diagram of the

3.3 LSTM (Long Short Term Memory)

LSTM is an RNN variant (Recurrent Neural Network). It is a sequential network

3.4 Google Text to Speech (GTTS)

Fig. 2 Result of Open CV

4 Overall Findings and Discussion

4.1 Input Dataset

4.2 Open CV and Media Pipe Holistic

4.3 LSTM Model

Fig. 3 Diagrammatic representation of LSTM Model

4.4 Google Text to Speech

Fig. 4 Audio output

Fig. 5 Output of the model

Fig. 6 Output of audio for sign

5 Conclusion and Future Work

You might also like