Professional Documents
Culture Documents
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
Conversion of Sign Language To Text and Audio Using Deep Learning Techniques
1 Introduction
This system’s major objective is to aid those with speech and hearing impairments in
connecting with regular people. It’s a method of communication without words. The
structured form is sign language, where each gesture denotes a certain concept or
personality. With the development of science and technology, numerous researchers
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 285
V. Goar et al. (eds.), Advances in Information Communication Technology and Computing,
Lecture Notes in Networks and Systems 628,
https://doi.org/10.1007/978-981-19-9888-1_21
286 K. Kaviyadharshini et al.
are working on numerous strategies that could raise the level of human–computer
connection. Both static and dynamic frames can be translated into text by the
computer thanks to its training. The system is set up and intended to recognize
postures and movements used in sign language and to instantly display the relevant
text and music for a given action. Videos are pre-processed after being recorded. Key
ideas are compiled using the holistic media pipe approach [2]. Video pipe Holistic
gathers important information from the hands, face, and body. The video is accessed,
recorded, and transformed to N frames using the Open-CV webcam. To train and
test the model, key point values are gathered. The LSTM method is used to train
and evaluate the model (Long- Term and Short-Term Memory) [1]. With the help of
Google Translate, the corresponding English word is translated into Tamil. Later, the
Tamil word is converted to audio using a library called GTTS and played immedi-
ately. By importing GTTS any Tamil word can be converted into the corresponding
audio format. Thus, this model converts a character into its respective English text
and as a bilingual model to its respective Tamil audio production model.
2 State of Art
Many theories have been proposed related to employing machine learning and deep
learning to translate between sign languages. This literature review focuses and
covers a wide variety of such related topics.
P. Ushasri et al. (2022) proposed a method that can identify hand poses & gestures
from Sign Language in real-time using CNN and RNN. Using this methodology, the
system is able to achieve a good model for gesture prediction.
R. Harini et al. (2020) suggested a system that tries to achieve computer vision
that can instantly translate user-provided signs into text. Four modules, including
picture capture, pre-processing, classification, and prediction, are included in the
suggested system. CNN is employed by proposed system to transform sign gestures
to text. The model’s accuracy is 99.91%.
A normal digital camera was merely utilised to acquire the signs; no wearable
devices were necessary to capture electrical signals, according to Ashok Kumar
Sahoo (2021), who describes a system for automatic recognition of ISL immobile
numeric signs. Since each sign image submitted must have exactly one numerical
sign in order for the system to translate solitary digit signs into text, In terms of
classification accuracy, it employs two classifiers: k-Nearest Neighbor and Naive
Bayes.
From the review of sign language conversion to text utilizing techniques from
deep and machine learning it is observed that the choice of using Deep learning
algorithms which will be a good choice to convert sign language to text.
Conversion of Sign Language to Text and Audio Using Deep Learning … 287
3 Proposed System
The primary goal of the research project is to use deep learning and machine-learning
algorithms to interpret sign language to text and audio. The input to this model
would be a real time sign video which is treated as dataset to this model. Initially
the model was tested with 4 words which include “Hello”, ”Thank you”, “Please”
and “Help”. This research work incorporates Media pipe Holistic Library for finding
and extracting the key points from Face, Hand and Body. Another important library
called Open CV that is employed to support a variety of applications, including
facial recognition, object tracking, landmark recognition, and more [6]. In this work
it is used to access web camera in order to detect human face. Utilizing the LSTM
(Long Short Term Memory) deep learning method, the work is trained, tested and
evaluated. Further the model is carried out by incorporating Google translator to
convert the corresponding English word to Tamil word in order to make this model
as a Bilingual [4]. The GTTS library is used to convert Tamil word into Tamil Audio.
Figure 1 depicts the flow diagram of the shows initially a video is feed into the
application via a webcam. Then the video is divided into frames and from that the
key points are taken from each frame and those key points are stored in form of
vectors in a stored database. The data from the database is portioned for training
and testing data. After that the training dataset is given to the LSTM model and
trained till the accuracy level reaches the satisfaction. If not the model architecture is
fine tuned to reach the required accuracy level. Then classification using the trained
model is done. This gives the output as words that resemble the sign in the video.
After delivering the words the system gives the respective Tamil audio of the signs
[9].
The MP holistic library is used to extract the important details from hands, faces, and
bodies. In this model, holistic tracking is leveraged to track 33 poses, 21 per hand, and
468 facial landmarks simultaneously and in a semantically coherent manner. With
the help of media pipe total 1662 key points are captured, processed and detected
and fed into our model to get the exact word for the corresponding sign.
3.2 Open CV
images or videos, and estimating the motion in videos by removing the background
and tracing objects in them [3]. In this model, Open CV is utilized to record video
and find important regions on the body, hands, and face. With the help of this library,
our computer’s webcam is accessed and the final result can be shown to the users in
an obvious way.
and it does so by using the sigmoid function [8]. It exams at ht −1 and x t , and outputs
a variety of among zero and 1 for every quantity with inside the cell state C t −1 . A
1 represents “absolutely hold this” and zero represents “absolutely dispose of this”.
The formula used to calculate the information to be forgotten is given below
f t = σ (W f .[h t−1 , xt ] + b f )
The old cell state in this instance is C t–1 , and the new cell state is C t . The following
step is to add new data that will be stored in the cell state. The value to be updated
is decided by the input gate layer (sigmoid layer). A vector of potential new values
that could be added to the state is created by the tanh layer. Below is the formula for
calculating the sigmoid layer and tanh layer function.
i t = σ Wi . h t−1 , xt + bi
C̃t = tanh Wc . h t−1 , xt + bc
The new data must then be added to the cell state after the old data has been
removed. Following is the calculation formula for this technique.
t
Ct = f t ∗ Ct−1 + i t ∗ C
Finally, designers must determine what to display. This output, however filtered,
will be based on the state of our cell. The first step is to employ a sigmoid layer to
choose which components of the cell structure will be released. Then, in order to
extract only the sections we have chosen to make, we multiply by the sigmoid gate
result after setting the cell state by tanh (to push the values between −1 and 1).
Ot = σ Wo . h t−1 , xt + bo
h t = Ot ∗ tanh(Ct )
An MP3 file can be created using the Python package GTTS to turn the entered text
into audio. English, Hindi, Tamil, French, German, and a host of other languages
are among the many that the GTTS API supports. In this model it is used to convert
Tamil text to Tamil audio and the audio will be played at the time of detection making
the system as a real time model.
290 K. Kaviyadharshini et al.
For model training, the input dataset has been provided. The input dataset contains
seven words of action such as ‘HELP’, ‘THANKS’, ‘PLEASE’, ‘HELLO’, ‘MILK’,
‘FOOD’, ‘WANT’. This input can be merged to get complete meaningful sentences.
Here a collection of 30 videos is taken as training dataset for each sign which is then
converted to array format. Each video is divided into 30 frames and from that the
key points are extracted.
The pose landmark involves 33 key points, Face landmarks involve 468 key points,
Left hand landmarks involve 21 and right hand has 21 landmarks in total 1662
landmark (key point values) has been collected for each frame and the results are
stored in an array [7,10]. Figure 2 displays the outcome of the model discussed
previously.
Input given to this model, X = np.array(sequences) in the shape of (30, 1662) to the
first layer. All 1662 features (key points [30 frames for each video] for all the 4 signs)
will be equally divided and connected to 64 units of the first layer. The output of this
layer will be 30*64 matrixes because the return sequence is fed to the subsequent
Conversion of Sign Language to Text and Audio Using Deep Learning … 291
LSTM layer and is given as True in the first LSTM layer [5]. Similarly for second
layer the output will be 30*128 matrixes and it is given to the next layer. In the final
layer the return sequence is false so the output of this layer will be 64 and is given to
the dense layer. The same function happens in the dense layer. As employed in this
research, the LSTM model’s diagrammatic representation is depicted in Fig. 3.
The Google Text To Speech Translator implemented with improved accuracy. The
Translator is used to convert the Tamil word to Tamil Audio and the converted audio
is played at the time of prediction. The output of the converted audio at the time of
prediction will be in the form which is shown in Fig. 4. The result of applying all
the functions and libraries to the dataset and the dataset is trained and tested using
LSTM model in order to convert the sign to English text and Tamil Audio and the
final output of the project is shown in Fig. 5. The output of the model that gives the
respective Tamil audio with the help of GTTS is shown as in Fig. 6.
Here the output is received in the form of both English and Tamil word in which
the Tamil words is delivered as audio using the GTTS algorithm and the English
words are printed in the screen at the top of the window. This process happens
simultaneously where English words are displayed and the audio is delivered from
the Tamil words. The English words are printed in a sequence form at the top of the
window.
292 K. Kaviyadharshini et al.
Since sign language conversion to text as well as audio is an important concern from
which many people will get benefited. The proposed system is implemented using
Media pipe Holistic library for detecting Key points from the video, Open CV is used
to access the web camera during the dataset collection and during the testing time,
LSTM Neural network is the main model of this project, used to train the dataset
Conversion of Sign Language to Text and Audio Using Deep Learning … 293
and a Translator is used to translate the English word to Tamil word, GTTS is used
to translate the Tamil word into its corresponding Tamil Audio. The sign video is
given as input to the model which is trained and tested using LSTM algorithm and
98% accuracy is achieved using this model. The corresponding English word for
a specific sign is displayed along with its Tamil audio is achieved in this system.
Further multiple language conversions can be incorporated into this model making
it as a multilingual model.
References
1. Abraham A, Rohini V (2018) Real time conversion of sign language to speech and prediction
of gestures using artificial neural network. Procedia Comput Sci 143:587–594
2. Bharathi CU, Ragavi G, Karthika K (2021) Sign language to text and speech conversion.
In: 2021 International conference on advancements in electrical, electronics, communication,
computing and automation (ICAECA). IEEE, pp 1–4
3. Camgoz NC, Koller O, Hadfield S, Bowden R (2020) Sign language transformers: Joint end-
to-end sign language recognition and translation. In: Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pp 10023–10033
4. Chai X, Li G, Lin Y, Xu Z, Tang Y, Chen X, Zhou M (2013) Sign language recognition and
translation with kinect. In: IEEE conference on AFGR, vol 655, p 4
5. Chandra MM, Rajkumar S, Kumar LS (2019) Sign languages to speech conversion prototype
using the SVM classifier. In: TENCON 2019–2019 IEEE region 10 conference (TENCON).
IEEE, pp 1803–1807
6. Cooper H, Holt B, Bowden R (2011) Sign language recognition. In: Visual analysis of humans.
Springer, London, pp 539–562
7. Dutta KK, Anil Kumar GS (2015) Double handed Indian sign language to speech and text.
In: 2015 Third international conference on image information processing (ICIIP). IEEE, pp
374–377
8. Elmahgiubi M, Ennajar M, Drawil N, Elbuni MS (2015) Sign language translator and gesture
recognition. In: 2015 Global summit on computer & information technology (GSCIT). IEEE,
pp 1–6
9. Hosain AA, Santhalingam PS, Pathak P, Rangwala H, Kosecka J (2021) Hand pose guided
3d pooling for word-level sign language recognition. In: Proceedings of the IEEE/CVF winter
conference on applications of computer vision, pp 3429–3439
10. Kunjumon J, Megalingam RK (2019) Hand gesture recognition system for translating Indian
sign language into text and speech. In: 2019 International conference on smart systems and
inventive technology (ICSSIT). IEEE, pp 14–18