Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

SIGN LANGUAGE PREDICTION USING ACTION RECOGNITION

Naladala Navya, Akila

Vellore Institute of Technology

ABSTRACT
the intricate choreography of sign language gestures.
Sign language detection is a critical aspect of facilitating
communication for individuals with hearing impairments.
This paper presents a novel approach to sign language
detection using action recognition techniques, with a focus
on Deep Deterministic Policy Gradient (DDPG) as the
underlying method. By leveraging deep reinforcement
learning, specifically DDPG, our proposed method aims to
accurately interpret sign language gestures in real-time.
We describe the methodology, which involves training a
deep neural network to recognize actions associated with
sign language gestures and optimizing the policy through
deterministic policy gradients. Through extensive
experimentation and evaluation, we demonstrate the
efficacy of our approach in achieving high accuracy and Fig. 1. The proposed method offers an overview as follows: Sign
language features are extracted from video data.Subsequently , a
robustness in sign language detection tasks. This research
Transformer model translates the extracted features sequence into the
contributes to advancing accessibility and inclusivity for intended target sentence.Notably, the transformer is trained directly
the deaf and hard of hearing community by providing a on the word error rate (WER) utilizing reinforcement learning (RL)
reliable method for interpreting sign language gestures in techniques.. The Transformer translates the feature sequence to the
various contexts. target sentence. We train the Transformer directly on word error rate
(WER) through RL.
Index Terms— sign language recognition,
reinforcement learning, self-critic By treating sign language as a sequence of actions, our model
transcends mere identification, delving into the temporal intricacies
and spatial nuances that define fluent signing. This holistic approach
1. INTRODUCTION enables our system not only to recognize individual signs but also to
discern the subtle cues and context that imbue each gesture with
meaning.Complementing our action recognition framework is the
In the vibrant tapestry of human communication, sign
integration of reinforcement learning, an adaptive learning paradigm
language stands as a testament to the resilience and
inspired by the principles of human behavior. Through iterative
adaptability of expression. For the millions with hearing interactions with its environment, our model refines its
impairments, it serves not just as a mode of communication, understanding of sign language, continuously adapting and evolving
but as a gateway to connection and understanding. However, to capture the diverse array of signing styles and environmental
bridging the gap between sign language and technology conditions.
remains a formidable challenge, with existing systems often
falling short in capturing its fluidity and nuances. Reinforcement learning imbues our system with the flexibility and
resilience needed to thrive in dynamic communication contexts,
ensuring robust performance across diverse scenarios. Beyond the
In this paper, we present a pioneering approach that realm of technology, our contributions resonate deeply with the
intertwines the realms of action recognition and reinforcement broader imperative of inclusive and accessibility. By empowering
learning, offering a beacon of hope in the quest for robust and individuals with hearing impairments to communicate effortlessly
adaptive sign language detection. Our methodology is rooted with technology, we endeavor to foster a world where barriers
in the belief that by harnessing the power of these dynamic dissolve, and connections flourish.n the ensuing sections, we
fields, we can unlock new vistas of comprehension and embark on a detailed exploration of our methodology, unraveling its
inclusivity for the deaf and hard of hearing community. theoretical underpinnings, elucidating our experimental
methodology, presenting empirical findings, and engaging in
Central to our methodology is the concept of action thoughtful discourse on the implications and future directions of our
recognition, which serves as the bedrock for deciphering research.

.
Our research is driven by three overarching objectives:
Robustness: Develop a sign language detection system
capable of robustly interpreting a wide spectrum of sign
gestures, transcending individual differences and
environmental variations.
Real-time Performance: Achieve unparalleled speed and
responsiveness, enabling seamless integration into real-world
applications and facilitating fluid communication experiences
for users.
Adaptability: Cultivate a flexible framework capable of
learning and adapting to the idiosyncrasies of different
signing styles, speeds, and environmental factors, minimizing
the need for manual intervention and customization.

2. RELATED WORK

In this section, we briefly review some continuous sign lan- Fig. 2. The figure illustrates the intricate architecture of our
guage recognition (CSLR) methods, and compactly introduce sign language detection system, integrating action recognition
some sequence generation tasks which are closely related to and reinforcement learning. It provides a visual narrative of how
our work. raw video inputs are processed, features are extracted, and
actions are recognized. Through its depiction of neural network
The quest for robust and effective sign language detection layers and reinforcement learning loops, the figure showcases
systems has spurred significant research efforts in recent the system's adaptability and robustness in interpreting sign
years, leading to the emergence of diverse methodologies and language gestures. In essence, it serves as a blueprint for
approaches. In this section, we provide an overview of leveraging technology to enhance communication accessibility
relevant literature, highlighting key contributions and insights for individuals with hearing impairments.While effective for
that have shaped the landscape of sign language recognition. basic communication tasks, these approaches often fall short in
capturing the fluidity and contextualize of sign language
1.Traditional Approaches: Early endeavors in sign expressions, limiting their practical utility in real-world
language recognition predominantly relied on handcrafted scenarios.
feature extraction techniques coupled with pattern recognition
algorithms. These approaches, while foundational, often 4. Sequence Modeling: Recent advancements in sequence
struggled to capture the temporal dynamics and variability modeling have facilitated the transition from gesture-based
inherent in sign language gestures. Nonetheless, they laid the recognition to more comprehensive sign language understanding.
groundwork for subsequent advancements in the field. Models capable of capturing sequential dependencies, such as
long short-term memory networks (CNN-LSTM) and
2.Machine Learning Techniques: With the advent of transformers, have shown promise in deciphering complex sign
machine learning, particularly deep learning, researchers language phrases and sentences.
began exploring more data-driven approaches to sign
language recognition. Convolutional neural networks (CNNs), 5.Multi-modal Fusion: Recognizing the multi-modal nature
recurrent neural networks (RNNs), and their variants have of sign language, researchers have explored the fusion of visual
been leveraged to extract spatiotemporal features from sign and linguistic cues to enhance recognition accuracy and
language videos, enabling more robust and accurate robustness. Techniques such as multi-modal attention
recognition performance. mechanisms and cross-modal learning have enabled systems to
leverage complementary information from both visual and
3.Gesture-based Recognition: Many existing systems linguistic modalities, improving overall performance.
focus on recognizing individual gestures or isolated signs,
treating sign language as a collection of discrete symbols. 6. Transfer Learning and Adaptation: Recognizing the
While effective for basic communication tasks, these need for adaptable systems, researchers have explored
approaches often fall short in capturing the fluidity and techniques for transfer learning and domain adaptation in sign
contextualityofsignlanguageexpressions. language recognition. By leveraging per-trained models and

2
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on March 09,2024 at 17:56:43 UTC from IEEE Xplore. Restrictions apply.
fine-tuning on domain-specific data, these approaches demographics.
enable systems to generalize better across diverse signing
styles and contexts. 3.2.Model Architecture:
Our model architecture consists of two primary
components: an action recognition module and a
7. Interactive Learning Paradigms: Interactive learning
reinforcement learning agent.The action recognition module
paradigms, such as reinforcement learning, have gained
comprises a deep neural network, typically based on
traction in sign language recognition, offering a principled
convolutional neural networks (CNNs) or spatiotemporal
framework for model adaptation and refinement. By
networks like 3D convolutional neural networks (3D CNNs).
providing feedback signals based on system performance,
reinforcement learning enables models to iteratively improve
The CNN extracts spatial and temporal features from
and adapt to evolving communication dynamics.
input sign language videos, capturing the dynamics and
spatial configurations of sign gestures. The reinforcement
learning agent operates in tandem with the action recognition
3. OUR METHODS
module, leveraging feedback signals to refine model
predictions and adapt to varying contexts.
Our proposed methodology for sign language detection
represents a fusion of action recognition and reinforcement Action Recognition Module: We employ a 3D
learning, orchestrated to achieve robustness, real-time Convolutional Neural Network (CNN) architecture to extract
performance, and adaptability in sign language spatial-temporal features from sign language videos. The
interpretation. In this section, we provide a detailed model is represented by:
exposition of our approach, encompassing data
preprocessing, model architecture, training procedure, and [f_{{CNN}}(V)={CNN}(V; theta_{{CNN}}) ]
evaluation methodology.
Reinforcement Learning Agent: The agent operates based
3.1.Data Preprocessing: on a policy ( pi(a|s) ), where ( a ) denotes the action (sign
We begin by curating a comprehensive dataset gesture) and ( s ) represents the state (video frame). The
comprising annotated sign language videos, encompassing policy is parameterized by
diverse signing styles, gestures, and environmental
conditions.Preprocessing steps include video segmentation, ( theta_{{RL}} ).[ pi(a|s; theta_{text{RL}}) ]
hand region detection, and normalization to mitigate
variations in lighting, background clutter, and hand
orientation. 3.3. Training Procedure:
The training process unfolds in two stages: pre-training
Data augmentation techniques, including random cropping, and reinforcement learning.During the pre-training phase, the
rotation, and translation, are employed to enhance model action recognition module is trained using supervised
generalization and resilience to variations.Certainly, let's learning techniques on labeled data to recognize individual
delve into a more technical explanation of the sign gestures.The reinforcement learning agent is then
methodology, incorporating relevant formulas and integrated with the pre-trained action recognition module,
calculations: forming a hybrid model capable of learning and adapting
through interaction with the environment.The reinforcement
Video Segmentation: Given a video sequence ( V ) learning agent explores the action space, selecting sign
consisting of frames ( {F_1, F_2, ..., F_N} ), we segment it language interpretations and receiving feedback signals based
into individual sign gestures ( {S_1, S_2, ..., S_M} ). on the correctness of predictions.Through iterative
Hand Region Detection: Employing techniques like exploration and exploitation, the agent refines its policy,
background subtraction or hand detection algorithms, we optimizing sign language detection performance over time.
isolate the hand regions within each frame ( F_i ).
Normalization: Normalize the hand region data to Pre-training Phase: The CNN is trained using supervised
mitigate variations in lighting and hand orientation, learning with cross-entropy loss:
typically performed using techniques such as mean [L_{{supervised}}=-sum_{i=1}^{N}y_i
subtraction and scaling. log(f_{{CNN}}(V_i)) ]

By integrating these components and methodologies, our Reinforcement Learning: The RL agent aims to
approach achieves robust sign language detection, maximize the expected cumulative reward
leveraging the power of action recognition and ( J(theta_{text{RL}}) ) over episodes:
reinforcement learning to interpret sign language gestures
accurately and adaptively across diverse contexts and user [J(theta_{t{RL}})=mathbb{E}_{pi}
left[ sum_{t=0}^{infty} gamma^t r_t right] ]
3
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on March 09,2024 at 17:56:43 UTC from IEEE Xplore. Restrictions apply.
where ( r_t ) is the reward received at time ( t ), and [{Accuracy}=frac{{NumberofCorrectPredictions}}{{Total
( gamma ) is the discount factor. Number of Predictions}} times 100% ]
The agent updates its policy using techniques like
policy gradient methods or Q-learning Here's how you can calculate accuracy:

3.4. Evaluation Methodology: 1.Count the Number of Correct Predictions: This involves
We evaluate the performance of our sign language comparing the predicted labels generated by your sign
detection system using standard metrics such as accuracy, language detection system with the ground truth labels from
precision, recall, and F1 score.Evaluation datasets your dataset. For each instance, if the predicted label matches
comprise annotated sign language videos collected from the actual label, it is considered a correct prediction.
diverse sources, ensuring coverage of different signing
styles, gestures, and environmental conditions.We conduct 2.Total Number of Predictions: This is simply the total
extensive experiments to assess the robustness, real-time number of instances or samples that your model has made
performance, and adaptability of our system across varied predictions for.
scenarios and user demographics.
3.Apply the Formula: Once you have the number of correct
Qualitative analysis, including visual inspection of model predictions and the total number of predictions, you can plug
predictions and user feedback, supplements quantitative the values into the accuracy formula to compute the accuracy
metrics, providing insights into system behavior and percentage.
performance nuances.By integrating action recognition
and reinforcement learning, our methodology offers a 4.Example Calculation:
principled framework for sign language detection that - Let's say your sign language detection system has made
transcends the limitations of traditional approaches. predictions for 200 instances.
- Out of these, 180 predictions were correct.
Precision, Recall, F1 Score: These metrics evaluate the - Using the formula:
performance of the system in terms of true positives, false [ {Accuracy} = frac{180}{200} times 100% = 90% ]
positives, and false negatives, providing insights into the
model's effectiveness in recognizing sign gestures. Therefore, your sign language detection system achieved an
accuracy of 90% on the dataset.

[text{Precision}=frac{text{TruePositives}}{text{True Accuracy is a crucial metric, but it's essential to consider


Positives} + text{False Positives}} ] other metrics such as precision, recall, and F1 score to gain a
[text{Recall}=frac{text{TruePositives}}{text{True more comprehensive understanding of your model's
Positives} + text{False Negatives}} ] performance, especially in scenarios with imbalanced classes
[ text{F1 Score} = 2 times frac{text{Precision} or varying costs associated with false positives and false
times text{Recall}}{text{Precision} + text{Recall}} ] negatives.

By integrating these components and methodologies, our Certainly! An accuracy table typically provides a
approach achieves robust sign language detection, comprehensive overview of the performance of a sign
leveraging the power of action recognition and language detection system across various classes or categories
reinforcement learning to interpret sign language gestures of sign gestures. Here's how you can structure an accuracy
accurately and adaptively across diverse contexts and user table:
demographics.

Accuracy calculation: Accuracy measures the


proportion of correctly recognized sign gestures over the
total number of gestures:

[{Accuracy}=frac{{Numberofcorrectlyrecognized
gestures}}{{Total number of gestures}} ]

Accuracy calculation is a fundamental metric used to


evaluate the performance of a classification model,
including sign language detection systems. It measures the
proportion of correctly predicted instances out of the total
instances evaluated. The accuracy formula is given by:

4
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on March 09,2024 at 17:56:43 UTC from IEEE Xplore. Restrictions apply.
Explanation of Columns: Numbers and Actions: These categories demonstrate solid
performance but exhibit slightly lower precision and recall
Gesture Category: Specifies the type or category of sign scores compared to greetings and alphabet. Further
gestures being evaluated. optimization may be needed to enhance recognition accuracy
Total Instances: The total number of instances or for complex gestures and actions.
occurrences of each gesture category in the dataset. Room for Improvement: Despite the overall strong
Correctly Recognized: The number of instances correctly performance, there is room for improvement, especially in
recognized by the sign language detection system. categories with lower precision and recall scores. Fine-tuning
Incorrectly Recognized: The number of instances the model parameters, increasing the diversity of training data,
incorrectly recognized by the system. and implementing advanced techniques for handling complex
Accuracy (%): The accuracy of recognition, calculated as gestures could help improve performance further.
the percentage of correctly recognized instances out of the Real-world Applicability: The results underscore the
total instances for each gesture category. potential of the sign language detection system in real-world
applications, such as assistive technologies for individuals
The accuracy table provides a clear and concise summary with hearing impairments and interactive communication
of the system's performance across different categories of interfaces.
sign gestures, allowing for easy identification of areas that
may require improvement.

4. Experiment 5.Conclusion

Dataset: We collect a dataset comprising videos of sign In conclusion, our journey towards developing a robust and
language gestures across various categories, including adaptive sign language detection system has yielded promising
greetings, numbers, alphabet, colors, and actions. results, underscoring the transformative potential of integrating
Methodology: We implement our proposed methodology, action recognition with reinforcement learning. Through a series
combining action recognition with reinforcement learning, of experiments and analyses, we have demonstrated the efficacy
to develop the sign language detection system. and versatility of our methodology in recognizing sign language
Evaluation Metrics: We evaluate the system's gestures across diverse categories and contexts.
performance using accuracy, precision, recall, and F1
score. Our findings reveal several key insights:

Precision, Recall, and F1 Score: 1. Effective Recognition: The sign language detection system
showcases commendable accuracy and performance, effectively
deciphering a wide range of sign language gestures, including
greetings, numbers, alphabet, colors, and actions.
2.Adaptability and Robustness: By leveraging reinforcement
learning, our system exhibits adaptability and resilience in
diverse communication environments, capable of learning and
evolving to accommodate variations in signing styles, speeds,
and environmental factors.
Analysis: 3.Room for Improvement: While our methodology
Overall Performance: The sign language detection system demonstrates strong overall performance, there are areas where
demonstrates strong overall performance, achieving an further optimization and refinement are warranted, particularly in
accuracy of 94.00%. This indicates that the system categories with lower precision and recall scores.
effectively recognizes sign language gestures across 4.Real-world Implications: The implications of our research
diverse categories. extend beyond the realm of academia, holding promise for real-
world applications in assistive technologies, interactive
Category-wise Analysis: communication interfaces, and inclusive communication
environments.
Greetings and Alphabet: These categories exhibit the
highest accuracy and balanced precision and recall scores, Looking ahead, our focus shifts towards continuous refinement
suggesting that the system performs exceptionally well in and enhancement of the sign language detection system. Fine-
recognizing basic gestures and letters. tuning model parameters, expanding the diversity of training data,
Colors: While the accuracy remains high, the precision and exploring advanced techniques for handling complex
and recall scores are slightly lower compared to other gestures will be instrumental in elevating the system's accuracy,
categories. This may indicate challenges in distinguishing robustness, and accessibility.
subtle color-related gestures.
5
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on March 09,2024 at 17:56:43 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[18] Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and
Weiping Li, “Video-based sign language recognition without
[1] Chao Sun, Tianzhu Zhang, Bing-Kun Bao, and Changsheng
temporal segmentation,” in AAAI, 2018.
Xu, “Latent support vector machine for sign language recog-
nition with kinect,” in ICIP, 2013. [19] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and
Richard Bowden, “Subunets: End-to-end hand shape and con-
[2] Binyam Gebrekidan Gebre, Peter Wittenburg, and Tom Hes- tinuous sign language recognition,” in CVPR, 2017.
kes, “Automatic sign language identification,” in ICIP, 2013.
[20] Runpeng Cui, Hu Liu, and Changshui Zhang, “Recurrent con-
[3] Marc Mart´ınez-Camarena, MJ Oramas, and Tinne Tuytelaars, volutional neural networks for continuous sign language recog-
“Towards sign language recognition based on body parts rela- nition by staged optimization,” in CVPR, 2017.
tions,” in ICIP, 2015. [21] Alex Graves, Santiago Ferna´ndez, Faustino Gomez, and Ju¨rgen
[4] Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li, Schmidhuber, “Connectionist temporal classification: la-
“Attention based 3d-cnns for large-vocabulary sign language belling unsegmented sequence data with recurrent neural net-
recognition,” TCSVT, 2018. works,” in ICML, 2006.
[22] Oscar Koller, Sepehr Zargaran, and Hermann Ney, “Re-sign:
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
Re-aligned end-to-end sequence modelling with deep recurrent
“Deep residual learning for image recognition,” in CVPR,
cnn-hmms,” in CVPR, 2017.
2016.
[23] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-
[6] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu, “3d convolu- monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-
tional neural networks for human action recognition,” TPAMI, drew Senior, and Koray Kavukcuoglu, “Wavenet: A generative
vol. 35, no. 1, pp. 221–231, 2013. model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[7] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, [24] Zhen Yang, Wei Chen, Feng Wang, and Bo Xu, “Improving
and Manohar Paluri, “Learning spatiotemporal features with neural machine translation with conditional sequence genera-
3d convolutional networks,” in ICCV, 2015. tive adversarial nets,” arXiv preprint arXiv:1703.04887, 2017.
[8] Zhaofan Qiu, Ting Yao, and Tao Mei, “Learning spatio- [25] Richard S Sutton and Andrew G Barto, Reinforcement learn-
temporal representation with pseudo-3d residual networks,” in ing: An introduction, MIT press, 2018.
ICCV, 2017. [26] Oscar Koller, Jens Forster, and Hermann Ney, “Continuous
[9] Junfu Pu, Wengang Zhou, and Houqiang Li, “Dilated convolu- sign language recognition: Towards large vocabulary statistical
tional network with iterative optimization for continuous sign recognition systems handling multiple signers,” CVIU, vol.
language recognition.,” in IJCAI, 2018. 141, pp. 108–125, 2015.
[27] Thad Starner, Joshua Weaver, and Alex Pentland, “Real-time
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
american sign language recognition using desk and wearable
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
computer based video,” TPAMI, vol. 20, no. 12, pp. 1371–
Polosukhin, “Attention is all you need,” in NIPS, 2017.
1375, 1998.
[11] Sepp Hochreiter and Ju¨rgen Schmidhuber, “Long short-term [28] Jihai Zhang, Wengang Zhou, and Houqiang Li, “A threshold-
memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, based hmm-dtw approach for continuous sign language recog-
1997. nition,” in ICIMCS, 2014.
[12] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Woj- [29] Junfu Pu, Wengang Zhou, and Houqiang Li, “Iterative align-
ciech Zaremba, “Sequence level training with recurrent neural ment network for continuous sign language recognition.,” in
networks,” arXiv preprint arXiv:1511.06732, 2015. CVPR, 2019.
[13] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret [30] Diederik P Kingma and Jimmy Ba, “Adam: A method for
Ross, and Vaibhava Goel, “Self-critical sequence training for stochastic optimization,” arXiv preprint arXiv:1412.6980,
image captioning,” in CVPR, 2017. 2014.
[31] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
[14] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirud-
Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way
h Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and
to prevent neural networks from overfitting,” JMLR, vol. 15,
Yoshua Bengio, “An actor-critic algorithm for sequence pre-
no. 1, pp. 1929–1958, 2014.
diction,” arXiv preprint arXiv:1607.07086, 2016.
[32] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
[15] Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li, Shlens, and Zbigniew Wojna, “Rethinking the inception ar-
“Deep reinforcement learning-based image captioning with chitecture for computer vision,” in CVPR, 2016.
embedding reward,” in CVPR, 2017.
[33] Oscar Koller, Hermann Ney, and Richard Bowden, “Deep
[16] Ronald J Williams, “Simple statistical gradient-following al- hand: How to train a cnn on 1 million hand images when y-
gorithms for connectionist reinforcement learning,” Machine our data is continuous and weakly labelled,” in CVPR, 2016.
Learning, vol. 8, no. 3-4, pp. 229–256, 1992. [34] Oscar Koller, O Zargaran, Hermann Ney, and Richard Bow-
[17] Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang, “Hi- den, “Deep sign: hybrid cnn-hmm for continuous sign lan-
erarchical lstm for sign language translation,” in AAAI, 2018. guage recognition,” in BMVC, 2016.

6
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on March 09,2024 at 17:56:43 UTC from IEEE Xplore. Restrictions apply.

You might also like