Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

POKHARA UNIVERSITY

UNIVERSAL ENGINEERING AND SCIENCE COLLEGE

“SIGN LANGUAGE DETECTION USING DEEP LEARNING”

BY
Aadarsh B.k.
Sandhya Thapa
Gitanjali Shah
Madan Shahi

A FINAL TERM PROJECT REPORT


SUBMITTED TO THE DEPARTMENT OF COMPUTER
ENGINEERING IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF BACHELOR IN
COMPUTER ENGINEERING

DEPARTMENT OF COMPUTER ENGINEERING


LALITPUR, NEPAL

JUNE, 2023
ACKNOWLEDGEMENT

First and foremost, we extend our deepest gratitude to our project supervisor
ER.Santosh Bhattarai for his invaluable guidance, expertise, and constant
motivation. His insightful feedback, constructive criticism, and patience have
played a pivotal role in shaping the direction and quality of this project.

We also would like to acknowledge the faculty members of computer engineering


for their dedication to teaching and their commitment to nurturing our academic
development. Their wealth of knowledge, passion for their subjects, and willingness
to assist and inspire students have been instrumental in my project’s success.

We would also like to thank my classmates and fellow students who provided
valuable insights, engaging discussions, and a collaborative environment that
fostered learning and growth. Their willingness to share ideas, exchange feedback,
and offer assistance has been truly invaluable.

Lastly, we would like to thank all the participants who willingly volunteered their
time and expertise to contribute to this project. Their cooperation and willingness
to share their experiences and insights have greatly enriched the findings and
outcomes of this study.

I
ABSTRACT

The project aims at building a machine learning model that will be able to classify
the various hand gestures used for fingerspelling in sign language. In this user
independent model, classification machine learning algorithms are trained using a
set of image data and testing is done on a completely different set of data. For the
image dataset, depth images are used, which gave better results than some of the
previous literatures [4], owing to the reduced pre-processing time. Various machine
learning algorithms are applied on the datasets, including Convolutional Neural
Network (CNN). An attempt is made to increase the accuracy of the CNN model
by pre-training it on the Imagenet dataset. However, a small dataset was used for
pre-training, which gave an accuracy of 15percent during training.

Keywords: Deep learning, NLP, Speech recognition, Speech To Text, Text


Summarization

II
TABLE OF CONTENTS

ACKNOWLEDGEMENT I
SUMMARY II
TABLE OF CONTENTS III
LIST OF FIGURES IV
LIST OF ABBREVIATIONS V
1 OVERVIEW 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Project Scope and Applications . . . . . . . . . . . . . . . . . . . . 3
2 LITERATURE REVIEW 4
3 REQUIREMENT ANALYSIS 7
3.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 METHODOLOGY 9
4.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 RESULTS AND ANALYSIS 11
5.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 CNN: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Creating the Model: . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.4 ANN: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.5 Training vs Validation Accuracy: . . . . . . . . . . . . . . . . . . . 13
6 FUTURE ENHANCEMENTS 15
6.1 Future Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 CONCLUSION 16
References 17

III
LIST OF FIGURES

4.1 Block diagram of the system . . . . . . . . . . . . . . . . . . . . . . 9

5.1 Dataset for ASL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


5.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . 12
5.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 13
5.4 Training And validation . . . . . . . . . . . . . . . . . . . . . . . . 14

IV
LIST OF ABBREVIATIONS

AdamW Adam with Weight Decay


AI Artificial Intelligence
API Application Programme Interface
ASR Automatic Speech Recognition
BART Bidirectional Auto-Regressive Transformers
BERT Bidirectional Encoder Representations From Transformers
BBC British Broadcasting Corporation
CCE Categorical Cross Entropy
CLI Command Line Interface
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
DNN Deep Neural Network
GELU Gaussian Error Linear Unit
GPU Graphics Processing Unit
IDF Inverse Document Frequency
MFCC Mel Frequency Cepstral Coefficients
ML Machine Learning
NB Naı̈ve Bayes
NER Named Entity Recognition
NLTK Natural Language ToolKit
NLP Natural Language Processing

V
CHAPTER 1

OVERVIEW

1.1 Introduction

American Sign language is a visual communication system that utilizes gestures,


hand movements, and facial expressions to convey meaning and information. It is
primarily used by individuals with hearing impairments as a means of communi-
cation. However, interpreting sign language can be a challenging task for those
unfamiliar with the language.

In recent years, there has been growing interest in developing automated systems
that can detect and interpret sign language gestures. These systems aim to bridge
the communication gap between hearing-impaired individuals and the general
population, enabling more inclusive interactions and accessibility.

Deep learning, a subset of machine learning, has emerged as a powerful approach


for solving complex pattern recognition tasks.

In the context of sign language detection, deep learning techniques offer promising
potential. By training neural networks on large datasets of sign language gestures,
these models can learn to recognize and classify different hand and body move-
ments accurately. The availability of annotated sign language datasets, such as
videos or images of signers performing various gestures, has further facilitated the
development of deep learning-based sign language detection systems.

The significance of this project lies in its potential to enhance communication


accessibility for individuals with hearing impairments. By developing an accurate
and efficient sign language detection system, we aim to facilitate seamless commu-
nication between sign language users and non-signers, promoting inclusivity and
breaking down barriers.

1
1.2 Motivation

The motivation behind our project proposal lies in the transformative impact
that sign language detection can have on the lives of individuals with hearing
impairments. By developing an advanced and efficient system that can accurately
interpret sign language gestures, we aim to empower the deaf and hard-of-hearing
community by providing them with a means to communicate effortlessly and
naturally with the broader society. This technology holds immense potential
in numerous domains, such as education, healthcare, public services, and social
interactions.

1.3 Problem definition

The problem of sign language detection involves developing a system or algorithm


that can accurately interpret and understand sign language gestures performed by
individuals. Sign language is a visual means of communication used primarily by
deaf or hard-of-hearing individuals to convey their thoughts, emotions, and ideas.
However, the interpretation of sign language poses several challenges, including
the complex nature of hand movements, variations in gestures across different sign
languages, and the need for real-time and accurate recognition.

1.4 Objective

Objectives for Sign Language Detection:

i. Accurate Gesture Recognition: Develop an algorithm or system that can


accurately recognize and classify a wide range of sign language gestures, considering
the complexity of hand movements, facial expressions, and body postures involved.

ii. Real-Time Processing: The objective is to minimize processing delays and


ensure efficient utilization of computational resources, making the system suitable
for real-time applications on various devices.

iii. Adaptability to Different Sign Languages: Create a system that can adapt
and recognize different sign languages, considering the variations and differences in

2
signs across regions and cultures.

iv. User-Friendly Interface: Develop a user-friendly interface that allows indi-


viduals to use sign language naturally without the need for specialized equipment
or markers. The objective is to design an intuitive and accessible system that
can be easily used by individuals with varying levels of sign language proficiency,
promoting independence and inclusivity.

1.5 Project Scope and Applications

This project aims to develop a robust sign language detection system capable of
accurately recognizing and interpreting a wide range of sign language gestures. The
scope includes designing efficient algorithms for gesture recognition, optimizing
real-time processing, accommodating variations across different sign languages, and
creating a user-friendly interface. The system’s primary goal is to enhance commu-
nication accessibility and inclusivity for the deaf and hard-of-hearing community
in various domains and applications.

3
CHAPTER 2

LITERATURE REVIEW

American Sign language is a visual language used by deaf communities worldwide


as a means of communication. The literature on sign language encompasses various
aspects, including its linguistic structure, cultural significance, and the develop-
ment of technology for sign language recognition and interpretation. Studies have
highlighted the linguistic complexity and richness of sign languages, demonstrating
their grammatical structures, phonology, and syntax. Research has shown that sign
languages have the same expressive power and complexity as spoken languages,
challenging earlier misconceptions that sign languages are mere gestures or visual
representations of spoken languages. Additionally, the literature emphasizes the
cultural importance of sign language, as it forms a vital part of deaf identity and
community cohesion.[1]
Advancements in technology have fueled the development of automated sign-
language recognition systems. Researchers have explored diverse approaches,
including computer vision, machine learning, and sensor-based techniques, to
recognize and interpret sign language gestures. These systems aim to bridge the
communication gap between sign language users and non-sign language users,
making communication more accessible and inclusive. The literature highlights
challenges in sign language recognition, such as dealing with variations across
different sign languages, handling dynamic and continuous gestures, and addressing
environmental factors that may affect recognition accuracy. Researchers have
proposed and evaluated various methodologies, including rule-based approaches,
hidden Markov models (HMMs), and deep learning techniques like convolutional
neural networks (CNNs) and recurrent neural networks (RNNs), to improve the
performance and robustness of sign language recognition systems.[2]
Researchers have delved into the sociocultural aspects of sign language, highlighting
its role in identity formation, cultural heritage, and community cohesion among
deaf individuals. Sign language serves as a medium for self-expression, storytelling,

4
and artistic forms such as sign poetry and sign dance. The literature discusses the
importance of recognizing and valuing sign languages as unique and independent
languages, promoting deaf rights and cultural inclusivity. Efforts to document and
preserve sign languages have resulted in sign language dictionaries, corpora, and
linguistic databases, providing valuable resources for linguistic research, language
documentation, and language revitalization initiatives.[3]
In recent years, the literature has witnessed increased attention on multimodal
communication, involving the integration of sign language with other modalities
such as speech, text, and haptic feedback. This interdisciplinary field explores
methods for enabling effective communication between deaf and hearing individuals
through the use of sign-language interpreters, speech-to-sign translation systems,
and sign-language avatars. The literature highlights challenges in achieving seam-
less multimodal communication, such as synchronization, context awareness, and
cultural nuances. Researchers have investigated the design and evaluation of
multimodal interfaces and communication technologies to support inclusive commu-
nication and equal access to information and services for deaf individuals.[4] The
goal of this project was to build a neural network able to classify which letter of
the American Sign Language (ASL) alphabet is being signed, given an image of a
signing hand. This project is a first step towards building a possible sign language
translator, which can take communications in sign language and translate them
into written and oral language. Such a translator would greatly lower the barrier
for many deaf and mute individuals to be able to better communicate with others in
day to day interactions. This goal is further motivated by the isolation that is felt
within the deaf community. Loneliness and depression exists in higher rates among
the deaf population, especially when they are immersed in a hearing world.[5] Sign
language translation is a promising application for vision-based gesture recognition
methods, in which highlystructured combinations of static and dynamic gestures
correlate to a given lexicon. Machine learning techniques can be used to create
interactive educational tools or to help a hearing-impaired person communicate
more effectively with someone who does not know sign language. In this paper,
the development of an online sign language recognizer is described. The scope
of the project is limited to static letters in the American Sign Language (ASL)

5
alphabet.[6] In this technologically advanced world, we must utilize the power of
artificial intelligence to solve some challenging real-life problems. One of the major
issues that the world is still trying to cope up with is establishing an efficient way
for communication between people. Between 6 and 8 million people in the United
States have some form of language impairment.[7]

6
CHAPTER 3

REQUIREMENT ANALYSIS

3.1 Hardware Requirements

i. Camera: A high-resolution camera capable of capturing video footage is


essential. The camera should have good low-light performance and a suitable frame
rate to capture fast hand movements accurately.

ii. Google Colab’s CPU: For the purpose of running Python code and carrying
out machine learning operations, Google Colab offers a virtual machine (VM)
environment. The virtual machines (VMs) used by Google Colab include CPUs,
and you can select the runtime type that best meets your requirements.

iii. Memory: Sufficient RAM is required to store and manipulate video frames,
as well as perform complex computations. At least 8 GB of RAM is recommended,
but more may be necessary depending on the specific requirements of your system.

iv. Tensor Processing Unit (TPU): A TPU (Tensor Processing Unit) is a


specialized hardware accelerator developed by Google for accelerating machine
learning workloads. TPUs are designed to efficiently execute tensor operations,
which are fundamental to deep learning computations.

v. Graphics Processing Unit (GPU): While not strictly necessary, a dedicated


GPU can significantly accelerate the processing of video data and machine learning
algorithms. GPUs with CUDA support, such as those from NVIDIA, are commonly
used for deep-learning tasks.

3.2 Software Requirements

i. Google Colab: Google Colab is a cloud-based platform that provides a


Python programming environment for data analysis, machine learning, and other
computational tasks. It is designed to be accessible through a web browser without

7
requiring any software installation.

ii. Python Librares:

: Google created the open-source machine learning framework known as TensorFlow.


It offers a whole ecosystem of resources, tools, and libraries for creating and
implementing machine learning models.

b. Pandas: pandas is a popular open-source library in Python for data manipulation,


analysis, and exploration. It provides powerful data structures and data analysis
tools.

c. Numpy: Numpy is a fundamental library in Python for numerical computations


and working with arrays. It provides powerful mathematical functions and efficient
data structures for handling multi-dimensional arrays and matrices.

d. Matplotlib: A well-liked Python data visualization library is Matplotlib. It


offers a large selection of tools for making different kinds of plots, charts, and
graphs.

e. PIL: PIL (Python Imaging Library) is a popular library in Python for image
processing and manipulation. It provides a wide range of functions and methods
for opening, manipulating, and saving various image file formats.

f. Pickle: Pickle is a built-in module in Python that allows you to serialize Python
objects into a binary format and deserialize them back into Python objects. It’s
commonly used for saving and loading complex data structures, such as lists,
dictionaries, and custom objects.

8
CHAPTER 4

METHODOLOGY

4.1 System Block Diagram

Figure 4.1: Block diagram of the system

The following elements are commonly included in the block diagram for American
sign language detection:

1. User: Represents the person using or signing in front of the sign language
recognition software.

2. Interpreter: This icon represents the system-based sign language interpreter


who interacts with the user.

3. Hand Detection: The goal of the hand detection procedure is to locate and
isolate the hand region within the video frames.

4. Gesture Tracking: To capture the temporal dynamics of the sign language


gestures, the tracked hand region is examined across time.

5. Feature Extraction: In this procedure, pertinent features, such as the form,


motion, and locations of the fingers, are extracted from the monitored hand region.

6. Gesture Classification: A machine learning or deep learning model is used to


classify the sign language motions using the extracted features.

9
7. Output Mapping: This method links the movements that have been identified
to their sign language equivalents or matching meanings.

8. Text/Speech Output: For the user or interpreter, the final identified gestures
are output as text or speech.

10
CHAPTER 5

RESULTS AND ANALYSIS

5.1 Data Set

We created our dataset for training and testing. The alphabet in American Sign
Language is depicted in images or videos in this dataset.

Figure 5.1: Dataset for ASL

5.2 CNN:

Convolutional Neural Network is a type of deep learning model commonly used for
image classification, object detection, and other computer vision tasks. CNNs are
designed to automatically learn spatial hierarchies and patterns from image data.

11
Figure 5.2: Convolutional Neural Network

5.3 Creating the Model:

The model consists of three convolution blocks with a max pool layer in each of
them. There’s a fully connected layer with 128 units on top of it that is activated
by a relu activation function. numc lasses = 2

model = Sequential([ layers.experimental.preprocessing.Rescaling(1./255, inputs hape =


(imgh eight, imgw idth, 3)), layers.Conv2D(16, 3, padding =′ same′ , activation =′
relu′ ), layers.M axP ooling2D(), layers.Conv2D(32, 3, padding =′ same′ , activation =′
relu′ ), layers.M axP ooling2D(), layers.Conv2D(64, 3, padding =′ same′ , activation =′
relu′ ), layers.M axP ooling2D(), layers.F latten(), layers.Dense(128, activation =′
relu′ ), layers.Dense(1, activation =′ sigmoid′ )])

5.4 ANN:

Artificial Neural Network is a type of machine learning model inspired by the


biological neural networks in the human brain. It consists of interconnected nodes,
called artificial neurons or units, organized in layers.

12
Figure 5.3: Artificial Neural Network

5.5 Training vs Validation Accuracy:

We tested our preprocessing on a test set made up of photos from the original
dataset and our own image data collection to see if it really did produce a more
robust model. the models’ performance on the test set. We observe that the model
that was trained on picture retouching performs a lot better. Due to the former’s
lack of overfitting, it performs better than the model trained on the original photos.
Reviewing the confusion matrix for the Filtered Model on the Kaggle test set.

13
ResultNdiscussion.jpg

Figure 5.4: Training And validation

14
CHAPTER 6

FUTURE ENHANCEMENTS

6.1 Future Enhancements

Due to the size of our initial dataset, using it requires a server with plenty of RAM
and disk space. The division of the file names into training, validation, and test
sets and the dynamic loading of photos in the dataset class are potential remedies.
We could train the model on more samples in the dataset if we used such a loading
strategy.

15
CHAPTER 7

CONCLUSION

In conclusion, the field of American sign language identification has shown promise,
utilizing computer vision and machine learning methods to promote effective
communication between sign language users and non-sign language users. These
devices are designed to bridge the communication gap and support inclusivity
for the deaf and hard-of-hearing community by recognizing and translating sign
language motions in real-time.

Convolutional neural networks (CNNs) and Artificial neural networks (ANNs),


two types of deep learning algorithms, have demonstrated significant promise for
American sign language recognition. They can detect both spatial and temporal
connections in sign language motions by automatically learning discriminative
features from raw data.

16
References

[1] A. Moryossef, I. Tsochantaridis, R. Aharoni, S. Ebling, and S. Narayanan,


“Real-time sign language detection using human pose estimation,” in Computer
Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings,
Part II 16. Springer, 2020, pp. 237–248.

[2] F. Zhang and K. Kim, “Id-based blind signature and ring signature from
pairings,” in Advances in Cryptology — ASIACRYPT 2002, Y. Zheng, Ed.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 533–547.

[3] S. K. Saksamudre, P. Shrishrimal, and R. Deshmukh, “A review on different


approaches for speech recognition system,” International Journal of Computer
Applications, vol. 115, no. 22, 2015.

[4] A. O. M. Salih, “Audio noise reduction using low pass filters,” Open Access
Library Journal, vol. 4, no. 11, pp. 1–7, 2017.

[5] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech
denoising,” in Fifteenth Annual Conference of the International Speech Com-
munication Association, 2014.

[6] J. Zhang, J.-g. Yao, and X. Wan, “Towards constructing sports news from live
text commentary,” in Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1361–1371.

[7] C. van der Lee, E. Krahmer, and S. Wubben, “Pass: A dutch data-to-text
system for soccer, targeted towards specific audiences,” in Proceedings of
the 10th International Conference on Natural Language Generation, 2017, pp.
95–104.

17

You might also like