Final Minor Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

SIGN LANGUAGE DETECTION

MINOR PROJECT SYNOPSIS

Submitted in partial fulfilment of the


degree of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
By

Ayush Dixit Harshit Khatter Manish Sharma

01611503120 03311503120 04411503120

Guided by
Dr. Prakhar Priyadarshi
Head of Department (IT)

Department of Information Technology


BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING
PASCHIM VIHAR, NEW DELHI
October 2023
CANDIDATE’s DECLARATION

It is hereby certified that the work which is being presented in the B. Tech Minor project Report
entitled "SIGNO-FY" in partial fulfilment of the requirements for the award of the degree of
Bachelor of Technology and submitted in the Department of Information Technology of BHARATI
VIDYAPEETH’S COLLEGE OF ENGINEERING, New Delhi (Affiliated to Guru Gobind Singh
Indraprastha University, Delhi) is an authentic record of our own work carried out under the
guidance of Dr. Prakhar Priyadarshi, HOD-IT.

The matter presented in the B. Tech Major Project Report has not been submitted by me for the
award of any other degree of this or any other Institute.

Ayush Dixit Harshit Khatter Manish Sharma

01611503120 03311503120 04411503120

This is to certify that the above statement made by the candidate is correct to the best of my knowledge. They
are permitted to appear in the External Minor Project Examination.

Dr. Prakhar Priyadarshi


Head of Department, IT
TABLE OF CONTENT

CHAPTER 1: 1.1 ABSTRACT


1.2 INTRODUCTION
1.2.1 DATA SET GENERATION
1.2.2 HAND SEGMENTATION
1.2.3 FEATURE EXTRACTION PROCESS
1.3 AIMS AND OBJECTIVES
CHAPTER 2: 2.1 TECHNOLOGY USED
2.1.1 ARTIFICIAL INTELLIGENCE
2.1.2 MACHINE LEARNING
2.1.3 DEEP LEARNING
2.1.4 ML vs DL
2.1.5 TENSOR FLOW
2.1.6 MEDIAPIPE
2.1.7 CNN

CHAPTER 3: 3.1 METHODOLOGY


3.1.1 GESTURE CLASSIFICATION
3.1.2 FINGURE SPELLING SENTENCE FORMATION
3.2 LITERATURE SURVEY

CHAPTER 4: 4.1 CONCLUSION


4.1.1 CONCLUSION
4.1.2 CHALLENGES FACED
4.1.3 FUTURE SCOPE
4.1.4 APPLICATIONS
4.2 APPENDIX
4.2.1 CONVOLUTIONAL NEURAL NETWORK
4.2.2 TENSORFLOW
4.3 REFERENCES
CHAPTER 1
1.1 ABSTRACT

More than 5% of the world's population is affected by hearing impairment. To overcome the
challenges faced by these individuals, various sign languages have been developed as an easy and
efficient means of communication. Sign language depends on signs and gestures which give meaning
to something during communication . Researchers are actively investigating methods to develop sign
language recognition systems, but they face many challenges during the implementation of such
systems which include recognition of hand poses and gestures. Furthermore, some signs have similar
appearances which add to the complexity in creating recognition systems. This paper focuses on the
sign language alphabet recognition system because the letters are the core of any language.
Moreover, the system presented here can be considered as a starting point for developing more
complex systems.

People dealing with hearing or speech impairment make use of Sign Language for effective
communication. Sign Language uses finger-spelling and word-level gestures for communication.
Interpreters can be sparse and the lack of knowledge of the sign language can pose a
communication-barrier for signers. Sensor based approaches that have been developed were slightly
unsuccessful because of the hardware components involved. To overcome this, machine learning,
image classification and object detection techniques have been employed over the years for
recognizing sign language. 3D CNN and RNN combined together resulted in sequential modelling but
used large number of pre-processing steps for training. This research paper proposes to implement a
solution that would recognize hand gestures used in sign language by leveraging LSTM and Object
Detection techniques while taking into consideration the short-comings of previous algorithms.
1.2 INTRODUCTION

Effective communication is a fundamental aspect of human interaction, and for individuals dealing
with hearing or speech impairment, Sign Language becomes a crucial medium for expressing
thoughts and messages. However, the challenge arises when those unfamiliar with Sign Language
find it difficult to comprehend, hindering communication with the hearing and speech impaired. A
survey conducted by the National Deaf Association (NAD) revealed that approximately 18 million
Indians are grappling with hearing loss, highlighting the significance of addressing communication
barriers in this community. Even for those well-versed in Sign Language, interpreters are scarce and
often come with a substantial financial burden.

In the realm of communication for the Deaf and Mute (D&M) community, the prevalent use of
American Sign Language underscores its importance. Given that the primary disability for D&M
individuals is communication-related, the exclusive reliance on Sign Language becomes the most
viable means of expression. Sign Language is a unique visual language where individuals utilize hand
gestures, facial expressions, and body language to convey their ideas and thoughts.

The communication process for D&M individuals revolves around the exchange of thoughts and
messages through visual means, including hand gestures, signals, behavior, and visual cues. These
non-verbal expressions, collectively known as sign language, play a pivotal role in bridging the
communication gap for the Deaf and Mute community. Rather than traditional spoken languages,
Sign Language becomes the primary avenue for meaningful interaction.

Sign language comprises three major components that contribute to its rich and nuanced
communication:

Finger Spelling: This component involves signing out individual letters to spell a word. Each letter is
expressed sequentially, contributing to the formation of complete words. While it is a methodical
process, it serves as a crucial element in Sign Language communication.

Gestures, Vocabulary, and Word-level Sign Language: In this component, each gesture made by the
signer represents an entire word. This method is faster and more commonly used, facilitating
efficient communication within the Deaf and Mute community. It allows for the expression of ideas
and thoughts with relative ease.

Facial Expressions: Beyond hand gestures, facial expressions play a vital role in sign language
interpretation. These external features add layers of meaning and context to the communicated
message, enhancing the overall expressiveness of Sign Language.

Recognizing the significance of Fingerspelling in Sign Language, our project focuses on developing a
robust model capable of accurately recognizing Fingerspelling-based hand gestures. The objective is
to create a system that can seamlessly interpret each gesture, allowing for the formation of
complete words and promoting more accessible and effective communication for individuals in the
Deaf and Mute community. By leveraging technology to bridge communication gaps, we aim to
empower individuals with hearing or speech impairment to engage meaningfully with the broader
society.
1.2.1 Data Set Generation

In this pivotal phase of our project, we undertook the meticulous task of dataset generation for
training our model, leveraging the webcams of our dedicated team members. Our dataset
encapsulates seven distinct gestures, each imbued with its own unique significance. The
comprehensive array of gestures encompasses greetings such as "Hello," expressions of affection
denoted by "I Love You," affirmative responses through "Yes," negations expressed by "No,"
expressions of gratitude in "Thank You," acknowledgments of remorse in "Sorry," and polite
requests articulated by "Please." To ensure the diversity and robustness of our dataset, we
meticulously captured 30 videos for each gesture class. Each video comprises 30 frames, capturing
the continuity of the same gesture, thereby contributing to a rich and dynamic dataset. The
inclusivity of our dataset is further amplified by incorporating alternative hand usages, varying
angles, and diverse poses, encapsulating the nuanced variability inherent in real-world sign language
expressions. Subsequent to the data acquisition process, a meticulous labeling protocol was adhered
to, assigning each image its corresponding gesture label. This comprehensive dataset forms the
bedrock of our model training efforts, facilitating the development of a sophisticated and versatile
system attuned to the intricacies of diverse sign language expressions.

1.2.2 Hand Segmentation in images

The pivotal segmentation process within our proposed solution plays a critical role in refining the
accuracy and efficacy of subsequent image processing steps. By breaking down the image into
smaller, more manageable segments, we lay the foundation for enhanced precision in capturing
intricate image attributes. One notable segmentation approach involves the separation of the
background from the object within the image, a strategy that contributes to improved clarity and
discernibility. The meticulousness invested in segmentation bears a direct correlation to the
subsequent accuracy of our recognition endeavors, underscoring the significance of this
preprocessing stage.

In the pursuit of meticulous annotation and labeling of our segmented images, we harness the
capabilities of Python's Labeling Library. This versatile tool proves invaluable in streamlining the
annotation process by facilitating the labeling of object bounding boxes within the images. The use
of this library introduces a structured and standardized approach to annotation, ensuring
consistency and coherence in the labeling process. Post-segmentation, the tool empowers our team
to generate crucial characteristics of the chosen segments, including the size of the bounding box.
This not only aids in the comprehensive understanding of the segmented elements but also
contributes to the seamless integration of labeled data into our training dataset. Through these
methodical steps, our solution not only embraces segmentation as a precursor to enhanced
recognition accuracy but also leverages advanced tools to fortify the annotation process, laying the
groundwork for a robust and meticulously labeled dataset.

1.2.3 Feature Extraction Process

In the pivotal task of dataset generation for our project, we harnessed the formidable capabilities of
the Open Computer Vision (OpenCV) library. This versatile library served as the cornerstone of our
data acquisition process, enabling the efficient capture of a substantial volume of images to fuel
both the training and testing phases of our model development. To ensure a robust and
comprehensive training set, approximately 800 images were meticulously captured for each ASL
symbol, providing a diverse array of visual representations to enrich the learning process.
Furthermore, for the testing phase, a judicious selection of around 200 images per symbol was
curated, striking a balance between dataset size and the imperative for diverse testing scenarios.
The utilization of OpenCV not only streamlined the image capture process but also empowered our
project with a wealth of data essential for training and evaluating the efficacy of our real-time vision-
based sign language recognition system. This strategic integration of OpenCV exemplifies our
commitment to leveraging cutting-edge tools for the seamless and efficient development of a robust
and high-performing model.

1.3 Aims and Objectives

The primary aim of our project is to develop a real-time sign language recognition system that can
effectively interpret and translate sign language gestures into text or speech. By achieving this goal,
we intend to improve accessibility for the deaf and hard of hearing communities, thereby reducing
communication barriers and enhancing human-machine interaction. Our overarching aim includes
creating a user-friendly interface, supporting multiple sign languages, ensuring high accuracy and
low latency, exploring educational and training applications, and facilitating communication in
emergency situations. We are committed to continuous improvement, incorporating user feedback
and adhering to ethical and inclusive design principles throughout the development process.
Ultimately, our project seeks to empower individuals who use sign language as their primary mode
of communication by providing them with an effective and reliable means of expressing themselves
in various contexts.

1. Creation of custom image dataset using a webcam.

2. Hand Segmentation for Object Detection Model. This is a data Preprocessing step in which
segmentation and labelling of image dataset has been done with the use of proper annotations
according to the American Sign Language.

3. Feature Extraction for LSTM based Approach: In this step important features have been extracted
in a NumPy array suitable for LSTM model input. Face, Hands and Pose Landmarks are captured
and then converted to model suitable form.

4. Use of Deep Learning based algorithms like LSTM to train our ASL Gesture Recognition model.

5. Use of Tensor Board to monitor model training and perform iterations to improve prediction
accuracy for certain classes.

6. Use of SSD Mobile net Model (Object Detection) to train ASL Gesture Recognition model.

7. Test both models to predict the gestures in real time. Compare the results..
CHAPTER 2
2.1Technologies used
2.1.1 Artificial Intelligence

Artificial Intelligence (AI) represents a revolutionary frontier in computer science, focusing on


endowing machines with capabilities that mimic human intelligence. At its core, AI leverages
machine learning, a subset where algorithms learn from data patterns, improving performance
without explicit programming. Natural Language Processing (NLP) enables machines to understand
and generate human language, facilitating applications like chatbots and language translation.
Computer vision empowers machines to interpret visual information, finding applications in facial
recognition and autonomous vehicles.

AI's impact spans a spectrum of industries, from healthcare, where it aids in diagnostics and drug
discovery, to finance, optimizing trading strategies and fraud detection. Education benefits from AI-
driven personalized learning, adapting content to individual student needs. However, ethical
considerations arise, demanding responsible AI development. Concerns about algorithmic bias
underscore the importance of fair and unbiased AI systems. Privacy issues also necessitate careful
handling of sensitive data.

As technology evolves, the ethical deployment of AI becomes paramount. Striking a balance


between innovation and ethical considerations is vital. Researchers, policymakers, and developers
collaborate to establish guidelines for responsible AI development, ensuring transparency and
accountability. Ongoing dialogues about the societal implications of AI aim to address concerns and
build trust in this transformative technology.

The future of AI holds promises and challenges. Achieving general intelligence remains a goal,
requiring advancements in deep learning and cognitive computing. The quest for ethical AI
necessitates continual refinement of algorithms and policies to mitigate unintended consequences.
As AI shapes our technological landscape, the collaborative effort to harness its potential responsibly
reflects a commitment to shaping a future where machines augment human capabilities while
upholding ethical standards.

2.1.2 Machine learning


Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that focuses on creating systems
capable of learning and making predictions or decisions without being explicitly programmed. At its
core, machine learning algorithms leverage patterns and statistical structures within data to improve
their performance over time. This learning process enables machines to adapt to new information,
providing a level of autonomy and versatility.

There are several types of machine learning approaches, each serving different purposes. In
supervised learning, models are trained on labeled datasets, where the algorithm learns to map
input data to corresponding output labels. This approach is commonly used in tasks like image
recognition, language translation, and spam filtering. Unsupervised learning deals with unlabeled
data, aiming to discover patterns or relationships within the data. Clustering and dimensionality
reduction are examples of unsupervised learning techniques. Reinforcement learning involves
training an agent to make decisions in an environment by receiving feedback in the form of rewards
or penalties. This approach is prevalent in applications like game playing and robotic control.

The success of machine learning relies heavily on data quality and diversity. The more varied and
representative the training data, the better the model's generalization to new, unseen data. Feature
engineering, the process of selecting and transforming relevant data attributes, is also crucial for
improving model performance.

Machine learning finds applications across various domains, from healthcare and finance to
marketing and entertainment. In healthcare, ML aids in disease diagnosis, personalized treatment
plans, and drug discovery. In finance, it is used for risk assessment, fraud detection, and algorithmic
trading. Recommendation systems in online platforms and predictive maintenance in manufacturing
are additional examples of machine learning applications.

As machine learning continues to advance, the field explores deep learning, a subset that involves
neural networks with multiple layers, inspired by the human brain's structure. Deep learning excels
in tasks like image and speech recognition, pushing the boundaries of what machines can achieve.
The ongoing evolution of machine learning promises to revolutionize industries, enhance decision-
making processes, and contribute to the development of increasingly intelligent and adaptive
systems.

2.1.3 Deep learning

It seems there might be a misunderstanding in your request. I assume you meant to ask for an
explanation of "deep learning" in 300 words. If that's the case, here's a concise overview:

Deep Learning is a specialized subfield of machine learning that involves neural networks with
multiple layers (deep neural networks). The "deep" in deep learning refers to the depth of the neural
networks, which have an increasing number of hidden layers. These layers transform input data into
progressively more abstract and complex representations.

The fundamental building blocks of deep learning are artificial neural networks, inspired by the
structure and functioning of the human brain. Neural networks consist of interconnected nodes, or
"neurons," organized into layers—input, hidden, and output layers. Each connection between nodes
has a weight, and during training, the network adjusts these weights to optimize its performance on
a specific task.
Deep learning excels at automatically learning hierarchical representations of data. In traditional
machine learning, feature engineering, the process of selecting relevant input features, is crucial for
model performance. Deep learning eliminates the need for explicit feature engineering, as the
network learns to extract relevant features from the raw data.

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are two popular
architectures within deep learning. CNNs are effective for tasks involving grid-like data, such as
images, and have been instrumental in image recognition and computer vision. RNNs are designed
for sequential data, making them suitable for tasks like natural language processing and speech
recognition.

Deep learning has achieved remarkable success in various domains. Image and speech recognition,
language translation, and autonomous vehicles are just a few examples of applications where deep
learning has outperformed traditional machine learning approaches. The ability of deep learning
models to automatically learn intricate patterns and representations from massive amounts of data
makes them powerful tools for complex tasks.

However, deep learning also comes with challenges. Training deep neural networks requires
substantial computational resources, and overfitting (where the model performs well on training
data but poorly on new data) can be a concern. Interpretability of deep learning models is another
challenge, as they often function as "black boxes," making it challenging to understand the reasoning
behind their decisions.

In conclusion, deep learning leverages deep neural networks to automatically learn complex
representations from data, eliminating the need for manual feature engineering. Its success in tasks
like image and speech recognition has propelled it to the forefront of artificial intelligence research
and applications, paving the way for advancements in various fields.

2.1.4 ML VS DL

Machine Learning (ML) and Deep Learning (DL) are related fields within artificial intelligence, but
they differ in their approaches and applications.

Machine Learning is a broader concept that involves developing algorithms and models that allow
computers to learn from data and make predictions or decisions without explicit programming. ML
encompasses various techniques, including supervised learning, unsupervised learning, and
reinforcement learning. It often requires feature engineering, where relevant input features are
manually selected to improve model performance.

On the other hand, Deep Learning is a subset of machine learning that focuses on neural networks
with multiple layers (deep neural networks). DL excels in automatically learning hierarchical
representations from raw data, eliminating the need for extensive feature engineering.
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are common
architectures within deep learning.

In summary, while ML is a broader field that includes various learning techniques, DL is a specific
approach within ML that involves deep neural networks. DL's strength lies in its ability to
automatically learn intricate patterns and representations from data, making it particularly effective
in tasks like image and speech recognition, where complex features can be automatically extracted.

2.1.5 Tensor flow

It is widely used for building and training various machine learning models, particularly deep learning
models. TensorFlow provides a comprehensive platform for developers and researchers to create
and deploy machine learning applications, from simple models to complex neural networks.

At its core, TensorFlow revolves around the concept of tensors, which are multi-dimensional arrays
representing data. These tensors flow through a computational graph, a series of mathematical
operations, forming the basis of the framework's name. TensorFlow allows users to define, train, and
deploy machine learning models efficiently, offering flexibility and scalability.

One of TensorFlow's key strengths is its support for deep learning. It simplifies the development of
neural networks with high-level APIs like Keras, making it accessible to a broader audience.
TensorFlow's versatility is evident in its ability to handle tasks such as image and speech recognition,
natural language processing, and reinforcement learning.

TensorFlow supports both CPU and GPU acceleration, enabling the efficient execution of complex
computations. TensorFlow Extended (TFX) extends its capabilities for deploying production-ready
machine learning pipelines. TensorFlow Lite is a version designed for mobile and embedded devices,
allowing models to run on edge devices with limited resources.

The TensorFlow ecosystem includes a vast community, extensive documentation, and pre-built
models through TensorFlow Hub, facilitating collaborative learning and model sharing. TensorFlow's
popularity and continuous development make it a go-to framework for machine learning
practitioners, researchers, and developers seeking to harness the power of machine learning and
deep learning in their applications.

2.1.6 MediaPipe

MediaPipe is an open-source framework developed by Google that provides tools and solutions for
building real-time perception applications. It is designed to simplify the development of multimedia
processing pipelines, making it easier for developers to incorporate machine learning models into
their applications. MediaPipe is particularly known for its capabilities in computer vision and gesture
recognition.

One of the key features of MediaPipe is its modular and customizable pipeline architecture.
Developers can assemble a series of pre-built components, known as calculators, to create a
processing pipeline tailored to their specific needs. These calculators encapsulate functionalities
such as image processing, feature extraction, and machine learning inference.

MediaPipe is widely used for tasks like face detection, facial landmark tracking, hand tracking, and
pose estimation. The framework comes equipped with pre-trained models that can be easily
integrated into applications, reducing the complexity of building these functionalities from scratch.
The use of machine learning models enables real-time analysis of video streams or image sequences.

The framework supports both desktop and mobile platforms, making it versatile for a range of
applications, from augmented reality experiences to gesture-based control systems. Additionally,
MediaPipe provides APIs for various programming languages, including Python and C++, making it
accessible to a broad community of developers.

With its focus on real-time, scalable, and customizable multimedia processing, MediaPipe has gained
popularity in the fields of computer vision, human-computer interaction, and augmented reality,
offering a powerful toolset for creating innovative applications that leverage machine learning and
computer vision techniques.

2.1.7 CNN

A Convolutional Neural Network (CNN) is a specialized type of artificial neural network designed for
processing grid-like data, particularly images. CNNs have become a cornerstone in computer vision
and image recognition due to their ability to automatically learn and extract hierarchical features
from raw pixel data.

The architecture of a CNN consists of several layers, each serving a specific purpose. Convolutional
layers use filters or kernels to scan input data for patterns, detecting features like edges, textures,
and more complex structures. These layers are crucial for capturing spatial hierarchies in the data.
Following convolutional layers, pooling layers reduce the spatial dimensions of the data, retaining
the most essential information while decreasing computational complexity and aiding in preventing
overfitting. Fully connected layers at the end of the network combine the learned features to make
predictions or classifications.

CNNs leverage the concept of weight sharing and local connectivity, reducing the number of
parameters and enabling the network to learn spatial hierarchies of features efficiently. This design
is inspired by the organization of the visual cortex in animals, emphasizing the extraction of features
at different scales.

The success of CNNs is evident in their performance on image-related tasks, such as image
classification, object detection, and segmentation. They have achieved remarkable results in
competitions like ImageNet, showcasing their effectiveness in large-scale visual recognition
challenges. Additionally, transfer learning, a technique where pre-trained CNN models are fine-
tuned for specific tasks, has become a common practice, especially when dealing with limited
datasets.

Beyond images, CNNs have found applications in diverse domains, including natural language
processing and speech recognition, underscoring their adaptability and efficiency in learning
hierarchical representations from various types of structured grid data. Overall, CNNs have become
a foundational tool in the realm of artificial intelligence, playing a pivotal role in advancing the
capabilities of machines to interpret and understand visual information.
CHAPTER 3
3.1 Methodology

3.1.1 GESTURE CLASSIFICATION

The approach which we used for this project is :

Our approach uses two layers of algorithm to predict the final symbol of the user.

Algorithm Layer 1:

1. Apply gaussian blur filter and threshold to the frame taken with open cv to get the processed
image after feature extraction.

2. This processed image is passed to the CNN model for prediction and if a letter is detected for
more than 50 frames then the letter is printed and taken into consideration for forming the word.

3. Space between the words are considered using the blank symbol.

Algorithm Layer 2:

1. We detect various sets of symbols which show similar results on getting detected.

2. 2. We then classify between those sets using classifiers made for those sets only.

Layer 1:

CNN Model:

1. In the intricate architecture of our convolutional neural network (CNN), the journey
commences with the 1st Convolution Layer. The initial input, a picture boasting a resolution
of 128x128 pixels, undergoes meticulous processing in this layer. Thirty-two filter weights,
each measuring 3x3 pixels, are applied, resulting in the generation of a 126x126 pixel image.
Each filter weight contributes to the creation of a distinct 126x126 pixel image, collectively
establishing the foundation for subsequent layers.

2. Progressing through the network, the 1st Pooling Layer takes center stage. Employing max
pooling with a 2x2 window, the images are strategically downsampled. This involves
retaining the highest value within each 2x2 square of the array. Consequently, the original
image undergoes a substantial reduction, culminating in a condensed 63x63 pixel
representation. This pooling operation serves the dual purpose of simplifying computations
and preserving essential features.
3. The journey continues with the 2nd Convolution Layer, where the 63x63 pixel output from
the first pooling layer becomes the input. Thirty-two additional filter weights, mirroring the
dimensions of 3x3 pixels each, are employed. The result is a refined 60x60 pixel image,
paving the way for enhanced feature extraction and abstraction.

4. Persevering in our architectural odyssey, the 2nd Pooling Layer comes into play. The
resultant images from the second convolutional layer undergo another round of
downsampling through max pooling with a 2x2 window. This transformative process reduces
the resolution to a compact 30x30, consolidating the extracted features for subsequent
layers.

5. Signifying a pivotal juncture, the 1st Densely Connected Layer takes the torch. The 30x30x32
array derived from the output of the second convolutional layer becomes the input for a
fully connected layer boasting 128 neurons. Simultaneously, the output from the second
convolutional layer is reshaped into an array comprising 30x30x32, amounting to a
formidable 28,800 values. A dropout layer with a judicious value of 0.5 is strategically
deployed to mitigate the risks of overfitting, ensuring the robustness of the network.

6. The narrative unfolds with the introduction of the 2nd Densely Connected Layer. The output
from the 1st Densely Connected Layer, enriched with learned features, serves as input to a
fully connected layer housing 96 neurons. This layer acts as a bridge, further refining the
hierarchical representations established in the earlier stages.

7. Culminating in the Final Layer, the output from the 2nd Densely Connected Layer takes
center stage. This final layer, a beacon of classification, accommodates neurons equivalent
to the number of classes under consideration. In our context, these classes encompass
alphabets along with the indispensable blank symbol. This strategic arrangement ensures
that the network is poised to deliver precise classifications, forming the bedrock of its
efficacy in symbol recognition within the defined set of classes.

Activation Function:

We have used ReLu (Rectified Linear Unit) in each of the layers(convolutional as well as fully
connected neurons). ReLu calculates max(x,0) for each input pixel. This adds nonlinearity to the
formula and helps to learn more complicated features. It helps in removing the vanishing gradient
problem and speeding up the training by reducing the computation time.

Pooling Layer: We apply Max pooling to the input image with a pool size of (2, 2) with relu activation
function. This reduces the amount of parameters thus lessening the computation cost and reduces
overfitting.

Dropout Layers: The problem of overfitting, where after training, the weights of the network are so
tuned to the training examples they are given that the network doesn’t perform well when given
new examples. This layer “drops out” a random set of activations in that layer by setting them to
zero. The network should be able to provide the right classification or output for a specific example
even if some of the activations are dropped out.
Optimizer: We have used Adam optimizer for updating the model in response to the output of the
loss function. Adam combines the advantages of two extensions of two stochastic gradient descent
algorithms namely adaptive gradient algorithm(ADA GRAD) and root mean square propagation(RMS
Prop)

Layer 2

In the intricacies of our algorithmic framework, we employ a dual-layered approach to not only
verify but also predict symbols that bear close resemblance to each other. The objective is to achieve
the utmost precision in detecting the displayed symbol, recognizing the nuanced similarities inherent
in certain symbol pairs. Rigorous testing uncovered instances where specific symbols exhibited
inaccuracies, yielding unintended results. Notably:

1. For D: Ambiguities were observed with symbols R and U.


2. For U: Challenges arose in distinguishing between D and R.
3. For I: Complexities emerged in delineating between T, D, K, and I.
4. For S: Similarities in symbols M and N posed recognition challenges.
To address these intricacies and enhance the robustness of our symbol recognition system, we
instituted three distinct classifiers tailored to classify these specific sets:

1. Classifier for {D, R, U}: This specialized algorithm focuses on precisely distinguishing between the
symbols D, R, and U, mitigating any potential confusion arising from their visual similarities.

2. Classifier for {T, K, D, I}: This classifier is meticulously designed to navigate the subtle distinctions
between the symbols T, K, D, and I. It ensures accurate classification within this set, acknowledging
the challenges posed by their visual intricacies.

3. Classifier for {S, M, N}: Recognizing the complexities in differentiating symbols S, M, and N, this
classifier is strategically devised to discern and accurately classify within this particular symbol set.

By introducing these targeted classifiers, we fortify our system against potential ambiguities,
elevating the precision of symbol recognition in scenarios where visual similarities could lead to
misclassifications. This multi-layered approach not only enhances the accuracy of our symbol
prediction but also underscores our commitment to creating a robust and adaptable framework
capable of navigating the nuanced intricacies of symbol recognition in diverse visual contexts.

3.1.2 Finger spelling sentence formation Implementation:

In the intricate orchestration of our algorithmic implementation, a multi-faceted strategy unfolds to


effectively manage the recognition and interpretation of symbols in dynamic visual contexts:

3.1.2.1. Dynamic Letter Recognition Thresholding:

A sophisticated mechanism triggers whenever the count of a detected letter surpasses a predefined
threshold value. Specifically set at 50 in our code, this threshold serves as a critical parameter to
ensure the accuracy of symbol recognition.
An additional layer of precision is introduced by evaluating the proximity of other letters. The system
meticulously verifies that no other letter is in close proximity within a designated threshold of 20
units. This meticulous spatial analysis enhances the robustness of letter detection.

3.1.2.2. Clearance of Detection Dictionary:

In scenarios where the detected letter count fails to meet the established threshold or is potentially
obscured by the proximity of other symbols, a prudent course of action is initiated.
The system systematically clears the current dictionary, responsible for tallying the count of
detections for the present symbol. This proactive measure mitigates the likelihood of erroneously
predicting the wrong letter, safeguarding the integrity of the recognition process.

3.1.2.3. Strategic Handling of Blank Detection:

A nuanced approach governs the detection of blanks or plain background in the visual input stream.
If the count of blank detections exceeds a predetermined threshold, and the current buffer remains
devoid of symbols, the system refrains from interpreting any spaces. This judicious decision ensures
that only meaningful spaces are considered in the absence of detected symbols.

3.1.2.4. Dynamic Space Prediction and Sentence Formation:

The algorithm dynamically predicts the end of a word by predicting a space when specific conditions
align. If the count of blank detections is significant, and the current buffer remains empty, the
system discerns that no spaces are intended.
In alternative scenarios, where a substantial count of blank detections is coupled with a non-empty
buffer, the system predicts the conclusion of a word by printing a space. Simultaneously, the current
buffer, representing the detected symbols, seamlessly appends to the evolving sentence below.
This intricate interplay of conditions and actions underscores the sophistication of our approach,
ensuring not only the accurate recognition of symbols but also the coherent formation of sentences.
By incorporating nuanced decision-making processes, our system navigates the complexities of real-
world visual data, facilitating effective and precise communication in diverse scenarios.
3.2 Autocorrect Feature:

A python library Hunspell suggest is used to suggest correct alternatives for each (incorrect) input
word and we display a set of words matching the current word in which the user can select a word
to append it to the current sentence. This helps in reducing mistakes committed in spellings and
assists in predicting complex words.

3.3 Training and Testing:

We convert our input images(RGB) into grayscale and apply gaussian blur to remove unnecessary
noise. We apply adaptive threshold to extract our hand from the background and resize our images
to 128 x 128. We feed the input images after preprocessing to our model for training and testing
after applying all the operations mentioned above. The prediction layer estimates how likely the
image will fall under one of the classes. So the output is normalized between 0 and 1 and such that
the sum of each values in each class sums to 1. We have achieved this using softmax function. At first
the output of the prediction layer will be somewhat far from the actual value. To make it better we
have trained the networks using labeled data. The cross-entropy is a performance measurement
used in the classification. It is a continuous function which is positive at values which is not same as
labeled value and is zero exactly when it is equal to the labeled value. Therefore we optimized the
cross-entropy by minimizing it as close to zero. To do this in our network layer we adjust the weights
of our neural networks. TensorFlow has an inbuilt function to calculate the cross entropy. As we
have found out the cross entropy function, we have optimized it using Gradient Descent in fact with
the best
3.2 Literature Survey

CHAPTER 4
4.1 CONCLUSION

4.1.1 Conclusion:

Within the context of this comprehensive report, we present the culmination of our efforts in
crafting a highly functional real-time vision-based American Sign Language (ASL) recognition system
tailored specifically for individuals facing challenges in hearing and speech, often referred to as D&M
(Deaf and Mute) individuals. Our focus revolves around the recognition of ASL alphabets, a pivotal
aspect of facilitating effective communication for this demographic.

The core achievement of our endeavor lies in the attainment of a remarkable final accuracy rate of
98.0% on the meticulously curated dataset that underpins our research and development. This
outstanding accuracy is a testament to the efficacy of our approach in accurately interpreting and
recognizing ASL alphabets in real-time scenarios.

One of the key innovations that propelled us toward this high accuracy is the incorporation of a two-
layered algorithmic framework. This strategic implementation allows us not only to verify but also to
predict symbols, especially those that bear a close resemblance to each other. This layered approach
significantly enhances our predictive capabilities, addressing the challenges posed by symbols that
exhibit visual similarities.
Our system's proficiency extends to detecting a comprehensive range of ASL symbols, assuming
optimal conditions. The efficacy is contingent upon the symbols being presented appropriately, the
absence of background noise, and the presence of adequate lighting. This stringent set of conditions
ensures that our system operates optimally, providing reliable and accurate predictions for a diverse
array of ASL symbols.

In summary, our developed real-time vision-based ASL recognition system stands as a testament to
technological innovation and inclusivity. With a robust accuracy rate, strategic algorithmic layers,
and stringent condition requirements, our system paves the way for improved communication and
interaction for D&M individuals. This report encapsulates our journey, methodologies, and
outcomes, showcasing the potential of vision-based technologies in enhancing accessibility and
fostering inclusivity for diverse communities.

4.1.2 Challenges Faced:

The expedition through our project journey was not without its share of formidable challenges, each
serving as a crucible for learning, innovation, and refinement. The initial hurdle that loomed
prominently was the scarcity of a suitable dataset aligned with our project requirements. The
preference for raw, square images, conducive for the convolutional neural network (CNN)
architecture in Keras, added an extra layer of complexity. Faced with the absence of an existing
dataset meeting these criteria, we took the initiative to curate our own dataset, a labor-intensive yet
indispensable endeavor to propel our project forward.

The next frontier of challenges unfurled in the realm of filter selection, a critical decision influencing
the extraction of pertinent features from our images. Given the diverse array of available filters,
ranging from binary thresholding to Canny edge detection and Gaussian blur, the quest for the
optimal filter was a meticulous exploration. After thorough experimentation, the Gaussian blur filter
emerged as the preferred choice, harmonizing with our objective of capturing essential image
features for subsequent CNN model input.

Navigating through the labyrinth of challenges, the intricacies of model accuracy emerged as a
central concern in the project's nascent phases. Diligent efforts were invested in iteratively
enhancing the accuracy, a journey marked by strategic decisions. Notably, a pivotal leap was
achieved by augmenting the input image size, a nuanced maneuver designed to enrich the model's
capacity for nuanced feature recognition. Simultaneously, refinements to the dataset played a
pivotal role in amplifying the model's discriminatory capabilities.
Our project's odyssey, rife with challenges, attests to the dynamic and iterative nature of scientific
endeavors. From dataset curation to filter selection and model accuracy refinement, each challenge
fostered an environment of continual improvement. The resilient spirit in addressing and
surmounting these challenges not only strengthened our project's foundations but also underscored
the adaptive and problem-solving ethos intrinsic to the research and development domain.

4.1.3 Future Scope:

Our commitment to elevating the accuracy and robustness of our system is unwavering, prompting
strategic initiatives to surmount challenges posed by complex backgrounds and low light conditions.
A pivotal facet of our future endeavors involves the exploration and implementation of diverse
background subtraction algorithms. By harnessing the potential of these algorithms, we aim to
enhance our system's adaptability to intricate visual scenarios, where complex backgrounds may
pose challenges to accurate gesture recognition.

In tandem with our pursuit of superior accuracy, we are strategically focusing on refining the
preprocessing stages to bolster our system's proficiency in predicting gestures under low light
conditions. Recognizing the real-world variability in lighting scenarios, this initiative aims to optimize
our model's performance in conditions where illumination may be suboptimal. By fine-tuning
preprocessing techniques, we aspire to achieve a higher accuracy rate even in challenging lighting
environments.

This forward-looking strategy aligns with our overarching goal of creating a versatile and resilient
system capable of delivering accurate sign language recognition across a spectrum of real-world
conditions. As we delve into the realm of background subtraction algorithms and preprocessing
enhancements, our vision is to fortify our system's adaptability and ensure its efficacy in diverse and
challenging visual contexts.
The iterative nature of our approach, characterized by ongoing experimentation and refinement,
underscores our dedication to pushing the boundaries of technological innovation. By continually
striving for higher accuracy and adaptability, we aim to make meaningful strides in enhancing the
accessibility and effectiveness of our real-time vision-based sign language recognition system.

4.1.4 Applications

The applications of our real-time vision-based sign language recognition system extend across
diverse domains, showcasing its potential for positive societal impact. In the realm of accessibility,
our technology stands as a beacon, empowering individuals with hearing and speech impairments by
providing them with an intuitive means of communication through American Sign Language (ASL).
Education and communication channels can leverage our system to facilitate seamless interactions
for the Deaf and Mute community, fostering inclusivity and reducing communication barriers.
Furthermore, the versatility of our model is underscored by its potential in human-computer
interaction scenarios, enabling a more natural and intuitive interface for gesture-based commands.
Beyond individual empowerment, our system holds promise in educational settings, serving as a
valuable tool for ASL learners and educators alike. By continually refining our technology to navigate
challenges such as complex backgrounds and low light conditions, we aspire to pave the way for a
more inclusive and accessible future, where innovative applications of sign language recognition
technology can make a meaningful difference in people's lives.

4.2 APPENDIX
4.2.1 Convolutional Neural network
CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
They are also known as shift invariant or space invariant artificial neural networks (SIANN),
based on their shared-weights architecture and translation invariance characteristics.
Convolutional networks were inspired by biological processes in that the connectivity
pattern between neurons resembles the organization of the animal visual cortex.

Individual cortical neurons respond to stimuli only in a restricted region of the visual field
known as the receptive field. The receptive fields of different neurons partially overlap such
that they cover the entire visual field CNNs use relatively little pre-processing compared to
other image classification algorithms. This means that the network learns the filters thatin
traditional algorithms were hand-engineered. This independence from prior knowledge and
human effort in feature design is a major advantage. They have applications in image and
video recognition, recommender systems, image classification, medical image analysis, and
natural language processing.

4.1.2 Tensorflow
TensorFlow is an open-source software library for dataflow programming across a range of
tasks. It is a symbolic math library, and is also used for machine learning applications such as
neural networks. It is used for both research and production at Google. TensorFlow was
developed by the Google brain team for internal Google use. It was released under the
Apache 2.0 open source library on November 9, 2015.

TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on


February 11, 2017. While the reference implementation runs on single devices, TensorFlow
can run on multiple CPUs and GPUs (with optional CUDA and SYCL extensions for general-
purpose computing on graphics processing units). TensorFlow is available on 64-bit Linux,
macOS, Windows, and mobile computing platforms including Android and iOS. Its flexible
architecture allows for the easy deployment of computation across a variety of platforms
(CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices.
REFERENCES

[1] Cooper, Roz, et al. "Inclusive design: from the pixel to the city." Proceedings of the International
Conference on Inclusive Design and Communication (2017).

[2] Dix, Alan, et al. "Human-computer interaction." Pearson Education (2004).

[3] Rubin, Jeffrey, and Dana Chisnell. "Handbook of Usability Testing: How to Plan, Design, and
Conduct Effective Tests." Wiley (2008).

[4] Nishimura, Masami, and Norma M. Graham. "Ethical issues in the use of assistive technology for
the deaf." The Oxford Handbook of Deaf Studies, Language, and Education (2010): 353-366.

[5] Rassam, Murad A., et al. "Sign language recognition and translation with Kinect." Proceedings of
the International Conference on Image Processing (ICIP). IEEE, 2012.

You might also like