Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

TRIBHUVAN UNIVERSITY

LALITPUR ENGINEERING COLLEGE

FINAL PROJECT REPORT


ON
“SIGN LANGUAGE DETECTION ”

SUBMITTED BY:
Amrit Sapkota (076 BCT 05)
Asmit Oli (076 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyam Aryal (076 BCT 29)

SUBMITTED TO:
DEPARTMENT OF COMPUTER ENGINEERING
LALITPUR ENGINEERING COLLEGE
LALITPUR, NEPAL

August, 2023
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
LALITPUR ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER ENGINEERING

FOURTH YEAR PROJECT REPORT ON:


”SIGN LANGUAGE DETECTION ”
IN PARTIAL FULFILLMENT FOR THE
AWARD OF
BACHELOR’S DEGREE IN COMPUTER ENGINEERING

SUBMITTED BY
Amrit Sapkota (076 BCT 05)
Asmit Oli (075 BCT 43)
Nischal Maharjan (076 BCT 20)
Sakshyan Aryal (076 BCT 29)

August, 2023
ABSTRACT

There is an undeniable communication problem between the Deaf community and


the hearing majority. It becomes hard for deaf people to communicate because
many people don’t understand sign language. With the use of innovation, in sign
language recognition, we tried to teardown this communication barrier. In this
proposal, it is shown how using Artificial Intelligence can play a key role to provide
the solution. Using the dataset, through the front camera of the laptop, translation
of sign language to text format can be seen on the screen in real-time i.e. the input
is in video format whereas the output is in text format. Extraction of complex head
and hand movements along with their constantly changing shapes for recognition
of sign language is considered a difficult problem in computer vision. Mediapipe
provides necessary key points or landmarks of hand, face, and pose. The model is
then trained using LSTM neural network. The trained model is used to recognize
sign language.

Keywords: Convolution Neural Network(CNN), Recurrent Neural Network, Deep


Learning (CNN, Gesture Recognition, Sign Language Recognition)

i
TABLE OF CONTENTS

ABSTRACT i
TABLE OF CONTENTS ii
LIST OF FIGURES iv
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 System Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5.1 Functional Requirements . . . . . . . . . . . . . . . . . . . 2
1.5.2 Non-functional Requirement . . . . . . . . . . . . . . . . . . 3
2 LITERATURE REVIEW 4
3 FEASIBILITY ANALYSIS 6
3.1 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 BLOCK DIAGRAM 7
4.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 METHODOLOGY 13
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.2 Video Segmentation . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.3 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.4 Preprocessing Techniques . . . . . . . . . . . . . . . . . . . 14

ii
5.2.5 Hand Segmentation . . . . . . . . . . . . . . . . . . . . . . . 15
5.2.6 Temporal Alignment . . . . . . . . . . . . . . . . . . . . . . 15
5.2.7 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . 15
6 IMPLEMENTATION PLAN 16
6.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7 REQUIREMENT ANALYSIS 17
7.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . 17
7.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.3 User Requirement Definition . . . . . . . . . . . . . . . . . . . . . . 18
8 EXPECTED OUTCOME 19
9 RESULT AND OUTCOMES 20
REFERENCES 21

iii
LIST OF FIGURES

4.1 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 7


4.2 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Level 0 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Level 1 DFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iv
CHAPTER 1

INTRODUCTION

1.1 Background

With the rapid growth of technology around us, Machine Learning and Artificial
Intelligence has been used in various sectors to support mankind including gesture,
object, face detection etc. With the help of Deep Learning, a machine imitates
the way humans gain certain types of knowledge. Using Artificial Neural Network,
simulation of human brain is done and using Convolution layers, extraction of
selected important part from an image to make computation easy. ”Sign Language
Detection”, the name itself specifies the gist of the project. Sign language recog-
nition has been a major problem between mute disabilities people in community.
People does not understand sign language and also it is difficult for them to learn
those sign language. A part from the scoring grades form this minor project, the
core idea is to make communication easy for deaf people. We set the bar of the
project such that it would be beneficial to society as well. The main reason for us
to choose this project is to aid people using Artificial Intelligence.

1.2 Problem Definition

Though there is a lot of research going on regarding sign language recognition,


there is very little implementation in practical life. As per the team research, upto
that extent, we came across this problem though there are many Sign language
recognition software and hardware out there but to some extent we felt that
people who don’t understand sign language or can’t read may have the problem
in communication also recognition using gloves is practically not portable and
costly as well. This made us think to use image processing and deep learning for
sign language recognition and provide the output as voice such that everyone can
understand.

1
1.3 Scope

The field of sign language recognition includes the development and application of
techniques for recognizing and interpreting sign language gestures. This involves
using computer vision and machine learning techniques to analyze video input and
identify gestures of sign language users. Sign language recognition has a wide range
of potential applications, including communication aids for deaf people, automatic
translation of sign language into spoken or written language, and an interactive
platform for learning sign language. The scope also extends to improving the
accuracy and efficiency of sign language recognition systems through advances in
algorithms, sensor technology and data collection. Additionally, this scope also
includes addressing challenges related to sign language diversity, gestural variation,
lighting conditions, and the need for robust real-time performance in a variety of
environments.

1.4 Objectives

The following are the objectives for sign language detection.


• To design and implement a system that can understand the sign language of
Hearing-impaired people.
• To train the model with a variety of datasets using MediaPipe and CNN, and
provide the output in real-time.
• To recognize sign language and provide the output as voice or text.

1.5 System Requirement

The following is the desired functionality of the new system. The proposed project
would cover.

1.5.1 Functional Requirements

• Real-time Output.

• Accurate detection of gestures.

• Data sets comment.

2
1.5.2 Non-functional Requirement

• Performance Requirement.

• Design Constraints.

• Reliability.

• Usability.

• Maintainability.

3
CHAPTER 2

LITERATURE REVIEW

There are many articles and papers that have been published regarding Sign
Language Detection. Many of them used different algorithms and data sets of
their own. In 1992, researchers developed a camera that could focus on a person’s
hand because the signer wore a glove with markings on the tip of each finger and
later, in 1994, on a ring of color around each joint on the signer’s hand (Starner,
1996).In 1995, Starner began the development of a system that initially involved
the signer wearing two different colored gloves, although eventually no gloves were
required. A camera was placed on a desk or mounted in a cap worn by a signer in
order to capture the movements (Starner, 1996). More recently, a wearable system
has been developed that can function as a limited interpreter (Brashear, Starner,
Lukowicz, and Junker, 2003). To this end, they used a camera vision system
along with wireless accelerometers mounted in a bracelet or watch to measure
hand rotation and since the early 2000s, ConvNets have been applied with great
success to the detection, segmentation, and recognition of objects and regions in
images. These were all tasks in which labeled data was relatively abundant, such as
traffic sign recognition53, the segmentation of biological images54 particularly for
connectomics55, and the detection of faces, text, pedestrians, and human bodies
in natural images36,50,51,56–58.
A major recent practical success of ConvNets is face recognition. Toshev and
Szegedy proposed a deep learning-based method, which localizes body joints by
solving a regression problem and further improves on estimation precision by using
a cascade of these pose regressors. Their work demonstrates that a general deep
learning-based network originally formed for a classification problem can be fine-
tuned and used to solve localization and detection problems. Since the early 2000s,
ConvNets have been applied with great success to the detection, segmentation, and
recognition of objects and regions in images. These were all tasks in which labeled
data was relatively abundant, such as traffic sign recognition, the segmentation of

4
biological images particularly for connectomics, and the detection of faces, text,
pedestrians, and human bodies in natural images. A major recent practical success
of ConvNets is face recognition.

5
CHAPTER 3

FEASIBILITY ANALYSIS

3.1 Technical Feasibility

The project is based on how much we have trained the model. The more data
we provide in the data set, the more accuracy can be observed. We might face
challenges on the way but we have enough datasets and problem-solving skills,
making us technically eligible. And regarding the output of the software we would
be using google voice translator.

3.2 Economic Feasibility

Since the project is totally based on software, the only expenditure is computational
power. The dataset we will be using are easily available and the computation
power would be from our personal computers and smartphones, so this project is
economically feasible.

3.3 Operational Feasibility

The project is operationally feasible since after the completion of the project it
can be operated as intended by the user to solve the problems for what is has been
developed.

6
CHAPTER 4

BLOCK DIAGRAM

4.1 System Block Diagram

Figure 4.1: System Block Diagram

Description of the working flow of the proposed system


The overall workflow of the system is shown in the above block diagram. Data-set
are like the memory of the system. Each and every detection that we view in real
time are the results of the data-set. Data-sets are captured in real time from the
front camera of the laptop . Using media pipe live perception of simultaneous
human pose, face landmarks, and hand tracking in real-time various modern life
applications including sign language detection can be enabled. With the help of
the landmarks or let’s say key points of features (face, pose, and hands) we get
from the media pipe we train our model. All the data that we collected from the
data-sets and from deep learning models are considered as training data. These
data are provided to the system such that the system can detect the sign language
in real-time. Input to this system is real-time or say live video using the front

7
camera of the laptop. As the real-time input i.e. sign language is provided using
the front camera of laptop, simultaneously live output can be seen on the screen in
text format. It acts as an interface for the Sign Language System providing an
environment for input data to get processed and provide the output.

8
4.2 Use Case Diagram

Figure 4.2: Use Case Diagram

9
4.3 Level 0 DFD

Figure 4.3: Level 0 DFD

10
4.4 Level 1 DFD

Figure 4.4: Level 1 DFD

11
4.5 Activity Diagram

Figure 4.5: Activity Diagram

12
CHAPTER 5

METHODOLOGY

5.1 Data Collection

In this project, we will collect sign language data from multiple sources to develop
a robust sign language detection system. We intend to work with sign language
schools, collect recorded videos of sign language interpreters, and make use of the
sign language datasets already available to the scientific community. High-quality
video recording tools, including cameras and lighting setups that allow for good
viewing of hand motions, will be used to collect the data

5.2 Data Preprocessing

Data preprocessing is a process of preparing the raw data and making it suitable
for a machine-learning model. It is the first and crucial step while creating a
machine-learning model. Real-world data generally contains noises and missing
values and may be in an unusable format, which cannot be directly used for
machine, learning models. Data preprocessing is a required task for cleaning the
data and making it suitable for a machine-learning model, which also increases the
accuracy and efficiency of a machine-learning model.

5.2.1 Video Acquisition

The video data for sign language detection will be captured using a high-definition
camera with a resolution of 1920x1080 pixels and a frame rate of 30 FPS. The
camera will be positioned to capture the frontal view of the signer’s upper body,
focusing on the hand region.

5.2.2 Video Segmentation

Then the acquired video data will be segmented into individual sign language
gestures. We will employ an automatic gesture detection algorithm based on motion

13
and hand region analysis. This algorithm will then detect significant changes in
motion and will use hand-tracking techniques to separate consecutive gestures from
video sequences.

5.2.3 Frame Extraction

From the segmented video data, frames will be extracted at a rate of one frame
per second to capture key moments of each gesture. A sample set of frames will
thus be guaranteed for additional study.

5.2.4 Preprocessing Techniques

Resizing and Cropping


To maintain consistency in the input data, the extracted frames will be resized to a
small resolution for example 224x224 pixels. Additionally, we will crop each frame
to focus on the hand region, ensuring that irrelevant background information is
eliminated.
Color Conversion
We will convert the frames from RGB to grayscale to reduce the computational
complexity and focus solely on hand shape and motion.
Noise Reduction Gaussian smoothing with a kernel size of 3x3 will be applied
to reduce noise in the grayscale frames. This will help to enhance the clarity of
hand contours and minimize the impact of minor variations caused by lighting
conditions.
Contrast Enhancements
Histogram equalization will be applied to the grayscale frames to improve the
visibility of hand features. This will enhance the contrast and increase the overall
dynamic range of pixel intensities. Normalization
We will use min-max scaling to translate the intensity values from the [0, 255]
range to [0, 1], standardizing the pixel values across frames. By ensuring that the
input data has consistent ranges, this normalization step will help in convergence
during model training.

14
5.2.5 Hand Segmentation

We will use a hand segmentation technique based on color and region analysis be-
cause hand movements are important in sign language. To separate the hands from
the background and other unimportant items, this technique will use background
subtraction and skin color modeling.

5.2.6 Temporal Alignment

Dynamic time warping (DTW) will be used to synchronize frames between various
sign language motions. Due to the temporal similarity of hand movements, DTW
will enable us to align the frames and account for changes in gesture duration and
space.

5.2.7 Data Augmentation

We will use data augmentation techniques to broaden the variety and amount
of the training dataset. These will consist of randomizing the frames’ cropping,
rotation, translation, and flipping. The model’s ability to recognize sign gestures
in a variety of situations will be strengthened with the aid of data augmentation.

15
CHAPTER 6

IMPLEMENTATION PLAN

6.1 Gantt Chart

Figure 6.1: Gantt Chart

16
CHAPTER 7

REQUIREMENT ANALYSIS

7.1 Hardware Requirements

The project is fully based on software so there are no hardware requirements except
laptops for coding as well as preparing documents.

7.2 Software Requirements

The software required for the projects are:


Python
Python is a high level language which is used for general purpose programming. It
was developed by Guido van Rossum. The first release of Python was in the year
1991 as Python 0.9.0. Programming paradigms such as structured, object oriented,
and functional programming are supported in Python.

TensorFlow
TensorFlow is a free open source library that can be used in the field of machine
learning and artificial intelligence. Including many other tasks, it can be used for
training purposes in deep learning.

Mediapipe
MediaPipe offers cross-platform, customizable machine learning solutions for live
and streaming media i.e. real time videos. Its features are End to end acceleration,
Build once deploy anywhere, ready to use solution, and Free and open source.

17
7.3 User Requirement Definition

The user requirement for this system is to make the system fast, feasible, less prone
to error, save time and improve the communication gap between normal people
and deaf people

• The system should have a user-friendly interface.

• The system can translate sign language into text.

18
CHAPTER 8

EXPECTED OUTCOME

• System must take real time data input and fairly accurate output.

• It helps in communication among the community of the needy.

We have expected the outcome to be such that with the real time video of sign
language input through the front camera of laptop we get the text output in real
time.

19
CHAPTER 9

RESULT AND OUTCOMES

20
REFERENCES

21

You might also like