Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

D – TALK

A Sign Language Recognition System for People with Disability

A MINI-PROJECT REPORT

Computer Science and Engineering


by

Rida Mumtaz (19268) Samiksha Rathore (19251)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KNIT SULTANPUR
UTTAR PRADESH, INDIA
4TH Semester, 2020
SULTANPUR, UTTAR PRADESH, 228001

Department of Computer Science and Engineering

CERTIFICATE

Certified that the mini-project work entitled “D-Talk” is a bona fide work carried out by

Samiksha Rathore 19251

Rida Mumtaz 19268

The report has been approved as it satisfies the academic requirements in respect of mini-project
work prescribed for the course.

……………...…………………………

Mr Rafeeq Ahmad

Head of the Department


ACKNOWLEDGEMENTS

We are pleased to acknowledge Prof. Rafeeq Ahmed for his invaluable guidance
during the course of this project work, without his guidance, this project would
have been an uphill task.
We would also like to thank ‘GoogleNet’, for their writing of the very useful
Inception V3 Model for Image Analysis and Detection under the Open-Source
banner which greatly helped us in writing the training part. It is the third edition of
Google's Inception Convolutional Neural Network, originally introduced during
the ImageNet Recognition Challenge.

Rida Mumtaz (19268) Samiksha Rathore (19251)


CONTENTS

1. INTRODUCTION…………………………………………………..1
1.1 Overview…………………………………………………5
1.2 Objective…………………………………………………5
1.3 Introduction………………………………………………6
1.4 Motivation………………………………………………..7

2. DESIGN……………………………………………………………..8
2.1 Methodology……………………………………………..8

3. REQUIREMENTS SPECIFICATIONS…………………………….9
3.1 Software Used…………………………………………….9
3.2 Libraries Used…………………………………………….10

4. FLOWCHART……………………………………………………….12
5. CODING……………………………………………………………..13
6. TRAINING…………………………………………………………...15
5. TESTING……………………………………………………………..17
6. RESULT………………………………………………………………20
6. CONCLUSION……………………………………………………….22
7. REFERENCES………………………………………………………..25
OVERVIEW

This report discusses the result of the work done in development of "D-Talk" as
aims of promoting Transfer Learning and a profound development in the field of
Deep Learning, for the betterment of the society. It is a part of the submission 2 nd
Year Project going in Computer Science Department, KNIT Sultanpur.

OBJECTIVE

The final goal of the project was twofold


1. To work with real-life implementations of images or object recognition with
one of the most vital design methodologies of deep learning that is
TRANSFER LEARNING and to use Transfer learning as a method that
can allows us to use knowledge gained from other tasks in order tackle new
but similar problems quickly and effectively.
2. Improving the model accuracy by Evaluating the model on test data.

Along with above main goals, capability to design the target platform manually in
order to put forward a Sign-Language Recognition system for people with
disability.
INTRODUCTION

This project aims to use machine learning sub-part of artificial intelligence to help people who
are unable to do what most people do in their everyday lives, i.e., communication.

Speech impairment is a disability which affects one’s ability to speak and hear. Such individuals
use sign language to communicate with other people. Sign language is an important part of life
for deaf and mute people. They rely on it for everyday communication with their peers. A sign
language consists of a well-structured code of signs, and gestures, each of which has a particular
meaning assigned to it. They have their own grammar and lexicon. It includes a mixture of hand
positioning, shapes and movements of the hand. The people who know sign language can
communicate with each other efficiently. Although it is an effective form of communication,
there remains a challenge for people who do not understand sign language to communicate with
speech impaired people.

The aim of this project is to develop an application which will translate sign language to English
in the form of text, thus aiding communication with sign language. The application acquires
image data using the webcam of the computer, then it is preprocessed using a combinational
algorithm and recognition is done using InceptionV3 model as the basis. The translation is done
in the form of text. We feel that their disability should not become a hindrance to achieve their
goals. Adding them into the workforce will only improve the socio-economic development of the
country.

There is no universal sign language for deaf and mute people. India’s National Association of
Deaf estimates that there are 18 million people in India with hearing impairment. This project
implements a system which translates Indian Sign Language gestures to its English language
interpretation.
MOTIVATION

Communication plays a significant role in making the world a better place. Communication
creates bonding and relations among the people, whether persona, social, or political views.
Disability is an emotive human condition. It limits the individual to a certain level of
performance. Being deaf and dumb pushes the subject to oblivion, highly introverted. In a world
of inequality, this society needs empowerment. Harnessing technology to improve their welfare
is necessary. This society has for a long-time used sign language. However, most people are
unable to understand this language making communication intricate.

Deaf and mute people usually depend on sign language interpreters for communication.
However, finding a good interpreter is difficult and often expensive.

This application aims to reduce the cost and effort of finding a good enough interpreter and
makes the whole process of communication easy and much more comfortable for its users. Also,
providing them with this much-needed basic conversational tool will make them feel
independent and confident.
METHODOLOGY

The proposed system consists of three main stages:


1. Preparing Dataset
2. Retraining the model
3. Sign recognition and displaying output.

We are using model as our basis:


Inception-v3 - Inception V3 is a convolutional neural network that is 48 layers deep neural
network that has been pretrained using the ImageNet dataset.
We will use Transfer Learning. In Transfer Learning, when we build a new model to classify
your original dataset, we reuse the feature extraction part and re-train the classification part with
your dataset. Since we don’t have to train the feature extraction part (which is the most complex
part of the model), we can train the model with less computational resources and training time.

1.Preparing the Dataset

First, we prepare the dataset consisting of alphabets and other hand gestures required for
communicating for e.g. space and delete (for sentence formation). The dataset consists around
3000 images for each sub-directory of alphabets and other signs. The test data is 10 percent of
the whole dataset, rest is used to train our model.

2.Retraining the Model

On the basis of the dataset which we have collected, we fine-tune the last layer of model using
fine tuning, Fine-tuning is a process to take a network model that has already been trained for a
given task, and make it perform a second similar taskforce the dataset mentioned, it will take 4-5
hours to retrain the model.
3.Sign Recognition and Displaying Output.

Our camera will be accessed using cv2.videocapture() attribute of cv2 module, there will be a
rectangular frame drawn inside camera app on running the application. On detecting a hand
inside the rectangular box our model calculates the probabilities of it being a particular
letter/sign. The letter with the highest probability is shown in the display window. Our
application stops detection when either esc key is pressed or CTRL+C is pressed.
SOFTWARE AND LIBRARIES USED

 Prerequisites

 A local development environment for Python 3 with at least 1GB of RAM.


 A working webcam to do real-time image detection

 Python3.5
Python is an interpreted, high-level and general-purpose programming language. We
choose this language for our project because it is nicely implemented and easy to debug
which makes it easier to build models for machine learning. It also has various
supporting Libraries useful for image processing and training a convolutional neural
network.

 InceptionV3 model
A convolutional neural network model for classification Inceptionv3 is a convolutional
neural network that is 48 layers deep. This pretrained network can classify images into
1000 object categories, for e.g. keyboard, mouse, pencil, and many other animals with a
high accuracy rate.

 Python Libraries or Modules

a. NumPy - NumPy is a library for the Python programming language, adding


support for large, multi-dimensional arrays and matrices, along with a large
collection of high-level mathematical functions to operate on these arrays.
b. TensorFlow - TensorFlow is a free and open-source software library for dataflow
and differentiable programming across a range of tasks. It is a symbolic math
library, and is also used for machine learning applications such as neural
networks.

c. OpenCV-python - OpenCV (Open Source Computer Vision Library) is an open


source computer vision and machine learning software library. OpenCV is used to
provide a common infrastructure for computer vision applications and to
accelerate the use of machine learning.
FLOWCHART
CODE EXPLANANTION

The project contains two modules -train.py and classify_webcam.py.

TRAIN.PY

 IMPORTING THE INCEPTIONV3 MODEL


For the first part of the code in train.py module we have to download the requisite libraries for
the model, next we have to download the inceptionv3 model from the internet.

 CREATING IMAGE LISTS

Brief:
Builds a list of training images from the file system. Analyzes the sub folders in the
image directory, splits then into training, testing, and validation sets, and returns a data
structure describing the lists of images for each label and their paths.
Args:
image_dir: String path to a folder containing subfolders of images.
testing_percentage: Integer percentage of the images to reserve for tests.
validation_percentage: Integer percentage of images reserved for validation.

 GETTING THE IMAGE PATH

Brief:

Returns a path to an image for a label at the given index.


Args:

image_lists: Dictionary of training images for each label.

label_name: Label string we want to get an image for.

index: Int offset of the image we want. This will be moduloed by the

available number of images for the label, so it can be arbitrarily large.

image_dir: Root folder string of the subfolders containing the training images.

category: Name string of set to pull images from - training, testing, or validation.

 DOWNLOADING AND EXTRACTING THE INCEPTIONV3 MODEL

Brief:

Download and extract model tar file.

If the pretrained model we're using doesn't already exist, this function

downloads it from the TensorFlow.org website and unpacks it into a directory .


 CREATING THE BOTTLENECKS

Brief:

Retrieves or calculates bottleneck values for an image. If a cached version of the bottleneck data
exists on-disk, return that, otherwise calculate the data and save it to disk for future use.

Args:

sess: The current active TensorFlow Session.


image_lists: Dictionary of training images for each label.
label_name: Label string we want to get an image for.
index: Integer offset of the image we want. This will be modulo-ed by the
available number of images for the label, so it can be arbitrarily large.
image_dir: Root folder string of the subfolders containing the training
images.
category: Name string of which set to pull images from - training, testing,
or validation.
bottleneck_dir: Folder string holding cached files of bottleneck values.
jpeg_data_tensor: The tensor to feed loaded jpeg data into.
bottleneck_tensor: The output tensor for the bottleneck values.
Returns:
Numpy array of values produced by the bottleneck layer for the image.
 CACHING BOTTLENECKS

Ensures all the training, testing, and validation bottlenecks are cached. Because we're likely to
read the same image multiple times (if there are no distortions applied during training) it can
speed things up a lot if we calculate the bottleneck layer values once for each image during
preprocessing, and then just read those cached values repeatedly during training. Here we go
through all the images we've found, calculate those values, and save them off.
 ADDING DISTORTIONS TO IMAGES

Creates the operations to apply the specified distortions. During training it can help to improve
the results if we run the images through simple distortions like crops, scales, and flips. These
reflect the kind of variations we expect in the real world, and so can help train the model to cope
with natural data more effectively. Here we take the supplied parameters and construct a network
of operations to apply them to an image.
 RETRAINING THE TOP LAYER OF THE MODEL

Adds a new softmax and fully-connected layer for training.We need to retrain the top layer to
identify our new classes, so this function adds the right operations to the graph, along with some
variables to hold the weights, and then sets up all the gradients for the backward pass.
 MAIN FUNCTION -THE DRIVER FUNCTION OF THE TRAIN.PY MODULE
CLASSIFY_WEBCAM.PY

 Importing the requisite libraries and disabling the tensorflow warnings

 Predicting the image label- it show labels of first prediction in order of confidence

 User interface and showing the output for the image- it Feeds the image_data as input
and gets the first prediction. It runs a tensorflow session until esc key is pressed and
displays two boxes – first box contains the webcam result while the other displays the
predicted output for the image displayed inside the rectangular box of the webcam box.
RESULTS

We tried different learning rates as to minimize the loss function and maximize the accuracy of
the predictions. The model showed highest accuracy when learning rate was equal to 0.01.It was
able to predict the hand gestures correctly with some minor errors due to environmental factors
like lighting,background,etc.

Printing the validation accuracies


When there is no hand gesture inside the rectangular box

Showing results according to the hand gesture in the rectangular box


CONCLUSION

Environmental conditions such as lighting can play a role in predicting. The light that is either
too bright or too dim will result in inaccurate hand segmentation, resulting in inaccurate gesture
prediction. The type of inaccuracy can emerge from the user's peripherals, such as poor web
camera performance. Apart from the minor error factors, the model serves its purpose well.

The development of technology is essential, and its deployment in sign language is highly
critical. It will serve to bring efficiency in communication, not only to the deaf and dumb but
those with the ability to hear and speak as well. In addition to creating opportunities for their
career growth, it will enhance their social life through effective communication.

Making an impact and changing the lives of the deaf and dump through technology will be an
innovation worth the time and resources.
REFERENCES

 MIT Deep Learning Course 2020


 Google for problem solving
 https://stackoverflow.com/
 Coursera Deep Learning course by Andrew Ng
 Python Machine Learning book by Sebastian Kaschka
 Purva C. Badhe, Vaishali Kulkarni, “Indian Sign Language
Translator using Gesture Recognition Algorithm”, 2015
IEEE International Conference on Computer Graphics,
Graphics and Information Security (CGVIS)

You might also like