Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

HAND SIGN LANGUAGE CLASSIFICATION IN REAL TIME

Major Project Progress Report

Submitted in the partial fulfillment of the Degree of

Bachelor of Technology
in

COMPUTER SCIENCE & ENGINEERING


by
JAIVARDHAN DESHWAL JITENDRA SHARMA KARTIKEY GOEL
04515002718 04715002718 04915002718

Guided by

Dr. Kavita Sheoran

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY


(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI)
DELHI – 110089

2018-2022
Abstract
The motive of the project is to recognize hand sign language in real time and convert the
recognized hand sign to text and finally to speech. In this model, classification machine learning
algorithms are trained using a set of image data for different hand gestures. Even minute
efficiency increment of this model can drastically impact its feasibility in a real world scenario.

As we know, the vision-based technology of hand gesture recognition is an important part of


human-computer interaction (HCI). In the last decades, keyboard and mouse play a significant
role in human-computer interaction. However, owing to the rapid development of hardware and
software, new types of HCI methods have been required. In particular, technologies such as
speech recognition and gesture recognition receive great attention in the field of HCI.

1
Introduction
Sign Language:

Sign language is a visual way of communicating through hand signals, gestures, facial
expressions, and body language.

Sign language is the primary form of communication for the deaf and hard of hearing
community, but sign language can be useful for other groups of people as well. People with
disabilities, including autism, apraxia of speech, cerebral palsy, and Down syndrome, may also
find sign language beneficial for communication.

Sign language is a visual language. It mainly consists of 3 major components:

Finger spelling: Spell out words character by character, and word level association which
involves hand gestures that convey the word meaning. The static Image Dataset is used for this
purpose.

World-level sign vocabulary: The entire gesture of words or alphabets is recognized through
video classification. (Dynamic Input / Video Classification)

Non-manual features: Facial expressions, tongue, mouth, body positions.

Impact:

There have been several advancements in technology and a lot of research has been done to help
the people who are deaf and dumb. Aiding the cause, Deep learning, and computer vision can be
used too to make an impact on this cause.

This can be very helpful for the deaf and dumb people in communicating with others as knowing
sign language is not something that is common to all, moreover, this can be extended to creating
automatic editors, where the person can easily write by just their hand gestures.

2
Literature Survey
Outcomes of the research papers survey :

Sign language is an essential tool to bridge the communication gap between normal and hearing-
impaired people. However, the diversity of over 7000 present-day sign languages with variability
in motion position, hand shape, and position of body parts making automatic sign language
recognition (ASLR) a complex system [1]

The literature review shows the importance of incorporating intelligent solutions into the sign
language recognition systems and reveals that perfect intelligent systems for sign language
recognition are still an open problem [2]

Sign language involves the usage of the upper part of the body, such as hand gestures, facial
expression, lip-reading, head nodding and body postures to disseminate information, which make
incorporating each aspect quite difficult to implement at the same time [3]

Slight changes in hand gesture can significantly affect accuracy of model. This could even be a
result of shape and size of hand anatomy of the user and the distance to capture the image [4]

The surface electromyography (sEMG) sensors with wearable hand gesture devices were the
most acquisition tool used in the work studied, also Artificial Neural Network (ANN) was the
most applied classifier, the most popular application was using hand gestures for sign language,
the dominant environmental surrounding factor that affected the accuracy was the background
color, and finally the problem of overfitting in the datasets was highly experienced [5]

3
Objectives
1. Prepare the Dataset in proper way to utilize it efficiently.

2. Build the Model using with Convolutional Neural Networks (CNN), Histogram of
Gradient(HoG)

3. Test of the Model against the dataset

4. Predict recognized hand signs in real time

5. Convert recognized signs language to text

6. Convert the text to speech using Google API

Research Methodology
Data used for the project:

 Model training uses a large dataset for ASL.

 We use self-made Dataset for hand gestures of different types containing at least 15
classes with 1500-2000 images per classes for testing.

Processing of Data

 The data will be collected in form of images which will be later processed with cropping,
lower resolution and grey scaling to get better results.

Algorithms Used

 Convolutional Neural Networks (CNN)

 Histogram of Gradient(HoG)

 Google Text-to-Speech(gTTS)

4
Methodology of Study:

Work done:

1. Creating Dataset: For the purpose of recognizing the hand signs, we need to build a
model with as high accuracy as possible. For accuracy the major factor is always the dataset. So,
we are making our own self made dataset for such purpose. There are several steps involved in
how we are making our dataset from scratch. These include:

1.1 Clicking Photos: A simple python script that runs every few seconds to click photos.
We use library OpenCV, numpy for processing the images.

1.1.1. OpenCV-Python is a library of Python bindings designed to solve computer


vision problems. cv2.imread() method loads an image from the specified file. If the
image cannot be read (because of missing file, proper permissions, unsupported or
invalid format) then this method returns an empty matrix.

1.1.2. NumPy is the fundamental package for scientific computing in Python. It is a


Python library that provides a multidimensional array object, various derived objects
(such as masked arrays and matrices), and an assortment of routines for fast
operations on arrays, including mathematical, logical, shape manipulation, sorting,
selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical
operations, random simulation and much more. At the core of the NumPy package, is
the ndarray object. This encapsulates n-dimensional arrays of homogeneous data
types, with many operations being performed in compiled code for performance.

1.2 Cropping Photos: We only want the hand sign to be clicked and thus, 300 pixel width
into 400 pixel height was good enough to capture the full image of the hand with clear
background.

1.3 Blurring the Image: When we blur an image, we make the color transition from one side
of an edge in the image to another smooth rather than sudden. The effect is to average out
rapid changes in pixel intensity. A blur is a very common operation we need to perform
before other tasks such as thresholding. There are several different blurring functions and
we use the Gaussian blur.

1.4 Processing Image to Grey scale: A grayscale (or graylevel) image is simply one in
which the only colors are shades of gray. The reason for differentiating such images from
any other sort of color image is that less information needs to be provided for each pixel.

1.4.1. In fact a `gray' color is one in which the red, green and blue components
all have equal intensity in RGB space, and so it is only necessary to specify a

5
single intensity value for each pixel, as opposed to the three intensities needed to
specify each pixel in a full color image. Often, the grayscale intensity is stored as
an 8-bit integer giving 256 possible different shades of gray from black to white.

1.4.2. If the levels are evenly spaced then the difference between successive
graylevels is significantly better than the graylevel resolving power of the human
eye. Grayscale images are very common, in part because much of today's display
and image capture hardware can only support 8-bit images. In addition, grayscale
images are entirely sufficient for many tasks and so there is no need to use more
complicated and harder-to-process color images

1.5. Saving The Image to Directory: We split the data set into training and testing data. The
photos we click and processed will be saved accordingly. When we press a key, we save the
image clicked at that very instance into the directory labelled with the key. This helps keep
dataset clean and easy to use.

6
Final working of the scripts to generate the dataset.

7
2. Creating a Model: Since our dataset is not prepared correctly and still requires several
changes in intensity and contrast of image as well as multiplication of dataset.

2.1 Getting dataset of ASL Classification from Kaggle to train basic model initially.
This is done to make a model to detect the hand signs in dim lighting without the
preprocessing done on images for better results.

2.2 Making the Model: The steps to make the model includes:

2.2.1. Convolution Layer : In convolution layer we take a small window size


[typically of length 5*5] that extends to the depth of the input matrix. The layer
consist of learnable filters of window size. During every iteration we slid the
window by stride size [typically 1], and compute the dot product of filter entries
and input values at a given position. As we continue this process well create a 2-
Dimensional activation matrix that gives the response of that matrix at every
spatial position. That is, the network will learn filters that activate when they see
some type of visual feature such as an edge of some orientation or a blotch of
some color

2.2.2. Pooling Layer : We use pooling layer to decrease the size of activation
matrix and ultimately reduce the learnable parameters.

2.2.3. Fully Connected Layer : In convolution layer neurons are connected only
to a local region, while in a fully connected region, well connect the all the inputs
to neurons.

2.2.4. Final Output Layer : After getting values from fully connected layer, well
connect them to final layer of neurons[having count equal to total number of
classes], that will predict the probability of each image to be in different classes.

8
Current model without any tweaks or optimization giving 78% accuracy.

Work left to do:


 Multiplying dataset with varying intensity, contrast and lighting.

 Training model on self made dataset of hand signs.

 Tweaking parameters of model to improve accuracy in prediction.

 Applying model in real time software to generate results.

 Convert text to speech.

9
Gantt chart

Gantt charts are useful for planning and scheduling projects. They help you assess how long a
project should take, determine the resources needed, and plan the order in which you'll complete
tasks.

10
References

[1] Khan, Rafiqul Zaman & Ibraheem, Noor. (2012). Hand Gesture Recognition: A Literature
Review. International Journal of Artificial Intelligence & Applications (IJAIA). 3. 161-174.
10.5121/ijaia.2012.3412.

[2] J. H. Sun, T. T. Ji, S. B. Zhang, J. K. Yang and G. R. Ji, "Research on the Hand Gesture
Recognition Based on Deep Learning," 2018 12th International Symposium on Antennas,
Propagation and EM Theory (ISAPE), 2018, pp. 1-4, doi: 10.1109/ISAPE.2018.8634348.

[3] H. Y. Chung, Y. L. Chung and W. F. Tsai, "An Efficient Hand Gesture Recognition System
Based on Deep CNN," 2019 IEEE International Conference on Industrial Technology (ICIT),
2019, pp. 853-858, doi: 10.1109/ICIT.2019.8755038.

[4] P. Choudhary and S. N. Tazi, "An Adaptive System of Yogic Gesture Recognition for
Human Computer Interaction," 2020 IEEE 15th International Conference on Industrial and
Information Systems (ICIIS), 2020, pp. 399-402, doi: 10.1109/ICIIS51140.2020.9342678.

[5] Yuhui Z, Shuo J, Peter BS. 2018. Wrist-worn hand gesture recognition based on barometric
pressure sensing. In: International conference on wearable and implantable body sensor networks
BSN. Piscataway: IEEE, 181–184.

11

You might also like