ASL Recognition in Real Time With RNN - Antonio Domènech

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

SIGN LANGUAGE RECOGNITION

ASL Recognition with MediaPipe and Recurrent


Neural Networks

Bacherlor-Thesis

Author:
Antonio Domènech L.

Date:
28. July 2020

Tutor:
Prof. Dr. Alexander Ferrein
Abstract

The recognition of Sign language has been a challenge for more than twenty years,
and in the last decade, some solutions, like translating gloves or complex systems
with several cameras have been able to accomplish partial or full recognition.

Contrary to previous technologies, this research proves that, nowadays, there is not a
need for complex and expensive hardware in order to recognize Sign Language, only
a modern mobile phone or a computer camera is required. This is accomplished
by using Google’s MediaPipe framework developed in 2019 and recurrent neural
networks (RNN).

Therefore, this paper proofs it is possible to recognize four different gestures (hello,
no, sign and understand ) with an accuracy of 92%, in real time, and with a mobile
phone or computer camera.

i
Acknowledgments

First of all, this project could have not been possible without the help of several
friends and family members that patiently recorded videos in the database and
encouraged other volunteers to help as well. Especially, I would like to mention my
colleges, Evgeny and Saida, and my mom, Montse, which have made more than 200
videos in total and asked other people to collaborate.

I also thank my tutor, Dr. Alexander Ferrein whom I contacted from another
university, accepted my proposal and gave me complete freedom and support to
develop the research.

Finally, I also would like to mention the importance of open source technologies.
This research has been possible thanks to the creators of MediaPipe, who completely
developed a hand tracking system and uploaded it for free, and the programming
community that constantly writes projects and code changes for everyone to read in
the official MediaPipe Github page.

ii
Contents

List of Figures vi

List of Tables viii

Glossary x

1 Introduction 1

2 Related work 3

2.1 Contour detection approach . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 RNN combined with another AI approach . . . . . . . . . . . . . . . 5

2.3 HMM (Hidden Markov Model) approach . . . . . . . . . . . . . . . . 6

2.4 Glove approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Technology Background 11

3.1 Codding languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 AutoKeras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 MediaPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 16

3.5 RNN with Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iv
By Antonio Domènech L.

4 Sign language recognition 19

4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 American Sign Language (ASL) . . . . . . . . . . . . . . . . . . . . . 20

4.3 Sign election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4.1 Video recording . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4.2 Video augmentation . . . . . . . . . . . . . . . . . . . . . . . 22

4.4.3 Hand landmarks extraction . . . . . . . . . . . . . . . . . . . 22

4.5 RNN with Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.6 Using the model in real time . . . . . . . . . . . . . . . . . . . . . . . 26

5 Evaluation 29

5.1 No - Understand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Hello - Understand . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Hello - Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Hello - No - Understand . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5 Hello - No - Sign - Understand . . . . . . . . . . . . . . . . . . . . . . 32

5.6 Performance in real time . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.7 Parameters of the AI . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.8 Comparison with other approaches . . . . . . . . . . . . . . . . . . . 34

6 Summary and Outlook 37

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.3 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography 39

v
List of Figures

2.1 Contour detection with OpenCV . . . . . . . . . . . . . . . . . . . . 4

2.2 Contour detection with OpenCV (multiple hands) . . . . . . . . . . . 4

2.3 Hand detection with background removal . . . . . . . . . . . . . . . . 5

2.4 Gloves for gesture recognition . . . . . . . . . . . . . . . . . . . . . . 8

2.5 SignAll system in an office . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 Accuracy comparison of lip recognition methods . . . . . . . . . . . . 9

3.1 AutoKeras system architecture . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Google MediaPipe detection examples . . . . . . . . . . . . . . . . . . 13

3.3 Application of a kernel in a layer . . . . . . . . . . . . . . . . . . . . 14

3.4 Google MediaPipe’s graph for Multi-hand tracking . . . . . . . . . . 15

3.5 Architecture of a RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.6 Comparison between Adam and other optimizers . . . . . . . . . . . 18

4.1 Steps followed to build a gesture recognition model. . . . . . . . . . . 19

4.2 Video augmentation possibilities from the code used . . . . . . . . . . 23

4.3 Input folder architecture for a correct reading of videos. . . . . . . . . 24

4.4 Input folder architecture created by MediaPipe. . . . . . . . . . . . . 24

4.5 Position of the landmarks in a text file. . . . . . . . . . . . . . . . . . 24

4.6 RNN architecture with 3 hidden layers . . . . . . . . . . . . . . . . . 26

vi
By Antonio Domènech L.

4.7 Blodck diagram of the real time recognition process . . . . . . . . . . 28

5.1 Accuracy and loss: signs no and understand . . . . . . . . . . . . . . 30

5.2 Accuracy and loss: signs hello and understand . . . . . . . . . . . . . 30

5.3 Accuracy and loss: signs hello and sign . . . . . . . . . . . . . . . . . 31

5.4 Accuracy and loss: signs hello, no and understand . . . . . . . . . . . 32

5.5 Accuracy and loss: signs hello, no, sign and understand . . . . . . . . 33

vii
List of Tables

2.1 Accuracy comparison between LSTM and other methods . . . . . . . 5

2.2 Accuracy of different tests with a HMM . . . . . . . . . . . . . . . . 6

2.3 Accuracy of different methods with a glove approach . . . . . . . . . 7

4.1 Characteristics of the cameras . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Parameter values of the AI . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Computer characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Input and output comparison . . . . . . . . . . . . . . . . . . . . . . 34

5.2 AI parameters for each test . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Comparison of key features from previous research and this paper . . 35

viii
Glossary

accuracy Ratio of correct predictions from the total. 2, 5–9, 12, 25–31, 33

convolutional neural network Type of neural network mainly used for image
recognition. 12, 19

database Package with all the information extracted from the videos in order to
train the AI. 2, 5–7, 11, 14, 19, 21–23, 25–31, 33, 34
dropout A Parameter from a neural network that drops/deletes the values of some
random neurons depending on a defined probability between 0 and 1. 25, 31,
34

epoch Training step of a neural network. 27, 30

Keras Library extension of TensorFlow to build deep neural networks with Python.
vi, 17, 19, 25
kernel Filter that extracts a value from a group of pixels. A kernel is part of a
convolutional layer. 13

loss Factor calculated by the loss function in a neural network. The higher the
number, the more correction it requires. 17, 25–31

MediaPipe Framework created by Google developers with the capability of track-


ing 21 landmarks of a hand. 1, 3, 5, 11–15, 19, 22–24, 33, 34

node A node is equivalent to a neuron of a layer. It stores a value. 25, 31

overfitting When the neural network needs a larger number and variety of exam-
ples in the database. This can be seen when the training accuracy is much
higher than the test accuracy. 13, 25–31

recurrent neural network Type of neural network used for sequence recognition.
1, 16, 19, 25

underfitting When the neural network can be trained much further to optain bet-
ter results. This can be seen when both accuracy and loss keep improving but
the neural network stops training. 26

x
Chapter 1

Introduction

The aim of this project is to build and implement an AI model capable of recognizing
sign language with computer vision in real time. The selected sign language is Amer-
ican Sign Language (ASL) because it is the most used among the Deaf community
and it is easily translated into spoken or written English.

According to the World Health Organization [20], around 466 million people world-
wide have disabling hearing loss and it is estimated that by 2050, the number will
rise to over 900 million. In order to satisfy the need for communication between
sign language speakers and non sign language speakers, people usually use text or a
translator. Both methods arise some problems: In the first case, text conversations
are not as comfortable as spoken ones, they go slower and expressions are not visi-
ble. In the second case, the need of a person to communicate everything someone
says can be expensive and eliminates privacy. The recognition of signs with artificial
vision could solve both problems.

Until now there were some technology limitations like finger tracking and real time
image processing (this is studied in more detail in chapter 2). Nowadays, thanks to
Google MediaPipe we can track the hands from a computer or mobile phone camera
in real time. MediaPipe is an open source framework released by Google in August
2019 that includes real time artificial vision technologies such as object tracking,
face detection and multi-hand tracking.

Taking advantage of MediaPipe, the selected approach is to track the hands and it’s
fingers from a webcam or mobile phone camera and then detect gestures with a re-
current neural network in real time. The hand detection depends on MediaPipe and
therefore, this research is focused on building the AI to do the gesture recognition
without the possibility of improving the hand tracking and find out if MediaPipe
is a tool good enough to recognize sign language. The main differences regarding
previous studies, is the use of only one camera from a computer or mobile phone
available nowadays and the use of a computer application capable of recognizing
signs in real time.

1
CHAPTER 1. INTRODUCTION

Due to time and computer power limitations this project does not intend to recognize
ASL at its totality. This paper only proves that it is possible to distinguish between
two, three and four different signs and also encourages the hypothesis that provided
a database large enough, it is possible to recognize any number of signs.

The gestures studied correspond to the meanings of: hello, no, understand and
sign. These signs have been selected in order to detect differences in the perfor-
mance of the AI depending on the movement and number of hands in use, this will
be explained in detail in chapter 4. It is important to emphasize that this research
has a low number of videos, and results could be improved with a larger database.
Some ideas about how to obtain more videos will be discussed in this paper.

There are some existing datasets with sign language videos. Those have not been
used for two main reasons. The first and most important one is that the videos
are in very high quality and sometimes even edited so that there is no background.
This would not allow to check if this project is possible with standard computer and
mobile phone cameras. The second reason is that the databases do not have more
than 5 examples per each sign, which is not enough to train the RNN.

Despite the fact that this research is promising for the future of sign language
recognition, there are still some important limitations which are not dealt with in
this paper. For example, it is possible to say different things with the same gestures
by changing the expression of the face. So in order to recognize sign language, it is
not enough to track the hands.

After trying different parameters for the AI build with RNNs and modifying the
database to maximize results, the AI has reached an accuracy of 92%. This accu-
racy applies to recognize only one gesture. In order to recognize sentences success-
fully, a second AI or corrector is required. This project only focuses on recognizing
individual signs. Methods to improve these results are discussed in chapter 6.

This research and the further extensions in the area have many applications. The
most obvious one, is a sign language translation in any spoken languages. Never-
theless, another important application could be sign language teaching by signing in
front of a device with camera and getting results of gesture accuracy the same way
that Duolingo (a free application to learn different languages) uses voice recognition
to check if the user is pronouncing words correctly. In fact, sign language recogni-
tion opens a large amount of new opportunities to specific technologies not related
to languages. Any controllable device could be instructed with hand signs, this is
useful for home automation, robotics, video games, etc.

Within the next chapters, the technology, methodology, results and conclusions will
be exposed without showing any code to make sure that the research is understood
without the need of coding knowledge. However, because the code is a very impor-
tant part of this project, it is all stored in the next link:

https://github.com/AntonioDomenech/ASL-recognition-training

2
Chapter 2

Related work

This chapter will summarize the research, previous to this paper, done by other
authors that tried to recognize sign language with different approaches.

The new features that this project presents in comparison to previous works are the
use of Google MediaPipe, which enables the possibility of real time sign language
recognition, and the use of cameras found in people’s everyday life.

2.1 Contour detection approach

Contour detection has probably been the approach most commonly used in the past
since it is one of the first technologies that gave the possibility of object detection.
Moreover, before MediaPipe, if one was supposed to rely on open source technol-
ogy, this could only be accomplished with OpenCV. This is a programming library
developed by Intel and it focuses on real time computer vision.

The contour detection generates a series of problems. First of all, it is not possible to
detect fingers that are behind other fingers. Second, if we have forms like a fist, we
can only detect the form of the fist, but not the position of each finger. In addition,
the processing time of each video frame is slower. As it can be seen in Gurav [8],
figure 2.3, only clear and steady signs can be detected and according to the author,
the maximum speed for these simple signs is 30 fps (note: this is with some specific
computer properties and the use of only one hand). Furthermore, this only works
with a plain contrasted background and with the hand kept close to the camera.

Another paper that studies the possibility of real time hand detection with OpenCV
is Mittal [18]. It can be seen in figure 2.4 that when the half top part of the body
appears, the detection becomes less accurate and although the paper claims to be
able to detect multiple hands in an image, it doesn’t detect its orientation or fingers,
only its position in the image.

3
CHAPTER 2. RELATED WORK

Figure 2.1: Contour detection of one hand with OpneCV. Gurav [8].

(a) Original. (b) Detection.

Figure 2.2: Example of hand detection using contour detection with multiple
hands and complex background. Mittal [18].

4
By Antonio Domènech L.

2.2 RNN combined with another AI approach

This is a very similar approach to the one used in this paper because of the use of a
RNN. RNNs are not enough to recognize sign language, they only serve for the time
aspect. Therefore, a space recognition AI is still required. For example, MediaPipe
is the technology that provides the space tracking of the hands for this research
using CNNs.

The combination of space and time recognition allows the understanding of move-
ment by a machine, which is what is needed to recognize sign language.

In Masood [16], the authors apply CNN to detect the position of the hands and RNN
to detect patterns of the signs through time. In this paper, the authors accomplish
the recognition of 46 signs with an accuracy of 95.2 percent. The results are very
good but are only possible with some approaches that eliminate the possibility of
real time recognition: In order to detect the hands, the authors remove all the
background so only the hands appear in the image. The quality of the camera used
by the authors is not specified.

Figure 2.3: Example of hand detection with background removal. Masood [16].

Another interesting paper is Liu [15] from the University of Science and Technology
of China, which studies different approaches of LSTM (see LSTM in chapter 3) with
RNNs. After comparing LSTMs with methods used by other papers, the authors
claim that an LSTM based model works better than Hidden Markov Models (HMM).

In table 2.1 can be seen the comparison made by the authors with a large database.
Both LSTM approaches show better results than the other models tried in other
papers.

Table 2.1: Accuracy obtained using different methods. Liu [15].

5
CHAPTER 2. RELATED WORK

The authors that made this comparison created their own database and tested mod-
els from other papers with it. In the next section, it can be seen that different authors
obtained very high results with HMM.

2.3 HMM (Hidden Markov Model) approach

HMMs Have been used for a long time and many research has been done in the area.
In 1997, 23 years prior to this research, the authors of Grobel [7] were able to build
a HMM that recognized 262 independent signs with an accuracy of 91 percent.

Table 2.2: Accuracy obtained for different tests. Grobel [7].

The difference of results between tests in table 2.2 has an explanation. The databases
were made by volunteers named A and B. in training 1, the database only had
examples from A and it was tested on videos of A (test 1) and videos of B (test 2).
The same procedure, but switching A and B, was done for the training 2. The final
training consists in training videos from both A and B, and then testing the model
with videos from A (test 1), B (test 2) and both (test 3).

When both the training and test are done with the same volunteer, the accuracy is
very high, but when the test is done with a different volunteer than the training,
results like 56 and 47 percent are obtained. This proofs that to obtain a good and
reliable accuracy, many different examples should be used for the database.

Many papers using HMMs have been published since 1997, one of the latest is
Kumar [13]. The authors try different hand tracking approaches to later predict 25
different signs with a HMM.

To track the hands three methods were used with two sensors: Kinect, leap motion
and a combination of both. Obtaining the conclusion that the combination of both
(Kinect and leap motion) works better.

The database used in the paper consisted of 2000 word samples from 25 different
Indian sign language gestures. Some of the samples are the gesture of the word and
some are made by spelling the letters of the word. The maximum accuracy reached
is 90 percent.
6
By Antonio Domènech L.

2.4 Glove approach

Previous to this paper, hand tracking with gloves has been studied by Toti Moragas,
Aniol Solé and the author of this paper, Antonio Domènech. This approach consists
of putting a glove with sensors to the person that is signing in order to recognize
the movement of the hands and display on a screen what the user is saying.

Each finger of the gloves has a flex-sensor which detects if the finger is stretched or
bent, and a gyroscope on each hand to detect its orientation.

For each sign, the values of all sensors are stored in a database in order to later train
a classification model with machine learning. For a little database of approximately
200 examples, the results in table 2.3 were obtained depending on the classification
model. Only signs without movement are recognized in this research, the addition
of movement requires of a larger database.

Table 2.3: Accuracy obtained for different classification methods in the project
GloveWord.

The glove approach has been studied many times before with different technologies.
It’s actually one of the first approaches studied since it doesn’t rely on computer
vision which is computationally demanding.

In Mehdi [17], a glove with seven sensors was used. One for each finger plus two
for the hand (tilt and rotate). The authors recognize the alphabet in American sign
language not having into account words with movement. The maximum accuracy
obtained is 88 percent.

7
CHAPTER 2. RELATED WORK

In Galka [4], the authors use multiple inertial motion sensors in a glove to later
process the data through a HMM. The recognizing results reach an accuracy of 99
percent for 40 selected gestures.

This approach has the problem of having the need of hardware and focuses only in a
one-way conversation. The Deaf can speak with the hearing person, but the opposite
cannot be done. Another problem is the value of the resources and complexity of
the hardware. In general all gloves are big and uncomfortable and the total price of
all the sensors is very expensive. Below there are some images of the gloves:

(a) GloveWord. (b) Kuroda [14]. (c) Galka [4].

Figure 2.4: Three different models of gloves designed to recognize the different
gestures and positions of the hands.

2.5 Other approaches

Many companies developed their own technology to detect sign language with dif-
ferent approaches from the ones mentioned before. Unfortunately, the specific tech-
nologies used are not fully explained.

Another approach is the one from the company SignAll [21]. This company uses
artificial vision, but with a very complex recognition system. It uses 4 cameras, a
specific illumination and gloves for the hands in order to track all gestures required.
In figure 2.5 the final installation of the whole system can be seen.

So far, SignAll is the only technology available that completely translates sign lan-
guage to voice or text. Moreover the communication system is bidirectional, hearing
people can also speak into the device which will translate voice to text for the Deaf
person.

Yannis [1] created LipNet, a lip-reading AI with deep learning technologies. This is a
very different approach from the previous ones since it doesn’t rely on hand tracking.
Usually deaf people learn how to speak and are able to read lips to understand what
hearing people are saying. This technology enables the possibility of breaking the
communication barrier between deaf and hearing people through lip recognition.

8
By Antonio Domènech L.

Figure 2.5: Example of the SignAll system in an office. SignAll [21].

LipNet is a neural network architecture for lipreading that maps variable-length


sequences of video frames to text sequences. The authors claim to reach an accuracy
of 95 percent. In the figure below a comparison with previous lip tracking systems
can be seen.

Figure 2.6: Accuracy comparison with previous lip recognition technologies.


Yannis [1].

9
Chapter 3

Technology Background

In order to build the AI for this project and conduct the necessary research, different
technologies have been used and studied. This chapter will explain the technological
resources used for the hand tracking , the database preparation, the AI and
the results analysis.

3.1 Codding languages

All code created for this project is written in Python. Only the Google MediaPipe
open source framework and all the modifications that have been made to it are in
C++.

Python is most frequently used among academy teaching and also has a very active
community and reliable published documentation. According to the TIOBE pro-
gramming community index, Python is the third most used programming language
in 2020 and has been the programming language of the year in 2007, 2010 and 2018.

Some of its features include: multi-paradigm programming, object-oriented pro-


gramming and structured programming. Furthermore, python has a lot of efficient
and supported options for machine learning and deep learning, which are the core
of this project.

C++, although more efficient than python, it is much more complex and for this
project, it has only been used for the MediaPipe technology developed by Google.
For future research, if a real time version of this project is intended, C++ is a better
option since it is better optimized and can work directly in the MediaPipe frame-
work. For this project, python simplifies a lot of tedious work and its performance
is fast enough.

11
CHAPTER 3. TECHNOLOGY BACKGROUND

3.2 AutoKeras

AutoKeras [10], developed in early 2019, is the open source alternative to AutoML
from Google. Both AutoKeras and AutoML are AI tools to design the optimal AI
architecture for a given problem.

AutoKeras uses a search algorithm described in [10] to select an AI architecture and


then trains the model to obtain accuracy and loss values. Each architecture trained
is stored to finally select the one that gave best results. The working process of
AutoKeras can be seen in figure 3.1.

Figure 3.1: AutoKeras system architecture. AutoKeras [10].

For this paper, only possible architectures of RNN were tested with the possibility
of modifying the number of layers, LSTM or GRU, and dropout.

A part from AutoKeras, a RNN has been developed by the author and several
parameters have been tried, reaching combinations that perform better than the
AutoKeras models seen in section 4.5.

3.3 MediaPipe

The hand tracking is done with the open source Google MediaPipe technology. As
mentioned before, this is a framework that provides real time computer vision tech-
nologies such as hand detection, hand tracking, face detection or object detection.
Launched in 2019, MediaPipe is the latest technology advancement that makes this
project possible.

The Google MediaPipe technology provides detailed real time finger tracking with
multiple hands. Although Google has not released characteristic details about frame
rate speed or power consumption, the accuracy of the palm detection, according to
the documentation page, is 95 percent on average. In the images below it is shown
how the finger detection works.
12
By Antonio Domènech L.

Figure 3.2: On top detection of real hands and below detection of sinthetyc hands.
Google [6].

MediaPipe detects the hand using the two following convolutional neural network
models: Palm detection and finger detection. First MediaPipe detects the palm,
if the palm doesn’t change position, palm detection is not required again, which
improves efficiency. The finger detection is done only on the area of the palm
previously detected, this way, false finger detection is avoided.

CNN is a type of neural network very useful for image recognition. It has other appli-
cations, but since it is only used for image processing by the MediaPipe framework,
the following explanation will be focusing on that.

An image, if it is not in grayscale, has two dimensions of the pixel and a third
dimension of the color. CNNs are capable of processing information in one, two or
three dimensions, coming from pixels that are near to one another.

CNNs are composed of convolutional layers which connect neurons that only corre-
spond to neighboring pixels of the image. This reduces the amount of connections
between neurons and therefore, reduces the number of weights to be processed.

A convolutional layer is the result of applying a set of filters to all areas of the data
being processed. Each filter (also named kernel) is an operation applied to a group
of pixels. The size of the filter is arbitrary, but typical groups of pixels are 2x2, 3x3,
4x4 or 5x5. The purpose of this filter is to highlight an edge or a line. In figure 3.3
a good visualization of a kernel can be seen.

This convolutional layer system has another advantage: a kernel always has the
same weights across all the image. Each neuron from a hidden layer is composed of
an operation between the input and the weights of the kernel. This means that all
hidden neurons share the same weights. This reduces memory usage and prevents
13
CHAPTER 3. TECHNOLOGY BACKGROUND

Figure 3.3: Visual representation of the application of a kernel to a group of pixels


(5x5 in this case). Image from the book, Deep Learning, from MIT [5].

overfitting (see overfitting in Glossary or chapter 5).

Each CNN model of MediaPipe constitutes a calculator in the framework and a


combination of calculators constitutes a graph. MediaPipe allows the user to create
its own graphs and calculators as well as using the ones provided by Google. For
this project, the multi-hand tracking graph example shown in figure 3.4 was used
and modified to satisfy the project necessities.

The hand-tracking graph includes MultiHandDetection, which is a sub-graph that


includes the palm and finger detection mentioned above, MultiHandLandmark, which
extracts the landmarks per each frame, and MultiHandRender, which overlays the
hand landmarks on the output video.

For this project, the MultiHandRender could have been extracted since it is not
used for the AI, but since the output videos needed to be checked to ensure the
hand-tracking was successful, the sub-graph has not been eliminated.

Three modifications have been applied to the graph; First, everything was changed
to run on the GPU instead of the CPU to reduce processing time. Before this change
was applied, each video would take between 30 to 50 seconds to process. That means
that it would take close to 80 hours to process all the videos. In GPU it only takes
10 hours.

Second, the input video was changed to be a mp4 video file instead of the webcam.
This way, all videos recorded for this project could be processed. A program in
python was created so that the input of MediaPipe could take all videos stored in a
folder.

Finally, the landmarks were extracted and organized in text files (one per each sign).
These text files are later used to build the final database for the AI.

14
By Antonio Domènech L.

Figure 3.4: Graph for Multi-hand tracking provided by MediaPipe. Bazarevsky [2].

15
CHAPTER 3. TECHNOLOGY BACKGROUND

3.4 Recurrent Neural Networks

The AI of this project has to make a prediction depending on a sequence of frames.


In this case, each frame has a position of the landmarks of the hands and the
combination of the frames of a video create a sign. Recurrent neural networks are
capable of solving this type of problems.

RNN is a type of neural network, which can process sequential data with variable
length. They apply the same function over a sequence recurrently. Unlike regular
networks, where the state only depends on the current input (and network weights),
RNN also depend on the previous states.

RNN can be defined as a recurrence relation:

st = f (st−1 , xt ) (3.1)

Where f is a differential function, st is a vector of values called internal network


state (at step t), and xt is the network input at step t. As shown in equation 3.1,
the internal network state at a given moment st depends on the previous internal
network state st−1 . At the same time, st−1 depends on st−2 and so on. Therefore,
the current state depends and all the previous ones.

Once the internal wights of the RNN are added, the output is as shown in equation
3.3.
st = f (st−1 ∗ W + xt ∗ U ) (3.2)

y t = st ∗ V (3.3)

Where W and U transform the previous state st−1 and the input xt respectively, V
transforms the current state st , and yt is the output of the RNN. In the figure below
we can see the typical architecture of a RNN and its weights:

Figure 3.5: Typical architecture of a RNN with weights. Vasilev [23].

16
By Antonio Domènech L.

Each of the neural network’s weights receives a new value proportional to the partial
derivative of an error function with respect to the current weight in each iteration
of training. In some cases, the proportion changes created by earlier states will be
very small compared to resent states, which removes the early states weight in the
RNN.

To solve this, the RNN of this project uses LSTM (Long Short Term Memory) units.
An LSTM unit is composed of a cell with: an input gate, an output gate and a forget
gate. This cell contains the temporal state which can handle long-term dependencies.
Depending of the LSTM function, the LSTM cell can erase the temporal state and
allow a new one to be stored or keep the same state that had before. Therefore,
while the normal RNN states may have forgotten old dependencies, LSTM cells still
hold them.

This composes the architecture of the RNN as explained in Vasilev [23]. In the
section below the training parameters are explained in detail.

3.5 RNN with Python

To build the RNN, Python has some very good tools for AI. Created by Google in
2015, Tensorflow is a well known and supported library for Python. It includes
tools for deep learning and machine learning, models, datasets, sub-libraries and
extensions, etc.

With Tensorflow, it is already possible to build neural networks, but the program-
ming of those are at a very low level, which makes it very difficult to code. To
reduce the difficulty, one of the extensions that Tensorflow has is Keras: It is spe-
cially built for Python, has very straight forward commands to build AI models and
is completely free with published documentation.

For this project, the RNN has been built following the instructions of pythonpro-
gramming [9] and the official Keras documentation.

In order to train the model for this research, there are two parameters that need to
be defined in a RNN: a loss function and an optimizer. The loss function defines
the ”cost” of outputting a wrong answer given an input and a current state, and
the optimizer defines how the weights and learning rates change.

For the loss function, cross-entropy has been selected since this AI is a classifi-
cation problem. As explained in Brownle [3], cross-entropy loss increases as the
predicted probability diverges from the actual output. The function is as shown in
equation 3.4:


X
E=− Yi ∗ log2 (Pi ) (3.4)
n=1
17
CHAPTER 3. TECHNOLOGY BACKGROUND

Where E is the error, Yi is 1 or 0 (1 in case the prediction is the same as the label
of the state i, or 0 in the opposite case) and Pi is the probability calculated by the
network for the state i.

The following example will explain how cross-entropy works in a practical way: If
the RNN (any other neural network can be used for this example) has to predict
between 3 different signs, the output of the RNN will give a probability for each
sign. The RNN outputs a probability of 63% for sign 1, 27% for sign 2 and 10%
for sign 3. The prediction is wrong and the correct sign is 2. The AI will apply
equation 3.4 as follows:

E = −[0 ∗ log2 (0.63) + 1 ∗ log2 (0.27) + 0 ∗ log2 (10)] (3.5)

E = log2 (0.27) = 1.89 (3.6)

For the optimizer, Adam has been selected. Adam has the advantage of being fast,
and although it is computationally costly, this is not a problem for this research.

In the graph below it can be seen that Adam has a faster learning rate than other
optimizes and as mentioned in Kingma [11], it has a constant performance indepen-
dently of the problem. That is why Adam is one of the most used optimizer for deep
learning problems.

Figure 3.6: Comparison between Adam and other optimizers. Kingma [11].

18
Chapter 4

Sign language recognition

Now that all the basic technologies and previous research are explained, everything
is set to describe the process followed in this research in order to recognize ASL. The
major steps of the process are as follows: First, since it is not possible and neither
intended to translate all ASL in the time schedule of this project, a sign selection
had to be made to maximize the understanding of results. Second, the database for
the AI had to be created properly to obtain proper results. Finally, the AI had to
be trained and the results evaluated.

4.1 Approach

This paper follows the approach of section 2.2, where the predicting model is a
recurrent neural network and the tracking of the hand is done with a convolutional
neural network. The sequence of steps is shown in figure 4.1

Figure 4.1: Steps followed in order to obtain a gesture recognition model.

The CNN is provided by the MediaPipe framework which, once the necessary mod-
ifications are applied, outputs the position of 21 landmarks for each hand.

The RNN build with Keras takes the first 20 frames of each sign. The signs have
different time lengths depending on the gesture and the person executing it. There-
fore, the AI doesn’t read the sign from beginning to end. This option has been
selected over padding because of its simplicity.
19
CHAPTER 4. SIGN LANGUAGE RECOGNITION

Padding consists of adding zeros to the beginning or end of a sequence in order


to make all sequences the same length. This is required for variable length input
because RNN don’t allow variable input lengths.

4.2 American Sign Language (ASL)

According to the National Institute on Deafness and Other Communication Disor-


ders (NIDCD) [19], ASL is a complete natural language that has the same linguistic
properties as spoken languages, with grammar that differs from English. It is ex-
pressed by the movement of the hands and face and it is the primary language of
many North Americans who are deaf and hard of hearing, and is used by many
hearing people as well.

ASL is a language completely separate and distinct from English. It has its own
rules for pronunciation, word formation, and word order. While every language has
ways of signaling different functions, such as asking a question rather than making a
statement, languages differ in how this is done. For example, English speakers may
ask a question by raising the pitch of their voices and by adjusting word order; ASL
users ask a question by raising their eyebrows, widening their eyes, and tilting their
bodies forward.

Just as with other languages, specific ways of expressing ideas in ASL vary as much
as ASL users themselves. ASL also has regional accents and dialects like: regional
variations in the rhythm of signing, pronunciation, slang, and signs used. Other
sociological factors, including age and gender, can affect ASL usage and contribute
to its variation, just as with spoken languages.

As mentioned in the introduction, this paper won’t take into account any ASL
features other than the hand movement. Therefore, facial or body expressions won’t
be tracked.

4.3 Sign election

As mentioned in the introduction chapter, the chosen signs for the research are hello,
no, understand and sign. These signs have been selected on purpose to study how
the AI performs at recognizing different movements and the number of hands.

Signs no and understand are very similar, both have very little vertical movement
and only use the right hand. They have been selected to see if the AI is capable of
distinguishing signs that have the same type of movement and use the same hand.
Hello is similar to no and understand since it also uses only the right hand, but the
movement is horizontal. This will show if the AI is capable of recognizing clearly
the change between horizontal and vertical movement but still with the same hand.
20
By Antonio Domènech L.

Finally, the word sign uses both hands and it is clearly different to the rest of signs.
So it is expected that the combination of any word with sign will give better results
than the rest of possible combinations.

The tests done with this four signs are: [no - understand ], [hello - understand],
[hello - sign], [hello - no - understand] and [hello - no - sign - understand]. The
amount of videos for each sign is around 1500 and all of them will be used for each
test and probably a larger database is required as more signs are added to the tests.
Therefore, better results are expected from tests with only two signs.

4.4 Database

In order to build an AI, a good and large database is needed. If the database is not
large enough, or the videos are not set or recorded properly the final AI result may
not be as good as it could be.

There are three key steps to build the database: video recording , video aug-
mentation and hand landmarks extraction. Each of these steps is explained
in detail along the next few sections.

4.4.1 Video recording

This project does not intend to prove the possibility of sign language recognition
at any technology expense. Instead, it proves that it is possible with the accessible
technology for people such as the cameras from mobile phones. For this purpose,
the volunteers that recorded the signs were told to use the maximum quality they
had available on their phones regardless of its brand or type of camera. In table 4.1
the camera properties used for the recordings can be seen.

It must be noted that all videos were checked and those which were cut, had bad
illumination, bad contrast or simply were signed too fast or slow have been removed.
Therefore, even though all qualities have been accepted (except for those which were
clearly too low), a filter has been applied to ensure the correctness of the sign.

Table 4.1: Different cameras used for the final AI.

21
CHAPTER 4. SIGN LANGUAGE RECOGNITION

The videos consist of one person doing one sign. All videos take between 3 and 5
seconds. Each video starts exactly at the point the signs start, but the end has not
been modified since only the first 20 frames of the video will be used by the AI.
The videos needed to be edited like this because the hands can come from different
positions and this could confuse the AI. Finally, the videos that were flipped were
rotated so that the people in the videos would always be standing up. This had to
be done for more than 800 videos (approximately 200 for each of the four signs).

Once finished the edition, the videos were set and ready for video augmentation.

4.4.2 Video augmentation

To be able to identify signs, a very large database is required. With the 800 videos
provided there is not enough material to build an AI or at least, better results
are guaranteed with a larger database (if the videos are correctly set). In order to
increase the number of videos provided, a video augmentation program was built to
multiply them by 9.

The video augmentation has been done with Python. The program takes each
original video and reads its frames. Then applies 8 times to the video a random
resize (makes the video longer or wider) and a random angle rotation. After the
process is finished, there are 8 new videos plus the original one. Once this is done
for all the original videos, the final database consists of more than 7200 videos.

Image augmentation is very normalized and common, but video augmentation is


more tricky because the same changes need to be applied to all frames of a video.
For the data augmentation, some modifications were applied to the open source code
found in [12]. All the program works with Python.

4.4.3 Hand landmarks extraction

Once the video augmentation was stored and ready, all videos were processed through
the Google MediaPipe hand tracking technology explained in section 3.2. In order to
input the videos, a python script was created. This python script searches for videos
in a folder and sends the videos to MediaPipe (see input folder in figure 4.3). The
final modified version of MediaPipe is the one responsible for storing the outputs in
new folders.

The specific files from MediaPipe that were edited as well as the explanation of how
to use them can be found in the code link presented in chapter 1. The modifications
were extracted from the issues section from MediaPipe’s github cite [22]. See the
output folder created by MediaPipe in figure 4.4.

22
By Antonio Domènech L.

Figure 4.2: Examples of video augmentation from Köpüklü [12].

The output from MediaPipe, once edited, is a text file for each video containing the
position of the hand landmarks (normalized between 0 and 1) for each frame and
a video with the landmarks rendered on it. See in the figure below. The code to
obtain these files is in C++ and was created by Kim [22].

To group all the landmark positions in a format that the AI understands, a program
was created using Python. This program reads all the positions stored in each text
file and organizes them as a list. Once all the signs are stored and set in the form
of landmark positions per frame, then the database is ready for use.

23
CHAPTER 4. SIGN LANGUAGE RECOGNITION

Figure 4.3: Input folder architecture for a correct reading of videos.

Figure 4.4: Output folder architecture created by MediaPipe.

Figure 4.5: Position of the landmarks normalized between 0 and 1 in a text file.

24
By Antonio Domènech L.

4.5 RNN with Keras

The RNN build on this project depends on a set of parameters and the optimal value
of those changes depending on the size and type of database. For each test, different
options were tried and the combination of parameters that delivered the best result
was selected. The variables that were tested for different values were: The number
of hidden layers, the number of nodes per each hidden layer, the number of nodes
in the last layer and the dropout in each layer.

Table 4.2: Values tried for each parameter in the RNN.

The result analysis and graphic visualization has been done with TensorBoard
from Tensorflow . TensorBoard is a visualization toolkit created by Tensorflow that
allows very useful possibilities for AI study such as the tracking and visualization
of loss and accuracy while training. In figure 4.6 there is the architecture of the
recurrent neural network for the case of 3 hidden layers.

Each combination of LSTM layer with a dropout forms a RNN layer, except for the
last one, which outputs the final prediction of the RNN and therefore doesn’t need
any dropout.

A common problem in deep neural networks and especially in RNN is overfitting.


This occurs when the neural network starts learning the specific examples with which
it is being trained and therefore, won’t predict correctly new unknown examples.

One of the possible measures to prevent overfitting is to add a dropout at each layer
(except the last one). A dropout simply ignores some of the neurons of a layer with
a specified probability (between 0 and 1). It can be seen in figure 4.2 that for this
research both a dropout of 0.2 and 0.5 have been tried.

The training accuracy, test accuracy, training loss and test loss were the parameters
analyzed to evaluate the results. Training accuracy and loss are the values that
the AI obtains from the videos in the database while test accuracy and loss are the
values that the AI calculates on videos that are not part of the training database.
The separation of the data between train and test has been done previously. The
test part is 10 percent of the total videos provided and the training part is the rest.
The test values (accuracy and loss) give information about how well the AI works
in the real world and therefore, determine if the project is successful or not.

The most important value is the test accuracy since it is the one that tells the
percentage of right predictions in videos that were not used for the training (i.e.

25
CHAPTER 4. SIGN LANGUAGE RECOGNITION

Figure 4.6: RNN architecture with 3 hidden layers. Extracted from TensorBoard.

new videos). The loss is used to determine if the RNN has to be trained any
further or if it has reached its maximum value. The training accuracy and loss only
serve as comparative tools with the test accuracy and loss to detect overfitting or
underfitting.

Overfitting is when the database does not have enough data and therefore the test
accuracy and loss results start to drop while the training accuracy and loss don’t.
Underfitting occurs when the AI stops training the database and the accuracy or
loss can still be improved. Overfitting occurs when the database is not large enough
meaning that it needs more videos or a larger variety of examples.

All graphic results displayed in this document are created in Excel with the data
extracted from TensorBoard.

4.6 Using the model in real time

At this point, the final model build with Keras is set and ready for use in an
application. In order to do that, MediaPipe needs to be modified again and a
program has to be created to run a webcam video in real time and recognize the
signs.

The application developed for this paper runs in a computer (see specifications in
figure 4.3). The program that does the recognition of signs works asynchronously
to MediaPipe and the only connection they have is a text file where the information
is written/read.

The real time factor is limited by MediaPipe since writing and reading from a text
file is very fast compared to the process time of a frame. MediaPipe has been
modified to store the data from each frame in a same text file. This text file always
has the information of only one frame, so when the next frame is stored, the previous
one is deleted.

26
By Antonio Domènech L.

Table 4.3: Computer carachteristics used in this paper.

The program that does the recognition of signs is developed with python and works
as follows: First the program reads the information in the text file and stores it.
Once it has done that, it repeats the task. The python program only reads and stores
the data from the text file, so it works much faster than MediaPipe. Therefore, in
order not to store several times the same information, the program compares the
new information with the previous one, and if it’s the same it doesn’t store it. Every
20 frames stored, the python program gives the data to the AI model, and the AI
model returns the sign it detected.

To understand clearly how the recognition process works, see the next block diagram
(Figure 4.7). As mentioned, the tracking of the hand and the recognition of gestures
work independently and MediaPipe is the speed bottleneck given that the python
code to recognize gestures works much faster.

The program described recognizes individual signs in real time, which must not be

27
CHAPTER 4. SIGN LANGUAGE RECOGNITION

Figure 4.7: Block diagram of the real time recognition process.

confused with the recognition of sentences in real time. The difference is that the
first one outputs purely what the recognition AI detects for every given input, and
the second one depends on a second AI that interprets the output of the first AI.

This can be better understood with the analogy of voice recognition; Voice recog-
nition detects letters and, once it has all the letters detected, an additional AI
interprets the meaning of it.

For example, if a user says ”Hello, my name is Robert”, the voice recognition system
may detect ”elllouuu maineeimmiis Rooperrt”. Obiously, this is not very clear. Some
letters have no sound, others are duplicated, some can change because of the accent
and finally some spaces are not detected. An additional AI is used to correct this
and obtain the proper sentence in English.

In the case of ASL recognition, instead of letters, words are recognized. If a user says
”Hello, my name is Robert”, the ASL recognition system may detect ”hellobread
myname mynamewall Robert for”. The AI can detect additional words between the
correct ones. In the example mentioned there appear ”bread”, ”wall”, and ”for”,
which are incorrectly detected between one sign and another. We can also obtain
repetitions or get wrong spaces.

In order to solve this problem, the same way the voice recognition has an AI to
interpret the letters, the ASL recognition needs and AI that interprets the words.

The interest of this paper is to obtain the words recognized from the gestures in real
time and the gestures studied (hello, no, sign and understand) can’t form any logic
sentence together, therefore, this problem is left open for future research.

28
Chapter 5

Evaluation

In this chapter, all results and their related discussion for each combination of signs
will be exposed. The results include graphs that show the accuracy and loss for each
Epoch, key values and the AI parameters used for each test.

5.1 No - Understand

The two signs are very similar, both use the same hand with the same orientation
using vertical movement. The only notable difference is the movement in the fingers.

As shown in figure 5.1a, the maximum accuracy obtained is 92 percent. When


comparing the training and test sets for both accuracy and loss, it is clear that
there is overfitting starting approximately at Epoch number 200. Therefore, better
results could be obtained with a larger database.

Considering the little amount of difference that both gestures have, between them, a
predicting capacity of 91 percent with the possibility of improvement due to database
enlargement is a better result than expected.

29
CHAPTER 5. EVALUATION

(a) Training and test accuracy. (b) Training and test loss.

Figure 5.1: Accuracy and loss for the recognition between no and understand.

5.2 Hello - Understand

The next test exposed, which is the recognition between the meanings hello and
understand , presents the best possible performance of recognition between vertical
and horizontal gestures.

(a) Training and test accuracy. (b) Training and test loss.

Figure 5.2: Accuracy and loss for the recognition between hello and understand.

The maximum accuracy obtained for this test is 95 percent, which is a clear improve-
ment over the previous test. This validates the supposition that the AI recognizes
better horizontal vs vertical rather than vertical vs vertical. Furthermore, comparing
accuracy and loss differences between train and validation, the overfitting has been
reduced respect to the no and understand test. Since both databases consisted
of approximately the same number of videos, the reduction of overfitting can only
mean that the AI is more precise at recognizing between a horizontal and vertical
gesture.
30
By Antonio Domènech L.

Another important detail to extract from the curves presented in figure 5.2, is that
they are exactly what you would expect from the evolution of and AI training.
Accuracy starts at 50 percent, which is the expected accuracy for an AI that has
not been trained between two signs, and rises up until the AI is not possible to
improve the recognition.

5.3 Hello - Sign

This is the third test made between only two different signs. As mentioned before,
this is to check the difference between gestures of one hand and gestures of two hands.
Furthermore, the movement and position of the hands are very different between
both signs, which makes the test a comparison between two signs completely different
in all senses (movement, position and number of hands).

The output accuracy of the AI, which reaches 99 percent, is clearly the highest
between all tests. This fits with the expectations regarding the results from the
tests made between two different signs. Moreover, the performance of the AI has
been the most consistent across the tests with different AI parameters. While the
performance of tests with more than two signs could change drastically with different
AI parameters, the results for the tests between signs hello and understand where
more or less the same.

(a) Training and test accuracy. (b) Training and test loss.

Figure 5.3: Accuracy and loss for the recognition between hello and sign.

As shown in figure 5.3, the accuracy value rises very quickly at the beginning and
is capable of improving to the point of almost reaching 100 percent. This test is the
only one that doesn’t present any characteristics to belief there is overfitting. Prob-
ably, the AI training could continue for a while longer with the database provided.

31
CHAPTER 5. EVALUATION

5.4 Hello - No - Understand

After the results obtained from tests with two signs. The goal is to observe how the
addition of one or more signs affects the performance of the AI.

(a) Training and test accuracy. (b) Training and test loss.

Figure 5.4: Accuracy and loss for the recognition between hello, no and
understand.

In the case of the recognition between hello, no and understand , the AI reached
an accuracy of 92 percent. This accuracy is between the 95 percent of the test
hello-understand and the 91 percent the test no-understand . It must also be
taken into account that the number of videos for each sign is the same in all tests.
Therefore, this test with three signs is expected to behave worse than a test with
only two signs because of the total number of videos in the database. This can be
clearly seen in that the overfitting in this test is larger than the one detected in tests
with only two signs.

Another important fact is that the test with three signs requires a larger number
of epochs to get to the highest possible accuracy, which also depending on the AI
parameters, increases the training time.

5.5 Hello - No - Sign - Understand

In this final test, all videos from all signs recorded were mixed in a database that
was used for the final AI test.

The validation accuracy reached a value of 92 percent, which is the same than the
previous test despite the increase in the number of signs that the AI has to recognize.

32
By Antonio Domènech L.

(a) Training and test accuracy. (b) Training and test loss.

Figure 5.5: Accuracy and loss for the recognition between hello, no, sign, and
understand.

This could be caused by the addition of the gesture sign. As seen in the test hello-
sign, this sign is so different from the rest that improves the recognition quality of
the AI.

The overfitting of the test is clearly seen in figure 5.5b, where the validation loss has
clearly stopped at higher values than the training loss. This indicates that with a
larger database, better results would be acquired.

5.6 Performance in real time

Once the model is trained, it is possible to create a program that recognizes the
gestures mentioned above in real time. The created program recognizes gestures
individually, it is not capable of recognizing sentences.

Therefore, the program recognizes, generally, the signs that a user performs, but it
also recognizes other signs in between as explained in section 4.6.

The four signs used in this paper cannot be combined to form a sentence that makes
sense, so it is not useful to build an additional AI that orders the words to form
English sentences. For future research where the aim will be to recognize sentences,
it is needed an additional AI that interprets what the recognition system is saying.

The program runs at an average of 21 frames per second (0.048s per each frame).
In table 5.1 some examples of the output of the program for each input are shown.
The signs that the user performs are detected although, as mentioned before, other
signs which are incorrect are detected between gestures.

33
CHAPTER 5. EVALUATION

Table 5.1: comparison between the signs of the user and the output of the
recognition program. Note: und. = understand.

5.7 Parameters of the AI

In table 5.2, the different parameters that gave better results for each test are listed.

According to the data extracted from the AI characteristics, for the cases tried in
this research, a dropout of 0.2 gives better results. This has been noticed as a general
rule through all the tests made including the ones not shown in this paper.

The rest of possible values appears to vary independently from one test to another.
This was already expected since there is not any specific rule about the number of
layers and nodes per layer that are best for each AI.

Table 5.2: Final set of used values for the AI parameters in each case.

5.8 Comparison with other approaches

This paper develops an approach not studied before by using MediaPipe as a hand-
tracking tool for sign language recognition. This approach allows to prove that it is
possible to recognize ASL in real time with standard cameras accessible to most peo-
ple, which before was only possible with a very restricted background (Masood [16]),
focusing only on one hand (Gurav [8]) or building a complex system with special

34
By Antonio Domènech L.

illumination and several cameras (SignAll [21]).

In terms of accuracy, how accurate the prediction is, some papers obtain better
results (as shown in table 5.3). Compared to the 92% reached for 4 signs in this
paper, Masood [16] reaches an accuracy of 95% for 46 signs and Kumar [13] reaches
and accuracy of 90% with 25 signs.

The use of CNN or other neural networks to classify the position of a hand in a
frame (Method used in previous research) compared to extracting 21 landmarks for
each hand with MediaPipe (methodology used in the current research) could be one
of the reasons why other papers obtain better results.

Another important factor could be the quality of the video, which is low in this
research compared to others. Moreover, other papers process the videos to remove
background and facilitate the recognition of the hands, which is not possible in this
research in order to maintain the hypothesis That ASL can be recognized with a
normal camera in real time.

The present research is an improvement over previous works by recognizing real time
gestures in ASL with standard cameras and the use of open source technology and,
since this is a new approach, there is room for improvement.

Table 5.3: Comparison of key features from previous research and this paper.

35
Chapter 6

Summary and Outlook

6.1 Conclusions

Even if this research is still far from a final product that people can use, it opens the
door to a new and more accessible way to recognize sign language. In other words,
it is not possible to recognize ASL yet, however, it encourages to keep studying this
approach in order to succeed.

The conclusions extracted from the research in this paper are:

• Gestures of ASL can be recognized in real time from a simple mobile phone
camera or webcam
• The hand tracking tool provided by Google MediaPipe is good enough to be
used for gesture recognition
• From best to worst, the movement changes detected by the AI are: number of
hands, vertical vs horizontal movement and vertical vs vertical movement.
• Videos with higher shutter speed (more frames per second) work better.
• There is room for improvement on the AI, video augmentation and variety of
gestures.
• An additional AI is required to recognize sentences properly

6.2 Observations

There are some other conclusions that this research hints on but that can not be
stated since they have not been studied properly. For example, one thing that can

37
CHAPTER 6. SUMMARY AND OUTLOOK

be observed, is that the accuracy of the test with all the signs is higher than the
accuracy from the test with three signs, although the number of videos per sign has
not increased. One possible reason for that is the fact that the gesture Sign has a
larger variety of volunteers recording the videos. The number of videos is the same,
but the variety of examples is larger. That could mean that in order to improve the
performance of the AI, a database with more variety of videos is better.

Another observation is that a value of 0.2 for the dropout characteristic of the
AI probably works better for sign language recognition since it is the value that
performed better in the majority of cases.

6.3 Further research

There are a lot of ways in which this research can be continued. The AI and database
used in this paper can be largely improved. A very common neural network used
for voice recognition, a very similar problem to sign language recognition, are CNN
or attention based models, which could also be studied for this issue.

Moreover, the database created for this paper has not been done by professional sign
language interpreters. In addition, there is much more research that can be done
about how to use video augmentation in order to improve results.

A list of things that can be done to improve the current result and develop new
features that would improve the recognition of ASL are:

First, a larger number of signs should be recognized and for this, a larger database
is required. One way of acquiring more videos is to use the AI developed on this
research to teach sign language. For example, if a user is learning the signs Hello,
No, Understand and Sign, the computer could ask him or her for one of those. If
the user does the incorrect sign, the computer will notify. Meanwhile, each video
recorded by the user can be stored and used for future databases.

Second, the use of RNN with AutoKeras has not improved results compared to the
combinations used by the author in the AI built with Keras. Anyhow, for future
research, AutoKeras could be used to find different architectures from RNN.

Third, an additional AI should be developed to recognize sentences properly as


explained in section 4.6.

Finally, other questions not studied in this paper, like the fact that some gestures
have the same meaning and change only with the expression of the face, have to be
taken into account. A possible solution is to build an AI that also tracks the key
points of the face in addition to the hand landmarks. This approach could be done
with a python library called dlib which tracks a total of 64 points of the face.

38
Bibliography

[1] Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de


Freitas. LipNet: End-to-End Sentence-level Lipreading. 2016. arXiv: 1611 .
01599 [cs.LG].
[2] Bazarevsky Valentin and Zhang Fan. Google AI Blog: On-Device, Real-Time
Hand Tracking with MediaPipe. Aug. 2019. url: https://ai.googleblog.
com/2019/08/on-device-real-time-hand-tracking-with.html (visited
on 07/20/2020).
[3] Jason Brownle. A Gentle Introduction to Cross-Entropy for Machine Learning.
2019. url: https://machinelearningmastery.com/cross-entropy-for-
machine-learning/ (visited on 07/13/2020).
[4] Jakub Galka, Mariusz Masior, Mateusz Zaborski, and Katarzyna Barczewska.
“Inertial motion sensing glove for sign language gesture acquisition and recog-
nition”. In: IEEE Sensors Journal 16.16 (2016), pp. 6310–6316.
[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016. url: http://www.deeplearningbook.org.
[6] Google. Home - mediapipe. 2020. url: https://google.github.io/mediapipe/
(visited on 07/08/2020).
[7] Kirsti Grobel and Marcell Assan. “Isolated sign language recognition using
hidden Markov models”. In: 1997 IEEE International Conference on Systems,
Man, and Cybernetics. Computational Cybernetics and Simulation. Vol. 1.
IEEE. 1997, pp. 162–167.
[8] Ruchi Manish Gurav and Premanand K. Kadbe. “Real time finger tracking
and contour detection for gesture recognition using OpenCV”. In: 2015 Inter-
national Conference on Industrial Instrumentation and Control, ICIC 2015.
Institute of Electrical and Electronics Engineers Inc., July 2015, pp. 974–977.
isbn: 9781479971657. doi: 10.1109/IIC.2015.7150886.
[9] Harrison. Python Programming Tutorials. 2018. url: https://pythonprogramming.
net/%20https://pythonprogramming.net/web-development-tutorials/
%7B%5C%%7D0Ahttps://pythonprogramming.net/loading- custom- data-
deep - learning - python - tensorflow - keras / %7B % 5C % %7D0Ahttps : / /
pythonprogramming.net/convolutional-neural-network-deep-learning-
python-tensorflow-keras/ (visited on 07/02/2020).

39
BIBLIOGRAPHY

[10] Haifeng Jin, Qingquan Song, and Xia Hu. “Auto-keras: An efficient neural
architecture search system”. In: Proceedings of the 25th ACM SIGKDD Inter-
national Conference on Knowledge Discovery & Data Mining. 2019, pp. 1946–
1956.
[11] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic opti-
mization”. In: arXiv preprint arXiv:1412.6980 (2014).
[12] Okan Köpüklü and Ahmet Gündüz. GitHub - okankop/vidaug: Effective Video
Augmentation Techniques for Training Convolutional Neural Networks. May
2019. url: https://github.com/okankop/vidaug (visited on 07/23/2020).
[13] Pradeep Kumar, Himaanshu Gauba, Partha Pratim Roy, and Debi Prosad Do-
gra. “Coupled HMM-based multi-sensor data fusion for sign language recog-
nition”. In: Pattern Recognition Letters 86 (2017), pp. 1–8.
[14] T Kuroda, Y Tabata, A Goto, H Ikuta, M Murakami, et al. “Consumer price
data-glove for sign language recognition”. In: Proc. of 5th Intl Conf. Disability,
Virtual Reality Assoc. Tech., Oxford, UK. 2004, pp. 253–258.
[15] T. Liu, W. Zhou, and H. Li. “Sign language recognition with long short-
term memory”. In: 2016 IEEE International Conference on Image Processing
(ICIP). 2016, pp. 2871–2875.
[16] Sarfaraz Masood, Adhyan Srivastava, Harish Chandra Thuwal, and Musheer
Ahmad. “Real-Time Sign Language Gesture (Word) Recognition from Video
Sequences Using CNN and RNN”. In: Intelligent Engineering Informatics. Ed.
by Vikrant Bhateja, Carlos A. Coello Coello, Suresh Chandra Satapathy, and
Prasant Kumar Pattnaik. Singapore: Springer Singapore, 2018, pp. 623–632.
isbn: 978-981-10-7566-7.
[17] S. A. Mehdi and Y. N. Khan. “Sign language recognition using sensor gloves”.
In: Proceedings of the 9th International Conference on Neural Information
Processing, 2002. ICONIP ’02. Vol. 5. 2002, 2204–2206 vol.5.
[18] Arpit Mittal, Andrew Zisserman, and Philip HS Torr. “Hand detection using
multiple proposals”. In: BMVC. Vol. 40. Citeseer. 2011, pp. 75–1.
[19] NIDCD. American Sign Language — NIDCD. May 2019. url: https://www.
nidcd.nih.gov/health/american-sign-language (visited on 07/10/2020).
[20] World Health Organization. “Deafness and hearing loss”. In: 2020. url: https:
/ / www . who . int / news - room / fact - sheets / detail / deafness - and -
hearing-loss (visited on 06/11/2020).
[21] SignAll. SignAll Media Kit. 2018. url: https://www.signall.us/about-us/
(visited on 07/10/2020).
[22] Jiuqiang Tang and Chuoling. Issues · google/mediapipe. url: https://github.
com/google/mediapipe/issues (visited on 07/23/2020).
[23] Ivan Vasilev, Daniel Slater, Gianmario Spacagna, Peter Roelants, and Valentino
Zocca. Python Deep Learning: Exploring deep learning techniques and neural
network architectures with Pytorch, Keras, and TensorFlow. Packt Publishing
Ltd, 2019.

40

You might also like