Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

sensors

Article
American Sign Language Alphabet Recognition
Using a Neuromorphic Sensor and an Artificial
Neural Network
Miguel Rivera-Acosta 1 , Susana Ortega-Cisneros 1, * ID
, Jorge Rivera 2 ID

and Federico Sandoval-Ibarra 1


1 Advanced Studies and Research Center (CINVESTAV), National Polytechnic Institute (IPN),
Zapopan 45019, Mexico; marivera@gdl.cinvestav.mx (M.R.-A.); sandoval@gdl.cinvestav.mx (F.S.-I.)
2 CONACYT-Advanced Studies and Research Center (CINVESTAV), National Polytechnic Institute (IPN),
Zapopan 45019, Mexico; jriverado@conacyt.mx
* Correspondence: sortega@gdl.cinvestav.mx; Tel.: +52-333-777-3600 (ext. 1069)

Received: 24 July 2017; Accepted: 13 September 2017; Published: 22 September 2017

Abstract: This paper reports the design and analysis of an American Sign Language (ASL) alphabet
translation system implemented in hardware using a Field-Programmable Gate Array. The system
process consists of three stages, the first being the communication with the neuromorphic camera
(also called Dynamic Vision Sensor, DVS) sensor using the Universal Serial Bus protocol. The feature
extraction of the events generated by the DVS is the second part of the process, consisting of
a presentation of the digital image processing algorithms developed in software, which aim to reduce
redundant information and prepare the data for the third stage. The last stage of the system process
is the classification of the ASL alphabet, achieved with a single artificial neural network implemented
in digital hardware for higher speed. The overall result is the development of a classification system
using the ASL signs contour, fully implemented in a reconfigurable device. The experimental
results consist of a comparative analysis of the recognition rate among the alphabet signs using
the neuromorphic camera in order to prove the proper operation of the digital image processing
algorithms. In the experiments performed with 720 samples of 24 signs, a recognition accuracy of
79.58% was obtained.

Keywords: feature extraction; contour detection; hand posture recognition

1. Introduction
Deafness is a very important problem that affects people both economically and socially because
all communication between individuals is through languages. Its main objective is to share information
between a transmitter and a receiver using various channels (speech, writing, or gestures). Most of
the population uses speech communication; thus, the integration of deaf people into society is very
difficult because of the general lack of knowledge of sign languages. According to the World Health
Organization, there are 360 million people worldwide with hearing loss, of which 32 million are
children [1], which represents approximately 5% of the population. Therefore, it is a global problem.
Research into sign language translation could solve many of the problems that deaf people face every
day, allowing a flow of communication between any person and those with hearing disabilities.
For the hand postures recognition problem, the hand contour is often used because it simplifies
real-time performance. All of the current related work can be globally sorted into software and
hardware based solutions. Several works can be found for the software based platforms, such as [2],
which uses contours to obtain characteristic points to reduce the computation time, which is a similar
approach to the system presented in this work; however, their interesting points extraction algorithms

Sensors 2017, 17, 2176; doi:10.3390/s17102176 www.mdpi.com/journal/sensors


Sensors 2017, 17, 2176 2 of 17

require many steps, such distance calculation between all of the pixels with the object contour, and some
other full-frame scanning operations. Ghassem et al. [3] proposed a method to extract global hand
features using the external shape, obtaining contour points between two consecutive convex hull
vertices. However, their feature extraction method is not robust. In [4], an algorithm is used to
represent objects in images with a fixed number of sample points, but they use 100 sample points to
generate the descriptor of every contour, making their classification step very computationally complex.
Lahamy and Lichti [5] presented a real-time hand posture recognition using a range camera that gives
visual and distance data, using the closest pixels to the camera for hand area segmentation and a voxel
representation on every incoming image to compare it with a database for classification. They reported
a recognition rate of 93.88%; however, they used only 12 postures, avoiding consideration of the most
similar American Sign Language (ASL) signs, e.g., “A”, “B”, “E”, among others. In addition, it was
implemented on a desktop computer with software (sequential nature) and their hand descriptors
contain a lot of data, which makes it computationally inefficient. Some researchers use more than one
digital camera to obtain a 3D perception [6]. Although it has an accuracy of 95%, the high computer
requirements and the marked gloves make it uncomfortable for real applications.
Recently, the Microsoft Kinect® has been widely used [7–9] because of its ability to obtain
depth information, which is very valuable for cropping the hand areas, if they overlap the face,
allowing real-time applications, but a 53% of accuracy in ASL alphabet classification is reported,
which is low nowadays. Additionally, the Kinect device requires a personal computer with the proper
drivers installed. Cao Dong et al. [10] proposed a system that uses depth images with a Kinect
device to create a hand skeleton for ASL alphabet recognition using a Random Forest learning model,
reaching an accuracy near 90%. However, they require a colored glove to generate the training
dataset. In [11], an FSL (French Sign Language) translation system is implemented as a web platform,
whose application is wide because of the current availability of the internet, but their system uses a sign
search engine based on an encoding of the FSL signs and is not based on image processing. Besides the
power consumption and the many computational resources required, these programmed systems
have a sequential nature; accordingly, the system could present a higher latency, making efficient
real-time applications difficult. These research projects represent a huge advance in hand posture
recognition tasks, but they are not intended to be applied in embedded platforms as most of them
require a full computer, special colored gloves, and more than one sensor. Increasing the parallelism
in the algorithms decreases the execution time. A common method for parallelizing the algorithms
in software is to implement them in a Graphics Processing Unit (GPU) [12,13], although managing
a GPU requires many computer resources and it cannot be used without an operating system with the
proper drivers installed.
For the hardware based solution, the algorithm is custom designed in hardware such as the
Field-Programmable Gate Array (FPGA) used in this work, offering the possibility of parallelizing the
algorithm and improving the execution time without dramatically increasing the resources needed [14].
This device allows for designing and developing complex systems with high parallelization and
pipelining levels, while simultaneously implementing a microprocessor that executes software, all in
a single chip. Hence, the hardware-based solution is more suitable for improving algorithm execution
and gaining a significant reduction in computational complexity. In [15,16], systems are proposed to
recognize 24 ASL alphabet signs implemented in a FPGA using frame-based Red-Green-Blue (RGB)
cameras; nevertheless, the need of a colored glove makes it uncomfortable for practical applications
and the usage of periodical cameras means that these systems require more operations for a single
frame, including the need for full-frame scanning operations. Wang et al. [17] propose different models
of neural networks using stochastic logic for the implementation of human postures classification
systems in FPGA.
There is another option in camera sensors that can be used for ASL recognition—the neuromorphic
camera, which has an event-based behavior that only upgrades the pixel value when there is
a light change in the scene. This particularity offers the advantage of not requiring periodical
Sensors 2017, 17, 2176 3 of 17

operations; in other words, if there is no movement in the scene, the system does not process any
data. There are previous approaches for gesture detection using the Dynamic Vision Sensor (DVS)
(the same used in the system presented in this manuscript), such as [18], where is presented an
event-based gesture recognition system, which processes the event stream using a natively event-based
processor from International Business Machines called TrueNorth, they use a temporal filter cascade
to create spatio-temporal frames that are the input data to a CNN (Convolutional Neural Network)
implemented in the event-based processor, and they reported an accuracy of 96.46%. In [19], a method
is proposed to recognize the three hand poses in the game of Rock-Paper-Scissors, and the DVS128
sensor is used to capture frames. For the classification, they use a Bayesian network reporting an
accuracy of 89%. Jun Haeng Lee et al. [20] proposed a gesture classification method with two DVSs to
obtain a stereo-vision system, they use spike neurons to process the incoming events and an HMM
(Hidden Markov Model) based classification stage, and they distinguish between 11 different gestures
with movement.
Taking advantage of the non-periodical behavior of the DVS, it is viable to increase parallelism
in the algorithms improving the system speed when implemented in FPGA. The novel signature
algorithm presented here is more suitable for implementation in digital hardware and does not
require a full scanning of the image like the algorithms used in [16], increasing its importance.
For the classification of the signs, a feedforward neural network was designed, which has resulted
in a very good method for solving pattern recognition problems, yielding very good results.
Additionally, this Artificial Neural Network (ANN) is easily implemented in digital hardware because
it is mainly composed of simple operations such as addition and multiplication.
Therefore, in this work, we propose an ASL alphabet recognition system implemented in an Altera
DE2-115 board that counts with the USB host port required for the communication with the DVS and
a proper video deployment port. It is used only one event-based neuromorphic camera. The proposed
system does not require any special gloves, color specific background or a desktop computer. Moreover,
it includes a novel algorithm to obtain the signatures using the contour. This algorithm is designed
for applications in ASL alphabet classification tasks, and the ANN, which is fully implemented in the
FPGA. This makes our system suitable for embedded applications, which results in a major portability,
including high speed processing with low computational cost implementing customized hardware.
As an overall result, this system could be focused as a smartphone accessory.
The remainder of this paper is organized as follows: Section 2 contains a brief explanation of the
system presented in this work and the requirements to replicate the experiments. The neuromorphic
camera behavior is presented in Section 3. All of the image processing algorithms applied in this
work are fully explained in Section 4. The classification system, including the hardware required to
implement all of the ANN, is presented in Section 5, followed by the synergy between the hardware
accelerators and the processor designed in the FPGA in Section 6. In Section 7, the results obtained
in the simulations of the proposed algorithms for every of the 24 ASL signs and the response of the
experiments performed with the overall system in a confusion matrix are presented. Section 8 contains
the conclusions of the obtained results as well as a comparison with previously proposed hand posture
recognition systems, including the future work and its relevance.

2. Experimental Methods
The system presented in this paper is fully implemented in a FPGA. It classifies 24 of the ASL
alphabet signs using the external contour of the hand in images received from a neuromorphic sensor.
It consists of three main sub-systems, the first dedicated to communicating with the camera sensor
and receiving the events, the second is the image processing algorithms, including noise filtering and
edge sorting, and the last step is the classification using a system composed of a dynamic neurons,
static ANN, data adjusting algorithms, and a comparator. As can be seen in Figure 1, the experiments
have been performed in a controlled environment, with the sensor placed at approximately 30 cm from
the hand user. Since the neuromorphic camera detects changes in the scene luminosity, a white light
Sensors 2017, 17, 2176 4 of 17

lamp synchronized with the system focused on the user hand has an on–off controller to ensure the
light changes in the hand. This guarantees that all of the pixels of the hand will exceed their internal
threshold
Sensors 2017,sending
17, 2176 the event. 4 of 17

Figure 1. Experimental
Figure 1. Experimental setup.
setup.

3. Neuromorphic Sensor
3. Neuromorphic Sensor
The neuromorphic sensor, also called DVS [21], is a sensorial system that consists of a silicon
The neuromorphic sensor, also called DVS [21], is a sensorial system that consists of a silicon retina
retina formed by pixels in a matrix. The DVS used in this research has a size of 128 × 128
formed by pixels in a matrix. The DVS used in this research has a size of 128 × 128 independent pixels.
independent pixels. This matrix array is a sensorial retina and every pixel contains asynchronous
This matrix array is a sensorial retina and every pixel contains asynchronous neuromorphic circuitry.
neuromorphic circuitry. These pixels are sensitive to dynamic luminosity changes in the scene
These pixels are sensitive to dynamic luminosity changes in the scene presented to the DVS; in other
presented to the DVS; in other words, their neuromorphic circuitry allows them to not send data
words, their neuromorphic circuitry allows them to not send data until an event occurs (changes in
until an event occurs (changes in the scene). Such behavior gives the sensor the peculiarity of very
the scene). Such behavior gives the sensor the peculiarity of very low power consumption, since the
low power consumption, since the data traffic only exists when an event occurs. Therefore, the DVS
data traffic only exists when an event occurs. Therefore, the DVS sends little redundant information in
sends little redundant information in contrast to the conventional digital cameras, which
contrast to the conventional digital cameras, which periodically transmit images in frames. Each event
periodically transmit images in frames. Each event carries information, such as the event polarity,
carries information, such as the event polarity, the coordinates in the matrix, and a timestamp value.
the coordinates in the matrix, and a timestamp value.
The event polarity indicates the type of transition occurring in the pixel, where a dark to light
The event polarity indicates the type of transition occurring in the pixel, where a dark to light
transition is called a positive event (ON), and a light to dark is a negative event (OFF). Every event is
transition is called a positive event (ON), and a light to dark is a negative event (OFF). Every event is
represented by 32 bits, which are arranged in four bytes. These bytes contain information about pixel
represented by 32 bits, which are arranged in four bytes. These bytes contain information about pixel
position, event polarity and timestamp.
position, event polarity and timestamp.
Neuromorphic Sensor Communication System
Neuromorphic Sensor Communication System
The neuromorphic sensor used in this research can be controlled through USB protocol [22],
The the
although neuromorphic sensor
DVS developers used
only in this
provide researchand
Windows canLinux
be controlled through
drivers; thus, USB
part of thisprotocol
research[22],
was
although the DVS developers only provide Windows and Linux drivers; thus, part of
to migrate the driver into C code for a processor implemented in the FPGA. In order to communicate this research
was to
with themigrate
DVS andthe driverthe
manage into C code
whole for aa softcore
system, processor implemented
processor in the FPGA.
was implemented withIna order
SDRAM to
communicate with
(Synchronous the DVS
Dynamic and manage
Random Access the whole system,
Memory) as data aandsoftcore processor
instructions was implemented
memory that shares
with a SDRAM (Synchronous Dynamic Random
a 100 Mhz oscillator with the softcore microprocessor. Access Memory) as data and instructions memory
that shares a 100 Mhz oscillator with the softcore microprocessor.
The sensorial sub-system consists of a processor with its SDRAM embedded into the FPGA
platform sensorial
The sub-systemthe
used to implement consists
system.ofThe
a processor
On-The-Go with
USBits controller
SDRAM embedded intodifferential
generates the the FPGA
platform used to implement the system. The On-The-Go USB controller generates
signals and some communication stages imposed by the USB protocol and the neuromorphic sensor, the differential
signals and
as can be some
seen communication
in Figure 2. stages imposed by the USB protocol and the neuromorphic sensor,
as can be seen in Figure 2.
with a SDRAM (Synchronous Dynamic Random Access Memory) as data and instructions memory
that shares a 100 Mhz oscillator with the softcore microprocessor.
The sensorial sub-system consists of a processor with its SDRAM embedded into the FPGA
platform used to implement the system. The On-The-Go USB controller generates the differential
Sensorssignals and
2017, 17, some communication stages imposed by the USB protocol and the neuromorphic sensor,5 of 17
2176
as can be seen in Figure 2.

Figure 2. Events reception system block diagram.


Figure 2. Events reception system block diagram.

4. Image Generation and Processing


AllSensors
of the information
2017, 17, 2176 sent by the DVS consists of a set of basic events generated by 5 ofthe
17 retina
Sensors 2017, 17, 2176 5 of 17
chip as noted above; therefore, the system needs to decode every event as well as create and process
4. Image Generation and Processing
the image.4. Image
WhenGeneration
there is and onlyProcessing
background noise, the period between incoming events is longer
All of the information
because few dynamic luminosity changes sent by the DVS
are consists
detected of by
a set
theof silicon
basic events generated
retina. by thewhen
In contrast, retina there is
All of the information sent by the DVS consists of a set of basic events generated by the retina
chip
movementchip as noted
in the above;
hand, therefore,
many the
events system needs to decode every event as well as create and process
as noted above; therefore, theare generated
system needs to by the every
decode retina, representing
event a major
as well as create increase in the
and process
the image. When there is only background noise, the period between incoming events is longer
USB bus traffic,
the image.reducing the period
When there between events.
is only background noise, theWhen
periodthis happens,
between incomingtheevents
system will store and
is longer
because few dynamic luminosity changes are detected by the silicon retina. In contrast, when there is
count all because few dynamic luminosity changes are detected by the siliconrepresents
retina. In contrast, when there is
movement in the hand, many events are generated by the retina, representing a major increaserate
of the incoming events within a period of 50 ms, which a theoretical in theof 20 fps.
movement in the hand, many events are generated by the retina, representing a major increase in the
Then, itUSB
creates the image
bus traffic, andthe
reducing processes it.
period between events. When this happens, the system will store and
USB bus traffic, reducing the period between events. When this happens, the system will store and
count all all
count of the incoming
of the incomingevents
eventswithin
withinaaperiod
periodof
of 50
50 ms, which represents
ms, which representsaatheoretical
theoretical rate
rate of of
20 20
fps.fps.
4.1. Noise Filter
Then, it creates
Then, thethe
it creates image and
image andprocesses
processesit.it.
The first
4.1. step
Noise
4.1.
in the image processing stage is to reduce the noise as much as possible in order
Filter
Noise Filter
to decrease useless information, and every image received by the system is submitted to a filter.
TheThe
first step
first in in
step thethe
image
imageprocessing
processingstage
stageis
is to
to reduce the
the noise
noiseasasmuch
muchasaspossible
possiblein in order
order to to
This process
decrease eliminates
useless
decrease useless pixels with
information,
information, and few
everyneighbors
andevery image
image receivedaround
received the
the system
by the image.
system These
isissubmitted
submitted aevents
toto a filter.
filter. are
This usually
This
unwanted noise,
process and this
eliminates filter
pixels is performed
with few using
neighbors
process eliminates pixels with few neighbors around around the Equation
the image. (1)
These (See
events Figure
are 3):
usually unwanted
image. These events are usually unwanted
noise,
noise, andand
thisthis filter
filter is is performedusing
performed usingthe
theEquation
Equation (1)
(1) (See
(See Figure
Figure3):
3):
Σ Σ I(x + i,I(xy+ +
I (x, y) = 1̅ (̅ ifx,y) j) ≥j) ≥≥5, i = j = {−1, 0, 1}. (1)(1) (1)
(x,y) = =11ififΣΣΣΣI(x + i,i, yy ++ j) 5,
5, ii == jj=={−1,0,1}.
{−1,0,1}.

I(x-1,y-1) I(x,y-1) I(x+1,y-1)


I(x-1,y-1) I(x,y-1) I(x+1,y-1)
I(x-1,y) I(x,y) I(x+1,y)
I(x-1,y) I(x,y) I(x+1,y)
I(x-1,y+1) I(x,y+1) I(x+1,y+1)
I(x-1,y+1) I(x,y+1) I(x+1,y+1)

Figure
Figure 3. Pixelnoise
3. Pixel noise filter
filter neighborhood.
neighborhood.
Figure 3. Pixel noise filter neighborhood.
The resulting image in Figure 4c has very few noisy events and its shape is clear; therefore, it
The resulting
Thebe
must imageand
resulting
tracked in Figure
image 4c has
in Figure
every contour4cvery few few
hassorted.
pixel very noisy events
noisy andand
events its its
shape is isclear;
shape clear;therefore,
therefore,ititmust be
tracked must be tracked
and every and every
contour pixelcontour
sorted.pixel sorted.

(a) (b) (c)

Figure 4. Image
(a)to event transition and events(b) (c) sign; (b) raw
filtering of “A” sign, (a) ASL alphabet
data sent by the neuromorphic sensor; (c) image after filtering.
Figure 4. Image to event transition and events filtering of “A” sign, (a) ASL alphabet sign; (b) raw
Figure 4. Image to event transition and events filtering of “A” sign, (a) ASL alphabet sign; (b) raw data
dataThis
sentfilter
by theeliminates
neuromorphic sensor;
few events, but(c)itimage
is useful
afterfor decreasing the time required to execute the
filtering.
sent by thealgorithm
next neuromorphic
for edgesensor;
tracing.(c) image after filtering.
This filter eliminates few events, but it is useful for decreasing the time required to execute the
next4.2. Contour Detection
algorithm for edge tracing.
The second image processing stage consists of an edge tracing algorithm, used to decrease
4.2.redundant
Contour Detection
information. In this step, most of the irrelevant events are eliminated. Figure 5 shows that
the only
The information
second image preserved
processingis stage
the ASL sign external
consists of an edgeshape.tracing
Prior toalgorithm,
this stage, used
an ASL
to sign is
decrease
represented by approximately 7000 events, but, after this, it is represented by approximately 500
redundant information. In this step, most of the irrelevant events are eliminated. Figure 5 shows that pixels.
Sensors 2017, 17, 2176 6 of 17
Sensors 2017, 17, 2176 6 of 17

This filter eliminates few events, but it is useful for decreasing the time required to execute the
next algorithm for edge tracing.

4.2. Contour Detection


The second image processing stage consists of an edge tracing algorithm, used to decrease
redundant information. In this step, most of the irrelevant events are eliminated. Figure 5 shows
that the only information preserved is the ASL sign external shape. Prior to this stage, an ASL sign is
Sensors 2017, 17,by
represented 2176 6 of 17
approximately 7000 events, but, after this, it is represented by approximately 500 pixels.
(a) (b) (c)

Figure 5. “A” ASL alphabet sign process: (a) ASL alphabet sign; (b) Dynamic Vision Sensor
events; (c) resulting image.

The process used to obtain the external shape of every image received by the system is called
the Moore–Neighbor algorithm [23]. The idea behind its process is simple, but the implementation is
complex. An important term for the understanding of this algorithm is the Moore Neighborhood of
a pixel, consisting of the eight pixels that share a side or corner with the pixel, which are called P1,
P2, P3, P4, P5, P6, P7,(a)and P8, as can be seen in (b) Figure 6c. (c)
The first step of this algorithm is to find the Start Pixel P; it can be found by checking every pixel
from any
Figure
Figure 5. “A”
5.
image
“A”ASL
ASLalphabet
corner
alphabetsign
in any constant
sign process:
process:
direction(a)(a) ASL alphabet
ASL
untilalphabet
the pointer (b);
sign;sign (b) Dynamic
Dynamic
reaches Vision
the first
Vision
Sensor
object
Sensor
events;
pixel shown
events ; (c)
(c) resulting resulting
image. 6a1.image.
in the first step of Figure
Once the Start Pixel is found, a set of steps must be performed. Depending on the Moore–
The process used to obtain the external shape of every image received by the system is called
Neighbor Thefrom
process used
which thetolast
obtain the external
contour pixel was shape of every
accessed, theimage
pointerreceived by the every
must check systemMoore–
is called
the Moore–Neighbor algorithm [23]. The idea behind its process is simple, but the implementation is
Neighbor in a constant direction (clockwise or anticlockwise). For example, for the step (a1) in is
the Moore–Neighbor algorithm [23]. The idea behind its process is simple, but the implementation
complex. An important term for the understanding of this algorithm is the Moore Neighborhood of
complex.
Figure 6, theAn important
Start Pixel was term for thefrom
accessed understanding of this algorithm
its Moore–Neighbor P6; thus, is the Moore Neighborhood
Moore–Neighbors P7, of
a pixel, consisting of the eight pixels that share a side or corner with the pixel, which are called P1,
P8,a P1,
pixel,
P2,consisting
P3, P4, and ofP5
themust
eightbepixels that in
checked share
thataorder
side or corner
until with object
the next the pixel, which
pixel are called
is found. Then,P1,theP2,
P2, P3, P4, P5, P6, P7, and P8, as can be seen in Figure 6c.
P3, P4,repeats
process P5, P6,until
P7, and P8, as can
a stopping be seenisin
criterion Figure 6c.
found.
The first step of this algorithm is to find the Start Pixel P; it can be found by checking every pixel
from any image corner in any
(1)
constant direction until the pointer reaches the first object pixel shown
(2) (3) (4)
in the first step of Figure 6a1.
Once the Start Pixel is found, a set of steps must be performed. Depending on the Moore–
Neighbor from P
which the last contour pixel was accessed, the pointer must check every Moore–
P P P
Neighbor in a constant direction (clockwise or anticlockwise). For example, P1 for
P2 the
P3 step (a1) in
(a) P8 P P4
Figure 6, the Start Pixel was accessed from its Moore–Neighbor P6; thus, the Moore–Neighbors P7,
P7 P6 P5
P8, P1, P2, P3, P4, and P5 must be checked in that order until the next object pixel is found. Then, the
process repeats until a stopping criterion is found.
P P P P (c)
(1) (2) (b) (3) (4)

Figure
Figure 6. 6.
Moore–Neighbor
Moore–Neighbor (MN) algorithm
(MN) demonstration,
algorithm demonstration,(a) starting stepssteps
(a) starting for MN algorithm
for MN ;
algorithm;
(b)(b)
lastlast
steps for
steps MN algorithm including stopping criterion; (c) Moore neighborhood of a pixel
P for MN algorithm including stopping criterion; (c) Moore neighborhood of a pixel P.
P.
P P P
P1 P2 P3
(a)
The stopping criterion of this algorithm is when the Start Pixel is reached for the P8 second
P P4 time, as
The first step of this algorithm is to find the Start Pixel P; it can be found by P7
checking
P6 P5 every pixel
can be seen in the last step of Figure 6b1, or when the only object pixel is that from which it was
from any image corner in any constant direction until the pointer reaches the first object pixel shown
accessed. This last stopping criterion represents an open contour.
in the first step of Figure 6a1. (c)
The resulting P image is represented
P byP very few pixels,
P as can be seen in Figure 5c. This
Once the Start Pixel is found, a set of steps must be performed. Depending on the Moore–Neighbor
represents a huge reduction in the digital (b) hardware required to implement the ANN to carry out the
from which the last contour pixel was accessed, the pointer must check every Moore–Neighbor in
ASL alphabet classification, but is not enough to design an efficient classification system.
a constant
Figure direction (clockwise(MN)
6. Moore–Neighbor or anticlockwise). For example,
algorithm demonstration, for the step
(a) starting steps(a1) in Figure
for MN 6, the
algorithm ;Start
There is still redundant information that must be eliminated to properly perform the
Pixel(b)
waslast steps for MN algorithm including stopping criterion; (c) Moore neighborhood of a pixel P. P3, P4,
accessed from its Moore–Neighbor P6; thus, the Moore–Neighbors P7, P8, P1, P2,
classification through the ANN.
and P5 must be checked in that order until the next object pixel is found. Then, the process repeats
untilThe
a stopping
stopping criterion
criterionis of
found.
this algorithm is when the Start Pixel is reached for the second time, as
4.3. Characteristic Pixel Extraction
can be seen in the last step of Figure 6b1, or when the only object pixel is that from which it was
In order
accessed. to last
This design an efficient
stopping criterionsystem withan
represents low computational
open contour. cost, the quantity of events
representing each ASL
The resulting sign is
image must be decreased
represented without
by very fewdegrading
pixels, as the
canshape
be seenof the object and
in Figure 5c. its
This
characteristics. Therefore, a set of processing stages was developed to reduce the number
represents a huge reduction in the digital hardware required to implement the ANN to carry out the of pixels
ASL alphabet classification, but is not enough to design an efficient classification system.
There is still redundant information that must be eliminated to properly perform the
Sensors 2017, 17, 2176 7 of 17

The stopping criterion of this algorithm is when the Start Pixel is reached for the second time,
as can be seen in the last step of Figure 6b1, or when the only object pixel is that from which it was
accessed. This last stopping criterion represents an open contour.
The resulting image is represented by very few pixels, as can be seen in Figure 5c. This represents
a huge reduction in the digital hardware required to implement the ANN to carry out the ASL alphabet
classification, but is not enough to design an efficient classification system.
There is still redundant information that must be eliminated to properly perform the classification
through the ANN.

4.3. Characteristic Pixel Extraction


In order to design an efficient system with low computational cost, the quantity of events
Sensors 2017, 17, each
representing 2176 ASL sign must be decreased without degrading the shape of the object and 7 ofits
17
characteristics. Therefore, a set of processing stages was developed to reduce the number of pixels
representing every
representing every contour,
contour, which
which will
will be
be called
called characteristic
characteristic pixels.
pixels. With
With the
the purpose
purpose of
of reducing
reducing
characteristic pixels of every shape, an elimination criterion that allows deleting
characteristic pixels of every shape, an elimination criterion that allows deleting irrelevant information irrelevant
information
without without
affecting the affecting the contour
contour features mustfeatures must be selected.
be selected.
The final
The final characteristic
characteristic pixels
pixels quantity
quantity directly
directly affects
affects the the digital
digital hardware
hardware complexity
complexity
implemented in the classification stage. Thus, some experiments were performed
implemented in the classification stage. Thus, some experiments were performed in order to in order to find
find the
the
minimum pixel quantity while maintaining the features of all the ASL alphabet signs.
minimum pixel quantity while maintaining the features of all the ASL alphabet signs. The first step The first step
is to
is to find
find the
the elimination
elimination criterion.
criterion. Therefore,
Therefore, it
it is
is necessary
necessary to to detect
detect aa pattern
pattern that
that appears
appears inin all
all
ASL alphabet signs. Most of the sign shapes have vertical features, for example,
ASL alphabet signs. Most of the sign shapes have vertical features, for example, the arm direction, the arm direction,
finger contours,
finger contours, angles,
angles, and
and the
the distance
distance between
between them,
them, as
ascan
canbe beseen
seenininFigure
Figure7.7.

Figure 7.7.American
Figure American Sign Language
Sign alphabet
Language community
alphabet used, used,
community including letters from
including A to
letters Y excluding
from A to Y
letter J.
excluding letter J.

This is
This is the
the reason
reason for
for selecting
selecting the
the vertical
vertical derivate
derivate as
as the
the first
first evaluation
evaluation criterion,
criterion, by
by calculating
calculating
the vertical slope of the entire contour, all of the vertical vertex pixels of any shape can be
the vertical slope of the entire contour, all of the vertical vertex pixels of any shape can be found.found.
The contour
The contour slope
slope is
is calculated
calculated using
using Equation
Equation (2),
(2), where
where PP represents
represents the
the characteristic
characteristic pixels
pixels
array and
array and nn is
is the
the size
size of
of the
the characteristic
characteristic array:
array:
Slope = (Pi+1(y) – Pi(y))/(Pi+1(x) – Pi(x)), 1 ≤ i < n. (2)
Slope = (Pi+1 (y) - Pi (y))/(Pi+1 (x) - Pi (x)), 1 ≤ i < n. (2)
In Figure 8, where violet and blue arrows represent negative and positive slopes, respectively,
In Figure
the main 8, where
vertices violet and blue
are characterized for arrows represent
being found negative
in vertical andsign
slope positive slopes,
changes; respectively,
thus, by storing
the main
these vertices
points, are characterized
the quantity for being
of characteristic found
pixels is in verticalwithout
reduced slope sign changes;
losing thus,
the main by storing
shape of the
these points, the quantity of characteristic pixels is reduced without losing the main shape
line. With this elimination criterion, the quantity of characteristic pixels is reduced to approximately of the line.
15. Using this compact characteristic array of pixels, it is feasible to classify the ASL signs.
The contour slope is calculated using Equation (2), where P represents the characteristic pixels
array and n is the size of the characteristic array:
Slope = (Pi+1(y) – Pi(y))/(Pi+1(x) – Pi(x)), 1 ≤ i < n. (2)
In Figure 8, where violet and blue arrows represent negative and positive slopes, respectively,
Sensors 2017, 17, 2176 8 of 17
the main vertices are characterized for being found in vertical slope sign changes; thus, by storing
these points, the quantity of characteristic pixels is reduced without losing the main shape of the
line.
WithWith this elimination
this elimination criterion,
criterion, the quantity
the quantity of characteristic
of characteristic pixels
pixels is reduced
is reduced to approximately
to approximately 15.
15. Using
Using thisthis compact
compact characteristic
characteristic array
array of pixels,
of pixels, it isitfeasible
is feasible to classify
to classify the the
ASLASL signs.
signs.

Figure8.
Figure Vertical slope
8.Vertical slope of
of aa curved
curved line.
line.

Despite this elimination


Despite this eliminationmethodology
methodologybeing being effective
effective in reducing
in reducing unnecessary
unnecessary information,
information, it doesit
does not guarantee
not guarantee the quantity
the quantity of characteristic
of characteristic pixels, obtaining
pixels, obtaining between 5between 5 and
and 30 pixels in 30
thepixels in the
experiments,
experiments, 13 in the case of a “U” sign, shown in Figure 9. Since this data is the input
13 in the case of a “U” sign, shown in Figure 9. Since this data is the input for the classification system, for the
classification
it is necessarysystem, it is
that every necessary that
characteristic pixelevery
array characteristic
be of the samepixel
size, aarray be of the
requirement forsame size, a
this ANN.
Sensors 2017, 17, 2176 8 of 17
requirement for this ANN.

(a) (b) (c)

Figure 9.
Figure 9. Shape
Shapereduction result
reduction using
result the vertical
using slopeslope
the vertical algorithm, (a) American
algorithm, Sign Language
(a) American alphabet
Sign Language
sign; (b) output image of edge tracing algorithm ; (c) data representation after vertical slope
alphabet sign; (b) output image of edge tracing algorithm; (c) data representation after vertical algorithm.
slope algorithm.
4.4. Characteristic Array Adjusting
4.4. Characteristic
In order to select Arraythe
Adjusting
best size of the characteristic array, a series of experiments was performed
withIn different
order tocharacteristic
select the bestpixel
sizequantities, using between
of the characteristic array,10 and 20ofcharacteristic
a series experiments points. The size
was performed
with different characteristic pixel quantities, using between 10 and 20 characteristic points. Theofsize
selection criterion is the minimum quantity of points necessary to maintain the main features all
the ASL alphabet signs. The artificial neural network will be implemented
selection criterion is the minimum quantity of points necessary to maintain the main features in digital hardware in the
next
of allstage;
the ASLthus, alphabet
a larger characteristic
signs. The array needs
artificial a larger
neural ANN,will
network requiring more hardware
be implemented for its
in digital
implementation, and, as a consequence, more power will be consumed;
hardware in the next stage; thus, a larger characteristic array needs a larger ANN, requiring more therefore, redundant
information
hardware forisits
particularly harmfuland,
implementation, to this
as asystem.
consequence, more power will be consumed; therefore,
As can be seen in Figure 10b,
redundant information is particularly harmful using a sizeto of
this20,system.
regions with a high density of characteristic
pixels are created (surrounded in green), representing
As can be seen in Figure 10b, using a size of 20, regions redundant information,
with a high density ofas only one of pixels
characteristic those
pixels
are gives(surrounded
created the information that representing
in green), this point is redundant
part of theinformation,
main shape.asThe onlyhardware
one of those resources
pixels
required for 20 characteristic pixels do not fit in our FPGA with our ANN architecture.
gives the information that this point is part of the main shape. The hardware resources required for Furthermore,
after
20 the vertical pixels
characteristic slope algorithm,
do not fit inmost
ourofFPGA
the ASL with symbols’
our ANN characteristic pixels
architecture. (CP) array after
Furthermore, contains
the
no more than 20 CP; therefore, more time would be required to add pixels in
vertical slope algorithm, most of the ASL symbols’ characteristic pixels (CP) array contains no more order to fulfill the size
requirement
than (e.g., 20 to
20 CP; therefore, 50).time
more In the otherbe
would hand, using
required to10addcharacteristic pixels,
pixels in order the main
to fulfill shape
the size of all the
requirement
signs is maintained and well distributed, with very few crowded areas observed
(e.g., 20 to 50). In the other hand, using 10 characteristic pixels, the main shape of all the signs in some signs, andis
so this is the selected size. As noted above, the resulting size of the
maintained and well distributed, with very few crowded areas observed in some signs, and so thischaracteristic array is not
guaranteed;
is the selected nevertheless,
size. As notedonlyabove,
three possible casessize
the resulting mayofoccur: the quantity of
the characteristic characteristic
array pixels is
is not guaranteed;
greater, less or equal to 10.
pixels gives the information that this point is part of the main shape. The hardware resources
required for 20 characteristic pixels do not fit in our FPGA with our ANN architecture. Furthermore,
after the vertical slope algorithm, most of the ASL symbols’ characteristic pixels (CP) array contains
no more than 20 CP; therefore, more time would be required to add pixels in order to fulfill the size
requirement (e.g., 20 to 50). In the other hand, using 10 characteristic pixels, the main shape of all the
Sensors 2017, 17, 2176 9 of 17
signs is maintained and well distributed, with very few crowded areas observed in some signs, and
so this is the selected size. As noted above, the resulting size of the characteristic array is not
guaranteed; nevertheless,
nevertheless, only three
only three possible casespossible cases
may occur: themay occur:ofthe
quantity quantity ofpixels
characteristic characteristic pixels
is greater, is
less or
greater,
equal to less
10. or equal to 10.

(a) (b) (c)

Figure 10. Size selection: (a) “B” ASL alphabet sign; (b) image processing result with 20 characteristic
Figure 10. Size selection: (a) “B” ASL alphabet sign; (b) image processing result with 20 characteristic
pixels; (c) image processing result with 10 characteristic pixels.
pixels; (c) image processing result with 10 characteristic pixels.

In the first case, less representative pixels should be eliminated, as they belong to high-density
In thethis,
areas. For first an
case, less representative
algorithm pixels
calculates the should
distance be eliminated,
between as they pixels,
all subsequent belongand to high-density
then one of
areas.
the two For this, an
closest algorithm calculates
characteristic the distanceThis
points is eliminated. between all subsequent
is repeated pixels,
until a size and
of 10 then one of the
is reached.
two closest
Figure characteristic
11 graphicallypoints is eliminated.
represents This isarray
the resulting repeated until
in each a size
cycle of of
the10previously
is reached.mentioned
Figure17,112176
reducing graphically
Sensors 2017,algorithm.
representspoints
The eliminated the resulting array in each
are surrounded by red,cycle
theofshape
the previously
on the left mentioned
represents
9 of 17
reducing algorithm. The eliminated points are surrounded by red, the shape on the left represents
the
the result
result ofof the
the vertical
vertical slope
slope algorithm
algorithm with
with 1212 characteristics
characteristics pixels, and the
pixels, and the shape
shape onon the
the right
right
Figure
Figure 11c
11c represents
represents thethe input
input for
for the
the ANN.
ANN.

(a) (b) (c)

Figure 11. Characteristic array down adjusting, (a) characteristic array of size 12; (b) characteristic
Figure 11. Characteristic array down adjusting, (a) characteristic array of size 12; (b) characteristic
array of
array of size 11;
size 11; (c)(c) final
final characteristic
characteristic array
array ofof size
size 10.10.

For the second case, where the array size is smaller than 10, pixels must be added in order to
For the second case, where the array size is smaller than 10, pixels must be added in order to
fulfill the requirements of the size imposed by the ANN. For this case, the image is separated into a
fulfill the requirements of the size imposed by the ANN. For this case, the image is separated into
finite number of sections, and the frontiers between those sections are marked with a violet vertical
a finite number of sections, and the frontiers between those sections are marked with a violet vertical
lines (See Figure 12). Although the quantity of sections can be changed, the more sections there are,
lines (See Figure 12). Although the quantity of sections can be changed, the more sections there are,
the more operations are needed.
the more operations are needed.
Figure 12 represents the Y sign of ASL. It was sectioned into three parts, with the purpose of
recognizing which part has a lower quantity of characteristic pixels. Then, pixels are added to this
section. Added pixels are selected from the obtained array in the contour detection algorithm explained
in Section 4.2, and the only requirement is that the added pixel belongs to the section with fewer
(a) (b) (c)

Figure 11. Characteristic array down adjusting, (a) characteristic array of size 12; (b) characteristic
array of size 11; (c) final characteristic array of size 10.

Sensors 2017, 17, 2176 10 of 17


For the second case, where the array size is smaller than 10, pixels must be added in order to
fulfill the requirements of the size imposed by the ANN. For this case, the image is separated into a
finite number pixels.
characteristic of sections,
Thisand the the
allows frontiers
systembetween those sections
to drastically reduce are
themarked withofa violet
probability vertical
creating high
lines (See Figure
characteristic 12).density
pixels Although theDuring
areas. quantity
theofdevelopment
sections can be changed,
of this the more
algorithm, sections
it was there that
discovered are,
the more operations are needed.
some main features lost in the vertical slope algorithm were restored in this pixel addition process.

(a) (b) (c)

Figure 12.
Figure 12. Characteristic
Characteristic array
array adjustment,
adjustment, (a)
(a) characteristic
characteristicarray
arrayof size8;8;
ofsize (b)(b) characteristic
characteristic array
array of
of size 9; (c) final characteristic array of size
size 9; (c) final characteristic array of size 10. 10.

Figure 12 represents the Y sign of ASL. It was sectioned into three parts, with the purpose of
Figure 13 shows the process used to adjust the characteristic array size of the L sign to 10,
recognizing which part has a lower quantity of characteristic pixels. Then, pixels are added to this
adding pixels to the section with fewer points. As noted above, the main feature of the L sign is also
section. Added pixels are selected from the obtained array in the contour detection algorithm
restored during
Sensors 2017, this process, increasing the importance of this pixel addition criterion.
17, 2176 10 of 17
explained in Section 4.2, and the only requirement is that the added pixel belongs to the section with
fewer characteristic pixels. This allows the system to drastically reduce the probability of creating
high characteristic pixels density areas. During the development of this algorithm, it was discovered
that some main features lost in the vertical slope algorithm were restored in this pixel addition process.
Figure 13 shows the process used to adjust the characteristic array size of the L sign to 10,
adding pixels to the section with fewer points. As noted above, the main feature of the L sign is also
restored during this process, increasing the importance of this pixel addition criterion.

(a) (b) (c)

Figure 13.
Figure 13. Feature
Featurerestoration,
restoration,(a)(a)
characteristic array
characteristic of size
array 8; 8;
of size (b)(b)
characteristic array
characteristic of size
array 9; (c)
of size 9;
final characteristic array of size 10.
(c) final characteristic array of size 10.

Remark 1. This algorithm can be used to extract the signature of any object with vertical features, e.g., the ASL
alphabet symbols, and standing people, requiring previously a proper segmentation stage.
Remark 1. This algorithm can be used to extract the signature of any object with vertical features, e.g., the ASL
alphabet symbols, and standing people, requiring previously a proper segmentation stage.
5. Classification System
The main purpose of this system is to classify the ASL alphabet based on the characteristic array
5. Classification
of the image; thus, System
a pattern recognition network is designed in hardware and implemented in a
FPGA.The main purposeof
ANNs consist of athis
finite number
system of basicthe
is to classify artificial neuronsbased
ASL alphabet that exchange data. The array
on the characteristic static
neuron is the most basic block of the ANN implemented in this system, and its structure
of the image; thus, a pattern recognition network is designed in hardware and implemented in a FPGA. is presented
in Figure
ANNs 14. of
consist The output
a finite function
number f depends
of basic artificialprimarily on exchange
neurons that the application of static
data. The the ANNneuronand the
is the
connections of the neuron.
most basic block of the ANN implemented in this system, and its structure is presented in Figure 14.
The output function f depends primarily on the application of the ANN and the connections of
the neuron.
The output equation of a static neuron is:

a = f[Σ(Wk × Pk ) + b]. (3)

Figure 14. Implemented static neuron architecture.

The output equation of a static neuron is:


neuron is the most basic block of the ANN implemented in this system, and its structure is presented
alphabet symbols, and standing people, requiring previously a proper segmentation stage.
in Figure 14. The output function f depends primarily on the application of the ANN and the
connections of theSystem
5. Classification neuron.

The main purpose of this system is to classify the ASL alphabet based on the characteristic11array
Sensors 2017, 17, 2176 of 17
of the image; thus, a pattern recognition network is designed in hardware and implemented in a
FPGA. ANNs consist of a finite number of basic artificial neurons that exchange data. The static
Another
neuron is the type
mostofbasic
artificial
blockneuron implemented
of the ANN in thisinsystem
implemented is called
this system, theitsdynamic
and structureneuron (DN),
is presented
which has a delayed behavior and the capacity of storing information; the input of
in Figure 14. The output function f depends primarily on the application of the ANN and the the DN is sequential,
as can be seenofinthe
connections Figure 15.
neuron.

Figure 14. Implemented static neuron architecture.

The output equation of a static neuron is:


a = f[Σ(Wk × Pk) + b]. (3)
Another type of artificial neuron implemented in this system is called the dynamic neuron
(DN), which has a delayed behavior and the capacity of storing information; the input of the DN is
sequential, as can be seen inFigure
Figure
Figure 15.
14.
14. Implemented static
Implemented static neuron
neuronarchitecture.
architecture.

The output equation of a static neuron is:


a = f[Σ(Wk × Pk) + b]. (3)
Another type of artificial neuron implemented in this system is called the dynamic neuron
(DN), which has a delayed behavior and the capacity of storing information; the input of the DN is
sequential, as can be seen in Figure 15.

Figure 15. Implemented dynamic neuron architecture.

The
The output equation of
output equation of aa dynamic
dynamic neuron
neuron is:
is:
Sensors 2017, 17, 2176 11 of 17
a(t) = W1 × P(t) + W2 × P(t − 1). (4)
a(t) = W1 × P(t) + W2 × P(t − 1). (4)
5.1. Classification System Specification
5.1. Classification Systemthe
In order to reduce Specification
size of the static neural network and digital ports of the main processor, a
dynamic neuron Figure 15. Implemented
was implemented preceding dynamic
the static neuron
ANN. architecture.
In order to reduce the size of the static neural network andNote that
digital the input
ports of the for
maintheprocessor,
dynamic
aneuron must
dynamic be sequential
neuron (property that
was implemented allowsthe
preceding to control the dynamic
static ANN. Note thatneuron usingfor
the input only
theone 32-bit
dynamic
The output equation of a dynamic neuron is:
data bus).
neuron must be sequential (property that allows to control the dynamic neuron using only one 32-bit
In Figure 16, an example is a(t)
data bus). presented where
= W1 × P(t) + W2the× P(tfirst
− 1). row represents the number (4) of
characteristic
In Figure pixels, the second
16, an example and third
is presented rowsthe
where represent
first rowthe X- and the
represents Y-coordinate
number of of each pixel,
characteristic
respectively,
pixels, andand
the second the third
last rows
row DNI (Dynamic
represent Neuron
the X- and Input) represents
Y-coordinate therespectively,
of each pixel, input data and
for the
the
dynamic neuron.
last row DNI (Dynamic Neuron Input) represents the input data for the dynamic neuron.

Figure 16.
Figure 16. Dynamic neuron
neuron input
input arrangement.
arrangement.

Table 1 shows a time representation of the process performed by the dynamic neuron, where
Table 1 shows a time representation of the process performed by the dynamic neuron, where valid
valid data ticks represent the product of a valid X–Y coordinate of each of the 10 characteristic points.
data ticks represent the product of a valid X–Y coordinate of each of the 10 characteristic points.
The resulting characteristic array is presented in Table 1. As can be seen, the quantity of data
The resulting characteristic array is presented in Table 1. As can be seen, the quantity of data was
was reduced to half, and, consequently, the hardware necessary to implement the static pattern
reduced to half, and, consequently, the hardware necessary to implement the static pattern recognition
recognition ANN will be drastically lower.
ANN will be drastically lower.

Table 1. Dynamic neuron time representation.

Time(t) P(t) P(t−1) a(t) Valid Data


1 0 0 0
2 465 0 1658.7 √
3 18 465 1103.5
4 76 18 311.33 √
Sensors 2017, 17, 2176 12 of 17

Table 1. Dynamic neuron time representation.

Time(t) P(t) P(t-1) a(t) Valid Data


1 0 0 0 √
2 465 0 1658.7
3 18 465 1103.5 √
4 76 18 311.33
5 89 76 487.33 √
6 45 89 359.43
7 124 45 542.90 √
8 53 124 466.19
9 153 53 664.23 √
10 42 153 491.77
11 215 42 860.81 √
12 29 215 583.97
13 257 29 981.58 √
14 44 257 731.34
15 296 44 1154.2 √
16 0 296 661.55
17 327 0 1166.5 √
18 153 327 1276.6
19 239 153 1194.5 √
20 464 239 2189.3

The implemented pattern recognition ANN consists of two layers with 10 and 24 neurons.
Figure 17 shows the structure of the ANN implemented in the system presented in this paper, and all
of the outputs
Sensors of the first layer are connected to all of the 24 neurons of the output layer.
2017, 17, 2176 12 of 17

Figure 17.
Figure 17. Implemented
Implemented Artificial
Artificial Neural
Neural Network
Network architecture.
architecture.

The second layer is the output connection of the ANN, which consists of 24 static neurons,
The second layer is the output connection of the ANN, which consists of 24 static neurons,
where each neuron output represents the similarity of the input to each of the 24 ASL alphabet signs.
where each neuron output represents the similarity of the input to each of the 24 ASL alphabet signs.
5.2. ANN
5.2. ANN Implementation
Implementation in
in Digital
Digital Hardware
Hardware
Usually,the
Usually, theimage
imageprocessing
processing algorithms
algorithms dodonotnot require
require a very
a very highhigh resolution
resolution in theindata;
the thus,
data;
thus, floating-point
floating-point representation
representation is avoided
is avoided in order
in order to reduce
to reduce hardware
hardware requirements.This
requirements. Thisisis why
why
fixed-point representation is commonly used in image processing FPGA implementations;
fixed-point representation is commonly used in image processing FPGA implementations; it allows it allows
carrying out
carrying out many
many fixed-point
fixed-point arithmetic
arithmetic operations
operations withwith relatively
relatively little
little hardware.
hardware. TheThe word
word length
length
used isis32
used 32bits,
bits,where
wheremostmostsignificant
significantbitbit
is is
thethe sign
sign bitbit
andand
thethe 16 least
16 least significant
significant bits bits are fraction
are fraction bits.
bits. The bits of the integer section will be decreased because this system does
The bits of the integer section will be decreased because this system does not require big values. not require big values.
The dynamic
The dynamicneuron
neuronconsists
consists mainly
mainly of two
of two 32 registers
32 bit bit registers connected
connected in series
in series in orderin to
order to
obtain
obtain
the the delayed
delayed input asinput as seen
can be can bein seen
Figure in 18.
Figure 18.
The logic behind static neurons is simpler because of its combinational behavior, but the hardware
required to implement it is substantially larger. Figure 19 shows the block diagram of a static neuron,
composed of combinational hardware; some tests will be performed to analyze the consequences of
reducing the parallelism in these neurons.

Figure 18. Dynamic neuron hardware.

The logic behind static neurons is simpler because of its combinational behavior, but the
fixed-point representation is commonly used in image processing FPGA implementations; it allows
carrying out many fixed-point arithmetic operations with relatively little hardware. The word length
used is 32 bits, where most significant bit is the sign bit and the 16 least significant bits are fraction
bits. The bits of the integer section will be decreased because this system does not require big values.
The dynamic neuron consists mainly of two 32 bit registers connected in series in order to
Sensors 2017, 17, 2176 13 of 17
obtain the delayed input as can be seen in Figure 18.

Figure 18. Dynamic neuron hardware.

The logic behind static neurons is simpler because of its combinational behavior, but the
hardware required to implement it is substantially larger. Figure 19 shows the block diagram of a
static neuron, composed of combinational hardware; some tests will be performed to analyze the
consequences of reducing the parallelism
18. in
Figure 18.
Figure these neurons.
Dynamic
Dynamic neuron hardware.
neuron hardware.

The logic behind static neurons is simpler because of its combinational behavior, but the
hardware required to implement it is substantially larger. Figure 19 shows the block diagram of a
static neuron, composed of combinational hardware; some tests will be performed to analyze the
consequences of reducing the parallelism in these neurons.

Figure 19.
Figure 19. Static
Static neuron
neuron hardware.
hardware.

5.3. ANN
5.3. ANN Training
Training
The training
The trainingof ofthethe
ANN ANN was performed
was performed using ausing
desktop a desktop
computercomputer and the
and the resulting resulting
parameters
parameters
were stored in were storedinside
registers in registers inside the FPGA.
the FPGA.
For the training of the pattern Figure 19. Static
recognition neuron
ANN, hardware. of 720 input images was used, 30
a community
For the training of the pattern recognition ANN, a community of 720 input images was used, 30 for
for every
every ASL and
ASL sign, sign,alland
imagesall images
reducedreduced to 10 characteristic
to 10 characteristic pixels represented
pixels represented by 10 X-
by 10 X- coordinates
5.3. ANN
coordinates Training
and 10 Y-coordinates, using the abovementioned
and 10 Y-coordinates, using the abovementioned algorithms. algorithms.
The training of the ANN was performed using a desktop computer and the resulting
6. Hardware–Software
6. Hardware–Software
parameters Interface
were storedInterface
in registers inside the FPGA.
Sensors 2017,
For the 17, 2176
training of the pattern recognition ANN, a is
community of 720 input 13 of30
17
The processor
The processorused usedto to communicate
communicate withthe
with the DVS
DVS is also
also used
used toto control
control theimages
the was used,
ANN implemented
ANN implemented
for every ASL
in hardware.
in hardware. sign,
This and all
represents images reduced
a reduction to complexity
10 characteristic
in theincomplexity
the for thepixels represented
network usage, by
which 10also
X-
increases the This represents
performance ofathe
reduction
whole system. for thebetween
The connection network usage,
the ANN which
and also increases
the processor
coordinates
the performance and 10 Y-coordinates, using the abovementioned algorithms.
is achieved usingofonly
the whole system.
four digital The connection
buses, two addressbetween
buses andthe two
ANN andbuses.
data the processor is achieved
The inputs and the
using
outputs only
of four
the digital
ANN arebuses, two
redirected address
using abuses and
demultiplexertwo data
and buses.
a The inputs
multiplexer, and the
respectively, outputs
as shown of
6. Hardware–Software Interface
the ANN are
in Figure 20. redirected using a demultiplexer and a multiplexer, respectively, as shown in Figure 20.
The processor used to communicate with the DVS is also used to control the ANN implemented
in hardware. This represents a reduction in the complexity for the network usage, which also

Figure 20.
Figure 20. System
System interconnection.
interconnection.

This allows controlling all of the neurons in a relatively simple manner, and the software
This allows controlling all of the neurons in a relatively simple manner, and the software required
required to manipulate it is substantially simpler as well. Figure 21 shows the block diagram of the
to manipulate it is substantially simpler as well. Figure 21 shows the block diagram of the classification
classification system implemented in the FPGA, which contains the software implemented functions
system implemented in the FPGA, which contains the software implemented functions including the
including the data adjusting, sigmoid function and data compression (inside the processor block in
data adjusting, sigmoid function and data compression (inside the processor block in Figure 21) and
Figure 21) and hardware blocks (outside the processor block in Figure 21).
hardware blocks (outside the processor block in Figure 21).
Figure 20. System interconnection.

This allows controlling all of the neurons in a relatively simple manner, and the software
required to manipulate it is substantially simpler as well. Figure 21 shows the block diagram of the
classification system implemented in the FPGA, which contains the software implemented functions
Sensorsincluding
2017, 17, 2176 14 of 17
the data adjusting, sigmoid function and data compression (inside the processor block in
Figure 21) and hardware blocks (outside the processor block in Figure 21).

Figure 21. Classification system data flow.


Figure 21. Classification system data flow.

7. Experimental Results
7. Experimental Results
In order to know the precision of the image processing algorithms, we performed some
In order to with
experiments know theofprecision
a total 720 imagesof(30
the
for image processing
every ASL algorithms,
sign image) wethe
different from performed
training setsome
experiments with a total
under controlled of 720with
conditions images (30 for every
no background ASL signand
movements image)
using different from
the method the training
detailed in
Sensors
set 2017,controlled
section
under 17, 2176
two of thisconditions
paper. with no background movements and using the method detailed 14 of in
17

section two of this paper.


Figure 22 shows graphics with the signatures obtained in the image processing stage for the
Figure 22 shows graphics with the signatures obtained in the image processing stage for the
ASL alphabet signs, whose vectors are the input for the classification system. As shown in Figure 22,
ASL alphabet signs, whose vectors are the input for the classification system. As shown in Figure 22,
the letters “Q” and “L” have few similarities with other ASL sign arrays.
the letters “Q” and “L” have few similarities with other ASL sign arrays.

Figure ASLalphabet
22. ASL
Figure 22. alphabet signs
signs signature
signature usedused including
including letters
letters fromfrom A excluding
A to Y to Y excluding
letter J,letter
whereJ,
where the X-axis represents the 20 X–Y coordinates of the 10 characteristic pixels in the
the X-axis represents the 20 X–Y coordinates of the 10 characteristic pixels in the format (X1, Y1, X2, format
(X1, Y1,X10,
Y2, …, X2, Y2,
Y10). .and
. , X10,
the Y10)
Y-axisand the Y-axis
represent therepresent the coordinates.
value of the value of the coordinates.

These graphs represent the results obtained in simulations of the 20 X–Y pair coordinates for a
single ASL alphabet sign.
In the confusion matrix (Table 2), where columns represents target and rows represent system
response, we can see that the best cases are the letters “L”, “Q”, and “T”, mainly because these signs
Sensors 2017, 17, 2176 15 of 17

These graphs represent the results obtained in simulations of the 20 X–Y pair coordinates for
a single ASL alphabet sign.
In the confusion matrix (Table 2), where columns represents target and rows represent system
response, we can see that the best cases are the letters “L”, “Q”, and “T”, mainly because these signs
have very few similarities with the others and their vertical features. Conversely, the worst case is the
letter “B”, with only a 30% accuracy.

Table 2. Confusion matrix.

A B C D E F G H I K L M N O P Q R S T U V W X Y
A 25 0 1 0 0 1 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
B 0 9 0 3 0 0 1 1 0 0 0 3 0 2 0 0 0 0 0 3 0 0 0 0
C 0 0 29 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 5 0 20 0 0 0 1 0 0 0 0 2 0 0 0 0 1 0 4 0 0 0 0
E 0 0 0 0 22 0 0 0 0 0 0 1 0 2 0 0 0 0 0 1 0 0 0 0
F 0 0 0 1 1 23 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 1
G 0 0 0 0 0 0 27 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 1 24 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
I 0 1 0 0 0 0 0 0 22 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 0 27 0 0 0 0 1 0 0 0 0 0 7 0 0 0
L 0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0 0
M 1 7 0 1 4 2 0 0 1 0 0 23 0 3 0 0 0 1 0 0 0 0 0 1
N 0 0 0 0 1 0 0 0 1 0 0 0 17 0 0 0 0 0 0 1 0 0 4 0
O 1 4 0 0 2 3 0 0 1 0 0 0 0 19 0 0 0 0 0 0 0 1 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0
R 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0 3 0 0 4 0
S 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 27 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 30 0 0 0 0 0
U 0 3 0 4 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 16 0 0 1 0
V 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 22 1 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 28 1 0
X 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 20 0
Y 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 28

The overall accuracy percentage is around 80%, which is an acceptable result for the simplicity and
easy implementation of the algorithms and the low resolution of the available neuromorphic sensor.
The results obtained of the whole classification system are very dependent on the sign presented.
The letter “B” was misclassified with many different signs, which we attribute to two factors: the low
resolution of the DVS and the need for more robust characteristic pixel classification algorithms.
Nevertheless, this is the first work that uses a Dynamic Vision Sensor with the proposed algorithms
for the classification task, while proving that good results can be obtained using a non-frame-based
sensor, reducing the need to periodically process images.
Some previously presented systems [8] reported an accuracy around 53% using Microsoft Kinect
for depth information. Kumar et al. achieved an accuracy as high as 90.80%, but they use two different
sensors and their system was implemented in software requiring a computer, while other presented
systems require specially colored gloves [15].

8. Conclusions
Hearing loss is a problem that affects 5% of the world population. Many efforts have been made
to improve both the accuracy and the performance of the hand posture recognition tasks, but almost all
of them require computers with operating system, making them unsuitable for embedded applications,
which represents a more realistic approach because of the mobility of those platforms. For this
reason, this work focuses on the development of a system in a FPGA, which can be implemented
Sensors 2017, 17, 2176 16 of 17

in an embedded platform with very low computational requirements. Additionally, the usage of
the DVS with the abovementioned algorithms is a novel approach for ASL alphabet recognition,
which significantly reduces redundant processing.
The novel algorithms developed for this system were specially designed for American Sign
Language classification using contours, which have proven to be very efficient for obtaining the feature
vector for classifying the ASL alphabet using artificial neural networks. These vectors are also very
suitable for use as a training set for the ANN, so few neurons are required for its implementation.
The system accuracy is acceptable, considering the few operations required by its algorithms;
the process performed to classify every image requires only one full image scan in the noise filtering
step, which drastically reduces the operations required and the time taken to process an input image.
With respect to RGB image processes such as gray scaling, morphological operations, and kernel based
algorithms requiring full scanning and processing of the image, this has repercussions not only on
the time needed for its algorithms, but also on the memory, which must be resized according to the
input image.
The accuracy of the overall system can be improved when the aforementioned main weaknesses
are solved, noise filter stage is not robust enough to ensure a fast edge tracing process, leading to
a bigger delayed response mainly because of this reason. Most of the lost symbol features are eliminated
in the vertical slope algorithm because they are not vertical features. This results in a weak signature
extraction in some ASL symbols, which affects the ANN training performance as well as the overall
accuracy. Due to the nature of the signature extraction algorithm, it is not invariant to moderate
orientation changes of the ASL symbol. Nevertheless, the images obtained for testing the system
presented an angular variation of approximately ±8◦ . However, if this algorithm is applied to more
rotation variant applications, it should be improved to decrease rotation sensitivity.
There are still some tests and modifications to be made in order to increase the accuracy while
maintaining the system simplicity. For example, changing the fixed point format could reduce required
hardware. Adding more characteristic pixel selection criteria could lead to a more robust events
processing stage, so horizontal features can be captured, yielding to a robust algorithm against hand
rotations. An algorithm for detecting the time for the initial and final frames in the neuromorphic
sensor for real-time application is another issue that remains.
Other ANN architectures are being tested, including Convolutional Neural Network (CNN),
and this architecture processes the input data without requiring pre-processing, avoiding in that way
the noise filter. The Recurrent Neural Network (RNN), which has a sequential nature, therefore, is more
suitable for the asynchronous spikes stream sent by the DVS.

Acknowledgments: M.R-A. would like to thank CONACYT for supporting his work with grant number 336595.
Author Contributions: M.R.-A. and S.O.-C. conceived and designed the experiments; M.R.-A. and F.S.-I.
performed the experiments; M.R.-A. and J.R. analyzed the data; M.R.-A. wrote the paper.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. World Health Organization. Deafness and Hearing Loss. Available online: http://www.who.int/
mediacentre/factsheets/fs300/en/ (accessed on 5 July 2017).
2. Cong, Y.; Christian, F. Shape-based object matching using interesting points and high-order graphs. PRL
2016, 83, 251–260. [CrossRef]
3. Ghassem, T.S.; Amirhassan, M.; Nasser, G.A. Rapid hand posture recognition using adaptive histogram
template of skin and hand edge contour. In Proceedings of the 2010 6th Iranian MVIP, Isfahan, Iran,
27–28 October 2010.
4. Junwei, W.; Xiang, B. Shape matching and classification using height functions. PRL 2012, 33, 134–143.
[CrossRef]
5. Hervé, L.; Derek, D.L. Towards real-time and rotation-invariant American Sign Language alphabet
recognition using a range camera. Sensors 2012, 12, 14416–14441. [CrossRef]
Sensors 2017, 17, 2176 17 of 17

6. Watcharin, T.; Suchin, A.; Chuchart, P. American Sign Language recognition using 3D geometric
inviariant feature and ANN classification. In Proceedings of the 2014 7th BMEiCON, Fukuoka, Japan,
26–28 November 2014. [CrossRef]
7. Kin, F.L.; Kylee, L.; Stephen, L. A web-based sign language translator using 3D video processing.
In Proceedings of the 2011 14th International Conference on NBiS, Tirana, Albania, 7–9 September 2011.
[CrossRef]
8. Nicolas, P.; Richard, B. Spelling it out: Real-time ASL fingerspelling recognition. In Proceedings of the 2011
IEEE International Conference on ICCVW, Barcelona, Spain, 6–13 November 2011.
9. Pradeep, K. Himaanshu, Coupled HMM-based multi-sensor data fusion for sign language recognition. PRL
2016, 86, 1–8. [CrossRef]
10. Cao, D.; Ming, C.L.; Zhaozheng, Y. American Sign Language alphabet recognition using microsoft kinect.
In Proceedings of the 2015 IEEE Conference on CVPRW, Boston, MA, USA, 7–12 June 2015. [CrossRef]
11. Mohammed, Z.; Zehira, H. An online reversed French Sign Language dictionary based on a learning
approach for signs classification. PRL 2015, 67, 28–38. [CrossRef]
12. Jaime, S.; Andrew, F.; Mat, C. Real-time human pose recognition in parts from single depth images.
In Proceedings of the 2011 IEEE Conference on CVPR, Colorado Springs, CO, USA, 20–25 June 2011.
[CrossRef]
13. Cem, K.; Furkan, K.; Yunus, E.K. Randomized decision forests for static and dynamic hand shape
classification. In Proceedings of the 2012 IEEE Computer Society Conference on CVPRW, Providence,
RI, USA, 16–21 June 2012. [CrossRef]
14. Byungik, A. Real-time video object recognition using convolutional neural network. In Proceedings of the
2015 International Joint Conference on ICJNN, Killarney, Ireland, 12–17 July 2015. [CrossRef]
15. Sridevi, K.; Sundarambal, M.; Muralidharan, K.; Josephine, R.L. FPGA implementation of hand gesture
recognition system using neural networks. In Proceedings of the 2017 11th International Conference on ISCO,
Coimbatore, India, 5–6 January 2017. [CrossRef]
16. Hiroomi, H.; Keishi, K. Novel FPGA implementation of hand sign recognition system with SOM-Hebb
classifier. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 153–166. [CrossRef]
17. Wang, X.; Chen, W.; Ji, Y.; Ran, F. Gesture recognition based on parallel hardware neural network
implemented with stochastic logics. In Proceedings of the 2016 International Conference on ICALIP, Shanghai,
China, 11–12 July 2016. [CrossRef]
18. Arnon, A.; Dharmendra, M.; Tobi, D. A low power, fully event-based gesture recognition system.
In Proceedings of the CVPRW, Honolulu, HI, USA, 22–25 July 2017; pp. 7243–7252.
19. Eun, Y.; Jun, H.; Tracy, M. Dynamic vision sensor camera based bare hand gesture recognition. In Proceedings
of the 2011 IEEE Symposium on CIMSIVP, Paris, France, 11–15 April 2011. [CrossRef]
20. Jun, H.; Tobi, D.; Michael, P. Real-time gesture interface based on event-driven processing from stereo silicon
retinas. IEEE TNNLS 2014, 25. [CrossRef]
21. IniLabs. Available online: http://siliconretina.ini.uzh.ch/wiki/doku.php?id=userguide#user_guides_for_
dvs128_and_dvs128_paer_dynamic_vision_sensors (accessed on 23 May 2016).
22. Universal Serial Bus Specification. Available online: http://www.usb.org/developers/docs/usb20_docs/
#usb20spec (accessed on 18 September 2017).
23. McGill University. Available online: http://www.imageprocessingplace.com/downloads_V3/root_
downloads/tutorials/contour_tracing_Abeer_George_Ghuneim/index.html (accessed on 14 November 2016).

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like