Multistroke Thai Fingerspelling Sign Language Recognition System With Deep Learningsymmetry

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

SS symmetry

Article
Multi‑Stroke Thai Finger‑Spelling Sign Language Recognition
System with Deep Learning
Thongpan Pariwat and Pusadee Seresangtakul *

Natural Language and Speech Processing Laboratory, Department of Computer Science, Faculty of Science,
Khon Kaen University, Khon Kaen 40002, Thailand; thongpan.par@kkumail.com
* Correspondence: pusadee@kku.ac.th; Tel.: +66‑80‑414‑0401

Abstract: Sign language is a type of language for the hearing impaired that people in the general
public commonly do not understand. A sign language recognition system, therefore, represents an
intermediary between the two sides. As a communication tool, a multi‑stroke Thai finger‑spelling
sign language (TFSL) recognition system featuring deep learning was developed in this study. This
research uses a vision‑based technique on a complex background with semantic segmentation per‑
formed with dilated convolution for hand segmentation, hand strokes separated using optical flow,
and learning feature and classification done with convolution neural network (CNN). We then com‑
pared the five CNN structures that define the formats. The first format was used to set the number
of filters to 64 and the size of the filter to 3 × 3 with 7 layers; the second format used 128 filters, each
filter 3 × 3 in size with 7 layers; the third format used the number of filters in ascending order with
7 layers, all of which had an equal 3 × 3 filter size; the fourth format determined the number of filters
in ascending order and the size of the filter based on a small size with 7 layers; the final format was
a structure based on AlexNet. As a result, the average accuracy was 88.83%, 87.97%, 89.91%, 90.43%,
and 92.03%, respectively. We implemented the CNN structure based on AlexNet to create models
 for multi‑stroke TFSL recognition systems. The experiment was performed using an isolated video

of 42 Thai alphabets, which are divided into three categories consisting of one stroke, two strokes,
Citation: Pariwat, T.; Seresangtakul,
and three strokes. The results presented an 88.00% average accuracy for one stroke, 85.42% for two
P. Multi‑Stroke Thai Finger‑Spelling
strokes, and 75.00% for three strokes.
Sign Language Recognition System
with Deep Learning. Symmetry 2021,
Keywords: TFSL recognition system; deep learning; semantic segmentation; optical flow;
13, 262. https://doi.org/10.3390/
sym13020262
complex background

Academic Editor: Peng‑Yeng Yin


Received: 11 January 2021
Accepted: 31 January 2021 1. Introduction
Published: 4 February 2021 Hearing‑impaired people around the world use sign language as their medium for
communication. However, sign language is not universal, with one hundred varieties used
Publisher’s Note: MDPI stays neutral around the world [1]. One of the most widely used types of sign language is American Sign
with regard to jurisdictional claims in Language (ASL), which is used in the United States, Canada, West Africa, and Southeast
published maps and institutional affil‑ Asia and influences Thai Sign Language (TSL). Typically, global sign language is divided
iations. into two forms: gesture language and finger‑spelling. Gesture language or sign language
involves the use of hand gestures, facial expressions, and the use of mouths and noses
to convey meanings and sentences. This type is used for communication between deaf
people in everyday life, focusing on terms such as eating, ok, sleep, etc. Finger‑spelling
Copyright: © 2021 by the authors. sign language is used for spelling letters related to the written language with one’s fingers
Licensee MDPI, Basel, Switzerland. and is used to spell people’ names, places, animals, and objects.
This article is an open access article ASL is the foundation of Thai finger‑spelling sign language (TFSL).The TFSL was
distributed under the terms and invented in 1953 by Khunying Kamala Krairuek using American finger‑spelling as a pro‑
conditions of the Creative Commons totype to represent the 42 Thai consonants, 32 vowels, and 6 intonation marks [2]. All
Attribution (CC BY) license (https://
forty‑two Thai letters can be presented with a combination of twenty‑five hand gestures.
creativecommons.org/licenses/by/
For this purpose, the number signs are combined with alphabet signs to create additional
4.0/).

Symmetry 2021, 13, 262. https://doi.org/10.3390/sym13020262 https://www.mdpi.com/journal/symmetry


Symmetry 2021, 13, 262 2 of 19

meanings. For example, the K sign combined with the 1 sign (K+1) meaning ‘ข’ (/kh /). Con‑
sequently, TFSL strokes can be organized into three groups (one‑stroke, two‑stroke, and
three‑stroke groups) with 15 letters, 24 letters, and 3 letters, respectively. However, ‘ฃ’(/kh /)
and ‘ฅ’(/kh /) have not been used since Thailand’s Royal Institute Dictionary announced
their discontinuation [3], and are no longer used in Thai finger‑spelling sign language.
This study is part of a research series on TFSL recognition that focuses on one stroke
performed by a signer standing in front of a blue background. Global and local features us‑
ing support vector machine (SVM) on the radial basis function (RBF) kernel were applied
and were able to achieve 91.20% accuracy [4]; however, there were still problems with
highly similar letters. Therefore, a new TFSL recognition system was developed by ap‑
plying pyramid histogram of oriented gradient (PHOG) and local features with k‑nearest
neighbors (KNN), providing an accuracy up to 97.6% [3]. However, past research has been
conducted only using one‑stroke, 15‑character TFSL under a simple background, which
does not cover all types of TFSL. There is also research on TFSL recognition systems that
uses a variety of techniques, as shown in Table 1.

Table 1. Research on Thai finger‑spelling sign language (TFSL) recognition systems.

Researchers No. of Sign Method Dataset Accuracy (%)


Black glove with 6 sphere marker,
Adhan and Pintavirooj [5] 42 1050 96.19
geometric invariant and ANN
Saengsri et al. [6] 16 Glove with motion tracking (N/A) 94.44
Nakjai and Kantanyakul [2] 25 CNN 1375 91.26
Chansri and Srinonchat [7] 16 HOG and ANN 320 83.33
Silanon [8] 16 HOG and ANN 2100 78.00

The study of TFSL recognition systems for practical applications requires using a va‑
riety of techniques, since TFSL has multi‑stroke gestures and combines hand gestures to
convey letters as well as complex background handling and a wide range of light volumes.
Deep learning is a tool that is increasingly being used in sign language
recognition [2,9,10], face recognition [11], object recognition [12], and others. This tech‑
nology is used for solving complex problems such as object detection [13], image segmen‑
tation [14], and image recognition [12]. With this technique, the feature learning method is
implemented instead of feature extraction. Convolution neural network (CNN) is a learn‑
ing feature process that can be applied to recognition and can provide high performance.
However, deep learning requires much data for training, and such data could use simple
background or complex background images. In the case of a picture with a simple back‑
ground, the object and background color are clearly different. Such an image can be used
as training data without cutting out the background. On the other hand, a complex back‑
ground image features objects and backgrounds that have similar colors. Cutting out the
background image will obtain input images for training with deep learning, featuring only
objects of interest without distraction. This can be done using the semantic segmentation
method, which uses deep learning to apply image segmentation. This method requires
labeling objects of interest to separate them from the background. Autonomous driving
is one application of semantic segmentation used to identify objects on the road, and can
also be used for a wide range of other applications.
This study is based on the challenges of the TFSL recognition system. There are up
to 42 Thai letters that involve spelling with one’s fingers—i.e., spelling letters with a com‑
bination of multi‑stroke sign language gestures. The gestures for spelling several letters,
however, are similar, so a sign language recognition system under a complex background
should instead be used. There are many studies about the TFSL recognition system. How‑
ever, none of them study the TFSL multi‑stroke recognition system using the vision‑based
technique with complex background. The main contributions of this research are the ap‑
plication of a new framework for a multi‑stroke Thai finger‑spelling sign language recog‑
nition system to videos under a complex background and a variety of light intensities by
Symmetry 2021, 13, 262 3 of 19

separating people’s hands from the complex background via semantic segmentation meth‑
ods, detecting changes in the strokes of hand gestures with optical flow, and learning fea‑
tures with CNN under the structure of AlexNet. This system supports recognition that
covers all 42 characters in TFSL.
The proposed method focuses on developing a multi‑stroke TFSL recognition system
with deep learning that can act as a communication medium between the hearing impaired
and the general public. Semantic segmentation is then applied to hand segmentation for
complex background images, and optical flow is used to separate the strokes of sign lan‑
guage. The processes of feature learning and classification use CNN.

2. Review of Related Literature


Research on sign language recognition systems has been developed in sequence with
a variety of techniques to obtain a system that can be used in the real world. Deep learn‑
ing is one of the most popular methods effectively applied to sign language recognition
systems. Nakjai and Katanyukul [2] used CNN to develop a recognition system for TFSL
with a black background image and compared 3‑layer CNN, 6‑layer CNN, and HOG, find‑
ing that the 3‑layer CNN with 128 filters on every layer had an average precision (mAP)
of 91.26%. Another study that applied deep learning to a sign language recognition sys‑
tem was published by Lean Karlo S. et al. [9], who developed an American sign language
recognition system with CNN under a simple background. The dataset was divided into
3 groups: alphabet, number, and static word. The results show that alphabet recogni‑
tion had an average accuracy of 90.04%, number recognition had an accuracy average of
93.44%, and static word recognition had an average accuracy of 97.52%. The total aver‑
age of the system was 93.67%. Rahim et al. [15] applied deep learning to a non‑touch sign
word recognition system using hybrid segmentation and CNN feature fusion. This system
used SVM to recognize sign language. The research results under a real‑time environment
provided an average accuracy of 97.28%.
Various sign language studies aimed to develop a vision base by using, e.g., image
and video processing, object detection, image segmentation, and recognition systems. Sign
language recognition systems using vision can also be divided into two types of back‑
grounds: simple backgrounds [3,4,9,16] and complex backgrounds [17–19]. A simple back‑
ground entails the use of a single color such as green, blue, or white. This can help hand
segmentation work more easily, and the system can recognize accuracy at a high level.
Pariwat et al. [4] used sign language images on a blue background while the signer wore a
black jacket in a TFSL recognition system with SVM on RBF, including global and local fea‑
tures. The average accuracy was 91.20%. Pariwat et al. [3] also used a simple background
in a system that was developed by combining PHOG and local features with KNN. This
combination enhanced the average accuracy to levels as high as 97.80%. Anh et al. [10]
presented a Vietnamese language recognition system using deep learning in a video se‑
quence format. A simple background, like a white background, was used in training and
testing, providing an accuracy of 95.83%. Using a simple background makes it easier to
develop a sign language recognition system and provides high‑accuracy results, but this
method is not practical due to the complexity of the backgrounds in real‑life situations.
Studying sign language recognition systems with complex backgrounds is challeng‑
ing. Complex backgrounds consist of a variety of elements that are difficult to distinguish,
which requires more complex methods. Chansri et al. [7] developed a Thai sign language
recognition system using Microsoft Kinect to help locate a person’s hand in a complex back‑
ground. This study used the fusion of depth and color video, HOG features, and a neural
network, resulting in accuracy of 84.05%. Ayman et al. [17] proposed the use of Microsoft
Kinect for Arabic sign language recognition systems with complex backgrounds with his‑
togram of oriented gradients—principal component analysis (HOG‑PCA) and SVM. The
resulting accuracy value was 99.2%. In summary, research on sign language recognition
systems using vision‑based techniques remains a challenge for real‑world applications.
Microsoft Kinect for Arabic sign language recognition systems with complex backgrounds
with histogram of oriented gradients—principal component analysis (HOG-PCA) and
SVM. The resulting accuracy value was 99.2%. In summary, research on sign language
recognition systems using vision-based techniques remains a challenge for real-world ap-
Symmetry 2021, 13, 262 4 of 19
plications.

3. The Proposed Method


3. The Proposed Method
This proposed method is a development of the multi-stroke TFSL recognition system
This proposed method is a development of the multi‑stroke TFSL recognition system
covering 42 letters. The system consists of four parts: (1) creating a SegNet model via se-
covering 42 letters. The system consists of four parts: (1) creating a SegNet model via
mantic segmentation
semanticusing dilated using
segmentation convolutions; (2) creating(2)a creating
dilated convolutions; hand-segmented library library
a hand‑segmented
with a SegNet with
model; (3) TFSL model creation with CNN for the TFSL model;
a SegNet model; (3) TFSL model creation with CNN for the TFSL model; and (4)and (4)
classification processes using
classification opticalusing
processes flowoptical
for hand
flowstroke detection
for hand and using
stroke detection andthe CNN
using the CNN
model for classification. Figure 1 shows an overview of the system.
model for classification. Figure 1 shows an overview of the system.

Part. 1 Hand segmentation model creation Part. 2 Hand-segmented library creation Part 3. TFSL Model Creation Part. 4 Classification

Hand sign Hand sign label Hand sign Hand sign video

Training process
with Convolution
Neural Network
Hand segmentation
(CNN) Stroke detection
Semantic using SegNet model
hand segmentation

Hand segmentation
Create the hand TFSL Model
using SegNet model
segmented library
SegNet model
Classification and
combination rules

Display text

Hand segmented Library

Figure 1. Multi‑stroke Thai finger‑spelling sign language recognition framework.


Figure 1. Multi-stroke Thai finger-spelling sign language recognition framework.

3.1. Hand Segmentation Model Creation


3.1. Hand Segmentation Model Creation
Segmentation is one of the most important processes for recognition systems. In par‑
Segmentation is one
ticular, signof the most
language importantsystems
recognition processes for recognition
require the separation systems. In par-
of the hands from the
ticular, sign language
background recognition systemsbackgrounds
image. Complex require theare separation
a challenging of thetaskhands from
for hand the
segmentation
background image.because Complex
they featurebackgrounds
a wide range areofacolors
challenging task for
and lighting. hand segmentation
Semantic segmentation has been
because they feature a wide
effectively range
applied to of colors background
complex and lighting. Semantic
images and itsegmentation has been
is used to describe images at a
pixel to
effectively applied level with a class
complex label [20].images
background For example,
and itin is an
usedimage containingimages
to describe people,attrees,
a cars,
pixel level with a class label [20]. For example, in an image containing people, trees, cars, the im‑
and signs, semantic segmentation can find separate labels for the objects depicted in
age, and
and signs, semantic it is applied tocan
segmentation autonomous
find separate vehicles, robotics,
labels for the human–computer
objects depicted interactions,
in the etc.
Dilated convolution is used in segmentation tasks and has consistently improved ac‑
image, and it is applied to autonomous vehicles, robotics, human–computer interactions,
curacy performance [21]. Dilated convolution provides a way to exponentially increase
etc. the receptive view (global view) of the network and provide linear parameter accretion.
Dilated convolution is usedconvolution
In general, dilated in segmentation includes tasks
the and has consistently
application of an inputimproved
with spacing ac-defined
curacy performance [21]. Dilated convolution provides a way to exponentially
by the dilation rate. For example, in Figure 2, a 1‑dilated convolution (a) refers to the nor‑ increase
the receptive view (global view)
mal convolution × 3network
withofa 3the receptiveand field,provide
a 2‑dilatedlinear parameter
(b) refers accretion.
to one‑pixel spacing with
In general, dilated of 7 × 7, andincludes
a sizeconvolution a 4‑dilated the(c)application
refers to a 3‑pixel spacing
of an input receptive
with spacingfield with a size of
defined
by the dilation 15 × 15.
rate. ForThe red dot in
example, represents
Figure 2,the 3 × 3 inputconvolution
a 1-dilated filters, and the (a)green
refersarea is the
to the receptive
nor-
area of these inputs.
mal convolution with a 3 × 3 receptive field, a 2-dilated (b) refers to one-pixel spacing with
a size of 7 × 7, and a 4-dilated (c) refers to a 3-pixel spacing receptive field with a size of
15 × 15. The red dot represents the 3 × 3 input filters, and the green area is the receptive
area of these inputs.
Symmetry
Symmetry 2021,
2021, 13,
13, 262
x FOR PEER REVIEW 55 of
of 19
19
ry 2021, 13, x FOR PEER REVIEW 5 of 19

(a) (b) (c)


(a) (b) (c)
Figure 2. Dilated convolution on a 2D image: (a) The normal convolution uses a 3 × 3 receptive
Figure 2. Dilated
Figure 2. Dilated convolution convolution
field;
on a(b) on a(a)
2Done-pixel
image: 2D image:
spacing
The (a)7The
has
normal × 7; normal convolution
(c) and three-pixel
convolution 3 uses
uses a 3 × spacing a 3uses
receptive × field;
3 receptive
a receptive field size
(b) one‑pixel of 15 ×
spacing
field; (b) one-pixel
15.spacing has 7 × 7; (c) and three-pixel spacing
has 7 × 7; (c) and three‑pixel spacing uses a receptive field size of 15 × 15. uses a receptive field size of 15 ×
15.
This study employs semantic segmentation using dilated convolutions to create a
This studyhand-segmented
employs
hand‑segmented semantic segmentation
library.
library. TheTheprocess using
process dilated
forfor convolutions
creating
creating a segmented
a segmented to create
library
library aisshown
is as as shown
in Fig-in
hand-segmented ure library.
Figure3. First,The
3. First, process
create
create imagefor
image creating
filesfiles
andanda segmented
image
image library
labels
labels for for
handis as
hand shown
signs
signs in Fig-
as training
as training data.
data. Sec‑
Second,
ure 3. First, create
ond,image
create create files
a semantic and
a semantic image
segmentation labelsnetwork
segmentation for hand signsmodel
network
model as training
(SegNet (SegNet data.
model). Second,
model).
The threeThelayers
threeoflayers
con-
create a semantic segmentation
of convolution
volution network
are as follows:
are as follows: model
dilateddilated(SegNet model).
convolution,
convolution, The three layers
batch normalization,
batch normalization, of con-
and and rectified
rectified lin‑
linear unit
volution are as (ReLU),
follows:
ear dilated
unit with
(ReLU),theconvolution,
with layer
final the final
usedbatch
for normalization,
layer used forand
softmax softmaxandclassification.
pixel rectified
and pixel linear unitonceNext,
classification.
Next, once
data train-
(ReLU), with the ingfinal
data layer used
istraining
completed, for
thesoftmax
is completed,
resultsthe and pixel
of results
the classification.
of the
SegNet SegNet
model areNext,
model
ready once
to data
are ready
be train-
used toinbe
theused
nextinstep.
the
ing is completed,next the results
step. The of the
SegNet SegNet
model model
that are
was ready to
created be
is used
appliedin the
in
The SegNet model that was created is applied in the process of creating a hand-segmented next
the step.
process of creating a
The SegNet model that was
hand‑segmented
library. created is
library.applied in the process of creating a hand-segmented
library.

Pixel Classification
Dilated Conv Dilated Conv Dilated Conv
Dilated Conv Dilated Conv Dilated Conv
Dilated
+ Conv Dilated
+ Conv Dilated
+ Conv
Pixel Classification

Dilated Conv Dilated


+
Dilated Conv +
Dilated Conv +
Dilated Conv + Conv
Dilated Conv Dilated
+ Conv Dilated
+ Conv
Softmax

Dilated
+ Conv batch Dilated
++ Conv Dilated
batch Conv batch
+ batch + Dilated
batch ++ Conv batch + SegNet
Dilated
+ Conv batch
normalization
Dilated
+ Conv batch
normalization +
Dilated batch
normalization
+ Conv
Softmax

batch batch
normalization
batch normalization
batch normalization Model
Hand sign batch +
batch
normalization + batch +
normalization
batch
normalization + batch
normalization
batch + + batch
normalization SegNet
normalization +
normalization
normalization + batch
normalization
normalization +
normalization Model
Relu + +
normalization
Relu
Hand sign + batch
normalization ++ batch
normalization
Relu ++ batch
normalization
Relu Relu +
Relu +
+
normalization Relu +
normalization Relu +
normalization Relu
Relu + Relu
Relu + Relu
Relu + Relu
Relu + Relu + Relu +
Relu Relu Relu
Relu Relu Relu

Hand sign label


Hand sign label
Figure 3. Hand segmentation model creation via semantic segmentation.
Figure 3. Hand segmentation model creation via semantic segmentation.
3.2. Hand-Segmented
Hand‑Segmented Library Creation
3.2. Hand-Segmented Library Creation
Creating a hand-segmented
hand‑segmented library is done as preparation for training, with the input
Creating ahand
hand-segmentedimageused
sign image library
used is
totocreatedone
create aslibrary
preparation
a alibrary consisting
consisting foroftraining,
of
25 25 withofthe
gestures
gestures ofinput
TFSL.TFSL.EachEach gesture
gesture has
hand sign image has
5000used
5000 toimages
images create
for aafor library
total a total consisting
of 125,000
of 125,000 ofimages.
images. 25 This
gestures
This
processof TFSL.
process
includes Eachthe
includes gesture
the following
following pro-
processes,
has 5000 images as for
cesses,
shown aas
total
shown
in of 125,000
Figure in4.Figure
The images.
4. The
first This
firstprocess
process process
is to engageincludes the following
is toinengage
semantic in semantic pro-
segmentation segmentation
by applying by
cesses, as shown in
applyingFigure
the SegNetthe 4. The
SegNet
model first
from process
model is
from the
the previous to engage
previous
process in semantic
process the
to segment segmentation
to segment
hands and the by
hands andimage.
background back-
applying the SegNet
ground
The second model
image. from
The is
process the toprevious
second label process process to
is to label
the position segment
of the the thea hands
position
hand as of theand
binary hand
image. back-
asThe
a binary image.
third process
ground image.TheisThe second
to third
blur the process
Gaussian
process is is
to to label
filter
blur the
method
the position
Gaussian to reduce of the
filter thehand
methodnoiseas
toofareduce
binary
the image image.
the and remove
noise small
of the image
The third process
space
and isremove
to blurfrom
object the Gaussian
small binary
space image, filterfrom
object method
leaving to image,
only
binary reduce the
areas ofleavingnoise
wide space.of the
only Theimage
areas fourth
of wide process
space.sorts
The
and remove smallthe space
fourthareas object
process fromthe
in ascending
sorts binary
order,
areas image,
selects leaving
the largest
in ascending only
order,areas
space of wide
(area
selects space.
theoflargest
the hand),The
spaceand creates
(area of thea
fourth process hand),
sorts the
bounding areas
box at in
the ascending
position order,
of the selects
hand to the
framelargest
only space
the
and creates a bounding box at the position of the hand to frame only the features (area
features ofof the
interest. The last
hand), and creates
of a bounding
process is to
interest. Thecrop box
last the atimage
the position
process isbytohand of the
crop by hand
imagetoby
the applying frame
ahand only
cropped the
binary
by applying features
image to thebinary
a cropped input
of interest. Theimage
last processremove
to the is to
inputonly crop
image the image
thetointeresting
remove onlyby hand
partthe by
and applying
then resize
interesting a
partcropped
the andimage
thenbinary
to 150 ×
resize the250 pixels.
image to
image to the input
150 image
Examples
× to images
of
250 pixels.remove only the
from
Examples the interesting
of hand‑segmented
images frompart theand thenare
library resize
hand-segmented shown thelibrary
image
in Figure to 5.
are shown in Fig-
150 × 250 pixels.
ureExamples
5. of images from the hand-segmented library are shown in Fig-
ure 5.
Symmetry 2021, 13, x FOR PEER REVIEW

Symmetry 2021, 13, x FOR PEER REVIEW 6 of 19


1, 13, x FOR PEER 2021,
Symmetry REVIEW
13, 262 6 of 19 6 of 19

1) Semantic segmentation
Hand sign files via SegNet
2) Label overlay to binary
1) Semantic segmentation
Hand sign files 2) Label overlay to binary
1) Semantic segmentation via SegNet
Hand sign files via SegNet
2) Label overlay to binary

3) Gaussian filter and


3) Gaussian filter and
remove noise
remove noise
3) Gaussian filter and
remove noise

Hand segmented
Hand files files
segmented
Hand segmented files 4) Select large area and
4) Select large area and
5) Hand 5)
mark
Hand mark
4) Select large area and bounding box bounding box
5) Hand mark
bounding box

Figure 4. The
Figure process
4. The of creating
process a hand-segmented
of creating library. library.
a hand-segmented
Figure 4. The process of creating
Figure 4. The aprocess
hand-segmented
of creating a library.
hand‑segmented library.

Figure 5. Examples of the hand-segmented library files.


Figure 5. Examples of the
Figure 5. hand-segmented library files. library files.
Examples of the hand‑segmented
Figure 5. Examples of the hand-segmented library files.
3.3. TFSL Model Creation
3.3. TFSL Model 3.3. TFSL Model Creation
Creation
This
3.3. TFSLprocess
This Model
processuses data
datatraining
Creation
uses trainingmethods
methods to learn features
to learn featuresand
andcreate
create a CNN
a CNN classifi-
classi‑
This process
cation uses data
model, training
as shown methods
in Figure to learn
6. The features
training and
starts create
with a CNN
introducing classifi-
data from
fication model,
This as shown
process uses in Figure
data 6. The
training training
methods starts with
to learn introducing
features and data froma a
create
hand-
a CNN c
cation model,segmented
as shown in Figure 6.
library into The
deeptraining starts with introducing data from a hand-
hand‑segmented library intolearning via CNN,
deep learning which
via CNN, whichconsists
consistsofofmultiple
multiple layers in the
layers in
cation
segmented librarythe into
feature
model,
deep
detection,
feature
as shown
learning
processes,
detection, viain Figure
CNN,
each
processes, with
each
6.
which The training
consists
convolution
with convolution
startsReLU,
of(Conv), with
multiple
(Conv),
introducing
layers
ReLU, and in
and the [22].
pooling
pooling
data from a
[22].
segmented
feature detection, processes,library intoconvolution
each with deep learning(Conv),viaReLU,
CNN,and which consists
pooling [22]. of multiple layers
feature detection, processes, each with convolution (Conv), ReLU, and pooling [22
Fully connected layer (FC)
Fully connected layer (FC)

Softmax

TFSL
Fully connected layer (FC)
Softmax

TFSL Model
Model
Softmax

Conv+Relu Pooling Conv+Relu Pooling Conv+Relu Pooling


Conv+Relu Conv+Relu Conv+Relu TFSL
Pooling Pooling Pooling
Model
Hand Segmented
Hand Segmented Library
Conv+Relu Pooling Conv+Relu Pooling Conv+Relu Pooling
Library
Figure 6. Training
Figure process
6. Training processwith
withconvolution neuralnetwork
convolution neural network (CNN).
(CNN).
Hand Segmented
Figure 6. Training process with convolution neural network (CNN).
Library
Three
Three processesare
processes arerepeated
repeated forforeach
eachcycle
cycle of layers, withwith
of layers, each each
layer layer
learning to detectto de-
learning
different
Figure
Three processes 6.
are features.
Training
repeated Convolution
process
for with
each is a of
layer
convolution
cycle thatneural
layers, learnseach
with features
network
layerfrom the input
(CNN).
learning to image. For
de-
tect different features. Convolution is a layer that learns features from the input image.
tect differentFor convolution,
features. small squares
Convolution is a of input
layer datalearns
that are used to learnfrom
features the image
thethefeatures
input by preserving
image.
convolution,
the relationship
small
between
squares
pixels.
ofThe
input data
image
are
matrix
used
and
to learn
filter or kernel
image
are
features by pre-
mathematically
For convolution, small
serving squares
Three of
processes
the relationship inputaredata
between are
repeated used to
for
pixels.inThe learn
each
image the
cycle image
of features
layers, with by pre-
each layer learning
produced by the dot product, as shown Figure 7. matrix and filter or kernel are mathe-
serving the relationship
matically between
tect different
produced pixels.
features.
by Theproduct,
image matrix
Convolution
the dot as is a and
shown infilter
layer thatorlearns
Figure kernel
7. are mathe-
features from the input
matically produced by the dot product, as shown in Figure 7.
For convolution, small squares of input data are used to learn the image features
serving the relationship between pixels. The image matrix and filter or kernel are
matically produced by the dot product, as shown in Figure 7.
Symmetry 2021, 13, x FOR PEER REVIEW 7 of 19

Symmetry 2021, 13, 262 7 of 19


Symmetry 2021, 13, x FOR PEER REVIEW 7 of 19

Figure 7. Learning features with a convolution filter [23].

ReLU is an activation function that allows an algorithm to work faster


fectively. The function returns 0 if it receives any negative input, but for any p
x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can b
Figure 7. Learning features with a convolution filter [23].
Equation (1):
ReLU
Figure
Figure 7. is an activation
7. Learning
Learning features function
featureswith
with that filter
a convolution allows
a convolution [23].an
filter algorithm to work faster and more ef-
[23].
f(x) = max(0, x)
fectively. The function returns 0 if it receives any negative input, but for any positive value
ReLU is an activation function that allows an algorithm to work faster and more effec‑
x, it returns
ReLU isthat value, as presented
an activation function that in Figures
allows an8 and 9. Thus,
algorithm to ReLU can beand
work faster written
more as
ef-
tively. The function returns 0 if it receives any negative input, but for any positive value
Equation
fectively. (1):
The function returns 0 if it receives any negative input, but for any positive value
x, it returns that value, as presented in Figures 8 and 9. Thus, ReLU can be written as
x,Equation
it returns
(1): that value, as presented f(x) in Figures
= max(0,8 x)
and 9. Thus, ReLU can be written(1) as
Equation (1): f (x) = max(0, x) (1)

f(x) = max(0, x) (1)

Figure 8. The ReLU function.


Figure 8. The ReLU function.
Figure 8. The ReLU function.

Figure 8. The ReLU function.

Figure9.9.An
Figure Anexample
example of
of the
thevalues
valuesafter using
after thethe
using ReLU function
ReLU [20]. [20].
function
Figure 9. An example of the values after using the ReLU function [20].
Pooling resizes the data to a smaller size, with the details of the input remaining in‑
Pooling
Figure resizes the data to aafter
smaller size, with the details of the input remaining in-
tact. It9.also
Anhas
example of the
the advantagevalues using
of increasing thethe ReLU function
sensitivity [20].
to calculations and solving the
tact. ItPooling
also has resizes
the the
advantage data
of to a smaller
increasing the size, with
sensitivity to the details
calculations
problems of overfitting. There are two types of pooling layers, max and average pooling,
of the
and input
solving ther
problems
tact. It of overfitting.
also
Pooling
as illustrated has thethe
resizes
in Figure There
advantage
10. area two
data to types of pooling
of increasing
smaller size, withthethelayers,
detailsmax
sensitivity and
of the average
toinput pooling,
calculations
remaining and
in-
as illustrated
tact. It alsoof
problems inoverfitting.
has Figure 10. There
the advantage of increasing
are twothe sensitivity
types to calculations
of pooling layers, and
maxsolving the
and aver
problems of overfitting.
as illustrated in Figure 10. There are two types of pooling layers, max and average pooling,
as illustrated in Figure 10.
metry 2021, 13, x FOR PEER REVIEW
Symmetry 2021, 13, 262 8 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 8 of 19

Figure
Figure 10.
10. An
An example
example of
of average
average and
and max
max pooling
pooling [23].
[23].
Figure 10. An example of average and max pooling [23].
The
The final
final layer
layer of
of the
the CNN
CNN architecture
architecture is
is the
the classification
classification layer,
layer, consisting
consisting of
of 22 sub-
sub‑
layers:
layers: the
the fully
fully connected
connected (FC)
(FC) and
and softmax
softmax layers.
layers. The
The fully
fully connected
connected layer
layer will
will use
use the
the
The map
feature finalmatrix
feature map
layerinof
matrix in
the
the
the form
CNN
form of
architecture is the classification layer, consistin
of aa vector
vector derived from the feature detection process to
layers:
createthe
create fully connected
predictive
predictive models. The (FC) and
The softmax
softmax softmax
function
function layer layers.the
layer provides The fully connected
classification output. layer w
feature map matrix in the form of a vector derived from the feature detection
3.4. Classification
3.4. Classification
create predictive models. The softmax function layer provides the classificatio
In the classification process, video files showing isolated TFSL were used as the input.
In the classification process, video files showing isolated TFSL were used as the in-
The video inputinput
put. The video files files
consisted of one
consisted of stroke, two strokes,
one stroke, and three
two strokes, strokes.
and three These
strokes. files
These
3.4.files
Classification
were taken with a digital camera and required the signer to act with the
were taken with a digital camera and required the signer to act with the right hand right hand in
a black jacket standing in front of a complex background.
in a black jacket standing in front of a complex background. The The whole process of testing
In thethe
involved
involved
classification
the following steps:process,
following steps: video filesby
motion segmentation
motion segmentation byshowing
optical flow,
optical
isolated
flow, splitting TFSL
splitting
were used
the strokes
the strokes
put.of
of The
the video
the hand
hand input
gestures,
gestures, hand
handfiles consisted
segmentation
segmentation viaof
via theone
the stroke,
SegNet
SegNet model,two
model, strokes,combining
classification,
classification, and three stro
combining
signs, and displaying the text, as shown in Figure
signs, and displaying the text, as shown in Figure 11. 11.
files were taken with a digital camera and required the signer to act with the
in a black jacket standing in front of a complex background. The whole proces
Hand segmentation via the

involved the following steps: motion segmentation K by optical flow, splitting Combining signs
SegNet model

of the hand gestures, hand segmentation via the SegNet model, classification,
Classification

1st Storke

signs, and displaying the text, as shown in Figure 11.


Video hand sign
Motion segmentation
by optical flow
Splitting the strokes of
the hand gestures
3
Hand segmentation via the

2nd Storke

Classification and display text process.


Figure 11. Classification K Combining signs
SegNet model

Classification

Motion segmentation
Motion segmentationused usedLucas
Lucas and and
1st Storke Kanade’s
Kanade’s optical flowflow
optical calculation method,
calculation which
method,
is a method
which of tracking
is a method the whole
of tracking image image
the whole using the usingpyramid algorithm.
the pyramid The track
algorithm. The starts
track
Video hand sign with
Motion segmentation
the
performed
top
by
layer
starts withSplitting
the top
tion was performed
by optical flow
of
optical
the pyramid and runs down to the bottom. Motion
layer of the pyramid and runs down to the bottom. Motion segmenta-
the strokes of
flow using the orientation and magnitude of the
by optical flow using the orientation and magnitude of the vector.
the hand gestures
segmentation
vector.
was
Calculating 3
the orientation
Calculating theand magnitude
orientation andofmagnitude
the optical flow of the was accomplished
optical flow wasusing frame‑by‑frame
accomplished using
calculations considering 2D motion. 2nd In each
Storke frame,
frame-by-frame calculations considering 2D motion. In each frame, the angle the angle of the motion at of
each
thepoint
mo-
varied,
tion depending
atFigure
each point on the magnitude
varied, depending and
on direction
the magnitudeof the motion
and observed
direction of frommotion
the the vector
ob-
relative to the 11.
axis Classification
(x, y). The andofdisplay
direction motion text
can process.
be obtained from Equation (2) [24]:
served from the vector relative to the axis (x, y). The direction of motion can be obtained
from Equation (2) [24]: ( )
−1 y
Motion segmentation used Lucas and θ = tan
x
y
Kanade’s optical flow calculatio (2)
 = tan−1( ) (2)
which is a method of tracking the whole x image using the pyramid algorithm
where θ is the direction of vector v = [x,y]TT , x is coordinate x, and y is coordinate y in the
starts with is the top layer≤ θof< thev =πpyramid and b isruns
where the
π direction 1 of vector [x,y] , x is coordinate x,down
and y isto the and
coordinatebottom.
y in theMotion
range of −𝜋 2 + b−1π b− B 𝜋 − b2 + π Bb where the number of block B is the
range
tionamount of − +
was performed 𝜋
of2 bins. B ≤ θ < −
by optical + 𝜋 where
flow b is
using the number
the of block
orientation and B
andis the amount
magnitude of
2 B
of bins.
Calculating the orientation and magnitude of the optical flow was accompli
The magnitude of optical flow calculates the vector length by using the linear equa-
frame-by-frame
tion to find the length calculations
of each vectorconsidering
between the 2D motion.
previous frameIn andeach frame,
the current the angle
frame.
tionThe atmagnitude
each point can be varied,
calculateddepending
from the equation on the magnitude
below [24]: and direction of the
served from the vector relative tomthe axis
= √x 2 + y(x,
2 y). The direction of motion
(3) can b
from Equation (2) [24]:
y
Symmetry 2021, 13, 262 9 of 19

The magnitude of optical flow calculates the vector length by using the linear equation
to find the length of each vector between the previous frame and the current frame. The
magnitude can be calculated from the equation below [24]:
Symmetry 2021, 13, x FOR PEER REVIEW 9 of 19

m= x2 + y2 (3)
Symmetry 2021, 13, x FOR PEER REVIEW 9 of 19
where
wherem misisthe
themagnitude,
magnitude,xxisiscoordinate
coordinatex,x,andandyyisiscoordinate
coordinatey.y.
In
In this research, motion was distinguished by findingthe
this research, motion was distinguished by finding themean
meanof ofall
allmagnitudes
magnitudes
within
within each
each frame
frame of
ofthe
thesplit
splithand
hand sign.
sign. The
The mean
mean of
of the
themagnitude
magnitude was
was high
highwhen
whenthe
the
where m is the magnitude, x is coordinate x, and y is coordinate y.
hand
hand signals
In signals were changed,
were motion
this research, changed, as
wasas shown in Figure
shown in Figure
distinguished 12.
12. the mean of all magnitudes
by finding
within each frame of the split hand sign. The mean of the magnitude was high when the
hand signals were changed, as shown in Figure 12.

Figure 12. The mean value in each frame of magnitude of the video files.
Figure12.
Figure 12.The
Themean
meanvalue
valuein
ineach
eachframe
frameof
ofmagnitude
magnitudeof
ofthe
thevideo
videofiles.
files.
Splitting the stroke of the hand was achieved by comparing each stroke with the
threshold value. Testing
Splitting the various
the stroke
stroke thethreshold
of the hand was values,
was it was found
achieved that a magnitude
bycomparing
comparing eachvalue
strokewithwiththe the
Splitting the of hand achieved by each stroke
of 1–3 meansvalue.
threshold slight Testing
movement, theand a value
various equal to 4values,
threshold or moreitindicates
was a change
found that in
a the
magnitude value
threshold value. Testing the various threshold values, it was found that a magnitude value
stroke of the hand signal. The threshold value in this experiment was considered as 4. If
ofof 1–3 means slightmovement,
the mean value slight
1–3 means movement,
of magnitude in eachand
andaawas
frame
value
value equal
equal
lower
to44threshold
thantothe
ormore
or morevalue,
indicates
indicates a change in the
then as =change in the
stroke
stroke of
of the
the hand
hand signal.
signal. The
The threshold
threshold value
value inin this
this experiment
experiment
0 (meaning), but if the value was more than the threshold value, then s = 1 (meaningless), was
was considered
considered asas4.4.IfIf
the
the
as meaninvalue
mean
shown value
Figureofof magnitude
magnitude
13. inin
After that, 10 eachframe
each
frames framewas
between was lower
thelower
start than
than
and thethethe
end ofthreshold
threshold value,
value,
s = 0 (mean- then
then s =s0=
0 (meaning),
ing) represent but
the if
hand the value
sign. was more than the threshold value,
(meaning), but if the value was more than the threshold value, then s = 1 (meaningless), as then s = 1 (meaningless),
shown After
in extracting
as shown in Figure
Figure 13.the
13.image
AfterAfter from
that, 10the
that, 10hand stroke
frames
frames separation,
between
between startthe
the the next
start
and and
thestep
thewas
end ofthe
end s of
= 0s (meaning)
= 0 (mean-
hand segmentation process. This process involves the separation of hands from the back-
ing) represent
represent the handthe hand
sign. sign.
ground using the SegNet model, as illustrated in Figure 14.
After extracting the image from the hand stroke separation, the next step was the
hand segmentation process. This process involves the separation of hands from the back-
ground using the SegNet model, as illustrated in Figure 14.

Figure 13. The strokes of the hand sign.


Figure 13. The strokes of the hand sign.

Figure 13. The strokes of the hand sign.


Symmetry 2021, 13, 262 10 of 19
Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Symmetry 2021, 13, x FOR PEER REVIEW


After extracting the image from the hand stroke separation, the next step was the hand
segmentation process. This process involves the separation of hands from the background
using the SegNet model, as illustrated in Figure 14.
Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19


Symmetry 2021, 13, x FOR PEER REVIEW
Figure 14. Image of hand segmentation after splitting the strokes of the hand. 10 of 19
Symmetry 2021, 13, x FOR PEER REVIEW 10 of 19

The next step was to classify the hand segmentation images with the CNN model to
predict 10 frames of each hand sign. The system votes on the results of the prediction in
descending order and selects the most predictable hand sign. As shown in Figure 11, when
Figurepassing
14. Image theofclassification
Figure 14. Image
hand of hand
segmentation process, afterthe
segmentation splittingresults
afterthecan be the
splitting
strokes predicted
ofstrokes
the hand. ofas thethehand.K sign and the 3 sign.
Figure 14. Image
After that, ofthehand signssegmentation
are combined after and then splitting the strokes
translated into Thai of letters.
the hand. The rules for the
The next
combination step
Figure 14.
Thewas
ofImageto
next
signs classify
step
of are
was
handshown
the
to
segmentation
hand
classify
in Table segmentation
the hand images
segmentation
2. the strokes of the hand.
after splitting
with
images the with CNN the modelmodel
CNN to to
predict 10 predict
frames of 10
eachframes hand of each
sign. hand
The
The next step was to classify the hand segmentation images with the CNN sign.
system The system
votes on votes
the on
resultsthe results
of the of the
prediction prediction
in in mo
descending descending
order The and next order
selects
step wasand
the to selects
most
classify the
predictable
the most
hand predictable
hand
segmentationsign.handAs
images sign.
shown
with As theshown
in Figure
CNN in Figure
model 11, to 11,
when when
predictTable 2. Combination rules for TFSL.
10 frames ofthe each hand sign. The system votes on the results ofin thethe3 predict
passing the classification process, the results can be predicted as the K sign and the 3 sign.sign.
passing
predict 10 framesclassification
of each hand process,
sign. the
The results
system can
votes be
on predicted
the results as
of the
the K sign
prediction and
descending
After that,No.order
the
descending
After
signs andRule
that,
are theselects
order and selects
signs
combined the
are most
Alphabet
the
combined
and
most predictable
then
predictable
and thenhand
translated No. hand
sign.
translated
into Thaiintosign.
Rule
As shown Thai
letters.
inAs Figureshown
letters.
The Alphabet
11,
The
rules
when in Figure
rules
for thefor the 11,
passing the classification process, thein results can be predicted as the K sign and the 3 sign.
passing 1 combination
the classification K of signs
process, are ก
shown (/k/)
the2. Table
results 2. 22
can be R ร (/r/)
combination of signs
After that, are shown
the signs are in Table
combined and then translated intopredicted
Thai letters. The as rulesthe for K sign the and the 3
After that, the2 Figure
signs
combination
Table 2. K +
are
of 1 combined
signs
14.Combination
Image of hand are shown
rules
ข (/kʰ/)
inand Table then
2. 23
translated
for TFSL. after splitting the strokes of the hand.
segmentation intoW Thai letters.ว (/w/) The rules f
Table 2. Combination
3
combinationFigure of signs
rules
K + 2for
areRule
TFSL.
ofshown ค (/kʰ/)
in Alphabet
Tableafter 24
2. splitting No. D ด (/d/)
Table No.2.14. Image
Combination hand
rules forsegmentation
TFSL. the strokes of the hand.
Rule with the Alphabet
4 Figure The 14.Knext
Image
+ 3 step
of hand to ฆ
was segmentation
classify
(/kʰ/) theafter
hand segmentation
splitting
25 the strokes images
D of+ the 1 hand. Alphabet ฎCNN (/d/)model to
No. No.
predictThe
Rule
1 10next
frames stepKof
Rule was
Alphabet
eachto Alphabet
hand sign.
classify ก (/k/)
the Thehand
No.
No.
system 22
votes
segmentation Rule Rule
theRresults
onimages Alphabet
withof thethe ร (/r/)
CNN prediction
model to in
5
Table 2. 1Combination T KKfor
rules TFSL. ตก(/t/)
(/k/) ขsign.(/kʰ/) 26 23 R FW ร (/r/) รวฟ(/r/) (/f/)
(/w/)
predict 1The
descending 2 10next
K order
frames step+and
of1 each
was toก (/k/)
selects hand
classifythe most
theThe hand 22
22
predictable
system hand sign.
votes
segmentation RAs
onimages
the shown
results
withofthe inthe
Figure CNN 11,
prediction
modelwhenin
to
2 6 passing 23 10
descending
predict K T+frames
the +classification
11K +K1+and
order ข (/k
selects
of2 each ถขh(/tʰ/)
process,
hand (/kʰ/)
/)
the ค sign.
(/kʰ/)
the The
most results 23can
23
predictable
system 27be onFthe
Wsign.
predicted
hand
votes
24 W +As Das ว (/w/)
1results
the
shown K ofsign
intheว ดฝand
Figure (/f/)
(/d/)
(/w/) the
11,3when
prediction sign.
in
No. 3 7 After Rule
passing 3that,
descending 4 K T+the
the 2Ksigns
2order+ 2and
+classification
K + 3are Alphabet
(/k ฐคh(/tʰ/)
คcombined
process,
selects (/kʰ/)
/)
the ฆ (/kʰ/) and
the
most then
results can
predictable No.
24translated
24 28 be Dinto
sign.D
predicted
hand
25 LThai
D As Rule
+as
ด (/d/)
letters.
the
1shown K sign ดฎลand
inThe
Figure (/l/)
(/d/)
(/d/) Alphabet
rules 11,for
the 3whenthe
sign.
ฆh(/kʰ/) ฎ (/d/) ฎฬฟand
T areฆกcombined theรfor
4that, K +signs
3 are 25translated D +into
1D+
1 4 8 combination
After
passing K K +the
5 the
3of
3 signs
T +classification (/k(/k/)
shown
process,


/) ตin
(/tʰ/)
(/t/)
Table
and
the
(/t/) 2.25
then
results can be
292226 predicted L +Thai 1as the
F1 Rletters.K sign
ฟ (/f/) The
The (/d/) rules
(/l/)
(/f/) 3(/r/)
the
sign.
5 5
combination
After that, the T T
ofsigns
signsare ต (/t/)
arecombined
shown in Table and then 26
26 F F
2. translated into Thai letters. ฟ (/f/) rules for the
T + 1 ข (/kʰ/) ฑถh(/tʰ/) ถ (/tʰ/) ฝ (/f/)
t͡ɕ ว (/w/)
2 6 9 Table K662.+ Combination
combination 1T++1 4of
T 1 rules
T +signs areถ for(/t
shown (/tʰ/)
/) in Table 2.27
TFSL.
27
27 3023 F + 1 J1
F+1
F ++ 1J2Wฝ (/f/) ฝ (/f/)
7
10Table72. Combination T + 2
TT++2 5T + 2 rulesฐ for h
t̚ ฐ TFSL.
(/tʰ/)
/) /) ฐ (/tʰ/) 28 31 28 Y L ล (/l/) ล
ลย(/l/) (/l/)
(/j/)
3 78 KNo. + Combination
8 2 + 3 T Rule T + 3
ค (/t
(/kʰ/)

Alphabet
hTFSL.
(/tʰ/) ฒ (/tʰ/)
28
24
No.
29
L L
Rule
L + 1 Dฬ (/l/) Alphabet ฬ ฬ (/l/) ด (/d/)
11 Table 8 2. T S + 3 rules ฒ (/t
for ส (/s/)
/) 29
29 32 L + 1 L
Y ++ 11 ญ (/j/)
(/l/)
4 9 12 KNo. 3S++41T Rule /) กฑ(/k/) จร(/(/r/)
Alphabet No. Rule Alphabet
919+ T +TK4+ 4 ฆฑ (/t
(/kʰ/)
ฑh (/tʰ/)
ศ (/s/)
(/tʰ/) 30 25
30 22
30 J1 + J2J1J1 +RJ2 +DJ2+ 1t͡ɕ t͡ɕ /) ฎ (/d/)
10
No.
1021 KRule
K+ 1
10 T + 5 T +T5+ 5 ฏต(/(/t/)
Alphabet
ข ก
t̚ /) t̚/) /)
(/k/)
(/kʰ/) 31
31
33No.
22
23
31 Y
M
Rule
W
YYR ย (/j/) วมยรย(/w/)
Alphabet (/m/)
(/r/)
(/j/) ฟ (/f/)
(/j/)
5 11 equ 11321T SS+ 2 KS K+ 21 ส (/s/) ษส(/s/) คขกส(/kʰ/)
(/k/) 32 3424
26
22 Y + 1 Y N R1 F ญ (/j/) วดญนร(/w/) (/n/)
(/r/)
ญ(/d/)
(/s/) 32 23 +WD (/j/)
11 S (/s/) 32 Y +1 (/j/)
6 12 14 T1212 32
4+ 1S + QS +S1+ 31
S + 1 K + 1
2 ถ (/tʰ/)
ศ ซศ(/s/)
(/s/) ข

ฆ ศ (/s/)
(/s/) (/kʰ/) 33 3527
33 23
24
25
33 M ND++F M W
D
M 1 1 + ม1(/m/) ฎณ (/n/) ฝ (/f/)
มว
ด (/w/)
(/d/)
(/m/)
ม (/m/)
equ S + 2 + 2+ 32 ษ (/s/) ฆคต (/kʰ/)
ษ(/pʰ/) 34 ND +G1 น (/n/) นฎดฟง(/n/) (/d/)
7 14 15 T equ543+ 2 P S K
equ S T+ 2 ฐ พ(/tʰ/) (/s/)
ษ (/t/)
(/s/)
34 3626 24 N ND
25
2834 +FN L (/ŋ/)
(/f/)
(/n/) ล (/l/)
น(/n/)
S + Q ซ (/s/)ซ (/s/)ฆ (/kʰ/) 35 N + 1 ณ (/n/) ณฎ (/d/)
14654 KQ
S +T T
++ 13
ป h(/p/) ถ ต (/t/)
(/tʰ/) 35 25 N + 1 D
26
27 F F+ + 11 ฟฝ
ท (/f/)
(/tʰ/)
15 16 14 PP+ 1 S + Q พฒ(/p (/tʰ/) /) ซ (/s/) 36 37 35 NT +
N
+ GH+ 1 ง ณ (/n/) ฬ (/l/)
(/η/)
8 T15765+ 3 P
TT + 21 พ (/pʰ/) ถฐต(/tʰ/)
(/t/) 36 29 26 N + G F L
27
28 +FHL1 +ง 1 (/ŋ/) ฝลฟ (/f/)
(/f/)
(/l/)
16 17 P + 1
15 P + 2P + 1P ป ผ
(/p/) (/pʰ/)
ป (/p/) พ (/pʰ/) 37
37 3827 36 T + T T
H +N +
H++G1 ท (/tʰ/) ท ธ ง (/tʰ/)
(/ŋ/)
(/t h /)
9 17 18 T16 876+ P4 + 2 T ++ 321
T ผฑ (/p(/tʰ/)
h /)ฒ
ภ ผ(/pʰ/)
ฐถ (/tʰ/)
ป (/p/) 38 30
28
29 T + LFHL+J1++111+ J2
ธ ฬ ลฝ(/t (/f/)
(/l/)
ท (/tʰ/)
h /) จ (/t͡ɕ/)
1716 P + 3P +P2+ 1 (/pʰ/) 38 39 37T + H +C 1 T+ +HH ธ (/tʰ/) t͡ɕʰ
987+ P T+2 h /)ฑ ฒฐ (/tʰ/)
(/tʰ/) 28 L+ +LHJ2 ฬล (/l/)t͡ɕ h /) ย (/j/)
10 18 19 T1817 5 +H3 P T+P3++432 ภฏ(/p (/t̚
ภ /)
(/pʰ/)
ห (/h/) ผ (/pʰ/) 39
39
29
30
31
402938 C + H
CTJ1 +
C + H+++111 H
1
+Y 1 t͡ ɕ ʰ ฉ (/ ธ (/tʰ/)
t͡ɕhʰ
19 10 98 H H T + 543 ห (/h/) ห (/h/) ฑt̚ฒ (/tʰ/) /) 40 30C + HC+ +1J1LHY
31 + J2 t͡ɕʰ ช (/ ฬย (/j/)
t͡(/l/)
ɕ /)
ส ภ (/pʰ/) ɕʰ
ɕɕʰh /) ญ (/j/)
19 40
11 20 20 20 10
18
9S HH++1 1 T + 54P + 3
ฮ (/s/)

(/h/) (/h/) ฑ
t̚ส (/tʰ/)
(/s/)/) 41 413239
30
31 CC +
+
C
J1 H
H Y
+
+ Y
++H
J222+ 1 ฌ ญ (/ย


(/j/)

1119 B H + S
1
H ฮ (/h/) ห (/h/) 41 32
40C + H + 2
CAY +H+1 + 1 t͡ ɕʰ t͡ɕʰ/)
12 21 21 S21 10
11
12 + 1 B BST S++ 15 บศ (/b/) บ(/s/)
บ(/b/)
(/b/) t̚สศ (/s/)/) 42 4232
42 33 31 A
33 YAM Y + 1M ʔ อ (/(/j/)
มญย(/m/) ʔ ม (/m/)
20 H+1 ฮ (/h/) 41 C+H+2 t͡ɕʰ
ส (/s/) ญ (/j/)
ษ (/s/)ษศบ (/s/) มน (/m/)
12 11 S +B21 S 33 M 32 Y+1 (/n/)
equ4. Experiments equ
Sand+ 2Results
21 (/b/) 34
34
42 N A N ʔ น (/n/)
4. Experiments 4. Experiments
12
equ
14 and and
Results
S
S+Q + Results
21 ศ

ซ (/s/) 33
34
35 N+1M
N มน (/m/)
ณ (/n/)
4.1.14
Data 4.1.SData
4.1. Collection +Q
equ
14
Data4.Collection
15
Collection
SSand
+P+ Q ซ (/s/)พซษ(/pʰ/)
2 Results (/s/) 35
34
35
36 NNN N1 + 1
++ G (/n/) ณ (/n/)
ณนง (/ŋ/)
Experiments
15Data collection
Data
Data
14
P for
15
16
collection
collection
4.1.training,
for Data and P
S
thisfor
Collection
+ Q
P+research
videos 1this พ testing
for this research
(/pʰ/)
forresearch

ป at
is divided into two
(/s/)
is พdivided
(/pʰ/)
(/p/) into two
3islocations
divided 35
36
types: images
36types:
37
into
(Library, two N
T ++N
images
N
types:
Computer
+H 1
G+
from 6 locations
G
from 6from
images
Lab2,
ณ (/n/) ง (/ŋ/)
ทง locations
(/ŋ/)
(/tʰ/)
and Office1),6 locations
for training, and 17 15
videos PP
for +testing atview พป (/pʰ/)
(/pʰ/)
(/p/)
3 ผlocations (Library,36were
ComputerN +HG 1a digital ทธง (/tʰ/)
(/ŋ/)
16for training,
whichP16 and
+take
Data videos
1 acollection
slightly 21different
for ปthis
(/p/)
fortesting at
of
research

3 locations
the
(/p/)
camera. They
is divided
37
38
(Library,
37 T+ H
TComputer
taken
into two types:T++Lab2,
+with
images
and
H Lab2,camera
from

Office1),
(/tʰ/)
and ทOffice1),
(/tʰ/)
6 locations
whichwhich
take atake
slightly
16
17
18 different
a slightly
for training, P
and + 32
different1view
videos
of the
view
for ผ
testing (/pʰ/)
camera.
ภ of the They
at 3camera.
were
37
38
locations39
They taken
T
were
(Library,
T
+ H+ with
H
CComputer +
+taken
H 1 a digital
with ธ (/tʰ/)
t͡ɕcamera
a digital
Lab2, ʰ Office1),
and camera
17 P19
which 17
18+take
2 a slightly
PH ผ (/pʰ/)
+ 32 different ภผหview
(/pʰ/)
(/h/) of the camera. 38
38 They C
39
40 TC T+H+H
++ H
were
H + 1 digital
++ 11 with aธ (/tʰ/)
taken t͡ɕʰ
ธcamera
(/tʰ/)
18 18
P20
19+3 HPH ++ 31 ภ (/pʰ/) ภ (/pʰ/)
หฮ (/h/) 39
39
40
41 CC +H +CH ++21 H t͡t͡ɕɕʰʰ ฉ (/t͡ɕʰ/)
19 H ห
ฮ (/h/)
(/h/) 40 1
19 20
21
H H B + 1
ห (/h/)บ (/b/) 41
42
40 C
CA+ H + 1 t͡ʔɕʰ ช (/t͡ɕʰ/)
+ H + 2
20
21 HB+ 1 บฮ (/b/)
(/h/) 41
42 C +A
H+2 t͡ʔɕʰ
Symmetry 2021, 13, x FOR PEER REVIEW
Symmetry 2021, 13, 262 11 of 19

Symmetry 2021, 13, x FOR PEER REVIEW


and 11 of 19 while st
required the signer to use his or her right hand to make a hand signal
and required the signer to use his or her right hand to make a hand signal while standing
in front of a complex background at the 6 locations presented in Figure 15.
in front of a complex background at the 6 locations presented in Figure 15.
and required the signer to use his or her right hand to make a hand signal while standing
in front of a complex background at the 6 locations presented in Figure 15.

Figure 15.
Figure 15.The
The six locations
six of data
locations collection
ofcollection
data (a) Library;
collection (b) Computer
(a) Library; Lab1; (c) Computer
(b) Computer Lab2;
Lab1; (c) Computer
Figure 15. The(e)
(d) Canteen; sixOffice1;
locations of(f)
and data
Office2. (a) Library; (b) Computer Lab1; (c) Computer Lab2;
(d)Canteen;
(d) Canteen;(e) (e) Office1;
Office1; and
and (f) (f) Office2.
Office2.
The hand sign images used for the training were divided into 2 groups: the train‑
ing The
The
data hand
hand
for sign
the images
sign
SegNet used and
images
model for the
used training
thefor were
datadivided
the training
training were
for the into
CNN 2 model,
groups:into
divided the training
with 225groups:
signs the t
data for
in total.
data the
for The SegNet
the trainingmodel
SegNetdata and
model the
for thetraining
andSegNet data for
model used
the training the CNN model,
dataoriginal with
for theimages
CNNand 25 signs
image
model, in total.
labels,
with 25 signs
The
320 ×
training
240
The training
data in
pixels forsize,
the SegNet
totaling
data for theimages.
model
10,000
SegNet
used
images.original
model used The images group
second and image
original images
labels,images
contained 320 × 240
and image that
labels, 32
pixels
were inhand‑segmented
size, totaling 10,000 via semanticThe second group contained
segmentation and wereimages that were in
contained hand-
the
pixels
segmented in size,
hand‑segmented
totaling
via semantic
library.
10,000 images.
segmentation
These images and The
were
were 150
second
contained group
× 250 pixels
in the contained images
inhand-segmented
size. The traininglibrary.
dataset
that wer
segmented
These images
contained 25 via
weresemantic
signs, 150 × 250segmentation
as presentedpixels
in Figure and
in size.16,The
each were
training contained
dataset
with 5000 in athe
contained
images for hand-segmented
25
total signs,
of as
125,000
presented
These in Figure
images.images were16, each
150with 5000
× 250 imagesinforsize.
pixels a total of 125,000
The trainingimages.
dataset contained 25 si
presented in Figure 16, each with 5000 images for a total of 125,000 images.

Figure
Figure16.
16.Twenty-five
Twenty‑fivehand
handsigns
signsfor
for42
42Thai
Thailetters.
letters.

The video used in the testing process was isolated TFSL featuring 42 Thai letters.
These videos were taken from a hand signer group consisting of 4 people, and each letter
was recorded 5 times, totaling (42 × 4 × 5) 840 files.

Figure 16. Twenty-five hand signs for 42 Thai letters.


Symmetry 2021, 13, 262 12 of 19

The video used in the testing process was isolated TFSL featuring 42 Thai letters.
Symmetry 2021,
Symmetry 2021, 13,
13, xx FOR
FOR PEER
PEER REVIEW
REVIEW 12 of
12 of 19
19
These
videos were taken from a hand signer group consisting of 4 people, and each let‑
ter was recorded 5 times, totaling (42 × 4 × 5) 840 files.

4.2.
4.2.
4.2. Experimental
Experimental
Experimental Results
Results
Results
TheThe
The experiments
experiments
experiments inin
in this
this
this study
study
study were
were
were divided
divided
divided into
into
into threeparts.
three
three parts.The
parts. TheThe first
first
first part
part
part involved
involved
involved
the
the evaluation
evaluation toto measure
measure performance
performance using
using an
an intersection
intersection over
over union
union
the evaluation to measure performance using an intersection over union areas (IoU); the areas
areas (IoU);
(IoU); thethe
second
second part part was
part was
was thethe experiment
the experiment
experiment used used
used toto determine
to determine
determine the the efficiency
the efficiency
efficiency of of
of thethe
the CNNCNN model;
CNN model;
model; and
andand
second
the
the last
last part
part wasthe
was thetesting
testingofofthe
themulti-stroke
multi‑strokeTFSL
TFSLrecognition
recognitionsystem.
system.
the last part was the testing of the multi-stroke TFSL recognition system.
4.2.1. Intersection Over Union Areas (IoU)
4.2.1. Intersection
4.2.1. Intersection OverOver Union
Union Areas
Areas (IoU)
(IoU)
IoU is a statistical method that uses two data consistency measurements between a
IoU is
IoU is aa statistical
statistical method
method that
that uses
uses two
two data consistency
consistency measurements
measurements between
between aa
predicted bounding box and ground truth bydata
dividing the overlapping areas between the
predicted
predicted bounding
bounding box and
box and ground
ground truth
truth byby dividing
dividing the overlapping
thethe
overlapping areas between the
prediction and ground truth by a union area between predictionareas
and between the
ground truth,
prediction
prediction and
and(4)ground
ground truth by
truth by a union area between the prediction and ground truth, as
as Equation and shown in aFigure
union 17area between
[18]. the prediction
A high‑value and ground
IoU (value closer totruth, as
1) refers
Equation (4)
Equation (4) and
and shown
shown in Figure
Figure 17
17 [18].
[18]. A
A high-value IoU IoU (value closer
closer to 1)
1) refers
refers to
to aa
to a bounding box withinhigh accuracy. Thehigh-value
0.5 threshold is(value
the accuracytothat determines
bounding box
bounding box with
with high
high accuracy.
accuracy. The
The 0.50.5 threshold
threshold isis the
the accuracy
accuracy that
that determines
determines
whether the predicted bounding box IoU is accurate, as shown in Figure 18. According to
whether the
whether the predicted
predicted bounding
bounding boxbox IoU
IoU is
is accurate,
accurate, as
as shown
shown in in Figure
Figure 18.
18. According
According to to
the Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be
the Standard
thegreater
Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be
thanPascal Visual Object Classes Challenge 2007, acceptable IoU values must be
0.5 [13].
greater than
greater than 0.50.5 [13].
[13].
Area
Area of overlap
of overlap between bounding
overlap between
between boundingboxes
boxes
IoU Area of bounding boxes
IoU==
IoU = Area of union between bounding boxes . (4)(4)
(4)
Area of
Area of union
union between
between bounding
bounding boxes
boxes

Predicted
Predicted
bounding box
bounding box

Overlap
Overlap
area
area

Ground truth
Ground truth
IoU =
IoU =
Predicted
Predicted
bounding box
bounding box

Union
Union
area
area

Ground truth
Ground truth
..
Figure
Figure 17.17.
Figure
17. The
The intersection
The
intersection over
intersection
over union
over areas
union
union (IoU)
areas
areas [25].
(IoU)
(IoU) [25].
[25].

IoU == 0.4034
IoU 0.4034 IoU == 0.7330
IoU 0.7330 IoU == 0.9264
IoU 0.9264

Poor
Poor Good
Good Excellent
Excellent
Figure
Figure 18.18.
IoUIoU is evaluated
is evaluated
evaluated asas follows:
follows: thethe left
left IoUIoU is 0.4034,
is 0.4034,
0.4034, which
which is is poor;
poor; thethe middle
middle IoU
IoU = 0.7330,
Figure 18. IoU is as follows: the left IoU is which is poor; the middle IoU ==
whichwhich
0.7330, is good; and the
is good;
good; high
and the IoU
high=IoU
0.9264,
IoU whichwhich
is excellent [13]. [13].
0.7330, which is and the high == 0.9264,
0.9264, which isis excellent
excellent [13].
The
The research
research used
used sign
sign language
language gestures
gestures on aon a complex
a complex
complex background
background alongalong
withwith
labella‑
The research used sign language gestures on background along with label
bel images, all of which were 320 × 240 pixels in size. A total of 10,000 images passed
images, all
images, all of
of which
which were
were 320
320 ×× 240
240 pixels
pixels in
in size.
size. A
A total
total of
of 10,000
10,000 images
images passed
passed through
through
semantic segmentation
semantic segmentation training
training using
using dilated
dilated convolutions.
convolutions. TheThe results
results of
of the
the dilation
dilation rate
rate
configuration test
configuration test and
and aa variety
variety of
of convolutions
convolutions cancan be
be structured
structured according
according toto the
the follow-
follow-
ing details.
ing details. The
The structure
structure ofof the
the training
training process
process consists
consists of
of five
five blocks
blocks ofof convolution.
convolution.
Symmetry 2021, 13, 262 13 of 19

through semantic segmentation training using dilated convolutions. The results of the di‑
lation
Symmetry 2021, 13, x FOR PEER REVIEW rate configuration test and a variety of convolutions can be structured according 13 of 19 to
the following details. The structure of the training process consists of five blocks of con‑
volution. Each block includes 32 convolution filters (3 × 3 in size), with different dilation
factors, batchincludes
Each block normalization, and ReLU.
32 convolution The(3dilations
filters are with
× 3 in size), 1, 2, 4, 8, 16, and
different 1, respectively.
dilation factors,
The training
batch option uses
normalization, and MaxEpochs
ReLU. The=dilations
500 and are MiniBatchSize
1, 2, 4, 8, 16,=and 64. According
1, respectively.to theTheIoU
performance
training option assessment,
uses MaxEpochs the average= 500IoUandwas 0.8972, which
MiniBatchSize = 64.isAccording
excellent. to the IoU per-
formance assessment, the average IoU was 0.8972, which is excellent.
4.2.2. Experiments on the CNN Models
4.2.2. Experiments
Experiments toon the CNNthe
determine Models
effectiveness of the CNN models in this research used
the 5‑fold cross validation method.
Experiments to determine the effectiveness All input hand of thesign
CNN images
models used in this
in this experiment
research used
contained 125,000 images from the hand‑segmented library,
the 5-fold cross validation method. All input hand sign images used in this experiment which consisted of 25 finger‑
spelling
contained images,
125,000each with 5000
images fromgestures. This datasetlibrary,
the hand-segmented was divided
which into 5 groups,
consisted of 25each with
finger-
25,000 images.
spelling images, Eacheachroundwithof50005‑fold cross validation
gestures. This dataset used
was 100,000
divided images
into 5for training
groups, eachand
with images
25,000 25,000 images. EachThis
for testing. round of 5-fold cross
experiment validation
compared used 100,000 images
the effectiveness for training
of the configuration,
and
the 25,000 of
number images
filters,forand
testing. This of
the sizes experiment
five differentcompared
CNN the effectiveness of the configu-
filters.
ration,
The thefirstnumber
format of usedfilters,
sevenand the sizes
layers of 64offilters and a 3 ×CNN
five different filters.
3 filter size. The second format
set theThe first
filter sizeformat
to 128 usedandseven layers
the size ofofthe
64 filter
filtersto and3× a 33×with
3 filter
7 size. TheThe
layers. second
thirdformat
format
set the
sorted thefilter
numbersize to of 128
filtersandinthe size of the
ascending filter
order uptoto3 7× layers,
3 with starting
7 layers.fromThe third format
2, 5, 10, 20, 40,
80, and 160, respectively, and all filter sizes were 3 × 3. The fourth format determined
sorted the number of filters in ascending order up to 7 layers, starting from 2, 5, 10, 20, 40,the
80, and 160, respectively, and all filter sizes were 3 × 3. The fourth
number of filters in ascending order and set the sizes of the filter from large to small with format determined the
number(the
7 layers of filters
number in ascending ordersize;
of filters/filter and set
4/11 the×sizes
11, 8/5of the
× 5, filter
16/5from× 5,large
32/3to×small
3, 64/3with× 3,
7 layers (the number of filters/filter size; 4/11 × 11, 8/5 × 5, 16/5
128/3 × 3, 256/3 × 3). The final format was a structure based on AlexNet, which defined × 5, 32/3 × 3, 64/3 × 3, 128/3
× 3,
the 256/3 × 3).
numbers andThesizes
finalof format was a as
the filters structure
followsbased(numberon AlexNet, which defined
of filters/filter the num-
size): 96/11 × 11,
bers and sizes of the filters as follows (number of filters/filter
256/5 × 5, 384/3 × 3, 384/3 × 3, 256/3 × 3, 256/3 × 3 [26]. The structure details are shown size): 96/11 × 11, 256/5 × 5,
in384/3
Figure × 3,19.
384/3 × 3, 256/3 × 3, 256/3 × 3 [26]. The structure details are shown in Figure 19.

Figure
Figure 19.19.The
Thefive
fiveCNN
CNNformats
formats for
for training
training the
the model.
model.
10 of 19
Symmetry 2021, 13, 262 14 of 19

The CNN model experiment used the 5‑fold cross validation method, and the results
are shown in Table 3. This experiment found that the accuracy results of the experiment
using the structure based on the AlexNet format were 92.03%. The fourth model of 7 layers
using the ascending sorting of the filter number and filter size from a larger size to a smaller
size offered an average accuracy of up to 90.04%. Format 3, which uses seven descending
filters, has a secondary accuracy average of 89.91%. While the first and second formats use
seven static filters, they have 64 and 128 filters, respectively, with an average accuracy of
88.23 and 87.97.

Table 3. Five‑fold cross validation for CNN model testing.

Five‑Fold Cross Format 1 Accuracy Format 2 Accuracy Format 3 Accuracy Format 4 Accuracy Format 5 Accuracy
Validation (%) (%) (%) (%) (%)
of hand segmentation after splitting the strokes of the hand.
1st 80.12 80.60 83.43 84.17 87.72
tep was to classify the hand segmentation images with the CNN model to 86.28
2nd 87.12 85.90 86.15 88.96
3rd 90.06
es of each hand sign. The system votes on the results of90.34
the prediction in 92.44 92.53 93.98
4th 93.84 92.82 95.52 94.87 97.00
er and selects the 5th
most predictable hand90.00
sign. As shown in Figure 11, when 91.86
90.20 92.47 92.50
sification process, the results can be 88.23
Average accuracy predicted as the K sign
87.97and the 3 sign. 89.91 90.04 92.03
signs are combined and then translated into Thai letters. The rules for the
signs are shown in Table 2.
Comparing the five CNN accuracy results, it was found that the fifth format provided
ation rules for TFSL. the highest accuracy. Therefore, we used this format for training to create models for the
multi‑stroke TFSL recognition system, divided into 112,500 training images representing
Rule Alphabet No. 90% and Rule 12,500 forAlphabet
testing data, representing 10%. The accuracy of the model in the exper‑
K ก (/k/) 22 iment was R 99.98%. ร (/r/)
K+1 ข (/kʰ/) 23 W ว (/w/)
4.2.3. The Experiment of the Multi‑Stroke TFSL Recognition System
K+2 ค (/kʰ/) 24 D ด (/d/)
This recognition system focuses on user‑independent testing via the isolated TFSL
K+3 ฆ (/kʰ/) 25 D+1 ฎ (/d/)
video format. The accuracy of each alphabet was tested 20 times. The experiment was
T ต (/t/) 26 dividedF into three groups: ฟ (/f/) one stroke, as shown in Table 4; the two‑stroke results, which
T+1 ถ (/tʰ/) 27 are presented
F+1 in Table ฝ (/f/)
5; and the three‑stroke results, which are shown in Table 6.
T+2 ฐ (/tʰ/) 28 L ล (/l/)
T+3 ฒ (/tʰ/) 1 results of ฬthe
29 Table L4. +The (/l/)
one‑stroke TFSL recognition system.
T+4 ฑ (/tʰ/) 30 J1No.
+ J2 t͡ɕ
Thai Letters Rule Correct Incorrect Accuracy (%)
T+5 t̚ /) 31 Y ย (/j/)
1 ก (/k/) K 13 7 65.00
S ส (/s/) 32 Y 2+ 1 ตญ(/t/)
(/j/) T 18 2 90.00
S+1 ศ (/s/) 33 M3 มส (/s/)
(/m/) S 18 2 90.00
4 พ (/ph /) P 20 0 100.00
S+2 ษ (/s/) 34 N น (/n/)
5 ห (/h/) H 20 0 100.00
S+Q ซ (/s/) 35 N6+ 1 ณ (/n/)
บ (/b/) B 17 3 85.00
P พ (/pʰ/) 36 N 7+ G รง (/r/)
(/ŋ/) R 15 5 75.00
8 ว (/w/) W 17 3 85.00
P+1 ป (/p/) 37 T+H ท (/tʰ/)
9 ด (/d/) D 20 0 100.00
P+2 ผ (/pʰ/) 38 T + 10
H+1 ฟธ (/f/)
(/tʰ/) F 20 0 100.00
P+3 ภ (/pʰ/) 39 C11 +H ล (/l/)
t͡ɕʰ L 20 0 100.00
12 ย (/j/) Y 20 0 100.00
H ห (/h/) 40 C+H+1 t͡ɕʰ
13 ม (/m/) M 18 2 90.00
H+1 ฮ (/h/) 41 C +14H+2 t͡ɕʰ
น (/n/) N 11 9 55.00
B บ (/b/) 42 15
A อ (/ ʔ/) A 17 3 85.00
Average accuracy 88.00
and Results
ion
ction for this research is divided into two types: images from 6 locations
d videos for testing at 3 locations (Library, Computer Lab2, and Office1),
ghtly different view of the camera. They were taken with a digital camera
The next step was to classify the hand segmentation images with the CNN model to
predict 10 frames of each hand sign. The system votes on the results of the prediction in
descending order and selects
Symmetry the
2021, 13, most
x FOR predictable
PEER REVIEW hand sign. As shown in Figure 11, when 16 of 19
passing the classification process, the results can be predicted as the K sign and the 3 sign.
After that, the signs
Symmetry are
2021, 13,combined
262 and then translated into Thai letters. The rules for the 15 of 19
combination of signs are shown in Table 2.
Table 5. The results of the two-stroke TFSL recognition system.
Table 2. Combination rules for Thai
TFSL. Classification Strokes Detection
No. Rule
No. Rule Letters Alphabet 10 of 19
Correct
TableNo. 5. The results Incorrect AlphabetTFSL recognition system.Incorrect Error Rate (%)
Accuracy
Ruleof the two‑stroke (%) Correct
1 ข (/kʰ/) ก (/k/) 10Kof+191 2217 R 3 ร (/r/)85.00 19 1 5
ting the1strokes of the K hand.
Classification Strokes Detection
2 No.
K+1 2 ค Thai(/kʰ/)
ข (/kʰ/) K + 2Rule 23 16 W 4 ว (/w/) 80.00 16 4 20
3 the Letters
ฅ CNN (/kʰ/) K +to 3 Correct Incorrect Accuracy
100.00 (%) Correct Incorrect Error
0 Rate (%)
egmentation
3 imagesK + 2with ค (/kʰ/) model 2420 D 0 ด (/d/) 20 0
stem votes 4 on the K1 results
+ 3 4 of the ขถ(/k (/tʰ/) h
ฆprediction
/)
(/kʰ/) T +in 1K + 1 2517 17 D + 1 33 ฎ (/d/) 85.00
85.00 20 19 0 1 0 5
table hand sign. As
5 2 T
shown
5 ฐ
inคFigure (/tʰ/)
(/k h /) 11, when
ต (/t/) T + 2K + 2 26 15 16 F 54 ฟ (/f/) 80.00
75.00 20 16 0 4 0 20
canthe
ing be strokes
predicted
of the3as hand.
the K sign ฅฒ(/k and h /)the 3 sign. K + 3 20 20 0 100.00
6 T+16 (/tʰ/)
ถh (/tʰ/) T + 3 27 F+1 0 ฝ (/f/) 100.00 20 20 0 0 0 0
translated into Thai 4 letters. The
ถฑ(/t rules
/) for theT + 1 17 3 85.00 20 0 0
7 T +2 7 (/tʰ/)
ฐ (/tʰ/) T + 4 28 20 L 0 ล (/l/) 100.00 20 0 0
egmentation images 5 with theฐCNN (/th /) model toT + 2 15 5 75.00 20 0 0
tem votes 8 on theT + 3 8 of the ฏ (/t̚ ฒh//) )
(/tʰ/) T +in5 17 3 ฬ (/l/)85.00 20 0 0
6results ฒ (/tprediction T + 3 29 20 L + 1 0 100.00 20 0 0
4 9 in Figure ศ (/s/)(/tʰ/) S+1 11 9 t͡ɕ 55.00 16 4 20
able hand 9 sign. As 7 +shown
T ฑ (/tฑh /) 11, when T + 4 30 20 J1 + J2 0 100.00 20 0 0
can be10 predicted T 8as+the
5 10K signฏษand (/s/)
(/ t̚ /) the/) 3 sign.
S + 2T + 5 3115 17 Y 53 ย (/j/)75.00
85.00 18 20 2 0 10 0
No. Rule Alphabet
ranslated into Thai 9 letters.
11 The ศซ (/s/)
(/s/)
rules
ส (/s/) forS the
+ QS + 1 3215 11 Y + 1 59 ญ (/j/)75.00
55.00 20 16 0 4 0 20
2211 RS ร (/r/)
10 12 ษป (/s/)
(/p/) ศ (/s/) P + 1S + 2 20 15 05 ม (/m/) 75.00
100.00 20 18 0 2 0 10
2312 W S+1
11 13 ซว (/pʰ/)
(/w/)
(/s/)
33
S + Q 18 15
M
5 75.00 20 0 0
equ S+2 ผ ษ (/s/) P + 2 2 90.00
น (/n/)100.00 20 0 0
24 12D ปด(/p/) (/d/) P + 1 34 20 N 0 20 0 0
ing the strokes of the hand.
14 S++1Q14 ภ (/pʰ/) (/s/) P + 3P + 2 3518 18 N + 1 22
ซh /) ณ (/n/) 90.00 18 2 10
25 D 13 ผ ฎ
(/p(/d/) 90.00 20 0 0
ingNo.
the strokes Rule
of the hand. Alphabet
ฮ (/h/)
26 15 14F P 15 ภ ฟ
(/p พh(/pʰ/)
(/f/) /) H + 1P + 3 3618 18 N + G 22 ง (/ŋ/)90.00
90.00 20 18 0 2 0 10
egmentation
22 images
R with the รCNN (/r/) model to
27 16 onimagesFW15 P 16
1+ 1withofthe
+results ฮฎฝ(/h/)
(/d/)ป (/p/)
(/f/)
D+H 1 + 1 3720 18 T + H 02 ท (/tʰ/)100.00
90.00 20 20 0 0 0 0
tem23 votes
egmentation the theว CNN prediction
(/w/) model in to
17 16 P + 2 17 ฎ ฝ (/f/)
(/d/)
ผ (/pʰ/) F + 1D + 1 38 18 20
T + H + 2
10 ธ (/tʰ/)100.00
90.00 20 20 0 0 0 0
able28
tem hand sign.
votes on the As shownofinthe
Lresults Figure
ด ล(/d/)
(/l/)
prediction11, when in
24 D
17 ฝ ฬ (/f/)
(/l/) F + 1 18 2 90.00 20 0 15 0
can
ablebe predicted
18
hand sign.LAs 3 18K sign
Pas+shown
the in Figure and ภ (/pʰ/)the L+1
11,3when
sign. 3917 C+H3 t͡ɕʰ 85.00 17 3
29
25 D 18 ++ 11 19 ฎจฬฬ(/t͡ (/l/)
(/d/)
(/l/) ɕ/) L + 1 17 3 85.00 17 3 15
3J1 + J2 20 C + H + 100 t͡ɕʰ100.00 20 0 0
t͡หɕ /)
(/h/)
ranslated
can be19 into Thai
predicted letters.
as the The
K sign and rules the for the
sign.
30
26 J1F19 +H J2 จฟ(/(/f/) J1 + J2 40 20 100.00 20 0 0
translated into Thai 20
letters. The ญ (/j/) ฮ rules
(/h/) forYthe
+ 1Y + 1 4119 19 11 ɕʰ 95.00 20 20 0 0 0
31 20 20H + 1 ญ ย(/f/)
(/j/)
(/j/) C + H + 2 t͡ 95.00 0
27 F +Y1 21 ณฝ (/n/)
(/n/) N+N 1 + 1 4211 11 A 99 55.00 20 20 0 0 0
32 21 21
YL+ 1 B ณ ญ(/l/)บ
(/j/) (/b/) ʔ 55.00 0
28 22 22 งงล(/ (/ŋ/) /) N+N G + G 12 12 88 60.00
60.00 20 20 0 0 5 5
33 M ม (/m/)
29
4.No.
Experiments L 23
Rule+and1 23 Results ททฬ(/t
Alphabet (/l/)
(/tʰ/) h /)
T+H T + H 17 17 33 85.00
85.00 20 20 0 0 0 0
34
30 J1 24N
+ J2 24 ฉฉ(/ น (/n/)
t͡ɕɕhʰ/) /) C + H 19 19 1 95.00 20 0 0
No.
22
4.1.35 Rule
R
Data Collection ร
Alphabet (/t͡
(/r/) C + H 1 95.00 20 0 0
NY+ 1 ณ (/n/)
31 รย(/w/)
(/j/)
(/r/)Average accuracy 85.42 Average of error rate 3.54
for thisว งresearch
22 R Average accuracy
23
36 N
W
Data collection + G (/ŋ/) is divided into two types: images from 85.42 6 locations Average of error rate 3.54
ญ(/w/)
(/j/)
videos for วดทtesting
32
23 YW +1 (/d/)
for24training, and D
(/tʰ/) at 3 locations (Library, Computer Lab2, and Office1),
37 TM +H ม (/m/)
33
24 take aDslightly
which
25 D +1 different ฎด (/d/) view of the Table camera. 6.Table
They 5were
shows
The results of the
taken results
thewith of theTFSL
a digital
three‑stroke TFSLrecognition
camera two-strokesystem.
recognition system. The results of
38 T + H + 1 ธ (/tʰ/)
นฎฟ (/n/) the experiment show that groups with very similar signs, such as T, S, M, N, and A, af-
34
25
26 DN F+ 1 (/d/)
(/f/)
39 C + H ณฝThai t͡ ɕ ʰ fected the accuracy rates in two-stroke sign language recognition systems, such as T + 2, S
35
26
27 N +F+ 11
FNo. ฟ (/n/)
(/f/)
(/f/) Classification Stroke Detection
40 C + H + 1 t͡ ɕ ʰ Rule
+ 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and S
36
27
28 NF L ++ G1 งลฝ(/ŋ/)
Letters (/f/)
(/l/) Correct Incorrect Accuracy (%) Correct Incorrect Error Rate (%)
41 CT++HH+ 2 t͡ɕʰ + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar to
37
28
29 LL +1 1 ทฬล (/tʰ/)
(/l/) h
ธ (/tʔ /) T+H K +gestures.
1 17 3
This affected the accuracy 85.00of K + 2, which 20had an accuracy
0 rate of 80%.0 The
42 A+ 1
38
29
30 T J1
+LH++2J21 ช ธ(/ ฬ(/tʰ/)
t͡(/l/)
ɕ h /) C+H +1
similarity 13
between 7 language gestures
sign 65.00 affected the19average accuracy 1 5
of the two-stroke
39
30
31 C
J1 Y+ 3 H
+ J2 ฌ (/ t͡t͡ɕʰ h /)
ɕ
ย (/j/) C+H + 2language
sign 15 recognition
5 system, 75.00which was equal to18 85.42%. In the2 sign language recog-
10
40
31
32 CY +H Y + 1+ 1 ɕAverage
ญย t͡(/j/) ʰ accuracynition system for two-stroke, we measured 75.00 Average
the error rate ofofstroke
error rate 5
detection. The results
41
32
33 C +YM H + 1+ 2 มญ(/m/) ɕʰ
t͡(/j/) show that sign language gestures with slight hand movements, such as K + 2 and S + 1,
ded42 into two types: images from ʔ 6 locations had a high error rate of 20%, affecting the overall accuracy of the system. The average
33 A
M มน (/m/)(/n/) In terms of performance, the one‑stroke TFSL recognition system shown in Table 4
ions34(Library, Computer
N Lab2, and Office1), error rate was 3.54%.
34
35 N N + 1 ณ น (/n/) had an average accuracy of 88.00%. According to the results, the system also encountered
mera. They were taken with a digital camera
35
36 NN ++ G 1 ณง (/ŋ/)
(/n/) very similar sign language gesture discrimination problems in three groups: (a) a handful
36
37 N
T ++ H G ทง (/tʰ/)
(/ŋ/) of gestures that use the thumb position ingress in different sign gestures for the letters T,
ded 38into
37 two TTtypes:
+H +H +1 images fromท (/tʰ/)
ธ (/tʰ/) 6 locations S, M, N, and A, yielding lower accuracy for the letter N at 55%; (b) group gestures with
ons38(Library, Computer
TC+ +HH+ 1 Lab2, ธ (/tʰ/)and Office1), two fingers, consisting of the index finger and middle finger, which are slightly different
39 t͡ɕʰ
mera. They were taken with a digital camera from the position of the thumb. This group consisted of letters K and 2 and lowered the
39
40 CC +H +H +1 t͡ɕʰ
40
41 C + H + 1
2 t͡ ɕ ʰ accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finger
41
42 C +A H+2 t͡ʔɕʰ up one finger. The R gesture involves crossing between the index finger and ring finger,
42 A ʔ which is similar to the gesture for number 1, which involves holding up one index fin‑
ger. This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These
three groups of similarities among the sign language gestures resulted in a low accuracy
average.
ded into two types: images from 6 locations
ons into
ded (Library, Computer
two types: Lab2,
images and
from Office1),
6 locations
mera. They were
ions (Library, taken with
Computer a digital
Lab2, camera
and Office1),
mera. They were taken with a digital camera
two fingers, consisting of the index finger and middle finger, which are slightly differen
from the position of the thumb. This group consisted of letters K and 2 and lowered th
accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finge
up one finger. The R gesture involves crossing between the index finger and ring finge
Symmetry 2021, 13, 262
which is similar to the gesture for number 1, which involves holding up one index finge
16 of 19
This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These thre
groups of similarities among the sign language gestures resulted in a low accuracy ave
age.

(a) T, S, N, M, and A gestures

(b) K and 2 gestures

(c) R and 1 gestures

Figure 20.Figure 20.of


A group Aone‑stroke
group of one-stroke
TFSL signsTFSL
withsigns
manywith many similarities.
similarities.

Table 5 shows the results of the TFSL two‑stroke recognition system. The results of
the experiment show that groups with very similar signs, such as T, S, M, N, and A, af‑
fected the accuracy rates in two‑stroke sign language recognition systems, such as T + 2,
S + 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and
S + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar
to K gestures. This affected the accuracy of K + 2, which had an accuracy rate of 80%. The
similarity between sign language gestures affected the average accuracy of the two‑stroke
sign language recognition system, which was equal to 85.42%. In the sign language recog‑
nition system for two‑stroke, we measured the error rate of stroke detection. The results
show that sign language gestures with slight hand movements, such as K + 2 and S + 1, had
a high error rate of 20%, affecting the overall accuracy of the system. The average error
rate was 3.54%.
Three‑stroke TFSL combines three sign language gestures to display one letter. Thus,
three‑stroke TFSL consists of three letters: T + H + 1, C + H + 1, and C + H + 2. The results
showed that the accuracy of T + H + 1 was 85%, that of C + H + 2 was 75%, and that of
Symmetry 2021, 13, 262 17 of 19

C + H + 1 was 65%, indication a low accuracy value. This low accuracy was due to several
reasons, such as the detection of faulty strokes due to the use of multi‑stroke sign language
and sign language similarities. The average overall accuracy was 75%. The three‑stroke
has different posture movements, so there is a slight error rate, where the average error
rate is 5%, as shown in Table 6.
Table 7 show that the multi‑stroke TFSL recognition system has an overall average
accuracy of 85.60%. The experiment used a training image for CNN modeling and testing
with multi‑stroke isolated TFSL. The results show that one‑stroke tested 300 times was cor‑
rect 264 times, incorrect 36 times, and had an average accuracy of 88%. For two‑strokes,
the average accuracy was 85.42% from 480 tests, with 410 correct and 70 incorrect. The
use of three‑strokes was tested 60 times and was found to be correct 45 times and incorrect
15 times with an average accuracy of 75%. Overall stroke detection of the system showed
an average error rate of 3.70. The results show that sign language gestures with little move‑
ment affected stroke detection and caused misclassified.

Table 7. The overall results of the multi‑stroke TFSL recognition system experiment.

Classification Stroke Detection


No. Stroke Group
Correct Incorrect Accuracy (%) Correct Incorrect Error Rate (%)
1 One‑stroke 264 36 88.00 ‑ ‑ ‑
2 Two‑stroke 410 70 85.42 464 16 3.54
3 Three‑stroke 45 15 75.00 57 3 5.00
Average accuracy 85.60 Average of error rate 3.70

5. Conclusions
The research focused on the development of a multi‑stroke TFSL recognition system
to support the use of complex background with deep learning. This study compared five
CNN performance models. According to the results, the first and second structure for‑
mats are static CNN architecture, and the determined number of filters and size of the
filters were the same for all layers, providing a minimal accuracy value. The third format
uses an architectural style to determine the number of filters from ascending with the same
size of the filter, which results in increased accuracy. The fourth format uses an ascending
number of filters. The first layer uses large filters for learning the global feature, whereas
the next layer uses smaller filters for learning the local feature. This results in higher accu‑
racy. The fifth format is a mixed architectural style, designed based on AlexNet structure.
Its first and second convolution layer increases the number of filters from ascending while
the third and fourth layers show the number of a static filter rises from the second layer.
However, the fifth and sixth layers use a fixed number of filters which drop from the two
previous layers. For similar purpose, the large size of the filter is used in the first layer
to learn global features, while the smaller filter is used in the next layer to learn the local
feature. The results show that mixed architecture with global feature learning followed by
local feature learning shows outperforming accuracy. By using the fifth format to train the
system, the results of the overall study indicate that factors affecting the system’s accuracy
average include (1) very similar sign language gestures that negatively affect classification,
resulting in lower accuracy average results; (2) low‑movement sign language spelling ges‑
tures that affect the detection of multiple spelling gestures, causing faulty stroke detection;
and (3) the spelling of multiple‑stroke sign language, which affects the average accuracy
since too much movement can sometimes lead to movement being detected between ges‑
tural changes, resulting in system recognition errors. A solution could be to improve the
motion detection system to make the strokes of the sign language more accurate or to apply
a long short‑term memory network (LSTM) to enhance the recognition system’s accuracy.
In conclusion, the study results demonstrate that similarities in the gestures and strokes
when finger‑spelling Thai sign language caused decreases in accuracy. For future studies,
further TFSL recognition system development is needed.
Symmetry 2021, 13, 262 18 of 19

Author Contributions: Conceptualization, T.P. and P.S.; methodology, T.P.; validation, P.S.; investi‑
gation, T.P. and P.S.; data curation, T.P.; writing—original draft preparation, T.P. and P.S.; writing—
review and editing, T.P. and P.S.; supervision, P.S.; All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the cor‑
responding author. The data are not publicly available due to the copyright of the Natural Language
and Speech Processing Laboratory, Department of Computer Science, Faculty of Science, Khon Kaen
University, Khon Kaen.
Acknowledgments: This study was funded by the Computer Science Department, Faculty of Sci‑
ence, Khon Kaen University, Khon Kaen, Thailand. The authors would like to extend our thanks to
Sotsuksa Khon Kaen School for their assistance and support in the development of the Thai finger‑
spelling sign language data.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Zhang, C.; Tian, Y.; Huenerfauth, M. Multi‑Modality American Sign Language Recognition. In Proceedings of the 2016 IEEE
International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2881–2885. [CrossRef]
2. Nakjai, P.; Katanyukul, T. Hand Sign Recognition for Thai Finger Spelling: An Application of Convolution Neural Network. J.
Signal Process. Syst. 2019, 91, 131–146. [CrossRef]
3. Pariwat, T.; Seresangtakul, P. Thai Finger‑Spelling Sign Language Recognition Employing PHOG and Local Features with KNN.
Int. J. Adv. Soft Comput. Appl. 2019, 11, 94–107.
4. Pariwat, T.; Seresangtakul, P. Thai Finger‑Spelling Sign Language Recognition Using Global and Local Features with SVM. In
Proceedings of the 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, Thailand, 1–4
February 2017; pp. 116–120. [CrossRef]
5. Adhan, S.; Pintavirooj, C. Thai Sign Language Recognition by Using Geometric Invariant Feature and ANN Classification. In
Proceedings of the 2016 9th Biomedical Engineering International Conference (BMEiCON), Laung Prabang, Laos, 7–9 December
2016; pp. 1–4. [CrossRef]
6. Saengsri, S.; Niennattrakul, V.; Ratanamahatana, C.A. TFRS: Thai Finger‑Spelling Sign Language Recognition System. In Pro‑
ceedings of the 2012 Second International Conference on Digital Information and Communication Technology and It’s Applica‑
tions (DICTAP), Bangkok, Thailand, 16–18 May 2012; pp. 457–462. [CrossRef]
7. Chansri, C.; Srinonchat, J. Reliability and Accuracy of Thai Sign Language Recognition with Kinect Sensor. In Proceedings of
the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information
Technology (ECTI‑CON), Chiang Mai, Thailand, 28 June–1 July 2016; pp. 1–4. [CrossRef]
8. Silanon, K. Thai Finger‑Spelling Recognition Using a Cascaded Classifier Based on Histogram of Orientation Gradient Features.
Comput. Intell. Neurosci. 2017, 2017, 1–11. [CrossRef] [PubMed]
9. Tolentino, L.K.S.; Juan, R.O.S.; Thio‑ac, A.C.; Pamahoy, M.A.B.; Forteza, J.R.R.; Garcia, X.J.O. Static Sign Language Recognition
Using Deep Learning. Int. J. Mach. Learn. Comput. 2019, 9, 821–827. [CrossRef]
10. Vo, A.H.; Pham, V.‑H.; Nguyen, B.T. Deep Learning for Vietnamese Sign Language Recognition in Video Sequence. Int. J. Mach.
Learn. Comput. 2019, 9, 440–445. [CrossRef]
11. Goel, T.; Murugan, R. Deep Convolutional—Optimized Kernel Extreme Learning Machine Based Classifier for Face Recognition.
Comput. Electr. Eng. 2020, 85, 106640. [CrossRef]
12. Wang, N.; Wang, Y.; Er, M.J. Review on Deep Learning Techniques for Marine Object Recognition: Architectures and Algorithms.
Control Eng. Pract. 2020, 104458. [CrossRef]
13. Cowton, J.; Kyriazakis, I.; Bacardit, J. Automated Individual Pig Localisation, Tracking and Behaviour Metric Extraction Using
Deep Learning. IEEE Access 2019, 7, 108049–108060. [CrossRef]
14. Chen, L.‑C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Con‑
volutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848.
[CrossRef] [PubMed]
15. Rahim, M.A.; Islam, M.R.; Shin, J. Non‑Touch Sign Word Recognition Based on Dynamic Hand Gesture Using Hybrid Segmen‑
tation and CNN Feature Fusion. Appl. Sci. 2019, 9, 3790. [CrossRef]
16. Koller, O.; Forster, J.; Ney, H. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Sys‑
tems Handling Multiple Signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [CrossRef]
Symmetry 2021, 13, 262 19 of 19

17. Hamed, A.; Belal, N.A.; Mahar, K.M. Arabic Sign Language Alphabet Recognition Based on HOG‑PCA Using Microsoft Kinect
in Complex Backgrounds. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC),
Bhimavaram, India, 27–28 February 2016; pp. 451–458. [CrossRef]
18. Dahmani, D.; Larabi, S. User‑Independent System for Sign Language Finger Spelling Recognition. J. Vis. Commun. Image Repre‑
sent. 2014, 25, 1240–1250. [CrossRef]
19. Auephanwiriyakul, S.; Phitakwinai, S.; Suttapak, W.; Chanda, P.; Theera‑Umpon, N. Thai Sign Language Translation Using
Scale Invariant Feature Transform and Hidden Markov Models. Pattern Recognit. Lett. 2013, 34, 1291–1298. [CrossRef]
20. Yu, F.; Koltun, V. Multi‑Scale Context Aggregation by Dilated Convolutions. In Proceedings of the 2016 International Conference
on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016.
21. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.
In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23
June 2018; pp. 1091–1100. [CrossRef]
22. Hosoe, H.; Sako, S.; Kwolek, B. Recognition of JSL Finger Spelling Using Convolutional Neural Networks. In Proceedings of
the 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, Japan, 8–12 May 2017; pp.
85–88. [CrossRef]
23. Bunrit, S.; Inkian, T.; Kerdprasop, N.; Kerdprasop, K. Text‑Independent Speaker Identification Using Deep Learning Model of
Convolution Neural Network. Int. J. Mach. Learn. Comput. 2019, 9, 143–148. [CrossRef]
24. Colque, R.V.H.M.; Junior, C.A.C.; Schwartz, W.R. Histograms of Optical Flow Orientation and Magnitude to Detect Anomalous
Events in Videos. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Bahia,
Brazil, 26–29 August 2015; pp. 126–133. [CrossRef]
25. Cheng, S.; Zhao, K.; Zhang, D. Abnormal Water Quality Monitoring Based on Visual Sensing of Three‑Dimensional Motion
Behavior of Fish. Symmetry 2019, 11, 1179. [CrossRef]
26. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM
2017, 60, 84–90. [CrossRef]

You might also like