Professional Documents
Culture Documents
Multistroke Thai Fingerspelling Sign Language Recognition System With Deep Learningsymmetry
Multistroke Thai Fingerspelling Sign Language Recognition System With Deep Learningsymmetry
Multistroke Thai Fingerspelling Sign Language Recognition System With Deep Learningsymmetry
Article
Multi‑Stroke Thai Finger‑Spelling Sign Language Recognition
System with Deep Learning
Thongpan Pariwat and Pusadee Seresangtakul *
Natural Language and Speech Processing Laboratory, Department of Computer Science, Faculty of Science,
Khon Kaen University, Khon Kaen 40002, Thailand; thongpan.par@kkumail.com
* Correspondence: pusadee@kku.ac.th; Tel.: +66‑80‑414‑0401
Abstract: Sign language is a type of language for the hearing impaired that people in the general
public commonly do not understand. A sign language recognition system, therefore, represents an
intermediary between the two sides. As a communication tool, a multi‑stroke Thai finger‑spelling
sign language (TFSL) recognition system featuring deep learning was developed in this study. This
research uses a vision‑based technique on a complex background with semantic segmentation per‑
formed with dilated convolution for hand segmentation, hand strokes separated using optical flow,
and learning feature and classification done with convolution neural network (CNN). We then com‑
pared the five CNN structures that define the formats. The first format was used to set the number
of filters to 64 and the size of the filter to 3 × 3 with 7 layers; the second format used 128 filters, each
filter 3 × 3 in size with 7 layers; the third format used the number of filters in ascending order with
7 layers, all of which had an equal 3 × 3 filter size; the fourth format determined the number of filters
in ascending order and the size of the filter based on a small size with 7 layers; the final format was
a structure based on AlexNet. As a result, the average accuracy was 88.83%, 87.97%, 89.91%, 90.43%,
and 92.03%, respectively. We implemented the CNN structure based on AlexNet to create models
for multi‑stroke TFSL recognition systems. The experiment was performed using an isolated video
of 42 Thai alphabets, which are divided into three categories consisting of one stroke, two strokes,
Citation: Pariwat, T.; Seresangtakul,
and three strokes. The results presented an 88.00% average accuracy for one stroke, 85.42% for two
P. Multi‑Stroke Thai Finger‑Spelling
strokes, and 75.00% for three strokes.
Sign Language Recognition System
with Deep Learning. Symmetry 2021,
Keywords: TFSL recognition system; deep learning; semantic segmentation; optical flow;
13, 262. https://doi.org/10.3390/
sym13020262
complex background
meanings. For example, the K sign combined with the 1 sign (K+1) meaning ‘ข’ (/kh /). Con‑
sequently, TFSL strokes can be organized into three groups (one‑stroke, two‑stroke, and
three‑stroke groups) with 15 letters, 24 letters, and 3 letters, respectively. However, ‘ฃ’(/kh /)
and ‘ฅ’(/kh /) have not been used since Thailand’s Royal Institute Dictionary announced
their discontinuation [3], and are no longer used in Thai finger‑spelling sign language.
This study is part of a research series on TFSL recognition that focuses on one stroke
performed by a signer standing in front of a blue background. Global and local features us‑
ing support vector machine (SVM) on the radial basis function (RBF) kernel were applied
and were able to achieve 91.20% accuracy [4]; however, there were still problems with
highly similar letters. Therefore, a new TFSL recognition system was developed by ap‑
plying pyramid histogram of oriented gradient (PHOG) and local features with k‑nearest
neighbors (KNN), providing an accuracy up to 97.6% [3]. However, past research has been
conducted only using one‑stroke, 15‑character TFSL under a simple background, which
does not cover all types of TFSL. There is also research on TFSL recognition systems that
uses a variety of techniques, as shown in Table 1.
The study of TFSL recognition systems for practical applications requires using a va‑
riety of techniques, since TFSL has multi‑stroke gestures and combines hand gestures to
convey letters as well as complex background handling and a wide range of light volumes.
Deep learning is a tool that is increasingly being used in sign language
recognition [2,9,10], face recognition [11], object recognition [12], and others. This tech‑
nology is used for solving complex problems such as object detection [13], image segmen‑
tation [14], and image recognition [12]. With this technique, the feature learning method is
implemented instead of feature extraction. Convolution neural network (CNN) is a learn‑
ing feature process that can be applied to recognition and can provide high performance.
However, deep learning requires much data for training, and such data could use simple
background or complex background images. In the case of a picture with a simple back‑
ground, the object and background color are clearly different. Such an image can be used
as training data without cutting out the background. On the other hand, a complex back‑
ground image features objects and backgrounds that have similar colors. Cutting out the
background image will obtain input images for training with deep learning, featuring only
objects of interest without distraction. This can be done using the semantic segmentation
method, which uses deep learning to apply image segmentation. This method requires
labeling objects of interest to separate them from the background. Autonomous driving
is one application of semantic segmentation used to identify objects on the road, and can
also be used for a wide range of other applications.
This study is based on the challenges of the TFSL recognition system. There are up
to 42 Thai letters that involve spelling with one’s fingers—i.e., spelling letters with a com‑
bination of multi‑stroke sign language gestures. The gestures for spelling several letters,
however, are similar, so a sign language recognition system under a complex background
should instead be used. There are many studies about the TFSL recognition system. How‑
ever, none of them study the TFSL multi‑stroke recognition system using the vision‑based
technique with complex background. The main contributions of this research are the ap‑
plication of a new framework for a multi‑stroke Thai finger‑spelling sign language recog‑
nition system to videos under a complex background and a variety of light intensities by
Symmetry 2021, 13, 262 3 of 19
separating people’s hands from the complex background via semantic segmentation meth‑
ods, detecting changes in the strokes of hand gestures with optical flow, and learning fea‑
tures with CNN under the structure of AlexNet. This system supports recognition that
covers all 42 characters in TFSL.
The proposed method focuses on developing a multi‑stroke TFSL recognition system
with deep learning that can act as a communication medium between the hearing impaired
and the general public. Semantic segmentation is then applied to hand segmentation for
complex background images, and optical flow is used to separate the strokes of sign lan‑
guage. The processes of feature learning and classification use CNN.
Part. 1 Hand segmentation model creation Part. 2 Hand-segmented library creation Part 3. TFSL Model Creation Part. 4 Classification
Hand sign Hand sign label Hand sign Hand sign video
Training process
with Convolution
Neural Network
Hand segmentation
(CNN) Stroke detection
Semantic using SegNet model
hand segmentation
Hand segmentation
Create the hand TFSL Model
using SegNet model
segmented library
SegNet model
Classification and
combination rules
Display text
Pixel Classification
Dilated Conv Dilated Conv Dilated Conv
Dilated Conv Dilated Conv Dilated Conv
Dilated
+ Conv Dilated
+ Conv Dilated
+ Conv
Pixel Classification
Dilated
+ Conv batch Dilated
++ Conv Dilated
batch Conv batch
+ batch + Dilated
batch ++ Conv batch + SegNet
Dilated
+ Conv batch
normalization
Dilated
+ Conv batch
normalization +
Dilated batch
normalization
+ Conv
Softmax
batch batch
normalization
batch normalization
batch normalization Model
Hand sign batch +
batch
normalization + batch +
normalization
batch
normalization + batch
normalization
batch + + batch
normalization SegNet
normalization +
normalization
normalization + batch
normalization
normalization +
normalization Model
Relu + +
normalization
Relu
Hand sign + batch
normalization ++ batch
normalization
Relu ++ batch
normalization
Relu Relu +
Relu +
+
normalization Relu +
normalization Relu +
normalization Relu
Relu + Relu
Relu + Relu
Relu + Relu
Relu + Relu + Relu +
Relu Relu Relu
Relu Relu Relu
1) Semantic segmentation
Hand sign files via SegNet
2) Label overlay to binary
1) Semantic segmentation
Hand sign files 2) Label overlay to binary
1) Semantic segmentation via SegNet
Hand sign files via SegNet
2) Label overlay to binary
Hand segmented
Hand files files
segmented
Hand segmented files 4) Select large area and
4) Select large area and
5) Hand 5)
mark
Hand mark
4) Select large area and bounding box bounding box
5) Hand mark
bounding box
Figure 4. The
Figure process
4. The of creating
process a hand-segmented
of creating library. library.
a hand-segmented
Figure 4. The process of creating
Figure 4. The aprocess
hand-segmented
of creating a library.
hand‑segmented library.
Softmax
TFSL
Fully connected layer (FC)
Softmax
TFSL Model
Model
Softmax
Figure9.9.An
Figure Anexample
example of
of the
thevalues
valuesafter using
after thethe
using ReLU function
ReLU [20]. [20].
function
Figure 9. An example of the values after using the ReLU function [20].
Pooling resizes the data to a smaller size, with the details of the input remaining in‑
Pooling
Figure resizes the data to aafter
smaller size, with the details of the input remaining in-
tact. It9.also
Anhas
example of the
the advantagevalues using
of increasing thethe ReLU function
sensitivity [20].
to calculations and solving the
tact. ItPooling
also has resizes
the the
advantage data
of to a smaller
increasing the size, with
sensitivity to the details
calculations
problems of overfitting. There are two types of pooling layers, max and average pooling,
of the
and input
solving ther
problems
tact. It of overfitting.
also
Pooling
as illustrated has thethe
resizes
in Figure There
advantage
10. area two
data to types of pooling
of increasing
smaller size, withthethelayers,
detailsmax
sensitivity and
of the average
toinput pooling,
calculations
remaining and
in-
as illustrated
tact. It alsoof
problems inoverfitting.
has Figure 10. There
the advantage of increasing
are twothe sensitivity
types to calculations
of pooling layers, and
maxsolving the
and aver
problems of overfitting.
as illustrated in Figure 10. There are two types of pooling layers, max and average pooling,
as illustrated in Figure 10.
metry 2021, 13, x FOR PEER REVIEW
Symmetry 2021, 13, 262 8 of 19
Figure
Figure 10.
10. An
An example
example of
of average
average and
and max
max pooling
pooling [23].
[23].
Figure 10. An example of average and max pooling [23].
The
The final
final layer
layer of
of the
the CNN
CNN architecture
architecture is
is the
the classification
classification layer,
layer, consisting
consisting of
of 22 sub-
sub‑
layers:
layers: the
the fully
fully connected
connected (FC)
(FC) and
and softmax
softmax layers.
layers. The
The fully
fully connected
connected layer
layer will
will use
use the
the
The map
feature finalmatrix
feature map
layerinof
matrix in
the
the
the form
CNN
form of
architecture is the classification layer, consistin
of aa vector
vector derived from the feature detection process to
layers:
createthe
create fully connected
predictive
predictive models. The (FC) and
The softmax
softmax softmax
function
function layer layers.the
layer provides The fully connected
classification output. layer w
feature map matrix in the form of a vector derived from the feature detection
3.4. Classification
3.4. Classification
create predictive models. The softmax function layer provides the classificatio
In the classification process, video files showing isolated TFSL were used as the input.
In the classification process, video files showing isolated TFSL were used as the in-
The video inputinput
put. The video files files
consisted of one
consisted of stroke, two strokes,
one stroke, and three
two strokes, strokes.
and three These
strokes. files
These
3.4.files
Classification
were taken with a digital camera and required the signer to act with the
were taken with a digital camera and required the signer to act with the right hand right hand in
a black jacket standing in front of a complex background.
in a black jacket standing in front of a complex background. The The whole process of testing
In thethe
involved
involved
classification
the following steps:process,
following steps: video filesby
motion segmentation
motion segmentation byshowing
optical flow,
optical
isolated
flow, splitting TFSL
splitting
were used
the strokes
the strokes
put.of
of The
the video
the hand
hand input
gestures,
gestures, hand
handfiles consisted
segmentation
segmentation viaof
via theone
the stroke,
SegNet
SegNet model,two
model, strokes,combining
classification,
classification, and three stro
combining
signs, and displaying the text, as shown in Figure
signs, and displaying the text, as shown in Figure 11. 11.
files were taken with a digital camera and required the signer to act with the
in a black jacket standing in front of a complex background. The whole proces
Hand segmentation via the
involved the following steps: motion segmentation K by optical flow, splitting Combining signs
SegNet model
of the hand gestures, hand segmentation via the SegNet model, classification,
Classification
1st Storke
2nd Storke
Classification
Motion segmentation
Motion segmentationused usedLucas
Lucas and and
1st Storke Kanade’s
Kanade’s optical flowflow
optical calculation method,
calculation which
method,
is a method
which of tracking
is a method the whole
of tracking image image
the whole using the usingpyramid algorithm.
the pyramid The track
algorithm. The starts
track
Video hand sign with
Motion segmentation
the
performed
top
by
layer
starts withSplitting
the top
tion was performed
by optical flow
of
optical
the pyramid and runs down to the bottom. Motion
layer of the pyramid and runs down to the bottom. Motion segmenta-
the strokes of
flow using the orientation and magnitude of the
by optical flow using the orientation and magnitude of the vector.
the hand gestures
segmentation
vector.
was
Calculating 3
the orientation
Calculating theand magnitude
orientation andofmagnitude
the optical flow of the was accomplished
optical flow wasusing frame‑by‑frame
accomplished using
calculations considering 2D motion. 2nd In each
Storke frame,
frame-by-frame calculations considering 2D motion. In each frame, the angle the angle of the motion at of
each
thepoint
mo-
varied,
tion depending
atFigure
each point on the magnitude
varied, depending and
on direction
the magnitudeof the motion
and observed
direction of frommotion
the the vector
ob-
relative to the 11.
axis Classification
(x, y). The andofdisplay
direction motion text
can process.
be obtained from Equation (2) [24]:
served from the vector relative to the axis (x, y). The direction of motion can be obtained
from Equation (2) [24]: ( )
−1 y
Motion segmentation used Lucas and θ = tan
x
y
Kanade’s optical flow calculatio (2)
= tan−1( ) (2)
which is a method of tracking the whole x image using the pyramid algorithm
where θ is the direction of vector v = [x,y]TT , x is coordinate x, and y is coordinate y in the
starts with is the top layer≤ θof< thev =πpyramid and b isruns
where the
π direction 1 of vector [x,y] , x is coordinate x,down
and y isto the and
coordinatebottom.
y in theMotion
range of −𝜋 2 + b−1π b− B 𝜋 − b2 + π Bb where the number of block B is the
range
tionamount of − +
was performed 𝜋
of2 bins. B ≤ θ < −
by optical + 𝜋 where
flow b is
using the number
the of block
orientation and B
andis the amount
magnitude of
2 B
of bins.
Calculating the orientation and magnitude of the optical flow was accompli
The magnitude of optical flow calculates the vector length by using the linear equa-
frame-by-frame
tion to find the length calculations
of each vectorconsidering
between the 2D motion.
previous frameIn andeach frame,
the current the angle
frame.
tionThe atmagnitude
each point can be varied,
calculateddepending
from the equation on the magnitude
below [24]: and direction of the
served from the vector relative tomthe axis
= √x 2 + y(x,
2 y). The direction of motion
(3) can b
from Equation (2) [24]:
y
Symmetry 2021, 13, 262 9 of 19
The magnitude of optical flow calculates the vector length by using the linear equation
to find the length of each vector between the previous frame and the current frame. The
magnitude can be calculated from the equation below [24]:
Symmetry 2021, 13, x FOR PEER REVIEW 9 of 19
√
m= x2 + y2 (3)
Symmetry 2021, 13, x FOR PEER REVIEW 9 of 19
where
wherem misisthe
themagnitude,
magnitude,xxisiscoordinate
coordinatex,x,andandyyisiscoordinate
coordinatey.y.
In
In this research, motion was distinguished by findingthe
this research, motion was distinguished by finding themean
meanof ofall
allmagnitudes
magnitudes
within
within each
each frame
frame of
ofthe
thesplit
splithand
hand sign.
sign. The
The mean
mean of
of the
themagnitude
magnitude was
was high
highwhen
whenthe
the
where m is the magnitude, x is coordinate x, and y is coordinate y.
hand
hand signals
In signals were changed,
were motion
this research, changed, as
wasas shown in Figure
shown in Figure
distinguished 12.
12. the mean of all magnitudes
by finding
within each frame of the split hand sign. The mean of the magnitude was high when the
hand signals were changed, as shown in Figure 12.
Figure 12. The mean value in each frame of magnitude of the video files.
Figure12.
Figure 12.The
Themean
meanvalue
valuein
ineach
eachframe
frameof
ofmagnitude
magnitudeof
ofthe
thevideo
videofiles.
files.
Splitting the stroke of the hand was achieved by comparing each stroke with the
threshold value. Testing
Splitting the various
the stroke
stroke thethreshold
of the hand was values,
was it was found
achieved that a magnitude
bycomparing
comparing eachvalue
strokewithwiththe the
Splitting the of hand achieved by each stroke
of 1–3 meansvalue.
threshold slight Testing
movement, theand a value
various equal to 4values,
threshold or moreitindicates
was a change
found that in
a the
magnitude value
threshold value. Testing the various threshold values, it was found that a magnitude value
stroke of the hand signal. The threshold value in this experiment was considered as 4. If
ofof 1–3 means slightmovement,
the mean value slight
1–3 means movement,
of magnitude in eachand
andaawas
frame
value
value equal
equal
lower
to44threshold
thantothe
ormore
or morevalue,
indicates
indicates a change in the
then as =change in the
stroke
stroke of
of the
the hand
hand signal.
signal. The
The threshold
threshold value
value inin this
this experiment
experiment
0 (meaning), but if the value was more than the threshold value, then s = 1 (meaningless), was
was considered
considered asas4.4.IfIf
the
the
as meaninvalue
mean
shown value
Figureofof magnitude
magnitude
13. inin
After that, 10 eachframe
each
frames framewas
between was lower
thelower
start than
than
and thethethe
end ofthreshold
threshold value,
value,
s = 0 (mean- then
then s =s0=
0 (meaning),
ing) represent but
the if
hand the value
sign. was more than the threshold value,
(meaning), but if the value was more than the threshold value, then s = 1 (meaningless), as then s = 1 (meaningless),
shown After
in extracting
as shown in Figure
Figure 13.the
13.image
AfterAfter from
that, 10the
that, 10hand stroke
frames
frames separation,
between
between startthe
the the next
start
and and
thestep
thewas
end ofthe
end s of
= 0s (meaning)
= 0 (mean-
hand segmentation process. This process involves the separation of hands from the back-
ing) represent
represent the handthe hand
sign. sign.
ground using the SegNet model, as illustrated in Figure 14.
After extracting the image from the hand stroke separation, the next step was the
hand segmentation process. This process involves the separation of hands from the back-
ground using the SegNet model, as illustrated in Figure 14.
The next step was to classify the hand segmentation images with the CNN model to
predict 10 frames of each hand sign. The system votes on the results of the prediction in
descending order and selects the most predictable hand sign. As shown in Figure 11, when
Figurepassing
14. Image theofclassification
Figure 14. Image
hand of hand
segmentation process, afterthe
segmentation splittingresults
afterthecan be the
splitting
strokes predicted
ofstrokes
the hand. ofas thethehand.K sign and the 3 sign.
Figure 14. Image
After that, ofthehand signssegmentation
are combined after and then splitting the strokes
translated into Thai of letters.
the hand. The rules for the
The next
combination step
Figure 14.
Thewas
ofImageto
next
signs classify
step
of are
was
handshown
the
to
segmentation
hand
classify
in Table segmentation
the hand images
segmentation
2. the strokes of the hand.
after splitting
with
images the with CNN the modelmodel
CNN to to
predict 10 predict
frames of 10
eachframes hand of each
sign. hand
The
The next step was to classify the hand segmentation images with the CNN sign.
system The system
votes on votes
the on
resultsthe results
of the of the
prediction prediction
in in mo
descending descending
order The and next order
selects
step wasand
the to selects
most
classify the
predictable
the most
hand predictable
hand
segmentationsign.handAs
images sign.
shown
with As theshown
in Figure
CNN in Figure
model 11, to 11,
when when
predictTable 2. Combination rules for TFSL.
10 frames ofthe each hand sign. The system votes on the results ofin thethe3 predict
passing the classification process, the results can be predicted as the K sign and the 3 sign.sign.
passing
predict 10 framesclassification
of each hand process,
sign. the
The results
system can
votes be
on predicted
the results as
of the
the K sign
prediction and
descending
After that,No.order
the
descending
After
signs andRule
that,
are theselects
order and selects
signs
combined the
are most
Alphabet
the
combined
and
most predictable
then
predictable
and thenhand
translated No. hand
sign.
translated
into Thaiintosign.
Rule
As shown Thai
letters.
inAs Figureshown
letters.
The Alphabet
11,
The
rules
when in Figure
rules
for thefor the 11,
passing the classification process, thein results can be predicted as the K sign and the 3 sign.
passing 1 combination
the classification K of signs
process, are ก
shown (/k/)
the2. Table
results 2. 22
can be R ร (/r/)
combination of signs
After that, are shown
the signs are in Table
combined and then translated intopredicted
Thai letters. The as rulesthe for K sign the and the 3
After that, the2 Figure
signs
combination
Table 2. K +
are
of 1 combined
signs
14.Combination
Image of hand are shown
rules
ข (/kʰ/)
inand Table then
2. 23
translated
for TFSL. after splitting the strokes of the hand.
segmentation intoW Thai letters.ว (/w/) The rules f
Table 2. Combination
3
combinationFigure of signs
rules
K + 2for
areRule
TFSL.
ofshown ค (/kʰ/)
in Alphabet
Tableafter 24
2. splitting No. D ด (/d/)
Table No.2.14. Image
Combination hand
rules forsegmentation
TFSL. the strokes of the hand.
Rule with the Alphabet
4 Figure The 14.Knext
Image
+ 3 step
of hand to ฆ
was segmentation
classify
(/kʰ/) theafter
hand segmentation
splitting
25 the strokes images
D of+ the 1 hand. Alphabet ฎCNN (/d/)model to
No. No.
predictThe
Rule
1 10next
frames stepKof
Rule was
Alphabet
eachto Alphabet
hand sign.
classify ก (/k/)
the Thehand
No.
No.
system 22
votes
segmentation Rule Rule
theRresults
onimages Alphabet
withof thethe ร (/r/)
CNN prediction
model to in
5
Table 2. 1Combination T KKfor
rules TFSL. ตก(/t/)
(/k/) ขsign.(/kʰ/) 26 23 R FW ร (/r/) รวฟ(/r/) (/f/)
(/w/)
predict 1The
descending 2 10next
K order
frames step+and
of1 each
was toก (/k/)
selects hand
classifythe most
theThe hand 22
22
predictable
system hand sign.
votes
segmentation RAs
onimages
the shown
results
withofthe inthe
Figure CNN 11,
prediction
modelwhenin
to
2 6 passing 23 10
descending
predict K T+frames
the +classification
11K +K1+and
order ข (/k
selects
of2 each ถขh(/tʰ/)
process,
hand (/kʰ/)
/)
the ค sign.
(/kʰ/)
the The
most results 23can
23
predictable
system 27be onFthe
Wsign.
predicted
hand
votes
24 W +As Das ว (/w/)
1results
the
shown K ofsign
intheว ดฝand
Figure (/f/)
(/d/)
(/w/) the
11,3when
prediction sign.
in
No. 3 7 After Rule
passing 3that,
descending 4 K T+the
the 2Ksigns
2order+ 2and
+classification
K + 3are Alphabet
(/k ฐคh(/tʰ/)
คcombined
process,
selects (/kʰ/)
/)
the ฆ (/kʰ/) and
the
most then
results can
predictable No.
24translated
24 28 be Dinto
sign.D
predicted
hand
25 LThai
D As Rule
+as
ด (/d/)
letters.
the
1shown K sign ดฎลand
inThe
Figure (/l/)
(/d/)
(/d/) Alphabet
rules 11,for
the 3whenthe
sign.
ฆh(/kʰ/) ฎ (/d/) ฎฬฟand
T areฆกcombined theรfor
4that, K +signs
3 are 25translated D +into
1D+
1 4 8 combination
After
passing K K +the
5 the
3of
3 signs
T +classification (/k(/k/)
shown
process,
ฒ
ต
/) ตin
(/tʰ/)
(/t/)
Table
and
the
(/t/) 2.25
then
results can be
292226 predicted L +Thai 1as the
F1 Rletters.K sign
ฟ (/f/) The
The (/d/) rules
(/l/)
(/f/) 3(/r/)
the
sign.
5 5
combination
After that, the T T
ofsigns
signsare ต (/t/)
arecombined
shown in Table and then 26
26 F F
2. translated into Thai letters. ฟ (/f/) rules for the
T + 1 ข (/kʰ/) ฑถh(/tʰ/) ถ (/tʰ/) ฝ (/f/)
t͡ɕ ว (/w/)
2 6 9 Table K662.+ Combination
combination 1T++1 4of
T 1 rules
T +signs areถ for(/t
shown (/tʰ/)
/) in Table 2.27
TFSL.
27
27 3023 F + 1 J1
F+1
F ++ 1J2Wฝ (/f/) ฝ (/f/)
7
10Table72. Combination T + 2
TT++2 5T + 2 rulesฐ for h
t̚ ฐ TFSL.
(/tʰ/)
/) /) ฐ (/tʰ/) 28 31 28 Y L ล (/l/) ล
ลย(/l/) (/l/)
(/j/)
3 78 KNo. + Combination
8 2 + 3 T Rule T + 3
ค (/t
(/kʰ/)
ฒ
Alphabet
hTFSL.
(/tʰ/) ฒ (/tʰ/)
28
24
No.
29
L L
Rule
L + 1 Dฬ (/l/) Alphabet ฬ ฬ (/l/) ด (/d/)
11 Table 8 2. T S + 3 rules ฒ (/t
for ส (/s/)
/) 29
29 32 L + 1 L
Y ++ 11 ญ (/j/)
(/l/)
4 9 12 KNo. 3S++41T Rule /) กฑ(/k/) จร(/(/r/)
Alphabet No. Rule Alphabet
919+ T +TK4+ 4 ฆฑ (/t
(/kʰ/)
ฑh (/tʰ/)
ศ (/s/)
(/tʰ/) 30 25
30 22
30 J1 + J2J1J1 +RJ2 +DJ2+ 1t͡ɕ t͡ɕ /) ฎ (/d/)
10
No.
1021 KRule
K+ 1
10 T + 5 T +T5+ 5 ฏต(/(/t/)
Alphabet
ข ก
t̚ /) t̚/) /)
(/k/)
(/kʰ/) 31
31
33No.
22
23
31 Y
M
Rule
W
YYR ย (/j/) วมยรย(/w/)
Alphabet (/m/)
(/r/)
(/j/) ฟ (/f/)
(/j/)
5 11 equ 11321T SS+ 2 KS K+ 21 ส (/s/) ษส(/s/) คขกส(/kʰ/)
(/k/) 32 3424
26
22 Y + 1 Y N R1 F ญ (/j/) วดญนร(/w/) (/n/)
(/r/)
ญ(/d/)
(/s/) 32 23 +WD (/j/)
11 S (/s/) 32 Y +1 (/j/)
6 12 14 T1212 32
4+ 1S + QS +S1+ 31
S + 1 K + 1
2 ถ (/tʰ/)
ศ ซศ(/s/)
(/s/) ข
ค
ฆ ศ (/s/)
(/s/) (/kʰ/) 33 3527
33 23
24
25
33 M ND++F M W
D
M 1 1 + ม1(/m/) ฎณ (/n/) ฝ (/f/)
มว
ด (/w/)
(/d/)
(/m/)
ม (/m/)
equ S + 2 + 2+ 32 ษ (/s/) ฆคต (/kʰ/)
ษ(/pʰ/) 34 ND +G1 น (/n/) นฎดฟง(/n/) (/d/)
7 14 15 T equ543+ 2 P S K
equ S T+ 2 ฐ พ(/tʰ/) (/s/)
ษ (/t/)
(/s/)
34 3626 24 N ND
25
2834 +FN L (/ŋ/)
(/f/)
(/n/) ล (/l/)
น(/n/)
S + Q ซ (/s/)ซ (/s/)ฆ (/kʰ/) 35 N + 1 ณ (/n/) ณฎ (/d/)
14654 KQ
S +T T
++ 13
ป h(/p/) ถ ต (/t/)
(/tʰ/) 35 25 N + 1 D
26
27 F F+ + 11 ฟฝ
ท (/f/)
(/tʰ/)
15 16 14 PP+ 1 S + Q พฒ(/p (/tʰ/) /) ซ (/s/) 36 37 35 NT +
N
+ GH+ 1 ง ณ (/n/) ฬ (/l/)
(/η/)
8 T15765+ 3 P
TT + 21 พ (/pʰ/) ถฐต(/tʰ/)
(/t/) 36 29 26 N + G F L
27
28 +FHL1 +ง 1 (/ŋ/) ฝลฟ (/f/)
(/f/)
(/l/)
16 17 P + 1
15 P + 2P + 1P ป ผ
(/p/) (/pʰ/)
ป (/p/) พ (/pʰ/) 37
37 3827 36 T + T T
H +N +
H++G1 ท (/tʰ/) ท ธ ง (/tʰ/)
(/ŋ/)
(/t h /)
9 17 18 T16 876+ P4 + 2 T ++ 321
T ผฑ (/p(/tʰ/)
h /)ฒ
ภ ผ(/pʰ/)
ฐถ (/tʰ/)
ป (/p/) 38 30
28
29 T + LFHL+J1++111+ J2
ธ ฬ ลฝ(/t (/f/)
(/l/)
ท (/tʰ/)
h /) จ (/t͡ɕ/)
1716 P + 3P +P2+ 1 (/pʰ/) 38 39 37T + H +C 1 T+ +HH ธ (/tʰ/) t͡ɕʰ
987+ P T+2 h /)ฑ ฒฐ (/tʰ/)
(/tʰ/) 28 L+ +LHJ2 ฬล (/l/)t͡ɕ h /) ย (/j/)
10 18 19 T1817 5 +H3 P T+P3++432 ภฏ(/p (/t̚
ภ /)
(/pʰ/)
ห (/h/) ผ (/pʰ/) 39
39
29
30
31
402938 C + H
CTJ1 +
C + H+++111 H
1
+Y 1 t͡ ɕ ʰ ฉ (/ ธ (/tʰ/)
t͡ɕhʰ
19 10 98 H H T + 543 ห (/h/) ห (/h/) ฑt̚ฒ (/tʰ/) /) 40 30C + HC+ +1J1LHY
31 + J2 t͡ɕʰ ช (/ ฬย (/j/)
t͡(/l/)
ɕ /)
ส ภ (/pʰ/) ɕʰ
ɕɕʰh /) ญ (/j/)
19 40
11 20 20 20 10
18
9S HH++1 1 T + 54P + 3
ฮ (/s/)
ฮ
(/h/) (/h/) ฑ
t̚ส (/tʰ/)
(/s/)/) 41 413239
30
31 CC +
+
C
J1 H
H Y
+
+ Y
++H
J222+ 1 ฌ ญ (/ย
t͡
t͡
(/j/)
t͡
1119 B H + S
1
H ฮ (/h/) ห (/h/) 41 32
40C + H + 2
CAY +H+1 + 1 t͡ ɕʰ t͡ɕʰ/)
12 21 21 S21 10
11
12 + 1 B BST S++ 15 บศ (/b/) บ(/s/)
บ(/b/)
(/b/) t̚สศ (/s/)/) 42 4232
42 33 31 A
33 YAM Y + 1M ʔ อ (/(/j/)
มญย(/m/) ʔ ม (/m/)
20 H+1 ฮ (/h/) 41 C+H+2 t͡ɕʰ
ส (/s/) ญ (/j/)
ษ (/s/)ษศบ (/s/) มน (/m/)
12 11 S +B21 S 33 M 32 Y+1 (/n/)
equ4. Experiments equ
Sand+ 2Results
21 (/b/) 34
34
42 N A N ʔ น (/n/)
4. Experiments 4. Experiments
12
equ
14 and and
Results
S
S+Q + Results
21 ศ
ษ
ซ (/s/) 33
34
35 N+1M
N มน (/m/)
ณ (/n/)
4.1.14
Data 4.1.SData
4.1. Collection +Q
equ
14
Data4.Collection
15
Collection
SSand
+P+ Q ซ (/s/)พซษ(/pʰ/)
2 Results (/s/) 35
34
35
36 NNN N1 + 1
++ G (/n/) ณ (/n/)
ณนง (/ŋ/)
Experiments
15Data collection
Data
Data
14
P for
15
16
collection
collection
4.1.training,
for Data and P
S
thisfor
Collection
+ Q
P+research
videos 1this พ testing
for this research
(/pʰ/)
forresearch
ซ
ป at
is divided into two
(/s/)
is พdivided
(/pʰ/)
(/p/) into two
3islocations
divided 35
36
types: images
36types:
37
into
(Library, two N
T ++N
images
N
types:
Computer
+H 1
G+
from 6 locations
G
from 6from
images
Lab2,
ณ (/n/) ง (/ŋ/)
ทง locations
(/ŋ/)
(/tʰ/)
and Office1),6 locations
for training, and 17 15
videos PP
for +testing atview พป (/pʰ/)
(/pʰ/)
(/p/)
3 ผlocations (Library,36were
ComputerN +HG 1a digital ทธง (/tʰ/)
(/ŋ/)
16for training,
whichP16 and
+take
Data videos
1 acollection
slightly 21different
for ปthis
(/p/)
fortesting at
of
research
ป
3 locations
the
(/p/)
camera. They
is divided
37
38
(Library,
37 T+ H
TComputer
taken
into two types:T++Lab2,
+with
images
and
H Lab2,camera
from
ท
Office1),
(/tʰ/)
and ทOffice1),
(/tʰ/)
6 locations
whichwhich
take atake
slightly
16
17
18 different
a slightly
for training, P
and + 32
different1view
videos
of the
view
for ผ
testing (/pʰ/)
camera.
ภ of the They
at 3camera.
were
37
38
locations39
They taken
T
were
(Library,
T
+ H+ with
H
CComputer +
+taken
H 1 a digital
with ธ (/tʰ/)
t͡ɕcamera
a digital
Lab2, ʰ Office1),
and camera
17 P19
which 17
18+take
2 a slightly
PH ผ (/pʰ/)
+ 32 different ภผหview
(/pʰ/)
(/h/) of the camera. 38
38 They C
39
40 TC T+H+H
++ H
were
H + 1 digital
++ 11 with aธ (/tʰ/)
taken t͡ɕʰ
ธcamera
(/tʰ/)
18 18
P20
19+3 HPH ++ 31 ภ (/pʰ/) ภ (/pʰ/)
หฮ (/h/) 39
39
40
41 CC +H +CH ++21 H t͡t͡ɕɕʰʰ ฉ (/t͡ɕʰ/)
19 H ห
ฮ (/h/)
(/h/) 40 1
19 20
21
H H B + 1
ห (/h/)บ (/b/) 41
42
40 C
CA+ H + 1 t͡ʔɕʰ ช (/t͡ɕʰ/)
+ H + 2
20
21 HB+ 1 บฮ (/b/)
(/h/) 41
42 C +A
H+2 t͡ʔɕʰ
Symmetry 2021, 13, x FOR PEER REVIEW
Symmetry 2021, 13, 262 11 of 19
Figure 15.
Figure 15.The
The six locations
six of data
locations collection
ofcollection
data (a) Library;
collection (b) Computer
(a) Library; Lab1; (c) Computer
(b) Computer Lab2;
Lab1; (c) Computer
Figure 15. The(e)
(d) Canteen; sixOffice1;
locations of(f)
and data
Office2. (a) Library; (b) Computer Lab1; (c) Computer Lab2;
(d)Canteen;
(d) Canteen;(e) (e) Office1;
Office1; and
and (f) (f) Office2.
Office2.
The hand sign images used for the training were divided into 2 groups: the train‑
ing The
The
data hand
hand
for sign
the images
sign
SegNet used and
images
model for the
used training
thefor were
datadivided
the training
training were
for the into
CNN 2 model,
groups:into
divided the training
with 225groups:
signs the t
data for
in total.
data the
for The SegNet
the trainingmodel
SegNetdata and
model the
for thetraining
andSegNet data for
model used
the training the CNN model,
dataoriginal with
for theimages
CNNand 25 signs
image
model, in total.
labels,
with 25 signs
The
320 ×
training
240
The training
data in
pixels forsize,
the SegNet
totaling
data for theimages.
model
10,000
SegNet
used
images.original
model used The images group
second and image
original images
labels,images
contained 320 × 240
and image that
labels, 32
pixels
were inhand‑segmented
size, totaling 10,000 via semanticThe second group contained
segmentation and wereimages that were in
contained hand-
the
pixels
segmented in size,
hand‑segmented
totaling
via semantic
library.
10,000 images.
segmentation
These images and The
were
were 150
second
contained group
× 250 pixels
in the contained images
inhand-segmented
size. The traininglibrary.
dataset
that wer
segmented
These images
contained 25 via
weresemantic
signs, 150 × 250segmentation
as presentedpixels
in Figure and
in size.16,The
each were
training contained
dataset
with 5000 in athe
contained
images for hand-segmented
25
total signs,
of as
125,000
presented
These in Figure
images.images were16, each
150with 5000
× 250 imagesinforsize.
pixels a total of 125,000
The trainingimages.
dataset contained 25 si
presented in Figure 16, each with 5000 images for a total of 125,000 images.
Figure
Figure16.
16.Twenty-five
Twenty‑fivehand
handsigns
signsfor
for42
42Thai
Thailetters.
letters.
The video used in the testing process was isolated TFSL featuring 42 Thai letters.
These videos were taken from a hand signer group consisting of 4 people, and each letter
was recorded 5 times, totaling (42 × 4 × 5) 840 files.
The video used in the testing process was isolated TFSL featuring 42 Thai letters.
Symmetry 2021,
Symmetry 2021, 13,
13, xx FOR
FOR PEER
PEER REVIEW
REVIEW 12 of
12 of 19
19
These
videos were taken from a hand signer group consisting of 4 people, and each let‑
ter was recorded 5 times, totaling (42 × 4 × 5) 840 files.
4.2.
4.2.
4.2. Experimental
Experimental
Experimental Results
Results
Results
TheThe
The experiments
experiments
experiments inin
in this
this
this study
study
study were
were
were divided
divided
divided into
into
into threeparts.
three
three parts.The
parts. TheThe first
first
first part
part
part involved
involved
involved
the
the evaluation
evaluation toto measure
measure performance
performance using
using an
an intersection
intersection over
over union
union
the evaluation to measure performance using an intersection over union areas (IoU); the areas
areas (IoU);
(IoU); thethe
second
second part part was
part was
was thethe experiment
the experiment
experiment used used
used toto determine
to determine
determine the the efficiency
the efficiency
efficiency of of
of thethe
the CNNCNN model;
CNN model;
model; and
andand
second
the
the last
last part
part wasthe
was thetesting
testingofofthe
themulti-stroke
multi‑strokeTFSL
TFSLrecognition
recognitionsystem.
system.
the last part was the testing of the multi-stroke TFSL recognition system.
4.2.1. Intersection Over Union Areas (IoU)
4.2.1. Intersection
4.2.1. Intersection OverOver Union
Union Areas
Areas (IoU)
(IoU)
IoU is a statistical method that uses two data consistency measurements between a
IoU is
IoU is aa statistical
statistical method
method that
that uses
uses two
two data consistency
consistency measurements
measurements between
between aa
predicted bounding box and ground truth bydata
dividing the overlapping areas between the
predicted
predicted bounding
bounding box and
box and ground
ground truth
truth byby dividing
dividing the overlapping
thethe
overlapping areas between the
prediction and ground truth by a union area between predictionareas
and between the
ground truth,
prediction
prediction and
and(4)ground
ground truth by
truth by a union area between the prediction and ground truth, as
as Equation and shown in aFigure
union 17area between
[18]. the prediction
A high‑value and ground
IoU (value closer totruth, as
1) refers
Equation (4)
Equation (4) and
and shown
shown in Figure
Figure 17
17 [18].
[18]. A
A high-value IoU IoU (value closer
closer to 1)
1) refers
refers to
to aa
to a bounding box withinhigh accuracy. Thehigh-value
0.5 threshold is(value
the accuracytothat determines
bounding box
bounding box with
with high
high accuracy.
accuracy. The
The 0.50.5 threshold
threshold isis the
the accuracy
accuracy that
that determines
determines
whether the predicted bounding box IoU is accurate, as shown in Figure 18. According to
whether the
whether the predicted
predicted bounding
bounding boxbox IoU
IoU is
is accurate,
accurate, as
as shown
shown in in Figure
Figure 18.
18. According
According to to
the Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be
the Standard
thegreater
Standard Pascal Visual Object Classes Challenge 2007, acceptable IoU values must be
thanPascal Visual Object Classes Challenge 2007, acceptable IoU values must be
0.5 [13].
greater than
greater than 0.50.5 [13].
[13].
Area
Area of overlap
of overlap between bounding
overlap between
between boundingboxes
boxes
IoU Area of bounding boxes
IoU==
IoU = Area of union between bounding boxes . (4)(4)
(4)
Area of
Area of union
union between
between bounding
bounding boxes
boxes
Predicted
Predicted
bounding box
bounding box
Overlap
Overlap
area
area
Ground truth
Ground truth
IoU =
IoU =
Predicted
Predicted
bounding box
bounding box
Union
Union
area
area
Ground truth
Ground truth
..
Figure
Figure 17.17.
Figure
17. The
The intersection
The
intersection over
intersection
over union
over areas
union
union (IoU)
areas
areas [25].
(IoU)
(IoU) [25].
[25].
IoU == 0.4034
IoU 0.4034 IoU == 0.7330
IoU 0.7330 IoU == 0.9264
IoU 0.9264
Poor
Poor Good
Good Excellent
Excellent
Figure
Figure 18.18.
IoUIoU is evaluated
is evaluated
evaluated asas follows:
follows: thethe left
left IoUIoU is 0.4034,
is 0.4034,
0.4034, which
which is is poor;
poor; thethe middle
middle IoU
IoU = 0.7330,
Figure 18. IoU is as follows: the left IoU is which is poor; the middle IoU ==
whichwhich
0.7330, is good; and the
is good;
good; high
and the IoU
high=IoU
0.9264,
IoU whichwhich
is excellent [13]. [13].
0.7330, which is and the high == 0.9264,
0.9264, which isis excellent
excellent [13].
The
The research
research used
used sign
sign language
language gestures
gestures on aon a complex
a complex
complex background
background alongalong
withwith
labella‑
The research used sign language gestures on background along with label
bel images, all of which were 320 × 240 pixels in size. A total of 10,000 images passed
images, all
images, all of
of which
which were
were 320
320 ×× 240
240 pixels
pixels in
in size.
size. A
A total
total of
of 10,000
10,000 images
images passed
passed through
through
semantic segmentation
semantic segmentation training
training using
using dilated
dilated convolutions.
convolutions. TheThe results
results of
of the
the dilation
dilation rate
rate
configuration test
configuration test and
and aa variety
variety of
of convolutions
convolutions cancan be
be structured
structured according
according toto the
the follow-
follow-
ing details.
ing details. The
The structure
structure ofof the
the training
training process
process consists
consists of
of five
five blocks
blocks ofof convolution.
convolution.
Symmetry 2021, 13, 262 13 of 19
through semantic segmentation training using dilated convolutions. The results of the di‑
lation
Symmetry 2021, 13, x FOR PEER REVIEW rate configuration test and a variety of convolutions can be structured according 13 of 19 to
the following details. The structure of the training process consists of five blocks of con‑
volution. Each block includes 32 convolution filters (3 × 3 in size), with different dilation
factors, batchincludes
Each block normalization, and ReLU.
32 convolution The(3dilations
filters are with
× 3 in size), 1, 2, 4, 8, 16, and
different 1, respectively.
dilation factors,
The training
batch option uses
normalization, and MaxEpochs
ReLU. The=dilations
500 and are MiniBatchSize
1, 2, 4, 8, 16,=and 64. According
1, respectively.to theTheIoU
performance
training option assessment,
uses MaxEpochs the average= 500IoUandwas 0.8972, which
MiniBatchSize = 64.isAccording
excellent. to the IoU per-
formance assessment, the average IoU was 0.8972, which is excellent.
4.2.2. Experiments on the CNN Models
4.2.2. Experiments
Experiments toon the CNNthe
determine Models
effectiveness of the CNN models in this research used
the 5‑fold cross validation method.
Experiments to determine the effectiveness All input hand of thesign
CNN images
models used in this
in this experiment
research used
contained 125,000 images from the hand‑segmented library,
the 5-fold cross validation method. All input hand sign images used in this experiment which consisted of 25 finger‑
spelling
contained images,
125,000each with 5000
images fromgestures. This datasetlibrary,
the hand-segmented was divided
which into 5 groups,
consisted of 25each with
finger-
25,000 images.
spelling images, Eacheachroundwithof50005‑fold cross validation
gestures. This dataset used
was 100,000
divided images
into 5for training
groups, eachand
with images
25,000 25,000 images. EachThis
for testing. round of 5-fold cross
experiment validation
compared used 100,000 images
the effectiveness for training
of the configuration,
and
the 25,000 of
number images
filters,forand
testing. This of
the sizes experiment
five differentcompared
CNN the effectiveness of the configu-
filters.
ration,
The thefirstnumber
format of usedfilters,
sevenand the sizes
layers of 64offilters and a 3 ×CNN
five different filters.
3 filter size. The second format
set theThe first
filter sizeformat
to 128 usedandseven layers
the size ofofthe
64 filter
filtersto and3× a 33×with
3 filter
7 size. TheThe
layers. second
thirdformat
format
set the
sorted thefilter
numbersize to of 128
filtersandinthe size of the
ascending filter
order uptoto3 7× layers,
3 with starting
7 layers.fromThe third format
2, 5, 10, 20, 40,
80, and 160, respectively, and all filter sizes were 3 × 3. The fourth format determined
sorted the number of filters in ascending order up to 7 layers, starting from 2, 5, 10, 20, 40,the
80, and 160, respectively, and all filter sizes were 3 × 3. The fourth
number of filters in ascending order and set the sizes of the filter from large to small with format determined the
number(the
7 layers of filters
number in ascending ordersize;
of filters/filter and set
4/11 the×sizes
11, 8/5of the
× 5, filter
16/5from× 5,large
32/3to×small
3, 64/3with× 3,
7 layers (the number of filters/filter size; 4/11 × 11, 8/5 × 5, 16/5
128/3 × 3, 256/3 × 3). The final format was a structure based on AlexNet, which defined × 5, 32/3 × 3, 64/3 × 3, 128/3
× 3,
the 256/3 × 3).
numbers andThesizes
finalof format was a as
the filters structure
followsbased(numberon AlexNet, which defined
of filters/filter the num-
size): 96/11 × 11,
bers and sizes of the filters as follows (number of filters/filter
256/5 × 5, 384/3 × 3, 384/3 × 3, 256/3 × 3, 256/3 × 3 [26]. The structure details are shown size): 96/11 × 11, 256/5 × 5,
in384/3
Figure × 3,19.
384/3 × 3, 256/3 × 3, 256/3 × 3 [26]. The structure details are shown in Figure 19.
Figure
Figure 19.19.The
Thefive
fiveCNN
CNNformats
formats for
for training
training the
the model.
model.
10 of 19
Symmetry 2021, 13, 262 14 of 19
The CNN model experiment used the 5‑fold cross validation method, and the results
are shown in Table 3. This experiment found that the accuracy results of the experiment
using the structure based on the AlexNet format were 92.03%. The fourth model of 7 layers
using the ascending sorting of the filter number and filter size from a larger size to a smaller
size offered an average accuracy of up to 90.04%. Format 3, which uses seven descending
filters, has a secondary accuracy average of 89.91%. While the first and second formats use
seven static filters, they have 64 and 128 filters, respectively, with an average accuracy of
88.23 and 87.97.
Five‑Fold Cross Format 1 Accuracy Format 2 Accuracy Format 3 Accuracy Format 4 Accuracy Format 5 Accuracy
Validation (%) (%) (%) (%) (%)
of hand segmentation after splitting the strokes of the hand.
1st 80.12 80.60 83.43 84.17 87.72
tep was to classify the hand segmentation images with the CNN model to 86.28
2nd 87.12 85.90 86.15 88.96
3rd 90.06
es of each hand sign. The system votes on the results of90.34
the prediction in 92.44 92.53 93.98
4th 93.84 92.82 95.52 94.87 97.00
er and selects the 5th
most predictable hand90.00
sign. As shown in Figure 11, when 91.86
90.20 92.47 92.50
sification process, the results can be 88.23
Average accuracy predicted as the K sign
87.97and the 3 sign. 89.91 90.04 92.03
signs are combined and then translated into Thai letters. The rules for the
signs are shown in Table 2.
Comparing the five CNN accuracy results, it was found that the fifth format provided
ation rules for TFSL. the highest accuracy. Therefore, we used this format for training to create models for the
multi‑stroke TFSL recognition system, divided into 112,500 training images representing
Rule Alphabet No. 90% and Rule 12,500 forAlphabet
testing data, representing 10%. The accuracy of the model in the exper‑
K ก (/k/) 22 iment was R 99.98%. ร (/r/)
K+1 ข (/kʰ/) 23 W ว (/w/)
4.2.3. The Experiment of the Multi‑Stroke TFSL Recognition System
K+2 ค (/kʰ/) 24 D ด (/d/)
This recognition system focuses on user‑independent testing via the isolated TFSL
K+3 ฆ (/kʰ/) 25 D+1 ฎ (/d/)
video format. The accuracy of each alphabet was tested 20 times. The experiment was
T ต (/t/) 26 dividedF into three groups: ฟ (/f/) one stroke, as shown in Table 4; the two‑stroke results, which
T+1 ถ (/tʰ/) 27 are presented
F+1 in Table ฝ (/f/)
5; and the three‑stroke results, which are shown in Table 6.
T+2 ฐ (/tʰ/) 28 L ล (/l/)
T+3 ฒ (/tʰ/) 1 results of ฬthe
29 Table L4. +The (/l/)
one‑stroke TFSL recognition system.
T+4 ฑ (/tʰ/) 30 J1No.
+ J2 t͡ɕ
Thai Letters Rule Correct Incorrect Accuracy (%)
T+5 t̚ /) 31 Y ย (/j/)
1 ก (/k/) K 13 7 65.00
S ส (/s/) 32 Y 2+ 1 ตญ(/t/)
(/j/) T 18 2 90.00
S+1 ศ (/s/) 33 M3 มส (/s/)
(/m/) S 18 2 90.00
4 พ (/ph /) P 20 0 100.00
S+2 ษ (/s/) 34 N น (/n/)
5 ห (/h/) H 20 0 100.00
S+Q ซ (/s/) 35 N6+ 1 ณ (/n/)
บ (/b/) B 17 3 85.00
P พ (/pʰ/) 36 N 7+ G รง (/r/)
(/ŋ/) R 15 5 75.00
8 ว (/w/) W 17 3 85.00
P+1 ป (/p/) 37 T+H ท (/tʰ/)
9 ด (/d/) D 20 0 100.00
P+2 ผ (/pʰ/) 38 T + 10
H+1 ฟธ (/f/)
(/tʰ/) F 20 0 100.00
P+3 ภ (/pʰ/) 39 C11 +H ล (/l/)
t͡ɕʰ L 20 0 100.00
12 ย (/j/) Y 20 0 100.00
H ห (/h/) 40 C+H+1 t͡ɕʰ
13 ม (/m/) M 18 2 90.00
H+1 ฮ (/h/) 41 C +14H+2 t͡ɕʰ
น (/n/) N 11 9 55.00
B บ (/b/) 42 15
A อ (/ ʔ/) A 17 3 85.00
Average accuracy 88.00
and Results
ion
ction for this research is divided into two types: images from 6 locations
d videos for testing at 3 locations (Library, Computer Lab2, and Office1),
ghtly different view of the camera. They were taken with a digital camera
The next step was to classify the hand segmentation images with the CNN model to
predict 10 frames of each hand sign. The system votes on the results of the prediction in
descending order and selects
Symmetry the
2021, 13, most
x FOR predictable
PEER REVIEW hand sign. As shown in Figure 11, when 16 of 19
passing the classification process, the results can be predicted as the K sign and the 3 sign.
After that, the signs
Symmetry are
2021, 13,combined
262 and then translated into Thai letters. The rules for the 15 of 19
combination of signs are shown in Table 2.
Table 5. The results of the two-stroke TFSL recognition system.
Table 2. Combination rules for Thai
TFSL. Classification Strokes Detection
No. Rule
No. Rule Letters Alphabet 10 of 19
Correct
TableNo. 5. The results Incorrect AlphabetTFSL recognition system.Incorrect Error Rate (%)
Accuracy
Ruleof the two‑stroke (%) Correct
1 ข (/kʰ/) ก (/k/) 10Kof+191 2217 R 3 ร (/r/)85.00 19 1 5
ting the1strokes of the K hand.
Classification Strokes Detection
2 No.
K+1 2 ค Thai(/kʰ/)
ข (/kʰ/) K + 2Rule 23 16 W 4 ว (/w/) 80.00 16 4 20
3 the Letters
ฅ CNN (/kʰ/) K +to 3 Correct Incorrect Accuracy
100.00 (%) Correct Incorrect Error
0 Rate (%)
egmentation
3 imagesK + 2with ค (/kʰ/) model 2420 D 0 ด (/d/) 20 0
stem votes 4 on the K1 results
+ 3 4 of the ขถ(/k (/tʰ/) h
ฆprediction
/)
(/kʰ/) T +in 1K + 1 2517 17 D + 1 33 ฎ (/d/) 85.00
85.00 20 19 0 1 0 5
table hand sign. As
5 2 T
shown
5 ฐ
inคFigure (/tʰ/)
(/k h /) 11, when
ต (/t/) T + 2K + 2 26 15 16 F 54 ฟ (/f/) 80.00
75.00 20 16 0 4 0 20
canthe
ing be strokes
predicted
of the3as hand.
the K sign ฅฒ(/k and h /)the 3 sign. K + 3 20 20 0 100.00
6 T+16 (/tʰ/)
ถh (/tʰ/) T + 3 27 F+1 0 ฝ (/f/) 100.00 20 20 0 0 0 0
translated into Thai 4 letters. The
ถฑ(/t rules
/) for theT + 1 17 3 85.00 20 0 0
7 T +2 7 (/tʰ/)
ฐ (/tʰ/) T + 4 28 20 L 0 ล (/l/) 100.00 20 0 0
egmentation images 5 with theฐCNN (/th /) model toT + 2 15 5 75.00 20 0 0
tem votes 8 on theT + 3 8 of the ฏ (/t̚ ฒh//) )
(/tʰ/) T +in5 17 3 ฬ (/l/)85.00 20 0 0
6results ฒ (/tprediction T + 3 29 20 L + 1 0 100.00 20 0 0
4 9 in Figure ศ (/s/)(/tʰ/) S+1 11 9 t͡ɕ 55.00 16 4 20
able hand 9 sign. As 7 +shown
T ฑ (/tฑh /) 11, when T + 4 30 20 J1 + J2 0 100.00 20 0 0
can be10 predicted T 8as+the
5 10K signฏษand (/s/)
(/ t̚ /) the/) 3 sign.
S + 2T + 5 3115 17 Y 53 ย (/j/)75.00
85.00 18 20 2 0 10 0
No. Rule Alphabet
ranslated into Thai 9 letters.
11 The ศซ (/s/)
(/s/)
rules
ส (/s/) forS the
+ QS + 1 3215 11 Y + 1 59 ญ (/j/)75.00
55.00 20 16 0 4 0 20
2211 RS ร (/r/)
10 12 ษป (/s/)
(/p/) ศ (/s/) P + 1S + 2 20 15 05 ม (/m/) 75.00
100.00 20 18 0 2 0 10
2312 W S+1
11 13 ซว (/pʰ/)
(/w/)
(/s/)
33
S + Q 18 15
M
5 75.00 20 0 0
equ S+2 ผ ษ (/s/) P + 2 2 90.00
น (/n/)100.00 20 0 0
24 12D ปด(/p/) (/d/) P + 1 34 20 N 0 20 0 0
ing the strokes of the hand.
14 S++1Q14 ภ (/pʰ/) (/s/) P + 3P + 2 3518 18 N + 1 22
ซh /) ณ (/n/) 90.00 18 2 10
25 D 13 ผ ฎ
(/p(/d/) 90.00 20 0 0
ingNo.
the strokes Rule
of the hand. Alphabet
ฮ (/h/)
26 15 14F P 15 ภ ฟ
(/p พh(/pʰ/)
(/f/) /) H + 1P + 3 3618 18 N + G 22 ง (/ŋ/)90.00
90.00 20 18 0 2 0 10
egmentation
22 images
R with the รCNN (/r/) model to
27 16 onimagesFW15 P 16
1+ 1withofthe
+results ฮฎฝ(/h/)
(/d/)ป (/p/)
(/f/)
D+H 1 + 1 3720 18 T + H 02 ท (/tʰ/)100.00
90.00 20 20 0 0 0 0
tem23 votes
egmentation the theว CNN prediction
(/w/) model in to
17 16 P + 2 17 ฎ ฝ (/f/)
(/d/)
ผ (/pʰ/) F + 1D + 1 38 18 20
T + H + 2
10 ธ (/tʰ/)100.00
90.00 20 20 0 0 0 0
able28
tem hand sign.
votes on the As shownofinthe
Lresults Figure
ด ล(/d/)
(/l/)
prediction11, when in
24 D
17 ฝ ฬ (/f/)
(/l/) F + 1 18 2 90.00 20 0 15 0
can
ablebe predicted
18
hand sign.LAs 3 18K sign
Pas+shown
the in Figure and ภ (/pʰ/)the L+1
11,3when
sign. 3917 C+H3 t͡ɕʰ 85.00 17 3
29
25 D 18 ++ 11 19 ฎจฬฬ(/t͡ (/l/)
(/d/)
(/l/) ɕ/) L + 1 17 3 85.00 17 3 15
3J1 + J2 20 C + H + 100 t͡ɕʰ100.00 20 0 0
t͡หɕ /)
(/h/)
ranslated
can be19 into Thai
predicted letters.
as the The
K sign and rules the for the
sign.
30
26 J1F19 +H J2 จฟ(/(/f/) J1 + J2 40 20 100.00 20 0 0
translated into Thai 20
letters. The ญ (/j/) ฮ rules
(/h/) forYthe
+ 1Y + 1 4119 19 11 ɕʰ 95.00 20 20 0 0 0
31 20 20H + 1 ญ ย(/f/)
(/j/)
(/j/) C + H + 2 t͡ 95.00 0
27 F +Y1 21 ณฝ (/n/)
(/n/) N+N 1 + 1 4211 11 A 99 55.00 20 20 0 0 0
32 21 21
YL+ 1 B ณ ญ(/l/)บ
(/j/) (/b/) ʔ 55.00 0
28 22 22 งงล(/ (/ŋ/) /) N+N G + G 12 12 88 60.00
60.00 20 20 0 0 5 5
33 M ม (/m/)
29
4.No.
Experiments L 23
Rule+and1 23 Results ททฬ(/t
Alphabet (/l/)
(/tʰ/) h /)
T+H T + H 17 17 33 85.00
85.00 20 20 0 0 0 0
34
30 J1 24N
+ J2 24 ฉฉ(/ น (/n/)
t͡ɕɕhʰ/) /) C + H 19 19 1 95.00 20 0 0
No.
22
4.1.35 Rule
R
Data Collection ร
Alphabet (/t͡
(/r/) C + H 1 95.00 20 0 0
NY+ 1 ณ (/n/)
31 รย(/w/)
(/j/)
(/r/)Average accuracy 85.42 Average of error rate 3.54
for thisว งresearch
22 R Average accuracy
23
36 N
W
Data collection + G (/ŋ/) is divided into two types: images from 85.42 6 locations Average of error rate 3.54
ญ(/w/)
(/j/)
videos for วดทtesting
32
23 YW +1 (/d/)
for24training, and D
(/tʰ/) at 3 locations (Library, Computer Lab2, and Office1),
37 TM +H ม (/m/)
33
24 take aDslightly
which
25 D +1 different ฎด (/d/) view of the Table camera. 6.Table
They 5were
shows
The results of the
taken results
thewith of theTFSL
a digital
three‑stroke TFSLrecognition
camera two-strokesystem.
recognition system. The results of
38 T + H + 1 ธ (/tʰ/)
นฎฟ (/n/) the experiment show that groups with very similar signs, such as T, S, M, N, and A, af-
34
25
26 DN F+ 1 (/d/)
(/f/)
39 C + H ณฝThai t͡ ɕ ʰ fected the accuracy rates in two-stroke sign language recognition systems, such as T + 2, S
35
26
27 N +F+ 11
FNo. ฟ (/n/)
(/f/)
(/f/) Classification Stroke Detection
40 C + H + 1 t͡ ɕ ʰ Rule
+ 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and S
36
27
28 NF L ++ G1 งลฝ(/ŋ/)
Letters (/f/)
(/l/) Correct Incorrect Accuracy (%) Correct Incorrect Error Rate (%)
41 CT++HH+ 2 t͡ɕʰ + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar to
37
28
29 LL +1 1 ทฬล (/tʰ/)
(/l/) h
ธ (/tʔ /) T+H K +gestures.
1 17 3
This affected the accuracy 85.00of K + 2, which 20had an accuracy
0 rate of 80%.0 The
42 A+ 1
38
29
30 T J1
+LH++2J21 ช ธ(/ ฬ(/tʰ/)
t͡(/l/)
ɕ h /) C+H +1
similarity 13
between 7 language gestures
sign 65.00 affected the19average accuracy 1 5
of the two-stroke
39
30
31 C
J1 Y+ 3 H
+ J2 ฌ (/ t͡t͡ɕʰ h /)
ɕ
ย (/j/) C+H + 2language
sign 15 recognition
5 system, 75.00which was equal to18 85.42%. In the2 sign language recog-
10
40
31
32 CY +H Y + 1+ 1 ɕAverage
ญย t͡(/j/) ʰ accuracynition system for two-stroke, we measured 75.00 Average
the error rate ofofstroke
error rate 5
detection. The results
41
32
33 C +YM H + 1+ 2 มญ(/m/) ɕʰ
t͡(/j/) show that sign language gestures with slight hand movements, such as K + 2 and S + 1,
ded42 into two types: images from ʔ 6 locations had a high error rate of 20%, affecting the overall accuracy of the system. The average
33 A
M มน (/m/)(/n/) In terms of performance, the one‑stroke TFSL recognition system shown in Table 4
ions34(Library, Computer
N Lab2, and Office1), error rate was 3.54%.
34
35 N N + 1 ณ น (/n/) had an average accuracy of 88.00%. According to the results, the system also encountered
mera. They were taken with a digital camera
35
36 NN ++ G 1 ณง (/ŋ/)
(/n/) very similar sign language gesture discrimination problems in three groups: (a) a handful
36
37 N
T ++ H G ทง (/tʰ/)
(/ŋ/) of gestures that use the thumb position ingress in different sign gestures for the letters T,
ded 38into
37 two TTtypes:
+H +H +1 images fromท (/tʰ/)
ธ (/tʰ/) 6 locations S, M, N, and A, yielding lower accuracy for the letter N at 55%; (b) group gestures with
ons38(Library, Computer
TC+ +HH+ 1 Lab2, ธ (/tʰ/)and Office1), two fingers, consisting of the index finger and middle finger, which are slightly different
39 t͡ɕʰ
mera. They were taken with a digital camera from the position of the thumb. This group consisted of letters K and 2 and lowered the
39
40 CC +H +H +1 t͡ɕʰ
40
41 C + H + 1
2 t͡ ɕ ʰ accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finger
41
42 C +A H+2 t͡ʔɕʰ up one finger. The R gesture involves crossing between the index finger and ring finger,
42 A ʔ which is similar to the gesture for number 1, which involves holding up one index fin‑
ger. This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These
three groups of similarities among the sign language gestures resulted in a low accuracy
average.
ded into two types: images from 6 locations
ons into
ded (Library, Computer
two types: Lab2,
images and
from Office1),
6 locations
mera. They were
ions (Library, taken with
Computer a digital
Lab2, camera
and Office1),
mera. They were taken with a digital camera
two fingers, consisting of the index finger and middle finger, which are slightly differen
from the position of the thumb. This group consisted of letters K and 2 and lowered th
accuracy of the letter K to 65%; and (c) a sign language gesture similar to raising a finge
up one finger. The R gesture involves crossing between the index finger and ring finge
Symmetry 2021, 13, 262
which is similar to the gesture for number 1, which involves holding up one index finge
16 of 19
This resulted in the accuracy of the letter R being 75%, as shown in Figure 20. These thre
groups of similarities among the sign language gestures resulted in a low accuracy ave
age.
Table 5 shows the results of the TFSL two‑stroke recognition system. The results of
the experiment show that groups with very similar signs, such as T, S, M, N, and A, af‑
fected the accuracy rates in two‑stroke sign language recognition systems, such as T + 2,
S + 2, and S + Q providing an accuracy of 75%, N + G providing an accuracy of 60%, and
S + 1 and N + 1 offering an accuracy of 55%. Sign language gesture number 2 was similar
to K gestures. This affected the accuracy of K + 2, which had an accuracy rate of 80%. The
similarity between sign language gestures affected the average accuracy of the two‑stroke
sign language recognition system, which was equal to 85.42%. In the sign language recog‑
nition system for two‑stroke, we measured the error rate of stroke detection. The results
show that sign language gestures with slight hand movements, such as K + 2 and S + 1, had
a high error rate of 20%, affecting the overall accuracy of the system. The average error
rate was 3.54%.
Three‑stroke TFSL combines three sign language gestures to display one letter. Thus,
three‑stroke TFSL consists of three letters: T + H + 1, C + H + 1, and C + H + 2. The results
showed that the accuracy of T + H + 1 was 85%, that of C + H + 2 was 75%, and that of
Symmetry 2021, 13, 262 17 of 19
C + H + 1 was 65%, indication a low accuracy value. This low accuracy was due to several
reasons, such as the detection of faulty strokes due to the use of multi‑stroke sign language
and sign language similarities. The average overall accuracy was 75%. The three‑stroke
has different posture movements, so there is a slight error rate, where the average error
rate is 5%, as shown in Table 6.
Table 7 show that the multi‑stroke TFSL recognition system has an overall average
accuracy of 85.60%. The experiment used a training image for CNN modeling and testing
with multi‑stroke isolated TFSL. The results show that one‑stroke tested 300 times was cor‑
rect 264 times, incorrect 36 times, and had an average accuracy of 88%. For two‑strokes,
the average accuracy was 85.42% from 480 tests, with 410 correct and 70 incorrect. The
use of three‑strokes was tested 60 times and was found to be correct 45 times and incorrect
15 times with an average accuracy of 75%. Overall stroke detection of the system showed
an average error rate of 3.70. The results show that sign language gestures with little move‑
ment affected stroke detection and caused misclassified.
Table 7. The overall results of the multi‑stroke TFSL recognition system experiment.
5. Conclusions
The research focused on the development of a multi‑stroke TFSL recognition system
to support the use of complex background with deep learning. This study compared five
CNN performance models. According to the results, the first and second structure for‑
mats are static CNN architecture, and the determined number of filters and size of the
filters were the same for all layers, providing a minimal accuracy value. The third format
uses an architectural style to determine the number of filters from ascending with the same
size of the filter, which results in increased accuracy. The fourth format uses an ascending
number of filters. The first layer uses large filters for learning the global feature, whereas
the next layer uses smaller filters for learning the local feature. This results in higher accu‑
racy. The fifth format is a mixed architectural style, designed based on AlexNet structure.
Its first and second convolution layer increases the number of filters from ascending while
the third and fourth layers show the number of a static filter rises from the second layer.
However, the fifth and sixth layers use a fixed number of filters which drop from the two
previous layers. For similar purpose, the large size of the filter is used in the first layer
to learn global features, while the smaller filter is used in the next layer to learn the local
feature. The results show that mixed architecture with global feature learning followed by
local feature learning shows outperforming accuracy. By using the fifth format to train the
system, the results of the overall study indicate that factors affecting the system’s accuracy
average include (1) very similar sign language gestures that negatively affect classification,
resulting in lower accuracy average results; (2) low‑movement sign language spelling ges‑
tures that affect the detection of multiple spelling gestures, causing faulty stroke detection;
and (3) the spelling of multiple‑stroke sign language, which affects the average accuracy
since too much movement can sometimes lead to movement being detected between ges‑
tural changes, resulting in system recognition errors. A solution could be to improve the
motion detection system to make the strokes of the sign language more accurate or to apply
a long short‑term memory network (LSTM) to enhance the recognition system’s accuracy.
In conclusion, the study results demonstrate that similarities in the gestures and strokes
when finger‑spelling Thai sign language caused decreases in accuracy. For future studies,
further TFSL recognition system development is needed.
Symmetry 2021, 13, 262 18 of 19
Author Contributions: Conceptualization, T.P. and P.S.; methodology, T.P.; validation, P.S.; investi‑
gation, T.P. and P.S.; data curation, T.P.; writing—original draft preparation, T.P. and P.S.; writing—
review and editing, T.P. and P.S.; supervision, P.S.; All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the cor‑
responding author. The data are not publicly available due to the copyright of the Natural Language
and Speech Processing Laboratory, Department of Computer Science, Faculty of Science, Khon Kaen
University, Khon Kaen.
Acknowledgments: This study was funded by the Computer Science Department, Faculty of Sci‑
ence, Khon Kaen University, Khon Kaen, Thailand. The authors would like to extend our thanks to
Sotsuksa Khon Kaen School for their assistance and support in the development of the Thai finger‑
spelling sign language data.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Zhang, C.; Tian, Y.; Huenerfauth, M. Multi‑Modality American Sign Language Recognition. In Proceedings of the 2016 IEEE
International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2881–2885. [CrossRef]
2. Nakjai, P.; Katanyukul, T. Hand Sign Recognition for Thai Finger Spelling: An Application of Convolution Neural Network. J.
Signal Process. Syst. 2019, 91, 131–146. [CrossRef]
3. Pariwat, T.; Seresangtakul, P. Thai Finger‑Spelling Sign Language Recognition Employing PHOG and Local Features with KNN.
Int. J. Adv. Soft Comput. Appl. 2019, 11, 94–107.
4. Pariwat, T.; Seresangtakul, P. Thai Finger‑Spelling Sign Language Recognition Using Global and Local Features with SVM. In
Proceedings of the 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, Thailand, 1–4
February 2017; pp. 116–120. [CrossRef]
5. Adhan, S.; Pintavirooj, C. Thai Sign Language Recognition by Using Geometric Invariant Feature and ANN Classification. In
Proceedings of the 2016 9th Biomedical Engineering International Conference (BMEiCON), Laung Prabang, Laos, 7–9 December
2016; pp. 1–4. [CrossRef]
6. Saengsri, S.; Niennattrakul, V.; Ratanamahatana, C.A. TFRS: Thai Finger‑Spelling Sign Language Recognition System. In Pro‑
ceedings of the 2012 Second International Conference on Digital Information and Communication Technology and It’s Applica‑
tions (DICTAP), Bangkok, Thailand, 16–18 May 2012; pp. 457–462. [CrossRef]
7. Chansri, C.; Srinonchat, J. Reliability and Accuracy of Thai Sign Language Recognition with Kinect Sensor. In Proceedings of
the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information
Technology (ECTI‑CON), Chiang Mai, Thailand, 28 June–1 July 2016; pp. 1–4. [CrossRef]
8. Silanon, K. Thai Finger‑Spelling Recognition Using a Cascaded Classifier Based on Histogram of Orientation Gradient Features.
Comput. Intell. Neurosci. 2017, 2017, 1–11. [CrossRef] [PubMed]
9. Tolentino, L.K.S.; Juan, R.O.S.; Thio‑ac, A.C.; Pamahoy, M.A.B.; Forteza, J.R.R.; Garcia, X.J.O. Static Sign Language Recognition
Using Deep Learning. Int. J. Mach. Learn. Comput. 2019, 9, 821–827. [CrossRef]
10. Vo, A.H.; Pham, V.‑H.; Nguyen, B.T. Deep Learning for Vietnamese Sign Language Recognition in Video Sequence. Int. J. Mach.
Learn. Comput. 2019, 9, 440–445. [CrossRef]
11. Goel, T.; Murugan, R. Deep Convolutional—Optimized Kernel Extreme Learning Machine Based Classifier for Face Recognition.
Comput. Electr. Eng. 2020, 85, 106640. [CrossRef]
12. Wang, N.; Wang, Y.; Er, M.J. Review on Deep Learning Techniques for Marine Object Recognition: Architectures and Algorithms.
Control Eng. Pract. 2020, 104458. [CrossRef]
13. Cowton, J.; Kyriazakis, I.; Bacardit, J. Automated Individual Pig Localisation, Tracking and Behaviour Metric Extraction Using
Deep Learning. IEEE Access 2019, 7, 108049–108060. [CrossRef]
14. Chen, L.‑C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Con‑
volutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848.
[CrossRef] [PubMed]
15. Rahim, M.A.; Islam, M.R.; Shin, J. Non‑Touch Sign Word Recognition Based on Dynamic Hand Gesture Using Hybrid Segmen‑
tation and CNN Feature Fusion. Appl. Sci. 2019, 9, 3790. [CrossRef]
16. Koller, O.; Forster, J.; Ney, H. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Sys‑
tems Handling Multiple Signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [CrossRef]
Symmetry 2021, 13, 262 19 of 19
17. Hamed, A.; Belal, N.A.; Mahar, K.M. Arabic Sign Language Alphabet Recognition Based on HOG‑PCA Using Microsoft Kinect
in Complex Backgrounds. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC),
Bhimavaram, India, 27–28 February 2016; pp. 451–458. [CrossRef]
18. Dahmani, D.; Larabi, S. User‑Independent System for Sign Language Finger Spelling Recognition. J. Vis. Commun. Image Repre‑
sent. 2014, 25, 1240–1250. [CrossRef]
19. Auephanwiriyakul, S.; Phitakwinai, S.; Suttapak, W.; Chanda, P.; Theera‑Umpon, N. Thai Sign Language Translation Using
Scale Invariant Feature Transform and Hidden Markov Models. Pattern Recognit. Lett. 2013, 34, 1291–1298. [CrossRef]
20. Yu, F.; Koltun, V. Multi‑Scale Context Aggregation by Dilated Convolutions. In Proceedings of the 2016 International Conference
on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016.
21. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.
In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23
June 2018; pp. 1091–1100. [CrossRef]
22. Hosoe, H.; Sako, S.; Kwolek, B. Recognition of JSL Finger Spelling Using Convolutional Neural Networks. In Proceedings of
the 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, Japan, 8–12 May 2017; pp.
85–88. [CrossRef]
23. Bunrit, S.; Inkian, T.; Kerdprasop, N.; Kerdprasop, K. Text‑Independent Speaker Identification Using Deep Learning Model of
Convolution Neural Network. Int. J. Mach. Learn. Comput. 2019, 9, 143–148. [CrossRef]
24. Colque, R.V.H.M.; Junior, C.A.C.; Schwartz, W.R. Histograms of Optical Flow Orientation and Magnitude to Detect Anomalous
Events in Videos. In Proceedings of the 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Bahia,
Brazil, 26–29 August 2015; pp. 126–133. [CrossRef]
25. Cheng, S.; Zhao, K.; Zhang, D. Abnormal Water Quality Monitoring Based on Visual Sensing of Three‑Dimensional Motion
Behavior of Fish. Symmetry 2019, 11, 1179. [CrossRef]
26. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM
2017, 60, 84–90. [CrossRef]