Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

applied

sciences
Article
Hand Gesture Recognition via Lightweight VGG16 and
Ensemble Classifier
Edmond Li Ren Ewe 1 , Chin Poo Lee 2, * , Lee Chung Kwek 1 and Kian Ming Lim 2

1 Faculty of Engineering and Technology, Multimedia University, Melaka 75450, Malaysia;


1201400034@student.mmu.edu.my (E.L.R.E.); lckwek@mmu.edu.my (L.C.K.)
2 Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia;
kmlim@mmu.edu.my
* Correspondence: cplee@mmu.edu.my

Abstract: Gesture recognition has been studied for a while within the fields of computer vision and
pattern recognition. A gesture can be defined as a meaningful physical movement of the fingers,
hands, arms, or other parts of the body with the purpose to convey information for the environment
interaction. For instance, hand gesture recognition (HGR) can be used to recognize sign language
which is the primary means of communication by the deaf and mute. Vision-based HGR is critical
in its application; however, there are challenges that will need to be overcome such as variations
in the background, illuminations, hand orientation and size and similarities among gestures. The
traditional machine learning approach has been widely used in vision-based HGR in recent years
but the complexity of its processing has been a major challenge—especially on the handcrafted
feature extraction. The effectiveness of the handcrafted feature extraction technique was not proven
across various datasets in comparison to deep learning techniques. Therefore, a hybrid network
architecture dubbed as Lightweight VGG16 and Random Forest (Lightweight VGG16-RF) is proposed
for vision-based hand gesture recognition. The proposed model adopts feature extraction techniques
via the convolutional neural network (CNN) while using the machine learning method to perform
Citation: Ewe, E.L.R.; Lee, C.P.; classification. Experiments were carried out on publicly available datasets such as American Sign
Kwek, L.C.; Lim, K.M. Hand Gesture
Language (ASL), ASL Digits and NUS Hand Posture dataset. The experimental results demonstrate
Recognition via Lightweight VGG16
that the proposed model, a combination of lightweight VGG16 and random forest, outperforms
and Ensemble Classifier. Appl. Sci.
other methods.
2022, 12, 7643. https://doi.org/
10.3390/app12157643
Keywords: sign language recognition; hand gesture recognition; convolutional neural network
Academic Editor: João M. F. (CNN); ensemble classifier; lightweight VGG16; random forest; transfer learning
Rodrigues

Received: 27 June 2022


Accepted: 18 July 2022
Published: 29 July 2022 1. Introduction
Communication, may it be verbally or through gestures, is a necessity in one’s life for
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
conveying messages and interaction. When deaf and dumb persons interact with hearing
published maps and institutional affil-
people who are not familiar with sign language, a communication barrier will arise. This
iations. communication gap can be overcome with the presence of interpreters who convert sign
language into spoken language and vice versa. However, the interpreter, whether it is
a person or device, is extremely expensive, and it may not be available for the rest of
a deaf person’s life. As a result, advancements in the hand gesture recognition of sign
Copyright: © 2022 by the authors. languages will benefit the deaf and dumb community by bridging the communication gap
Licensee MDPI, Basel, Switzerland. that currently exists.
This article is an open access article Most of the sign language lexicons were made up of hand gestures, which are usually
distributed under the terms and combined with facial expressions and body movements that emphasize the words or
conditions of the Creative Commons phrases. Due to this inherent trait, a hand gesture can be either static or dynamic in
Attribution (CC BY) license (https:// nature. Due to this inherent trait, a hand gestures can be either static or dynamic in nature.
creativecommons.org/licenses/by/
Dynamic gestures are made up of a series of hand gestures that move, whereas static
4.0/).

Appl. Sci. 2022, 12, 7643. https://doi.org/10.3390/app12157643 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 7643 2 of 16

hand gestures are made up of various forms and hand orientations that do not reflect any
motion information.
The vision-based approach can be divided into two categories, handcrafted machine
learning approach or deep learning approach (depicted in Figure 1). The handcrafted
approach, which is also known as the traditional machine learning, has a separate section
for its features defined and extracted prior subjecting it through the machine learning
algorithm. Some examples of features that are pre-defined are edge detection, corner
detection, histograms, etc. On the other hand, the deep learning approach does not need a
specific manual feature extraction process as the algorithm itself basically searches for what
features are best to classify the images, such as CNN. The major difference between deep
learning and machine learning techniques is the problem-solving approach. Deep learning
techniques tend to solve the problem end to end, whereas machine learning techniques
need to break down the problem statements into different parts to be solved first; then,
their results are combined in the final stage.

Figure 1. Different categories of vision-based approaches.

Lately, more studies have been carried out to propose a model which can classify
datasets of different conditions such as the illuminations level and complex backgrounds
through CNN. By employing CNN, the hand-crafted feature extraction portion can be
avoided, especially when the dataset comes with complex backgrounds. However, when-
ever it involves CNN, the dataset size is one of the crucial considerations when it comes
to classification. Generally, deep neural networks require a very large amount of training
data to avoid overfitting, whereas traditional machine learning approaches are more robust
due to their hierarchical structure and have a shorter execution time. In order to achieve a
better accuracy, researchers tried to perform a deeper convolution layer but have reported
that computation resources such as computer memory is a major stumbling block, not to
also mention the time taken to perform training.
In this paper, a hybrid hand gesture recognition model based on CNN as part of the
deep learning and ensemble classifier is introduced. The performance of a model heavily
depends on the features studied and extracted accurately. Hence, feature extraction via the
CNN approach avoids complex methods in manual feature extraction, especially when it is
required to be crafted according to each individual dataset. However, the dataset size and
execution time, which have constantly been a source of worry with regard to CNN, will be
addressed using machine learning methods in classification. This paper draws several key
contributions as follows:
1. A hybrid model using deep learning techniques for feature extraction and an ensemble
classifier for classification (Lightweight VGG16 and Random Forest) is devised for
hand gesture recognition;
2. Reduced burden on the computation resources required for VGG16 feature extraction
through architecture depth optimization;
3. Execution time improvement in the comparison of lightweight VGG16-RF to a full-
fledged deep learning architecture for hand gesture recognition.
Appl. Sci. 2022, 12, 7643 3 of 16

What remains of this paper is organized as follows: Section 2 reviews the related work
pertaining to hand gesture recognition; Section 3 presents the proposed model; Section 4
covers the datasets used, the experiments carried out and the results recorded; and Section 5
concludes the paper.

2. Related Works
Before the deep learning approach became popular, the hand-crafted approach was the
way to go for image recognition, particularly vision-based. A hand-crafted approach often
consists of several sections of image pre-processing and specific crafted feature extractions
modules. Vishwakarma (2017) [1] proposed a hand gesture recognition using the shape
and texture evidence in complex backgrounds. The National University of Singapore (NUS)
Hand Posture dataset used was subjected to segmentation and morphological operation for
image pre-processing. The pre-processing mainly targeted the internal noises prior to using
the Gabor filter to retrieve the texture features of the images. A differentiable intensity
profile was created through the Gabor filter and smoothened through the Gaussian filter
where the intensity information then fed into the classifier.
Sadeddine et al. (2018) [2] proposed an implementation of hand posture recognition
using several descriptors on three different databases, namely American Sign Language
(ASL), Arabic Sign Language (ArSL) and the NUS Hand Posture dataset. The system
architecture was categorized into three phases, namely hand detection, feature extraction
and classification. Several descriptors such as Hu’s Moment Descriptor (Hu’s MD), Zernike
Moments Descriptor (ZMD), Generic Fourier Descriptor (GFD) and Local Binary Pattern
Descriptor (LBPD) were used to detect the hand posture region. In Hu’s MD, the moment
invariants were computed based on the information provided by both the external shape
and internal edges. While for LBPD, the image was divided into several non-overlapping
blocks; LBP histograms were then computed for each individual block. Finally, the local
binary pattern (LBP) histograms were concatenated into a single vector. As for ZMD, a
statistical measure of pixel distribution around the centre of gravity of the shape was used
to detect the hand in the image and constructed a bounding box around it to eliminate the
unwanted surrounding background.
Zhang et al. (2018) [3] proposed a hand gesture recognition system based on the
Histogram of Oriented Gradients (HOG) and LBP using the NUS dataset. The architecture
of the proposal algorithm worked in a manner wherein feature extraction was performed
separately and paralleled via HOG and LBP followed by fusing the collected features into
one, and then followed by the Support Vector Machine (SVM) for classification. HOG
features were used to acquire the edge and local shape information, while LBP features
were used to extract the texture features which were robust to the grey level transform,
therefore, as rotational change. In the final stage, SVM with the radial basis function (RBF)
was used to classify the feature obtained.
Gajalakshmi and Sharmila (2019) [4] proposed a hand gesture recognition using SVM
with the chain code histogram (CCH) used for feature extraction on the NUS dataset. The
process began with a thresholding process as part of the pre-processing to produce binary
images of hand posture for feature extraction. Ridler and Calvard thresholding (RCT)
was used to segment the region of interest. RCT thresholding worked by considering
the average value of the intensity pixels as an initial threshold and the foreground and
background classes were first separated based on the computed average foreground mean
and background mean. For feature vector extraction, CCH segregated the binary image ac-
cording to cluster-based thresholds into grid blocks and the histogram was then calculated
based on the frequently occurring discrete values.
In order to obtain good classification, feature extraction has become a much more
crucial task, especially with complex or noisy backgrounds. Researchers had then started
to adopt more of a deep learning approach to ease the feature extraction module creation.
Gao et al. (2017) [5] proposed a static hand gesture recognition model with parallel CNNs
for space human–robot interaction on the ASL dataset. This experiment proposed a parallel
Appl. Sci. 2022, 12, 7643 4 of 16

CNNs method where the network includes two subnetworks, the RGB-CNN subnetwork
and the Depth-CNN subnetwork, which ran in parallel and merged to obtain the result
for the final model. There are seven layers in RGB-CNN and Depth-CNN subnetwork
each. The convolution layers made up the first four layers of the CNN, whereas the fully
connected layers had 144 and 72 neurons, respectively. The prediction probabilities were
created in the SoftMax classification layers at the conclusion of the subnetwork. The RGB-
CNN and Depth-CNN subnetwork achieved an accuracy of 90.3% and 81.4% individually,
but when combined, the CNN network can achieve a test accuracy of 93.3%.
Adithya and Rajesh (2020) [6] proposed a method for the automatic recognition of
hand postures using convolutional neural networks with deep parallel architectures. The
proposed model avoided the need for hand segmentation, which was a very difficult task
in images with cluttered backgrounds. In the proposal, two datasets were used, namely the
National University of Singapore (NUS) dataset and ASL dataset. The images for training
were subjected to three layers of convolutional operation with different filter sizes for
feature extraction with proper zero padding applied in each layer to ensure that the size of
the input and output remained the same. The dimension of the feature map was reduced
through the max pooling layer for each convolutional layer. In this model, stochastic
gradient descent with the momentum (SGDM) optimization function was used as well.
The proposed model achieved an accuracy of 99.96% and 94.7% for the ASL and NUS
dataset, respectively.
Bheda and Radpour (2017) [7] presented a method to classify both letters and digits
in ASL using deep convolutional networks. Three datasets were used in the research, self-
acquired dataset, ASL Alphabets, and ASL Digits. The author proposed a common CNN
architecture which consisted of three groups of two convolutional layers followed by a
max-pool layer with a dropout layer and connected to two groups of fully connected layers
followed by a dropout layer. The authors noticed that the size of the training data was
critical in ensuring better accuracy at the validation stage. Data augmentation techniques
such as rotation and transformation which includes flipping were applied on the self-
acquired dataset in the effort to increase the sample size has yielded an improvement of
20% to the overall performance. On top of that, backgrounds from each of the images
was removed using a background-subtraction method to minimize the noise impact to
the overall accuracy. An accuracy of 82.5% was recorded for ASL Alphabets, 97% for ASL
Digits while the self-acquired dataset only recorded 67% and 70%, respectively, for ASL
Alphabet and Digits.
In the deep learning-based approach, as researchers have started to realize, the size
of the dataset plays a role in determining a good classification rate. Hence, researchers
are now either performing data augmentation to the datasets or importing weights from
a pre-trained model which was trained on a larger dataset. Ozcan and Basturk (2019) [8]
proposed a hand gesture recognition method for digits using a transfer learning-based
CNN structure with heuristic optimization. Two datasets were used in this proposal, the
ASL Digits dataset and ASL dataset. In this model, the datasets were loaded into the system
together with AlexNet, a pre-trained CNN model that had eight learnable layers, among
which the first five are the convolutional layers and the three fully connected layers as part
of the transfer learning. The final three layers of the CNN was modified and optimized
using Artificial Bee Colony (ABC) algorithm.
Tan et al. (2021) [9] proposed a customized network architecture called Enhanced
Densely Connected Convolutional Neural Network (EDenseNet). In the experiment, the
ASL dataset and NUS Hand Posture dataset were used. The datasets were subjected to nine
data augmentation techniques as a mitigation plan towards the effect of data scarcity. The
proposed model had three dense blocks where each block contained four convolutional
layers and transition layers connected each of the dense blocks. The dense block was
setup with three layers at a growth rate of 24 (amount of feature maps to be produced)
with the filter size of three and within a single dense block, the feature map of preceding
convolutional layers was concatenated and served as input to the succeeding convolutional
Appl. Sci. 2022, 12, 7643 5 of 16

layer. As for the transition layer, it was made up of a bottleneck layer of four convolutional
layers, growth rate of 24 with filter size of 3 as well and followed by a pooling layer. Max
pooling was also deployed in the first two transition layers to extract all extreme features
such as curves and edges while average pooling was used in the final transition layer to
only extract and smoothen out features.
In further optimizing the approach for image classification, the combination of deep
learning models together with machine learning models are taking place. Wang et al. (2021) [10]
proposed a gesture image recognition method based on transfer learning called MobileNet-
RF. The proposed model’s structure was a combination of CNN for feature extraction and
machine learning for classification. The structure worked by processing images through
a standard convolution and continued with stacking depth-wise convolution and point-
wise convolution for feature extraction. Batch normalization (BN) and ReLU activation
functions are added for each of the depth-wise and point-wise convolution where BN
accommodates the slow convergence speed of the neural network while ReLU has great
computing advantages which can make the network design more in-depth. The entire
MobileNet has 28 layers as the depth-wise and point-wise convolution are calculated
separately. The first 28 layers of the proposed network are used to extract the gesture image
features and they are then directly input into the random forest model for classification.
Table 1 presents the summary of the related works.

Table 1. Summary of related works.

Author and Publication Year Feature Extraction Method Classification Method


Vishwakarma (2017) [1] Gabor filter Support vector machine (SVM)
Hu’s invariant moments + LBPD +
Sadeddine et al. (2018) [2] Probabilistic neural network (PNN)
Zernike moments + GFD
Histogram of oriented gradients (HOG) +
Zhang et al. (2018) [3] Support vector machine (SVM)
local binary pattern (LBP)
Gajalakshmi and Sharmila (2019) [4] Chain code histogram (CCH) Support vector machine (SVM)
Gao et al. (2017) [5] CNN
Adithya and Rajesh (2020) [6] CNN
Bheda and Radpour (2017) [7] CNN
Ozcan and Basturk (2019) [8] AlexNet + ABC
Tan et al. (2021) [9] EDenseNet
Wang et al. (2021) [10] MobileNet Random forest
Sahoo et al. (2022) [11] AlexNet, VGG16
Wang et al. (2022) [12] E-MobileNetv2
Gadekallu et al. (2022) [13] CNN + HHO
Li et al. (2022) [14] HOG, 9ULBP SVM

Sahoo et al. (2022) [11] proposed a score-level fusion technique between AlexNet
and VGG16 for hand gesture recognition. In the effort of fine-tuning both CNN models,
weights are transferred from the pre-trained model for initialization instead of starting
from scratch. The vector score generated from both fine-tuned CNN models are first
normalized and then fused together using the sum-ruled-based method to form a single
output vector. Through the Massey University (MU) dataset and HUST American Sign
Language (HUST-ASL) dataset, the accuracy of the proposed model was recorded at 90.26%
and 56.18%, respectively.
Wang et al. (2022) [12] proposed an improved lightweight CNN model by adding
adaptive channel attention (ECA module) to an existing MobileNetv2 CNN model called
E-MobileNetv2. The newly added module helped reduce the interference of unrelated
information so as to enhance the model’s feature refining ability, especially in capturing
the cross-channelled interactions. The proposed model is topped-up with a new activation
function R6-SELU (instead of ReLU6) for better feature extraction ability and the prevention
of the loss of negative-valued feature information. The newly proposed model achieved an
accuracy of 96.82%, so as to reduce the number of parameters by 30%.
Appl. Sci. 2022, 12, 7643 6 of 16

Furthering the effort to improve classification of hand gesture, Gadekallu et al. (2022) [13]
proposed a method where Harris hawks optimization (HHO) algorithm was utilized to
fine tune the hyperparameters of the CNN model. The algorithm is a mimic of how an
actual Harris hawk hunts for prey in the nature. The HHO has two phases which consist
of two stages of exploration stages and four stages of exploitation stages in the effort to
locate the optimal solution within a given location. The effectiveness of the algorithm has
contributed to the 100% accuracy achieved when tested on the hand gesture dataset from
Kaggle which consists of non-alphabetical hand gesture actions such as those of a fist, palm
and thumb.
Li et al. (2022) [14] proposed an algorithm which is robust for a multi-scale and multi-
angle algorithm against a complex background during feature extraction. Features are
extracted from the complex background using the Gaussian model and K-means algorithm
then subjecting them through HOG and 9ULBP. Features generated when fused together
is not only invariant in scale and rotation but rich in texture information. The proposed
method used SVM for classification to locate the optimal separation between class and the
achieved accuracy of 99.01%, 97.5% and 98.72% when tested on a self-collected dataset,
NUS dataset and MU Hand Images ASL dataset.

3. Hand Gesture Recognition Technique


The proposed system is designed to perform hand gesture recognition directly from the
input image without further sectioning of the hand region, even with complex background
conditions. The only pre-processing process is resizing the RGB input image to 48 ×
48 in the effort of performing image size standardization across datasets. The following
subsections shall detail out the architecture in which VGG16 is used in feature extraction
while random forest is used for classification. Results from the pre-experimental studies
are shared as well to explain the logic behind the selection.

3.1. Lightweight VGG16 as Feature Extractor


While a network such as ResNet and its variant improves gradient flow and feature
propagation through the summation of an identity function, a recent variant of ResNet [15]
shows that a huge number of convolutional layers contribute very little to the results.
Needless to say, this induces a huge amount of trainable parameters. On the other hand,
DenseNet emphasizes feature reuse via dense connectivity to mitigate the problem and
improve parameter efficiency. In the meantime, dense connectivity further improves the
gradient flow and feature propagation.
Figure 2 illustrates the proposed lightweight VGG16 network architecture for feature
extraction. Figure 3 shows a comparison of the original VGG16 model and the proposed
lightweight VGG16 model. The architecture consists of an input layer, four convolution
blocks and a single batch normalization layer. The first two convolution blocks consist of
two convolution layers and a max pooling layer, whereas the third and fourth convolution
blocks have three convolution layers and one max pooling layer. The number of channels
of the image increases continuously after passing through each of the convolution blocks
while the width and height are halved. The channel first increases from 3 to 64 and then
128, 256 and 512 channels, respectively. As the number of channels increases, the size of
the width and height reduces by half continuously as the image goes through the series
of max pooling layers. The max pooling operation can be denoted as taking the input
channel’s width, Nw , and height, Nh , and partitioning it into adjacent pixels of 2 × 2 size
which results in the output image of Nw /2 and Nh /2. The proposed VGG16 architecture
was slightly modified compared to the original VGG16 structure where the 5th convolution
block was removed from the pre-trained VGG16 model to ease the memory load as the
image size was increased from 32 × 32 × 3 to 48 × 48 × 3 at the input layer.
Weights for the VGG16 layers were adopted from the pre-trained VGG16 model on
ImageNet which consists of millions of images instead of training the model from scratch.
This allows us to utilize much more optimized weights for feature extraction. Figure 4
Appl. Sci. 2022, 12, 7643 7 of 16

provides an illustration of the feature extracted from different convolution layers. It can
be seen that, as the layers go deeper, more specific features are extracted from the image.
However, a good architecture needs to be defined to ensure an appropriate learning capacity
without overfitting.
A layer of batch normalization is added after the 4th convolution block to normalize
the output of the layers and prevent the model overfitting. The batch normalization process
can be applied to any layer of the neural network with the main purpose of having a stable
activation value distribution, which will reduce the internal covariate shift and suppress the
over-fitting problem [16]. With the d-dimensional input, x = ( x (1) . . . x (d) ), each dimension
is normalized by h i
x (k) − E x (k)
x̂ (k) = q (1)
Var x (k)
 

h i h i
(k)
where x (k) represents each particular activation while E x (k) = m1 ∑im=1 xi , Var x (k) =
 h i2
1 m (k) (k)
m ∑ i =1 x i − E x + e represents the mean, variance and e is a constant added for
numerical stability. To ensure that the input value is not limited to a narrow range, the
normalized value is generally multiplied with the scaling amount γ(k) and the offset
amount β(k) :
y(k) = γ(k) x̂ (k) + β(k) (2)
Conventionally, a typical VGG16 model will have three fully connected layers at the
end of the VGG16 architecture for classification, but there is none for this case. The features
extracted from the convolution layers are prepared and fed into the ensemble classifier.

Figure 2. Architecture of the lightweight VGG16 feature extractor model.

Figure 3. The model architecture of (left) the original VGG16 and (right) the proposed
lightweight VGG16.
Appl. Sci. 2022, 12, 7643 8 of 16

Figure 4. Feature map visualization.

3.2. Random Forest as Ensemble Classifier


The machine learning classifier is designed into the proposed model instead of using
VGG16 to perform the classification task. Ensemble learning is a method where better
predictive performance is achieved through combination of predictions from several models.
There are several methods in ensemble learning, among which there is bagging. The
bagging method basically accumulates the average results by fitting many decision trees on
different samples but from the same dataset such as Random Forest. The features extracted
from the convolution layers are subjected into the batch normalization layer and are then
fed into Random Forest. Random forest is an ensemble learning method for classification
which is based on the most selected class by many decision trees. Figure 5 illustrates how
the classification is performed. Random forest is great in handling large input variables and
in this case, the features generated from the convolution layers of VGG16 can be handled
with variable deletion or down scaling. No cross-validation set is required to ensure an
unbiased estimate which makes any size or data pool suitable.

Figure 5. Random forest.


Appl. Sci. 2022, 12, 7643 9 of 16

4. Experiments
4.1. Datasets
Three datasets were tested in this experiment: ASL dataset; ASL Digits dataset; and
NUS Hand Posture dataset.

4.1.1. ASL Dataset


This dataset is a collection of American Sign Language Alphabet images obtained
from Kaggle. It has 29 classes in total, which comprises 26 ASL Alphabets and 3 extra
gesture signs (Space, Delete and Nothing). The dataset contains a total of 87,000 pictures in
200 × 200 pixel size. Alphabet J and Z are excluded from the dataset as they are not static.
The dataset is split into 69,600 images for training and 8700 images for test and validation
respectively. Figure 6 shows some sample images from the ASL dataset.

Figure 6. Sample images from ASL dataset.

4.1.2. ASL Digits Dataset


The ASL Digits dataset [17] consists of 26 classes of alphabets (A–Z) and 10 classes
of digits (0–9). A total of 12,600 images of 28 × 28 pixel size of each photo is split into
3 categories, training set, test set, and validation set. The training set consists of 10,080
images while the test and validation set consists of 1260 images each. Figure 7 shows some
sample images from the ASL Digits dataset.

Figure 7. Sample images from ASL Digits dataset.

4.1.3. NUS Hand Posture Dataset


The NUS Hand Posture dataset [18] contains 10 classes of postures obtained by altering
the placement and size of the hand in the camera frame. The poses were captured in the
National University of Singapore (NUS), with a variety of backdrops and hands. The poses
were carried out by 40 people of various nationalities and origins. This dataset has ten
classes with letters ranging from A to J. There are 10,000 images in total, with the images
Appl. Sci. 2022, 12, 7643 10 of 16

divided into 8000 images for training and 1000 images for test and validation. Figure 8
shows some sample images from the NUS Hand Posture dataset.

Figure 8. Sample images from NUS Hand Posture dataset.

4.2. Pre-Work in Feature Extractor and Classifier Selection


Prior to selecting VGG16 as the feature extractor, studies were carried out to test
out several potential options, namely MobileNetV2, VGG16, ResNet50V2, InceptionV3,
InceptionResNetV2, DenseNet169, Xception, and NASNetMobile. Hyperparameter set-
tings used in the studies are shown in Table 2. The majority of the models performed at
approximately 99%. Hence, the selection of the best model was based on the duration
required for classification, given that the performance of the models is quite close to each
other as shown in Table 3. The VGG16 model was picked as the choice for feature extraction
due to the accuracy achieved and requires the least epoch to complete the overall training.

Table 2. Settings for pre-work investigation.

Hyperparameter Settings
Input size 48 × 48
Batch size 16
Learning rate 0.0001
Optimizer Adam
No. of epochs 100
Patience 15
Early stopping function
Mode Validation accuracy
Loss function Sparse categorical crossentropy

Table 4 shows the comparison of different ensemble learning methods in classifying


the training dataset based on the feature map generated by the VGG16 model. ASL Digits
dataset (denoted as “D2”) and the NUS Hand Posture dataset (denoted as “D3”) constantly
showed a high classification rate regardless of the type of ensemble classifier used. For
the ASL dataset (denoted as “D1”), however, the Random Forest proved to have the
highest classification rate with its estimator set at 100 and the processing time has shown a
tremendous difference compared to XGBoost and LightGBM model. Hence, the random
forest is picked as the ensemble classifier combine with VGG16 as the feature extractor.
Table 4 also shows the number of parameters and the feature length for the different
numbers of convolutional blocks. In this case, VGG16 with only four convolution blocks
(the original VGG16 has five convolution blocks) is the most optimized setting for these
experimental datasets from the view of the classification rate and the processing time taken.
The number of parameters of VGG16 is proportionate to the amount of convolution blocks
available. However, with the removal of max pooling layers, the feature length will no
longer be as small as a full VGG16 model, and hence the processing time will increase as
the number of blocks reduces.
A summary of the hyperparameters and the range of values tested are presented in
Table 5. Table 6 demonstrates the recognition accuracy at different estimators and the
random state value. The number of estimator parameters determines the number of trees
Appl. Sci. 2022, 12, 7643 11 of 16

in the random forest and is also indirectly determining the number of features to consider
when looking for the best split (branching to a new tree) while the random state value
controls both the randomness of the bootstrapping of the sample used when building
trees. The experimental results show that highest accuracy is obtained when the number of
estimators is set to 100 and the random state is set to 50.

Table 3. Results of the pre-work for feature extractor selection.

Epoch Average
Model Dataset Accuracy
Completed Accuracy
D1 91 99.95%
MobileNetV2 D2 58 99.84% 99.93%
D3 88 100.00%
D1 22 100.00%
VGG16 D2 20 99.92% 99.97%
D3 23 100.00%
D1 100 99.82%
ResNet50V2 D2 61 100.00% 99.87%
D3 83 99.80%
D1 100 96.84%
InceptionV3 D2 100 98.89% 84.94%
D3 100 59.10%
D1 59 99.95%
InceptionResNetV2 D2 40 99.76% 99.91%
D3 65 100.00%
D1 100 99.20%
DenseNet169 D2 100 95.39% 84.99%
D3 100 60.40%
D1 92 99.92%
Xception D2 82 100.00% 99.97%
D3 100 100.00%
D1 100 92.37%
NASNetMobile D2 100 88.08% 73.38%
D3 100 39.70%

Table 4. Results of the pre-work for classifier selection.

Accuracy (%) (Execution Time)


Conv Block Total Parameter Feature Length Dataset
Random Forest XGBoost LightGBM

D1 99.74 (75.1 s) 99.53 (1365.9 s) 99.74 (133.3 s)


5 conv blocks 14,716,736 512 D2 100 (4.6 s) 100 (54.1 s) 100 (24.2 s)
D3 100 (3.5 s) 100 (35.5 s) 100 (14.3 s)
D1 99.98 (330 s) 99.98 (9968 s) 99.99 (2789 s)
4 conv blocks 7,637,312 4608 D2 100 (17.6 s) 100 (328.4 s) 100 (534.7 s)
D3 100 (12.8 s) 100 (220.9 s) 100 (241.9 s)
D1 99.99 (603.5 s) 99.95 (25,938 s) 100 (5375 s)
3 conv blocks 1,736,512 9216 D2 100 (29.8 s) 100 (637.5 s) 100 (1099.7 s)
D3 100 (20.2 s) 100 (415.2 s) 100 (509.6 s)
D1 Unable to run due to out of memory
2 conv blocks 268,672 18,432 D2 100 (53.9 s) 100 (1544 s) 100 (2145 s)
D3 100 (38.4 s) 100 (1078 s) 100 (933 s)
Appl. Sci. 2022, 12, 7643 12 of 16

Table 5. Summary of Optimal Hyperparameter Settings for Ensemble Classifier.

Hyperparameters Tested Value Optimal Settings


Number of estimators 50, 75, 100, 200 100
Random state 42, 50, 60 50
Max depth Set to default: none
Max feature Set to default: sqrt (number of estimators)

Table 6. Recognition Accuracy at Different Estimators and Random State Value.

Accuracy (%)
No. of Estimators Random State
Dataset 1 Dataset 2 Dataset 3
100 42 99.81 100 100
100 50 99.98 100 100
100 60 99.82 100 100
50 50 99.70 100 100
75 50 99.78 100 100
200 50 99.89 100 100

4.3. Experimental Results and Discussion


The evaluation of the proposed model’s performance in comparison with other re-
searchers’ models is summarized in Table 7 based on the datasets. In all three instances,
the proposed model outperforms other methods from an accuracy perspective. Further
investigation shows that the CNN model used to classify the ASL dataset can address the
different illumination conditions of the image. However, the complex and inconsistent
background in the NUS dataset cannot solely be handled by the CNN model.

Table 7. Performance comparison among other models and the proposed lightweight VGG16-
RF model.

Accuracy (%)
Method
ASL Dataset ASL Digits Dataset NUS Hand Posture Dataset
Hu’s Moment + LBPD + Zernike moments + GFD + PNN [2] 93.33 93.33 93.33
CNN [6] 99.96 94.70 94.70
CNN [5] 93.30 - -
EDenseNet [9] 98.50 98.50 98.50
MobileNet-Random Forest [10] 98.12 - -
CCH + SVM [4] 90.00 90.00 90.00
HOG + LBP + SVM [3] - 97.80 97.80
Gabor + SVM [1] - 94.60 94.60
Proposed Lightweight VGG16-RF 99.98 100 100

Feature extraction is relatively crucial to the accuracy of the hand gesture recognition
model. However, certain features can be harder to extract due to factors such as illumination,
feature selection and the number of features extracted. Each convolution layer extracts
different types of features from the input image. Hence, as the convolution layer goes deeper,
the more features can be extracted. The many convolutional layers of VGG16 can extract low-
level features (generic features) up to high-level features (more specific features). However,
the drawback associated is the computational resources (computer memory and training time)
required for model training. The lightweight VGG16 as a feature extractor adopts the transfer
learning approach by transferring the weights from the pre-trained VGG16 model, which was
trained on a large-scale dataset (ImageNet). The depth of the proposed model is optimized
based on two criteria, namely accuracy and training time. It was found that at the fourth
convolutional block, the optimum balance of accuracy versus training time is achieved. Hence,
the fifth convolutional block of the original VGG16 model is removed.
Random forest is an ensemble meta-algorithm which consists of many decision trees
and is trained through the bagging method. Bagging involves the use of different subsets
Appl. Sci. 2022, 12, 7643 13 of 16

of features which are randomly selected for each tree which solves the issue of overfitting.
The proposed classifier algorithm establishes its final decision based on the majority of
the decision produced by each decision tree, which in this case, features are classified into
classes of up to 100 times (via a different tree network). This amazing feature of the random
forest creates a very high confidence level in classifying the features into the right class.
In the comparison of model performance, the other researchers achieved the highest
accuracy of 99.96% on the ASL dataset but the proposed model outperformed them by
achieving 99.98%. Figures 9–11 show the confusion matrices of the proposed lightweight
VGG16-RF model on the datasets. There were two samples of “U” wrongly categorized as
alphabet “S”, probably due to its illumination problem (Figure 12). As for ASL, with its
ASL Digits dataset and NUS Hand Posture dataset, the recorded performance was 100%,
whereas the closest performance recorded by other researchers was 99.64% and 98.50%
respectively. The main reason behind ASL Digits dataset and NUS Hand Posture dataset
achieving 100% compared to the ASL dataset is the uniformity of the image’s illumination.

Figure 9. The confusion matrix of the lightweight VGG16-RF on the ASL dataset.

Figure 10. The confusion matrix of the lightweight VGG16-RF on the ASL Digits dataset.
Appl. Sci. 2022, 12, 7643 14 of 16

Figure 11. The confusion matrix of lightweight VGG16-RF on the NUS Hand Posture dataset.

Figure 12. The alphabet “U” and alphabet “S” from the ASL dataset that are wrongly classified.

The classification time is the most crucial factor when applied in a real-time application.
This proposed model clocked an average of 0.09 seconds in classifying a single image.

5. Conclusions
In this paper, a lightweight VGG16 feature extractor and Random Forest as an en-
semble classifier is proposed for hand gesture recognition. The emphasis of this model is
on the VGG16 layers where the transfer learning technique is adopted to ensure that the
underfitting does not happen. As a result, the optimized weights of each VGG16 layer are
achieved as it was pre-trained using a very large dataset (ImageNet), while the random
forest classifier has its tree network grow up to 100 trees to ensure classification is at its
best performance. This is evidenced by the accuracy level achieved in comparison with the
other performed studies, as the ASL dataset, ASL Digits dataset and NUS Hand Posture
dataset achieved 99, 98, and 100%, respectively.

Author Contributions: Conceptualization, E.L.R.E. and C.P.L.; methodology, E.L.R.E. and C.P.L.;
software, E.L.R.E. and C.P.L.; validation, E.L.R.E. and C.P.L.; formal analysis, E.L.R.E.; investigation,
E.L.R.E.; resources, E.L.R.E.; data curation, E.L.R.E. and C.P.L.; writing—original draft preparation,
E.L.R.E.; writing—review and editing, C.P.L., L.C.K. and K.M.L.; visualization, E.L.R.E.; supervision,
C.P.L. and L.C.K.; project administration, C.P.L.; funding acquisition, C.P.L. All authors have read
and agreed to the published version of the manuscript.
Funding: The research in this work was supported by the Fundamental Research Grant Scheme
of the Ministry of Higher Education under award number FRGS/1/2021/ICT02/MMU/02/4 and
Multimedia University Internal Research Grant with award number MMUI/220021.
Appl. Sci. 2022, 12, 7643 15 of 16

Institutional Review Board Statement: Not applicable.


Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

HGR Hand Gesture Recognition


Lightweight VGG16-RF Lightweight VGG16 and Random Forest
CNN Convolutional Neural Network
ASL American Sign Language
ArSL Arabic Sign Language
Hu’s MD Hu’s Moment Descriptor
ZMD Zernike Moments Descriptor
GFD Generic Fourier Descriptor
LBPD Local Binary Pattern Descriptor
LBP Local Binary Pattern
HOG Histogram of Oriented Gradients
SVM Support Vector Machine
RBF Radial Basis Function
CCH Chain Code Histogram
RCT Ridler and Calvard Thresholding
NUS National University of Singapore
SGDM Stochastic Gradient Descent with Momentum
EDenseNet Enhanced Densely Connected Convolutional Neural Network
BN Batch Normalization

References
1. Vishwakarma, D.K. Hand gesture recognition using shape and texture evidences in complex background. In Proceedings of
the 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 23–24 November 2017;
pp. 278–283.
2. Sadeddine, K.; Djeradi, R.; Chelali, F.Z.; Djeradi, A. Recognition of static hand gesture. In Proceedings of the 2018 6th International
Conference on Multimedia Computing and Systems (ICMCS), Rabat, Morocco, 10–12 May 2018; pp. 1–6.
3. Zhang, F.; Liu, Y.; Zou, C.; Wang, Y. Hand gesture recognition based on HOG-LBP feature. In Proceedings of the 2018 IEEE
International Instrumentation and Measurement Technology Conference (I2MTC), Houston, TX, USA, 14–17 May 2018; pp. 1–6.
4. Gajalakshmi, P.; Sharmila, T.S. Hand gesture recognition by histogram based kernel using density measure. In Proceedings of
the 2019 2nd International Conference on Power and Embedded Drive Control (ICPEDC), Chennai, India, 21–23 August 2019;
pp. 294–298.
5. Gao, Q.; Liu, J.; Ju, Z.; Li, Y.; Zhang, T.; Zhang, L. Static hand gesture recognition with parallel CNNs for space human-robot
interaction. In Proceedings of the International Conference on Intelligent Robotics and Applications, Wuhan, China, 15–18
August 2017; pp. 462–473.
6. Adithya, V.; Rajesh, R. A deep convolutional neural network approach for static hand gesture recognition. Procedia Comput. Sci.
2020, 171, 2353–2361.
7. Bheda, V.; Radpour, D. Using deep convolutional networks for gesture recognition in American sign language. arXiv 2017,
arXiv:1710.06836.
8. Ozcan, T.; Basturk, A. Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture
recognition. Neural Comput. Appl. 2019, 31, 8955–8970. [CrossRef]
9. Tan, Y.S.; Lim, K.M.; Lee, C.P. Hand gesture recognition via enhanced densely connected convolutional neural network. Expert
Syst. Appl. 2021, 175, 114797. [CrossRef]
10. Wang, F.; Hu, R.; Jin, Y. Research on gesture image recognition method based on transfer learning. Procedia Comput. Sci. 2021,
187, 140–145. [CrossRef]
11. Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional
Neural Network. Sensors 2022, 22, 706. [CrossRef] [PubMed]
12. Wang, W.; He, M.; Wang, X.; Ma, J.; Song, H. Medical Gesture Recognition Method Based on Improved Lightweight Network.
Appl. Sci. 2022, 12, 6414. [CrossRef]
Appl. Sci. 2022, 12, 7643 16 of 16

13. Gadekallu, T.R.; Srivastava, G.; Liyanage, M.; Iyapparaja, M.; Chowdhary, C.L.; Koppu, S.; Maddikunta, P.K.R. Hand gesture
recognition based on a Harris hawks optimized convolution neural network. Comput. Electr. Eng. 2022, 100, 107836. [CrossRef]
14. Li, J.; Li, C.; Han, J.; Shi, Y.; Bian, G.; Zhou, S. Robust Hand Gesture Recognition Using HOG-9ULBP Features and SVM Model.
Electronics 2022, 11, 988. [CrossRef]
15. Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the European
Conference on Computer Vision, Munich, Germany, 8–14 September 2016; pp. 646–661.
16. Zheng, J.; Sun, H.; Wang, X.; Liu, J.; Zhu, C. A Batch-Normalized Deep Neural Networks and its Application in Bearing Fault
Diagnosis. In Proceedings of the 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics
(IHMSC), Hangzhou, China, 24–25 August 2019; Volume 1, pp. 121–124. [CrossRef]
17. Barczak, A.; Reyes, N.; Abastillas, M.; Piccio, A.; Susnjak, T. A New 2D Static Hand Gesture Colour Image Dataset for ASL Gestures;
Massey University: Palmerston North, New Zealand, 2011.
18. Pisharady, P.K.; Vadakkepat, P.; Loh, A.P. Attention based detection and recognition of hand postures against complex back-
grounds. Int. J. Comput. Vis. 2013, 101, 403–419. [CrossRef]

You might also like