Professional Documents
Culture Documents
Applsci 12 07643
Applsci 12 07643
sciences
Article
Hand Gesture Recognition via Lightweight VGG16 and
Ensemble Classifier
Edmond Li Ren Ewe 1 , Chin Poo Lee 2, * , Lee Chung Kwek 1 and Kian Ming Lim 2
Abstract: Gesture recognition has been studied for a while within the fields of computer vision and
pattern recognition. A gesture can be defined as a meaningful physical movement of the fingers,
hands, arms, or other parts of the body with the purpose to convey information for the environment
interaction. For instance, hand gesture recognition (HGR) can be used to recognize sign language
which is the primary means of communication by the deaf and mute. Vision-based HGR is critical
in its application; however, there are challenges that will need to be overcome such as variations
in the background, illuminations, hand orientation and size and similarities among gestures. The
traditional machine learning approach has been widely used in vision-based HGR in recent years
but the complexity of its processing has been a major challenge—especially on the handcrafted
feature extraction. The effectiveness of the handcrafted feature extraction technique was not proven
across various datasets in comparison to deep learning techniques. Therefore, a hybrid network
architecture dubbed as Lightweight VGG16 and Random Forest (Lightweight VGG16-RF) is proposed
for vision-based hand gesture recognition. The proposed model adopts feature extraction techniques
via the convolutional neural network (CNN) while using the machine learning method to perform
Citation: Ewe, E.L.R.; Lee, C.P.; classification. Experiments were carried out on publicly available datasets such as American Sign
Kwek, L.C.; Lim, K.M. Hand Gesture
Language (ASL), ASL Digits and NUS Hand Posture dataset. The experimental results demonstrate
Recognition via Lightweight VGG16
that the proposed model, a combination of lightweight VGG16 and random forest, outperforms
and Ensemble Classifier. Appl. Sci.
other methods.
2022, 12, 7643. https://doi.org/
10.3390/app12157643
Keywords: sign language recognition; hand gesture recognition; convolutional neural network
Academic Editor: João M. F. (CNN); ensemble classifier; lightweight VGG16; random forest; transfer learning
Rodrigues
hand gestures are made up of various forms and hand orientations that do not reflect any
motion information.
The vision-based approach can be divided into two categories, handcrafted machine
learning approach or deep learning approach (depicted in Figure 1). The handcrafted
approach, which is also known as the traditional machine learning, has a separate section
for its features defined and extracted prior subjecting it through the machine learning
algorithm. Some examples of features that are pre-defined are edge detection, corner
detection, histograms, etc. On the other hand, the deep learning approach does not need a
specific manual feature extraction process as the algorithm itself basically searches for what
features are best to classify the images, such as CNN. The major difference between deep
learning and machine learning techniques is the problem-solving approach. Deep learning
techniques tend to solve the problem end to end, whereas machine learning techniques
need to break down the problem statements into different parts to be solved first; then,
their results are combined in the final stage.
Lately, more studies have been carried out to propose a model which can classify
datasets of different conditions such as the illuminations level and complex backgrounds
through CNN. By employing CNN, the hand-crafted feature extraction portion can be
avoided, especially when the dataset comes with complex backgrounds. However, when-
ever it involves CNN, the dataset size is one of the crucial considerations when it comes
to classification. Generally, deep neural networks require a very large amount of training
data to avoid overfitting, whereas traditional machine learning approaches are more robust
due to their hierarchical structure and have a shorter execution time. In order to achieve a
better accuracy, researchers tried to perform a deeper convolution layer but have reported
that computation resources such as computer memory is a major stumbling block, not to
also mention the time taken to perform training.
In this paper, a hybrid hand gesture recognition model based on CNN as part of the
deep learning and ensemble classifier is introduced. The performance of a model heavily
depends on the features studied and extracted accurately. Hence, feature extraction via the
CNN approach avoids complex methods in manual feature extraction, especially when it is
required to be crafted according to each individual dataset. However, the dataset size and
execution time, which have constantly been a source of worry with regard to CNN, will be
addressed using machine learning methods in classification. This paper draws several key
contributions as follows:
1. A hybrid model using deep learning techniques for feature extraction and an ensemble
classifier for classification (Lightweight VGG16 and Random Forest) is devised for
hand gesture recognition;
2. Reduced burden on the computation resources required for VGG16 feature extraction
through architecture depth optimization;
3. Execution time improvement in the comparison of lightweight VGG16-RF to a full-
fledged deep learning architecture for hand gesture recognition.
Appl. Sci. 2022, 12, 7643 3 of 16
What remains of this paper is organized as follows: Section 2 reviews the related work
pertaining to hand gesture recognition; Section 3 presents the proposed model; Section 4
covers the datasets used, the experiments carried out and the results recorded; and Section 5
concludes the paper.
2. Related Works
Before the deep learning approach became popular, the hand-crafted approach was the
way to go for image recognition, particularly vision-based. A hand-crafted approach often
consists of several sections of image pre-processing and specific crafted feature extractions
modules. Vishwakarma (2017) [1] proposed a hand gesture recognition using the shape
and texture evidence in complex backgrounds. The National University of Singapore (NUS)
Hand Posture dataset used was subjected to segmentation and morphological operation for
image pre-processing. The pre-processing mainly targeted the internal noises prior to using
the Gabor filter to retrieve the texture features of the images. A differentiable intensity
profile was created through the Gabor filter and smoothened through the Gaussian filter
where the intensity information then fed into the classifier.
Sadeddine et al. (2018) [2] proposed an implementation of hand posture recognition
using several descriptors on three different databases, namely American Sign Language
(ASL), Arabic Sign Language (ArSL) and the NUS Hand Posture dataset. The system
architecture was categorized into three phases, namely hand detection, feature extraction
and classification. Several descriptors such as Hu’s Moment Descriptor (Hu’s MD), Zernike
Moments Descriptor (ZMD), Generic Fourier Descriptor (GFD) and Local Binary Pattern
Descriptor (LBPD) were used to detect the hand posture region. In Hu’s MD, the moment
invariants were computed based on the information provided by both the external shape
and internal edges. While for LBPD, the image was divided into several non-overlapping
blocks; LBP histograms were then computed for each individual block. Finally, the local
binary pattern (LBP) histograms were concatenated into a single vector. As for ZMD, a
statistical measure of pixel distribution around the centre of gravity of the shape was used
to detect the hand in the image and constructed a bounding box around it to eliminate the
unwanted surrounding background.
Zhang et al. (2018) [3] proposed a hand gesture recognition system based on the
Histogram of Oriented Gradients (HOG) and LBP using the NUS dataset. The architecture
of the proposal algorithm worked in a manner wherein feature extraction was performed
separately and paralleled via HOG and LBP followed by fusing the collected features into
one, and then followed by the Support Vector Machine (SVM) for classification. HOG
features were used to acquire the edge and local shape information, while LBP features
were used to extract the texture features which were robust to the grey level transform,
therefore, as rotational change. In the final stage, SVM with the radial basis function (RBF)
was used to classify the feature obtained.
Gajalakshmi and Sharmila (2019) [4] proposed a hand gesture recognition using SVM
with the chain code histogram (CCH) used for feature extraction on the NUS dataset. The
process began with a thresholding process as part of the pre-processing to produce binary
images of hand posture for feature extraction. Ridler and Calvard thresholding (RCT)
was used to segment the region of interest. RCT thresholding worked by considering
the average value of the intensity pixels as an initial threshold and the foreground and
background classes were first separated based on the computed average foreground mean
and background mean. For feature vector extraction, CCH segregated the binary image ac-
cording to cluster-based thresholds into grid blocks and the histogram was then calculated
based on the frequently occurring discrete values.
In order to obtain good classification, feature extraction has become a much more
crucial task, especially with complex or noisy backgrounds. Researchers had then started
to adopt more of a deep learning approach to ease the feature extraction module creation.
Gao et al. (2017) [5] proposed a static hand gesture recognition model with parallel CNNs
for space human–robot interaction on the ASL dataset. This experiment proposed a parallel
Appl. Sci. 2022, 12, 7643 4 of 16
CNNs method where the network includes two subnetworks, the RGB-CNN subnetwork
and the Depth-CNN subnetwork, which ran in parallel and merged to obtain the result
for the final model. There are seven layers in RGB-CNN and Depth-CNN subnetwork
each. The convolution layers made up the first four layers of the CNN, whereas the fully
connected layers had 144 and 72 neurons, respectively. The prediction probabilities were
created in the SoftMax classification layers at the conclusion of the subnetwork. The RGB-
CNN and Depth-CNN subnetwork achieved an accuracy of 90.3% and 81.4% individually,
but when combined, the CNN network can achieve a test accuracy of 93.3%.
Adithya and Rajesh (2020) [6] proposed a method for the automatic recognition of
hand postures using convolutional neural networks with deep parallel architectures. The
proposed model avoided the need for hand segmentation, which was a very difficult task
in images with cluttered backgrounds. In the proposal, two datasets were used, namely the
National University of Singapore (NUS) dataset and ASL dataset. The images for training
were subjected to three layers of convolutional operation with different filter sizes for
feature extraction with proper zero padding applied in each layer to ensure that the size of
the input and output remained the same. The dimension of the feature map was reduced
through the max pooling layer for each convolutional layer. In this model, stochastic
gradient descent with the momentum (SGDM) optimization function was used as well.
The proposed model achieved an accuracy of 99.96% and 94.7% for the ASL and NUS
dataset, respectively.
Bheda and Radpour (2017) [7] presented a method to classify both letters and digits
in ASL using deep convolutional networks. Three datasets were used in the research, self-
acquired dataset, ASL Alphabets, and ASL Digits. The author proposed a common CNN
architecture which consisted of three groups of two convolutional layers followed by a
max-pool layer with a dropout layer and connected to two groups of fully connected layers
followed by a dropout layer. The authors noticed that the size of the training data was
critical in ensuring better accuracy at the validation stage. Data augmentation techniques
such as rotation and transformation which includes flipping were applied on the self-
acquired dataset in the effort to increase the sample size has yielded an improvement of
20% to the overall performance. On top of that, backgrounds from each of the images
was removed using a background-subtraction method to minimize the noise impact to
the overall accuracy. An accuracy of 82.5% was recorded for ASL Alphabets, 97% for ASL
Digits while the self-acquired dataset only recorded 67% and 70%, respectively, for ASL
Alphabet and Digits.
In the deep learning-based approach, as researchers have started to realize, the size
of the dataset plays a role in determining a good classification rate. Hence, researchers
are now either performing data augmentation to the datasets or importing weights from
a pre-trained model which was trained on a larger dataset. Ozcan and Basturk (2019) [8]
proposed a hand gesture recognition method for digits using a transfer learning-based
CNN structure with heuristic optimization. Two datasets were used in this proposal, the
ASL Digits dataset and ASL dataset. In this model, the datasets were loaded into the system
together with AlexNet, a pre-trained CNN model that had eight learnable layers, among
which the first five are the convolutional layers and the three fully connected layers as part
of the transfer learning. The final three layers of the CNN was modified and optimized
using Artificial Bee Colony (ABC) algorithm.
Tan et al. (2021) [9] proposed a customized network architecture called Enhanced
Densely Connected Convolutional Neural Network (EDenseNet). In the experiment, the
ASL dataset and NUS Hand Posture dataset were used. The datasets were subjected to nine
data augmentation techniques as a mitigation plan towards the effect of data scarcity. The
proposed model had three dense blocks where each block contained four convolutional
layers and transition layers connected each of the dense blocks. The dense block was
setup with three layers at a growth rate of 24 (amount of feature maps to be produced)
with the filter size of three and within a single dense block, the feature map of preceding
convolutional layers was concatenated and served as input to the succeeding convolutional
Appl. Sci. 2022, 12, 7643 5 of 16
layer. As for the transition layer, it was made up of a bottleneck layer of four convolutional
layers, growth rate of 24 with filter size of 3 as well and followed by a pooling layer. Max
pooling was also deployed in the first two transition layers to extract all extreme features
such as curves and edges while average pooling was used in the final transition layer to
only extract and smoothen out features.
In further optimizing the approach for image classification, the combination of deep
learning models together with machine learning models are taking place. Wang et al. (2021) [10]
proposed a gesture image recognition method based on transfer learning called MobileNet-
RF. The proposed model’s structure was a combination of CNN for feature extraction and
machine learning for classification. The structure worked by processing images through
a standard convolution and continued with stacking depth-wise convolution and point-
wise convolution for feature extraction. Batch normalization (BN) and ReLU activation
functions are added for each of the depth-wise and point-wise convolution where BN
accommodates the slow convergence speed of the neural network while ReLU has great
computing advantages which can make the network design more in-depth. The entire
MobileNet has 28 layers as the depth-wise and point-wise convolution are calculated
separately. The first 28 layers of the proposed network are used to extract the gesture image
features and they are then directly input into the random forest model for classification.
Table 1 presents the summary of the related works.
Sahoo et al. (2022) [11] proposed a score-level fusion technique between AlexNet
and VGG16 for hand gesture recognition. In the effort of fine-tuning both CNN models,
weights are transferred from the pre-trained model for initialization instead of starting
from scratch. The vector score generated from both fine-tuned CNN models are first
normalized and then fused together using the sum-ruled-based method to form a single
output vector. Through the Massey University (MU) dataset and HUST American Sign
Language (HUST-ASL) dataset, the accuracy of the proposed model was recorded at 90.26%
and 56.18%, respectively.
Wang et al. (2022) [12] proposed an improved lightweight CNN model by adding
adaptive channel attention (ECA module) to an existing MobileNetv2 CNN model called
E-MobileNetv2. The newly added module helped reduce the interference of unrelated
information so as to enhance the model’s feature refining ability, especially in capturing
the cross-channelled interactions. The proposed model is topped-up with a new activation
function R6-SELU (instead of ReLU6) for better feature extraction ability and the prevention
of the loss of negative-valued feature information. The newly proposed model achieved an
accuracy of 96.82%, so as to reduce the number of parameters by 30%.
Appl. Sci. 2022, 12, 7643 6 of 16
Furthering the effort to improve classification of hand gesture, Gadekallu et al. (2022) [13]
proposed a method where Harris hawks optimization (HHO) algorithm was utilized to
fine tune the hyperparameters of the CNN model. The algorithm is a mimic of how an
actual Harris hawk hunts for prey in the nature. The HHO has two phases which consist
of two stages of exploration stages and four stages of exploitation stages in the effort to
locate the optimal solution within a given location. The effectiveness of the algorithm has
contributed to the 100% accuracy achieved when tested on the hand gesture dataset from
Kaggle which consists of non-alphabetical hand gesture actions such as those of a fist, palm
and thumb.
Li et al. (2022) [14] proposed an algorithm which is robust for a multi-scale and multi-
angle algorithm against a complex background during feature extraction. Features are
extracted from the complex background using the Gaussian model and K-means algorithm
then subjecting them through HOG and 9ULBP. Features generated when fused together
is not only invariant in scale and rotation but rich in texture information. The proposed
method used SVM for classification to locate the optimal separation between class and the
achieved accuracy of 99.01%, 97.5% and 98.72% when tested on a self-collected dataset,
NUS dataset and MU Hand Images ASL dataset.
provides an illustration of the feature extracted from different convolution layers. It can
be seen that, as the layers go deeper, more specific features are extracted from the image.
However, a good architecture needs to be defined to ensure an appropriate learning capacity
without overfitting.
A layer of batch normalization is added after the 4th convolution block to normalize
the output of the layers and prevent the model overfitting. The batch normalization process
can be applied to any layer of the neural network with the main purpose of having a stable
activation value distribution, which will reduce the internal covariate shift and suppress the
over-fitting problem [16]. With the d-dimensional input, x = ( x (1) . . . x (d) ), each dimension
is normalized by h i
x (k) − E x (k)
x̂ (k) = q (1)
Var x (k)
h i h i
(k)
where x (k) represents each particular activation while E x (k) = m1 ∑im=1 xi , Var x (k) =
h i2
1 m (k) (k)
m ∑ i =1 x i − E x + e represents the mean, variance and e is a constant added for
numerical stability. To ensure that the input value is not limited to a narrow range, the
normalized value is generally multiplied with the scaling amount γ(k) and the offset
amount β(k) :
y(k) = γ(k) x̂ (k) + β(k) (2)
Conventionally, a typical VGG16 model will have three fully connected layers at the
end of the VGG16 architecture for classification, but there is none for this case. The features
extracted from the convolution layers are prepared and fed into the ensemble classifier.
Figure 3. The model architecture of (left) the original VGG16 and (right) the proposed
lightweight VGG16.
Appl. Sci. 2022, 12, 7643 8 of 16
4. Experiments
4.1. Datasets
Three datasets were tested in this experiment: ASL dataset; ASL Digits dataset; and
NUS Hand Posture dataset.
divided into 8000 images for training and 1000 images for test and validation. Figure 8
shows some sample images from the NUS Hand Posture dataset.
Hyperparameter Settings
Input size 48 × 48
Batch size 16
Learning rate 0.0001
Optimizer Adam
No. of epochs 100
Patience 15
Early stopping function
Mode Validation accuracy
Loss function Sparse categorical crossentropy
in the random forest and is also indirectly determining the number of features to consider
when looking for the best split (branching to a new tree) while the random state value
controls both the randomness of the bootstrapping of the sample used when building
trees. The experimental results show that highest accuracy is obtained when the number of
estimators is set to 100 and the random state is set to 50.
Epoch Average
Model Dataset Accuracy
Completed Accuracy
D1 91 99.95%
MobileNetV2 D2 58 99.84% 99.93%
D3 88 100.00%
D1 22 100.00%
VGG16 D2 20 99.92% 99.97%
D3 23 100.00%
D1 100 99.82%
ResNet50V2 D2 61 100.00% 99.87%
D3 83 99.80%
D1 100 96.84%
InceptionV3 D2 100 98.89% 84.94%
D3 100 59.10%
D1 59 99.95%
InceptionResNetV2 D2 40 99.76% 99.91%
D3 65 100.00%
D1 100 99.20%
DenseNet169 D2 100 95.39% 84.99%
D3 100 60.40%
D1 92 99.92%
Xception D2 82 100.00% 99.97%
D3 100 100.00%
D1 100 92.37%
NASNetMobile D2 100 88.08% 73.38%
D3 100 39.70%
Accuracy (%)
No. of Estimators Random State
Dataset 1 Dataset 2 Dataset 3
100 42 99.81 100 100
100 50 99.98 100 100
100 60 99.82 100 100
50 50 99.70 100 100
75 50 99.78 100 100
200 50 99.89 100 100
Table 7. Performance comparison among other models and the proposed lightweight VGG16-
RF model.
Accuracy (%)
Method
ASL Dataset ASL Digits Dataset NUS Hand Posture Dataset
Hu’s Moment + LBPD + Zernike moments + GFD + PNN [2] 93.33 93.33 93.33
CNN [6] 99.96 94.70 94.70
CNN [5] 93.30 - -
EDenseNet [9] 98.50 98.50 98.50
MobileNet-Random Forest [10] 98.12 - -
CCH + SVM [4] 90.00 90.00 90.00
HOG + LBP + SVM [3] - 97.80 97.80
Gabor + SVM [1] - 94.60 94.60
Proposed Lightweight VGG16-RF 99.98 100 100
Feature extraction is relatively crucial to the accuracy of the hand gesture recognition
model. However, certain features can be harder to extract due to factors such as illumination,
feature selection and the number of features extracted. Each convolution layer extracts
different types of features from the input image. Hence, as the convolution layer goes deeper,
the more features can be extracted. The many convolutional layers of VGG16 can extract low-
level features (generic features) up to high-level features (more specific features). However,
the drawback associated is the computational resources (computer memory and training time)
required for model training. The lightweight VGG16 as a feature extractor adopts the transfer
learning approach by transferring the weights from the pre-trained VGG16 model, which was
trained on a large-scale dataset (ImageNet). The depth of the proposed model is optimized
based on two criteria, namely accuracy and training time. It was found that at the fourth
convolutional block, the optimum balance of accuracy versus training time is achieved. Hence,
the fifth convolutional block of the original VGG16 model is removed.
Random forest is an ensemble meta-algorithm which consists of many decision trees
and is trained through the bagging method. Bagging involves the use of different subsets
Appl. Sci. 2022, 12, 7643 13 of 16
of features which are randomly selected for each tree which solves the issue of overfitting.
The proposed classifier algorithm establishes its final decision based on the majority of
the decision produced by each decision tree, which in this case, features are classified into
classes of up to 100 times (via a different tree network). This amazing feature of the random
forest creates a very high confidence level in classifying the features into the right class.
In the comparison of model performance, the other researchers achieved the highest
accuracy of 99.96% on the ASL dataset but the proposed model outperformed them by
achieving 99.98%. Figures 9–11 show the confusion matrices of the proposed lightweight
VGG16-RF model on the datasets. There were two samples of “U” wrongly categorized as
alphabet “S”, probably due to its illumination problem (Figure 12). As for ASL, with its
ASL Digits dataset and NUS Hand Posture dataset, the recorded performance was 100%,
whereas the closest performance recorded by other researchers was 99.64% and 98.50%
respectively. The main reason behind ASL Digits dataset and NUS Hand Posture dataset
achieving 100% compared to the ASL dataset is the uniformity of the image’s illumination.
Figure 9. The confusion matrix of the lightweight VGG16-RF on the ASL dataset.
Figure 10. The confusion matrix of the lightweight VGG16-RF on the ASL Digits dataset.
Appl. Sci. 2022, 12, 7643 14 of 16
Figure 11. The confusion matrix of lightweight VGG16-RF on the NUS Hand Posture dataset.
Figure 12. The alphabet “U” and alphabet “S” from the ASL dataset that are wrongly classified.
The classification time is the most crucial factor when applied in a real-time application.
This proposed model clocked an average of 0.09 seconds in classifying a single image.
5. Conclusions
In this paper, a lightweight VGG16 feature extractor and Random Forest as an en-
semble classifier is proposed for hand gesture recognition. The emphasis of this model is
on the VGG16 layers where the transfer learning technique is adopted to ensure that the
underfitting does not happen. As a result, the optimized weights of each VGG16 layer are
achieved as it was pre-trained using a very large dataset (ImageNet), while the random
forest classifier has its tree network grow up to 100 trees to ensure classification is at its
best performance. This is evidenced by the accuracy level achieved in comparison with the
other performed studies, as the ASL dataset, ASL Digits dataset and NUS Hand Posture
dataset achieved 99, 98, and 100%, respectively.
Author Contributions: Conceptualization, E.L.R.E. and C.P.L.; methodology, E.L.R.E. and C.P.L.;
software, E.L.R.E. and C.P.L.; validation, E.L.R.E. and C.P.L.; formal analysis, E.L.R.E.; investigation,
E.L.R.E.; resources, E.L.R.E.; data curation, E.L.R.E. and C.P.L.; writing—original draft preparation,
E.L.R.E.; writing—review and editing, C.P.L., L.C.K. and K.M.L.; visualization, E.L.R.E.; supervision,
C.P.L. and L.C.K.; project administration, C.P.L.; funding acquisition, C.P.L. All authors have read
and agreed to the published version of the manuscript.
Funding: The research in this work was supported by the Fundamental Research Grant Scheme
of the Ministry of Higher Education under award number FRGS/1/2021/ICT02/MMU/02/4 and
Multimedia University Internal Research Grant with award number MMUI/220021.
Appl. Sci. 2022, 12, 7643 15 of 16
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Vishwakarma, D.K. Hand gesture recognition using shape and texture evidences in complex background. In Proceedings of
the 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 23–24 November 2017;
pp. 278–283.
2. Sadeddine, K.; Djeradi, R.; Chelali, F.Z.; Djeradi, A. Recognition of static hand gesture. In Proceedings of the 2018 6th International
Conference on Multimedia Computing and Systems (ICMCS), Rabat, Morocco, 10–12 May 2018; pp. 1–6.
3. Zhang, F.; Liu, Y.; Zou, C.; Wang, Y. Hand gesture recognition based on HOG-LBP feature. In Proceedings of the 2018 IEEE
International Instrumentation and Measurement Technology Conference (I2MTC), Houston, TX, USA, 14–17 May 2018; pp. 1–6.
4. Gajalakshmi, P.; Sharmila, T.S. Hand gesture recognition by histogram based kernel using density measure. In Proceedings of
the 2019 2nd International Conference on Power and Embedded Drive Control (ICPEDC), Chennai, India, 21–23 August 2019;
pp. 294–298.
5. Gao, Q.; Liu, J.; Ju, Z.; Li, Y.; Zhang, T.; Zhang, L. Static hand gesture recognition with parallel CNNs for space human-robot
interaction. In Proceedings of the International Conference on Intelligent Robotics and Applications, Wuhan, China, 15–18
August 2017; pp. 462–473.
6. Adithya, V.; Rajesh, R. A deep convolutional neural network approach for static hand gesture recognition. Procedia Comput. Sci.
2020, 171, 2353–2361.
7. Bheda, V.; Radpour, D. Using deep convolutional networks for gesture recognition in American sign language. arXiv 2017,
arXiv:1710.06836.
8. Ozcan, T.; Basturk, A. Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture
recognition. Neural Comput. Appl. 2019, 31, 8955–8970. [CrossRef]
9. Tan, Y.S.; Lim, K.M.; Lee, C.P. Hand gesture recognition via enhanced densely connected convolutional neural network. Expert
Syst. Appl. 2021, 175, 114797. [CrossRef]
10. Wang, F.; Hu, R.; Jin, Y. Research on gesture image recognition method based on transfer learning. Procedia Comput. Sci. 2021,
187, 140–145. [CrossRef]
11. Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional
Neural Network. Sensors 2022, 22, 706. [CrossRef] [PubMed]
12. Wang, W.; He, M.; Wang, X.; Ma, J.; Song, H. Medical Gesture Recognition Method Based on Improved Lightweight Network.
Appl. Sci. 2022, 12, 6414. [CrossRef]
Appl. Sci. 2022, 12, 7643 16 of 16
13. Gadekallu, T.R.; Srivastava, G.; Liyanage, M.; Iyapparaja, M.; Chowdhary, C.L.; Koppu, S.; Maddikunta, P.K.R. Hand gesture
recognition based on a Harris hawks optimized convolution neural network. Comput. Electr. Eng. 2022, 100, 107836. [CrossRef]
14. Li, J.; Li, C.; Han, J.; Shi, Y.; Bian, G.; Zhou, S. Robust Hand Gesture Recognition Using HOG-9ULBP Features and SVM Model.
Electronics 2022, 11, 988. [CrossRef]
15. Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the European
Conference on Computer Vision, Munich, Germany, 8–14 September 2016; pp. 646–661.
16. Zheng, J.; Sun, H.; Wang, X.; Liu, J.; Zhu, C. A Batch-Normalized Deep Neural Networks and its Application in Bearing Fault
Diagnosis. In Proceedings of the 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics
(IHMSC), Hangzhou, China, 24–25 August 2019; Volume 1, pp. 121–124. [CrossRef]
17. Barczak, A.; Reyes, N.; Abastillas, M.; Piccio, A.; Susnjak, T. A New 2D Static Hand Gesture Colour Image Dataset for ASL Gestures;
Massey University: Palmerston North, New Zealand, 2011.
18. Pisharady, P.K.; Vadakkepat, P.; Loh, A.P. Attention based detection and recognition of hand postures against complex back-
grounds. Int. J. Comput. Vis. 2013, 101, 403–419. [CrossRef]