Development and Experimental Evaluation of Machine Learning Techniques For An Intelligent Hairy Scalp Detection System

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/325264930
Development and Experimental Evaluation of Machine Learning Techniques

for an Intelligent Hairy Scalp Detection System
Article in Applied Sciences · May 2018

DOI: 10.3390/app8060853
CITATIONS READS
15 346
3 authors:
Wei-Chien Wang Liang-Bi Chen

The University of Sydney National Penghu University
6 PUBLICATIONS 78 CITATIONS 142 PUBLICATIONS 492 CITATIONS
SEE PROFILE SEE PROFILE
Wan-Jung Chang
Southern Taiwan University of Science and Technology
67 PUBLICATIONS 409 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A Framework to Security-Enabled and AIoT-Based System for Intelligent Aquaculture View project
Design and Implementation of Intelligent Assistive Systems and Devices for Visually Impaired People Walking Safety View project
All content following this page was uploaded by Liang-Bi Chen on 23 May 2018.
The user has requested enhancement of the downloaded file.

applied
sciences
Article
Development and Experimental Evaluation of
Machine-Learning Techniques for an Intelligent
Hairy Scalp Detection System
Wei-Chien Wang 1 , Liang-Bi Chen 2, * ID
and Wan-Jung Chang 2, * ID
1 Science and Engineering Faculty, Queensland University of Technology, Brisbane 4000, Australia;
cchain.5@gmail.com
2 Department of Electronic Engineering, Southern Taiwan University of Science and Technology,
Tainan 71005, Taiwan
* Correspondence: liangbi.chen@gmail.com (L.-B.C.); allenchang@stust.edu.tw (W.-J.C.)

Received: 14 May 2018; Accepted: 21 May 2018; Published: 23 May 2018
Featured Application: Deep learning, decision tree, linear discriminant analysis (LDA), support
vector machines (SVMs), k-nearest neighbors algorithm (K-NN), and ensemble learning are
evaluated for detecting hairy scalp problems. To the best of our knowledge, we are the first
case study to apply modern machine learning to the diagnosis and analysis of hairy scalp issues.
Abstract: Deep learning has become the most popular research subject in the fields of artificial
intelligence (AI) and machine learning. In October 2013, MIT Technology Review commented that
deep learning was a breakthrough technology. Deep learning has made progress in voice and
image recognition, image classification, and natural language processing. Prior to deep learning,
decision tree, linear discriminant analysis (LDA), support vector machines (SVM), k-nearest neighbors
algorithm (K-NN), and ensemble learning were popular in solving classification problems. In this
paper, we applied the previously mentioned and deep learning techniques to hairy scalp images.
Hairy scalp problems are usually diagnosed by non-professionals in hair salons, and people with
such problems may be advised by these non-professionals. Additionally, several common scalp
problems are similar; therefore, non-experts may provide incorrect diagnoses. Hence, scalp problems
have worsened. In this work, we implemented and compared the deep-learning method, the
ImageNet-VGG-f model Bag of Words (BOW), with machine-learning classifiers, and histogram of
oriented gradients (HOG)/pyramid histogram of oriented gradients (PHOG) with machine-learning
classifiers. The tools from the classification learner apps were used for hairy scalp image classification.
The results indicated that deep learning can achieve an accuracy of 89.77% when the learning rate is
1 × 10−4 , and this accuracy is far higher than those achieved by BOW with SVM (80.50%) and PHOG
with SVM (53.0%).
Keywords: deep learning; machine learning; support vector machine (SVM); images classification;
images recognition; hairy scalp diagnosis and analysis
1. Introduction
In recent years, machine-learning techniques have been widely used in computer vision, image
recognition, stock market analysis, medical diagnosis, natural language processing, voice/speech
recognition, etc. Machine learning is an aspect of artificial intelligence (AI) that represents another
widely used term for AI. Therefore, AI research has shifted from reasoning as the most vital aspect to
knowledge and then learning as the most important aspects, which represents a natural and distinct
Appl. Sci. 2018, 8, 853; doi:10.3390/app8060853 www.mdpi.com/journal/applsci

Appl. Sci. 2018, 8, 853 2 of 28
sequence of events. Machine learning has become a methodology of realizing AI, and it also represents
a method of resolving issues associated with AI. Historically, huge quantities of data had to be
analyzed to accomplish many different tasks, and such analyses represented a methodology used for
policy making and modeling; however, machine learning provides a more effective and productive
replacement methodology for acquiring knowledge.
Machine learning can progressively improve forecast modeling functions and capabilities and
utilize relevant data to assist with policy formulation. Therefore, this approach has become an
increasingly important tool in computer science-related research and has played a progressively vital
role in people’s daily lives. Machine learning has provided more advanced forms of junk email filters,
better software for voice/speech recognition, and reliable internet search engines, among other benefits.
Deep learning has attracted increasing attention in academia and industry and represents the most
popular research direction in the fields of AI and machine learning. Even MIT Technology Review
commented in October 2013 that deep learning was a breakthrough technology [1]. Deep learning has
made progress in voice and image recognition, image classification, and natural language processing.
Prior to deep learning, support vector machines (SVM) represented a popular approach to solving
classification problems.
In this paper, we use the diagnosis and analysis of hairy scalps as a case study for machine-learning
techniques. In this case study, we implemented, evaluated, and compared the deep learning method
with the ImageNet-VGG-f model Bag of Words (BOW), with machine learning classifiers, and histogram
of oriented gradients (HOG)/pyramid histogram of oriented gradients (PHOG) with machine-learning
classifiers. We selected scalp state detection as our case study because people have many hairy scalp
lesions caused by work pressure, long working hours, and a lack of scalp care, among other reasons.
In the hair salon industry, dermatology clinics and medical clinics, human methods of hairy
scalp state detection are frequently employed. The professional education and training costs in
these industries are very high. Moreover, the accuracy of hairy scalp state recognition varies among
individuals and does not follow a set of criteria.
In this paper, we test whether machine learning technology can be applied to hairy scalp detection.
A machine learning-based hairy scalp detector can automatically recognize the state of the hairy scalp.
Moreover, machine learning can continue to train learning algorithms to increase the accuracy of scalp
detection. We believe that machine learning-based AI image processing methods should be able to
effectively solve the aforementioned hairy scalp detection problem. By installing machine-learning
technology into a scalp state detector, the use of human-based assessments and the resulting errors can
be reduced. To the best of our knowledge, this is the first study to apply modern machine-learning
techniques to the diagnosis and analysis of hairy scalp problems.
The remainder of this paper is organized as follows. Section 2 introduces the preliminary
assessment of machine-learning techniques. Section 3 contains a review of previous works on the
design and development of image classification and recognition using machine-learning techniques.
Section 4 presents the machine-learning techniques evaluated for diagnosing and analyzing hairy
scalps. Section 5 describes the experimental results and provides a related discussion. Section 6
presents our conclusions and plans for future work.
2. Preliminaries
Figure 1 describes the principles [2] of building a machine-learning system. Machine-learning
technology as a predictive modeling system can be divided into four parts: preprocessing, learning,
evaluation, and prediction. Generally, the raw data and format of images cannot be directly used for
calculations because such data could contain a considerable amount of useless information that may
have a negative impact on the performance of the learning algorithms. Hence, the preprocessing of
raw data plays an import role within every application of machine learning systems. In this work, the
raw data are scalp images, and we try to capture the key features.
Appl. Sci. 2018, 8, x 3 of 27
datasets are usually automatically divided into two parts (training and testing datasets). The training
Appl. Sci.dataset
2018, 8, 853
is for training and optimizing the models, and the testing dataset is for evaluating the 3 of 28
efficiency of the model.
Appl. Sci. 2018, 8, x 3 of 27
datasets are usually automatically divided into two parts (training and testing datasets). The training
dataset is for training and optimizing the models, and the testing dataset is for evaluating the
efficiency of the model.
Figure 1. Typical machine-learning system [2].

Figure 1. Typical machine-learning system [2].
Each machine-learning algorithm has its own advantages for solving specific problems, and one
The key features
learning algorithm arecannot
usually the color,
manage luminance,
all problems. Hence,and to density
train andofdiscover
hair. Features
the best are selected to
models,
meet particular restrictions
different learning so thatcould
algorithms theybe perform
appliedwell to thewhen
same using
dataset.machine-learning
Comparisons of thealgorithms.
training dataA good
results for each
machine-learning learning algorithm
algorithm predictscan good be used to identify
results in both thetraining
most suitable
and learning model. The datasets
new datasets.
In the 1980s, multilayer perceptron (MLP) was a very popular machine learning technique,
are usually automatically divided into two parts (training and testing datasets). The training dataset
especially for speech recognition and image recognition. However, since the 1990s, MLP has
is for training and optimizing the models, and the testing dataset is for evaluating the efficiency of
encountered strong competition from the simpler support vector machines (SVM). Recently, due to
the model.
the success of deep learning, Figure
MLP 1. Typical machine-learning
has regained attention. Assystem
we [2].
know, modern deep-learning
Each machine-learning
techniques were based on algorithm has its own advantages for solving specific problems, and one
the MLP structure.
learning Each MLPmachine-learning
algorithm cannot
includes least algorithm
at manage one all
hidden has
problems. its own
layer (in advantages
Hence,
additionto to
train for
one andsolving
layerspecific
discover
input and problems,
theone
best models,
output anddifferent
layer). one
learning
Compared algorithm
to the cannot
single-layer manage all
perceptron problems.
(SLP), which Hence,
can to
only train
learn and
linear
learning algorithms could be applied to the same dataset. Comparisons of the training data results for discover
functions, the
MLP best
can models,
learn
non-linear functions. Figure 2could
showsbe
eachdifferent
learning learning
algorithm algorithms
can be used toaidentify
simple
applied structure
to
thethe of an MLP
same
most withComparisons
dataset.
suitable a hidden layer.ofPlease
learning note thatdata
model. the training
all connections
results are weighted,
for each learning but only
algorithm canthree
be usedweights (w0, w1,the
to identify andmost
w2) are marked
suitable in Figuremodel.
learning 2. A simple
In the 1980s, multilayer perceptron (MLP) was a very popular machine learning technique,
structure of an MLP
In the 1980s, is introduced
multilayer as follows.
perceptron (MLP) was a very popular machine learning technique,
encountered strong competition from the simpler support vector machines (SVM). Recently, due
encountered strong competition from the simpler support vector machines (SVM). Recently, due to
to the
thesuccess
successofofdeep deep learning,
learning, MLP MLP has has regained
regainedattention.
attention.AsAsweweknow, know,modern
modern deep-learning
deep-learning
techniques were based on the MLP
techniques were based on the MLP structure. structure.
MLP MLPincludes
includesat at
least
leastoneonehidden
hiddenlayer layer (in(in addition
addition to toone
oneinput
inputlayer
layer andandoneone output
output layer).
layer).
Compared
Compared to the single-layer
to the single-layer perceptron
perceptron(SLP),(SLP), which
which cancan only
onlylearn
learnlinear
linearfunctions,
functions, MLPMLP cancan learn
learn
non-linear
non-linearfunctions.
functions. Figure
Figure22shows
shows aasimple
simplestructure
structure of MLP
of an an MLP withwith a hidden
a hidden layer. note
layer. Please Pleasethatnote
that all
allconnections
connections areare
weighted,
weighted, but only
but three
only weights (w0, w1,(w0,
three weights and w2)
w1,areand marked
w2) are in Figure
marked 2. Ainsimple
Figure 2.
structure of an MLP is introduced as
A simple structure of an MLP is introduced as follows. follows.
Figure 2. A simple structure of a multilayer perceptron (MLP) with a hidden layer.
2. A
Figure
Figure 2. simple structure
A simple structureofofaamultilayer perceptron(MLP)
multilayer perceptron (MLP)with
with a hidden
a hidden layer.
layer.
Appl. Sci. 2018, 8, 853 4 of 28
Input layer: The input layer has three nodes. The offset node value is 1. The other two nodes take
external inputs from X1 and X2 (all are digital values based on the input data set). As discussed above,
no calculation is performed on the input layer; thus, the output of the input layer node is 1, and X1
and X2 are passed to the hidden layer.
Hidden layer: The hidden layer also has three nodes. The offset node output is 1. The output of
the other two nodes of the hidden layer depends on the output of the input layer (1, X1, and X2) and
the weight attached to the connection (boundary). Figure 2 shows the calculation of an output in a
hidden layer (highlighted). The output calculations of the other hidden nodes are the same. Please
note that “f” refers to the activation function. These outputs are passed to the nodes of the output layer.
Output layer: The output layer has two nodes, receives input from the hidden layer, and performs
calculations similar to the highlighted hidden layer. These calculated values (Y1 and Y2), which are
the result of the calculation, are the outputs of the MLP.
As a result, given a set of features X = (x1, x2, . . . ) and a goal Y, an MLP can learn the relationship
between features and goals for the purpose of classification or regression.
To assess the effectiveness of a given model, the accuracy of different models is compared. In
this work, we have applied several popular machine-learning algorithms to analyze and diagnose
hairy scalp images. First, we adopt a manual method of classifying hairy scalps and comparing the
classification of the model accuracy to determine the quality of recognition.
Consequently, test datasets can be used to test the proposed models. Then, suitable models can
be selected based on the training dataset. The performance of the test dataset compared with that of
an untested dataset must be determined along with the error rate and performance. Subsequently,
optimal models can be applied to predict new and future data.
3. Related Works
Machine learning-based techniques are widely discussed, studied and applied for image
classification, image recognition, and object detection in many fields [3–21]. The related application
cases of machine learning-based image detection and classification are introduced as follows.
For traffic applications [3–8,19], Lousier and Abdelkrim [3] proposed a bag of features
(Bove)-based machine learning framework for image classification, and this assessed the performance
of training models using different image classification algorithms on the Caltech 101 images [4].
These authors also adopted the proposed BoF-based machine-learning framework to identify stop
sign images for applying the trained classifier in a robotic system. Ahmed et al. [5] presented text
recognition that adopted convolutional neural networks (CNNs) as their deep-learning classifier for
detecting and recognizing Arabic text. The error rate of their proposed recognition methodology was
15% using cursive script scene data.
Jagannathan et al. [6] implemented an embedded system-based object detection and classification
mechanism that adopted a commercial SoC solution. The main components of this commercial
SoC solution are fixed/floating-point dual DSP cores, a fully programmable VisionAccelerationPac
(EVE), dual ARM-based Cortex M4 cores, and an image signal processor. In this work, the authors
combined two methodologies, an AdaBoost cascade classifier with 10 HOG features for object detection
and a 7-layer CNN classifier for objects classification. The implemented methodologies can achieve
accuracies of 74.6% for pedestrian object detection, 79.4% for vehicle object detection, 79.6% for traffic
sign object detection, and 89.6% for traffic sign objects classification.
Du et al. [7] proposed an end-to-end deep learning model that adopted the Convolutional
Neural Network-Long Short-Term Memory (CNN-LSTM) architecture to predict real-time vehicular
ego-motion for an autonomous driving system. The error rate of their proposed CNN-LSTM-based
model for multiple ego-motion classification was 0.0417. Pop et al. [8] proposed different cross-modality
CNN-based deep-learning training approaches for pedestrian recognition, and they consisted of a
correlated model, an incremental model, and a particular cross-modality model.
Appl. Sci. 2018, 8, 853 5 of 28
For hyperspectral image classification applications [9–14], Ermushev and Balashov [9] developed
a complex machine-learning technique for target detection from ground radar images called CTDCM
(complex ground radar target detection and classification method), which was based on a three-step
analysis that included a learning procedure, data pre-processing examination, and target-labeling step
coupled with classification. The classification error rate of their proposed CTDCM was 12%.
Zhao and Du [10] proposed a spectral-spatial feature-based classification framework that
integrated deep-learning techniques with dimension reductions for hyperspectral image classification.
This work overcomes the insufficient discovery of objects that present considerable variations of
shape in fixed detection windows. Zhong et al. [11] designed a supervised deep-learning framework
that alleviated the declining accuracy of deep-learning models. The proposed work classified many
agricultural and urban hyperspectral imagery data sets (Indian Pines, Kennedy Space Center, and
University of Pavia).
Qader et al. [12] classified the types of vegetation extracted by satellite-based phonological
characteristics in Iraq and achieved an overall accuracy of 85%. Chen et al. [13] applied exclusive
and hierarchical relationships to enhance the classification accuracy of multiple-label scenes. In this
work, the authors combined these two relationships with a LSTM to form an accurate CNN-based
scene classifier.
For the other applications [15–21], Zhang et al. [15] proposed a covariance descriptor that
combined visual and geometric information. Moreover, this work integrated a classification framework
with dictionary learning for the object recognition of 3D point clouds. Remez et al. [16] presented
a full CNN-based deep learning architecture for image de-noising that uses the splitting scheme to
achieve sub-optimization. Nasr et al. [21] used multi-class SVMs that classified facial images for robotic
applications. This work adopted BoF as a face representation. In their work, scale-invariant feature
transform (SIFT) features were replaced by speeded up robust features (SURF) for rapid and accurate
extractions. Moreover, SURF was also used for selecting interesting points.
Besides, a few hairy scalp issues were also discussed and researched [22–24], Shih and Lin [22]
developed hair segmentation and counting algorithms which were based on an unsupervised
mechanism for diagnosing person’s hair of health condition. Nakajima and Sasaki [23] proposed an
automatic health monitoring system which two subjects had thinning hair due to aging. In their work,
hair whirl was recognized by the close circle. Lee et al. [24] developed and manufactured a carbon
nanotube/adhesive polydimethylsiloxane-based electroencephalograph (EEG) electrode which EEG
signals can be read and recorded from the hairy scalp.
4. Machine-Learning Techniques for Diagnosing and Analyzing Hairy Scalps

Figure 3 shows the system architecture of the intelligent scalp detection system (ISDS) [25]. The
ISDS consists of a scalp detector, an app running on a tablet, machine-leaning techniques [2,26], and
a cloud management platform. The scalp detector will be connected with the tablet through a Wi-Fi
wireless network. Thus, a scalp photo can be captured via the scalp detector. The scalp photo will be
taken by the scalp detector, and the recognized result of the scalp will also be sent and displayed to the
tablet. Then, we can obtain quantitative data on scalps. Furthermore, based on ISDS, we will compare
three state-of-the-art machine-learning methods using scalp image data: deep learning, BOW [27] with
machine-learning classifiers, and PHOG [28] with machine-learning classifiers.
A pre-trained model and transfer learning are included in the deep-learning method as shown
Figure 4. The ImageNet-VGG-f model [29] is a pre-trained model that has been trained using
1,200,000 images. The results of this pre-trained model are used as the initial parameters with our scale
image data set to perform the fine-tuning function, which can reduce the training time compared with
training all images and increases the accuracy compared with choosing random initializations.
Cloud Platform
Data upload
Data Training Process
base
Scalps Inspection
Appl. Sci. 2018, 8, 853 Update 6 of 28
Appl. Sci. 2018, 8, x 6 of 27
4G New predictor
Train a new predictor with new data
Wifi Tablet Cloud Platform
Data upload Machine Learning Techniques Development
Data Training Process
base
Bag of Words (BOW)
Scalps Inspection Result
Scalp Detector Update
Display
Pyramid Histogram of
4G
Take photographs of scalp New predictor Oriented Gradient
(PHOG/HOF)
Train a new predictor with new data
Recognize
Wifi Tablet Deep Learning –
Machine Learning Techniques
Predictor Development
ImageNet-VGG-f
This Work (An Evaluation of Machine Learning

Bag of Words (BOW) Tech.)
Result
Scalp Detector Display
Pyramid Histogram of
Figure 3. System architecture of the intelligent scalp detection system (ISDS).
Oriented Gradient
Take photographs of scalp
(PHOG/HOF)
A pre-trained model and transfer learning are included in the Deep
Recognize deep-learning
Learning – method as shown
Predictor
Figure 4. The ImageNet-VGG-f model [29] is a pre-trained model thatImageNet-VGG-f
has been trained using 1,200,000
images. The results of this pre-trained model are Thisused as Evaluation
Work (An the initial parameters
of Machine Learning with
Tech.) our scale image
data set to perform the fine-tuning function, which can reduce the training time compared with training
System
Figure 3.the
all images and increases architecture
accuracy compared of thewith
intelligent
choosing scalprandom
scalp detectioninitializations.
detection system (ISDS).
system (ISDS).
A pre-trained model and transfer learning are included in the deep-learning method as shown
ImageNet-VGG-f model
Figure 4. The ImageNet-VGG-f model [29] is a pre-trained Inputmodel

Images that has been trained using 1,200,000
images. The results of this pre-trained model are used as ReLU,
Layer1: Conv, theLRN,initial
Pooling parameters with our scale image
Layer2: Conv, ReLU, LRN, Pooling

data set to perform the fine-tuning function, which can reduce the training time compared with training
TrainLayer3: Conv, ReLU
all images and increases the accuracy compared with choosing random initializations.
Layer4: Conv, ReLU
Layer5: Conv, ReLU, Pooling
Layer6: FC, ReLU
ImageNet-VGG-f model
Layer7:
InputFC, ReLU
Images
Layer8:
Layer1: Conv, FCLRN, Pooling
ReLU,
Softmax
Layer2: Conv, ReLU, LRN, Pooling
Train Layer3: Conv, ReLU

Use the obtained parameters as
Layer4: Conv, ReLU
initialization to retrain a new model
Layer5: Conv, ReLU, Pooling
Fine-tune
Layer6: FC, ReLU
Input Images
Layer7: FC, ReLU
Fine-tune on Layer1: Conv, ReLU, FC
Layer8: LRN, Pooling
scalp dataset Layer2: Conv,Softmax

ReLU, LRN, Pooling
Data augmentation
(Crop, Flip) Layer3: Conv, ReLU
Use the obtained parameters as
Layer4: Conv, ReLU
initialization to retrain a new model
Training data set Layer5: Conv, ReLU, Pooling
Fine-tune
Layer6: FC, ReLU
Input Images
Layer7: FC, ReLU Category 1( Bacteria_1) : No
Fine-tune on Layer1: Conv, ReLU, LRN, Pooling
Layer8: FC Category 2( Bacteria_2) : No
scalp dataset Layer2: Conv, ReLU, LRN, Pooling
Softmax Category 1( Allergy) : No
Data augmentation
(Crop, Flip) Layer3: Conv, ReLU
Category 1( Dandruff) : Yes
Layer4: Conv, ReLU
A testing image
Training data set Layer5: Conv, ReLU, Pooling
Figure
Figure4.4.Training
Trainingand
andtesting
testing of theLayer6:
of the FC, ReLU
deep-learning
deep-learning structure.
structure.
Layer7: FC, ReLU Category 1( Bacteria_1) : No
Layer8: FC Category 2( Bacteria_2) : No
Figure
Figure5 5shows
showsthe thesecond
secondmethod,
method,which
whichisisthethe combination
combination
Softmax of
of BOW
BOW and
and SVM.
Category
SVM. For
1( Allergy)
For : No the BOW,
the BOW,
both
boththe
thetraining
trainingandand testing
testingimage
image features
features are
are obtained
obtained via
via the SIFT
the SIFT [30]
[30] method.
method.Then, Then,we weuse
Category 1( Dandruff) : Yes
usethe
the
A testing image
K-means [31] method based on SIFT features to train the images to create a codebook.
K-means [31] method based on SIFT features to train the images to create a codebook. Histograms [32] Histograms
[32]
areare built
built according
according to
to the the
Figure codebook
4. Training
codebook produced
and testing
produced for each
of the
for each test image.structure.
deep-learning
test image.
Finally, those histograms are used to train a SVM classifier, and the SVM classifier is used to
Figure
predict 5 shows
the test image.the
Thesecond method, which
third method, whichisisshown
the combination
in Figure 6,ofused
BOW and SVM.
a HOG For the
algorithm BOW,
to obtain
both the training and testing image features are obtained via the SIFT [30] method. Then,
the training and testing image features. This method trained a SVM classifier based on the HOG we use the
K-meansof[31]
features the method
training based
images.on SIFT features to train the images to create a codebook. Histograms
[32] are built according to the codebook produced for each test image.
Appl. Sci. 2018, 8, x 7 of 27
Appl. Sci. 2018, 8, 853 7 of 28

Appl. Sci. 2018, 8, x 7 of 27
Figure 5. Bag of Words (BOW) with the support vector machine (SVM) method.
predict the test image. The third method, which is shown in Figure 6, used a HOG algorithm to obtain
the training and testing image features. This method trained a SVM classifier based on the HOG
features of theFigure
training
Figure images.
BagofofWords
5.5.Bag Words(BOW)
(BOW) with the
the support
supportvector
vectormachine
machine(SVM) method.
(SVM) method.
predict the test image. The third method, which is shown in Figure 6, used a HOG algorithm to obtain
the training and testing image features. This method trained a SVM classifier based on the HOG
features of the training images.
Figure
Figure6.6.Pyramid histogram
Pyramid histogramofoforiented
orientedgradients
gradients(PHOG)/histogram of of
(PHOG)/histogram oriented gradients
oriented (HOG)
gradients (HOG)
with the
with theSVMSVMmethod.
method.
4.1.
4.1.Deep
DeepLearning
Learning
Deep
Deep learning
learning has
hasbecome
become a popular
a popular method
method forfor
image
image processing
processing and
andenables automatic
enables automatic feature
feature
extraction
extractionasas a type
a typeofof
feature
featurelearning.
learning.TheTheprocess
processis is
improved
improved bybyreplacing feature
replacing engineering
feature engineering that
that
requires
Figure the6.analysis
Pyramid of knowledgeable
histogram of and
oriented experienced
gradients
requires the analysis of knowledgeable and experienced specialists. specialists.
(PHOG)/histogram of oriented gradients (HOG)
The
withThethree
the general
SVM
three steps
method.
general performed
steps performed toto
complete
complete training
trainingonon
thethedeep-learning
deep-learningframework
framework include
include
defining
defining a network structure, defining a learning target and using a numerical method. Thestep
a network structure, defining a learning target and using a numerical method. The first first
4.1. Deep
is step
to Learning
identify
is to a network
identify structure
a network to choose
structure several
to choose severalpossible functions.
possible With
functions. a proper
With a propernetwork
network
structure,
structure, ananefficient
efficientdeep-learning model can
canbebebuilt through thethetraining process. The second step
Deep learning has deep-learning
become a popular model method built through
for image processingtraining
andprocess.
enablesThe second
automatic step
feature
is istotodefine a learning target by choosing objective functions, such as the mean square
define a learning target by choosing objective functions, such as the mean square error (MSE) and error (MSE)
extraction as a type of feature learning. The process is improved by replacing feature engineering that
andcrosscross entropy.
entropy.
requires the analysis of knowledgeable and experienced specialists.
Finally, during the training process, we use numerical methods to discover the best combination
The three general steps performed to complete training on the deep-learning framework include
of parameters, including weights and bias, to reduce the size of the learning target as much as possible.
defining a network structure, defining a learning target and using a numerical method. The first step
Backpropagation (BP) is usually used for minimizing an objective function. Deep learning is composed
is to identify a network structure to choose several possible functions. With a proper network
structure, an efficient deep-learning model can be built through the training process. The second step
is to define a learning target by choosing objective functions, such as the mean square error (MSE)
and cross entropy.
function can be obtained, we can predict the new input data through those functions. In the following
section, we describe how the parameters were updated using BP.
4.1.1. Backpropagation (BP)

Appl. Sci. 2018, 8, 853 8 of 28
BP is utilized to identify the suitable parameters for the deep network. We use the partial
derivatives (∂J/∂w
of a set and ∂J/∂b)
of functions of be
that can theused
objective function
to describe data. IfJ the
regarding every weight
proper parameters w andcan
of a function bias
be b in the
deep neural network.
obtained, we canEquation
predict the(1) shows
new input the
dataobjective function.
through those functions. In the following section, we
describe how the parameters were updated using BP.
1 1
J= ∑ (𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑜𝑢𝑡𝑝𝑢𝑡)2 = ∑ (𝑡 − 𝑜)2 (1)
2 2
4.1.1. Backpropagation (BP)
where t represents the real
BP is utilized to class of the
identify images parameters
the suitable and o represents the output
for the deep of the
network. We deep-learning
use the partial model.
The purpose of BP(∂J/∂w
derivatives is to and
minimize
∂J/∂b) ofJ the
as much
objectiveasfunction
possible by finding
J regarding every the w. Therefore,
weight w and bias bBP lessens the
in the
difference between
deep the realEquation
neural network. value and the output
(1) shows of the
the objective model. To minimize J and obtain all proper
function.
weights, we calculate the partial derivatives 1
of J on w in each 1
layer. We use two layers (as shown in
Figure 7) to describe the BP process.∑First, ∑partial
2 2
J = ( target − output )
we calculate the = ( t − o )derivatives (1)
of J from Equation (1)
2 2
(2)
with respect to the weight 𝑤 of the output layer (as given in Equation (2)). This partial derivative
where t represents the real class of the images and o represents the output of the deep-learning model.
is shownThe
in purpose
Equation (3).is to minimize J as much as possible by finding the w. Therefore, BP lessens the
of BP
difference between the real value and the output of the (2)model. To minimize J and obtain all proper
𝑤
weights, we calculate the partial derivatives of J on w in11each layer. We use two layers (as shown in
(2) (2)
𝑤calculate
Figure 7) to describe the BP process. First, we = [𝑤the ]
21 partial derivatives of J from Equation (1) (2)
( 2 )
with respect to the weight w of the output layer (as given(2) in Equation (2)). This partial derivative is
𝑤31
shown in Equation (3).
 (2) 
w11
∂𝐽
(2) 
w(2) =  w21 (2)

(2)
(2)𝑤

w31 11
∂𝐽  ∂J ∂𝐽 
= (2) (2) (3)
∂𝑤 (2) 𝑤 
w11
∂J ∂J 21 
=  w(2)∂𝐽 

 (3)
∂w(2) 21

 ∂J (2)
(𝑤
w[31 31 ]
2)
Input x
Output o
Figure 7.7.Two-layer
Figure neural
Two-layer neural network.
network.
Target tTarget t consists

consists of constants,
of constants, and andthe
theresults
results obtained
obtained after performing
after partialpartial
performing derivatives are
derivatives are
∂∑1 21 (t − o)22 ∂ 21 (1
t − o )2
∑ =2−(t − o ) (2)
∂J ∂o
∂𝐽∂w(2) ∂
=
2
(𝑡 −
( 2 )
𝑜) = ∑ ∂
2
(𝑡
( 2 )
− 𝑜) ∂𝑜 (4)
(4)
= ∂w = ∑ ∂w = −(𝑡∂w − 𝑜)
∂𝑤 (2) ∂𝑤 (2) ∂𝑤 (2) ∂𝑤 (2)
The chain rule is utilized in Equation (5).
Appl. Sci. 2018, 8, 853 9 of 28
The chain rule is utilized in Equation (5).
∂J ∂o ∂z(3)
= −( t − o ) (5)
∂w(2) ∂z(3) ∂w(2)
Because o = f (z(3) ), the partial derivatives of o with respect to z(3) are equal to f 0 (z(3) ) such that
∂o
∂z(3)
= f 0 (z(3) ). Hence, the replacement of ∂o(3) by f 0 (z(3) ) is shown in Equation (6).
∂z
∂J (3)
0 (3) ∂z
= −( t − o ) f ( z ) (6)
∂w(2) ∂w(2)
According to the neural network rule, z(3) (Figure 7) is the result of the multiple a(2) and w(2) such
that z(3) = a(2) w(2) . Hence, the partial derivative of z(3) regarding the weight w(2) is a(2) as shown in
Equation (7).
∂z(3)
= a (2) (7)
∂w(2)
For mathematical convenience, replace −(t − o ) f 0 (z(3) ) with δ(3) as shown in Equation (8).
δ(3) = −(t − o ) f 0 (z(3) ) (8)
∂J
Therefore, according to Equations (7) and (8), can be represented by Equation (9).
∂w(2)
∂J T
( 2 )
= ( a (2) ) δ(3) (9)
∂w
Second, the partial derivatives of J regarding the weight w(1) Equation (10) of the layer before the
output layer are represented by Equations (11) and (12). The chain rule is used in Equation (13).
" #
(1) (1) (1)
(1) w11 w12 w13
w = (2) (2) (3) (10)
w21 w22 w23
 ∂J ∂J ∂J

(1) (1) (1)
∂J w11 w12 w13
= (11)
 
∂w(1) ∂J ∂J ∂J

(1) (1) (1)
w21 w22 w23
∂J ∂ ∑ 12 (t − o )2 ∂ 21 (t − o )2
∂w(1)
=
∂w(1)
= ∑ ∂w(1)
(12)
∂J ∂o ∂o ∂z(3)
= −( t − o ) = −( t − o ) (13)
∂w(1) ∂w(1) ∂z(3) ∂w(1)
Based on ∂o
∂z(3)
= f 0 (z(3) ) and Equation (8), ∂J
∂w(1)
can simply be represented by Equation (14).
∂J (3) (3)
0 (3) ∂z (3) ∂z
= −( t − o ) f ( z ) = δ (14)
∂w(1) ∂w(1) ∂w(1)
(3)
Because ∂z(2) is equal to w(2) , the results of the replacement after performing the chain rule in
∂a
Equation (14) are shown in Equation (15).
(3) ∂a(2) (2)

∂J (3) ∂z (3) (2) T ∂a
= δ = δ ( w ) (15)
∂w(1) ∂a(2) ∂w(1) ∂w(1)
Appl. Sci. 2018, 8, 853 10 of 28
∂a(2) ∂a(2) ∂z(2) ∂a(2)

=
∂w(1) ∂z(2) ∂w(1)
and
∂z(2)
= f 0 (z(2) ), and Equation (16) displays the result of those substitutions.
(2) ∂z(2) (2)

∂J (3) (2) T ∂a (3) (2) T 0 (2) ∂z
= δ ( w ) = δ ( w ) f ( z ) (16)
∂w(1) ∂z(2) ∂w(1) ∂w(1)
∂z(2) ∂z(2)
Also,
Appl. Sci. 2018,∂w
8, (x1) = x, and Equation (17) is the result of substituting x for ∂w(1)
. 10 of 27
∂J T
∂𝐽 = x T δ(3) (w(2) ) f 0 (z(2) ) (17)
∂w(1)(1) = 𝑥 𝑇 δ(3) (𝑤 (2) )𝑇 𝑓′(𝑧 (2) ) (17)
∂𝑤
(2(2) T
Equation
Equation(18)
(18) shows
shows the short representation
representationafter settingδδ(2(2)
aftersetting =δ(3(3)
) =δ
(w(2(2)
) (𝑤 ) ) )𝑇f𝑓
0 (′z(𝑧 ) ).).
∂J∂𝐽 = T𝑥 𝑇(2δ) (2) (18)

=x δ
(1)(1)
(18)
∂𝑤
∂w
For the partial derivatives of J, the biases 𝑏((1) and 𝑏 (2) are shown in Equation (19).
For the partial derivatives of J, the biases b 1) and b(2) are shown in Equation (19).
∂𝐽

(1)
=(2δ)(2)
 ∂J
{ ∂𝑏
( 1 ) = δ
(19)
∂𝐽
∂b
(19)
 ∂J
(2) (2)== δ(3δ)(3)
∂b∂𝑏
Finally,
Finally, we
we use
use Equations
Equations (20) and (21) to update
update the
the parameters,
parameters,including
includingthe
theweights
weightsand
andbias.
bias.
(𝑙) (𝑙) ∂𝐽(𝑤, 𝑏)
wij𝑤𝑖𝑗= = wij𝑤𝑖𝑗− − −α
(l ) (l ) ∂J (w, b) (20)
−α (𝑙) (20)
∂𝑤
(l )
∂wij 𝑖𝑗
(𝑙) (𝑙) ∂𝐽(𝑤, 𝑏)

bi 𝑏𝑖 = = bi 𝑏𝑖− − α α ∂𝑏
(l ) (l ) ∂J (w, b) (21)
(𝑙) (21)
(l )
∂bi 𝑖
where
whereijijrepresents
represents the
the location of the
location of the neuron
neuronon thel 𝑙layer
onthe andααisisthe
layerand thelearning
learningrate.
rate.
4.1.2.
4.1.2.Convolution
Convolution
Convolution
Convolutionlayers
layersare areusually
usuallycomposed
composed of of
a wide
a widerange of filters
range as shown
of filters as shown in Figure 8. Those
in Figure 8.
Thosecan
filters filters can enhance
enhance the features
the features of images.
of images. For For example,
example, thethe topofofFigure
top Figure 8 shows showsthe thepixel
pixel
representationof
representation ofaaline
line filter.
filter. After convoluting
convoluting input
input images
imageswith
withthis
thisline
linefilter,
filter,a avertical
verticalline
linefeature
feature
mapcan
map canbebeobtained.
obtained. Through
Through a vertical line feature
feature map,
map,we wecan
canextract
extractthetheline
linefeatures
featuresincluded
included inin
those images. After several layers, the CNN can learn to grab more complex
those images. After several layers, the CNN can learn to grab more complex features, features, such as objects.
Figure
Figure 8.
8. Pixel
Pixel representation andvisualization
representation and visualizationofoffilters.
filters.
The purpose of training in a CNN is to discover the optimization filters. Through their
calculations, testing images can be well represented so that the accuracy of classification can be
enhanced. Shared weights and sparse connectivity are two advantages of CNNs that can reduce the
number of training parameters.
Appl. Sci. 2018, 8, 853 11 of 28
The purpose of training in a CNN is to discover the optimization filters. Through their calculations,
testing images can be well represented so that the accuracy of classification can be enhanced.
Shared weights and sparse connectivity are two advantages of CNNs that can reduce the number of
training parameters.
For shared weights, all neurons in the same hidden layers use the same filter to detect the same
feature within an image, such as the line or edge feature. Therefore, different parts of an image use the
same filter that includes weights and bias. Sparse connectivity is related to a neuron on the layer being
Appl. Sci. 2018, 8, x 11 of 27
connected to several (not all) neurons of the previous layer.
4.1.3. Rectified
4.1.3. Linear Unit
Rectified Linear Unit (ReLU)
(ReLU)
ReLU is
ReLU is aa linear
linear activation
activation function
function and
and shown
shown inin Equation
Equation (22).
(22). If
If the
the input
input of
of ReLU, x, is
ReLU, x, is less
less
than 0, then the value of ReLU is zero. If the input of ReLU, x, is larger than 0, then the value
than 0, then the value of ReLU is zero. If the input of ReLU, x, is larger than 0, then the value of ReLU of ReLU
is still
is still x.
x. Compared
Compared withwith non-linear
non-linear activation
activation functions,
functions, sigmoid
sigmoid andand tanh
tanh can
can perform
perform astronomical
astronomical
calculations because
calculations because both are exponential
both are exponential functions.
functions. Most
Most importantly, sigmoid and
importantly, sigmoid and tanh
tanh functions
functions
have gradient vanishing problems when implementing BP, which causes
have gradient vanishing problems when implementing BP, which causes missing information. missing information.
𝑅𝑒𝐿𝑈 = 𝑚𝑎𝑥 (0, 𝑥) (22)
ReLU = max (0, x ) (22)
4.1.4. Max-Pooling
After performing
performing convolution,
convolution,thethenumber
numberof ofparameters
parametersisistoo
toohigh
hightototrain
traina aclassifier,
classifier,such asas
such a
Softmax.
a Softmax.Thousands of parameters
Thousands create an
of parameters overfitting
create problem and
an overfitting cause astronomical
problem calculations.
and cause astronomical
As shown in As
calculations. Figure 9, the
shown in highest
Figure 9,value is obtained
the highest valuefrom a small area
is obtained fromina the
smallprevious
area in layer. Hence,
the previous
Max-pooling
layer. Hence, is proposed toisreduce
Max-pooling the number
proposed to reduce ofthe
parameters
number of and prevent overfitting.
parameters and prevent overfitting.
Figure 9. Max-pooling.
4.1.5. Fully Connected Layers (FC)

For the ImageNet-VGG-f model, layers 6 to 8 are fully connected layers (FC). Each neuron of the
connectsto
FC connects toevery
everyneuron
neuronofofthethe previous
previous layer.
layer. Hence,
Hence, the the number
number of parameters
of parameters of theofFC
the
is FC is
large.
large. Layers six and seven include activate functions. High-level features can be calculated
Layers six and seven include activate functions. High-level features can be calculated from layer five,from
layer
the five, the convolutional
convolutional layer,and
layer, the ReLU the the
ReLU and the
pooling pooling layer.
layer.
The main concept of FC is to transform high-level features into one dimension for classification
purposes. Therefore, by using matrix multiplication, FC condenses all dimensions into one vector for
classifier.
Softmax to train a classifier.
4.1.6. Softmax
The Softmax
Softmax layer
layerisisbased
basedonona logical
a logicalregression
regressionto manage
to manage multiclass problems.
multiclass Therefore,
problems. this
Therefore,
layer is also
this layer is called a multinomial
also called logisticlogistic
a multinomial regression. In this In
regression. research, four classes
this research, four are used:are
classes bacteria
used:
type 1, bacteria
bacteria type 1,type 2, allergy
bacteria typeand dandruff.
2, allergy The
and output ofThe
dandruff. Softmax
output presents four probabilities
of Softmax of
presents four
belonging to aofclass,
probabilities and the
belonging to sum of the
a class, andfour probabilities
the sum is equal
of the four to 1.
probabilities is equal to 1.
4.1.7. Data Augmentation

Before performing fine tuning through the ImageNet-VGG-f pre-trained model, four categories
are included with a different number of images in each. Hence, the data augmentation method is
Appl. Sci. 2018, 8, 853 12 of 28
4.1.7. Data Augmentation

Before performing fine tuning through the ImageNet-VGG-f pre-trained model, four categories
are included with a different number of images in each. Hence, the data augmentation method is
applied to equalize the number of images in those four classifications. Flip and Crop are two data
augmentation methods.
We use the fliplr function of MATLAB to exchange the columns in order in the horizontal direction.
For
Appl.example,
Appl. Sci. 2018,8,8,the
Sci.2018, xx first column of the image is exchanged with the last a column of the image. Figure
12 10a
12of
of27
27
shows the original image, and Figure 10b shows the result of flipping the original image. The Crop
method
The
The Crop
Crop crops
method
method part of the original
crops
crops part ofimages.
part of the The data
the original
original augmentation
images.
images. The data method
The data can help
augmentation
augmentation to reduce
method
method canthe
can effect
help
help to
to
of overfitting.
reduce
reduce the
theeffect
effectof ofoverfitting.
overfitting.
(a)
(a) (b)
(b)
Figure
Figure10.
Figure 10.(a)
10. (a)Scalp
(a) Scalpbefore
Scalp beforeflipping;
before flipping;(b)
flipping; (b)scalp
(b) scalpafter
scalp afterflipping.
after flipping.
flipping.
4.2.
4.2.Bag
4.2. Bagof
Bag ofWords
of Words(BOW)
Words (BOW)
(BOW)
BOW
BOWwas
BOW wasoriginally
was originallyused
originally usedto
used todetect
to detectwords.
detect words.The
words. Thegeneral
The generalpurpose
general purposeof
purpose ofthis
of thismethod
this methodis
method isto
is toobtain
to obtainvisual
obtain visual
visual
words
words from
from all
alldocuments
documents and
and then
then calculate
calculate the
the number
number of
ofoccurrences
occurrences of
ofthose
those
words from all documents and then calculate the number of occurrences of those visual words based visual
visual words
words based
based
on
on
on each
each document.
document. Subsequently,
Subsequently, the
theoccurrences
occurrences are
arepresented
presented by
byaafeature
feature vector
vector
each document. Subsequently, the occurrences are presented by a feature vector in each document. Thein
in each
eachdocument.
document.
The
The flow
flow chartchart
flow chart in
in Figure
in Figure Figure 11
11 shows
11 shows shows the four
thekey
the four four key
keyof
parts parts
the of
parts of the
BOW BOW
BOW method:
themethod: method: (1)
(1) feature
(1) feature feature detection;
detection;detection; (2)
(2)
(2) feature
feature
feature description;
description; (3)
(3)cluster;
cluster; and
and
description; (3) cluster; and (4) histogram.(4)
(4)histogram.
histogram.
Figure
Figure11.
Figure 11.BOW
11. BOWstructure.
structure.
structure.
4.2.1.
4.2.1.Feature
4.2.1. Feature Detection
Detection
Feature
Featuredetection
Feature detectiondetects
detection detectsthe
detects theinteresting
the interestingpoints
interesting pointsof
points ofimages
of images andandcan
canbe
bedivided
dividedinto
intotwotwocategories:
categories:
global
globalfeatures
global features and
features and local
andlocal features.
localfeatures. For
features.For global
Forglobal features,
globalfeatures, the
features,the information
theinformation contained
informationcontained
contained ininanan
in an entire
entire
entire image
image
image is
is used,
is used,
used, and
andand the
thethe most
mostmost popular
popular
popular global
global
global feature
feature
feature is GIST
is GIST
is GIST [33].
[33].
[33].
For
For local
local features,
features, the
the images
images are
are divided
divided into
into many
many sub-images
sub-images oror segmented
segmented according
according to to
objects
objectswithin
withinthe
theimages.
images.The Thefeature
featuredetection
detectionofofBOW
BOWuses useslocal
localfeatures
featuresto
todetect
detectthethecritical
criticalpoints
points
of
ofimages.
images.Many
Manymethods
methodsare areused
usedtotodetect
detectkey
keypoints,
points,such
suchas asHarris–Affine,
Harris–Affine,Hessian–Affine
Hessian–Affine[34], [34],
maximally
maximally stable
stable extremal
extremal regions
regions (MSER)
(MSER) [35],
[35], Lowe’s
Lowe’s difference-of-Gaussian
difference-of-Gaussian (DOG),
(DOG), edge-based
edge-based
regions
regions (EBRs),
(EBRs), intensity-based
intensity-based regions
regions (IBRs)
(IBRs) [36],
[36], and
and salient
salient regions
regions [37].
[37]. Tuytelaars
Tuytelaars and and
Appl. Sci. 2018, 8, 853 13 of 28
For local features, the images are divided into many sub-images or segmented according
to objects within the images. The feature detection of BOW uses local features to detect
Appl. Sci. 2018, 8, x
the
13 of 27
critical points of images. Many methods are used to detect key points, such as Harris–Affine,
Hessian–Affine [34], maximally stable extremal regions (MSER) [35], Lowe’s difference-of-Gaussian
4.2.2. Feature Description
(DOG), edge-based regions (EBRs), intensity-based regions (IBRs) [36], and salient regions [37].
After detecting
Tuytelaars the key[38]
and Mikolajczyk points of images,
surveyed many we must
local accurately
feature describe those key points. In this
detectors.
work, we use VLFeat [39], which combines Lowes DOG and SIFT. By exploiting the features of SIFT,
such as
4.2.2. its invariance
Feature Descriptionto image scale and rotation, suitable information can be obtained.
Thus, the same SIFT features can be obtained regardless of the rotation and scale of the image.
After detecting the key points of images, we must accurately describe those key points. In this
Through SIFT, we can finally obtain a 128-dimension feature vector from each key point. Four steps
work, we use VLFeat [39], which combines Lowes DOG and SIFT. By exploiting the features of SIFT,
are performed to obtain the SIFT features: (a) scale-space peak selection; (b) key point localization; (c)
such as its invariance to image scale and rotation, suitable information can be obtained.
orientation assignment; and (d) key point description.
Thus, the same SIFT features can be obtained regardless of the rotation and scale of the image.
(a) Scale-space
Through SIFT, wePeak can Selection
finally obtain a 128-dimension feature vector from each key point. Four steps
are performed to obtain the SIFT features:
The SIFT algorithm identifies (a) scale-space
the interesting points in peak selection;
different (b) keyBy
scale spaces. point localization;
multiplying the
(c) orientation
Gaussian kernelassignment;
with inputand (d) key
images, we point
can description.
obtain different scale spaces. Equation (23) shows the
two-dimensional
(a) variable
Scale-space Peak scale space’s Gaussian function and Equation (24) shows the scale-space.
Selection
1 2 +𝑦 2 )/2𝜎 2
The SIFT algorithm identifies the𝐺(𝑥, 𝑦, 𝜎) = points
interesting 𝑒 −(𝑥
in different scale spaces. By multiplying(23)
the
2𝜋𝜎 2
Gaussian kernel with input images, we can obtain different scale spaces. Equation (23) shows the
𝐿(𝑥,
two-dimensional variable scale space’s 𝑦, 𝜎) = 𝐺(𝑥,
Gaussian 𝑦, 𝜎)and
function ∗ 𝐼(𝑥, 𝑦)
Equation (24) shows the scale-space.
(24)
where L(x,y,𝜎) is the result of the convolution operation 1 −( x2 +with
y2 )/2σ2
an input image I(x,y) and the two-
G ( x, y, σ ) = e (23)
dimensional space’s Gaussian function and σ is2πσ the2 standard deviation. If the value of σ increases,
then the image will be blurred, and vice versa. Multiple Gaussian scale spaces can be obtained by
L( x, y, σ ) = G ( x, y, σ ) ∗ I ( x, y) (24)
using the convolution operation on an original image and varying 𝜎.
whereFigure 12 shows
L(x,y,σ) is thethat the Gaussian
result pyramid is constructed
of the convolution operation with by different
an input octaves.
imageFigure
I(x,y) 13 shows
and the
the four octaves of
two-dimensional a dandruff
space’s image
Gaussian and five
function andstandard deviations
σ is the standard in each octave.
deviation. The of
If the value images in the
σ increases,
samethe
then octave
imagearewill
the same size but
be blurred, have
and vice different standard Gaussian
versa. Multiple deviations. scale spaces can be obtained by
using the convolution operation on an original image and varying σ.and the upper layer is k times σ.
For example, at the bottom of octave1, the σ value is the lowest
As σFigure
increases, a variety
12 shows of scale
that the spaces
Gaussian can be isobtained.
pyramid constructed For by different octaves,
different octaves.the next 13
Figure octave
shows is
obtained by down-sampling the previous octave. Based on the Gaussian pyramid,
the four octaves of a dandruff image and five standard deviations in each octave. The images in the the DOG can be
generated
same byare
octave subtracting
the same two
size adjacent
but haveimages
different instandard
the samedeviations.
octave.
Figure 12.
Figure 12. Difference-of-Gaussian (DOG).
Difference-of-Gaussian (DOG).
Appl. Sci. 2018, 8, 853 14 of 28
Appl. Sci. 2018, 8, x 14 of 27
Figure 13. Dandruff image is displayed at four octaves with five standard deviations.
Comparing 26 neighbor pixels of the candidate pixel (as shown in the right side of Figure 10)
For example, at the bottom of octave1, the σ value is the lowest and the upper layer is k times
can obtain a feature point as the local maxima and minima in the DOG space. The 26 neighbor pixels
σ. As σ increases, a variety of scale spaces can be obtained. For different octaves, the next octave is
include the 18 pixels of the up and down layer of DOG and the 8 pixels at the current layer of DOG.
obtained by down-sampling the previous octave. Based on the Gaussian pyramid, the DOG can be
To identify stable key points within the scale space, the DOG scale-space is obtained by convoluting
generated by subtracting two adjacent images in the same octave.
different DOG kernels and original images as shown in Equation (25).
Comparing 26 neighbor pixels of the candidate pixel (as shown in the right side of Figure 10)
𝐷(𝑥,point
can obtain a feature 𝑦, 𝜎) as
= the
(𝐺(𝑥, 𝑦, 𝑘maxima
local 𝜎) − 𝐺(𝑥, 𝑦, 𝜎)
and ∗ 𝐼(𝑥, 𝑦))
minima = 𝐿(𝑥,
in the DOG 𝑦, space.
𝑘 𝜎) − The
𝐿(𝑥, 26
𝑦, 𝜎)
neighbor pixels
(25)
include the 18 pixels of the up and down layer of DOG and the 8 pixels at the current layer of DOG.
(b) identify
To Key Point Localization
stable key points within the scale space, the DOG scale-space is obtained by convoluting
different DOG kernels and original
Because all local maxima images in
and minima as the
shown
DOG inspace
Equation (25).feature points, many redundant
are not
points must be removed. Brown and Lowe [40] proposed using the Taylor expansion of the scale-
D ( x, y, σ ) = ( G ( x, y, k σ ) − G ( x, y, σ ) ∗ I ( x, y)) = L( x, y, k σ) − L( x, y, σ ) (25)
space function (as shown in Equation (26)) to increase the accuracy of identifying feature points by
removing
(b) candidate
Key Point points that have low contrast.
Localization
∂𝐷 𝑇 1 𝑇 ∂2 𝐷
Because all local maxima and 𝐷(𝑥,minima
𝑦, 𝜎) = 𝐷(𝑥,
in the𝑦, DOG
𝜎) + space𝑥+are 𝑥not feature
𝑋 points, many redundant
(26)
∂𝑥 2 ∂𝑥 2
points must be removed. Brown and Lowe [40] proposed using the Taylor expansion of the scale-space
where x = (𝑥, 𝑦, σ) 𝑇 is the offset from the local maxima or minima sampled. Then, the derivative of
function (as shown in Equation (26)) to increase the accuracy of identifying feature points by removing
Equation (26) is applied and set to zero. The accurate location of the local maxima or minima sampled
candidate points that have low contrast.
can be obtained as 𝑥̂ as shown in Equation (27).
∂2 𝐷 −1 ∂𝐷
∂D T 1 ∂2 D
D ( x, y, σ) =𝑥̂D=( x,
−y, σ)2+ x + xT 2 X (27)
(26)
∂𝑥 ∂𝑥 ∂x 2 ∂x
Then, we place Equation (27) into Equation (26), and the result of the first two items is shown in
where
Equation ( x, y, σ)T is the offset from the local maxima or minima sampled. Then, the derivative of
x =(28).
Equation (26) is applied and set to zero. The accurate location of𝑇 the local maxima or minima sampled
1 ∂𝐷
can be obtained as x̂ as shown in Equation 𝐷(𝑥̂)(27).
= 𝐷(𝑥, 𝑦, 𝜎) + 𝑥̂ (28)
2 ∂𝑥
Finally, the value of D(𝑥̂) at the minima or maxima
∂2 D −1 ∂D
must be larger than the threshold as shown
x̂ = − (27)
in Equation (29). ∂x 2 ∂x
Then, we place Equation (27) into Equation |𝐷(𝑥̂)|(26),
> threshold
and the result of the first two items is shown(29)
in
Equation (28).
If the value of |D(𝑥̂)| is less than the threshold, then Tthose points are unstable and can be
1 ∂D
removed. Removing the interestingDpoints( x̂ ) = D with
( x, y,low contrast x̂is insufficient because of the strong
σ) + (28)
2 ∂x
response at the image edges within the difference of the Gaussian function. Two situations were
Finally,
observed: the value
when of D(x̂) at
the principle the minima
curvature or maxima
is large across themust be larger
edge than the
and when the threshold as shown in
principle curvature is
Equation (29).
small on the perpendicular edge. Therefore, Lowe used the 2 × 2 Hessian matrix to create the principle
curvatures. The Hessian matrix is shown| D in(Equation
x̂ )| > threshold
(30). (29)
𝐷𝑥𝑥 (𝑥, 𝑦) 𝐷𝑥𝑦 (𝑥, 𝑦)

H=[ ] (30)
𝐷𝑥𝑦 (𝑥, 𝑦) 𝐷𝑦𝑦 (𝑥, 𝑦)
Appl. Sci. 2018, 8, 853 15 of 28
If the value of |D(x̂)| is less than the threshold, then those points are unstable and can be removed.
Removing the interesting points with low contrast is insufficient because of the strong response at
the image edges within the difference of the Gaussian function. Two situations were observed: when
the principle curvature is large across the edge and when the principle curvature is small on the
perpendicular edge. Therefore, Lowe used the 2*2 Hessian matrix to create the principle curvatures.
The Hessian matrix is shown in Equation (30).
" #
Dxx ( x, y) Dxy ( x, y)
H= (30)
Dxy ( x, y) Dyy ( x, y)
where Dxx ( x, y) is the second-order partial derivative of D(x,y,σ) in the x direction at a sample point in
DOG. Then, Lowe calculated the individual eigenvalues and only considered the ratio in Equation (31).
The numerator of this ratio is the square of the sum of the eigenvalues from the value of the diagonal
of H as shown in Equation (32). The denominator of this ratio is the product from the determinant as
Tr(H)2 ( α + β )2
= (31)
Det(H) αβ
Tr(H)2 = Dxx ( x, y) + Dyy ( x, y) = α + β (32)
Det(H) = Dxx ( x, y) ∗ Dyy ( x, y) − ( Dxy ( x, y)2 ) = αβ (33)
where α is the largest of the eigenvalues and β is the smallest of the eigenvalues. After setting α to γβ,
the ratio is as shown in Equation (34).
Tr(H)2 ( α + β )2 (γβ + β)2 ( γ + 1)2

= = = (34)
Det(H) αβ γβ2 γ
( γ +1)2
The smallest value of γ is obtained when α and β are the same, and the ratio increases when
γ increases. Hence, we check (35) instead of comparing the ratio of principle curvatures that are less
than the threshold γ.
Figure 14a,b show the Bacteria_1 image before and after the threshold method. The numbers of
the points before and after the threshold are counted in Figure 14a,b as 1198 and 610, respectively.
Tr(H)2 ( γ + 1)2
< (35)
Det(H) γ
(c) Orientation Assignment
After finding the locations of key points, we assign an orientation to those key points to obtain
the benefits of orientation invariance. By calculating Equations (36) and (37) on adjacent pixels around
the key points, we then obtain their gradient magnitudes Msi f t and orientation θsi f t . An orientation
histogram with 10 degrees for each bin is built according to this statistic. The main directions of
the local gradients are high bins. Except for the highest bin in this orientation histogram, the other
key points can also be generated when their bins are more than 80% of the highest bin. Figure 15a
shows the gradient magnitude and orientation of a key point. Similarly, Figure 15b shows the gradient
magnitude and orientation for all key points of an image.
q
Msi f t = ( L x+1,y − L x−1,y )2 + ( L x,y+1 − L x,y−1 )2 (36)
θsi f t = tan−1 (( L x,y+1 − L x,y−1 )/( L x+1,y − L x−1,y )) (37)

magnitude and orientation for all key points of an image.
𝑀𝑠𝑖𝑓𝑡 = √(𝐿𝑥+1,𝑦 − 𝐿𝑥−1,𝑦 )2 + (𝐿𝑥,𝑦+1 − 𝐿𝑥,𝑦−1 )2 (36)
Appl. Sci. 2018, 8, 853 θ𝑠𝑖𝑓𝑡 = 𝑡𝑎𝑛−1 ((𝐿𝑥,𝑦+1 − 𝐿𝑥,𝑦−1 )/(𝐿𝑥+1,𝑦 − 𝐿𝑥−1,𝑦 )) 16 of 28
(37)
(a) (b)
Figure
Figure14.
14.(a)
(a)Before
Beforethe
theoperating
operating threshold method; (b)
threshold method; (b)after
afterthe
theoperating
operatingthreshold
threshold method.
method.
Appl. Sci. 2018, 8, x 16 of 27
(a) (b)
Figure 15.
Figure 15. (a)
(a) Gradient
Gradient magnitude
magnitude and
and orientation
orientation of
of aa key
key point;
point; (b)
(b) gradient
gradient magnitude
magnitude and
and
orientation of
orientation of all
all key
key points
points found.
found.
(d) Key Point Description

(d) Key Point Description
The key point descriptor primarily describes each key point through a 128-dimension vector.
AfterThe key point
rotating thedescriptor primarily
main direction of describes
a key point,eachLowe
key point throughthe
calculated a 128-dimension vector. After
gradient magnitude and
rotating the main direction of a key point, Lowe calculated the gradient magnitude and
orientation of 16*16 adjacent pixels of a key point. The key point as shown in Figure 16a is at the orientation of
16*16
centeradjacent pixels ofThe
of the square. a key point.of
length The keyarrowhead
each point as shown in Figure
represents the16a is at themagnitude,
gradient center of theand
square.
the
The lengthofofeach
direction eacharrowhead
arrowheadisrepresents the gradient magnitude, and the direction of each arrowhead
the orientation.
is the orientation.
Then, Lowe divided those 16*16 grids into 4*4 subregions as shown in Figure 16b. Each sub-region
presents eight directions, such as 45◦ , 90◦ , 135◦ , 180◦ , 225◦ , 270◦ , 315◦ , and 360◦ . Finally, we can obtain
a 4*4*8-dimension vector to describe each key point.
(a) (b)
Figure 16. (a) Image gradient; (b) key point descriptor.
Then, Lowe divided those 16*16 grids into 4 × 4 subregions as shown in Figure 16b. Each sub-
region presents eight directions, such as 45°, 90°, 135°, 180°, 225°, 270°, 315°, and 360°. Finally, we can
obtain a 4*4*8-dimension vector to describe each key point.
The key point descriptor primarily describes each key point through a 128-dimension vector.
After rotating the main direction of a key point, Lowe calculated the gradient magnitude and
orientation of 16*16 adjacent pixels of a key point. The key point as shown in Figure 16a is at the
center of the square. The length of each arrowhead represents the gradient magnitude, and
Appl. Sci. 2018, 8, 853
the
17 of 28
direction of each arrowhead is the orientation.
(a) (b)

Then, Lowe divided those 16*16 grids into 4 × 4 subregions as shown in Figure 16b. Each sub-
4.2.3. Cluster (K-Means)
region presents eight directions, such as 45°, 90°, 135°, 180°, 225°, 270°, 315°, and 360°. Finally, we can
obtainK-means clustering isvector
a 4*4*8-dimension an unsupervised
to describe learning
each keymethod
point. performed to gather similar objects into
same groups. Therefore, the minimum of Equation (38) finds the shortest distance for each datum
4.2.3. Cluster
from x1 . . . . . .(K-Means)
.x j that belongs to centroid Ci .
K-means clustering is an unsupervised learning
k
method performed to gather similar objects into
same groups. Therefore, the minimum ∑ ∑ (38)
SSEof=Equation µi k2 the shortest distance for each datum
k x j −finds (38)
from 𝑥1 … … . 𝑥𝑗 that belongs to centroid 𝐶𝑖 . i=1 x j ∈Ci
𝑘
where SSE represents the sum of squares error. 2
SSE = ∑ ∑ ‖𝑥𝑗 − 𝜇𝑖 ‖ (38)
Five steps are performed to achieve K-means clustering:
𝑖=1 𝑥𝑗 ∈𝐶𝑖
Step
where Randomly
1: SSE represents k number
pickthe of cluster
sum of squares centroids, µ1 ∼ µk .
error.
Step 2: Calculate the closest distance between the remaining data and those k centroids. Then, assign
each datum to the nearest cluster centroid.
Step 3: Recalculate each cluster centroid using the meaning of each cluster.
Step 4: Regroup all data according to the new cluster centroids.
Step 5: Repeat Step 4 until certain conditions are met, including a lack of change in centroids, a small
change of SSE and no data movement.
4.2.4. Histogram
A histogram calculates the number of occurrences of training images based on the codebook
obtained via K-means. Then, we normalize the histograms for the training SVM. For the testing image
data set, we calculate the normalized histograms based on the same codebook.
4.3. Histogram of Oriented Gradient (HOG)

Compared with SIFT feature extraction, we also use the global feature extraction method HOG
and its extension PHOG. In this research, the VLFeat [39] function vl_hog is used. The essential
purpose of HOG is to divide an image into many blocks, and each block contains cells. For each cell,
Dalal built a histogram based on the gradient direction and orientation of each pixel.
The combination of histograms of 2*2 cells can represent the descriptor of the image. Then,
to improve the performance of HOG, Dalal normalized all cells inside each block according to the
intensity of each block, which is called contrast normalization. This contrast normalization process can
reduce the effects of shadowing and light changes. The general scheme of HOG is described as follows
and shown in Figure 17.
Appl. Sci. 2018, 8, 853 18 of 28
Stage A: Centered gradients of the horizontal and vertical directions are calculated. Through
Equations (39) and (40), we can obtain the center horizontal and vertical gradients, respectively,
where
 [−1 0 1] is the 1D centered-point discrete derivative mask of the horizontal direction and
1
 0  is the same mask in the vertical direction.
 
−1
Ix = I ∗ [−1 0 1] (39)
 
1
Iy = I ∗  0  (40)
 
−1
Stage B: The distribution of the intensity gradient and the orientation of an image can well represent
the local object appearance and shape. Equation (41) is used to obtain the gradient magnitude,
and Equation (42) is utilized to obtain the orientation.
q
Mhog = Ix2 + Iy2 (41)

Ix
θhog= tan−1 (42)
Iy
Stage C: A 64*128 image is split into blocks. Dalal considered 2*2 close cells (as shown in Figure 15) to
represent a block. Each cell consists of four 8*8 pixels. Hence, a total of 105 blocks are within a
64*128 image and a 50% overlap occurs between two blocks.
Stage D: Histograms are built. For this stage, 180 degrees are divided into nine bins of 20 degrees/each.
Then, the calculation of the orientation of each bin within a cell produces a histogram. The
combination of four histograms of four adjacent cells is a histogram of the block.
Stage E: All histograms are combined. Finally, for an image, 3780 (105(blocks)*4(cells)*9(bins)) features
can be obtained.
Appl. Sci. 2018, 8, x 18 of 27
Centered: Ix = I x -1 0 1
Stage A
Iy = I x -1
Gradient 0
S S S
2 2
1
Magnitude: x y
θ Stage B
Orientation:   arctan  S y 
 
Sx 8x8 8x8
Block 1 Block 2 Block 7
cell cell
8x8 8x8
Stage C cell cell
Stage D
Total (105 Blocks)
1 Block
Stage E
Figure
Figure 17.
17. General
General scheme
scheme of
of HOG.
HOG.
Stage C: A 64*128 image is split into blocks. Dalal considered 2*2 close cells (as shown in Figure 15)
to represent a block. Each cell consists of four 8*8 pixels. Hence, a total of 105 blocks are within
a 64*128 image and a 50% overlap occurs between two blocks.
Stage D: Histograms are built. For this stage, 180 degrees are divided into nine bins of 20 degrees/each.
Then, the calculation of the orientation of each bin within a cell produces a histogram. The
combination of four histograms of four adjacent cells is a histogram of the block.
Appl. Sci. 2018, 8, 853 19 of 28
4.4. Pyramid Histogram of Oriented Gradient (PHOG)

PHOG [41] is an extension of HOG. First, we calculate the different scales of an image. Then,
on the same scale, we divide the image into several patches, such as 2*2, 4*4, etc. Subsequently,
we calculate the HOF features of those different patches and place all HOF feature results into a
one-dimensional array. Finally, we perform the same approach for all other scales of this image and
combine the results into the same array. Compared with the HOG method, PHOG can detect features
at a variety of scales of an image and can more actively represent the image.
4.5. Machine-Learning Classifiers

In this work, we apply Classification Learner Apps [42], which is a tool included in Matlab.
Classification Learner Apps includes a variety of machine-learning classifiers such as SVM, decision
tree, linear discriminant analysis (LDA), KNN, and ensemble learning.
4.5.1. Support Vector Machine (SVM)

SVM is a classification algorithm proposed by Vapnik [27], and it is based on statistical learning
theory. Classification plays a significant role in data mining. The main purpose of the SVM method is to
train a classifier via supervised learning. Then, the data can be classified using this model. The purpose
of SVM is to calculate the optimum separation hyperplane whose margin is the largest distance from
the closest data as shown in Figure 18a. As we may obtain many hyperplanes, the optimum separation
hyperplane Equation (43) can reduce the effect of noise and decrease the possibility of overfitting. The
largest margin separates the pluses from the minuses as much as possible.
Appl. Sci. 2018, 8, x → → 19 of 27

w· x + b = 0 (43)
where
where𝑤
⃗⃗ →
wis is
thethe
perpendicular
perpendiculartotothe theoptimum
optimumseparation
separationhyperplane
hyperplaneand andisisthe theinner
innerproduct.
product.A
→ →
number 𝑤
⃗⃗
A number of w are perpendicular to the optimum separation hyperplane because w can belength
of are perpendicular to the optimum separation hyperplane because ⃗⃗⃗⃗
𝑤 can be any any
→
and insufficient
length constraints
and insufficient are available
constraints to fixtoafixparticular
are available b or
a particular particular ⃗⃗⃗⃗
b ora aparticular 𝑤. Hence,
w. Hence,additional
additional
→
constraints
constraintsarearerequired
requiredto
tocalculate and 𝑤
calculate bb and ⃗w.
⃗ . The
The super hyperplaneswith
super hyperplanes with++labeled
labeleddata dataand
andthe
thesuper
super
hyperplanes
hyperplaneswith with− −
labeled
labeleddata
dataare
arepresented
presentedin inEquations
Equations (44) and (45),(45), respectively.
respectively.
𝑤
⃗⃗ ∙→⃗⃗⃗⃗
→ 𝑥 +𝑏 =
w · x+++ b = 1
1 (44)
(44)
→𝑤
w⃗⃗· x∙−⃗⃗⃗⃗
→𝑥 + 𝑏 = −1 (45)
−
+ b = −1 (45)
Support
+
Vector
+
op
+
tim
al + +
- s ep
ar
at
in
g
- hy
pe
+
rp
la +
ne
-
-
- Margin
- -
Support Hyper-plans
(a) (b)
Figure 18.18.(a)(a)Margin
Figure Marginbetween
betweentwo
twosupport
support hyperplanes;
hyperplanes; (b) margin isisthe
(b) margin theprojection
projectionofofthe
thedifferent
different
vector ononthe
vector thenormal
normalunit.
unit.
Hence, if the data belong to the 𝑦𝑖 = 1 class, then 𝑥+ data can be described by Equation (46).
Similarly, if the data are located in the 𝑦𝑖 = −1 class, then 𝑥− data can be described by Equation (47).
𝑤
⃗⃗ ∙ ⃗⃗⃗⃗
𝑥+ + 𝑏 ≥ 1, & if 𝑦𝑖 = 1 (46)
𝑤
⃗⃗ ∙ ⃗⃗⃗⃗
𝑥− + 𝑏 ≤ −1 , & if 𝑦𝑖 = −1 (47)
Appl. Sci. 2018, 8, 853 20 of 28
Hence, if the data belong to the yi = 1 class, then x+ data can be described by Equation (46).
Similarly, if the data are located in the yi = −1 class, then x− data can be described by Equation (47).
→ →
w · x+ + b ≥ 1, & if yi = 1 (46)
→ →
w · x− + b ≤ −1, & if yi = −1 (47)
Then, multiply yi = 1 by Equation (46) and multiply yi = −1 by Equation (47) to obtain the result
( → →
yi ( w · x+ + b) ≥ 1, yi = 1 For + samples → →
→ → => yi ( w · xi + b) − 1 ≥ 0 (48)
yi ( w · x− + b) ≥ 1, y i = −1 For − samples
→ →
where yi ( w · xi + b) − 1 = 0 for xi in the support hyperplanes. The margin is the projection of
→ → →
w
the different vector x+ − x− on the normal unit kW k
, which is normal to the optimum separation
hyperplane as shown in Figure 18b.
Therefore, the dot product of this different vector with the normal unit is the distance between
→
those two support hyperplanes as shown in Equation (49), where x− in Figure 18b is a vector
→
originating from Equation (45) and x+ is a vector originating from Equation (44). According to
→
→ →→
Equations (44) and (45), w x+ = 1 − b and ( w x− ) = −1 − b are obtained. Finally, by inserting the right
2
side of those two equalities into Equation (49), the margin is kW k
.
→ →→ →→
→ → w x+ w − ( x− w ) 1 − b − (−1 − b) 2
Margin = ( x+ − x− )· = = = (49)
kW k kW k kW k kW k
2
For the sake of mathematical convenience, maximizing the margin kW k
is equivalent to
1 1 2
maximizing kW k
. Also, minimizing kW k is equivalent to minimizing 2 kW k as shown in Equation (50).
2 1 1
Max ≈ Max ≈ Min kW k ≈ Min kW k2 (50)
kW k kW k 2
Hence, the solution that identifies the optimum separation hyperplane can solve Equation (51).
(
Minimize : 12 kW k2 ,
→ → (51)
Subject to yi ( w · xi + b) − 1 ≥ 0
The solution for the extremum of a function with constraints can use Lagrange multipliers. The
benefit of using Lagrange multipliers is to maximize or minimize the equation without considering
the constraints. Using the Lagrange multiplier method, Equation (51) can be transformed into the
quadratic equation shown in Equation (52).
1 h →→ i
L= kW k 2 − ∑ α i y i ( w · x i + b ) − 1 (52)
2
where αi is the Lagrange multiplier.
By calculating the derivatives of L and setting the results to zero, the extremum of equation L are
identified. Equations (53) and (54) show the derivative of L with respect to w and b, respectively. The
result of Equation (53) shows that vector w is the linear sum of all or certain samples.
∂L → → → →
= 0 => w − ∑ αi yi xi = 0 => w = ∑ αi yi xi (53)
∂w
Appl. Sci. 2018, 8, 853 21 of 28
∂L
= 0 => − ∑ αi yi = 0 => ∑ αi yi = 0 (54)
∂b
Equation (55) shows the results of plugging the w expression Equation (53) into L in Equation (52).
1 → → → →
L= (∑ αi yi xi )·(∑ α j y j x j ) − (∑ αi yi xi )·( ∑ α j y j x j ) − ∑ α i y i ·b + ∑ α i (55)
2
L can be expressed in Equation (56) because ∑ αi yi = 0.
1 → →
L= ∑ αi − 2 ∑ ∑ αi α j yi y j xi · x j (56)
i j
The objective is to identify the maxima of expression L, and this maximization process depends
→ →
on x sample vectors. The optimization depends on only the dot product of pairs of samples xi · x j .
4.5.2. Decision Tree

A decision tree is a supervised machine-learning model with simple process intuition and
high execution efficiency. It is suitable for the prediction of classification and regression data types.
Compared with other machine-learning models, the execution speed is a major advantage.
In addition, one feature of the decision tree is that each decision stage is fairly clear (YES or NO).
In contrast, logistic regression and SVMs are similar to black boxes. It is difficult for us to predict or
understand their internal complexities and operational details. Additionally, the decision tree has
provided instructions for us to actually simulate and draw a decision-making process from the root, to
each leaf, and to the final node.
4.5.3. Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a classification approach in which the data of different
classes can be generated based on different Gaussian distributions. The LDA-based classification
approach attempts to find a linear combination of the characteristics of two types of objects
or events to be able to characterize or distinguish the characteristics or events. The resulting
combination can be used as a linear classifier or, more commonly, for dimensionality reduction
for subsequent classifications.
4.5.4. k-Nearest Neighbor Algorithm (K-NN)

The k-Nearest Neighbor algorithm (K-NN) is a statistical method for classification and regression.
In both cases, the input contains the k closest training samples in the feature space as follows. In the
K-NN classification, the output is a classification group. The classification of an object is determined
by a “majority vote” of its neighbors, and the most common categories in the k nearest neighbors (k is
a positive integer, typically small) determine the category assigned to the object. If k = 1, the object’s
category is directly given by the nearest node. In K-NN regression, the output is the attribute value of
the object. This value is the average of the values of its k nearest neighbors.
Therefore, the K-NN algorithm uses the vector space model to classify the concepts into similar
class cases. The similarity between the cases is high, and it is possible to evaluate the possible
classifications of the unknown class cases by calculating the similarity to the known class cases.
neighbors (k is a positive integer, typically small) determine the category assigned to the object. If k =
1, the object's category is directly given by the nearest node. In K-NN regression, the output is the
attribute value of the object. This value is the average of the values of its k nearest neighbors.
Therefore, the K-NN algorithm uses the vector space model to classify the concepts into similar
class cases.
Appl. Sci. The
2018, 8, 853 similarity between the cases is high, and it is possible to evaluate the possible
22 of 28
classifications of the unknown class cases by calculating the similarity to the known class cases.
4.5.5. Ensemble Learning

A simple understanding of ensemble learning refers to the use of multiple classifiers to predict
data sets, thereby improving the generalization
generalization capabilities of the overall
overall classifier. We use the
classifier. We
classification problem as an explanation. The classification problem refers to the use of some sort of
rule for classification.
classification.The
Theproblem
problemisisininfinding
finding a certain
a certain function.
function. TheThe idea
idea of ensemble
of ensemble learning
learning can
can generally be understood in this manner: when classifying new data instances, a
generally be understood in this manner: when classifying new data instances, a plurality of classifiersplurality of
classifiers
are trainedareandtrained and the classification
the classification results
results of these of these are
classifiers classifiers
combinedare (for
combined (forvoting)
example, example,
to
voting)
determineto determine the classification
the classification results results to obtain
to obtain betterbetter results
results andandto to improvethe
improve the generalization
generalization
capabilities of the classifier using multiple decision makers together to determine the classification of
an instance.
5. Measurements and
5. Measurements and Experimental
Experimental Results
Results
In
In this
this work,
work, we
we use
use aa 200
200×× magnification
magnification camera
camera to
to take
take scalp
scalp images
images as
as shown
shown in in Figure
Figure 19.
19.
The scalp examples are sorted into four groups: bacteria type 1, bacteria type 2, allergy, and dandruff
The scalp examples are sorted into four groups: bacteria type 1, bacteria type 2, allergy, and dandruff
groups
groups as
as shown
shown in
in Figure
Figure 20a–d,
20a–d, respectively.
respectively.
Figure 19.
Figure 19. Taking
Taking scalp
scalp images
images using
using aa 200
200×× magnification
magnification camera.
camera.
Bacteria_1 is the condition in which the skin on the scalp presents blisters or boils. Bacteria type
Bacteria_1 is the condition in which the skin on the scalp presents blisters or boils. Bacteria type 2
2 is the condition in which the skin on the scalp has many red spots. An allergy scalp problem usually
is the condition in which the skin on the scalp has many red spots. An allergy scalp problem usually
has a considerable number of red patches where the hair root attaches. Dandruff is caused by the
has a considerable number of red patches where the hair root attaches. Dandruff is caused by the
shedding of dead scalp skin cells.
shedding of dead scalp skin cells.
For each group, we use an 8:2 rate to divide them into training data and testing data. Thus,
44 images are used for testing and 176 images are used for training. A variety of sizes is observed
within those images. The largest image is 2000*15,000, whereas the smallest one is 186*63. Each group
contains 220 test images and 880 training images.
Appl. Sci. 2018, 8, 853 23 of 28
Appl. Sci. 2018, 8, x 22 of 27
(a) (b)
(c) (d)
Figure
Figure20.
20.(a)
(a)Bacteria
Bacteriatype
type11image;
image;(b)
(b)bacteria
bacteriatype
type22image;
image;(c)
(c)allergy
allergyimage;
image;(d)
(d)dandruff
dandruff image.
image.
For each group, we use an 8:2 rate to divide them into training data and testing data. Thus, 44
5.1. Experimental Results of Deep Learning
images are used for testing and 176 images are used for training. A variety of sizes is observed within
thoseAs shownThe
images. in Table 1, the
largest accuracy
image increases whereas
is 2000*15,000, as the learning rate decreases.
the smallest The three
one is 186*63. Eachlearning
group
rates set in this research are 10 −4 , 10−5 and 10−6 .
contains 220 test images and 880 training images.
Tableof
5.1. Experimental Results Time Learning
1. Deep for deep learning of training and testing data (seconds).
As shown in Table 1, the accuracy

Spendingincreases as the learning
on Training rate
Spending on decreases.
Testing The three learning
Learning Rate Accuracy
Time
rates set in this research are 10−4, 10−5 and 10−6(s)
. Time (s)
1 × 10−4 52,798.313 19.36 89.77%
1 × 10 −5
Table 1. Time for deep56,821.985
learning of training and testing data (seconds).
20.826 88.06%
1 × 10−6 55,571.778 19.451 78.40%
Spending on Spending on
Learning Rate Accuracy
Training Time (S) Testing Time (S)
The accuracy assesses
1 × 10 −4 the number of correctly predicted 19.36
52,798.313 images from the deep-learning
89.77% model
with respect to1the total
× 10 −5 number of images.
56,821.985 The validation (Val)
20.826 shown in Figure 21
88.06% represents the
validation error1 based× 10−6 on the validation data
55,571.778 set. The validation
19.451 set has been used
78.40% to estimate the
predictive error of the selected model. Similarly, training in Figure 21 represents the training errors
created
Thefrom the training
accuracy assessesdata sets. When
the number the correct
of correctly label ofimages
predicted a testingfromimage is not within the
the deep-learning five
model
highest
with possible
respect labels
to the totalthat the model
number predictsThe
of images. for this testing image,
validation the topin5 Figure
(Val) shown err will 21
be represents
high. Similarly,
the
the top 1 err indicates that the correct label of a testing image is not the label
validation error based on the validation data set. The validation set has been used to estimate the that the model predicts
for this testing
predictive errorimage. A smaller
of the selected top 1 Similarly,
model. err and toptraining
5 err indicates
in Figure better results. Asthe
21 represents shown in Table
training errors1,
the accuracy − 6 − 4
created from increases
the training when
datathesets.
learning
Whenratethe decreases from
correct label of10 to 10 image
a testing . is not within the five
highest possible labels that the model predicts for this testing image, the top5 err will be high.
Similarly, the top1 err indicates that the correct label of a testing image is not the label that the model
Appl. Sci. 2018, 8, x 23 of 27
predicts for this testing image. A smaller top1 err and top5 err indicates better results. As shown in
Appl. Sci. 2018, 8, 853 24 of 28
Table 1, the accuracy increases when the learning rate decreases from 10−6 to 10−4.
(a) (b) (c)
Figure21.
Figure 21.(a)(a)Learning
Learning rate
rate is is
1× 1 ×1010 4 ;; (b)
−−4 (b) learning
learningrate
rateisis11×× 10
10−−5; (c) learning rate is 1 × 10−6.
5 ; (c) learning rate is 1 × 10−6 .
5.2. Experimental Results of BOW with Machine-Learning Classifiers

5.2. Experimental Results of BOW with Machine-Learning Classifiers
Table
Table22shows
showsthe
thetime
timespent
spentononobtaining
obtaining the
the SIFT
SIFT features,
features, calculating K-means and
calculating K-means and generating
generating
histograms for the training data and obtaining the SIFT features and calculating the histograms
histograms for the training data and obtaining the SIFT features and calculating the histograms for the for
the testing
testing data.
data. ThisThis discrepancy
discrepancy is because
is because the training
the training datathe
data used used the K-means
K-means results
results for for the
the training
training data to create its own
data to create its own histogram.histogram.
Table
Table2.2.Time
Timefor
forBOW
BOW with
with the
the training
training and testing data
and testing data (in
(in seconds).
seconds).
Different Time to Obtain

Different Centers Time Time toto
Obtain
Obtain Time to Obtain Time Time totoCalculate
Calculate Time Time to
to Calculate
Calculate
Centers (Using
(Using K-Means SIFT SIFT Features for
SIFT
SIFT
Features
Features for Histograms for
Features for Histograms for Histogramsfor
Histograms for
K-Means
Method) Training Data for Training
Training Data Training Data Training Data
Training Data Training Data Training Data
Method)
10 centers 472,358.918 Data
2157.486 4253.557 834.538
10 50
centers
centers 472,358.918
440,101.268 2157.486
2415.997 4253.557
4533.5 834.538
908.185
50100 centers
centers 526,735.405
440,101.268 2854.442
2415.997 4851.46
4533.5 963.334
908.185
300 centers 481,344.417 3307.601 4791.243 917.323
100 centers 526,735.405 2854.442 4851.46 963.334
500 centers 444,823.379 4040.969 4748.997 874.902
300 centers 481,344.417 3307.601 4791.243 917.323
500 centers 444,823.379 4040.969 4748.997 874.902
The experimental results show that the time spent obtaining the SIFT features from the training
dataThedoes not increaseresults
experimental as the number
show thatof centers
the timechosen for the K-means
spent obtaining the SIFT method increases.
features from the However,
training
the time spent on the other operations increases as the number of centers increases.
data does not increase as the number of centers chosen for the K-means method increases. However,
the time Table 3 shows
spent on thethe accuracies
other basedincreases
operations on different numbers
as the number of of
centers,
centersincluding 10, 50, 100, 300 and
increases.
500.Table
As shown in Table 3, the best accuracy of all scenarios was 80.5% and
3 shows the accuracies based on different numbers of centers, including 10, 50, was achieved with SVM.
100, 300
In Table 3, five machine-learning methods, decision tree, LDA, SVM, K-NN
and 500. As shown in Table 3, the best accuracy of all scenarios was 80.5% and was achieved withand ensemble learning, are
implemented
SVM. In Tableusing themachine-learning
3, five features resulting from BOWdecision
methods, based ontree,
different
LDA, types
SVM,of centers
K-NN selected when
and ensemble
using K-means. In Table 3, for those five machine-learning methods, the accuracies increased when
learning, are implemented using the features resulting from BOW based on different types of centers
selecting 10 centers, 50 centers and 100 centers when using the K-means method. The highest accuracy
selected when using K-means. In Table 3, for those five machine-learning methods, the accuracies
was achieved when using SVM.
increased when selecting 10 centers, 50 centers and 100 centers when using the K-means method. The
highest accuracy was achieved when using SVM.
Table 3. Accuracy of BOW based on different machine-learning classifiers.
TableLearners
Classification 3. Accuracy of BOW based on different
10 Centers 50 Centersmachine-learning
100 Centers classifiers.
300 Centers 500 Centers
Classification
Decision Learners
Tree 10 Centers
53.8% 50 Centers
59.7% 100 Centers
62.1% 30060.7%
Centers 50060.9%
Centers
Linear Discriminant Analysis
Decision Tree (LDA) 55.0%
53.8% 64.2%
59.7% 67.0%
62.1% 59.7%
60.7% 33.9%
60.9%
Support vector machine (SVM) 68.9% 75.9% 77.4% 80.5% 80.0%
Linear Discriminant Analysis
K-Nearest Neighbor (K-NN) 55.0%
67.9% 64.2%
74.9% 67.0%
76.3% 59.7%
74.7% 33.9%
76.4%
(LDA)
Ensemble Learning 67.9% 75.4% 78.1% 75.9% 77.3%
Support vector machine (SVM) 68.9% 75.9% 77.4% 80.5% 80.0%
K-Nearest Neighbor (K-NN) 67.9% 74.9% 76.3% 74.7% 76.4%
Ensemble Learning 67.9% 75.4% 78.1% 75.9% 77.3%
Appl. Sci. 2018, 8, 853 25 of 28
5.3. Experimental Results of PHOG/HOG with SVM

Compared with the deep learning and BOW methods, the accuracies of PHOG and HOG are far
lower (less than 45%) as shown in Table 4.
Table 4. Training time, testing time and accuracy for PHOG and HOG.
Accuracy without Accuracy with

Training Time Testing Time
Normalization Normalization
PHOG 6715.337 1517.237 39.77% (70/176) 44.31% (78/176)
HOG 106,267.868 6294.642 34.65% (61/176) 37.5% (66/176)
Although PHOG is an extension of HOG, the time required to generate PHOG results is far less
than that of HOG. Also, the accuracy of applying PHOG is better than HOG, including for the results
with and without normalization.
The accuracy of PHOG without normalization is approximately 39.77%, which is 5% higher than
that of HOG. Furthermore, after normalization, the accuracy of PHOG is approximately 44%, which is
approximately 7% higher than that of HOG. In addition, the time spent on PHOG is 15 times higher
than that of HOG. As seen in Table 5, the greatest accuracy was achieved using SVM based on the
PHOG features. Compared to the BOG features, when using the same five machine-learning methods,
the accuracy of the PHOG features is lower.
Table 5. Accuracy of PHOG based on different machine-learning classifiers.
Classification Learners PHOG

Decision Tree 39.3%
LDA 33.2%
SVM 53.0%
K-NN 41.3%
Ensemble 50.3%
5.4. Summary
As shown in Table 6, deep learning has the highest accuracy at 89.77%, and the lowest algorithm
(PHOG) is at 53.4%. In terms of the time spent on the training and testing data, the PHOG is the quickest
algorithm and the BOW is the slowest algorithm, which require 8232.574 and 447,958.95 s, respectively.
Table 6. Running time and accuracy comparisons among the four methods.
Methods Time Spent Accuracy

(Second Fastest) (Best)
Deep Learning
52,817.67 89.77%
(Slowest) (Worst)
BOW
447,959 80.50%
(Faster) (Second Best)
PHOG
8232.574 53.30%
6. Conclusions and Future Works
6.1. Conclusions
In this paper, we analyzed scalp images using machine-learning algorithms, including deep
learning, BOW with machine-learning classifiers and HOG/PHOG with machine-learning classifiers.
The results show the good performance of deep learning, which presented an accuracy rate of 89.77%.
Appl. Sci. 2018, 8, 853 26 of 28
This value is far higher than that achieved by the other two combination methods and higher than the
human inspection accuracy of 70.2% of human inspection.
6.2. Future Works

In the future, we hope to use machine-learning algorithms in scalp detectors to automatically
detect scalp problems. We will continue collecting additional scalp images to help train the
deep-learning model and increase the training accuracy. In addition, we will consider evaluating
some possible hybrid deep-learning/non-deep learning methodologies [43,44] for suitable application
evaluations to continuously optimize the performance of scalp image recognition. Moreover, other
image classification applications will also be trained and evaluated.
Author Contributions: W.-C.W., L.-B.C., and W.-J.C. conceived and designed the experiments; L.-B.C. and W.-J.C.
surveyed the related works; W.-C.W. performed the experiments; L.-B.C. and W.-J.C. analyzed the data; W.-J.C.
contributed materials and tools; W.-C.W. and L.-B.C. wrote the paper.
Acknowledgments: This work was partially supported by the Ministry of Science and Technology (MOST),
Taiwan, under grants MOST-106-2632-E-218-002 and MOST 106-2218-E-218-004.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. The 10 Breakthrough Technologies of 2013. MIT Technology Review. 2013. Available online: https://
www.technologyreview.com/s/513981/the-10-breakthrough-technologies-of-2013/#comments (accessed
on 10 May 2017).
2. Raschka, S. Python Machine Learning, 1st ed.; Packet Publishing: Birmingham, UK, 2015.
3. Loussaief, S.; Abdelkrim, A. Machine learning framework for image classification. In Proceedings of the 7th
International Conference on Sciences of Electronics, Technologies of Information and Telecommunications,
Zilina, Slovakia, 5–7 July 2017; pp. 58–61.
4. Caltech 101-Pictures of Objects Belonging to 101 Categories, Computational Vision at Clatech. Available
online: http://www.vision.caltech.edu/Image_Datasets/Caltech101/ (accessed on 5 June 2017).
5. Ahmed, S.B.; Naz, S.; Razzak, M.I.; Yousaf, R. Deep learning based isolated Arabic scene character recognition.
In Proceedings of the 2017 IEEE International Workshop on Arabic Script Analysis and Recognition, Nancy,
France, 3–5 April 2017; pp. 46–51.
6. Jagannathan, S.; Desappan, K.; Swami, P.; Mathew, M.; Nagori, S.; Chitnis, K.; Marathe, Y.; Poddar, D.;
Narayanan, S.; Jain, A. Efficient object detection and classification on low power embedded systems.
In Proceedings of the 2017 IEEE International Conference Consumer Electronics, Las Vegas, NV, USA,
8–10 January 2017; pp. 233–234.
7. Du, L.; Jiang, W.; Zhao, Z.; Su, F. Ego-motion classification for driving vehicle. In Proceedings of the 2017
IEEE Third International Conference on Multimedia Big Data, Laguna Hills, CA, USA, 19–21 April 2017;
pp. 276–279.
8. Pop, D.O.; Rogozan, A.; Nashashibi, F.; Bensrhair, A. Pedestrian recognition through different cross-modality
deep learning methods. In Proceedings of the 2017 IEEE International Conference on Vehicular Electronics
and Safety, Vienna, Austria, 27–28 June 2017; pp. 133–138.
9. Ermushev, S.A.; Balashov, A.G. A complex machine learning technique for ground target detection and
classification. Int. J. Appl. Eng. Res. 2016, 11, 158–161.
10. Zhao, W.; Du, S. Spectral-spatial feature extraction for hyperspectral image classification: A dimension
reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [CrossRef]
11. Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral-spatial residual network for hyperspectral image
classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [CrossRef]
12. Qader, S.H.; Dash, J.; Atkinson, P.M.; Rodriguez-Galiano, V. Classification of vegetation type in Iraq using
satellite-based phrenological parameters. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 414–424.
[CrossRef]
Appl. Sci. 2018, 8, 853 27 of 28
13. Chen, P.-J.; Ding, J.-J.; Hsu, H.-W.; Wang, C.-Y.; Wang, J.-C. Improved convolutional neural network
based scene classification using long short-term memory and label relations. In Proceedings of the
IEEE International Conference on Multimedia and Expo Workshops, Hong Kong, China, 10–14 July 2017;
pp. 429–434.
14. Singhal, V.; Aggarwal, H.K.; Tariyal, S.; Majumdar, A. Discriminative robust deep dictionary learning for
hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5274–5283. [CrossRef]
15. Zhang, H.; Zhuang, B.; Liu, Y. Object classification based on 3D points clouds covariance descriptor.
In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering and
IEEE International Conference on Embedded and Ubiquitous Computing, Guangzhou, China, 21–24 July
2017; pp. 234–237.
16. Remez, T.; Litany, O.; Giryes, R.; Bronstein, A.M. Deep class-aware image denosing. In Proceedings of the
International Conference on Sampling Theory and Applications (SampTA’17), Allin, Estonia, 3–7 July 2017;
pp. 138–142.
17. Yang, B.; Wang, Y.; Li, J. A spiking-timing-based model for classification. In Proceedings of the 2016 13th
International Computer Conference on Wavelet Active Media Technology and Information Processing
(ICCWAMTIP’16), Chengdu, China, 16–18 December 2016; pp. 99–102.
18. Zhou, W.; Wu, C.; Du, W. Automatic optic disc detection in retinal images via group sparse regularization
extreme learning machine. In Proceedings of the 36th Chinese Control Conference, Dalian, China, 26–28 July
2017; pp. 11053–11058.
19. Chen, T.; Lu, S.; Fan, J. S-CNN: Subcategory-aware convolutional networks for object detection. IEEE Trans.
Pattern Anal. Mach. Intell. 2017. [CrossRef] [PubMed]
20. Wang, X.; Chen, C.; Cheng, Y.; Wang, Z.J. Zero-shot image classification based on deep feature extraction.
IEEE Trans. Cogn. Dev. Syst. 2017. [CrossRef]
21. Nasr, S.; Bouallegue, K.; Shoaib, M.; Mekki, H. Face recognition system using bag of features and multi-class
SVM for robot applications. In Proceedings of the 2017 International Conference on Control, Automation
and Diagnosis (ICCAD’17), Hammamet, Tunisia, 19–21 January 2017; pp. 263–268.
22. Shih, H.-C.; Lin, B.-S. Hair segmentation and counting algorithms in microscopy image. In Proceedings of the
2015 IEEE International Conference on Consumer Electronics (ICCE’15), Las Vegas, NV, USA, 9–12 January
2015; pp. 612–613.
23. Nakajima, K.; Sasaki, K. Personal recognition using head-top image for health-monitoring system in the
home. In Proceedings of the 26th Annual International Conference of the IEEE EMBS, San Francisco, CA,
USA, 1–5 September 2004; pp. 3147–3150.
24. Lee, S.M.; Kim, J.H.; Park, C.; Hwang, J.-Y.; Hong, J.S.; Lee, K.H.; Lee, S.H. Self-adhesive and capacitive
carbon nanotube-based electrode to record electroencephalograph signals from the hairy scalp. IEEE Trans.
Biomed. Eng. 2016, 63, 138–147. [CrossRef] [PubMed]
25. Chen, L.-B.; Hsu, C.-H.; Su, J.-P.; Chang, W.-J.; Hu, W.-W.; Lee, D.-H. A portable wireless scalp inspector and
its automatic diagnosis system based on deep learning techniques. In Proceedings of the 16th International
Symposium on Advanced Technology, Tokyo, Japan, 1–2 November 2017.
26. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995.
27. Sivic, J.; Zisserman, A. Video Google: A text retrieval approach to object matching in videos. In Proceedings
of the IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 1470–1477.
28. Bosch, A.; Zisserman, A.; Munoz, X. Representing shape with a spatial pyramid kernel. In Proceedings of the
6th ACM International Conference on Image and Video Retrieval (CIVR’07), Amsterdam, The Netherlands,
9–11 July 2007; pp. 401–408.
29. Chatfield, K.; Somonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into
convolutional nets. In Proceedings of the British Machine Vision Conference (BMVC), Nottingham, UK,
1–5 September 2014.
30. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
[CrossRef]
31. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [CrossRef]
32. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA,
20–25 June 2005.
Appl. Sci. 2018, 8, 853 28 of 28
33. Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope.
Int. J. Comput. Vis. 2001, 42, 145–175. [CrossRef]
34. Mikolajczyk, K.; Schmid, C. Scale and affine invariant interest point detectors. Int. J. Comput. Vis. 2004, 60,
63–86. [CrossRef]
35. Matas, J.; Chum, O.; Urban, M.; Pajdla, T. Robust wide baseline stereo from maximally stable extremal
regions. In Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 2–5 September 2002;
pp. 384–393.
36. Tuytelaars, T.; Van Gool, L. Matching widely separated views based on affine invariant regions. Int. J.
Comput. Vis. 2004, 59, 61–85. [CrossRef]
37. Kadir, T.; Zisserman, A.; Brady, M. An affine invariant salient region detector. In Proceedings of the European
Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; pp. 404–416.
38. Tuytelaars, T.; Mikolajczyk, K. Local invariant feature detectors: A survey. Comput. Graph. Vis. 2008, 3,
177–280. [CrossRef]
39. Vedaldi, A.; Fulkerson, B. VLFeat. An Open and Portable Library of Computer Vision Algorithms. 2008.
Available online: http://www.vlfeat.org (accessed on 5 June 2017).
40. Brown, M.; Lowe, D.G. Recognising panoramas. In Proceedings of the IEEE International Conference on
Computer Vision, Nice, France, 13–16 October 2003; pp. 1–8.
41. Bosch, A.; Zisserman, A. PHOG Descriptor. Available online: http://www.robots.ox.ac.uk/~vgg/research/
caltech/phog.htm (accessed on 15 June 2017).
42. Classification Learner Apps of Matlab. MathWork. Available online: https://www.mathworks.com/help/
stats/classification-learner-app.html (accessed on 24 April 2018).
43. Nanni, L.; Ghidoni, S.; Brahnam, S. Handcrafted vs. non-handcrafted features for computer vision
classification. Pattern Recognit. 2017, 71, 158–172. [CrossRef]
44. Nanni, L.; Ghidono, S. How could a subcellular image, or a painting by Van Gogh, be similar to a great white
shark or to a pizza? Pattern Recognit. 2017, 85, 1–7. [CrossRef]
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
View publication stats

Development and Experimental Evaluation of Machine Learning Techniques For An Intelligent Hairy Scalp Detection System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Development and Experimental Evaluation of Machine Learning Techniques For An Intelligent Hairy Scalp Detection System

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Development and Experimental Evaluation of Machine Learning Techniques

Article in Applied Sciences · May 2018

Wei-Chien Wang Liang-Bi Chen

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Appl. Sci. 2018, 8, 853; doi:10.3390/app8060853 www.mdpi.com/journal/applsci

Appl. Sci. 2018, 8, x 3 of 27

Figure 1. Typical machine-learning system [2].

Figure 2. A simple structure of a multilayer perceptron (MLP) with a hidden layer.

4. Machine-Learning Techniques for Diagnosing and Analyzing Hairy Scalps

This Work (An Evaluation of Machine Learning

Figure 4. The ImageNet-VGG-f model [29] is a pre-trained Inputmodel

Layer2: Conv, ReLU, LRN, Pooling

Train Layer3: Conv, ReLU

scalp dataset Layer2: Conv,Softmax

Appl. Sci. 2018, 8, 853 7 of 28

4.1.1. Backpropagation (BP)

Target tTarget t consists

The chain rule is utilized in Equation (5).

δ(3) = −(t − o ) f 0 (z(3) ) (8)

(3) ∂a(2) (2)

∂a(2) ∂a(2) ∂z(2) ∂a(2)

(2) ∂z(2) (2)

∂J∂𝐽 = T𝑥 𝑇(2δ) (2) (18)

(𝑙) (𝑙) ∂𝐽(𝑤, 𝑏)

4.1.5. Fully Connected Layers (FC)

4.1.7. Data Augmentation

4.1.7. Data Augmentation

𝐷𝑥𝑥 (𝑥, 𝑦) 𝐷𝑥𝑦 (𝑥, 𝑦)

Tr(H)2 = Dxx ( x, y) + Dyy ( x, y) = α + β (32)

Det(H) = Dxx ( x, y) ∗ Dyy ( x, y) − ( Dxy ( x, y)2 ) = αβ (33)

Tr(H)2 ( α + β )2 (γβ + β)2 ( γ + 1)2

(c) Orientation Assignment

θsi f t = tan−1 (( L x,y+1 − L x,y−1 )/( L x+1,y − L x−1,y )) (37)

𝑀𝑠𝑖𝑓𝑡 = √(𝐿𝑥+1,𝑦 − 𝐿𝑥−1,𝑦 )2 + (𝐿𝑥,𝑦+1 − 𝐿𝑥,𝑦−1 )2 (36)

(d) Key Point Description

Figure 16. (a) Image gradient; (b) key point descriptor.

Figure 16. (a) Image gradient; (b) key point descriptor.

4.3. Histogram of Oriented Gradient (HOG)

4.4. Pyramid Histogram of Oriented Gradient (PHOG)

4.5. Machine-Learning Classifiers

4.5.1. Support Vector Machine (SVM)

Appl. Sci. 2018, 8, x → → 19 of 27

4.5.2. Decision Tree

4.5.3. Linear Discriminant Analysis (LDA)

4.5.4. k-Nearest Neighbor Algorithm (K-NN)

4.5.5. Ensemble Learning

As shown in Table 1, the accuracy

(a) (b) (c)

5.2. Experimental Results of BOW with Machine-Learning Classifiers

Different Time to Obtain

5.3. Experimental Results of PHOG/HOG with SVM

Accuracy without Accuracy with

Table 5. Accuracy of PHOG based on different machine-learning classifiers.

Classification Learners PHOG

Methods Time Spent Accuracy

6. Conclusions and Future Works

6.2. Future Works

View publication stats

You might also like