Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Biomedical Signal Processing and Control 62 (2020) 102137

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control


journal homepage: www.elsevier.com/locate/bspc

Classification of glaucoma using hybrid features with machine


learning approaches
Niharika Thakur, Mamta Juneja *
Computer Science and Engineering, UIET, Panjab University, Chandigarh, India

A R T I C L E I N F O A B S T R A C T

Keywords: Glaucoma also known as silent theft of sight, is the prevailing source of sightedness across the globe. It is mainly
Glaucoma caused owing to damage in the optical nerve of an eye leading to permanent blindness. Traditional approaches
Retinal fundus image used by the ophthalmologists for diagnosis include assessment of intraocular pressure by tonometry, pachymetry
Structural features
etc. But all these evaluations are time-consuming, require human interaction and may be prone to subjective
Non-structural features
Classification
errors. Thus, to overcome these challenges, researchers are working in the field of medical imaging by analysis of
the retinal images for glaucoma diagnosis. Also, Computer aided diagnosis (CAD) systems can be developed to
overcome these challenges using machine learning approaches for classifying retinal images as ‘abnormal’ and
‘normal’. This paper presents a new set of reduced hybrid features derived from structural and nonstructural
features to classify the retinal fundus image, which could serve as a second opinion for ophthalmologists. The
structural features extracted include Disc damage likelihood scale (DDLS) and Cup to disc ratio (CDR). Whereas,
non-structural features include Grey level run length matrix (GLRM), Grey level co-occurrence matrix (GLCM),
First order statistical (FoS), Higher order spectra (HOS), Higher order cumulant (HOC) and Wavelets. Finally, the
paper presents a comparative analysis of K nearest neighbor (k-NN), Neural network (NN), Random forest (RF),
Support vector machine (SVM) and Naïve bayes (NB) using metrics such as accuracy, specificity, precision and
sensitivity.

1. Introduction studies carried out by Glaucoma Research Foundation in 2017, Glau­


coma is the second most recurrent source of sightedness after cataract. It
Glaucoma is an abnormality in the retina which originates in the is approximated that 3 million people in the Americas suffer from
optical nerve of an eye commonly due to the high pressure. It is caused glaucoma, but only fractions of them are familiar with the same. Also,
when the front side of an eye gets filled with lucid transparent fluid the cases of blindness in the United States are more than 120,000 ac­
termed as “aqueous humor”. The fluid is produced continuously and counting for 9%–12% of total cases across the world. Further, the visual
flows constantly from the eye, resulting in steady pressure. But in case of impairment is 15 times, and blindness is 6–8 times more prevalent in
glaucoma, the fluid does not flow from the eye properly, as a result of African Americans as compared to Caucasians. Glaucoma suspects are
which pressure inside the eye increases [1]. According to the recent estimated to be around 60 million across the world [1]. Whereas, in

Abbreviations: CAD, computer aided diagnosis; DDLS, disc damage likelihood scale; CDR, cup to disc ratio; GLRM, grey level run length matrix; GLCM, grey level
co-occurrence matrix; FoS, first order statistical; HOS, higher order spectra; HOC, higher order cumulant; k-NN, K nearest neighbor; NN, neural network; RF, random
forest; SVM, support vector machine; NB, naive bayes; ROI, region of interest; SO, sequential optimization; MLP, maximum linear progression; RBF, radial basis
function; PCA, principal component analysis; EAS, evolutionary attribute selection; OS, operating system; LARKIFCM, level set based adaptively regularized kernel
based intuitionistic fuzzy c means; DWT, discrete wavelet transform; IG, information gain; GR, gain ratio; SRE, short run emphasis; LRE, long run emphasis; GLN, grey
level run emphasis; RP, run percentage; RLN, run length non-uniformity; autoc, autocorrelation; contr, contrast; corrp, correlation; cprom, cluster prominence; cshad,
cluster shade; dissi, dissimilarity; entro, entropy; homop, homogeneity; maxpr, maximum probability; sosvh, sum of squares variance; savgh, sum average; svarh, sum
variance; senth, sum entropy; dvarh, difference variance; denth, difference entropy; inf1h, information measure of correlation; inf2h, information measure of cor­
relation2; indnc, inverse difference normalized; idmnc, inverse difference moment normalized; var, variance; kr, kurtosis; sk, skewness; db, daubechies; bior, bio­
rthogonal; Tn, true negative; Tp, true positive; Fn, false negative; Fp, false positive.
* Corresponding author.
E-mail addresses: niharikathakur04@gmail.com (N. Thakur), mamtajuneja@pu.ac.in (M. Juneja).

https://doi.org/10.1016/j.bspc.2020.102137
Received 2 November 2019; Received in revised form 28 July 2020; Accepted 4 August 2020
Available online 26 August 2020
1746-8094/© 2020 Elsevier Ltd. All rights reserved.
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Fig. 1. Generalized CAD System.

Fig. 2. (a) Normal fundus retinal image (b) Abnormal fundus retinal image.

India 12 million people are observed to be suspect of glaucoma and the Comparison of the approaches showed that RF outperforms other ap­
cases of blindness is approximated to be 1.2 million [2]. The screening of proaches with accuracy of 90% [6]. Thereafter Sumeet et al. extracted
glaucoma is usually carried out by skilled ophthalmologists using mea­ wavelet features for classification of glaucoma using SVM, SO, NB and
surements of the fluid in the eye, the angle of drainage or out flow RF. Based on the analysis, they found that SVM outperforms all other
channel and visual analysis of retinal fundus image. But all these manual classifiers with accuracy of 93% [7]. Also, Rama Krishnan et al. used
tasks and visual analysis of any medical image is time-consuming, which SVM to classify glaucoma based on wavelet and HOS features with an
may take approximate of 20–45 min and prone to subjective evalua­ accuracy of 95% [8]. Further, Noronha et al. extracted HOS and clas­
tions. Thus, there is a need for a Computer aided diagnosis (CAD) system sified glaucoma using SVM and NB classifier. Amongst the two, NB
[3] for glaucoma which can aid as an assistant to the ophthalmologists classifier performed better than the SVM with accuracy of 92.65% [9].
for analysis of medical images comprehensively in a short duration. The Thereafter, Rajendra et al. used SVM and NB classifiers to classify
CAD system comprises pre-processing, segmentation and classification glaucoma based on Gabor features. They analyzed that SVM classifier
of medical images for diagnosis of glaucoma. Pre-processing is the performs better than NB with accuracy of 93.13% [10]. Further, Issac
beginning step to remove certain noise and outliers in the input image et al. extracted structural features such as cup to disc ratio (CDR) and
that causes problems for further analysis by use of various filters and disc damage likelihood scale (DDLS) after segmentation of disc and cup
morphological operations etc. Further, the segmentation comprises from retinal image. Thereafter, the classification was performed using
dividing the input image into multiple regions for extracting the SVM and neural network (NN). Based on the performance, SVM was
required region of interest (ROI) which is of more meaning than other found to be better than NN with accuracy of 94.11% and specificity of
regions so as to easily analyze the abnormality. Finally, the classification 100% [11]. Later, Salem et al. extracted structural, textural and
is a decision-making approach to classify images as ‘abnormal’ or intensity-based features for detection of glaucoma using SVM classifier.
‘normal’. This work emphasizes on the classification module of CAD The values of sensitivity, specificity and accuracy were observed to be
systems using machine learning approaches in retinal images. Fig. 1 100%, 87% and 92% for the same [12]. Haleem et al. presented an
shows the generalized CAD system with different modules. approach for classification between abnormal and normal cases of
In the area of medical image processing, glaucoma can also be glaucoma using SVM classifier with accuracy of 94.4%. The features
diagnosed by analysis of retinal imaging acquired using retinal fundus extracted for classification were structural, gaussian, gabor and wavelets
cameras. Fig. 2 presents the retinal image with absence of glaucoma [13]. Further, Claro et al. classified retinal images using Maximum
(normal) and presence of glaucoma (abnormal). Here, it can be seen that linear progression (MLP), RF, Random committee and Radial basis
an eye with glaucoma contains an enlarged area of optic cup occupying function (RBF). They extracted structural and textural features to test on
the optic disc [4]. different classifiers. Based on the analysis, MLP classifier was found to
Various approaches have been performed till date for classification of outperform others with accuracy of 93.03% [14]. Similarly, Singh et al.
glaucoma using different types of features extracted from retinal images extracted wavelet features for glaucoma classification using RF, NB, K
[5]. Some of them include: nearest neighbor (k-NN), ANN and SVM. Features extracted were
Rajendra et al. presented a computerized approach for diagnosis of thereafter selected using Principal component analysis (PCA) and
glaucoma by Higher order spectra (HOS) and textural features. The Evolutionary attribute selection (EAS). Testing of results with different
classification was performed using Support vector machine (SVM), classifiers showed that the features selected by EAS and classified using
Sequential optimization (SO), Naïve bayes (NB) and Random forest (RF). RF/NN gives accuracy of 94.7%. Also, the features selected using PCA

2
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Table 1 Thus, the study aims to


Analysis of the existing studies.
Authors Year Observations Classifier • Use both the structural and non-structural features set for
Used classification.
Rajendra et al. 2011 Only non-structural HOS and textural RF • Five different kinds of feature selection methods are employed for
[6] features were taken into consideration. ranking of features, thereby creating reduced significant feature set.
Sumeet et al. [7] 2012 Only non-structural wavelet features SVM • A reduced significant novel feature set consisting of both structural
were taken into consideration. and non-structural features has been proposed for classification of
Rama Krishnan 2012 Only non-structural wavelet and HOS SVM
et al. [8] features were taken into consideration.
normal and abnormal retinal images.
Noronha et al. 2014 Only non-structural HOS features were NB • Different classifiers are being employed for comparative analysis and
[9] taken into consideration. the best suitable is used for classification.
Rajendra et al. 2015 Only non-structural gabor features SVM • All the experimental analysis has been performed on the same
[10] were used
dataset to avoid biases.
Issac et al. [11] 2015 Only Structural features such as CDR SVM
and DDLS were used.
Salem et al. [12] 2016 Structural, texture and intensity SVM The rest of the paper is divided into Section 2 with experimental set-
features were taken into consideration. up and dataset used; Section 3 with methodology comprising pre-
Haleem et al. 2016 Structural, gaussian, gabor and SVM processing, segmentation, feature extraction, feature ranking/selection
[13] wavelets were taken into consideration.
Claro et al. [14] 2016 Structural and textural features were MLP
and classification. Further, Section 4 presents the results and discussions
used comprising performance metrics, parameter settings of experimentation
Singh et al. [15] 2016 Only Wavelet features were extracted. SVM and performance analysis. Finally, Section 5 concludes the study
Ranking of features were performed. conducted.
Sousa et al. [16] 2017 Only geostatistical features were SVM
extracted
Koh et al. [17] 2017 Only Wavelet features were extracted. RF 2. Materials
Septiarini et al. 2018 Only Statistical features were used. k-NN
[18] This section highlights the datasets and softwares used for experi­
Selvathi et al. 2018 Only Wavelet features were extracted. NN mental analysis.
[19]
Shubhangi et al. 2019 Entropy features from wavelet SVM
[20] transformed images were extracted. 2.1. Datasets used
Appropriate approach for ranking of
features was performed. (i) DRISHTI-GS [22] The dataset comprises 101 retinal images with
Renukalatha 2019 Structural and non-structural features SVM
labels “glaucoma” or “non-glaucoma” acquired using retinal fundus
[21] such as texture were used
cameras. The ground truth for comparison of implemented approaches
comprises the soft segmented maps of disc and cup created by the re­
and classified using SVM/k-NN gives accuracy of 94.7% [15]. Sousa searchers of the IIIT Hyderabad in alliance with Aravind eye hospital,
et al. extracted texture based geostatistical functions such as semi­ Madurai, India. It also includes a. txt file for each retinal image con­
madogram, semivariogram, correlogram and covariogram to classify sisting of CDR values, which is a significant diagnostic parameter for
glaucoma using SVM. Based on the experimental evaluation, SVM glaucoma. Further, the images in the dataset for experimentation are
classifier achieved 91.2% accuracy with sensitivity of 95% and speci­ gathered from people of random age groups visiting the hospital, with
ficity of 88.23% [16]. Thereafter, Koh et al. extracted wavelet-based images acquired under varying brightness and contrast.
features from retinal fundus images and classified them using random (ii) RIM-One [23]: This dataset comprises 159 retinal fundus images
forest classifier. The accuracy, sensitivity and specificity in this case was with labels “glaucoma” or “non-glaucoma” followed by ground truth of
observed to be 92.4%, 89.3% and 95.5% [17]. Recently, Septiarini et al. cup and disc for diagnosis of glaucoma. The images in the dataset are
extracted statistical features from desired ROI in retinal images and acquired under different brightness and contrast using Nidek AFC-210
performed classification using k-NN with accuracy of 95.24% [18]. Also, retinal fundus camera.
Selvathi et al. extracted wavelet-based features for classification using
NN with 95.8% accuracy, 91.6% specificity and 100% sensitivity [19]. 2.2. Experimental setup
More recently, Shubhangi et al. extracted cross entropy features from
wavelet transformed input fundus images and applied statistical tests to The experimentation was carried out on total 260 (101 + 159) im­
rank the features. Classification was then performed using SVM with ages gathered from both the datasets on MATLAB version 2016b with
improved performance [20]. Further, Renukalatha used structural pa­ windows 10 basic edition, and the system with intel(R) core (TM) along
rameters such as CDR and non-structural parameters such as texture for with i7− 7500U CPU, 16 GB RAM, 64-bit Operating system (OS) and
classification of glaucoma. These features were then classified using 2.90 GHz processor. The images used for testing were cropped to size
multi-level SVM with accuracy of 94% [21]. 512 × 512 to make the size of the dataset homogenous. The cropping
The general observations for each of the studies discussed above are was performed using the MATLAB function ‘imcrop’ with input pa­
given in Table 1. rameters ‘image’ and ‘size of the crop’. The cropping resulted in
As per the literature, major drawbacks are: reduction of the computation time and uniformity of the input param­
eters required for experimental analysis.
• Use of features for classification vary from study to study. In most of
the studies, features used for classification are either structural or 3. Methodology
non-structural. Very few studies have used the combined feature set
for classification. This section introduces the methodology used for classifying fundus
• Feature selection using appropriate ranking schemes have not been image as abnormal or normal. Initially, the input image was pre-
performed in most of the approaches. processed to remove outliers and segmentation was performed for
• Use of Classifiers vary from study to study. extraction of structural features such as CDR and DDLS. Thereafter,
• Datasets used for analysis vary from study to study. features such as Grey level co-occurrence matrix (GLCM), Grey level run
length matrix (GLRM), HOS, First order statistical (FoS), Higher order

3
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Fig. 3. Flow diagram of Methodology.

Fig. 4. Vessels removed image.

cumulant (HOC) and Discrete wavelet transform (DWT) are extracted 3.1. Pre-processing
from grayscale converted retinal image. Further, the more significant
features are extracted by Information gain (IG), Gain ratio (GR), Cor­ This is the initial step comprising, channel separation and vessel
relation, Relief, Wrapper; and classification is performed using k-NN, removal. The input images were initially converted into different
NN, SVM, RF and NB. The flow diagram of the methodology used for channels such as red, green and blue to analyse the best suitable channel
classification is being represented in Fig. 3 and the procedure for the for further processing. Finally, the red and green channels were opted
same is discussed in Section 3.1 to Section 3.5. for segmentation of optic disc and optic cup after removal of vessels
using morphological closing with structuring element of type ‘disc’.
Fig. 4 shows the input image, gray-scale image, vessels removed red
channel and vessels removed green channel that can be used for

Fig. 5. Segmented regions.

4
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

glaucomatous damage with staging and monitoring the rate of


change followed by severity or degree of progression. The value
of DDLS predicts the stage of glaucoma from 1 to 10 on the basis
of a chart shown in Fig. 6 [26]. If this ratio is more than 0.4 an eye
is considered to be normal, and if this ratio is less than 0.4 an eye
is considered to be glaucoma suspect.

DDLS is calculated using Eq. (2).


RIMwidth
DDLS = (2)
Diadisc

Here, DDLS is the scale of disc damage likelihood, RIMwidth is the width
of region between disc and cup (i.e. rim) shown in Fig. 4(c), while,
Diadisc is the disc diameter used in Eq. (1) above. Here also, the RIMwidth
and Diadisc are calculated using ‘EquivDiameter’ parameter of function
‘regionprops’ in MATLAB with values of RIMwidth and Diadisc in the same
Fig. 6. DDLS Chart. units.

3.3.2. Non-structural features


segmentation of optic disc/optic cup and classification of retinal images
These are the features which emphasize on the internal features of
as normal or abnormal.
the object i.e. texture, intensity, etc. Some of the non-structural dis­
cussed here includes features GLRM, GLCM, FoS, HOC, HOS and
3.2. Segmentation
wavelets:
It is the primary step used for detection of glaucoma using CDR and
(i) Grey level run length matrix (GLRM) [6]
DDLS which requires segmentation of optic disc and optic cup. The
segmentation of disc and cup were performed by Level set based adap­
GLRM is a matrix which contains information about textural features
tively regularized kernel based intuitionistic fuzzy c means (LARKIFCM)
to analyse the texture of the object. It can be defined as a line of pixels in
[24]. Fig. 5 shows the segmented disc, segmented cup and the rim region
a certain direction having the same values of intensity. The number of
(i.e. the region between disc and cup).
such pixels are termed as gray level run length and the occurrences of
same are considered to be its run length value. In GLRM P(i,j|Ө), the (i,
3.3. Feature extraction
j)th element defines the number of runs with grey level ‘i’ and length ‘j’
in the image along angle Ө. Now, let Ng be the discrete values of in­
Features are the significant details used for classification. The retinal
tensity in the image; Nr be the discrete run lengths in image; Np be the
image is classified by extraction of different types of structural and non-
voxels number in the image; Nr(Ө) is the number of runs along angle Ө,
structural features. Structural features such as CDR and DDLS are ob­ Ng ∑

tained from segmented regions of retinal images using LARKIFCM. given as Nr
j=1 P(i, j|Ө) and 1≤ Nr(Ө) ≤ Np; P(i,j| Ө) is the run length
Whereas, non-structural features such as GLRM, GLCM, FoS, HOC, HOS i=1

and DWT are obtained directly from grayscale converted retinal image matrix for direction Ө, p(i,j| Ө) is the normalized matrix given as p(i,
P(i,j|Ө)
shown in Fig. 4(b). The details of structural and non-structural features jӨ) = Nr (Ө) . The values of Ө in this case are 0◦ corresponding to hori­
are given below: zontal, 45 representing diagonal and 90◦ signifying vertical direction.

Thus, the final value of GLRM is computed for each degree and the mean
3.3.1. Structural features of these values are returned. The commonly extracted GLRM features are
These are the features which emphasize on the external boundary of SRE, LRE, GLN, RP and RLN.
the ROI detected i.e. area, volume, length, breadth, perimeter, etc. Some
of the structural features discussed here include CDR and DDLS: (a) Short run emphasis (SRE): It is the measure of short run lengths
distribution and calculated using Eq. (3). Here, the greater value
(i) Cup to disc ratio (CDR): It is the ratio of the optic cup to the of SRE indicates fine texture.
optic disc diameters used in ophthalmology to access the pro­
gression of glaucoma. The increasing value of CDR signifies the
Ng ∑
∑ Nr
P(i,j|θ)

progression of glaucoma. If this ratio is 0.3 or less, an eye is


j2
(3)
i=1 j=1
SRE =
considered to be normal and if this ratio exceeds 0.3, an eye is Nr (θ)
considered to be glaucoma suspect [25]. CDR is calculated using
Eq. (1) below:

CDR =
Diacup
(1) (b) Long run emphasis (LRE): It is a measure of long run lengths
Diadisc distribution with higher values indicating coarse texture. LRE is
computed using Eq. (4)
Here, CDR is the ratio of cup to disc, Diacup is the diameter of optic cup
and Diadisc is the diameter of disc. Diacup and Diadisc are calculated from Ng ∑
∑ Nr ⃒
segmented optic cup and optic disc shown in Fig. 4(a) and Fig. 4(b) using P(i, j⃒θ)j2
(4)
i=1 j=1
‘EquivDiameter’ parameter of function ‘regionprops’ in MATLAB. Here, LRE =
Nr (θ)
the values of Diacup and Diadisc must be in same unit.

(ii) Disc damage likelihood scale (DDLS): DDLS is the ratio of rim
width to the disc diameter, which is used for prediction of the
disease glaucoma. It provides the quantitative assessment of

5
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

(c) Grey level run emphasis (GLN): GLN is the measure of similar (a) Autocorrelation (autoc) [27]: autoc is the measure of coarseness
intensity values in an image, computed using Eq. (5). Here, a and fineness of the texture. It is calculated using Eq. (8) and in
lesser value of GLN indicates higher similarity in intensity values. this case, higher magnitude of autoc indicates more fineness.
( )2 Ng ∑
Ng
Ng
∑ ∑Nr ∑
P(i, j|θ) autoc = p(i, j)ij (8)
i=1 j=1 i=1 j=1
GLN = (5)
Nr (θ)
(b) Contrast (contr) [27]: contr is the measure of variation in in­
tensity and is computed using Eq. (9) below.
(d) Run percentage (RP): It is the measure of coarseness in the texture
The larger value of contr signifies disparity in the values of the image
of an image by using the ratio of runs number to the voxels
intensity.
number in ROI. It is calculated using Eq. (6) and the RP with
higher values indicate fine texture. Ng ∑
∑ Ng
contr = (i − j)2 p(i, j) (9)
Nr (θ)
RP = (6) i=1 j=1
Np

1
Here, the values of RP lie between Np to 1.
(c) Correlation (corrp) [27]: corrp lies between 0 and 1, indicating
(e) Run length non-uniformity (RLN): It is the measure of similarity the dependency of gray level values in the respective voxels
of run lengths through the image and calculated using Eq. (7). calculated by Eq. (10). Further, the larger value of corrp signifies
Here, the lower values of RLN signifies more homogeneity. higher dependency.
( Ng
)2 Ng ∑
∑ Ng

Nr ∑ p(i, j)ij − μx μy
P(i, j|θ)
(10)
i=1 j=1
RLN =
j=1 i=1
(7) corrp =
Nr (θ) σx (i)σy (j)

(ii) Grey level co-occurrence matrix (GLCM) [27–29]


(d) Cluster prominence (cprom) [27]: cprom defines the level of
This is an approach for identifying texture of image by calculating asymmetry and skewness, and is calculated by Eq. (11). Its higher
the paired pixels occurrence with specific values in spatial relationship value signifies increase in asymmetry around the mean and lower
within an image. It is a matrix of size Ng X Ng, which describes the value signifies a peak adjacent to the mean value with least
second-order joint probability of an image region constrained by the variations around the mean.
mask defined as P(i,j| δ, Ө). The (i,j)th element of matrix represents Ng ∑
Ng

number of times the combined levels ‘i’ and ‘j’ occurs in two pixels cprom = (i + j − μx − μy )4 p(i, j) (11)
within the image separated by distance δ along angle Ө. Now, for δ = 1 i=1 j=1
resulting in 2 neighbors for 13 angles. Now, let Ɛ be a small positive
number; P(i,j) is the co-occurrence matrix for δ = 1 and angle Ө; p(i,j) is
the normalized co-occurrence matrix given as ∑P(i,j) ; Ng is the number of P(i,j)
(e) Cluster shade (cshad) [27]: cshad is the measure of uniformity
Ng
discrete intensity levels in image; px (i) =

P(i, j) is marginal row and skewness, computed using Eq. (12). Here, higher value of
j=1 cshad signifies asymmetry near the mean.
∑Ng
probability; py (j) = i=1 P(i, j) is marginal column probability; μx is the Ng ∑
∑ Ng
( )
Ng
∑ cshad = i + j − μx − μy 3p(i, j) (12)
grey level intensity of px , defined as px (i)i; μy is the grey level intensity i=1 j=1
i=1
Ng

of py , defined as px (j)j; σx and σy are the standard deviations of px and
j=1
Ng ∑
∑ Ng (f) Dissimilarity (dissi) [27]: It is the measure of relationship be­
py respectively. Further px+y (k) = j=1 p(i, j), where i + j=k and k = tween pairs with similar and different intensities calculated using
i=1
Ng ∑
∑ Eq. (13). Higher is the value of dissi, more is the dissimilarity
Ng
2, 3,….,2Ng. Also, px− y (k) = j=1 p(i, j), where |i-j|=k and k = 0,1, between the pairs of pixels.
i=1
∑ Ng
……, Ng-1. Similarly, HX = − px (i)log(px (i) + ε) is the entropy of px
i=1
Ng ∑
∑ Ng

∑Ng ∑ Ng dissi = |i − j|p(i, j) (13)


; HY = − j=1 py (j)log(py (j) + ε) is the entropy of py ; HXY = − i=1 i=1 j=1
∑ Ng ∑Ng ∑Ng
j=1 p(i, j)log(p(i, j) + ε ) is the entropy of p(i, j); HXY1 = − i=1 j=1
∑Ng ∑Ng
p(i, j)log(px (i)py (j) + ε); HXY2 = − i=1 j=1 p x (i)py (j)log(p x (i)py (j) +
ε). Here also, the final value of GLCM is computed for each degree and (g) Energy (energy) [27]: Energy is the measure of voxels magnitude
mean of these values are returned. The values of Ө are 0◦ corresponding in the image and computed using Eq. (14). Here, the higher
to horizontal, 45◦ representing diagonal and 90◦ signifying vertical di­ values of energy signify a greater sum of the squares.
rection. Finally, the features discussed below are calculated on the
Ng ∑
Ng
resultant matrix. ∑
energy = (p(i, j)2 (14)
i=1 j=1

6
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Entropy (entro) [27]: entro is the measure of randomness or uncertainty (n) Difference variance (dvarh) [28]: It is the heterogeneity measure,
in the image, computed using Eq. (15). Its higher values signify increase that places higher weights on different levels of intensity pairs
in randomness and uncertainty. deviating from the mean using Eq. (22). In this case, an increase
in dvarh signifies an increase in the heterogeneity.
Ng ∑
∑ Ng
entro = − p(i, j)log(p(i, j) + ε) (15) N
∑ g− 1

i=1 j=1 dvarh = (k − DA)2 px− y (k) (22)


k=0

Difference entropy (denth) [28]: denth is the measure of randomness/


variability across neighborhood values with intensity differences
(h) Homogeneity(homop) [27]: homop is the metric used for analysis
computed by Eq. (23). Here, the higher values of denth means increase
of homogeneity in an image, calculated using Eq. (16). Here, the
in randomness.
increase in value of homop signifies increase in homogeneity.
N
∑ g− 1

(23)
Ng ∑
∑ Ng
p(i, j) denth = px− y (k)log(px− y (k) + ε)
homop = ⃒ (16)
i=1 j=1 1 + ⃒i − j|2 k=0

(o) Information measure of correlation (inf1h) [28]: inf1h analyses


correlation among the probability distributed over ‘i’ and ‘j’, by
(i) Maximum probability(maxpr) [27]: It signifies the occurrence of mutual information using equation (24). In case of no mutual
predominant pairs of neighboring intensity values extracted information, the value of inf1h becomes 0. While, in case of
using Eq. (17). Here, the higher values of maxpr specify an in­ uniform distribution, its value approaches towards ‘log’.
crease in the occurrence of predominant pairs.
HXY − HXY1
inf1h = (24)
maxpr = max(p(i, j)) (17) max(HX, HY)

(p) Information measure of correlation2 (inf2h) [28]: inf2h also an­


(j) Sum of squares variance (sosvh) [28]: sosvh is the measure of alyses the correlation between probability distribution of ‘i’ and
neighboring intensity level pairs distribution around the mean ‘j’ by quantification of texture complexity using Eq. (25).
intensity, computed using Eq. (18). Further, the increase in the
(25)
1
inf2h = (1 − exp[− 2(HXY2 − HXY)])2
value of sosvh specifies increase in the distribution of pixels
around the mean.
Ng ∑
Ng
(q) Inverse difference normalized (indnc) [29]: indnc is also the de­

sosvh = 2
(i − μx ) p(i, j) (18) gree of local homogeneity in an image calculated by Eq. (26). It
i=1 j=1 normalizes the difference among neighboring values of intensity
by division with the sum total of distinct intensities. Here, also
the increase in value of indnc increases the homogeneity.
N g− 1
(k) Sum average (savgh) [28]: It is used to measure the relationship indnc =
∑ px− y (k)
( ) (26)
between pairs occurred with higher values of intensity and lower k=0 1 + Nkg
values of intensity, computed using Eq. (19). The higher value of
savgh signifies more differences between pairs of pixels.
2Ng
∑ (r) Inverse difference moment normalized (idmnc) [29]: It is the
savgh = px+y (k)k (19) degree of local homogeneity in an image which is inverse of
k=2
contrast weights and square of difference normalized among
neighboring values of intensity by its division with the total
number of distinct intensity values. idmnc is calculated using Eq.
(27) and its increase in value signifies increase in homogeneity.
(l) Sum variance (svarh) [28]: svarh is the measure of voxel groups
with similar values of gray-level, calculated using Eq. (20). Its Ng ∑
∑ Ng
p(i, j)
higher value signifies an increase in similarity between the idmnc = ⃒ (27)
1 + ⃒i − j|2
gray-level values. i=1 j=1

2Ng

svarh = (k − SA)2 px+y (k) (20) (iii) First order statistical (FoS) features [30]
k=2

FoS features are also used to analyse the texture of an image by


computing histogram of an image, which shows the probability of
occurrence of a pixel in an image. These features depend only on indi­
(m) Sum entropy (senth) [28]: It is the sum of neighborhood intensity vidual values of pixels and not on the interaction between the neigh­
value differences computed using Eq. (21). Here, the increase in
bouring pixels. The first order histogram estimate is given as p(b) = N(b)
M .
value of senth increases the difference in intensity.
Here, b represents the grey level in the image, M is the total number of
2Ng
∑ pixels in the neighborhood window centered around the expected pixel
senth = px+y (k)log(px+y (k) + ε) (21) and N(b) is the number of pixels of gray value b in window 0 ≤ b ≤ L-1.
The commonly used FoS features are mean, standard deviation, vari­
k=2

ance, kurtosis, skewness and entropy as described below:

7
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

[ ]
(a) Mean: It is the measure of average grey level of each region dn φ(λ)
Also, from the Eq. (34) it can be observed that (− m)n = μn .
computed using Eq. (28). dλn
λ=0
Further, the function for generating cumulants is given as:

L− 1
b= b.p(b) (28) ∑

(mλ)n
b=0 ψ X (λ) = lnφX (λ) = Kj (35)
n=1
n!

(b) Standard deviation (std): std is the measure of variation or Here Kn represents the nth cumulant of the variable X and is given as
dispersion from the mean defined using Eq. (29). [
n
]
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ (− m)n d dλ
ψ X (λ)
n = Kn . On comparison of the Eqs. (34) and (35),
√ L− 1 λ=0
√∑
σ b = √ (b − b)2 b(p) (29) cumulants can be expressed as K1 = μ1 , K2 = μ2 − (μ1 )2 etc. Further, the
b=0 ( )
2 2
variable X along with variance σ2 and mean μ, φX (λ) = exp mλμ − λ 2σ

(c) Variance (var): var is the measure of spread of the grey levels signifies Kn = 0 for j ≥ 3. Thus, the cumulants of order three at different
around the mean value computed using Eq. (30). orientations of 10◦ , 50◦ , 90◦ , 130◦ and 180◦ were given as y_cum10,
y_cum50, y_cum90, y_cum130 and y_cum180. These orientations were
∑ determined by repeated testing at different degrees. These very orien­
L− 1
σ2 = (b − b)2 b(p) (30)
b=0
tations were only used, as the difference between them were significant
for ‘normal’ and ‘abnormal’ cases. While, at other degrees the difference
was not significant.
(d) Kurtosis (kr): kr is the measure of the ‘peak’ values distributed in
ROI of image. A higher value of kr signifies distribution of mass (v) Higher order spectral (HOS) features [6]
concentrated towards the tail instead of mean, while lower value
of kr implies distribution of mass concentrated towards a spike HOS features are also used to analyse non-stationary, non-linearity
near the value of mean. and non-gaussian characteristics hidden in the image, it uses phase and
amplitude information of a given signal which can be used for both
1 ∑
L− 1
kr = (b − b)4 p(b) − 3 (31) random processes and deterministic signals. These features are derived
σ4b b=0
from third-order statistics of the signal, namely bispectrum. The bis­
pectrum is given as B(f1 , f2 ) = E[X(f1 )X(f2 )X ∗ (f1 + f2 )], here X(f ) is the
(e) Skewness (sk): sk is the degree of asymmetry in the values of fourier transform of the signal x(nT), n is the integer index, T is the
image distributed near the mean value. It can be positive or sampling interval and E[.] signifies the expectation operation. Features
negative depending upon the elongation of tail and distribution are here calculated by integrating the bispectrum along f2 = af1 , where
of mass. a is the slope. The bispectral invariant P(a) termed as phase of integrated
( )
1 ∑
L− 1 bispectrum is given as P(a) = arctan IIri (a)
(a) , where Ir and Ii refers to real
sk = (b − b)3 p(b) (32)
σ4b b=0 and imaginary part of the integrated spectrum. These bispectrum con­
tains information about shape of waveform within the window,
invariant to amplification and shift along with robustness to changes in
(f) Entropy (ent): ent is the measure of randomness/uncertainty in
time scale. These features are distributed uniformly and symmetrically
the values of image calculated using Eq. (33). It is used to mea­
about zero in the interval [ − π, + π].
sure the average of information desired to encode values of an
The commonly used HOS features are as follows:
image. In this case, higher values of ‘ent’ specify more
randomness.
(a) Entropy: It is the measure of distribution of the spectral power in

L− 1 the image or a signal, computed using Eq. (36). It is derived from
ent = − p(b)logp(b) (33) the probability distribution in the frequency domain, and calcu­
lates entropy, also termed as Shannon entropy.
b=0


(iv) Higher order cumulant (HOC) features [31] Ph = p(ψ n )logp(ψ n ) (36)
n

HOC features are used to analyse non-stationary, non-linearity and Where, p(ψ n ) = 1

l(ϕ(B(f1 , f2 )) ∈ ψ n , ψn = {ϕ| − π + 2πn/N
L
non-gaussian characteristics hidden in the image. It has certain prop­ Ω
erties such as same values regardless of permutations in the arguments, ≤ ϕ ≤ − π + 2π (n + 1)/N}, n = 0,1,.......,N − 1, L is the number of points
cumulants of scaled random variables equals the product of all scale in the region Ω, ϕ is the phase angle of bispectrum and l(.) is indicator
factor times the cumulant, and sums of independent random processes function which gives value 1 when phase angle is within the range of ψ n .
are the sum of cumulants sum. Also, the third and higher order cumu­
lants for gaussian are identically zeros. Further, the cumulants follow a (b) Mean: Similar to the mean defined in Eq. (28), here the mean is
special symmetry which makes it sufficient for computation in the sub- defined in context of spectral analysis using Eq. (37).
domain. It uses higher order statistics derived from moments/correla­ 1∑
tions for a given image. Moments are particular weights assigned to Mean = |B(f1 , f2 )| (37)
L Ω
image pixels or intensities using a function for interpretation. For an
image with random variable X, moments are given as μn = E(Xn ), n = 1,
2, ....., ∞. Here, the function of X given as φX (λ) is:


(mλ)n (c) Entropy1: It is the measure of entropy with degree 1 computed
φX (λ) = E(emλX ) = 1 + μn ,λ ∈ R (34) using Eq. (38).
n=1
n!

8
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137


Ent1 = − tnn logtnn (38) Table 2
nn Number of Features.
⃒ S. Type Name Number
where, probability distribution for entropy with degree 1 is tnn = (⃒B(f1 , no.
⃒ ∑ ⃒
f2 )⃒)/( |B(, f1 , f2 )⃒)
Ω 1 Structural CDR, DDLS 2
features
(d) Entropy2: It is the measure of entropy with degree 2 computed 2 GLRM SRE, LRE, GLN, RP, RLN 5
3 GLCM contr, corrp, autoc, cprom, cshad, dissi, maxpr, 20
using Eq. (39).
sosvh, energy, entro, homop, savgh, svarh,
∑ denth, senth, dvarh, inf2h, inf1h, indnc, idmnc
Ent2 = − qnn logqnn (39)
4 FoS Mean, std, var, kr, sk, ent 6
nn
5 HOC y_cum10, y_cum50, y_cum90, y_cum130 and 5
y_cum180
where, probability distribution for entropy with degree 1 is 6 HOS Entropy, Mean, Entropy1, Entropy2, Entropy3 5

qnn = (|B(f1 , f2 )|2 )/( Ω |B(f1 , f2 )|2 ) 7 Wavelet db3H, db3 V, db3D, sym3H, sym3 V, sym3D, 18
bior3.3H, bior3.3 V, bior3.3D, bior3.5H,
bior3.5 V, bior3.5D, bior3.7H, bior3.7 V,
(e) Entropy3: It is the measure of entropy with degree 3 computed bior3.7D, haarH, haarV, haarD
using Eq. (40). TOTAL number of features 61

Ent3 = − rnn logrnn (40)
nn db3 is the third order wavelet filter with square modulus of transfer
∑N− 1 N− 1+k k
function h.Consider, P(y) = k=0 Ck y , wherein CN−
k
1+k
signifies
where, probability distribution for entropy with degree 1 is
∑ binomial coefficients. Thus, the function m0 (ω) is given in Eq. (45)
rnn = (|B(f1 , f2 )|3 )/( Ω |B(f1 , f2 )|3 ) ⃒ ( ( ) )N ( (ω) )

⃒m0 (ω)|2 = cos2 ω P sin2 (45)
⃒ 2 2
(vi) Discrete wavelet transform (DWT) features [32,33]
∑2N− 1
Here, m0 (ω) = √1̅̅2 k=0 hk e− ikω
, length of ψ and φ is 2N − 1 and number
The wavelet transformation is the division of data, operators or
functions into different components of frequency, and study each with of dissipating moments of ψ is N = 3.
resolution matching to scale. Wavelet transform here decomposes the
image into four different components namely approximation, horizon­ (b) Symlets (sym3: sym3H, sym3 V, sym3D)
tal, diagonal and vertical. Approximation component is utilized in
scaling and other three in translation. For images, separable wavelet sym3 is a third order wavelet filter with few modifications in the db3
Transform with One-Dimension filter bank is applied to rows and col­ to increase symmetry. The function m0 is re-used by considering
umns of each channel. The scaling function φj,m,n (x, y) and translation |m0 (ω)|2 used in Eq. (45) as function W of z = eiω . Here, the W is of form
function ψ ij,m,n (x, y) used here for m rows and n columns of an image are W(z) = U(z)U(1/z) and the value of U is selected in such a way that all
its roots are strictly greater than or equal to 1.
given in Eqs. (41) and (42). Further, the 3 high-pass channels corre­
sponding to horizontal ψ H (x, y), vertical ψ V (x, y) and diagonals ψ D (x, y)
(c) Biorthogonal (bior3.3: bior3.3H, bior3.3 V, bior3.3D; bior3.5:
functions are extracted along with φ(x, y) scaling function. High pass
bior3.5H, bior3.5 V, bior3.5D; bior3.7: bior3.7H, bior3.7 V,
channel is used here to acquire high frequency components i.e. abrupt
bior3.7D)
changes from horizontal, vertical and diagonal components of the
image.
bior is the extension of wavelet which considers two wavelets instead
φj,m,n (x, y) = 2j/2 φ(2j x − m, 2j y − n) (41) of one to overcome the incompatibility of symmetry and reconstruction.
First wavelet ψ is used for analysis and second wavelet ψ performs
ψ ij,m,n (x, y) = 2j/2 ψ (2j x − m, 2j y − n) i = {H, V, D} (42) synthesis with coefficients in Eqs. (46) and (47).

Here, D, H and V signifies diagonal, horizontal, and vertical components cj,k = s(x)ψ j,k (x)dx (46)
of wavelet transform. Thereafter, the DWT of function f(x, y) with size

m X n is given in Eqs. (43) and (44). s= cj,k ψ j,k (47)
j,k
1 m− 1 ∑
∑ n− 1
Wφ (j0 , m, n) = √̅̅̅̅̅̅ f (x, y)φjo ,m,n (x, y) (43) Thus, the wavelets ψ and ψ related by duality are given in equation
mn x=0 y=0
(48) and equation (49)

1 ∑ M− 1 ∑
N− 1
Wψi (j, m, n) = √̅̅̅̅̅̅ f (x, y)ψ ij,m,n (x, y) i = {H, V, D} (44) ψ j,k (x)ψ j’,k’ (x)dx = 0 if j ∕
= j’ or k ∕
= k’ (48)
mn x=0 y=0

Here, jo is scale starting arbitrary, Wφ (j0 , m, n) signifies approximation φ0,k (x)φ0,k’ (x)dx = 0 if k ∕
= k’ (49)
coefficients of f(x, y) at jo scale and Wψi (j, m, n) is used to add H, V and D
coefficients for scales j > = jo . Further, the types of DWT such as dau­
bechies 3 (db3), symlets 3(sym3), biorthogonal 3.3(bior 3.3), bio­ (d) haar (haarH, haarV, haarD)
rthogonal 3.5(bior 3.5), biorthogonal 3.7(bior 3.7) and haar with H, V
and D components are given as The haar is the simplest wavelet comprising of φ(t) scaling function
and ψ (t) translating function given in Eqs. (50) and (51)
(a) Daubechies (db3: db3H, db3 V, db3D) { }
1 0≤t≤1
φ(t) = (50)
0 else

9
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Table 3

v
Ranking of features. I(S) = − pi log2 (pi ) (54)
S.no. Features Ranks i=1

IG GR Correlation Relief Wrapper where, pi signifies probability of a random sample. Now for a specific
1 CDR 1 1 2 2 1 feature ‘S’ with v distinct values taken into consideration, the entropy is
2 DDLS 2 2 1 1 2 computed using Eq. (55)
3 Homom 3 8 8 9 7
4 Dvarh 4 4 7 5 6 ∑
v
Si
5 Dissi 5 5 5 6 4 H(S) = I(S) (55)
S
6 Contr 6 6 6 7 3 i=1

7 Homop 7 7 9 8 5
Further, the information gained is given using Eq. (56) and
8 Denth 8 3 10 10 10
9 Idmnc 9 9 4 3 8 normalization of same is performed using Eq. (57)
10 Indnc 10 10 3 4 9
Gain(S) = I(S) − H(S) (56)

⎧ ⎫ ∑
v

⎪ ⎪ SpliInfoS (S) = − (|Si |/|S|)log2 (|Si |/|S|) (57)



⎪ ⎪


⎪ 1 0 ≤ t ≤ 12 /

⎪ i=1
⎨ ⎬
ψ (t) = − 1 − 1 2 ≤ t ≤ 1
/ (51) Finally, the GR is given using Eq. (58)

⎪ ⎪


⎪ 0 elsewhere ⎪


⎩ ⎪
⎭ GR(S) = Gain(S)/SplitInfoS (S) (58)
Thereafter, the GR score of different number of features (i = 1–61)
Table 2 lists the number and types of features extracted in each are calculated and ones with highest scores are taken into consideration.
category. Here, also the ten features with high score of 0.291 are considered as
significant, while others less than that are discarded. Hence, the sig­
3.4. Feature Ranking/Selection nificant features according to scores of GR are CDR, DDLS, denth, dvarh,
dissi, contr, homop, homom, idmnc, indnc, and the ranks of same is
It is the process of extracting significant features from a large number given in Table 3.
of features using different ranking and selection criteria before classifi­
cation to remove redundant and less significant features from the feature 3.4.3. Correlation [36]
set. We have employed different filter techniques such as IG, GR, Cor­ Correlation is a measure of interconnections between features in a
relation and relief feature ranking followed by wrapper approach for feature set, where relevance of a feature in a set increases with corre­
selection of significant features. Each of the techniques is explained in lation between class and features, while decreases with increase in inter-
section 3.4.1–3.4.5 correlation. It is computed using Eq. (59)

3.4.1. Information gain (IG) [34] krzi


rzc = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ (59)
IG is the approach for ranking of features information or reduction in k + k(k − 1)rii
entropy (i.e. uncertainty of random feature) for the predicted class in
presence of corresponding class distributions comprising of different Where, rzc represents the correlations between class variable (z) and
features. For a given set ‘S’ with ‘x’ samples and a corresponding labels feature set (c), k signifies number of features, rzi signifies average cor­

‘y’, measure of IG for ith feature of sample xi is computed using Eq. (52). relations between z and features (i) in (c) determined as rzi = (z − z)
(i − i), and rii is the inter-correlation between the features (i) in feature
|Sxi =v | √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅

∑ |S|
set (c) given as rii = (z − z)2 (i − i)2 .
IG(S, xi ) = H(S) − H(Sxi =v ) (52)
v=values(xi )
Similar to above two selection criteria’s, the features with high score
of 0.53 are considered as significant, whereas rest with scores less than
where, H(S) represents shanon entropy calculated using Eq. (53) and 0.53 are discarded. So, the ten significant features accordingly are DDLS,
|Sxi =v | CDR, indnc, idmnc, dissi, contr, dvarh, homom, homop, denth, and the
|S| is the fraction of ith feature in the sample with v distinct values to ranks of same is given in Table 3.
calculate conditional entropy.
H(S) = − p+ (S)log2 p+ (S) − p− (S)log2 p− (S) (53) 3.4.4. Relief feature ranking [37]
Relief feature ranking is an approach to rank features based on
p± (S) signifies probability of a sample in set S for labels ‘y’. Further, the instance-based learning. It uses relevancy metric γ with threshold be­
IG score of different number of features (i = 1–61) are calculated and tween 0–1 to estimate numerical stability of attributes with instance
ones with highest scores are taken into consideration. Here, the ten belonging to class. This method utilizes parameters such as near hit (NH)
features with high score of 0.291 are considered as significant, while and near-miss (NM) to define closeness of a given sample to subset of
other less than that are discarded. Thus, the significant features ac­ samples in a class. A sample is an NH if it fits a neighborhood of instance
cording to scores of IG are CDR, DDLS, homom, dvarh, dissi, contr, Y and lies in the same class Y. On the other hand, a sample is taken as NM
homop, denth, idmnc, indnc, and the ranks of same is given in Table 3. if fits the nearness of Y and also lies in a class dissimilar to Y. It chooses a
triplet sample with < instance Y, NH, NM>, here the NM and NH are
3.4.2. Gain ratio (GR) [35] selected by employing the euclidean distance. Once the NH and NM are
Gain ratio is an enhancement of IG, which overcome bias due to identified, a feature weight vector WV is updated by employing the
selection of attributes at each node. It also considers a set ‘S’ with ‘x’ euclidean distance using Eq. (60)
samples and a corresponding label ‘y’ with ‘v’ distinct values of features
WVi = WVi − diff (xi , NHi ) 2 + diff (xi , NMi ) 2 (60)
‘i’. The information expected for set S is given as
Now, the ranking vector R is estimated by triplet sample and weight
vector WV, to signify rank of each feature using Eq. (61)

10
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

1 and labels drawn from finite set. Also, it requires small training data for
R= (WV) (61)
m’ estimation of parameters significant for classification. It is also consid­
ered as a conditional probability, for instance a vector x = (x1 , ..........,
Here, m’ is the number of iterations. Similar to above selection criteria’s, xn ) with n number of features are assigned probabilities t(CCk |x1 , ........,
the features with high score of 0.59 are considered as significant, xn ) for ‘k’ possible outcomes of CCk classes. Thus, according to bayes
whereas rest with scores less than 0.59 are discarded. So, the ten sig­ theorem conditional probability is given using (62):
nificant features accordingly are DDLS, CDR, indnc, idmnc, dissi, contr, ⃒
dvarh, homom, homop, denth, and the ranks of same is given in Table 3. ⃒
t(CCk ⃒⃒x) =
t(CCk )t(x|CCk )
(62)
t(x)
3.4.5. Wrapper selection [38]
Wrapper selection is an approach that evaluates a feature subset Here, t(x|CCk) specifies the likelihood i.e. the probability of a feature ‘x’
using an algorithm which employs search strategy such as forward in a class ‘CCk’. Whereas, t(CCk) is the prior probability of class ‘CCk’ and
feature selection, backward feature selection, exhaustive feature selec­ t(x) is the prior probability of feature ‘x’.
tion or bidirectional search to look into the space for possible feature It can be seen from Eq. (62) that denominator is not dependent on CC
subset. Greedy algorithm is evaluated on each subset to determine the and is thus effectively constant. Further the numerator is identical to
best possible combination. The wrapper approach detects the interac­ model of joint probability t(CCk , x1 , ........, xn ) which can be written as
tion between variables and finds the optimal feature subset. In this case, given in Eq. (63) using the chain rule for applications of conditional
the search strategy employed is bidirectional search due to its better probability.

= t(x1 , ......, xn , CCk )


= t(x1 |x2 , ......, xn , CCk )t(x2 , ......, xn , CCk )
t(CCk , x1 , ......., xn ) = .... (63)
= .....
= t(x1 |x2 , ......., xn , CCk )t(x2 |x3 , ....., xn ,CCk )..........t(xn− 1 |xn , CCk )t(xn |CCk )t(CCk )

performance than other search techniques. The features selected using


wrapper approach includes CDR, DDLS, homom, dvarh, dissi, contr,
homop, denth, idmnc and indnc. Table 3 lists the order in which 10 Now, assuming every feature xi to be conditionally independent of

features are selected. another feature xj for j ∕ = i, in the category CCk means that t(xi ⃒xi+1, .......,

Further, Table 4 presents a range of features for normal and xn , CCk ) = t(xi ⃒CCk ). Thus, the joint model is given in Eq. (64):
abnormal cases.
αt(CCk , x1 , ....xn )
= t(CCk )t(x1 |CCk )t(x2 |CCk )t(x3 |CCk ).........
t(CCk |x1 , ....., xn ) (64)
3.5. Classification ∏n
= t(CCk ) t(xi |CCk ),
i=1
Classification is a decision-making method to identify features from
Here, α signifies proportionality. Also, under the independent as­
images and assign them labels. It is broadly categorized into unsuper­
sumptions, conditional probability distribution over class variable CC is
vised classification and supervised classification. Unsupervised classifi­
given in Eq. (65).
cation is based on the outcome without pre-defined(human) input.
Whereas, supervised classification is based on pre-defined (human) 1 ∏n

input [39]. The extracted features are ranked and fed into classifiers for t(CCk |x1 , ......, xn ) = t(CCk ) t(xi |CCk ) (65)
Z
categorization into normal and abnormal cases. The classifiers used in
i=1

this study are as follows: Here, Z = t(x) =



t(CCk )t(x|CCk ) is the scaling factor dependent on x1 ,
k

3.5.1. Naive Bayes (NB) [40] ......., xn .


In the field of medical diagnosis using machine learning, naive bayes Finally, the bayes classifier picks the class with highest probability,
belong to simple probabilistic classification, which works on the prin­ that marks a label y = CCk given in Eq. (66):

ciple of bayes theorem with robust assumption of independence among ∏


n
the features. NB requires a number of linear parameters in a learning ⌢
y = argmaxt(CCk ) t(xi |CCk ) (66)
problem as it assigns a label to instances of problem in vectors of features i=1

Here, the probability t(x|CCk) can be calculated “with kernel density


Table 4
estimation” and “without kernel density estimation”. The probability t
Range of features.
(x|CCk) in case of “with kernel density estimation” is given as t(x |CCk )
S.no. Features Normal range Abnormal range Nc (a− b)2

1 CDR 0.1–0.3 0.3–0.8
= N1c h K(xi , xi|CCk ), where K(a, b) = √1̅̅̅̅

e 2h2 . K signifies kernel func­
j=1
2 DDLS 0.3–0.5 0.01–0.3
3 Indnc 0.990–0.993 0.993–0.997 tion with variance 1 and mean 0, Nc is the number of input data x
4 Idmnc 0.9987–0.9989 0.9989–0.9990 belonging to class CC, xi|CCk is the value of ith feature in class c, and h is
5 Dissi 0.06–0.09 0.02–0.06 the smoothing parameter to optimize the training dataset. Further the
6 Contr 0.05–0.09 0.02–0.05
probability t(x|CCk) in case of “without kernel density estimation” for
7 Dvarh 0.06–0.09 0.03–0.07
(xi − μk )2
8 Homom 0.95–0.96 0.96–0.98 1 ̅

2σ 2
9 Homop 0.951–0.965 0.966–0.977 normal distribution is given as t(x, CCk ) = √̅̅̅̅̅̅̅
2
e k ,where μk and σ2k
2πσ k
10 Denth 0.25–0.35 0.10–0.25
are mean and variance of k respectively.

11
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

3.5.2. Support vector machine (SVM) [41] Here, φ(xi ) represents the image in the feature space due to xi . While, wo
SVM use an optimal linear hyperplane to separate the images in data signifies the optimal bias bo .
set into a feature space. The ideal hyperplane is achieved by maximizing Finally, the x with values of wo ≥ 1 are considered in one hyperplane
the margin among two sets. Thus, the final hyperplane depends on the and wo < 1 in another hyperplane.
training patterns of border known as support vectors. It mainly operates
on two operations, first is the nonlinear mapping of input vector into 3.5.3. Random forest (RF) [42]
high dimension feature space hidden from output and input. Second is RF is an ensemble approach of learning for classification that per­
the ideal hyperplane construction for separation of discovered features. forms computation by construction of magnitude at training time and
It considers a vector x drawn from input space of dimension mo, outputs the class with mean or mode of classes. This decision tree is a
{ }
φj (x) for j = 1 to m1 signifies non-linear transformations to the feature popular approach for machine learning where, learning meets the need
space from input space where m1 is the dimension of feature space. Also of serving an off-shelf process for mining of data. The algorithm for
{ }
wj for j = 1 to m1 denotes a set of linear weights connecting output training of RF uses general technique of bagging or bootstrap aggrega­
{ }
space to the feature space, φj (x) represents the input supplied to tion to make the trees learn. Now, for training the set X = x1 , ......, xn
weight wj through feature space, b represents bias, αi is the coefficient of with labels Y = y1 , ......, yn bagging iteratively (B times) chooses a sam­
lagrange and di is the corresponding output. ple randomly by interchange of training set and fit trees to samples as
Further, the steps involved for designing of SVM are as follows: given below.
Step 1: Define hyperplane H acting as decision surface i. For b = 1 to B

∑ a. Create a bootstrap sample X* of size N from training data.


N
H= αi di K(x, xi ) = 0 (67)
i=1 b. Grow a random forest tree fb from bootstrapped data by repeating the
following steps of each terminal node in the tree, till minimum node
where, K(x, xi ) = φT (x)φ(xi ) signifies the internal vector products of two of size nmin is achieved.
vectors in the feature space using input pattern xi and vector x for ith • Select m random variables from x.
feature in the feature space. The term is known as internal product of the • Pick the split point variable among m.
kernel, • Split the node into sub nodes.
where
{ }B

N ii. Output ensemble trees fb 1 .
w= αi di φ(xi ) (68) Now to make a prediction for new variable of x’, regression is per­
i=1
formed by averaging individual predictions of ensemble trees, and is
given as
φ(x) = [φo (x), φ1 (x), ......, φm1 (x)]T (69)
∑B
̂f = 1 fb (x’) (75)
Here N is the total number of features, φo (x) = 1 for every x, wo signifies B b=1
the bias b and T signifies the transformation operation. Thus, hyperplane
defined in Eq. (67) can also be given as w.φT (x) = 0. Further, for the estimation of uncertainty, prediction can be made
Step 2: The kernel K(x, xi ) required for satisfaction of Mercer’s the­ with standard deviation from different individual predictions on x’
orem and thus the function of kernel is selected as polynomial for ma­ given as:
chine learning √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
√B
√∑
( )2 √ (fb (x’) − ̂f )2
K(x, xi ) = 1 + xT xi (70) √
σ = b=1 (76)
B− 1
The mercer’s theorem is satisfied if K (x, xi) = K (xi, x).
Step 3: The lagrange multipliers {αi } with i = 1, ...., N to maximize Finally the classification is performed using CBrf (x) =
the objective function F(α) for optimal values of lagrange multiplier αo,i majority vote {Cb (x)}B1 , th
where Cb (x) is the class prediction for b random
is given as forest.

N
1∑ N ∑ N
F(α) = αi − αi αj di dj K(xi , xj ) (71) 3.5.4. K nearest neighbor (k-NN) [43]
2 i=1 j=1
i=1 In pattern recognition, k- Nearest neighbor is the common non-
Further, it follows the constraints given in Eq. (72) and Eq. (73): parametric approach used for recognition of statistical patterns which
forms finite partition of sample space for any observation x’ classified
∑ into jth class. The procedural steps followed for classification of x’ are:
N
αi di = 0 (72)
i=1
Find euclidean distance between x’ and all other observations x in
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
the feature set X as (x − x’ )2 .
0 ≤ αi ≤ C for i = 1, 2, ...., N (73)

Here C = 21πh, where h determines the trade-off between the hyperplanes i. Choose k minimum distances.
ii. Choose the most common class for these k distances.
and ensures that the xi lies on optimal hyperplane.
iii. The common class is considered the class for new item.
Step 4: Finally, the linear weight wo with the optimal values of αo,i are
given as:
For a point x’ in the dth dimensional space of feature set X, function
⃒⃒ ⃒⃒

N
fx’ (x) : ℝd →ℝ is computed based on euclidean metric fx’ (x) = ⃒⃒x − x’⃒⃒.
wo = αo,i di φ(xi ) (74)
i=1
Further, the entire training set comprising of X samples are ordered with

12
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

respect to x’ corresponding to reordering function rx’ : {1, ......, N}→{1, (81)


......, N} with different distances, which reorders the indexing of N
∂ε(n)
training points. Thus, the function is recursively defined to get minimum − = ej (n)ϕ’(vj (n)) (81)
∂vj (n)
distances for nearest neighbour j = 1 to N using Eq. (77)

rx’ (1) = argminfx’ (xi ) with i ∈ {1, .....N}


i
rx’ (j) = argminfx’ (xi ) with i ∈ {1, ......N} and i∕
= rx’ (1)........., i ∕
= rx’ (j − 1) for j = 2, ......, N (77)
i

Here, ϕ’ represents the derivative of activation function.


Here, xrx’ (j) is point on set X in jth position in terms of distance from x’ But, as the analysis of change in weights for hidden layers is complex,
⃒⃒ ⃒⃒
known as jth nearest neighbor, while fx’ (xrx’ (j) ) = ⃒⃒xrx’(j) − x’⃒⃒ is its dis­ the derivative can be represented as:
tance from x’. In other words, it can be stated that
∂ε(n) ∑ ∂ε(n)
j < k⇒fx’ (xrx’ (j) ) ≤ fx’ (xrx’ (k) ). Hence, the decision rule for k-NN classifier − = ϕ’(vj (n)) − w (n) (82)
to perform binary classification is given using rule: ∂vj (n) k
∂vk (n) kj
(
∑k
) Finally, it depends on the change in weights of the kth nodes rep­
kNN(x’) = sign yrx’ (i) (78) resenting the output layer.
i=1

3.6. Performance metrics [45]


Here, yrx (i) ∈ {− 1, +1} signifies labels of class in ith nearest sample for
training set, and ‘sign’ computes the label of class for x’ as the one with
The performance of features extracted are evaluated on different
common or maximum number of occurrences amongst nearest dis­
classifiers using performance metrics such as accuracy, precision,
tances. But, if the k-NN is performing regression ‘sign’ computes the
sensitivity and specificity. Further a matrix consisting of True negative
average of occurrences in the nearest distances.
(Tn), True positive (Tp), False negative (Fn) and False positive (Fp)
along with labels glaucoma and non-glaucoma as shown in Fig. 6 is
3.5.5. Neural network (NN) [44]
created for computation of performance metrics.
Neural network is a computational approach which depends on the
Here, Tp signifies that a retinal image is diseased, and is also pre­
function and structure of biological neurons. It comprises of information
dicted as diseased.
that flows from the structure of NN in such a way that the network learns
Fp signifies that a retinal image is diseased, but is predicted as non-
based on the sense of input and output. This model is commonly used for
diseased.
nonlinear statistical data with complex relations between input and
Tn signifies that a retinal image is non-diseased, and is also predicted
output. A multilayer perceptron (MLP) includes three interconnected
as non-diseased.
layers, first is the initial input layer, second is the primary hidden layer
Fn signifies that a retinal image is non-diseased, but is predicted as
and third is the final output layer. Every node or neuron here uses a non-
diseased.
linear activation function, that utilizes supervised approach of learning
The performance parameters namely Accuracy, Precision and Recall
for training of classifier. The activation functions used here are the
calculated using matrix shown in Fig. 6 are as follows:
sigmoids given as

y(vi ) = tanh(vi ) and y(vi ) = (1 + exp− vi )− 1


(79)

The first activation used here is the hyperbolic tangent lying in the (i) Accuracy: Accuracy is the state of being precise or correct
range -1 to 1, whereas second one is the logistic function which lies in and is given as:
range 0–1. In this case, yi signifies output for ith node and vi represents
the weighted sum of connections in input. As, MLP is an extensive Accuracy =
Tp + Tn
× 100 (83)
connected architecture, each individual node in single layer in­ Tp + Tn + Fp + Fn
terconnects with weight wij for each node of the following layer. The
learning is performed using change of weights on the basis of corrections
that minimize the error in entire output using
1∑ 2 (i) Precision: Precision is the probability of predicted labels
ε(n) = e (n) (80) belongingness to actual labels and is given as:
2 j j
Tp
Precision = (84)
Here, ej (n) = dj (n) − yj (n) is the degree of error in an output node j for Tp + Fp
the nth sample in training data, where d is the expected output and y is
the predicted output. Further the change in each weight using gradient
descent is given as Δwji (n) = − η ∂υ
∂ε(n)
i (n)
yi (n), where (yi = wij ∗ xi + b, here
b is the bias) is the output of previous neuron and η is the rate of learning (i) Sensitivity: Sensitivity is the probability of diseased occur­
to ensure that weights converge to response quickly without oscillations. rence to the total number of diseased occurrences and is given
The Weight updation can be performed as wij = wij + η ∗ ej ∗ x. as:
Now the derivative to be calculated is dependent on induced local
field vj, and the derivative for output node can be computed using Eq.

13
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Fig. 7. Accuracy of each algorithm with varying parameters.

(i) Specificity: Specificity is the probability of non-diseased oc­


Table 5
currences to the total number of non-diseased occurrences
Parameter used for classifiers.
and is given as:
S. Classifier Parameter setting
No. Tn
Specificity = (86)
1 NB Without kernel estimation function
Tn + Fp
2 SVM With polynomial function (gamma* u*v + coef()^degree),
gamma = 1.25 4. Results and discussions
3 RF With 30 number of trees
4 k-NN With k = 5 and Euclidean distance function
This section highlights the performance analysis of the features
5 NN Multilayer perceptron with learning rate = 0.3, validation
threshold = 20, momentum = 0.2 and hidden layer = 6 selected and classifiers used. To optimize the performance of the
considered classifier model, different metrics have been tuned to attain
better classification accuracy, and the appropriate ones are selected. It
Sensitivity =
Tp
(85) illustrates the comprehensive explanation of the obtained results from
Tp + Fn different feature selection/feature ranking experiments and records the
outcomes of the distinct performance metrics.

Table 6
Performance analysis for split method.
Classifier/Performance Accuracy (in%) Specificity Precision Sensitivity

61 features 10 features 61 features 10 features 61 features 10 features 61 features 10 features

k-NN 83.3 ± 4.6 91.6 ± 3.3 0.83 ± 0.02 0.90 ± 0.02 0.84 ± 0.03 0.91 ± 0.03 0.83 ± 0.02 0.91 ± 0.02
NN 88.8 ± 5.2 94.4 ± 4.1 0.87 ± 0.03 0.93 ± 0.03 0.88 ± 0.04 0.95 ± 0.04 0.88 ± 0.02 0.94 ± 0.02
SVM 91.6 ± 3.3 97.2 ± 3.1 0.89 ± 0.02 0.96 ± 0.02 0.91 ± 0.03 0.97 ± 0.03 0.91 ± 0.01 0.97 ± 0.01
RF 88.8 ± 5.2 94.4 ± 4.1 0.87 ± 0.03 0.93 ± 0.03 0.88 ± 0.04 0.95 ± 0.04 0.88 ± 0.02 0.94 ± 0.02
NB 76.3 ± 4.9 89.6 ± 4.3 0.74 ± 0.03 0.89 ± 0.03 0.76 ± 0.03 0.89 ± 0.03 0.76 ± 0.01 0.88 ± 0.01

Fig. 8. Plots of Accuracy with split method for different classifiers.

14
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Fig. 9. Plots of Specificity, Precision, Sensitivity for different classifiers using split method.

Table 7
3-fold cross validation.
Classifier/Performance Accuracy (in%) Specificity Precision Sensitivity

61 features 10 features 61 features 10 features 61 features 10 features 61 features 10 features

k-NN 79.9 ± 5.5 88.6 ± 5.1 0.76 ± 0.04 0.88 ± 0.04 0.78 ± 0.03 0.87 ± 0.03 0.78 ± 0.02 0.87 ± 0.02
NN 83.3 ± 4.6 93.1 ± 4.2 0.81 ± 0.03 0.92 ± 0.03 0.82 ± 0.03 0.91 ± 0.03 0.81 ± 0.02 0.91 ± 0.02
SVM 90.3 ± 3.8 96.5 ± 3.3 0.89 ± 0.02 0.95 ± 0.02 0.91 ± 0.02 0.96 ± 0.02 0.90 ± 0.01 0.95 ± 0.01
RF 83.3 ± 4.6 93.1 ± 4.2 0.81 ± 0.03 0.92 ± 0.03 0.82 ± 0.03 0.91 ± 0.03 0.81 ± 0.02 0.91 ± 0.02
NB 75.8 ± 5.8 85.3 ± 5.3 0.73 ± 0.04 0.85 ± 0.04 0.75 ± 0.04 0.84 ± 0.04 0.74 ± 0.03 0.84 ± 0.03

Table 8
5-fold cross validation.
Classifier/Performance Accuracy (in%) Specificity Precision Sensitivity

61 features 10 features 61 features 10 features 61 features 10 features 61 features 10 features

k-NN 76.5 ± 6.1 85.6 ± 5.7 0.75 ± 0.04 0.84 ± 0.04 0.74 ± 0.03 0.84 ± 0.03 0.74 ± 0.02 0.83 ± 0.02
NN 79.2 ± 5.6 88.6 ± 5.1 0.78 ± 0.03 0.87 ± 0.03 0.78 ± 0.03 0.86 ± 0.03 0.77 ± 0.02 0.86 ± 0.02
SVM 87.7 ± 4.2 93.2 ± 3.9 0.87 ± 0.02 0.92 ± 0.02 0.86 ± 0.02 0.92 ± 0.02 0.86 ± 0.01 0.91 ± 0.01
RF 79.2 ± 5.6 88.6 ± 5.1 0.78 ± 0.03 0.87 ± 0.03 0.78 ± 0.03 0.86 ± 0.03 0.77 ± 0.02 0.86 ± 0.02
NB 72.9 ± 6.3 79.8 ± 6.1 0.71 ± 0.04 0.79 ± 0.04 0.70 ± 0.04 0.78 ± 0.04 0.71 ± 0.03 0.79 ± 0.03

4.1. Parameter selection for classifiers polynomial function, (gamma*u*v + coef()^degree)” and “with radial
basis function, tanh(gamma*u*v + coef()^degree)”, where gamma =
For the parameter selection of the classifiers, different parameters 1.25. The accuracy in case of “with radial basis function, exp (-gamma*|
were varied to check the optimal performance. The parameters used in u-v|2)” was 94.4%, “with polynomial function, (gamma*u*v + coef()
NB were “with kernel estimation” and “without kernel estimation”. The ^degree)” was 97.2% and “with radial basis function, tanh(gamma*u*v
accuracy in case of NB “with kernel estimation” was 86.32% and that for + coef()^degree)” was 94.4%. Thereafter, the parameters in case of
“without kernel estimation” was 89.65%. Further, the parameters used random forest was the number of trees (n), its values varied from 10 to
in SVM were “with basis radial function, exp (-gamma*|u-v|2)”, “with 50. Accuracy in case of n = 10 was observed to be 91.6%, at n = 30 was

Fig. 10. Plots of Accuracy with cross validation method for different classifiers.

15
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Fig. 11. Plots of Specificity, Precision, Sensitivity for different classifiers using 3-fold cross method.

Fig. 12. Plots of Specificity, Precision, Sensitivity for different classifiers using 5-fold cross method.

94.4% and at n = 50 was 90.2%. Thereafter, in the case of k-NN values of testing images comprises 50% normal cases i.e. ones without glaucoma
k were varied from 3 to 7. The accuracy in case of k = 3 was 86.9%, k = 4 and 50% abnormal cases i.e. ones with glaucoma. Similarly, the 30%
was 89.6%, k = 5 was 91.6%, k = 6 was 89.6% and k = 7 was 87.3%. testing images comprises 50% glaucoma cases and 50% non-glaucoma
Whereas, in the case of NN the number of hidden layers (l) varied from 1 cases. Analysis was then performed for both the feature sets using
to 10, accuracy in case of l = 1–5 was 92.6%, l = 6–8 was 94.4% and l = different classifiers. Table 6 shows the values of accuracy, specificity,
9–10 was 91.3%. Fig. 7 shows the accuracy of each classifier with precision and sensitivity of k-NN, NN, SVM, RF and NB classifiers for a
changing parameters. Finally, Table 5 presents the optimal values used full set of 61 features, and reduced set of 10 features. Further Fig. 8
for classification. shows the plots of accuracy for the full set of 61 hybrid features and
reduced set of 10 hybrid features. While Fig. 9 presents the plots of
specificity, precision and sensitivity for a full set of 61 hybrid features
4.2. Performance analysis and reduced set of 10 features.
Thus, from Table 6 and Figs. 8–9 it can be observed that accuracy of
This section presents the analysis of different classifiers such as NB, k-NN for 61 features was observed to be 83.3% and that for 10 features
SVM, RF, k-NN and NN on two types of feature sets. First is the full set of was 91.6%. While, the values of specificity, precision and sensitivity for
61 features consisting of structural and non-structural features. Second 61 features were 0.83, 0.84 and 0.83; and that for 10 features were 0.90,
is the reduced set of 10 significant features shown in Table 3 extracted 0.91 and 0.91.
from 61 features using different ranking and selection criteria discussed Similarly, the accuracy of NN and RF for 61 features was 88.8% and
in section 3.2. Further, the analysis has been performed with a split that for 10 features was 94.4%. On the other hand, the values of speci­
method and k-fold cross validation for both sets of features. ficity, precision and sensitivity for 61 features were 0.87, 0.88 and 0.88;
and that for 10 features were 0.93, 0.95 and 0.94.
4.2.1. Split method Further, the accuracy of SVM for 61 features was 91.6% and that for
The dataset was here divided into training and testing sets using a 10 features was 97.2%. While, the values of specificity, precision and
split method, 70% of the dataset was used for training and 30% for sensitivity for 61 features were 0.89, 0.91 and 0.91; and that for 10
testing without any overlap in the training and testing set. Here, 70% of

16
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

Table 9
Comparison of proposed feature set with State-of-the-art feature sets.
Approach Dataset Used Classifier Features extracted Classification Accuracy (Split Classification Accuracy (5-fold cross
Used Approach) validation)

Sumeet et al. [7] Wavelet features 93% 87.4%


Rama Krishnan et al.
Wavelet + HOS features 95% 91.6%
[8]
Issac et al. [11] Structural features 94.1% 89.2%
Salem et al. [12] GLCM, Structural features 92% 87.1%
Spatial, Texture, Gaussian, Gabor and
Haleem et al. [13] RIM-One and 94.4% 89.8%
SVM Wavelet features
Drishti-GS
Singh et al. [15] Wavelet features 94.7% 90.3%
De Sousa et al. [16] Geostatistical texture features 91.2% 85.3%
Renukalata et al.
GLCM, Structural features 94% 89.2%
[21]
Proposed feature
Hybrid features 97.22% 93.2%
set

Fig. 13. Plots of accuracy for different feature sets on SVM classifier.

features were 0.96, 0.97 and 0.97. validation, and that for 10 features was 93.1% with 3-fold cross vali­
Also, the accuracy of NB for 61 features was 76.3% and that for 10 dation and 88.6% with 5-fold cross validation. While, the values of
features were 89.6%. Whereas, the values of specificity, precision and specificity, precision and sensitivity for 3-fold cross validation with 61
sensitivity for 61 features were 0.74, 0.76 and 0.76; and that for 10 features were 0.81, 0.82 and 0.71; and with 10 features were 0.92, 0.91
features were 0.89, 0.89 and 0.88 respectively. and 0.91. Similarly, the values of specificity, precision and sensitivity for
Thus, it can be concluded that the proposed set of reduced hybrid 5-fold cross validation with 61 features were 0.78, 0.78 and 0.77; and
features outperforms the full set of 61 features for all the classifiers. Also, with 10 features were 0.87, 0.86 and 0.86.
the SVM classifier is found to outperform other classifiers with highest Thereafter, the accuracy in case of SVM for 61 features was 90.3%
accuracy for both sets of features. with 3- fold cross validation and 87.7% with 5-fold cross validation, and
that for 10 features was 96.5% with 3-fold cross validation and 93.2%
4.2.2. K-fold cross validation method with 5-fold cross validation. While, the values of specificity, precision
The dataset is here divided into k parts, with k-1 part for training and and sensitivity for 3-fold cross validation with 61 features were 0.89,
remaining part for testing during the first iteration. The complete pro­ 0.91 and 0.90; and with 10 features were 0.95, 0.96 and 0.95. Similarly,
cess is being iterated k-times where every fold is selected arbitrarily to the values of specificity, precision and sensitivity for 5-fold cross vali­
avoid biasing by selecting specific samples. Both the 3-fold and 5-fold dation with 61 features were 0.87, 0.86 and 0.86; and with 10 features
cross validation was performed on the entire dataset, and the analysis were 0.92, 0.92 and 0.91.
performed for both the feature sets using different classifiers are shown Finally, the accuracy of NB for 61 features was observed to be 75.8%
in Tables 7 and 8 followed by graph plots in Figs. 10–12. with 3-fold cross validation and 72.9% with 5-fold cross validation, and
On the basis of experimental analysis performed in Tables 7–8 and that for 10 features was 85.3% with 3-fold cross validation and 79.8%
Figs. 10–12, it can be observed that accuracy of k-NN for 61 features was with 5-fold cross validation. While, the values of specificity, precision
observed to be 79.9% with 3-fold cross validation and 76.5% with 5-fold and sensitivity for 3-fold cross validation with 61 features were 0.73,
cross validation, and that for 10 features was 88.6% with 3-fold cross 0.75 and 0.74; and with 10 features were 0.85, 0.84 and 0.84. Similarly,
validation and 85.6% with 5-fold cross validation. While, the values of the values of specificity, precision and sensitivity for 5-fold cross vali­
specificity, precision and sensitivity for 3-fold cross validation with 61 dation with 61 features were 0.71, 0.70 and 0.71; and with 10 features
features were 0.76, 0.78 and 0.78; and with 10 features were 0.88, 0.87 were 0.79, 0.78 and 0.79. Thus, it can be concluded that the proposed set
and 0.87. Similarly, the values of specificity, precision and sensitivity for of reduced hybrid features outperforms the full set of 61 features for all
5-fold cross validation with 61 features were 0.75, 0.74 and 0.74; and the classifiers. Also, the SVM classifier is found to outperform other
with 10 features were 0.84, 0.84 and 0.83. classifiers with highest accuracy for both sets of features.
Also, the accuracy of NN and RF for 61 features was observed to be Now, as the SVM classifier is found to outperform all other classifiers,
83.3% with 3-fold cross validation and 79.2% with 5-fold cross comparison of the state-of-the-art feature set with proposed feature set

17
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

for SVM classifier is presented in Table 9 and Fig. 13 with plots of ac­ [7] S. Dua, U.R. Acharya, P. Chowriappa, S.V. Sree, Wavelet-based energy features for
glaucomatous image classification, IEEE Trans. Inf. Technol. Biomed. 16 (January
curacies to analyse the proposed feature set comprising 10 features.
(1)) (2012) 80–87.
On the basis of results presented in Table 9 and Fig. 13, it can be [8] M.R. Mookiah, U.R. Acharya, C.M. Lim, A. Petznick, J.S. Suri, Data mining
observed that the proposed feature set gives accuracy of 97.22% using technique for automated diagnosis of glaucoma using higher order spectra and
split method and 93.2% using 5-fold cross validation on SVM classifier. wavelet energy features, Knowledge Based Syst. 33 (September) (2012) 73–82.
[9] K.P. Noronha, U.R. Acharya, K.P. Nayak, R.J. Martis, S.V. Bhandary, Automated
Thus, the proposed feature set gives higher accuracy than other state of classification of glaucoma stages using higher order cumulant features, Biomed.
the art feature set, and is thus the most suitable for classification of Signal Process. Control 10 (March) (2014) 174–183.
retinal images as ‘normal’ or ‘abnormal’. [10] U.R. Acharya, E.Y. Ng, L.W. Eugene, K.P. Noronha, L.C. Min, K.P. Nayak, S.
V. Bhandary, Decision support system for the glaucoma using Gabor
transformation, Biomed. Signal Process. Control 15 (January) (2015) 18–26.
5. Conclusion [11] A. Issac, M.P. Sarathi, M.K. Dutta, An adaptive threshold-based image processing
technique for improved glaucoma detection and classification, Comput. Methods
Programs Biomed. 122 (Novemver (2) (2015) 229–244.
This work presented a machine learning based classification using [12] A.A. Salam, T. Khalil, M.U. Akram, A. Jameel, I. Basit, Automated detection of
reduced set of proposed hybrid features comprising of CDR, DDLS, glaucoma using structural and non structural features, Springerplus 5 (December
homom, dvarh, dissi, contr, homop, denth, idmnc and indnc extracted (1)) (2016) 1519.
[13] M.S. Haleem, L. Han, J. Van Hemert, A. Fleming, L.R. Pasquale, P.S. Silva, B.
from 61 set of hybrid features for better diagnosis of glaucoma using J. Song, L.P. Aiello, Regional image features model for automatic classification
retinal images. In the existing studies, the use of features, classifiers and between normal and glaucoma in fundus and scanning laser ophthalmoscopy (SLO)
datasets vary from study to study. Also, very few studies have employed images, J. Med. Syst. 40 (June (6)) (2016) 132.
[14] M. Claro, L. Santos, W. Silva, F. Araújo, N. Moura, A. Macedo, Automatic glaucoma
feature selection using appropriate ranking schemes. Thus, this study detection based on optic disc segmentation and texture feature extraction, CLEI
extracts all possible feature sets based on the state of the art, extracts the Electron. J. 19 (August (2)) (2016) 5.
significant features using five feature ranking schemes and performs [15] A. Singh, M.K. Dutta, M. ParthaSarathi, V. Uher, R. Burget, Image processing based
automatic diagnosis of glaucoma using wavelet features of segmented optic disc
classification with varied classifiers on the same dataset. The method­ from fundus image, Comput. Methods Programs Biomed. 124 (February) (2016)
ology used comprises of pre-processing to remove the outliers, seg­ 108–120.
mentation of optic and optic cup to find CDR and DDLS, feature [16] J.A. de Sousa, A.C. de Paiva, J.D. de Almeida, A.C. Silva, G.B. Junior, M. Gattass,
Texture based on geostatistic for glaucoma diagnosis from fundus eye image,
extraction to extract all the possible features, feature ranking/selection
Multimed. Tools Appl. 76 (September (18)) (2017) 19173–19190.
to extract only the significant features followed by classification of [17] J.E. Koh, U.R. Acharya, Y. Hagiwara, U. Raghavendra, J.H. Tan, S.V. Sree, S.
retinal images as ‘normal’ or ‘abnormal’ based on training provided by V. Bhandary, A.K. Rao, S. Sivaprasad, K.C. Chua, A. Laude, Diagnosis of retinal
health in digital fundus images using continuous wavelet transform (CWT) and
features as input and labels as output. The total of 61 features consisting
entropies, Comput. Biol. Med. 84 (May) (2017) 89–97.
of structural and non-structural features such DDLS, CDR, GLRM, GLCM, [18] A. Septiarini, D.M. Khairina, A.H. Kridalaksana, H. Hamdani, Automatic glaucoma
HOS, FoS, HOC and Wavelets were extracted from input retinal images. detection method applying a statistical approach to fundus images, Healthc.
These features are then reduced using IG, GR, Correlation, relief and Inform. Res. 24 (January (1)) (2018) 53–60.
[19] D. Selvathi, N.B. Prakash, V. Gomathi, G.R. Hemalakshmi, Fundus image
wrapper approach, and were fed into k-NN, NN, SVM, RF, NB for clas­ classification using wavelet based features in detection of Glaucoma, Biomed.
sifying input images as abnormal and normal. Based on the experimental Pharmacol. J. 11 (June (2)) (2018) 795–805.
analysis, the proposed feature set was found to outperform full set of 61 [20] D.C. Shubhangi, N. Parveen, A dynamic roi based Glaucoma detection and region
estimation technique, Int. J. Computer Sci. Mobile Comput. 8 (August (8)) (2019)
features on all classifiers. Also, the ranking/selection reduced the 82–86.
training time of the classifier. Further, the performance of SVM was [21] S. Renukalatha, K.V. Suresh, Classification of Glaucoma using simplified-multiclass
found to be most suitable, as it offered better accuracy than k-NN, NN, support vector machine, Biomed. Eng. Appl. Basis Commun. 31 (October (05))
(2019), 1950039.
RF and NB. Thus, the study conducted can be also used as a second [22] J. Sivaswamy, S.R. Krishnadas, G.D. Joshi, M. Jain, A.U. Tabish, Drishti-gs: retinal
opinion by medical practitioners in glaucoma diagnosis for large scale image dataset for optic nerve head (onh) segmentation, in: 2014 IEEE 11th
clinical testing. Also, in the near future, it can be further extended for International Symposium on Biomedical Imaging (ISBI) 2014 Apr 29, IEEE, 2014,
pp. 53–56.
automation to improve accuracy by deep learning approaches with a
[23] F. Fumero, S. Alayón, J.L. Sanchez, J. Sigut, M. Gonzalez-Hernandez, RIM-ONE, An
larger number of images. open retinal image database for optic nerve evaluation, in: 2011 24th International
Symposium on Computer-Based Medical Systems (CBMS) 2011 Jun 27, IEEE, 2011,
pp. 1–6.
Author statement [24] N. Thakur, M. Juneja, Optic disc and optic cup segmentation from retinal images
using hybrid approach, Expert Syst. Appl. (March) (2019).
It is certified that all the authors have equally contributed in the [25] N. Harizman, C. Oliveira, A. Chiang, C. Tello, M. Marmor, R. Ritch, J.M. Liebmann,
The ISNT rule and differentiation of normal from glaucomatous eyes, Arch.
manuscript. Ophthalmol. 124 (November (11)) (2006) 1579–1583.
[26] G.L. Spaeth, P. Ichhpujani, The ethics of treating or not treating Glaucoma, J. Curr.
Glaucoma Pract. 3 (3) (2009) 7–12.
Declaration of Competing Interest [27] L.K. Soh, C. Tsatsoulis, Texture analysis of SAR sea ice imagery using gray level co-
occurrence matrices, IEEE Trans. Geosci. Remote. Sens. 37 (March (2)) (1999)
780–795.
The authors report no declarations of interest.
[28] R.M. Haralick, K. Shanmugam, Textural features for image classification, IEEE
Trans. Syst. Man Cybern. (November(6)) (1973) 610–621.
References [29] D.A. Clausi, An analysis of co-occurrence texture statistics as a function of grey
level quantization, Can. J. Remote. Sens. 28 (January (1)) (2002) 45–62.
[30] Statistical Features: available at: http://shodhganga.inflibnet.ac.in/bitstream/1
[1] Glaucoma Facts and Stats, available at: https://www.glaucoma.org/glaucoma/gla
0603/10111/9/09_chapter%203.pdf (Accessed 15 December 2018).
ucoma-facts-and-stats.php (Accessed 20 October 2018).
[31] S.R. Jammalamadaka, T.S. Rao, G. Terdik, Higher order cumulants of random
[2] A. Narula, V. Rajshekhar, S. Singh, S. Chakarvarty, An epidemiological study
vectors and applications to statistical inference and time series, Sankhyā: Indian J.
(Cross-sectional study) of Glaucoma in a semi-urban population of Delhi, J. Clin.
Stat. (May) (2006) 326–356.
Exp. Ophthalmol. 8 (2017) 686, https://doi.org/10.4172/2155-9570.1000686.
[32] R.C. Gonzalez, R.E. Woods, Digital Image Processing, Publishing House of
[3] R. Kaur, M. Juneja, A.K. Mandal, Computer-aided diagnosis of renal lesions in CT
Electronics Industry, 2002. Mar; 141(7).
images: a comprehensive survey and future prospects, Comput. Electr. Eng.
[33] Wavelets: available at: http://matlab.izmiran.ru/help/toolbox/wavelet/ch
(August) (2018).
06_a31,32,34.htm (Accessed 15 December 2019).
[4] Glaucoma Research Foundation: Optic nerve cupping, available at: http://www.
[34] T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.
glaucoma.org/treatment/optic-nerve-cupping.php (accessed 20 October 2018).
[35] H. Uğuz, A two-stage feature selection method for text categorization by using
[5] N. Thakur, M. Juneja, Survey of classification approaches for glaucoma diagnosis
information gain, principal component analysis and genetic algorithm, Knowledge
from retinal images. Advanced Computing and Communication Technologies,
Based Syst. 24 (October (7)) (2011) 1024–1032.
Springer, Singapore, 2018, pp. 91–99.
[36] M. Doshi, Correlation based feature selection (CFS) technique to predict student
[6] U.R. Acharya, S. Dua, X. Du, C.K. Chua, Automated diagnosis of glaucoma using
Perfromance, Int. J. Comput. Netw. Commun. 6 (May (3)) (2014) 197.
texture and higher order spectra features, IEEE Trans. Inf. Technol. Biomed. 15
(May (3)) (2011) 449–455.

18
N. Thakur and M. Juneja Biomedical Signal Processing and Control 62 (2020) 102137

[37] K. Kira, L.A. Rendell, A practical approach to feature selection. Machine Learning [42] H. Trevor, T. Robert, F. JH, The elements of statistical learning: data mining,
Proceedings, Morgan Kaufmann, 1992, pp. 249–256, 1992 Jan 1. inference, and prediction.
[38] M.A. Hall, L.A. Smith, Feature selection for machine learning: comparing a [43] E. Blanzieri, F. Melgani, Nearest neighbor classification of remote sensing images
correlation-based filter approach to the wrapper, in: FLAIRS Conference, vol. 1999, with the maximal margin principle, IEEE Trans. Geosci. Remote. Sens. 46 (May (6))
1999, pp. 235–239. May 1. (2008) 1804–1811.
[39] J.R. Jensen, K. Lulla, Introductory digital image processing: a remote sensing [44] S. Haykin, Neural Networks: a Comprehensive Foundation, Prentice Hall PTR,
perspective. 1994. Oct 1.
[40] I. Rish, An empirical study of the naive Bayes classifier, in: IJCAI 2001 Workshop [45] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (June (8))
on Empirical Methods in Artificial Intelligence, vol. 3, 2001, pp. 41–46. (2006) 861–874.
[41] L. Vanitha, A.R. Venmathi, Classification of medical images using support vector
machine, in: International Conference on Information and Network Technology,
Singapore, vol. 4, 2011, pp. 63–67. Sep 24.

19

You might also like