An EfficientData Mining Classification

Approachfor Detecting Lung Cancer Disease

Divya Chauhan Varun Jaiswal
Computer science& engineering Computer science& engineering
Shoolini University Shoolini University
Solan, India Solan, India

Abstract-Background: Automated disease classification values with respect to generated model. An

using machine learning often relies on features derived assortment of data mining techniques can be applied
from segmenting individual objects, which can be to find associations and regularities in data, extract
difficult to automate. Proposed model is a classification knowledge in the forms of rules and predict the value
based an efficient approach in which machine learning
of the dependent variables. Common data mining
concepts are used for the detection of Lung cancer
diseases. The algorithm obtained encouraging results techniques which are used in almost all the sectors
but requires considerable computational expertise to are listed as: Naive Bayes, Decision Tree, Artificial
execute. Furthermore, some benchmark sets have been neural network (ANN), Bagging algorithm, K-
shown to compare the proposed work model working. nearest neighborhood (KNN), Support vector
Results: We developed user friendly disease prediction machine (SVM) etc. Data mining is an important step
model based on PCA and LDA.To validate the method, of knowledge discovery in databases (KDD) which is
the proposed method is applied in MATLAB 2014a to an iterative process of data cleaning, integration of
achieve high accuracy performance metric and then data, data selection, pattern recognition and data
comparison has been made with ICA and SURF
mining knowledge recognition. KDD and data
Conclusions: The proposed approach offers improved mining are also used interchangeably. Data mining
user-friendliness, as feature extraction is performed in encompasses association, classification, clustering,
an easily editable. As a direct implication, intermediate statistical analysis and prediction. Data mining has
results are more easily accessible. been widely used in areas of communication, credit
assessment, stock market prediction, marketing,
Keywords- Data Mining,PCA, Feature Extraction, banking, education, health and medicine, hazard
classification, Lung cancer
forecasting, knowledge acquisition, scientific
I. INTRODUCTION discovery, fraud detection, etc but data mining holds
significant presence in every field of medical for the
Data mining is the extraction of hidden predictive diagnosis of several diseases such as diabetes, skin
information and unknown data, patterns, relationships cancer, lung cancer, breast cancer, heart disease,
and knowledge by exploring the large data sets which kidney failure, kidney stone, liver disorder, hepatitis
are difficult to find and detect with traditional etc. Data mining applications include analysis of data
statistical methods. Data mining it is powerful for better policy making in health, prevention of
technology which will discover most important various errors in hospitals, detection of fraudulent
information from the data warehouse of the insurance claims early detection and prevention of
organizations [1] [2] [6] [14]. It is a very crucial step various diseases, value for more money, saving costs
that collectively examine large amount of routinely and saving more lives by reducing death rates.
data. To find latest patterns in healthcare industry, Automated medical diagnosis helps the doctors to
there exist various interactive and scalable data calculate the correct disease with less time [15].
mining methods. Data mining is a quantitative Table 1 highlights the foremost objectives of the
approach which is user friendly in reading reports authors working in the field of predicting medical
and reducing errors and controls the quality more disease(s) using data mining methodology.
uniformly. Important task of data mining is data pre- Knowledge gained by exercising of aim(s) of data
processing. mining can be used to make booming decisions that
Data mining tools are used for decision making. will improve success of healthcare organization and
Prediction and classification techniques are used in health of the patients [17].
which classification technique predicts the unknown
TABLE.1 Related Work

Author Year of Disease Objectives

Publication Considered
DursunDelen et al [3], 2005, 2006 Breast cancer Analysis of the prediction of breast cancer
Bellaachia et al [4] survivability data mining methods.
Asha Rajkumar et al [9] 2010 Heart Disease To achieve high accuracy by classifying
D.Senthil Kumar et al [16] 2011 Diabetes, Development and evaluation of a clinical
Heart, decision support system for the treatment of
Hepatitis patients with heart disease, diabetes and
JyotiSoni [22] 2011 Heart Disease Predictive data mining for medical diagnosis:
An overview of heart disease

Akhiljabbar et al [11] 2012 Heart Disease Proposed a system for heart disease prediction
using data mining techniques
DSVGK Kaladhar et al 2012 Kidney Stone Statistical and data mining aspects on kidney
[19] stones: a systematic review and meta-analysis

Mai Shouman [12] 2012 Heart Disease Applying K-nearest neighbor in diagnosing
heart disease patients
Abhishek Taneja [13] 2013 Heart Disease To design a predictive model for heart disease
detection to enhance their liability of heart
disease diagnosis.
Kawsar Ahmed et al 2013 Lung Cancer, Early prevention and detection of skin cancer
[7],[23] Skin Cancer and lung cancer risk using data mining

SyedaFarhaShazmeen 2013 Liver Disorder Performance evaluation of different data

et al [21] mining classification algorithm and predictive
V. Krishnaiah et al [8] 2013 Lung Cancer Diagnosis of lung cancer prediction system
using data mining classification techniques
Vikram Kumar Gupta et al 2013 Drug A study of profile of patients admitted in the
[18] Addiction drug de-addiction centers in the state of
K R Lakshmi et al [20] 2014 Kidney Performance comparison of three data
dialysis mining techniques for predicting kidney
dialysis survivability
proposed wok model. Section 6 shows the results and
analysis description and in the and next section
Earlier various authors has worked for this, in the comparison between ICA ad SURF has been shown
same way we have worked on the prediction of and in the end conclusion will be shown.
various diseases like lung cancer, silicosis, infective
images using PCA as well as neural network. I. DATA MINING METHODS
Rest of the paper is organized as: Section 2 gives the Different types of mining algorithms in the
comparison of various data mining methods, Section healthcare field have been proposed by different
3 gives the review of the PCA feature extraction researchers in recent years. A particular algorithm
method. Section 4 gives the overview of LDA may not be applied to all the applications due to
method and section 5 gives the detailed view of complexity for appropriate data types of the
algorithm. Consequently the choice of an acceptable comparative analysis of different data mining
data mining algorithm depends on not only the techniques and algorithms which have been used by
purpose of an application, but also on the most of the researchers in medical data mining.
compatibility of the data set. Table 2 presents the

TABLE.2 Comparison between Methods

Authors Name Year Data Mining Techniques

AN DTree Logistic KN N SV M Other
N s Regressi on N B
DursunDelen et 2005 √ √ √ × × × -
al [3]
Bellaachia et al [4] 2006 √ √ × × √ × -
Asha Rajkumar 2010 × √ × √ √ × -
et al [9]
D.Senthil Kumar [16] 2011 × √ × × × × -
JyotiSoni [22] 2011 √ √ × √ √ × -
Akhiljabbar et al [11] 2012 √ √ × × √ × -
DSVGK 2012 × √ × √ √ √ Random
Kaladhar et al Forest,
[19] Bagging
Mai Shouman 2012 × × × √ × × -
Abhishek Taneja [13] 2013 √ √ × × √ × -
Kawsar Ahmed 2013 × × × × × × Mafia
et al [23]
Kawsar Ahmed 2013 × √ × × × × Apriori
et al [7] Algorithm
Syeda Farha 2013 √ √ × √ √ √ -
Shazmeen et al
K R Lakshmi et 2014 √ √ √ × × × -
al. [20]


Step 2: Take away the mean image from each image
Principal component analysis is a classic method vector. Mean should be row wise.
used for compress higher dimensional data sets to
lower dimensional ones for data analysis, apparition, Step 3: For calculating the Eigen vectors and Eigen
feature extraction, or data compression. PCA values, Compute the covariance matrix.
involves the calculation of the Eigen value Step 4: Analyze the eigenvectors and Eigen values of
decomposition of a data covariance medium or the covariance matrix.
singular value decay of a data matrix, usually after
mean centering the data for each attribute [24]. Step 5: The eigenvectors are sorted from high to low
according to their corresponding Eigen values.
Step 1: Get normalizes data from the iris regions. 2-D Choose components and forming a feature vector.
iris image is represent as 1-D Vector by
concatenating each row (or Column) into a long Step 6: Derive the new data set once we have chosen
vector the components, we simply take the transpose of the
vector and increase it on the left of the original data Step 3 : Evaluate the eigen vectors (e1, e2... ed) as
set, transposed. [25] well as corresponding eigen values (λ1, λ2,
...,λd) for the disseminate matrices.
Final Dataset = RowFeatureVector x Row Mean Step 4 : Sorts the eigenvectors by diminishing
Adjust eigenvalues as well as select k eigenvectors
Where RowFeatureVector is the matrix with the using the leading eigenvalues in the
eigenvectors in the columns transposed so that the direction of forming a d×k-dimensional
eigenvectors are now in the rows, with the most matrix W i.e. where every particular
major eigenvector at the top, and RowMeanAdjust is column exemplifies an eigenvector.
the mean used to data transposed. The data items are Step 5 : Afterwards, utilize this d×k eigenvector
in each editorial, with each row holding a split matrix towards transforming the samples
dimension. Principal components analysis is basically onto the new subspace. This could be
useful for dropping the number of variables that précised by utilizing the equation Y = X ×
consists a dataset while retaining the contradiction in W i.e. where X is an n×d- dimensional
the data and to identify unknown patterns in the data matrix; the ith row signifies the ith sample,
and to classify them according to how much of the and Y is the converted n×k- dimensional
information, stored in the data, they report for. matrix using the n samples anticipated into
the new subspace.
PCA allows scheming a linear alteration that maps in
order as of a high dimensional space to a lower IV. PROPOSED WORK
dimensional space.
In proposed work the prediction and prevention of
b1 = t11a1 + ………. T1naN various medical diseases is done using PCA, Canny
edge operator along with some pre- processing and
b2 = t21a1 + ………. T2naN(1)
post- processing steps. Firstly edge detection is done
bk= tK1a1 + ………. T13naN then feature extraction is done to get the optimized
no. of feature to classify between infected and non-
infected diseases. Following steps will be followed to
III. LINEAR DISCRIMINANT get the proposed disease prediction model.
The proposed system has been fully implemented (in
Linear Discriminant Analysis is utmost commonly matlab 2010) and tested with real CT scan images.
utilized as dimensionality lessening method in the The objective is to support efficient image data
pre-processing stage for machine learning processing and feature extraction. Obviously, to deal
applications in addition to design-classification. The with real image data, the image processing tool must
main objective is to project a specific dataset on top possess important characteristics such as being noise-
of a lower-dimensional space using virtuous class tolerant, efficient, practical, and convenient to use.
reparability so as to decrease computational prices as The aim of this research was to detect features for
well as also evade overfitting. The novel linear accurate images.
discriminant was first designated for a two-class
issue, in addition it was then afterwards widespread Load all Detect Apply
as "Multiple Discriminant Analysis" or "multi-class Start Images for Edges of the Principal
LDA" through C. R. Rao in the year of 1948. Linear the training uploaded component
Discriminant Analysis is "controlled" as well as process category analysis
calculates the guidelines ("linear discriminants")
which would probably signify the axes that are
applied to make the most of the separation amongst
multiple type of classes. Below are the five basic Perform
steps utilized for implementing a LDA technique; Selection
Step 1 : Calculate the d-dimensional mean vectors
intended for the dissimilar classes from the
specific dataset.
Step 2 : Calculate the disseminate matrices i.e. Load feature
Get disease Perform
between-class as well as within-class classificatio testing
vector saved in
scatter matrix. n using LDA Phase by
the database for
cancerous and
loading test
eigvector = eigvector(:, index);
Fig.1 Proposed Work Model Compute the number of eigen values that greater
than zero
Lemma: 1 Diseases Prediction Working Module
Initialize = all images;
Load images = early to tb;
Load features = early to tb;
Get file to be uploaded;
Do conversion to gray scale images;
Compute the eigen vectors that the corresponding
Calculate basic features of the image using PCA like eigen values is greater
Eigen, covarianceetc;
Save PCA features;
Load PCA data;
Training will be done using PCA oly;
Testing will be done using PCA only;
Find basic features again like SD, Average value etc;
Lemma: 3 PCA Working Module for Training and
Get Detection rate with type of disease; Testing
End load data
[r1,c1]=size(pca_train_data_advance, early, ild, tb,
silicosis, infected);
Lemma: 2 PCA Working Module for feature
extraction fori=1:r1
data=edge_image; iftest_data==pca_train_data_advance(i), early, ild, tb,
silicosis, infected;
training=pca_train_data_advance, early, ild, tb,
m=mean(data')';calculating the mean of each row silicosis, infected;
d=data-repmat(m,1,c); lda_class = classify(test_data,training,species_cat1);
Compute the covariance matrix error=sqrt(sum(sum(pca_features_cancer_advance,
cov_mat=d*d'; early, ild, tb, silicosis, infected))

Compute the eigen values and eigen vectors of the sum(sum(pca_features_test)))/numel(pca_features_ca

covariance matrix ncer_advance, early, ild, tb, silicosis, infected);

[eigvector,eigvl]=eig(cov_mat); end

Sort the eigen vectors according to the eigen values end

eigvalue = diag(eigvl); V. RESULTS AND ANALYSIS

[number, index] = sort(-eigvalue); Our main goal with proposed model is to propose
anequally efficient but more user-friendly version,we
eigvalue = eigvalue(index); performed several experiments to ensurethe good
performance could be obtained with ouralgorithm on [34]
proposed dataset. The data set consists of five classes
and each class has around 120 images of cancer as Lung Proposed Model 2016 97.94%
well as non- cancerous of size 200 kb. In order to
allow for objective comparison,we implemented Table 4 shows maximum accuracy of 93.85% of kidney
following thedescription of the algorithm given by as in previous works and with proposed model it has been
above. enhanced to 97.97%. Figure 3 shows the graph of
accuracies with range of percentage to diseases.

Table 4. Accuracies Applied on Different diseases

Authors Images Classification Accuracy

Fig.2 Proposed Dataset Images

Below are accuracies shown applied on various Disha Sharma CT Diagnostic 80%
diseases using different data mining techniques. Indicators
Accuracy is then computed from the formula to find
the no. of affected cases with respect to values in Anam Tariq CT Neuro-Fuzzy 95%
given parameters. Commonly used techniques are
Decision trees (Dtrees), Artificial Neural Network Yang Liu CT SVM(GRBF 87.82%
(ANN), Naïve Bayes (NB) [31, 32]. The Comparison kernel type)
of these techniques used year wise on different Dr.K .Usha rani CT Feed forward, 92%
disease is analyzed and also obtained results from Back
proposed work model has also been implemented. propagation
Table 3shows the comparison of various diseases on
ANN versus proposed model. Afzan Adam CT GA 83.36%

Table 3. Accuracies Applied on Artificial Neural S.K VijaiAnand CT Back 86.30%

Networks propagation
Disease Authors of Year of Accura
Consider Publication Publicati cy in
ed on ANN
JR Marsilin CT SVM 78.00%
Breast DursunDelen et al. 2005 91.21%
[26] F Eddaoudi CT SVM 95%
Heart Andreeva, P 2006 82.77%
AparnaKanakatte PET k-NN, SVM 97%
Breast Bellaachia et al 2006 86.50%
Heart Palaniappan, et al. 2007 93.54% S.Sivakumar CT SVM(RBF 80.36%
[28] kernel type)
Heart De Beule, et al. 2007 82.00%
[29] Hiram Mad CT SVM 84%
Heart Tantimongcolwata 2008 74.50%
[30] ,et al. FatmaTaher Sputum Bayesian 88.62%
Heart Hara, et al. 2008 82.30%
Heart Akhiljabbar et al 2012 82.00% KesavKancherla Sputum Random 87%
Heart Abhishek Taneja 2013 93.83%
Liver SyedaFarhaShazm 2013 67.59% Tuba kiyan CT Radial basis 96.81%
[33] een et al function
Kidney K R Lakshmi et al. 2014 93.85%
In this section comparison between three feature
extraction methods has been done like ICA,
SURF and PCA on the basis of accuracy,
standard deviation and miscellaneous error.
a. For ICA

Fig. 7 PCA Feature extraction

Fig .3 Accuracy using ICA

Fig.8 Accuracy using PCA

From graphical representation it has been seen that
proposed work based on PCA has good accuracy as
well as less error and standard deviation w.r.t ICA
and SURF method.

Fig. 4 ICA Feature extraction In this paper, our contributions include first the
presentation of disease prediction model, an easy-to-
b. For SURF use image-based classification algorithm inspired
from the PCA feature extraction algorithm. It is also
generalized across wide range of applications for
disease predictions, obviously the need of medical
processing. In addition, theproposed approach offers
improved user-friendliness, as feature extraction is
performed in an easily editable. As a direct
implication, intermediate results are more easily
accessible. We have also demonstrated that proposed
algorithm is efficient than other traditional methods
like SURF and ICA.
Fig. 5 Accuracy using SURF

