DS CP Paper

Lung Cancer Prediction Using Machine Learning
Harshal Dhande, Manoj Dohale, Ayush Doshi, Ojas Dudhabaware, Abha Marathe
Department of Engineering, Sciences and Humanities (DESH)

Vishwakarma Institute of Technology, Pune

variety of factors increase the risk of LC [2]. Over 12,000
Abstract —. For more than a decade, machine fatalities may be avoided if half of the high-risk LC patients
learning techniques have been applied in cancer were examined. Chest X-rays, sputum cytology, low-dose
spiral or helical CT scans, and low-dose computed
research. Machine Learning Algorithms (MLa)
tomography are all used to detect LC (LCDT). LDCT
can now make a significant contribution to lung screening, among other tests, can reduce LC mortality by 14
cancer (LC) research. Because LC has the highest to 20% in high-risk populations, LDCT Scan also detects
death rate in the world, early detection and smaller lung cancers at a younger stage [6]. It is not
classification of cancer cells can significantly necessary for all nodules discovered in the lungs to be
improve survival rates. Though numerous malignant. Small cell lung cancer (SCLC), non-small cell
algorithms for LC prediction are employed in the lung cancer (NSCLC), and lung carcinoid tumor are the
fields of neurology, radiology, and cancer, MLa three types of lung cancer.In lung malignant development,
excels them due to its accuracy and efficiency. there are two types of staging systems: the number system
(Stage I, Stage II, Stage III, Stage IV) and the TNM (Tumor,
This research initially looks at MLa's workflow
Nodes, Metastases) system. Stage I and stage II are
methodology for LC early prediction and determined by the size of the tumor. Stage III involves
categorization. Selecting the input data, preparing lymph nodes, while stage IV indicates that the cancer has
the data, feature selection and extraction, training progressed to other regions of the body, such as the liver and
and evaluating the data, and selecting the optimal brain.In LC screening, the term Volume Doubling Time
ML approach are among the methodologies. A (VDT) is employed. The time it takes for a nodule to double
survey report on the ML algorithms employed in its volume is measured in VDT. According to most studies, a
LC is offered, along with their methodology. nodule with a VDT of less than 400 days has a significant
Third, performance metrics such as Accuracy, risk of being dangerous, whereas a VDT of more than 500
days can be a normal or benign nodule. Thirteen of the 48
Sensitivity, Specificity, Precision, F1 Score, Root
LCs discovered by CT Screen have a VDT of more than 400
Mean Square Error (RMSE), Confusion Matrix days Wireless. Machine learning approaches have been used
with various MLa are investigated. Finally, the in LC and other cancer studies since the 1990s.According to
parameters used in creating an efficient and PubMed statistics, around 650 research articles on detecting
accurate ML model for the early prediction of LC LC using machine learning algorithms have been published.
are discussed in this paper. The main purpose of this paper is oriented towards
performing feature and exploratory data analysis for finding
Keywords - Machine learning, Lung cancer, Prediction, the leading factors that cause lung cancer in different age
Classification, Methodologies, Survey groups based on the rules identified by the tree-based
learning models. The proposed paper suggests a way to
I. INTRODUCTION break down the entire set of symptoms and causes and tries
Lung cancer (LC) continues to be the leading cause of to find out the best subset that will allow medical
cancer development and mortality worldwide, with millions practitioners to diagnose a patient whether he or she has lung
of new cases and deaths. Estimated LC new occurrence cancer based on textual information and lifestyle
cases and mortality from 1999 - 2020 around the world is preferences.
shown in Fig. 1. Though LC is the most life-threatening
disease, the best approach to survive is to get a diagnosis and II.LITERATURE REVIEW
a prognosis as soon as possible [1]. Smoking is the leading [1] Nikita Banerjee research on Prediction Lung Cancer-In
cause of lung cancer in both smokers and nonsmokers, and a Machine Learning Perspective In which the proposed model
here shows the overview of prediction of lung cancer at an
 early age. After prediction of the tumor begins malignant or
benign, it generates confusion matrix for each machine analytical study is to pinpoint the key determinants that can
learning technique and based on the confusion matrix it cause or are key symptoms of lung cancer in various age
calculates accuracy, recall, precision and F1 score from the groups. Tree-based models can be of great use when the
result it says that the model can distinguish between benign need is to identify and understand the different rules and
and malignant it can be seen that artificial neural network is patterns hidden in complex data. Furthermore, this has
providing more accuracy in both texture and vision based, as helped us to easily categorize patients based on their age and
well as from the recall value it can say that it has correctly what lifestyle choices one needs to keep in control to avoid
identified maximum number of malignant tumour risking their lives against lung cancer. The accuracies
Technologies used:Edge detection, Segmentation, SVM, obtained on the data also help us to conclude that cancer can
Random Forest, ANN.[2] Subhalaxmi Das research on Lung be predicted using Machine Learning techniques, and given
Cancer Prediction using machine learning and advanced more credible data one can delve deep and identify cancers
imaging techniques in which there is the overview of main even without the CT scans but just with the help of pure
textual data.[6] D. Jayaraj research on Random Forest based
Classification Model for Lung Cancer Prediction on
approaches used for nodule classification and lung cancer Computer Tomography Images in which, a new automated
prediction from CT imaging data. Here when evaluating computer aided model has been developed for the
performance, it is important to be aware of were the patients' identification of lung cancer on the applied CT images. The
smokers or non smokers, or were the patients with a current presented model consists a series of processes. Once the
or prior history of malignancy included. Given an apparent input image is pre-processed, the segmentation of images
acceptable level of performance, the next stage is to test such will take place by watershed segmentation algorithm which
CADx systems in a clinical setting but before this can be produces the output as segmented image in the binary form.
done, we must first define the way in which the output of the At the next stage, a collection of important features gets
CADx should be utilized in clinical decision making.[3] M. extracted from the segmented image Then, the classification
Siddardha Kumar research on Prediction of lung cancer of images will be carried out using RF classifier model
using machine learning technique: a survey in which to which finally provides the output as classified image into
discover tumor growth cells, the methodology for finding the ‘normal’ or ‘abnormal’. To simulate the presented model, a
ailment has a fundamental influence. Disclosure and set of images from LIDC dataset is employed. The above
Prediction of Lung tumor at the outset stage is especially a tables and figures clearly stated the presented model attains
central assignment to fix the issue. To get an exact thought superior classification performance by attaining the
regarding the expectation strategy the work is partitioned maximum accuracy of 89.90, sensitivity of 90.85 and
into following areas: Image Enhancement technique, Image specificity of 88.32 respectively.
Segmentation stage and Features Extraction stage and
Clustering. A new feature selection model is proposed for
cancer microarray recognition. A large portion of the regular III. PROPOSED WORK
grouping methods manage restricted properties and little
A. DATASET
datasets. Arbitrary backwoods classifier is one of the outfit
learning models, which is competent to deal with datasets The data used in this paper to train the models has been
with countless properties. After evacuation process, retrieved from Kaggle dataset ‘Survey Lung Cancer’. It
segmentation has been done to fragment the picture into a lot consists of 309 instances with 16 attributes (15 independent
of pixels that are arranged utilizing force esteems present on variables and 1 dependent variable). The data mostly
the images. This framework is helpful just for the contains ordinal values which makes it most efficient for
recognition of lung tumor growth and not for the diagnosing performing a relative analysis on the independent variables.
of lung disease. [4] Nisha Jenipher research on A Study on The description of all 16 attributes is represented in the table
Early Prediction of Lung Cancer Using Machine Learning 1 below. The value ‘Yes’ in level indicates a Person
Techniques in which LC prediction, with the help of these suffering from Lung Cancer and ‘No’ indicates healthy
techniques available data can be used to make predictions or persons which are not suffering from Lung cancer of
decisions. Study work provided a proposed system followed LUNG_CANCER. Fig.1. plots a histogram for age
by MLa in predicting early LC which provides the distribution that depicts the number of people of a particular
researcher with better knowledge in ML Technique for early age in the whole dataset. It can be inferred that people from
prediction of LC. Moreover, to make ML approaches easier age 50 to 80 have relatively higher chance of getting
in the field of LC, different types of the dataset used, various diagnosed with lung cancer.
data preprocessing methods implemented, essential features
that are been selected and extracted are been explained in
detail. Also, the performance of different MLa is evaluated.
The parameters used in constructing an efficient and
accurate ML model for early prediction of Lung cancer is a
piece of additional information. [5] Atharva Bankar research
on Symptom Analysis using a Machine Learning approach
for Early Stage Lung Cancer in which The focus of this
B. METHODOLOGY
In this research, we have developed a model to predict lung

cancer in patients. The performance of the model was tested
on both all attributes and selected features. The machine
learning classifiers such as logistic regression, linear support
vector machine (LSVM), radial support vector machine
(RSVM), decision tree (DT) and random forest were used
for training the model. Each classifier validation and
performance matrix were computed. The methodology
includes following three stages:
(i) dataset preprocessing,
(ii) classifier application,
(iii) analyzing the performance of the classifier.
The methodology is described in Fig.2
Fig.1 Age vs count (Lung cancer patients)

TABLE 1. Description of Attributes in the Dataset.
Sr. Attribute Name

No
1 GENDER
2 AGE
3 YELLOW FINGER
4 PEER PRESSURE Patient having yellow finger or not (Two values 2 and 1)
5 ANXIETY Patient having anxiety problem or not (Two values 2 and 1)
6 PEER PRESSURE Patient having peer pressure or not (Two values 2 and 1)
7 CHRONIC DISEASE Patient having chronic disease or not (Two values 2 and 1)
8 FATIGUE Patient having fatigues or not (Two values 2 and 1)
9 ALLERGY Patient having any kind of allergy or not (Two values 2 and 1)
10 WHEEZING Patient having wheezingFig.2

problemWork-Flow
or not (Two values 2 and 1)
C. PREPROCESSING OF DATA
11 ALCOHOL CONSUMING Patient consumes alcohol or not (Two values 2 and 1)
Data preprocessing could be a strategy that is utilized to
12 COUGHING changehaving
Patient over the raw or
cough information
not (Two into a clean
values 2 anddataset.
1) It is a
the basic step to train every machine learning classifier
13 SHORTNESS OF BREATH Patientalgorithm. In thisproblem
having breath proposedormodels,
not (TwoThe first 2step
values andinvolved
1)
in data preprocessing is binary transformation has been
14 SWALLOWING DIFFUCULTY applied to convert the value into 2 and 1.The second step
Patient having
involved in data preprocessing is converting all categorical
15 CHEST PAIN variable into dummy variables i.e., converting their datatype
Patient having chest pain problem or not (Two values 2 and 1)
into factor. AS many machine learning models requires
16 LUNG CANCER numbers as an input, dummy variables provides that feature.
Target variable (Lung cancer or Not, Two values YES and NO)
So first of all, all categorical variables are converted into
factor and one another problem that been faced with the data
is class imbalance problem, as the dataset “survey lung
cancer.csv” contains 309 instances out of which 270
contains “Yes” as target variable while only 39 instances
contain “No” as target variable. To handle this class
imbalance problem, ovun.sample() function from ROSE
package is used, which basically creates possibly balanced
samples by random over-sampling minority examples,
under-sampling majority examples or combination of over-
and under-sampling. In this case, the dataset has only 39
instances of “NO”. So, number of rows containing “NO” has
been balanced so the modified data contains total 173
instances of “NO”, and 187 instances of “YES”. This way
class imbalance problem has been solved. This data frame is
converted into csv file and this file is been used for all
algorithms.
D. CLASSIFICATION ALGORITHMS Fig. 4. Random Forest
Classification technique is an important feature of C) Linear SVM
supervised learning. Classifiers learn from the training
dataset and apply on the testing dataset for finding the target Support Vector Machine or SVM is one of the most
attribute. Below there are classification techniques used in popular Supervised Learning algorithms, which is used for
proposed model. Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine
A) Decision Tree Learning. SVM chooses the extreme points/vectors that
help increating the hyperplane. These extreme cases are
Decision Tree is an efficient machine learning algorithm called as support vectors, and hence algorithm is termed as
used in predictive analysis. They allow us to easily interpret Support Vector Machine.Linear SVM is used for linearly
the data with high accuracy by creating decision rules separable data, which means if a dataset can be classified
(stumps). The dataset is then split based on the different into two classes by using a single straight line, then such
decision rules created and enable us to predict the target data is termed as linearly separable data, and classifier is
variable. ID3 algorithm is used in decision trees. With used called as Linear SVM classifier.
decision trees, feature importance is clearly obtained and
relations can be viewed easily. D) Radial SVM
In machine learning, the radial basis function kernel, or RBF
kernel, is a popular kernel function used in various
kernelized learning algorithms. In particular, it is commonly
used in support vector machine classification.
Fig.3.Decision Tree
B) Random Forest
Random Forest is an ensemble model having Decision Trees
as their base model. It consists of many Decision Trees Fig. 5. SVM radial
which operate individually and provide a prediction for the
E) Logistics Regression
target variable. To build each individual tree, the model uses
bagging and feature randomness which helps in making it an Logistic regression is a statistical method for binary
uncorrelated forest of trees. This helps the model to achieve classification that can be generalized to multiclass
prediction accuracy which is more accurate than any of the classification. It is a classification model, which is very easy
individual tree to realize and achieves very good performance with linearly
separable classes. Logistic regression has been extensively
employed as an algorithm for classification in industry.
2) CLASSIFICATION ACCURACY
Classification accuracy shows the correct rate of prediction
results. It computes from the confusion matrix. The
classification accuracy is found by equation 1:
accuracy = TP + TN ∗ 100 (1)

TP + TN + FP + FN
3) CLASSIFICATION ERROR
Classification error shows the incorrect rate of prediction
results. It computes from the confusion matrix. The
classification error is found by equation 2:
Error = FP + FN ∗100 (2)

TP + TN + FP + FN
4) PRECISION
Precision is an important model performance evaluation
matrix. It is the fraction of related instances among the total
Fig. 6. Logistics retrieved instances. It is a positive predicted value. The
precision is calculated as follows in equation 3:
A) PERFORMANCE EVALUATION MEASURE
Precision = TP ∗ 100 (3)
Various evaluation matrices were used for checking the
TP + FP
performance of the classifier. For this purpose, the confusion
matrix was used. It is a 2∗2 matrix due to two classes in the
5) RECALL
dataset. The confusion matrix gives two types of correct
Recall is also an important model performance evaluation
prediction of the classifier and two types of incorrect
matrix. It is the fraction of related instances among the total
prediction of the classifier. The confusion matrix is
number of retrieved instances. The recall is calculated as
presented in Table 2.
follows in equation 4:
TABLE 2. Confusion Matrix.
Recall = TP ∗ 100 (4)
TP + FN
6) Specificity
Specificity is also an important model performance
evaluation matrix. It is proportion of truly negative cases
that were classified as negative thus it is measure of how
well classifier identifies negative cases also called as
negative rate. The Specificity is calculated as follows in
equation 5:
Specificity= TP ∗ 100 (5)

TABLE 2 TP + FN
1) CONFUSION MATRIX DESCRIPTION 7) F-MEASURE

It is also known as F Score. F-measure is calculated so as to
TP: True Positive means output as positive such that measure the accuracy of test. It is calculated from the
predicted result is correctly classified. precision and recall by equation 7:
TN: True Negative means output as negative such that
predicted result is correctly classified. F − Measure = 2 ∗ Precision ∗ Recall (6)
FP: False Positive means output as positive such that Precision + Recall
predicted result is incorrectly classified.
FN: False Negative means output as negative such that IV. RESULT AND DISCUSSION
predicted result is incorrectly classified. After training the data with five algorithms now we will
descriptive analysis of the data used in the research as well
as the experimental results. In the data file, there is problem
with the data that is class imbalance problem, to handle this
class imbalance problem, ovun.sample() function from
ROSE package is used, which basically creates possibly Measures SVM radial Values
balanced samples by random over-sampling minority Accuracy 96.66667
examples, under-sampling majority examples or Error 3.33333
combination of over- and under-sampling. This way class
imbalance problem has been solved. This data frame is
Precision 93.61702
converted into csv file and this file is been used for all Recall 100
algorithms. So, in this was the dataset became handy and the Specificity 93.47826
pre-processed done .To build a model, we separate our F-measure 967033
dataset into two parts:
Table IV
• Training Dataset
• Testing Dataset In the Random Forest algorithm, the ’forest’ is built decision
trees by ensemble method. This algorithm uses the bagging
We used a 5:1 ratio for preparing our model.The 80% of method for training on data. Random Forest gets 91.666%
dataset will be treated as training dataset and the rest portion accuracy in this dataset with a recall of 100. This is so far
will be considered as testing dataset. The data set includes the best for this cancer detection dataset. Table V has all
360 data points and 16 attributes. There are 15 input other evaluating
variables and one output variable. The output variables are a
class which has two categories: yes (Lung cancer) and No
(no Lung cancer). In order to predict Lung cancer, five Measures RF Values
different analytic methods were used. Some performance Accuracy 91.66667
metrics are used to evaluate the different algorithm models: Error 83.3333
Confusion Matrix Accuracy, Precision, Recall (Sensitivity), Precision 85.71429
Error, Specificity and F-measure.Positive classification is
Recall 100
when the person has lung cancer and negative classification
of when the person does not have lung cancer. The data was Specificity 83.333333
partitioned into 80% for training and 20% for testing. For F-measure 92.30769
both training and testing data, the accuracy of the prediction
models is very high and the model that has the highest Table V
accuracy for data sets is radial SVM. Decision tree is a tree-like structure where an internal node
is a feature and the branch node represent the decision rule
and the leaf node represent the final output. we got 75%
A. Experimental Results Between Different Algorithm accuracy for kidney disease prediction with a recall score of
72.41. Table VI has other matrix.
Support Vector Machine with linear kernel (LSVM) is a
Discriminative classifier which works by separating Measures DT Values
hyperplane. This algorithm can be used in both regression Accuracy 75.0
and classification. For this LSVM algorithm, we got 92.13% Error 25.0
accuracy and 93.33 recall score, which is pretty impressive
in this scenario. Table III showing all the matrix output
Precision 67.74194
Recall 72.41379
Measures SVM linear Values Specificity 76.74419
Accuracy 92.13483 F-measure 70
Error 7.865169
Table VI
Precision 91.30435
Recall 93.333333 Logistic Regression is most popular classifiers in machine
Specificity 90.909090 learning. We got 94.38% accuracy and 93.47 recall score.
F-measure 92.30769 Table VII has other details.
Table III
Support Vector Machine with (RSVM) is a Discriminative Measures Logistic Values

classifier which works by separating hyperplane. This Accuracy 94.38202
algorithm can be used in both regression and classification. Error 5.617978
For this RSVM algorithm, we got 96.66% accuracy and 100
recall score, which is pretty impressive in this scenario.
Precision 95.55556
Table IV showing all the matrix output Recall 93.47826
Specificity 95.34884
F-measure 94.50549 Different Algorithms
100
Table VII
90
On Comparing all five techniques we finally got perfect
techniques which satisfied the highest performance which 80
age Logistics ,Random Forest and radial SVM having high
70
accuracy rate shown in fig 7 .The data used in this study
includes only two classes for the output variable, lung cancer 60
and not lung cancer.
50
40
Accuracy SVM SVM RF LGR Tree
Linear Radial
100
Accuracy Precision Recall
90
80 Fig. 9. Comparison of Accuracy ,Precision and Recall for

various classifier
70
60
Among applied classifiers RSVM gives highest accuracy of
50 96.66 % along with recall score of 100, LSVM gives
accuracy of 92.13 % along with recall score of 93.33,
40 logistic regression gives accuracy of 94.38 % along with
SVM SVM RF LGR Tree
Linear Radial recall score of 93.47, Random Forest gives accuracy of
91.667 % along with recall score of 100 and Decision Tree
Fig. 7. Classifier Accuracy on Different Algorithms has lowest accuracy of 75% along with recall 72.41.
On Comparing all five techniques we finally got perfect

techniques which satisfied the highest rate. fig 8 shows
variation of accuracy with respect to concern classifier.
F_measure
100
Accuracy 90
100 80
70
90 60
50
80 40
70 30
20
60 10
0
50 SVM SVM RF LGR Tree
Linear Radial
40
SVM SVM RF LGR Tree
Linear Radial Fig.10 Performance of F-Measure for all classifiers
Fig. 8. Classifier Accuracy variation on Different

Algorithms V. CONCLUSION
Here, fig 9 shows comparison of accuracy, precision and Although a conclusion may review the main points of the
recall with respect to concern classifier. paper, do not replicate the abstract as the conclusion. A
conclusion might elaborate on the importance of the work or
suggest applications and extensions.
VI. REFERENCES
[1] Nikita Banerjee, Subhalaxmi Das, “Prediction Lung Cancer– In

Machine Learning Perspective” in July 04,2020 IEEE Xplore
[2] Timor Kadir , Fergus Gleeson , “Lung cancer prediction using
machine learning and advanced imaging techniques” in Transl Lung
Cancer Res 2018
[3] M.Siddardha Kumar, Prof.Dr.K.Venkata Rao , “PREDICTION OF
LUNG CANCER USING MACHINE LEARNING TECHNIQUE: A
SURVEY” in 2021 International Conference on Computer
Communication and Informatics (ICCCI -2021), Jan. 27 – 29, 2021,
Coimbatore, INDIA
[4] Ms. V. Nisha Jenipher, Dr. S. Radhika, “A Study on Early Prediction
of Lung Cancer Using Machine Learning Techniques,” in Proceedings
of the Third International Conference on Intelligent Sustainable
Systems [ICISS 2020] IEEE Xplore
[5] Atharva Bankar , Kewal Padamwar , Aditi Jahagirdar ,“ Symptom
Analysis using a Machine Learning approach for Early Stage Lung
Cancer,” Proceedings of the Third International Conference on
Intelligent Sustainable Systems [ICISS 2020] IEEE Xplore .
[6] Jayaraj D., Sathiamoorthy S, “Random Forest based Classification
Model for Lung Cancer Prediction on Computer Tomography ,
IEEE 2019 International Conference on Smart Systems and Inventive
Technology (ICSSIT) - Tirunelveli, India
[7] PANKAJ CHITTORA, SANDEEP CHAURASIA, PRASUN
CHAKRABARTI, GAURAV KUMAWAT , TULIKA
CHAKRABARTI , ZBIGNIEW LEONOWICZ , MICHAŁ
JASIŃSKI, ŁUKASZ JASIŃSKI , RADOMIR GONO , ELŻBIETA
JASIŃSKA , VADIM BOLSHEV, “Prediction of Chronic Kidney
Disease - A Machine Learning Perspective”, VOLUME 9, IEEE
Access · January 2021

DS CP Paper

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS CP Paper

Uploaded by

Copyright:

Available Formats

Lung Cancer Prediction Using Machine Learning

Department of Engineering, Sciences and Humanities (DESH)

In this research, we have developed a model to predict lung

The methodology is described in Fig.2

Fig.1 Age vs count (Lung cancer patients)

Sr. Attribute Name

5 ANXIETY Patient having anxiety problem or not (Two values 2 and 1)

8 FATIGUE Patient having fatigues or not (Two values 2 and 1)

10 WHEEZING Patient having wheezingFig.2

accuracy = TP + TN ∗ 100 (1)

Error = FP + FN ∗100 (2)

Specificity= TP ∗ 100 (5)

1) CONFUSION MATRIX DESCRIPTION 7) F-MEASURE

Support Vector Machine with (RSVM) is a Discriminative Measures Logistic Values

80 Fig. 9. Comparison of Accuracy ,Precision and Recall for

On Comparing all five techniques we finally got perfect

Fig. 8. Classifier Accuracy variation on Different

[1] Nikita Banerjee, Subhalaxmi Das, “Prediction Lung Cancer– In

You might also like