Professional Documents
Culture Documents
Batch 36
Batch 36
By
SITHESHWAR.T(312417104104)
PRAVEEN.M(312417104067)
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
i
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Chennai-600119.
ii
ACKNOWLEDGEMENT
We take this opportunity to thank our honourable Chairman Dr. B. Babu Manoharan, M.A.,
M.B.A., Ph.D. for the guidance he offered during our tenure in this institution.
We extend our heartfelt gratitude to our honourable Managing Director Mrs. B. Jessie Priya,
M.Com. and Director Mr. B. Sashisekar, M.Sc. for providing us with the required resources
to carry out this project.
We are indebted to our Principal Dr. P. Ravichandran, M.Tech., Ph.D. for granting us
permission to undertake this project.
We would like to express our earnest gratitude to our Head of the Department Dr. J. Dafni
Rose, M.E., Ph.D. for her commendable support and encouragement for the completion of the
project with perfection.
We also take this opportunity to express our profound gratitude to our guide
Mr.C.GOWTHAM, M.E., Assistant professor for his guidance, constant encouragement,
immense help and valuable advice for the completion of this project.
We wish to convey our sincere thanks to all the teaching and non-teaching staff of the
department of Computer Science and Engineering, St. Joseph’s Institute of Technology
without whose co-operation this venture would not have been a success.
iii
CERTIFICATE OF EVALUATION
Semester : VIII
LEARNING Technology
2. PRAVEEN M
(312417104067)
The report of the project work submitted by the above students in partial fulfillment
for the award of Bachelor of Engineering Degree in Computer Science and
Engineering of Anna University were evaluated and confirmed to be report of the
work done by above students.
iv
ABSTRACT
Lung cancer generally occurs in both male and female due to uncontrollable growth of
cells in the lungs. This causes a serious breathing problem in both inhale and exhale part of
chest. Cigarette smoking and passive smoking are the principal contributor for the cause of
lung cancer as per world health organization. The mortality rate due to lung cancer is
increasing day by day in youths as well as in old persons as compared to other cancers. Even
though the availability of high tech Medical facility for careful diagnosis and effective medical
treatment, the mortality rate is not yet controlled up to a good extent. Therefore it is highly
necessary to take early precautions at the initial stage such that it’s symptoms and effect can
be found at early stage for better diagnosis .Machine learning is a type of artificial intelligence
(AI) that provides computers with the ability to learn without being explicitly programmed.
Machine learning focuses on the development of Computer Programs that can change when
exposed to new data. Machine learning now days has a great influence to health care sector
because of its high computational capability for early prediction of the diseases with accurate
data analysis. In our paper we have analyzed various machine learning classifiers techniques
to classify available lung cancer data in the dataset. The aim is to predict machine learning
based techniques for lung cancer prediction. The analysis of dataset by supervised machine
learning technique(SMLT) to capture several information’s like, variable identification, uni-
variate analysis, bi-variate and multi-variate analysis, missing value treatments and analyze
the data validation, data cleaning/preparing and data visualization will be done on the entire
given dataset. To propose a machine learning-based method to accurately predict the lung
cancer using supervised classification machine learning algorithms. Additionally, to compare
and discuss the performance of various machine learning algorithms from the given transport
traffic department dataset with evaluation of GUI based user interface of lung cancer
prediction by attributes.
v
TABLE OF CONTENT
ABSTRACT v
LIST OF FIGURES 9
LIST OF ABBREIVATION 10
1 INTRODUCTION 11
1.1 OVERVIEW 11
1.2 EXISTING SYSTEM 12
1.2.1 ADVANTAGE/DISADVANTAGES
OF EXISTING SYSTEM 13
1.3 PROPOSED SYSTEM 13
1.3.1 BENEFIT OF USING
EXPLORATORY DATA ANALYSIS 13
1.3.2 BENEFIT OF USING DATA
COLLECTION 14
1.4 ADVANTAGES OF THE SOLUTION 14
2 LITERATURE SURVEY 15
3 SYSTEM DESIGN 21
3.1 UNIFIED MODELLING LANGUAGE 21
3.1.1 USE CASE DIAGRAM FOR
LUNG CANCER PREDICTION 21
3.1.2 CLASS DIAGRAM FOR DIAGNOSE
OF LUNG CANCER PREDICTION 23
3.1.3 SEQUENCE DIAGRAM FOR
6
DIAGNOSE OF LUNG
CANCER PREDICTION 24
3.1.4 ACTIVITY DIAGRAM FOR
DIAGNOSE OF LUNG
CANCER PREDICTION 26
3.1.5 COLLABRATION DIAGRAM FOR
DIAGNOSE OF LUNG
CANCER PREDICTION 27
3.1.6 ER DIAGRAM FOR DIAGNOSE
OF LUNG CANCER PREDICTION 28
3.1.7 WORKFLOW DIAGRAM FOR
DIAGNOSE OF LUNG
CANCER PREDICTION 30
4 SYSTEM ARCHITECTURE 31
4.1 ARCHITECTURE DESCRIPTION 31
5 SYSTEM IMPLEMENTATION 33
5.1 MODULE DESCRIPTION 33
5.1.1 DATA VALIDATION
PROCESS 33
5.1.2 EXPLORATION DATA ANALYSIS
OF VISULIZATION 35
5.1.3 LOGISTIC REGRESSION 36
5.1.4 GUI BASED PRECTION
RESULT OF LUNG 38
5.2 ACCURACY CALCULATION 38
5.3 VALIDATION OF THE OUTPUT 40
5.3.1 SENSITIVITY 40
7
5.3.2 SPECIFICY 42
5.4 WORKFLOW DIAGRAM FOR
LUNG CANCER PREDICTION 43
8
LIST OF FIGURES
LIST OF FIGURES NAME OF THE FIGURE PAGE NO.
9
LIST OF ABBREVATION
SMLT Supervised Machine Learning Technique
CT Computed Tomography
MRRN Multiple Resolution Residually Connected Network
NSCL Non-Small Cell Lung Cancer
LIDC Lung Image Database Consortium
CAD Coronary Artery Disease
DSC Dice Similarity Coefficient
RP Radiation Pneumonitis
CTCAE Common Terminology Criteria for Adverse Events
LDCT Low-dose Computed Tomography
UML Unified Modelling Language
ERD Entity Relationship diagram
GUI Graphical user Interfaces
FP False Positive
FN False Negative
TP True Positive
TN True Negative
10
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
12
1.2.1 ADVANTAGE/DISADVANTAGE OF THE EXISTING SYSTEM
The experimental results prove that compared to other algorithms, all
metrics in the proposed algorithm are improved. This model has an obvious anti-
interference ability. It is stable and can identify lung nodules effectively, which
is expected to provide auxiliary diagnostic for early screening of lung cancers.
They are using CT images and large amount of data required to classify
accurately and it take a lot of time to train the model. Any modification made
takes lots of effort to change the model.
The aim is to predict machine learning based techniques for lung cancer
prediction. The analysis of dataset by supervised machine learning
technique(SMLT) to capture several information’s like, variable identification,
uni-variate analysis, bi-variate and multi-variate analysis, missing value
treatments and analyze the data validation, data cleaning/preparing and data
visualization will be done on the entire given dataset.To propose a machine
learning-based method to accurately predict the lung cancer using supervised
classification machine learning algorithms. Additionally, to compare and
discuss the performance of various machine learning algorithms from the given
transport traffic department dataset with evaluation of GUI based user interface
of lung cancer prediction by attributes. This algorithm uses Data Analysis and
Data collection.
13
or hypothesis testing task..Machine learning supervised classification algorithms
will be used to give dataset and extract patterns, which would help in predicting
the likely patient affected or not, thereby helping them for making better
decisions in the future
The data set collected for predicting the network attacks is split into
Training set and Test set. Generally, 7:3 ratios are applied to split the Training
set and Test set. The Data Model which was created using by machine learning
model are applied on the Training set and based on the test result accuracy, Test
set prediction is done.
14
CHAPTER 2
LITERATURE SURVEY
Sanjukta Rani Jena, Dr. Thomas George, Dr. Narain Ponraj(2019) - Lung
cancer is most lifethreatening disease, treatment of which must be the primary
goal throughout scientific research. The early recognition of cancer can be
helpful in curing disease entirely. There are numerous techniques found in
literature for detection of lung cancer. Several investigators have contributed
their facts for cancer prediction. These papers largely pact about prevailing lung
cancer detection techniques that are obtainable in the literature. A numeral of
methodologies has been originated in cancer detection methodologies to
progress the efficiency of their detection. Diverse applications like as support
vector machines, neural networks, image processing techniques are extensively
used in for cancer detection which is elaborated in this work The early discovery
of lung malignancy is a confront, because of the structure of tumour cells, where
the greater part of the cells are covered with each other. This paper has surveyed
numerous strategies, to distinguish the lung tumour in its beginning periods. The
manual examination of the samples is tedious, inaccurate and requires intensive
trained person to eliminate diagnostic errors. From the results obtained we could
conclude that the Local Binary Pattern performs better than other basic textural
patterns as the histogram features obtained were greater than that of the latter.
15
features across multiple image resolution and feature levels through residual
connections to detect and segment lung tumors. They evaluated our method on
a total of 1210 non-small cell (NSCLC) lung tumors and nodules from three
datasets consisting of 377 tumors from the open-source Cancer Imaging Archive
(TCIA), 304 advanced stage NSCLC treated with anti-PD-1 checkpoint
immunotherapy from internal institution MSKCC dataset, and 529 lung nodules
from the Lung Image Database Consortium (LIDC). The algorithm was trained
using the 377 tumors from the TCIA dataset and validated on the MSKCC and
tested on LIDC datasets. The segmentation accuracy compared to expert
delineations was evaluated by computing the Dice Similarity Coefficient (DSC),
Hausdorff distances, sensitivity and precision metrics. Our best performing
incremental-MRRN method produced the highest DSC of 0.74±0.13 for TCIA,
0.75±0.12 for MSKCC and 0.68±0.23 for the LIDC datasets. There was no
significant difference in the estimations of volumetric tumor changes computed
using the incremental-MRRN method compared with expert segmentation
proposed two neural networks to segment lung tumors from CT images by
adding multiple residual streams of varying resolutions. Our results clearly
demonstrate the improvement in segmentation accuracy across multiple
datasets.
16
two problems jointly. They tested system on LUNA16 dataset and achieved an
average dice similarity coefficient (DSC) of 91% as segmentation accuracy and
a score of nearly 92% for FP reduction. As a proof of our hypothesis, they
showed improvements of segmentation and FP reduction tasks over two
baselines. e proposed a 3D deep multi-task CNN for simultaneously performing
segmentation and FP reduction. We showed that sharing some underlying
features for these tasks and training a single model using shared features can
improve the results for both tasks, which are critical for lung cancer screening.
Furthermore, we showed that a semisupervised approach can improve the results
without the need for large number of labeled data in the training.
Tianle Shen, Liming Sheng, Ying Chen , Lei Cheng and Xianghui
Du(2020) - Silica is an independent risk factor for lung cancer in addition to
smoking. Chronic silicosis is one of the most common and serious occupational
diseases associated with poor prognosis. However, the role of radiotherapy is
unclear in patients with chronic silicosis. They conducted a retrospective study
to evaluate efficacy and safety in lung cancer patients with chronic silicosis,
especially focusing on the incidence of radiation pneumonitis (RP). Lung cancer
patients with chronic silicosis who had been treated with radiotherapy from
2005 to 2018 in our hospital were enrolled in this retrospective study . RP was
graded according to the National Cancer Institute’s Common Terminology
Criteria for Adverse Events (CTCAE), version 3.0. Of the 22 patients, ten
(45.5%) developed RP ≥2. Two RP-related deaths (9.1%) occurred within 3
months after radiotherapy. Dosimetric factors V5, V10, V15, V20 and mean
lung dose (MLD) were significantly higher in patients who had RP >2 (P <
0.05). The median overall survival times in patients with RP ≤2 and RP>2 were
11.5 months and 7.1 months, respectively. Radiotherapy is associated with
excessive and fatal pulmonary toxicity in lung cancer patients with chronic
silicosis. This retrospective study showed that radiotherapy is associated with a
high incidence of lethal RP in lung cancer patients with chronic silicosis. RP is
related to OS
18
of lung cancer. Despite the life-saving benefit of early detection by LDCT, there
are many limitations of this imaging modality including high rates of detection
of indeterminate pulmonary nodules. Radiomics is the process of extracting and
analyzing image-based, quantitative features from a region-ofinterest which
then can be analyzed to develop decision support tools that can improve lung
cancer screening. Although prior published research has shown that delta
radiomics (i.e., changes in features over time) have utility in predicting
treatment response, limited work has been conducted using delta radiomics in
lung cancer screening. As such, we conducted analyses to assess the
performance of incorporating delta with conventional (non delta) features using
machine learning to predict lung nodule malignancy. They found the best
improved area under the receiver operating characteristic curve (AUC) was
0.822 when delta features were combined with conventional features versus an
AUC 0.773 for conventional features only. Overall, this study demonstrated the
important utility of combining delta radiomics features with conventional
radiomics features to improve performance of models in the lung cancer
screening setting. This paper investigated the impact of combining delta
radiomics features with conventional (non-delta) features for diagnostic
discrimination and to predict future nodule malignancy. Our experiments
confirm that delta features can improve the performance of models derived from
machine learning. An important finding that emerged from these experiments is
the improvement of models performance specifically among Rider features
when delta and conventional (non-delta) features were combined.
20
CHAPTER 3
SYSTEM DESIGN
21
figure 3.1 use case diagram of lung cancer prediction
Use case diagrams are considered for high level requirement analysis of
a system. So when the requirements of a system are analyzed the functionalities
are captured in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner. Now the second things which
are relevant to the use cases are the actors (Patient/Doctor).
3.1.2 CLASS DIAGRAM FOR DIAGNOSE OF LUNG CANCER
PREDICTION USING MACHINE LEARNING
Class diagram is basically a graphical representation of the static view of
the system and represents different aspects of the application.So, a collection of
class diagrams represents the whole system. The class diagram is the main
building block of object-oriented modelling. They can also be used for data
modelling. The classes in class diagram represent both the main elements,
22
interactions in the application, and the classes to be programmed.The class
diagram is the main building block of object-oriented modelling. It is used for
general conceptual modelling of the structure of the application, and for detailed
modelling translating the models into programming code. Class diagrams can
be used for data modelling. The classes in a class diagram represent both the
main elements, interactions in the application, and the classes to be
programmed.The top compartment contains the name of the class. The middle
compartment contains the attributes of the class. The bottom compartment
contains the operations the class can execute. The class diagram can be
complemented by a state diagram or UML state machine.In the design of a
system, a number of classes are identified and grouped together in a class
diagram that helps to determine the static relations between them. With detailed
modelling, the classes of the conceptual design are often split into a number of
subclasses.
23
The name of the class diagram should be meaningful to describe the
aspect of the system. Each element and their relationships should be identified
in advance responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be
specified. All of these specifications for the system is displayed as a class
diagram in Figure 3.2
24
figure 3.3 sequence diagram for lung cancer prediction
The various actions that take place in the application in the correct
sequence are shown in Figure 3.3 Sequence diagrams are the most popular UML
for dynamic modelling.
25
figure 3.4 activity diagram for lung cancer prediction
It does not show any message flow from one activity to another. Activity
diagram is sometime considered as the flow chart. Although the diagrams look
like a flow chart but it is not. It shows different flow like parallel, branched,
concurrent and single. The Figure 3.4 shows the activity diagram of the
developed application.The only missing thing in activity diagram is the message
part. An application can have multiple systems. Activity diagram also captures
these systems and describes the flow from one system to another. This specific
26
usage is not available in other diagrams. These systems can be database,
external queues, or any other system. Activity diagram is suitable for modelling
the activity flow of the system.
27
Now to choose between these two diagrams the main emphasis is given
on the type of requirement. If the time sequence is important then sequence
diagram is used and if organization is required then collaboration diagram is
used.
28
still serve as a referral point, should any debugging or business process re-
engineering be needed later.An ER diagram in the Unified Modelling Language
(UML), is a diagram that shows a complete or partial view of the structure of a
modelled system at a specific time.
29
On its own, a workflow diagram can be extremely helpful with analysis.
By seeing how the business works from a top-down perspective, you can
identify it’s potential flaws, weaknesses, and areas for improvement. On top of
that, the workflow diagram can be extremely helpful for your doctor to finding
the lung cancer. In the workflow diagram source data moves to the data
processing and cleaning. Then it split into two dataset. They are training dataset
and testing data set. In training dataset to classify the source data with ML
algorithm. In testing dataset to find the best model accuracy by analyzing the
ML algorithm. By using the method finding the lung cancer prediction.
30
CHAPTER 4
SYSTEM ARCHITECTURE
In this chapter, the system architecture for diagnose of lung cancer prediction
using machine learning techniques is represented and the modules are explained.
31
figure 4.1 system architecture for diagnose of lung cancer
prediction using machine learning
To propose a machine learning-based method to accurately predict the
lung cancer using supervised classification machine learning algorithms.
Additionally, to compare and discuss the performance of various machine
learning algorithms from the given transport traffic department dataset with
evaluation of GUI based user interface of lung cancer prediction by attributes.
32
CHAPTER 5
SYSTEM IMPLEMENTATION
For the diagnose of lung cancer prediction using machine learning to find the
accuracy rate. In this system algorithms are compared and from which accuracy
rate will be find. After prediction the system is visualised using GUI
application..
Validation techniques in machine learning are used to get the error rate of
the Machine Learning (ML) model, which can be considered as close to the true
error rate of the dataset. If the data volume is large enough to be representative
of the population, you may not need the validation techniques. However, in real-
world scenarios, to work with samples of data that may not be a true
representative of the population of given dataset. To finding the missing value,
duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model
fit on the training dataset while tuning model hyper parameters. The evaluation
becomes more biased as skill on the validation dataset is incorporated into the
33
model configuration. The validation set is used to evaluate a given model, but
this is for frequent evaluation. It as machine learning engineers uses this data to
fine-tune the model hyper parameters. Data collection, data analysis, and the
process of addressing data content, quality, and structure can add up to a time-
consuming to-do list. During the process of data identification, it helps to
understand your data and its properties; this knowledge will help you choose
which algorithm to use to build your model. For example, time series data can
be analyzed by regression algorithms; classification algorithms can be used to
analyze discrete data. (For example to show the data type format of given
dataset)
Data Pre-processing
34
gathered from different sources it is collected in raw format which is not feasible
for the analysis. To achieving better results from the applied model in Machine
Learning method of the data has to be in a proper manner. Some specified
Machine Learning model needs information in a specified format; for example,
Random Forest algorithm does not support null values. Therefore, to execute
random forest algorithm null values have to be managed from the original raw
data set.
35
5.1.3 Logistic Regression
It is a statistical method for analysing a data set in which there are one or
more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two possible
outcomes). The goal of logistic regression is to find the best fitting model to
describe the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of independent
(predictor or explanatory) variables. Logistic regression is a Machine Learning
classification algorithm that is used to predict the probability of a categorical
dependent variable. In logistic regression, the dependent variable is a binary
variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
36
Decision Tree
37
fitted generating too many branches and may reflect anomalies due to noise or
outliers.
False Positives (FP): A person who will pay predicted as defaulter. When actual
class is no and predicted class is yes. E.g. if actual class says this passenger did
not survive but predicted class tells you that this passenger will survive.
False Negatives (FN): A person who default predicted as payer. When actual
class is yes but predicted class in no. E.g. if actual class value indicates that this
passenger survived and predicted class tells you that passenger will die.
True Positives (TP): A person who will not pay predicted as defaulter. These
are the correctly predicted positive values which means that the value of actual
class is yes and the value of predicted class is also yes. E.g. if actual class value
indicates that this passenger survived and predicted class tells you the same
thing.
True Negatives (TN): A person who default predicted as payer. These are the
correctly predicted negative values which means that the value of actual class is
no and value of predicted class is also no. E.g. if actual class says this passenger
did not survive and predicted class tells you the same thing.
38
It achieved precision, recall, true positive rate (TPR), and false positive rate
(FPR) for each classification techniques as it is shown in the above tables and
also achieved different interesting confusion matrix for each classification
techniques and we can see the classification performance of each classifiers by
the help of confusion matrix. We use a confusion matrix to compute the
accuracy rate of each severity class. For each class, it demonstrates how
instances from that class receive the various classifications. Here in the next
table we have shown instances that are correctly classified and incorrectly
classified in accordance with overall accuracy of each classification techniques.
All classifiers perform similarly well with respect to the number of correctly
classified instances.
Comparing Algorithm with prediction in the form of best accuracy result:
39
In the example below 4 different algorithms are compared:
Logistic Regression
Random Forest
Decision tree
• Now, the dimensions of new features in a numpy array called ‘n’ and it
want to predict the species of this features and to do using the predict
method which takes this array as input and spits out predicted target value
as output.
• So, the predicted target value comes out to be 0. Finally to find the test
score which is the ratio of no. of predictions found correct and total
predictions made and finding accuracy score method which basically
compares the actual values of the test set with the predicted values.
• Test set or unseen examples is a subset of the dataset to assess the likely
future performance of a model. If a model fit to the training set much
better than it fits the test set, over fitting is probably the cause.
5.3.1 Sensitivity
Sensitivity is a measure of the proportion of actual positive cases that got
predicted as positive (or true positive). Sensitivity is also termed as Recall. This
implies that there will be another proportion of actual positive cases, which
40
would get predicted incorrectly as negative (and, thus, could also be termed as
the false negative). This can also be represented in the form of a false negative
rate. The sum of sensitivity and false negative rate would be 1. Let's try and
understand this with the model used for predicting whether a person is suffering
from the disease. Sensitivity is a measure of the proportion of people suffering
from the disease who got predicted correctly as the ones suffering from the
disease. In other words, the person who is unhealthy actually got predicted as
unhealthy.
The following is the details in relation to True Positive and False Negative used
in the above equation.
The higher value of sensitivity would mean higher value of true positive and
lower value of false negative. The lower value of sensitivity would mean lower
value of true positive and higher value of false negative. For healthcare and
financial domain, models with high sensitivity will be desired.
41
5.3.2 Specificity
The following is the details in relation to True Negative and False Positive used
in the above equation.
• True Negative = Persons predicted as not suffering from the disease (or
healthy) are actually found to be not suffering from the disease (healthy);
In other words, the true negative represents the number of persons who
are healthy and are predicted as healthy.
• False Positive = Persons predicted as suffering from the disease (or
unhealthy) are actually found to be not suffering from the disease
(healthy). In other words, the false positive represents the number of
persons who are healthy and got predicted as unhealthy.
42
5.4 WORKFLOW DIAGRAM FOR DIAGNOSE OF LUNG CANCER
PREDICTION USING MACHINE LEARNING
A workflow diagram is a visual representation of a business process (or
workflow), usually done through a flowchart. It uses standardized symbols to
describe the exact steps needed to complete a process, as well as pointing out
individuals responsible for each step.
43
CHAPTER 6
RESULTS AND CODING
44
6.3 PACKAGES USED
45
df.columns
df.info()
df.duplicated()
df.Age.unique()
print("Minimum value of average pollution is:", df.Age.min())
print("Maximum value of average pollution is:", df.Age.max())
print("Avg pollution range :", sorted(df['Age'].unique()))
df.AreaQ.unique()
df.columns
p.Categorical(df['familyhist']
describe()
df['Smokes'].value_counts()
df.corr()
df.head()
df.columns
from sklearn.preprocessing import LabelEncoder
var_mod = ['Name', 'Member_ID', 'Diagnosis', 'Age',
'Smokes','Smokes (years)',
'Smokes (packs/year)', 'AreaQ', 'Alkhol', 'family
history', 'Result'] le = LabelEncoder() for i in var_mod:
df[i] =
le.fit_transform(df[i]).astyt)
df.head()
df['Age'].unique()
df['Smokes'].unique()
46
6.5 LOGISTIC REGRESSION AND NAIVE BAYES ALGORITHMS
import pandas as p
importmatplotlib.py
plot as plt
import seaborn as s
import numpy as n
import warnings
warnings.filterwarni
ngs("ignore")
df=p.read_csv("lung
_cancer1.csv")
df.shape
df=df.drop_duplicat
es()
df.shape
df.columns
from sklearn.preprocessing import LabelEncoder var_mod =
['Name', 'Member_ID', 'Diagnosis', 'Age', 'Smokes', 'Smokes
(years)',
'Smokes (packs/year)', 'AreaQ', 'Alkhol', 'family
history', 'Result'] le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i]).astype(str)
df.head()
df.columns
from sklearn.metrics import confusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score,
average_precision_score, roc_auc_score
47
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
logR= LogisticRegression() logR.fit(X_train,y_train)
predictR = logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
accuracy = cross_val_score(logR, X, y, cv=70)
print('Cross validation test results of accuracy:')
print(accuracy)
print("")
print("Accuracy result of Logistic Regression
is:",accuracy.mean() * 100)
print("")
cm1=confusion_matrix(y_test,predictR) print('Confusion
Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,
0]+cm1[0,1])
print('Sensitivity :
',sensitivity1 )
print("")
specificity1 =
cm1[1,1]/(cm1[1,0]+cm1[1,1]
48
) print('Specificity : ',
specificity1) print("")
TN = cm1[0][0]
FN = cm1[1][0]
TP =
cm1[1][1] FP
= cm1[0][1]
print("True
Positive
:",TP)
print("True
Negative
:",TN)
print("False
Positive
:",FP)
print("False
Negative
:",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR =
FP/(FP+T
N) FNR =
FN/(TP+F
N)
49
print("True Positive Rate
:",TPR)
print("True Negative Rate
:",TNR)
print("False Positive Rate
:",FPR)
print("False Negative Rate
:",FNR) print("")
PPV =
TP/(TP+F
P) NPV =
TN/(TN+
FN)
print("Positive Predictive Value
:",PPV)
print("Negative predictive
value :",NPV) from
sklearn.naive_bayes
import GaussianNB gnb =
GaussianNB()
gnb.fit(X_train,y_train)
predictR = gnb.predict(X_test)
print("")
print('Classification report of Naive Bayes Results:')
print("")
print(classification_report(y_test,predictR))
accuracy = cross_val_score(gnb, X, y, cv=100)
print('Cross validation test results of accuracy:')
print(accuracy)
50
print("")
print("")
print("Accuracy result of Naive Bayes
is:",accuracy.mean() * 100)
print("") cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Naive Bayes
is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
TN = cm1[0][0]
FN = cm1[1][0]
TP =
cm1[1][1] FP
= cm1[0][1]
print("True
Positive
:",TP)
print("True
Negative
:",TN)
print("False
Positive
:",FP)
51
print("False
Negative
:",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR =
FP/(FP+T
N) FNR =
FN/(TP+F
N)
print("True Positive Rate
:",TPR)
print("True Negative Rate
:",TNR)
print("False Positive Rate
:",FPR)
print("False Negative Rate
:",FNR)
print("")
PPV = TP/(TP+FP)
NPV = TN/(TN+FN)
52
6.8 DATA ANALYSIS AND DISCUSSION
The application reads the input and applies the prediction algorithm to it
to generate the output. The output consists of c accuracy. This output is generated
for given input.
The close analysis of the output of lung cancer to get the input and analyse
the data and preprocess and compare with Machine learning algorithm.Finally
the accuracy rate will be displayed.
53
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 CONCLUSION
54
It improves accuracy score by comparing popular machine learning
algorithms.
These reports are to the investigation of applicability of machine
learning techniques for detecting cancer in operational conditions by
attribute prediction
REFERENCES
55
5. B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J.
Kirby, et al., "The multimodal brain tumor image segmentation benchmark
(BRATS)," IEEE transactions on medical imaging, vol. 34, pp. 1993-2024,
2015.
56
12. O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional
networks for biomedical image segmentation," in International Conference
on Medical Image Computing and Computer-Assisted Intervention, pp. 234-
241, 2015. 22
57