Batch 36

DIAGNOSE OF LUNG CANCER PREDICTION
USING MACHINE LEARNING
By
SITHESHWAR.T(312417104104)
PRAVEEN.M(312417104067)
in partial fulfilment for the award of the degree of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
St. JOSEPH’S INSTITUTE OF TECHNOLOGY

CHENNAI 600 119
ANNA UNIVERSITY: CHENNAI 600 025

APRIL -2021
i
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “DIAGNOSE OF LUNG CANCER PREDICTION

USING MACHINE LEARNING” is the bonafide work of SITHESHWAR T
(312417104104) and PRAVEEN M (312417104067) who carried out the project under my
supervision
SIGNATURE SIGNATURE
Dr.J.DAFNI ROSE, M.E., Ph.D., Mr.C.GOWTHAM, M.E.,
Professor, Assistant professor ,
Head of the Department, Computer Science and Engineering,
Computer science and Engineering, St.Joseph’s Institute of Technology,
St.Joseph’s Institute of Technology, Old Mamallapuram road,
Old Mamallapuram road, Chennai-600119.
Chennai-600119.
ii
ACKNOWLEDGEMENT
We take this opportunity to thank our honourable Chairman Dr. B. Babu Manoharan, M.A.,
M.B.A., Ph.D. for the guidance he offered during our tenure in this institution.
We extend our heartfelt gratitude to our honourable Managing Director Mrs. B. Jessie Priya,
M.Com. and Director Mr. B. Sashisekar, M.Sc. for providing us with the required resources
to carry out this project.
We are indebted to our Principal Dr. P. Ravichandran, M.Tech., Ph.D. for granting us
permission to undertake this project.
We would like to express our earnest gratitude to our Head of the Department Dr. J. Dafni
Rose, M.E., Ph.D. for her commendable support and encouragement for the completion of the
project with perfection.
We also take this opportunity to express our profound gratitude to our guide
Mr.C.GOWTHAM, M.E., Assistant professor for his guidance, constant encouragement,
immense help and valuable advice for the completion of this project.
We wish to convey our sincere thanks to all the teaching and non-teaching staff of the
department of Computer Science and Engineering, St. Joseph’s Institute of Technology
without whose co-operation this venture would not have been a success.
iii
CERTIFICATE OF EVALUATION
College Name : St. JOSEPH’S INSTITUTE OF TECHNOLOGY
Branch Name : COMPUTER SCIENCE AND ENGINEERING
Semester : VIII
Sl.No Name of the Title of the project Name of the Supervisor

Students with designation
1. SITHESHWAR DIAGNOSE OF Mr.C.GOWTHAM,M.E.,
(31241710104) LUNG
Assistant Professor,
CANCER
PREDICTION Department of CSE
USING MACHINE St. Joseph’s Institute of
LEARNING Technology
2. PRAVEEN M
(312417104067)
The report of the project work submitted by the above students in partial fulfillment
for the award of Bachelor of Engineering Degree in Computer Science and
Engineering of Anna University were evaluated and confirmed to be report of the
work done by above students.
Submitted for project review and viva voce exam held on
INTERNAL EXAMINER EXTERNAL EXAMINER
iv
ABSTRACT
Lung cancer generally occurs in both male and female due to uncontrollable growth of
cells in the lungs. This causes a serious breathing problem in both inhale and exhale part of
chest. Cigarette smoking and passive smoking are the principal contributor for the cause of
lung cancer as per world health organization. The mortality rate due to lung cancer is
increasing day by day in youths as well as in old persons as compared to other cancers. Even
though the availability of high tech Medical facility for careful diagnosis and effective medical
treatment, the mortality rate is not yet controlled up to a good extent. Therefore it is highly
necessary to take early precautions at the initial stage such that it’s symptoms and effect can
be found at early stage for better diagnosis .Machine learning is a type of artificial intelligence
(AI) that provides computers with the ability to learn without being explicitly programmed.
Machine learning focuses on the development of Computer Programs that can change when
exposed to new data. Machine learning now days has a great influence to health care sector
because of its high computational capability for early prediction of the diseases with accurate
data analysis. In our paper we have analyzed various machine learning classifiers techniques
to classify available lung cancer data in the dataset. The aim is to predict machine learning
based techniques for lung cancer prediction. The analysis of dataset by supervised machine
learning technique(SMLT) to capture several information’s like, variable identification, uni-
variate analysis, bi-variate and multi-variate analysis, missing value treatments and analyze
the data validation, data cleaning/preparing and data visualization will be done on the entire
given dataset. To propose a machine learning-based method to accurately predict the lung
cancer using supervised classification machine learning algorithms. Additionally, to compare
and discuss the performance of various machine learning algorithms from the given transport
traffic department dataset with evaluation of GUI based user interface of lung cancer
prediction by attributes.
v
TABLE OF CONTENT
CHAPTER NO. TITLE PAGE NO.
ABSTRACT v
LIST OF FIGURES 9
LIST OF ABBREIVATION 10
1 INTRODUCTION 11
1.1 OVERVIEW 11
1.2 EXISTING SYSTEM 12
1.2.1 ADVANTAGE/DISADVANTAGES
OF EXISTING SYSTEM 13
1.3 PROPOSED SYSTEM 13
1.3.1 BENEFIT OF USING
EXPLORATORY DATA ANALYSIS 13
1.3.2 BENEFIT OF USING DATA
COLLECTION 14
1.4 ADVANTAGES OF THE SOLUTION 14
2 LITERATURE SURVEY 15
3 SYSTEM DESIGN 21
3.1 UNIFIED MODELLING LANGUAGE 21
3.1.1 USE CASE DIAGRAM FOR
LUNG CANCER PREDICTION 21
3.1.2 CLASS DIAGRAM FOR DIAGNOSE
OF LUNG CANCER PREDICTION 23
3.1.3 SEQUENCE DIAGRAM FOR
6
DIAGNOSE OF LUNG
CANCER PREDICTION 24
3.1.4 ACTIVITY DIAGRAM FOR
DIAGNOSE OF LUNG
3.1.5 COLLABRATION DIAGRAM FOR
DIAGNOSE OF LUNG
3.1.6 ER DIAGRAM FOR DIAGNOSE
OF LUNG CANCER PREDICTION 28
3.1.7 WORKFLOW DIAGRAM FOR
DIAGNOSE OF LUNG
4 SYSTEM ARCHITECTURE 31
4.1 ARCHITECTURE DESCRIPTION 31
5 SYSTEM IMPLEMENTATION 33
5.1 MODULE DESCRIPTION 33
5.1.1 DATA VALIDATION
PROCESS 33
5.1.2 EXPLORATION DATA ANALYSIS
OF VISULIZATION 35
5.1.3 LOGISTIC REGRESSION 36
5.1.4 GUI BASED PRECTION
RESULT OF LUNG 38
5.2 ACCURACY CALCULATION 38
5.3 VALIDATION OF THE OUTPUT 40
5.3.1 SENSITIVITY 40
7
5.3.2 SPECIFICY 42
5.4 WORKFLOW DIAGRAM FOR
LUNG CANCER PREDICTION 43
6 RESULT AND CODING 44

6.1 LANGUAGE USED 44
6.2 TOOL USED 44
6.3 PACKAGE USED 45
6.4 LUNG CANCER PREDICTION USING
DATA VALIDATION PROCESS 45
6.5 LOGISTIC REGRESSON AND NAVIE
BAYES ALGORITHM 47
6.6 DATA ANALYSIS AND DISCUSSION 53
6.7 RESULT ANALYSIS 53
7 CONCLUSION AND FUTURE WORK 55

7.1 CONCLUSION 55
7.2 FUTURE WORK 55
REFERENCES 56
8
LIST OF FIGURES
LIST OF FIGURES NAME OF THE FIGURE PAGE NO.
3.1 Use case diagram of lung cancer prediction 22
3.2 Class diagram of lung cancer prediction 24
3.3 Sequence diagram of lung cancer prediction 25
3.4 Activity diagram of lung cancer prediction 26
3.5 Collaboration diagram of lung cancer 28

prediction
3.6 ER diagram of lung cancer prediction 29
3.7 workflow diagram of lung cancer prediction 30
4.1 system architecture for diagnose of

32
lung cancer prediction
5.1 workflow diagram for lung cancer prediction 43
6.1 lung cancer prediction 53
9
LIST OF ABBREVATION
SMLT Supervised Machine Learning Technique
CT Computed Tomography
MRRN Multiple Resolution Residually Connected Network
NSCL Non-Small Cell Lung Cancer
LIDC Lung Image Database Consortium
CAD Coronary Artery Disease
DSC Dice Similarity Coefficient
RP Radiation Pneumonitis
CTCAE Common Terminology Criteria for Adverse Events
LDCT Low-dose Computed Tomography
UML Unified Modelling Language
ERD Entity Relationship diagram
GUI Graphical user Interfaces
FP False Positive
FN False Negative
TP True Positive
TN True Negative
10
CHAPTER 1
INTRODUCTION
Lung cancer is due to uncontrollable growth of cells in the lungs. It causes

a serious breathing problem in both inhale and exhale part of chest. Cigarette
smoking and passive smoking are the principal contributor for the cause of lung
cancer as per world health organization. The aim is to predict machine learning
based techniques for lung cancer prediction. The analysis of dataset by
supervised machine learning techniques (SMLT) to capture several information.
1.1 OVERVIEW
Lung cancer generally occurs in both male and female due to

uncontrollable growth of cells in the lungs. This causes a serious breathing
problem in both inhale and exhale part of chest. Cigarette smoking and passive
smoking are the principal contributor for the cause of lung cancer as per world
health organization. The mortality rate due to lung cancer is increasing day by
day in youths as well as in old persons as compared to other cancers. Even
though the availability of high tech Medical facility for careful diagnosis and
effective medical treatment, the mortality rate is not yet controlled up to a good
extent. Therefore it is highly necessary to take early precautions at the initial
stage such that it’s symptoms and effect can be found at early stage for better
diagnosis. Machine learning is a type of artificial intelligence (AI) that provides
computers with the ability to learn without being explicitly programmed.
Machine learning focuses on the development of Computer Programs that can
change when exposed to new data. Machine learning now days has a great
influence to health care sector because of its high computational capability for
early prediction of the diseases with accurate data analysis. In our paper we have
analyzed various machine learning classifiers techniques to classify available
lung cancer data in the dataset. The aim is to predict machine learning based
techniques for lung cancer prediction. The analysis of dataset by supervised
11
machine learning technique(SMLT) to capture several information’s like,
variable identification, uni-variate analysis, bi-variate and multi-variate
analysis, missing value treatments and analyze the data validation, data
cleaning/preparing and data visualization will be done on the entire given
dataset. To propose a machine learning-based method to accurately predict the
lung cancer using supervised classification machine learning algorithms.
Additionally, to compare and discuss the performance of various machine
learning algorithms from the given transport traffic department dataset with
evaluation of GUI based user interface of lung cancer prediction by attributes.
Keywords: Dataset, Machine Learning-Classification Method, Python,
Prediction Of Accuracy Result.
1.2 EXISTING SYSTEM
Detecting lung nodules with low-dose computed tomography (CT) can

predict the future risk suffering from lung cancers. There are a few studies on
lung nodules with low-dose CT and detecting rate is very low at present. In order
to accurately detect lung nodules with low-dose CT, this paper proposes a
solution based on an integrated deep learning algorithm. The CT images are pre-
processed via image clipping, normalization and segmentation, and the positive
samples are expanded to balance the number of positive and negative samples.
The features of candidate lung nodule samples are learned by using
convolutional neural network and residual network, and then import into long
short-term memory network, respectively.Then fuse these features,
continuously optimize the network parameters during the training process, and
finally obtain the model with an optimal performance.
12
1.2.1 ADVANTAGE/DISADVANTAGE OF THE EXISTING SYSTEM
The experimental results prove that compared to other algorithms, all
metrics in the proposed algorithm are improved. This model has an obvious anti-
interference ability. It is stable and can identify lung nodules effectively, which
is expected to provide auxiliary diagnostic for early screening of lung cancers.
They are using CT images and large amount of data required to classify
accurately and it take a lot of time to train the model. Any modification made
takes lots of effort to change the model.
1.3 PROPOSED SYSTEM
The aim is to predict machine learning based techniques for lung cancer
prediction. The analysis of dataset by supervised machine learning
technique(SMLT) to capture several information’s like, variable identification,
uni-variate analysis, bi-variate and multi-variate analysis, missing value
treatments and analyze the data validation, data cleaning/preparing and data
visualization will be done on the entire given dataset.To propose a machine
learning-based method to accurately predict the lung cancer using supervised
classification machine learning algorithms. Additionally, to compare and
discuss the performance of various machine learning algorithms from the given
transport traffic department dataset with evaluation of GUI based user interface
of lung cancer prediction by attributes. This algorithm uses Data Analysis and
Data collection.
1.3.1 BENEFITS OF USING EXPLORATORY DATA ANALYSIS

Exploratory data analysis is an approach to analyzing data sets to
summarize their main characteristics, often using statistical graphics and
other data visualization methods. A statistical model can be used or not, but
primarily EDA is for seeing what the data can tell us beyond the formal modeling
13
or hypothesis testing task..Machine learning supervised classification algorithms
will be used to give dataset and extract patterns, which would help in predicting
the likely patient affected or not, thereby helping them for making better
decisions in the future
1.3.2 BENEFITS OF USING DATA COLLECTION
The data set collected for predicting the network attacks is split into
Training set and Test set. Generally, 7:3 ratios are applied to split the Training
set and Test set. The Data Model which was created using by machine learning
model are applied on the Training set and based on the test result accuracy, Test
set prediction is done.
1.4 ADVANTAGES OF THE SOLUTION
It improves accuracy score by comparing popular machine learning

algorithms. These reports are to the investigation of applicability of machine
learning techniques for detecting cancer in operational conditions by attribute
prediction.
14
CHAPTER 2
LITERATURE SURVEY
Sanjukta Rani Jena, Dr. Thomas George, Dr. Narain Ponraj(2019) - Lung
cancer is most lifethreatening disease, treatment of which must be the primary
goal throughout scientific research. The early recognition of cancer can be
helpful in curing disease entirely. There are numerous techniques found in
literature for detection of lung cancer. Several investigators have contributed
their facts for cancer prediction. These papers largely pact about prevailing lung
cancer detection techniques that are obtainable in the literature. A numeral of
methodologies has been originated in cancer detection methodologies to
progress the efficiency of their detection. Diverse applications like as support
vector machines, neural networks, image processing techniques are extensively
used in for cancer detection which is elaborated in this work The early discovery
of lung malignancy is a confront, because of the structure of tumour cells, where
the greater part of the cells are covered with each other. This paper has surveyed
numerous strategies, to distinguish the lung tumour in its beginning periods. The
manual examination of the samples is tedious, inaccurate and requires intensive
trained person to eliminate diagnostic errors. From the results obtained we could
conclude that the Local Binary Pattern performs better than other basic textural
patterns as the histogram features obtained were greater than that of the latter.
Jue Jiang , Yu-chi Hu , Chia-Ju Liu , Darragh Halpenny(2018) -

Volumetric lung tumor segmentation and accurate longitudinal tracking of
tumor volume changes from computed tomography (CT) images are essential
for monitoring tumor response to therapy. Hence, they developed two multiple
resolution residually connected network (MRRN) formulations called
incremental-MRRN and dense-MRRN. Our networks simultaneously combine
15
features across multiple image resolution and feature levels through residual
connections to detect and segment lung tumors. They evaluated our method on
a total of 1210 non-small cell (NSCLC) lung tumors and nodules from three
datasets consisting of 377 tumors from the open-source Cancer Imaging Archive
(TCIA), 304 advanced stage NSCLC treated with anti-PD-1 checkpoint
immunotherapy from internal institution MSKCC dataset, and 529 lung nodules
from the Lung Image Database Consortium (LIDC). The algorithm was trained
using the 377 tumors from the TCIA dataset and validated on the MSKCC and
tested on LIDC datasets. The segmentation accuracy compared to expert
delineations was evaluated by computing the Dice Similarity Coefficient (DSC),
Hausdorff distances, sensitivity and precision metrics. Our best performing
incremental-MRRN method produced the highest DSC of 0.74±0.13 for TCIA,
0.75±0.12 for MSKCC and 0.68±0.23 for the LIDC datasets. There was no
significant difference in the estimations of volumetric tumor changes computed
using the incremental-MRRN method compared with expert segmentation
proposed two neural networks to segment lung tumors from CT images by
adding multiple residual streams of varying resolutions. Our results clearly
demonstrate the improvement in segmentation accuracy across multiple
datasets.
Naji Khosravan and Ulas Bagci(2018) - Early detection of lung nodules

is of great importance in lung cancer screening. Existing research recognizes the
critical role played by CAD systems in early detection and diagnosis of lung
nodules. However, many CAD systems, which are used as cancer detection
tools, produce a lot of false positives (FP) and require a further FP reduction
step. Furthermore, guidelines for early diagnosis and treatment of lung cancer
are consist of different shape and volume measurements of abnormalities. To
support this hypothesis we proposed a 3D deep multi-task CNN to tackle these
16
two problems jointly. They tested system on LUNA16 dataset and achieved an
average dice similarity coefficient (DSC) of 91% as segmentation accuracy and
a score of nearly 92% for FP reduction. As a proof of our hypothesis, they
showed improvements of segmentation and FP reduction tasks over two
baselines. e proposed a 3D deep multi-task CNN for simultaneously performing
segmentation and FP reduction. We showed that sharing some underlying
features for these tasks and training a single model using shared features can
improve the results for both tasks, which are critical for lung cancer screening.
Furthermore, we showed that a semisupervised approach can improve the results
without the need for large number of labeled data in the training.
Nidhi S. Nadkarni, Prof. Sangam Borkar(2019) - Cancer is one of the

most serious and widespread disease that is responsible for large number of
deaths every year. Among all different types of cancers, lung cancer is the most
prevalent cancer having the highest mortality rate. Computed tomography scans
are used for identification of lung cancer as it provides detailed picture of tumor
in the body and tracks its growth. Although CT is preferred over other imaging
modalities, visual interpretation of these CT scan images may be an error prone
task and can cause delay in lung cancer detection. Therefore, image processing
techniques are used widely in medical fields for early stage detection of lung
tumor. This paper presents an automated approach for detection of lung cancer
in CT scan images. The algorithm for lung cancer detection is proposed using
methods such as median filtering for image preprocessing followed by
segmentation of lung region of interest using mathematical morphological
operations. The algorithm for lung cancer detection is proposed using methods
such as median filtering for image preprocessing followed by segmentation of
lung region of interest using mathematical morphological operations the system
for automatic detection of lung cancer in CT images was successfully developed
using image processing technique. The adopted methodology performs well in
17
enhancing, segmenting and extracting features from CT images. Median
filtering technique was effective in eliminating impulse noise from the images
without blurring the image. Mathematical morphological operations enable
accurate segmentation of lung and tumor region.
Tianle Shen, Liming Sheng, Ying Chen , Lei Cheng and Xianghui
Du(2020) - Silica is an independent risk factor for lung cancer in addition to
smoking. Chronic silicosis is one of the most common and serious occupational
diseases associated with poor prognosis. However, the role of radiotherapy is
unclear in patients with chronic silicosis. They conducted a retrospective study
to evaluate efficacy and safety in lung cancer patients with chronic silicosis,
especially focusing on the incidence of radiation pneumonitis (RP). Lung cancer
patients with chronic silicosis who had been treated with radiotherapy from
2005 to 2018 in our hospital were enrolled in this retrospective study . RP was
graded according to the National Cancer Institute’s Common Terminology
Criteria for Adverse Events (CTCAE), version 3.0. Of the 22 patients, ten
(45.5%) developed RP ≥2. Two RP-related deaths (9.1%) occurred within 3
months after radiotherapy. Dosimetric factors V5, V10, V15, V20 and mean
lung dose (MLD) were significantly higher in patients who had RP >2 (P <
0.05). The median overall survival times in patients with RP ≤2 and RP>2 were
11.5 months and 7.1 months, respectively. Radiotherapy is associated with
excessive and fatal pulmonary toxicity in lung cancer patients with chronic
silicosis. This retrospective study showed that radiotherapy is associated with a
high incidence of lethal RP in lung cancer patients with chronic silicosis. RP is
related to OS
Saeed S. Alahmari , Dmitry Cherezov , Dmitry B. Goldgof(2018) - Low-

dose computed tomography (LDCT) plays a critical role in the early detection
18
of lung cancer. Despite the life-saving benefit of early detection by LDCT, there
are many limitations of this imaging modality including high rates of detection
of indeterminate pulmonary nodules. Radiomics is the process of extracting and
analyzing image-based, quantitative features from a region-ofinterest which
then can be analyzed to develop decision support tools that can improve lung
cancer screening. Although prior published research has shown that delta
radiomics (i.e., changes in features over time) have utility in predicting
treatment response, limited work has been conducted using delta radiomics in
lung cancer screening. As such, we conducted analyses to assess the
performance of incorporating delta with conventional (non delta) features using
machine learning to predict lung nodule malignancy. They found the best
improved area under the receiver operating characteristic curve (AUC) was
0.822 when delta features were combined with conventional features versus an
AUC 0.773 for conventional features only. Overall, this study demonstrated the
important utility of combining delta radiomics features with conventional
radiomics features to improve performance of models in the lung cancer
screening setting. This paper investigated the impact of combining delta
radiomics features with conventional (non-delta) features for diagnostic
discrimination and to predict future nodule malignancy. Our experiments
confirm that delta features can improve the performance of models derived from
machine learning. An important finding that emerged from these experiments is
the improvement of models performance specifically among Rider features
when delta and conventional (non-delta) features were combined.
Yanbo Wang , Weikang Qian , Bo Yuan(2016) - Smoking is the major

cause of lung cancer and the leading cause of cancer-related death in the world.
The most current view about lung cancer is no longer limited to individual genes
being mutated by any carcinogenic insults from smoking. Instead,
tumorigenesis is a phenotype conferred by many systematic and global
19
alterations, leading to extensive heterogeneity and variation for both the
genotypes and phenotypes of individual cancer cells. Thus, strategically it is
foremost important to develop a methodology to capture any consistent and
global alterations presumably shared by most of the cancerous cells for a given
population. This is particularly true that almost all of the data collected from
solid cancers (including lung cancers) are usually distant apart over a large span
of temporal or even spatial contexts. Here we report a multiple non-Gaussian
graphical model to reconstruct the gene interaction network using two
previously published gene expression datasets. Our graphical model aims to
selectively detect gross structural changes at the level of gene interaction
networks. Our methodology is extensively validated, demonstrating good
robustness, as well as the selectivity and specificity expected based on our
biological insights. In summary, gene regulatory networks are still relatively
stable during presumably the early stage of neoplastic transformation.
20
CHAPTER 3
SYSTEM DESIGN
3.1 UNIFIED MODELLING LANGUAGE
Unified Modelling Language (UML) is a standardized modelling

language enabling developers to specify, visualize, construct and document
artifacts of a software system. Thus, UML makes these artifacts scalable, secure
and robust in execution. It uses graphic notation to create visual models of
software systems. UML is designed to enable users to develop an expressive,
ready to use visual modelling language. In addition, it supports high-level
development concepts such as frameworks, patterns and collaborations. Some
of the UML diagrams are discussed.
3.1.1 USE CASE DIAGRAM FOR LUNG CANCER PREDICTION

Use case diagrams are considered for high level requirement analysis of
a system. So, when the requirements of a system are analysed the functionalities
are captured in use cases. So, it can be said that uses cases are nothing, but the
system functionalities written in an organized manner. Now the second things
which are relevant to the use cases are the actors. Actors can be defined as
something that interacts with the system.The actors can be human user, some
internal applications or may be some external applications. Use case diagrams
are used to gather the requirements of a system including internal and external
influences. These requirements are mostly design requirements. Hence, when a
system is analysed to gather its functionalities, use cases are prepared and actors
are identified.
21
figure 3.1 use case diagram of lung cancer prediction
Use case diagrams are considered for high level requirement analysis of
a system. So when the requirements of a system are analyzed the functionalities
are captured in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner. Now the second things which
are relevant to the use cases are the actors (Patient/Doctor).
3.1.2 CLASS DIAGRAM FOR DIAGNOSE OF LUNG CANCER
PREDICTION USING MACHINE LEARNING
Class diagram is basically a graphical representation of the static view of
the system and represents different aspects of the application.So, a collection of
class diagrams represents the whole system. The class diagram is the main
building block of object-oriented modelling. They can also be used for data
modelling. The classes in class diagram represent both the main elements,
22
interactions in the application, and the classes to be programmed.The class
diagram is the main building block of object-oriented modelling. It is used for
general conceptual modelling of the structure of the application, and for detailed
modelling translating the models into programming code. Class diagrams can
be used for data modelling. The classes in a class diagram represent both the
main elements, interactions in the application, and the classes to be
programmed.The top compartment contains the name of the class. The middle
compartment contains the attributes of the class. The bottom compartment
contains the operations the class can execute. The class diagram can be
complemented by a state diagram or UML state machine.In the design of a
system, a number of classes are identified and grouped together in a class
diagram that helps to determine the static relations between them. With detailed
modelling, the classes of the conceptual design are often split into a number of
subclasses.
figure 3.2 class diagram for lung cancer prediction
23
The name of the class diagram should be meaningful to describe the
aspect of the system. Each element and their relationships should be identified
in advance responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be
specified. All of these specifications for the system is displayed as a class
diagram in Figure 3.2
3.1.3 SEQUENCE DIAGRAM FOR DIAGNOSE OF LUNG CANCER

UML sequence diagrams model the flow of logic within the system in a
visual manner, enabling to both document and validate the logic, and are
commonly used for both analysis and design purposes. They show object
interactions arranged in time sequence. It depicts the objects and classes
involved in the scenario and the sequence of messages exchanged between the
objects needed to carry out the functionality of the scenario. Sequence
diagrams are also called as event diagrams or event scenarios. This diagram
shows different processes or objects that live simultaneously, and as horizontal
arrows, the messages exchanged between them, in the order in which they occur.
This allows the specification of simple runtime scenarios in a graphical manner.
24
figure 3.3 sequence diagram for lung cancer prediction
The various actions that take place in the application in the correct
sequence are shown in Figure 3.3 Sequence diagrams are the most popular UML
for dynamic modelling.
3.1.4 ACTIVITY DIAGRAM FOR DIAGNOSE OF LUNG CANCER

Activity is a particular operation of the system. Activity diagram is
suitable for modelling the activity flow of the system. Activity diagrams are not
only used for visualizing dynamic nature of a system but they are also used to
construct the executable system by using forward and reverse engineering
techniques.
25
figure 3.4 activity diagram for lung cancer prediction
It does not show any message flow from one activity to another. Activity
diagram is sometime considered as the flow chart. Although the diagrams look
like a flow chart but it is not. It shows different flow like parallel, branched,
concurrent and single. The Figure 3.4 shows the activity diagram of the
developed application.The only missing thing in activity diagram is the message
part. An application can have multiple systems. Activity diagram also captures
these systems and describes the flow from one system to another. This specific
26
usage is not available in other diagrams. These systems can be database,
external queues, or any other system. Activity diagram is suitable for modelling
the activity flow of the system.
3.1.5 COLLABORATION DIAGRAM FOR DIAGNOSE OF LUNG

CANCER PREDICTION USING MACHINE LEARNING
The next interaction diagram is collaboration diagram. It shows the
object organization. Here in collaboration diagram the method call sequence is
indicated by some numbering technique. The number indicates how the
methods are called one after another. The method calls are similar to that of a
sequence diagram. But the difference is that the sequence diagram does not
describe the object organization whereas the collaboration diagram shows the
object organization. The various objects involved and their collaboration is
shown in Figure 3.5.
A collaboration diagram, also known as a communication diagram, is an
illustration of the relationships and interactions among software objects in the
Unified Modelling Language (UML). These diagrams can be used to portray the
dynamic behaviour of a particular use case and define the role of each object.
figure 3.5 collaboration diagram for lung cancer prediction
27
Now to choose between these two diagrams the main emphasis is given
on the type of requirement. If the time sequence is important then sequence
diagram is used and if organization is required then collaboration diagram is
used.
3.1.6 ER DAIGRAM FOR DIAGNOSE OF LUNG CANCER

An entity relationship diagram (ERD), also known as an entity
relationship model, is a graphical representation of an information system that
depicts the relationships among people, objects, places, concepts or events
within that system. An ERD is a data modelling technique that can help define
business processes and be used as the foundation for a relational database.
figure 3.6 ER diagram for lung cancer prediction
Entity relationship diagrams provide a visual starting point for database

design that can also be used to help determine information system requirements
throughout an organization. After a relational database is rolled out, an ERD can
28
still serve as a referral point, should any debugging or business process re-
engineering be needed later.An ER diagram in the Unified Modelling Language
(UML), is a diagram that shows a complete or partial view of the structure of a
modelled system at a specific time.
3.1.7 WORKFLOW DAIGRAM FOR DIAGNOSE OF LUNG CANCER

A workflow diagram is a visual representation of a business process (or
workflow), usually done through a flowchart. It uses standardized symbols to
describe the exact steps needed to complete a process, as well as pointing out
individuals responsible for each step.
figure 3.7 workflow diagram for lung cancer prediction
29
On its own, a workflow diagram can be extremely helpful with analysis.
By seeing how the business works from a top-down perspective, you can
identify it’s potential flaws, weaknesses, and areas for improvement. On top of
that, the workflow diagram can be extremely helpful for your doctor to finding
the lung cancer. In the workflow diagram source data moves to the data
processing and cleaning. Then it split into two dataset. They are training dataset
and testing data set. In training dataset to classify the source data with ML
algorithm. In testing dataset to find the best model accuracy by analyzing the
ML algorithm. By using the method finding the lung cancer prediction.
30
CHAPTER 4
SYSTEM ARCHITECTURE
In this chapter, the system architecture for diagnose of lung cancer prediction
using machine learning techniques is represented and the modules are explained.
4.1 ARCHITECTURE DESCRIPTION
In System Architecture the detailed description about the system modules

and working of each module is discussed as shown in figure 4.1. To create dataset
and add input information and lung cancer dataset. Then the past dataset can be
pre-processed Using application and compare with Machine Learning algorithm.
Machine learning supervised classification algorithms will be used to give
dataset and extract patterns, which would help in predicting the likely patient
affected or not, thereby helping them for making better decisions in the
future.The data set collected for predicting the network attacks is split into
Training set and Test set. Generally, 7:3 ratios are applied to split the Training
set and Test set. The Data Model which was created using by ensemble learning
model are applied on the Training set and based on the test result accuracy, Test
set prediction is done. After find the accuracy rate from the given input ,finally
show the result to the user by GUI application. These reports are to the
investigation of applicability of machine learning techniques for detecting cancer
in operational conditions by attribute prediction.
31
figure 4.1 system architecture for diagnose of lung cancer
prediction using machine learning
To propose a machine learning-based method to accurately predict the
lung cancer using supervised classification machine learning algorithms.
Additionally, to compare and discuss the performance of various machine
learning algorithms from the given transport traffic department dataset with
evaluation of GUI based user interface of lung cancer prediction by attributes.
32
CHAPTER 5
SYSTEM IMPLEMENTATION
For the diagnose of lung cancer prediction using machine learning to find the
accuracy rate. In this system algorithms are compared and from which accuracy
rate will be find. After prediction the system is visualised using GUI
application..
5.1 MODULE DESCRIPTION
This system is implemented in ways :
• Data validation process EDA(Exploratory Data Analysis) (Module-01)

• Data Visualization(EDA) (Module-02)
• To train a model by given dataset using sklearn package Comparison of
algorithms Accuracy results of algorithms (Module-03 and module-
04)
• GUI based prediction results of Lung cancer (Module-05)
5.1.1 Data Validation Process
Validation techniques in machine learning are used to get the error rate of
the Machine Learning (ML) model, which can be considered as close to the true
error rate of the dataset. If the data volume is large enough to be representative
of the population, you may not need the validation techniques. However, in real-
world scenarios, to work with samples of data that may not be a true
representative of the population of given dataset. To finding the missing value,
duplicate value and description of data type whether it is float variable or
integer. The sample of data used to provide an unbiased evaluation of a model
fit on the training dataset while tuning model hyper parameters. The evaluation
becomes more biased as skill on the validation dataset is incorporated into the
33
model configuration. The validation set is used to evaluate a given model, but
this is for frequent evaluation. It as machine learning engineers uses this data to
fine-tune the model hyper parameters. Data collection, data analysis, and the
process of addressing data content, quality, and structure can add up to a time-
consuming to-do list. During the process of data identification, it helps to
understand your data and its properties; this knowledge will help you choose
which algorithm to use to build your model. For example, time series data can
be analyzed by regression algorithms; classification algorithms can be used to
analyze discrete data. (For example to show the data type format of given
dataset)
Data Validation/ Cleaning/Preparing Process
Importing the library packages with loading given dataset. To analyzing

the variable identification by data shape, data type and evaluating the missing
values, duplicate values. A validation dataset is a sample of data held back from
training your model that is used to give an estimate of model skill while tuning
model's and procedures that you can use to make the best use of validation and
test datasets when evaluating your models. Data cleaning / preparing by rename
the given dataset and drop the column etc. to analyze the uni-variate, bi-variate
and multi-variate process. The steps and techniques for data cleaning will vary
from dataset to dataset. The primary goal of data cleaning is to detect and
remove errors and anomalies to increase the value of data in analytics and
decision making.
Data Pre-processing
Pre-processing refers to the transformations applied to our data before

feeding it to the algorithm. Data Pre-processing is a technique that is used to
convert the raw data into a clean data set. In other words, whenever the data is
34
gathered from different sources it is collected in raw format which is not feasible
for the analysis. To achieving better results from the applied model in Machine
Learning method of the data has to be in a proper manner. Some specified
Machine Learning model needs information in a specified format; for example,
Random Forest algorithm does not support null values. Therefore, to execute
random forest algorithm null values have to be managed from the original raw
data set.
5.1.2 Exploration data analysis of visualization

Data visualization is an important skill in applied statistics and machine
learning. Statistics does indeed focus on quantitative descriptions and estimations
of data. Data visualization provides an important suite of tools for gaining a
qualitative understanding. This can be helpful when exploring and getting to
know a dataset and can help with identifying patterns, corrupt data, outliers, and
much more. With a little domain knowledge, data visualizations can be used to
express and demonstrate key relationships in plots and charts that are more
visceral and stakeholders than measures of association or significance. Data
visualization and exploratory data analysis are whole fields themselves and it will
recommend a deeper dive into some the books mentioned at the end.Sometimes
data does not make sense until it can look at in a visual form, such as with charts
and plots. Being able to quickly visualize of data samples and others is an
important skill both in applied statistics and in applied machine learning. It will
discover the many types of plots that you will need to know when visualizing
data in Python and how to use them to better understand your own data.
35
5.1.3 Logistic Regression
It is a statistical method for analysing a data set in which there are one or
more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two possible
outcomes). The goal of logistic regression is to find the best fitting model to
describe the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of independent
(predictor or explanatory) variables. Logistic regression is a Machine Learning
classification algorithm that is used to predict the probability of a categorical
dependent variable. In logistic regression, the dependent variable is a binary
variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
In other words, the logistic regression model predicts P(Y=1) as a

function of X. Logistic regression Assumptions:
 Binary logistic regression requires the dependent variable to be binary.
 For a binary regression, the factor level 1 of the dependent variable should
represent the desired outcome.
 Only the meaningful variables should be included.
 The independent variables should be independent of each other. That is,

the model should have little.
 The independent variables are linearly related to the log odds.
 Logistic regression requires quite large sample sizes.
36
Decision Tree
It is one of the most powerful and popular algorithm. Decision-tree algorithm

falls under the category of supervised learning algorithms. It works for both
continuous as well as categorical output variables.
Assumptions of Decision tree
 At the beginning, we consider the whole training set as the root.
 Attributes are assumed to be categorical for information gain, attributes

are assumed to be continuous.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or internal node.
Decision tree builds classification or regression models in the form of a

tree structure. It breaks down a data set into smaller and smaller subsets while
at the same time an associated decision tree is incrementally developed. A
decision node has two or more branches and a leaf node represents a
classification or decision. The topmost decision node in a tree which
corresponds to the best predictor called root node. Decision trees can handle
both categorical and numerical data. Decision tree builds classification or
regression models in the form of a tree structure. It utilizes an if-then rule set
which is mutually exclusive and exhaustive for classification. The rules are
learned sequentially using the training data one at a time. Each time a rule is
learned, the tuples covered by the rules are removed.This process is continued
on the training set until meeting a termination condition. It is constructed in a
top-down recursive divide-and-conquer manner. All the attributes should be
categorical. Otherwise, they should be discretized in advance. Attributes in the
top of the tree have more impact towards in the classification and they are
identified using the information gain concept. A decision tree can be easily over-
37
fitted generating too many branches and may reflect anomalies due to noise or
outliers.
5.1.4 GUI based prediction results of Lung

Tkinter is a python library for developing GUI (Graphical User
Interfaces). We use the tkinter library for creating an application of UI (User
Interface), to create windows and all other graphical user interface and Tkinter
will come with Python as a standard package, it can be used for security purpose
of each users or accountants. There will be two kinds of pages like registration
user purpose and login entry purpose of users.
5.2 Accuracy calculation
False Positives (FP): A person who will pay predicted as defaulter. When actual
class is no and predicted class is yes. E.g. if actual class says this passenger did
not survive but predicted class tells you that this passenger will survive.
False Negatives (FN): A person who default predicted as payer. When actual
class is yes but predicted class in no. E.g. if actual class value indicates that this
passenger survived and predicted class tells you that passenger will die.
True Positives (TP): A person who will not pay predicted as defaulter. These
are the correctly predicted positive values which means that the value of actual
class is yes and the value of predicted class is also yes. E.g. if actual class value
indicates that this passenger survived and predicted class tells you the same
thing.
True Negatives (TN): A person who default predicted as payer. These are the
correctly predicted negative values which means that the value of actual class is
no and value of predicted class is also no. E.g. if actual class says this passenger
did not survive and predicted class tells you the same thing.
38
It achieved precision, recall, true positive rate (TPR), and false positive rate
(FPR) for each classification techniques as it is shown in the above tables and
also achieved different interesting confusion matrix for each classification
techniques and we can see the classification performance of each classifiers by
the help of confusion matrix. We use a confusion matrix to compute the
accuracy rate of each severity class. For each class, it demonstrates how
instances from that class receive the various classifications. Here in the next
table we have shown instances that are correctly classified and incorrectly
classified in accordance with overall accuracy of each classification techniques.
All classifiers perform similarly well with respect to the number of correctly
classified instances.
Comparing Algorithm with prediction in the form of best accuracy result:
It is important to compare the performance of multiple different machine

learning algorithms consistently and it will discover to create a test harness to
compare multiple different machine learning algorithms in Python with scikit-
learn. It can use this test harness as a template on your own machine learning
problems and add more and different algorithms to compare. Each model will
have different performance characteristics. Using resampling methods like cross
validation, you can get an estimate for how accurate each model may be on
unseen data. It needs to be able to use these estimates to choose one or two best
models from the suite of models that you have created. When have a new
dataset, it is a good idea to visualize the data using different techniques in order
to look at the data from different perspectives. The same idea applies to model
selection. You should use a number of different ways of looking at the estimated
accuracy of your machine learning algorithms in order to choose the one or two
to finalize. A way to do this is to use different visualization methods to show
the average accuracy, variance and other properties of the distribution of model
accuracies.
39
In the example below 4 different algorithms are compared:
 Logistic Regression
 Random Forest
 Decision tree
 Support Vector Machines
• Now, the dimensions of new features in a numpy array called ‘n’ and it
want to predict the species of this features and to do using the predict
method which takes this array as input and spits out predicted target value
as output.
• So, the predicted target value comes out to be 0. Finally to find the test
score which is the ratio of no. of predictions found correct and total
predictions made and finding accuracy score method which basically
compares the actual values of the test set with the predicted values.
5.3 VALIDATION OF THE OUTPUT
Once we acquire a dataset, we intend to divide it into two subsets:
• Training set is a subset of the dataset used to build predictive models.
• Test set or unseen examples is a subset of the dataset to assess the likely
future performance of a model. If a model fit to the training set much
better than it fits the test set, over fitting is probably the cause.
5.3.1 Sensitivity
Sensitivity is a measure of the proportion of actual positive cases that got
predicted as positive (or true positive). Sensitivity is also termed as Recall. This
implies that there will be another proportion of actual positive cases, which
40
would get predicted incorrectly as negative (and, thus, could also be termed as
the false negative). This can also be represented in the form of a false negative
rate. The sum of sensitivity and false negative rate would be 1. Let's try and
understand this with the model used for predicting whether a person is suffering
from the disease. Sensitivity is a measure of the proportion of people suffering
from the disease who got predicted correctly as the ones suffering from the
disease. In other words, the person who is unhealthy actually got predicted as
unhealthy.
Mathematically, sensitivity can be calculated as the following:
Sensitivity = (True Positive) / (True Positive + False Negative)
The following is the details in relation to True Positive and False Negative used
in the above equation.
• True Positive = Persons predicted as suffering from the disease (or

unhealthy) are actually suffering from the disease (unhealthy); In other
words, the true positive represents the number of persons who are
unhealthy and are predicted as unhealthy.
• False Negative = Persons who are actually suffering from the disease (or
unhealthy) are actually predicted to be not suffering from the disease
(healthy). In other words, the false negative represents the number of
persons who are unhealthy and got predicted as healthy. Ideally, we
would seek the model to have low false negatives as it might prove to be
lifethreatening or business threatening.
The higher value of sensitivity would mean higher value of true positive and
lower value of false negative. The lower value of sensitivity would mean lower
value of true positive and higher value of false negative. For healthcare and
financial domain, models with high sensitivity will be desired.
41
5.3.2 Specificity
Specificity is defined as the proportion of actual negatives, which got predicted

as the negative (or true negative). This implies that there will be another
proportion of actual negative, which got predicted as positive and could be
termed as false positives. This proportion could also be called a false positive
rate. The sum of specificity and false positive rate would always be 1. Let's try
and understand this with the model used for predicting whether a person is
suffering from the disease. Specificity is a measure of the proportion of people
not suffering from the disease who got predicted correctly as the ones who are
not suffering from the disease. In other words, the person who is healthy actually
got predicted as healthy is specificity.
Mathematically, specificity can be calculated as the following:
Specificity = (True Negative) / (True Negative + False Positive)
The following is the details in relation to True Negative and False Positive used
in the above equation.
• True Negative = Persons predicted as not suffering from the disease (or
healthy) are actually found to be not suffering from the disease (healthy);
In other words, the true negative represents the number of persons who
are healthy and are predicted as healthy.
• False Positive = Persons predicted as suffering from the disease (or
unhealthy) are actually found to be not suffering from the disease
(healthy). In other words, the false positive represents the number of
persons who are healthy and got predicted as unhealthy.
42
5.4 WORKFLOW DIAGRAM FOR DIAGNOSE OF LUNG CANCER
A workflow diagram is a visual representation of a business process (or
workflow), usually done through a flowchart. It uses standardized symbols to
describe the exact steps needed to complete a process, as well as pointing out
individuals responsible for each step.
figure 5.1 workflow diagram for lung cancer prediction
On its own, a workflow diagram can be extremely helpful with analysis.

By seeing how the business works from a top-down perspective, you can
identify it’s potential flaws, weaknesses, and areas for improvement. On top of
that, the workflow diagram can be extremely helpful for your doctor to finding
the lung cancer.
43
CHAPTER 6
RESULTS AND CODING
The following coding corresponds to predicting and visualising financial time

series using machine learning techniques. The input to the code is taken from
Yahoo Finance and the output corresponds to the prediction variable
(Up/Down).
6.1 LANGUAGE USED
Anaconda is a free and open-source distribution of the Python and R

programming languages for scientific computing (data science, machine
learning applications, large-scale data processing, predictive analytics, etc.),
that aims to simplify package management and deployment. Python is a widely
used high-level, general-purpose, interpreted, dynamic programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than would be possible
in languages such as C++ or Java. The language provides constructs intended
to enable clear programs on both a small and large scale.
6.2 TOOL USED
The Jupyter Notebook is an open-source web application that allows you to

create and share documents that contain live code, equations, visualizations and
narrative text. Uses include: data cleaning and transformation, numerical
simulation, statistical modeling, data visualization, machine learning, and much
more
44
6.3 PACKAGES USED
Package versions are managed by the package management system “Conda”.

The Anaconda distribution is used by over 12 million users and includes more
than 1400 popular data science packages suitable for Windows, Linux, and
MacOS. So, Anaconda distribution comes with more than 1,400 packages as
well as the Conda package and virtual environment manager called Anaconda
Navigator and it eliminates the need to learn to install each library
independently. The open source packages can be individually installed from the
Anaconda repository with the conda install command or using the pip install
command that is installed with Anaconda. Pip packages provide many of the
features of conda packages and in most cases they can work together. Custom
packages can be made using the conda build command, and can be shared with
others by uploading them to Anaconda Cloud,[10] PyPI or other repositories. The
default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes
Python 3.7. However, you can create new environments that include any version
of Python packaged with conda.
6.4LUNG CANCER PREDICTION USING DATA VALIDATION

PROCESS
import pandas as p
import numpy as n
import warnings warnings.filterwarnings("ignore")
data = p.read_csv("lung_cancer1.csv")
data.head(10)
data.shape
df=data.dropna()
df.head(10)
df.shape
45
df.columns
df.info()
df.duplicated()
df.Age.unique()
print("Minimum value of average pollution is:", df.Age.min())
print("Maximum value of average pollution is:", df.Age.max())
print("Avg pollution range :", sorted(df['Age'].unique()))
df.AreaQ.unique()
df.columns
p.Categorical(df['familyhist']
describe()
df['Smokes'].value_counts()
df.corr()
df.head()
df.columns
from sklearn.preprocessing import LabelEncoder
var_mod = ['Name', 'Member_ID', 'Diagnosis', 'Age',
'Smokes','Smokes (years)',
'Smokes (packs/year)', 'AreaQ', 'Alkhol', 'family
history', 'Result'] le = LabelEncoder() for i in var_mod:
df[i] =
le.fit_transform(df[i]).astyt)
df.head()
df['Age'].unique()
df['Smokes'].unique()
46
6.5 LOGISTIC REGRESSION AND NAIVE BAYES ALGORITHMS
import pandas as p
importmatplotlib.py
plot as plt
import seaborn as s
import numpy as n
import warnings
warnings.filterwarni
ngs("ignore")
df=p.read_csv("lung
_cancer1.csv")
df.shape
df=df.drop_duplicat
es()
df.shape
df.columns
from sklearn.preprocessing import LabelEncoder var_mod =
['Name', 'Member_ID', 'Diagnosis', 'Age', 'Smokes', 'Smokes
(years)',
'Smokes (packs/year)', 'AreaQ', 'Alkhol', 'family
history', 'Result'] le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i]).astype(str)
df.head()
df.columns
from sklearn.metrics import confusion_matrix, classification_report,
matthews_corrcoef, cohen_kappa_score, accuracy_score,
average_precision_score, roc_auc_score
47
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
logR= LogisticRegression() logR.fit(X_train,y_train)
predictR = logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
print("")
print(classification_report(y_test,predictR))
accuracy = cross_val_score(logR, X, y, cv=70)
print('Cross validation test results of accuracy:')
print(accuracy)
print("")
print("Accuracy result of Logistic Regression
is:",accuracy.mean() * 100)
print("")
cm1=confusion_matrix(y_test,predictR) print('Confusion
Matrix result of Logistic Regression is:\n',cm1)
print("")
sensitivity1=cm1[0,0]/(cm1[0,
0]+cm1[0,1])
print('Sensitivity :
',sensitivity1 )
print("")
specificity1 =
cm1[1,1]/(cm1[1,0]+cm1[1,1]
48
) print('Specificity : ',
specificity1) print("")
TN = cm1[0][0]
FN = cm1[1][0]
TP =
cm1[1][1] FP
= cm1[0][1]
print("True
Positive
:",TP)
print("True
Negative
:",TN)
print("False
Positive
:",FP)
print("False
Negative
:",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR =
FP/(FP+T
N) FNR =
FN/(TP+F
N)
49
print("True Positive Rate
:",TPR)
print("True Negative Rate
:",TNR)
print("False Positive Rate
:",FPR)
print("False Negative Rate
:",FNR) print("")
PPV =
TP/(TP+F
P) NPV =
TN/(TN+
FN)
print("Positive Predictive Value
:",PPV)
print("Negative predictive
value :",NPV) from
sklearn.naive_bayes
import GaussianNB gnb =
GaussianNB()
gnb.fit(X_train,y_train)
predictR = gnb.predict(X_test)
print("")
print('Classification report of Naive Bayes Results:')
print("")
print(classification_report(y_test,predictR))
accuracy = cross_val_score(gnb, X, y, cv=100)
print('Cross validation test results of accuracy:')
print(accuracy)
50
print("")
print("")
print("Accuracy result of Naive Bayes
is:",accuracy.mean() * 100)
print("") cm1=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Naive Bayes
is:\n',cm1)
print("")
sensitivity1 = cm1[0,0]/(cm1[0,0]+cm1[0,1])
print('Sensitivity : ', sensitivity1 )
print("")
specificity1 = cm1[1,1]/(cm1[1,0]+cm1[1,1])
print('Specificity : ', specificity1)
print("")
TN = cm1[0][0]
FN = cm1[1][0]
TP =
cm1[1][1] FP
= cm1[0][1]
print("True
Positive
:",TP)
print("True
Negative
:",TN)
print("False
Positive
:",FP)
51
print("False
Negative
:",FN)
print("")
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
FPR =
FP/(FP+T
N) FNR =
FN/(TP+F
N)
print("True Positive Rate
:",TPR)
print("True Negative Rate
:",TNR)
print("False Positive Rate
:",FPR)
print("False Negative Rate
:",FNR)
print("")
PPV = TP/(TP+FP)
NPV = TN/(TN+FN)
print("Positive Predictive Value :",PPV)

print("Negative predictive value :",NPV)
52
6.8 DATA ANALYSIS AND DISCUSSION
The application reads the input and applies the prediction algorithm to it
to generate the output. The output consists of c accuracy. This output is generated
for given input.
6.9 RESULT ANALYSIS
The close analysis of the output of lung cancer to get the input and analyse
the data and preprocess and compare with Machine learning algorithm.Finally
the accuracy rate will be displayed.
figure 6.1 Lung Cancer Prediction Output
53
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 CONCLUSION
In this project, Many machine learning algorithms are sensitive to the

range and distribution of attribute values in the input data. Outliers in input data
can skew and mislead the training process of machine learning algorithms
resulting in longer training times, less accurate models and ultimately poorer
results.Even before predictive models are prepared on training data, outliers can
result in misleading representations and in turn misleading interpretations of
collected data. Outliers can skew the summary distribution of attribute values in
descriptive statistics like mean and standard deviation and in plots such as
histograms and scatterplots, compressing the body of the data. Finally, outliers
can represent examples of data instances that are relevant to the problem such
as anomalies in the case of fraud detection and computer security.It couldn’t fit
the model on the training data and can’t say that the model will work accurately
for the real data. For this, we must assure that our model got the correct patterns
from the data, and it is not getting up too much noise. Cross-validation is a
technique in which we train our model using the subset of the data-set and then
evaluate using the complementary subset of the data-set.
7.2 FUTURE WORK
 To automate this process by show the prediction result in web

application or desktop application.
 To optimize the work to implement in Artificial Intelligence
environment.
54
 It improves accuracy score by comparing popular machine learning
algorithms.
 These reports are to the investigation of applicability of machine
learning techniques for detecting cancer in operational conditions by
attribute prediction
REFERENCES
1. S. G. Armato, G. McLennan, L. Bidaut, M. F. McNitt‐Gray, C. R.

Meyer, A. P. Reeves, et al., "The lung image database consortium (LIDC)
and image database resource initiative (IDRI): a completed reference
database of lung nodules on CT scans," Medical physics, vol. 38, pp. 915-
931, 2011.
2. W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian, "Multi-scale

convolutional neural networks for lung nodule classification," in
International Conference on Information Processing in Medical Imaging,
pp. 588-599, 2015
3. R. C. Hardie, S. K. Rogers, T. Wilson, and A. Rogers, "Performance

analysis of a new computer aided detection system for identifying lung
nodules on chest radiographs," Medical Image Analysis, vol. 12, pp. 240-
258, 2008.
4. M. U. Dalmış, A. Gubern‐Mérida, S. Vreemann, N. Karssemeijer, R.

Mann, and B. Platel, "A computer‐aided diagnosis system for breast DCE‐
MRI at high spatiotemporal resolution," Medical physics, vol. 43, pp. 84-94,
2016
55
5. B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J.
Kirby, et al., "The multimodal brain tumor image segmentation benchmark
(BRATS)," IEEE transactions on medical imaging, vol. 34, pp. 1993-2024,
2015.
6. B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, "Hypercolumns

for object segmentation and fine-grained localization," in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp.
447456, 2015.
7. G. Lin, A. Milan, C. Shen, and I. Reid, "Refinenet: Multi-path

refinement networks with identity mappings for high-resolution semantic
segmentation," arXiv preprint arXiv:1611.06612, 2016.
8. R. K. Srivastava, K. Greff, and J. Schmidhuber, "Highway networks,"

arXiv preprint arXiv:1505.00387, 2015
9. K. Greff, R. K. Srivastava, and J. Schmidhuber, "Highway and

residual networks learn unrolled iterative estimation," arXiv preprint
arXiv:1612.07771, 2016.
10. K. He, X. Zhang, S. Ren, and J. Sun, "Identity mappings in deep

residual networks," in European Conference on Computer Vision, pp. 630-
645, 2016.
11. F. Milletari, N. Navab, and S.-A. Ahmadi, "V-net: Fully convolutional

neural networks for volumetric medical image segmentation," in 3D Vision
(3DV), Fourth International Conference on, pp. 565-571, 2016.
56
12. O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional
networks for biomedical image segmentation," in International Conference
on Medical Image Computing and Computer-Assisted Intervention, pp. 234-
241, 2015. 22
13. K. Simonyan and A. Zisserman, "Very deep convolutional networks

for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
14. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based

learning applied to document recognition," Proceedings of the IEEE, vol.
86, pp. 2278-2324, 1998
15. H. Veeraraghavan and J. V. Miller, "Active learning guided

interactions for consistent image segmentation with reduced user
interactions," in Biomedical Imaging: From Nano to Macro, IEEE
International Symposium on, pp. 1645-1648, 2011.
16. Y. Gu, V. Kumar, L. O. Hall, D. B. Goldgof, C.-Y. Li, R. Korn, et al.,

"Automated delineation of lung tumors from CT images using a single click
ensemble segmentation approach," Pattern Recognition, vol. 46, pp. 692-
702, 2013
17.. Y. Tan, L. H. Schwartz, and B. Zhao, "Segmentation of lung lesions on

CT scans using watershed, active contours, and Markov random field,"
Medical physics, pp.043502, vol. 40, 2013.
57

Batch 36

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Batch 36

Uploaded by

Copyright:

Available Formats

DIAGNOSE OF LUNG CANCER PREDICTION

USING MACHINE LEARNING

in partial fulfilment for the award of the degree of

St. JOSEPH’S INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025

Certified that this project report “DIAGNOSE OF LUNG CANCER PREDICTION

Dr.J.DAFNI ROSE, M.E., Ph.D., Mr.C.GOWTHAM, M.E.,

Professor, Assistant professor ,

Head of the Department, Computer Science and Engineering,

Computer science and Engineering, St.Joseph’s Institute of Technology,

St.Joseph’s Institute of Technology, Old Mamallapuram road,

Old Mamallapuram road, Chennai-600119.

College Name : St. JOSEPH’S INSTITUTE OF TECHNOLOGY

Branch Name : COMPUTER SCIENCE AND ENGINEERING

Sl.No Name of the Title of the project Name of the Supervisor

USING MACHINE St. Joseph’s Institute of

Submitted for project review and viva voce exam held on

INTERNAL EXAMINER EXTERNAL EXAMINER

CHAPTER NO. TITLE PAGE NO.

6 RESULT AND CODING 44

7 CONCLUSION AND FUTURE WORK 55

3.1 Use case diagram of lung cancer prediction 22

3.2 Class diagram of lung cancer prediction 24

3.3 Sequence diagram of lung cancer prediction 25

3.4 Activity diagram of lung cancer prediction 26

3.5 Collaboration diagram of lung cancer 28

3.6 ER diagram of lung cancer prediction 29

3.7 workflow diagram of lung cancer prediction 30

4.1 system architecture for diagnose of

5.1 workflow diagram for lung cancer prediction 43

6.1 lung cancer prediction 53

Lung cancer is due to uncontrollable growth of cells in the lungs. It causes

Lung cancer generally occurs in both male and female due to

1.2 EXISTING SYSTEM

Detecting lung nodules with low-dose computed tomography (CT) can

1.3 PROPOSED SYSTEM

1.3.1 BENEFITS OF USING EXPLORATORY DATA ANALYSIS

1.3.2 BENEFITS OF USING DATA COLLECTION

1.4 ADVANTAGES OF THE SOLUTION

It improves accuracy score by comparing popular machine learning

Jue Jiang , Yu-chi Hu , Chia-Ju Liu , Darragh Halpenny(2018) -

Naji Khosravan and Ulas Bagci(2018) - Early detection of lung nodules

Nidhi S. Nadkarni, Prof. Sangam Borkar(2019) - Cancer is one of the

Saeed S. Alahmari , Dmitry Cherezov , Dmitry B. Goldgof(2018) - Low-

Yanbo Wang , Weikang Qian , Bo Yuan(2016) - Smoking is the major

3.1 UNIFIED MODELLING LANGUAGE

Unified Modelling Language (UML) is a standardized modelling

3.1.1 USE CASE DIAGRAM FOR LUNG CANCER PREDICTION

figure 3.2 class diagram for lung cancer prediction

3.1.3 SEQUENCE DIAGRAM FOR DIAGNOSE OF LUNG CANCER

3.1.4 ACTIVITY DIAGRAM FOR DIAGNOSE OF LUNG CANCER

3.1.5 COLLABORATION DIAGRAM FOR DIAGNOSE OF LUNG

figure 3.5 collaboration diagram for lung cancer prediction

3.1.6 ER DAIGRAM FOR DIAGNOSE OF LUNG CANCER

figure 3.6 ER diagram for lung cancer prediction

Entity relationship diagrams provide a visual starting point for database

3.1.7 WORKFLOW DAIGRAM FOR DIAGNOSE OF LUNG CANCER

figure 3.7 workflow diagram for lung cancer prediction

4.1 ARCHITECTURE DESCRIPTION

In System Architecture the detailed description about the system modules

5.1 MODULE DESCRIPTION

This system is implemented in ways :

• Data validation process EDA(Exploratory Data Analysis) (Module-01)