Lung Cancer Detection Final Report

VISVESVARAYA TECHNOLOGICALUNIVERSITY
JNANASANGAMA" BELAGAVI - 590 018
KARNATAKA
ooic4 U
REPORTOFINTERNSHIP/PROFESSIONAL PRACTICE
CARRIED OUT IN
EUNOIA LABS
EUNOIA LABS, BANGALORE
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE AWARD OF THE DEGREE OF
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE ENGINEERING
Submitted by:
BHARATH KALYANS
|ICGI8CS017|
INTERI GIDE INIERL GUIDE

Mrs. Rashmi C.R, M.Iech Mrs. Roopa K S
Asst. Prolessor, Dept. of CSE. HR Manager.
C.I.T. Gubbi. Tunmkur. Eunoia labs. Bangalore
IIOD
Dr.Shantala C P rh.D.
Professor & lead
Dept., of CSE
C..T, Gulbbi, Tumakuru
Channabasaveshwara Institute of Technology

(Afmiliated to VTU, Belgaum & Approved by AICTE, New Delhi)
(NAAC Accredited & IS0 9001:2015 Certified Institution)
NIH 206 (B.H. Road), Gubbi, Tumkur 572216. Karnataka
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
2021-2022
(ATiliated to VTU, Belgaum & Approved hy AICTE, New Delhi)
(NAAC Accredited & 1SO 9001:2015 Certisied Institution)
NH 206 (B.H. Road), Gubbi, Tumkur 5722 16. Karnataka.
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
2021-2022
CERTIFICATE
that the Internship Project entitled "
Lung Cancer
This is to certify
Bharath Kalyan S [1CG18CSO17] bonafide
Detection
"
has been carried out by
CHANNABASAVESHWARA INSTITUTE OF TECHNOLOGY,
student of
GUBBI. TUMKUR, in partial fulfillment of the requirement for the award of the
Bachelor of Engineering in Computer Science Engineering from the

degree
the year 2021-2022. It
Visvesvaraya Technological University, Belagavi during
Internal Assessment
is certified that all corrections/suggestions indicated during
has been approved as
have been incorporated into the report. The Internship report
of the Technical Seminar
it satisfies the academic requirements in respect
prescribed for the said degree.
Signature of Guide Signature of Internship

Co-ordinator
Mrs. Rashmi C.R, M.ech Mrsi Rashmi C.R, w.Tech

Ass1. Professor Asst. Professor,
Dept. of CSE
Dept. of CSE.
C.I.T. Gulbbi. C.I.T. Gubbi.
Signature of HODM SgauchdsMCipal

Dr. Suresh DS
STiantala CP Ph.D. Director & Principal
Professor & Ilead.
Dept. of CSE. C.I.1, Gubbi.
C1.T. Gubbi.
External Viva Signature with Date
Examiners Name
1.
2.
hannabasaveshwara Instituteof Technology
C.1.T Ailiated to VTU, Belgaum& Approved by AlCTE, New Delhi)
NAACAeeredited &IS) 9001:2015 Certilied Institution)

NH 206 (B.1 Road). Gubbi, Tumkur 572216. Karnataka.
DEPARTMENT OF COMPUTER SCIENCE & ENGNEERING
2021 2022
UNDERTAKING
student of VIII
. BHARATH KALYAN S bearing 1CG18CS017,
Semester B.E. in Computer Science and Engineering, C.I.T, GUBBI,
internship carried out in

TUMKUR hereby declare that the that the
and submitted in partial fulfillment of the

Eunoia Labs, Bangalore
award of the degree Bachelor of Engineering in
requirements for the
Visvesvaraya Technological
Computer Science and Engineering by
academic year 2021-2022.
University, Belgaum during the
BHARATH KALYAN S
Place: GUBBI
|ICG18Cs017|
Date:
CIT (Amiliated to VTU, Belgaum & Approved by AICTE, New Delhi)
(NAACAccredited & ISO 9001:2015 Certified Institution)
NH 206 (B.H. Road). Gubbi, Tumkur 5722 16. Karnataka.
G
DEPARTMENT OF cOMPUTER SCIENCE & ENGINEERING
2021 2022
BONAFIDE CERTIFICATE
This is to certify that the Internship carried out in Eunoia Labs,

Bangalore is a bonafide work of Bharath Kalyan S - 1CG18CSO17, student
of VIlI semester B.E. in Computer Science and Engineering from
Channabasaveshwara Institute of Technology, Gubbi, Tumkur, in partial

fulfillment of the requirements for the award of degree B.E., in Computer
Science and Engineering of Visvesvaraya Technological University,

Belgaum during the academic year 2021 - 2022. It is certified that Internship
work carried out was under my supervision and guidance.
Guide
Mrs.Rashmi C.R, M.Tech
Asst. Prof, Dept. of CSE
CIT,Gubbi, Tumakuru.
ACKNOWLEDGEMENT
A Several special people have contributed significantly to this effort. First of all,
I am grateful to my institution, Channabasaveshwara Institute of Technology, Gubbi,
which provides me an opportunity in fulfilling my most cherished desire of reaching my
goal.
I, acknowledge and express my sincere thanks to our beloved Director &

Principal, Dr. Suresh D S, for his many valuable suggestion and continued
encouragement by supporting me in mt academic endeavors.
I, express my sincere gratitude to Dr.Shantala C P, Professor and Head,

Department of Computer Science and Engineering, for providing her constructive
criticisms and suggestions.
I, extend my gratitude to my Internship guide, Mrs. Rashmi C R, Assistant

Professor, Department of Computer Science and Engineering, for her guidance,
support and suggestions throughout the period of this Internship.
I express my deep sense of gratitude to for giving such an opportunity to carry out
the internship in their esteemed industry/organization.
I sincerely thank Mrs. Roopa K S, H R Manager, Eunoia Labs, Bangalore for

exemplary guidance and supervision.
Finally, I would like to thank all the individuals who supported me directly and
indirectly for the successful completion of this internship work.
BHARATH KALYAN S
[1CG18CS017]
I
ABSTRACT
Cancer is the disease which is most dangerous that leads to death for both men and women
especially Lung Cancer.Lung cancer is uncontrollable disease if they affect both lungs. Early
diagnosis of lung cancer saves enormous lives, failing which may lead to another severe problems
causing sudden fatal end. The advance detection of cancer is not easier process but if it is detected,
it is curable. Lung cancer is a disease has highest mortality rates in the world. Based on data from
the International Agency for Research Center (IARC) in 2012, there are 19.4% of people in the
world die from lung cancer.
We analyzed the lung cancer prediction using classification algorithm such as Naive Bayes
and Random Forest Classifier algorithm. Initially 100 cancer and non-cancer patients data were
collected pre-processed and analyzed using a classification algorithm for predicting lung cancer.
The dataset have 309 instances and 16 attributes. The main aim of this project is to provide the
advance warning to the users and the performance analysis of the classification algorithms. By
using classification algorithms Naive Bayes and Random Forest Classifier we are going to analyze
the lung cancer. Here both cancer and non cancer patient details are collected for processing and
analyzing the data which have different instances and attributes
II
TABLE OF CONTENTS
CONTENTS PAGE NO
1. INTRODUCTION 1-2
1.1 OBJECTIVE 1
1.2 PROBLEM STATEMENT 2
2. LITERATURE SURVEY 3-4
3. TRAINING 5-6
4. SYSTEM MODELLING & DESIGN 7 - 10
5. METHODOLOGY 11 - 14
6. RESULTS 15 - 17
7. CONCLUSION 18
REFERENCES 19 - 20
Lung Cancer Detection 2021 - 2022
CHAPTER 1
INTRODUCTION
The cause of lung cancer stays obscure and prevention become impossible hence the early
detection of lung cancer is the only one way to cure. The rapid growth of machine learning is very
interesting for many people due to its numerous applications in various areas like it can be used for
fraud detection, computer vision, bioinformatics, medical image diagnosis etc. This is used for
prediction of cancer based on the medical reports like CT scan, X-Ray, and MRI etc, and has been
proven that due to various machine learning technique it has become easier for the doctor to predict
disease at right stage.
Cancer is a leading cause of death globally and by 2018 it has been estimated as 9.8 million
deaths and this estimation has been provided by world health organization, and the most common
cancer is lung cancer, and death rate due to lung cancer is more as compared to other all type of
cancer. Lung cancer is one of the leading causes of cancer death in both men and women. There are
various reasons for lung cancer like smoking, explorer to radon gas etc but it is not necessary that
the person who smoke will only suffer from lung cancer, it can also occur due to secondhand
smoking. This project uses various machine learning techniques used for the prediction of cancer in
both image data that is CT scan report through which we can predict the location of tumor or the
size of tumor and CSV file which contain the data like age, gender smoking rate etc.
Dept. of CSE, CIT, GUBBI Page 1

1.1 PROBLEM STATEMENT :
Identifying out patients with Lung Cancer is an impossible task just by looking at them.It is
the job of Doctors to study their symptoms and then come to conclusion.Lung Cancer is a very
serious issue and must be dealt with care and precision.Thus there is need for a reliable automated
system.The variations such as different uncommon symptoms and side effects pose a challenge to
the existing methodologies and technologies.
1.2 OBJECTIVE :
To design and build a system that can handle and help the patients suffering with Lung Cancer
Symptoms.
• To provide easy user interface to input the symptoms.
• System should be able to preprocess the given input.
• System should present the user the result after preprocessing and computation.

CHAPTER 2
LITERATURE SURVEY
1.Diego Riquelme and Moulay A. Akhloufi ,”Lung Cancer Detection and Classification in
CSV Datasets using Naive Bayes Classification”
The authors have put forward a set of classification case with Naive Bayes Classification for
analysis of the dataset.The system is capable to detecting the lung cancer in its earlier stage because
of this the survival rate of patient increases.Author also discusses about the advantages of using
Naive Bayes as it is mathematically computed.[1]
2. Anita Chaudhary, Sonit Sukhraj Singh,”Lung Cancer Detection using Random Forest
Classifier”[2]
The authors have discussed how Random Forest Classifier can be the best Supervised Algorithm for
CSV datasets.Random Forest Classifier was applied to the dataset which contained the symptoms of
the patients.Random Forest gives the most accurate and robust results according the authors who
have worked on this Algorithm for this type of problems.[2]
3. Kanchan Pradhan,Priyanka Chawla,”Medical internet of things using machine learning

algorithms for lung cancer detection”.
This paper explores the possibilities of using IOT with Machine Learning models that have been
developed with the highest accuracy possible making it easier for the doctors to asses the
patients.Using smart objects and collecting the statements directly from patient and running the
assessment check to help the doctors have been discussed in this paper.[3]

4. Mr. Sandeep,A.Dwivedi, Mr.R.P.Borse,,”Lung Cancer Detection and Classification by using

Machine Learning & Bernoulli Bayesian”.
This Objective of this paper is to recognize the patterns in the data sets that form when we apply all
the machine learning techniques and later using Naive Bayes Bernoulli Model to train it.Bernoulli
NB gives the best result and accurate one for datasets that are small and precise.[4]
5. GAP Singh,PK GUPTA,“Performance analysis of various machine learning-based

approaches for detection and classification of lung cancer in humans”.
This paper consist of overview of proposed work by different researcher from 2013 to 2019 using
machine learning algorithm either on digital image or through statistical approach. Most of the
researcher has taken data set from the Lung Image Database Consortium image collection (LIDC-
IDRI), The Cancer Genome Atlas (TCGA), SEER Database.[5]

CHAPTER 3
TRAINING
st
In the 1 week of internship, we were assigned with a project based on “Machine Learning “ and
introduced to a few basics of Python, Numpy, Pandas, Matplotlib for developing the project.
Day 1 :
On first day of our internship we were addressed by our allotted guid. She assigned us a project on
“Machine Learning” and explained us regarding the outlook and working of the project.
Day 2 :
We decided to develop the project assigned to us by using Python since we already had some
requisite knowledge regarding the concepts. The guide further explained us about the concepts of
Python in detail. We practiced the concepts by performing hands on session.
Day 3 :
On third day we were introduced to the concepts of NumPy, Pandas, Matplotlib in detail.
NumPy is an open source numerical Python library. NumPy contains a multi-dimensional array and
matrix data structures. It can be utilized to perform a number of mathematical operations on arrays
such as trigonometric, statistical, and algebraic routines. Data manipulation in Python is nearly
synonymous with NumPy array manipulation: even newer tools like Pandas are built around the
NumPy array.
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data. Pandas allows us to analyze big data and make conclusions based
on statistical theories.Pandas can clean messy data sets, and make them readable and relevant.

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter , wxPython , Qt or GTK.
Day 4 :
On the fourth day we were assigned with the task – To work on the algorithms in the machine
learning.Such as Naïve Bayes Classification, Linear Regression, Decission tree, CNN, KNN,
Random forest etc. Later we decided which algorithm best suits with the high accuracy for the
project.
Day 5 :
On fifth day we were introduced to actual software’s required for the development of the project .
We were trained about basics of Anaconda and Jupyter notebook required for the development of
the actual project . Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows you to launch applications and easily manage conda packages,
environments, and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository. It is available for Windows,
macOS, and Linux.
Week 2nd to Week 4th :
From second week we collected the datasets required for the project and started implementation of
the project using Supervised algorithms and to build out model.Later on the final stages of our
internship we implemented a data app using Stream lit that runs on GitHub cloud for patients that
can verify whether they have Lung Cancer by filling simple common symptoms.

CHAPTER 4
SYSTEM MODELLING & DESIGN
4.1 PURPOSE
The purpose of this design document is to explore the logical view of architecture design,
Sequence diagram, data flow diagram, user interface design of the software for performing the
operations such as pre-processing, extracting features and displaying the text present in the images.
4.2 SCOPE
The scope of this design document is to archive the features of the system such as pre-
process the images, feature extraction, segmentation and display the text present in the image
4.3 ARCHITECTURE
Pre Processing &

Input DataSet Data Visualisation
Cleaning Data
Training and Building

Output Generation Splitting DataSet
Model
Fig.4.1 Architecture of Proposed System

4.4 DATA FLOW DIAGRAM
Fig4.2 DataFlow Diagram

CHAPTER 5
METHODOLOGY
To be used efficiently, all computer software needs certain hardware components or other
software resources to be present on a computer. These prerequisites are known as (computer)

system requirements and are often used as a guideline as opposed to an absolute rule. Most software
defines two sets of system requirements: minimum and recommended. With increasing demand for
higher processing power and resources in newer versions of software, system requirements tend to
increase over time. Industry analysts suggest that this trend plays a bigger part in driving upgrades
to existing computer systems than technological advancements.
The proposed methodology comprises following phases :
a. Exploring Dataset
The Data used for this system is Comma Separated Values(CSV).We use this dataset to train
our model.The dataset comprises of 16 attributes and around 309 patients data.These data can be
easily collected using any survey forms.
Fig. 5.1 Patient Data in the form of CSV

b. Pre-processing and Cleaning Data
The pre processing is a series of operations performed on our dataset. Pre-processing refers
to the transformations applied to our data before providing the data to the algorithm. Data Pre-
processing technique is used to convert the raw data into an understandable data set. Cleaning of
Data refers to removing the outliers, duplicates and null values which might decrease our model
performance.
Fig 5.2 Checking for Nulls. Fig 5.3 Removing Duplicates
c. Classification
A classifier in machine learning is an algorithm that automatically orders or categorizes data
into one or more of a set of “classes.” Classifiers are given training data, constructs a model. Then it
is supplied testing data and the accuracy of model is calculated. The classifiers used in this paper
are GaussianNB and Bernoulli NB.

• Gaussian NB:
The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.In Gaussian Naive Bayes, continuous values associated
with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal Distribution. When plotted, it gives a bell shaped curve which is
symmetric about the mean of the feature values as shown below.
Fig.5.4 Graph depicting the Normal Distribution w.r.t to the function f(x).
The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by
Fig.5.5 The Conditional Probability of the Gaussian NB

• Bernoulli NB:
The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.
In the multivariate Bernoulli event model, features are independent booleans (binary variables)
describing inputs. Like the multinomial model, this model is popular for document classification
tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used
rather than term frequencies(i.e. frequency of a word in the document).
Fig.5.6 Bernoulli Distribution Graph
Fig.5.7 The Conditional Probability of Bernoulli NB

CHAPTER 6
RESULTS
SCREENSHOTS
Fig 6.1 Number of people diagnosed with Lung Cancer

Fig 6.2 Depicting the ratio of people who are diagnosed with
Lung Cancer with their Attributes.

Fig 6.3 HeatMap showing the Correlation between the Variables.

Fig 6.4 Prediction Result using GaussianNB
Fig 6.5 Prediction Result using BernoulliNB

Fig 6.6 DataApp predicting results according
to the Symptoms given by user.

CHAPTER 7
CONCLUSION
The main focus of this project is to show various machine learning algorithms are used for
the prediction of lung cancer at early stage using the symptoms of the patient.Survey has been
carried out using cdv type of dataset which is a statistical dataset. After data pre processing and
cleaning we use data visualization libraries to get more insight of the data presented.We apply
Naive Bayes Classification algorithms, Random Forest Classifier and Gradient Boost Classifier to
classify the data.Based on the classification it can be predicted that Bernoulli Naive Bayes is
generating more accuracy compared to other mentioned algorithm.In future other machine learning
techniques along with the mentioned technique can be used for building a model which would yield
more accuracy for the prediction of not only lung cancer but also other cancers as well.

REFERENCES
[1] Diego Riquelme and Moulay A. Akhloufi ,”Lung Cancer Detection and Classification in CSV
Datasets using Naive Bayes Classification”,www.mdpi.com,2020.
[2] Anita Chaudhary, Sonit Sukhraj Singh,”Lung Cancer Detection using Random Forest
Classifier”,IEEE,2012.
[3] Kanchan Pradhan,Priyanka Chawla,”Medical internet of things using machine learning

algorithms for lung cancer detection”,Journal of Management Analytics,2020.
[4] Mr. Sandeep,A.Dwivedi, Mr.R.P.Borse,,”Lung Cancer Detection and Classification by using

Machine Learning & Bernoulli Bayesian”, (IOSR- JECE),2014.
[5] GAP Singh,PK GUPTA,“Performance analysis of various machine learning-based approaches

for detection and classification of lung cancer in humans”,Springer.com,2017.

Lung Cancer Detection Final Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lung Cancer Detection Final Report

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICALUNIVERSITY

JNANASANGAMA" BELAGAVI - 590 018

EUNOIA LABS, BANGALORE

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

INTERI GIDE INIERL GUIDE

Channabasaveshwara Institute of Technology

NH 206 (B.H. Road), Gubbi, Tumkur 5722 16. Karnataka.

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Bachelor of Engineering in Computer Science Engineering from the

Signature of Guide Signature of Internship

Mrs. Rashmi C.R, M.ech Mrsi Rashmi C.R, w.Tech

Signature of HODM SgauchdsMCipal

NAACAeeredited &IS) 9001:2015 Certilied Institution)

DEPARTMENT OF COMPUTER SCIENCE & ENGNEERING

internship carried out in

and submitted in partial fulfillment of the

This is to certify that the Internship carried out in Eunoia Labs,

of VIlI semester B.E. in Computer Science and Engineering from

Channabasaveshwara Institute of Technology, Gubbi, Tumkur, in partial

Science and Engineering of Visvesvaraya Technological University,

work carried out was under my supervision and guidance.

I, acknowledge and express my sincere thanks to our beloved Director &

I, express my sincere gratitude to Dr.Shantala C P, Professor and Head,

I, extend my gratitude to my Internship guide, Mrs. Rashmi C R, Assistant

I sincerely thank Mrs. Roopa K S, H R Manager, Eunoia Labs, Bangalore for

world die from lung cancer.

analyzing the data which have different instances and attributes

1.2 PROBLEM STATEMENT 2

2. LITERATURE SURVEY 3-4

4. SYSTEM MODELLING & DESIGN 7 - 10

Dept. of CSE, CIT, GUBBI Page 1

1.1 PROBLEM STATEMENT :

• To provide easy user interface to input the symptoms.

• System should be able to preprocess the given input.

Dept. of CSE, CIT, GUBBI Page 2

3. Kanchan Pradhan,Priyanka Chawla,”Medical internet of things using machine learning

Dept. of CSE, CIT, GUBBI Page 3

4. Mr. Sandeep,A.Dwivedi, Mr.R.P.Borse,,”Lung Cancer Detection and Classification by using

5. GAP Singh,PK GUPTA,“Performance analysis of various machine learning-based

Dept. of CSE, CIT, GUBBI Page 4

Dept. of CSE, CIT, GUBBI Page 5

Week 2nd to Week 4th :

Dept. of CSE, CIT, GUBBI Page 6

Pre Processing &

Training and Building

Fig.4.1 Architecture of Proposed System

Dept. of CSE, CIT, GUBBI Page 7

4.4 DATA FLOW DIAGRAM

Fig4.2 DataFlow Diagram

Dept. of CSE, CIT, GUBBI Page 8

software resources to be present on a computer. These prerequisites are known as (computer)

Fig. 5.1 Patient Data in the form of CSV

Dept. of CSE, CIT, GUBBI Page 9

b. Pre-processing and Cleaning Data

Fig 5.2 Checking for Nulls. Fig 5.3 Removing Duplicates

Dept. of CSE, CIT, GUBBI Page 10

Fig.5.5 The Conditional Probability of the Gaussian NB

Dept. of CSE, CIT, GUBBI Page 11

Fig.5.6 Bernoulli Distribution Graph

Fig.5.7 The Conditional Probability of Bernoulli NB

Dept. of CSE, CIT, GUBBI Page 12

Fig 6.1 Number of people diagnosed with Lung Cancer

Dept. of CSE, CIT, GUBBI Page 13