Professional Documents
Culture Documents
Lung Cancer Detection Final Report
Lung Cancer Detection Final Report
KARNATAKA
ooic4 U
REPORTOFINTERNSHIP/PROFESSIONAL PRACTICE
CARRIED OUT IN
EUNOIA LABS
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE ENGINEERING
Submitted by:
BHARATH KALYANS
|ICGI8CS017|
IIOD
Dr.Shantala C P rh.D.
Professor & lead
Dept., of CSE
C..T, Gulbbi, Tumakuru
2021-2022
CERTIFICATE
that the Internship Project entitled "
Lung Cancer
This is to certify
Bharath Kalyan S [1CG18CSO17] bonafide
Detection
"
has been carried out by
CHANNABASAVESHWARA INSTITUTE OF TECHNOLOGY,
student of
GUBBI. TUMKUR, in partial fulfillment of the requirement for the award of the
C1.T. Gubbi.
External Viva Signature with Date
Examiners Name
1.
2.
hannabasaveshwara Instituteof Technology
C.1.T Ailiated to VTU, Belgaum& Approved by AlCTE, New Delhi)
2021 2022
UNDERTAKING
student of VIII
. BHARATH KALYAN S bearing 1CG18CS017,
Semester B.E. in Computer Science and Engineering, C.I.T, GUBBI,
BHARATH KALYAN S
Place: GUBBI
|ICG18Cs017|
Date:
Channabasaveshwara Institute of Technology
CIT (Amiliated to VTU, Belgaum & Approved by AICTE, New Delhi)
(NAACAccredited & ISO 9001:2015 Certified Institution)
NH 206 (B.H. Road). Gubbi, Tumkur 5722 16. Karnataka.
G
DEPARTMENT OF cOMPUTER SCIENCE & ENGINEERING
2021 2022
BONAFIDE CERTIFICATE
Guide
Mrs.Rashmi C.R, M.Tech
Asst. Prof, Dept. of CSE
CIT,Gubbi, Tumakuru.
ACKNOWLEDGEMENT
A Several special people have contributed significantly to this effort. First of all,
I am grateful to my institution, Channabasaveshwara Institute of Technology, Gubbi,
which provides me an opportunity in fulfilling my most cherished desire of reaching my
goal.
I express my deep sense of gratitude to for giving such an opportunity to carry out
the internship in their esteemed industry/organization.
Finally, I would like to thank all the individuals who supported me directly and
indirectly for the successful completion of this internship work.
BHARATH KALYAN S
[1CG18CS017]
I
ABSTRACT
Cancer is the disease which is most dangerous that leads to death for both men and women
especially Lung Cancer.Lung cancer is uncontrollable disease if they affect both lungs. Early
diagnosis of lung cancer saves enormous lives, failing which may lead to another severe problems
causing sudden fatal end. The advance detection of cancer is not easier process but if it is detected,
it is curable. Lung cancer is a disease has highest mortality rates in the world. Based on data from
the International Agency for Research Center (IARC) in 2012, there are 19.4% of people in the
We analyzed the lung cancer prediction using classification algorithm such as Naive Bayes
and Random Forest Classifier algorithm. Initially 100 cancer and non-cancer patients data were
collected pre-processed and analyzed using a classification algorithm for predicting lung cancer.
The dataset have 309 instances and 16 attributes. The main aim of this project is to provide the
advance warning to the users and the performance analysis of the classification algorithms. By
using classification algorithms Naive Bayes and Random Forest Classifier we are going to analyze
the lung cancer. Here both cancer and non cancer patient details are collected for processing and
II
TABLE OF CONTENTS
CONTENTS PAGE NO
1. INTRODUCTION 1-2
1.1 OBJECTIVE 1
3. TRAINING 5-6
5. METHODOLOGY 11 - 14
6. RESULTS 15 - 17
7. CONCLUSION 18
REFERENCES 19 - 20
Lung Cancer Detection 2021 - 2022
CHAPTER 1
INTRODUCTION
The cause of lung cancer stays obscure and prevention become impossible hence the early
detection of lung cancer is the only one way to cure. The rapid growth of machine learning is very
interesting for many people due to its numerous applications in various areas like it can be used for
fraud detection, computer vision, bioinformatics, medical image diagnosis etc. This is used for
prediction of cancer based on the medical reports like CT scan, X-Ray, and MRI etc, and has been
proven that due to various machine learning technique it has become easier for the doctor to predict
disease at right stage.
Cancer is a leading cause of death globally and by 2018 it has been estimated as 9.8 million
deaths and this estimation has been provided by world health organization, and the most common
cancer is lung cancer, and death rate due to lung cancer is more as compared to other all type of
cancer. Lung cancer is one of the leading causes of cancer death in both men and women. There are
various reasons for lung cancer like smoking, explorer to radon gas etc but it is not necessary that
the person who smoke will only suffer from lung cancer, it can also occur due to secondhand
smoking. This project uses various machine learning techniques used for the prediction of cancer in
both image data that is CT scan report through which we can predict the location of tumor or the
size of tumor and CSV file which contain the data like age, gender smoking rate etc.
Identifying out patients with Lung Cancer is an impossible task just by looking at them.It is
the job of Doctors to study their symptoms and then come to conclusion.Lung Cancer is a very
serious issue and must be dealt with care and precision.Thus there is need for a reliable automated
system.The variations such as different uncommon symptoms and side effects pose a challenge to
the existing methodologies and technologies.
1.2 OBJECTIVE :
To design and build a system that can handle and help the patients suffering with Lung Cancer
Symptoms.
• System should present the user the result after preprocessing and computation.
CHAPTER 2
LITERATURE SURVEY
1.Diego Riquelme and Moulay A. Akhloufi ,”Lung Cancer Detection and Classification in
CSV Datasets using Naive Bayes Classification”
The authors have put forward a set of classification case with Naive Bayes Classification for
analysis of the dataset.The system is capable to detecting the lung cancer in its earlier stage because
of this the survival rate of patient increases.Author also discusses about the advantages of using
Naive Bayes as it is mathematically computed.[1]
2. Anita Chaudhary, Sonit Sukhraj Singh,”Lung Cancer Detection using Random Forest
Classifier”[2]
The authors have discussed how Random Forest Classifier can be the best Supervised Algorithm for
CSV datasets.Random Forest Classifier was applied to the dataset which contained the symptoms of
the patients.Random Forest gives the most accurate and robust results according the authors who
have worked on this Algorithm for this type of problems.[2]
This paper explores the possibilities of using IOT with Machine Learning models that have been
developed with the highest accuracy possible making it easier for the doctors to asses the
patients.Using smart objects and collecting the statements directly from patient and running the
assessment check to help the doctors have been discussed in this paper.[3]
This Objective of this paper is to recognize the patterns in the data sets that form when we apply all
the machine learning techniques and later using Naive Bayes Bernoulli Model to train it.Bernoulli
NB gives the best result and accurate one for datasets that are small and precise.[4]
This paper consist of overview of proposed work by different researcher from 2013 to 2019 using
machine learning algorithm either on digital image or through statistical approach. Most of the
researcher has taken data set from the Lung Image Database Consortium image collection (LIDC-
IDRI), The Cancer Genome Atlas (TCGA), SEER Database.[5]
CHAPTER 3
TRAINING
st
In the 1 week of internship, we were assigned with a project based on “Machine Learning “ and
introduced to a few basics of Python, Numpy, Pandas, Matplotlib for developing the project.
Day 1 :
On first day of our internship we were addressed by our allotted guid. She assigned us a project on
“Machine Learning” and explained us regarding the outlook and working of the project.
Day 2 :
We decided to develop the project assigned to us by using Python since we already had some
requisite knowledge regarding the concepts. The guide further explained us about the concepts of
Python in detail. We practiced the concepts by performing hands on session.
Day 3 :
On third day we were introduced to the concepts of NumPy, Pandas, Matplotlib in detail.
NumPy is an open source numerical Python library. NumPy contains a multi-dimensional array and
matrix data structures. It can be utilized to perform a number of mathematical operations on arrays
such as trigonometric, statistical, and algebraic routines. Data manipulation in Python is nearly
synonymous with NumPy array manipulation: even newer tools like Pandas are built around the
NumPy array.
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data. Pandas allows us to analyze big data and make conclusions based
on statistical theories.Pandas can clean messy data sets, and make them readable and relevant.
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter , wxPython , Qt or GTK.
Day 4 :
On the fourth day we were assigned with the task – To work on the algorithms in the machine
learning.Such as Naïve Bayes Classification, Linear Regression, Decission tree, CNN, KNN,
Random forest etc. Later we decided which algorithm best suits with the high accuracy for the
project.
Day 5 :
On fifth day we were introduced to actual software’s required for the development of the project .
We were trained about basics of Anaconda and Jupyter notebook required for the development of
the actual project . Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows you to launch applications and easily manage conda packages,
environments, and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository. It is available for Windows,
macOS, and Linux.
From second week we collected the datasets required for the project and started implementation of
the project using Supervised algorithms and to build out model.Later on the final stages of our
internship we implemented a data app using Stream lit that runs on GitHub cloud for patients that
can verify whether they have Lung Cancer by filling simple common symptoms.
CHAPTER 4
SYSTEM MODELLING & DESIGN
4.1 PURPOSE
The purpose of this design document is to explore the logical view of architecture design,
Sequence diagram, data flow diagram, user interface design of the software for performing the
operations such as pre-processing, extracting features and displaying the text present in the images.
4.2 SCOPE
The scope of this design document is to archive the features of the system such as pre-
process the images, feature extraction, segmentation and display the text present in the image
4.3 ARCHITECTURE
CHAPTER 5
METHODOLOGY
To be used efficiently, all computer software needs certain hardware components or other
a. Exploring Dataset
The Data used for this system is Comma Separated Values(CSV).We use this dataset to train
our model.The dataset comprises of 16 attributes and around 309 patients data.These data can be
easily collected using any survey forms.
The pre processing is a series of operations performed on our dataset. Pre-processing refers
to the transformations applied to our data before providing the data to the algorithm. Data Pre-
processing technique is used to convert the raw data into an understandable data set. Cleaning of
Data refers to removing the outliers, duplicates and null values which might decrease our model
performance.
c. Classification
A classifier in machine learning is an algorithm that automatically orders or categorizes data
into one or more of a set of “classes.” Classifiers are given training data, constructs a model. Then it
is supplied testing data and the accuracy of model is calculated. The classifiers used in this paper
are GaussianNB and Bernoulli NB.
• Gaussian NB:
The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.In Gaussian Naive Bayes, continuous values associated
with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal Distribution. When plotted, it gives a bell shaped curve which is
symmetric about the mean of the feature values as shown below.
Fig.5.4 Graph depicting the Normal Distribution w.r.t to the function f(x).
The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by
• Bernoulli NB:
The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.
In the multivariate Bernoulli event model, features are independent booleans (binary variables)
describing inputs. Like the multinomial model, this model is popular for document classification
tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used
rather than term frequencies(i.e. frequency of a word in the document).
CHAPTER 6
RESULTS
SCREENSHOTS
Fig 6.2 Depicting the ratio of people who are diagnosed with
CHAPTER 7
CONCLUSION
The main focus of this project is to show various machine learning algorithms are used for
the prediction of lung cancer at early stage using the symptoms of the patient.Survey has been
carried out using cdv type of dataset which is a statistical dataset. After data pre processing and
cleaning we use data visualization libraries to get more insight of the data presented.We apply
Naive Bayes Classification algorithms, Random Forest Classifier and Gradient Boost Classifier to
classify the data.Based on the classification it can be predicted that Bernoulli Naive Bayes is
generating more accuracy compared to other mentioned algorithm.In future other machine learning
techniques along with the mentioned technique can be used for building a model which would yield
more accuracy for the prediction of not only lung cancer but also other cancers as well.
REFERENCES
[1] Diego Riquelme and Moulay A. Akhloufi ,”Lung Cancer Detection and Classification in CSV
Datasets using Naive Bayes Classification”,www.mdpi.com,2020.
[2] Anita Chaudhary, Sonit Sukhraj Singh,”Lung Cancer Detection using Random Forest
Classifier”,IEEE,2012.