Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

VISVESVARAYA TECHNOLOGICALUNIVERSITY

JNANASANGAMA" BELAGAVI - 590 018

KARNATAKA

ooic4 U

REPORTOFINTERNSHIP/PROFESSIONAL PRACTICE
CARRIED OUT IN
EUNOIA LABS

EUNOIA LABS, BANGALORE

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS


FOR THE AWARD OF THE DEGREE OF

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE ENGINEERING

Submitted by:
BHARATH KALYANS
|ICGI8CS017|

INTERI GIDE INIERL GUIDE


Mrs. Rashmi C.R, M.Iech Mrs. Roopa K S
Asst. Prolessor, Dept. of CSE. HR Manager.
C.I.T. Gubbi. Tunmkur. Eunoia labs. Bangalore

IIOD
Dr.Shantala C P rh.D.
Professor & lead
Dept., of CSE
C..T, Gulbbi, Tumakuru

Channabasaveshwara Institute of Technology


(Afmiliated to VTU, Belgaum & Approved by AICTE, New Delhi)
(NAAC Accredited & IS0 9001:2015 Certified Institution)
NIH 206 (B.H. Road), Gubbi, Tumkur 572216. Karnataka
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
2021-2022
Channabasaveshwara Institute of Technology
(ATiliated to VTU, Belgaum & Approved hy AICTE, New Delhi)
(NAAC Accredited & 1SO 9001:2015 Certisied Institution)

NH 206 (B.H. Road), Gubbi, Tumkur 5722 16. Karnataka.

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

2021-2022

CERTIFICATE
that the Internship Project entitled "
Lung Cancer
This is to certify
Bharath Kalyan S [1CG18CSO17] bonafide
Detection
"
has been carried out by
CHANNABASAVESHWARA INSTITUTE OF TECHNOLOGY,
student of
GUBBI. TUMKUR, in partial fulfillment of the requirement for the award of the

Bachelor of Engineering in Computer Science Engineering from the


degree
the year 2021-2022. It
Visvesvaraya Technological University, Belagavi during
Internal Assessment
is certified that all corrections/suggestions indicated during
has been approved as
have been incorporated into the report. The Internship report
of the Technical Seminar
it satisfies the academic requirements in respect
prescribed for the said degree.

Signature of Guide Signature of Internship


Co-ordinator

Mrs. Rashmi C.R, M.ech Mrsi Rashmi C.R, w.Tech


Ass1. Professor Asst. Professor,
Dept. of CSE
Dept. of CSE.
C.I.T. Gulbbi. C.I.T. Gubbi.

Signature of HODM SgauchdsMCipal


Dr. Suresh DS
STiantala CP Ph.D. Director & Principal
Professor & Ilead.
Dept. of CSE. C.I.1, Gubbi.

C1.T. Gubbi.
External Viva Signature with Date
Examiners Name

1.
2.
hannabasaveshwara Instituteof Technology
C.1.T Ailiated to VTU, Belgaum& Approved by AlCTE, New Delhi)

NAACAeeredited &IS) 9001:2015 Certilied Institution)


NH 206 (B.1 Road). Gubbi, Tumkur 572216. Karnataka.

DEPARTMENT OF COMPUTER SCIENCE & ENGNEERING

2021 2022

UNDERTAKING
student of VIII
. BHARATH KALYAN S bearing 1CG18CS017,
Semester B.E. in Computer Science and Engineering, C.I.T, GUBBI,

internship carried out in


TUMKUR hereby declare that the that the

and submitted in partial fulfillment of the


Eunoia Labs, Bangalore
award of the degree Bachelor of Engineering in
requirements for the
Visvesvaraya Technological
Computer Science and Engineering by
academic year 2021-2022.
University, Belgaum during the

BHARATH KALYAN S
Place: GUBBI
|ICG18Cs017|
Date:
Channabasaveshwara Institute of Technology
CIT (Amiliated to VTU, Belgaum & Approved by AICTE, New Delhi)
(NAACAccredited & ISO 9001:2015 Certified Institution)
NH 206 (B.H. Road). Gubbi, Tumkur 5722 16. Karnataka.
G
DEPARTMENT OF cOMPUTER SCIENCE & ENGINEERING
2021 2022

BONAFIDE CERTIFICATE

This is to certify that the Internship carried out in Eunoia Labs,


Bangalore is a bonafide work of Bharath Kalyan S - 1CG18CSO17, student

of VIlI semester B.E. in Computer Science and Engineering from

Channabasaveshwara Institute of Technology, Gubbi, Tumkur, in partial


fulfillment of the requirements for the award of degree B.E., in Computer

Science and Engineering of Visvesvaraya Technological University,


Belgaum during the academic year 2021 - 2022. It is certified that Internship

work carried out was under my supervision and guidance.

Guide
Mrs.Rashmi C.R, M.Tech
Asst. Prof, Dept. of CSE
CIT,Gubbi, Tumakuru.
ACKNOWLEDGEMENT

A Several special people have contributed significantly to this effort. First of all,
I am grateful to my institution, Channabasaveshwara Institute of Technology, Gubbi,
which provides me an opportunity in fulfilling my most cherished desire of reaching my
goal.

I, acknowledge and express my sincere thanks to our beloved Director &


Principal, Dr. Suresh D S, for his many valuable suggestion and continued
encouragement by supporting me in mt academic endeavors.

I, express my sincere gratitude to Dr.Shantala C P, Professor and Head,


Department of Computer Science and Engineering, for providing her constructive
criticisms and suggestions.

I, extend my gratitude to my Internship guide, Mrs. Rashmi C R, Assistant


Professor, Department of Computer Science and Engineering, for her guidance,
support and suggestions throughout the period of this Internship.

I express my deep sense of gratitude to for giving such an opportunity to carry out
the internship in their esteemed industry/organization.

I sincerely thank Mrs. Roopa K S, H R Manager, Eunoia Labs, Bangalore for


exemplary guidance and supervision.

Finally, I would like to thank all the individuals who supported me directly and
indirectly for the successful completion of this internship work.

BHARATH KALYAN S
[1CG18CS017]

I
ABSTRACT

Cancer is the disease which is most dangerous that leads to death for both men and women

especially Lung Cancer.Lung cancer is uncontrollable disease if they affect both lungs. Early

diagnosis of lung cancer saves enormous lives, failing which may lead to another severe problems

causing sudden fatal end. The advance detection of cancer is not easier process but if it is detected,

it is curable. Lung cancer is a disease has highest mortality rates in the world. Based on data from

the International Agency for Research Center (IARC) in 2012, there are 19.4% of people in the

world die from lung cancer.

We analyzed the lung cancer prediction using classification algorithm such as Naive Bayes

and Random Forest Classifier algorithm. Initially 100 cancer and non-cancer patients data were

collected pre-processed and analyzed using a classification algorithm for predicting lung cancer.

The dataset have 309 instances and 16 attributes. The main aim of this project is to provide the

advance warning to the users and the performance analysis of the classification algorithms. By

using classification algorithms Naive Bayes and Random Forest Classifier we are going to analyze

the lung cancer. Here both cancer and non cancer patient details are collected for processing and

analyzing the data which have different instances and attributes

II
TABLE OF CONTENTS

CONTENTS PAGE NO

1. INTRODUCTION 1-2

1.1 OBJECTIVE 1

1.2 PROBLEM STATEMENT 2

2. LITERATURE SURVEY 3-4

3. TRAINING 5-6

4. SYSTEM MODELLING & DESIGN 7 - 10

5. METHODOLOGY 11 - 14

6. RESULTS 15 - 17

7. CONCLUSION 18

REFERENCES 19 - 20
Lung Cancer Detection 2021 - 2022

CHAPTER 1

INTRODUCTION

The cause of lung cancer stays obscure and prevention become impossible hence the early

detection of lung cancer is the only one way to cure. The rapid growth of machine learning is very
interesting for many people due to its numerous applications in various areas like it can be used for
fraud detection, computer vision, bioinformatics, medical image diagnosis etc. This is used for
prediction of cancer based on the medical reports like CT scan, X-Ray, and MRI etc, and has been
proven that due to various machine learning technique it has become easier for the doctor to predict
disease at right stage.
Cancer is a leading cause of death globally and by 2018 it has been estimated as 9.8 million
deaths and this estimation has been provided by world health organization, and the most common
cancer is lung cancer, and death rate due to lung cancer is more as compared to other all type of
cancer. Lung cancer is one of the leading causes of cancer death in both men and women. There are
various reasons for lung cancer like smoking, explorer to radon gas etc but it is not necessary that
the person who smoke will only suffer from lung cancer, it can also occur due to secondhand
smoking. This project uses various machine learning techniques used for the prediction of cancer in
both image data that is CT scan report through which we can predict the location of tumor or the
size of tumor and CSV file which contain the data like age, gender smoking rate etc.

Dept. of CSE, CIT, GUBBI Page 1


Lung Cancer Detection 2021 - 2022

1.1 PROBLEM STATEMENT :

Identifying out patients with Lung Cancer is an impossible task just by looking at them.It is
the job of Doctors to study their symptoms and then come to conclusion.Lung Cancer is a very
serious issue and must be dealt with care and precision.Thus there is need for a reliable automated
system.The variations such as different uncommon symptoms and side effects pose a challenge to
the existing methodologies and technologies.

1.2 OBJECTIVE :
To design and build a system that can handle and help the patients suffering with Lung Cancer
Symptoms.

• To provide easy user interface to input the symptoms.

• System should be able to preprocess the given input.

• System should present the user the result after preprocessing and computation.

Dept. of CSE, CIT, GUBBI Page 2


Lung Cancer Detection 2021 - 2022

CHAPTER 2
LITERATURE SURVEY

1.Diego Riquelme and Moulay A. Akhloufi ,”Lung Cancer Detection and Classification in
CSV Datasets using Naive Bayes Classification”

The authors have put forward a set of classification case with Naive Bayes Classification for
analysis of the dataset.The system is capable to detecting the lung cancer in its earlier stage because
of this the survival rate of patient increases.Author also discusses about the advantages of using
Naive Bayes as it is mathematically computed.[1]

2. Anita Chaudhary, Sonit Sukhraj Singh,”Lung Cancer Detection using Random Forest
Classifier”[2]

The authors have discussed how Random Forest Classifier can be the best Supervised Algorithm for
CSV datasets.Random Forest Classifier was applied to the dataset which contained the symptoms of
the patients.Random Forest gives the most accurate and robust results according the authors who
have worked on this Algorithm for this type of problems.[2]

3. Kanchan Pradhan,Priyanka Chawla,”Medical internet of things using machine learning


algorithms for lung cancer detection”.

This paper explores the possibilities of using IOT with Machine Learning models that have been
developed with the highest accuracy possible making it easier for the doctors to asses the
patients.Using smart objects and collecting the statements directly from patient and running the
assessment check to help the doctors have been discussed in this paper.[3]

Dept. of CSE, CIT, GUBBI Page 3


Lung Cancer Detection 2021 - 2022

4. Mr. Sandeep,A.Dwivedi, Mr.R.P.Borse,,”Lung Cancer Detection and Classification by using


Machine Learning & Bernoulli Bayesian”.

This Objective of this paper is to recognize the patterns in the data sets that form when we apply all
the machine learning techniques and later using Naive Bayes Bernoulli Model to train it.Bernoulli
NB gives the best result and accurate one for datasets that are small and precise.[4]

5. GAP Singh,PK GUPTA,“Performance analysis of various machine learning-based


approaches for detection and classification of lung cancer in humans”.

This paper consist of overview of proposed work by different researcher from 2013 to 2019 using
machine learning algorithm either on digital image or through statistical approach. Most of the
researcher has taken data set from the Lung Image Database Consortium image collection (LIDC-
IDRI), The Cancer Genome Atlas (TCGA), SEER Database.[5]

Dept. of CSE, CIT, GUBBI Page 4


Lung Cancer Detection 2021 - 2022

CHAPTER 3
TRAINING
st
In the 1 week of internship, we were assigned with a project based on “Machine Learning “ and

introduced to a few basics of Python, Numpy, Pandas, Matplotlib for developing the project.
Day 1 :

On first day of our internship we were addressed by our allotted guid. She assigned us a project on
“Machine Learning” and explained us regarding the outlook and working of the project.

Day 2 :

We decided to develop the project assigned to us by using Python since we already had some
requisite knowledge regarding the concepts. The guide further explained us about the concepts of
Python in detail. We practiced the concepts by performing hands on session.

Day 3 :

On third day we were introduced to the concepts of NumPy, Pandas, Matplotlib in detail.

NumPy is an open source numerical Python library. NumPy contains a multi-dimensional array and
matrix data structures. It can be utilized to perform a number of mathematical operations on arrays
such as trigonometric, statistical, and algebraic routines. Data manipulation in Python is nearly
synonymous with NumPy array manipulation: even newer tools like Pandas are built around the
NumPy array.

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning,
exploring, and manipulating data. Pandas allows us to analyze big data and make conclusions based
on statistical theories.Pandas can clean messy data sets, and make them readable and relevant.

Dept. of CSE, CIT, GUBBI Page 5


Lung Cancer Detection 2021 - 2022

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter , wxPython , Qt or GTK.

Day 4 :

On the fourth day we were assigned with the task – To work on the algorithms in the machine
learning.Such as Naïve Bayes Classification, Linear Regression, Decission tree, CNN, KNN,
Random forest etc. Later we decided which algorithm best suits with the high accuracy for the
project.

Day 5 :

On fifth day we were introduced to actual software’s required for the development of the project .
We were trained about basics of Anaconda and Jupyter notebook required for the development of
the actual project . Anaconda Navigator is a desktop graphical user interface (GUI) included in
Anaconda distribution that allows you to launch applications and easily manage conda packages,
environments, and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository. It is available for Windows,
macOS, and Linux.

Week 2nd to Week 4th :

From second week we collected the datasets required for the project and started implementation of
the project using Supervised algorithms and to build out model.Later on the final stages of our
internship we implemented a data app using Stream lit that runs on GitHub cloud for patients that
can verify whether they have Lung Cancer by filling simple common symptoms.

Dept. of CSE, CIT, GUBBI Page 6


Lung Cancer Detection 2021 - 2022

CHAPTER 4
SYSTEM MODELLING & DESIGN

4.1 PURPOSE

The purpose of this design document is to explore the logical view of architecture design,
Sequence diagram, data flow diagram, user interface design of the software for performing the
operations such as pre-processing, extracting features and displaying the text present in the images.

4.2 SCOPE

The scope of this design document is to archive the features of the system such as pre-
process the images, feature extraction, segmentation and display the text present in the image

4.3 ARCHITECTURE

Pre Processing &


Input DataSet Data Visualisation
Cleaning Data

Training and Building


Output Generation Splitting DataSet
Model

Fig.4.1 Architecture of Proposed System

Dept. of CSE, CIT, GUBBI Page 7


Lung Cancer Detection 2021 - 2022

4.4 DATA FLOW DIAGRAM

Fig4.2 DataFlow Diagram

Dept. of CSE, CIT, GUBBI Page 8


Lung Cancer Detection 2021 - 2022

CHAPTER 5
METHODOLOGY
To be used efficiently, all computer software needs certain hardware components or other

software resources to be present on a computer. These prerequisites are known as (computer)


system requirements and are often used as a guideline as opposed to an absolute rule. Most software
defines two sets of system requirements: minimum and recommended. With increasing demand for
higher processing power and resources in newer versions of software, system requirements tend to
increase over time. Industry analysts suggest that this trend plays a bigger part in driving upgrades
to existing computer systems than technological advancements.
The proposed methodology comprises following phases :

a. Exploring Dataset

The Data used for this system is Comma Separated Values(CSV).We use this dataset to train
our model.The dataset comprises of 16 attributes and around 309 patients data.These data can be
easily collected using any survey forms.

Fig. 5.1 Patient Data in the form of CSV

Dept. of CSE, CIT, GUBBI Page 9


Lung Cancer Detection 2021 - 2022

b. Pre-processing and Cleaning Data

The pre processing is a series of operations performed on our dataset. Pre-processing refers
to the transformations applied to our data before providing the data to the algorithm. Data Pre-
processing technique is used to convert the raw data into an understandable data set. Cleaning of
Data refers to removing the outliers, duplicates and null values which might decrease our model
performance.

Fig 5.2 Checking for Nulls. Fig 5.3 Removing Duplicates

c. Classification
A classifier in machine learning is an algorithm that automatically orders or categorizes data
into one or more of a set of “classes.” Classifiers are given training data, constructs a model. Then it
is supplied testing data and the accuracy of model is calculated. The classifiers used in this paper
are GaussianNB and Bernoulli NB.

Dept. of CSE, CIT, GUBBI Page 10


Lung Cancer Detection 2021 - 2022

• Gaussian NB:

The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.In Gaussian Naive Bayes, continuous values associated
with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian
distribution is also called Normal Distribution. When plotted, it gives a bell shaped curve which is
symmetric about the mean of the feature values as shown below.

Fig.5.4 Graph depicting the Normal Distribution w.r.t to the function f(x).

The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by

Fig.5.5 The Conditional Probability of the Gaussian NB

Dept. of CSE, CIT, GUBBI Page 11


Lung Cancer Detection 2021 - 2022

• Bernoulli NB:

The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in a
document. This model is also famous for document classification tasks.

In the multivariate Bernoulli event model, features are independent booleans (binary variables)
describing inputs. Like the multinomial model, this model is popular for document classification
tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used
rather than term frequencies(i.e. frequency of a word in the document).

Fig.5.6 Bernoulli Distribution Graph

Fig.5.7 The Conditional Probability of Bernoulli NB

Dept. of CSE, CIT, GUBBI Page 12


Lung Cancer Detection 2021 - 2022

CHAPTER 6
RESULTS
SCREENSHOTS

Fig 6.1 Number of people diagnosed with Lung Cancer

Dept. of CSE, CIT, GUBBI Page 13


Lung Cancer Detection 2021 - 2022

Fig 6.2 Depicting the ratio of people who are diagnosed with

Lung Cancer with their Attributes.

Dept. of CSE, CIT, GUBBI Page 14


Lung Cancer Detection 2021 - 2022

Fig 6.3 HeatMap showing the Correlation between the Variables.

Dept. of CSE, CIT, GUBBI Page 15


Lung Cancer Detection 2021 - 2022

Fig 6.4 Prediction Result using GaussianNB

Fig 6.5 Prediction Result using BernoulliNB

Dept. of CSE, CIT, GUBBI Page 16


Lung Cancer Detection 2021 - 2022

Fig 6.6 DataApp predicting results according

to the Symptoms given by user.

Dept. of CSE, CIT, GUBBI Page 17


Lung Cancer Detection 2021 - 2022

CHAPTER 7
CONCLUSION

The main focus of this project is to show various machine learning algorithms are used for
the prediction of lung cancer at early stage using the symptoms of the patient.Survey has been
carried out using cdv type of dataset which is a statistical dataset. After data pre processing and
cleaning we use data visualization libraries to get more insight of the data presented.We apply
Naive Bayes Classification algorithms, Random Forest Classifier and Gradient Boost Classifier to
classify the data.Based on the classification it can be predicted that Bernoulli Naive Bayes is
generating more accuracy compared to other mentioned algorithm.In future other machine learning
techniques along with the mentioned technique can be used for building a model which would yield
more accuracy for the prediction of not only lung cancer but also other cancers as well.

Dept. of CSE, CIT, GUBBI Page 18


Lung Cancer Detection 2021 - 2022

REFERENCES
[1] Diego Riquelme and Moulay A. Akhloufi ,”Lung Cancer Detection and Classification in CSV
Datasets using Naive Bayes Classification”,www.mdpi.com,2020.

[2] Anita Chaudhary, Sonit Sukhraj Singh,”Lung Cancer Detection using Random Forest
Classifier”,IEEE,2012.

[3] Kanchan Pradhan,Priyanka Chawla,”Medical internet of things using machine learning


algorithms for lung cancer detection”,Journal of Management Analytics,2020.

[4] Mr. Sandeep,A.Dwivedi, Mr.R.P.Borse,,”Lung Cancer Detection and Classification by using


Machine Learning & Bernoulli Bayesian”, (IOSR- JECE),2014.

[5] GAP Singh,PK GUPTA,“Performance analysis of various machine learning-based approaches


for detection and classification of lung cancer in humans”,Springer.com,2017.

Dept. of CSE, CIT, GUBBI Page 19

You might also like