Cancer Cell Detection1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

CANCER CELL DETECTION

USING
LOGISTIC REGRESSION
A PROJECT REPORT
In partial fulfillment of the requirements for the award of the degree
BACHELOR OF COMPUTER APPLICATION
Under the guidance of
Sofikul Mullick
Submitted by
ADITYA BASAK 31401218057
DIPTARUP DAS 31401218044
RONENDRANATH ROY 31401218026
SOHAM JASHU 31401218014
SOUMYADEEP SARKAR 31401218012

TECHNO INTERNATIONAL NEWTOWN

In association with
Engineering Study Centre (ESC)
1|Page
CANCER CELL
1. Title of the Project: DETECTION USING
LOGISTIC
REGRESSION
ADITYA BASAK
2. Project Members: DIPTARUP DAS
RONENDRANATH ROY
SOHAM JASHU
SOUMYADEEP SARKAR
3. Name of the guide: SOFIKUL MULLICK

Project Version Control History

Version Primary Author Description of Date


Version Completed

Aditya Basak
Final Diptarup Das Project Report 16th
Ronendranath Roy July,2021
Soham Jashu
Soumyadeep Sarkar

Signature of Team Member Signature of Approver


Date: Date:
For Office Use Only SOFIKUL MULLICK
Approved Not Approved Project Proposal Evaluator

2|Page
DECLARATION
We hereby declare that the project work being presented in the project proposal
entitled “CANCER CELL DETECTION USING LOGISTIC REGRESSION”
in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF COMPUTER APPLICATION at TECHNO
INTERNATIONAL NEW TOWN, is an authentic work carried out under the
guidance of SOFIKUL MULLICK. The matter embodied in this projectwork
has not been submitted elsewhere for the award of any degree of our knowledge
and belief.

Date:

Name of the Students: Signature of the students:


Aditya Basak
Diptarup Das
Ronendranath Roy
Soham Jashu
Soumyadeep Sarkar

3|Page
CERTIFICATE FROM SUPERVISOR

This is to certify that this proposal of minor project entitled “CANCER CELL
DETECTION USING LOGISTIC REGRESSION” is a record of bona fide
work, carried out by ADITYA BASAK, DIPTARUP DAS, RONENDRANATH
ROY, SOHAM JASHU and SOUMYADEEP SARKAR under my guidance at
TECHNO INTERNATIONAL NEW TOWN. In my opinion, the report in its
present form is in partial fulfillment of the requirements for the award of the
degreeof BACHELOR OF COMPUTER APPLICATION and as per
regulations of the TECHNO INTERNATIONAL NEW TOWN. To the best of
my knowledge, the results embodied in this report, are original in nature and
worthy of incorporation in the present versionof the report.

Guide / Supervisor
------------------------------------------------
SOFIKUL MULLICK

4|Page
ACKNOWLEDGEMENT

Success of any project depends largely on the encouragement and guidelines of


many others. We take this sincere opportunity to express my gratitude to the people
who have been instrumental in the successful completion of this project work.

We would like to show our greatest appreciation to SOFIKUL MULLICK as


our guide to this project. We always feel motivated and encouraged every time by
his valuable advice and constant inspiration; without his encouragement and
guidance this project would not have materialized.

Words are inadequate in offering our thanks to the other trainees, project assistants
and other faculties at Techno International New Town for their encouragement and
cooperation in carrying out this project work. The guidance and support received
from all the members and who are contributing to this project, was vital for the
success of this project.

5|Page
INDEX

Sl No Topic Page No
1. Abstract 7

2. Introduction 8
3. Problem Definition 11
4. Project Goal 12
5. Methodology 13
6. Project Objective 15
7. Project Workflow 15
8. Project Implementation 16
9. Step-by-Step Working 18
10. Project Limitations 34
11. Future Scope 35
12. Summary 36
13. Bibliography 37

6|Page
ABSTRACT

Breast Cancer is the most common malignancy in women affecting 2.1


million women every year and causing the maximum number of deaths in
women due to cancer. It occurs as a result of the unusual development of
cells in the breast tissue, which is generally referred to as a Tumor. A
tumor does not signify cancer. It may be not cancerous (benign), pre-
cancerous (pre-malignant), or cancerous (malignant). Various types of
tests such as mammograms, MRIs, ultrasound, and biopsy are frequently
used to identify breast cancer. Early detection and treatment will help to
improve breast cancer outcomes as well as survival. Therefore, this paper
consists of a relative study of the breast cancer prediction using different
supervised machine learning algorithms like Logistics Regression, K-
Nearest Neighbors, Decision Tree Classifier, Gaussian NB, and Support
Vector Machine on the UCI repository dataset. Concerning the
performance of all the models, the accuracy score, precision, recall, and
F-score of each model have been compared. After using various models,
we got to see that Logistic Regression is a well-suited algorithm for Breast
cancer prediction and came up with better accuracy and other performance
indices as compared with other models.

7
INTRODUCTION

Python is an interpreted, high-level and general-purpose


programming language. Python's design philosophy
emphasizes code readability with its notable use of
significant whitespace. Its language constructs and object-
oriented approach aim to help programmers write clear,
logical code for small and large-scale projects.

Machine learning (ML) is the study of computer


algorithms that improve automatically through
experience. It is seen as a subset of artificial intelligence.
Machine learning algorithms build a model based on
sample data, known as "training data”, in order to make
predictions or decisions without being explicitly
programmed to do so. Machine learning algorithms are
used in a wide variety of applications, such as email
filtering and computer vision, where it is difficult or
unfeasible to develop conventional algorithms to perform
the needed tasks.

Logistic regression is a supervised learning classification


algorithm used to predict the probability of a target
variable. The nature of target or dependent variable is
8
dichotomous, which means there would be only two
possible classes.
In simple words, the dependent variable is binary
in nature having data coded as either 1 (stands for
success/yes) or 0 (stands for failure/no).
Mathematically, a logistic regression model
predicts P(Y=1) as a function of X. It is one of the simplest
ML algorithms that can be used for various classification
problems such as spam detection, Diabetes prediction,
cancer detection etc.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be


classified into three types:

9
● Binomial: In binomial Logistic regression, there can be

only two possible types of the dependent variables, such

as 0 or 1, Pass or Fail, etc.

● Multinomial: In multinomial Logistic regression, there

can be 3 or more possible unordered types of the

dependent variable, such as "cat", "dogs", or "sheep"

● Ordinal: In ordinal Logistic regression, there can be 3 or

more possible ordered types of dependent variables, such

as "low", "Medium", or "High".

10
PROBLEM DEFINITION

Breast Cancer is one of the leading cancers developed in many


countries including India. Though the endurance rate is high
– with early diagnosis 97% women can survive for more than
5 years. Statistically, the death toll due to this disease has
increased drastically in last few decades. The main issue
pertaining to its cure is early recognition. Hence, apart from
medicinal solutions some Data Science solution needs to be
integrated for resolving the death causing issue. This analysis
aims to observe which features are most helpful in predicting
malignant or benign cancer and to see general trends that may
aid us in model selection and hyper parameter selection. The
goal is to classify whether the breast cancer is benign or
malignant. To achieve this, we have used machine learning
classification methods to fit a function that can predict the
discrete class of new input.

11
PROJECT GOAL

The goal is to predict whether a person has cancer or not. The


goal is achieved by using logistic regression in machine
learning. Breast Cancer (BC) is a common cancer for women
around the world. Early detection of BC can greatly improve
prognosis and survival chances by promoting clinical
treatment to patients.

12
METHODOLOGY

• Data Selection: Data is the foundation for any machine


learning project. The job is to find ways and sources of
collecting relevant and comprehensive data, interpreting it,
and analyzing results with the help of statistical techniques.

• Data Visualization: A large amount of information


represented in graphic form is easier to understand and
analyze. Some companies specify that a data analyst must
know how to create slides, diagrams, charts, and templates.
• Data cleaning: This set of procedures allows for removing
noise and fixing inconsistencies in data. A data scientist can
fill in missing data using imputation techniques. A specialist
also detects outliers — observations that deviate significantly
from the rest of distribution.

• Data Splitting: A dataset used for machine learning should


be partitioned into three subsets — training, test, and
validation sets.

• Model Selection: After a data scientist has preprocessed


the collected data and split it into three subsets, he or she can
proceed with a model training. This process entails “feeding”
the algorithm with training data. An algorithm will process
13
data and output a model that is able to find a target value in
new data. The purpose of model training is to develop a
model.
• Model Evaluation: The goal of this step is to develop the
simplest model able to formulate a target value fast and well
enough and check the accuracy.

14
PROJECT OBJECTIVE
The main objective of this is Cancer Cell Detection using
Logistic Regression.

PROJECT WORKFLOW
This is the detailed work architecture where we are showing
the process of Cancer Cell Detection using Logistic
Regression.

15
PROJECT IMPLEMENTATION

• SELECTION OF DATA: The process of selecting data depends on the


type of project we desire to. The data set can be collected from various
sources such as a file, database, sensor and many other such sources.

• VISUALIZATION OF DATA: Data visualization is the graphical


representation of information and data. By using visual elements like
charts, graphs, and maps, data visualization tools provide an accessible
way to see and understand trends, outliers, and patterns.

• DATA PRE-PROCESSING: As we know that data pre-processing is


a process of cleaning the raw data into clean data, so that can be used to
train the model. So, we definitely need data pre-processing to achieve
good results from the applied model in machine learning and deep
learning projects.

• SELECTION OF DEPENDENT AND INDEPENDENT DATA:


We need to select the dependent and independent data and store them in
y and x.

• SPLITTING OF THE DATA: We train the classifier using ‘training


data set’, then test the performance of your classifier on unseen ‘test data
set’. We split the data for training and testing by using the
‘train_test_split’.
16
• FITTING THE MODEL: In a data set, a training set is implemented
to build up a model. Once the model is trained, we can use the same
trained model to predict using the testing data i.e., the unseen data. Once
this is done, we can develop a confusion matrix, this tells us how well our
model is trained.

• MODEL EVALUATION: It is an integral part of the model


development process. It helps to find the best model that represents our
data and how well the chosen model will work in the future.

17
STEP BY STEP WORKING:

SELECTION OF DATA:

18
CLEANING DATA:

19
20
21
DEPENDENT & INDEPENDENT DATA:

22
VISUALIZATION OF DATA:

23
24
25
26
27
28
29
SPLITTING THE DATA:

30
MODEL TRAINING:

31
MODEL EVALUATION:

32
33
PROJECT LIMITATIONS

• We worked on the backend part of the system thus there is no


frontend work associated which can result in more realistic look
and focus on user experience
• While using dataset greater than 1gb, the project won’t work
properly

34
FUTURE SCOPE

In this project in python, we learned to build a breast cancer cell predictor


on the Wisconsin dataset and created graphs and results for the same. It
has been observed that a good dataset provides better accuracy. Selection
of appropriate algorithms with good home dataset will lead to the
development of prediction systems. These systems can assist in proper
treatment methods for a patient diagnosed with breast cancer. There are
many treatments for a patient based on breast cancer stage; data mining
and machine learning can be a very good help in deciding the line of
treatment to be followed by extracting knowledge from such suitable
databases.

35
SUMMARY
Logistic Regression is a powerful Machine Learning tool, and we can use
it successfully for predicting categorical outputs of biomedical data. Data
wrangling and data mining can benefit from excellent performances
offered by Python and its libraries so well supported by the community.
Linear Algebra programming has intrinsic advantages in avoiding, where
possible, ‘while’ and ‘for’ loops. It is implementable by NumPy, a
package that vectorizes the matrixes. NumPy makes working on them
more comfortable, and guarantees better control over the operations,
especially for large arrays.

Moreover, the Machine Learning scenario with Python is enriched by the


presence of many powerful packages (i.e., Scikit-learn,) which provide
excellently optimized classifications and predictions on data.

The Diagnostic Data Set, with its 569 patients and 30 features, offers an
exhaustive assortment of parameters for classification and for this reason
represents a perfect example for Machine Learning applications. Anyway,
many of these features seem to be redundant, and a definite impact on
classification and prediction by some of them remains still unknown.

36
BIBLIOGRAPHY

• https://www.wikipedia.org/

• https://www.kaggle.com/leemun1/predicting-breast-cancer-
logistic-regression

• https://www.researchgate.net/publication/8251094_Cancer_classifi
cation_and_prediction_using_logistic_regression_with_Bayesian_
gene_selection

37
CODE

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
%matplotlib inline
cancer_data = pd.read_csv('cancer_data.csv')
cancer_data.head()
print(len(cancer_data.index))
cancer_data.info()
sns.countplot(x='diagnosis',data=cancer_data)
cancer_data.isnull().sum()
sns.heatmap(cancer_data.isnull())
cancer_data.drop('id',axis=1,inplace = True)
cancer_data.head(n=1)
cancer_data.diagnosis.unique()
cancer_data.describe()
cancer_data = cancer_data.dropna()
X = cancer_data.drop('diagnosis',axis=1)
from sklearn.feature_selection import RFE
model = LogisticRegression(solver='lbfgs',max_iter=3000)
rfe = RFE(model, n_features_to_select=10)
fit = rfe.fit(X, y)
dropColumns=[]
for i in range(0,30):
38
if(fit.ranking_[i] != 1):
dropColumns.append(X.columns[i])
X.drop(dropColumns,axis=1,inplace=True)
X.head()
y = cancer_data['diagnosis']
y.head()
plt.figure(figsize=(15, 20))
j=1
for i in range(1,10):
plt.subplot(5,2,i)
plt.scatter(X.iloc[:,0],X.iloc[:,i])
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(2,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,1],X.iloc[:,i])
plt.xlabel(X.columns[1])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(3,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,2],X.iloc[:,i])
plt.xlabel(X.columns[2])
plt.ylabel(X.columns[i])
j+=1

39
plt.figure(figsize=(15, 20))
j=1
for i in range(4,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,3],X.iloc[:,i])
plt.xlabel(X.columns[3])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(5,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,4],X.iloc[:,i])
plt.xlabel(X.columns[4])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(6,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,5],X.iloc[:,i])
plt.xlabel(X.columns[5])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(7,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,6],X.iloc[:,i])
plt.xlabel(X.columns[6])

40
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(8,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,7],X.iloc[:,i])
plt.xlabel(X.columns[7])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(9,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,8],X.iloc[:,i])
plt.xlabel(X.columns[8])
plt.ylabel(X.columns[i])
j+=1
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2,
random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
y_test
logmodl=LogisticRegression(max_iter=1000)
logmodl.fit(X_train, y_train)
X_test
y_test
y_pred=logmodl.predict(X_test)

41
y_pred
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
accuracy_score(y_test, y_pred)*100
cm=confusion_matrix(y_test, y_pred)
((cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))*10
classification_report(y_test, y_pred)
predict={0:'Benign',1:'Malignant'}
print("Actual Data and answer:")
print(X_test.iloc[-2].values)
print(y_test.iloc[-2],",",predict[y_test.iloc[-2]])
print("Model's Answer:")
value = X_test.iloc[[-2]].values
z = logmodl.predict(value)
print(z)
predict[z[0]]

42

You might also like