Professional Documents
Culture Documents
Cancer Cell Detection1
Cancer Cell Detection1
Cancer Cell Detection1
USING
LOGISTIC REGRESSION
A PROJECT REPORT
In partial fulfillment of the requirements for the award of the degree
BACHELOR OF COMPUTER APPLICATION
Under the guidance of
Sofikul Mullick
Submitted by
ADITYA BASAK 31401218057
DIPTARUP DAS 31401218044
RONENDRANATH ROY 31401218026
SOHAM JASHU 31401218014
SOUMYADEEP SARKAR 31401218012
In association with
Engineering Study Centre (ESC)
1|Page
CANCER CELL
1. Title of the Project: DETECTION USING
LOGISTIC
REGRESSION
ADITYA BASAK
2. Project Members: DIPTARUP DAS
RONENDRANATH ROY
SOHAM JASHU
SOUMYADEEP SARKAR
3. Name of the guide: SOFIKUL MULLICK
Aditya Basak
Final Diptarup Das Project Report 16th
Ronendranath Roy July,2021
Soham Jashu
Soumyadeep Sarkar
2|Page
DECLARATION
We hereby declare that the project work being presented in the project proposal
entitled “CANCER CELL DETECTION USING LOGISTIC REGRESSION”
in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF COMPUTER APPLICATION at TECHNO
INTERNATIONAL NEW TOWN, is an authentic work carried out under the
guidance of SOFIKUL MULLICK. The matter embodied in this projectwork
has not been submitted elsewhere for the award of any degree of our knowledge
and belief.
Date:
3|Page
CERTIFICATE FROM SUPERVISOR
This is to certify that this proposal of minor project entitled “CANCER CELL
DETECTION USING LOGISTIC REGRESSION” is a record of bona fide
work, carried out by ADITYA BASAK, DIPTARUP DAS, RONENDRANATH
ROY, SOHAM JASHU and SOUMYADEEP SARKAR under my guidance at
TECHNO INTERNATIONAL NEW TOWN. In my opinion, the report in its
present form is in partial fulfillment of the requirements for the award of the
degreeof BACHELOR OF COMPUTER APPLICATION and as per
regulations of the TECHNO INTERNATIONAL NEW TOWN. To the best of
my knowledge, the results embodied in this report, are original in nature and
worthy of incorporation in the present versionof the report.
Guide / Supervisor
------------------------------------------------
SOFIKUL MULLICK
4|Page
ACKNOWLEDGEMENT
Words are inadequate in offering our thanks to the other trainees, project assistants
and other faculties at Techno International New Town for their encouragement and
cooperation in carrying out this project work. The guidance and support received
from all the members and who are contributing to this project, was vital for the
success of this project.
5|Page
INDEX
Sl No Topic Page No
1. Abstract 7
2. Introduction 8
3. Problem Definition 11
4. Project Goal 12
5. Methodology 13
6. Project Objective 15
7. Project Workflow 15
8. Project Implementation 16
9. Step-by-Step Working 18
10. Project Limitations 34
11. Future Scope 35
12. Summary 36
13. Bibliography 37
6|Page
ABSTRACT
7
INTRODUCTION
9
● Binomial: In binomial Logistic regression, there can be
10
PROBLEM DEFINITION
11
PROJECT GOAL
12
METHODOLOGY
14
PROJECT OBJECTIVE
The main objective of this is Cancer Cell Detection using
Logistic Regression.
PROJECT WORKFLOW
This is the detailed work architecture where we are showing
the process of Cancer Cell Detection using Logistic
Regression.
15
PROJECT IMPLEMENTATION
17
STEP BY STEP WORKING:
SELECTION OF DATA:
18
CLEANING DATA:
19
20
21
DEPENDENT & INDEPENDENT DATA:
22
VISUALIZATION OF DATA:
23
24
25
26
27
28
29
SPLITTING THE DATA:
30
MODEL TRAINING:
31
MODEL EVALUATION:
32
33
PROJECT LIMITATIONS
34
FUTURE SCOPE
35
SUMMARY
Logistic Regression is a powerful Machine Learning tool, and we can use
it successfully for predicting categorical outputs of biomedical data. Data
wrangling and data mining can benefit from excellent performances
offered by Python and its libraries so well supported by the community.
Linear Algebra programming has intrinsic advantages in avoiding, where
possible, ‘while’ and ‘for’ loops. It is implementable by NumPy, a
package that vectorizes the matrixes. NumPy makes working on them
more comfortable, and guarantees better control over the operations,
especially for large arrays.
The Diagnostic Data Set, with its 569 patients and 30 features, offers an
exhaustive assortment of parameters for classification and for this reason
represents a perfect example for Machine Learning applications. Anyway,
many of these features seem to be redundant, and a definite impact on
classification and prediction by some of them remains still unknown.
36
BIBLIOGRAPHY
• https://www.wikipedia.org/
• https://www.kaggle.com/leemun1/predicting-breast-cancer-
logistic-regression
• https://www.researchgate.net/publication/8251094_Cancer_classifi
cation_and_prediction_using_logistic_regression_with_Bayesian_
gene_selection
37
CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
%matplotlib inline
cancer_data = pd.read_csv('cancer_data.csv')
cancer_data.head()
print(len(cancer_data.index))
cancer_data.info()
sns.countplot(x='diagnosis',data=cancer_data)
cancer_data.isnull().sum()
sns.heatmap(cancer_data.isnull())
cancer_data.drop('id',axis=1,inplace = True)
cancer_data.head(n=1)
cancer_data.diagnosis.unique()
cancer_data.describe()
cancer_data = cancer_data.dropna()
X = cancer_data.drop('diagnosis',axis=1)
from sklearn.feature_selection import RFE
model = LogisticRegression(solver='lbfgs',max_iter=3000)
rfe = RFE(model, n_features_to_select=10)
fit = rfe.fit(X, y)
dropColumns=[]
for i in range(0,30):
38
if(fit.ranking_[i] != 1):
dropColumns.append(X.columns[i])
X.drop(dropColumns,axis=1,inplace=True)
X.head()
y = cancer_data['diagnosis']
y.head()
plt.figure(figsize=(15, 20))
j=1
for i in range(1,10):
plt.subplot(5,2,i)
plt.scatter(X.iloc[:,0],X.iloc[:,i])
plt.xlabel(X.columns[0])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(2,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,1],X.iloc[:,i])
plt.xlabel(X.columns[1])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(3,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,2],X.iloc[:,i])
plt.xlabel(X.columns[2])
plt.ylabel(X.columns[i])
j+=1
39
plt.figure(figsize=(15, 20))
j=1
for i in range(4,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,3],X.iloc[:,i])
plt.xlabel(X.columns[3])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(5,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,4],X.iloc[:,i])
plt.xlabel(X.columns[4])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(6,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,5],X.iloc[:,i])
plt.xlabel(X.columns[5])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(7,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,6],X.iloc[:,i])
plt.xlabel(X.columns[6])
40
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(8,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,7],X.iloc[:,i])
plt.xlabel(X.columns[7])
plt.ylabel(X.columns[i])
j+=1
plt.figure(figsize=(15, 20))
j=1
for i in range(9,10):
plt.subplot(5,2,j)
plt.scatter(X.iloc[:,8],X.iloc[:,i])
plt.xlabel(X.columns[8])
plt.ylabel(X.columns[i])
j+=1
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2,
random_state=0)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
y_test
logmodl=LogisticRegression(max_iter=1000)
logmodl.fit(X_train, y_train)
X_test
y_test
y_pred=logmodl.predict(X_test)
41
y_pred
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
accuracy_score(y_test, y_pred)*100
cm=confusion_matrix(y_test, y_pred)
((cm[0][0]+cm[1][1])/(cm[0][0]+cm[0][1]+cm[1][0]+cm[1][1]))*10
classification_report(y_test, y_pred)
predict={0:'Benign',1:'Malignant'}
print("Actual Data and answer:")
print(X_test.iloc[-2].values)
print(y_test.iloc[-2],",",predict[y_test.iloc[-2]])
print("Model's Answer:")
value = X_test.iloc[[-2]].values
z = logmodl.predict(value)
print(z)
predict[z[0]]
42