Exploratory Data Analysis On Haberman Dataset PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

3/17/2020 Exploratory Data Analysis on Haberman Dataset

Data Set Information:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.

Attribute Information:

1. Age of patient at time of operation (numerical)


2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died
within 5 year

Source : https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
(https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival)

In [1]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings

warnings.filterwarnings("ignore")

habermandf = pd.read_csv("haberman.csv")

In [2]: # (Q) how many data-points and features?


habermandf.shape

Out[2]: (306, 4)

Dataset contains 306 Data points (observations) and 4 Attributes(charecteristics)

In [3]: habermandf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB

Dataset has only Integers There is no missing data, all the colums have values.

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 1/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [4]: #(Q) What are the column names in our dataset?


habermandf.columns

Out[4]: Index(['age', 'year', 'nodes', 'status'], dtype='object')

In [5]: #(Q) How many data points for each class are present?
habermandf["status"].value_counts()

Out[5]: 1 225
2 81
Name: status, dtype: int64

This is a im-balanced dataset data points for each class is different (huge gap among different
status types)

2-D Scatter Plot


In [6]: habermandf.plot(kind = "Scatter", x = "age", y = "year")
plt.grid()
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 2/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [7]: # 2-D Scatter plot with color-coding for each flower type/class.
# How many cobinations exist? 3C2
habermandf["status"] = habermandf["status"].apply(lambda x: "Positive" if x == 1

sns.set_style("whitegrid");
sns.FacetGrid(habermandf, hue="status", size=4) \
.map(plt.scatter, "age", "year") \
.add_legend();
plt.show();

Observation(s):
1. The patient survived 5 years or longer
2. The patient died within 5 year

Very hard to distinguish between Age and Year as data points overlap

Pair-plot
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 3/12
3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [8]: # pairwise scatter plot: Pair-Plot

plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermandf, hue="status", size=3);
plt.show()

Histogram, PDF, CDF

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 4/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [9]: # What about 1-D scatter plot using just one feature?
haberman_pos=habermandf.loc[habermandf["status"]=='Positive'];
haberman_neg=habermandf.loc[habermandf["status"]=='Negative'];

plt.plot(haberman_pos['nodes'],np.zeros_like(haberman_pos['nodes']),'o',label='Po
plt.plot(haberman_neg['nodes'],np.zeros_like(haberman_neg['nodes']),'o',label='Ne
plt.ylabel("Counts")
plt.xlabel("Nodes")
plt.title("Haberman")
plt.legend()
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 5/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [10]: # Nodes
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"nodes")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 6/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [11]: # Year
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"year")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 7/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [12]: # Age
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"age")\
.add_legend()
plt.title('Haberman')
plt.show()

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 8/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [13]: #Plot CDF


counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='Positive')
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]


[58. 60.2 62.4 64.6 66.8 69. ]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 9/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

In [14]: # Plots of CDF of Year for Status (Positive/Negative)


counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Positive')

counts,bin_edges=np.histogram(haberman_neg['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)

cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]


[58. 60.2 62.4 64.6 66.8 69. ]
[0.30864198 0.12345679 0.19753086 0.2345679 0.13580247]
[58. 60.2 62.4 64.6 66.8 69. ]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 10/12


3/17/2020 Exploratory Data Analysis on Haberman Dataset

Box plot and Whiskers


In [15]: sns.boxplot(x='status',y='year',data=habermandf)
plt.title("Haberman")
plt.show()

Violin Plots
In [16]: sns.violinplot(x='status',y='year',data=habermandf,size=8)
plt.title("Haberman")
plt.show()

Conclusion:
Unable to find out perfect relation as dataset is imbalaneced.

1.Patients with less than 35 years will survive 5 year or longer.

2.Patients with more than 75 years will not survive 5 years or longer.
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 11/12

You might also like