Exploratory Data Analysis On Haberman Dataset PDF

3/17/2020 Exploratory Data Analysis on Haberman Dataset
Data Set Information:
The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.
Attribute Information:
1. Age of patient at time of operation (numerical)

2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute) 1 = the patient survived 5 years or longer 2 = the patient died
within 5 year
Source : https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
(https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival)
In [1]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
habermandf = pd.read_csv("haberman.csv")
In [2]: # (Q) how many data-points and features?

habermandf.shape
Out[2]: (306, 4)
Dataset contains 306 Data points (observations) and 4 Attributes(charecteristics)
In [3]: habermandf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB
Dataset has only Integers There is no missing data, all the colums have values.
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 1/12

In [4]: #(Q) What are the column names in our dataset?

habermandf.columns
Out[4]: Index(['age', 'year', 'nodes', 'status'], dtype='object')
In [5]: #(Q) How many data points for each class are present?
habermandf["status"].value_counts()
Out[5]: 1 225
2 81
Name: status, dtype: int64
This is a im-balanced dataset data points for each class is different (huge gap among different
status types)
2-D Scatter Plot

In [6]: habermandf.plot(kind = "Scatter", x = "age", y = "year")
plt.grid()
plt.show()

In [7]: # 2-D Scatter plot with color-coding for each flower type/class.
# How many cobinations exist? 3C2
habermandf["status"] = habermandf["status"].apply(lambda x: "Positive" if x == 1
sns.set_style("whitegrid");
sns.FacetGrid(habermandf, hue="status", size=4) \
.map(plt.scatter, "age", "year") \
.add_legend();
plt.show();
Observation(s):
1. The patient survived 5 years or longer
2. The patient died within 5 year
Very hard to distinguish between Age and Year as data points overlap
Pair-plot
In [8]: # pairwise scatter plot: Pair-Plot
plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermandf, hue="status", size=3);
plt.show()
Histogram, PDF, CDF

In [9]: # What about 1-D scatter plot using just one feature?
haberman_pos=habermandf.loc[habermandf["status"]=='Positive'];
haberman_neg=habermandf.loc[habermandf["status"]=='Negative'];
plt.plot(haberman_pos['nodes'],np.zeros_like(haberman_pos['nodes']),'o',label='Po
plt.plot(haberman_neg['nodes'],np.zeros_like(haberman_neg['nodes']),'o',label='Ne
plt.ylabel("Counts")
plt.xlabel("Nodes")
plt.title("Haberman")
plt.legend()
plt.show()

In [10]: # Nodes
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"nodes")\
.add_legend()
plt.title('Haberman')
plt.show()

In [11]: # Year
.map(sns.distplot,"year")\
.add_legend()
plt.show()

In [12]: # Age
.map(sns.distplot,"age")\
.add_legend()
plt.show()

In [13]: #Plot CDF

counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf,label='Positive')
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.legend()
plt.show()
[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

[58. 60.2 62.4 64.6 66.8 69. ]

In [14]: # Plots of CDF of Year for Status (Positive/Negative)

counts,bin_edges=np.histogram(haberman_pos['year'],bins=5,density=True)
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Positive')
counts,bin_edges=np.histogram(haberman_neg['year'],bins=5,density=True)
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.legend()
plt.show()
[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

[58. 60.2 62.4 64.6 66.8 69. ]
[0.30864198 0.12345679 0.19753086 0.2345679 0.13580247]
[58. 60.2 62.4 64.6 66.8 69. ]

Box plot and Whiskers

In [15]: sns.boxplot(x='status',y='year',data=habermandf)
plt.show()
Violin Plots
In [16]: sns.violinplot(x='status',y='year',data=habermandf,size=8)
plt.show()
Conclusion:
Unable to find out perfect relation as dataset is imbalaneced.
1.Patients with less than 35 years will survive 5 year or longer.
2.Patients with more than 75 years will not survive 5 years or longer.

Exploratory Data Analysis On Haberman Dataset PDF

Uploaded by

Copyright:

Available Formats

You might also like

Exploratory Data Analysis On Haberman Dataset PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploratory Data Analysis On Haberman Dataset PDF

Uploaded by

Copyright:

Available Formats

3/17/2020 Exploratory Data Analysis on Haberman Dataset

Data Set Information:

1. Age of patient at time of operation (numerical)

In [1]: import pandas as pd

In [2]: # (Q) how many data-points and features?

Dataset contains 306 Data points (observations) and 4 Attributes(charecteristics)

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 1/12

In [4]: #(Q) What are the column names in our dataset?

Out[4]: Index(['age', 'year', 'nodes', 'status'], dtype='object')

2-D Scatter Plot

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 2/12

In [8]: # pairwise scatter plot: Pair-Plot

Histogram, PDF, CDF

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 4/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 5/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 6/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 7/12

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 8/12

In [13]: #Plot CDF

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 9/12

In [14]: # Plots of CDF of Year for Status (Positive/Negative)

[0.29333333 0.17333333 0.2 0.16444444 0.16888889]

localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 10/12

Box plot and Whiskers

1.Patients with less than 35 years will survive 5 year or longer.

You might also like