Professional Documents
Culture Documents
Exploratory Data Analysis On Haberman Dataset PDF
Exploratory Data Analysis On Haberman Dataset PDF
Exploratory Data Analysis On Haberman Dataset PDF
The dataset contains cases from a study that was conducted between 1958 and 1970 at the
University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for
breast cancer.
Attribute Information:
Source : https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
(https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival)
warnings.filterwarnings("ignore")
habermandf = pd.read_csv("haberman.csv")
Out[2]: (306, 4)
In [3]: habermandf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.6 KB
Dataset has only Integers There is no missing data, all the colums have values.
In [5]: #(Q) How many data points for each class are present?
habermandf["status"].value_counts()
Out[5]: 1 225
2 81
Name: status, dtype: int64
This is a im-balanced dataset data points for each class is different (huge gap among different
status types)
In [7]: # 2-D Scatter plot with color-coding for each flower type/class.
# How many cobinations exist? 3C2
habermandf["status"] = habermandf["status"].apply(lambda x: "Positive" if x == 1
sns.set_style("whitegrid");
sns.FacetGrid(habermandf, hue="status", size=4) \
.map(plt.scatter, "age", "year") \
.add_legend();
plt.show();
Observation(s):
1. The patient survived 5 years or longer
2. The patient died within 5 year
Very hard to distinguish between Age and Year as data points overlap
Pair-plot
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 3/12
3/17/2020 Exploratory Data Analysis on Haberman Dataset
plt.close();
sns.set_style("whitegrid");
sns.pairplot(habermandf, hue="status", size=3);
plt.show()
In [9]: # What about 1-D scatter plot using just one feature?
haberman_pos=habermandf.loc[habermandf["status"]=='Positive'];
haberman_neg=habermandf.loc[habermandf["status"]=='Negative'];
plt.plot(haberman_pos['nodes'],np.zeros_like(haberman_pos['nodes']),'o',label='Po
plt.plot(haberman_neg['nodes'],np.zeros_like(haberman_neg['nodes']),'o',label='Ne
plt.ylabel("Counts")
plt.xlabel("Nodes")
plt.title("Haberman")
plt.legend()
plt.show()
In [10]: # Nodes
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"nodes")\
.add_legend()
plt.title('Haberman')
plt.show()
In [11]: # Year
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"year")\
.add_legend()
plt.title('Haberman')
plt.show()
In [12]: # Age
sns.FacetGrid(habermandf,hue='status',size=5)\
.map(sns.distplot,"age")\
.add_legend()
plt.title('Haberman')
plt.show()
counts,bin_edges=np.histogram(haberman_neg['year'],bins=5,density=True)
pdf=counts/(sum(counts))
print(pdf)
print(bin_edges)
cdf=np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:],cdf,label='Negative')
plt.xlabel('year')
plt.ylabel('count')
plt.title("Haberman")
plt.legend()
plt.show()
Violin Plots
In [16]: sns.violinplot(x='status',y='year',data=habermandf,size=8)
plt.title("Haberman")
plt.show()
Conclusion:
Unable to find out perfect relation as dataset is imbalaneced.
2.Patients with more than 75 years will not survive 5 years or longer.
localhost:8888/notebooks/Subahani/Study/Applied AI/Assignments/Mandatory/Exploratory Data Analysis on Haberman Dataset.ipynb 11/12