EDA On Haberman Survival Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

In [1]:

#Performing Exploratory data analysis on haberman survival data set

In [2]:
#importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [3]:
#reading the data from csv file
haberman_data = pd.read_table('haberman.csv',sep=',')
haberman_data.shape
data = {1:'Survived',2:'Dead'}
haberman_data['status'] = haberman_data.status.map(data)
#haberman_data.drop('survival_status',axis=1,inplace=True)

In [4]:
haberman_data

Out[4]:

age year nodes status


0 30 64 1 Survived
1 30 62 3 Survived
2 30 65 0 Survived
3 31 59 2 Survived
4 31 65 4 Survived
... ... ... ... ...
301 75 62 1 Survived
302 76 67 0 Survived
303 77 65 3 Survived
304 78 65 1 Dead
305 83 58 2 Dead

306 rows × 4 columns

In [5]:
haberman_data.status[haberman_data.nodes.isin([0,1])].value_counts()

Out[5]:
Survived 150
Dead 27
Name: status, dtype: int64

In [6]:
haberman_data.status[~(haberman_data.nodes.isin([0,1]))].value_counts()
Out[6]:
Survived 75
Dead 54
Name: status, dtype: int64

In [7]:
haberman_data['status'].value_counts()

Out[7]:
Survived 225
Dead 81
Name: status, dtype: int64

Objective: 1.haberman dataset contains data related to cancer patients and we need to classify with the given data whether a person
survived or dead 2. It is a classification task with below values 2.1) status value 1 ( renamed to Survived) represents patient survived 5
or more 2.2) status value 2 (renamed to Dead) represents patient died with in 5 years KeyFindings-1: with the above statistics if the
node value is either zero or 1 it mostly classified as survived and in other cases it is not that much straight forward
In [8]:
#haberman_data.rename(columns={'stauts':'status'},inplace=True)

In [9]:
#plotting scatter plot between age and nodes
sns.scatterplot(x=haberman_data.age,y=haberman_data.year,hue=haberman_data.status).set_title("Scatter p
lot btw age and nodes")
plt.show()

KeyFindings-2: 1. From above plot we can say that data is not well seperated with above two values
In [10]:
#creating a pair plot to identify importent features
sns.pairplot(haberman_data,hue='status',height=3)
plt.show()
keyFindings-3: 1. We noted that with the features that they given distributions are not clearly seperable 2. None of the features clearly
distinguish the data between survival and dead classes 3. Age and Node features seems better when compared with other feature year
4. among age and Node Node value seems more important
In [11]:
#drawing histogram for Nodes feature
grid = sns.FacetGrid(haberman_data,hue='status',height=5)
grid.map(sns.distplot,'nodes')
grid.add_legend()
grid.fig.suptitle("Histogram for Node")
plt.show()

In [12]:
grid = sns.FacetGrid(haberman_data,hue='status',height=4)
grid.map(sns.distplot,'age')
grid.add_legend()
grid.fig.suptitle("Distribution plot for Age")
plt.show()
In [13]:
grid = sns.FacetGrid(haberman_data,hue='status',height=3)
grid.map(sns.distplot,'year')
grid.add_legend()
grid.fig.suptitle("distribution plot for Year")
plt.show()

with the findings from keyFindings-1 i am concluding that nodes value 0 and 1 having more survival probability than dead and from
age feature it is clear that if age is <34 years then he survived and year feature is help less

In [14]:
#creating boxplots and violin plots n coutour plots

In [15]:
plt.figure(1)
plt.subplot(121)
sns.set_style('whitegrid')
sns.boxplot(x='status',y='age',data=haberman_data)
plt.title('AgeVsStatus')
plt.subplot(122)
sns.set_style('whitegrid')
sns.boxplot(x='status',y='nodes',data=haberman_data)
plt.title('NodesVsStatus')
plt.show()
In [16]:
plt.figure(1)
plt.subplot(121)
sns.violinplot(x='status',y='age',data=haberman_data)
plt.title('AgeVsStatus')
plt.subplot(122)
sns.set_style('whitegrid')
sns.violinplot(x='status',y='nodes',data=haberman_data)
plt.title('NodesVsStatus')
plt.show()

In [17]:
grid = sns.jointplot(x='age',y='nodes',kind='kde',data=haberman_data)
grid.fig.suptitle("kernal density estimator")
plt.show()

Conclusions: 1. From above analysis i am concluding that age n nodes were importent features 2. With these two features we can
create a simple if else model with conditions as if nodes were in [0,1] or age < 34 we can classify them as survived else Dead 3. more
number of survivals were there at nodes values 0 and 1 n most number values were scatter around < 10

You might also like