Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

PRACTICAL EXERCISES

1. Download, install and explore the features of Python for data


analytics

i) Installation of the Python data analytics tools NumPy, pandas, Matplotlib, and seaborn.

Step 1 - Install latest Python from python.org website


Step 2 - Install latest NumPy using the following command
pip install numpy
Step 3 - Install latest pandas using the following command
pip install pandas
Step 4 - Install latest matplotlib using the following command
pip install matplotlib
Step 5 - Install latest sea using the following command
pip install seaborn
Step 6 – Install latest sklearn using the following command
pip install sklearn

ii) Exploring the features of the Python data analytics tools NumPy

Exercise 1: Write a python program to create an array using NumPy tool and print the
array values

Program:
import numpy as np
arr = np.array([1,2,3,4,5])
print(arr)

Output
Exercise 2: Write a python program to create a 1-D array consisting of 50 values, where
each value has to be only 2, 4, 6, 8 or 9

a) The probability for the value to be 2 is set to be 0.2


b) The probability for the value to be 4 is set to be 0.3
c) The probability for the value to be 6 is set to be 0.3
d) The probability for the value to be 8 is set to be 0.1
e) The probability for the value to be 9 is set to be 0.1

Program:
from numpy import random as rd
x = rd.choice([2,4,6,8,9], p=[0.2,0.3,0.3,0.1,0.1], size=(50))
print(x)
Output

Exercise 3: Write a python program to create a 2-D array with 4 rows and 4 columns,
where each value has to be only 2, 4, 6, 8 or 9

f) The probability for the value to be 2 is set to be 0.2


g) The probability for the value to be 4 is set to be 0.3
h) The probability for the value to be 6 is set to be 0.3
i) The probability for the value to be 8 is set to be 0.1
j) The probability for the value to be 9 is set to be 0.1
Program:
from numpy import random as rd
x = rd.choice([2,4,6,8,9], p=[0.2,0.3,0.3,0.1,0.1], size=(4,4))
print(x)
Output

iii) Exploring the features of the Python data analytics tools Seaborn and MatPlotlib

Exercise 4: Write a python program which takes the distribution of points as an array
input and plots a curve corresponding to the distribution of points
Program:
import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()

Output

Exercise 5: Write a python program which takes the distribution of points generated in
the Exercise 2 as an array input and plots a curve corresponding to the distribution of
points
Program:
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import random as rd

x = rd.choice([2,4,6,8,9], p=[0.2,0.3,0.3,0.1,0.1], size=(50))

sns.distplot(x)

plt.show()

Output

Exercise 7: Write a python program to generate an array containing random Normal


distribution points and plot a curve corresponding to the random Normal distribution
points

Program:

from numpy import random


import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(random.normal(size=1000), hist=False)

plt.show()
Output

Exercise 8: Write a python program to generate an array containing random Binomial


distribution points and plot a curve corresponding to the random Binomial distribution
points
Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(random.binomial(n=20, p=0.3, size=1000), hist=True, kde=False)

plt.show()

Output

Exercise 9: Write a python program to generate an array containing random Poisson


distribution points and plot a curve corresponding to the random Poisson distribution
points
Program:

from numpy import random


import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(random.poisson(lam=2, size=1000), kde=False)

plt.show()

Output

Exercise 10: Write a python program to generate an array containing random Uniform
distribution points and plot a curve corresponding to the random Uniform distribution
points
Program:

from numpy import random


import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(random.uniform(size=1000), hist=False)

plt.show()
Output

Exercise 11: Write a python program to generate an array containing random Uniform
distribution points and plot a curve corresponding to the random Uniform distribution
points

Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(random.logistic(size=1000), hist=False)

plt.show()

Output
Exercise 12: Write a python program to generate an array containing random Chi
Square distribution points and plot a curve corresponding to the random Chi Square
distribution points

Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns

sns.distplot(random.chisquare(df=1, size=1000), hist=False)

plt.show()

Output

iv) Exploring the features of the Python data analytics tools pandas

Exercise 13: Write a python program to read the attached csv data into a data frame
using pandas python tool

data.csv

Program:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())
Output

Exercise 14: Write a python program to read the attached csv data into a data frame
using pandas python tool and print the first 10 rows

data.csv

Program:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))

Output
Exercise 15: Write a python program to read the attached csv data into a data frame
using pandas python tool and print the last 10 rows

data.csv

Program:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.tail(10))

Output

Exercise 16: Write a python program to read the attached csv data into a data frame
using pandas python tool and remove all rows will NULL value

data_not_proper.cs
v

Program:
import pandas as pd
df = pd.read_csv(' data_not_proper.csv')
df.dropna(inplace = True)

print(df.to_string())
Output

Exercise 17: Write a python program to read the attached csv data into a data frame
using pandas python tool and convert the data in wrong format in the row number 22
and 26 into correct format

data_not_proper.cs
v

Program:

import pandas as pd

df = pd.read_csv('data_not_proper.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

Output
Exercise 18: Write a python program to read the attached csv data into a data frame
using pandas python tool and remove duplicate records.

data_not_proper.cs
v

Program:

import pandas as pd

df = pd.read_csv('data_not_proper.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())

Output
Exercise 19: Write a python program to read the attached csv data into a data frame
using pandas python tool and determine the correlations between columns in the csv
file.

data.csv

Program:

import pandas as pd

df = pd.read_csv('data.csv')
x = df.corr()
print(x)

Output

2. Use the diabetes data set from UCI and Pima Indians for
performing the following:
a)
i) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis using UCI diabetes data set

dia_dataset_uci.csv

Program:
import pandas as pd
df = pd.read_csv('dia_dataset_uci.csv')

# Frequency of diffetent codes


freq = df['code'].value_counts()

print("The Frequency of occurance of diffetent codes \n")

print(freq)

#calculate mean of 'insulin values'


mean = df['insulin_value'].mean()

print("The Mean of Insulin Values : ", mean)

#calculate median of 'insulin values'


median = df['insulin_value'].median()

print("The Median of Insulin Values : ", median)

#calculate mode of 'insulin values'


mode = df['insulin_value'].mode()

print("The Mode of Insulin Values : ", mode)

#calculate variance of 'insulin values'


var = df['insulin_value'].var()

print("The Variance of Insulin Values : ", var)

#calculate standard deviation of 'insulin values'


std = df['insulin_value'].std()

print("The Standard Deviation of Insulin Values : ", std)

#calculate Skewness of 'insulin values'


skew = df['insulin_value'].skew()

print("The Skewness of Insulin Values : ", skew)

#calculate Kurtosis of 'insulin values'


kurt = df['insulin_value'].kurt()

print("The Kurtosis of Insulin Values : ", kurt)

Output
ii) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis using Pima Indians diabetes data set

dia_dataset_prima_i
ndia.csv

Program:

import pandas as pd

df = pd.read_csv('dia_dataset_prima_india.csv')

# Frequency of diffetent codes


freq = df['pregency_no_of_times'].value_counts()

print("The Frequency of Pregency \n")

print(freq)

#calculate mean of 'Serum insulin values'


mean = df['ser_insulin'].mean()

print("The Mean of Serum Insulin Values : ", mean)

#calculate median of 'Serum insulin values'


median = df['ser_insulin'].median()

print("The Median of Serum Insulin Values : ", median)

#calculate mode of 'Serum insulin values'


mode = df['ser_insulin'].mode()

print("The Mode of Serum Insulin Values : ", mode)

#calculate variance of 'Serum insulin values'


var = df['ser_insulin'].var()

print("The Variance of Serum Insulin Values : ", var)

#calculate standard deviation of 'Serum insulin values'


std = df['ser_insulin'].std()

print("The Standard Deviation of Serum Insulin Values : ", std)

#calculate Skewness of 'insulin values'


skew = df['ser_insulin'].skew()

print("The Skewness of Serum Insulin Values : ", skew)

#calculate Kurtosis of 'Serum insulin values'


kurt = df['ser_insulin'].kurt()

print("The Kurtosis of Serum Insulin Values : ", kurt)

Output

b)
i) Bivariate Analysis: Linear regression modelling using UCI diabetes data set

dia_dataset_uci.csv

Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('code')
plt.ylabel('insulin_value')

# function to show plot


plt.show()

def main():

df = pd.read_csv('dia_dataset_uci.csv')
# observations / data
x = np.array(df['code'])
y = np.array(df['insulin_value'])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()
Output

ii) Bivariate Analysis: Linear regression modelling using Pima Indians diabetes
data set

dia_dataset_prima_i
ndia.csv

Program:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('pregency_no_of_times')
plt.ylabel('ser_insulin')

# function to show plot


plt.show()

def main():

df = pd.read_csv('dia_dataset_prima_india.csv')
# observations / data
x = np.array(df['pregency_no_of_times'])
y = np.array(df['ser_insulin'])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()
Output

iii) Bivariate Analysis: Logistic regression modelling using UCI diabetes data set

dia_dataset_uci.csv

Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

df = pd.read_csv('dia_dataset_uci.csv')
# input
x = df.iloc[:, [2, 3]].values
# output
y = df.iloc[:, 4].values
# Training Data Set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state =
0)
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(X_train)
xtest = sc_x.transform(X_test)
print (xtrain[0:10, :])
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, y_train)
y_pred = classifier.predict(xtest)
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix : \n", cm)
print ("Accuracy : ", accuracy_score(y_test, y_pred))

Output
iv) Bivariate Analysis: Logistic regression modelling using Pima Indians diabetes
data set

dia_dataset_prima_i
ndia.csv

Program:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

df = pd.read_csv('dia_dataset_prima_india.csv')
# input
x = df.iloc[:, [0, 4]].values
# output
y = df.iloc[:, 8].values# Training Data Set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state =
0)
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(X_train)
xtest = sc_x.transform(X_test)
print (xtrain[0:10, :])
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, y_train)
y_pred = classifier.predict(xtest)
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix : \n", cm)
print ("Accuracy : ", accuracy_score(y_test, y_pred))

Output
c)
i) Multiple Regression Analysis: Multiple Regression Analysis using UCI diabetes
data set

dia_dataset_uci.csv

Program:

import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('dia_dataset_uci.csv')
# observations / data
x = np.array(df['code'])
y = np.array(df['insulin_value'])
mpl.rcParams['legend.fontsize'] = 12

fig = plt.figure()
ax = fig.gca(projection ='3d')

ax.scatter(x, x, y, label ='Diabetes Range', s = 5)


ax.legend()
ax.view_init(45, 0)

plt.show()

Output
ii) Multiple Regression Analysis: Multiple Regression Analysis using Pima Indian
diabetes data set

dia_dataset_prima_i
ndia.csv

Program:

import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('dia_dataset_prima_india.csv')
# observations / data
x = np.array(df['pregency_no_of_times'])
y = np.array(df['ser_insulin'])
z = np.array(df['age'])
mpl.rcParams['legend.fontsize'] = 12

fig = plt.figure()
ax = fig.gca(projection ='3d')

ax.scatter(x, y, z, label ='Diabetes Range', s = 5)


ax.legend()
ax.view_init(45, 0)

plt.show()

Output
d)
i) Write a python program to compare the analysis results of UCI and Pima Indian
diabetes data set

dia_dataset_uci.csv dia_dataset_prima_i
ndia.csv

Program:

import pandas as pd
dfu = pd.read_csv('dia_dataset_uci.csv')
dfp = pd.read_csv('dia_dataset_prima_india.csv')

#calculate mean of 'insulin values'


meanu = dfu['insulin_value'].mean()
meanp = dfp['ser_insulin'].mean()
print("The Mean of Insulin Values of UCI Dataset {0}, Pima DataSet {1}:
".format(meanu,meanp))

#calculate median of 'insulin values'


medianu = dfu['insulin_value'].median()
#calculate median of 'Serum insulin values'
medianp = dfp['ser_insulin'].median()
print("The Median of Insulin Values of UCI Dataset {0}, Pima DataSet {1}:
".format(medianu,medianp))

#calculate mode of 'insulin values'


modeu = dfu['insulin_value'].mode()
#calculate mode of 'Serum insulin values'
modep = dfp['ser_insulin'].mode()
print("The Mode of Insulin Values of UCI Dataset {0}, Pima DataSet {1}:
".format(modeu,modep))
#calculate variance of 'insulin values'
varu = dfu['insulin_value'].var()
varp = dfp['ser_insulin'].var()
print("The Variance of Insulin Values of UCI Dataset {0}, Pima DataSet {1}:
".format(varu,varp))
#calculate standard deviation of 'insulin values'
stdu = dfu['insulin_value'].std()
#calculate standard deviation of 'Serum insulin values'
stdp = dfp['ser_insulin'].std()
print("The Standard Deviation of Insulin Values of UCI Dataset {0}, Pima
DataSet {1}: ".format(stdu,stdp))

#calculate Skewness of 'insulin values'


skewu = dfu['insulin_value'].skew()
skewp = dfp['ser_insulin'].skew()
print("The Skewness of Insulin Values of UCI Dataset {0}, Pima DataSet {1}:
".format(skewu,skewp))

#calculate Kurtosis of 'insulin values'


kurtu = dfu['insulin_value'].kurt()
kurtp = dfp['ser_insulin'].kurt()
print("The Kurtosis of Insulin Values of UCI Dataset {0}, Pima DataSet {1}:
".format(kurtu,kurtp))
Output
3. Apply Bayesian and SVM techniques on Iris and Diabetes data
set

i) Write a python program to implement Navi Bayes classification technique on Iris


data set

Program:

# load the iris dataset


from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)


X = iris.data
y = iris.target

# splitting X and y into training and testing sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)

# training the model on training set


from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# making predictions on the testing set


y_pred = gnb.predict(X_test)

# comparing actual response values (y_test) with predicted response values


(y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):",
metrics.accuracy_score(y_test, y_pred)*100)
Output

ii) Write a python program to implement Navi Bayes classification technique on


diabetes data set
Program:
# load the iris dataset
from sklearn.datasets import load_diabetes
dia = load_diabetes()

# store the feature matrix (X) and response vector (y)


X = dia.data
y = dia.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
random_state=1)
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set
y_pred = gnb.predict(X_test)
# comparing actual response values (y_test) with predicted response values
(y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):",
metrics.accuracy_score(y_test, y_pred)*100)
Output

iii) Write a python program to implement SVM classification technique on Iris data
set

iris_uci_data_set.csv

Program:

# importing required libraries


import numpy as np
import pandas as pd

#Define the col names


colnames=["sepal_length_in_cm",
"sepal_width_in_cm","petal_length_in_cm","petal_width_in_cm", "class"]

#Read the dataset


dataset = pd.read_csv("iris_uci_data_set.csv", header = None, names= colnames )

#Data
dataset.head()

#Encoding the categorical column


dataset = dataset.replace({"class": {"Iris-setosa":1,"Iris-versicolor":2, "Iris-
virginica":3}})
#Visualize the new dataset
dataset.head()

X = dataset.iloc[:,:-1]
y = dataset.iloc[:, -1].values

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)

#Create the SVM model


from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
#Fit the model for the data

classifier.fit(X_train, y_train)

#Make the prediction


y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)
print(cm)

from sklearn.model_selection import cross_val_score


accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
iv) Write a python program to implement SVM classification technique on diabetes
data set

dia_dataset_uci.csv

Program:

# importing required libraries


import numpy as np
import pandas as pd

#Read the dataset


dataset = pd.read_csv("dia_dataset_uci.csv")

#Data
print(dataset.head())

X = dataset.iloc[:,:-1]
y = dataset.iloc[:, -1].values

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)

#Create the SVM model


from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
#Fit the model for the data

classifier.fit(X_train, y_train)

#Make the prediction


y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)
print(cm)

from sklearn.model_selection import cross_val_score


accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
Output

4. Apply and explore various plotting functions on Pima Indian


diabetes data set

dia_dataset_prima_i
ndia.csv

Program:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)

diab=pd.read_csv("d:\\dia_dataset_prima_india.csv")
print(diab.head())
print(diab.isnull().values.any())
## To check if data contains null values
print(diab.describe())
(diab.Pregnancies ==
0).sum(),(diab.Glucose==0).sum(),(diab.BloodPressure==0).sum(),(diab.SkinThickne
ss==0).sum(),(diab.Insulin==0).sum(),(diab.BMI==0).sum(),(diab.DiabetesPedigreeF
unction==0).sum(),(diab.Age==0).sum()

## Creating a dataset called 'dia' from original dataset 'diab' with excludes all rows with
have zeros only for Glucose, BP, Skinthickness, Insulin and BMI, as other columns
can contain Zero values.
drop_Glu=diab.index[diab.Glucose == 0].tolist()
drop_BP=diab.index[diab.BloodPressure == 0].tolist()
drop_Skin = diab.index[diab.SkinThickness==0].tolist()
drop_Ins = diab.index[diab.Insulin==0].tolist()
drop_BMI = diab.index[diab.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
dia=diab.drop(diab.index[c])

print(dia.info())

print(dia.describe())

dia1 = dia[dia.Outcome==1]
dia0 = dia[dia.Outcome==0]

print(dia1)

print(dia0)

## creating count plot with title using seaborn


sns.countplot(x=dia.Outcome)
plt.title("Count Plot for Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Pregnancies")
sns.distplot(dia.Pregnancies,kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Pregnancies,kde=False,color="Blue", label="Preg for Outome=0")
sns.distplot(dia1.Pregnancies,kde=False,color = "Gold", label = "Preg for
Outcome=1")
plt.title("Histograms for Preg by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Pregnancies)
plt.title("Boxplot for Preg by Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(dia.Glucose, kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Glucose)
plt.title("Boxplot for Glucose by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BloodPressure, kde=False)
plt.title("Histogram for Blood Pressure")
plt.subplot(1,3,2)
sns.distplot(dia0.BloodPressure,kde=False,color="Gold",label="BP for Outcome=0")
sns.distplot(dia1.BloodPressure,kde=False, color="Blue", label="BP for Outcome=1")
plt.legend()
plt.title("Histogram of Blood Pressure by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.BloodPressure)
plt.title("Boxplot of BP by Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.SkinThickness, kde=False)
plt.title("Histogram for Skin Thickness")
plt.subplot(1,3,2)
sns.distplot(dia0.SkinThickness, kde=False, color="Gold", label="SkinThick for
Outcome=0")
sns.distplot(dia1.SkinThickness, kde=False, color="Blue", label="SkinThick for
Outcome=1")
plt.legend()
plt.title("Histogram for SkinThickness by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.SkinThickness)
plt.title("Boxplot of SkinThickness by Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Insulin,kde=False)
plt.title("Histogram of Insulin")
plt.subplot(1,3,2)
sns.distplot(dia0.Insulin,kde=False, color="Gold", label="Insulin for Outcome=0")
sns.distplot(dia1.Insulin,kde=False, color="Blue", label="Insuline for Outcome=1")
plt.title("Histogram for Insulin by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.Insulin)
plt.title("Boxplot for Insulin by Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BMI, kde=False)
plt.title("Histogram for BMI")
plt.subplot(1,3,2)
sns.distplot(dia0.BMI, kde=False,color="Gold", label="BMI for Outcome=0")
sns.distplot(dia1.BMI, kde=False, color="Blue", label="BMI for Outcome=1")
plt.legend()
plt.title("Histogram for BMI by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.BMI)
plt.title("Boxplot for BMI by Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.DiabetesPedigreeFunction,kde=False)
plt.title("Histogram for Diabetes Pedigree Function")
plt.subplot(1,3,2)
sns.distplot(dia0.DiabetesPedigreeFunction, kde=False, color="Gold",
label="PedFunction for Outcome=0")
sns.distplot(dia1.DiabetesPedigreeFunction, kde=False, color="Blue",
label="PedFunction for Outcome=1")
plt.legend()
plt.title("Histogram for DiabetesPedigreeFunction by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.DiabetesPedigreeFunction)
plt.title("Boxplot for DiabetesPedigreeFunction by Outcome")
plt.show()

## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot

plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Age,kde=False)
plt.title("Histogram for Age")
plt.subplot(1,3,2)
sns.distplot(dia0.Age,kde=False,color="Gold", label="Age for Outcome=0")
sns.distplot(dia1.Age,kde=False, color="Blue", label="Age for Outcome=1")
plt.legend()
plt.title("Histogram for Age by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Age)
plt.title("Boxplot for Age by Outcome")
plt.show()

## importing stats module from scipy


from scipy import stats
## retrieving p value from normality test function
PregnanciesPVAL=stats.normaltest(dia.Pregnancies).pvalue
GlucosePVAL=stats.normaltest(dia.Glucose).pvalue
BloodPressurePVAL=stats.normaltest(dia.BloodPressure).pvalue
SkinThicknessPVAL=stats.normaltest(dia.SkinThickness).pvalue
InsulinPVAL=stats.normaltest(dia.Insulin).pvalue
BMIPVAL=stats.normaltest(dia.BMI).pvalue
DiaPeFuPVAL=stats.normaltest(dia.DiabetesPedigreeFunction).pvalue
AgePVAL=stats.normaltest(dia.Age).pvalue
## Printing the values
print("Pregnancies P Value is " + str(PregnanciesPVAL))
print("Glucose P Value is " + str(GlucosePVAL))
print("BloodPressure P Value is " + str(BloodPressurePVAL))
print("Skin Thickness P Value is " + str(SkinThicknessPVAL))
print("Insulin P Value is " + str(InsulinPVAL))
print("BMI P Value is " + str(BMIPVAL))
print("Diabetes Pedigree Function P Value is " + str(DiaPeFuPVAL))
print("Age P Value is " + str(AgePVAL))

sns.pairplot(dia, vars=["Pregnancies",
"Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction", "Age"],hue="Outcome")
plt.title("Pairplot of Variables by Outcome")
plt.show()

cor = dia.corr(method ='pearson')


print(cor)

sns.heatmap(cor)
plt.show()

Output

You might also like