Professional Documents
Culture Documents
Data science Laboratory
Data science Laboratory
i) Installation of the Python data analytics tools NumPy, pandas, Matplotlib, and seaborn.
ii) Exploring the features of the Python data analytics tools NumPy
Exercise 1: Write a python program to create an array using NumPy tool and print the
array values
Program:
import numpy as np
arr = np.array([1,2,3,4,5])
print(arr)
Output
Exercise 2: Write a python program to create a 1-D array consisting of 50 values, where
each value has to be only 2, 4, 6, 8 or 9
Program:
from numpy import random as rd
x = rd.choice([2,4,6,8,9], p=[0.2,0.3,0.3,0.1,0.1], size=(50))
print(x)
Output
Exercise 3: Write a python program to create a 2-D array with 4 rows and 4 columns,
where each value has to be only 2, 4, 6, 8 or 9
iii) Exploring the features of the Python data analytics tools Seaborn and MatPlotlib
Exercise 4: Write a python program which takes the distribution of points as an array
input and plots a curve corresponding to the distribution of points
Program:
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()
Output
Exercise 5: Write a python program which takes the distribution of points generated in
the Exercise 2 as an array input and plots a curve corresponding to the distribution of
points
Program:
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import random as rd
sns.distplot(x)
plt.show()
Output
Program:
sns.distplot(random.normal(size=1000), hist=False)
plt.show()
Output
plt.show()
Output
plt.show()
Output
Exercise 10: Write a python program to generate an array containing random Uniform
distribution points and plot a curve corresponding to the random Uniform distribution
points
Program:
sns.distplot(random.uniform(size=1000), hist=False)
plt.show()
Output
Exercise 11: Write a python program to generate an array containing random Uniform
distribution points and plot a curve corresponding to the random Uniform distribution
points
Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(random.logistic(size=1000), hist=False)
plt.show()
Output
Exercise 12: Write a python program to generate an array containing random Chi
Square distribution points and plot a curve corresponding to the random Chi Square
distribution points
Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
plt.show()
Output
iv) Exploring the features of the Python data analytics tools pandas
Exercise 13: Write a python program to read the attached csv data into a data frame
using pandas python tool
data.csv
Program:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Output
Exercise 14: Write a python program to read the attached csv data into a data frame
using pandas python tool and print the first 10 rows
data.csv
Program:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Output
Exercise 15: Write a python program to read the attached csv data into a data frame
using pandas python tool and print the last 10 rows
data.csv
Program:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail(10))
Output
Exercise 16: Write a python program to read the attached csv data into a data frame
using pandas python tool and remove all rows will NULL value
data_not_proper.cs
v
Program:
import pandas as pd
df = pd.read_csv(' data_not_proper.csv')
df.dropna(inplace = True)
print(df.to_string())
Output
Exercise 17: Write a python program to read the attached csv data into a data frame
using pandas python tool and convert the data in wrong format in the row number 22
and 26 into correct format
data_not_proper.cs
v
Program:
import pandas as pd
df = pd.read_csv('data_not_proper.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
Output
Exercise 18: Write a python program to read the attached csv data into a data frame
using pandas python tool and remove duplicate records.
data_not_proper.cs
v
Program:
import pandas as pd
df = pd.read_csv('data_not_proper.csv')
df.drop_duplicates(inplace = True)
print(df.to_string())
Output
Exercise 19: Write a python program to read the attached csv data into a data frame
using pandas python tool and determine the correlations between columns in the csv
file.
data.csv
Program:
import pandas as pd
df = pd.read_csv('data.csv')
x = df.corr()
print(x)
Output
2. Use the diabetes data set from UCI and Pima Indians for
performing the following:
a)
i) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis using UCI diabetes data set
dia_dataset_uci.csv
Program:
import pandas as pd
df = pd.read_csv('dia_dataset_uci.csv')
print(freq)
Output
ii) Univariate Analysis: Frequency, Mean, Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis using Pima Indians diabetes data set
dia_dataset_prima_i
ndia.csv
Program:
import pandas as pd
df = pd.read_csv('dia_dataset_prima_india.csv')
print(freq)
Output
b)
i) Bivariate Analysis: Linear regression modelling using UCI diabetes data set
dia_dataset_uci.csv
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# putting labels
plt.xlabel('code')
plt.ylabel('insulin_value')
def main():
df = pd.read_csv('dia_dataset_uci.csv')
# observations / data
x = np.array(df['code'])
y = np.array(df['insulin_value'])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Output
ii) Bivariate Analysis: Linear regression modelling using Pima Indians diabetes
data set
dia_dataset_prima_i
ndia.csv
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# putting labels
plt.xlabel('pregency_no_of_times')
plt.ylabel('ser_insulin')
def main():
df = pd.read_csv('dia_dataset_prima_india.csv')
# observations / data
x = np.array(df['pregency_no_of_times'])
y = np.array(df['ser_insulin'])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Output
iii) Bivariate Analysis: Logistic regression modelling using UCI diabetes data set
dia_dataset_uci.csv
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
df = pd.read_csv('dia_dataset_uci.csv')
# input
x = df.iloc[:, [2, 3]].values
# output
y = df.iloc[:, 4].values
# Training Data Set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state =
0)
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(X_train)
xtest = sc_x.transform(X_test)
print (xtrain[0:10, :])
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, y_train)
y_pred = classifier.predict(xtest)
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix : \n", cm)
print ("Accuracy : ", accuracy_score(y_test, y_pred))
Output
iv) Bivariate Analysis: Logistic regression modelling using Pima Indians diabetes
data set
dia_dataset_prima_i
ndia.csv
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
df = pd.read_csv('dia_dataset_prima_india.csv')
# input
x = df.iloc[:, [0, 4]].values
# output
y = df.iloc[:, 8].values# Training Data Set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state =
0)
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(X_train)
xtest = sc_x.transform(X_test)
print (xtrain[0:10, :])
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, y_train)
y_pred = classifier.predict(xtest)
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix : \n", cm)
print ("Accuracy : ", accuracy_score(y_test, y_pred))
Output
c)
i) Multiple Regression Analysis: Multiple Regression Analysis using UCI diabetes
data set
dia_dataset_uci.csv
Program:
import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('dia_dataset_uci.csv')
# observations / data
x = np.array(df['code'])
y = np.array(df['insulin_value'])
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.gca(projection ='3d')
plt.show()
Output
ii) Multiple Regression Analysis: Multiple Regression Analysis using Pima Indian
diabetes data set
dia_dataset_prima_i
ndia.csv
Program:
import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('dia_dataset_prima_india.csv')
# observations / data
x = np.array(df['pregency_no_of_times'])
y = np.array(df['ser_insulin'])
z = np.array(df['age'])
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.gca(projection ='3d')
plt.show()
Output
d)
i) Write a python program to compare the analysis results of UCI and Pima Indian
diabetes data set
dia_dataset_uci.csv dia_dataset_prima_i
ndia.csv
Program:
import pandas as pd
dfu = pd.read_csv('dia_dataset_uci.csv')
dfp = pd.read_csv('dia_dataset_prima_india.csv')
Program:
iii) Write a python program to implement SVM classification technique on Iris data
set
iris_uci_data_set.csv
Program:
#Data
dataset.head()
X = dataset.iloc[:,:-1]
y = dataset.iloc[:, -1].values
classifier.fit(X_train, y_train)
dia_dataset_uci.csv
Program:
#Data
print(dataset.head())
X = dataset.iloc[:,:-1]
y = dataset.iloc[:, -1].values
classifier.fit(X_train, y_train)
dia_dataset_prima_i
ndia.csv
Program:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes =True)
diab=pd.read_csv("d:\\dia_dataset_prima_india.csv")
print(diab.head())
print(diab.isnull().values.any())
## To check if data contains null values
print(diab.describe())
(diab.Pregnancies ==
0).sum(),(diab.Glucose==0).sum(),(diab.BloodPressure==0).sum(),(diab.SkinThickne
ss==0).sum(),(diab.Insulin==0).sum(),(diab.BMI==0).sum(),(diab.DiabetesPedigreeF
unction==0).sum(),(diab.Age==0).sum()
## Creating a dataset called 'dia' from original dataset 'diab' with excludes all rows with
have zeros only for Glucose, BP, Skinthickness, Insulin and BMI, as other columns
can contain Zero values.
drop_Glu=diab.index[diab.Glucose == 0].tolist()
drop_BP=diab.index[diab.BloodPressure == 0].tolist()
drop_Skin = diab.index[diab.SkinThickness==0].tolist()
drop_Ins = diab.index[diab.Insulin==0].tolist()
drop_BMI = diab.index[diab.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
dia=diab.drop(diab.index[c])
print(dia.info())
print(dia.describe())
dia1 = dia[dia.Outcome==1]
dia0 = dia[dia.Outcome==0]
print(dia1)
print(dia0)
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for Pregnancies")
sns.distplot(dia.Pregnancies,kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Pregnancies,kde=False,color="Blue", label="Preg for Outome=0")
sns.distplot(dia1.Pregnancies,kde=False,color = "Gold", label = "Preg for
Outcome=1")
plt.title("Histograms for Preg by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Pregnancies)
plt.title("Boxplot for Preg by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
plt.title("Histogram for Glucose")
sns.distplot(dia.Glucose, kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.Glucose,kde=False,color="Gold", label="Gluc for Outcome=0")
sns.distplot(dia1.Glucose, kde=False, color="Blue", label = "Gloc for Outcome=1")
plt.title("Histograms for Glucose by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Glucose)
plt.title("Boxplot for Glucose by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BloodPressure, kde=False)
plt.title("Histogram for Blood Pressure")
plt.subplot(1,3,2)
sns.distplot(dia0.BloodPressure,kde=False,color="Gold",label="BP for Outcome=0")
sns.distplot(dia1.BloodPressure,kde=False, color="Blue", label="BP for Outcome=1")
plt.legend()
plt.title("Histogram of Blood Pressure by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.BloodPressure)
plt.title("Boxplot of BP by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.SkinThickness, kde=False)
plt.title("Histogram for Skin Thickness")
plt.subplot(1,3,2)
sns.distplot(dia0.SkinThickness, kde=False, color="Gold", label="SkinThick for
Outcome=0")
sns.distplot(dia1.SkinThickness, kde=False, color="Blue", label="SkinThick for
Outcome=1")
plt.legend()
plt.title("Histogram for SkinThickness by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.SkinThickness)
plt.title("Boxplot of SkinThickness by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Insulin,kde=False)
plt.title("Histogram of Insulin")
plt.subplot(1,3,2)
sns.distplot(dia0.Insulin,kde=False, color="Gold", label="Insulin for Outcome=0")
sns.distplot(dia1.Insulin,kde=False, color="Blue", label="Insuline for Outcome=1")
plt.title("Histogram for Insulin by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.Insulin)
plt.title("Boxplot for Insulin by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.BMI, kde=False)
plt.title("Histogram for BMI")
plt.subplot(1,3,2)
sns.distplot(dia0.BMI, kde=False,color="Gold", label="BMI for Outcome=0")
sns.distplot(dia1.BMI, kde=False, color="Blue", label="BMI for Outcome=1")
plt.legend()
plt.title("Histogram for BMI by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.BMI)
plt.title("Boxplot for BMI by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.DiabetesPedigreeFunction,kde=False)
plt.title("Histogram for Diabetes Pedigree Function")
plt.subplot(1,3,2)
sns.distplot(dia0.DiabetesPedigreeFunction, kde=False, color="Gold",
label="PedFunction for Outcome=0")
sns.distplot(dia1.DiabetesPedigreeFunction, kde=False, color="Blue",
label="PedFunction for Outcome=1")
plt.legend()
plt.title("Histogram for DiabetesPedigreeFunction by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome, y=dia.DiabetesPedigreeFunction)
plt.title("Boxplot for DiabetesPedigreeFunction by Outcome")
plt.show()
## Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome
and 3rd for representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.distplot(dia.Age,kde=False)
plt.title("Histogram for Age")
plt.subplot(1,3,2)
sns.distplot(dia0.Age,kde=False,color="Gold", label="Age for Outcome=0")
sns.distplot(dia1.Age,kde=False, color="Blue", label="Age for Outcome=1")
plt.legend()
plt.title("Histogram for Age by Outcome")
plt.subplot(1,3,3)
sns.boxplot(x=dia.Outcome,y=dia.Age)
plt.title("Boxplot for Age by Outcome")
plt.show()
sns.pairplot(dia, vars=["Pregnancies",
"Glucose","BloodPressure","SkinThickness","Insulin",
"BMI","DiabetesPedigreeFunction", "Age"],hue="Outcome")
plt.title("Pairplot of Variables by Outcome")
plt.show()
sns.heatmap(cor)
plt.show()
Output