Ad3411 Data Science and Analytics Laboratory

Fundamentals of Data Science Laboratory L.
AD3411 - DATA SCIENCE AND ANALYTICS LABORATORY
Sl.No LIST OF EXPERIMENTS Pg.No
Tools: Python, Numpy, Scipy, Matplotib, Pandas,Statmodels,

Seaborn,Plotly,Bokeh,working with Numpy arrays
1. Working with Pandas data frame 2
Basic Plots using Matplotlib

2. 3
Frequency distributors, Averages, Variability

3. 5
Normal Curves, Correlation and scatter plots, Correlation

4. coefficient 6
5. Regression 9
6. Z-test 11
7. T-test 13
8. Anova 15
9. Building and validating linear models 16
10. Building and validating logistic models 19
11. Time series analysis 22

L.2 Fundamentals of Data Science
Experiment No: 1
WORKING WITH PANDAS DATA FRAMES
Program:
import pandas as pd
data = {"calories": [420, 380, 390], "duration": [50, 40,
45]}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print (df.loc[0])
Output:
calories 420
duration 50
Name: 0, dtype: int64

Fundamentals of Data Science Laboratory L.3
Experiment No: 2
BASIC PLOTS USING MATPLOTLIB
Program:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is
# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command
ax = plt.gca()
# get command over the individual
# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set the range or the bounds of
# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which
# the x-axis set the marks
plt.xticks(list(range(-3, 10)))
# set the intervals by which y-axis

# set the marks
plt.yticks(list(range(-3, 20, 3)))
# legend denotes that what color
# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])
# annotate command helps to write
# ON THE GRAPH any text xy denotes
# the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))
# gives a title to the Graph
plt.title('All Features Discussed')
plt.show()
Output:

Experiment No: 3
FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY
Program:
# Python program to get average of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements
list = [2, 40, 2, 502, 177, 7, 9]
# Calculating average using average()
print(np.average(list))
Output:
105.57142857142857
# Python program to get variance of a list
import numpy as np
list = [2, 4, 4, 4, 5, 5, 7, 9]
# Calculating variance using var()
print(np.var(list))
Output:
4.0
# Python program to get standard deviation of a list
import numpy as np
list = [290, 124, 127, 899]
# Calculating standard
# deviation using var()
print(np.std(list))
Output:
318.35750344541907

Experiment No: 4
NORMAL CURVES, CORRELATION AND SCATTER PLOTS,

CORRELATION COEFFICIENT
Program:
#Normal curves
import numpy as np
mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma, 1000)
# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)
Output:
#Correlation and scatter plots

import sklearn
import numpy as np
import pandas as pd
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation = y.corr(x)
correlation
Output:
0.8603090020146067
# Correlation coefficient
import math
# function that returns correlation coefficient.
def correlationCoefficient(X, Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i = 0
while i < n :
# sum of elements of array X.
sum_X = sum_X + X[i]
# sum of elements of array Y.
sum_Y = sum_Y + Y[i
# sum of X[i] * Y[i].
sum_XY = sum_XY + X[i] * Y[i]
# sum of square of array elements.
squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
i = i + 1
# use formula for calculating correlation
# coefficient.
corr = (float)(n * sum_XY - sum_X * sum_Y)/
(float)(math.sqrt((n * squareSum_X -
sum_X * sum_X)* (n * squareSum_Y -
sum_Y * sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]

# Find the size of array.
n = len(X)
# Function call to correlationCoefficient.
print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))
Output:
0.953463

Experiment No: 5
REGRESSION
Program:
import numpy as np
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437

Experiment No: 6
Z-TEST
Program:
# imports
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and
sd 15
# similar to the IQ scores data we assume above
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha = 0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq
# print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# now we perform the test. In this function, we passed data,
in the value parameter
# we passed mean value in the null hypothesis, in alternative
hypothesis we check whether the
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='la
rger')
# the function outputs a p_value and z-score corresponding to
that value, we compare the
# p-value with alpha, if it is greater than alpha then we do
not null hypothesis
# else we reject it.
if(p_value < alpha):
print("Reject Null Hypothesis")
else:
print("Fail to Reject NUll Hypothesis")
Output:
Reject Null Hypothesis

Experiment No: 7
T-TEST
Program:
# Importing the required libraries and packages
import numpy as np
from scipy import stats
# Defining two random distributions
# Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var = 1
x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1
y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard deviation
var_x = x.var(ddof = 1)
var_y = y.var(ddof = 1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
# Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf( tval, df = dof)
print("t = " + str(tval))
print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Packa

ge
tval2, pval2 = stats.ttest_ind(x, y)
print("t = " + str(tval2))
print("p = " + str(pval2))
Output:
Standard Deviation = 0.7642398582227466
t = 4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205

Experiment No: 8
ANOVA
Program:
# Installing the package
install.packages("dplyr")
# Loading the package
library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp~factor(mtcars$gear),
xlab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis
# H0 = mu = mu01 = mu02 (There is no difference
# between average displacement for different gear)
# H1 = Not all means are equal
# Step 2: Calculate test statistics using aov function
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))
summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value
# and conclude test p <alpha, Reject Null Hypothesis
Output:

Experiment No: 9
BUILDING AND VALIDATING LINEAR MODELS
Program
# Importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import load_boston
sns.set(style=”ticks”,color_codes=True)
plt.rcParams[‘figure.figsize’] = (8,5)
plt.rcParams[‘figure.dpi’] = 150
# loading the databoston = load_boston()
You can check those keys with the following code.
print(boston.keys())
The output will be as follow:
dict_keys([‘data’, ‘target’, ‘feature_names’, ‘DESCR’,
‘filename’])
print(boston.DESCR)
You will find these details in output:

Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000
— PTRATIO pupil-teacher ratio by town

— B 1000 (Bk — 0.63)² where Bk is the proportion of blacks by town
— LSTAT % lower status of the population
— MEDV Median value of owner-occupied homes in $1000’s :Missing
Attribute Values: None
df=pd.DataFrame(boston.data,columns=boston.feature_names)
df.head()
# print the columns present in the dataset
print(df.columns)
# print the top 5 rows in the dataset
print(df.head())
First five records from data set

#plotting heatmap for overall data setsns.heatmap(df.corr(),
square=True, cmap=’RdYlGn’)
Heat map of overall data set

So let’s plot a regression plot to see the correlation between RM and MEDV.
sns.lmplot(x = ‘RM’, y = ‘MEDV’, data = df)
Regression plot with RM and MEDV

Experiment No: 10
BUILDING AND VALIDATING LOGISTICS MODELS
Program
Building the Logistic Regression model:

# importing libraries
import statsmodels.api as sm
import pandas as pd
# loading the training dataset
df = pd.read_csv('logit_train1.csv', index_col = 0)
# defining the dependent and independent variables
Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]
# building the model and fitting the data
log_reg = sm.Logit(ytrain, Xtrain).fit()
Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
# printing the summary table
print(log_reg.summary())
Output :
Logit Regression Results
=============================================================
Dep. Variable: admitted No. Observations: 30
Model: Logit Df Residuals: 27
Method: MLE Df Model: 2
Date: Wed, 15 Jul 2020 Pseudo R-squ.: 0.4912
Time: 16:09:17 Log-Likelihood: -10.581
converged: True LL-Null: -20.794

Covariance Type: nonrobust LLR p-value: 3.668e-05
=============================================================
===
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------
gmat -0.0262 0.011 -2.383 0.017 -0.048 -0.005
gpa 3.9422 1.964 2.007 0.045 0.092 7.792
work_experience 1.1983 0.482 2.487 0.013 0.254 2.143
Predicting on New Data :
# loading the testing dataset

df = pd.read_csv('logit_test1.csv', index_col = 0)
# defining the dependent and independent variables
Xtest = df[['gmat', 'gpa', 'work_experience']]
ytest = df['admitted']
# performing predictions on the test dataset
yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
# comparing original and predicted values of y
print('Actual values', list(ytest.values))
print('Predictions :', prediction)
Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
Testing the accuracy of the model :
from sklearn.metrics import (confusion_matrix,

accuracy_score)
# confusion matrix
cm = confusion_matrix(ytest, prediction)
print ("Confusion Matrix : \n", cm)
# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))
Output :
Confusion Matrix :
[[6 0]
[2 2]]
Test accuracy = 0.8

Experiment No: 11
TIME SERIES ANALYSIS
Program
We are using Superstore sales data .
import warnings
import itertools
import numpy as np
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlibmatplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
We start from time series analysis and forecasting for furniture sales.
df=pd.read_excel("Superstore.xls")
furniture = df.loc[df['Category'] == 'Furniture']
A good 4-year furniture sales data.
furniture['Order Date'].min(), furniture['Order Date'].max()
Timestamp(‘2014–01–06 00:00:00’), Timestamp(‘2017–12–30
00:00:00’)
Data Preprocessing
This step includes removing columns we do not need, check missing values,
aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product
ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols,axis=1,inplace=True)
furniture=furniture.sort_values('Order
Date')furniture.isnull().sum()
furniture=furniture.groupby('OrderDate')['Sales'].sum().reset_
index()
Order Date 0
Sales 0
dtype: int64
Figure 1
Indexing with Time Series Data

furniture=furniture.set_index('OrderDate')
furniture.index
Figure 2
We will use the averages daily sales value for that month instead, and we are using
the start of each month as the timestamp.
y = furniture ['Sales'].resample('MS').mean()
Have a quick peek 2017 furniture sales data.
y['2017':]
Figure 3
Visualizing Furniture Sales Time Series Data

y.plot (figsize=(15,6))
plt.show()


Ad3411 Data Science and Analytics Laboratory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ad3411 Data Science and Analytics Laboratory

Uploaded by

Copyright:

Available Formats

Fundamentals of Data Science Laboratory L.

AD3411 - DATA SCIENCE AND ANALYTICS LABORATORY

Sl.No LIST OF EXPERIMENTS Pg.No

Tools: Python, Numpy, Scipy, Matplotib, Pandas,Statmodels,

1. Working with Pandas data frame 2

Basic Plots using Matplotlib

Frequency distributors, Averages, Variability

Normal Curves, Correlation and scatter plots, Correlation

9. Building and validating linear models 16

10. Building and validating logistic models 19

11. Time series analysis 22

WORKING WITH PANDAS DATA FRAMES

BASIC PLOTS USING MATPLOTLIB

# set the intervals by which y-axis

FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY

NORMAL CURVES, CORRELATION AND SCATTER PLOTS,

#Correlation and scatter plots

Y = [25, 25, 27, 31, 32]

## Cross Checking using the internal function from SciPy Packa

BUILDING AND VALIDATING LINEAR MODELS

You will find these details in output:

— PTRATIO pupil-teacher ratio by town

First five records from data set

Heat map of overall data set

Regression plot with RM and MEDV

BUILDING AND VALIDATING LOGISTICS MODELS

Building the Logistic Regression model:

converged: True LL-Null: -20.794

Predicting on New Data :

# loading the testing dataset

Testing the accuracy of the model :

from sklearn.metrics import (confusion_matrix,

TIME SERIES ANALYSIS

Indexing with Time Series Data

Have a quick peek 2017 furniture sales data.

Visualizing Furniture Sales Time Series Data

You might also like