Data Mining Lab Maual Through Python 031023

DATAMINING THROUGH PYTHON LAB MANUAL
Statistics, is the method of collection of data, tabulation, and interpretation of numerical

data. It is an area of applied mathematics concerned with data collection analysis,
interpretation, and presentation. With statistics, we can see how data can be used to solve
complex problems.
Understanding the Descriptive Statistics

In layman’s terms, descriptive statistics generally means describing the data with the help
of some representative methods like charts, tables, Excel files, etc. The data is described in
such a way that it can express some meaningful information that can also be used to find
some future trends.
• Univariate Analysis: Describing and summarizing a single variable is
called univariate analysis.
• Bivariate Analysis: Describing a statistical relationship between two variables is
called bivariate analysis.
• Multivariate Analysis: Describing the statistical relationship between multiple
variables is called multivariate analysis.
There are two types of Descriptive Statistics:

• The measure of central tendency
• Measure of variability
Measure of Central Tendency

The measure of central tendency is a single value that attempts to describe the whole set
of data. There are three main features of central tendency:
• Mean
• Median
• Median Low
• Median High
• Mode
1
Dr. C. Krishna Priya
The measure of Central Tendency
Mean
It is the sum of observations divided by the total number of observations. It is also defined
as average which is the sum divided by count.
The mean () function returns the mean or average of the data passed in its arguments. If
the passed argument is empty, StatisticsError is raised.
Example Program 1: Python code to calculate mean
Median
It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even then
the median would be the average of two central elements. it first sorts the data i=and then
performs the median operation
For Odd Numbers:
2
For Even Numbers:

Program 2: Median
# Python code to demonstrate the working of median() on various range of data-sets
# importing the statistics module

from statistics import median
# Importing fractions module as fr

from fractions import Fraction as fr
# tuple of positive integer numbers

data1 = (2, 3, 4, 5, 7, 9, 11)
# tuple of floating point values

data2 = (2.4, 5.1, 6.7, 8.9)
# tuple of fractional numbers

data3 = (fr(1, 2), fr(44, 12),
fr(10, 3), fr(2, 3))
# tuple of a set of negative integers

data4 = (-5, -1, -12, -19, -3)
# tuple of set of positive and negative integers

data5 = (-1, -2, -3, -4, 4, 3, 2, 1)
# Printing the median of above datasets

print("Median of data-set 1 is % s" % (median(data1)))
Output:
The median() function is used to calculate the median, i.e middle element of data. If the
passed argument is empty, StatisticsError is raised.
Median Low
3
The median_low() function returns the median of data in case of odd number of elements,
but in case of even number of elements, returns the lower of two middle elements. If the
Median High
The median_high() function returns the median of data in case of odd number of elements,
but in case of even number of elements, returns the higher of two middle elements. If
Mode
It is the value that has the highest frequency in the given data set. The data set may have no
mode if the frequency of all data points is the same. Also, we can have more than one mode
if we encounter two or more data points having the same frequency.
The mode() function returns the number with the maximum number of occurrences. If the
Program 3:
#Python code to demonstrate the working of mode() function on a various range of data
types
from statistics import mode
# Importing fractions module as fr

# Which enables to calculate harmonic_mean of a set in Fraction
# tuple of positive integer numbers

data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)
# tuple of a set of floating point values

data2 = (2.4, 1.3, 1.3, 1.3, 2.4, 4.6)
4
# tuple of a set of fractional numbers

data3 = (fr(1, 2), fr(1, 2), fr(10, 3), fr(2, 3))
# tuple of a set of negative integers

data4 = (-1, -2, -2, -2, -7, -7, -9)
# tuple of strings
data5 = ("red", "blue", "black", "blue", "black", "black", "brown")
# Printing out the mode of the above data-sets

print("Mode of data set 1 is % s" % (mode(data1)))
Output:
Measure of Variability
The measure of variability is known as the spread of data or how well our data is
distributed. The most common variability measures are:
• Range
• Variance
• Standard deviation
Range
5
The difference between the largest and smallest data point in our data set is known as the
range. The range is directly proportional to the spread of data which means the bigger the
range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
Program 4: We can calculate the maximum and minimum values using
the max() and min() methods respectively.
Variance
It is defined as an average squared deviation from the mean. It is calculated by finding the
difference between every data point and the average which is also known as the mean,
squaring them, adding all of them, and then dividing by the number of data points present
in our data set.
where N = number of terms

u = Mean
The statistics module provides the variance() method that does all the maths behind the
scene. If the passed argument is empty, StatisticsError is raised.
Program 5:
# Python code to demonstrate variance()
# function on varying range of data-types
from statistics import variance

sample1 = (1, 2, 5, 4, 8, 9, 12)
sample2 = (-2, -4, -3, -1, -5, -6)
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
6
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),fr(5, 6), fr(7, 8))
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s " % (variance(sample1)))
Output:
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the Mean, then
subtracting each number from the Mean which is also known as the average, and squaring
the result. Adding all the values and then dividing by the no. of terms followed by the square
root.
where N = number of terms

u = Mean
The stdev() method of the statistics module returns the standard deviation of the data. If
the passed argument is empty, StatisticsError is raised.
Program 6:
# Python code to demonstrate stdev()
from statistics import stdev

sample1 = (1, 2, 5, 4, 8, 9, 12)
sample2 = (-2, -4, -3, -1, -5, -6)
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
sample4 = (1.23, 1.45, 2.1, 2.2, 1.9)
print("The Standard Deviation of Sample1 is % s" % (stdev(sample1)))
7
Quartiles:
Quartiles and percentiles are measures of variation, which describes how spread out the data
is.
Quartiles and percentiles are both types of quantiles.
Quartiles are values that separate the data into four equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing
the quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter.
Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next
25%. And so on.
• Q0 is the smallest value in the data.

• Q1 is the value separating the first quarter from the second quarter of the data.
• Q2 is the middle value (median), separating the bottom from the top half.
• Q3 is the value separating the third quarter from the fourth quarter
• Q4 is the largest value in the data.
Calculating Quartiles with Programming
• Quartiles can easily be found with many programming languages.

• Using software and programming to calculate statistics is more common for bigger
sets of data, as finding it manually becomes difficult.
Example:
8
With Python use the NumPy library quantile() method to find the quartiles of the values
13, 21, 21, 40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)
Output:
[13. 21. 41. 49.75 72. ]
Program 7: Linear Regression :
# Making a simple linear regression in Python is very easy using Scikit-learn and seaborn regplot.
import matplotlib.pyplot as plt

import seaborn as sns
from sklearn.datasets import make_regression
sns.set()
X, y = make_regression(n_samples=200, n_features=1, n_targets=1, noise=30, random_state=0)

sns.regplot(x=X, y=y)
plt.title('Simple Linear Regression', fontsize=20)
plt.show()
Output:
# linear regression with real world data

import pandas as pd
from sklearn.datasets import load_diabetes
sns.set()
X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X[['bmi']]
9
sns.regplot(x=X, y=y)
plt.title('Simple Linear Regression on Diabetes', fontsize=20)
plt.show()
Output:
Program 8: Program on Cross Tabulation:

Cross Tabulation:
• The cross-tabulations is an analysis that uses a table to describe or summarize the

relationship between two different variables.
• The table used in cross-tabulation is called a cross table, a two-way table,
a contingency table (i.e., frequency tables), or a pivot table
Syntax:
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None,
aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)
For this program create excel file named Dataimage.csv with the following data
Name NationalitySex Age Handedness
Kathy Canada Female 23 Right
nina Canada Female 18 Right
Peter Canada Male 19 Right
John Canada Male 22 Left
Fatima London Female 31 Left
Elizhabeth London Female 25 Left
Dhaval India Male 35 Left
Sudhir India Male 31 Left
Sanju India Male 18 Right
Yan China Female 52 Right
Juan China Female 58 Left
Liang China Male 43 Left
10
import pandas as pd
df = pd.read_csv('F:\Datascience\Dataimage.csv')
# Crosstab function is called 2 parameters are passed and the table is stored in a variable
crosstb = pd.crosstab(df.Nationality, df.Handedness)

print (crosstb)
Output:
# Creating barplot
barplot = crosstb.plot.bar(rot=0)
# Creating stacked barplot

pl = crosstb.plot(kind="bar", stacked=True, rot=0)
11
#We can also create a crosstab with more than two values.
import pandas as pd
df = pd.read_csv('F:\Datascience\Dataimage.csv')
# Crosstab with three variables

crosstb = pd.crosstab(df.Sex, [df.Nationality,df.Handedness])
# Bar plotting
a = crosstb.plot(kind='bar', rot=0)
a.legend(title='Handedness', bbox_to_anchor=(1, 1.02),loc='upper left')
Correlation
A correlation Matrix is basically a covariance matrix. Also known as the auto-covariance
matrix, dispersion matrix, variance matrix, or variance-covariance matrix. It is a matrix in
which the i-j position defines the correlation between the ith and jth parameter of the given
data set. When the data points follow a roughly straight-line trend, the variables are said to
have an approximately linear relationship. In some cases, the data points fall close to a
straight line, but more often there is quite a bit of variability of the points around the
straight-line trend. A summary measure called correlation describes the strength of the
linear association.
Correlation in Python
Correlation summarizes the strength and direction of the linear (straight-line) association
between two quantitative variables. Denoted by r, it takes values between -1 and +1. A
positive value for r indicates a positive association, and a negative value for r indicates a
negative association. The closer r is to 1 the closer the data points fall to a straight line, thus,
the linear association is stronger. The closer r is to 0, making the linear association weaker.
Correlation
Correlation is the statistical measure that defines to which extent two variables are linearly
related to each other. In statistics, correlation is defined by the Pearson Correlation
formula:
12
where,
• r: Correlation coefficient
• : i^th value first dataset X
• : Mean of first dataset X
• : i^th value second dataset Y
• : Mean of second dataset Y
Condition: The length of the dataset X and Y must be the same.
The Correlation value can be positive, negative, or zeros.
Program 9 : Correlation
import numpy as np
# Define the dataset

x = np.array([1,3,5,7,8,9, 10, 15])
y = np.array([10, 20, 30, 40, 50, 60, 70, 80])
def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr
print(Pearson_correlation(x,y))
print(Pearson_correlation(x,x))
Output:
13
# We can also find the correlation by using the numpy corrcoef function
print(np.corrcoef(x, y))
Output:
Program 10 : program to find Correlation coefficient and visualizing through heatmap on

Diabetes Dataset
import pandas as pd
from sklearn.datasets import load_diabetes
# Load the dataset with frame

df = load_diabetes(as_frame=True)
# conver into pandas dataframe
df = df.frame
# Print first 5 rows
df.head()
Output:
Finding the Pearson correlations matrix by using the pandas command df.corr()
Syntax
df.corr(method, min_periods,numeric_only )
method : In method we can choose any one from {'pearson', 'kendall', 'spearman'} pearson
is the standard correlation coefficient matrix i.e default
min_periods : int This is optional. Defines th eminimum number of observations required
per pair.
14
numeric_only : Default is False, Defines we want to compare only numeric or categorical

object also
# Find the pearson correlations matrix

corr = df.corr(method = 'pearson')
corr
Output:
The above table represents the correlations between each column of the data frame. The
correlation between the self is 1.0, The negative correlation defined negative relationship
means on increasing one column value second will decrease and vice-versa. The zeros
correlation defines no relationship I.e neutral. and positive correlations define positive
relationships meaning on increasing one column value second will also increase and vice-
versa.
#We can also find the correlations using numpy between two columns
# correaltions between age and sex columns
c = np.corrcoef(df['age'],df['sex'])
print('Correlations between age and sex\n',c)
Output:
Heatmap is defined as a graphical representation of data using colors to visualize the

value of the matrix. In this, to represent more common values or higher activities brighter
colors basically reddish colors are used and to represent less common or activity values,
15
darker colors are preferred. Heatmap is also defined by the name of the shading matrix.
Heatmaps in Seaborn can be plotted by using the seaborn.heatmap() function.
seaborn.heatmap()
Syntax: seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, an

not_kws=None, linewidths=0, linecolor=’white’, cbar=True, **kwargs)
Important Parameters:
• data: 2D dataset that can be coerced into an ndarray.
• vmin, vmax: Values to anchor the colormap, otherwise they are inferred from
the data and other keyword arguments.
• cmap: The mapping from data values to color space.
• center: The value at which to center the colormap when plotting divergent
data.
• annot: If True, write the data value in each cell.
• fmt: String formatting code to use when adding annotations.
• linewidths: Width of the lines that will divide each cell.
• linecolor: Color of the lines that will divide each cell.
• cbar: Whether to draw a colorbar.
All the parameters except data are optional .
#Ploting the correlation matrix with the seaborn heatmap
plt.figure(figsize=(10,8), dpi =500)
sns.heatmap(corr,annot=True,fmt=".2f", linewidth=.5)
plt.show()
Output:
Program 11: program for finding correlation coefficient and Heatmap on Iris data set
import pandas as pd
from sklearn.datasets import load_iris
16
# Load the dataset with frame

df = load_iris(as_frame=True)
# convert into pandas dataframe

df = df.frame
# Print first 5 rows

df.head()
# Find the pearson correlations matrix

corr = df.corr(method = 'pearson')
corr
# correaltions between sepal length and sepal width using numpy

import numpy as np
c = np.corrcoef(df['sepal length (cm)'],df['sepal width (cm)'])
print('Correlations between sepal length and sepal width\n',c)
Output:
plt.figure(figsize=(5,5), dpi =100)

sns.heatmap(corr,annot=True,fmt=".2f", linewidth=.5)
plt.show()
17
Program 12: Write a program on Univariate, Bivariate & Multivariate Analysis

in Data Visualisation
Univariate Analysis
Univariate Analysis is a type of data visualization where we visualize only a single
variable at a time. Univariate Analysis helps us to analyze the distribution of the
variable present in the data so that we can perform further analysis.
import pandas as pd
data = pd.read_csv('F:\Datascience\Employee_dataset.csv')
print(data.head())
Output:
18
#Visualizing numerical data

sns.histplot(data['age'])
#visulaizing categorical variable

sns.countplot(data['gender_full'])
#A piechart helps us to visualize the percentage of the data belonging to each category.
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
x = data['STATUS_YEAR'].value_counts()
plt.pie(x.values,labels=x.index,autopct='%1.1f%%')
plt.show()
#bivariate data analysis

#Categorical v/s Numerical
plt.figure(figsize=(15, 5))
sns.barplot(x=data['department_name'], y=data['length_of_service'])
plt.xticks(rotation='90')
19
#Numerical v/s Numerical

sns.scatterplot(x=data['length_of_service'],y=data['age'])
# Categorical v/s Categorical

sns.countplot(data['STATUS_YEAR'], hue=data['STATUS'])
20
Multivariate Analysis
It is an extension of bivariate analysis which means it involves multiple variables at the same
time to find correlation between them. Multivariate Analysis is a set of statistical model that
examine patterns in multidimensional data by considering at once, several data variable.
#multivariate analysis
from sklearn import datasets, decomposition
iris = datasets.load_iris()
X = iris.data
y = iris.target
pca = decomposition.PCA(n_components=2)
X = pca.fit_transform(X)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y)
#The values of correlation can vary from -1 to 1 where -1 means strong negative and +1 means
strong positive correlation.
sns.heatmap(data.corr(), annot=True)
21
22

Data Mining Lab Maual Through Python 031023

Uploaded by

Copyright:

Available Formats

You might also like

Data Mining Lab Maual Through Python 031023

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Lab Maual Through Python 031023

Uploaded by

Copyright:

Available Formats

DATAMINING THROUGH PYTHON LAB MANUAL

Statistics, is the method of collection of data, tabulation, and interpretation of numerical

Understanding the Descriptive Statistics

There are two types of Descriptive Statistics:

Measure of Central Tendency

The measure of Central Tendency

For Even Numbers:

# importing the statistics module

# Importing fractions module as fr

# tuple of positive integer numbers

# tuple of floating point values

# tuple of fractional numbers

# tuple of a set of negative integers

# tuple of set of positive and negative integers

# Printing the median of above datasets

from statistics import mode

# Importing fractions module as fr

from fractions import Fraction as fr

# tuple of positive integer numbers

# tuple of a set of floating point values

# tuple of a set of fractional numbers

# tuple of a set of negative integers

# Printing out the mode of the above data-sets

where N = number of terms

from statistics import variance

where N = number of terms

from statistics import stdev

print("The Standard Deviation of Sample4 is % s" % (stdev(sample4)))

• Q0 is the smallest value in the data.

Calculating Quartiles with Programming

• Quartiles can easily be found with many programming languages.

Program 7: Linear Regression :

import matplotlib.pyplot as plt

X, y = make_regression(n_samples=200, n_features=1, n_targets=1, noise=30, random_state=0)

# linear regression with real world data

import matplotlib.pyplot as plt

Program 8: Program on Cross Tabulation:

• The cross-tabulations is an analysis that uses a table to describe or summarize the

crosstb = pd.crosstab(df.Nationality, df.Handedness)

# Creating stacked barplot

# Crosstab with three variables

# Define the dataset

Program 10 : program to find Correlation coefficient and visualizing through heatmap on

# Load the dataset with frame

numeric_only : Default is False, Defines we want to compare only numeric or categorical

# Find the pearson correlations matrix

Heatmap is defined as a graphical representation of data using colors to visualize the

Syntax: seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, an

# Load the dataset with frame

# convert into pandas dataframe

# Print first 5 rows

# Find the pearson correlations matrix

# correaltions between sepal length and sepal width using numpy

plt.figure(figsize=(5,5), dpi =100)

Program 12: Write a program on Univariate, Bivariate & Multivariate Analysis

#Visualizing numerical data

#visulaizing categorical variable

#bivariate data analysis