Data Mining Lab Maual Through Python 031023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

DATAMINING THROUGH PYTHON LAB MANUAL

Statistics, is the method of collection of data, tabulation, and interpretation of numerical


data. It is an area of applied mathematics concerned with data collection analysis,
interpretation, and presentation. With statistics, we can see how data can be used to solve
complex problems.

Understanding the Descriptive Statistics


In layman’s terms, descriptive statistics generally means describing the data with the help
of some representative methods like charts, tables, Excel files, etc. The data is described in
such a way that it can express some meaningful information that can also be used to find
some future trends.
• Univariate Analysis: Describing and summarizing a single variable is
called univariate analysis.
• Bivariate Analysis: Describing a statistical relationship between two variables is
called bivariate analysis.
• Multivariate Analysis: Describing the statistical relationship between multiple
variables is called multivariate analysis.

There are two types of Descriptive Statistics:


• The measure of central tendency
• Measure of variability

Measure of Central Tendency


The measure of central tendency is a single value that attempts to describe the whole set
of data. There are three main features of central tendency:
• Mean
• Median
• Median Low
• Median High
• Mode

1
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

The measure of Central Tendency

Mean

It is the sum of observations divided by the total number of observations. It is also defined
as average which is the sum divided by count.

The mean () function returns the mean or average of the data passed in its arguments. If
the passed argument is empty, StatisticsError is raised.
Example Program 1: Python code to calculate mean

Median

It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even then
the median would be the average of two central elements. it first sorts the data i=and then
performs the median operation
For Odd Numbers:

2
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

For Even Numbers:


Program 2: Median
# Python code to demonstrate the working of median() on various range of data-sets

# importing the statistics module


from statistics import median

# Importing fractions module as fr


from fractions import Fraction as fr

# tuple of positive integer numbers


data1 = (2, 3, 4, 5, 7, 9, 11)

# tuple of floating point values


data2 = (2.4, 5.1, 6.7, 8.9)

# tuple of fractional numbers


data3 = (fr(1, 2), fr(44, 12),
fr(10, 3), fr(2, 3))

# tuple of a set of negative integers


data4 = (-5, -1, -12, -19, -3)

# tuple of set of positive and negative integers


data5 = (-1, -2, -3, -4, 4, 3, 2, 1)

# Printing the median of above datasets


print("Median of data-set 1 is % s" % (median(data1)))
print("Median of data-set 2 is % s" % (median(data2)))
print("Median of data-set 3 is % s" % (median(data3)))
print("Median of data-set 4 is % s" % (median(data4)))
print("Median of data-set 5 is % s" % (median(data5)))

Output:

The median() function is used to calculate the median, i.e middle element of data. If the
passed argument is empty, StatisticsError is raised.

Median Low

3
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

The median_low() function returns the median of data in case of odd number of elements,
but in case of even number of elements, returns the lower of two middle elements. If the
passed argument is empty, StatisticsError is raised.
Median High
The median_high() function returns the median of data in case of odd number of elements,
but in case of even number of elements, returns the higher of two middle elements. If
passed argument is empty, StatisticsError is raised.

Mode
It is the value that has the highest frequency in the given data set. The data set may have no
mode if the frequency of all data points is the same. Also, we can have more than one mode
if we encounter two or more data points having the same frequency.
The mode() function returns the number with the maximum number of occurrences. If the
passed argument is empty, StatisticsError is raised.
Program 3:

#Python code to demonstrate the working of mode() function on a various range of data
types

from statistics import mode

# Importing fractions module as fr


# Which enables to calculate harmonic_mean of a set in Fraction

from fractions import Fraction as fr

# tuple of positive integer numbers


data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)

# tuple of a set of floating point values


data2 = (2.4, 1.3, 1.3, 1.3, 2.4, 4.6)

4
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

# tuple of a set of fractional numbers


data3 = (fr(1, 2), fr(1, 2), fr(10, 3), fr(2, 3))

# tuple of a set of negative integers


data4 = (-1, -2, -2, -2, -7, -7, -9)

# tuple of strings
data5 = ("red", "blue", "black", "blue", "black", "black", "brown")

# Printing out the mode of the above data-sets


print("Mode of data set 1 is % s" % (mode(data1)))
print("Mode of data set 2 is % s" % (mode(data2)))
print("Mode of data set 3 is % s" % (mode(data3)))
print("Mode of data set 4 is % s" % (mode(data4)))
print("Mode of data set 5 is % s" % (mode(data5)))

Output:

Measure of Variability
The measure of variability is known as the spread of data or how well our data is
distributed. The most common variability measures are:
• Range
• Variance
• Standard deviation

Range

5
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

The difference between the largest and smallest data point in our data set is known as the
range. The range is directly proportional to the spread of data which means the bigger the
range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
Program 4: We can calculate the maximum and minimum values using
the max() and min() methods respectively.

Variance

It is defined as an average squared deviation from the mean. It is calculated by finding the
difference between every data point and the average which is also known as the mean,
squaring them, adding all of them, and then dividing by the number of data points present
in our data set.

where N = number of terms


u = Mean
The statistics module provides the variance() method that does all the maths behind the
scene. If the passed argument is empty, StatisticsError is raised.
Program 5:
# Python code to demonstrate variance()
# function on varying range of data-types

from statistics import variance


from fractions import Fraction as fr
sample1 = (1, 2, 5, 4, 8, 9, 12)
sample2 = (-2, -4, -3, -1, -5, -6)
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)

6
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),fr(5, 6), fr(7, 8))
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s " % (variance(sample1)))
print("Variance of Sample2 is % s " % (variance(sample2)))
print("Variance of Sample3 is % s " % (variance(sample3)))
print("Variance of Sample4 is % s " % (variance(sample4)))
print("Variance of Sample5 is % s " % (variance(sample5)))
Output:

Standard Deviation

It is defined as the square root of the variance. It is calculated by finding the Mean, then
subtracting each number from the Mean which is also known as the average, and squaring
the result. Adding all the values and then dividing by the no. of terms followed by the square
root.

where N = number of terms


u = Mean
The stdev() method of the statistics module returns the standard deviation of the data. If
the passed argument is empty, StatisticsError is raised.

Program 6:
# Python code to demonstrate stdev()

from statistics import stdev


from fractions import Fraction as fr
sample1 = (1, 2, 5, 4, 8, 9, 12)
sample2 = (-2, -4, -3, -1, -5, -6)
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
sample4 = (1.23, 1.45, 2.1, 2.2, 1.9)
print("The Standard Deviation of Sample1 is % s" % (stdev(sample1)))
print("The Standard Deviation of Sample2 is % s" % (stdev(sample2)))
print("The Standard Deviation of Sample3 is % s" % (stdev(sample3)))

7
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

print("The Standard Deviation of Sample4 is % s" % (stdev(sample4)))

Quartiles:
Quartiles and percentiles are measures of variation, which describes how spread out the data
is.
Quartiles and percentiles are both types of quantiles.
Quartiles are values that separate the data into four equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing
the quartiles:

The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter.

Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next
25%. And so on.

• Q0 is the smallest value in the data.


• Q1 is the value separating the first quarter from the second quarter of the data.
• Q2 is the middle value (median), separating the bottom from the top half.
• Q3 is the value separating the third quarter from the fourth quarter
• Q4 is the largest value in the data.

Calculating Quartiles with Programming

• Quartiles can easily be found with many programming languages.


• Using software and programming to calculate statistics is more common for bigger
sets of data, as finding it manually becomes difficult.

Example:

8
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

With Python use the NumPy library quantile() method to find the quartiles of the values
13, 21, 21, 40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)
Output:
[13. 21. 41. 49.75 72. ]

Program 7: Linear Regression :

# Making a simple linear regression in Python is very easy using Scikit-learn and seaborn regplot.

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.datasets import make_regression
sns.set()

X, y = make_regression(n_samples=200, n_features=1, n_targets=1, noise=30, random_state=0)


sns.regplot(x=X, y=y)
plt.title('Simple Linear Regression', fontsize=20)
plt.show()

Output:

# linear regression with real world data

import matplotlib.pyplot as plt


import pandas as pd
import seaborn as sns
from sklearn.datasets import load_diabetes
sns.set()

X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X[['bmi']]

9
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

sns.regplot(x=X, y=y)
plt.title('Simple Linear Regression on Diabetes', fontsize=20)
plt.show()

Output:

Program 8: Program on Cross Tabulation:


Cross Tabulation:

• The cross-tabulations is an analysis that uses a table to describe or summarize the


relationship between two different variables.
• The table used in cross-tabulation is called a cross table, a two-way table,
a contingency table (i.e., frequency tables), or a pivot table
Syntax:
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None,
aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)

For this program create excel file named Dataimage.csv with the following data
Name NationalitySex Age Handedness
Kathy Canada Female 23 Right
nina Canada Female 18 Right
Peter Canada Male 19 Right
John Canada Male 22 Left
Fatima London Female 31 Left
Elizhabeth London Female 25 Left
Dhaval India Male 35 Left
Sudhir India Male 31 Left
Sanju India Male 18 Right
Yan China Female 52 Right
Juan China Female 58 Left
Liang China Male 43 Left

10
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

import pandas as pd
df = pd.read_csv('F:\Datascience\Dataimage.csv')

# Crosstab function is called 2 parameters are passed and the table is stored in a variable

crosstb = pd.crosstab(df.Nationality, df.Handedness)


print (crosstb)

Output:

# Creating barplot
barplot = crosstb.plot.bar(rot=0)

# Creating stacked barplot


pl = crosstb.plot(kind="bar", stacked=True, rot=0)

11
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

#We can also create a crosstab with more than two values.

import pandas as pd
df = pd.read_csv('F:\Datascience\Dataimage.csv')

# Crosstab with three variables


crosstb = pd.crosstab(df.Sex, [df.Nationality,df.Handedness])

# Bar plotting
a = crosstb.plot(kind='bar', rot=0)
a.legend(title='Handedness', bbox_to_anchor=(1, 1.02),loc='upper left')

Correlation
A correlation Matrix is basically a covariance matrix. Also known as the auto-covariance
matrix, dispersion matrix, variance matrix, or variance-covariance matrix. It is a matrix in
which the i-j position defines the correlation between the ith and jth parameter of the given
data set. When the data points follow a roughly straight-line trend, the variables are said to
have an approximately linear relationship. In some cases, the data points fall close to a
straight line, but more often there is quite a bit of variability of the points around the
straight-line trend. A summary measure called correlation describes the strength of the
linear association.
Correlation in Python
Correlation summarizes the strength and direction of the linear (straight-line) association
between two quantitative variables. Denoted by r, it takes values between -1 and +1. A
positive value for r indicates a positive association, and a negative value for r indicates a
negative association. The closer r is to 1 the closer the data points fall to a straight line, thus,
the linear association is stronger. The closer r is to 0, making the linear association weaker.
Correlation
Correlation is the statistical measure that defines to which extent two variables are linearly
related to each other. In statistics, correlation is defined by the Pearson Correlation
formula:

12
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

where,
• r: Correlation coefficient
• : i^th value first dataset X
• : Mean of first dataset X
• : i^th value second dataset Y
• : Mean of second dataset Y
Condition: The length of the dataset X and Y must be the same.
The Correlation value can be positive, negative, or zeros.

Program 9 : Correlation

import numpy as np

# Define the dataset


x = np.array([1,3,5,7,8,9, 10, 15])
y = np.array([10, 20, 30, 40, 50, 60, 70, 80])

def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr

print(Pearson_correlation(x,y))
print(Pearson_correlation(x,x))
Output:

13
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

# We can also find the correlation by using the numpy corrcoef function
print(np.corrcoef(x, y))

Output:

Program 10 : program to find Correlation coefficient and visualizing through heatmap on


Diabetes Dataset

import pandas as pd
from sklearn.datasets import load_diabetes
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset with frame


df = load_diabetes(as_frame=True)
# conver into pandas dataframe
df = df.frame
# Print first 5 rows
df.head()

Output:

Finding the Pearson correlations matrix by using the pandas command df.corr()
Syntax
df.corr(method, min_periods,numeric_only )
method : In method we can choose any one from {'pearson', 'kendall', 'spearman'} pearson
is the standard correlation coefficient matrix i.e default
min_periods : int This is optional. Defines th eminimum number of observations required
per pair.

14
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

numeric_only : Default is False, Defines we want to compare only numeric or categorical


object also

# Find the pearson correlations matrix


corr = df.corr(method = 'pearson')
corr

Output:

The above table represents the correlations between each column of the data frame. The
correlation between the self is 1.0, The negative correlation defined negative relationship
means on increasing one column value second will decrease and vice-versa. The zeros
correlation defines no relationship I.e neutral. and positive correlations define positive
relationships meaning on increasing one column value second will also increase and vice-
versa.
#We can also find the correlations using numpy between two columns
# correaltions between age and sex columns

c = np.corrcoef(df['age'],df['sex'])
print('Correlations between age and sex\n',c)

Output:

Heatmap is defined as a graphical representation of data using colors to visualize the


value of the matrix. In this, to represent more common values or higher activities brighter
colors basically reddish colors are used and to represent less common or activity values,

15
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

darker colors are preferred. Heatmap is also defined by the name of the shading matrix.
Heatmaps in Seaborn can be plotted by using the seaborn.heatmap() function.

seaborn.heatmap()

Syntax: seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, an


not_kws=None, linewidths=0, linecolor=’white’, cbar=True, **kwargs)
Important Parameters:
• data: 2D dataset that can be coerced into an ndarray.
• vmin, vmax: Values to anchor the colormap, otherwise they are inferred from
the data and other keyword arguments.
• cmap: The mapping from data values to color space.
• center: The value at which to center the colormap when plotting divergent
data.
• annot: If True, write the data value in each cell.
• fmt: String formatting code to use when adding annotations.
• linewidths: Width of the lines that will divide each cell.
• linecolor: Color of the lines that will divide each cell.
• cbar: Whether to draw a colorbar.
All the parameters except data are optional .
#Ploting the correlation matrix with the seaborn heatmap
plt.figure(figsize=(10,8), dpi =500)
sns.heatmap(corr,annot=True,fmt=".2f", linewidth=.5)
plt.show()

Output:

Program 11: program for finding correlation coefficient and Heatmap on Iris data set

import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns
import matplotlib.pyplot as plt

16
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

# Load the dataset with frame


df = load_iris(as_frame=True)

# convert into pandas dataframe


df = df.frame

# Print first 5 rows


df.head()

# Find the pearson correlations matrix


corr = df.corr(method = 'pearson')
corr

# correaltions between sepal length and sepal width using numpy


import numpy as np
c = np.corrcoef(df['sepal length (cm)'],df['sepal width (cm)'])
print('Correlations between sepal length and sepal width\n',c)
Output:

plt.figure(figsize=(5,5), dpi =100)


sns.heatmap(corr,annot=True,fmt=".2f", linewidth=.5)
plt.show()

17
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

Program 12: Write a program on Univariate, Bivariate & Multivariate Analysis


in Data Visualisation
Univariate Analysis
Univariate Analysis is a type of data visualization where we visualize only a single
variable at a time. Univariate Analysis helps us to analyze the distribution of the
variable present in the data so that we can perform further analysis.
import pandas as pd
import seaborn as sns
data = pd.read_csv('F:\Datascience\Employee_dataset.csv')
print(data.head())
Output:

18
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

#Visualizing numerical data


sns.histplot(data['age'])

#visulaizing categorical variable


sns.countplot(data['gender_full'])

#A piechart helps us to visualize the percentage of the data belonging to each category.
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
x = data['STATUS_YEAR'].value_counts()
plt.pie(x.values,labels=x.index,autopct='%1.1f%%')
plt.show()

#bivariate data analysis


#Categorical v/s Numerical
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 5))
sns.barplot(x=data['department_name'], y=data['length_of_service'])
plt.xticks(rotation='90')

19
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

#Numerical v/s Numerical


sns.scatterplot(x=data['length_of_service'],y=data['age'])

# Categorical v/s Categorical


sns.countplot(data['STATUS_YEAR'], hue=data['STATUS'])

20
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

Multivariate Analysis
It is an extension of bivariate analysis which means it involves multiple variables at the same
time to find correlation between them. Multivariate Analysis is a set of statistical model that
examine patterns in multidimensional data by considering at once, several data variable.
#multivariate analysis
from sklearn import datasets, decomposition
iris = datasets.load_iris()
X = iris.data
y = iris.target
pca = decomposition.PCA(n_components=2)
X = pca.fit_transform(X)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y)

#The values of correlation can vary from -1 to 1 where -1 means strong negative and +1 means
strong positive correlation.
sns.heatmap(data.corr(), annot=True)

21
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL

22
Dr. C. Krishna Priya

You might also like