Professional Documents
Culture Documents
Data Mining Lab Maual Through Python 031023
Data Mining Lab Maual Through Python 031023
Data Mining Lab Maual Through Python 031023
1
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
Mean
It is the sum of observations divided by the total number of observations. It is also defined
as average which is the sum divided by count.
The mean () function returns the mean or average of the data passed in its arguments. If
the passed argument is empty, StatisticsError is raised.
Example Program 1: Python code to calculate mean
Median
It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is the median and if it is even then
the median would be the average of two central elements. it first sorts the data i=and then
performs the median operation
For Odd Numbers:
2
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
Output:
The median() function is used to calculate the median, i.e middle element of data. If the
passed argument is empty, StatisticsError is raised.
Median Low
3
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
The median_low() function returns the median of data in case of odd number of elements,
but in case of even number of elements, returns the lower of two middle elements. If the
passed argument is empty, StatisticsError is raised.
Median High
The median_high() function returns the median of data in case of odd number of elements,
but in case of even number of elements, returns the higher of two middle elements. If
passed argument is empty, StatisticsError is raised.
Mode
It is the value that has the highest frequency in the given data set. The data set may have no
mode if the frequency of all data points is the same. Also, we can have more than one mode
if we encounter two or more data points having the same frequency.
The mode() function returns the number with the maximum number of occurrences. If the
passed argument is empty, StatisticsError is raised.
Program 3:
#Python code to demonstrate the working of mode() function on a various range of data
types
4
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
# tuple of strings
data5 = ("red", "blue", "black", "blue", "black", "black", "brown")
Output:
Measure of Variability
The measure of variability is known as the spread of data or how well our data is
distributed. The most common variability measures are:
• Range
• Variance
• Standard deviation
Range
5
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
The difference between the largest and smallest data point in our data set is known as the
range. The range is directly proportional to the spread of data which means the bigger the
range, the more the spread of data and vice versa.
Range = Largest data value – smallest data value
Program 4: We can calculate the maximum and minimum values using
the max() and min() methods respectively.
Variance
It is defined as an average squared deviation from the mean. It is calculated by finding the
difference between every data point and the average which is also known as the mean,
squaring them, adding all of them, and then dividing by the number of data points present
in our data set.
6
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),fr(5, 6), fr(7, 8))
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s " % (variance(sample1)))
print("Variance of Sample2 is % s " % (variance(sample2)))
print("Variance of Sample3 is % s " % (variance(sample3)))
print("Variance of Sample4 is % s " % (variance(sample4)))
print("Variance of Sample5 is % s " % (variance(sample5)))
Output:
Standard Deviation
It is defined as the square root of the variance. It is calculated by finding the Mean, then
subtracting each number from the Mean which is also known as the average, and squaring
the result. Adding all the values and then dividing by the no. of terms followed by the square
root.
Program 6:
# Python code to demonstrate stdev()
7
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
Quartiles:
Quartiles and percentiles are measures of variation, which describes how spread out the data
is.
Quartiles and percentiles are both types of quantiles.
Quartiles are values that separate the data into four equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing
the quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter.
Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next
25%. And so on.
Example:
8
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
With Python use the NumPy library quantile() method to find the quartiles of the values
13, 21, 21, 40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)
Output:
[13. 21. 41. 49.75 72. ]
# Making a simple linear regression in Python is very easy using Scikit-learn and seaborn regplot.
Output:
X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X[['bmi']]
9
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
sns.regplot(x=X, y=y)
plt.title('Simple Linear Regression on Diabetes', fontsize=20)
plt.show()
Output:
For this program create excel file named Dataimage.csv with the following data
Name NationalitySex Age Handedness
Kathy Canada Female 23 Right
nina Canada Female 18 Right
Peter Canada Male 19 Right
John Canada Male 22 Left
Fatima London Female 31 Left
Elizhabeth London Female 25 Left
Dhaval India Male 35 Left
Sudhir India Male 31 Left
Sanju India Male 18 Right
Yan China Female 52 Right
Juan China Female 58 Left
Liang China Male 43 Left
10
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
import pandas as pd
df = pd.read_csv('F:\Datascience\Dataimage.csv')
# Crosstab function is called 2 parameters are passed and the table is stored in a variable
Output:
# Creating barplot
barplot = crosstb.plot.bar(rot=0)
11
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
#We can also create a crosstab with more than two values.
import pandas as pd
df = pd.read_csv('F:\Datascience\Dataimage.csv')
# Bar plotting
a = crosstb.plot(kind='bar', rot=0)
a.legend(title='Handedness', bbox_to_anchor=(1, 1.02),loc='upper left')
Correlation
A correlation Matrix is basically a covariance matrix. Also known as the auto-covariance
matrix, dispersion matrix, variance matrix, or variance-covariance matrix. It is a matrix in
which the i-j position defines the correlation between the ith and jth parameter of the given
data set. When the data points follow a roughly straight-line trend, the variables are said to
have an approximately linear relationship. In some cases, the data points fall close to a
straight line, but more often there is quite a bit of variability of the points around the
straight-line trend. A summary measure called correlation describes the strength of the
linear association.
Correlation in Python
Correlation summarizes the strength and direction of the linear (straight-line) association
between two quantitative variables. Denoted by r, it takes values between -1 and +1. A
positive value for r indicates a positive association, and a negative value for r indicates a
negative association. The closer r is to 1 the closer the data points fall to a straight line, thus,
the linear association is stronger. The closer r is to 0, making the linear association weaker.
Correlation
Correlation is the statistical measure that defines to which extent two variables are linearly
related to each other. In statistics, correlation is defined by the Pearson Correlation
formula:
12
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
where,
• r: Correlation coefficient
• : i^th value first dataset X
• : Mean of first dataset X
• : i^th value second dataset Y
• : Mean of second dataset Y
Condition: The length of the dataset X and Y must be the same.
The Correlation value can be positive, negative, or zeros.
Program 9 : Correlation
import numpy as np
def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr
print(Pearson_correlation(x,y))
print(Pearson_correlation(x,x))
Output:
13
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
# We can also find the correlation by using the numpy corrcoef function
print(np.corrcoef(x, y))
Output:
import pandas as pd
from sklearn.datasets import load_diabetes
import seaborn as sns
import matplotlib.pyplot as plt
Output:
Finding the Pearson correlations matrix by using the pandas command df.corr()
Syntax
df.corr(method, min_periods,numeric_only )
method : In method we can choose any one from {'pearson', 'kendall', 'spearman'} pearson
is the standard correlation coefficient matrix i.e default
min_periods : int This is optional. Defines th eminimum number of observations required
per pair.
14
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
Output:
The above table represents the correlations between each column of the data frame. The
correlation between the self is 1.0, The negative correlation defined negative relationship
means on increasing one column value second will decrease and vice-versa. The zeros
correlation defines no relationship I.e neutral. and positive correlations define positive
relationships meaning on increasing one column value second will also increase and vice-
versa.
#We can also find the correlations using numpy between two columns
# correaltions between age and sex columns
c = np.corrcoef(df['age'],df['sex'])
print('Correlations between age and sex\n',c)
Output:
15
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
darker colors are preferred. Heatmap is also defined by the name of the shading matrix.
Heatmaps in Seaborn can be plotted by using the seaborn.heatmap() function.
seaborn.heatmap()
Output:
Program 11: program for finding correlation coefficient and Heatmap on Iris data set
import pandas as pd
from sklearn.datasets import load_iris
import seaborn as sns
import matplotlib.pyplot as plt
16
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
17
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
18
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
#A piechart helps us to visualize the percentage of the data belonging to each category.
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
x = data['STATUS_YEAR'].value_counts()
plt.pie(x.values,labels=x.index,autopct='%1.1f%%')
plt.show()
19
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
20
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
Multivariate Analysis
It is an extension of bivariate analysis which means it involves multiple variables at the same
time to find correlation between them. Multivariate Analysis is a set of statistical model that
examine patterns in multidimensional data by considering at once, several data variable.
#multivariate analysis
from sklearn import datasets, decomposition
iris = datasets.load_iris()
X = iris.data
y = iris.target
pca = decomposition.PCA(n_components=2)
X = pca.fit_transform(X)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y)
#The values of correlation can vary from -1 to 1 where -1 means strong negative and +1 means
strong positive correlation.
sns.heatmap(data.corr(), annot=True)
21
Dr. C. Krishna Priya
DATAMINING THROUGH PYTHON LAB MANUAL
22
Dr. C. Krishna Priya