MLDM PPT

MACHINE LEARNING AND DATA MINING
TEAM 27
P Viswa Teja Reddy CSE18095
Sai Charan K M CSE18109
Sai Teja Prasanth V CSE18111
S Vikas Reddy CSE18113
S Sura Reddy CSE18114
DATASET EXPLORATION
• Abstract:
Experimental data used for binary classification to predict if the
temperature at a particular place in the ocean is greater than the average
temperature or less.
• The Dataset used is CalCOFI which has Over 60 years of oceanographic data.
• The attributes used here in this dataset are
• Depth
• Temperature
• Salinity
• O2 Saturation Level
• No of instances: 2000
• No of attributes: 5
BASIC IMPORTS
DBSCAN
• DBSCAN is a clustering algorithm commonly used in machine learning.

• Two attributes temperature and salinity and 3000 instances are considered.
• After removing NaN values 2722 instances are remaining for clustering.
• Min max scaler is used to normalise the data.
•
DBSCAN
• To find the min value of eps a graph is plotted for distances vs eps.
• From the graph it’s clear that epsilon value of 0.025 will be optimal
• Upon looking for different values of min_samples and calculating Silhouette
Coefficient for min_samples as 25 the Silhouette Coefficient returned the
highest value.
KNN CLASSIFICATION
• Finds K points in the training set that are nearest to the given test input and
counts how many members of each class are in this set.
• It assigns the majority class in case of classification and for regression gives the
mean of all these test points.
• KNN is a lazy learner.
KNN CLASSIFICATION
• The first 2000 instances of the dataset and only the attributes which
determine temperature are taken.
• These are classified into 2 classes.
• Class 1 where temperature of the ocean is greater than or equal to the
average.
• Class 2 where temperature of the ocean is less than the average .
• KNN classifier is used to predict which class the temperature falls under.
• The observed min, mean, max values of temperatures are 2.78°C, 9.26 °C,
19.76 °C.
• There are 925 records of class 1 and 1022 values of class 2.
• A graph for k vs accuracy is plotted.
• Maximum accuracy is obtained for k=5.
• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.
202 5
2 181
LINEAR REGRESSION
• The first 1000 instances of the dataset are taken.

• Temperature is predicted with the help of linear regression.
• Standard Scaler is used to normalize the dataset.
SINGLE VARIABLE
Two attributes were considered here.

Salinity and Temperature
Temperature is predicted with salinity
salt_temp = df[['Salnty','T_degC’]]
RMSE: 0.53
Mean absolute error: 0.41
R2-score: 0.72
MULTI VARIABLE SINGLE VARIABLE
Four attributes were considered here.

col_list=["Depthm","Salnty","STheta","T_degC"]
Temperature is predicted with salinity depth and
sthetha.
RMSE: 0.23
Mean absolute error: 0.18
R2-score: 0.94
NAIVE BAYES CLASSIFICATION
• Naïve bayes is a supervised machine learning algorithm that uses probability to

predict which class it belongs to.
• It computes probabilities of an instance belonging to each of many classes.
• This algorithm assumes that the attributes are independent of each other.
NAIVE BAYES CLASSIFICATION
• The first 2000 instances of the dataset and only the attributes which determine
temperature are taken.
average.
• Naïve Bayes is used to predict which class the temperature falls under.
NAIVE BAYES CLASSIFICATION WITH
K-FOLD
968 54
34 891
NAIVE BAYES CLASSIFICATION WITH MINMAX
SCALER
204 12
7 167
NAIVE BAYES CLASSIFICATION WITH MINMAX
AND K-FOLD
969 53
32 893
DECISION TREE
• A Decision tree is a flowchart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
•
DECISION TREE
average.
• SVM is used to predict which class the temperature falls under.
Decision tree with k fold and min max scaler.
1543 0
1 1378
Decision tree with test train split as 0.8
479 0
0 398
SVM CLASSIFICATION
• A Support Vector Machine (SVM) is a supervised machine learning algorithm

that can be employed for both classification and regression purposes.
• The goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane.
• Data points that are closest to the hyperplane is called support vectors.
• Distance between the line and the support vectors is called margin.
• SVM goal is to find the hyperplane with maximum margin.
SVM CLASSIFICATION
average.
• SVM is used to predict which class the temperature falls under.
SVM CLASSIFICATION WITH LINEAR KERNEL
• The observed precision, accuracy, recall

and f1 score are as below. • The confusion matrix is also as follows.
205 0
1 184
SVM CLASSIFICATION WITH MINMAX
NORMALIZATION
198 3
1 188
SVM CLASSIFICATION WITH MINMAX
NORMALIZATION AND K-FOLD
LINEAR SIGMOID RBF
998 24 1015 7 999 23

27 898 10 915 32 893
COMPARISON ON CLASSIFICATION
ALGORITHMS FOR THE DATASET
• The accuracy of all the 3 classification models are similar

• Confusion matrix helps differentiate the performance.
KNN NAÏVE BAYES SVM
168 18
9 195
ACCURACY: 98.2% ACCURACY: 97.6% ACCURACY: 93%

PRINCIPAL COMPONENT ANALYSIS(PCA)
• PCA is a dimensionality-reduction method that is often used to reduce the

dimensionality of large data sets
• Upon reducing attributes accuracy is lost.
• But dimensionality reduction is to trade a little accuracy for simplicity.
• The attributes used here for dimensionality reduction are:
Depthm, T_degC, Salinity, STheta, R_Depth, R_TEMP
STEPS IN PCA
1. STANDARDIZATION
2. COVARIANCE MATRIX COMPUTATION
3. COMPUTE THE EIGENVECTORS AND EIGENVALUES
4. FINDING THE PRINCIPAL COMPONENTS

NORMALIZATION
Standard Scaler is used to normalise the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x = scaler.fit_transform(df)
COVARIANCE MATRIX COMPUTATION
AND COMPUTING EIGENVALUES,
EIGENVECTORS
# computing the covariance matrix

features = x.T
cov_matrix = np.cov(features)
# computing eigenvalues and eigenvectors

from numpy import linalg
eigen_values, eigen_vectors = np.linalg.eig(cov_matrix)
FINDING THE PRINCIPAL COMPONENTS
# calculating the percentage of variance of each Component

variances = []
for i in range(len(eigen_values)):
variances.append(eigen_values[i] / np.sum(eigen_values))
print(variances)
# top 3 principal components
0.8705379494684057, 0.06510912776420882, 0.05158786438850016,
VISUALISING THE DATA
USING THE FIRST AND SECOND USING THE SECOND AND THIRD USING THE FIRST AND THIRD
PRINCIPAL COMPONENT PRINCIPAL COMPONENT PRINCIPAL COMPONENT
K MEANS
K means is one of the most popular Unsupervised Machine Learning Algorithm

K Means segregates the unlabeled data into clusters, based on similar features,
common patterns.
3000 instances of the dataset are considered here along with two attributes
Salinity and Temperature.
After removing the NaN values 2972 instances are left for clustering.
Min max scaler is used to normalise the data.
K MEANS
K MEANS
A plot of SSE vs k is shown here.

Clearly the optimal value for k is 3.
THANK YOU

MLDM PPT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLDM PPT

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING AND DATA MINING

• DBSCAN is a clustering algorithm commonly used in machine learning.

• The first 1000 instances of the dataset are taken.

Two attributes were considered here.

Four attributes were considered here.

• Naïve bayes is a supervised machine learning algorithm that uses probability to

• A Support Vector Machine (SVM) is a supervised machine learning algorithm

• The observed precision, accuracy, recall

LINEAR SIGMOID RBF

998 24 1015 7 999 23

• The accuracy of all the 3 classification models are similar

KNN NAÏVE BAYES SVM

ACCURACY: 98.2% ACCURACY: 97.6% ACCURACY: 93%

• PCA is a dimensionality-reduction method that is often used to reduce the

2. COVARIANCE MATRIX COMPUTATION

3. COMPUTE THE EIGENVECTORS AND EIGENVALUES

4. FINDING THE PRINCIPAL COMPONENTS

Standard Scaler is used to normalise the data

from sklearn.preprocessing import StandardScaler

# computing the covariance matrix

# computing eigenvalues and eigenvectors

# calculating the percentage of variance of each Component

K means is one of the most popular Unsupervised Machine Learning Algorithm

A plot of SSE vs k is shown here.

You might also like