Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

MACHINE LEARNING AND DATA MINING

TEAM 27
P Viswa Teja Reddy CSE18095
Sai Charan K M CSE18109
Sai Teja Prasanth V CSE18111
S Vikas Reddy CSE18113
S Sura Reddy CSE18114
DATASET EXPLORATION

• Abstract:
Experimental data used for binary classification to predict if the
temperature at a particular place in the ocean is greater than the average
temperature or less.
• The Dataset used is CalCOFI which has Over 60 years of oceanographic data.
• The attributes used here in this dataset are
• Depth
• Temperature
• Salinity
• O2 Saturation Level
• No of instances: 2000
• No of attributes: 5
BASIC IMPORTS
DBSCAN

• DBSCAN is a clustering algorithm commonly used in machine learning.


• Two attributes temperature and salinity and 3000 instances are considered.
• After removing NaN values 2722 instances are remaining for clustering.
• Min max scaler is used to normalise the data.

DBSCAN

• To find the min value of eps a graph is plotted for distances vs eps.
• From the graph it’s clear that epsilon value of 0.025 will be optimal
• Upon looking for different values of min_samples and calculating Silhouette
Coefficient for min_samples as 25 the Silhouette Coefficient returned the
highest value.
KNN CLASSIFICATION

• Finds K points in the training set that are nearest to the given test input and
counts how many members of each class are in this set.
• It assigns the majority class in case of classification and for regression gives the
mean of all these test points.
• KNN is a lazy learner.
KNN CLASSIFICATION

• The first 2000 instances of the dataset and only the attributes which
determine temperature are taken.
• These are classified into 2 classes.
• Class 1 where temperature of the ocean is greater than or equal to the
average.
• Class 2 where temperature of the ocean is less than the average .
• KNN classifier is used to predict which class the temperature falls under.
• The observed min, mean, max values of temperatures are 2.78°C, 9.26 °C,
19.76 °C.
• There are 925 records of class 1 and 1022 values of class 2.
• A graph for k vs accuracy is plotted.
• Maximum accuracy is obtained for k=5.
• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

202 5
2 181
LINEAR REGRESSION

• The first 1000 instances of the dataset are taken.


• Temperature is predicted with the help of linear regression.
• Standard Scaler is used to normalize the dataset.
SINGLE VARIABLE

Two attributes were considered here.


Salinity and Temperature
Temperature is predicted with salinity
salt_temp = df[['Salnty','T_degC’]]

RMSE: 0.53
Mean absolute error: 0.41
R2-score: 0.72
MULTI VARIABLE SINGLE VARIABLE

Four attributes were considered here.


col_list=["Depthm","Salnty","STheta","T_degC"]
Temperature is predicted with salinity depth and
sthetha.

RMSE: 0.23
Mean absolute error: 0.18
R2-score: 0.94
NAIVE BAYES CLASSIFICATION

• Naïve bayes is a supervised machine learning algorithm that uses probability to


predict which class it belongs to.
• It computes probabilities of an instance belonging to each of many classes.
• This algorithm assumes that the attributes are independent of each other.
NAIVE BAYES CLASSIFICATION

• The first 2000 instances of the dataset and only the attributes which determine
temperature are taken.
• These are classified into 2 classes.
• Class 1 where temperature of the ocean is greater than or equal to the
average.
• Class 2 where temperature of the ocean is less than the average .
• Naïve Bayes is used to predict which class the temperature falls under.
NAIVE BAYES CLASSIFICATION WITH
K-FOLD

• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

968 54
34 891
NAIVE BAYES CLASSIFICATION WITH MINMAX
SCALER

• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

204 12
7 167
NAIVE BAYES CLASSIFICATION WITH MINMAX
AND K-FOLD

• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

969 53
32 893
DECISION TREE

• A Decision tree is a flowchart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.

DECISION TREE

• The first 2000 instances of the dataset and only the attributes which
determine temperature are taken.
• These are classified into 2 classes.
• Class 1 where temperature of the ocean is greater than or equal to the
average.
• Class 2 where temperature of the ocean is less than the average .
• SVM is used to predict which class the temperature falls under.
Decision tree with k fold and min max scaler.

• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

1543 0
1 1378
Decision tree with test train split as 0.8

• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

479 0
0 398
SVM CLASSIFICATION

• A Support Vector Machine (SVM) is a supervised machine learning algorithm


that can be employed for both classification and regression purposes.
• The goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane.
• Data points that are closest to the hyperplane is called support vectors.
• Distance between the line and the support vectors is called margin.
• SVM goal is to find the hyperplane with maximum margin.
SVM CLASSIFICATION

• The first 2000 instances of the dataset and only the attributes which
determine temperature are taken.
• These are classified into 2 classes.
• Class 1 where temperature of the ocean is greater than or equal to the
average.
• Class 2 where temperature of the ocean is less than the average .
• SVM is used to predict which class the temperature falls under.
SVM CLASSIFICATION WITH LINEAR KERNEL

• The observed precision, accuracy, recall


and f1 score are as below. • The confusion matrix is also as follows.

205 0
1 184
SVM CLASSIFICATION WITH MINMAX
NORMALIZATION

• The observed precision, accuracy, recall • The confusion matrix is also as follows.
and f1 score are as below.

198 3
1 188
SVM CLASSIFICATION WITH MINMAX
NORMALIZATION AND K-FOLD

LINEAR SIGMOID RBF

998 24 1015 7 999 23


27 898 10 915 32 893
COMPARISON ON CLASSIFICATION
ALGORITHMS FOR THE DATASET

• The accuracy of all the 3 classification models are similar


• Confusion matrix helps differentiate the performance.

KNN NAÏVE BAYES SVM

168 18
9 195

ACCURACY: 98.2% ACCURACY: 97.6% ACCURACY: 93%


PRINCIPAL COMPONENT ANALYSIS(PCA)

• PCA is a dimensionality-reduction method that is often used to reduce the


dimensionality of large data sets
• Upon reducing attributes accuracy is lost.
• But dimensionality reduction is to trade a little accuracy for simplicity.
• The attributes used here for dimensionality reduction are:
Depthm, T_degC, Salinity, STheta, R_Depth, R_TEMP
STEPS IN PCA

1. STANDARDIZATION

2. COVARIANCE MATRIX COMPUTATION

3. COMPUTE THE EIGENVECTORS AND EIGENVALUES

4. FINDING THE PRINCIPAL COMPONENTS


NORMALIZATION

Standard Scaler is used to normalise the data

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
x = scaler.fit_transform(df)
COVARIANCE MATRIX COMPUTATION
AND COMPUTING EIGENVALUES,
EIGENVECTORS

# computing the covariance matrix


features = x.T
cov_matrix = np.cov(features)

# computing eigenvalues and eigenvectors


from numpy import linalg
eigen_values, eigen_vectors = np.linalg.eig(cov_matrix)
FINDING THE PRINCIPAL COMPONENTS

# calculating the percentage of variance of each Component


variances = []
for i in range(len(eigen_values)):
variances.append(eigen_values[i] / np.sum(eigen_values))
print(variances)
# top 3 principal components
0.8705379494684057, 0.06510912776420882, 0.05158786438850016,
VISUALISING THE DATA

USING THE FIRST AND SECOND USING THE SECOND AND THIRD USING THE FIRST AND THIRD
PRINCIPAL COMPONENT PRINCIPAL COMPONENT PRINCIPAL COMPONENT
K MEANS

K means is one of the most popular Unsupervised Machine Learning Algorithm


K Means segregates the unlabeled data into clusters, based on similar features,
common patterns.
3000 instances of the dataset are considered here along with two attributes
Salinity and Temperature.
After removing the NaN values 2972 instances are left for clustering.
Min max scaler is used to normalise the data.
K MEANS
K MEANS

A plot of SSE vs k is shown here.


Clearly the optimal value for k is 3.
THANK YOU

You might also like