Data Science Program With SONAR Data

DATA SCIENCE ASSIGNMENT - 1
SWAPNIL SAURAV
Email: ssaurav2@gitam.in
ROLL NO. HR21CSEN0114029
PhD Part Time – CSE Department (GITAM, Hyderabad)
2023
DATA SCIENCES Assignment 1
ASSIGNMENT – 1: SONAR DATA
1. About the Dataset
The Sonar dataset is a collection of 208 labeled samples of sonar signals collected by a
naval mine detection system. Each sample is composed of 60 input variables that
represent the strength of the sonar signal in various frequency bands (continuous numeric
data). The output variable is a categorical string that indicates whether the sample is
either a ROCK (R) or a MINE (M).
In this assignment, we have been asked to perform logistic regression to create a logistic
regression model. This dataset can be used to train a logistic regression model to classify
objects as either "R" or "M" based on the attributes.
NOTE: CODE ADDED AT THE END OF THIS DOCUMENT
2. Exploratory Data Analysis
1. Correlation Matrix for 60 input variables
Figure 1: Correlation matrix for 60 input variables
Figure 1 plot is a correlation matrix between all the 60 input variables to understand how they
are correlated. Ideally, these attributes shouldn’t have any or weak correlation values but the
SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
figure shows some dark color (moderate positive correlation) as well whitish color (moderate
negative correlation). We should remove highly correlated attribute and just keep one of them
for the analysis.
2. Barchart for Rock and Mine count
Figure 2: Barchart for Rock and Mine count in the data
Data shows out of 208 observations, Mine is the outcome for 111 observations and the rest 97
has Rock as output. Dataset is balanced and we don’t have to perform any other steps to
balance the dataset.
3. Histogram of each attribute
Figure 3: Histogram of data of each attribute
The height of each bar represents the frequency or number of data points that fall within that
range. Histogram plots are useful for seeing the overall pattern or shape of the data and are
often used to compare different data sets. As we can see from the Figure 3, that bulk of the
attribute is positively skewed. There are some negatively skewed attributes also like attributes
with index 30 to 35.
3. Principal Component Analysis (PCA)
Following is the Explained variance from the PCA:
variance ratio = [3.36178655e-01 2.09779464e-01 8.32708744e-02 6.24268373e-02
5.04292090e-02 4.15829528e-02 3.83488399e-02 2.43051915e-02
2.18704490e-02 1.87415946e-02 1.52565081e-02 1.29500940e-02
1.14982420e-02 9.29021862e-03 8.52312983e-03 6.72627570e-03
6.31178804e-03 4.96758002e-03 4.78424565e-03 3.98239274e-03
3.61197940e-03 2.99174444e-03 2.58630997e-03 2.37907813e-03
2.17290875e-03 1.89641782e-03 1.55629303e-03 1.28218703e-03
1.17821730e-03 1.10509792e-03 9.80045702e-04 7.49242191e-04
6.91496182e-04 6.45631399e-04 6.36023810e-04 5.74066224e-04
5.50277443e-04 5.16568464e-04 4.11082703e-04 3.81975636e-04
3.75648041e-04 3.24604308e-04 2.61878204e-04 2.04150636e-04
1.78988775e-04 1.39255681e-04 9.30331450e-05 7.32757984e-05
6.76506693e-05 4.09874856e-05 2.43932374e-05 2.32746619e-05
1.88647189e-05 1.30954136e-05 1.04843943e-05 8.46924231e-06
7.25432833e-06 6.51511596e-06 4.20959141e-06 2.78155017e-06]
We see that:
1. Only two attributes out of 60 can explain more than 54% of the data
2. Eleven out of 60 can explain more than 90% of the data
3. 18 out of 60 attributes can explain more than 95% of the data
4. Logistic Regression.
We need to run Binary Logistic Regression. But before that we need to perform ANOVA test to
see if the attributes impact the output variable or not.
Since there are more than 2 attributes we need to perform n-Way Anova or MONOVA test.
MONOVA Test:
The null hypothesis (H0) of ANOVA is that there is no difference among group means. The
alternative hypothesis (Ha) is that at least one group differs significantly from the overall
mean of the dependent variable. If we accept the null hypothesis then that means there is no
difference among group means and Logistics Regression will not give us significant results
hence we can’t trust the regression.
Rejecting the null hypothesis will help in accepting the alternate hypothesis which will make
us confident to use any machine learning techniques (including Logistic Regression) in this
case.
Figure 4: MONOVA Analysis
Because the p-value of the independent variable (60 attributes), is statistically significant (p <
0.05), it is likely that the attributes do have a significant effect on identifying Mine or Rock.
So, now we perform Logistics Regression (LR) on this dataset.
The result from LR is shown in the Figure 5 below:
Figure 5: Coefficient and intercept value given by Logistics Regression
Our model has given us the coefficient for all the 60 attributes and the constant value
(intercept). This will help us the make the prediction. Accuracy on the training data is 84.6%
and on the testing data is about 75%
5. Learning Curves
The learning curve is shown in Figure 6:
Figure 6: Learning curve
We can see from Figure 6 that as we increase the number of dataset the Training accuracy is
decreasing but the test accuracy is increasing. Less dataset gives us underfitting situation but
as we increase the model is fine tuned and we reach best possible scenario.
ROC Curve of the given model is shown in Figure 7:
Figure 7: ROC Curve Logistics Regression
A receiver operating characteristic (ROC) curve is a graphical representation of the

performance of a classification model at various threshold settings. The greater the area under
the curve (AUC), the better the model is at distinguishing between positive and negative
classes. It is used to compare the performance of different models and select the one with the
best predictive accuracy.
6. Error Analysis on Test Data
Error analysis is done using finding Confusion matrix and then calculating various performance
matrices like precision, recall, and f1-score
Confusion Matrix: A confusion matrix is a table that is used to evaluate the performance of a
classification model. It is a table of the predicted classes compared to the actual classes in the
test data set. It allows you to see where the model is making correct predictions, and where it
is making mistakes. It also helps you to identify potential areas for improvement in the model.
The rows of the matrix represent the actual classes and the columns represent the predicted
classes. The matrix is filled with counts of the number of times an instance was predicted to
be in a certain class. The diagonal elements represent the number of correct predictions,
while the off-diagonal elements indicate the number of incorrect predictions.
The below table is the confusion matrix of test data. IT shows that 14 records were classified
as Mine and it was actually Mine and 10 were correctly classified as Rock. Remaining 8 (5+3)
were incorrectly classified as
Classification Report: Precision, recall, F1 score and support are all metrics used to evaluate
the performance of a machine learning model.
• Precision is the fraction of true positives from the total number of predicted positives.
It tells us how accurate our model is when predicting positive outcomes.
• Recall is the fraction of true positives from the total number of actual positives. It tells
us how many positive outcomes our model is able to predict.
• F1 score is a weighted average of precision and recall. It takes both false positives and
false negatives into account when calculating the score.
• Support is the number of occurrences of each class in our dataset. It is used to
calculate the accuracy, precision and recall of our model.
***
CODE
import numpy as np # linear algebra
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
loc = "D:\\datasets\\sonar_csv.csv"
dataset = pd.read_csv(loc,header= None) #header=0

cols = dataset.columns
dataset[cols[:-1]] = dataset[cols[:-1]].apply(pd.to_numeric, errors='coerce')
# to zero and 1
##
dataset=dataset.iloc[1: , :]
print(dataset.shape)
print(dataset.dtypes)
print(dataset[60].value_counts())
X = dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values
print(X)
print(Y)
#EDA
#create correlation plot

print("Correlation")
print(dataset.iloc[: , :-1].corr())
import seaborn as sb
# plotting correlation heatmap
dataplot = sb.heatmap(dataset.iloc[: , :-1].corr(), cmap="YlGnBu", annot=False)
# displaying heatmap
plt.show()
dataset.iloc[: , :].groupby(60)[60].count().plot.bar();
plt.show()
# histograms
dataset.iloc[: , :].hist(sharex=False, sharey=False, xlabelsize=1,
ylabelsize=1, figsize=(12,12),fill=True, color='#F5AC03')
plt.title("Histogram of data for each attribute")
plt.show()
#MODAL
print("=======> \n\n",Y)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.15,
random_state=1)
print(X.shape,X_train.shape,X_test.shape)
#PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=60)
principalComponents = pca.fit_transform(X_train)
principalDf = pca.transform(X_test)
var = pca.explained_variance_ratio_
print("variance ratio = ",var)
## Performing n-way ANOVA test

from statsmodels.multivariate.manova import MANOVA
#Reading again
df = pd.read_csv(loc, index_col=0)
print("Columns: ",df.columns)
df.head()
maov = MANOVA.from_formula('attribute_2 + attribute_3 +

attribute_4+attribute_5+attribute_6+attribute_7+attribute_8+attribute_9+attribu
te_10+attribute_11+attribute_12+attribute_13+attribute_14+attribute_15+attribut
e_16+attribute_17+attribute_18+attribute_19+attribute_20+attribute_21+attribute
_22+attribute_23+attribute_24+attribute_25+attribute_26+attribute_27+attribute_
28+attribute_29+attribute_30+attribute_31+attribute_32+attribute_33+attribute_3
4+attribute_35+attribute_36+attribute_37+attribute_38+attribute_39+attribute_40
+attribute_41+attribute_42+attribute_43+attribute_44+attribute_45+attribute_46+
attribute_47+attribute_48+attribute_49+attribute_50+attribute_51+attribute_52+a
ttribute_53+attribute_54+attribute_55+attribute_56+attribute_57+attribute_58+at
tribute_59+attribute_60 ~ Class', data=df)
print(maov.mv_test())
#####################
model =LogisticRegression()
model.fit(X,Y)
y_pred = model.predict(X_test)
print("LR Coefficient= ",model.coef_)
print("Intercept = ",model.intercept_)
#TEST
model =LogisticRegression()
model.fit(X_train,Y_train)
x_train_prediction =model.predict(X_train)
traning_data_accuracy= accuracy_score(x_train_prediction,Y_train)
print('Accuracy on traning data :',traning_data_accuracy)
x_test_prediction =model.predict(X_test)
testing_data_accuracy= accuracy_score(x_test_prediction,Y_test)
print('Accuracy on testing data :',testing_data_accuracy)
# Learning Curves
# overfitting
print("Now plotting Learning Curve")

from sklearn.model_selection import learning_curve, KFold
cv = KFold(166, shuffle=True)
model3 =LogisticRegression()
train_sizes, train_scores, test_scores = learning_curve(

estimator=model3,
X=X,
y=Y,
cv=cv,
scoring="neg_root_mean_squared_error",
train_sizes =[1,15,35,66,99,125,150,166], verbose=0
)
plt.title("Learning Curve: LOGISTIC REGRESSION")

plt.xlabel("Training Set Size")
plt.ylabel("RMSE")
plt.legend(loc="best")
plt.show()
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import LearningCurveDisplay, ShuffleSplit
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 6), sharey=True)
common_params = {
"X": X,
"y": Y,
"train_sizes": np.linspace(0.1, 1.0, 5),
"cv": ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),
"score_type": "both",
"n_jobs": 4,
"line_kw": {"marker": "o"},
"std_display_style": "fill_between",
"score_name": "Accuracy",
}
for ax_idx, estimator in enumerate([model, model3]):

LearningCurveDisplay.from_estimator(estimator, **common_params,
ax=ax[ax_idx])
handles, label = ax[ax_idx].get_legend_handles_labels()
ax[ax_idx].legend(handles[:2], ["Training Score", "Test Score"])
ax[ax_idx].set_title(f"Learning Curve for {estimator.__class__.__name__}")
plt.show()
## ROC Curve
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_estimator(estimator= model, X= X_test, y = Y_test)
plt.show()
#RESULT ANALYSIS
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(Y_test, y_pred))
print(classification_report(Y_test,y_pred))

Data Science Program With SONAR Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Program With SONAR Data

Uploaded by

Copyright:

Available Formats

DATA SCIENCE ASSIGNMENT - 1

ASSIGNMENT – 1: SONAR DATA

1. About the Dataset

NOTE: CODE ADDED AT THE END OF THIS DOCUMENT

2. Exploratory Data Analysis

1. Correlation Matrix for 60 input variables

Figure 1: Correlation matrix for 60 input variables

2. Barchart for Rock and Mine count

Figure 2: Barchart for Rock and Mine count in the data

3. Histogram of each attribute

Figure 3: Histogram of data of each attribute

3. Principal Component Analysis (PCA)

Following is the Explained variance from the PCA:

variance ratio = [3.36178655e-01 2.09779464e-01 8.32708744e-02 6.24268373e-02

5.04292090e-02 4.15829528e-02 3.83488399e-02 2.43051915e-02

2.18704490e-02 1.87415946e-02 1.52565081e-02 1.29500940e-02

1.14982420e-02 9.29021862e-03 8.52312983e-03 6.72627570e-03

6.31178804e-03 4.96758002e-03 4.78424565e-03 3.98239274e-03

3.61197940e-03 2.99174444e-03 2.58630997e-03 2.37907813e-03

2.17290875e-03 1.89641782e-03 1.55629303e-03 1.28218703e-03

1.17821730e-03 1.10509792e-03 9.80045702e-04 7.49242191e-04

6.91496182e-04 6.45631399e-04 6.36023810e-04 5.74066224e-04

5.50277443e-04 5.16568464e-04 4.11082703e-04 3.81975636e-04

3.75648041e-04 3.24604308e-04 2.61878204e-04 2.04150636e-04

1.78988775e-04 1.39255681e-04 9.30331450e-05 7.32757984e-05

6.76506693e-05 4.09874856e-05 2.43932374e-05 2.32746619e-05

1.88647189e-05 1.30954136e-05 1.04843943e-05 8.46924231e-06

7.25432833e-06 6.51511596e-06 4.20959141e-06 2.78155017e-06]

Figure 4: MONOVA Analysis

So, now we perform Logistics Regression (LR) on this dataset.

The result from LR is shown in the Figure 5 below:

Figure 5: Coefficient and intercept value given by Logistics Regression

The learning curve is shown in Figure 6:

Figure 6: Learning curve

ROC Curve of the given model is shown in Figure 7:

Figure 7: ROC Curve Logistics Regression

A receiver operating characteristic (ROC) curve is a graphical representation of the

6. Error Analysis on Test Data

dataset = pd.read_csv(loc,header= None) #header=0

#create correlation plot

## Performing n-way ANOVA test

maov = MANOVA.from_formula('attribute_2 + attribute_3 +

print("Now plotting Learning Curve")

train_sizes, train_scores, test_scores = learning_curve(

plt.title("Learning Curve: LOGISTIC REGRESSION")

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 6), sharey=True)

for ax_idx, estimator in enumerate([model, model3]):

You might also like