Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

DATA SCIENCE ASSIGNMENT - 1

SWAPNIL SAURAV

Email: ssaurav2@gitam.in
ROLL NO. HR21CSEN0114029
PhD Part Time – CSE Department (GITAM, Hyderabad)

2023
DATA SCIENCES Assignment 1

ASSIGNMENT – 1: SONAR DATA

1. About the Dataset

The Sonar dataset is a collection of 208 labeled samples of sonar signals collected by a
naval mine detection system. Each sample is composed of 60 input variables that
represent the strength of the sonar signal in various frequency bands (continuous numeric
data). The output variable is a categorical string that indicates whether the sample is
either a ROCK (R) or a MINE (M).

In this assignment, we have been asked to perform logistic regression to create a logistic
regression model. This dataset can be used to train a logistic regression model to classify
objects as either "R" or "M" based on the attributes.

NOTE: CODE ADDED AT THE END OF THIS DOCUMENT

2. Exploratory Data Analysis

1. Correlation Matrix for 60 input variables

Figure 1: Correlation matrix for 60 input variables

Figure 1 plot is a correlation matrix between all the 60 input variables to understand how they
are correlated. Ideally, these attributes shouldn’t have any or weak correlation values but the

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

figure shows some dark color (moderate positive correlation) as well whitish color (moderate
negative correlation). We should remove highly correlated attribute and just keep one of them
for the analysis.

2. Barchart for Rock and Mine count

Figure 2: Barchart for Rock and Mine count in the data

Data shows out of 208 observations, Mine is the outcome for 111 observations and the rest 97
has Rock as output. Dataset is balanced and we don’t have to perform any other steps to
balance the dataset.

3. Histogram of each attribute

Figure 3: Histogram of data of each attribute

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

The height of each bar represents the frequency or number of data points that fall within that
range. Histogram plots are useful for seeing the overall pattern or shape of the data and are
often used to compare different data sets. As we can see from the Figure 3, that bulk of the
attribute is positively skewed. There are some negatively skewed attributes also like attributes
with index 30 to 35.

3. Principal Component Analysis (PCA)

Following is the Explained variance from the PCA:

variance ratio = [3.36178655e-01 2.09779464e-01 8.32708744e-02 6.24268373e-02

5.04292090e-02 4.15829528e-02 3.83488399e-02 2.43051915e-02

2.18704490e-02 1.87415946e-02 1.52565081e-02 1.29500940e-02

1.14982420e-02 9.29021862e-03 8.52312983e-03 6.72627570e-03

6.31178804e-03 4.96758002e-03 4.78424565e-03 3.98239274e-03

3.61197940e-03 2.99174444e-03 2.58630997e-03 2.37907813e-03

2.17290875e-03 1.89641782e-03 1.55629303e-03 1.28218703e-03

1.17821730e-03 1.10509792e-03 9.80045702e-04 7.49242191e-04

6.91496182e-04 6.45631399e-04 6.36023810e-04 5.74066224e-04

5.50277443e-04 5.16568464e-04 4.11082703e-04 3.81975636e-04

3.75648041e-04 3.24604308e-04 2.61878204e-04 2.04150636e-04

1.78988775e-04 1.39255681e-04 9.30331450e-05 7.32757984e-05

6.76506693e-05 4.09874856e-05 2.43932374e-05 2.32746619e-05

1.88647189e-05 1.30954136e-05 1.04843943e-05 8.46924231e-06

7.25432833e-06 6.51511596e-06 4.20959141e-06 2.78155017e-06]

We see that:

1. Only two attributes out of 60 can explain more than 54% of the data
2. Eleven out of 60 can explain more than 90% of the data
3. 18 out of 60 attributes can explain more than 95% of the data

4. Logistic Regression.

We need to run Binary Logistic Regression. But before that we need to perform ANOVA test to
see if the attributes impact the output variable or not.

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

Since there are more than 2 attributes we need to perform n-Way Anova or MONOVA test.

MONOVA Test:

The null hypothesis (H0) of ANOVA is that there is no difference among group means. The
alternative hypothesis (Ha) is that at least one group differs significantly from the overall
mean of the dependent variable. If we accept the null hypothesis then that means there is no
difference among group means and Logistics Regression will not give us significant results
hence we can’t trust the regression.

Rejecting the null hypothesis will help in accepting the alternate hypothesis which will make
us confident to use any machine learning techniques (including Logistic Regression) in this
case.

Figure 4: MONOVA Analysis

Because the p-value of the independent variable (60 attributes), is statistically significant (p <
0.05), it is likely that the attributes do have a significant effect on identifying Mine or Rock.

So, now we perform Logistics Regression (LR) on this dataset.

The result from LR is shown in the Figure 5 below:

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

Figure 5: Coefficient and intercept value given by Logistics Regression

Our model has given us the coefficient for all the 60 attributes and the constant value
(intercept). This will help us the make the prediction. Accuracy on the training data is 84.6%
and on the testing data is about 75%

5. Learning Curves

The learning curve is shown in Figure 6:

Figure 6: Learning curve

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

We can see from Figure 6 that as we increase the number of dataset the Training accuracy is
decreasing but the test accuracy is increasing. Less dataset gives us underfitting situation but
as we increase the model is fine tuned and we reach best possible scenario.

ROC Curve of the given model is shown in Figure 7:

Figure 7: ROC Curve Logistics Regression

A receiver operating characteristic (ROC) curve is a graphical representation of the


performance of a classification model at various threshold settings. The greater the area under
the curve (AUC), the better the model is at distinguishing between positive and negative
classes. It is used to compare the performance of different models and select the one with the
best predictive accuracy.

6. Error Analysis on Test Data

Error analysis is done using finding Confusion matrix and then calculating various performance
matrices like precision, recall, and f1-score

Confusion Matrix: A confusion matrix is a table that is used to evaluate the performance of a
classification model. It is a table of the predicted classes compared to the actual classes in the
test data set. It allows you to see where the model is making correct predictions, and where it
is making mistakes. It also helps you to identify potential areas for improvement in the model.
The rows of the matrix represent the actual classes and the columns represent the predicted
classes. The matrix is filled with counts of the number of times an instance was predicted to
be in a certain class. The diagonal elements represent the number of correct predictions,
while the off-diagonal elements indicate the number of incorrect predictions.

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

The below table is the confusion matrix of test data. IT shows that 14 records were classified
as Mine and it was actually Mine and 10 were correctly classified as Rock. Remaining 8 (5+3)
were incorrectly classified as

Classification Report: Precision, recall, F1 score and support are all metrics used to evaluate
the performance of a machine learning model.

• Precision is the fraction of true positives from the total number of predicted positives.
It tells us how accurate our model is when predicting positive outcomes.
• Recall is the fraction of true positives from the total number of actual positives. It tells
us how many positive outcomes our model is able to predict.
• F1 score is a weighted average of precision and recall. It takes both false positives and
false negatives into account when calculating the score.
• Support is the number of occurrences of each class in our dataset. It is used to
calculate the accuracy, precision and recall of our model.

***

CODE
import numpy as np # linear algebra
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

loc = "D:\\datasets\\sonar_csv.csv"

dataset = pd.read_csv(loc,header= None) #header=0


cols = dataset.columns
dataset[cols[:-1]] = dataset[cols[:-1]].apply(pd.to_numeric, errors='coerce')
# to zero and 1

##
dataset=dataset.iloc[1: , :]
print(dataset.shape)
print(dataset.dtypes)
print(dataset[60].value_counts())

X = dataset.iloc[:,:-1].values
Y=dataset.iloc[:,-1].values

print(X)
print(Y)

#EDA

#create correlation plot


print("Correlation")

print(dataset.iloc[: , :-1].corr())
import seaborn as sb
# plotting correlation heatmap
dataplot = sb.heatmap(dataset.iloc[: , :-1].corr(), cmap="YlGnBu", annot=False)

# displaying heatmap
plt.show()

dataset.iloc[: , :].groupby(60)[60].count().plot.bar();
plt.show()

# histograms
dataset.iloc[: , :].hist(sharex=False, sharey=False, xlabelsize=1,
ylabelsize=1, figsize=(12,12),fill=True, color='#F5AC03')
plt.title("Histogram of data for each attribute")
plt.show()

#MODAL

print("=======> \n\n",Y)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.15,
random_state=1)
print(X.shape,X_train.shape,X_test.shape)

#PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=60)
principalComponents = pca.fit_transform(X_train)

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

principalDf = pca.transform(X_test)
var = pca.explained_variance_ratio_
print("variance ratio = ",var)

## Performing n-way ANOVA test


from statsmodels.multivariate.manova import MANOVA

#Reading again
df = pd.read_csv(loc, index_col=0)
print("Columns: ",df.columns)
df.head()

maov = MANOVA.from_formula('attribute_2 + attribute_3 +


attribute_4+attribute_5+attribute_6+attribute_7+attribute_8+attribute_9+attribu
te_10+attribute_11+attribute_12+attribute_13+attribute_14+attribute_15+attribut
e_16+attribute_17+attribute_18+attribute_19+attribute_20+attribute_21+attribute
_22+attribute_23+attribute_24+attribute_25+attribute_26+attribute_27+attribute_
28+attribute_29+attribute_30+attribute_31+attribute_32+attribute_33+attribute_3
4+attribute_35+attribute_36+attribute_37+attribute_38+attribute_39+attribute_40
+attribute_41+attribute_42+attribute_43+attribute_44+attribute_45+attribute_46+
attribute_47+attribute_48+attribute_49+attribute_50+attribute_51+attribute_52+a
ttribute_53+attribute_54+attribute_55+attribute_56+attribute_57+attribute_58+at
tribute_59+attribute_60 ~ Class', data=df)

print(maov.mv_test())

#####################
model =LogisticRegression()
model.fit(X,Y)
y_pred = model.predict(X_test)
print("LR Coefficient= ",model.coef_)
print("Intercept = ",model.intercept_)

#TEST
model =LogisticRegression()
model.fit(X_train,Y_train)
x_train_prediction =model.predict(X_train)
traning_data_accuracy= accuracy_score(x_train_prediction,Y_train)
print('Accuracy on traning data :',traning_data_accuracy)
x_test_prediction =model.predict(X_test)
testing_data_accuracy= accuracy_score(x_test_prediction,Y_test)
print('Accuracy on testing data :',testing_data_accuracy)

# Learning Curves
# overfitting

print("Now plotting Learning Curve")


from sklearn.model_selection import learning_curve, KFold
cv = KFold(166, shuffle=True)
model3 =LogisticRegression()

train_sizes, train_scores, test_scores = learning_curve(


estimator=model3,

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)
DATA SCIENCES Assignment 1

X=X,
y=Y,
cv=cv,
scoring="neg_root_mean_squared_error",
train_sizes =[1,15,35,66,99,125,150,166], verbose=0
)

plt.title("Learning Curve: LOGISTIC REGRESSION")


plt.xlabel("Training Set Size")
plt.ylabel("RMSE")
plt.legend(loc="best")

plt.show()
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import LearningCurveDisplay, ShuffleSplit

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 6), sharey=True)

common_params = {
"X": X,
"y": Y,
"train_sizes": np.linspace(0.1, 1.0, 5),
"cv": ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),
"score_type": "both",
"n_jobs": 4,
"line_kw": {"marker": "o"},
"std_display_style": "fill_between",
"score_name": "Accuracy",
}

for ax_idx, estimator in enumerate([model, model3]):


LearningCurveDisplay.from_estimator(estimator, **common_params,
ax=ax[ax_idx])
handles, label = ax[ax_idx].get_legend_handles_labels()
ax[ax_idx].legend(handles[:2], ["Training Score", "Test Score"])
ax[ax_idx].set_title(f"Learning Curve for {estimator.__class__.__name__}")
plt.show()
## ROC Curve
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_estimator(estimator= model, X= X_test, y = Y_test)
plt.show()

#RESULT ANALYSIS
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(Y_test, y_pred))
print(classification_report(Y_test,y_pred))

SWAPNIL SAURAV (HR21CSEN0114029) – Ph. D Part Time – CSE Department (GITAM HYD)

You might also like