Underwater Mine & Rock Prediction by Evaluation of Machine Learning Algorithms

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Soft Computing Sem VI

UNDERWATER MINE & ROCK PREDICTION BY EVALUATION OF


MACHINE LEARNING ALGORITHMS

TY B.TECH

SOFT COMPUTING

EXAM REPORT

SUBMITTED BY
1. Aditya Sirsat 202101070057
2. Abhishek Kumbhar 202101070060
3. Swastik Gaikwad 202101070062
4. Pratik Pisal 202101070073

SCHOOL OF ELECTRONICS AND TELECOMMUNICATION ENGINEERING

MIT ACADEMY OF ENGINEERING, ALANDI (D), PUNE-412105

MAHARASHTRA (INDIA)

MAY, 2024

MAY - 2024 Page|1


Soft Computing Sem VI

PROBLEM STATEMENT AND DATASET

▪ PROBLEM STATEMENT

"Underwater Mine & Rock Prediction by Evaluation of Machine Learning


Algorithms".

The problem statement for this project is to develop machine learning models that can
accurately classify underwater objects detected by sonar return data as either rocks or mines.
Using the provided dataset, the goal is to train models that can effectively differentiate between
the two classes based on the 60 numerical features representing energy within specific
frequency bands integrated over time periods. The objective is to create robust classification
algorithms capable of distinguishing between harmless rocks and potentially hazardous mines
in underwater environments, thereby aiding in underwater surveillance and navigation tasks.

▪ DATASET

The Sonar Mines vs Rocks dataset consists of sonar return data used to predict whether an
object detected underwater is a metal cylinder (mine) or a rock. Each instance in the dataset
contains 60 numerical features representing signal strength within specific frequenc for
different angles . The features range from 0.0 to 1.0. The labels 'R' and 'M' indicate rocks and
mines, respectively. The goal is to accurately classify objects based on their sonar return
characteristics, enabling the differentiation between underwater rocks and potentially
hazardous mines. This dataset serves as a valuable resource for developing machine learning
models for underwater object classification and detection tasks.

MAY - 2024 Page|2


Soft Computing Sem VI

ABSTRACT

The project aims to develop and evaluate machine learning models for predicting underwater
objects as either mines or rocks using sonar return data. Initially, the dataset is imported and
subjected to exploratory data analysis (EDA) tools such as `head()` and `isnull()` to understand its
structure and check for missing values. A pie chart is generated to visualize the distribution of
output classes, highlighting the percentage of mines and rocks. One-hot encoding is applied to
categorical columns, facilitating the transformation of categorical variables into numerical format
for model training. The dataset is split into training and testing sets using train-test split
methodology. Various machine learning models including Logistic Regression, Support Vector
Machines (SVM), Random Forest, and K-Nearest Neighbors (KNN) are employed, with
hyperparameter tuning conducted using GridSearchCV to optimize model performance.

Evaluation metrics including accuracy and confusion matrix are calculated for each model to
assess their predictive capabilities. The accuracy measures the proportion of correctly classified
instances, while the confusion matrix provides insights into the model's classification performance
across different classes. The comparison of accuracies and time complexities of training and
testing times is presented in tabular format. This comparative analysis enables the selection of the
most suitable machine learning algorithm(s) based on their performance and computational
efficiency for underwater mine and rock prediction tasks.

In summary, the project encompasses the entire machine learning pipeline from data preprocessing
to model evaluation, with a focus on optimizing model performance and selecting the best-suited
algorithm for underwater object classification.

MAY - 2024 Page|3


Soft Computing Sem VI

EXPLANATION OF THE LIBRARIES USED .

1. Pandas:
Utilized for data manipulation and analysis, including importing the dataset, performing
exploratory data analysis, and preprocessing tasks such as handling missing values.

2. Matplotlib:
Employed for data visualization, including the creation of plots such as pie charts to
represent the distribution of output classes and confusion matrices to visualize model
performance.

3. Seaborn:
Enhanced data visualization library that complements Matplotlib, used for creating
visually appealing and informative statistical graphics, including heatmaps for visualizing
confusion matrices.

4. Scikit-learn:
Widely-used machine learning library in Python, providing efficient tools for data
preprocessing, model selection, training, and evaluation, as well as hyperparameter tuning
using techniques like GridSearchCV.

5. Tabulate:
Used for formatting and displaying data in tabular form, facilitating the comparison of
accuracies and time complexities of different machine learning models.

6. Time:
Standard Python library used for measuring and reporting the time taken for model training
and testing, enabling the assessment of computational efficiency.

MAY - 2024 Page|4


Soft Computing Sem VI

DATA EXPLORATION

1. Displaying the first few rows of the DataFrame


Utilizing `head()` function to showcase the initial rows of the dataset, aiding in
understanding its structure and contents.

2. Checking the structure of the DataFrame:


Employing `info()` function to inspect the DataFrame's structure, encompassing column
names, data types, and non-null values, facilitating a comprehensive overview of the
dataset.

3. Checking for missing values in the DataFrame:


Utilizing `isnull().sum()` function to identify the presence of missing values within the
DataFrame, crucial for ensuring data integrity and completeness.

4. Summary statistics for numeric columns:


Utilizing `describe()` function to generate summary statistics such as mean, median, and
quartiles for numeric columns, offering insights into their distribution and variability.

5. Counting occurrences of each output class:


Employing `value_counts()` function on the target variable to tally the frequency of each
output class, aiding in assessing class distribution and potential biases.

6. Displaying the shape of the dataset:


Utilizing `shape` attribute to retrieve the dimensions (rows and columns) of the dataset,
offering a concise overview of its size and structure.

MAY - 2024 Page|5


Soft Computing Sem VI

Data Visualization

This aims to visualize the distribution of outcomes within the categorical column 'Outcome' in the
DataFrame. By counting the occurrences of each outcome class and representing them
proportionally in a pie chart, it provides a clear visual representation of the relative frequencies of
different outcomes. This visualization aids in understanding the class distribution, which is
essential for tasks such as assessing the balance of the dataset and identifying potential biases in
the data. Additionally, presenting this information graphically enhances interpretability and
facilitates insights into the dataset's composition.

MAY - 2024 Page|6


Soft Computing Sem VI

PROBLEM STATEMENT IMPLEMENTATION

▪ DATASET SPLITTING

The process of splitting the data into training and testing sets is essential for effectively
evaluating the performance of machine learning models. By partitioning the dataset into
two distinct subsets, one for training and the other for testing, we ensure that the model's
performance is assessed on unseen data. This helps in estimating how well the model will
generalize to new, unseen instances in real-world scenarios. Setting aside a portion of the
data for testing allows us to validate the model's predictions and assess its accuracy and
robustness. Additionally, specifying a random seed ensures reproducibility, enabling
consistent results across different runs of the code.

▪ MODELS USED

1. Logistic Regression (Default):

- Logistic Regression is a linear classification algorithm used for binary classification


tasks.

- It calculates the probability of an instance belonging to a particular class using the


logistic function.

- In this context, "default" refers to the model trained with default hyperparameters
without tuning.

2. SVM (Default):

- Support Vector Machines (SVM) is a powerful classification algorithm that finds the
optimal hyperplane to separate classes.

- It aims to maximize the margin between classes while minimizing classification errors.

MAY - 2024 Page|7


Soft Computing Sem VI

- The "default" SVM model is trained with default hyperparameters.

3. SVM (GridSearchCV):

- In GridSearchCV (Cross-Validation Grid Search), hyperparameters are systematically


tested using cross-validation to find the optimal combination that yields the best
performance.

- SVM with GridSearchCV involves tuning hyperparameters like C (regularization


parameter) and gamma (kernel coefficient) to optimize model performance.

4. Random Forest (Default):

- Random Forest is an ensemble learning method consisting of multiple decision trees


trained on random subsets of the data.

- It aggregates the predictions of individual trees to make more accurate classifications.

- The "default" Random Forest model is trained with default settings, including the
number of trees in the forest (n_estimators).

5. Random Forest (GridSearchCV):

- Random Forest with GridSearchCV involves tuning hyperparameters such as the


number of trees (n_estimators), maximum depth of trees (max_depth), and minimum
samples required to split a node (min_samples_split).

- GridSearchCV systematically explores different combinations of hyperparameters to


identify the optimal configuration that maximizes model performance.

6. K-Nearest Neighbors (Default):

- K-Nearest Neighbors (KNN) is a simple and intuitive classification algorithm that


assigns a class label based on the majority class of its k-nearest neighbors.

MAY - 2024 Page|8


Soft Computing Sem VI

- In the "default" KNN model, the value of k is typically set to 5, which means it considers
the class labels of the five nearest neighbors for classification.

7. K-Nearest Neighbors (GridSearchCV):

- KNN with GridSearchCV involves tuning hyperparameters such as the number of


neighbors (n_neighbors) and the distance metric used for calculating proximity.

- GridSearchCV systematically searches for the optimal combination of


hyperparameters, such as the value of k, to improve model performance.

▪ GridSearchCV

GridSearchCV (Grid Search Cross-Validation) is a technique used in machine learning for


hyperparameter optimization, specifically in the context of models trained using cross-
validation.

In machine learning, hyperparameters are parameters that are not learned during the
training process but are set prior to training and affect the learning process. Examples of
hyperparameters include the learning rate in a neural network or the regularization
parameter in a support vector machine.

GridSearchCV works by exhaustively searching through a specified subset of


hyperparameters for a given model. It evaluates each combination of hyperparameters
using cross-validation, which involves splitting the training dataset into multiple subsets
(folds), training the model on a subset of folds, and validating it on the remaining fold. This
process is repeated for each combination of hyperparameters.

MAY - 2024 Page|9


Soft Computing Sem VI

GridSearchCV is used for two primary reasons:

1. Hyperparameter Tuning:
It helps in finding the optimal combination of hyperparameters for a given model and
dataset. By systematically searching through a grid of hyperparameter values, it can
identify the combination that results in the best performance (e.g., highest accuracy,
lowest error) on a validation set.

2. Preventing Overfitting:
Using cross-validation helps prevent overfitting by providing an estimate of how the
model will perform on unseen data. By evaluating the model's performance on multiple
validation sets, GridSearchCV provides a more robust estimate of how well the model
will generalize to new data compared to using a single validation set.

MAY - 2024 P a g e | 10
Soft Computing Sem VI

RESULTS
1. ACCURACY

The table presents the accuracy scores of various classification models, with the highest accuracy
achieved by SVM after hyperparameter tuning using GridSearchCV. This indicates that SVM,
when optimized through GridSearchCV, outperformed other models in accurately classifying the
data into the correct categories.

MAY - 2024 P a g e | 11
Soft Computing Sem VI

2. TIME COMPLEXITIES

The table displays the time complexities of training and testing for various classification models.
Training time refers to the duration taken to train the model on the training data, while testing time
indicates the time required to make predictions on the testing data.

- Logistic Regression and SVM (Support Vector Machines) exhibit relatively low training and
testing times, indicating their computational efficiency.

- Random Forest, although slower to train compared to logistic regression and SVM, still
demonstrates reasonable training times. Testing time for Random Forest is slightly higher due to
its ensemble nature, but it remains manageable.

- K-Nearest Neighbors (KNN), both in its default and GridSearchCV versions, show the lowest
training times, owing to its simplicity and lazy learning approach. However, KNN incurs higher
testing times compared to other models due to its instance-based nature, where predictions involve
computing distances to all training instances.

MAY - 2024 P a g e | 12
Soft Computing Sem VI

CONCLUSION
In conclusion, this project effectively explored and evaluated multiple machine learning
algorithms for classifying underwater objects as mines or rocks based on sonar data. The accuracy
of the Support Vector Machine (SVM) model, particularly the one tuned with GridSearchCV,
emerged as the highest among the models tested. This superior performance can be attributed to
SVM's ability to find optimal decision boundaries, maximizing the margin between classes while
minimizing classification errors. Additionally, the careful selection and tuning of hyperparameters
through GridSearchCV likely contributed to enhancing SVM's predictive accuracy. The project
underscores the importance of methodical model selection, hyperparameter tuning, and rigorous
evaluation techniques in achieving superior classification performance for real-world applications.

MAY - 2024 P a g e | 13

You might also like