Professional Documents
Culture Documents
Underwater Mine & Rock Prediction by Evaluation of Machine Learning Algorithms
Underwater Mine & Rock Prediction by Evaluation of Machine Learning Algorithms
Underwater Mine & Rock Prediction by Evaluation of Machine Learning Algorithms
TY B.TECH
SOFT COMPUTING
EXAM REPORT
SUBMITTED BY
1. Aditya Sirsat 202101070057
2. Abhishek Kumbhar 202101070060
3. Swastik Gaikwad 202101070062
4. Pratik Pisal 202101070073
MAHARASHTRA (INDIA)
MAY, 2024
▪ PROBLEM STATEMENT
The problem statement for this project is to develop machine learning models that can
accurately classify underwater objects detected by sonar return data as either rocks or mines.
Using the provided dataset, the goal is to train models that can effectively differentiate between
the two classes based on the 60 numerical features representing energy within specific
frequency bands integrated over time periods. The objective is to create robust classification
algorithms capable of distinguishing between harmless rocks and potentially hazardous mines
in underwater environments, thereby aiding in underwater surveillance and navigation tasks.
▪ DATASET
The Sonar Mines vs Rocks dataset consists of sonar return data used to predict whether an
object detected underwater is a metal cylinder (mine) or a rock. Each instance in the dataset
contains 60 numerical features representing signal strength within specific frequenc for
different angles . The features range from 0.0 to 1.0. The labels 'R' and 'M' indicate rocks and
mines, respectively. The goal is to accurately classify objects based on their sonar return
characteristics, enabling the differentiation between underwater rocks and potentially
hazardous mines. This dataset serves as a valuable resource for developing machine learning
models for underwater object classification and detection tasks.
ABSTRACT
The project aims to develop and evaluate machine learning models for predicting underwater
objects as either mines or rocks using sonar return data. Initially, the dataset is imported and
subjected to exploratory data analysis (EDA) tools such as `head()` and `isnull()` to understand its
structure and check for missing values. A pie chart is generated to visualize the distribution of
output classes, highlighting the percentage of mines and rocks. One-hot encoding is applied to
categorical columns, facilitating the transformation of categorical variables into numerical format
for model training. The dataset is split into training and testing sets using train-test split
methodology. Various machine learning models including Logistic Regression, Support Vector
Machines (SVM), Random Forest, and K-Nearest Neighbors (KNN) are employed, with
hyperparameter tuning conducted using GridSearchCV to optimize model performance.
Evaluation metrics including accuracy and confusion matrix are calculated for each model to
assess their predictive capabilities. The accuracy measures the proportion of correctly classified
instances, while the confusion matrix provides insights into the model's classification performance
across different classes. The comparison of accuracies and time complexities of training and
testing times is presented in tabular format. This comparative analysis enables the selection of the
most suitable machine learning algorithm(s) based on their performance and computational
efficiency for underwater mine and rock prediction tasks.
In summary, the project encompasses the entire machine learning pipeline from data preprocessing
to model evaluation, with a focus on optimizing model performance and selecting the best-suited
algorithm for underwater object classification.
1. Pandas:
Utilized for data manipulation and analysis, including importing the dataset, performing
exploratory data analysis, and preprocessing tasks such as handling missing values.
2. Matplotlib:
Employed for data visualization, including the creation of plots such as pie charts to
represent the distribution of output classes and confusion matrices to visualize model
performance.
3. Seaborn:
Enhanced data visualization library that complements Matplotlib, used for creating
visually appealing and informative statistical graphics, including heatmaps for visualizing
confusion matrices.
4. Scikit-learn:
Widely-used machine learning library in Python, providing efficient tools for data
preprocessing, model selection, training, and evaluation, as well as hyperparameter tuning
using techniques like GridSearchCV.
5. Tabulate:
Used for formatting and displaying data in tabular form, facilitating the comparison of
accuracies and time complexities of different machine learning models.
6. Time:
Standard Python library used for measuring and reporting the time taken for model training
and testing, enabling the assessment of computational efficiency.
DATA EXPLORATION
Data Visualization
This aims to visualize the distribution of outcomes within the categorical column 'Outcome' in the
DataFrame. By counting the occurrences of each outcome class and representing them
proportionally in a pie chart, it provides a clear visual representation of the relative frequencies of
different outcomes. This visualization aids in understanding the class distribution, which is
essential for tasks such as assessing the balance of the dataset and identifying potential biases in
the data. Additionally, presenting this information graphically enhances interpretability and
facilitates insights into the dataset's composition.
▪ DATASET SPLITTING
The process of splitting the data into training and testing sets is essential for effectively
evaluating the performance of machine learning models. By partitioning the dataset into
two distinct subsets, one for training and the other for testing, we ensure that the model's
performance is assessed on unseen data. This helps in estimating how well the model will
generalize to new, unseen instances in real-world scenarios. Setting aside a portion of the
data for testing allows us to validate the model's predictions and assess its accuracy and
robustness. Additionally, specifying a random seed ensures reproducibility, enabling
consistent results across different runs of the code.
▪ MODELS USED
- In this context, "default" refers to the model trained with default hyperparameters
without tuning.
2. SVM (Default):
- Support Vector Machines (SVM) is a powerful classification algorithm that finds the
optimal hyperplane to separate classes.
- It aims to maximize the margin between classes while minimizing classification errors.
3. SVM (GridSearchCV):
- The "default" Random Forest model is trained with default settings, including the
number of trees in the forest (n_estimators).
- In the "default" KNN model, the value of k is typically set to 5, which means it considers
the class labels of the five nearest neighbors for classification.
▪ GridSearchCV
In machine learning, hyperparameters are parameters that are not learned during the
training process but are set prior to training and affect the learning process. Examples of
hyperparameters include the learning rate in a neural network or the regularization
parameter in a support vector machine.
1. Hyperparameter Tuning:
It helps in finding the optimal combination of hyperparameters for a given model and
dataset. By systematically searching through a grid of hyperparameter values, it can
identify the combination that results in the best performance (e.g., highest accuracy,
lowest error) on a validation set.
2. Preventing Overfitting:
Using cross-validation helps prevent overfitting by providing an estimate of how the
model will perform on unseen data. By evaluating the model's performance on multiple
validation sets, GridSearchCV provides a more robust estimate of how well the model
will generalize to new data compared to using a single validation set.
MAY - 2024 P a g e | 10
Soft Computing Sem VI
RESULTS
1. ACCURACY
The table presents the accuracy scores of various classification models, with the highest accuracy
achieved by SVM after hyperparameter tuning using GridSearchCV. This indicates that SVM,
when optimized through GridSearchCV, outperformed other models in accurately classifying the
data into the correct categories.
MAY - 2024 P a g e | 11
Soft Computing Sem VI
2. TIME COMPLEXITIES
The table displays the time complexities of training and testing for various classification models.
Training time refers to the duration taken to train the model on the training data, while testing time
indicates the time required to make predictions on the testing data.
- Logistic Regression and SVM (Support Vector Machines) exhibit relatively low training and
testing times, indicating their computational efficiency.
- Random Forest, although slower to train compared to logistic regression and SVM, still
demonstrates reasonable training times. Testing time for Random Forest is slightly higher due to
its ensemble nature, but it remains manageable.
- K-Nearest Neighbors (KNN), both in its default and GridSearchCV versions, show the lowest
training times, owing to its simplicity and lazy learning approach. However, KNN incurs higher
testing times compared to other models due to its instance-based nature, where predictions involve
computing distances to all training instances.
MAY - 2024 P a g e | 12
Soft Computing Sem VI
CONCLUSION
In conclusion, this project effectively explored and evaluated multiple machine learning
algorithms for classifying underwater objects as mines or rocks based on sonar data. The accuracy
of the Support Vector Machine (SVM) model, particularly the one tuned with GridSearchCV,
emerged as the highest among the models tested. This superior performance can be attributed to
SVM's ability to find optimal decision boundaries, maximizing the margin between classes while
minimizing classification errors. Additionally, the careful selection and tuning of hyperparameters
through GridSearchCV likely contributed to enhancing SVM's predictive accuracy. The project
underscores the importance of methodical model selection, hyperparameter tuning, and rigorous
evaluation techniques in achieving superior classification performance for real-world applications.
MAY - 2024 P a g e | 13