Breast Cancer Classification

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Comparative Analysis of Machine Learning

Algorithms for Breast Cancer Classification: A


Performance Evaluation
Introduction to Breast Cancer Diagnosis and Classification

Breast cancer classification using machine learning algorithms is an application of supervised learning. The goal
is to predict whether a given breast mass is malignant or benign based on a set of features extracted from fine
needle aspirate (FNA) images.
Know your Dataset

The features in this dataset are derived from a digital image of


a fine needle aspirate (FNA) sample taken from a breast mass.
These features are computed using various measurements and
calculations performed on the digitized image data.
Data Collection and Preprocessing for Breast Cancer Classification

The data collection and preprocessing for breast cancer classification involves obtaining the dataset
from the public repository on Kaggle, which contains diagnostic features computed from digitized
images of fine needle aspirates (FNA) of breast masses. The dataset consists of 569 samples with 30
features, and preprocessing steps such as missing value imputation, feature scaling, and train-test split
are applied to prepare the data for machine learning modeling.
Methodology: Model Selection, Training, and Evaluation

The first step is to select an appropriate machine learning model to use for the classification task. There are many
different algorithms that can be used for this purpose, such as logistic regression, k-nearest neighbors (KNN),
support vector classifier (SVC), decision tree classifier, random forest classifier, and XGBoost.

Once the model has been selected, the next step is to train it using the breast cancer dataset. The dataset is typically
divided into two subsets: a training set and a testing set.

The final step in the methodology is to evaluate the performance of the trained model on the testing set. There are
several performance metrics that can be used to assess the accuracy of the model, including accuracy, Precision,
Recall and F1 score.
Performance Evaluation Metrics for Breast Cancer Classification Models

Performance evaluation metrics are used to measure the accuracy and effectiveness of machine learning models
for breast cancer classification.

1. Accuracy:

Accuracy is a metric that measures the overall performance of a classification model. It is the ratio of the number
of correct predictions to the total number of predictions made by the model.

2. Precision:

Precision is a metric that measures the proportion of true positive predictions out of all positive predictions made
by the model.
3. F1 Score:

F1 score is the harmonic mean of precision and sensitivity. It provides a balanced measure of the model's
performance.

4. Confusion Matrix:

A confusion matrix is a table that summarizes the performance of a classification model by comparing the
predicted values against the true values. It shows the number of true positives, false positives, true negatives, and
false negatives.

5. Sensitivity:

Sensitivity, also known as recall or true positive rate (TPR), is a metric that measures the proportion of actual
positive cases that are correctly identified by the model.
Results and Comparative Analysis of Machine Learning Models
1. Logistic Regression

It is a binary classification algorithm that predicts the probability of occurrence of an event by fitting the
data to a logistic function. The logistic function is a type of S-shaped curve, and it maps any real-valued
number to a value between 0 and 1, which represents the probability of occurrence of an event. In
logistic regression, the independent variables can be continuous or categorical, and the dependent
variable is binary.

Advantage: Simplicity, Interpretable, Flexible and Good performance with small datasets

Disadvantage: Assumption of linearity, Assumption of Independance, Overfitting and Sensitive to


outliers.
K Nearest Neighbour

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for both
classification and regression tasks. In KNN, the output for an unseen data point is determined by
finding the k-nearest neighbors to that point from the training set, and then using those neighbors
to determine the output value.
Advantages : Simple, easy to understand, Non-parametric and Can be used for both classification
and regression tasks.
Disadvantage: Computationally expensive, Sensitive to the choice of k and Can be sensitive to
the scale of the data.
In summary, KNN is a simple and flexible algorithm that can be used for both classification and
regression tasks. However, it can be computationally expensive and sensitive to the choice of k
and the scale of the data
SVC (Support Vector Classifier)

SVC (Support Vector Classifier) is a type of supervised learning algorithm used in machine learning for
classification tasks.The main goal of SVC is to create a decision boundary or hyperplane that separates
different classes in the data. This hyperplane is created by finding the optimal margin or the largest gap
between classes.The algorithm tries to maximize the margin between the two classes while minimizing the
misclassification error. The support vectors are the data points closest to the decision boundary, and they
determine the position of the decision boundary .SVC works well with both linearly separable and non-linearly
separable datasets. For non-linear datasets, SVC uses the kernel trick to map the data to a higher dimensional
space where a linear boundary can be found.

Overall, SVC is a powerful classification algorithm that can handle complex datasets and achieve high
accuracy. However, it can be computationally expensive for large datasets and requires careful tuning of
hyperparameters for optimal performance.
Decision tree
A decision tree is a type of supervised learning algorithm used in machine learning for both classification and
regression tasks.A decision tree represents a hierarchy of decisions and their possible consequences. It consists of
nodes that represent decision points and edges that represent the possible outcomes of those decisions. The topmost
node is called the root node, and the final nodes are called leaf nodes.
The decision tree is constructed using a top-down approach, where the dataset is recursively split into subsets based
on the most important features, until a stopping criterion is reached. The goal of the decision tree is to create a tree
that predicts the target variable with high accuracy for new and unseen data.
Decision trees are easy to understand and interpret, which makes them useful for explaining the reasoning behind a
prediction. They can also handle both numerical and categorical data and are capable of capturing non-linear
relationships between variables.

However, decision trees can be prone to overfitting, especially when the tree is too complex or the dataset is noisy.
To mitigate this, techniques such as pruning and limiting the maximum depth of the tree can be used. Additionally,
ensemble methods such as random forests and gradient boosting can be used to improve the accuracy and
robustness of the decision tree.
Random forest
Random forest is a type of ensemble learning method used in machine learning for both classification
and regression tasks. It is based on the idea of creating multiple decision trees and combining their
outputs to make a final prediction.The basic idea behind a random forest is to create multiple decision
trees using different subsets of the training data and features. Each tree is trained using a random subset
of the data and a random subset of the features. This randomness helps to reduce overfitting and
increase the diversity of the trees.The final prediction of a random forest is made by aggregating the
predictions of all the individual trees. For classification tasks, the final prediction is usually the majority
vote of all the trees, while for regression tasks, it is the average of all the tree outputs.Random forests
have several advantages over a single decision tree. They are less prone to overfitting, can handle high-
dimensional data and non-linear relationships between features, and are generally more accurate than a
single decision tree. They are also able to provide an estimate of feature importance, which can be
useful in feature selection.
However, random forests can be computationally expensive and may not perform well on small
datasets. They also have a black box nature, making it difficult to interpret how the model is making
predictions.
XGBoost(Extreme Gradient Boosting)

XGBoost (Extreme Gradient Boosting) is a type of gradient boosting algorithm used in machine learning for both
regression and classification tasks. It is a popular and powerful algorithm that has won several Kaggle competitions and
is widely used in industry.XGBoost works by iteratively adding decision trees to the model, where each new tree is
trained to correct the errors of the previous trees. During the training process, the algorithm calculates the gradients of
the loss function with respect to the predictions and uses them to update the model parameters.One of the key features of
XGBoost is that it uses a regularized objective function that combines a loss function and a penalty term to prevent
overfitting. The algorithm also allows for parallel processing and is optimized for speed and memory
efficiency.XGBoost has several advantages over other gradient boosting algorithms. It is highly scalable and can handle
large datasets with millions of samples and thousands of features. It also performs well on a wide range of tasks and is
able to capture complex non-linear relationships between features.In addition, XGBoost has several advanced features
such as early stopping to prevent overfitting, tree pruning to improve generalization, and automatic feature selection.

Overall, XGBoost is a powerful and versatile algorithm that is widely used in machine learning for its performance and
flexibility.

You might also like