Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Machine Learning

Selecting Classification model

Author
S Sanjayakumar

1
Table of content
1. Introduction 3

2. Dataset Introduction 3

3. Data preprocessing and Feature Selection 3

4. Data analysis 4

5. Conclusion 8

6. Reference 9

2
1. Introduction
In this analysis report I am going to analyze a breast cancer dataset by using 4 classification models such as
Logistic regression, Decision Tree classification, Random Forest analysis and Standard vectors machine. The
main purpose of this analysis is to identify the most suitable classification model for the given dataset.

➢ Required libraries.

2. Dataset introduction
I imported this dataset to python by using the pandas function and when I check the dimension of the dataset,
It indicated as 1050 rows and 15 columns. Furthermore, when I check the data type of the variable it showed
as every variable as continuous values but according to the variable description there are plenty of binary
variable such as, gender, hereditary_history, marital_status, marital_length, pregnency_experience,
giving_birth, abortion, smoking, alcohol, breast_pain. Some other categorical variables were in the form of
numerical values such as, blood and Condition. The other 3 variables are pure continuous variables. According
to the requirement I choose the Chest pain as dependent variable.The reason I chose this variable is because I
needed to run the classification techniques to identify which one is performing well. So, this variable seemed
reasonable for my analysis.

3. Data preprocessing and Feature Selection


First, I checked the presence of null values in every variable and there are no null values in the dataset.
Therefore, I checked is the dataset balanced or not. The main reason for this is, the imbalanced dataset needs
different kinds of handling when we calculate the accuracy of the model. In accordance to do that, I created a
bar chart of the dependent variable and that indicates, the dataset is barely balanced. Next, I Change the binary
variables and nominal categorical variables into categorical again to do the further analysis. After that, I
checked whether the numerical variables have outliers or not. Because those excessive points will affect the
overall accuracy of the models. So, create a boxplot to find the outliers. According to that age and weight
variables have outliers.

3
To identify the lower value and upper value of the outlier I used the following equation.

(Outlier Calculator - Calculator Academy (2023))

With the use of this equation, I removed those outliers from the age and weight variables. Next, I used the
decision tree classifier to select the important features(variables) as the independent variables so we can pick
the most related variables with the dependent variable. Therefore, with the results of feature importance we
must pick the variables with the highest value in score comparing to other variables. So according to those
results I have dropped lowest scored variables from the dataset.

4. Data analysis
In this part I am going to analyze the dataset with the 4 models that I mentioned before and see which model
is suitable to predict the value more accurately. First split the data set into training and testing model. With the
training dataset the model will be trained by using it and with the test dataset the model will make predictions
about the dependent variable. Therefore, by comparing the accuracy of the training dataset and test dataset we
can decide whether the model performs good or not. In this analysis I split the data set 70% for training and
30% for testing. In this analysis I am going to use Logistic regression, Decision Tree classification, Random
Forest and Standard vector machine for the prediction and I am going to compare the accuracy measurements
to identify which is best to make predictions.

➢ Required libraries.

I. Logistic Regression
Logistic regression is a supervised machine learning technique specifically designed to perform classification.
In logistic regression, it uses the probability of an instance falling under a specific class to make predictions.
Therefore, a data point will fall to which class has a higher number of related datapoints with that specific data
point and not like linear regression the predicted values will follow a sigmoid(S shape line or nonlinear
line)line in logistic regression.
Before performing this Logistic regression, we need to perform a scalar implementation for the x variable.
Scalar nothing but just making the values under within a range. Here I used to minmax scalar to set the values
in between 0 and 1. The reason to pick this scalar standard scalar more sensitive for outliers. After that I fit
the logistic regression model with the x and y. after that it has shown the following results.

4
This a confusion matrix to indicate the true positive, true negative, False negative and False positive values to
know the exact accuracy of the model. According to this, 148 values are predicted correctly as the patients are
having chest pain and 35 values are predicted correctly as the patients don’t have chest pain. In opposite,22
values are predicted wrongly as the patient doesn’t have chest pain and 70 values are predicted wrongly as
the patient have chest pain.

In these results it is showing that, according to the f1 score, this Logistic regression model is predicting 76%
accurately whether the patient had chest pain. and the overall accuracy in prediction is also 66%. Therefore,
this model’s performance is good with this dataset(more than 50% should be in f1 score and accuracy).

Basically, in ROC curve the AUC indicates the probability of overall performance of the model and its range
will be 0 to 1.This ROC curve is plotted between the number of true positives are predicted correctly vs
number of true negative predicted as true positive. According to this ROC curve, the area under the curve is
0.6. Therefore, the overall performance of predicting correctly is 0.6 in probability(60%).

5
II. Decision Tree
This is also a supervised machine learning model, and it falls under a lazy learner classification technique. In
the decision tree there are three main nodes. Root node, leaf node and decision node. Basically, the root node
has all features and values of the dataset. Decision node will have the features(variables) and the leaf node has
the values or undividable classes. In decision tree it will divide the dataset according to the features and it will
split the feature classes until the situation of ‘there is nothing to split’. In the end we can find the best feature
and class. As said before, I fit the model with the dataset.

According to this, this model predicted 115 true positive values and 49 true negative values correctly which
is good comparing to the amount of false positive 56 and false negative 53.That means the low amount false
predictions will increase the accuracy of the model(because in the equation of accuracy, when the total amount
of denominator decreases the total accuracy of the model will increase).
According to this report, the f1 score of true positives are 68% were predicted correctly and the overall
accuracy also coming as 60%.Therefore, this model is performing well with the dataset.
According to this ROC curve, the probability of the overall performance with the dataset is 0.58(58%).but the
accuracy of model is shown 60% in the classification report. The reason for this difference may be the balance
of the dataset. That’s why the ROC curve shows a low percentage.

6
III. Random Forest
Random Forest is a supervised machine learning technique and in this random forest technique, it will create
set of decision trees inside of that and run the decision trees and identify the best classes from them and
finally take the class with highest number in average. Here we have to consider how many decision trees we
are going to use and number feature that we selected.

According to the confusion matrix, there are 135 values are true positive and 39 values are true negative in
this model and 66 values are false positive and 33 values are false negative. Therefore, comparatively the false
prediction amount is lower than true positive. Therefore, accuracy will be higher as well. The overall accuracy
of this model is 64% and 73% of the true positives were predicted accurately. Therefore, according to these
results the model seems to be performing well but with the ROC curve the AUC 0.59 which means, according
to this, the model is underperforming compared to the classification report. This might be because of an
imbalance nature of the dataset.

7
IV. Standard Vector Machine(SVM)
Standard Vector Machine is supervised machine learning model, and this is used for regression and
classification as well. In SVM, it finds the optimal hyperplane in the N dimensional space to divide
datapoints from the different classes. The dimension of the hyperplane will be decided by the number of
independent variables. When we increase the number of independent variables then the dimension of the
hyperplane also will be increased.

According to the confusion matrix, there are 150 values true positive, and 31 values true negative and model
prediction went wrong as false negative 18 values and false positive 74 values. In this model also the true
positive and true negative are higher than the false negative and false positive. Therefore, here the overall
accuracy is shown as 66% and the 77% percentage of true positives were predicted accurately. Here also it
seems to be performed well But with the ROC curve AUC is 0.59(59%).Therefore, this difference also came
because of the imbalance nature of the dataset.

Conclusion
In this research I have performed 4 important classification techniques to identify which technique is
suitable for the given dataset. According to the actual results I will suggest that the Logistic regression is the
best model for this dataset. Because it is shown the overall accuracy of 60% according to the ROC
curve,66% of overall accuracy of model and 76% of accuracy in predicting the true positives. Even though
the Standard vector machine performs well with overall accuracy of 66% and 77% accuracy in predicting
the true positives, its actual accuracy with ROC is comparatively lower than the Logistic regression.

8
Reference
➢ Outlier Calculator - Calculator Academy (2023) available from <https://calculator.academy/outlier-

calculator/>

➢ GeeksforGeeks (2023) Support Vector Machine SVM Algorithm [online] available from

<https://www.geeksforgeeks.org/support-vector-machine-algorithm/>

➢ GeeksforGeeks (2023b) Guide to AUC ROC Curve in Machine Learning [online] available from

<https://www.geeksforgeeks.org/auc-roc-curve/>

➢ GeeksforGeeks (2023c) Getting Started with Classification [online] available from

<https://www.geeksforgeeks.org/getting-started-with-classification/>

You might also like