Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Training Cost-Sensitive Classifiers to Tackle Imbalanced

Data

Utkarsh Kejriwal1[0009-0004-4952-3117] and Komal Arora2[1111-2222-3333-4444]


1
Lovely Professional University, Phagwara, Punjab, India
2
Lovely Professional University, Phagwara, Punjab, India

Abstract. As the world is growing, the amount of data generated has been
growing proportionately. Classification of the large amount of data generated
presents multiple issues including skewness and sparsity of the available data
which may result in a significant imbalance in the classes. Several techniques
such as sampling, cost-sensitive learning and the use of ensemble methods have
been researched upon to tackle this issue of class imbalance or distribution.
Cost sensitive learning is a potent approach in terms of dealing with imbalanced
data that may be used to enhance the performance of the machine learning
model, lower the risk of false negatives, and improve interpretability of the
model. This paper presents an empirical approach by using cost-sensitive learn-
ing on various predictive models such as binary logistic regression, decision
tree, and support vector machine classifiers. An investigation is carried out on
the differences in the accuracy of the classification when each of these machine
learning models are trained with and without assigning misclassification costs
to the classes of the dataset. The experiment conducted to validate the research
presents conflicting results as the accuracy of the classification models decrease
by providing a cost but a significant increase in recall is observed upon adding
the misclassification costs to the model.

Keywords: Data Imbalance, Classification, Cost-Sensitive Learning (CSL).

1 Introduction

Massive volumes of data have been continuously generated over the last few years. In
most real time applications such as fraud detection in credit cards, intrusion detection
and disease diagnosis, the data produced is skewed and immensely distributed [2,6].
This skewed or imbalanced data presents as a major problem in data mining research
[6]. Classification of such data using traditional machine learning models might not
produce accurate results when concerning the minority classes or the lesser repre-
sented classes in the dataset [3].
Imbalance in data refers to the biased distribution of classes in the data [1-4]. The
class having higher count of observations in the dataset is said to be the majority class
whereas the class having a lower frequency is considered to be the minority class [9].
In most situations, the minority classes possess a greater value than the majority
2

classes [1,4]. Hence, the misclassification of the minority classes may bring forward
more problems or cost than the misclassification of the majority classes [7].
Various approaches have been implemented to tackle this problem, broadly catego-
rized into two categories: making changes in the data while preprocessing and altering
the predictive algorithms according to the user’s needs [4]. The technique of tackling
this issue at the data-level includes resampling techniques, such as oversampling or up
sampling and undersampling or down sampling, which are carried out during data
preprocessing [2,4,7]. Algorithmic-level approach employs techniques such as recog-
nition-based one class learning classification, cost-sensitive learning and ensemble
methods which learn directly from the class imbalance ratio [10]. The dataset's unique
properties and the needs of the current situation, however, determine the strategy that
should be used [7]. Experimentation and rigorous evaluation are crucial to find the
best approach for the classification problem.
Cost-sensitive learning is one of the main strategies for addressing the issue of
class disparity. In this approach, the classification models are altered by adding a
charge to the misclassification of minority classes [6]. This method not only helps in
dealing with skewed data, but also helps in dealing with classification problems where
the misclassification of one class can result in actual loss to the society [5]. Such
problems include the misclassification of a cancer, which could result in the loss of a
life [11]. This paper deals with such problems by employing cost sensitive learning
and using different costs to optimize the classification problems.
This paper aims at providing a better understanding of the cost-sensitive learning
technique by experimenting on an imbalanced UCI dataset comprising of two classes
and classifying it using Logistic Regression, Decision Tree Classifier and Support
Vector Machine (SVM). It would present the individual performance (without CSL)
of the classifiers and their performance after modification of the algorithm (with
CSL). The performance of the classifiers with and without the use of the misclassifi-
cation costs are thoroughly compared to obtain the results.

2 Background

2.1 Resampling

Replicating certain observations from the minority class or removing observations


from the majority class will modify the target class's distribution [14]. The addition or
removal of samples from the dataset refers to the resampling strategy [12]. Resam-
pling techniques aim to create a state of balanced classes, and it is continued until this
equilibrium is reached. Oversampling technique entails the addition or replication of
samples of minority class whereas the undersampling technique leads to the removal
of samples from the majority class [17].
Several algorithms such as SMOTE, ADASYN, Borderline SMOTE, and more
have been developed to help with the oversampling strategy [16]. All of these tech-
niques create artificial samples for the positive class, expanding the dataset's size and
possibly lengthening the classifier's computation and training period. [15]. Oversam-
3

pling methods also have the tendency of overfitting the classification model and pro -
duce biased results [16].
Undersampling can be carried out using a number of previously investigated tech-
niques, including random undersampling, Tomen's Links, Cluster Centroid, and oth-
ers [13]. Nevertheless, these techniques frequently eliminate samples that include
crucial information, which leads to data loss, underfitting of the model, and inaccurate
outcomes effect [15].

2.2 Cost Sensitive Learning

The algorithmic method for addressing the dilemma of data skewness involves com-
bining the classifier with cost sensitive learning [7]. The fundamental principle of
cost-sensitive learning is to penalize incorrectly classifying a minority or positive
class as a false negative (greater penalty) and incorrectly classifying a majority or
negative class as a false positive (comparatively lesser penalty) [7].

Table 1. Fundamental Design of Cost Matrix


Predicted Class
+ve -ve
+ve C(P,P) C(P,N)
Actual Class
-ve C(N,P) C(N,N)

Table 1. represents the cost matrix of any given classification problem with two
classes. Here,
C(P,P) = Cost of properly predicting the positive class,
C(P,N) = Cost of incorrectly predicting the positive class,
C(N,P) = Cost of incorrectly predicting the negative class,
C(N,N) = Cost of properly predicting the negative class.
Hence, the total cost (Tc) of the classification problem would be,

T c =Freq(FN)∗C(P , N )+ Freq(FP)∗C (N , P) (1)

The main objective of the classification model used would be to minimize the total
cost (Tc) by reducing the count of the false positive and false negative predictions [8].
Generally, the True Positive and True Negative costs are assigned to 0 whereas the
False Positive and False Negative are assigned some charge according to the problem
at hand [5].

2.3 Classification Algorithms

In the context of data mining and predictive analytics, classification is a technique


used to determine the classification or grouping to which a set of data examples be-
longs [19]. The most often used machine learning technique, out of the several that
are available, is classification [19]. In this section, the working of a few classification
models is discussed.
4

Logistic Regression. This classification function creates classes using a single multi-
nomial logistic regression model and estimator [18]. In a specific way, logistic regres-
sion establishes class boundaries and calculates class probabilities depending on dis-
tance from the boundary [20]. It converges to the extremes (0 and 1) more quickly as
the dataset size grows [20]. Because of these probability-based features, logistic re-
gression is more than just a classifier. It makes it possible to make precise and reliable
predictions, however these predictions could end up being wrong [19]. One of the
most often used methods for discrete data analysis and practical statistical analysis is
logistic regression.

Decision Tree Classifier. They maximise the distance between various classes by
iteratively segmenting the data into subsets based on the values of particular attrib -
utes. [21]. At each node of the tree, the data is split into two or more subgroups based
on the values of a certain feature. [21]. Once the data has been divided into subsets
that are homogeneous with regard to the desired variable, the process is repeated [21].
By traversing the path from the root to a leaf node, where the class label is established
based on the majority class of the training cases that reach that node, the resulting tree
can be used to categorise fresh data [21].

Support Vector Machine. Many scientific disciplines use support vector machines, a
powerful and trustworthy method for classification and regression [19]. SVMs func-
tion by identifying the hyperplane that divides the data into the most distinct classes.
By choosing the hyperplane, the margin—the distance between it and the closest data
points from each class—is maximized [22]. SVMs can take a long time to compute,
especially for large datasets, hence numerous methods to speed up training time have
been proposed, including data reduction and chunking [18].

3 Methodology

3.1 Dataset

The Statlog (German Credit Data) is taken from the UCI data repository [23]. The
1000 records provided are characterized by 20 features including 7 numerical and 13
categorical ones. This dataset categorises individuals as excellent or bad credit risks
based on a set of attributes. Out of the 1000 observations, 700 records represent good
credits which is the majority class and 300 represent the bad credits and are the mi-
nority class.

3.2 Cost Setting Strategy

In this paper, two different cost setting strategies are applied – imbalance ratio and the
cost matrix provided by the authors of the dataset used in the experiment [23]. Table
2. presents the cost matrix using the imbalance ratio (Cost Matrix A) whereas Table 3.
presents the cost matrix provided by the authors (Cost Matrix B).
5

Table 1. Cost Matrix using Imbalance Ratio


Predicted Class
Good Bad
Good 0 3
Actual Class
Bad 7 0

Table 1. Cost Matrix using Domain Knowledge


Predicted Class
Good Bad
Good 0 1
Actual Class
Bad 5 0

The cost matrix provided by the authors is created by the fundamental logic that it is
worse to predict a customer with a bad credit score as a good customer that to predict
a customer with a good credit score as a bad customer. On the other hand, the cost
matrix using imbalance ratio is created on the basis of the imbalance in the class
which would penalize the model for incorrectly classifying the minority or the nega-
tive class as the majority class more than the wrongful classification of the positive or
the majority class as the minority class.

3.3 Classification Models

The dataset is classified using three machine learning models – Logistic Regres-
sion, Decision Tree and Support Vector Machine. Each of these classifiers is evalu-
ated using 3 different strategies – (i) without using a cost matrix, (ii) using cost matrix
A (CMA) and (iii) using cost matrix B (CMB).

3.4 Evaluation Measures

Every model and every strategy combination is assessed using metrics such as Accu-
racy or recognition rate, precision and recall which help in the evaluation of models
based on the number of true positives, and the f1 score. The objective is to determine
the best possible classification model based on these evaluation metrics.

4 Experiment

4.1 Result

The following tables provide information about the performance of the several classi-
fication models that were employed for this investigation. The assessment results of
Logistic Regression, Decision Tree Classifier and Support Vector Machine for vari-
ous cost matrices are shown in Tables 4, 5, and 6, respectively.
6

Table 1. Performance of Logistic Regression


Performance Metric No Cost Cost Matrix A Cost Matrix B
Accuracy 71.5 68.5 53.5
Precision 55.17 48.23 38.46
Recall 26.66 68.33 91.66
F1 Score 35.95 56.55 54.18

Table 1. Performance of Decision Tree Classifier


Performance Metric No Cost Cost Matrix A Cost Matrix B
Accuracy 67.5 71 71.5
Precision 46.03 51.19 52.63
Recall 48.33 71.66 50
F1 Score 47.15 59.72 51.28

Table 1. Performance of Support Vector Machine


Performance Metric No Cost Cost Matrix A Cost Matrix B
Accuracy 72 70 67.5
Precision 66.66 50 45.61
Recall 13.33 25 43.33
F1 Score 22.22 33.33 44.44

4.2 Discussion

The results show the differences in the performance of various classification methods
and applying different costs to the classes of the dependent variable. Logistic Regres -
sion and Support Vector Machine classifiers present a decrease in the overall accu-
racy as well as the precision value for Cost Matrix A and Cost Matrix B as compared
to the accuracy and precision when no cost was applied to the classes. However, there
is a significant rise in the recall and f1 scores of the classification models when a cost
is given to the incorrect predictions of the minority class. Alternatively, Decision Tree
Classifiers provide a slightly better performance in each evaluation metric when a
cost matrix is provided to the model.
A similar observation is made while comparing the performances when incorporat-
ing Cost Matrix A and Cost Matrix B in the models. Cost Matrix A provides a higher
accuracy and precision in Logistic Regression and Support Vector machine, but a
significantly lower value of recall in both the classification models as compared to
Cost Matrix B. The Decision Tree Classifiers portrays a different narrative as the Cost
Matrix A provides a slightly inferior performance that Cost Matrix B in terms of ac-
curacy and precision but has a significantly higher value of recall and f1 score.
7

5 Conclusion

In machine learning and data science, the problem of data imbalance is crucial, partic-
ularly in the context of classification tasks. Addressing this issue is of utmost impor-
tance and can be done using several techniques such as resampling methods and ap-
plying algorithms designed to handle imbalanced data such as ensemble methods and
cost sensitive learning.
This research provides a glimpse of the effects of cost sensitive learning on differ-
ent classification algorithms. Two different cost matrixes are employed and compared
with the unmodified classification models. The results portray conflicting results as
the overall accuracy of the models decrease when a misclassification cost is added to
the dataset but the recall values increase significantly. This indicates that the model
has a low number of false negatives when a misclassification cost is added to the
model and that it recognizes all of the pertinent positive cases from the dataset.
As evidenced by the results, cost sensitive learning removes the biasness of the
classification models and maintains the integrity of the data obtained as opposed to
the resampling techniques. It also helps in better generalization when the correct costs
of misclassification are derived. Further studies can be conducted in order to optimize
the cost setting strategies for different datasets and improve the accuracy as well as
the other evaluation metrics using those costs.

References
1. Kaur, Gagandeep, Veerpal Kaur, Yashika Sharma, and Vishnu Bansal. "Analyzing various
Machine Learning Algorithms with SMOTE and ADASYN for Image Classification hav-
ing Imbalanced Data." In 2022 IEEE International Conference on Current Development in
Engineering and Technology (CCET), pp. 1-7. IEEE, 2022.
2. Longadge, Rushi, and Snehalata Dongre. "Class imbalance problem in data mining re-
view." arXiv preprint arXiv:1305.1707 (2013).
3. Wang, Shuo, and Xin Yao. "Multiclass imbalance problems: Analysis and potential solu-
tions." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42, no.
4 (2012): 1119-1130.
4. Punsin, Parinya, and Jakramate Bootkrajang. "A Comparative Study of Misclassification
Cost Assignment Strategies for Cost-sensitive AdaBoost in Imbalance Data Classifica-
tion." In 2022 19th International Conference on Electrical Engineering/Electronics, Com-
puter, Telecommunications and Information Technology (ECTI-CON), pp. 1-4. IEEE,
2022.
5. Elkan, Charles. "The foundations of cost-sensitive learning." In International joint confer-
ence on artificial intelligence, vol. 17, no. 1, pp. 973-978. Lawrence Erlbaum Associates
Ltd, 2001.
6. Thai-Nghe, Nguyen, Zeno Gantner, and Lars Schmidt-Thieme. "Cost-sensitive learning
methods for imbalanced data." In The 2010 International joint conference on neural net-
works (IJCNN), pp. 1-8. IEEE, 2010.
7. Chen, You. "Research on Cost-sensitive Classification Methods for Imbalanced Data."
In 2021 International Conference on Artificial Intelligence, Big Data and Algorithms
(CAIBDA), pp. 224-228. IEEE, 2021.
8

8. Zhou, Zhi-Hua, and Xu-Ying Liu. "Training cost-sensitive neural networks with methods
addressing the class imbalance problem." IEEE Transactions on knowledge and data
engineering 18, no. 1 (2005): 63-77.
9. McCarthy, Kate, Bibi Zabar, and Gary Weiss. "Does cost-sensitive learning beat sampling
for classifying rare classes?." In Proceedings of the 1st international workshop on Utility-
based data mining, pp. 69-77. 2005.
10. Seiffert, Chris, Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. "A com-
parative study of data sampling and cost sensitive learning." In 2008 IEEE international
conference on data mining workshops, pp. 46-52. IEEE, 2008.
11. Galar, Mikel, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco
Herrera. "A review on ensembles for the class imbalance problem: bagging-, boosting-,
and hybrid-based approaches." IEEE Transactions on Systems, Man, and Cybernetics,
Part C (Applications and Reviews) 42, no. 4 (2011): 463-484.
12. Qazi, N. "Effect of Feature Selection, Synthetic Minority Over-sampling (SMOTE) And
Under-sampling on Class imbalance Classification (2012)."
13. Xu, Huan. "Hierarchical cost-sensitive techniques for class imbalance learning." In 2021
4th International Conference on Artificial Intelligence and Big Data (ICAIBD), pp. 604-
609. IEEE, 2021.
14. Choirunnisa, Shabrina, and Joko Lianto. "Hybrid method of undersampling and oversam-
pling for handling imbalanced data." In 2018 International Seminar on Research of Infor-
mation Technology and Intelligent Systems (ISRITI), pp. 276-280. IEEE, 2018.
15. Sabha, Saqib Ul, Assif Assad, Nusrat Mohi Ud Din, and Muzafar Rasool Bhat. "Compara-
tive Analysis of Oversampling Techniques on Small and Imbalanced Datasets Using Deep
Learning." In 2023 3rd International conference on Artificial Intelligence and Signal
Processing (AISP), pp. 1-5. IEEE, 2023.
16. Shamsudin, Haziqah, Umi Kalsom Yusof, Andal Jayalakshmi, and Mohd Nor Akmal
Khalid. "Combining oversampling and undersampling techniques for imbalanced classifi-
cation: A comparative study using credit card fraudulent transaction dataset." In 2020
IEEE 16th International Conference on Control & Automation (ICCA), pp. 803-808.
IEEE, 2020.
17. Shelke, Mayuri S., Prashant R. Deshmukh, and Vijaya K. Shandilya. "A review on imbal-
anced data handling using undersampling and oversampling technique." Int. J. Recent
Trends Eng. Res 3, no. 4 (2017): 444-449.
18. Osisanwo, F. Y., J. E. T. Akinsola, O. Awodele, J. O. Hinmikaiye, O. Olakanmi, and J.
Akinjobi. "Supervised machine learning algorithms: classification and comparison." Inter-
national Journal of Computer Trends and Technology (IJCTT) 48, no. 3 (2017): 128-138.
19. Soofi, Aized Amin, and Arshad Awan. "Classification techniques in machine learning:
applications and issues." Journal of Basic & Applied Sciences 13, no. 1 (2017): 459-465.
20. Cokluk, Omay. "Logistic Regression: Concept and Application." Educational Sciences:
Theory and Practice 10, no. 3 (2010): 1397-1407.
21. Safavian, S. Rasoul, and David Landgrebe. "A survey of decision tree classifier methodol-
ogy." IEEE transactions on systems, man, and cybernetics 21, no. 3 (1991): 660-674.
22. Cervantes, Jair, Farid Garcia-Lamont, Lisbeth Rodríguez-Mazahua, and Asdrubal Lopez.
"A comprehensive survey on support vector machine classification: Applications, chal-
lenges and trends." Neurocomputing 408 (2020): 189-215.
23. Hofmann, Hans. "Statlog (german credit data) data set." UCI Repository of Machine
Learning Databases 53 (1994).

You might also like