An Extension of Synthetic Minority Oversampling Technique Based On

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Machine Learning with Applications 8 (2022) 100267

Contents lists available at ScienceDirect

Machine Learning with Applications


journal homepage: www.elsevier.com/locate/mlwa

An extension of Synthetic Minority Oversampling Technique based on


Kalman filter for imbalanced datasets
Thejas G.S. a ,∗, Yashas Hariprasad b , S.S. Iyengar b , N.R. Sunitha c , Prajwal Badrinath b ,
Shasank Chennupati d
a
Tarleton State University, Texas A&M University System, Department of Computer Science and Electrical Engineering, Stephenville, TX, 76401, USA
b
Florida International University, Discovery Lab, Knight Foundation School of Computing and Information Sciences, Miami, FL, 33199, USA
c Siddaganga Institute of Technology, Department of Computer Science and Engineering, Tumakuru, Karnataka, 572103, India
d University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC, 27599, USA

ARTICLE INFO ABSTRACT


Keywords: More often than not, data collected in real-time tends to be imbalanced i.e., the samples belonging to a
Imbalanced data particular class are significantly more than the others. This degrades the performance of the predictor. One of
Oversampling the most notable algorithms to handle such an imbalance in the dataset by fabricating synthetic data, is the
SMOTE
‘‘Synthetic Minority Oversampling Technique (SMOTE)’’. However, data imbalance is not solely responsible
Noise filter
for the poor performance of the classifier. Certain research works have demonstrated that noisy samples can
have a significant role in misclassifying the dataset. Also, handling large data is computationally expensive.
Hence, data reduction is imperative. In this work, we put forth a novel extension of SMOTE by integrating it
with the Kalman filter. The proposed method, Kalman-SMOTE (KSMOTE), filters out the noisy samples in the
final dataset after SMOTE, which includes both the raw data and the synthetically generated samples, thereby
reducing the size of the dataset. Our model is validated with a wide range of datasets. An experimental analysis
of the results shows that our model outperforms the presently available techniques.

1. Introduction SMOTE have been proposed. We provide a thorough review of all the
variants in Section 2.
Data is categorized into different classes, where a few of them might Past research shows that an imbalance in the class samples is
have an excessive number of samples, leading to an imbalance. Since not the only concern, as other factors such as noise and borderline
the data has a minority class, the result of the predictor tends to be samples may hinder the performance of the learning algorithm (García,
biased. This becomes a detriment to the performance of the learning Sánchez, & Mollineda, 2007; Japkowicz, 2003; Napierała, Stefanowski,
model. Most of the real-time datasets associated with medicine (Tek, & Wilk, 2010). Applying SMOTE on imbalanced datasets gives bet-
Dempster, & Kale, 2010), text classification (Liu, Loh, & Sun, 2009), ter results, but it generates synthetic samples, resulting in a notable
intrusion detection (Khor, Ting, & Phon-Amnuaisuk, 2012), and click
increase in the size of the data. Presently, the data collected in real-
fraud detection (Fawcett & Provost, 1997) are not balanced. A binary
time is extremely large (BIG DATA) (Zikopoulos, Eaton, et al., 2011).
dataset having 95% positive samples may obtain an extremely high
SMOTE can be considerably improved by performing certain modifica-
classification accuracy. However, this accuracy may be incorrect due
tions (Borderline-SMOTE(blSMOTE) 1,2 (Han, Wang, & Mao, 2005),
to over-fitting.
Extensive research has been performed in the recent past to formu- Safe-level SMOTE (Bunkhumpornpat, Sinapiromsaran, & Lursinsap,
late a solution for the issue of handling imbalanced data, and several 2009)) or by adding some extensions (SMOTE-IPF (SMOTE-Iterative-
solutions have also been suggested (He & Garcia, 2008). Several re- Partitioning Filter) (Sáez, Luengo, Stefanowski, & Herrera, 2015), SMO-
sampling techniques (Bunkhumpornpat, Sinapiromsaran, & Lursinsap, TEENN (SMOTE Edited Nearest Neighbor) or SMOTETOMEK (SMOTE
2011; Chawla, Bowyer, Hall, & Kegelmeyer, 2002; Kubat, Matwin, Tomek Links) (Batista, Prati, & Monard, 2004)).
et al., 1997; Stefanowski & Wilk, 2008) are available to balance the Filters have not been commonly used in combination with SMOTE
dataset, among which SMOTE (Chawla et al., 2002) is a widely rec- until now. We propose an extension called KSMOTE which employs
ognized technique. In the recent years, around eighty-five versions of the Kalman Filter (Bishop, Welch, et al., 2001) to remove noisy data

∗ Corresponding author.
E-mail addresses: sadashiva@tarleton.edu (Thejas G.S.), yhari001@fiu.edu (Y. Hariprasad), iyengar@cis.fiu.edu (S.S. Iyengar), nrsunitha@sit.ac.in
(N.R. Sunitha), prajwal2495@gmail.com (P. Badrinath), shasankchennupati@gmail.com (S. Chennupati).

https://doi.org/10.1016/j.mlwa.2022.100267
Received 2 July 2021; Received in revised form 22 January 2022; Accepted 23 January 2022
Available online 31 January 2022
2666-8270/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

samples. The use of the Kalman filter as a data-reduction method In Cieslak, Chawla, and Striegel (2006), the authors use repeated in-
improves the efficacy of the classifier which is clearly evident in the cremental pruning to produce error reduction (RIPPER) as an underly-
results we have obtained. We use three Click-Fraud datasets and a few ing rule classifier and also a clustering-based method for oversampling
UCI (Blake & Merz, 1998) datasets that are considered by previous is proposed. In De La Calleja and Fuentes (2007), the method proposed
researchers, to evaluate our work. Furthermore, we make comparisons averages the neighbors to obtain the mean example to oversample
with several, existing variants of SMOTE. Metrics such as Recall (Rcl), the data, also the author only considers the positive dataset to locate
Accuracy (Acry), F1-Score (F1scr) and Precision (Pres) have been used the nearest occurrences utilizing the weighted distance. In He et al.
to provide comparisons. Area Under the Curve (AUC) is another very (2008), the authors have proposed to alter the decision boundary in
good metric to verify overfitting caused by SMOTE (Bradley, 1997). the direction close to the difficult samples by using some techniques.
We portray Receiver Operating Characteristic Curve (ROC Curve) to In De La Calleja, Fuentes, and González (2008), the authors use a
demonstrate the comparison with the other models. collection of classifiers to select the samples from the dataset, and
The proposed extension to the SMOTE algorithm emphasizes on fil- weighted distance is used to balance the dataset. In Gazzah and Amara
tering the noisy data samples that may be generated. SMOTE generates (2008), the authors use 4 different topologies to oversample the minor-
synthetic data for the minority class samples to balance the dataset. ity class using a polynomial fitting function. In Tang and Chen (2008),
Synthetic samples are generated along the line segment joining the the authors have proposed to readjust the direction of the synthetic
minority class nearest neighbors (NN). We can note that for the datasets minority samples by generating data along the first component axis.
which have a mixed class distribution where the classes overlap each In Stefanowski and Wilk (2008), the authors have applied amplifi-
other, we can see that the synthetic samples generated along the line cation methods and selective preprocessing techniques and compared
of NN for the minority class might be in the wrong class space. This SMOTE with NCR. Before generating the data samples, positive instance
can lead to the induction of noisy data samples. Removing these noisy is assigned to the safe level and regenerates the DP around the line
samples plays an important role and this will increase the performance with various weights (Bunkhumpornpat et al., 2009). In Hu, Liang,
of the classifier. We use Kalman filter to identify the noisy samples. Ma, and He (2009), the authors classify the outnumbered samples into
We obtain better results on the classifier when KSMOTE is employed, 3 different groups as noise, border and security samples using the
which can be seen in Section 4 distance and then balance the data according to the groups. In Gu,
Summary of contribution: Cai, and Zhu (2009), the authors have used Isometric feature mapping
algorithm (Isomap) to map the training data, later SMOTE is applied,
• In this work, we have extended SMOTE by applying Kalman and the data is reduced by applying Neighborhood Cleaning Rule (NCR)
Filter. method. In Chen, Cai, Chen, and Gu (2010), the authors have used
• As SMOTE generates synthetic data, it also induces some noise differential evolution clustering algorithm along with SVM and SMOTE
in the dataset. We have used Kalman Filter to remove unwanted with SVM, and a hybrid approach is proposed. In Chen, Guo, and Chen
noisy data samples. (2010), the authors generate partitions using k-means and samples are
• We have used Random Forest classifier (RF) to classify the data clustered. Later, a threshold is defined and samples with cluster index
and verify the results using AUC and other metrics. lesser than the threshold are regenerated.
• Comparison with other methods such as SMOTE, Adaptive Syn- In Kang and Won (2010), the authors decide the count of samples
thetic (ADASYN) (He, Bai, Garcia, & Li, 2008), blSMOTE, SMO- to be generated from every data point, and generate samples to balance
TEENN and SMOTETOMEK has been given and it is evident from the dataset. In Cao and Wang (2011), the authors obtain the distri-
the comparison that better results are obtained by employing bution report and the density report of the DP and balance the data.
KSMOTE over the other methods. In Cateni, Colla, and Vannucci (2011), the authors have proposed a
new under-sampling and oversampling technique to resample the data
Organization of the paper: In Section 2, we discuss the related works
and balance the dataset so that there is no loss of data and addition of
in terms of the research we have carried out on SMOTE. In Section 3,
too many samples. In Fan, Tang, and Weise (2011), the authors use a
we demonstrate the preliminary concepts behind this work, and our
margin-based rule to sample the synthetic data; this process overcomes
novel, proposed approach followed by an in-depth explanation of our
the over-generalization of data samples. In Ramentol, Caballero, Bello,
concept. In the 4th Section, we discuss the experimental setup and the
and Herrera (2012), Rough-Set-Theory(RST) and SMOTE are used to-
details of the datasets we have used. We also present the evaluation
gether to handle the imbalance in the dataset. In Maciejewski and
metrics used in our work, analysis methodology and discuss our results
Stefanowski (2011), the method considers the more local neighborhood
that improve the state-of-art methods concerning various metrics. In
of the minority sample (Considers next k+1 neighbor) and gives better
the 5th Section, our work is concluded.
approximations.
In Barua, Islam, and Murase (2011), the authors have incorpo-
2. Related work rated unsupervised clustering in the generation of synthetic data. This
method ensures that the samples generated always lie inside the mi-
To overcome the imbalance, various resampling techniques have nority region and avoid wrong samples. In Deepa and Punithavalli
been presented (Bunkhumpornpat et al., 2011; Chawla et al., 2002; (2011), the authors have combined SMOTE with Evolutionary Sampling
Kubat et al., 1997; Stefanowski & Wilk, 2008). SMOTE was proposed Technique (EST) to over-sample or under-sample the data. In Dong
by Chawla in the year 2002 (Chawla et al., 2002), which makes use of and Wang (2011), the method randomly generates data points in
K Nearest Neighbor (KNN) graphs to generate synthetic data. the minority region unlike SMOTE. In Zhang and Wang (2011), the
authors check whether the samples are crossed or not and are grouped
2.1. Non-filter based accordingly, then new data points are generated based on the different
groups. In Fernández-Navarro, nez, and Gutiérrez (2011), the authors
In Han et al. (2005), only the borderline samples are considered have proposed a 2 stage algorithm. In 1st stage the data is balanced
and are oversampled. In Cohen, Hilario, Sax, Hugonnet, and Geiss- and in 2nd phase different patterns are generated and the data is over
buhler (2006), authors have used different Prototype-based resampling sampled. In Farquad and Bose (2012), the authors have proposed to
methods like KNN and Support Vector Machine (SVM) to balance the pre-process the data by making use of SVM and the data is balanced
dataset. In Wang, Xu, Wang, and Zhang (2006), the authors have with the SVM predictions.
applied the LLE algorithm to process the dataset and then oversampled In Puntumapon and Waiyamai (2012), the authors remove the data
the dataset by using SMOTE. points from the minority region that are not relevant, this will give

2
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

a precise minority region. In Bunkhumpornpat, Sinapiromsaran, and used to propose a new approach. This will balance the dataset and is
Lursinsap (2012), the data is generated along the shortest path and an alternative to SMOTE. In Gazzah, Hechkel, and Essoukri Ben Amara
the newly generated samples lie near the centroid. In Wang, Li, Chao, (2015), Principal Component Analysis (PCA) and multiclass SVM are
and Cao (2012), the proposed method dynamically generates different combined together and a hybrid approach is proposed to sample the
data points around the negative class data point. This will eliminate data.
noise and make the boundary more distinct. Also, smoothing techniques In Tang and He (2015), kernel density is estimated and the difficulty
are proposed by the author. According to Barua, Islam, and Murase level is found, based on which the samples are adaptively generated
(2013), weights for the negative class samples are found depending to balance the dataset. In Xie, Jiang, Ye, and Li (2015), the authors
on the distance from the positive class which will generate accurately proposed Minority Oversampling Technique based on Local Densities
balanced dataset. In Bunkhumpornpat and Subpaiboonkit (2013), the in Low-Dimensional Space (MOT2LD), that creates clusters by mapping
authors propose a tool for selecting a variant of SMOTE, either safe the samples. Weights are assigned based on the importance and the
level or borderline; synthetic samples are generated in the safe region dataset is balanced accordingly. In Young, Nykl, Weckman, and Chel-
determined by a mechanism. In Hu and Li (2013), the authors propose berg (2015), Voronoi diagram is generated and the data points that
a 3 step algorithm where at first, positive samples in the lower decision lie on the border of the 2 classes are found and based on these data
and the negative samples around the boundary is calculated. In the points, the dataset is balanced. In Lee, Kim, and Lee (2015), the authors
second step, SMOTE is applied on the dataset. Next, the data is balanced generate the samples and then decide whether to keep the sample or
and processed. not based on the location of the sample. This method will take care
In Nakamura, Kajiwara, Otsuka, and Kimura (2013), codebooks are of the noisy data and the issue of over fitting. In Dang, Tran, Hirose,
obtained from Learning Vector Quantization (LVQ) technique and the and Satou (2015), the authors change the label of the data samples and
data is balanced based on the codebooks obtained. In Sanchez, Morales, then uses SPY method that they propose to balance the data.
and Gonzalez (2013), the authors have proposed a method Synthetic In Li, Fong, and Zhuang (2015), two metaheuristics are combined
Oversampling of Instances (SOI) to resample data inside the clusters. together to obtain the best value for the parameter. The value of the
These clusters are from the minority class instances, 2 methods are accuracy depends on the value of the kappa specified by the user.
presented in the paper, SOI by Clustering and Jittering (SOI-CJ), and In Rivera and Xanthopoulos (2016), the authors propose to use Over-
SOI by Clustering (SOI-C). In Zhou, Yang, Guo, and Hu (2013), the sampling Using Propensity Scores (OUPS) that performs oversampling
authors have proposed a hybrid method combining quasi-linear SVM based on the requirement. The probability of group membership is
and assembled SMOTE. In Koto (2014), author has showcased 3 differ- found and the data points are resampled based on propensity rate.
ent variants of SMOTE; SMOTE-out is a strategy to handle very close In Torres, Carrasco-Ochoa, and Martínez-Trinidad (2016), the method
vectors by creating samples outside the area of the dashed line. SMOTE- handles the imbalance by generating data corresponding to every data
cosine-Euclidian formula and the cosine similarity are consolidated point. In Borowska and Stepaniuk (2016), the authors have proposed
together to obtain the new NN. Selected SMOTE — certain attributes to balance the dataset by using SMOTE and to handle the oversampled
are synthesized based on feature selection emphasizing the dimension data by using Rough Set Theory(RST). In Yun, Ha, and Lee (2016), the
of significant attributes. In Li, Zou, Wang, and Xia (2013), the nega- authors have proposed a method to restrict the neighborhood size. The
tive samples are over sampled using Improved SMOTE (ISMOTE) and method will determine the value for every minority instance and as-
the positive samples are under sampled using distance-based under- sures safety for generating synthetic data. In Jiang, Lu, and Xia (2016),
sampling (DUS) technique. Both the methods are combined to obtain the authors have proposed genetic-algorithm-based-SMOTE (GAST).
a balanced dataset. In Barua, Islam, Yao, and Murase (2014), the Optimal sampling rates are estimated and their optimal combination
proposed method assigns different weights to the samples depending is found. The dataset is then balanced by generating new samples.
on the Euclidean distance from positive class sample. This weight is In Nekooeimehr and Lai-Yuen (2016), clustering technique is used
then used for balancing the dataset. and each cluster is oversampled based on the Euclidean distance.
In Gao, Hong, Chen, Harris, and Khalaf (2014), the authors use In Ramentol et al. (2016), fuzzy rough set theory (Dubois & Prade,
kernel density to over sample the dataset and balance it. According 1990) is used as a pre-processing tool. A threshold is then defined and if
to López, Triguero, Carmona, a, and Herrera (2014), the method pro- a sample does not cross it, it will be deleted. In Cervantes et al. (2017),
posed additively generates new data until an appropriate dataset is the authors use support vector to generate new samples. Particle swarm
obtained. In Zhang and Li (2014), the raw data will have a probability optimization (SMOTE-PSO) is used to handle the noise in the dataset.
distribution which is unknown. The newly generated data should also In Ma and Fan (2017), the authors have proposed to cluster the samples
have the same probability distribution, then the data will be accurate using Clustering Using Representatives (CURE-SMOTE) and then get rid
and precise. In Li, Zhang, Lu, and Fang (2014), to handle the imbalance of noise and outliers from the dataset. The dataset is then balanced
in the dataset, the boundary samples are selected and resampled. The by resampling. In Rivera (2017), the authors propose a method to
author says that this will improve the quality of the dataset. In Mah- eliminate the noise prior to the resampling of the dataset.
moudi, Moradi, Akhlaghian, and Moradi (2014), the authors have In Lee, Kim, and Kim (2017), authors have showcased a method
proposed Diversity and Separable Metrics in Over-Sampling Technique where Gaussian Probability Distribution in the feature space is com-
(DSMOTE) method that improves the accuracy. Anomalous samples bined and new data is sampled, diverged from the line. In Koziarski
are removed from the negative class. The top three samples are then and Wożniak (2017), the method is proposed in 2 phases. Firstly,
considered based on a criteria and synthetic data is generated based on the neighborhoods are cleaned. Secondly, synthetic samples are gen-
these samples. erated selectively. In Siriseriwan and Sinapiromsaran (2017), Adaptive
In Jiang, Qiu, and Li (2015), the minority class data samples are Neighbor Synthetic Minority Oversampling Technique (ANS) is pro-
resampled by finding the similarity between the samples using Mi- posed which dynamically adjusts the number of neighbors needed to
nority Cloning Technique (MCT). In Xu, Li, Le, and Tian (2014), the oversample the minority regions. In Douzas, Bacao, and Last (2018),
authors have proposed to combine triangular area sampling and NN k-means has been combined with SMOTE which generates samples in
with SMOTE and the dataset is balanced. In Rong, Gong, and Ng the deficient minority area and the class label is not considered while
(2014), Gaussian distribution in Q-union is used to resample the data generating the synthetic data.
and balance it. In Hu et al. (2014), a supervised method is used to In Douzas and Bacao (2019) the authors propose a technique where
balance the dataset by generating new samples. Also, TargetSOS, a the synthetic samples are generated in the region around each of input
new predictor is proposed by the authors. In Bellinger, Japkowicz, and minority data points. Eighty five versions of SMOTE are documented
Drummond (2015), modeling efficiency of denoising autoencoders are and implemented in python and is made available to public by Kovács

3
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

(2019). In Xu, Shen, Nie, and Kou (2020), an hybrid implementation that gives a proficient solution of the least-squares technique. It is an
for medical data is given. It is divided into 3 sub-parts where Modified efficient algorithm that can support the past, future, and the present
SMOTE (MSMOTE) algorithm is used, post which ENN is used to reduce estimations. It is basically a two stage algorithm. In the first step,
the outlier data points and then the author uses RF for the classification. estimates of the current state variables are calculated. In the next
In Douzas, Rauch, and Bacao (2021), an extension named G-SOMO stage, weighted averages are used to update the estimates. We use the
is presented it identifies the optimal region for new data samples to covariance grids 𝑇 and 𝑃 to determine noisy data. Consider 𝑔 ∈ 𝑃 𝑥
be generated which will increase the variability. (Li, Zhu, Wu & Fan, as the measurements and 𝑎 ∈ 𝑃 𝑖 as the state of the discrete controlled
2021) Natural Neighbors SMOTE algorithm uses an adaptive k value, linear system. At time 𝑛, the process and measurement equations are
the samples in the center of the class has higher number of synthetic shown in Eqs. (1) and (2) respectively.
samples and the ones at the borders have lesser in number and also the
outlier samples are reduced. In Li et al. (2021) the synthetic samples 𝑎𝑛 = 𝐵𝑛 𝑎𝑛−1 + 𝑐𝑛 𝑓𝑛 + 𝑑𝑛 (1)
are generated using the SMOTE based algorithm which then the newly
generated data is used to detect the noisy samples using error detection 𝑔𝑛 = 𝑦𝑛 𝑎𝑛 + 𝑒𝑛 (2)
which is then, not removed from the dataset, but it is iteratively
re-positioned using the differential evolution. Where 𝑓𝑛 ∈ 𝑃 𝑗 is the control-input model at time 𝑛, 𝑎𝑛 is the state
at 𝑛, 𝑔𝑛 is the measurement at 𝑛. 𝐵𝑛 is 𝑖 × 𝑖 matrix relating state at 𝑛 − 1
2.2. Filter based & state at 𝑛. 𝑐𝑛 is 𝑖 × 𝑗 matrix relating control input at 𝑛 & state at 𝑛.
𝑦𝑛 is 𝑥 × 𝑖 matrix relating state and measurement at 𝑛. x, i and j are the
In Batista et al. (2004), the author proposes to extend SMOTE by variables of the matrices. 𝑒𝑛 and 𝑑𝑛 are measurement noise and process
integrating it with ENN and Tomek noise filters. Tomek links are used noise respectively with covariance matrix 𝑃𝑛 and 𝑇𝑛 and with a mean
to remove the noisy samples or samples lying on the decision boundary of 0.
after SMOTE is applied, whereas ENN removes more samples than
Tomek and it provides a cleaner data by removing the samples from 3.1.3. Our approach
both the classes that are misclassified by its three nearest neighbors. Input: Data C
In Sáez et al. (2015), the authors have presented a filtering method Output: Train data appended with mean and covariance
using ensemble based Iterative-Partitioning Filter (IPF) noise filter to column KalD
manage the samples at the borderline and the samples that are noisy. 1 Split the data C into Train D and Test T randomly with 8:2
In Almogahed and Kakadiaris (2015), the authors propose a filter ratio.
approach named NEATER using the Game Theory (GT). (Hussein, Li,
2 D_res = SMOTE(D) // apply SMOTE algorithm on train data D
Yohannese, & Bashir, 2019) is an extension of SMOTE where the
3 Apply Kalman filter
authors use a technique to identify the location of the newly generated
4 Obtain the number of columns (NC) (Label (or) output column
synthetic sample. The synthetic data points that are in close propinquity
need not be considered)
to the majority samples than the minority samples are rejected and
5 Create Kalman Filter object with ISM and NDO
filtered out. In Cheng et al. (2019) the authors propose grouped SMOTE
6 msr = D_res
algorithm with noise filtering mechanism (GSMOTE-NFM). At first,
7 With train data as input, call the EM algorithm using the
Gaussian-Mixture Model (GMM) is used to identify the distribution and
Kalman Filter object.
then noisy samples are filtered out by using the probability density of
8 Calculate the mean and covar
the samples in different classes. In Liang, Jiang, Li, Xue, and Wang
9 KalD = Append mean and covar to D_res.
(2020) a combination of K-means and SVM is used to filter out the noise
10 return 𝐾𝑎𝑙𝐷
in the actual data post which the algorithm is modified to generate
Algorithm 1: Kalman filter application and obtaining mean and
synthetic data points over the connection line or extension line of the
neighbor data point and actual data sample. covariance for each row of the data

3. Proposed approach Input: D, D_res, Number of Iterations Q


Output: rem, the 𝑛𝑜𝑠 to be removed in each iteration.
3.1. Preliminary concept 1 B = 𝑛𝑜𝑠 in D
2 A = 𝑛𝑜𝑠 in D_res
3.1.1. SMOTE 3 N = percentage of data increased after SMOTE.
If the classes are not proportionally distributed, then the data is 4 𝑀= 𝑁 𝑄
said to be imbalanced. Most of the real-time datasets suffer from 𝑀
data imbalance, where normal samples have many more occurrences 5 𝑌 = 100
×𝐴
𝑌
when compared to abnormal samples. The classifiers running on these 6 𝑟𝑒𝑚 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
datasets are generally overfitted or underfitted. There are numerous 7 return 𝑟𝑒𝑚
resampling techniques that have been proposed to handle this. SMOTE Algorithm 2: Calculating number of samples(𝑛𝑜𝑠) to remove in each
is one such algorithm to handle imbalanced data effectively. SMOTE iteration from each class
over-samples the data to achieve better results. Negative class samples
are resampled to handle the imbalance. Depending on the degree of
oversampling that needs to be performed, neighboring data points
using the kNN algorithm are chosen. Typically, k is assigned with 5 Our proposed model extends SMOTE by incorporating the Kalman
to oversample the data. Initially, the distance between a sample and filter. Noisy samples may cause the classifier to misclassify the data,
its NN is calculated. In the next step, the distance is multiplied with an hence such samples should be removed. Kalman filter is used to sift
arbitrary number ranging between 0 and 1 following which, it is added through the data and eliminate such noisy samples.
to the sample. An arbitrary point is selected along the line segment Our proposed approach follows a three stage procedure. Stages
between the two specified samples. 1, 2 and 3 are depicted by Algorithms 1, 2 and 3 respectively. We
take dataset ‘C’ as an input and Q iterations are to be performed
3.1.2. Kalman Filter to remove the data samples. The noisy samples can be removed in
The Kalman filter (Kalman, 1960) was proposed to solve the Wiener a varying percentage based on the iterations specified by the user.
problem. Fundamentally, this filter is a union of numerical conditions SMOTE is applied on the training dataset, after which, the Kalman

4
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Input: KalD, rem, Q this process, the Random Forest is applied. We have considered 𝑄=5
Output: Classifier result on the datasets for each iteration for our experiments and the results are computed for each iteration
1 if most of the rows have the same covariance value. then and the finest result of all the iterations is considered. The dataset
2 for 𝑗 = 1 to 𝑄 do corresponding to that result gives the best performance when used with
3 for 𝑖 = 1 to 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do the classifier. Fig. 1 depicts the working of KSMOTE.
4 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖 [𝑚𝑎𝑥 𝑣𝑎𝑙𝑢𝑒 𝑐𝑜𝑣𝑎𝑟] 𝐴𝑁𝐷 𝑑𝑎𝑡𝑎𝑖 [𝑙𝑎𝑏𝑒𝑙]
5 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖+1 [𝑟𝑒𝑚𝑜𝑣𝑒 𝑟𝑒𝑚 𝑟𝑎𝑛𝑑𝑜𝑚 𝑟𝑜𝑤𝑠]
4. Experimental results and discussions
6 𝑑𝑎𝑡𝑎𝑗 = 𝑑𝑎𝑡𝑎𝑗 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑑𝑎𝑡𝑎𝑖+1 )
7 end
4.1. Experimental setup
8 𝑑𝑎𝑡𝑎𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠+1 = 𝑑𝑎𝑡𝑎𝑖+1 [𝑐𝑜𝑣𝑎𝑟! = 𝑚𝑎𝑥𝑐𝑜𝑣𝑎𝑟]
9 𝑑𝑎𝑡𝑎𝑗 = 𝑑𝑎𝑡𝑎𝑗 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑑𝑎𝑡𝑎𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠+1 )
10 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙 = 𝑑𝑟𝑜𝑝𝑚𝑒𝑎𝑛𝑎𝑛𝑑𝑐𝑜𝑣𝑎𝑟𝑐𝑜𝑙𝑢𝑚𝑛𝑠 Here, we describe the datasets, the analysis methodology and the
11 Separate X_train having data attributes and y_train having evaluation metrics used in our experiments. Furthermore, an in-depth
output label from 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙 comparison of the classifier’s results is provided. We have validated
12 Apply Random Forest Classifier the experimental outcomes with other contemporary techniques viz.,
13 end SMOTE, ADASYN, BorderlineSMOTE (blSMOTE), SMOTEENN and
14 end SMOTETOMEK. The NN parameter’s value is initialized as 5. In each
15 else iteration, our model removes data proportionally from all the classes
16 for 𝑖 = 1 to 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do thereby maintaining the class balance. The random forest classifier is
17 for 𝑖 = 1 to 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do employed after which, multiple metrics are computed. We have also
18 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖 [𝑙𝑎𝑏𝑒𝑙] used Decision Tree Classifier and Extra Tree Classifier to demonstrate
19 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖+1 [remove rem samples starting the results. The test–train partition was done in 80:20 ratio for all the
from highest covariance value] datasets. The experiments were run three times and the average of the
20 𝑑𝑎𝑡𝑎𝑗 = 𝑑𝑎𝑡𝑎𝑗 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑑𝑎𝑡𝑎𝑖+1 ) results are considered. The experiments were performed on server -
21 end ‘‘Flounder’’ at Florida International University which has AMD Opteron
22 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙 = 𝑑𝑟𝑜𝑝𝑚𝑒𝑎𝑛𝑎𝑛𝑑𝑐𝑜𝑣𝑎𝑟𝑐𝑜𝑙𝑢𝑚𝑛𝑠 Processor 6380 (64 cores) along with 504 GB RAM. Implementation of
23 Separate X_train having data attributes and y_train having our work is available on GitHub.1
output label from 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙
24 Apply Random Forest Classifier 4.2. Datasets
25 end
26 end Several real-time datasets regarding click-fraud and UCI benchmark
Algorithm 3: Removing data samples and classifying the result datasets are used. All of these datasets have disproportional class
distributions. The string values wherever present in the datasets, were
converted to numbers with the help of label encoding. Under the
filter object is created with 𝑖𝑛𝑖𝑡𝑖𝑎𝑙_𝑠𝑡𝑎𝑡𝑒_𝑚𝑒𝑎𝑛(𝐼𝑆𝑀) which is the ’initial domain of click fraud, the Talking Data (kag, 2018) (Thejas et al.,
condition for the model’ as 0 and 𝑛_𝑑𝑖𝑚_𝑜𝑏𝑠(𝑁𝐷𝑂) which is the ’size 2019), the Avazu Click-Through Rate (CTR) (kag, 2015) and the Criteo
of the state or observation space’ as 𝑁𝐶 and applied on the oversam- datasets (cri, 2017) have been considered. The Talking Data dataset
pled data. A part of ‘pykalman’ package (pyp, 2013) is modified and contains 9 attributes. Data preprocessing was performed by separating
used for the same purpose. In the Kalman Filter object, we call the the attribute ’click_time’ into different four attributes, ‘day’, ‘hour’,
Expectation–Maximization (EM) algorithm (Moon, 1996). The training ‘min’ and ‘sec’. A million random samples were considered from the
data is considered as measurements (msr) and given as an input to entire dataset which originally had 184,903,890 entries. The Criteo
the filter. In our experiments, we have considered the 𝑛𝑖𝑡𝑒𝑟 to be dataset was randomly sampled prior to usage and all the rows with NaN
5 for the EM algorithm because it avoids overfitting as explained values were dropped. The attribute ’hour of click’ in the Avazu dataset
in pyp (2013). This algorithm calculates the maximum likelihood of was split into different columns. The data was collected over a period
parameters iteratively. On applying the filter, the mean and covariance of ten days and is chronologically ordered. We made use of a million
values are obtained corresponding to each row of the dataset. random samples for the experiment. Datasets from the UCI Archive are
We employ some notations to describe the second stage of the used. For the CMC dataset, we have combined the class attributes ’Long-
method. 𝐵 indicates the number of rows before applying SMOTE and term’ and ’Short-term’ as ‘Use’. These datasets are commonly used in
𝐴 denotes the number of rows after applying SMOTE. 𝑁 portrays the multiple papers and are now treated as benchmark datasets. Table 1
percentage of increase in the data (synthetic samples generated). The provides a brief outline of the datasets used in our experiments. It
percentage of data to be removed is calculated as shown in Algorithm specifies the Alias Name (AN) for the datasets, the number of samples
2. The count of data samples to be removed in each iteration is also (NOS), the type of class distribution and the number of attributes of the
calculated. To maintain data balance, Y samples are discarded propor- dataset (#Attr).
tionally from all classes. In Stage 3, the train dataset is referred to as
‘data’ and the ‘numberofclasses’ is the number of class attributes for
4.3. Analysis methodology
the dataset. We obtain a count of all unique values of covariances that
were calculated. If a considerably large number of rows have the same
To test the proposed model, we considered the contemporary tech-
covariance, then, depending on the 𝑛𝑜𝑠 calculated, the data samples
niques, namely, SMOTE, SMOTEENN, SMOTETOMEK, ADASYN and
with the highest covariance are dropped from each class equally. The
BorderlineSMOTE (blSMOTE). To compare the different models, pa-
Random Forest (Breiman, 1999) classifier is run on the new dataframe.
If the covariance is different for most of the rows, then, the data is rameters such as Acry, AUC, Pres, Rcl and F1scr were made use of. A
sorted according to the covariance values. The data samples with the mathematical definition of these metrics is presented in the following
highest covariances are dropped iteratively from every class proportion- subsection.
ally. The values in the covariance matrix is the amount of noise present,
1
hence we filter the noise according to the covariance values. After https://github.com/thejasgs/SMO

5
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Fig. 1. The Flowchart Representing our Approach KSMOTE.

Table 1 evaluate our work. The AUC is one such parameter which is not greatly
Dataset description.
affected by overfitting.
Dataset AN NOS #Attr Class We have also used Pres, Rcl and F1scr which are calculated as given
Haberman D1 306 4 Binary in Eqs. (4), (5) and (6).
Hepatitis D2 115 20 Binary
Statlog (German Credit Data) D3 1000 20 Binary 𝑇𝑃
𝑃 𝑟𝑒𝑠 = (4)
Talking Data D4 1,000,000 9 Binary 𝑇𝑃 + 𝐹𝑃
Avazu D5 1,000,000 24 Binary
𝑇𝑃
Display Ad. Challenge-Criteo Labs D6 756,554 40 Binary 𝑅𝑐𝑙 = (5)
Credit Card Fraud Data D7 284,807 30 Binary
𝑇𝑃 + 𝐹𝑁
Z-Alizadeh sani dataset D8 302 55 Binary 𝑃 𝑟𝑒𝑠 × 𝑅𝑐𝑙
Contraceptive Method Choice CMC D9 1178 10 Binary 𝐹 1𝑠𝑐𝑟 = 2 × (6)
𝑃 𝑟𝑒𝑠 + 𝑅𝑐𝑙
Pima D10 579 9 Binary
4.5. Results

Table 2 portrays the outcome of running the RF classifier on the raw


4.4. Evaluation metrics
binary datasets. Table 3 shows the result of running different classifier
on the raw dataset. #0’s indicates the negative samples, # 1’s denotes
One of the most popular and commonly used metrics, accuracy, is
the number of positive samples and %min represents the percentage of
computed for all the datasets. We make use of the following notations:
the minority class samples. We can observe that there is a significant
True Positive as TP, True Negative as TN, False Positive as FP and False
imbalance in the data as the AUC values indicate that the model is
Negative as FN. The accuracy of the predictor is defined in Eq. (3).
overfitted. For the same reason, we balance the datasets before using
𝑇𝑃 + 𝑇𝑁 them.
𝐴𝑐𝑟𝑦 = (3)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 We conducted the experiments by applying the current SMOTE
The biggest drawback of using accuracy as a metric is that it might algorithms on all the datasets, and the results are tabulated in Table 4.
be incorrect due to overfitting of the learning model. Overfitting is an Table 4 demonstrates a comparison of KSMOTE (our method) with
undesirable trait that occurs when the algorithm predicts values based the existent methods using binary datasets. Table 5 portrays a compar-
on erroneous data. Hence, we have also made use of other metrics to ison of KSMOTE with the existent methods with ET and DT.

6
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Table 2
Results on Raw data — Random Forest.
Dataset #1’s #0’s %min Acry AUC Rcl F1scr Pres
D1 118 56 32.18 0.69355 0.55430 0.88637 0.80413 0.73585
D2 26 94 21.66 0.83871 0.6875 0.375 0.54546 0.99
D3 558 242 24.2 0.775 0.64821 0.95744 0.85714 0.77586
D4 197411 602589 24.67 0.85609 0.72235 0.45765 0.61141 0.92075
D5 136122 663878 17.01 0.83265 0.51087 0.0249 0.04785 0.61522
D6 190488 414755 31.47 0.71167 0.56449 0.1663 0.26664 0.67224
D7 391 227454 0.172 0.99943 0.86220 0.72448 0.81609 0.93421
D8 70 172 28.92 0.83606 0.75452 0.55555 0.66666 0.83333
D9 686 492 41.7 0.70847 0.67983 0.92121 0.77948 0.67555
D10 201 378 34.71 0.8 0.75022 0.62791 0.65061 0.675

Table 3 Table 4 (continued).


Results on Raw data — DT: Decision Tree and ET: Extra Tree. Dataset Method Acry AUC Rcl F1scr Pres
Dataset Model Acry AUC Rcl F1scr Pres
SMOTE 0.85245 0.80748 0.90909 0.89887 0.88888
D1 DT 0.70967 0.56565 0.22222 0.30769 0.5 ADASYN 0.85245 0.78943 0.93181 0.90109 0.87234
D1 ET 0.64516 0.52020 0.22222 0.26666 0.33333 D8 blSMOTE 0.86885 0.81885 0.93181 0.91111 0.89130
D9 DT 0.65762 0.65477 0.67878 0.68923 0.7 SMOTEENN 0.63934 0.71390 0.54545 0.68571 0.92307
D9 ET 0.67118 0.66282 0.73333 0.71386 0.69540 SMOTETOMEK 0.88524 0.86631 0.90909 0.91954 0.93023
D10 DT 0.79310 0.78568 0.76744 0.6875 0.62264 KSMOTE 0.86885 0.85494 0.88636 0.90697 0.92857
D10 ET 0.77241 0.72389 0.60465 0.61176 0.61904
SMOTE 0.68474 0.67756 0.77848 0.72566 0.67955
ADASYN 0.68813 0.67830 0.81645 0.73714 0.67187
D9 blSMOTE 0.68135 0.67827 0.72151 0.70807 0.69512
Table 4
SMOTEENN 0.67796 0.67899 0.66455 0.68852 0.71428
Results of re-sampled data and comparison of our model — Random Forest.
SMOTETOMEK 0.68813 0.68072 0.78481 0.72941 0.68131
Dataset Method Acry AUC Rcl F1scr Pres KSMOTE 0.69492 0.68851 0.77848 0.73214 0.69101
SMOTE 0.59678 0.55838 0.75676 0.69136 0.63637 SMOTE 0.73103 0.75687 0.83333 0.67226 0.56338
ADASYN 0.61291 0.58487 0.72973 0.69231 0.65854 ADASYN 0.72413 0.76224 0.875 0.67741 0.55263
D1 blSMOTE 0.99 0.61244 0.86487 0.75295 0.66667 D10 blSMOTE 0.72413 0.74645 0.8125 0.66101 0.55714
SMOTEENN 0.64517 0.62487 0.72973 0.71053 0.69231 SMOTEENN 0.68965 0.73120 0.85416 0.64566 0.51898
SMOTETOMEK 0.62904 0.59838 0.75676 0.70887 0.66667 SMOTETOMEK 0.71724 0.72551 0.75 0.63716 0.55384
KSMOTE 0.6613 0.65136 0.70271 0.71233 0.72223 KSMOTE 0.75862 0.77749 0.83333 0.69565 0.59701
SMOTE 0.90323 0.83797 0.75 0.66667 0.6
ADASYN 0.90323 0.83797 0.75 0.66667 0.6
D2 blSMOTE 0.83871 0.69445 0.5 0.44445 0.4 Table 5
SMOTEENN 0.87097 0.81945 0.75 0.6 0.5 Results of re-sampled data and comparison of our model — DT and ET.
SMOTETOMEK 0.83871 0.80093 0.75 0.44445 0.42858 Dataset Model Method Acry AUC Rcl F1scr Pres
KSMOTE 0.87097 0.87097 0.75 0.6 0.5
SMOTE 0.51612 0.52324 0.56 0.48275 0.42424
SMOTE 0.71 0.70398 0.71830 0.77862 0.85 ADASYN 0.59677 0.57135 0.44 0.46808 0.5
ADASYN 0.725 0.71964 0.73239 0.79087 0.85950 D1 DT blSMOTE 0.64516 0.63783 0.6 0.57692 0.55555
D3 blSMOTE 0.715 0.69730 0.73943 0.78651 0.84 SMOTEENN 0.61290 0.58486 0.44 0.47826 0.52380
SMOTEENN 0.7 0.71223 0.68309 0.76377 0.86607 SMOTETOMEK 0.51612 0.51675 0.52 0.46428 0.41935
SMOTETOMEK 0.675 0.65383 0.70422 0.75471 0.81300 KSMOTE 0.69354 0.67189 0.56 0.59574 0.63636
KSMOTE 0.735 0.72668 0.74647 0.81 0.86178
SMOTE 0.64516 0.63783 0.6 0.57692 0.55555
SMOTE 0.9094 0.88079 0.82404 0.81852 0.81307 ADASYN 0.56451 0.53135 0.36 0.39999 0.45
ADASYN 0.9049 0.88118 0.83414 0.81306 0.79303 D1 ET blSMOTE 0.58064 0.54486 0.36 0.40909 0.47368
D4 blSMOTE 0.90387 0.87781 0.82612 0.80995 0.7944 SMOTEENN 0.62903 0.60486 0.48 0.51063 0.54545
SMOTEENN 0.88013 0.86797 0.84384 0.77732 0.72053 SMOTETOMEK 0.62903 0.61783 0.56 0.54901 0.53846
SMOTETOMEK 0.9103 0.88152 0.82442 0.82006 0.81575 KSMOTE 0.69354 0.66540 0.52 0.57777 0.65
KSMOTE 0.91049 0.88359 0.83023 0.82141 0.81277
SMOTE 0.64406 0.64007 0.69620 0.67692 0.65868
SMOTE 0.7853 0.56224 0.22478 0.26195 0.31384 ADASYN 0.63389 0.63251 0.65189 0.65605 0.66025
ADASYN 0.78313 0.56551 0.23629 0.26972 0.31418 D9 DT blSMOTE 0.66101 0.66026 0.67088 0.67948 0.68831
D5 blSMOTE 0.78739 0.56005 0.21611 0.25627 0.31476 SMOTEENN 0.70169 0.70260 0.68987 0.71241 0.73648
SMOTEENN 0.78987 0.55746 0.20585 0.24929 0.31599 SMOTETOMEK 0.64745 0.64614 0.66455 0.66878 0.67307
SMOTETOMEK 0.7859 0.56288 0.22546 0.26308 0.31575 KSMOTE 0.70169 0.70260 0.68987 0.71241 0.73648
KSMOTE 0.77993 0.56483 0.23983 0.26926 0.30693
SMOTE 0.64406 0.64201 0.67088 0.66876 0.66666
SMOTE 0.64626 0.62729 0.57605 0.50636 0.45171 ADASYN 0.63728 0.63568 0.65822 0.66031 0.66242
ADASYN 0.63509 0.62353 0.59229 0.50554 0.44096 D9 ET blSMOTE 0.64745 0.64566 0.67088 0.67088 0.67088
D6 blSMOTE 0.63968 0.62313 0.57842 0.50278 0.44464 SMOTEENN 0.70508 0.70382 0.72151 0.72380 0.72611
SMOTEENN 0.54036 0.60692 0.78675 0.51881 0.38701 SMOTETOMEK 0.63728 0.63471 0.67088 0.66457 0.65838
SMOTETOMEK 0.64509 0.62737 0.57949 0.50703 0.45067 KSMOTE 0.66101 0.65977 0.67721 0.68152 0.68589
KSMOTE 0.64634 0.62797 0.57834 0.50741 0.45197
SMOTE 0.73103 0.70425 0.625 0.60606 0.58823
SMOTE 0.99362 0.93256 0.87128 0.32653 0.20091 ADASYN 0.73103 0.73582 0.75 0.64864 0.57142
ADASYN 0.99287 0.93712 0.88118 0.30479 0.18426 D10 DT blSMOTE 0.60689 0.58515 0.52083 0.46728 0.42372
D7 blSMOTE 0.99919 0.93041 0.86138 0.7909 0.73109 SMOTEENN 0.73103 0.75161 0.8125 0.66666 0.56521
SMOTEENN 0.99327 0.93733 0.88118 0.31729 0.19347 SMOTETOMEK 0.71034 0.70457 0.6875 0.61111 0.55
SMOTETOMEK 0.99378 0.93264 0.87128 0.33207 0.20512 KSMOTE 0.70344 0.67837 0.60416 0.57425 0.54716
KSMOTE 0.99485 0.93318 0.87128 0.37526 0.23913
(continued on next page)

7
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Fig. 2. ROC curve plot for D1 data, (a) represents the ROC curve of Random Forest Classifier, (b) represents the ROC curve of Decision Tree Classifier, (c) represents the ROC
curve of Extra Tree Classifier.

Table 5 (continued). • The performance of our model is better for the click fraud datasets
Dataset Model Method Acry AUC Rcl F1scr Pres and the benchmark datasets.
SMOTE 0.75172 0.74602 0.72916 0.66037 0.60344 • In the dataset D10 run on ET model, the AUC is close to the best
ADASYN 0.69655 0.69426 0.6875 0.6 0.53225 performing model SMOTE.
D10 ET blSMOTE 0.72413 0.72540 0.72916 0.63636 0.56451
• To summarize, the proposed model shows good results when
SMOTEENN 0.70344 0.73625 0.83333 0.65040 0.53333
SMOTETOMEK 0.74482 0.73560 0.70833 0.64761 0.59649 tested with different datasets. Although the accuracy might de-
KSMOTE 0.73793 0.73571 0.72916 0.64814 0.58333 crease, the AUC scores obtained are higher, denoting a better
model.

We also compare our results with few of the recent methods, as


The following observations are made based on the outcomes. shown in Tables 6 and 7. GSMOTE-NFM and CURE-SMOTE use RF
for classification, hence we have compared the results with our RF
• We make use of AUC as the comparative metric as it is a standard results. SMOTE-PSO uses SVM classifier, hence we compare the results
measure and is not affected by overfitting. of KSMOTE on SVM classifier with SMOTE-PSO.
• The AUC of our model is closest to ADASYN, the best performing The Receiver Operating Characteristic (ROC) Curve for D1 dataset
model for the D5 Dataset. is portrayed in Fig. 2. In Fig. 2(a) the ROC curve for Random Forest
• In the Dataset D7, the AUC is close to the best performing model. Classifier can be observed. Fig. 2(b) portrays the plot of ROC Curve
• For D8, the AUC of our model is closest to the best performing for Decision Tree Classifier. Fig. 2(c) represents the ROC Curve for
model SMOTETOMEK. Extra Trees Classifier. From all these plots we can infer that our model

8
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Fig. 3. ROC curve plot for D9 data, (a) represents the ROC curve of Random Forest Classifier, (b) represents the ROC curve of Decision Tree Classifier, (c) represents the ROC
curve of Extra Tree Classifier.

Table 6 Classifier can be observed. Fig. 3(b) portrays the plot of ROC Curve for
F1scr comparison — Random Forest.
Decision Tree Classifier. Fig. 3(c) represents the ROC Curve for Extra
Dataset KSMOTE GSMOTE-NFM CURE-SMOTE Trees Classifier.
D1 0.71233 0.4821 0.5000
D10 0.69565 0.6726 𝑁𝐴
4.6. Statistical significance test

Table 7 In order to compare different techniques, this study supports our


AUC comparison — SVM Classifier. assumptions with statistical significance testing. We compared the pre-
Dataset KSMOTE SMOTE-PSO dictions of datasets oversampled by proposed KSMOTE vs five oversam-
D1 0.65136 0.622 pling (OS) techniques (SMOTE, ADASYN, blSMOTE, SMOTEENN and
D10 0.7515 0.742 SMOETOMEK). The prediction output from these datasets was tested by
a non-parametric test (Demšar, 2006; López, Fernández, García, Palade,
& Herrera, 2013). The reasons for using a non-parametric statistical test
KSMOTE significantly outperforms the other state of the art models we is as follows
have used to present the comparison. • To overcome one of the limitations of parametric tests — which
The Receiver Operating Characteristic (ROC) Curve for D9 dataset is to have a normally distributed data
is portrayed in Fig. 3. In Fig. 3(a) the ROC curve for Random Forest • Non-parametric tests share the reliability of a parametric test, and

9
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Table 8 draft, Supervision, Project administration. Yashas Hariprasad: Soft-


Results obtained by the Wilcoxon signed-rank test for KSMOTE at a 95% confidence ware, Validation, Formal analysis, Investigation, Data curation, Writing
interval.
– original draft. S.S. Iyengar: Supervision, Writing – review & editing,
Metric Vs R+ R- P-value
Project administration. N.R. Sunitha: Supervision, Writing – review &
Raw data 55 0 0.001953 editing. Prajwal Badrinath: Writing – review & editing, Visualization.
SMOTE 55 0 0.001953
AUC ADASYN 51 4 0.01367
Shasank Chennupati: Formal analysis, Writing – review & editing.
blSMOTE 55 0 0.001953
SMOTEENN 54 1 0.003906 Declaration of competing interest
SMOTETOMEK 49 6 0.02734
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
• Parametric tests are only reliable to reject null-hypothesis if their influence the work reported in this paper.
assumptions are not violated.
References
A Wilcoxon signed-rank test (Wilcoxon, 1992) to check the signif-
icance of KSMOTE on the AUCs (because it represents the measure Almogahed, B. A., & Kakadiaris, I. A. (2015). NEATER: filtering of over-sampled
data using non-cooperative game theory. Soft Computing, 19, 3301–3322. http:
of performance of a classifier in terms of evaluating which model is
//dx.doi.org/10.1007/s00500-014-1484-5.
better on average). The Wilcoxon signed-rank test is a non-parametric Barua, S., Islam, M. M., & Murase, K. (2011). A novel synthetic minority oversampling
statistical hypothesis test to compare two related instances, matched technique for imbalanced data set learning. In B.-L. Lu, L. Zhang, & J. Kwok (Eds.),
instances, or repeated measurements on a single instance by assessing Neural information processing (pp. 735–744). Berlin, Heidelberg: Springer Berlin
Heidelberg.
whether their population mean ranks differ. The null hypothesis of the
Barua, S., Islam, M. M., & Murase, K. (2013). ProWSyn: Proximity weighted synthetic
test was that KSMOTE’s performance was similar to the compared over- oversampling technique for imbalanced data set learning. In Pacific-asia conference
sampler (i.e., the oversamplers used followed a symmetric distribution on knowledge discovery and data mining (pp. 317–328). Springer.
around zero). Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE–majority weighted
Notably, as shown in Table 8, the 𝑝-value of all methods is less minority oversampling technique for imbalanced data set learning. IEEE Transac-
tions On Knowledge And Data Engineering, 26, 405–425. http://dx.doi.org/10.1109/
than 0.05, which demonstrates that KSMOTE produces a significant TKDE.2012.232.
improvement in the AUC results in comparison with other techniques. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several
Evidently, the results of KSMOTE are not always the best performer as methods for balancing machine learning training data. ACM SIGKDD Explorations
seen from Table 4. However, the 𝑝-value is less than 0.05 of KSMOTE Newsletter, 6, 20–29.
Bellinger, C., Japkowicz, N., & Drummond, C. (2015). Synthetic oversampling for
versus all approaches. Therefore, KSMOTE is superior compared to
advanced radioactive threat detection. In: 2015 IEEE 14th international conference
other resampling methods. on machine learning and applications (pp. 948–953). 10.1109/ICMLA.2015.58.
Bishop, G., Welch, G., et al. (2001). An introduction to the kalman filter. Proc Of
5. Conclusion SIGGRAPH, Course, 8, 41.
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases, 1998.
Borowska, K., & Stepaniuk, J. (2016). Imbalanced data classification: A novel re-
The removal of noisy samples is the main focus of our research. sampling approach combining versatile improved SMOTE and rough sets. In
This reduces the computational overhead by reducing the size of the K. Saeed, & W. Homenda (Eds.), Computer information systems and industrial
data and increases the efficacy of the classifier. We have proposed management (pp. 31–42). Cham: Springer International Publishing.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of
an extension of SMOTE by integrating it with the Kalman filter. By
machine learning algorithms. Pattern Recognition, 30, 1145–1159.
incorporating the Kalman filter, KSMOTE can filter the noisy data Breiman, L. (1999). Random forests. UC Berkeley TR567.
effectively by dropping the erroneous samples in the original and Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-
fabricated data. The Kalman Filter made use of EM algorithm, to which level-synthetic minority over-sampling technique for handling the class imbalanced
problem. In Pacific-asia conference on knowledge discovery and data mining (pp.
we set 𝑛𝑖𝑡𝑒𝑟 value to 5 as it prevents overfitting. Based on the number
475–482). Springer.
of iterations specified by the user, we calculated the number of samples Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2011). Mute: Majority
to be removed from each of the class dynamically, which will maintain under-sampling technique. In 2011 8th international conference on information,
the balance in the dataset. The data samples are removed depending communications & signal processing (pp. 1–4). IEEE.
on the covariance values. If the number of iterations are very small, Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2012). DBSMOTE: Density-
based synthetic minority over-sampling technique. Applied Intelligence: The Interna-
then a large number of samples will be deleted at the same time, which tional Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving
may result in the removal of the required data. We may not be able to Technologies, 36, 664–684. http://dx.doi.org/10.1007/s10489-011-0287-y.
get an optimized dataset. Conversely, if the number of iterations are Bunkhumpornpat, C., & Subpaiboonkit, S. (2013). Safe level graph for synthetic
very large, the classifier will run many times and it is an unwanted minority over-sampling techniques. In 2013 13th international symposium on
communications and information technologies (pp. 570–575). IEEE.
time overhead. We considered 5 iterations as optimal as we can obtain
Cao, Q., & Wang, S. (2011). Applying over-sampling technique based on data density
better results and there is no time overhead. and cost-sensitive svm to imbalanced learning. In 2011 international conference on
A wide range of real-time binary datasets associated with different information management, innovation management and industrial engineering, vol. 2 (pp.
fields are considered. Multiple evaluation techniques have been em- 543–548). IEEE.
Cateni, S., Colla, V., & Vannucci, M. (2011). Novel resampling method for the
ployed to compare our models with various oversampling techniques
classification of imbalanced datasets for industrial and other real-world problems.
like SMOTE, ADASYN, BordelineSMOTE, SMOTETOMEK and SMO- In 2011 11th international conference on intelligent systems design and applications (pp.
TEENN. We have achieved notable results compared to the other mod- 402–407). IEEE.
els. The AUC score is primarily given importance to, for the comparison Cervantes, J., Garcia-Lamont, F., Rodriguez, L., López, A., Castilla, J. R., & Trueba, A.
of the models. The limitation of this method is that the computational (2017). PSO-based method for SVM classification on skewed data sets. Neuro-
computing, 228, 187–197. http://dx.doi.org/10.1016/j.neucom.2016.10.041, URL:
complexity is cubic in size for kalman filter which makes the model http://www.sciencedirect.com/science/article/pii/S0925231216312668. Advanced
slower for large datasets, the time required is higher. intelligent computing: theory and applications.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: syn-
CRediT authorship contribution statement thetic minority over-sampling technique. Journal Of Artificial Intelligence Research,
16, 321–357.
Chen, L., Cai, Z., Chen, L., & Gu, Q. (2010). A novel differential evolution-clustering
Thejas G.S.: Conceptualization, Methodology, Software, Valida- hybrid resampling algorithm on imbalanced datasets. In 2010 third international
tion, Formal analysis, Investigation, Data curation, Writing – original conference on knowledge discovery and data mining (pp. 81–85). IEEE.

10
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Chen, S., Guo, G., & Chen, L. (2010). A new over-sampling method based on cluster Hu, J., He, X., Yu, D.-J., Yang, X.-B., Yang, J.-Y., & Shen, H.-B. (2014). A new supervised
ensembles. In: 2010 IEEE 24th international conference on advanced information over-sampling algorithm with application to protein-nucleotide binding residue
networking and applications workshops (pp. 599–604). 10.1109/WAINA.2010.40. prediction. PLOS ONE, 9, 1–10. http://dx.doi.org/10.1371/journal.pone.0107676.
Cheng, K., Zhang, C., Yu, H., Yang, X., Zou, H., & Gao, S. (2019). Grouped SMOTE Hu, F., & Li, H. (2013). A novel boundary oversampling algorithm based on neighbor-
with noise filtering mechanism for classifying imbalanced data. IEEE Access, 7, hood rough set model: Nrsboundary-SMOTE. Mathematical Problems In Engineering,
170668–170681. 2013.
Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in network Hu, S., Liang, Y., Ma, L., & He, Y. (2009). MSMOTE: improving classification perfor-
intrusion datasets. In GrC (pp. 732–737). mance when training data is imbalanced. In: 2009 second international workshop on
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from computer science and engineering, vol. 2, IEEE (pp. 13–17).
imbalanced data in surveillance of nosocomial infection. Artificial Intelligence In Hussein, A. S., Li, T., Yohannese, C. W., & Bashir, K. (2019). A-SMOTE: A new
Medicine, 37, 7–18. preprocessing approach for highly imbalanced datasets by improving SMOTE.
Kaggle display advertising challenge dataset. (2017). URL: https://labs.criteo.com/ International Journal Of Computational Intelligence Systems, 12, 1412–1422.
2014/02/kaggle-display-advertising-challenge-dataset/. Japkowicz, N. (2003). Class imbalances: are we focusing on the right issue. In: Workshop
Dang, X. T., Tran, D. H., Hirose, O., & Satou, K. (2015). SPY: A novel resampling on learning from imbalanced data sets II, vol. 1723 (pp. 63).
method for improving classification performance in imbalanced data. In: 2015 Jiang, K., Lu, J., & Xia, K. (2016). A novel algorithm for imbalance data classification
seventh international conference on knowledge and systems engineering (pp. 280–285). based on genetic algorithm improved SMOTE. Arabian Journal For Science And
10.1109/KSE.2015.24. Engineering, 41, 3255–3266. http://dx.doi.org/10.1007/s13369-016-2179-2.
De La Calleja, J., & Fuentes, O. (2007). A distance-based over-sampling method for Jiang, L., Qiu, C., & Li, C. (2015). A novel minority cloning technique for cost-sensitive
learning from imbalanced data sets. In: FLAIRS conference (pp. 634–635). learning. International Journal Of Pattern Recognition And Artificial Intelligence, 29,
De La Calleja, J., Fuentes, O., & González, J. (2008). Selecting minority examples from Article 1551004. http://dx.doi.org/10.1142/S0218001415510040.
misclassified data for over-sampling. In: FLAIRS conference (pp. 276–281). Click-through rate prediction. (2015). URL: https://www.kaggle.com/c/avazu-ctr-
prediction/data.
Deepa, T., & Punithavalli, M. (2011). An E-SMOTE technique for feature
selection in high-dimensional imbalanced dataset. In: 2011 3rd interna- Talkingdata AdTracking fraud detection challenge. (2018). URL: https://www.kaggle.
tional conference on electronics computer technology, vol. 2 (pp. 322–324). com/c/talkingdata-adtracking-fraud-detection.
10.1109/ICECTECH.2011.5941710. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
Journal Of Basic Engineering, 82, 35–45.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The
Kang, Y., & Won, S. (2010). Weight decision algorithm for oversampling technique
Journal Of Machine Learning Research, 7, 1–30.
on class-imbalanced learning. In ICCAS 2010 (pp. 182–186). http://dx.doi.org/10.
Dong, Y., & Wang, X. (2011). A new over-sampling approach: Random-SMOTE for
1109/ICCAS.2010.5669889.
learning from imbalanced data sets. In H. Xiong, & W. B. Lee (Eds.), Knowledge
Khor, K.-C., Ting, C.-Y., & Phon-Amnuaisuk, S. (2012). A cascaded classifier approach
science, engineering and management (pp. 343–352). Berlin, Heidelberg: Springer
for improving detection rates on rare attack categories in network intrusion
Berlin Heidelberg.
detection. Applied Intelligence: The International Journal of Artificial Intelligence, Neural
Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in
Networks, and Complex Problem-Solving Technologies, 36, 320–329.
replacement for SMOTE. Information Sciences, 501, 118–135.
Koto, F. (2014). SMOTE-out, SMOTE-cosine, and selected-SMOTE: An enhancement
Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a
strategy to handle imbalance in data level. In 2014 international conference on
heuristic oversampling method based on k-means and SMOTE. Information Sci-
advanced computer science and information system (pp. 280–284). IEEE.
ences, 465, 1–20. http://dx.doi.org/10.1016/j.ins.2018.06.056, URL: http://www.
Kovács, G. (2019). Smote-variants: A python implementation of 85 minority
sciencedirect.com/science/article/pii/S0020025518304997.
oversampling techniques. Neurocomputing, 366, 352–354.
Douzas, G., Rauch, R., & Bacao, F. (2021). G-SOMO: An oversampling approach based
Koziarski, M., & Wożniak, M. (2017). CCR: A combined cleaning and resampling algo-
on self-organized maps and geometric SMOTE. Expert Systems With Applications,
rithm for imbalanced data classification. International Journal Of Applied Mathematics
Article 115230.
And Computer Science, 27, 727–736.
Dubois, D., & Prade, H. (1990). Rough fuzzy sets and fuzzy rough sets*. Inter- Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets:
national Journal Of General Systems, 17, 191–209. http://dx.doi.org/10.1080/ one-sided selection. In Icml, vol. 97 (pp. 179–186). Nashville, USA.
03081079008935107.
Lee, H., Kim, J., & Kim, S. (2017). Gaussian-based SMOTE algorithm for solving skewed
Fan, X., Tang, K., & Weise, T. (2011). Margin-based over-sampling method for learning class distributions. International Journal of Fuzzy Logic and Intelligent Systems, 17,
from imbalanced datasets. In Pacific-asia conference on knowledge discovery and data 229–234.
mining (pp. 309–320). Springer. Lee, J., Kim, N.-r., & Lee, J.-H. (2015). An over-sampling technique with rejection
Farquad, M., & Bose, I. (2012). Preprocessing unbalanced data using support vector for imbalanced class learning. In Proceedings of the 9th international conference on
machine. Decision Support Systems, 53, 226–233. ubiquitous information management and communication (pp. 102:1–102:6). New York,
Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining And Knowledge NY, USA: ACM, http://dx.doi.org/10.1145/2701126.2701181.
Discovery, 1, 291–316. Li, J., Fong, S., & Zhuang, Y. (2015). Optimizing SMOTE by metaheuristics with neural
Fernández-Navarro, F., nez, C. H.-M., & Gutiérrez, P. A. (2011). A dynamic network and decision tree. In: 2015 3rd international symposium on computational
over-sampling procedure based on sensitivity for multi-class problems. Pattern and business intelligence (pp. 26–32). 10.1109/ISCBI.2015.12.
Recognition, 44, 1821–1833. http://dx.doi.org/10.1016/j.patcog.2011.02.019, URL: Li, K., Zhang, W., Lu, Q., & Fang, X. (2014). An improved SMOTE imbalanced data
http://www.sciencedirect.com/science/article/pii/S0031320311000823. classification method based on support degree. In: 2014 international conference
Gao, M., Hong, X., Chen, S., Harris, C. J., & Khalaf, E. (2014). PDFOS: PDF on identification, information and knowledge in the internet of things (pp. 34–38).
estimation based over-sampling for imbalanced two-class problems. Neurocomput- 10.1109/IIKI.2014.14.
ing, 138, 248–259. http://dx.doi.org/10.1016/j.neucom.2014.02.006, URL: http: Li, J., Zhu, Q., Wu, Q., & Fan, Z. (2021). A novel oversampling technique for class-
//www.sciencedirect.com/science/article/pii/S0925231214002501. imbalanced learning based on SMOTE and natural neighbors. Information Sciences,
García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior 565, 438–455.
of classifiers on imbalanced and overlapped data sets. In Iberoamerican congress on Li, J., Zhu, Q., Wu, Q., Zhang, Z., Gong, Y., He, Z., et al. (2021). SMOTE-NaN-DE:
pattern recognition (pp. 397–406). Springer. Addressing the noisy and borderline examples problem in imbalanced classification
Gazzah, S., & Amara, N. E. B. (2008). New oversampling approaches based on by natural neighbors and differential evolution. Knowledge-Based Systems, 223,
polynomial fitting for imbalanced data sets. In: 2008 the eighth IAPR international Article 107056.
workshop on document analysis systems, IEEE (pp. 677–684). Li, H., Zou, P., Wang, X., & Xia, R. (2013). A new combination sampling method for
Gazzah, S., Hechkel, A., & Essoukri Ben Amara, N. (2015). A hybrid sampling method imbalanced data. In Z. Sun, & Z. Deng (Eds.), Proceedings of 2013 Chinese intelligent
for imbalanced data. In: 2015 IEEE 12th international multi-conference on systems, automation conference (pp. 547–554). Berlin, Heidelberg: Springer Berlin Heidelberg.
signals devices (pp. 1–6). 10.1109/SSD.2015.7348093. Liang, X., Jiang, A., Li, T., Xue, Y., & Wang, G. (2020). LR-SMOTE—An improved
Gu, Q., Cai, Z., & Zhu, L. (2009). Classification of imbalanced data sets by using the unbalanced data set oversampling based on K-means and SVM. Knowledge-Based
hybrid re-sampling algorithm based on isomap. In Z. Cai, Z. Li, Z. Kang, & Y. Liu Systems, 196, Article 105845.
(Eds.), Advances in computation and intelligence (pp. 287–296). Berlin, Heidelberg: Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting
Springer Berlin Heidelberg. approach. Expert Systems With Applications, 36, 690–701.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into
method in imbalanced data sets learning. In International conference on intelligent classification with imbalanced data: Empirical results and current trends on using
computing (pp. 878–887). Springer. data intrinsic characteristics. Information Sciences, 250, 113–141.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). Adasyn: Adaptive synthetic sampling López, V., Triguero, I., Carmona, C. J., a, S. G., & Herrera, F. (2014). Addressing
approach for imbalanced learning. In 2008 IEEE international joint conference on imbalanced classification with instance generation techniques: IPADE-ID. Neurocom-
neural networks (pp. 1322–1328). IEEE. puting, 126, 15–28. http://dx.doi.org/10.1016/j.neucom.2013.01.050, URL: http://
He, H., & Garcia, E. A. (2008). Learning from imbalanced data. IEEE Transactions On www.sciencedirect.com/science/article/pii/S0925231213006887, Recent trends in
Knowledge & Data Engineering, 1263–1284. intelligent data analysis online data processing.

11
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267

Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature Stefanowski, J., & Wilk, S. (2008). Selective pre-processing of imbalanced data for
selection and parameter optimization based on random forests. BMC Bioinformatics, improving classification performance. In International conference on data warehousing
18, 169. http://dx.doi.org/10.1186/s12859-017-1578-z. and knowledge discovery (pp. 283–292). Springer.
Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE Tang, S., & Chen, S.-p. (2008). The generation mechanism of synthetic minority class
for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence examples. In 2008 international conference on information technology and applications
and data mining (pp. 104–111). 10.1109/CIDM.2011.5949434. in biomedicine (pp. 444–447). IEEE.
Mahmoudi, S., Moradi, P., Akhlaghian, F., & Moradi, R. (2014). Diversity and separable Tang, B., & He, H. (2015). KernelADASYN: Kernel based adaptive synthetic data gen-
metrics in over-sampling technique for imbalanced data classification. In: 2014 eration for imbalanced learning. In: 2015 IEEE congress on evolutionary computation
4th international conference on computer and knowledge engineering (pp. 152–158). (pp. 664–671). 10.1109/CEC.2015.7256954.
10.1109/ICCKE.2014.6993409. Tek, F. B., Dempster, A. G., & Kale, I. (2010). Parasite detection and identification
for automated thin blood film malaria diagnosis. Computer Vision And Image
Moon, T. K. (1996). The expectation-maximization algorithm. IEEE Signal Processing
Understanding, 114, 21–32.
Magazine, 13, 47–60. http://dx.doi.org/10.1109/79.543975.
Thejas, G. S., Boroojeni, K. G., Chandna, K., Bhatia, I., Iyengar, S. S., & Sunitha, N. R.
Nakamura, M., Kajiwara, Y., Otsuka, A., & Kimura, H. (2013). Lvq-smote–learning vec-
(2019). Deep learning-based model to fight against ad click fraud. In Proceedings
tor quantization based synthetic minority over–sampling technique for biomedical
of the 2019 ACM southeast conference (pp. 176–181). New York, NY, USA: ACM,
data. BioData Mining, 6, 16.
http://dx.doi.org/10.1145/3299815.3314453.
Napierała, K., Stefanowski, J., & Wilk, S. (2010). Learning from imbalanced data in
Torres, F. R., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2016). SMOTE-D a
presence of noisy and borderline examples. In International conference on rough sets deterministic version of SMOTE. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa,
and current trends in computing (pp. 158–167). Springer. V. Ayala Ramirez, J. A. Olvera-López, & X. Jiang (Eds.), Pattern recognition (pp.
Nekooeimehr, I., & Lai-Yuen, S. K. (2016). Adaptive semi-unsupervised weighted 177–188). Cham: Springer International Publishing.
oversampling (A-SUWO) for imbalanced datasets. Expert Systems With Applications, Wang, S., Li, Z., Chao, W., & Cao, Q. (2012). Applying adaptive over-sampling
46, 405–416. http://dx.doi.org/10.1016/j.eswa.2015.10.031, URL: http://www. technique based on data density and cost-sensitive SVM to imbalanced learn-
sciencedirect.com/science/article/pii/S0957417415007356. ing. In: The 2012 international joint conference on neural networks (pp. 1–8).
Puntumapon, K., & Waiyamai, K. (2012). A pruning-based approach for searching 10.1109/IJCNN.2012.6252696.
precise and generalized region for synthetic minority over-sampling. In Pacific-asia Wang, J., Xu, M., Wang, H., & Zhang, J. (2006). Classification of imbalanced data by
conference on knowledge discovery and data mining (pp. 371–382). Springer. using the SMOTE algorithm and locally linear embedding. In 2006 8th international
Pykalman. (2013). URL: https://pypi.org/project/pykalman/. conference on signal processing, vol. 3. IEEE.
Ramentol, E., Caballero, Y., Bello, R., & Herrera, F. (2012). SMOTE-RSB*: a hybrid Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Breakthroughs in
preprocessing approach based on oversampling and undersampling for high imbal- statistics (pp. 196–202). Springer.
anced data-sets using SMOTE and rough sets theory. Knowledge And Information Xie, Z., Jiang, L., Ye, T., & Li, X. (2015). A synthetic minority oversampling method
Systems, 33, 245–265. based on local densities in low-dimensional space for imbalanced learning. In
Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., et al. (2016). M. Renz, C. Shahabi, X. Zhou, & M. A. Cheema (Eds.), Database systems for advanced
Fuzzy-rough imbalanced learning for the diagnosis of high voltage circuit breaker applications (pp. 3–18). Cham: Springer International Publishing.
maintenance: The SMOTE-FRST-2T algorithm. Engineering Applications Of Artificial Xu, Y. H., Li, H., Le, L. P., & Tian, X. Y. (2014). Neighborhood triangular synthetic
Intelligence, 48, 134–139. http://dx.doi.org/10.1016/j.engappai.2015.10.009, URL: minority over-sampling technique for imbalanced prediction on small samples of
Chinese tourism and hospitality firms. In: 2014 seventh international joint conference
http://www.sciencedirect.com/science/article/pii/S0952197615002389.
on computational sciences and optimization (pp. 534–538). 10.1109/CSO.2014.104.
Rivera, W. A. (2017). Noise reduction a priori synthetic over-sampling for class
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining
imbalanced data sets. Information Sciences, 408, 146–161. http://dx.doi.org/10.
M-SMOTE and ENN based on Random Forest for medical imbalanced data. Journal
1016/j.ins.2017.04.046, URL: http://www.sciencedirect.com/science/article/pii/
Of Biomedical Informatics, 107, Article 103465.
S0020025517307089.
Young, W. A., Nykl, S. L., Weckman, G. R., & Chelberg, D. M. (2015). Using voronoi di-
Rivera, W. A., & Xanthopoulos, P. (2016). A priori synthetic over-sampling methods for
agrams to improve classification performances when modeling imbalanced datasets.
increasing classification sensitivity in imbalanced data sets. Expert Systems With Ap- Neural Computing And Applications, 26, 1041–1054. http://dx.doi.org/10.1007/
plications, 66, 124–135. http://dx.doi.org/10.1016/j.eswa.2016.09.010, URL: http: s00521-014-1780-0.
//www.sciencedirect.com/science/article/pii/S0957417416304882. Yun, J., Ha, J., & Lee, J.-S. (2016). Automatic determination of neighborhood size in
Rong, T., Gong, H., & Ng, W. W. Y. (2014). Stochastic sensitivity oversampling SMOTE. In Proceedings of the 10th international conference on ubiquitous information
technique for imbalanced data. In X. Wang, W. Pedrycz, P. Chan, & Q. He (Eds.), management and communication (pp. 100:1–100:8). New York, NY, USA: ACM,
Machine learning and cybernetics (pp. 161–171). Berlin, Heidelberg: Springer Berlin http://dx.doi.org/10.1145/2857546.2857648.
Heidelberg. Zhang, H., & Li, M. (2014). RWO-sampling: A random walk over-sampling approach to
Sáez, J. A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). SMOTE–IPF: Ad- imbalanced data classification. Information Fusion, 20, 99–116. http://dx.doi.org/
dressing the noisy and borderline examples problem in imbalanced classification 10.1016/j.inffus.2013.12.003, URL: http://www.sciencedirect.com/science/article/
by a re-sampling method with filtering. Information Sciences, 291, 184–203. pii/S1566253514000025.
http://dx.doi.org/10.1016/j.ins.2014.08.051, URL: http://www.sciencedirect.com/ Zhang, L., & Wang, W. (2011). A re-sampling method for class imbalance learning
science/article/pii/S0020025514008561. with credit data. In: 2011 international conference of information technology, computer
Sanchez, A. I., Morales, E. F., & Gonzalez, J. A. (2013). Synthetic oversampling of engineering and management sciences, vol. 1 (pp. 393–397). 10.1109/ICM.2011.34.
instances using clustering. International Journal On Artificial Intelligence Tools, 22, Zhou, B., Yang, C., Guo, H., & Hu, J. (2013). A quasi-linear SVM combined with
Article 1350008. assembled SMOTE for imbalanced data classification. In: The 2013 international
Siriseriwan, W., & Sinapiromsaran, K. (2017). Adaptive neighbor synthetic minority joint conference on neural networks (pp. 1–7). 10.1109/IJCNN.2013.6707035.
over-sampling technique under 1NN outcast handling. Songklanakarin Journal of Zikopoulos, P., Eaton, C., et al. (2011). Understanding Big Data: Analytics For Enterprise
Science and Technology, 39, 565–576. Class Hadoop And Streaming Data. McGraw-Hill Osborne Media.

12

You might also like