Professional Documents
Culture Documents
An Extension of Synthetic Minority Oversampling Technique Based On
An Extension of Synthetic Minority Oversampling Technique Based On
An Extension of Synthetic Minority Oversampling Technique Based On
1. Introduction SMOTE have been proposed. We provide a thorough review of all the
variants in Section 2.
Data is categorized into different classes, where a few of them might Past research shows that an imbalance in the class samples is
have an excessive number of samples, leading to an imbalance. Since not the only concern, as other factors such as noise and borderline
the data has a minority class, the result of the predictor tends to be samples may hinder the performance of the learning algorithm (García,
biased. This becomes a detriment to the performance of the learning Sánchez, & Mollineda, 2007; Japkowicz, 2003; Napierała, Stefanowski,
model. Most of the real-time datasets associated with medicine (Tek, & Wilk, 2010). Applying SMOTE on imbalanced datasets gives bet-
Dempster, & Kale, 2010), text classification (Liu, Loh, & Sun, 2009), ter results, but it generates synthetic samples, resulting in a notable
intrusion detection (Khor, Ting, & Phon-Amnuaisuk, 2012), and click
increase in the size of the data. Presently, the data collected in real-
fraud detection (Fawcett & Provost, 1997) are not balanced. A binary
time is extremely large (BIG DATA) (Zikopoulos, Eaton, et al., 2011).
dataset having 95% positive samples may obtain an extremely high
SMOTE can be considerably improved by performing certain modifica-
classification accuracy. However, this accuracy may be incorrect due
tions (Borderline-SMOTE(blSMOTE) 1,2 (Han, Wang, & Mao, 2005),
to over-fitting.
Extensive research has been performed in the recent past to formu- Safe-level SMOTE (Bunkhumpornpat, Sinapiromsaran, & Lursinsap,
late a solution for the issue of handling imbalanced data, and several 2009)) or by adding some extensions (SMOTE-IPF (SMOTE-Iterative-
solutions have also been suggested (He & Garcia, 2008). Several re- Partitioning Filter) (Sáez, Luengo, Stefanowski, & Herrera, 2015), SMO-
sampling techniques (Bunkhumpornpat, Sinapiromsaran, & Lursinsap, TEENN (SMOTE Edited Nearest Neighbor) or SMOTETOMEK (SMOTE
2011; Chawla, Bowyer, Hall, & Kegelmeyer, 2002; Kubat, Matwin, Tomek Links) (Batista, Prati, & Monard, 2004)).
et al., 1997; Stefanowski & Wilk, 2008) are available to balance the Filters have not been commonly used in combination with SMOTE
dataset, among which SMOTE (Chawla et al., 2002) is a widely rec- until now. We propose an extension called KSMOTE which employs
ognized technique. In the recent years, around eighty-five versions of the Kalman Filter (Bishop, Welch, et al., 2001) to remove noisy data
∗ Corresponding author.
E-mail addresses: sadashiva@tarleton.edu (Thejas G.S.), yhari001@fiu.edu (Y. Hariprasad), iyengar@cis.fiu.edu (S.S. Iyengar), nrsunitha@sit.ac.in
(N.R. Sunitha), prajwal2495@gmail.com (P. Badrinath), shasankchennupati@gmail.com (S. Chennupati).
https://doi.org/10.1016/j.mlwa.2022.100267
Received 2 July 2021; Received in revised form 22 January 2022; Accepted 23 January 2022
Available online 31 January 2022
2666-8270/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
samples. The use of the Kalman filter as a data-reduction method In Cieslak, Chawla, and Striegel (2006), the authors use repeated in-
improves the efficacy of the classifier which is clearly evident in the cremental pruning to produce error reduction (RIPPER) as an underly-
results we have obtained. We use three Click-Fraud datasets and a few ing rule classifier and also a clustering-based method for oversampling
UCI (Blake & Merz, 1998) datasets that are considered by previous is proposed. In De La Calleja and Fuentes (2007), the method proposed
researchers, to evaluate our work. Furthermore, we make comparisons averages the neighbors to obtain the mean example to oversample
with several, existing variants of SMOTE. Metrics such as Recall (Rcl), the data, also the author only considers the positive dataset to locate
Accuracy (Acry), F1-Score (F1scr) and Precision (Pres) have been used the nearest occurrences utilizing the weighted distance. In He et al.
to provide comparisons. Area Under the Curve (AUC) is another very (2008), the authors have proposed to alter the decision boundary in
good metric to verify overfitting caused by SMOTE (Bradley, 1997). the direction close to the difficult samples by using some techniques.
We portray Receiver Operating Characteristic Curve (ROC Curve) to In De La Calleja, Fuentes, and González (2008), the authors use a
demonstrate the comparison with the other models. collection of classifiers to select the samples from the dataset, and
The proposed extension to the SMOTE algorithm emphasizes on fil- weighted distance is used to balance the dataset. In Gazzah and Amara
tering the noisy data samples that may be generated. SMOTE generates (2008), the authors use 4 different topologies to oversample the minor-
synthetic data for the minority class samples to balance the dataset. ity class using a polynomial fitting function. In Tang and Chen (2008),
Synthetic samples are generated along the line segment joining the the authors have proposed to readjust the direction of the synthetic
minority class nearest neighbors (NN). We can note that for the datasets minority samples by generating data along the first component axis.
which have a mixed class distribution where the classes overlap each In Stefanowski and Wilk (2008), the authors have applied amplifi-
other, we can see that the synthetic samples generated along the line cation methods and selective preprocessing techniques and compared
of NN for the minority class might be in the wrong class space. This SMOTE with NCR. Before generating the data samples, positive instance
can lead to the induction of noisy data samples. Removing these noisy is assigned to the safe level and regenerates the DP around the line
samples plays an important role and this will increase the performance with various weights (Bunkhumpornpat et al., 2009). In Hu, Liang,
of the classifier. We use Kalman filter to identify the noisy samples. Ma, and He (2009), the authors classify the outnumbered samples into
We obtain better results on the classifier when KSMOTE is employed, 3 different groups as noise, border and security samples using the
which can be seen in Section 4 distance and then balance the data according to the groups. In Gu,
Summary of contribution: Cai, and Zhu (2009), the authors have used Isometric feature mapping
algorithm (Isomap) to map the training data, later SMOTE is applied,
• In this work, we have extended SMOTE by applying Kalman and the data is reduced by applying Neighborhood Cleaning Rule (NCR)
Filter. method. In Chen, Cai, Chen, and Gu (2010), the authors have used
• As SMOTE generates synthetic data, it also induces some noise differential evolution clustering algorithm along with SVM and SMOTE
in the dataset. We have used Kalman Filter to remove unwanted with SVM, and a hybrid approach is proposed. In Chen, Guo, and Chen
noisy data samples. (2010), the authors generate partitions using k-means and samples are
• We have used Random Forest classifier (RF) to classify the data clustered. Later, a threshold is defined and samples with cluster index
and verify the results using AUC and other metrics. lesser than the threshold are regenerated.
• Comparison with other methods such as SMOTE, Adaptive Syn- In Kang and Won (2010), the authors decide the count of samples
thetic (ADASYN) (He, Bai, Garcia, & Li, 2008), blSMOTE, SMO- to be generated from every data point, and generate samples to balance
TEENN and SMOTETOMEK has been given and it is evident from the dataset. In Cao and Wang (2011), the authors obtain the distri-
the comparison that better results are obtained by employing bution report and the density report of the DP and balance the data.
KSMOTE over the other methods. In Cateni, Colla, and Vannucci (2011), the authors have proposed a
new under-sampling and oversampling technique to resample the data
Organization of the paper: In Section 2, we discuss the related works
and balance the dataset so that there is no loss of data and addition of
in terms of the research we have carried out on SMOTE. In Section 3,
too many samples. In Fan, Tang, and Weise (2011), the authors use a
we demonstrate the preliminary concepts behind this work, and our
margin-based rule to sample the synthetic data; this process overcomes
novel, proposed approach followed by an in-depth explanation of our
the over-generalization of data samples. In Ramentol, Caballero, Bello,
concept. In the 4th Section, we discuss the experimental setup and the
and Herrera (2012), Rough-Set-Theory(RST) and SMOTE are used to-
details of the datasets we have used. We also present the evaluation
gether to handle the imbalance in the dataset. In Maciejewski and
metrics used in our work, analysis methodology and discuss our results
Stefanowski (2011), the method considers the more local neighborhood
that improve the state-of-art methods concerning various metrics. In
of the minority sample (Considers next k+1 neighbor) and gives better
the 5th Section, our work is concluded.
approximations.
In Barua, Islam, and Murase (2011), the authors have incorpo-
2. Related work rated unsupervised clustering in the generation of synthetic data. This
method ensures that the samples generated always lie inside the mi-
To overcome the imbalance, various resampling techniques have nority region and avoid wrong samples. In Deepa and Punithavalli
been presented (Bunkhumpornpat et al., 2011; Chawla et al., 2002; (2011), the authors have combined SMOTE with Evolutionary Sampling
Kubat et al., 1997; Stefanowski & Wilk, 2008). SMOTE was proposed Technique (EST) to over-sample or under-sample the data. In Dong
by Chawla in the year 2002 (Chawla et al., 2002), which makes use of and Wang (2011), the method randomly generates data points in
K Nearest Neighbor (KNN) graphs to generate synthetic data. the minority region unlike SMOTE. In Zhang and Wang (2011), the
authors check whether the samples are crossed or not and are grouped
2.1. Non-filter based accordingly, then new data points are generated based on the different
groups. In Fernández-Navarro, nez, and Gutiérrez (2011), the authors
In Han et al. (2005), only the borderline samples are considered have proposed a 2 stage algorithm. In 1st stage the data is balanced
and are oversampled. In Cohen, Hilario, Sax, Hugonnet, and Geiss- and in 2nd phase different patterns are generated and the data is over
buhler (2006), authors have used different Prototype-based resampling sampled. In Farquad and Bose (2012), the authors have proposed to
methods like KNN and Support Vector Machine (SVM) to balance the pre-process the data by making use of SVM and the data is balanced
dataset. In Wang, Xu, Wang, and Zhang (2006), the authors have with the SVM predictions.
applied the LLE algorithm to process the dataset and then oversampled In Puntumapon and Waiyamai (2012), the authors remove the data
the dataset by using SMOTE. points from the minority region that are not relevant, this will give
2
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
a precise minority region. In Bunkhumpornpat, Sinapiromsaran, and used to propose a new approach. This will balance the dataset and is
Lursinsap (2012), the data is generated along the shortest path and an alternative to SMOTE. In Gazzah, Hechkel, and Essoukri Ben Amara
the newly generated samples lie near the centroid. In Wang, Li, Chao, (2015), Principal Component Analysis (PCA) and multiclass SVM are
and Cao (2012), the proposed method dynamically generates different combined together and a hybrid approach is proposed to sample the
data points around the negative class data point. This will eliminate data.
noise and make the boundary more distinct. Also, smoothing techniques In Tang and He (2015), kernel density is estimated and the difficulty
are proposed by the author. According to Barua, Islam, and Murase level is found, based on which the samples are adaptively generated
(2013), weights for the negative class samples are found depending to balance the dataset. In Xie, Jiang, Ye, and Li (2015), the authors
on the distance from the positive class which will generate accurately proposed Minority Oversampling Technique based on Local Densities
balanced dataset. In Bunkhumpornpat and Subpaiboonkit (2013), the in Low-Dimensional Space (MOT2LD), that creates clusters by mapping
authors propose a tool for selecting a variant of SMOTE, either safe the samples. Weights are assigned based on the importance and the
level or borderline; synthetic samples are generated in the safe region dataset is balanced accordingly. In Young, Nykl, Weckman, and Chel-
determined by a mechanism. In Hu and Li (2013), the authors propose berg (2015), Voronoi diagram is generated and the data points that
a 3 step algorithm where at first, positive samples in the lower decision lie on the border of the 2 classes are found and based on these data
and the negative samples around the boundary is calculated. In the points, the dataset is balanced. In Lee, Kim, and Lee (2015), the authors
second step, SMOTE is applied on the dataset. Next, the data is balanced generate the samples and then decide whether to keep the sample or
and processed. not based on the location of the sample. This method will take care
In Nakamura, Kajiwara, Otsuka, and Kimura (2013), codebooks are of the noisy data and the issue of over fitting. In Dang, Tran, Hirose,
obtained from Learning Vector Quantization (LVQ) technique and the and Satou (2015), the authors change the label of the data samples and
data is balanced based on the codebooks obtained. In Sanchez, Morales, then uses SPY method that they propose to balance the data.
and Gonzalez (2013), the authors have proposed a method Synthetic In Li, Fong, and Zhuang (2015), two metaheuristics are combined
Oversampling of Instances (SOI) to resample data inside the clusters. together to obtain the best value for the parameter. The value of the
These clusters are from the minority class instances, 2 methods are accuracy depends on the value of the kappa specified by the user.
presented in the paper, SOI by Clustering and Jittering (SOI-CJ), and In Rivera and Xanthopoulos (2016), the authors propose to use Over-
SOI by Clustering (SOI-C). In Zhou, Yang, Guo, and Hu (2013), the sampling Using Propensity Scores (OUPS) that performs oversampling
authors have proposed a hybrid method combining quasi-linear SVM based on the requirement. The probability of group membership is
and assembled SMOTE. In Koto (2014), author has showcased 3 differ- found and the data points are resampled based on propensity rate.
ent variants of SMOTE; SMOTE-out is a strategy to handle very close In Torres, Carrasco-Ochoa, and Martínez-Trinidad (2016), the method
vectors by creating samples outside the area of the dashed line. SMOTE- handles the imbalance by generating data corresponding to every data
cosine-Euclidian formula and the cosine similarity are consolidated point. In Borowska and Stepaniuk (2016), the authors have proposed
together to obtain the new NN. Selected SMOTE — certain attributes to balance the dataset by using SMOTE and to handle the oversampled
are synthesized based on feature selection emphasizing the dimension data by using Rough Set Theory(RST). In Yun, Ha, and Lee (2016), the
of significant attributes. In Li, Zou, Wang, and Xia (2013), the nega- authors have proposed a method to restrict the neighborhood size. The
tive samples are over sampled using Improved SMOTE (ISMOTE) and method will determine the value for every minority instance and as-
the positive samples are under sampled using distance-based under- sures safety for generating synthetic data. In Jiang, Lu, and Xia (2016),
sampling (DUS) technique. Both the methods are combined to obtain the authors have proposed genetic-algorithm-based-SMOTE (GAST).
a balanced dataset. In Barua, Islam, Yao, and Murase (2014), the Optimal sampling rates are estimated and their optimal combination
proposed method assigns different weights to the samples depending is found. The dataset is then balanced by generating new samples.
on the Euclidean distance from positive class sample. This weight is In Nekooeimehr and Lai-Yuen (2016), clustering technique is used
then used for balancing the dataset. and each cluster is oversampled based on the Euclidean distance.
In Gao, Hong, Chen, Harris, and Khalaf (2014), the authors use In Ramentol et al. (2016), fuzzy rough set theory (Dubois & Prade,
kernel density to over sample the dataset and balance it. According 1990) is used as a pre-processing tool. A threshold is then defined and if
to López, Triguero, Carmona, a, and Herrera (2014), the method pro- a sample does not cross it, it will be deleted. In Cervantes et al. (2017),
posed additively generates new data until an appropriate dataset is the authors use support vector to generate new samples. Particle swarm
obtained. In Zhang and Li (2014), the raw data will have a probability optimization (SMOTE-PSO) is used to handle the noise in the dataset.
distribution which is unknown. The newly generated data should also In Ma and Fan (2017), the authors have proposed to cluster the samples
have the same probability distribution, then the data will be accurate using Clustering Using Representatives (CURE-SMOTE) and then get rid
and precise. In Li, Zhang, Lu, and Fang (2014), to handle the imbalance of noise and outliers from the dataset. The dataset is then balanced
in the dataset, the boundary samples are selected and resampled. The by resampling. In Rivera (2017), the authors propose a method to
author says that this will improve the quality of the dataset. In Mah- eliminate the noise prior to the resampling of the dataset.
moudi, Moradi, Akhlaghian, and Moradi (2014), the authors have In Lee, Kim, and Kim (2017), authors have showcased a method
proposed Diversity and Separable Metrics in Over-Sampling Technique where Gaussian Probability Distribution in the feature space is com-
(DSMOTE) method that improves the accuracy. Anomalous samples bined and new data is sampled, diverged from the line. In Koziarski
are removed from the negative class. The top three samples are then and Wożniak (2017), the method is proposed in 2 phases. Firstly,
considered based on a criteria and synthetic data is generated based on the neighborhoods are cleaned. Secondly, synthetic samples are gen-
these samples. erated selectively. In Siriseriwan and Sinapiromsaran (2017), Adaptive
In Jiang, Qiu, and Li (2015), the minority class data samples are Neighbor Synthetic Minority Oversampling Technique (ANS) is pro-
resampled by finding the similarity between the samples using Mi- posed which dynamically adjusts the number of neighbors needed to
nority Cloning Technique (MCT). In Xu, Li, Le, and Tian (2014), the oversample the minority regions. In Douzas, Bacao, and Last (2018),
authors have proposed to combine triangular area sampling and NN k-means has been combined with SMOTE which generates samples in
with SMOTE and the dataset is balanced. In Rong, Gong, and Ng the deficient minority area and the class label is not considered while
(2014), Gaussian distribution in Q-union is used to resample the data generating the synthetic data.
and balance it. In Hu et al. (2014), a supervised method is used to In Douzas and Bacao (2019) the authors propose a technique where
balance the dataset by generating new samples. Also, TargetSOS, a the synthetic samples are generated in the region around each of input
new predictor is proposed by the authors. In Bellinger, Japkowicz, and minority data points. Eighty five versions of SMOTE are documented
Drummond (2015), modeling efficiency of denoising autoencoders are and implemented in python and is made available to public by Kovács
3
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
(2019). In Xu, Shen, Nie, and Kou (2020), an hybrid implementation that gives a proficient solution of the least-squares technique. It is an
for medical data is given. It is divided into 3 sub-parts where Modified efficient algorithm that can support the past, future, and the present
SMOTE (MSMOTE) algorithm is used, post which ENN is used to reduce estimations. It is basically a two stage algorithm. In the first step,
the outlier data points and then the author uses RF for the classification. estimates of the current state variables are calculated. In the next
In Douzas, Rauch, and Bacao (2021), an extension named G-SOMO stage, weighted averages are used to update the estimates. We use the
is presented it identifies the optimal region for new data samples to covariance grids 𝑇 and 𝑃 to determine noisy data. Consider 𝑔 ∈ 𝑃 𝑥
be generated which will increase the variability. (Li, Zhu, Wu & Fan, as the measurements and 𝑎 ∈ 𝑃 𝑖 as the state of the discrete controlled
2021) Natural Neighbors SMOTE algorithm uses an adaptive k value, linear system. At time 𝑛, the process and measurement equations are
the samples in the center of the class has higher number of synthetic shown in Eqs. (1) and (2) respectively.
samples and the ones at the borders have lesser in number and also the
outlier samples are reduced. In Li et al. (2021) the synthetic samples 𝑎𝑛 = 𝐵𝑛 𝑎𝑛−1 + 𝑐𝑛 𝑓𝑛 + 𝑑𝑛 (1)
are generated using the SMOTE based algorithm which then the newly
generated data is used to detect the noisy samples using error detection 𝑔𝑛 = 𝑦𝑛 𝑎𝑛 + 𝑒𝑛 (2)
which is then, not removed from the dataset, but it is iteratively
re-positioned using the differential evolution. Where 𝑓𝑛 ∈ 𝑃 𝑗 is the control-input model at time 𝑛, 𝑎𝑛 is the state
at 𝑛, 𝑔𝑛 is the measurement at 𝑛. 𝐵𝑛 is 𝑖 × 𝑖 matrix relating state at 𝑛 − 1
2.2. Filter based & state at 𝑛. 𝑐𝑛 is 𝑖 × 𝑗 matrix relating control input at 𝑛 & state at 𝑛.
𝑦𝑛 is 𝑥 × 𝑖 matrix relating state and measurement at 𝑛. x, i and j are the
In Batista et al. (2004), the author proposes to extend SMOTE by variables of the matrices. 𝑒𝑛 and 𝑑𝑛 are measurement noise and process
integrating it with ENN and Tomek noise filters. Tomek links are used noise respectively with covariance matrix 𝑃𝑛 and 𝑇𝑛 and with a mean
to remove the noisy samples or samples lying on the decision boundary of 0.
after SMOTE is applied, whereas ENN removes more samples than
Tomek and it provides a cleaner data by removing the samples from 3.1.3. Our approach
both the classes that are misclassified by its three nearest neighbors. Input: Data C
In Sáez et al. (2015), the authors have presented a filtering method Output: Train data appended with mean and covariance
using ensemble based Iterative-Partitioning Filter (IPF) noise filter to column KalD
manage the samples at the borderline and the samples that are noisy. 1 Split the data C into Train D and Test T randomly with 8:2
In Almogahed and Kakadiaris (2015), the authors propose a filter ratio.
approach named NEATER using the Game Theory (GT). (Hussein, Li,
2 D_res = SMOTE(D) // apply SMOTE algorithm on train data D
Yohannese, & Bashir, 2019) is an extension of SMOTE where the
3 Apply Kalman filter
authors use a technique to identify the location of the newly generated
4 Obtain the number of columns (NC) (Label (or) output column
synthetic sample. The synthetic data points that are in close propinquity
need not be considered)
to the majority samples than the minority samples are rejected and
5 Create Kalman Filter object with ISM and NDO
filtered out. In Cheng et al. (2019) the authors propose grouped SMOTE
6 msr = D_res
algorithm with noise filtering mechanism (GSMOTE-NFM). At first,
7 With train data as input, call the EM algorithm using the
Gaussian-Mixture Model (GMM) is used to identify the distribution and
Kalman Filter object.
then noisy samples are filtered out by using the probability density of
8 Calculate the mean and covar
the samples in different classes. In Liang, Jiang, Li, Xue, and Wang
9 KalD = Append mean and covar to D_res.
(2020) a combination of K-means and SVM is used to filter out the noise
10 return 𝐾𝑎𝑙𝐷
in the actual data post which the algorithm is modified to generate
Algorithm 1: Kalman filter application and obtaining mean and
synthetic data points over the connection line or extension line of the
neighbor data point and actual data sample. covariance for each row of the data
4
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Input: KalD, rem, Q this process, the Random Forest is applied. We have considered 𝑄=5
Output: Classifier result on the datasets for each iteration for our experiments and the results are computed for each iteration
1 if most of the rows have the same covariance value. then and the finest result of all the iterations is considered. The dataset
2 for 𝑗 = 1 to 𝑄 do corresponding to that result gives the best performance when used with
3 for 𝑖 = 1 to 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do the classifier. Fig. 1 depicts the working of KSMOTE.
4 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖 [𝑚𝑎𝑥 𝑣𝑎𝑙𝑢𝑒 𝑐𝑜𝑣𝑎𝑟] 𝐴𝑁𝐷 𝑑𝑎𝑡𝑎𝑖 [𝑙𝑎𝑏𝑒𝑙]
5 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖+1 [𝑟𝑒𝑚𝑜𝑣𝑒 𝑟𝑒𝑚 𝑟𝑎𝑛𝑑𝑜𝑚 𝑟𝑜𝑤𝑠]
4. Experimental results and discussions
6 𝑑𝑎𝑡𝑎𝑗 = 𝑑𝑎𝑡𝑎𝑗 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑑𝑎𝑡𝑎𝑖+1 )
7 end
4.1. Experimental setup
8 𝑑𝑎𝑡𝑎𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠+1 = 𝑑𝑎𝑡𝑎𝑖+1 [𝑐𝑜𝑣𝑎𝑟! = 𝑚𝑎𝑥𝑐𝑜𝑣𝑎𝑟]
9 𝑑𝑎𝑡𝑎𝑗 = 𝑑𝑎𝑡𝑎𝑗 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑑𝑎𝑡𝑎𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠+1 )
10 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙 = 𝑑𝑟𝑜𝑝𝑚𝑒𝑎𝑛𝑎𝑛𝑑𝑐𝑜𝑣𝑎𝑟𝑐𝑜𝑙𝑢𝑚𝑛𝑠 Here, we describe the datasets, the analysis methodology and the
11 Separate X_train having data attributes and y_train having evaluation metrics used in our experiments. Furthermore, an in-depth
output label from 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙 comparison of the classifier’s results is provided. We have validated
12 Apply Random Forest Classifier the experimental outcomes with other contemporary techniques viz.,
13 end SMOTE, ADASYN, BorderlineSMOTE (blSMOTE), SMOTEENN and
14 end SMOTETOMEK. The NN parameter’s value is initialized as 5. In each
15 else iteration, our model removes data proportionally from all the classes
16 for 𝑖 = 1 to 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do thereby maintaining the class balance. The random forest classifier is
17 for 𝑖 = 1 to 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 do employed after which, multiple metrics are computed. We have also
18 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖 [𝑙𝑎𝑏𝑒𝑙] used Decision Tree Classifier and Extra Tree Classifier to demonstrate
19 𝑑𝑎𝑡𝑎𝑖+1 = 𝑑𝑎𝑡𝑎𝑖+1 [remove rem samples starting the results. The test–train partition was done in 80:20 ratio for all the
from highest covariance value] datasets. The experiments were run three times and the average of the
20 𝑑𝑎𝑡𝑎𝑗 = 𝑑𝑎𝑡𝑎𝑗 .𝑎𝑝𝑝𝑒𝑛𝑑(𝑑𝑎𝑡𝑎𝑖+1 ) results are considered. The experiments were performed on server -
21 end ‘‘Flounder’’ at Florida International University which has AMD Opteron
22 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙 = 𝑑𝑟𝑜𝑝𝑚𝑒𝑎𝑛𝑎𝑛𝑑𝑐𝑜𝑣𝑎𝑟𝑐𝑜𝑙𝑢𝑚𝑛𝑠 Processor 6380 (64 cores) along with 504 GB RAM. Implementation of
23 Separate X_train having data attributes and y_train having our work is available on GitHub.1
output label from 𝑑𝑎𝑡𝑎𝑡𝑟𝑎𝑖𝑙
24 Apply Random Forest Classifier 4.2. Datasets
25 end
26 end Several real-time datasets regarding click-fraud and UCI benchmark
Algorithm 3: Removing data samples and classifying the result datasets are used. All of these datasets have disproportional class
distributions. The string values wherever present in the datasets, were
converted to numbers with the help of label encoding. Under the
filter object is created with 𝑖𝑛𝑖𝑡𝑖𝑎𝑙_𝑠𝑡𝑎𝑡𝑒_𝑚𝑒𝑎𝑛(𝐼𝑆𝑀) which is the ’initial domain of click fraud, the Talking Data (kag, 2018) (Thejas et al.,
condition for the model’ as 0 and 𝑛_𝑑𝑖𝑚_𝑜𝑏𝑠(𝑁𝐷𝑂) which is the ’size 2019), the Avazu Click-Through Rate (CTR) (kag, 2015) and the Criteo
of the state or observation space’ as 𝑁𝐶 and applied on the oversam- datasets (cri, 2017) have been considered. The Talking Data dataset
pled data. A part of ‘pykalman’ package (pyp, 2013) is modified and contains 9 attributes. Data preprocessing was performed by separating
used for the same purpose. In the Kalman Filter object, we call the the attribute ’click_time’ into different four attributes, ‘day’, ‘hour’,
Expectation–Maximization (EM) algorithm (Moon, 1996). The training ‘min’ and ‘sec’. A million random samples were considered from the
data is considered as measurements (msr) and given as an input to entire dataset which originally had 184,903,890 entries. The Criteo
the filter. In our experiments, we have considered the 𝑛𝑖𝑡𝑒𝑟 to be dataset was randomly sampled prior to usage and all the rows with NaN
5 for the EM algorithm because it avoids overfitting as explained values were dropped. The attribute ’hour of click’ in the Avazu dataset
in pyp (2013). This algorithm calculates the maximum likelihood of was split into different columns. The data was collected over a period
parameters iteratively. On applying the filter, the mean and covariance of ten days and is chronologically ordered. We made use of a million
values are obtained corresponding to each row of the dataset. random samples for the experiment. Datasets from the UCI Archive are
We employ some notations to describe the second stage of the used. For the CMC dataset, we have combined the class attributes ’Long-
method. 𝐵 indicates the number of rows before applying SMOTE and term’ and ’Short-term’ as ‘Use’. These datasets are commonly used in
𝐴 denotes the number of rows after applying SMOTE. 𝑁 portrays the multiple papers and are now treated as benchmark datasets. Table 1
percentage of increase in the data (synthetic samples generated). The provides a brief outline of the datasets used in our experiments. It
percentage of data to be removed is calculated as shown in Algorithm specifies the Alias Name (AN) for the datasets, the number of samples
2. The count of data samples to be removed in each iteration is also (NOS), the type of class distribution and the number of attributes of the
calculated. To maintain data balance, Y samples are discarded propor- dataset (#Attr).
tionally from all classes. In Stage 3, the train dataset is referred to as
‘data’ and the ‘numberofclasses’ is the number of class attributes for
4.3. Analysis methodology
the dataset. We obtain a count of all unique values of covariances that
were calculated. If a considerably large number of rows have the same
To test the proposed model, we considered the contemporary tech-
covariance, then, depending on the 𝑛𝑜𝑠 calculated, the data samples
niques, namely, SMOTE, SMOTEENN, SMOTETOMEK, ADASYN and
with the highest covariance are dropped from each class equally. The
BorderlineSMOTE (blSMOTE). To compare the different models, pa-
Random Forest (Breiman, 1999) classifier is run on the new dataframe.
If the covariance is different for most of the rows, then, the data is rameters such as Acry, AUC, Pres, Rcl and F1scr were made use of. A
sorted according to the covariance values. The data samples with the mathematical definition of these metrics is presented in the following
highest covariances are dropped iteratively from every class proportion- subsection.
ally. The values in the covariance matrix is the amount of noise present,
1
hence we filter the noise according to the covariance values. After https://github.com/thejasgs/SMO
5
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Table 1 evaluate our work. The AUC is one such parameter which is not greatly
Dataset description.
affected by overfitting.
Dataset AN NOS #Attr Class We have also used Pres, Rcl and F1scr which are calculated as given
Haberman D1 306 4 Binary in Eqs. (4), (5) and (6).
Hepatitis D2 115 20 Binary
Statlog (German Credit Data) D3 1000 20 Binary 𝑇𝑃
𝑃 𝑟𝑒𝑠 = (4)
Talking Data D4 1,000,000 9 Binary 𝑇𝑃 + 𝐹𝑃
Avazu D5 1,000,000 24 Binary
𝑇𝑃
Display Ad. Challenge-Criteo Labs D6 756,554 40 Binary 𝑅𝑐𝑙 = (5)
Credit Card Fraud Data D7 284,807 30 Binary
𝑇𝑃 + 𝐹𝑁
Z-Alizadeh sani dataset D8 302 55 Binary 𝑃 𝑟𝑒𝑠 × 𝑅𝑐𝑙
Contraceptive Method Choice CMC D9 1178 10 Binary 𝐹 1𝑠𝑐𝑟 = 2 × (6)
𝑃 𝑟𝑒𝑠 + 𝑅𝑐𝑙
Pima D10 579 9 Binary
4.5. Results
6
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Table 2
Results on Raw data — Random Forest.
Dataset #1’s #0’s %min Acry AUC Rcl F1scr Pres
D1 118 56 32.18 0.69355 0.55430 0.88637 0.80413 0.73585
D2 26 94 21.66 0.83871 0.6875 0.375 0.54546 0.99
D3 558 242 24.2 0.775 0.64821 0.95744 0.85714 0.77586
D4 197411 602589 24.67 0.85609 0.72235 0.45765 0.61141 0.92075
D5 136122 663878 17.01 0.83265 0.51087 0.0249 0.04785 0.61522
D6 190488 414755 31.47 0.71167 0.56449 0.1663 0.26664 0.67224
D7 391 227454 0.172 0.99943 0.86220 0.72448 0.81609 0.93421
D8 70 172 28.92 0.83606 0.75452 0.55555 0.66666 0.83333
D9 686 492 41.7 0.70847 0.67983 0.92121 0.77948 0.67555
D10 201 378 34.71 0.8 0.75022 0.62791 0.65061 0.675
7
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Fig. 2. ROC curve plot for D1 data, (a) represents the ROC curve of Random Forest Classifier, (b) represents the ROC curve of Decision Tree Classifier, (c) represents the ROC
curve of Extra Tree Classifier.
Table 5 (continued). • The performance of our model is better for the click fraud datasets
Dataset Model Method Acry AUC Rcl F1scr Pres and the benchmark datasets.
SMOTE 0.75172 0.74602 0.72916 0.66037 0.60344 • In the dataset D10 run on ET model, the AUC is close to the best
ADASYN 0.69655 0.69426 0.6875 0.6 0.53225 performing model SMOTE.
D10 ET blSMOTE 0.72413 0.72540 0.72916 0.63636 0.56451
• To summarize, the proposed model shows good results when
SMOTEENN 0.70344 0.73625 0.83333 0.65040 0.53333
SMOTETOMEK 0.74482 0.73560 0.70833 0.64761 0.59649 tested with different datasets. Although the accuracy might de-
KSMOTE 0.73793 0.73571 0.72916 0.64814 0.58333 crease, the AUC scores obtained are higher, denoting a better
model.
8
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Fig. 3. ROC curve plot for D9 data, (a) represents the ROC curve of Random Forest Classifier, (b) represents the ROC curve of Decision Tree Classifier, (c) represents the ROC
curve of Extra Tree Classifier.
Table 6 Classifier can be observed. Fig. 3(b) portrays the plot of ROC Curve for
F1scr comparison — Random Forest.
Decision Tree Classifier. Fig. 3(c) represents the ROC Curve for Extra
Dataset KSMOTE GSMOTE-NFM CURE-SMOTE Trees Classifier.
D1 0.71233 0.4821 0.5000
D10 0.69565 0.6726 𝑁𝐴
4.6. Statistical significance test
9
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
10
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Chen, S., Guo, G., & Chen, L. (2010). A new over-sampling method based on cluster Hu, J., He, X., Yu, D.-J., Yang, X.-B., Yang, J.-Y., & Shen, H.-B. (2014). A new supervised
ensembles. In: 2010 IEEE 24th international conference on advanced information over-sampling algorithm with application to protein-nucleotide binding residue
networking and applications workshops (pp. 599–604). 10.1109/WAINA.2010.40. prediction. PLOS ONE, 9, 1–10. http://dx.doi.org/10.1371/journal.pone.0107676.
Cheng, K., Zhang, C., Yu, H., Yang, X., Zou, H., & Gao, S. (2019). Grouped SMOTE Hu, F., & Li, H. (2013). A novel boundary oversampling algorithm based on neighbor-
with noise filtering mechanism for classifying imbalanced data. IEEE Access, 7, hood rough set model: Nrsboundary-SMOTE. Mathematical Problems In Engineering,
170668–170681. 2013.
Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in network Hu, S., Liang, Y., Ma, L., & He, Y. (2009). MSMOTE: improving classification perfor-
intrusion datasets. In GrC (pp. 732–737). mance when training data is imbalanced. In: 2009 second international workshop on
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from computer science and engineering, vol. 2, IEEE (pp. 13–17).
imbalanced data in surveillance of nosocomial infection. Artificial Intelligence In Hussein, A. S., Li, T., Yohannese, C. W., & Bashir, K. (2019). A-SMOTE: A new
Medicine, 37, 7–18. preprocessing approach for highly imbalanced datasets by improving SMOTE.
Kaggle display advertising challenge dataset. (2017). URL: https://labs.criteo.com/ International Journal Of Computational Intelligence Systems, 12, 1412–1422.
2014/02/kaggle-display-advertising-challenge-dataset/. Japkowicz, N. (2003). Class imbalances: are we focusing on the right issue. In: Workshop
Dang, X. T., Tran, D. H., Hirose, O., & Satou, K. (2015). SPY: A novel resampling on learning from imbalanced data sets II, vol. 1723 (pp. 63).
method for improving classification performance in imbalanced data. In: 2015 Jiang, K., Lu, J., & Xia, K. (2016). A novel algorithm for imbalance data classification
seventh international conference on knowledge and systems engineering (pp. 280–285). based on genetic algorithm improved SMOTE. Arabian Journal For Science And
10.1109/KSE.2015.24. Engineering, 41, 3255–3266. http://dx.doi.org/10.1007/s13369-016-2179-2.
De La Calleja, J., & Fuentes, O. (2007). A distance-based over-sampling method for Jiang, L., Qiu, C., & Li, C. (2015). A novel minority cloning technique for cost-sensitive
learning from imbalanced data sets. In: FLAIRS conference (pp. 634–635). learning. International Journal Of Pattern Recognition And Artificial Intelligence, 29,
De La Calleja, J., Fuentes, O., & González, J. (2008). Selecting minority examples from Article 1551004. http://dx.doi.org/10.1142/S0218001415510040.
misclassified data for over-sampling. In: FLAIRS conference (pp. 276–281). Click-through rate prediction. (2015). URL: https://www.kaggle.com/c/avazu-ctr-
prediction/data.
Deepa, T., & Punithavalli, M. (2011). An E-SMOTE technique for feature
selection in high-dimensional imbalanced dataset. In: 2011 3rd interna- Talkingdata AdTracking fraud detection challenge. (2018). URL: https://www.kaggle.
tional conference on electronics computer technology, vol. 2 (pp. 322–324). com/c/talkingdata-adtracking-fraud-detection.
10.1109/ICECTECH.2011.5941710. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
Journal Of Basic Engineering, 82, 35–45.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The
Kang, Y., & Won, S. (2010). Weight decision algorithm for oversampling technique
Journal Of Machine Learning Research, 7, 1–30.
on class-imbalanced learning. In ICCAS 2010 (pp. 182–186). http://dx.doi.org/10.
Dong, Y., & Wang, X. (2011). A new over-sampling approach: Random-SMOTE for
1109/ICCAS.2010.5669889.
learning from imbalanced data sets. In H. Xiong, & W. B. Lee (Eds.), Knowledge
Khor, K.-C., Ting, C.-Y., & Phon-Amnuaisuk, S. (2012). A cascaded classifier approach
science, engineering and management (pp. 343–352). Berlin, Heidelberg: Springer
for improving detection rates on rare attack categories in network intrusion
Berlin Heidelberg.
detection. Applied Intelligence: The International Journal of Artificial Intelligence, Neural
Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in
Networks, and Complex Problem-Solving Technologies, 36, 320–329.
replacement for SMOTE. Information Sciences, 501, 118–135.
Koto, F. (2014). SMOTE-out, SMOTE-cosine, and selected-SMOTE: An enhancement
Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a
strategy to handle imbalance in data level. In 2014 international conference on
heuristic oversampling method based on k-means and SMOTE. Information Sci-
advanced computer science and information system (pp. 280–284). IEEE.
ences, 465, 1–20. http://dx.doi.org/10.1016/j.ins.2018.06.056, URL: http://www.
Kovács, G. (2019). Smote-variants: A python implementation of 85 minority
sciencedirect.com/science/article/pii/S0020025518304997.
oversampling techniques. Neurocomputing, 366, 352–354.
Douzas, G., Rauch, R., & Bacao, F. (2021). G-SOMO: An oversampling approach based
Koziarski, M., & Wożniak, M. (2017). CCR: A combined cleaning and resampling algo-
on self-organized maps and geometric SMOTE. Expert Systems With Applications,
rithm for imbalanced data classification. International Journal Of Applied Mathematics
Article 115230.
And Computer Science, 27, 727–736.
Dubois, D., & Prade, H. (1990). Rough fuzzy sets and fuzzy rough sets*. Inter- Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets:
national Journal Of General Systems, 17, 191–209. http://dx.doi.org/10.1080/ one-sided selection. In Icml, vol. 97 (pp. 179–186). Nashville, USA.
03081079008935107.
Lee, H., Kim, J., & Kim, S. (2017). Gaussian-based SMOTE algorithm for solving skewed
Fan, X., Tang, K., & Weise, T. (2011). Margin-based over-sampling method for learning class distributions. International Journal of Fuzzy Logic and Intelligent Systems, 17,
from imbalanced datasets. In Pacific-asia conference on knowledge discovery and data 229–234.
mining (pp. 309–320). Springer. Lee, J., Kim, N.-r., & Lee, J.-H. (2015). An over-sampling technique with rejection
Farquad, M., & Bose, I. (2012). Preprocessing unbalanced data using support vector for imbalanced class learning. In Proceedings of the 9th international conference on
machine. Decision Support Systems, 53, 226–233. ubiquitous information management and communication (pp. 102:1–102:6). New York,
Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining And Knowledge NY, USA: ACM, http://dx.doi.org/10.1145/2701126.2701181.
Discovery, 1, 291–316. Li, J., Fong, S., & Zhuang, Y. (2015). Optimizing SMOTE by metaheuristics with neural
Fernández-Navarro, F., nez, C. H.-M., & Gutiérrez, P. A. (2011). A dynamic network and decision tree. In: 2015 3rd international symposium on computational
over-sampling procedure based on sensitivity for multi-class problems. Pattern and business intelligence (pp. 26–32). 10.1109/ISCBI.2015.12.
Recognition, 44, 1821–1833. http://dx.doi.org/10.1016/j.patcog.2011.02.019, URL: Li, K., Zhang, W., Lu, Q., & Fang, X. (2014). An improved SMOTE imbalanced data
http://www.sciencedirect.com/science/article/pii/S0031320311000823. classification method based on support degree. In: 2014 international conference
Gao, M., Hong, X., Chen, S., Harris, C. J., & Khalaf, E. (2014). PDFOS: PDF on identification, information and knowledge in the internet of things (pp. 34–38).
estimation based over-sampling for imbalanced two-class problems. Neurocomput- 10.1109/IIKI.2014.14.
ing, 138, 248–259. http://dx.doi.org/10.1016/j.neucom.2014.02.006, URL: http: Li, J., Zhu, Q., Wu, Q., & Fan, Z. (2021). A novel oversampling technique for class-
//www.sciencedirect.com/science/article/pii/S0925231214002501. imbalanced learning based on SMOTE and natural neighbors. Information Sciences,
García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior 565, 438–455.
of classifiers on imbalanced and overlapped data sets. In Iberoamerican congress on Li, J., Zhu, Q., Wu, Q., Zhang, Z., Gong, Y., He, Z., et al. (2021). SMOTE-NaN-DE:
pattern recognition (pp. 397–406). Springer. Addressing the noisy and borderline examples problem in imbalanced classification
Gazzah, S., & Amara, N. E. B. (2008). New oversampling approaches based on by natural neighbors and differential evolution. Knowledge-Based Systems, 223,
polynomial fitting for imbalanced data sets. In: 2008 the eighth IAPR international Article 107056.
workshop on document analysis systems, IEEE (pp. 677–684). Li, H., Zou, P., Wang, X., & Xia, R. (2013). A new combination sampling method for
Gazzah, S., Hechkel, A., & Essoukri Ben Amara, N. (2015). A hybrid sampling method imbalanced data. In Z. Sun, & Z. Deng (Eds.), Proceedings of 2013 Chinese intelligent
for imbalanced data. In: 2015 IEEE 12th international multi-conference on systems, automation conference (pp. 547–554). Berlin, Heidelberg: Springer Berlin Heidelberg.
signals devices (pp. 1–6). 10.1109/SSD.2015.7348093. Liang, X., Jiang, A., Li, T., Xue, Y., & Wang, G. (2020). LR-SMOTE—An improved
Gu, Q., Cai, Z., & Zhu, L. (2009). Classification of imbalanced data sets by using the unbalanced data set oversampling based on K-means and SVM. Knowledge-Based
hybrid re-sampling algorithm based on isomap. In Z. Cai, Z. Li, Z. Kang, & Y. Liu Systems, 196, Article 105845.
(Eds.), Advances in computation and intelligence (pp. 287–296). Berlin, Heidelberg: Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting
Springer Berlin Heidelberg. approach. Expert Systems With Applications, 36, 690–701.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into
method in imbalanced data sets learning. In International conference on intelligent classification with imbalanced data: Empirical results and current trends on using
computing (pp. 878–887). Springer. data intrinsic characteristics. Information Sciences, 250, 113–141.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). Adasyn: Adaptive synthetic sampling López, V., Triguero, I., Carmona, C. J., a, S. G., & Herrera, F. (2014). Addressing
approach for imbalanced learning. In 2008 IEEE international joint conference on imbalanced classification with instance generation techniques: IPADE-ID. Neurocom-
neural networks (pp. 1322–1328). IEEE. puting, 126, 15–28. http://dx.doi.org/10.1016/j.neucom.2013.01.050, URL: http://
He, H., & Garcia, E. A. (2008). Learning from imbalanced data. IEEE Transactions On www.sciencedirect.com/science/article/pii/S0925231213006887, Recent trends in
Knowledge & Data Engineering, 1263–1284. intelligent data analysis online data processing.
11
Thejas G.S., Y. Hariprasad, S.S. Iyengar et al. Machine Learning with Applications 8 (2022) 100267
Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature Stefanowski, J., & Wilk, S. (2008). Selective pre-processing of imbalanced data for
selection and parameter optimization based on random forests. BMC Bioinformatics, improving classification performance. In International conference on data warehousing
18, 169. http://dx.doi.org/10.1186/s12859-017-1578-z. and knowledge discovery (pp. 283–292). Springer.
Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE Tang, S., & Chen, S.-p. (2008). The generation mechanism of synthetic minority class
for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence examples. In 2008 international conference on information technology and applications
and data mining (pp. 104–111). 10.1109/CIDM.2011.5949434. in biomedicine (pp. 444–447). IEEE.
Mahmoudi, S., Moradi, P., Akhlaghian, F., & Moradi, R. (2014). Diversity and separable Tang, B., & He, H. (2015). KernelADASYN: Kernel based adaptive synthetic data gen-
metrics in over-sampling technique for imbalanced data classification. In: 2014 eration for imbalanced learning. In: 2015 IEEE congress on evolutionary computation
4th international conference on computer and knowledge engineering (pp. 152–158). (pp. 664–671). 10.1109/CEC.2015.7256954.
10.1109/ICCKE.2014.6993409. Tek, F. B., Dempster, A. G., & Kale, I. (2010). Parasite detection and identification
for automated thin blood film malaria diagnosis. Computer Vision And Image
Moon, T. K. (1996). The expectation-maximization algorithm. IEEE Signal Processing
Understanding, 114, 21–32.
Magazine, 13, 47–60. http://dx.doi.org/10.1109/79.543975.
Thejas, G. S., Boroojeni, K. G., Chandna, K., Bhatia, I., Iyengar, S. S., & Sunitha, N. R.
Nakamura, M., Kajiwara, Y., Otsuka, A., & Kimura, H. (2013). Lvq-smote–learning vec-
(2019). Deep learning-based model to fight against ad click fraud. In Proceedings
tor quantization based synthetic minority over–sampling technique for biomedical
of the 2019 ACM southeast conference (pp. 176–181). New York, NY, USA: ACM,
data. BioData Mining, 6, 16.
http://dx.doi.org/10.1145/3299815.3314453.
Napierała, K., Stefanowski, J., & Wilk, S. (2010). Learning from imbalanced data in
Torres, F. R., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2016). SMOTE-D a
presence of noisy and borderline examples. In International conference on rough sets deterministic version of SMOTE. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa,
and current trends in computing (pp. 158–167). Springer. V. Ayala Ramirez, J. A. Olvera-López, & X. Jiang (Eds.), Pattern recognition (pp.
Nekooeimehr, I., & Lai-Yuen, S. K. (2016). Adaptive semi-unsupervised weighted 177–188). Cham: Springer International Publishing.
oversampling (A-SUWO) for imbalanced datasets. Expert Systems With Applications, Wang, S., Li, Z., Chao, W., & Cao, Q. (2012). Applying adaptive over-sampling
46, 405–416. http://dx.doi.org/10.1016/j.eswa.2015.10.031, URL: http://www. technique based on data density and cost-sensitive SVM to imbalanced learn-
sciencedirect.com/science/article/pii/S0957417415007356. ing. In: The 2012 international joint conference on neural networks (pp. 1–8).
Puntumapon, K., & Waiyamai, K. (2012). A pruning-based approach for searching 10.1109/IJCNN.2012.6252696.
precise and generalized region for synthetic minority over-sampling. In Pacific-asia Wang, J., Xu, M., Wang, H., & Zhang, J. (2006). Classification of imbalanced data by
conference on knowledge discovery and data mining (pp. 371–382). Springer. using the SMOTE algorithm and locally linear embedding. In 2006 8th international
Pykalman. (2013). URL: https://pypi.org/project/pykalman/. conference on signal processing, vol. 3. IEEE.
Ramentol, E., Caballero, Y., Bello, R., & Herrera, F. (2012). SMOTE-RSB*: a hybrid Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Breakthroughs in
preprocessing approach based on oversampling and undersampling for high imbal- statistics (pp. 196–202). Springer.
anced data-sets using SMOTE and rough sets theory. Knowledge And Information Xie, Z., Jiang, L., Ye, T., & Li, X. (2015). A synthetic minority oversampling method
Systems, 33, 245–265. based on local densities in low-dimensional space for imbalanced learning. In
Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., et al. (2016). M. Renz, C. Shahabi, X. Zhou, & M. A. Cheema (Eds.), Database systems for advanced
Fuzzy-rough imbalanced learning for the diagnosis of high voltage circuit breaker applications (pp. 3–18). Cham: Springer International Publishing.
maintenance: The SMOTE-FRST-2T algorithm. Engineering Applications Of Artificial Xu, Y. H., Li, H., Le, L. P., & Tian, X. Y. (2014). Neighborhood triangular synthetic
Intelligence, 48, 134–139. http://dx.doi.org/10.1016/j.engappai.2015.10.009, URL: minority over-sampling technique for imbalanced prediction on small samples of
Chinese tourism and hospitality firms. In: 2014 seventh international joint conference
http://www.sciencedirect.com/science/article/pii/S0952197615002389.
on computational sciences and optimization (pp. 534–538). 10.1109/CSO.2014.104.
Rivera, W. A. (2017). Noise reduction a priori synthetic over-sampling for class
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining
imbalanced data sets. Information Sciences, 408, 146–161. http://dx.doi.org/10.
M-SMOTE and ENN based on Random Forest for medical imbalanced data. Journal
1016/j.ins.2017.04.046, URL: http://www.sciencedirect.com/science/article/pii/
Of Biomedical Informatics, 107, Article 103465.
S0020025517307089.
Young, W. A., Nykl, S. L., Weckman, G. R., & Chelberg, D. M. (2015). Using voronoi di-
Rivera, W. A., & Xanthopoulos, P. (2016). A priori synthetic over-sampling methods for
agrams to improve classification performances when modeling imbalanced datasets.
increasing classification sensitivity in imbalanced data sets. Expert Systems With Ap- Neural Computing And Applications, 26, 1041–1054. http://dx.doi.org/10.1007/
plications, 66, 124–135. http://dx.doi.org/10.1016/j.eswa.2016.09.010, URL: http: s00521-014-1780-0.
//www.sciencedirect.com/science/article/pii/S0957417416304882. Yun, J., Ha, J., & Lee, J.-S. (2016). Automatic determination of neighborhood size in
Rong, T., Gong, H., & Ng, W. W. Y. (2014). Stochastic sensitivity oversampling SMOTE. In Proceedings of the 10th international conference on ubiquitous information
technique for imbalanced data. In X. Wang, W. Pedrycz, P. Chan, & Q. He (Eds.), management and communication (pp. 100:1–100:8). New York, NY, USA: ACM,
Machine learning and cybernetics (pp. 161–171). Berlin, Heidelberg: Springer Berlin http://dx.doi.org/10.1145/2857546.2857648.
Heidelberg. Zhang, H., & Li, M. (2014). RWO-sampling: A random walk over-sampling approach to
Sáez, J. A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). SMOTE–IPF: Ad- imbalanced data classification. Information Fusion, 20, 99–116. http://dx.doi.org/
dressing the noisy and borderline examples problem in imbalanced classification 10.1016/j.inffus.2013.12.003, URL: http://www.sciencedirect.com/science/article/
by a re-sampling method with filtering. Information Sciences, 291, 184–203. pii/S1566253514000025.
http://dx.doi.org/10.1016/j.ins.2014.08.051, URL: http://www.sciencedirect.com/ Zhang, L., & Wang, W. (2011). A re-sampling method for class imbalance learning
science/article/pii/S0020025514008561. with credit data. In: 2011 international conference of information technology, computer
Sanchez, A. I., Morales, E. F., & Gonzalez, J. A. (2013). Synthetic oversampling of engineering and management sciences, vol. 1 (pp. 393–397). 10.1109/ICM.2011.34.
instances using clustering. International Journal On Artificial Intelligence Tools, 22, Zhou, B., Yang, C., Guo, H., & Hu, J. (2013). A quasi-linear SVM combined with
Article 1350008. assembled SMOTE for imbalanced data classification. In: The 2013 international
Siriseriwan, W., & Sinapiromsaran, K. (2017). Adaptive neighbor synthetic minority joint conference on neural networks (pp. 1–7). 10.1109/IJCNN.2013.6707035.
over-sampling technique under 1NN outcast handling. Songklanakarin Journal of Zikopoulos, P., Eaton, C., et al. (2011). Understanding Big Data: Analytics For Enterprise
Science and Technology, 39, 565–576. Class Hadoop And Streaming Data. McGraw-Hill Osborne Media.
12