Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

PEOPLE’S DEMOCRATIC REPUBLIC OF ALGERIA

MINISTRY OF HIGHER EDUCATION AND SCIENTIFIC RESEARCH


HIGHER SCHOOL OF COMPUTER SCIENCE

2nd Year Superior Cycle (2CS)


2022-2023

Big Data Mining

TP N◦6 : Imbalanced Classes

Realized By :

- Mohammed Abderrahmane
Bensalem

- Safa Zakaria Abdellah

Year : 2022-2023

1
Contents
1 Problem Description 3

2 Algorithms to handle the imbalanced classes problem 4


2.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 F1-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 ROC-AUC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Balanced Accuracy Score . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 SMOTE (Synthetic Minority Over-sampling Technique) . . . . . . . . 5
2.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Ensemblist methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 CatBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Extremely Randomized Trees . . . . . . . . . . . . . . . . . . . . . . 11

3 Data set-Credit Card 13

4 Testing ET and CatBoost on the creditCard dataset 14


4.1 Notes and Defficulties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 The test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Testing ET and CatBoost on the creditCard dataset After Balancing 17

2
Introduction
Classification problems are quite common in the machine learning world.classification algorithms
are about predicting the class label by studying the input data or predictor where the target
or output variable is a categorical variable in nature. When designing a predictive model
there is usually the assumption that the classes used in the training data are balanced that
is there is an overall equal number of instances for each class , the majority of machine
learning algorithms take this assumption but this leads to problems when dealing with highly
imbalanced data which results in poor performance and predictions for the minority classes
.As the reality of our physical world tend to be imbalanced .

1 Problem Description
Imbalanced classes is the problem that occurs when dealing with data sets that has a
highly imbalanced distribution of categorical values , meaning for a specific class there is an
overwhelming number of members compared to the others , this leads to biases in prediction
and poor performance on the minority classes.for example A commonly used dataset for fraud
detection is the Credit Card Fraud Detection dataset, which contains transactions made by
credit cards in September 2013 by European cardholders. The dataset is imbalanced, as only
0.17% of the transactions are fraudulent.The dataset contains 284,807 transactions, out of
which 492 are fraudulent.

Such a data set could lead to biased predictions that is 100% non fraudulent transactions
,following the classic approach for evaluation using the confusion matrix T P +TTNP +T N
+F P +F N
results in 0+99.83+0+0.17 = 99.83% the model performs so poorly on the minority class yet
0+99.83

the accuracy of the model is 99.83% !!! which is misleading now before tackling the problem
there few points to consider :

- Class imbalance is more common than balanced classes.

- the imbalanced data isn’t as much as a problem as much as it is the metric used for
accuracy , as discussed previously an error rate of 0.17% sounds good on paper.

- a possible over fitting may occur on the minority class ,if the minority class has only 100
samples , the model may easily memorize these samples, leading to an over fit model
that generalized poorly.

3
2 Algorithms to handle the imbalanced classes problem
this section will be divided into three main parts,the evaluation metrics since the classic
accuracy test is misleading.resampling methods which focus on balancing the dataset and
ensemble methods,in addition we tackle a new technique that is transformers

2.1 Evaluation Metrics


One of the key problems related to handling imbalanced datasets is using inappropriate
metrics, for this we list a few evaluation metrics that are better suited for the measurement
of accuracy for imbalanced datasets

2.1.1 F1-score
F1 already seen in the course ,By combining precision and recall, the F-score provides a
balanced measure of the model’s performance on both classes, which is useful in imbalanced
datasets. It penalizes the model for predicting the majority class too often and not identifying
the minority class correctly. In this way, the F-score helps to evaluate the model’s ability
to identify the minority class and achieve a balance between precision and recall on both
classes.it’s calculated using the following formula
recall ∗ precision
F1 = 2 ∗ (1)
recall + precision

2.1.2 ROC-AUC
The ROC (Receiver Operating Characteristic) curve is a graphical representation of the
performance of a binary classification model at different classification thresholds. It is created
by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various
threshold settings.

Figure 1: Roc Curve Example

In imbalanced datasets, where the positive class is a minority, the ROC curve is especially
useful because it allows the user to visualize how well the model performs across different

4
threshold settings. This is important because the choice of threshold can have a significant
impact on the model’s performance. For example, if the goal is to maximize the TPR while
keeping the FPR low, the ROC curve can help identify the threshold that achieves this
trade-off,on the other hand the AUC is the area under the roc curve , with a value ranging
between 0 to 1 , the closer its value is to 1 the more accurate the model is. the auc under
roc could be calculated using the following formula :
Z 1 Z 1
TP
AU C = d(F P R) = T P Rd(F P R) (2)
0 TP + FN 0

2.1.3 Balanced Accuracy Score


More specifically, the balanced accuracy score is defined as the arithmetic mean of the
TPR for each class:The balanced accuracy score is a metric used to evaluate the performance
of a binary classification model on imbalanced datasets. It is similar to the traditional
accuracy metric, but it takes into account the class imbalance by considering the average of
the true positive rate (TPR) for both classes.
More specifically, the balanced accuracy score is defined as the arithmetic mean of the
TPR for each class:
T P R1 + T P R0
BalancedAccuracy = (3)
2

The balanced accuracy score ranges from 0 to 1, with a higher score indicating better
performance. A score of 0.5 indicates a random classifier, and a score of 1 indicates perfect
classification performance. where TPR1 is the true positive rate for the positive class (i.e.,
the class of interest), and TPR0 is the true positive rate for the negative class.
The balanced accuracy score ranges from 0 to 1, with a higher score indicating better
performance. A score of 0.5 indicates a random classifier, and a score of 1 indicates perfect
classification performance.

2.2 Resampling
Using clear descriptive metrics isn’t enough to handle imbalanced classes, the first naive
approach would be to modify the dataset , this includes decreasing the sample size of the
majority class or increasing the the sample size of the minorrity class

2.2.1 SMOTE (Synthetic Minority Over-sampling Technique)


Smote is a popular oversampling technique used to handle imbalanced datasets.it was
inspired by a method used in hand written digits recognition(Ha & Bunke, 1997) in which the
minority class samples were shifted .for smote the new examples are generated syntheticly
using the k nearest neighbors .if the amount of over-sampling needed is 200% then for each
sample 2 nearest neighbors are selected and one sample is generated in the direction of each,
the pseudo code for this algorithme [2]is as follows :

5
Algorithm SMOTE(T, N, k)
Input: Number of minority class samples T; Amount of SMOTE N%; Number of nearest
neighbors k
Output: (N/100) * T synthetic minority class samples
1. ( If N is less than 100%, randomize the minority class samples as only a random
percent of them will be SMOTEd. )
2. if N < 100
3. then Randomize the T minority class samples
4. T = (N/100) T
5. N = 100
6. endif
7. N = (int)(N/100) ( The amount of SMOTE is assumed to be in integral multiples of
100. )
8. k = Number of nearest neighbors
9. numattrs = Number of attributes
10. Sample[ ][ ]: array for original minority class samples
11. newindex: keeps a count of number of synthetic samples generated, initialized to 0
12. Synthetic[ ][ ]: array for synthetic samples
( Compute k nearest neighbors for each minority class sample only. )
13. for i ← 1 to T
14. Compute k nearest neighbors for i, and save the indices in the nnarray
15. Populate(N, i, nnarray)
16. endfor
Populate(N, i, nnarray) ( Function to generate the synthetic samples. )
17. while N 6= 0
18. Choose a random number between 1 and k, call it nn. This step chooses one of
the k nearest neighbors of i.
19. for attr ← 1 to numattrs
20. Compute: dif = Sample[nnarray[nn]][attr] Sample[i][attr]
21. Compute: gap = random number between 0 and 1
22. Synthetic[newindex][attr] = Sample[i][attr] + gap dif
23. endfor
24. newindex++
25. N = N 1
26. endwhile
27. return ( End of Populate. )
End of Pseudo-Code.

Example let’s say we have an original sample S=(6,4) and a neighbor K=(4,3). the generated
sample F
F = S + rand(0, 1) ∗ (K − S) = (6, 4) + rand(0, 1) ∗ (−2, −1) (4)

6
2.3 Transformers
Transformers are a type of neural network architecture that are widely used in natural
language processing (NLP) tasks, such as language translation,Transformers have been shown
to be highly effective at NLP tasks, and many pre-trained transformer models, such as BERT
and GPT, are widely used as the basis for fine-tuning on specific downstream tasks.it is made
of an encoder and a decoder .the encoder maps an input vector X to a continous representation
Z and then the decoder maps to an output Y (x1 , ..., xn ) 7→ (z1 , ..., zn ) 7→ (y1 , ..., ym )
both the encoder and decoder are composed of 6 stacks each with two sub layers while the
decoder has a third sub layer, these sub layers work to provide an attention mecanism which
is the most important part.
An attention function can be described as mapping a query and a set of key-value pairs to
an output, where the query, keys, values, and output are all vectors. The output is computed
as a weighted sum of the values, where the weight assigned to each value is computed by
a compatibility function of the query with the corresponding key. To classify a numeric
dataset with transformers, we need to first convert the numeric features into text data, and
then use the transformers for text classification. One common approach is to convert the
numeric data into text by binning the values into discrete categories or ranges a common
algorithm for this is the BERT-Tokenization, and then using these categories or ranges as
text features.
suppose we want to map a certain phrase to another one ,Hello 7→ W orld we begin by
encoding the message, first to tokenize the message meaning it turns to a vector using
embedding layer, The purpose of the embedding layer is to create dense, low-dimensional
vector representations of the input tokens, which capture the semantic relationships between
them.once this is done we then move on to a positional layer.this adds a position vector to
the token vector to preserve the position. this is the only downfall of transformers as they
don’t preserve order , to calculate position a clever sin method was used using this formula
[1]
P E( pos, 2i) = sin(pos/10002i/dmodel ) (5)
PE is essentially the frequency of the output word ,the exponent of i is used to handle the
problem of two words having the same frequency
Once this is done it gets fed to the attention head,attention is a mechanism that allows the
model to focus on different parts of the input sequence during encoding and decoding. it is
calculated based on three inputs(query,keys,Value) where its similar to databases, where we
extract the key from the query and use it to find the value, here it’s found using the formula
[1]
QK T
Attention(Q, K, V ) = V ∗ √ (6)
dk
once this is done it gets fed to the decoder but not directly, first a special start token will
be fed to the decoder and will go through the embedding and tokenizing layers, then in the
attention part the outputs of the encoder will be fed as query and keys for the new attention
and this starting token as the value, in the end it will be fed into a simple feed forward
neural network and a final residuel layer and a softmax. the softmax will assign a probability
to each possible output and the one with the highest probability is outputted, the drawing
below simplifies it a bit

7
Figure 2: Illustration of a transformer

2.4 Ensemblist methods


In ensembles methods we combine multiple learners ,these learners don’t have to be homogeneous(one
is knn the other is linear regression ..) and each learner is trained separately .once the training
is done if we plugin an input X for each learner we get a seprate output Y, let’s say we have
3 learners each producing their own output X → {Y1 , Y2 , Y3 } if we are doing classification
then we would do a voting to determine the global output Y ,in the case of regression we
take the mean .there are 3 common approaches in ensemblist methods

• Stacking Stacking is a different paradigm. The point of stacking is to explore a


space of different models for the same problem. The idea is that you can attack
a learning problem with different types of models which are capable to learn some
part of the problem, but not the whole space of the problem. So, you can build
multiple different learners and you use them to build an intermediate prediction, one
prediction for each learned model. Then you add a new model which learns from
the intermediate predictions the same target.In bagging, multiple models are trained
independently on different subsets of the training data, whereas in stacking, multiple
models are trained sequentially, with the output of one model becoming the input of
the next model.Creating a sequential learning process. Neural networks can be thought
of as a type of stacking models.

8
• Bagging (Bootstrap aggregating) refers to the spliting of the training data across
subsets called "bags" and distributing it across the learners.the operation of choosing
these samples for each bag is random rendering it sensitive to duplication of data an
example of such algorithm is random forest already seen in the classroom where different
instances of data is distributed across multiple weak learners.

Figure 3: Bagging illustration

• Boosting (Bootstrap aggregating) Boosting is similar to Bagging the only difference is


that in the bagging the manner in which samples are chosen for each bag is completely
independant of the choice for the other bag ,it is essentially done in parallel manner.
in the boosting on the other hand the data samples that were badly classified in the
first bag are the most likely to be chosen in the second bag. a weight is assigned to
the badly classified data .some popular boosting algorithms include adaBoost,XGboost
and CatBoost.

Figure 4: Boosting illustration

For the sake of the TP ,we will focus on two ensemble methods which we will be used later
on which are CatBoost and Extremely randomized trees

9
2.4.1 CatBoost
CatBoost is based on gradient boosting.an algorithm similar to the one used in adaBoost.in
adaboost we generate (stumps) which are trees with only two leafs.these trees have different
importance which plays a role in the voting during classification.each stump builds on the
errors of the previous stump and each stump contains decisions regarding one feature at a
time.in gradient boosting the same is done but with a different catch. the resulting trees
aren’t stumps but they are large trees with a limited bound. these trees could differ in size
and differ in importance.
Catboost is short for Categorical boost and was developed specifically to handle categorical
data .the algorithm starts by Ordered Target Encoding.
traditional encoding methods such as one hot encoding have the risk of overfitting and badly
handling rare categories.but this is not the case in ordered target encoding .it starts off by
handling sequentially each row at a time using the formula
OptionCount + 0.05
OT E = (7)
n+1
the OptionCount is the number of times that our observed categorical feature appeared
simultaneously with class 1 . 0.05 is considered the guess factor usually 0.05 instead of using
the mean .and n is the number of times the categorical feature has appeared

Car-Color Price(100k$) Sold


Blue 7→ 0.05 1.77 1
Blue 7→ 0.525 1.32 0
Green 7→ 0.05 1.81 1
Blue 7→ 0.35 1.56 0
Green 7→ 0.525 1.64 1
Red 7→ 0.05 1.64 1
in this example we will apply the OTE for each row

• Row1 : 0+0.05
0+1
= 0.05, Blue didn’t appear before so both n and optioncount are 0

• Row2 : 1+0.05
1+1
= 0.525 ,blue has appeared before so n=1 ,blue appeared and the class
was 1 so optioncount=1

• Row3 : 0+0.05
0+1
= 0.05

• Row4 : 1+0.05
2+1
= 0.35 ,blue appeared twice so n=2, blue appeared once with class 0 so
optioncount =1

• Row5 : 1+0.05
1+1
= 0.525

• Row6 : 0+0.05
0+1
= 0.525

if there’s only 2 categories they are simply replaced by 1s and 0s


From here CatBoost takes a similar path to gradient boosting,but with few modifications

10
• In CatBoost, base predictors are oblivious decision trees also called decision tables
.Term oblivious means that the same splitting criterion is used across an entire level
of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up
execution at testing time significantly. [3]

• Catboost uses Ridge and Lasso regression(l1,l2 regularization) on the weights of decision
trees to avoid overfitting.

• When making predictions on new data with missing values, CatBoost uses the same
approach as during training by treating missing values as a separate category. Specifically,
for each feature with missing values, CatBoost creates two new categories: one for the
missing values and one for the non-missing values. Then, for each tree in the boosting
process, the model calculates the probability of the target variable for both categories
and uses these probabilities to make the final prediction.

• for classification ,CatBoost uses dynamic learning rate ,where the learning rate changes
with each iteration. At the beginning of each boosting round, CatBoost selects a subset
of the training instances based on their gradients. This subset is called the "one-side
sampled set." The size of the one-side sampled set depends on the gradients of the
instances in the training data. Specifically, instances with high gradients are more
likely to be included in the one-side sampled set.

• unlike classic gradient boosting.CatBoost uses Random permutations which is a technique


called "ordered boosting," which involves training decision trees in a specific order
based on their feature importance scores. To introduce randomness into this process,
CatBoost randomly permutes the feature importance scores in each boosting round.
This helps to prevent the model from becoming overly reliant on any one set of features.

2.4.2 Extremely Randomized Trees


Extremely Randomized Trees (also known as Extra Trees) is an ensemble learning algorithm
that is based on decision trees. Like Random Forest, it builds a large number of decision
trees and aggregates their predictions to make a final prediction. However, it differs from
Random Forest in two key ways :
• Random feature selection: In Random Forest, each tree is built using a random
subset of the features. However, in Extremely Randomized Trees, the splitting thresholds
are selected randomly from a uniform distribution over the feature range. This means
that the splitting criteria of each tree are not only based on a random subset of the
features but also on random thresholds within the selected features.

• Randomized node optimization: In Extremely Randomized Trees, the decision tree


splits are chosen randomly among a set of candidate splits. In other words, rather than
choosing the best split based on a metric such as information gain, it selects a split at
random from a set of candidate splits.
One of the main reasons Extra Trees perform well on imbalanced data is that they tend to
create decision boundaries that are less biased towards the majority class, compared to other

11
Figure 5: ET illustration

tree-based ensemble methods like Random Forest. This is because Extra Trees randomly
select the splitting thresholds for each feature, rather than finding the optimal threshold
based on a criterion like Gini impurity or information gain.The Gini index is a measure of
impurity often used in decision trees and random forests. The formula for Gini impurity
is:1 − ki=1 p2i where p1 , p2 , ..., pk are the proportions of each class in a given node.
P

12
3 Data set-Credit Card
Sticking to the same example mentioned previously .It is important that credit card companies
are able to recognize fraudulent credit card transactions so that customers are not charged for
items that they did not purchase.The dataset contains transactions made by credit cards in
September 2013 by European cardholders. This dataset presents transactions that occurred
in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly
unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains
only numerical input variables which are the result of a PCA transformation. Unfortunately,
due to confidentiality issues, the original features and more background information about
the data aren’t provided. Features V1, V2, . . . V28 are the principal components obtained
with PCA, the only features which have not been transformed with PCA are ’Time’ and
’Amount’. Feature ’Time’ contains the seconds elapsed between each transaction and the
first transaction in the dataset. The feature ’Amount’ is the transaction Amount, this feature
can be used for example-dependant cost-sensitive learning. Feature ’Class’ is the response
variable and it takes value 1 in case of fraud and 0 otherwise.

Figure 6: Credit card fraud detection result

the dataset contains 31 numerical variables, a quick r script using skimr library shows the
following info about the dataset

Data Summary
Values
Name data
Number of rows 284807
Number of columns 31
_______________________
Column type frequency:
numeric 31
________________________
Group variables None

13
Variable type: numeric
skim_variable n_missing complete_rate mean sd p0 p25
1 Time 0 1 9.48e+ 4 47488. 0 54202. 84692
2 V1 0 1 1.17e-15 1.96 -56.4 -0.920 0
.................................................
30 Amount 0 1 8.83e+ 1 250. 0 5.6 22
31 Class 0 1 1.73e- 3 0.0415 0 0 0

the dataset could be found here

4 Testing ET and CatBoost on the creditCard dataset


for this evaluation we will create a python program that will do classification on the
creditcard dataset using both CatBoost and ET.the code will be done using pythons pandas
and sklearn.in the end we will evaluate using the f1score,balanced measure and finally roc-auc.

4.1 Notes and Defficulties


- CatBoost and ET both have built in multhreading support for parallalization using the
extra parameters threadc ount, nj obsrespectively, thisiswhywedidn′ tneedtousesparkf ortesting.C
-

- the Extreme Randomized trees package from sklearn doesn’t have built in support for
spark,only the regular randomized trees have .

- Although the official documentation states that spark could be used with catboost ,it
shows a package called catboost_spark to do so,we tried installing it using pip and
pip3 in google collab and it wasn’t recognized.

- thankfully both catboost and ET have built in support for parallization as stated before.

4.2 The test


CatBoost and ET both have built in multhreading support for parallalization using the extra
parameters threadc ount, nj obsrespectively, wewillsetthemto4f orthesakeof theparallelization.Another

After 100 iterations on google collab , and with 4 threads we obtain the
following ,we get the provided the results in the table below :

14
Balanced
Method Time(s) F1-score ROC-AUC
Accuracy
ET 16.06 0.8457 0.9526 0.8775
without
CatBoost GPU :5.09
0.8636 0.9821 0.8877
with
GPU:2.11
From the table shown above we find that the CatBoost simply outperforms
ET , in time and accuracy for classification.this is due to many reasons :
- CatBoost supports GPU acceleration unlike ET.

- CatBoost includes several algorithmic improvements over ET, such as


ordered boosting, permutation-driven feature importance, and gradient-based
sampling.

- CatBoost tends to perform better than Extremely Randomized Trees on


imbalanced datasets, due to its handling of class weights and gradient-based
boosting algorithm.
to further illustrate the difference in accuracy we demonstrate the graph
above

Figure 7: roc curve comparaison between ET and CatBoost

An important thing to keep in mind is that the dataset credit card doesn’t
contain missing values or categorical data, while the ET doesn’t handle these

15
issues CatBoost contains built in algorithms to handle these ,making it possible
to have a faster execution time for ET (for example in the kdd99 dataset)
Using 500 iterations , we find a smaller diffrence in the roc curve between
the two

Figure 8: roc curve comparaison between ET and CatBoost after 500


iterations

though the time diffrence remains big and the accuracy of CatBoost remains
better even after 1000 iterations.

16
5 Testing ET and CatBoost on the creditCard
dataset After Balancing
for balancing the dataset we will be using an oversampling technique discussed
previously SMOTE , smote is the most popular and relatively safest method
for balancing.applying smote will give each Class samples 284315, to use it
we require the additional code below

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
.....
et.fit(X_train_smote, y_train_smote)
.....
cb.fit(X_train_smote, y_train_smote)

After applying smote and reapplying the ensemble methods we get the following
results

Balanced
Method Time(s) F1-score ROC-AUC
Accuracy
ET 38.59 0.8723 0.9744 0.9183
without
GPU
CatBoost :11.55 0.6908 0.9825 0.9382
with
GPU:4.59

Figure 9: roc curve comparaison between ET and CatBoost after balancing

17
From analysing and comparing the results above to the previous ones , it’s
apparent that the execution time has more than tripled for both datasets
,which is totaly normal giving the amount of extra data added to the minority
class,going from only having 492 samples to 284315 samples.as for the performance
performance has significantly improved for both algorithms for 100 iterations,
this means that for this dataset smote does have a positive impact on the
performance of the model with CatBoost remaining superior in term of
performance and execution time, however the diffrence in performance compared
to previously shrinked meaning that for balanced datasets et and CatBoost
have relatively similar performances
However it is worth noting that the f1-score has significantly dropped
for CatBoost going from 0.8686 to 0.6908,this could be attributed to the
fact that when the minority class is so small like in our case being only
0.17%and the original dataset has a large number of features as in our
case exceeding 28 features.thankfully we have other measurements or else
we would be misleaded

18
Conclusion
In conclusion, dealing with imbalanced datasets can be a challenging task
for machine learning algorithms. In such scenarios, techniques such as
SMOTE can be used to balance the dataset and improve the performance
of classification algorithms. When working with balanced datasets, both
Extra Trees and CatBoost can achieve high accuracy and perform well on
different metrics such as F1 score, ROC AUC, and balanced accuracy.
However, the performance of each algorithm may vary depending on the
specific dataset and problem being addressed. Overall, it is recommended
to try different algorithms and techniques to achieve the best possible
results for imbalanced datasets.

19
References
[1] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit,
Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz
and Polosukhin, Illia, Attention Is All You Need, Advances in Neural
Information Processing Systems 30 (NIPS 2017).

[2] Nitesh V. Chawla,Kevin W. Bowyer,Lawrence O. Hall ,W. Philip


KegelmeyerSMOTE: Synthetic Minority Over-sampling Technique
Journal of Artificial Intelligence Research 16 (2002) 321–357.

[3] Liudmila Prokhorenkova1,Gleb Gusev,Aleksandr Vorobev,Anna Veronika


Dorogush,Andrey Gulin CatBoost: unbiased boosting with categorical
features Moscow Institute of Physics and Technology, Dolgoprudny,
Russia 6.

20

You might also like