Professional Documents
Culture Documents
Catboost ET Comparaison
Catboost ET Comparaison
Realized By :
- Mohammed Abderrahmane
Bensalem
Year : 2022-2023
1
Contents
1 Problem Description 3
2
Introduction
Classification problems are quite common in the machine learning world.classification algorithms
are about predicting the class label by studying the input data or predictor where the target
or output variable is a categorical variable in nature. When designing a predictive model
there is usually the assumption that the classes used in the training data are balanced that
is there is an overall equal number of instances for each class , the majority of machine
learning algorithms take this assumption but this leads to problems when dealing with highly
imbalanced data which results in poor performance and predictions for the minority classes
.As the reality of our physical world tend to be imbalanced .
1 Problem Description
Imbalanced classes is the problem that occurs when dealing with data sets that has a
highly imbalanced distribution of categorical values , meaning for a specific class there is an
overwhelming number of members compared to the others , this leads to biases in prediction
and poor performance on the minority classes.for example A commonly used dataset for fraud
detection is the Credit Card Fraud Detection dataset, which contains transactions made by
credit cards in September 2013 by European cardholders. The dataset is imbalanced, as only
0.17% of the transactions are fraudulent.The dataset contains 284,807 transactions, out of
which 492 are fraudulent.
Such a data set could lead to biased predictions that is 100% non fraudulent transactions
,following the classic approach for evaluation using the confusion matrix T P +TTNP +T N
+F P +F N
results in 0+99.83+0+0.17 = 99.83% the model performs so poorly on the minority class yet
0+99.83
the accuracy of the model is 99.83% !!! which is misleading now before tackling the problem
there few points to consider :
- the imbalanced data isn’t as much as a problem as much as it is the metric used for
accuracy , as discussed previously an error rate of 0.17% sounds good on paper.
- a possible over fitting may occur on the minority class ,if the minority class has only 100
samples , the model may easily memorize these samples, leading to an over fit model
that generalized poorly.
3
2 Algorithms to handle the imbalanced classes problem
this section will be divided into three main parts,the evaluation metrics since the classic
accuracy test is misleading.resampling methods which focus on balancing the dataset and
ensemble methods,in addition we tackle a new technique that is transformers
2.1.1 F1-score
F1 already seen in the course ,By combining precision and recall, the F-score provides a
balanced measure of the model’s performance on both classes, which is useful in imbalanced
datasets. It penalizes the model for predicting the majority class too often and not identifying
the minority class correctly. In this way, the F-score helps to evaluate the model’s ability
to identify the minority class and achieve a balance between precision and recall on both
classes.it’s calculated using the following formula
recall ∗ precision
F1 = 2 ∗ (1)
recall + precision
2.1.2 ROC-AUC
The ROC (Receiver Operating Characteristic) curve is a graphical representation of the
performance of a binary classification model at different classification thresholds. It is created
by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various
threshold settings.
In imbalanced datasets, where the positive class is a minority, the ROC curve is especially
useful because it allows the user to visualize how well the model performs across different
4
threshold settings. This is important because the choice of threshold can have a significant
impact on the model’s performance. For example, if the goal is to maximize the TPR while
keeping the FPR low, the ROC curve can help identify the threshold that achieves this
trade-off,on the other hand the AUC is the area under the roc curve , with a value ranging
between 0 to 1 , the closer its value is to 1 the more accurate the model is. the auc under
roc could be calculated using the following formula :
Z 1 Z 1
TP
AU C = d(F P R) = T P Rd(F P R) (2)
0 TP + FN 0
The balanced accuracy score ranges from 0 to 1, with a higher score indicating better
performance. A score of 0.5 indicates a random classifier, and a score of 1 indicates perfect
classification performance. where TPR1 is the true positive rate for the positive class (i.e.,
the class of interest), and TPR0 is the true positive rate for the negative class.
The balanced accuracy score ranges from 0 to 1, with a higher score indicating better
performance. A score of 0.5 indicates a random classifier, and a score of 1 indicates perfect
classification performance.
2.2 Resampling
Using clear descriptive metrics isn’t enough to handle imbalanced classes, the first naive
approach would be to modify the dataset , this includes decreasing the sample size of the
majority class or increasing the the sample size of the minorrity class
5
Algorithm SMOTE(T, N, k)
Input: Number of minority class samples T; Amount of SMOTE N%; Number of nearest
neighbors k
Output: (N/100) * T synthetic minority class samples
1. ( If N is less than 100%, randomize the minority class samples as only a random
percent of them will be SMOTEd. )
2. if N < 100
3. then Randomize the T minority class samples
4. T = (N/100) T
5. N = 100
6. endif
7. N = (int)(N/100) ( The amount of SMOTE is assumed to be in integral multiples of
100. )
8. k = Number of nearest neighbors
9. numattrs = Number of attributes
10. Sample[ ][ ]: array for original minority class samples
11. newindex: keeps a count of number of synthetic samples generated, initialized to 0
12. Synthetic[ ][ ]: array for synthetic samples
( Compute k nearest neighbors for each minority class sample only. )
13. for i ← 1 to T
14. Compute k nearest neighbors for i, and save the indices in the nnarray
15. Populate(N, i, nnarray)
16. endfor
Populate(N, i, nnarray) ( Function to generate the synthetic samples. )
17. while N 6= 0
18. Choose a random number between 1 and k, call it nn. This step chooses one of
the k nearest neighbors of i.
19. for attr ← 1 to numattrs
20. Compute: dif = Sample[nnarray[nn]][attr] Sample[i][attr]
21. Compute: gap = random number between 0 and 1
22. Synthetic[newindex][attr] = Sample[i][attr] + gap dif
23. endfor
24. newindex++
25. N = N 1
26. endwhile
27. return ( End of Populate. )
End of Pseudo-Code.
Example let’s say we have an original sample S=(6,4) and a neighbor K=(4,3). the generated
sample F
F = S + rand(0, 1) ∗ (K − S) = (6, 4) + rand(0, 1) ∗ (−2, −1) (4)
6
2.3 Transformers
Transformers are a type of neural network architecture that are widely used in natural
language processing (NLP) tasks, such as language translation,Transformers have been shown
to be highly effective at NLP tasks, and many pre-trained transformer models, such as BERT
and GPT, are widely used as the basis for fine-tuning on specific downstream tasks.it is made
of an encoder and a decoder .the encoder maps an input vector X to a continous representation
Z and then the decoder maps to an output Y (x1 , ..., xn ) 7→ (z1 , ..., zn ) 7→ (y1 , ..., ym )
both the encoder and decoder are composed of 6 stacks each with two sub layers while the
decoder has a third sub layer, these sub layers work to provide an attention mecanism which
is the most important part.
An attention function can be described as mapping a query and a set of key-value pairs to
an output, where the query, keys, values, and output are all vectors. The output is computed
as a weighted sum of the values, where the weight assigned to each value is computed by
a compatibility function of the query with the corresponding key. To classify a numeric
dataset with transformers, we need to first convert the numeric features into text data, and
then use the transformers for text classification. One common approach is to convert the
numeric data into text by binning the values into discrete categories or ranges a common
algorithm for this is the BERT-Tokenization, and then using these categories or ranges as
text features.
suppose we want to map a certain phrase to another one ,Hello 7→ W orld we begin by
encoding the message, first to tokenize the message meaning it turns to a vector using
embedding layer, The purpose of the embedding layer is to create dense, low-dimensional
vector representations of the input tokens, which capture the semantic relationships between
them.once this is done we then move on to a positional layer.this adds a position vector to
the token vector to preserve the position. this is the only downfall of transformers as they
don’t preserve order , to calculate position a clever sin method was used using this formula
[1]
P E( pos, 2i) = sin(pos/10002i/dmodel ) (5)
PE is essentially the frequency of the output word ,the exponent of i is used to handle the
problem of two words having the same frequency
Once this is done it gets fed to the attention head,attention is a mechanism that allows the
model to focus on different parts of the input sequence during encoding and decoding. it is
calculated based on three inputs(query,keys,Value) where its similar to databases, where we
extract the key from the query and use it to find the value, here it’s found using the formula
[1]
QK T
Attention(Q, K, V ) = V ∗ √ (6)
dk
once this is done it gets fed to the decoder but not directly, first a special start token will
be fed to the decoder and will go through the embedding and tokenizing layers, then in the
attention part the outputs of the encoder will be fed as query and keys for the new attention
and this starting token as the value, in the end it will be fed into a simple feed forward
neural network and a final residuel layer and a softmax. the softmax will assign a probability
to each possible output and the one with the highest probability is outputted, the drawing
below simplifies it a bit
7
Figure 2: Illustration of a transformer
8
• Bagging (Bootstrap aggregating) refers to the spliting of the training data across
subsets called "bags" and distributing it across the learners.the operation of choosing
these samples for each bag is random rendering it sensitive to duplication of data an
example of such algorithm is random forest already seen in the classroom where different
instances of data is distributed across multiple weak learners.
For the sake of the TP ,we will focus on two ensemble methods which we will be used later
on which are CatBoost and Extremely randomized trees
9
2.4.1 CatBoost
CatBoost is based on gradient boosting.an algorithm similar to the one used in adaBoost.in
adaboost we generate (stumps) which are trees with only two leafs.these trees have different
importance which plays a role in the voting during classification.each stump builds on the
errors of the previous stump and each stump contains decisions regarding one feature at a
time.in gradient boosting the same is done but with a different catch. the resulting trees
aren’t stumps but they are large trees with a limited bound. these trees could differ in size
and differ in importance.
Catboost is short for Categorical boost and was developed specifically to handle categorical
data .the algorithm starts by Ordered Target Encoding.
traditional encoding methods such as one hot encoding have the risk of overfitting and badly
handling rare categories.but this is not the case in ordered target encoding .it starts off by
handling sequentially each row at a time using the formula
OptionCount + 0.05
OT E = (7)
n+1
the OptionCount is the number of times that our observed categorical feature appeared
simultaneously with class 1 . 0.05 is considered the guess factor usually 0.05 instead of using
the mean .and n is the number of times the categorical feature has appeared
• Row1 : 0+0.05
0+1
= 0.05, Blue didn’t appear before so both n and optioncount are 0
• Row2 : 1+0.05
1+1
= 0.525 ,blue has appeared before so n=1 ,blue appeared and the class
was 1 so optioncount=1
• Row3 : 0+0.05
0+1
= 0.05
• Row4 : 1+0.05
2+1
= 0.35 ,blue appeared twice so n=2, blue appeared once with class 0 so
optioncount =1
• Row5 : 1+0.05
1+1
= 0.525
• Row6 : 0+0.05
0+1
= 0.525
10
• In CatBoost, base predictors are oblivious decision trees also called decision tables
.Term oblivious means that the same splitting criterion is used across an entire level
of the tree. Such trees are balanced, less prone to overfitting, and allow speeding up
execution at testing time significantly. [3]
• Catboost uses Ridge and Lasso regression(l1,l2 regularization) on the weights of decision
trees to avoid overfitting.
• When making predictions on new data with missing values, CatBoost uses the same
approach as during training by treating missing values as a separate category. Specifically,
for each feature with missing values, CatBoost creates two new categories: one for the
missing values and one for the non-missing values. Then, for each tree in the boosting
process, the model calculates the probability of the target variable for both categories
and uses these probabilities to make the final prediction.
• for classification ,CatBoost uses dynamic learning rate ,where the learning rate changes
with each iteration. At the beginning of each boosting round, CatBoost selects a subset
of the training instances based on their gradients. This subset is called the "one-side
sampled set." The size of the one-side sampled set depends on the gradients of the
instances in the training data. Specifically, instances with high gradients are more
likely to be included in the one-side sampled set.
11
Figure 5: ET illustration
tree-based ensemble methods like Random Forest. This is because Extra Trees randomly
select the splitting thresholds for each feature, rather than finding the optimal threshold
based on a criterion like Gini impurity or information gain.The Gini index is a measure of
impurity often used in decision trees and random forests. The formula for Gini impurity
is:1 − ki=1 p2i where p1 , p2 , ..., pk are the proportions of each class in a given node.
P
12
3 Data set-Credit Card
Sticking to the same example mentioned previously .It is important that credit card companies
are able to recognize fraudulent credit card transactions so that customers are not charged for
items that they did not purchase.The dataset contains transactions made by credit cards in
September 2013 by European cardholders. This dataset presents transactions that occurred
in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly
unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains
only numerical input variables which are the result of a PCA transformation. Unfortunately,
due to confidentiality issues, the original features and more background information about
the data aren’t provided. Features V1, V2, . . . V28 are the principal components obtained
with PCA, the only features which have not been transformed with PCA are ’Time’ and
’Amount’. Feature ’Time’ contains the seconds elapsed between each transaction and the
first transaction in the dataset. The feature ’Amount’ is the transaction Amount, this feature
can be used for example-dependant cost-sensitive learning. Feature ’Class’ is the response
variable and it takes value 1 in case of fraud and 0 otherwise.
the dataset contains 31 numerical variables, a quick r script using skimr library shows the
following info about the dataset
Data Summary
Values
Name data
Number of rows 284807
Number of columns 31
_______________________
Column type frequency:
numeric 31
________________________
Group variables None
13
Variable type: numeric
skim_variable n_missing complete_rate mean sd p0 p25
1 Time 0 1 9.48e+ 4 47488. 0 54202. 84692
2 V1 0 1 1.17e-15 1.96 -56.4 -0.920 0
.................................................
30 Amount 0 1 8.83e+ 1 250. 0 5.6 22
31 Class 0 1 1.73e- 3 0.0415 0 0 0
- the Extreme Randomized trees package from sklearn doesn’t have built in support for
spark,only the regular randomized trees have .
- Although the official documentation states that spark could be used with catboost ,it
shows a package called catboost_spark to do so,we tried installing it using pip and
pip3 in google collab and it wasn’t recognized.
- thankfully both catboost and ET have built in support for parallization as stated before.
After 100 iterations on google collab , and with 4 threads we obtain the
following ,we get the provided the results in the table below :
14
Balanced
Method Time(s) F1-score ROC-AUC
Accuracy
ET 16.06 0.8457 0.9526 0.8775
without
CatBoost GPU :5.09
0.8636 0.9821 0.8877
with
GPU:2.11
From the table shown above we find that the CatBoost simply outperforms
ET , in time and accuracy for classification.this is due to many reasons :
- CatBoost supports GPU acceleration unlike ET.
An important thing to keep in mind is that the dataset credit card doesn’t
contain missing values or categorical data, while the ET doesn’t handle these
15
issues CatBoost contains built in algorithms to handle these ,making it possible
to have a faster execution time for ET (for example in the kdd99 dataset)
Using 500 iterations , we find a smaller diffrence in the roc curve between
the two
though the time diffrence remains big and the accuracy of CatBoost remains
better even after 1000 iterations.
16
5 Testing ET and CatBoost on the creditCard
dataset After Balancing
for balancing the dataset we will be using an oversampling technique discussed
previously SMOTE , smote is the most popular and relatively safest method
for balancing.applying smote will give each Class samples 284315, to use it
we require the additional code below
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
.....
et.fit(X_train_smote, y_train_smote)
.....
cb.fit(X_train_smote, y_train_smote)
After applying smote and reapplying the ensemble methods we get the following
results
Balanced
Method Time(s) F1-score ROC-AUC
Accuracy
ET 38.59 0.8723 0.9744 0.9183
without
GPU
CatBoost :11.55 0.6908 0.9825 0.9382
with
GPU:4.59
17
From analysing and comparing the results above to the previous ones , it’s
apparent that the execution time has more than tripled for both datasets
,which is totaly normal giving the amount of extra data added to the minority
class,going from only having 492 samples to 284315 samples.as for the performance
performance has significantly improved for both algorithms for 100 iterations,
this means that for this dataset smote does have a positive impact on the
performance of the model with CatBoost remaining superior in term of
performance and execution time, however the diffrence in performance compared
to previously shrinked meaning that for balanced datasets et and CatBoost
have relatively similar performances
However it is worth noting that the f1-score has significantly dropped
for CatBoost going from 0.8686 to 0.6908,this could be attributed to the
fact that when the minority class is so small like in our case being only
0.17%and the original dataset has a large number of features as in our
case exceeding 28 features.thankfully we have other measurements or else
we would be misleaded
18
Conclusion
In conclusion, dealing with imbalanced datasets can be a challenging task
for machine learning algorithms. In such scenarios, techniques such as
SMOTE can be used to balance the dataset and improve the performance
of classification algorithms. When working with balanced datasets, both
Extra Trees and CatBoost can achieve high accuracy and perform well on
different metrics such as F1 score, ROC AUC, and balanced accuracy.
However, the performance of each algorithm may vary depending on the
specific dataset and problem being addressed. Overall, it is recommended
to try different algorithms and techniques to achieve the best possible
results for imbalanced datasets.
19
References
[1] Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit,
Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz
and Polosukhin, Illia, Attention Is All You Need, Advances in Neural
Information Processing Systems 30 (NIPS 2017).
20