Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

XgBoost

Mohamed Nachid Boussiala

2022-06-20

The Theory of Extreme Gradient Boosting Machine (XGBoost)

Introduction
• Extreme Gradiend Boosting is currently the state-of-the-art algorithm for building predictive models
on real-world datasets.
• Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular
datasets of all sizes.

• XGboost is a very fast, scalable implementation of gradient boosting.


• Models using XGBoost regularly winning online data science competitions and being used at scale
across different industries.

XGBoost Library

XGBoost: is an optimized Gradient-boosting machine learning library.

XGBoost is a popular machine learning library for good reasons:

• It was developed originally as a C++ command-line application.

• After winning a popular machine learning competition on kagglein 2016, Tianqi Chen and Carlos
Guestrin authored XGBoost: A Scalable Tree Boosting System to present their algorithm to the larger
machine learning community.
• The adoption result of the algorithm by the ML community is bindings, or functions that tapped into
the core C++ code, started appearing in a variety of other languages, including Python, R, Scala, and
Julia.

The Technical Key Aspects of XGBoost: Speed

• The Extreme in Extreme Gradient Boosting means pushing computational limits to the ex-
treme. Pushing computational limits requires knowledge not just of model-building but also of disk-
reading, compression, cache, and cores.

The following new design features give XGBoost a big edge in speed over comparable ensemble algorithms:

1
1. Approximate split-finding algorithm: XGBoost presents an exact greedy algorithm in addition to a
new approximate split-finding algorithm. The split-finding algorithm uses quantiles, percentages that
split data, to propose candidate splits. In a global proposal, the same quantiles are used throughout
the entire training, and in a local proposal, new quantiles are provided for each round of splitting.
2. Sparsity aware split-finding: Sparse matrices are designed to only store data points with non-zero and
non-null values. This saves valuable space. A sparsity-aware split indicates that when looking for
splits, XGBoost is faster because its matrices are sparse.
3. Parallel computing: it is parallelizable onto GPU’s and across networks of computers, making it feasible
to train models on very large datasets on the order of hundreds of millions of training examples.

4. Cache-aware accessL: The data on a computer is separated into cache and main memory. The cache,
what we use most often, is reserved for high-speed memory.
5. Block compression and sharding:

1. Sharding is a method for distributing a single dataset across multiple databases, which can then
be stored on multiple machines. This allows for larger datasets to be split in smaller chunks and
stored in multiple data nodes, increasing the total storage capacity of the system.
• Block sharding decreases read times by sharding the data into multiple disks that alternate
when reading the data.
2. Block compression helps with computationally expensive disk reading by compressing columns.

XGBoost Learning Objective and Base Learners


• In order to fully understand why XGBoost is such a powerful approach to building supervised learning
models (regression and classificatin), two concetps are needed to comprehend:
– Loss function also called Learning objective function.
– Base learners or the booster: is s the machine learning model that is constructed during every
round of boosting. gbtree is XGBoost default base learner. Other base learner are available such
as gblinear, DART (Dropouts meet Multiple Additive Regression Trees). Check here
to learner more about DART

Learning Objective

• The learning objective or objective function of a machine learning model determines how well
the model fits the data. When we construct any machine learning model, we do so in the hopes that
it minimizes the loss function across all of the data points we pass in. That’s our ultimate goal, the
smallest possible loss.
• In the case of XGBoost, the learning objective consists of two parts: the loss function and the regular-
ization term.

obj(θ) = L(θ) + Ω(θ)

where L is the training loss function, and Ω is the regularization term. The training loss measures how
predictive our model is with respect to the training data.
A common choice of L is the mean squared error, which is given by:

2
X
L(θ) = (yi − ŷi )2
i

Another commonly used loss function is logistic loss, to be used for logistic regression:

X
L(θ) = [yi ln(1 + e−ŷi ) + (1 − yi ) ln(1 + eŷi )]
i

The regularization term is what people usually forget to add. The regularization term controls the
complexity of the model, which helps us to avoid overfitting.
For mathematical derivation, check the documentation here.

Common loss functions and XGBoost

• Loss functions have specific naming conventions in XGBoost:


– For regression models: the most common loss function used is called reg:squarederror.
– For binary classification models: The most common loss functions used are
∗ reg:logistic when you simply want the category of the target.
∗ binary:logistic, when you want the actual predicted probability of the positive class.
∗ multi:softpob, when the dataset includes multiple classes. It computes the probabilities of
classification and chooses the highest one.
∗ reg:squarederror: regression with squared loss.

Check here for a full list of objective functions.

Step 1: Load the Necessary Packages

First, we’ll load the necessary libraries.

#install.packages(c("xgboost", "caret"))
library(xgboost)

## Warning: package ’xgboost’ was built under R version 4.1.3

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(MASS)

3
Step 2: Load the Data
For this example we’ll fit a boosted regression model to the Boston dataset from the MASS package.
This dataset contains 13 predictor variables that we’ll use to predict one response variable called mdev, which
represents the median value of homes in different census tracts around Boston.

data <- Boston


str(data)

## ’data.frame’: 506 obs. of 14 variables:


## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

We can see that the dataset contains 506 observations and 14 total variables.

Step 3: Prep the Data


Next, we’ll use the createDataPartition() function from the caret package to split the original dataset into a
training and testing set.
For this example, we’ll choose to use 80% of the original dataset as part of the training set.
Note that the xgboost package also uses matrix data, so we’ll use the data.matrix() function to hold our
predictor variables.

#make this example reproducible


set.seed(1711)

#split into training (80%) and testing set (20%)


parts <- createDataPartition(data$medv, p = .8, list = F)
train <- data[parts, ]
test <- data[-parts, ]

#define predictor and response variables in training set


train_X <- data.matrix(train[, -13])
train_y <- train[,13]

#define predictor and response variables in testing set


test_X <- data.matrix(test[, -13])
test_y <- test[, 13]

4
#define final training and testing sets
xgb_train <- xgb.DMatrix(data = train_X, label = train_y)
xgb_test <- xgb.DMatrix(data = test_X, label = test_y)

Step 4: Fit the Model


Next, we’ll fit the XGBoost model by using the xgb.train() function, which displays the training and testing
RMSE (root mean squared error) for each round of boosting.
Note that we chose to use 70 rounds for this example, but for much larger datasets it’s not uncommon to
use hundreds or even thousands of rounds. Just keep in mind that the more rounds, the longer the run time.
Also note that the max.depth argument specifies how deep to grow the individual decision trees. We typically
choose this number to be quite low like 2 or 3 so that smaller trees are grown. It has been shown that this
approach tends to produce more accurate models.

#define watchlist
watchlist = list(train=xgb_train, test=xgb_test)

#fit XGBoost model and display training and testing data at each round
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 70)

## [1] train-rmse:10.268952 test-rmse:10.308258


## [2] train-rmse:7.621959 test-rmse:7.640498
## [3] train-rmse:5.802636 test-rmse:6.143439
## [4] train-rmse:4.550875 test-rmse:4.928000
## [5] train-rmse:3.731272 test-rmse:4.306933
## [6] train-rmse:3.212170 test-rmse:3.999524
## [7] train-rmse:2.876153 test-rmse:3.749049
## [8] train-rmse:2.679817 test-rmse:3.650994
## [9] train-rmse:2.491618 test-rmse:3.607793
## [10] train-rmse:2.375090 test-rmse:3.527256
## [11] train-rmse:2.313203 test-rmse:3.553428
## [12] train-rmse:2.211834 test-rmse:3.544834
## [13] train-rmse:2.174607 test-rmse:3.565205
## [14] train-rmse:2.139551 test-rmse:3.573385
## [15] train-rmse:2.114976 test-rmse:3.588664
## [16] train-rmse:2.068074 test-rmse:3.570330
## [17] train-rmse:2.051490 test-rmse:3.590506
## [18] train-rmse:1.995807 test-rmse:3.566253
## [19] train-rmse:1.970252 test-rmse:3.570843
## [20] train-rmse:1.934135 test-rmse:3.555940
## [21] train-rmse:1.900306 test-rmse:3.576161
## [22] train-rmse:1.871715 test-rmse:3.577647
## [23] train-rmse:1.826952 test-rmse:3.554332
## [24] train-rmse:1.811795 test-rmse:3.565745
## [25] train-rmse:1.801999 test-rmse:3.565397
## [26] train-rmse:1.779351 test-rmse:3.542693
## [27] train-rmse:1.754780 test-rmse:3.530288
## [28] train-rmse:1.746118 test-rmse:3.532668
## [29] train-rmse:1.705051 test-rmse:3.519894
## [30] train-rmse:1.683708 test-rmse:3.537678

5
## [31] train-rmse:1.656522 test-rmse:3.540007
## [32] train-rmse:1.645577 test-rmse:3.545167
## [33] train-rmse:1.614263 test-rmse:3.546557
## [34] train-rmse:1.593455 test-rmse:3.539367
## [35] train-rmse:1.567166 test-rmse:3.528534
## [36] train-rmse:1.555617 test-rmse:3.532916
## [37] train-rmse:1.527377 test-rmse:3.540154
## [38] train-rmse:1.517858 test-rmse:3.525271
## [39] train-rmse:1.492600 test-rmse:3.502901
## [40] train-rmse:1.484267 test-rmse:3.523625
## [41] train-rmse:1.468374 test-rmse:3.530097
## [42] train-rmse:1.442361 test-rmse:3.532107
## [43] train-rmse:1.410727 test-rmse:3.527861
## [44] train-rmse:1.402039 test-rmse:3.526187
## [45] train-rmse:1.380398 test-rmse:3.518224
## [46] train-rmse:1.372660 test-rmse:3.518499
## [47] train-rmse:1.349876 test-rmse:3.514045
## [48] train-rmse:1.345266 test-rmse:3.509992
## [49] train-rmse:1.319305 test-rmse:3.503536
## [50] train-rmse:1.296536 test-rmse:3.484637
## [51] train-rmse:1.284502 test-rmse:3.484824
## [52] train-rmse:1.259046 test-rmse:3.491340
## [53] train-rmse:1.253510 test-rmse:3.494199
## [54] train-rmse:1.246855 test-rmse:3.502560
## [55] train-rmse:1.228881 test-rmse:3.501782
## [56] train-rmse:1.218618 test-rmse:3.514161
## [57] train-rmse:1.193581 test-rmse:3.515703
## [58] train-rmse:1.181366 test-rmse:3.516584
## [59] train-rmse:1.161258 test-rmse:3.523906
## [60] train-rmse:1.137540 test-rmse:3.530229
## [61] train-rmse:1.121475 test-rmse:3.519965
## [62] train-rmse:1.111235 test-rmse:3.523945
## [63] train-rmse:1.087250 test-rmse:3.521362
## [64] train-rmse:1.076125 test-rmse:3.512412
## [65] train-rmse:1.058271 test-rmse:3.521056
## [66] train-rmse:1.054615 test-rmse:3.519675
## [67] train-rmse:1.043884 test-rmse:3.518975
## [68] train-rmse:1.034688 test-rmse:3.522976
## [69] train-rmse:1.022151 test-rmse:3.518511
## [70] train-rmse:1.009072 test-rmse:3.507155

From the output we can see that the minimum testing RMSE is achieved at 50 rounds. Beyond this point,
the test RMSE actually begins to increase, which is a sign that we’re overfitting the training data.
Thus, we’ll define our final XGBoost model to use 50 rounds:

#define final model


final <- xgboost(data = xgb_train, max.depth = 3, nrounds = 50, verbose = 1)

## [1] train-rmse:10.268952
## [2] train-rmse:7.621959
## [3] train-rmse:5.802636
## [4] train-rmse:4.550875
## [5] train-rmse:3.731272

6
## [6] train-rmse:3.212170
## [7] train-rmse:2.876153
## [8] train-rmse:2.679817
## [9] train-rmse:2.491618
## [10] train-rmse:2.375090
## [11] train-rmse:2.313203
## [12] train-rmse:2.211834
## [13] train-rmse:2.174607
## [14] train-rmse:2.139551
## [15] train-rmse:2.114976
## [16] train-rmse:2.068074
## [17] train-rmse:2.051490
## [18] train-rmse:1.995807
## [19] train-rmse:1.970252
## [20] train-rmse:1.934135
## [21] train-rmse:1.900306
## [22] train-rmse:1.871715
## [23] train-rmse:1.826952
## [24] train-rmse:1.811795
## [25] train-rmse:1.801999
## [26] train-rmse:1.779351
## [27] train-rmse:1.754780
## [28] train-rmse:1.746118
## [29] train-rmse:1.705051
## [30] train-rmse:1.683708
## [31] train-rmse:1.656522
## [32] train-rmse:1.645577
## [33] train-rmse:1.614263
## [34] train-rmse:1.593455
## [35] train-rmse:1.567166
## [36] train-rmse:1.555617
## [37] train-rmse:1.527377
## [38] train-rmse:1.517858
## [39] train-rmse:1.492600
## [40] train-rmse:1.484267
## [41] train-rmse:1.468374
## [42] train-rmse:1.442361
## [43] train-rmse:1.410727
## [44] train-rmse:1.402039
## [45] train-rmse:1.380398
## [46] train-rmse:1.372660
## [47] train-rmse:1.349876
## [48] train-rmse:1.345266
## [49] train-rmse:1.319305
## [50] train-rmse:1.296536

Step 5: Use the Model to Make Predictions


Lastly, we can use the final boosted model to make predictions about the median house value of Boston
homes in the testing set.
We will then calculate the following accuracy measures for the model:
MSE: Mean Squared Error MAE: Mean Absolute Error RMSE: Root Mean Squared Error

7
pred_y <- stats::predict(final, xgb_test)
MSE <- mean((test_y - pred_y)ˆ2) #mse
MAE <- MAE(test_y, pred_y) #mae
RMSE <- RMSE(test_y, pred_y) #rmse

MSE

## [1] 12.1427

MAE

## [1] 2.513421

RMSE

## [1] 3.484637

The root mean squared error turns out to be 3.484637. This represents the average difference between the
prediction made for the median house values and the actual observed house values in the test set.

You might also like