Extreme Gradient Boosting

XgBoost
Mohamed Nachid Boussiala
2022-06-20
The Theory of Extreme Gradient Boosting Machine (XGBoost)
Introduction
• Extreme Gradiend Boosting is currently the state-of-the-art algorithm for building predictive models
on real-world datasets.
• Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular
datasets of all sizes.
• XGboost is a very fast, scalable implementation of gradient boosting.

• Models using XGBoost regularly winning online data science competitions and being used at scale
across different industries.
XGBoost Library
XGBoost: is an optimized Gradient-boosting machine learning library.
XGBoost is a popular machine learning library for good reasons:
• It was developed originally as a C++ command-line application.
• After winning a popular machine learning competition on kagglein 2016, Tianqi Chen and Carlos
Guestrin authored XGBoost: A Scalable Tree Boosting System to present their algorithm to the larger
machine learning community.
• The adoption result of the algorithm by the ML community is bindings, or functions that tapped into
the core C++ code, started appearing in a variety of other languages, including Python, R, Scala, and
Julia.
The Technical Key Aspects of XGBoost: Speed
• The Extreme in Extreme Gradient Boosting means pushing computational limits to the ex-
treme. Pushing computational limits requires knowledge not just of model-building but also of disk-
reading, compression, cache, and cores.
The following new design features give XGBoost a big edge in speed over comparable ensemble algorithms:
1
1. Approximate split-finding algorithm: XGBoost presents an exact greedy algorithm in addition to a
new approximate split-finding algorithm. The split-finding algorithm uses quantiles, percentages that
split data, to propose candidate splits. In a global proposal, the same quantiles are used throughout
the entire training, and in a local proposal, new quantiles are provided for each round of splitting.
2. Sparsity aware split-finding: Sparse matrices are designed to only store data points with non-zero and
non-null values. This saves valuable space. A sparsity-aware split indicates that when looking for
splits, XGBoost is faster because its matrices are sparse.
3. Parallel computing: it is parallelizable onto GPU’s and across networks of computers, making it feasible
to train models on very large datasets on the order of hundreds of millions of training examples.
4. Cache-aware accessL: The data on a computer is separated into cache and main memory. The cache,
what we use most often, is reserved for high-speed memory.
5. Block compression and sharding:
1. Sharding is a method for distributing a single dataset across multiple databases, which can then
be stored on multiple machines. This allows for larger datasets to be split in smaller chunks and
stored in multiple data nodes, increasing the total storage capacity of the system.
• Block sharding decreases read times by sharding the data into multiple disks that alternate
when reading the data.
2. Block compression helps with computationally expensive disk reading by compressing columns.
XGBoost Learning Objective and Base Learners

• In order to fully understand why XGBoost is such a powerful approach to building supervised learning
models (regression and classificatin), two concetps are needed to comprehend:
– Loss function also called Learning objective function.
– Base learners or the booster: is s the machine learning model that is constructed during every
round of boosting. gbtree is XGBoost default base learner. Other base learner are available such
as gblinear, DART (Dropouts meet Multiple Additive Regression Trees). Check here
to learner more about DART
Learning Objective
• The learning objective or objective function of a machine learning model determines how well
the model fits the data. When we construct any machine learning model, we do so in the hopes that
it minimizes the loss function across all of the data points we pass in. That’s our ultimate goal, the
smallest possible loss.
• In the case of XGBoost, the learning objective consists of two parts: the loss function and the regular-
ization term.
obj(θ) = L(θ) + Ω(θ)
where L is the training loss function, and Ω is the regularization term. The training loss measures how
predictive our model is with respect to the training data.
A common choice of L is the mean squared error, which is given by:
2
X
L(θ) = (yi − ŷi )2
i
Another commonly used loss function is logistic loss, to be used for logistic regression:
X
L(θ) = [yi ln(1 + e−ŷi ) + (1 − yi ) ln(1 + eŷi )]
i
The regularization term is what people usually forget to add. The regularization term controls the
complexity of the model, which helps us to avoid overfitting.
For mathematical derivation, check the documentation here.
Common loss functions and XGBoost
• Loss functions have specific naming conventions in XGBoost:

– For regression models: the most common loss function used is called reg:squarederror.
– For binary classification models: The most common loss functions used are
∗ reg:logistic when you simply want the category of the target.
∗ binary:logistic, when you want the actual predicted probability of the positive class.
∗ multi:softpob, when the dataset includes multiple classes. It computes the probabilities of
classification and chooses the highest one.
∗ reg:squarederror: regression with squared loss.
Check here for a full list of objective functions.
Step 1: Load the Necessary Packages
First, we’ll load the necessary libraries.
#install.packages(c("xgboost", "caret"))
library(xgboost)
## Warning: package ’xgboost’ was built under R version 4.1.3
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(MASS)
3
Step 2: Load the Data
For this example we’ll fit a boosted regression model to the Boston dataset from the MASS package.
This dataset contains 13 predictor variables that we’ll use to predict one response variable called mdev, which
represents the median value of homes in different census tracts around Boston.
data <- Boston

str(data)
## ’data.frame’: 506 obs. of 14 variables:

## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We can see that the dataset contains 506 observations and 14 total variables.
Step 3: Prep the Data

Next, we’ll use the createDataPartition() function from the caret package to split the original dataset into a
training and testing set.
For this example, we’ll choose to use 80% of the original dataset as part of the training set.
Note that the xgboost package also uses matrix data, so we’ll use the data.matrix() function to hold our
predictor variables.
#make this example reproducible

set.seed(1711)
#split into training (80%) and testing set (20%)

parts <- createDataPartition(data$medv, p = .8, list = F)
train <- data[parts, ]
test <- data[-parts, ]
#define predictor and response variables in training set

train_X <- data.matrix(train[, -13])
train_y <- train[,13]
#define predictor and response variables in testing set

test_X <- data.matrix(test[, -13])
test_y <- test[, 13]
4
#define final training and testing sets
xgb_train <- xgb.DMatrix(data = train_X, label = train_y)
xgb_test <- xgb.DMatrix(data = test_X, label = test_y)
Step 4: Fit the Model

Next, we’ll fit the XGBoost model by using the xgb.train() function, which displays the training and testing
RMSE (root mean squared error) for each round of boosting.
Note that we chose to use 70 rounds for this example, but for much larger datasets it’s not uncommon to
use hundreds or even thousands of rounds. Just keep in mind that the more rounds, the longer the run time.
Also note that the max.depth argument specifies how deep to grow the individual decision trees. We typically
choose this number to be quite low like 2 or 3 so that smaller trees are grown. It has been shown that this
approach tends to produce more accurate models.
#define watchlist
watchlist = list(train=xgb_train, test=xgb_test)
#fit XGBoost model and display training and testing data at each round
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 70)
## [1] train-rmse:10.268952 test-rmse:10.308258

5
From the output we can see that the minimum testing RMSE is achieved at 50 rounds. Beyond this point,
the test RMSE actually begins to increase, which is a sign that we’re overfitting the training data.
Thus, we’ll define our final XGBoost model to use 50 rounds:
#define final model

final <- xgboost(data = xgb_train, max.depth = 3, nrounds = 50, verbose = 1)
## [1] train-rmse:10.268952
## [2] train-rmse:7.621959
## [3] train-rmse:5.802636
## [4] train-rmse:4.550875
## [5] train-rmse:3.731272
6
## [6] train-rmse:3.212170
## [7] train-rmse:2.876153
## [8] train-rmse:2.679817
## [9] train-rmse:2.491618
## [10] train-rmse:2.375090
## [11] train-rmse:2.313203
## [12] train-rmse:2.211834
## [13] train-rmse:2.174607
## [14] train-rmse:2.139551
## [15] train-rmse:2.114976
## [16] train-rmse:2.068074
## [17] train-rmse:2.051490
## [18] train-rmse:1.995807
## [19] train-rmse:1.970252
## [20] train-rmse:1.934135
## [21] train-rmse:1.900306
## [22] train-rmse:1.871715
## [23] train-rmse:1.826952
## [24] train-rmse:1.811795
## [25] train-rmse:1.801999
## [26] train-rmse:1.779351
## [27] train-rmse:1.754780
## [28] train-rmse:1.746118
## [29] train-rmse:1.705051
## [30] train-rmse:1.683708
## [31] train-rmse:1.656522
## [32] train-rmse:1.645577
## [33] train-rmse:1.614263
## [34] train-rmse:1.593455
## [35] train-rmse:1.567166
## [36] train-rmse:1.555617
## [37] train-rmse:1.527377
## [38] train-rmse:1.517858
## [39] train-rmse:1.492600
## [40] train-rmse:1.484267
## [41] train-rmse:1.468374
## [42] train-rmse:1.442361
## [43] train-rmse:1.410727
## [44] train-rmse:1.402039
## [45] train-rmse:1.380398
## [46] train-rmse:1.372660
## [47] train-rmse:1.349876
## [48] train-rmse:1.345266
## [49] train-rmse:1.319305
## [50] train-rmse:1.296536
Step 5: Use the Model to Make Predictions

Lastly, we can use the final boosted model to make predictions about the median house value of Boston
homes in the testing set.
We will then calculate the following accuracy measures for the model:
MSE: Mean Squared Error MAE: Mean Absolute Error RMSE: Root Mean Squared Error
7
pred_y <- stats::predict(final, xgb_test)
MSE <- mean((test_y - pred_y)ˆ2) #mse
MAE <- MAE(test_y, pred_y) #mae
RMSE <- RMSE(test_y, pred_y) #rmse
MSE
## [1] 12.1427
MAE
## [1] 2.513421
RMSE
## [1] 3.484637
The root mean squared error turns out to be 3.484637. This represents the average difference between the
prediction made for the median house values and the actual observed house values in the test set.

Extreme Gradient Boosting

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Extreme Gradient Boosting

Uploaded by

Copyright:

Available Formats

XgBoost

Mohamed Nachid Boussiala

The Theory of Extreme Gradient Boosting Machine (XGBoost)

• XGboost is a very fast, scalable implementation of gradient boosting.

XGBoost: is an optimized Gradient-boosting machine learning library.

XGBoost is a popular machine learning library for good reasons:

• It was developed originally as a C++ command-line application.

The Technical Key Aspects of XGBoost: Speed

XGBoost Learning Objective and Base Learners

obj(θ) = L(θ) + Ω(θ)

Common loss functions and XGBoost

• Loss functions have specific naming conventions in XGBoost:

Check here for a full list of objective functions.

Step 1: Load the Necessary Packages

First, we’ll load the necessary libraries.

## Warning: package ’xgboost’ was built under R version 4.1.3

## Loading required package: ggplot2

## Loading required package: lattice

data <- Boston

## ’data.frame’: 506 obs. of 14 variables:

Step 3: Prep the Data

#make this example reproducible

#split into training (80%) and testing set (20%)

#define predictor and response variables in training set

#define predictor and response variables in testing set

Step 4: Fit the Model

## [1] train-rmse:10.268952 test-rmse:10.308258

#define final model

Step 5: Use the Model to Make Predictions

You might also like