Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Discretization or Binning

in
Machine Learning

Uniform Binning Quantile Binning kmeans Binning

smitesh-tamboli
Discretization or Binning
Discretization, also known as binning, involves
converting continuous variables or features into
discrete variables by partitioning their value range into
adjacent intervals known as bins.

Binning helps to change the distribution of skewed


variables, minimize the influence of outliers, and
improve the performance of some machine learning
models.

Types of Binning
Uniform Binning or Equal Width Binning
Quantile Binning or Equal Frequency Binning
K-Means Binning

KBinsDiscretizer( no of bins

n_bins=5, encode = ordinal


encode="ordinal", | onehot |
onehot-dense
strategy="kmeans" )

strategy = uniform | quantile|


kmeans

smitesh-tamboli
Uniform Binning or Equal Width Binning
Uniform binning divides the range of continuous
values into a predetermined number of intervals or
bins, ensuring that each bin has the same width

Formula to calculate interval

For example,
If the values of the variable vary between 0 and 100, we
can create five bins like width = (100 - 0) / 5 = 20.

smitesh-tamboli
Perform Uniform Binning

equal width (23.474 - 16.226)

equal width

smitesh-tamboli
Quantile Binning or Equal Frequency
Quantile binning divides the range of continuous
values into a specified number of intervals or bins
based on the quantiles of the data distribution.
Quantiles are points that divide the data into equal-
sized subsets. For example, 20th, 40th, 60th
percentile.
Each bin represents a specific range of values
determined by the quantiles. Data points within the
same bin share similar positions or ranks in the
dataset.
Quantile binning is particularly effective for handling
outliers and skewed distributions by partitioning the
data based on ranks rather than absolute values.

“Quantiles are values that


split sorted data or a
probability distribution into
equal parts.”

bins width based


on quantile

smitesh-tamboli
Perform Quantile Binning

bins are based


on quantile

smitesh-tamboli
K-Means Binning

When continuous numerical data exhibit in a cluster


pattern within features or columns, K-means binning
can be used to discretize them.
The k-means algorithm is applied to create the
interval or bins.
K-means clustering follows these steps

Randomly choose K observations as clusters of the K clusters, and


the remaining data points are assigned to the closest cluster.

Until convergence ( e.g. when the centroids of the clusters no


longer change significantly between iteration)

Recompute the center of each cluster as the average of all


observations within that cluster

Reassign each observation to the cluster with the closest centroid


based on Euclidean distance.

smitesh-tamboli
K-Means Binning

Bins are based on


the K-means
cluster algorithm

smitesh-tamboli
How to create Bins
KBinsDiscretizer class from sklearn library can be used to
create bins

KBinsDiscretizer( no of bins

n_bins=5, encode = ordinal | onehot | onehot-dense

encode="ordinal",
strategy="kmeans" )

strategy = uniform | quantile| kmeans

Benefits of Binning
Binning can handle outliers by grouping them into the lower
or higher intervals, along with the remaining inlier values of
the distribution.
Binning can Improve the spread of values by creating evenly
distributed intervals

Disadvantage of Binning
Binning can increase Type I (false positive) and Type II (false
negative) errors. For example, 10 is in bin A while 12 and 20
are in bin B. 12 is similar to 10, but due to binning it is
considered as similar to 20 which may create issue.

smitesh-tamboli

You might also like