Professional Documents
Culture Documents
Binning or Discretization
Binning or Discretization
in
Machine Learning
smitesh-tamboli
Discretization or Binning
Discretization, also known as binning, involves
converting continuous variables or features into
discrete variables by partitioning their value range into
adjacent intervals known as bins.
Types of Binning
Uniform Binning or Equal Width Binning
Quantile Binning or Equal Frequency Binning
K-Means Binning
KBinsDiscretizer( no of bins
smitesh-tamboli
Uniform Binning or Equal Width Binning
Uniform binning divides the range of continuous
values into a predetermined number of intervals or
bins, ensuring that each bin has the same width
For example,
If the values of the variable vary between 0 and 100, we
can create five bins like width = (100 - 0) / 5 = 20.
smitesh-tamboli
Perform Uniform Binning
equal width
smitesh-tamboli
Quantile Binning or Equal Frequency
Quantile binning divides the range of continuous
values into a specified number of intervals or bins
based on the quantiles of the data distribution.
Quantiles are points that divide the data into equal-
sized subsets. For example, 20th, 40th, 60th
percentile.
Each bin represents a specific range of values
determined by the quantiles. Data points within the
same bin share similar positions or ranks in the
dataset.
Quantile binning is particularly effective for handling
outliers and skewed distributions by partitioning the
data based on ranks rather than absolute values.
smitesh-tamboli
Perform Quantile Binning
smitesh-tamboli
K-Means Binning
smitesh-tamboli
K-Means Binning
smitesh-tamboli
How to create Bins
KBinsDiscretizer class from sklearn library can be used to
create bins
KBinsDiscretizer( no of bins
encode="ordinal",
strategy="kmeans" )
Benefits of Binning
Binning can handle outliers by grouping them into the lower
or higher intervals, along with the remaining inlier values of
the distribution.
Binning can Improve the spread of values by creating evenly
distributed intervals
Disadvantage of Binning
Binning can increase Type I (false positive) and Type II (false
negative) errors. For example, 10 is in bin A while 12 and 20
are in bin B. 12 is similar to 10, but due to binning it is
considered as similar to 20 which may create issue.
smitesh-tamboli