Download as pdf or txt
Download as pdf or txt
You are on page 1of 134

What is Machine Learning?

Machine learning is a branch of artificial


intelligence (AI) and computer science which
focuses on the use of data and algorithms to
imitate the way that humans learn, gradually
improving its accuracy.

- IBM

5
What is Machine Learning?
Machine Learning algorithms enable the
computers to learn from data, and even
improve themselves, without being explicitly
programmed

Arthur Lee Samuel


6
Man Vs. Machine

7
Why Machine Learning?

VOLUMINOUS
DATA

COMPUTATIONAL POWERFUL
POWER ALGORITHMS

8
ENIAC

9
Machine Learning Evolution

10
Applications of Machine Learning

11
Machine Learning Types

12
Machine Learning Types

13
Supervised Learning
In Supervised learning, an AI system is presented with
data which is labeled

14
The Goal of Supervised Learning
To approximate the mapping function so well that
when you have new input data (x) that you can
predict the output variables (Y) for that data

X f(x) Y

15
Types of Supervised Learning

Supervised Learning

Classification Regression

The output variable is a category, the output variable is a real value,


such as “red” or “blue” or such as “dollars” or “weight”
“disease” and “no disease”

16
Classification
It refers to a predictive modeling problem where a
class label is predicted for a given example of input
data

17
Classification Types

Binary Classification

Multiclass Classification

Multilabel Classification

Imbalance Classification
18
Classification Modelling

Training Set
(IV & DV)

Predicted
ML Classification
Original Dataset Trained Model Labels for
Model Training
Test Set

Test Set
(IV)

19
Regression
Regression predictive modeling is the task of
approximating a mapping function (f) from input
variables (X) to a continuous output variable (y)

Prediction
Regression (House Price)
Size, Area, #Floors, Model
#Bedrooms, Parking, …

20
Regression Modelling

Training Set
(IV & DV)
Predicted
ML Regression Real
Original Dataset Trained Model
Model Training Values for
Test Set
Test Set
(IV)

21
Unsupervised Learning
In unsupervised learning, an AI system is presented
with unlabeled, uncategorized data and the system’s
algorithms act on the data without prior training

22
Unsupervised Learning Types

Unsupervised Learning

Dimensionality
Clustering Association Rules
Reduction
Discovering inherent group Reducing the number of Discovering rules that
in the data input variables in a describe large portions of
dataset the data

23
Clustering
● The method of identifying similar groups of data in
a dataset is called clustering
○ Entities in each group are comparatively more similar to
entities of that group than those of the other groups

24
Association Rules
● Association rule mining finds interesting
associations and relationships among large sets of
data items
○ This rule shows how frequently a itemset occurs in a
transaction (e.g. Market Based Analysis)

25
Dimensionality Reduction
● It refers to techniques for reducing the number of
input variables in training data
○ Reducing the dimensionality by projecting the data to a
lower dimensional subspace which captures the
“essence” of the data

26
Semisupervised Learning
● A learning problem that involves a small number of
labeled examples and a large number of unlabeled
examples
○ It is required while working with data where labeling instances
is challenging or expensive

27
Reinforcement Learning
● A reinforcement learning algorithm, or agent,
learns by interacting with its environment
○ The agent receives rewards by performing correctly and
penalties for performing incorrectly

○ The agent learns without intervention from a human by


maximizing its reward and minimizing its penalty

28
Reinforcement Learning

29
Machine Learning Terminlogies
● ML Model
○ The learned program that maps inputs to predictions
○ Alternate Name: Predictor/Classifier/Regression Model

ML Model

Unseen
Predictions
Input

30
Machine Learning Terminlogies
● A table with the data from which the machine
learns
○ The dataset contains
Features/ the features and the target toOutput/
Inputs/ Target/
predict IV DV

Instance/
Record

31
Machine Learning Terminlogies
● Training and Test Sets
Training set

The original Dataset

80:20 Split

Test set

32
Training Vs. Validation Vs. Test Sets
● Training Set
○ The sample of data used to fit the model

○ The model sees and learns from this data


● Validation Set
○ The sample of data used to provide an unbiased evaluation
of a model fit on the training dataset while tuning model
hyperparameters
● Test Set
○ The sample of data used to provide an unbiased evaluation
of a final model fit on the training dataset
33
Validation Set

34
Generalization
● Generalization is a term used to describe a model’s
ability to react to new data
○ after being trained on a training set, a model can digest
new data and make accurate predictions
○ a model’s ability to generalize is central to the success of
a model

35
Overfitting
● If a model has been trained too well on training
data, it will be unable to generalize
● It will make inaccurate predictions when given new
data, making the model useless even though it is
able to make accurate predictions for the training
data

36
Underfitting
● Underfitting happens when a model has not been
trained enough on the data
● In the case of underfitting, it makes the model just
as useless and it is not capable of making accurate
predictions, even with the training data

37
Underfit Vs. Balanced Vs. Overfit

38
Underfit Vs. Balanced Vs. Overfit

Underfitting Overfitting Balanced fitting


39
The Prediction Erros
• The prediction error for any machine learning
algorithm can be broken down into three parts:
• Bias Error
• Variance Error
• Irreducible Error
• The irreducible error cannot be reduced regardless
of what algorithm is used
• It is the error caused by factors like unknown variables
that influence the mapping of the input variables to the
output variable
40
Bias
● The difference between the average prediction of
our model and the correct value which we are
trying to predict
○ Model with high bias pays very little attention to the
training data and oversimplifies the model
○ It always leads to high error on training and test data

41
Variance
● Variance measures the amount that the outputs of
our model will change, if a different dataset is used
○ Models with high variance perform pretty well on training
data but has high error rates on test data
○ High variance model leads to overfitting

42
Bias – Variance Tradeoff

43
Bias – Variance Tradeoff

44
Bias – Variance Tradeoff

Height

Weight 45
Bias – Variance Tradeoff

Training Set

Test Set

Height

Weight 46
Bias – Variance Tradeoff
Training Set with a
Linear Model

Height

Weight 47
Bias – Variance Tradeoff
Training Set with a
Complex Model

Height

Weight 48
Bias – Variance Tradeoff
Test Set with a
Linear Model

Height

Weight 49
Bias – Variance Tradeoff
Test Set with a
Complex Model

Height

Weight 50
ML – Peformance Metrics
● The metrics that you choose to evaluate your
machine learning algorithms are very important
○ They influence
■ How you weigh the importance of different
characteristics in the results and
■ Your ultimate choice of which algorithm to choose

51
Classification Metrics
Confusion Matrix

Classification Accuracy

Classification Report

Area Under ROC Curve

Log Loss

52
Confusion Matrix
● A confusion matrix is a correlation between the predictions of
a model and the actual class labels of the data points

Actual Actual
Positive Negative

Predicted TP FP
Positive (True Positives) (False Positives)

Predicted FN TN
Negative (False Negatives) (True Negatives)

53
Important Ratios from Confusion
Matrix
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑃𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
For a model to be smart,
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 • TP and TN should be as high as
𝑇𝑁𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 possible

• FP and FN should be minimized


𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑃𝑅 = • Also, the ratios TPR and TNR should
𝐴𝑐𝑡𝑢𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 be very high

• And FPR and FNR should be very low


𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝐹𝑁𝑅 =
𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
54
Classification Metrics - Accuracy
● Accuracy is what its literal meaning says, a measure
of how accurate your model is

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 Accuracy is
algorithm and
dataset specific
● Accuracy from Confusion Matrix

𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

55
Classification Metrics - Precision
● It is the ratio of True Positives (TP) and the total
positive predictions
○ Basically, it tells us how many times your positive
prediction was actually positive

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃

56
Classification Metrics - Recall
● It is nothing but TPR (True Positive Rate explained
earlier), also known as Sensitivity
○ It tells us about out of all the positive points how
many were predicted positive

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃+𝐹𝑁

57
Classification Metrics - Specificity
● In contrast to Recall, Specificity measures the
proportion of negatives that are correctly identified

𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁+𝐹𝑃

True Negative Rate

58
Classification Metrics - F1 Score
● F1 Score is the Harmonic Mean between precision
and recall
● The range for F1 Score is [0, 1] and it tells you
○ how precise your classifier is (how many instances it
classifies correctly)
○ as well as how robust it is (it does not miss a significant
number of instances)

1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ∗ =2∗
1 1 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
+
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑅𝑒𝑐𝑎𝑙𝑙

59
Harmoinc Vs. Arithmetic Mean
● Unlike arithmetic mean, harmonic punishes extreme
values more
○ It is calculated by dividing the number of observations by
the reciprocal of each number in the series

○ The harmonic mean is the reciprocal of the arithmetic


mean of the reciprocals

60
Area Under the ROC Curve
● It is used for binary classification problem
○ AUC of a classifier is equal to the probability that the
classifier will rank a randomly chosen positive example
higher than a randomly chosen negative example

FPR and TPR both are computed at


varying threshold values such as (0.00,
0.02, 0.04, …., 1.00) and a graph is
drawn

.90-1 = excellent (A)


.80-.90 = good (B)
.70-.80 = fair (C)
.60-.70 = poor (D)
.50-.60 = fail (F)
61
Log Loss
● AUC ROC only takes into account the order of
probabilities
○ Hence, it does not take into account the model’s capability
to predict higher probability for samples more likely to be
positive
● The log loss is the negative average of the log of
corrected predicted probabilities for each instance
𝑁
1
𝐿𝑜𝑔 𝐿𝑜𝑠𝑠 = − ෍ 𝑦𝑖 − log 𝑝 𝑦𝑖 + 1 − 𝑦𝑖 ∙ log(1 − 𝑝 𝑦𝑖 )
𝑁
𝑖=1

• p(yi) is predicted probability of positive class


• 1- p(yi) is predicted probability of negative class
• yi = 1 for positive class and 0 for negative class (actual values)
62
Regression Metrics

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

63
Mean Absolute Error
● The average of the difference between the Original
Values and the Predicted Values
○ It gives us the measure of how far the predictions were
from the actual output
○ However, they don’t gives us any idea of the direction of
the error i.e. whether we are under predicting the data or
over predicting the data

𝑁
1
𝑀𝐴𝐸 = ෍ 𝑦𝑖 − 𝑦𝑖 ′
𝑁
𝑖=1

64
Mean Squared Error
● MSE takes the average of the square of the difference
between the original values and the predicted values
○ The advantage of MSE is it is easier to compute the
gradient, whereas Mean Absolute Error requires
complicated linear programming tools to compute the
gradient
𝑁
1
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦𝑖′ 2
𝑁
𝑖=1

○ As, we take square of the error, the effect of larger errors become
more pronounced than smaller error, hence the model can now
focus more on the larger errors
65
Root Mean Squared Error (RMSE)
● It follows an assumption that error are unbiased
and follow a normal distribution
○ The power of ‘square root’ empowers this metric to show
large number deviations
○ The ‘squared’ nature of this metric helps to deliver more
robust results which prevents cancelling the positive and
negative error values

𝑁
1 2
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦𝑖′
𝑁
𝑖=1
66
Data Types

Data types in the world of


Machine Learning

Categorical Numerical

Categorical data generally means Numerical data is used to mean


everything else and in particular anything represented by numbers
discrete labeled groups

67
Stevens’ typology of measurement scales
Age, Height,
Weight
• Equal Spaces between values
Ratio • A meaningful zero value, mean makes
sense Temperature
(Celsius/Fahrenheit),
• Equal Spaces between values IQ, Credit Score

Interval • No meaningful zero value, mean makes


sense
Gender, Ethnicity,
Eye Color, Blood
• 1st, 2nd, 3rd values, but not equal space Type
Nominal between 1st and 2nd and 2nd and 3rd
• Median makes sense
Income Level,
Level of
• No numerical relationship between the Agreement
Ordinal different categories
• Mean and median are meaningless
68
Nominal/Ordinal Examples

Nominal

Ordinal

69
Nominal Vs. Ordinal Vs. Interval Vs. Ratio

70
A More Detailed Taxonomy
Types of Data

Quantitative Qualitative

Discrete Continuous Nonnumerical

Interval Nominal

Ratio Ordinal
71
Quantitative Vs. Qualitative
● Quantitative data seem to be the easiest to explain and
try to find the answers to questions such as
○ “how many, “how much” and “how often”
● It can be expressed as a number, so it can be quantified

72
Quantitative Vs. Qualitative
● Qualitative data can’t be expressed as a number, so it
can’t be measured
○ It mainly consists of words, pictures, and symbols, but
not numbers
● These can answer the questions like:
○ “how this has happened”, or “why this has happened”

73
Categorical Data
● Categorical data represents characteristics.
○ Therefore it can represent things like a person’s gender,
language etc.
○ Categorical data can also take on numerical values
(Example: 1 for female and 0 for male)
● Two types of categorical data
○ Nominal
○ Ordinal

74
Categorical - Nominal
● Nominal values represent discrete units and are
used to label variables
○ Nominal data don’t have any order
○ Therefore, changing the order won’t change their value

75
Categorical - Ordinal
● Ordinal values represent discrete and ordered
units
○ It is therefore nearly the same as nominal data, except
that it’s ordering matters

76
Numerical - Discrete
● We speak of discrete data if its values are distinct
and separate
○ In other words: We speak of discrete data if the data can
only take on certain values
○ This type of data can’t be measured but it can be counted
○ It basically represents information that can be categorized
into a classification
○ Example:
■ The number of students in a class
■ The number of workers in a company
■ The number of test questions you answered correctly 77
Numerical - Continuous
● Continuous Data represents measurements and
therefore their values can’t be counted but they
can be measured
○ Interval
■ Interval values represent ordered units that have the same
difference
○ Ratio
■ Ratio values are also ordered units that have the same
difference
■ Ratio values are the same as interval values, with the
difference that they do have an absolute zero
78
Interval Vs. Ratio

79
Statistical Descriptions of Data
● They help us measure some very special properties
of the data
● One such property is the central tendency
○ Measuring the central tendency helps us know, where
most of the data lies taking into account the whole set of
data

80
Central Tendancy - Mean
● Mathematically, the mean of n values can be
defined as:

○ Suppose that we have a dataset, in which, we have an attribute


“age” of supposing 100 people
○ The mean of ages is equivalent to answering, “what age do most
of the people belong to?”
○ Not a good choice when there are some extreme values in the
data
81
Central Tendancy - Median
● When our dataset has skewness, calculating the
Median could prove to be more beneficial than
Mean
○ Median is defined as the centermost value of an ordered
numerical dataset

82
Central Tendancy - Mode
● The mode for a set of data is the value that occurs
most frequently in the set
○ Hence, it can be calculated for both qualitative and quantitative
attributes
○ A dataset might have two modes and are known as bimodal
○ In general, a dataset with two or more modes is known as
multimodal

83
Central Tendency – Mid Range
● This is defined as the average of the largest and
smallest values in the set of values

84
Dispersion of the Data
● The dispersion of data means the spread of data
● Measuring the dispersion of data
○ Let x1, x2, x3…xn be a set of observations for some numeric
attribute, X
○ The following terms for measuring the dispersion of data:
■ Range
■ Quantile
■ Interquartile Range (IQR)
■ Variance and Standard Deviation

85
Range
● It is defined as the difference between the largest
and smallest values in the set

5 8 9 4 3 2 7 12 15 6

Range = 15 – 2
= 13

86
Quantiles
● These are points taken at regular intervals of data
distribution, dividing it into essentially equal-size
consecutive sets.

Quantile 1 Quantile 2 Quantile 3 Quantile 4

2 3 4 5 7 9 11 13 15 22 24 27 30 31 35

The kth q-quantile for given data distribution is the value x such at most k/q of
data values are less than x and at most (q-k)/q of data values are more than x,
where k is an integer such that 0 < k < q. There are total (q-1) q-quantiles.
87
Quartile – 4 Quantiles
Quartiles are the values that divide a list of numbers into quarters:
• Put the list of numbers in order
• Then cut the list into four equal parts
• The Quartiles are at the "cuts"

88
Quartiles

89
Interquartile Range (IQR)
● The distance between the first and third quartiles is a
simple measure of the spread that gives the range
covered by the middle half of the data

90
Variance & Standard Deviation
● The Standard Deviation is a measure of how spread out
numbers are
● The Variance is defined as the average of the squared
differences from the Mean
○ The variance of N observations, x1, x2, x3….xn, for a
numeric attribute X is –

○ Mathematically, the standard deviation is defined as the


square root of the variance 91
Example
● The heights (at the shoulders) of the dogs below are:
600mm, 470mm, 170mm, 430mm and 300mm

92
Example - Mean

93
Example - Variance

94
Example – Standard Deviation

95
Standard Deviation – a Look

96
Outliers
● An Outlier is a data object that deviates significantly
from the rest of the objects as if it were generated by a
different mechanism

97
Outlier Example

98
What if remove outlier?

99
Outlier Detection using Box Plot
● A box and whisker plot — also called a box plot — displays
five-number summary of a set of data
● Five number summary
○ Minimum
○ First quartile (Q1)
○ Median
○ Third quartile (Q3)
○ Maximum

100
Outlier Detection using Box Plot

101
Handling missing values in the dataset
● The data this real time will obviously have a lot of
missing values
● Handling missing values:
○ Ignore the tuple with missing values

○ Use a measure of central tendency for the attribute to fill


in the missing value

○ Use prediction techniques to fill in the missing value


● Handling missing data is important as many machine learning
algorithms do not support data with missing values
102
Diabetes Dataset
1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose
tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Class variable (0 or 1).
103
Removing noise from the data using the Binning
Technique
● What is defined as a noise in data?
○ Suppose that we have a dataset in which we have some
measured attributes
○ Now, these attributes might carry some random error or
variance
○ Such errors in attribute values are called as noise in the
data
● If such errors persist in our data, it will return
inaccurate results

104
Binning Vs. Encoding
● For a machine learning model, the dataset needs to be
processed in the form of numerical vectors to train it
using an ML algorithm
○ Feature Binning: Conversion of a continuous variable to
categorical
○ Feature Encoding: Conversion of a categorical variable to
numerical features

Categorical Data Continuous Data

105
Binning Technique
● The set of data values are sorted in an order, grouped
into “buckets” or “bins” and then each value in a
particular bin is smoothed using its neighbor
○ It is also said that the binning method does local
smoothing because it consults its nearby values to
smooth the values of the attribute
[4, 8, 15, 21, 21, 24, 25, 28, 34]

4, 8, 15 21, 21, 24 25, 28, 34


(Bin 1) (Bin 2) (Bin 3)

106
Smoothing by bin means
● In this method, all the values of a particular bin are replaced
by the mean of the values of that particular bin
○ Mean of 4, 8, 15 = 9
○ Mean of 21, 21, 24 = 22
○ Mean of 25, 28, 34 = 29

9, 9, 9 22, 22, 22 29, 29, 29


(Bin 1) (Bin 2) (Bin 3)

107
Smoothing by bin medians
● In this method, all the values of a particular bin are replaced
by the median of the values of that particular bin
○ Median of 4, 8, 15 = 8
○ Median of 21, 21, 24 = 21
○ Median of 25, 28, 34 = 28

8, 8, 8 21, 21, 21 28, 28, 28


(Bin 1) (Bin 2) (Bin 3)

108
Smoothing by bin boundaries
● In this method, all the values of a particular bin are replaced
by the closest boundary of the values of that particular bin

4, 4, 15 21, 21, 24 25, 25, 34


(Bin 1) (Bin 2) (Bin 3)

109
Encoding
● Most of the ML algorithms cannot handle categorical
variables and hence it is important to do feature
encoding
Label Encoding

Ordinal Encoding

Frequency Encoding

Binary Encoding

One-hot Encoding

Target mean Encoding


110
Label Encoding
● Label Encoding is a popular encoding technique for
handling categorical variables
○ In this technique, each label is assigned a unique
integer based on alphabetical ordering

For Target
Variable

111
Ordinal Encoding
● An ordinal encoding involves mapping each unique
label to an integer value
○ This type of encoding is really only appropriate if there is
a known relationship between the categories

For
Features

112
Frequency Encoding
● It transforms an original categorical variable to a
numerical variable by considering the frequency
distribution of the data
○ It can be useful for nominal features

113
Binary encoding
● Binary Encoding just labels values to an integer then
takes binary of the integer and makes a binary table to
encode data

114
One hot encoding
● One hot encoding technique splits the category each to
a column
○ It creates n different columns each for a category
and replaces one column with 1 rest of the columns
is 0

115
Target Encoding
● Target encoding is the process of replacing a categorical value
with the mean of the target variable
○ Any non-categorical columns are automatically dropped by the
target encoder model

116
Feature Scaling
● Feature scaling means adjusting data that has
different scales so as to avoid biases from big
outliers
○ It standardizes the independent features present in the
data in a fixed range

117
Why Feature Scaling?
● Machine learning algorithm works on numbers and
has no knowledge of what that number represents
○ Many ML algorithms perform better when numerical input
variables are scaled to a standard range
crucial part of the
data
preprocessing
stage

118
Will Feature Scaling Work for all ML
Algorithms?

It improves the performance of some machine


learning algorithms and does not work at all for
others
119
Why Feature Scaling?
● Gradient Descent Based Algorithms
○ ML algorithms like linear regression, logistic regression,
neural network, etc. that use gradient descent as an
optimization technique require data to be scaled

The presence of feature value X in the formula will


affect the step size of the gradient descent

Having features on a similar scale can help the gradient


descent converge more quickly towards the minima
120
Why Feature Scaling?
● Distance-Based Algorithms
○ Distance algorithms like KNN, K-means, and SVM are
mostly affected by the range of features
○ This is because behind the scenes they are using distances
between data points to determine their similarity

121
Why Feature Scaling?
● Tree-Based Algorithms
○ They are fairly insensitive to the scale of the features
○ Think about it, a decision tree is only splitting a node
based on a single feature
○ This split on a feature is not influenced by other features

122
Feature Scaling Categories

Feature Scaling

Normalization Standardization

123
Normalizaion
● A scaling technique in which values are shifted and
rescaled so that they end up ranging between 0
and 1
○ It is also known as Min-Max scaling
○ Here’s the formula for normalization:

𝑋 − 𝑋𝑚𝑖𝑛
𝑋′ =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛

124
Standardization
● Standardization is another scaling technique where
the values are centered around the mean with a
unit standard deviation
○ This means that the mean of the attribute becomes zero
and the resultant distribution has a unit standard
deviation
○ Here’s the formula for standardization:

𝑋 − 𝜇
𝑋′ =
𝜎
125
Normalization or Standardization?
● Normalization is good to use when you know that the
distribution of your data does not follow a Gaussian
distribution
● Standardization, on the other hand, can be helpful in cases
where the data follows a Gaussian distribution

○ However, this does not have to be necessarily true


● However, at the end of the day, the choice of using
normalization or standardization will depend on your
problem and the machine learning algorithm you are using

126
Covariance
● Variables may change in relation to each other
● Covariance measures how much the movement in one
variable predicts the movement in a corresponding
variable

127
Covariance

128
Smoking v Lung Capacity Data

Variables Cigarettes and Lung Capacity


covary inversely

When smoking is above its group mean, lung


capacity tends to be below its group mean. 129
Calculating Covariance

130
Calculating Covariance

131
Covariance Calculation

132
Calculating Correlation

133
Calculating Correlation

134
Calculating Correlation

Greater smoking exposure implies


greater likelihood of lung damage
135
Different Correlation Values

136
Correlation is not Causation

137
Correlation Is Not Good at Curves

138

You might also like