Lesson 3

INTRODUCTION TO
AI AND MACHINE
LEARNING
UNDERSTANDING • what is regression
• Linear Regression
REGRESSION • implementation issues
WHAT IS REGRESSION
Regression is method to find the relationship between

independent variables and a dependent variable.
Very well understood and mathematically defined
Used in supervised learning as labelled data are required to

create and train the model.
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

WHAT IS REGRESSION
Regression can be broadly classified into:

▪ Linear Regression
▪ The model created is linear (i.e. line) between input(s) variables and output variable
▪ Output (dependent ) variable is continuous, while input variable(s) can be either
categorical or continuous
▪ Logistic Regression
▪ The model created is usually sigmoidal (i.e. S-shape) between input(s) variables and
output variable
▪ Output variable is usually categorical
▪ Binary Logistic Regression – Only 2 possible outcome in the output (i.e. Success. / Failure)
▪ Multinomial Logistic Regression - > 2 possible outcomes in the output + no ordering
▪ Ordinal Logistic Regression - >2 possible outcomes in the output + order associated with output
WHAT IS REGRESSION
REGRESSION VS CLASSIFICATION
▪ Regression (method) is to study the relationship input vs output
BUT using the same method could also be used to classifier (i.e.
logistic regression)
▪ The objective (or problem) will require a suitable regression
method to be used.
▪ We will use Linear regression subsequently in this lesson as it

is not suitable as a classifier to reduce possible confusion
between the objective (of finding relationship) and the method
Source : https://kindsonthegenius.com/blog/what-is-the-difference-between-classification-and-regression/
HOW LINEAR REGRESSION WORKS
SINGLE DIMENSION LINEAR REGRESSION
▪ For example:
▪ X (independent variable)
▪ Y (dependent variable)
▪ How to draw a line to get the

best fit ?

FINDING THE BEST LINE
▪ To get the “best fit” for the training samples, the line should
minimize error between the observed y and predicted ŷ value
in the training data.

FINDING THE BEST LINE

HOW GOOD IS THE FIT
Linear regression uses 2 metrics to check how “good” is the model

▪ Root Mean squared error (RMSE)
▪ How bad / erroneous the model’s predictions are when compared with actual
observed values
▪ High RMSE – 󰗩 , Low RMSE – 👍
▪ Coefficient of termination (R2)

▪ Measure the strength of the relationship between the response and the predictor
variables in the model
▪ If R2 = 0.65 , that means predictor variables explain 65% of the variance in the
response variable
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee
HOW GOOD IS THE FIT
▪ RMSE ▪ R2
Range is dependent on the Range 0 - 1
response (output) variable General formula (there is
another formula for entire
population)

HOW GOOD IS THE FIT
These are the possible outcomes:

▪ Low RMSE, high R² (the best case)
▪ Low RMSE, low R²
▪ High RMSE, high R²
▪ High RMSE, low R² (the worst case)

HOW GOOD IS THE FIT
Low RMSE, high R² (best case)
R2 = 0.98
Knowing X will help in predicting Y
RMSE = 5.1 (average of 10 to -10 )

RMSE is in the order of 102
(5 vs 100-300) hence small

HOW GOOD IS THE FIT
Low RMSE, low R²

R2 = 0
Knowing X is useless to predict Y (all
300)
RMSE = 5 (average of 10 to -10 )

RMSE is in the order of 102
(5 vs 300) hence small

HOW GOOD IS THE FIT
▪ High RMSE, high R²

▪ With high R2,there are value in the prediction
▪ But high error (RMSE) resulted in inaccurate predicated value (output)
▪ High RMSE, low R² (the worst case)

▪ The prediction is groundless
▪ The error is high

ACTIVITY
LINEAR REGRESSION CALCULATOR
X Y
1 2
2 3
3 6
4 7
5 9

MULTI DIMENSION LINEAR REGRESSION
In real-life, multiple input variables are often

used.
For example:
X1 = Distance from MRT station
X2 = House age
Y = Price ($ psf)
Input / output stored as matrix
Equation need multiple coefficients

▪ Each x variable will have their own w (another

matrix)
▪ The E can be calculated
▪ Using partial differential with respect to w = 0
▪ Each coefficient for the respective x variable
can be optimized.

▪ Once we have computed coefficients

and verified the goodness (RMSE
and R2 ) the modelling is done.
▪ Typically, it would be useful to plot

the data into a graph with the
prediction plane to inspect the model
Source: https://towardsdatascience.com/linear-regression-made-easy-how-does-it-work-and-
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC how-to-use-it-in-python-be0799d2f159
▪ Additional work analysis

▪ Create and use derived features (columns) (i.e. age instead of
year-of-birth)
▪ Derived features (x0 .. x3, in this case) could also be powers of basic
features (x) to becomes a polynomial
▪ Multi-dimensional linear regression could derive polynomial curves,

planes and even hyper-planes

GRADIENT DESCENT
▪ Gradient descent is an optimization technique use to find the

minimum of arbitrarily complex error functions.
▪ It is easy to understand and widely used in ML techniques.
▪ With the error function, it could find the weights that give the
lowest errors.
▪ Differentiation cannot be used to compute all error functions
hence gradient descent is used.

GRADIENT DESCENT
▪ Steps
▪ pick random set of weights
▪ iteratively adjust weights in the direction of the gradient of the error
▪ when gradient approaches 0 🡪 minimum error , convergence
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://towardsdatascience.com/how-to-do-linear-regression-using-gradient-descent-79a2ff4ace05

GRADIENT DESCENT CONSIDERATIONS
▪ The learning rate is also called hyper-parameters
▪ Suitable value for hyper-parameters

▪ Too small – Convergence take a very very long time
▪ Too large – Convergence may never happen as iterations bounce
from different sides of the minima
▪ Many algorithms can automatically determine if gradient
descent has converged (e.g/ orange3)

▪ If convergence need to be manually detected use

▪ iterations number (x axis)
vs
▪ cost-function (y axis)
▪ Typically, the graph will be as follows

▪ Batch gradient descent

▪ Take average of the gradients of all the training examples (dataset)
▪ Use the mean gradient to update the hyper-parameters
▪ Stochastic gradient decent (SGD)

▪ Take 1 training example
▪ Calculate gradient and update hyper-parameters
▪ Repeat with next example (for all)
▪ Cost function will decrease but will fluctuate
▪ May keep dancing and never reach minima
▪ Mini-batch gradient descent

▪ Like SGD except in batch of 50-256 examples
IMPLEMENTING LINEAR REGRESSION
▪ Generalization
▪ Over-fitting
▪ Regularization

OVERFITTING AND GENERALIZATION
▪ Assumption
▪ Limited dataset
▪ System could ”remember” all the data
points
▪ Over-fitting (green line)
▪ Perfect prediction for known training data
▪ Likely not as good for unseen data
▪ Generalization (black line)
▪ More suitable for new unseen data
Source : https://en.wikipedia.org/wiki/Overfitting
OVERFITTING AND GENERALIZATION
▪ “Over-fitting” will increase generalization error
▪ To reduce generalization error, we should

▪ Collect as much sample as possible
▪ Use random subset of data for training
▪ Do not use training set with test set
▪ Experiment with adding higher degrees of polynomials (x1, x2, x3)

L1 REGULARIZATION (LASSO)
▪ L1 regularization also know as Lasso regression (Least

Absolute Shrinkage and Selection Operator)
▪ Shrinks the less important feature’s coefficient to 0

▪ Effectively remove the low impact feature(s)
▪ L1 regularization encourages a few coefficients to be

non-zero, many are zero.

L2 REGULARIZATION (RIDGE)
▪ L2 regularization also known as Ridge regression
▪ Reduce the complexity by prevent overfitting of the outliners
▪ Add an additional term in the cost function that has the effect
of penalizing large weights and thereby minimizing this skew.

CATEGORICAL INPUTS
▪ For input that are categories (e.g. gender) rather than

number, we will represent the category values as one-hot
encoding for use in the linear regression equations (if there is
no built-in support from data analysis tool).
▪ Examples
▪ Male = x1
▪ Female = x2
▪ Other features = x3 …
REGRESSION APPLICATIONS
Regression is usually for

1. Forecasting
2. Capital Asset Pricing Model (CAPM)
▪ Establishes the link between an asset's projected return and the
related market risk premium, use linear regression model
3. Identifying problems
▪ Based on regression and other statistical analysis on in-house data
4. Comparing with competition
END OF LESSON 3

Lesson 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 3

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO

Regression is method to find the relationship between

Very well understood and mathematically defined

Used in supervised learning as labelled data are required to

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Regression can be broadly classified into:

▪ We will use Linear regression subsequently in this lesson as it

▪ How to draw a line to get the

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Linear regression uses 2 metrics to check how “good” is the model

▪ Coefficient of termination (R2)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee

These are the possible outcomes:

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee

Low RMSE, high R² (best case)

RMSE = 5.1 (average of 10 to -10 )

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee

Low RMSE, low R²

RMSE = 5 (average of 10 to -10 )

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee

▪ High RMSE, high R²

▪ High RMSE, low R² (the worst case)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

In real-life, multiple input variables are often

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Each x variable will have their own w (another

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Once we have computed coefficients

▪ Typically, it would be useful to plot

▪ Additional work analysis

▪ Multi-dimensional linear regression could derive polynomial curves,

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Gradient descent is an optimization technique use to find the

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC Source : https://towardsdatascience.com/how-to-do-linear-regression-using-gradient-descent-79a2ff4ace05

▪ The learning rate is also called hyper-parameters

▪ Suitable value for hyper-parameters

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ If convergence need to be manually detected use

▪ Typically, the graph will be as follows

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Batch gradient descent

▪ Stochastic gradient decent (SGD)

▪ Mini-batch gradient descent

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ “Over-fitting” will increase generalization error

▪ To reduce generalization error, we should

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ L1 regularization also know as Lasso regression (Least

▪ Shrinks the less important feature’s coefficient to 0

▪ L1 regularization encourages a few coefficients to be

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ L2 regularization also known as Ridge regression

▪ Reduce the complexity by prevent overfitting of the outliners

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ For input that are categories (e.g. gender) rather than

Regression is usually for

You might also like