L03_The_Regression_Pipeline_-_2

UCCD2063
Artificial Intelligence Techniques
Unit 03:
The Regression Pipeline – 2/2
400
300
200
100
0
0 500 1000 1500 2000 2500
Machine Learning Pipeline – Regression
1. Look at the Big Picture

2. Get Data
3. Explore Data
4. Prepare Data
5. Select & train model
6. Fine-tune the model
7. Launch, monitor & maintain
Phase 5: Select & train model
Look at the
big picture ▪ Select suitable machine learning algorithm(s)
Get ▪ Select a suitable performance measure
data
Explore ▪ Train a model
data
Prepare
▪ Evaluate the model
data
Select and
train
model
Fine-tune
model
Launch,
monitor &
maintain
3
Task – Regression
▪ Recall that our task is to predict the median house price based on
information (median income, total bedrooms, ocean proximity, ...) of a
district
𝑦ො = ℎ(𝒙)
• 𝑦ො is the predicted output (median house price)

• 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 is the input features
• h is the model to be learned by machine learning
400
300
h(x)
median
house 200
price
100
0
0 500 1000 1500 2000 2500
x
4
Selecting machine learning algorithms
Choose the appropriate machine learning algorithms for h(x)
▪ Linear Models
• Linear Regression
• Logistic Regression (for classification)
▪ Support Vector Machine (SVM)
▪ Neural Networks (NN)
▪ Decision Tree
▪ Random Forest
▪ K-Nearest Neighbours (k-NN)
▪ More…
Note that most machine learning algorithms can be used for both
regression (continuous output) and classification (discrete output)
5
Linear Model
▪ A linear model models the data using the hypothesis function:
𝑦ො = ℎ𝜃 (𝒙) = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
• 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 is the input
• n is the number of features (n-dimension)
• j are model parameters or weights (also represents as W·X + b)
▪ A linear model is a way of describing a response variable in terms of a

linear combination of the predictor variables.
7
Linear Regression
▪ Linear Regression finds a linear model that minimize the sum of
squared errors (prediction errors)
• Prediction error = (true value – predicted value)2
𝑦ො = ℎ𝜃 (𝒙) = 𝜃0 + 𝜃1 𝑥
8
Linear Regression
▪ For multiple dimension data, the linear model is a hyperplane
𝐼𝑛𝑐𝑜𝑚𝑒 = 𝜃0 + 𝜃1 𝑦𝑒𝑎𝑟 + 𝜃2 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦
9
What if the data is not linear?
Underfits!
Polynomial Regression 10
Linear Classification (Logistic Regression)
▪ Linear model can also be used to categories a set of data points
to discrete classes (more on classification in next lecture)
Find a linear decision boundary that produces

the least classification error.
11
Support Vector Machine
▪ Support Vector Machine (SVM) is a linear model that finds a
hyperplane that maximizes the margin between two distinct set
of data points.
▪ The vectors (data points) that influence the position and
orientation of the hyperplane are the support vectors.
12
▪ SVM algorithm minimize ‖W‖ subject to y(W·X+b) >= 1.
W.X+b = 1
W.X+b = -1
Maximize margin (or minimize ‖w‖)

13
▪ The C parameter: control the tradeoff between training
classification error and margin error.
Training error = C x classification error + margin error
14
Non-linear SVM
▪ The basic idea is to transform x to a higher dimensional space
using a kernel function so that they are linearly separable in the
transformed space.
SVM hyperparameters: regularization parameter C, kernel function, gamma, … 15

▪ SVM is commonly used for classification tasks but can also be
used for regression.
▪ Support Vector Regression (SVR): find the best line
(hyperplane) that contains the maximum number of points
within a margin.
16
Artificial Neural Networks
▪ Artificial neural network (ANN) is a network of simple elements
called artificial neurons, which receive input, and produce
output depending on the input weight and activation function.
An Artificial Neuron
17
Neural Networks
▪ Artificial neural network typically consists of multiply layers of
interconnected neurons, called MLP.
weights (network parameters, to be learned from training data)
X y
𝑎 = ෍ 𝑓(𝒘𝑙 𝑎𝑙−1 + 𝒃𝑙 )
𝑙
𝑖=1 18
Neural Network Training
▪ A neural network is trained using two passes:
• Forward pass – make prediction and compute prediction loss (Y’ – Y)
• Backpropagation – adjust all weights layer by layer to reduce loss using gradient
descent
NN training hyperparameters: learning rate, number of epochs, momentum, … 19

Decision Tree
▪ Like human, a decision tree algorithm makes prediction by
following a sequence of decisions based on input information
▪ Example: Should we play golf today?
One possibility:
if outlook is sunny, and if humidity is normal
then play
20
Decision Tree
▪ A decision tree is a tree representation where each node
represents a feature, each branch represents a decision(rule)
and each leaf represents an outcome(class or value).
▪ A path from the root note to a leaf node produces a prediction
Node (feature) Branch (decision)
Leaf node
(outcome)
Note: for continuous features, they may be discretized in advance. 21

How to select a feature to split?
▪ Select a feature to split that maximize:
• Classification: Entropy reduction (ER) or Gini-index
• Regression: Standard Deviation reduction
Play golf dataset
ER = 0.247
ER = 0.152
ER = 0.029 22
ER = 0.048
Continuing to split
ER = 0.571
ER = 0.020 ER = 0.971
• Stopping criteria:
– All samples for a given node belong to the same class (pure)
– Reach a predefined threshold: e.g. minimum number of samples
– There are no remaining attributes for further partitioning
Decision Tree
▪ Can operate on both numerical and nominal features.

▪ Do not require any assumptions about the distributions or
the independence of attribute values.
▪ Often easy to interpret and can be converted to if-then
rules.
• By tracing the tree from the root node to each leaf node and
forming logical expressions.
▪ May produce too many branches and overfit the training
data.
• Possible solutions: control the depth of the tree and limit the
number of leaf nodes, prune excessive branches
Decision Tree hyperparameters: max_depth, max_features, max_leaf_nodes, min_samples_split, …24

Random Forest Algorithm
▪ Random Forest is an ensemble of Decision Trees.
▪ Random forest builds multiple decision trees and merges them
together to get a more accurate and stable prediction.
25
k-Nearest Neighbour
▪ k-Nearest Neighbour (k-NN) is a non-parametric, instance-
based learning algorithm
• The instances themselves represent the knowledge
▪ k-NN returns the most common value among the k training
examples nearest to the test sample
• K should be odd number
▪ A decision is made by examining the labels on the k-nearest
neighbors and taking a vote
26
k-Nearest Neighbour
▪ k-NN can be used for both classification and regression.
• Classification: classify an object by a majority vote of its k-
neighbors
• Regression: prediction is the average of k-nearest neighbors
K-NN problem: require large memory to store all training samples, long prediction time.
27
Selecting machine learning algorithms
▪ No Free Lunch Theorem [D. Wolperts, 1996]
• There is no one model that works best for every problem. The
assumptions of a model that work for one problem may not hold for
another problem.
▪ Strategy:
• Try out different models from various categories of ML
algorithms.
• Do not spend too much time tweaking the hyperparameters
• Shortlist a few (2 to 5) promising models.
We will try 3 models on the “housing” dataset

• Linear Regression
• Decision Tree
• Random Forest
28
Performance measure for regression
▪ Select the suitable performance measure – gives an idea of how much
error the system typically makes in its predictions
• Allow us to evaluate which model is more desirable
• Consider the following three models:
3 3
3
2 2
2
1 1
1
Best Bad Not so good
0 0
0
0 1 2 3 0 1 2 3
0 1 2 3
• A good performance measure must capture notion of how

good the learnt model predicts.
𝑦  𝑦ො = ℎ(𝑋) , |𝑦 ‒ 𝑦|ො must be small
true predict
29
Root Mean Square Error (RMSE)
𝑚
1 𝑖 𝑖 2
𝑅𝑀𝑆𝐸 𝐗, ℎ = ෍ ℎ 𝑥 −𝑦
𝑚
𝑖=1
ℎ 𝑥
Given the following samples, compute the 3
RMSE for models L1 and L2 2.5 X
2 X L1
x y h1 h2
1.5
0 0.5 0 2.5
1
1 2 1 1.5 0.5 X
2 2.5 2 0.5 0 L2 x
0 1 2 3
𝟏 𝟐 𝟐 𝟐
RMSE(L1=h1-y) = 𝟎 − 𝟎. 𝟓 + 𝟏−𝟐 + 𝟐 − 𝟐. 𝟓
𝟑
= 0.7071 The RMSE for L1 is smaller
than L2 → L1 is a better
𝟏
RMSE(L2=h2-y) = 𝟐. 𝟓 − 𝟎. 𝟓 𝟐 + 𝟏. 𝟓 − 𝟐 𝟐 + 𝟎. 𝟓 − 𝟐. 𝟓 𝟐 model
𝟑
= 1.6583
30
Build a model
Model #1: LinearRegression Model

▪ First, build a linear regression model
LinearRegression is a predictor. It is used to learn linear model.

31
Prediction
▪ Next, use the trained model to make prediction and display the
predicted values for 10 random training samples
32
Evaluate the model

▪ Then, we evaluate our model using RMSE
RMSE = 67862.0057 This is not a great score (average median

housing values is $206855 ,
error=67862/206855=33%)
This is an example of model underfitting.
33
mean_squared_error computes the mean squared error given the actual and predicted labels
Underfitting vs Overfitting (recall)
▪ Underfitting (high bias) may happen when our model is over-simplified
or not expressive enough (both training error and test error are high).
▪ Overfitting (high variance) may happen when our model is too complex
and fits too specifically to the training set, but it does not generalize
well to new data (low training error but high testing error).
Just Right Underfit / High Bias Overfitting/High Variance
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 +𝜃3 𝑥 3 𝑦 = 𝜃0 + 𝜃1 𝑥 𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + … + 𝜃9 𝑥 9
green = ground truth model (unknown), red = fitted model

34
Fixing Underfitting
▪ The main ways to fix underfitting are:

• Add more features or transform the features (e.g., the
log of the population) – expensive!
• Use a more powerful model
− Decision Tree 
− Random Forest 
• Add nonlinear (polynomial) terms (see Lecture 6)
35
Model #2: DecisionTreeRegressor
▪ Second, try decision tree regression model. This is a more powerful

model, capable of finding complex nonlinear relationship
random_state=42 is
optional. Used to
ensure that we get the
same result each time
we run the code
RMSE = 0 means no error

→ the model has likely overfit the data.
DecisionTreeRegresor is a kind of non-linear predictor for regression problems.
36
Model #3 RandomForestRegressor
▪ Lastly, try random forest regression.

▪ Random forest: training many decision trees on random subsets of the
features. Then average out their predictions.
▪ Building a model on top of many other models is called Ensemble Learning.
▪ Then, we evaluate our model
RMSE = 21184.0454
Much better than Linear
Regression (was 67862), but can be
overfitting, need to verify
RandomForestRegresor(n_estimators=10) is a predictor. It uses an ensemble of

37
decision tree regressions to perform prediction.
Discussion on the Training Errors
▪ Linear Regression may have underfitting issue
• Use a more powerful model
• Add more features / transform features
▪ Decision Tree may have overfitting issue. To confirm,
perform cross-validation (next section)
• Add more training samples
• Perform regularization
▪ Random Forest Regressor seems promising, but again, need
to verify through cross-validation (next section)
Model Training RMSE

LinearRegresssion 67862
DecisionTreeRegressor 0
RandomForestRegressor 21461
38
Model Evaluation using Cross-Validation
Training vs Testing Performance
▪ Misleading if we simply report the performance of the model on the
data that we use to learn the model
• Why? The reported performance will be overly optimistic. A model
that works well on the training set may not do as well on unseen
data
▪ Recommendation: set aside some data purely for validation purposes

(e.g., 80% for training, 20% for validation)
• Learn the model using the training data
• Evaluate the model on the validation data
Training data Validation data
▪ Important: if you have a test set, don’t use the test data for validation
purpose during training
39
Holdout Validation
1. Split the original training data into training data and validation data
(e.g., 80% training data, 20% validation)
Original training data
Training data Validation data
2. Learn parameters on training set and evaluate the performance on

validation set
3. Repeat step 1 and 2 for different parameters
4. Choose the best model that generates the highest cross-validated
score. Alternatively, use the best parameter setting and retrain on the
whole training dataset
5. Test the final model using the test set
Problem: might produce a model bias to the validation set

40
k-fold Cross-validation
1. Split the original training data into k folds
Train data
fold1 fold2 fold3 fold4 fold5
2. Hold out one fold as a validation set and use the other folds as
training set. Repeat using each fold in turn as the validation set.
Train data Validation data
Train data Validation data Train data
Validation data Train data
41
k-fold Cross-validation
(Continued)
3. Calculate the average and standard deviation for all folds’

validation result.
− The average estimates the performance of a particular model
− The standard deviation will tell you the predictor reliability
(if high, the cross-validation error could be imprecise)
4. Select the parameter with the highest average value. You may
choose retrain the final model using the whole training set
5. Evaluate the final model on test set.
42
Evaluation Using k-fold Cross-Validation
▪ 5-fold cross validation for Linear Regression model
𝑅𝑀𝑆𝐸 = −𝑁𝑀𝑆𝐸 The function returns the negative

mean squared error (NMSE)
cross_val_score is used to perform k-fold cross validation.
43
Evaluation Using k-fold Cross-Validation
▪ The result is given as follows:
Since there are 5 folds,

Mean and stdev of the scores cross_val_score returns 5 scores
44
▪ DecisionTreeRegressor Model
▪ RandomForestRegressor Model
45
Summary:
Evaluation Linear Decision Tree Random Forest

set Regression Regressor Regressor
Training set 67862 0 21461
Validation set 68622 69677 51889
1. Linear Regression underfits the data

2. Decision Tree badly overfits the data
• performs unreasonably well on whole training set
• when cross-validated, perform worse than Linear Regression
3. Random forest is still having some overfitting issue – the score on whole
training set is still much lower than the validation set. To solve it:
• get more training data
• tune the hyperparameters
• simplify / regularize the model
− reduce tree depth, number of features per split, samples per leaf, tree
pruning... 46

2. Get Data
3. Explore Data
4. Prepare Data
Phase 6: Fine-tuning model
Look at the
big picture ▪ Fine-tune the models using grid search /
Get random search
data
Explore
▪ Use model ensemble to get more performance
data boost
Prepare
data ▪ Analyze the best models and their errors
Select and
train model
▪ Evaluate your system on test set
Fine-tune
model
Launch,
monitor &
maintain
48
Fine-tuning Model
▪ Each model comes with different hyperparameters.
Hyperparameters are the settings of a machine learning that
needs to be set prior to training.
▪ Important hyperparameters for RandomForestRegressor are:
• n_estimators: the number of trees in the forest
• max_features: the number of features to consider when looking for
the best split
• bootstrap: true or false, if true, samples are drawn with replacement
▪ Two ways to find the best hyperparameters:
• Grid search
• Random search
49
Grid Search
▪ Grid search: try different combinations of the hyperparameters in a grid
▪ For example, try the following values for RandomForestRegressor:
• n_estimators (e): [12, 20, 28]
• max_features (f): [2, 4, 6, 8]
• bootstrap (b): [True, False]
▪ Expensive! Need nm runs for n configurations of m hyperparameters
b = False f=2 f=4 f=6 f=8

e = 12 e12,f2 e12,f4 e12,f6 e12,f8
b = True e = 20
f=2 e20,f2
f=4 e20,f4
f=6 e20,f6
f=8 e20,f8
e = 12 e e12,f2
= 28 e28,f2
e12,f4 e28,f4
e12,f6 e28,f6
e12,f8 e28,f8
e = 20 e20,f2 e20,f4 e20,f6 e20,f8
e = 28 e28,f2 e28,f4 e28,f6 e28,f8
Total runs = 3x4x2=24

50
Grid Search
▪ Use Scikit-Learn's GridSearchCV to perform grid search with cross-validation
▪ param_grid sets the search grid
Python Dictionary
You can also set refit = True to retrain it on the whole training set once it
finds the best estimator using cross-validation
51
Grid Search
▪ To get the best hyperparameter settings from cross validation and the average cv
score:
49178 / 206855 = 23.7% (close to human expert, 20%)
▪ To get the best model learnt from cross validation
GridSearchCV.best_params_ returns the best parameter evaluated by GridSearchCV

GridSearchCV.best_estimator_ returns the best estimator evaluated by GridSearchCV
GridSearchCV.cv_results_ returns the average scores (across all folds) for all settings
52
Random Search
▪ Random search: test random combinations of the hyperparameters
▪ With the same iterations, the chances of finding the optimal parameter
are comparatively higher in random search (see below)
▪ The drawback is that it yields higher variance during computing
▪ Currently the prefer choice for hyperparameter tuning
RandomizedSearchCV is used for Randomized search on hyperparameters.

53
Analyze the Best Model
▪ Gain good insights on the importance of each of the attributes by
inspecting the best models.
Analyzing the best model for RandomForestRegressor
median_income is the most

important predictor, followed by
location features (INLAND,
longtitude, latitude).
54
Evaluate the final model on the test set
▪ To evaluate the final model on the test set, we need to ensure that the
test set goes through the same preprocessing steps as the training set.
55
Evaluate the final model on the test set
▪ Perform prediction on test set
50712.6789
▪ Test performance is usually slightly worse than validation performance if

your system is fine-tuned to perform well on the validation data.
56

2. Get Data
3. Explore Data
4. Prepare Data
Phase 7: Launch, Monitor and Maintain your system
Look at the
big picture
▪ Plug the production input data source to your
Get system and write tests
data
▪ Write monitoring code to check the system on
Explore
data regular intervals. Trigger alerts when it drop
Prepare (model tends to rot as data evolves over time)
data
▪ Monitor the system's input data quality
Select and
train model ▪ Human input required to evaluate system's
Fine-tune prediction
model
Launch, ▪ Retrain your model on a regular basis using fresh
monitor data (may automate this process).
&
maintain ▪ For online models, save snapshot of its state at
regular intervals to enable roll back to a
previously working state
58
Next:
The Classification Pipeline

L03_The_Regression_Pipeline_-_2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L03_The_Regression_Pipeline_-_2

Uploaded by

Copyright:

Available Formats

UCCD2063

Artificial Intelligence Techniques

1. Look at the Big Picture

• 𝑦ො is the predicted output (median house price)

▪ A linear model is a way of describing a response variable in terms of a

𝐼𝑛𝑐𝑜𝑚𝑒 = 𝜃0 + 𝜃1 𝑦𝑒𝑎𝑟 + 𝜃2 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦

Find a linear decision boundary that produces

Maximize margin (or minimize ‖w‖)

Training error = C x classification error + margin error

SVM hyperparameters: regularization parameter C, kernel function, gamma, … 15

weights (network parameters, to be learned from training data)

NN training hyperparameters: learning rate, number of epochs, momentum, … 19

Node (feature) Branch (decision)

Note: for continuous features, they may be discretized in advance. 21

▪ Can operate on both numerical and nominal features.

Decision Tree hyperparameters: max_depth, max_features, max_leaf_nodes, min_samples_split, …24

We will try 3 models on the “housing” dataset

• A good performance measure must capture notion of how

Model #1: LinearRegression Model

LinearRegression is a predictor. It is used to learn linear model.

Model #1: LinearRegression Model

RMSE = 67862.0057 This is not a great score (average median

This is an example of model underfitting.

Just Right Underfit / High Bias Overfitting/High Variance

green = ground truth model (unknown), red = fitted model

▪ The main ways to fix underfitting are:

▪ Second, try decision tree regression model. This is a more powerful

RMSE = 0 means no error

DecisionTreeRegresor is a kind of non-linear predictor for regression problems.

▪ Lastly, try random forest regression.

▪ Then, we evaluate our model

RandomForestRegresor(n_estimators=10) is a predictor. It uses an ensemble of

Model Training RMSE

▪ Recommendation: set aside some data purely for validation purposes

Training data Validation data

Original training data

Training data Validation data

2. Learn parameters on training set and evaluate the performance on

Problem: might produce a model bias to the validation set

fold1 fold2 fold3 fold4 fold5

Train data Validation data Train data

Train data Validation data Train data

Train data Validation data Train data

Validation data Train data

3. Calculate the average and standard deviation for all folds’

5. Evaluate the final model on test set.

▪ 5-fold cross validation for Linear Regression model

𝑅𝑀𝑆𝐸 = −𝑁𝑀𝑆𝐸 The function returns the negative

cross_val_score is used to perform k-fold cross validation.

▪ The result is given as follows:

Since there are 5 folds,

Evaluation Linear Decision Tree Random Forest

Validation set 68622 69677 51889

1. Linear Regression underfits the data

1. Look at the Big Picture

b = False f=2 f=4 f=6 f=8

Total runs = 3x4x2=24

49178 / 206855 = 23.7% (close to human expert, 20%)

▪ To get the best model learnt from cross validation

GridSearchCV.best_params_ returns the best parameter evaluated by GridSearchCV

RandomizedSearchCV is used for Randomized search on hyperparameters.

Analyzing the best model for RandomForestRegressor

median_income is the most

▪ Test performance is usually slightly worse than validation performance if