Professional Documents
Culture Documents
L03_The_Regression_Pipeline_-_2
L03_The_Regression_Pipeline_-_2
Unit 03:
The Regression Pipeline – 2/2
400
300
200
100
0
0 500 1000 1500 2000 2500
Machine Learning Pipeline – Regression
3
Task – Regression
▪ Recall that our task is to predict the median house price based on
information (median income, total bedrooms, ocean proximity, ...) of a
district
𝑦ො = ℎ(𝒙)
400
300
h(x)
median
house 200
price
100
0
0 500 1000 1500 2000 2500
x
4
Selecting machine learning algorithms
Choose the appropriate machine learning algorithms for h(x)
▪ Linear Models
• Linear Regression
• Logistic Regression (for classification)
▪ Support Vector Machine (SVM)
▪ Neural Networks (NN)
▪ Decision Tree
▪ Random Forest
▪ K-Nearest Neighbours (k-NN)
▪ More…
Note that most machine learning algorithms can be used for both
regression (continuous output) and classification (discrete output)
5
Linear Model
▪ A linear model models the data using the hypothesis function:
𝑦ො = ℎ𝜃 (𝒙) = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑛 𝑥𝑛
• 𝒙 = 𝑥1 , 𝑥2 , … , 𝑥𝑛 is the input
• n is the number of features (n-dimension)
• j are model parameters or weights (also represents as W·X + b)
7
Linear Regression
▪ Linear Regression finds a linear model that minimize the sum of
squared errors (prediction errors)
• Prediction error = (true value – predicted value)2
𝑦ො = ℎ𝜃 (𝒙) = 𝜃0 + 𝜃1 𝑥
8
Linear Regression
▪ For multiple dimension data, the linear model is a hyperplane
9
What if the data is not linear?
Underfits!
Polynomial Regression 10
Linear Classification (Logistic Regression)
▪ Linear model can also be used to categories a set of data points
to discrete classes (more on classification in next lecture)
12
Support Vector Machine
▪ SVM algorithm minimize ‖W‖ subject to y(W·X+b) >= 1.
W.X+b = 1
W.X+b = -1
14
Non-linear SVM
▪ The basic idea is to transform x to a higher dimensional space
using a kernel function so that they are linearly separable in the
transformed space.
16
Artificial Neural Networks
▪ Artificial neural network (ANN) is a network of simple elements
called artificial neurons, which receive input, and produce
output depending on the input weight and activation function.
An Artificial Neuron
17
Neural Networks
▪ Artificial neural network typically consists of multiply layers of
interconnected neurons, called MLP.
X y
𝑎 = 𝑓(𝒘𝑙 𝑎𝑙−1 + 𝒃𝑙 )
𝑙
𝑖=1 18
Neural Network Training
▪ A neural network is trained using two passes:
• Forward pass – make prediction and compute prediction loss (Y’ – Y)
• Backpropagation – adjust all weights layer by layer to reduce loss using gradient
descent
20
Decision Tree
▪ A decision tree is a tree representation where each node
represents a feature, each branch represents a decision(rule)
and each leaf represents an outcome(class or value).
▪ A path from the root note to a leaf node produces a prediction
Leaf node
(outcome)
ER = 0.247
ER = 0.152
ER = 0.029 22
ER = 0.048
Continuing to split
ER = 0.571
ER = 0.020 ER = 0.971
• Stopping criteria:
– All samples for a given node belong to the same class (pure)
– Reach a predefined threshold: e.g. minimum number of samples
– There are no remaining attributes for further partitioning
Decision Tree
25
k-Nearest Neighbour
▪ k-Nearest Neighbour (k-NN) is a non-parametric, instance-
based learning algorithm
• The instances themselves represent the knowledge
▪ k-NN returns the most common value among the k training
examples nearest to the test sample
• K should be odd number
▪ A decision is made by examining the labels on the k-nearest
neighbors and taking a vote
26
k-Nearest Neighbour
▪ k-NN can be used for both classification and regression.
• Classification: classify an object by a majority vote of its k-
neighbors
• Regression: prediction is the average of k-nearest neighbors
K-NN problem: require large memory to store all training samples, long prediction time.
27
Selecting machine learning algorithms
▪ No Free Lunch Theorem [D. Wolperts, 1996]
• There is no one model that works best for every problem. The
assumptions of a model that work for one problem may not hold for
another problem.
▪ Strategy:
• Try out different models from various categories of ML
algorithms.
• Do not spend too much time tweaking the hyperparameters
• Shortlist a few (2 to 5) promising models.
2 2
2
1 1
1
Best Bad Not so good
0 0
0
0 1 2 3 0 1 2 3
0 1 2 3
true predict
29
Root Mean Square Error (RMSE)
𝑚
1 𝑖 𝑖 2
𝑅𝑀𝑆𝐸 𝐗, ℎ = ℎ 𝑥 −𝑦
𝑚
𝑖=1
ℎ 𝑥
Given the following samples, compute the 3
RMSE for models L1 and L2 2.5 X
2 X L1
x y h1 h2
1.5
0 0.5 0 2.5
1
1 2 1 1.5 0.5 X
2 2.5 2 0.5 0 L2 x
0 1 2 3
𝟏 𝟐 𝟐 𝟐
RMSE(L1=h1-y) = 𝟎 − 𝟎. 𝟓 + 𝟏−𝟐 + 𝟐 − 𝟐. 𝟓
𝟑
= 0.7071 The RMSE for L1 is smaller
than L2 → L1 is a better
𝟏
RMSE(L2=h2-y) = 𝟐. 𝟓 − 𝟎. 𝟓 𝟐 + 𝟏. 𝟓 − 𝟐 𝟐 + 𝟎. 𝟓 − 𝟐. 𝟓 𝟐 model
𝟑
= 1.6583
30
Build a model
32
Evaluate the model
33
mean_squared_error computes the mean squared error given the actual and predicted labels
Underfitting vs Overfitting (recall)
▪ Underfitting (high bias) may happen when our model is over-simplified
or not expressive enough (both training error and test error are high).
▪ Overfitting (high variance) may happen when our model is too complex
and fits too specifically to the training set, but it does not generalize
well to new data (low training error but high testing error).
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 +𝜃3 𝑥 3 𝑦 = 𝜃0 + 𝜃1 𝑥 𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥 2 + … + 𝜃9 𝑥 9
35
Model #2: DecisionTreeRegressor
36
Model #3 RandomForestRegressor
RMSE = 21184.0454
Much better than Linear
Regression (was 67862), but can be
overfitting, need to verify
38
Model Evaluation using Cross-Validation
Training vs Testing Performance
▪ Misleading if we simply report the performance of the model on the
data that we use to learn the model
• Why? The reported performance will be overly optimistic. A model
that works well on the training set may not do as well on unseen
data
▪ Important: if you have a test set, don’t use the test data for validation
purpose during training
39
Holdout Validation
1. Split the original training data into training data and validation data
(e.g., 80% training data, 20% validation)
2. Hold out one fold as a validation set and use the other folds as
training set. Repeat using each fold in turn as the validation set.
Train data Validation data
41
k-fold Cross-validation
(Continued)
42
Evaluation Using k-fold Cross-Validation
43
Evaluation Using k-fold Cross-Validation
44
▪ DecisionTreeRegressor Model
▪ RandomForestRegressor Model
45
Summary:
48
Fine-tuning Model
▪ Each model comes with different hyperparameters.
Hyperparameters are the settings of a machine learning that
needs to be set prior to training.
▪ Important hyperparameters for RandomForestRegressor are:
• n_estimators: the number of trees in the forest
• max_features: the number of features to consider when looking for
the best split
• bootstrap: true or false, if true, samples are drawn with replacement
▪ Two ways to find the best hyperparameters:
• Grid search
• Random search
49
Grid Search
▪ Grid search: try different combinations of the hyperparameters in a grid
▪ For example, try the following values for RandomForestRegressor:
• n_estimators (e): [12, 20, 28]
• max_features (f): [2, 4, 6, 8]
• bootstrap (b): [True, False]
▪ Expensive! Need nm runs for n configurations of m hyperparameters
Python Dictionary
You can also set refit = True to retrain it on the whole training set once it
finds the best estimator using cross-validation
51
Grid Search
▪ To get the best hyperparameter settings from cross validation and the average cv
score:
▪ To evaluate the final model on the test set, we need to ensure that the
test set goes through the same preprocessing steps as the training set.
55
Evaluate the final model on the test set
▪ Perform prediction on test set
50712.6789
56
Machine Learning Pipeline – Regression