Linear - Regression

Introduction to Machine Learning
Linear Models for Regression

林彥宇教授
Yen-Yu Lin, Professor
國立陽明交通大學資訊工程學系
Computer Science, National Yang Ming Chiao Tung University
Some slides are modified from Prof. Sheng-Jyh Wang

and Prof. Hwang-Tzong Chen
Regression
• Given a training data set comprising 𝑁 observations {𝐱𝑛 }𝑁 𝑛=1

and the corresponding target values {𝑡𝑛 }𝑁 𝑛=1 , the goal of
regression is to predict the value of 𝑡 for a new value of 𝐱
https://www.scribbr.com/statistics/linear-regression-in-r/
2
A simple regression model
• A simple linear model:
➢ Each observation is in a 𝐷–dimensional space 𝐱 = (𝑥1 , … , 𝑥𝐷 )T

➢ 𝑦 is a regression model parametrized by 𝐰 = (𝑤0 , … , 𝑤𝐷 )T
➢ The output is a linear combination of the input variables
➢ It is a linear function of parameters
➢ The fitting power is quite limited. Seeking a nonlinear extension
for the input variables
3
An example
• A regressor in the form of

➢ A straight line in this case -> Insufficient fitting power
➢ Nonlinear feature transforms before linear regression
4
Linear regression with nonlinear basis functions
• Simple linear model:
• A linear model with nonlinear basis functions
where {𝜙𝑗 }𝑀−1

𝑗=1 : nonlinear basis functions
𝑀: the number of parameters
𝑤0 : the bias parameter allowing a fixed offset
• The regression output is a linear combination of nonlinear

basis functions of the inputs
5
Linear regression with nonlinear basis functions
• A linear model with nonlinear basis functions
• Let 𝜙0 𝐱 = 1, a dummy basis function. The regression function

is equivalently expressed as
where and
6
Examples of basis functions
• Polynomial basis function: taking the form of powers of 𝑥
• Gaussian basis function: governed by and

➢ governs the location while governs the scale
• Sigmoidal basis function: governed by and

where
7
How basis functions work
• Take Gaussian basis functions as an example
y = w0 + w11 ( x ) + w22 ( x ) + ... + wM −1M −1 ( x )
1(x) 2(x) 3(x) 4(x) 5(x) 6(x) 7(x) 8(x)
8
Maximum likelihood and least squares
• Assume each observation is sampled from a deterministic

function with an added Gaussian noise
where 𝜀 is a zero mean Gaussian and precision (inverse

variance) is 𝛽
• Thus, we have the conditional probability
9
• Given a data set of inputs X = {x1, . . . , xN} with corresponding

target values t1, . . . , tN, we have the likelihood function
• The log likelihood function is
where
10
• Given a data set of inputs X = {x1, . . . , xN} with corresponding

target values t1, . . . , tN, we have the likelihood function
How?
where
11
• Gaussian noise likelihood ֞ sum-of-squares error function
• Maximum likelihood solution: Optimize 𝐰 by maximizing the

log likelihood function
• Step 1: Compute the gradient of log likelihood w.r.t. 𝐰
• Step 2: Set the gradient to zero, which gives
12
• Define the design matrix in this task
➢ It has 𝑁 rows, one for each training sample

➢ It has 𝑀 columns, one for each basis function
13
• Setting the gradient to zero
we have
• How to derive?
➢ Hint 1:
➢ Hint 2:
14
• The ML solution
• is known as the Moore-Penrose pseudo-

inverse of the design matrix
• has linearly independent columns. Why is invertible?
15
• Similarly, 𝛽 is optimized by maximizing the log likelihood
where
• We get
16
Regression for a new data point
• The conditional probability (likelihood function)
• After learning, we get 𝐰 ՚ 𝐰ML and 𝛽 ՚ 𝛽ML
• Specify the prediction of a data point 𝐱 in the form of a

−1
Gaussian distribution with mean 𝑦 𝐱, 𝐰ML and variance 𝛽ML
17
Regularized least squares
• Add a regularization term helps alleviate over-fitting
• The simplest form of the regularization term
• The total error function becomes
• Setting the gradient of the function w.r.t. 𝐰 to 0, we have
18
Regularized least squares
• A more general regularizer
• q=2 → quadratic regularizer

• q=1 → the lasso in the statistics literature
• Contours of the regularization term
19
Multiple outputs
• In some applications, we wish to predict 𝐾 > 1 target values

➢ One target value: Income -> Happiness
➢ Multiple target values: Income -> Happiness, Hours of duty, Health
• Recall the one-dimensional case
• With the same basis functions, the regression approach becomes
where 𝐖 is a 𝑀 × 𝐾 matrix, 𝑀 is the number of basis functions,

and 𝐾 is the number of target values
20
Multiple outputs
• The conditional probability of a single observation is
➢ An isotropic Gaussian, i.e., with a diagonal covariance matrix

➢ Each pair of variables are independent
21
Multiple outputs: Maximum likelihood solution
• Setting the gradient of the log likelihood function w.r.t. 𝐖 to 0,

we have
• Consider the kth column of 𝐖ML
where 𝐭 𝑘 is a 𝑁-dimensional vector with components [𝑡𝑛𝑘 ]
• It leads to 𝐾 independent regression problems
22
Sequential learning
• The maximum likelihood derivation is a batch technique

➢ It takes all training data into account at the same time
➢ Case 1: The training data set is sufficiently large
➢ Case 2: Data points are arriving in a continuous stream
• For the two cases, it is worthwhile to use sequential

algorithms, or on-line algorithms, in which the data points are
considered one by one, and the model parameters are
updated incrementally
23
Sequential learning
• Stochastic gradient descent

➢ Error function comprises a sum over data points 𝐸 = σ𝑛 𝐸𝑛
➢ Given data point 𝐱 𝑛 , the parameter vector 𝐰 is updated by
where 𝜏 is the iteration number and η is the learning rate

➢ In the case of sum-of-squares error, it is
24
Maximum a posterior
• Likelihood function
• Let’s consider a prior function, which is a Gaussian
where 𝐦0 is the mean and S0 is the covariance matrix
• The posterior function is also a Gaussian
where is the mean

and is the covariance
25
How to derive the mean and covariance in posterior
• According to the marginal and conditional Gaussians on page

93 of the PRML textbook
26
A zero-mean isotropic Gaussian prior
• A general Gaussian prior function
where 𝐦0 is the mean and S0 is the covariance matrix
• A widely used Gaussian prior
• Mean and covariance of the resulting posterior function
27
Sequential Bayesian learning: An example
• Data, including observations and target values, are given one-

by-one
• Data are in a one-dimensional space
• Data are sampled from the function ,

where and , and added by a Gaussian
noise
➢ Note that the function is unknown
➢ We have just the observations and the target values
28
An example
• Regression function
29
An example
• In the beginning, no
data are available
• Constant likelihood
• Prior = posterior
• Sample 6 curves for

function according to
posterior distribution
30
An example
• One data (blue circle)

sample is given
• Likelihood for this
sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
31
An example
• Second data (blue

circle) sample is given
• Likelihood for the
second sample
• White cross
• Sample 6 curves
32
An example
• 20 data (blue circle)

sample are given
• Likelihood for the 20th
sample
• White cross
• Sample 6 curves
33
Predictive distribution
• Recall the posterior function
where and
• Given 𝐰, we regress a data sample via
• In Bayesian treatment, the predictive distribution is
• Then we have
• where
34
• Green curve is used to sample data. It is unknown
• Blue circle: a sampled data
• After learning, the predictive distribution
• Red curve: the mean of the Gaussian above

• Red shaded region: One standard deviation on either side of mean
35
36
• Sample 5 points of 𝐰 according to the posterior function
• Plot the corresponding regression functions
37
References
• Chapters 3.1 and 3.3 in the PRML textbook
38
Thank You for Your Attention!
Yen-Yu Lin (林彥宇)

Email: lin@cs.nctu.edu.tw
URL: https://www.cs.nycu.edu.tw/members/detail/lin
39

Linear - Regression

Uploaded by

Copyright:

Available Formats

You might also like

Linear - Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear - Regression

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning

Linear Models for Regression

Some slides are modified from Prof. Sheng-Jyh Wang

• Given a training data set comprising 𝑁 observations {𝐱𝑛 }𝑁 𝑛=1

• A simple linear model:

➢ Each observation is in a 𝐷–dimensional space 𝐱 = (𝑥1 , … , 𝑥𝐷 )T

• A regressor in the form of

• Simple linear model:

• A linear model with nonlinear basis functions

where {𝜙𝑗 }𝑀−1

• The regression output is a linear combination of nonlinear

• A linear model with nonlinear basis functions

• Let 𝜙0 𝐱 = 1, a dummy basis function. The regression function

• Gaussian basis function: governed by and

• Sigmoidal basis function: governed by and

• Take Gaussian basis functions as an example

y = w0 + w11 ( x ) + w22 ( x ) + ... + wM −1M −1 ( x )

1(x) 2(x) 3(x) 4(x) 5(x) 6(x) 7(x) 8(x)

• Assume each observation is sampled from a deterministic

where 𝜀 is a zero mean Gaussian and precision (inverse

• Thus, we have the conditional probability

• Given a data set of inputs X = {x1, . . . , xN} with corresponding

• The log likelihood function is

• Given a data set of inputs X = {x1, . . . , xN} with corresponding

• The log likelihood function is

• Gaussian noise likelihood ֞ sum-of-squares error function

• Maximum likelihood solution: Optimize 𝐰 by maximizing the

• Step 1: Compute the gradient of log likelihood w.r.t. 𝐰

• Step 2: Set the gradient to zero, which gives

• Define the design matrix in this task

➢ It has 𝑁 rows, one for each training sample

• Setting the gradient to zero

• is known as the Moore-Penrose pseudo-

• has linearly independent columns. Why is invertible?

• Similarly, 𝛽 is optimized by maximizing the log likelihood

• The conditional probability (likelihood function)

• After learning, we get 𝐰 ՚ 𝐰ML and 𝛽 ՚ 𝛽ML

• Specify the prediction of a data point 𝐱 in the form of a

• Add a regularization term helps alleviate over-fitting

• The simplest form of the regularization term

• The total error function becomes

• Setting the gradient of the function w.r.t. 𝐰 to 0, we have

• A more general regularizer

• q=2 → quadratic regularizer

• In some applications, we wish to predict 𝐾 > 1 target values

• Recall the one-dimensional case

• With the same basis functions, the regression approach becomes

where 𝐖 is a 𝑀 × 𝐾 matrix, 𝑀 is the number of basis functions,

• The conditional probability of a single observation is

➢ An isotropic Gaussian, i.e., with a diagonal covariance matrix

• The log likelihood function is

• Setting the gradient of the log likelihood function w.r.t. 𝐖 to 0,

• Consider the kth column of 𝐖ML

where 𝐭 𝑘 is a 𝑁-dimensional vector with components [𝑡𝑛𝑘 ]

• It leads to 𝐾 independent regression problems

• The maximum likelihood derivation is a batch technique

• For the two cases, it is worthwhile to use sequential