CH 4

Chapter 4: Training Regression Models
Dr. Xudong Liu

Assistant Professor
School of Computing
University of North Florida
Monday, 10/14/2019
1 / 41
Overview
Linear regression
Normal equation
Gradient descent
Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
Polynomial regression
Regularization for linear models
Ridge regression
Lasso regression
Elastic Net
Logistic regression
Softmax regression
Overview 2 / 41
Linear Regression Model
Overview 3 / 41
Linear Regression Model
Overview 4 / 41
Learning Linear Regression Models
We try to learn θ so that the following MSE cost function is

minimized.
Linear Regression Learning 5 / 41

Normal Equation
To find the θ that minimized the cost function, we can apply the
following closed-form solution:
How is it derived?

Normal Equation
Under the cost function, we are looking for a line that minimize the
summation of the distances from the data points to the line.
The cost function is convex, so all we need to do is to compute the
cost function’s partial derivative w.r.t θ, and make the partial
derivative to 0 to solve for θ.
m
The θ that minimizes m1 (θT · X (i) − y (i) )2 also minimizes
P
i=1
m
(θT · X (i) − y (i) )2
P
i=1

Normal Equation
m
(θT · X (i) − y (i) )2 = (y − X θ)T (y − X θ) = Eθ
P
1
i=1
∂Eθ T T
∂θ = (−X ) (y − X θ) + (−X )(y − X θ) , this is because we have
2
T
∂u v ∂u T ∂v T
∂X = ∂X v + ∂X u
3 Thus, ∂E T
∂θ = 2X (X θ − y )
θ
4 Let ∂E T
∂θ = 0, we can solve for θ = (X · X )
θ −1 · X T y

Normal Equation Experiment

Normal Equation Complexity
Using the normal equation takes time roughly O(mn3 ), where n is the
number of features, and m is the number of examples.
So, it scales well with large set of examples, but poorly with big
number of features.
Now let’s look at other ways of learning θ that may be better for
when there are a lot of features or too many training examples to fit
in memory.

Gradient Descent
General idea is to tweak parameters iteratively in order to minimize a
cost function.
Analogy: suppose you are lost in the mountains in a dense fog. You
can only feel the slope of the ground below your fee. A good strategy
to get to the bottom quickly is to go downhill in the direction of the
steepest slope.
The step size during the walking downhill is called learning rate.
Gradient Descent 11 / 41
Learning Rate Too Small
Learning Rate Too Big
Gradient Descent Pitfalls
Gradient Descent with Feature Scaling
When all features have a similar scale, GD tends to converge quick.
Batch Gradient Descent
The gradient vector of the MSE cost function is below.

It is something we already computed for normal equation!
Gradient descent step:

θ(nextstep) → θ − η ∗ 5θ MSE (θ)
Batch Gradient Descent: Learning Rates
Other Gradient Descent Algorithms
Batch GD: uses the whole training set to compute GD.

Stochastic GD: uses one random training example to compute GD.
Mini-batch GD: uses a small random set of training examples to
compute GD.
Comparing Linear Regression Algorithms
Polynomial Regression
Univariate polynomial: ŷ = θ0 + θ1 x + . . . + θd x d
Multivariate polynomial: more complex
E.g., for degree 2 and 2 variables,
ŷ = θ0 + θ1 x1 + θ2 x2 + θ3 x1 x2 + θ4 x12 + θ5 x22
n+d
(n+d)! 1
In general, for degree d and n variables, there are d = d!n! .
Clearly, polynomial regression is more general than linear regression
and can fit non-linear data.
1
https://mathoverflow.net/questions/225953/
number-of-polynomial-terms-for-certain-degree-and-certain-number-of-variables
Polynomial Regression 20 / 41
Polynomial Regression Learning
We may transform the given attributes to get additional attributes in

the higher degrees.
Then the problem boils down to a linear regression problem.
The following data is generated for y = 0.5x12 + 1.0x1 + 2.0+
Gaussian noise.
Polynomial Regression Learning
We first transform x1 to a polynomial feature of degree 2, then fit the
data to learn ŷ = 0.56x12 + 0.93x1 + 1.78.
The PolynomialFeatures class in sklearn can produce all n+d

d
polynomial features.
Apparently, we wouldn’t know what degree exactly the data was
generated.
Polynomial Regression Learning: Overfitting
Generally, the higher the polynomial degree the better the model fits
the training data.
The danger is overfitting, making the model generate poorly on
testing/future data.
Learning Curves: Underfitting
Adding more training examples will not help underfitting.

Instead, need to use more complex models.
Learning Curves: Overfitting
Adding more training examples may help overfitting.

Another way to battle overfitting is regularization.
Ridge Regression
To regularize a model is to constrain it: the less freedom it has, the

hard it will be to overfit.
For linear regression, this regularization typically is achieved by
constraining the weights (θ’s) of the model.
First way to constrain the weights is Ridge regression, which simply
n
adds α 21 θi2 to the cost function:
P
i=1
n
J(θ) = MSE (θ) + α 12 θi2
P
i=1
Notice θ0 is NOT constrained.

Remember to scale the data before using Ridge regression.
α is a hyperparameter: bigger results in flatter and smoother model.
Regularized Linear Models 26 / 41

Ridge Regression: α

Ridge Regression
Closed-form solution:
θ̂ = (X T · X + αA)−1 · X T y ,
where A is the n × n identity matrix except top-left cell is 0.
In sklearn: import Ridge
Stochastic GD:
2 T
Oθ MSE (θ) = m X (X θ − y ) + 21 αθ
In sklearn: SGDRegressor(penalty=”l2”)

Lasso Regression
Second way to constrain the weights is Lasso regression.

Lasso: least absolute shrinkage and selection operator.
It add an l1 norm, instead of an l2 norm in Ridge, to the cost function:
Pn
J(θ) = MSE (θ) + α |θi |
i=1
n 1
|xi |p )
P
p-norm: ||x||p = ( p
i=1
Lasso tends to completely eliminate the weights of the least
important features (i.e., setting them to 0) and it automatically
performs feature selection.
Role of α is same as the α in Ridge regression.

Lasso Regression

Lasso Regression
Closed-form solution: Does not exist because J(θ) is not differentiable

at θi = 0.
Stochastic GD:
Oθ MSE (θ) = m2 X T (X θ − y ) + 21 αsign(θ),
where sign(θi ) = −1, if θi < 0; 0, if θi = 0; and 1, if θi > 0.
In sklearn: SGDRegressor(penalty=”l1”), or import Lasso

Elastic Net
Last way to constrain the weights is Elastic net, a combination of

Ridge and Lasso.
It combines both cost functions:
n n
1−r 1
θi2
P P
J(θ) = MSE (θ) + r α |θi | + 2 α2
i=1 i=1
When to use which?
Ridge is a good default.
If you suspect some features are not useful, use Lasso or Elastic.
When features are more than training examples, prefer Elastic.

Logistic Regression
Logistic regression outputs class probabilities for binary regression
problems, which can be used to predict classes for binary classification
problems too.
Although called regression, it often is used for binary classification.
Multi-class regression or classification? Softmax regression. (next)
The logistic regression model:
p̂ = hθ (x) = σ(θT · x),
where σ(t) = 1+e1 −t is the logistic function.
Logistic Regression 33 / 41
Logistic Regression
Once p̂ is computed for example x, it predicts the class ŷ to be 0 if

p̂ < 0.5; 1, otherwise.
In other words, it predicts ŷ to be 0 if θT · x is negative; 1, otherwise.
Now we design a cost function, first for one single training example.
For positive examples, we want the cost function to be small when the
probability is big; for negative examples, when the prbability is small.
Cost function per single training example:
(
− log(p̂), if y = 1.
c(θ) =
− log(1 − p̂), if y = 0.
Overall cost function (log loss):

m
J(θ) = − m1 [y i log(p̂ (i) ) + (1 − y i ) log((1 − p̂ (i) ))]
P
i=1
Logistic Regression
No known close-form euqation for J(θ), but J(θ) is convex, so

gradient descent can be used.
Log loss partial derivative:
m
∂J(θ) 1 (i)
(σ(θT · x (i) ) − y (i) )xj
P
∂θj = m
i=1
This is very similar to linear regression’s partial derivative:
m
∂MSE (θ) (i)
= m2 (θT · x (i) − y (i) )xj
P
∂θj
i=1
Like linear regression, we can perform SGD, BGD and MBGD.
In sklearn, logistic regression by default uses l2 penalty to regularize.
Iris Dataset Example
Iris Dataset Example: Decision Boundary
The decision boundary is petal width being around 1.6cm.
Figure: Consider only one feature: petal length
The dashed line is the decision boundary, representing the points

where the model estimates a 50% probability.
This line is θ0 + θ1 x1 + θ2 x2 = 0.
Figure: Consider two features: petal length and sepal length
Softmax Regression
Logistic regression by default does not solve for multi-class problems.

But it can be extended to do so, called softmax regression.
Idea: given an instance x, we first compute a score sk (x) for each
class k, then estimate the probability of each class by applying the
softmax function.
Scoring function: sk (x) = θkT · x
e sk (x)
Softmax function: p̂k = K
P s (x)
ej
j=1
Prediction:
Softmax Regression 39 / 41
Softmax Regression Cost Function
The cost function for softmax regression is cross entropy :
(i)
Below, yk = 1, if y (i) = k; 0, otherwise.
If K = 2, the cross entropy function is the log loss function.
Cross entropy gradient descent for class k:

CH 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 4

Uploaded by

Copyright:

Available Formats

Chapter 4: Training Regression Models

Dr. Xudong Liu

We try to learn θ so that the following MSE cost function is

Linear Regression Learning 5 / 41

Linear Regression Learning 6 / 41

Linear Regression Learning 7 / 41

Linear Regression Learning 8 / 41

Linear Regression Learning 9 / 41

Linear Regression Learning 10 / 41

When all features have a similar scale, GD tends to converge quick.

The gradient vector of the MSE cost function is below.

Gradient descent step:

Batch GD: uses the whole training set to compute GD.

We may transform the given attributes to get additional attributes in

Adding more training examples will not help underfitting.

Adding more training examples may help overfitting.

To regularize a model is to constrain it: the less freedom it has, the

Notice θ0 is NOT constrained.

Regularized Linear Models 26 / 41

Regularized Linear Models 27 / 41

Regularized Linear Models 28 / 41

Second way to constrain the weights is Lasso regression.

Regularized Linear Models 29 / 41

Regularized Linear Models 30 / 41

Closed-form solution: Does not exist because J(θ) is not differentiable

Regularized Linear Models 31 / 41

Last way to constrain the weights is Elastic net, a combination of

Regularized Linear Models 32 / 41

Once p̂ is computed for example x, it predicts the class ŷ to be 0 if

Overall cost function (log loss):

No known close-form euqation for J(θ), but J(θ) is convex, so

The decision boundary is petal width being around 1.6cm.

Figure: Consider only one feature: petal length

The dashed line is the decision boundary, representing the points

Figure: Consider two features: petal length and sepal length

Logistic regression by default does not solve for multi-class problems.

The cost function for softmax regression is cross entropy :

You might also like