Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Chapter 4: Training Regression Models

Dr. Xudong Liu


Assistant Professor
School of Computing
University of North Florida

Monday, 10/14/2019

1 / 41
Overview

Linear regression
Normal equation
Gradient descent
Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent

Polynomial regression
Regularization for linear models
Ridge regression
Lasso regression
Elastic Net
Logistic regression
Softmax regression

Overview 2 / 41
Linear Regression Model

Overview 3 / 41
Linear Regression Model

Overview 4 / 41
Learning Linear Regression Models

We try to learn θ so that the following MSE cost function is


minimized.

Linear Regression Learning 5 / 41


Normal Equation

To find the θ that minimized the cost function, we can apply the
following closed-form solution:

How is it derived?

Linear Regression Learning 6 / 41


Normal Equation

Under the cost function, we are looking for a line that minimize the
summation of the distances from the data points to the line.
The cost function is convex, so all we need to do is to compute the
cost function’s partial derivative w.r.t θ, and make the partial
derivative to 0 to solve for θ.
m
The θ that minimizes m1 (θT · X (i) − y (i) )2 also minimizes
P
i=1
m
(θT · X (i) − y (i) )2
P
i=1

Linear Regression Learning 7 / 41


Normal Equation

m
(θT · X (i) − y (i) )2 = (y − X θ)T (y − X θ) = Eθ
P
1
i=1
∂Eθ T T
∂θ = (−X ) (y − X θ) + (−X )(y − X θ) , this is because we have
2
T
∂u v ∂u T ∂v T
∂X = ∂X v + ∂X u
3 Thus, ∂E T
∂θ = 2X (X θ − y )
θ

4 Let ∂E T
∂θ = 0, we can solve for θ = (X · X )
θ −1 · X T y

Linear Regression Learning 8 / 41


Normal Equation Experiment

Linear Regression Learning 9 / 41


Normal Equation Complexity

Using the normal equation takes time roughly O(mn3 ), where n is the
number of features, and m is the number of examples.
So, it scales well with large set of examples, but poorly with big
number of features.
Now let’s look at other ways of learning θ that may be better for
when there are a lot of features or too many training examples to fit
in memory.

Linear Regression Learning 10 / 41


Gradient Descent
General idea is to tweak parameters iteratively in order to minimize a
cost function.
Analogy: suppose you are lost in the mountains in a dense fog. You
can only feel the slope of the ground below your fee. A good strategy
to get to the bottom quickly is to go downhill in the direction of the
steepest slope.
The step size during the walking downhill is called learning rate.

Gradient Descent 11 / 41
Learning Rate Too Small

Gradient Descent 12 / 41
Learning Rate Too Big

Gradient Descent 13 / 41
Gradient Descent Pitfalls

Gradient Descent 14 / 41
Gradient Descent with Feature Scaling

When all features have a similar scale, GD tends to converge quick.

Gradient Descent 15 / 41
Batch Gradient Descent

The gradient vector of the MSE cost function is below.


It is something we already computed for normal equation!

Gradient descent step:


θ(nextstep) → θ − η ∗ 5θ MSE (θ)

Gradient Descent 16 / 41
Batch Gradient Descent: Learning Rates

Gradient Descent 17 / 41
Other Gradient Descent Algorithms

Batch GD: uses the whole training set to compute GD.


Stochastic GD: uses one random training example to compute GD.
Mini-batch GD: uses a small random set of training examples to
compute GD.

Gradient Descent 18 / 41
Comparing Linear Regression Algorithms

Gradient Descent 19 / 41
Polynomial Regression

Univariate polynomial: ŷ = θ0 + θ1 x + . . . + θd x d
Multivariate polynomial: more complex
E.g., for degree 2 and 2 variables,
ŷ = θ0 + θ1 x1 + θ2 x2 + θ3 x1 x2 + θ4 x12 + θ5 x22
n+d
 (n+d)! 1
In general, for degree d and n variables, there are d = d!n! .
Clearly, polynomial regression is more general than linear regression
and can fit non-linear data.

1
https://mathoverflow.net/questions/225953/
number-of-polynomial-terms-for-certain-degree-and-certain-number-of-variables
Polynomial Regression 20 / 41
Polynomial Regression Learning

We may transform the given attributes to get additional attributes in


the higher degrees.
Then the problem boils down to a linear regression problem.
The following data is generated for y = 0.5x12 + 1.0x1 + 2.0+
Gaussian noise.

Polynomial Regression 21 / 41
Polynomial Regression Learning
We first transform x1 to a polynomial feature of degree 2, then fit the
data to learn ŷ = 0.56x12 + 0.93x1 + 1.78.
The PolynomialFeatures class in sklearn can produce all n+d

d
polynomial features.
Apparently, we wouldn’t know what degree exactly the data was
generated.

Polynomial Regression 22 / 41
Polynomial Regression Learning: Overfitting

Generally, the higher the polynomial degree the better the model fits
the training data.
The danger is overfitting, making the model generate poorly on
testing/future data.

Polynomial Regression 23 / 41
Learning Curves: Underfitting

Adding more training examples will not help underfitting.


Instead, need to use more complex models.

Polynomial Regression 24 / 41
Learning Curves: Overfitting

Adding more training examples may help overfitting.


Another way to battle overfitting is regularization.

Polynomial Regression 25 / 41
Ridge Regression

To regularize a model is to constrain it: the less freedom it has, the


hard it will be to overfit.
For linear regression, this regularization typically is achieved by
constraining the weights (θ’s) of the model.
First way to constrain the weights is Ridge regression, which simply
n
adds α 21 θi2 to the cost function:
P
i=1
n
J(θ) = MSE (θ) + α 12 θi2
P
i=1

Notice θ0 is NOT constrained.


Remember to scale the data before using Ridge regression.
α is a hyperparameter: bigger results in flatter and smoother model.

Regularized Linear Models 26 / 41


Ridge Regression: α

Regularized Linear Models 27 / 41


Ridge Regression

Closed-form solution:
θ̂ = (X T · X + αA)−1 · X T y ,
where A is the n × n identity matrix except top-left cell is 0.
In sklearn: import Ridge
Stochastic GD:
2 T
Oθ MSE (θ) = m X (X θ − y ) + 21 αθ
In sklearn: SGDRegressor(penalty=”l2”)

Regularized Linear Models 28 / 41


Lasso Regression

Second way to constrain the weights is Lasso regression.


Lasso: least absolute shrinkage and selection operator.
It add an l1 norm, instead of an l2 norm in Ridge, to the cost function:
Pn
J(θ) = MSE (θ) + α |θi |
i=1
n 1
|xi |p )
P
p-norm: ||x||p = ( p

i=1
Lasso tends to completely eliminate the weights of the least
important features (i.e., setting them to 0) and it automatically
performs feature selection.
Role of α is same as the α in Ridge regression.

Regularized Linear Models 29 / 41


Lasso Regression

Regularized Linear Models 30 / 41


Lasso Regression

Closed-form solution: Does not exist because J(θ) is not differentiable


at θi = 0.
Stochastic GD:
Oθ MSE (θ) = m2 X T (X θ − y ) + 21 αsign(θ),
where sign(θi ) = −1, if θi < 0; 0, if θi = 0; and 1, if θi > 0.
In sklearn: SGDRegressor(penalty=”l1”), or import Lasso

Regularized Linear Models 31 / 41


Elastic Net

Last way to constrain the weights is Elastic net, a combination of


Ridge and Lasso.
It combines both cost functions:
n n
1−r 1
θi2
P P
J(θ) = MSE (θ) + r α |θi | + 2 α2
i=1 i=1
When to use which?
Ridge is a good default.
If you suspect some features are not useful, use Lasso or Elastic.
When features are more than training examples, prefer Elastic.

Regularized Linear Models 32 / 41


Logistic Regression
Logistic regression outputs class probabilities for binary regression
problems, which can be used to predict classes for binary classification
problems too.
Although called regression, it often is used for binary classification.
Multi-class regression or classification? Softmax regression. (next)
The logistic regression model:
p̂ = hθ (x) = σ(θT · x),
where σ(t) = 1+e1 −t is the logistic function.

Logistic Regression 33 / 41
Logistic Regression

Once p̂ is computed for example x, it predicts the class ŷ to be 0 if


p̂ < 0.5; 1, otherwise.
In other words, it predicts ŷ to be 0 if θT · x is negative; 1, otherwise.
Now we design a cost function, first for one single training example.
For positive examples, we want the cost function to be small when the
probability is big; for negative examples, when the prbability is small.
Cost function per single training example:
(
− log(p̂), if y = 1.
c(θ) =
− log(1 − p̂), if y = 0.

Overall cost function (log loss):


m
J(θ) = − m1 [y i log(p̂ (i) ) + (1 − y i ) log((1 − p̂ (i) ))]
P
i=1

Logistic Regression 34 / 41
Logistic Regression

No known close-form euqation for J(θ), but J(θ) is convex, so


gradient descent can be used.
Log loss partial derivative:
m
∂J(θ) 1 (i)
(σ(θT · x (i) ) − y (i) )xj
P
∂θj = m
i=1
This is very similar to linear regression’s partial derivative:
m
∂MSE (θ) (i)
= m2 (θT · x (i) − y (i) )xj
P
∂θj
i=1
Like linear regression, we can perform SGD, BGD and MBGD.
In sklearn, logistic regression by default uses l2 penalty to regularize.

Logistic Regression 35 / 41
Iris Dataset Example

Logistic Regression 36 / 41
Iris Dataset Example: Decision Boundary

The decision boundary is petal width being around 1.6cm.

Figure: Consider only one feature: petal length

Logistic Regression 37 / 41
Iris Dataset Example: Decision Boundary

The dashed line is the decision boundary, representing the points


where the model estimates a 50% probability.
This line is θ0 + θ1 x1 + θ2 x2 = 0.

Figure: Consider two features: petal length and sepal length

Logistic Regression 38 / 41
Softmax Regression

Logistic regression by default does not solve for multi-class problems.


But it can be extended to do so, called softmax regression.
Idea: given an instance x, we first compute a score sk (x) for each
class k, then estimate the probability of each class by applying the
softmax function.
Scoring function: sk (x) = θkT · x
e sk (x)
Softmax function: p̂k = K
P s (x)
ej
j=1

Prediction:

Softmax Regression 39 / 41
Softmax Regression Cost Function

The cost function for softmax regression is cross entropy :

(i)
Below, yk = 1, if y (i) = k; 0, otherwise.
If K = 2, the cross entropy function is the log loss function.
Cross entropy gradient descent for class k:

Softmax Regression 40 / 41
Iris Dataset Example: Decision Boundary

Softmax Regression 41 / 41

You might also like