Professional Documents
Culture Documents
CH 4
CH 4
Monday, 10/14/2019
1 / 41
Overview
Linear regression
Normal equation
Gradient descent
Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
Polynomial regression
Regularization for linear models
Ridge regression
Lasso regression
Elastic Net
Logistic regression
Softmax regression
Overview 2 / 41
Linear Regression Model
Overview 3 / 41
Linear Regression Model
Overview 4 / 41
Learning Linear Regression Models
To find the θ that minimized the cost function, we can apply the
following closed-form solution:
How is it derived?
Under the cost function, we are looking for a line that minimize the
summation of the distances from the data points to the line.
The cost function is convex, so all we need to do is to compute the
cost function’s partial derivative w.r.t θ, and make the partial
derivative to 0 to solve for θ.
m
The θ that minimizes m1 (θT · X (i) − y (i) )2 also minimizes
P
i=1
m
(θT · X (i) − y (i) )2
P
i=1
m
(θT · X (i) − y (i) )2 = (y − X θ)T (y − X θ) = Eθ
P
1
i=1
∂Eθ T T
∂θ = (−X ) (y − X θ) + (−X )(y − X θ) , this is because we have
2
T
∂u v ∂u T ∂v T
∂X = ∂X v + ∂X u
3 Thus, ∂E T
∂θ = 2X (X θ − y )
θ
4 Let ∂E T
∂θ = 0, we can solve for θ = (X · X )
θ −1 · X T y
Using the normal equation takes time roughly O(mn3 ), where n is the
number of features, and m is the number of examples.
So, it scales well with large set of examples, but poorly with big
number of features.
Now let’s look at other ways of learning θ that may be better for
when there are a lot of features or too many training examples to fit
in memory.
Gradient Descent 11 / 41
Learning Rate Too Small
Gradient Descent 12 / 41
Learning Rate Too Big
Gradient Descent 13 / 41
Gradient Descent Pitfalls
Gradient Descent 14 / 41
Gradient Descent with Feature Scaling
Gradient Descent 15 / 41
Batch Gradient Descent
Gradient Descent 16 / 41
Batch Gradient Descent: Learning Rates
Gradient Descent 17 / 41
Other Gradient Descent Algorithms
Gradient Descent 18 / 41
Comparing Linear Regression Algorithms
Gradient Descent 19 / 41
Polynomial Regression
Univariate polynomial: ŷ = θ0 + θ1 x + . . . + θd x d
Multivariate polynomial: more complex
E.g., for degree 2 and 2 variables,
ŷ = θ0 + θ1 x1 + θ2 x2 + θ3 x1 x2 + θ4 x12 + θ5 x22
n+d
(n+d)! 1
In general, for degree d and n variables, there are d = d!n! .
Clearly, polynomial regression is more general than linear regression
and can fit non-linear data.
1
https://mathoverflow.net/questions/225953/
number-of-polynomial-terms-for-certain-degree-and-certain-number-of-variables
Polynomial Regression 20 / 41
Polynomial Regression Learning
Polynomial Regression 21 / 41
Polynomial Regression Learning
We first transform x1 to a polynomial feature of degree 2, then fit the
data to learn ŷ = 0.56x12 + 0.93x1 + 1.78.
The PolynomialFeatures class in sklearn can produce all n+d
d
polynomial features.
Apparently, we wouldn’t know what degree exactly the data was
generated.
Polynomial Regression 22 / 41
Polynomial Regression Learning: Overfitting
Generally, the higher the polynomial degree the better the model fits
the training data.
The danger is overfitting, making the model generate poorly on
testing/future data.
Polynomial Regression 23 / 41
Learning Curves: Underfitting
Polynomial Regression 24 / 41
Learning Curves: Overfitting
Polynomial Regression 25 / 41
Ridge Regression
Closed-form solution:
θ̂ = (X T · X + αA)−1 · X T y ,
where A is the n × n identity matrix except top-left cell is 0.
In sklearn: import Ridge
Stochastic GD:
2 T
Oθ MSE (θ) = m X (X θ − y ) + 21 αθ
In sklearn: SGDRegressor(penalty=”l2”)
i=1
Lasso tends to completely eliminate the weights of the least
important features (i.e., setting them to 0) and it automatically
performs feature selection.
Role of α is same as the α in Ridge regression.
Logistic Regression 33 / 41
Logistic Regression
Logistic Regression 34 / 41
Logistic Regression
Logistic Regression 35 / 41
Iris Dataset Example
Logistic Regression 36 / 41
Iris Dataset Example: Decision Boundary
Logistic Regression 37 / 41
Iris Dataset Example: Decision Boundary
Logistic Regression 38 / 41
Softmax Regression
Prediction:
Softmax Regression 39 / 41
Softmax Regression Cost Function
(i)
Below, yk = 1, if y (i) = k; 0, otherwise.
If K = 2, the cross entropy function is the log loss function.
Cross entropy gradient descent for class k:
Softmax Regression 40 / 41
Iris Dataset Example: Decision Boundary
Softmax Regression 41 / 41