L5 Normal Equations For Regression PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Normal Equations

Slide sources for this set of slides: Stanford Intro to ML course


Lecture Outcomes
What are Normal Equations?

Normal Equations versus Gradiant Descent (GD)

Invertible issue with Normal Equations

Derivation of Normal Equations


Normal Equations
Gradient Descent versus Normal Equations
Two possible ways to find optimal parameters that minimize the cost
function:
1. Gradient Descent
• So far, we have been using the gradient descent.
• The gradient descent uses iterative steps to find the required parameters

2. Normal Equations
• Normal equations are equations obtained by setting equal to zero the
partial derivatives of the sum of squared errors (least squares);
• The Normal equations give us a method to find the parameters directly.
Normal Equation
Direct (Closed Form) Solution
Intuition: If 1D

(for every )

Solve for
Multi-Variate Regression
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

Also called
design matrix

The parameters that minimize the cost


function are given by:
Gradient Descent versus Normal Equations
training examples, features.
Gradient Descent Normal Equation
• Need to choose • No need to choose
• Needs many iterations • Don’t need to iterate
• Works well even • Need to compute
when is large (one
choice >>10000) • Slow if is very large ~O(n3)
• Need to scale • No need to scale X
Which to use?
• As long as the number of features is not too large, use
the Normal Equations.

• When we talk about classification algorithms (e.g.


logistic regression) or other more sophisticated
complex algorithms, the normal equations solution
does not work and we have to use gradient descent.
The Invertible Issue with
Normal Equations
Issue with Normal equation

What if is non-invertible? (singular/ degenerate)


Two Issues with Normal Equations
XTX ((n+1) by (n+1)) represents the covariance matrix between the
predictors.

Issue1: If some of the predictors can be written as linear combinations


of others, (linearly dependent). E.g. x1 = size in feet2; x2 = size in m2
• Solution: Remove the redundant predictors in pre-processing
before inverse. Otherwise, inverse does not exist.

Issue 2: If the number of samples < Number of predictors, model will


overfit samples
• Solution: Reduce # of predictors or use regularization
These two scenarios can benefit from reduction.
Derivation for Normal
Equations
Cost Function in matrix form
Gradient with respect to a Matrix
Gradient of a function with respect to a matrix

Gradient with respect to a vector assuming A symmetric:


Solving Normal Equations
Set partials (w/r to vector θ) = 0

Given

Knowing (A symmetric):
Maximum Likelihood
Interpretation to Linear
Regression
Probabilistic Interpretation
• Assume the target prediction is modeled as follows:
• 𝜖(i) is an error term that captures either:
• unmodeled effects such as if there are some features very pertinent to
predicting housing price, but that we’d left out of the regression),
• or random noise.
• Assume that the 𝜖(i) are distributed IID (independently and identically
distributed) according to a Gaussian distribution (also called a Normal
distribution) with mean zero and some variance σ2.
• We can write this assumption as “𝜖(i) ∼ N (0, σ2).” I.e., the density of
𝜖(i) is given by
Likelihood L(θ) =
• The distribution of y(i) given x(i) and parameterized by θ is then given
by:

• Represented also by p(y|X; θ) for all data y, where y represents the


vector of all predictions y(i)
• Assuming IID for for the y(i) (since the 𝜖(i) are distributed IID), the
probability of the data y, as a whole, is given by:

• When we wish to explicitly view this as a function of θ, we call it the


likelihood function L(θ):
Log Likelihood: l(θ)= Log(L(θ))
• The principal of maximum likelihood says that we should choose θ so
as to make the data as high probability as possible. So, we should
choose θ to maximize L(θ).
• Instead of maximizing L(θ),
we can also maximize any
strictly increasing function of
L(θ)
• A common option is to
maximize the Log of the
likelihood l(θ):
Maximum Likelihood Estimate (MLE)
• Maximizing log likelihood l(θ):

• Becomes equivalent to minimizing:

• Which is the same as minimizing the cost function J(θ) for linear regression
• Note that 𝝈 did not play a factor in the MLE result.

You might also like