Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Improving

Learning:
Feature Scaling
•  Idea: Ensure that feature have similar scales
Before Feature Scaling Aver Feature Scaling
20 20
15 15

✓2 10 ✓2 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
✓1 ✓1
•  Makes gradient descent converge much faster

51
Feature StandardizaIon
•  Rescales features to have zero mean and unit variance

Xn
1 (i)
–  Let μj be the mean of feature j: j
µ = x
n i=1 j
–  Replace each value with:
(i)
(i) xj µj for j = 1...d
xj (not x0!)
sj
•  sj is the standard deviaIon of feature j
•  Could also use the range of feature j (maxj – minj) for sj

•  Must apply the same transformaIon to instances for


both training and predicIon
•  Outliers can cause problems
52
Price Quality of Fit

Price

Price
Size Size Size

Underfinng Correct fit Overfinng


(high bias) (high variance)

OverfiHng:
•  The learned hypothesis may fit the training set very
well ( )
J(✓) ⇡ 0
•  ...but fails to generalize to new examples

Based on example by Andrew Ng 53


RegularizaIon
•  A method for automaIcally controlling the
complexity of the learned hypothesis
•  Idea: penalize for large values of ✓j
–  Can incorporate into the cost funcIon
–  Works well when we have a lot of features, each that
contributes a bit to predicIng the label

•  Can also address overfinng by eliminaIng features


(either manually or via model selecIon)

54
RegularizaIon
•  Linear regression objecIve funcIon
Xn ⇣ ⇣ ⌘ ⌘2 XXdd
1
J(✓) = h✓ x(i) y (i) + ✓✓j2j2
2n 2 j=1
i=1 j=1

model fit to data regularizaIon

–  is the regularizaIon parameter ( )


0
–  No regularizaIon on !
✓0

55
Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

X d
✓j2 = k✓1:d k22
•  Note that
j=1
–  This is the magnitude of the feature coefficient vector!

•  We can also think of this as:


Xd
(✓j 0)2 = k✓1:d ~0k22
j=1
•  L2 regularizaIon pulls coefficients toward 0
56
Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

•  What happens if we set to be huge (e.g., 1010)?


Price

Size

Based on example by Andrew Ng 57


Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

•  What happens if we set to be huge (e.g., 1010)?


Price

0 Size
0 0 0

Based on example by Andrew Ng 58


Regularized Linear Regression
•  Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

•  Fit by solving min J(✓)


•  Gradient update:
n ⇣
X ⇣ ⌘ ⌘
@ 1 (i)
@✓0
J(✓) ✓ 0 ✓ 0 ↵ h ✓ x y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
@ 1 (i)
@✓j
J(✓) ✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1
regularizaIon
59
Regularized Linear Regression
1 X ⇣ ⇣ (i) ⌘ ⌘2
n d
X
J(✓) = h✓ x y (i) + ✓j2
2n i=1 2 j=1
Xn ⇣ ⇣ ⌘ ⌘
1
✓0 ✓0 ↵ h✓ x(i) y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1

•  We can rewrite the gradient step as:


Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j (1 ↵ ) ↵ h✓ x(i) y (i)
xj
n i=1

60

You might also like