04 LinearRegression PDF

Linear
Regression
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper aHribuIon. Please send comments and correcIons to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Regression
Given: n o
–  Data where
X = x(1) , . . . , x(n) x(i) 2 Rd
n o
y = y (1) , . . . , y (n) y (i) 2 R
–  Corresponding labels where
9
September Arc+c Sea Ice Extent
8
7
(1,000,000 sq km)
6
5
4
3
Linear Regression
QuadraIc Regression
2
1
0
1975 1980 1985 1990 1995 2000 2005 2010 2015
Year
Data from G. WiH. Journal of StaIsIcs EducaIon, Volume 21, Number 1 (2013) 2
Prostate Cancer Dataset
•  97 samples, parIIoned into 67 train / 30 test
•  Eight predictors (features):
–  6 conInuous (4 log transforms), 1 binary, 1 ordinal
•  ConInuous outcome variable:
–  lpsa: log(prostate specific anIgen level)
Based on slide by Jeff Howbert

Linear Regression
•  Hypothesis: d
X
y = ✓0 + ✓1 x1 + ✓2 x2 + . . . + ✓d xd = ✓j xj
j=0
Assume x0 = 1
•  Fit model by minimizing sum of squared errors
Figures are courtesy of Greg Shakhnarovich 5

Least Squares Linear Regression
•  Cost FuncIon n ⇣
1 X ⇣ ⌘ ⌘2
J(✓) = h✓ x(i) y (i)
2n i=1
•  Fit by solving min J(✓)
✓
6
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]
Based on example
by Andrew Ng 7
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
x2R ✓ = [✓0 , ✓1 ]
(for fixed , this is a funcIon of x) (funcIon of the parameter )

3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Based on example
by Andrew Ng 8
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
x2R ✓ = [✓0 , ✓1 ]

3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 ⇥ ⇤
Based on example J([0, 0.5]) = (0.5 1)2 + (1 2)2 + (1.5 3)2 ⇡ 0.58 9
by Andrew Ng 2⇥3
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
x2R ✓ = [✓0 , ✓1 ]

3 3
J([0, 0]) ⇡ 2.333
2 2
y
1 1 J() is concave
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Based on example
by Andrew Ng 10
Slide by Andrew Ng 11
(for fixed , this is a funcIon of x) (funcIon of the parameters )
Basic Search Procedure
•  Choose iniIal value for ✓
•  UnIl we reach a minimum:
–  Choose a new value for to reduce
✓ J(✓)
J( 0, 1)
1
0
Figure by Andrew Ng 16
✓ J(✓)
✓
J( 0, 1)
1
0
✓ J(✓)
✓
J( 0, 1)
Since the least squares objecIve funcIon is convex (concave),

1
we don’t need to worry about local minima

0
Gradient Descent
•  IniIalize ✓
•  Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
learning rate (small)

e.g., α = 0.05 3
2
J(✓) ↵
1
0
-0.5 0 0.5 1 1.5 2 2.5
✓ 19
Gradient Descent
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 20
(i)
Gradient Descent
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 21
(i)
Gradient Descent
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 22
(i)
Gradient Descent
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 23
(i)
Gradient Descent for Linear Regression
Xn ⇣ ⇣ ⌘ ⌘
1 (i) simultaneous
✓j ✓j ↵ h✓ x(i) y (i)
xj update
n i=1 for j = 0 ... d
•  To achieve simultaneous update ⇣ ⌘

•  At the start of each GD iteraIon, compute h✓ x(i)
•  Use this stored value in the update step loop
•  Assume convergence when k✓new ✓old k2 < ✏
sX q
L2 norm: kvk2 = vi2 = v12 + v22 + . . . + v|v|
2
i
24
Gradient Descent
h(x) = -900 – 0.1 x
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
Choosing α
α too small α too large
Increasing value for J(✓)
slow convergence
•  May overshoot the minimum

•  May fail to converge
•  May even diverge
To see if gradient descent is working, print out each iteraIon

J(✓)
•  The value should decrease at each iteraIon
•  If it doesn’t, adjust α
34
Extending Linear Regression to
More Complex Models
•  The inputs X for linear regression can be:
–  Original quanItaIve inputs
–  TransformaIon of quanItaIve inputs
•  e.g. log, exp, square root, square, etc.
–  Polynomial transformaIon
•  example: y = 0 + 1 x + 2 x2 + 3 x3
–  Basis expansions
–  Dummy coding of categorical inputs
–  InteracIons between variables
•  example: x3 = x1 x2
This allows use of linear regression techniques

to fit non-linear datasets.
Linear Basis FuncIon Models
•  Generally, d
X
h✓ (x) = ✓j j (x)
j=0
basis funcIon
•  Typically, so that acts as a bias

0 (x) = 1 ✓0
•  In the simplest case, we use linear basis funcIons :
j (x) = xj
Based on slide by Christopher Bishop (PRML)

•  Polynomial basis funcIons:
–  These are global; a small

change in x affects all
basis funcIons
•  Gaussian basis funcIons:
–  These are local; a small change

in x only affect nearby basis
funcIons. μj and s control
locaIon and scale (width).
•  Sigmoidal basis funcIons:
where
–  These are also local; a small

change in x only affects
nearby basis funcIons. μj
and s control locaIon and
scale (slope).

Example of Finng a Polynomial Curve
with a Linear Model
p
X
y = ✓0 + ✓1 x + ✓2 x2 + . . . + ✓p xp = ✓j xj
j=0
d
X
•  Basic Linear Model: h✓ (x) = ✓j xj
j=0
d
X
•  Generalized Linear Model: h✓ (x) = ✓j j (x)
j=0
•  Once we have replaced the data by the outputs of

the basis funcIons, finng the generalized model is
exactly the same problem as finng the basic model
–  Unless we use the kernel trick – more on that when we
cover support vector machines
–  Therefore, there is no point in cluHering the math with
basis funcIons
Based on slide by Geoff Hinton 40
Linear Algebra Concepts
•  Vector in is an ordered set of d real numbers
Rd
–  e.g., v = [1,6,3,4] is in R4 ⎛1⎞
⎜ ⎟
–  “[1,6,3,4]” is a column vector: ⎜6⎟
⎜ 3⎟
–  as opposed to a row vector: ⎜ ⎟
⎜ 4⎟
⎝ ⎠
(1 6 3 4)
•  An m-by-n matrix is an object with m rows and n columns,

where each entry is a real number:
⎛1 2 8⎞
⎜ ⎟
⎜ 4 78 6 ⎟
⎜9 3 2⎟
⎝ ⎠
Based on slides by Joseph Bradley
•  Transpose: reflect vector/matrix on line:
T T
⎛a⎞ ⎛a b ⎞ ⎛a c ⎞
⎜⎜ ⎟⎟ = (a b ) ⎜⎜ ⎟⎟ = ⎜⎜ ⎟⎟
⎝b⎠ ⎝c d ⎠ ⎝b d ⎠
–  Note: (Ax)T=xTAT (We’ll define mulIplicaIon soon…)
•  Vector norms: X
! p1
–  Lp norm of v = (v1,…,vk) is |vi |p

i
–  Common norms: L1, L2
–  Linfinity = maxi |vi|
•  Length of a vector v is L2(v)
•  Vector dot product: u • v = (u1 u2 ) • (v1 v2 ) = u1v1 + u2v2
–  Note: dot product of u with itself = length(u)2 = kuk22
•  Matrix product: ⎛ a11 a12 ⎞ ⎛ b11 b12 ⎞

A = ⎜⎜ ⎟⎟, B = ⎜⎜ ⎟⎟
⎝ a21 a22 ⎠ ⎝ b21 b22 ⎠
⎛ a11b11 + a12b21 a11b12 + a12b22 ⎞
AB = ⎜⎜ ⎟⎟
⎝ a21b11 + a22b21 a21b12 + a22b22 ⎠

•  Vector products:
–  Dot product: u • v = u v = (u u )⎜ 1 ⎞⎟ = u v + u v
⎛
T
v
1 2 ⎜ ⎟ 1 1 2 2
v
⎝ 2⎠
–  Outer product:
⎛ u1 ⎞ ⎛ u1v1 u1v2 ⎞
uv = ⎜⎜ ⎟⎟(v1 v2 ) =
T
⎜⎜ ⎟⎟
⎝ u2 ⎠ ⎝ u2v1 u2v2 ⎠

VectorizaIon
•  Benefits of vectorizaIon
–  More compact equaIons
–  Faster code (using opImized matrix libraries)
•  Consider our model: Xd
h(x) = ✓j xj
•  Let 2 3
j=0
✓0
6 ✓1 7 ⇥ ⇤
6 7 |
x = 1 x1 . . . x d
✓=6 . 7
4 .. 5
✓d
|
•  Can write the model in vectorized form as h(x) = ✓ x
45
VectorizaIon
•  Consider our model for n instances:
⇣ ⌘ X d
(i) (i)
h x = ✓j xj
j=0
•  Let 2 (1) (1)
3
2 3 1 x1 . . . xd
✓0 6 . .. .. .. 7
6 .. . . . 7
6 ✓1 7 6 7
6 7 6 7
✓ = 6 . 7 X = 6 1 x(i) (i)
. . . xd 7
.
4 . 5 6 1 7
6 .. .. .. .. 7
✓d 4 . . . . 5
(n) (n)
1 x1 . . . xd
R(d+1)⇥1 Rn⇥(d+1)
•  Can write the model in vectorized form as h✓ (x) = X✓
46
VectorizaIon
•  For the linear regression cost funcIon:
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
Xn ⇣ ⌘2
1
J(✓) = ✓ | x(i) y (i)
2n i=1
Rn⇥(d+1)
1 R(d+1)⇥1
Let: |
2 3 J(✓) = (X✓ y) (X✓ y)
y(1) 2n
6 y (2) 7
6 7
y=6 .. 7
4 . 5 R1⇥n Rn⇥1
y (n)
47
Closed Form SoluIon
•  Instead of using GD, solve for opImal ✓ analyIcally
@
–  NoIce that the soluIon is when J(✓) = 0
@✓
•  DerivaIon:
1 | 1 x 1
J (✓) = (X✓ y) (X✓ y)
2n
/ ✓ | X | X✓ y | X✓ ✓ | X | y + y | y
/ ✓ | X | X✓ 2✓ | X | y + y | y
Take derivaIve and set equal to 0, then solve for :
✓
@
(✓ | X | X✓ 2✓ | X | y + y | y) = 0
@✓
(X | X)✓ X | y = 0
(X | X)✓ = X | y
Closed Form SoluIon: ✓ = (X | X) 1
X |y 48
X y + y y) = 0
X)✓ X | y =Closed Form SoluIon
0
| |
(X X)✓ = X ✓y
•  Can obtain by simply plugging X and y into
✓ = (X | X) 1
X |y
2 (1) (1)
3
1 x1 ... xd
6 . .. .. 7 2 3
6 .. .. 7 y (1)
6 . . . 7 6 7
6 (i) (i) 7 6 y (2) 7
X=6 1 x1 ... xd 7 y=6 .. 7
6 7 4 . 5
6 .. .. .. .. 7
4 . . . . 5 y (n)
(n) (n)
1 x1 ... xd

•  If X T X is not inverIble (i.e., singular), may need to:
–  Use pseudo-inverse instead of the inverse
•  In python, numpy.linalg.pinv(a)
–  Remove redundant (not linearly independent) features
–  Remove extra features to ensure that d ≤ n
49
Gradient Descent vs Closed Form
Gradient Descent Closed Form Solu+on
•  Requires mulIple iteraIons •  Non-iteraIve
•  Need to choose α •  No need for α
•  Works well when n is large •  Slow if n is large
•  Can support incremental – CompuIng (X T X)-1 is
learning roughly O(n3)
50
Improving Learning:
Feature Scaling
•  Idea: Ensure that feature have similar scales
Before Feature Scaling Aver Feature Scaling
20 20
15 15
✓2 10 ✓2 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
✓1 ✓1
•  Makes gradient descent converge much faster
51
Feature StandardizaIon
•  Rescales features to have zero mean and unit variance
Xn
1 (i)
–  Let μj be the mean of feature j: j
µ = x
n i=1 j
–  Replace each value with:
(i)
(i) xj µj for j = 1...d
xj (not x0!)
sj
•  sj is the standard deviaIon of feature j
•  Could also use the range of feature j (maxj – minj) for sj
•  Must apply the same transformaIon to instances for

both training and predicIon
•  Outliers can cause problems
52
Price Quality of Fit
Price
Price
Size Size Size
Underfinng Correct fit Overfinng

(high bias) (high variance)
OverfiHng:
•  The learned hypothesis may fit the training set very
well ( )
J(✓) ⇡ 0
•  ...but fails to generalize to new examples
Based on example by Andrew Ng 53

RegularizaIon
•  A method for automaIcally controlling the
complexity of the learned hypothesis
•  Idea: penalize for large values of ✓j
–  Can incorporate into the cost funcIon
–  Works well when we have a lot of features, each that
contributes a bit to predicIng the label
•  Can also address overfinng by eliminaIng features

(either manually or via model selecIon)
54
RegularizaIon
•  Linear regression objecIve funcIon
Xn ⇣ ⇣ ⌘ ⌘2 XXdd
1
J(✓) = h✓ x(i) y (i) + ✓✓j2j2
2n 2 j=1
i=1 j=1
model fit to data regularizaIon
–  is the regularizaIon parameter ( )

0
–  No regularizaIon on !
✓0
55
Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1
X d
✓j2 = k✓1:d k22
•  Note that
j=1
–  This is the magnitude of the feature coefficient vector!
•  We can also think of this as:

Xd
(✓j 0)2 = k✓1:d ~0k22
j=1
•  L2 regularizaIon pulls coefficients toward 0
56
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1
•  What happens if we set to be huge (e.g., 1010)?

Price
Size

Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1
•  What happens if we set to be huge (e.g., 1010)?

Price
0 Size
0 0 0

Regularized Linear Regression
•  Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1
•  Fit by solving min J(✓)

✓
•  Gradient update:
n ⇣
X ⇣ ⌘ ⌘
@ 1 (i)
@✓0
J(✓) ✓ 0 ✓ 0 ↵ h ✓ x y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
@ 1 (i)
@✓j
J(✓) ✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1
regularizaIon
59
1 X ⇣ ⇣ (i) ⌘ ⌘2
n d
X
J(✓) = h✓ x y (i) + ✓j2
2n i=1 2 j=1
Xn ⇣ ⇣ ⌘ ⌘
1
✓0 ✓0 ↵ h✓ x(i) y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1
•  We can rewrite the gradient step as:

Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j (1 ↵ ) ↵ h✓ x(i) y (i)
xj
n i=1

60
•  To incorporate regularizaIon into the closed form
soluIon:
0 2 31 1
0 0 0 ... 0
B 6 0 1 0 . . . 0 7C
B 6 7C
B | 6 0 0 1 . . . 0 7C
✓ = BX X + 6 7C X | y
B 6 .. .. .. . . .. 7C
@ 4 . . . . . 5A
0 0 0 ... 1
61
•  To incorporate regularizaIon into the closed form
soluIon:
0 2 31 1
0 0 0 ... 0
B 6 0 1 0 . . . 0 7C
B 6 7C
B | 6 0 0 1 . . . 0 7C
✓ = BX X + 6 7C X | y
B 6 .. .. .. . . .. 7C
@ 4 . . . . . 5A
0 0 0 ... 1
@
•  Can derive this the same way, by solving J(✓) = 0
@✓
•  Can prove that for λ > 0, inverse exists in the
equaIon above
62

04 LinearRegression PDF

Uploaded by

Copyright:

Available Formats

You might also like

04 LinearRegression PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 LinearRegression PDF

Uploaded by

Copyright:

Available Formats

Linear

Based on slide by Jeﬀ Howbert

• Fit model by minimizing sum of squared errors

Figures are courtesy of Greg Shakhnarovich 5

(for ﬁxed , this is a funcIon of x) (funcIon of the parameter )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameter )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameter )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

Since the least squares objecIve funcIon is convex (concave),

we don’t need to worry about local minima

learning rate (small)

• To achieve simultaneous update ⇣ ⌘

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

h(x) = -900 – 0.1 x

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

(for ﬁxed , this is a funcIon of x) (funcIon of the parameters )

• May overshoot the minimum

To see if gradient descent is working, print out each iteraIon

This allows use of linear regression techniques

• Typically, so that acts as a bias

Based on slide by Christopher Bishop (PRML)

– These are global; a small

• Gaussian basis funcIons:

– These are local; a small change

– These are also local; a small

Based on slide by Christopher Bishop (PRML)

• Once we have replaced the data by the outputs of

• An m-by-n matrix is an object with m rows and n columns,

– Lp norm of v = (v1,…,vk) is |vi |p

• Vector dot product: u • v = (u1 u2 ) • (v1 v2 ) = u1v1 + u2v2

– Note: dot product of u with itself = length(u)2 = kuk22

• Matrix product: ⎛ a11 a12 ⎞ ⎛ b11 b12 ⎞

Based on slides by Joseph Bradley

Based on slides by Joseph Bradley

• Must apply the same transformaIon to instances for

Underfinng Correct fit Overfinng

Based on example by Andrew Ng 53

• Can also address overﬁnng by eliminaIng features

model ﬁt to data regularizaIon

– is the regularizaIon parameter ( )

• We can also think of this as:

• What happens if we set to be huge (e.g., 1010)?

Based on example by Andrew Ng 57

• What happens if we set to be huge (e.g., 1010)?

Based on example by Andrew Ng 58

• Fit by solving min J(✓)

• We can rewrite the gradient step as:

You might also like

•  Fit model by minimizing sum of squared errors

•  To achieve simultaneous update ⇣ ⌘

•  May overshoot the minimum

•  Typically, so that acts as a bias

–  These are global; a small

•  Gaussian basis funcIons:

–  These are local; a small change

–  These are also local; a small

•  Once we have replaced the data by the outputs of

•  An m-by-n matrix is an object with m rows and n columns,

–  Lp norm of v = (v1,…,vk) is |vi |p

•  Vector dot product: u • v = (u1 u2 ) • (v1 v2 ) = u1v1 + u2v2

–  Note: dot product of u with itself = length(u)2 = kuk22

•  Matrix product: ⎛ a11 a12 ⎞ ⎛ b11 b12 ⎞

•  Must apply the same transformaIon to instances for

•  Can also address overﬁnng by eliminaIng features

–  is the regularizaIon parameter ( )

•  We can also think of this as:

•  What happens if we set to be huge (e.g., 1010)?

•  What happens if we set to be huge (e.g., 1010)?

•  Fit by solving min J(✓)

•  We can rewrite the gradient step as: