Professional Documents
Culture Documents
Lecture2 - Introduction To Learning
Lecture2 - Introduction To Learning
Lecture 2
David Sontag
New York University
[Bishop]
Hypo. Space: Degree-N Polynomials
1 M =0 1 M =1
• Infinitely many t t
hypotheses 0 0
• None / Infinitely !1 !1
0 0
!1 !1
0 1 0 x 1
x
Hypo. Space: Degree-N Polynomials
1 M =0 1 M =1 1 M =3 1 M =9
t t t t
0 0 0 0
!1 !1 !1 !1
0 x 1 0 x 1 0 1 0 1
x x
Learning curve
We measure error using a loss function L(y, ŷ)
1
Training
For regression, a common choice is Test
squared loss:
L(yi , f (xi )) = (yi − f (xi ))2
ERMS
0.5
0 0 0 0
!1 !1 !1 !1
0 x 1 0 x 1 0 1 0 1
x x
Learning curve
We measure error using a loss function L(y, ŷ)
1
Training
For regression, a common choice is Test
squared loss: Example of
2
L(yi , f (xi )) = (yi − f (xi )) overfitting
ERMS
0.5
� �
• Expected (test) loss = E(x,y)∼p L(y, f (x) Also called risk, R(f )
• Ideally, learning chooses the hypothesis that minimize the risk – but this is
impossible to compute!
• Empirical risk is an unbiased estimate of the risk (by linearity of expectation)
• The principle of empirical risk minimization: f ∗ (DN ) = arg min R̂(f, DN )
f ∈F
Key Issues in Machine Learning
• How do we choose a hypothesis space?
– Often we use prior knowledge to guide this choice
– The ability to answer to the next two questions also affects choice
[Samy Bengio]
Binary classification
Dear Sir.
• Input: email
• Output: spam/ham First, I must solicit your confidence in this
transaction, this is by virture of its nature
• Setup: as being utterly confidencial and top
– Get a large collection of secret. …
example emails, each
labeled spam or ham
TO BE REMOVED FROM FUTURE
– Note: someone has to hand MAILINGS, SIMPLY REPLY TO THIS
label all this data! MESSAGE AND PUT "REMOVE" IN THE
– Want to learn to predict SUBJECT.
labels of new, future emails
99 MILLION EMAIL ADDRESSES
FOR ONLY $99
• Features: The attributes used to
make the ham / spam decision
Ok, Iknow this is blatantly OT but I'm
– Words: FREE! beginning to go insane. Had an old Dell
– Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
– Non-text: SenderInContacts decided to put it to use, I know it was
– … working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
The perceptron algorithm
• 1957: Perceptron algorithm invented by Rosenblatt
Wikipedia: “A handsome bachelor, he drove a classic MGA sports… for several years taught an
interdisciplinary undergraduate honors course entitled "Theory of Brain Mechanisms" that
drew students equally from Cornell's Engineering and Liberal Arts colleges…this course was
a melange of ideas .. experimental brain surgery on epileptic patients while conscious,
experiments on .. the visual cortex of cats, ... analog and digital electronic circuits that
modeled various details of neuronal behavior (i.e. the perceptron itself, as a machine).”
[William Cohen]
Linear Classifiers
• Inputs are feature values
• Each feature has a weight
• Sum is the activation
BIAS : 1 BIAS : -3
free : 1 free : 4
free money money : 1 money : 2
... ...
money
+1 = SPAM
1
BIAS : -3
free : 4
money : 2
0
... -1 = HAM 0 1
free
1 − θ µ2i0 + µ2i1
w0 = ln +
θ 2σi2
The perceptron
µi0 + µi1 algorithm
wi = 2
σi = �0
• Start with weight vector
• For each training instance (xi,yi*): t
�
– Classify with
∗ current ∗weights
w =w+η [yj − p(yj |xj , w)]f (xj(x)i)
i
j
i
i
(xi)
i
[Rong Jin]
Def: Linearly separable data
[Rong Jin]
Mistake Bound: Separable Case
• Assume the data set D is linearly separable
with margin γ, i.e.,
• Assume
• Theorem: The maximum number of
mistakes made by the perceptron algorithm
is bounded by
[Rong Jin]
Proof by induction
. .
[Rong Jin]
Properties of the perceptron algortihm
Non-Separable
Problems with the perceptron algorithm