Professional Documents
Culture Documents
12 LinearModels1 Annotated
12 LinearModels1 Annotated
|video|https://www.youtube.com/embed/YGBZE6RA7aU|
Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam
machine learning: the basic recipe Here is the "basic recipe" for machine learning we saw in
the last lecture. In this lecture, we’ll have a look at linear
models, both for regression and classi ication. We’ll see
Abstract your problem to a standard task.
Classi cation, Regression, Clustering, Density estimation, Generative Modeling, Online how to de ine a linear model, how to formulate a loss
learning, Reinforcement Learning, Structured Output Learning
function and how to search for a model that minimises that
Choose your instances and their features.
For supervised learning, choose a target. loss function.
Choose your model class. Most of the lecture will be focused on search methods. The
Linear models
linear models themselves aren’t that strong, but because
Search for a good model.
Choose a loss function, choose a search method to minimise the loss. they’re pretty simple, we can use them to explain various
search methods that we can also apply to more complex
models as the course progresses. Speci ically, the method of
2
gradient descent, which we'll introduce here, will be the
search method used for almost all approaches we will
discuss.
data 2 0 ?
3 0 1.1
2 0 4.5
0 1 12.3
0 2 5.1
1 1 9.1
learner model
2 1 1.2
2 0 5.2
1 0 6.1
1 1 1.9
3 1 1.8
1.8
6 0 3.2
0 1 0.1
3
0 0 0.4
f
fi
f
f
This is is the example data we used to illustrate regression:
ipper body predicting the body mass of a penguin from its lipper
feature > length mass ‹ target length.
(dm) (kg)
data source: https://allisonhorst.github.io/
1.93 3.65
1.88 4.70
palmerpenguins/, https://github.com/mcnakhaee/
1.90 4.95 palmerpenguins (python package)
2.17 5.70
1.90 3.32 image source: https://allisonhorst.github.io/
1.92 5.70 palmerpenguins/
2.23 5.00
2.05 3.45
2.08 3.05
image source: https://allisonhorst.github.io/palmerpenguins/ 1.93 5.30
4
1.88 5.55
(dm) (kg)
1.93 3.65
1.88 4.70
1.90 4.95
2.17 5.70
1.90 3.32
1.92 5.70
2.23 5.00
2.05 3.45
feature space 2.08 3.05
5
1.93 5.30
regression: example data To simplify things we'll use this very simple data set in the
rest of this lecture. There is one input feature x, one output
value t (for target) and we have six instances.
6
fl
fl
f
f
notation Throughout the course, we will use the following notation:
lowercase non-bold for scalars, lowercase bold for vectors
and uppercase bold for matrices.
x, y, z <latexit sha1_base64="ZH2TeDPGY9vUH2k2JnFcW8OK+1E=">AAAJ9nicfVZNb9tGEGXSj1hq0zjpsReiQoGiEAxSikX5ECC25CSHJnYNy3ZrCsZytaIILT+wuxRJL/hTmluRa/9Iry36b7qkKJnkUt2LZ+e9Gc7OPK3XCrBDmab9++jxZ59/8eWTvVb7q6+ffvNs//mLK+qHBKIJ9LFPbixAEXY8NGEOw+gmIAi4FkbX1nKU4dcrRKjje5csCdDUBbbnzB0ImHDd7R/HXTXpqveqabZNYZtiY+Y7brqALay5epN21e3m1/Lmt1QQ7/Y72oGWL1U29MLoKMU6v3u+97s582HoIo9BDCi91bWATTkgzIEYpW0zpCgAcAlsdBuy+XDKHS8IGfJgqv4gsHmIVear2XHUmUMQZDgRBoDEERlUuAAEQCYO3a6mosgDLqLd2coJ6NqkK3ttMCA6NuVx3tH0aSWS2wQECwfGldI4cGnWB8lJE9eqOlGIEVm5VWdWpiiyxowRgQ7NmnAuOnMWZFOil/55gS+SYIE8mvKQ4LQcKABECJqLwNykiIUBz08jpLGkrxgJUTczc9+rMSDLCzTrijwVR7WcOfYBq7oscQzRHQ9F0Hdd4M24GaTcZChm3OwepHnvyuhFygvBWOpFBlfQDyX0Q5pWwdMSeCrAKjrZonN1Ug+9KoFX0levS+h1PdQKS2gooasSupIyW1EJjiQ4LqGxhCYlNJHQ+xJ6L/cZCFnc9qZ8PYt8qPwMOyv0liDkpbzTS+tnIWLet3o1JNMA7+hp3u4Zmot7ZQ24SUbn7y7f/5zy0bB3qA3SOsPCIdpQtP7gcKRJFHtdTcHRhsPeicTxCfDsbaLx6eBYlxMFIQnwlmQY/TdHcqYEYexH20yjk3GvXyOJhlRr0g1d0+qnj2y4IQwMY6zV64nwA+F4NO4b9c9EZItbeh/16s2L8ANh1h++lBNYW7yvDQZHEo4fCEYfGDOJsNziR/mq/6BEAXU1FENXO7oqqcduohetbAywmgLWkmnkL2X+WwKSHWy/KftGSY0RQVPERlaNEUlTxEZjm4hqSNTQplxMjR/IZSTR8W5+w8xype3I3kTHu/kNE8tluCN7Ex3v5jfMN9docyOXQXb/LaGYW/ZSAHhNGSPxhiDovbgXz8T/PcB88pO4DIntOkKH4q/Zzaz/I4J4QxRWuy0eNHr9+SIbV70D/fBA++Vl5/VJ8bTZU75Tvld+VHTFUF4r75RzZaJA5ZPyl/K38k8rbn1s/dH6tKY+flTEfKtUVuvP/wAG66Yb</latexit>
feature.
<latexit sha1_base64="uVMH5vj/qqGMBCSKPY9Vkq53rRk=">AAAJ8XicfVZLb9tGEGbSl6U2rdMceyEqFCgKwSClWJSBGEgsucmhiV3DsgOYgrFcjihCywd2lxIZgj+khwJFrv0jvba3/psuqYdJLtU9UKP5vhnOznxarRUSl3FN+/fR408+/ezzLw5a7S+/evL1N4dPv71hQUQxTHBAAvreQgyI68OEu5zA+5AC8iwCt9ZilOO3S6DMDfxrnoQw9ZDjuzMXIy5c94cvzFg9VU0LHNdPQw9x6saZGt/rpimePfE0l3bAmVp890zw7Qfa/WFHO9KKpcqGvjE6ymZd3j89+N20Axx54HNMEGN3uhbyaYoodzGBrG1GDEKEF8iBu4jPhtPU9cOIg48z9QeBzSKi8kDNd6LaLgXMSSIMhKkrMqh4jijCXOy3XU3FwEcesK69dEO2NtnSWRsciWZN07hoZvakEpk6FIVzF8eV0lLkMdGDueRkiWdVnRARoEuv6szLFEXWmDFQ7LK8CZeiMxdhPiB2HVxu8HkSzsFnWRpRkpUDBQCUwkwEFiYDHoVpsRuhigU75TSCbm4WvtMxoosrsLsiT8VRLWdGAsSrLktsQ3THhxUOPA8JJZhhlpocYp6a3aOs6F0ZvcrS1MwbZVnqVQ5X0Hcl9F2WVcHzEnguwCo62aEzdVIPvSmBN9Jbb0vobT3UikpoJKHLErqUMlurEryS4LiExhKalNBEQj+U0A9yn5GQxV1vmq5nUQw1vSDuEl5TAD9LO72svhcq5n2nV0NyDaQdPSvabcNMHClrwEtyevrm+u0vWToa9o61QVZnWCSCLUXrD45HmkRx1tVsONpw2DuTOAFFvrNLND4fvNLlRGFEQ7IjGUb/5xM5UwKEBKtdptHZuNevkURDqjXphq5p9d2vHLwlDAxjrNXrWZEHwqvRuG/UX7OiO9zS+9CrN29FHgh2f/hcTmDt8L42GJxIOHkgGH1k2BJhscNPilX/QYkC6mrYDF3t6KqkHqeJvmllY4DVFLCWTCN/IfNfU5TsYQdN2bdKaowImyK2smqMSJoithrbRlRDVg1tKsTU+IJCRhKd7Oc3zKxQ2p7sTXSyn98wsUKGe7I30cl+fsN8C402N3IR5uffAou55TcFRNaUMYg7BIW34ly8EP97iAf0J3EYUsdzhQ7Fp9nNrf8jonhLFFa7LS40ev36Ihs3vSP9+Ej79Xnn5dnmanOgfKd8r/yo6IqhvFTeKJfKRMHKR+Uv5W/lnxZr/db6o/VxTX38aBPzTKms1p//AVOIpuY=</latexit>
Note that this isn’t a very good it for the data. Our job is to
ind better numbers w and b.
two features If we have multiple features, each feature gets its own
weight (also known as a coef icient)
<latexit sha1_base64="6X9O4vxRp4SsTDPQp8U0v9nV9Ok=">AAAJ4HicfZbPb9s2FMfVdj9ib13T9biLMGNAtxqBZDeWcwjQxs7aw9pkQZwUiAyDomlZMPUDFGVJJXjfbeh1wP6OXbc/Y//NKFl2JFEeL3ng5/ueHt97oWkF2Amppv374OGjzz7/4suDVvurrx9/8+Tw6bc3oR8RiCbQxz75YIEQYcdDE+pQjD4EBAHXwujWWo0yfrtGJHR875qmAZq6wPachQMBFVuzwxeLGTMJZDHvqqYFmcX5czP5UT1VN7vJTFdfFGR22NGOtHypsqEXRkcp1uXs6cGf5tyHkYs8CjEIwztdC+iUAUIdiBFvm1GIAgBXwEZ3EV0Mp8zxgogiD3L1B8EWEVapr2aJq3OHIEhxKgwAiSMiqHAJCIBUHK9dDRUiD7go7M7XThBuzHBtbwwKRG2mLMlrxx9XPJlNQLB0YFJJjQE3dAFdSpth6lrVTRRhRNZudTNLUyRZUyaIQCfMinApKnMRZP0Ir/3Lgi/TYIm8kLOIYF52FAARghbCMTdDRKOA5acRQ7AKTymJUDcz873TMSCrKzTvijiVjWo6C+wDWt2yxDFEdTwUQ991gTdnZsCZSVFCmdk94nntyvSKM2ZmhbIs9SrDFfq+RN9zXoXnJXguYJVOdnShTuquNyV4I331tkRv665WVKKRRNclupYiW3EJxxJOSjSRaFqiqUQ/luhHuc5AjMVdb8o2vcibyi6ws0ZvCEIeZ50er5+FiH7f6VWXbAZYR+d5uedoIW6QDXDTTM7eXr/7hbPRsHesDXhdYeEIbSVaf3A80iSJvcmm0GjDYe9M0vgEePYu0Ph88FqXAwURCfBOZBj9n0/kSCnC2I93kUZn416/JhIFqeakG7qm1U8f23ArGBjGWKvnE+N7wevRuG/UPxOTHbf0PurVixfje8G8P3wpB7B2vK8NBicSx/cCow+MuSRY7fhJvur/UCKB+jQUTVc7uipNj90kL0rZ6GA1OWxGplG/kvVvCEj3qP2m6NtJavQImjy2Y9XokTZ5bGds61F1iRvKlA9T4wfyMZLkeL++oWf5pO2J3iTH+/UNHcvHcE/0Jjner2/obz6jzYVcBdn9t4Kib9lLAeCNZIzEG4Kgd+JevBC/e4D65CdxGRLbdcQcir9mN7P+TwiSrVBY7bZ40Oj154ts3PSO9OMj7deXnVdnxdPmQPlO+V55ruiKobxS3iqXykSByiflL+Vv5Z+W1fqt9Xvr00b68EHh80yprNYf/wGxi5+I</latexit>
10
a 2D linear function Here’s what that looks like. The thick orange lines together
indicate a plane (which rises in the x2 direction, and
x2
declines in the x1 direction). The parameter b describes
f(x1, x2) = w1x1 + w2x2 + b
how high above the origin this plane lies (what the value of
w2 f is if both features are 0). The value w1 indicates how much
f increases if we take a step of 1 along the x1 axis, and the
value w2 indicates how much f increases if we take a step of
size 1 along the x2 axis.
b
w1
x1 11
f
fi
f
f
f
for n features For an arbitrary number of features, the pattern continues
as you’d expect. We summarize the w’s in a vector w with
<latexit sha1_base64="FsPpHsCnGaZ2b6KHnEcbvTw4+/A=">AAALHHicfVbdjuM0GE13+RkCCzvsFeImogItUI2SdqbtCI20O+2we8HuDKP5WWlSKid126jOjxynSdeyxJNwyS08BHeIWyRegafATtJMEqfkZj77nO/Y/r5Tj60AOSHR9X9aDx6+8+577+99oH740aOPP3m8/+lN6EfYhte2j3z8xgIhRI4Hr4lDEHwTYAhcC8FbazUS+O0a4tDxvSuyCeDEBQvPmTs2IHxqut/6bD6lpm9T04pZx7RsajH21Ey+1r460cR8PDVYMjW0b/NRl4+6WjHs8WFPjFZcAs18EjIxSnVMU81FhPhPV5qZFJjGQZPAhNDYIUtmfqepW2K6sgUXjkcDFxDsJGy7EZ6V6a2zlfJhPPWYCb1ZwddybeDNuDRfVlJUxZlEeqokomTqqRWR6eO2fqCnnyYHRh60lfy7mO7v/WvOfDtyoUdsBMLwztADMqEAE8dGkKlmFMIA2CuwgHcRmQ8n1PGCiEDPZtqXHJtHSCO+JvqkzRwMbYI2PAA2driCZi8BBjbh3VSrUiH0gAvDzmztBGEWhutFFhDArTChSWoV9qiSSRcYBEvHTipbo8ANeQmW0mS4ca3qJIwQxGu3Oim2yTdZYyYQ204oinDBK3MeCPuFV/5Fji83wRJ6IaMRRqycyAGIMZzzxDQMIYkCmp6Ge34VnhAcwY4I07mTMcCrSzjrcJ3KRHU7c+QDUp2y+DF4dTwY277rcudQM2A085HZOWBp7croJaPUFIWyLO1SwBX0dQl9zVgVPCuBZxysotcFOteu66k3JfBGWvW2hN7WU62ohEYSui6ha0mZ/zDv4ViCkxKaSOimhG4k9G0JfSvXGXBb3HUnNOtF2lR6jpw1fIEh9Bhtd1n9LJj3+86opggP0LbB0nLP4JxfmBngbgSdvrx69QOjo2H3SO+zOsNCEdxS9F7/aKRLlEW2m5yjD4fdU4njY+AtCqHxWf+5IQsFEQ5QQRoMet8fy0obiJAfF0qj03G3VyPxglT3ZAwMXa+fPl7YW0J/MBjr9f3E6J7wfDTuDerLxLjALaMHu/XixeieMOsND2UBq8B7er9/LOHonjDogcFMIqwK/Dj96rhf4LB/fJTVoGIXW3JLbgqtbWiSuxZN9LzUjQlWU0JmqUb+Sua/wGCzg+03qW+d1pgRNGVsbdeYsWnK2Hpwm1FNiRvKlJqtcYHUZhId7eY39Cx14g71JjrazW/oWGrTHepNdLSb39Df1MPN7Ib+po4uyl4zTyCuU/EcC8TDA6CsNWPInyQYvuLX7Dn/NwqIj7/hdyteuA63Lf9rdkT0f0SQbIkgfR0Z9beQHNx0D4z+weGPh+1np/k7aU/5XPlCeaoYykB5prxULpRrxW793Pq19Vvrd/UX9Q/1T/WvjPqglec8USqf+vd/clkURg==</latexit>
fw,b (x) =
=wwT1 xx1++bw2 x2 + w3 x3 + . . . + b
<latexit sha1_base64="FsPpHsCnGaZ2b6KHnEcbvTw4+/A=">AAALHHicfVbdjuM0GE13+RkCCzvsFeImogItUI2SdqbtCI20O+2we8HuDKP5WWlSKid126jOjxynSdeyxJNwyS08BHeIWyRegafATtJMEqfkZj77nO/Y/r5Tj60AOSHR9X9aDx6+8+577+99oH740aOPP3m8/+lN6EfYhte2j3z8xgIhRI4Hr4lDEHwTYAhcC8FbazUS+O0a4tDxvSuyCeDEBQvPmTs2IHxqut/6bD6lpm9T04pZx7RsajH21Ey+1r460cR8PDVYMjW0b/NRl4+6WjHs8WFPjFZcAs18EjIxSnVMU81FhPhPV5qZFJjGQZPAhNDYIUtmfqepW2K6sgUXjkcDFxDsJGy7EZ6V6a2zlfJhPPWYCb1ZwddybeDNuDRfVlJUxZlEeqokomTqqRWR6eO2fqCnnyYHRh60lfy7mO7v/WvOfDtyoUdsBMLwztADMqEAE8dGkKlmFMIA2CuwgHcRmQ8n1PGCiEDPZtqXHJtHSCO+JvqkzRwMbYI2PAA2driCZi8BBjbh3VSrUiH0gAvDzmztBGEWhutFFhDArTChSWoV9qiSSRcYBEvHTipbo8ANeQmW0mS4ca3qJIwQxGu3Oim2yTdZYyYQ204oinDBK3MeCPuFV/5Fji83wRJ6IaMRRqycyAGIMZzzxDQMIYkCmp6Ge34VnhAcwY4I07mTMcCrSzjrcJ3KRHU7c+QDUp2y+DF4dTwY277rcudQM2A085HZOWBp7croJaPUFIWyLO1SwBX0dQl9zVgVPCuBZxysotcFOteu66k3JfBGWvW2hN7WU62ohEYSui6ha0mZ/zDv4ViCkxKaSOimhG4k9G0JfSvXGXBb3HUnNOtF2lR6jpw1fIEh9Bhtd1n9LJj3+86opggP0LbB0nLP4JxfmBngbgSdvrx69QOjo2H3SO+zOsNCEdxS9F7/aKRLlEW2m5yjD4fdU4njY+AtCqHxWf+5IQsFEQ5QQRoMet8fy0obiJAfF0qj03G3VyPxglT3ZAwMXa+fPl7YW0J/MBjr9f3E6J7wfDTuDerLxLjALaMHu/XixeieMOsND2UBq8B7er9/LOHonjDogcFMIqwK/Dj96rhf4LB/fJTVoGIXW3JLbgqtbWiSuxZN9LzUjQlWU0JmqUb+Sua/wGCzg+03qW+d1pgRNGVsbdeYsWnK2Hpwm1FNiRvKlJqtcYHUZhId7eY39Cx14g71JjrazW/oWGrTHepNdLSb39Df1MPN7Ib+po4uyl4zTyCuU/EcC8TDA6CsNWPInyQYvuLX7Dn/NwqIj7/hdyteuA63Lf9rdkT0f0SQbIkgfR0Z9beQHNx0D4z+weGPh+1np/k7aU/5XPlCeaoYykB5prxULpRrxW793Pq19Vvrd/UX9Q/1T/WvjPqglec8USqf+vd/clkURg==</latexit>
We call the w’s the weights, and b the bias. The weights and
fw,b (x) = w x 1 + w2 x2 + w03 x3 1+ ... + b
=0w1Twx11+ b x1 1
the bias are the parameters of the model. We need to choose
=w0T 1
x. +Cb 0 these to it the model to our data.
Bw B x..1 C
with w = 0 @ .. 11 A and x = 0@ .1 A
with w = @w.n A and x = @xx...1n A
Bw..1 C B C The operation of multiplying elements of w by the
with w = @ w... A and x = @ x... A
B C B C corresponding elements of x and summing them is the dot
n n
wn xn product of w and x.
12
dot product The dot product of two vectors is simply the sum of the
products of their elements. If we place the features into a
vector and the weights, then a linear function is simply their
wT x
<latexit sha1_base64="Zy9MQ+ahOqSl2Y7U+hTpk9n8on0=">AAAHVHicfVXbbtNAEHVb2pRAaQuPvFhESAhFkZPS20Ol0gZaIXqhatpKdajWm0liZW2vdtdp3JW/hFf4JsTHILHODTtr8ItHc84ZzZxZ7TqUuFxY1q+5+YVHi0uF5cfFJ09Xnq2urT+/4kHIMDRwQAJ24yAOxPWhIVxB4IYyQJ5D4NrpHSb4dR8YdwP/UkQUmh7q+G7bxUio1N3aqh1gaTv3sfn10rQH5t1ayapYw8/Ug+o4KBnj7/xufalitwIceuALTBDnt1WLiqZETLiYQFy0Qw4U4R7qwG0o2jtN6fo0FODj2HytsHZITBGYSXNmy2WABYlUgDBzVQUTdxFDWKgRitlSHHzkAS+3+i7lo5D3O6NAIDV/Uw6G/sQrGaXsMES7Lh5kWpPI4x4SXS3JI8/JJiEkwPpeNpm0qZqcYQ6AYZcnJpwrZ85o4jm/DM7HeDeiXfB5LENG4rRQAcAYtJVwGHIQIZXDadSie3xPsBDKSTjM7dUR611Aq6zqZBLZdtokQCKbctQYyh0f7nHgechvSZvG0hYwENIuV+Khd2n0IpbSToxyHPMigTPoaQo9jeMZbWOKts2GQjPgVQq80gpfp9DrWakTptBQQ/sptK9VVkf/L3yvwYMUOtDQKIVGGvqQQh90K5Ha/G2tKUd2D/cmz4jbhyMG4MeyVItnZ2FqpbfVrCRZsyxV46HdLWiri2AEeFFCl8eXJ59jebhT27S24lmGQ0KYUKyNrc1DS6N0Rt2MOdbOTu1A4wQM+Z1pofqHrfdVvRANGSVT0vb2xsddvVIEhAT300qHB/Xaxuw5YlgzYTyrWaqammmdPPp4qlyBkycYOZXL7+n8I4aif7CDvOoTA3MVNE8xcTNXEeUpJtZOFDND0OS09tRjQJOrG5ERpQ7qUmdwok7xmbqIkAjYW3V0WcdzlX3qb5eT6H9ENJgQVVQsqhemOvue6EGjVtmtWF/elfY/jZ+aZeOl8cp4Y1SNbWPfODbOjYaBjdD4Znw3fiz9XPpdWCgsjqjzc2PNCyPzFZ79AXFBpo4=</latexit>
or w·x
<latexit sha1_base64="/tMj0ybMVbm5pP7NPAG+YDuApcw=">AAAHV3icfVVdb9MwFPX4WEdhsI1HXiIqJISqKu3Y18OksQ2GEGxjWjekpZoc97aN6iSW7XTJrPwVXuEvwa/B6cdI6kBecnXPOVf3nmvZLqOekLb9a+He/QcPFytLj6qPnyw/fbayunYhwogTaJOQhvybiwVQL4C29CSFb4wD9l0Kl+7wIMMvR8CFFwbnMmHQ8XE/8HoewVKnrlfWnJAox71JLYd0Q2k5sXW9UrMb9vizzKA5DWpo+p1ery42nG5IIh8CSSgW4qppM9lRmEuPUEirTiSAYTLEfbiKZG+7o7yARRICklqvNNaLqCVDK+vP6nociKSJDjDhnq5gkQHmmEg9RbVYSkCAfRD17shjYhKKUX8SSKwt6Kh4bFG6XFCqPsds4JG40JrCvvCxHBhJkfhuMQkRBT7yi8msTd3kHDMGTjyRmXCqnTlhme3iPDyd4oOEDSAQqYo4TfNCDQDn0NPCcShARkyNp9G7HopdySOoZ+E4t3uI+fAMunVdp5AottOjIZbFlKvH0O4EcENC38dBVzksVY6EWCqn3kjH3uXRs1QpJzPKda2zDC6gxzn0OE3ntO07tGe1NVoAL3LghVH4ModezkvdKIdGBjrKoSOjsj79f+EbA45zaGygSQ5NDPQ2h96aVmK9+atWR03sHu9NnVBvBEccIEhVrZXOz8L1Sq+aRUm2ZlVrpmO7u9DTd8EE8JOMrj6ef/mcqoPt1oa9mc4zXBrBjGKvb24c2AalP+lmyrG3t1v7BifkOOjfFTp8v/muaRZiEWf0jrS1tf5hx6yUAKXhzV2lg/3D1vr8OeLEMGE6q1VrWoZp/TL6dKpSgVsmmDhVyh+a/COOk3+ww7LqMwNLFaxMMXOzVJGUKWbWzhRzQ7DstA71e8CyqxvTCeUQ9KXO4Ys+xSf6IsIy5G/00eV939P26b9Tz6L/EXE8I+qoWtUvTHP+PTGDdqux07C/vq3tfZo+NUvoBXqJXqMm2kJ76CM6RW1EUIy+ox/o5+LvCqroV3JCvbcw1TxHha+y+gcBJ6cB</latexit>
<latexit sha1_base64="9IY7JFukLSLpK8vsjzvW6e4p4vs=">AAAHnnicfVXtbiM1FJ3dhc0SWOguP/ljiEAIRdEkZduuUKVlW+gK0W2pmraoEyKPc5NY8cxYtieZWdfPxnPwAPyFV8CTjzITD1ga5eqec67uPXbskDMqle//8eDho/fef9x48kHzw4+efvzJzrPnVzJJBYE+SVgibkIsgdEY+ooqBjdcAI5CBtfh7KjAr+cgJE3iS5VzGER4EtMxJVjZ1HDn1yAhOggXBv12iYIMfXWIAplGQ4oKYDGkJhvSIGi6tLu7Tc5G39kvu7tDAUkkCjDjUzzcafkdf7mQG3TXQctbr/Phs8edYJSQNIJYEYalvO36XA00FooSBqYZpBI4JjM8gdtUjQ8GmsY8VRATg7602DhlSCWoGBONqACiWG4DTAS1FRCZYoGJsmY0q6UkxDgC2R7NKZerUM4nq0Bh6+RAZ0unzdOKUk8E5lNKskprGkcywmrqJGUehdUkpAzEPKomizZtk1vMDAShsjDh3Dpzxovdk5fJ+Rqf5nwKsTQ6FcyUhRYAIWBshctQgkq5Xk5jj8xMHiqRQrsIl7nDYyxmFzBq2zqVRLWdMUuwqqZCO4Z1J4YFSaIIxyMdcKMDBZnSQbtjlt6V0QujdVAYFYboooAr6NsS+taYLW3/Hh2jvkUr4FUJvHIKX5fQ621pmJbQ1EHnJXTuVLb/hH/hhQNnJTRz0LyE5g76roS+c63EdudvewO9snu5b/qM0TmcCIDY6FbPbM8i7JbedquSYpt1q2uWdo9gbK+UFRDlBV2/uTz92eijg94Lf89sM0KWwobi7+69OPIdymTVzZrjHxz0XjucROB4cl/o+Ie977tuIZ4Kzu5J+/u7P750K+XAWLK4r3T0+ri3u32OBHFMWM+KWl3kmDapo6+nqhWEdYKVU7X8mcs/ETj/D3ZSV31jYK2C1yk2btYq8jrFxtqNYmsIXpzWmX0beHF1Y7aiHIO91AWc2lN8Zi8irBLxjT26YhJRa5/9DdpF9H9EnG2INmo27QvT3X5P3KDf67zs+L9823r10/qpeeJ95n3hfe11vX3vlffGO/f6HvF+9/70/vL+bnzeOGmcNs5W1IcP1ppPvcpq3PwDxRvCpA==</latexit>
The dot product will come back a lot in the rest of the
course. We don't have time to discuss it in depth, but if your
memory is hazy, we strongly recommend that you take a
minute to go back to your linear algebra book and look up
the various interpretations of what the dot product means.
example: predicting high blood pressure To build some intuition for the meaning of the weights w,
let’s look at an example. Imagine we are trying to predict
the risk of high blood pressure based on these three
features. We’ll assume that the features are expressed in
some number that measures these properties.
instances patients
14
f
dot product Here’s what the dot product expresses. For some features,
like job stress, we want to learn a positive weight (since
more job stress should contribute to higher risk of high
blood pressure). For others, we want to learn a negative
patient x weights w
how predictive stress is
weight (the healthier your diet, the lower your risk of high
wT x = x1 w1 + x2 w2 + x3 w3
how predictive diet is blood pressure). Finally, we can control the magnitude of
patient
how predictive age is x weights w
the weights to control their relative importance: if age and
patient x weights w wT x = x1 w1 + x2 w2 + x3 w3
job stress both contribute positively, but age is the bigger
wT x = x1 w1 + x2 w2 + x3 w3
age
has stressful job
has healthy diet
15
But which model ts our data best? So, that's our model de ined in detail. But we still don't
know which model to choose for a given dataset. Given
some data, which values should we choose for the
two more ingredients:
• loss function parameters w and b?
• search method (next video)
In order to answer this question, we need two more
ingredients. First, we need a loss function, which tells us
how well a particular choice of model does (for the given
data) and second, we need a way to search the space of all
models for a particular model that results in a low loss (a
model for which the loss function returns a low value).
16
mean squared error loss Here is a common loss function for regression: the mean-
squared error (MSE) loss. We saw this brie ly already in
1X
<latexit sha1_base64="8TPMxhDbo/EfV1mR95LMJCyAzog=">AAAGtnicdZRbTxNBFMcXlIooCvroy8TGBEwlu1WwLyQoKMTIRUKBpFua2enZ7dLZS2ZmS8tkvprfw3df9TM42y1kt1vnZU/O739OzmV2nJj6XJjmr7n5Bw8XKo8WHy89ebr87PnK6otzHiWMQJNENGKXDuZA/RCawhcULmMGOHAoXDj93ZRfDIBxPwrPxCiGdoC90Hd9goV2dVYubQFDIWnEuerIyxo6U2s2IzJW62gb2S7DRFpKhgrZPAk618im4Io12yPS7WTCNXvYuV5H75DoXCub+V5PrF/VOytVc8McH1Q2rIlRNSbnpLO68NXuRiQJIBSEYs5blhmLtsRM+ISCWrITDjEmfexBKxFuoy39ME4EhEShN5q5CUUiQmmXqOszIIKOtIEJ83UGRHpYNyP0LJaKqTiEOABe6w78mGcmH3iZIbAeZFsOx4NWy4VI6TEc93wyLJQmccADLHolJx8FTtEJCQU2CIrOtExd5JRyCIz4PB3CiZ7McZwuj59FJxPeG8U9CLmSCaMqH6gBMAauDhybHEQSy3E3+sb0+bZgCdRSc+zb3sOsfwrdms5TcBTLcWmEhdLDCOGGREGAw660YyWzu2TXNtR4VHl6qqS007k4DjpNcYEe5eiRms7cvKcuampagOc5eF5KfJGjF9OhTpKjSYkOcnRQyuzc5PBNCQ9zdFiioxwdlehtjt6WR4n1olv1tszGPV6TPKb+APYZQKhkta6me2F6gy2rGJJuVVYtNR53F1z9gGQgGKVyeXB2+F3J3UZ909xS0wqHJnAnMd9vbe6aJYmXVTPRmI1G/XNJEzEceveJ9r5sfbLM6e0zUip9UiGqWqjUqjdLPqllZoAzKyDrb6a+X9bvMzz6jzqalf2u7buIqY77cXoB+kT/U+njh2m2oz3QzyKDQ30xjvWvjEXE3urbwLzA173pr11LrSX98lrT72zZOK9vWJsb5o8P1Z1vkzd40XhlvDbWDMv4aOwYB8aJ0TSI8dP4bfwx/lYalasKVLxMOj83iXlpFE4l/gfuPmlU</latexit>
2
lossX,T (w, b) = w xj + b - tj
n
j The model maps the data to the output, the loss function
maps a model to a loss value. The data is a constant in the
loss function.
The MSE loss takes the residual for each instance in our
data, squares them, and returns the average. One reason for
the squaring step is to ensure that negative and positive
residuals don’t cancel out (giving us a small loss even
though we have big residuals). But that's not the only
reason.
f
f
fi
The squares also ensure that big errors affect the loss more
heavily than small errors. You can visualise this as shown
here: the mean squared error is the mean of the areas of the
green squares (it’s also called sum-of-squares loss).
When we search for a well- itting model, the search will try
to reduce the big squares much more than the small
squares.
slight variations You may see slightly different versions of the MSE loss:
sometimes we take the average of the squares, sometimes
X just the sum. Sometimes we multiply by 1/2 to make the
(fp (xj ) - yi )2
j derivative simpler. In practice, the differences don’t mean
1X much because we’re not interested in the absolute value,
(fp (xj ) - yi )2
n
j just in how the loss changes from model to another.
1X
(fp (xj ) - yi )2
2
j
We will switch between these based on what is most useful
s
1X in a given context.
(fp (xj ) - yi )2
n
j
19
fl
f
|section|Searching for a good model|
|video|https://youtube.com/embed/q97nOAYfpHg|
Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam
x w
loss surface As we saw in the previous lecture, we can plot the loss for
every point in our model space. This is called the loss
surface or sometimes the loss landscape. If you imagine a
2D model space, you can think of the loss surface as a
feature space landscape of rolling hills (or sometimes of jagged cliffs).
b
Here is what that actually looks like for the two parameters
of the one-feature linear regression. Note that this is
speci ic to the data we saw earlier. For a different dataset,
we get a different loss landscape.
model space To minimize the loss, we need to search this space to ind
the brightest point in this picture. Or, the lowest point in the
w 22
p value as possible.
23
machine learning: nd the lowest loss that generalizes Optimization is concerned with inding the absolute
minimum (or maximum) of a function. The lower the better,
with no ifs or buts. In machine learning, if we have a very
Minimize the loss on the test data, seeing only the training
expressive model class (like the regression tree from the
data.
last lecture), the model that actually minimizes the loss on
the training data is the one that over its. In such cases,
we’re not looking to minize the loss on the training data,
24
since that would mean over itting, we’re looking to
minimize the loss on the test data. Of course, we don’t get to
see the test data, so we use the training data as a stand in,
and try to control against over itting as best we can.
random search Let’s start with a very simple example: random search. We
simply make a small step to a nearby point. If the loss goes
up, we move back to our previous point. If it goes down we
start with a random point p in the model space
stay in the new point. Then we repeat the process.
loop:
25
27
28
29
f
f
f
convexity One of the reasons such a simple approach works well
enough here is that our problem is convex. A surface (like
our loss landscape) is convex if a line drawn between any
two points on the surface lies entirely above the surface.
One of the implications of convexity is that any point that
looks like a minimum locally (because all nearby points are
loss higher) it must be the global minimum: it’s lower than any
sur
face other point on the surface.
global minimum > This means that so long as we know we’re moving down (to
model space a point with lower loss), we can be sure we’re moving
towards the global minimum: the best of all possible
30
models.
local vs global minima Let’s look at what happens if the loss surface isn’t convex:
what if the loss surface has multiple local minima? These
global minimum are points that are lower than all nearby points, but if we
move far enough away from them, we can ind a point that
is even lower.
This loss surface isn't based on actual data. It's just some
local minima
31
function that illustrates the idea.
Note that changing the step size will not help us here. Once
the search is stuck, it stays stuck.
32
f
simulated annealing There are a few tricks that can help us to escape local
minima. Here’s a popular one, called simulated annealing:
if the next point chosen isn’t better than the current one, we
pick a random point p in the model space
still pick it, but only with some small probability. In other
loop:
words, we allow the algorithm to occasionally travel uphill.
pick a random point p’ close to p
This means that whenever it gets stuck in a local minimum,
if loss(p’) < loss(p):
it still has some probability of escaping, and inding the
p <- p’
global minimum.
else:
The name "simulated annealing" is a bit of a historical
with probability q: p <- p’
accident, so don't read too much in to it. It comes from the
33
fact that this algorithm can be used to simulate the cooling of
a material like metal. The carefully controlled cooling of a
material to promote the growth of particular kinds of
crystals is called annealing. In physical terms this is like
looking for the minimum in an energy landscape, which is
mathematically similar to our loss landscape.
All this talk about global minima may suggest that the local
minima are always terrible. Remember, however that if we
have a complex model, the global minimum will probably
over it. In such cases, we may actually be more interested in
inding a good local minimum.
Note: in many situations, the local minima are ne. We do
not always need an algorithm that is guaranteed to nd
the global minimum. In short, we want to think carefully about whether our
algorithm can escape bad local minima, but that doesn't
mean that local minima are always bad solutions.
35
f
f
f
f
f
variations on random search The ixed step size we used so far is just one way to sample
the next point. To allow the algorithm to occasionally make
smaller steps, you can sample p’ so that it is at most some
distance away from p, instead of exactly. Another approach
is to sample the distance from a Normal distribution. That
way, most points will be close to the original p, but every
point in the model space can theoretically be reached in one
step.
Fixed radius Random uniform Normal
36
37
discrete model spaces The space of linear models is continuous: between every
two models, there is always another model, no matter how
close they are together. *
38
You just need to de ine which models are “close” to each
other. In this slide, we've decided that two trees are close if I
can turn one into the other by adding or removing a single
node.
parallel search Another thing you can do is just to run random search a
couple of times independently (one after the other, or in
parallel). If you’re lucky one of these runs may start you off
close enough to the global minimum.
39
population methods To make parallel search even more useful, we can introduce
some form of communication or synchronization between
the searches happening in parallel. If we see the parallel
evolutionary algorithms
searches as a population of agents that occasionally
• genetic algorithms “communicate” in some way, we can guide the search a lot
• evolutionary strategies more. Here are some examples of such population
methods. we won’t go into this too deeply. We will only
particle swarm optimization
take a (very) brief look at evolutionary algorithms.
ant colony optimization
Often, there are speci ic variants for discrete and for
continuous model spaces.
40
f
f
evolutionary algorithms Here is a basic outline of an evolutionary method (although
many other variations exist). We start with a population of
models, we remove the half with the worst loss, and pair up
Start with a population of k models.
the remainder to breed a new population.
loop:
In order to instantiate this, we need to de ine what it means
rank the population by loss
to “breed” a population of new models from an existing
remove the half with the worst loss population. A common approach is to select to random
“breed” a new population of k models
parents and to somehow average their models. This is easy
to do in a continuous model space: we can literally average
optional: add a little noise to each child.
the two parent models to create a child.
41
In a discrete model space, it’s more dif icult, and it depends
more on the speci ics of the model space. In such case,
designing the breeding process (sometimes called the
crossover operator) is usually the most dif icult part of
designing an effective evolutionary algorithm.
42
43
f
f
f
f
f
population methods Population methods are very powerful, but computing the
loss for so many different models is often expensive. They
can also come with a lot of different parameters to control
Powerful
the search, each of which you will need to carefully tune.
Easy to parallelise
Di cult to tune
44
review
To converge faster:
45
black box optimization All these search methods are instances of black box
optimization.
random search, simulated annealing: Black box optimization refers to those methods that only
• very simple
require us to be able to compute the loss function. We don’t
need to know anything about the internals of the model.
• we only need to compute the loss function for each
model
These are usually very simple starting points. Often, there is
some knowledge about your model that you can add to
• can require many iterations
improve the search, but sometimes the black box approach
• also works for discrete model spaces (like tree models) is good enough. If nothing else, they serve as a good starting
point and point of comparison for the more sophisticated
approaches.
46
Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam
towards gradient descent: branching search As a stepping stone to what we’ll discuss in this video, let’s
take the random search from the previous video, and add a
little more inspection of the local neighborhood before
pick a random point p in the model space
taking a step. Instead of taking one random step, we’ll look
loop:
at k random steps and move in the direction of the one that
pick k random points {pi} close to p gives us the lowest loss.
p’ <- argminpi loss(pi)
In the hiker analogy, you can think of this algorithm as the
if loss(p’) < loss(p): case where the hiker taps his foot on the ground in a couple
p <- p’ of random directions, and then moves in the direction with
the strongest downward slope.
48
As you can see, the more samples we take, the more directly
k=2 k=5 k=15
we head for the region of low loss. The more closely we
inspect our local neighbourhood, to determine in which
direction the function decreases quickest, the faster we
converge.
50
51
calculus basics: slope Before we dig in to the gradient descent algorithm, let’s
review some basic principles from calculus. First up, slope.
f(x The slope of a linear function is simply how much it moves
)=
-1
x+ up if we move one step to the right. In the case of f(x) in this
3
picture, the slope is negative, because the line moves down.
52
fi
fi
fi
tangent line, derivative The tangent line of a function at particular point p is the
line that just touches the function at x. The derivative of
the function gives us the slope of the tangent line.
ace
g(x) = slope · x + c Traditionally we ind the minimum of a function by setting
surf
g(x) = f 0 (p)x + c the derivative equal to 0 and solving for x. This gives us the
loss
point where the tangent line has slope 0, and is therefore
(x)
)g
horizontal.
f(x
For complex models, it may not be possible to solve for x in
f(x)
g(x) = slope · x + c
for the minimum. Looking at the example in the slide, we
model space
)
g(x) = f 0 (p)x + c
f(x) g(x) note that the tangent line moves down (i.e. the slope is
53
negative). This tells us that we should move to the right to
follow the function downward. As we take small steps to the
right, the derivative stays negative, but gets smaller and
smaller as we close in on the minimum. This suggests that
the magnitude of the slope lets us know how big the steps
are that we should take, and the sign gives us the direction.
vectors as directions (with magnitudes) To represent this direction we'll use a vector.
55
the direction of steepest ascent So, now that we have a local linear approximation g to our
function f, which is the direction of steepest ascent on the
<latexit sha1_base64="Hx9t4qh4dmJwoy91jyhymIL/X44=">AAAKS3icfVbNbttGEKaSprHUpnHSYy9EhBZpIxikFEsyDAOJJSc5NLFrWHYAUzGWqxVFaPmD3aUkZs3X6K2P1Afoc/QW5NDlj2SSS3UvGs73fcPZmeFqTR/blGnaP7V797958O3DnXrju+8f/fB498nTS+oFBKIR9LBHPpqAImy7aMRshtFHnyDgmBhdmfNBjF8tEKG2516w0EdjB1iuPbUhYMJ1s/un9dywIDdW0a/qL0eq4QnbXEbqpws186vGXPwevjAOTWEbDcG6vV3zbm9V41A8ptT4CXqUGwD7M5Cy1Tx6pOqJ02BeKUhJdrPb1Pa0ZKmyoWdGU8nW2c2Tnb+MiQcDB7kMYkDpta75bMwBYTbEKGoYAUU+gHNgoeuATftjbrt+wJALI/VngU0DrIqs4hqpE5sgyHAoDACJLSKocAYIgExUslEMRZELHERbk4Xt09SkCys1GBBtGPNV0qboUUHJLQL8mQ1XhdQ4cKgD2Exy0tAxi04UYEQWTtEZpymSLDFXiECbxkU4E5U59ePW0wvvLMNnoT9DLo14QHCUFwoAEYKmQpiYFLHA58luxLzN6REjAWrFZuI7GgIyP0eTlohTcBTTmWIPsKLLFNsQ1XHREnqOA9wJN/yIGwytGDdae1FSuzx6HnFuxIUyTfU8hgvohxz6IYqK4EkOPBFgER1t0Kk6Kksvc+Cl9NarHHpVlppBDg0kdJFDF1Jk8Y3cwUsJXuXQlYSGOTSU0M859LNcZyDG4ro95mkvkqbyU2wv0FuCkBvxZjsq74WIfl/rRUk8A7ypR0m5J2gqDqsUcMKYzt9dvP894oN+e1/rRmWGiQO0pmid7v5AkyhWmk3G0fr99rHE8QhwrU2g4Un3tS4H8gPi4w2p1+u8OZAjhQhjb7mJNDgetjslkihIMSe9p2taefdLC64J3V5vqJXzWeI7wuvBsNMrv2ZJNripd1C7XLwlviNMOv2XcgBzg3e0bvdAwvEdodcBvYlEmG/wg2SVPyiRQHkasqarTV2VpseqomelrBSYVYJ0ZCr5c5n/loBwC9urir6epEqFX6VYj1WlIqxSrGdsrShKlhVlSoap8gXJGEl0vJ1f0bNk0rZEr6Lj7fyKjiVjuCV6FR1v51f0N5nR6kLO/fj8iy84fnxTADilDJG4QxD0XpyLp+J/DzCP/CYOQ2I5tphD8Wu0Yuv/iGC1JgqrEV9o9PL1RTYu23v6/p72x8vmq+PsarOj/KQ8U54rutJTXinvlDNlpEDla+1Z7UWtVf+7/m/9S/1rSr1XyzQ/KoXVePAfPO3Fug==</latexit>
approximation g?
g(x) = wT x + b
Since g is linear, many details don’t matter: we can set b to
= ||w|| ||x|| cos ↵
zero, since that just translates the hyperplane up or down.
||x|| = 1 It doesn’t matter how big a step we take in any direction, so
! ||w|| cos ↵ we’ll take a step of size 1. Finally, it doesn’t matter where
we start from, so we will just start from the origin. So the
α question becomes: for which input x of magnitude 1 (which
unit vector) does g(x) provide the biggest output?
58
To see the answer, we need to use the geometric de inition
of the dot product. Since we required that ||x||= 1, this
disappears from the equation, and we only need to
maximise the quantity ||w|| cos(α) (where only α depends
on our choice of x, and w is the gradient we computed).
cos(α) is maximal when α is zero: that is, when x and w are
pointing in the same direction.
summary
59
magnitude of the gradient: speed of ascent Note that the gradient is a vector: it has a direction and a
magnitude (the length of the arrow). The magnitude tells us
x2
how quickly the linear function is rising. This is very useful
in search, since the more the function is changing, the
bigger the steps that we want to take. Once the function
stops changing as much, it's a good bet we are approaching
a minimum, so we'd like to slow down.
x1 60
gradient descent Here is the gradient descent algorithm. Starting from
some candidate p, we simply compute the gradient at p,
subtract it from the current choice, and iterate this process:
pick a random point p in the model space
62
f
fi
fi
fi
Let’s go back to our example problem, and see how we can
apply gradient descent here.
63
of instance i.
loss(w, b) = (wxi + b - ti )2
n
i
<latexit sha1_base64="bxkKGAnqY2xeGalMH2MONzMdAfw=">AAAKVniclVZdb9s2FJW7dUm8dU23x70IMzakgxdIdmo5DwHa2Fn7sDZZECcFIiOgaFoWTH2ApCyrBH/KHvavtj8zjJI/IolygenF1/ecc3V5eUzTibBHmWH803jyxZdPv9rbP2h+/c2zb58fvvjuloYxgWgEQxySjw6gCHsBGjGPYfQxIgj4DkZ3znyQ4XcLRKgXBjcsjdDYB27gTT0ImEw9HP5lzyG3A+BgIHSboSXjOKRUHNkh5Ilo67YDuSNe6j+f6U0boynTj+wpAVI0jz4nECtCnsyy/0OTf5HdEM+dsZcPhy3j2MgfXQ3MddDS1s/Vw4v9P+1JCGMfBQxiQOm9aURszAFhHsRINO2YogjAOXDRfcym/TH3gihmKIBC/0li0xjrLNSzcekTjyDIcCoDAIknK+hwBuRamBxqs1yKogD4iLYnCy+iq5Au3FXA5ITRmC/zHRPPSkruEhDNPLgstcaBT33AZkqSpr5TTqIYI7Lwy8msTdlkhblEBHo0G8KVnMxllLmA3oRXa3yWRjMUUMFjgkVRKAFECJpKYR5SxOKI56uR1pvTM0Zi1M7CPHc2BGR+jSZtWaeUKLczxSFg5ZQjlyGnE6AEhr4Pggm3I+mL3DR2+1jksyui14JzOxuU4+jXGVxCPxTQD0KUwYsCeCHBMjraolN9VJXeFsBb5a13BfSuKnXiAhor6KKALpTKTlKAEwVeFtClgqYFNFXQTwX0kzpnIG1x3xnz1V7km8ovsbdAbwlCgeCtjqiuhcj9vjfLkswDvGWKfNwTNJXn1grw04zO3928/13wQb/zyuiJKsPBMdpQjG7v1cBQKO6qmzXH6Pc75wonJCBwt4WGF703plooikmEtyTL6v52qlZKEcZhsq00OB92uhWSHEi5J9MyDaO6+sSFG0LPsoZGtZ8EPxLeDIZdq/qahGxxx+yiTnV4CX4kTLr9E7WAs8W7Rq93quD4kWB1gTVRCPMtfpo/1R+UbKDqhvWm6y1TV9zj1tHXo6wVOHWClWVq+XOV/5aAdAc7rKu+cVKtIqpTbGxVq0jrFBuPbRRlSVIzptxMtS/IbaTQ8W5+zZ7lTttRvY6Od/Nrdiy34Y7qdXS8m1+zv7lH6wc5j7LzL7sNRdlNAeAVZYjkHYKg9/JcvJT/e4CF5Bd5GBLX96QP5afdzqLPEcFyQ5RRsykvNGb1+qIGt51js3d88sdJ6/X5+mqzr/2g/agdaaZmaa+1d9qVNtJgY6/xa6PXsA7+Pvi3+bS5t6I+aaw132ulp3n4H/GOynQ=</latexit>
✓ ◆
@loss(w, b) @loss(w, b)
rloss(w, b) = ,
@w @b
64
<latexit sha1_base64="2q8i8sFMD9DJzZopUlJNGKhwZN0=">AAALRnicvVbdbts2FFayv0xdt2a7HAYIM9a1mxdIdmo7FwHa2Fl7sTZZECcFIs+gaNoWTP2ApCyrBN9hL7D32SvsJXY39HaULDuSKO9yuvHB+b7z8fDwk0wnxC5lpvnX3v4HH3708ScHn+oPPnv4+RePDr+8oUFEIBrCAAfkrQMowq6PhsxlGL0NCQKeg9Gts+in+O0SEeoG/jVLQjTywMx3py4ETKbGh3u/21MCILcXoWEztGIcB5SKJ3YAeSyahu1A7oinYk3IksJ4fGoUqrLIEtwXhk0jb+waefVKhj/mCsZPBhu7T39rlZVsW9+KZRK5wlb+f5XazRD3+60nJSmppp98+Vahod2ryJRtjx81zCMzeww1sPKgoeXP5fjw4A97EsDIQz6DGFB6Z5khG3FAmAsxErodURQCuAAzdBexaW/EXT+MGPKhML6T2DTCBguM1B7GxCUIMpzIAEDiSgUDzoHcApMm0stSFPnAQ7Q5WbohXYd0OVsHDEgHjvgqc6h4WKrkMwLCuQtXpdY48KgH2FxJ0sRzykkUYUSWXjmZtimbrDBXiECXpkO4lJO5CFPX0+vgMsfnSThHPhU8IlgUCyWACEFTWZiFFLEo5Nlu5Ku2oKeMRKiZhlnudADI4gpNmlKnlCi3M8UBYOWUI7chp+OjGAaeB/wJt0NppOxVtJtHIptdEb0SnNvpoBzHuErhEvqmgL4RogyeF8BzCZbR4RadGsNq6U0BvFFWvS2gt9VSJyqgkYIuC+hSUXbiAhwr8KqArhQ0KaCJgr4roO/UOQNpi7vWiK/PIjtUfoHdJXpJEPIFb7REdS9EnvedVS5JPcAblsjGPUFT+Z1eA16S0vmr69e/CN7vtZ6ZHVFlODhCG4rZ7jzrmwpltu4m55i9XutM4QQE+LOt0OC888JShcKIhHhL6nbbP5+oSgnCOIi3Sv2zQatdIcmBlHuyupZpVncfz+CG0Ol2B2a1nxjfE170B+1udZmYbHHHaqNWdXgxvidM2r1jVcDZ4m2z0zlRcHxP6LZBd6IQFlv8JHuqL5RsoOqG/NCNhmUo7pnV0fNR1hY4dQVry9TyFyr/JQHJDnZQp75xUm1FWFexsVVtRVJXsfHYpqJcEteMKTNT7QKZjRQ63s2vObPMaTvU6+h4N7/mxDIb7lCvo+Pd/JrzzTxaP8hFmH7/FvJSE6Y3BYDXlAGSdwiCXsvv4oX83wMsID/IjyGZea70ofy1m2n0X0Sw2hBlpOvyQmNVry9qcNM6sjpHx78eN56f5VebA+1r7VvtiWZpXe259kq71IYa3Hu//83+4/3v9T/1v/V/9Pdr6v5eXvOVVnoeaP8C/FofsQ==</latexit>
@loss(w, b)
= n i
@w @w • irst we use the sum rule, moving the derivative inside the
1 X @(wxi + b - ti )2
= sum symbol
n @w
i
1 X @(wxi + b - ti )2 @(wxi + b - yi ) • then we use the chain rule, to split the function into the
=
n
i
@(wxi + b - ti ) @w composition of computing the residual and squaring,
2X computing the derivative of each with respect to its
= (wxi + b - ti )xi
n argument.
i
P
@ n1 + b - ti )2 The second homework exercise, and the formula sheet both
<latexit sha1_base64="5QZ8IdW5jDYQACqjimOm4azCDkw=">AAAKZ3ichVbdbts2FJbb/aTauqYdMBTYjTZjQ9t5gWSnlnORoY2dJRdrkwVxUiDyDEqmbcHUD0jKskrwUfYSe5s9wt5ilCw7kihvvPExv+87OjrnM007RC6huv5348HDTz797PO9R+oXXz7+6sn+02c3JIiwA4dOgAL8wQYEIteHQ+pSBD+EGALPRvDWXvRT/HYJMXED/5omIRx5YOa7U9cBVGyN9/+yphg4zFqEmkXhijIUEMJfWIHDYt7SLNthNn/J14TsC9d+PNYKqiwyOPO5ZpHIG7tarl6J8KdcpP2s0bH78o92OZNlqSJZnqOd5vjfFJY13m/qB3q2NDkw8qCp5Oty/HTvT2sSOJEHfeogQMidoYd0xACmroMgV62IwBA4CzCDdxGd9kbM9cOIQt/h2g8Cm0ZIo4GW9k+buBg6FCUiAA52RQbNmQNRPxVdVsupCPSBB0lrsnRDsg7JcrYOKBAjGrFVNkL+uKRkMwzCueusSqUx4BEP0Lm0SRLPLm/CCEG89MqbaZmiyApzBbHjkrQJl6IzF2FqC3IdXOb4PAnn0CecRRjxolAAEGM4FcIsJJBGIcveRnhxQY4pjmArDbO94wHAiys4aYk8pY1yOVMUAFressVriO74MHYCzwP+hFmhcFFmVqt1wLPeFdErzpiVNsq2tasULqHvC+h7zsvgaQE8FWAZHW7RqTasSm8K4I301NsCeluV2lEBjSR0WUCXUmY7LsCxBK8K6EpCkwKaSOjHAvpR7jMQtrhrj9h6FtlQ2QVyl/AMQ+hz1mzz6rtgMe87oyxJPcCaBs/aPYFTcZCtAS9J6ez8+t1vnPV77dd6l1cZNorghqJ3uq/7ukSZravJOXqv1z6ROAEG/mybaHDafWvIicIIh2hLMs3Or0dypgQiFMTbTP2TQbtTIYmGlGsyTEPXq28fz5wNoWuaA71aT4zuCW/7g45ZfUyMt7htdGC72rwY3RMmnd6hnMDe4h292z2ScHRPMDvAnEiExRY/ylb1ByUKqLohH7rWNDTJPbM6et7KWoFdJ1hbppa/kPlnGCQ72EFd9o2TahVhnWJjq1pFUqfYeGyjKEvimjZlZqp9QGYjiY5282tmljltR/Y6OtrNr5lYZsMd2evoaDe/Zr6ZR+sbuQjT828hrjlhelMAaE0ZQHGHwPCdOBcvxP8eoAF+JQ5DPPNc4UPxabXS6L+IYLUhikhVxYXGqF5f5OCmfWB0Dw5/P2y+OcmvNnvKt8r3ygvFUEzljXKuXCpDxWk8b/zSOGucP/pHfaJ+oz5fUx80cs3XSmmp3/0LWy3Ngw==</latexit>
@loss(w, b) i (wxi
=
@b @b provide a list of the most common rules for derivatives.
2X
= (wxi + b - ti )
n On your irst pass through the slides, it's ok to take my word
i
65
for it that these are the derivatives and to skip the derivation.
However, there are a lot more derivations like these coming
up, so you should work through every step before moving on
to the next lecture, or you'll struggle in the later parts of the
course.
f
f
f
gradient descent for our example Here's what we've just worked out. Gradient descent, but
speci ic to this particular model. We start with some initial
guess, compute the gradient of the loss with the two
pick a random point (w, b) in the model space
functions we've just worked out, and we subtract that
loop:
vector (times some scalar η) from our current guess.
✓ ◆
<latexit sha1_base64="bJZ2zITtxlRh/W2jbiDeOZcYVkY=">AAAKpHiclVbLbttGFKXStHXZpnXaZTdEhBZpqxik5EjyIkBiyU0WSey6lh3AVIXh6IoiNHxgOBTFDLhrP6Mf1q/IL2RIPUxyqALhRldzzrm8c+/RaKyAOCHT9f8a9z67//kXXx58pX79zYNvvzt8+P116EcUwwj7xKfvLBQCcTwYMYcReBdQQK5F4MZaDDL8Zgk0dHzviiUBjF1ke87MwYiJpcnhB9MC2/F44CJGnVWqmT7msfgwNdPC3BIReNMCTGDGEKV+rH2i8on4zpCsmlGEeTvlnojDyJ042uN1ppUIf9vmeqKxifNLtiTyf5qoVMbksKkf6fmjyYGxCZrK5rmYPDz415z6OHLBY5igMLw19ICNOaLMwQRS1YxCCBBeIBtuIzbrj7njBREDD6faTwKbRURjvpY1X5s6FDAjiQgQpo7IoOE5EnthYkRqOVUIHnIhbE2XThCuw3BprwOGxHzHfJXPP31QUnKbomDu4FWpNI7cULRgLi2GiWuVFyEiQJdueTErUxRZYa6AYifMmnAhOnMeZJ4Kr/yLDT5Pgjl4YcojStKiUABAKcyEMA9DYFHA890IIy/CZ4xG0MrCfO3ZENHFJUxbIk9poVzOjPiIlZcssQ3RHQ9i7LsuEkYwg5SbDFaMm62jNO9dEb1MOTezRlmWdpnBJfRtAX2bpmXwrACeCbCMjnboTBtVpdcF8Fp6600BvalKraiARhK6LKBLKbMVF+BYglcFdCWhSQFNJPR9AX0v9xkJW9y2x3w9i3yo/Jw4S3hJAbyUN9tpdS9UzPvWKEsyD/CmkebtnsJMnIJrwE0yOn919eZ1ygf99lO9m1YZFolgS9E73acDXaLY62o2HL3fb59KHJ8iz94lGp51XxhyoiCiAdmRer3O7ydypgQI8eNdpsHpsN2pkERDyjUZPUPXq7uPbbwldHu9oV6tJyZ3hBeDYadXfU1Md7hldKBdbV5M7gjTTv9YTmDt8I7e7Z5IOLkj9DqoN5UIix1+kj/VH5QooOqGzdC1pqFJ7rHr6JtW1gqsOsHaMrX8hcx/SVGyh+3XZd86qVYR1Cm2tqpVJHWKrce2irIkrmlTbqbaF+Q2kuhkP79mZrnT9mSvo5P9/JqJ5Tbck72OTvbza+abe7S+kYsgO/8WWMwtuykgsqYMQdwhKLwR5+K5+N9DzKe/isOQ2q4jfCg+zVYW/R8RrbZEEamquNAY1euLHFy3j4zu0fEfx83np5urzYHyo/JIeawYSk95rrxSLpSRghujBm/83fhH/Vl9rf6pjtbUe42N5gel9Kh/fQQlbehd</latexit>
✓ ◆ ✓2 P ◆
w w (wxi + b - ti )xi
- ⌘ n2 Pi
b b n i (wxi + b - ti ) Hopefully, repeating this process a number of times in small
steps will directly follow the loss surface down to a (local)
s
ss
gues
gradient
minimum.
e
nt gu
next
curre
66
68
f
f
playground.tensor ow.org (bit.ly/2MnehJp) Here is a very helpful little browser app that we’ll return to
a few times during the course. It contains a few things that
that we haven't discussed yet, but if you remove all hidden
layers, and set the target to regression, you'll get a linear
classi ier of the kind that we've been discussing. Click the
following link to see a version with only the currently
relevant features: playground.tensor ow.com We will
enable different additional features as we discuss them in
the course.
Note that the page calls this model a neural network (which
we won’t discuss for a few more weeks). Linear models are
just a very simple neural network.
71
f
f
f
fl
fl
Here, we see the effect of the learning rate. If we set if too
high, the gradient descent jumps outof the irst minimum it
inds. A little lower and it stays in the neighborhood of the
irst minimum, but it sort of bounces from side to side, only
very slowly moving towards the actual minimum.
gradient descent
very accurate
but actually… It’s worth saying that for linear regression, although it
makes a nice, simple illustration, none of this searching is
actually necessary. For linear regression, we can set the
derivatives equal to zero and solve explicitly for w and for b.
This would give us the optimal solution directly without
<latexit sha1_base64="lpnOhDhBinkS7r1HNkXnGplwc80=">AAAKKHiclVZdb9s2FFW6bou9dU23x70INTZ0gxFIdmo7DwHa2Fn7sDZZECcFIiOgaFoWTH2ApCyrBH/K/kB/Td+GvO6PbJRkO5IoD5hefH3POVfkvcc07RC7lBnG/d6jLx5/+dXX+43mN98++e7pwbPvr2kQEYjGMMAB+WADirDrozFzGUYfQoKAZ2N0Yy+GKX6zRIS6gX/FkhBNPOD47syFgMnU3QG1ZgRAbi1C3WJoxTgOKBUvrADyWLR1y4bcFr+InJAlhf7ziW7oltX8P9LsSy5t3h20jEMje3Q1MNdBS1s/F3fP9v+0pgGMPOQziAGlt6YRsgkHhLkQI9G0IopCABfAQbcRmw0m3PXDiCEfCv0nic0irLNATxugT12CIMOJDAAkrqygwzmQO2GyTc1yKYp84CHani7dkOYhXTp5wIDs8YSvshmIJyUldwgI5y5clZbGgUc9wOZKkiaeXU6iCCOy9MrJdJlykRXmChHo0rQJF7Iz52E6V3oVXKzxeRLOkU8FjwgWRaEEECFoJoVZSBGLQp7tRpppQU8YiVA7DbPcyQiQxSWatmWdUqK8nBkOACunbLkN2R0fxTDwPOBPuRVKV2SWsdqHIutdEb0UnFtpo2xbv0zhEvq+gL4XogyeFcAzCZbR8Rad6eOq9LoAXitvvSmgN1WpHRXQSEGXBXSpVLbjAhwr8KqArhQ0KaCJgn4soB/VPgNpi9vOhOezyIbKz7G7RG8IQr7grY6o7oXIed+aZUnqAd4yRdbuKZrJkygHvCSl87dX734XfDjovDR6osqwcYQ2FKPbezk0FIqTr2bNMQaDzqnCCQjwnW2h0VnvtakWCiMS4i2p3+/+dqxWShDGQbytNDwddboVkmxIeU1m3zSM6u5jB24IvX5/ZFTXE+MHwuvhqNuvviYmW9w2u6hTbV6MHwjT7uBILWBv8a7R6x0rOH4g9LugP1UIiy1+nD3VH5RcQNUN66HrLVNX3OPU0detrBXYdYLcMrX8hcp/Q0Cygx3UVd84qVYR1ik2tqpVJHWKjcc2irIkrmlTZqbaF2Q2Uuh4N79mZpnTdlSvo+Pd/JqJZTbcUb2Ojnfza+abebS+kYswPf8W8ooSpjcFgHPKCMk7BEHv5Ll4Lv/3AAvIr/IwJI7nSh/KT6udRv9FBKsNUUbN9EJjVq8vanDdOTR7h0d/HLVena6vNvvaj9pz7YVman3tlfZWu9DGGtTutX/29vcajU+Nz42/Gvc59dHeWvODVnoaf/8LgoS6/w==</latexit>
@loss(w, b)
=0 searching.
@w
@loss(w, b) However, this trick requires more advanced linear algebra to
=0 work out than we want to introduce here. You should learn
@b
about this in most linear algebra courses, where the problem
lution
tical so
n analy is called ordinary least squares, and is solved by computing
there’s a m o d el)
(for this the pseudo-inverse of the data matrix. We won't go down this
74
route in this course because it'll stop working very quickly
once we start looking at more complicated models.
f
f
f
f
f
f
75
Machine Learning
mlvu.github.io
Vrije Universiteit Amsterdam
classi cation Now, let’s look at how this works for classi ication.
77
f
f
f
f
fi
f
f
f
To de ine a linear decision boundary, we take the same
functional form we used for the linear regression: some
weight vector w, and a bias b.
wTx + b> 0
The way we de ine the decision boundary is a little different
wTx + b= 0
x2 than the way we de ined the regression line. Here, we say
wTx + b < 0 that if wTx + b is larger than 0, we call x one class, if it is
smaller than 0, we call it the other (we’ll stick to binary
classi ication for now).
1D linear classi er The actual hyperplane this function y = wTx + b de ines can
be thought of as lying above and below the feature space.
wx + b > 0
79
x1 80
f
f
f
f
f
f
fi
This also shows us another interpretation of w. Since it is
the direction of steepest ascent on this hyperplane, it is the
vector perpendicular to the decision boundary, pointing
to the class we assigned to the case where wTx + b is larger
w wTx + b = 0 than 0 (the blue class in this case).
x1 81
example data Here is a simple classi ication dataset, which we’ll use to
illustrate the principle.
x1 x2
1 2 true
1 1 false
2 3 true
3 3 true
3 1 false
3 2 false
82
what loss function do we use? This gives us a model space, but how do we decide the
quality of any particular model? What is our loss function
for classi ication?
nr. of misclassi ed examples (error)?
The thing we are usually trying to minimise is the error:
the number of misclassi ied examples. Sometimes we are
looking for something else, but in the simplest classi ication
problems, this is what we are ultimately interested in: a
classi ier that makes as few mistakes as possible. So let's
start there: can we use the error as a loss function?
83
f
f
f
f
f
fi
This is what our loss surface looks like for the error
function on our simple dataset. Note that it consists almost
entirely of lat regions. This is because changing a model a
tiny bit will usually not change the number of misclassi ied
examples. And if it does, the loss function will suddenly
jump a lot.
Note that our model now has three parameters w1, w2 and b,
so the loss surface is a function on a 3d space (a 4d
"surface"). In order to plot it in two dimensions, we have ixed
w2=1.
Sometimes your loss function 1. To express what quantity we want to maximise in our
search for a good model.
should not be the same as 2. To provide a smooth loss surface, so that we can ind a
path from a bad model to a good one.
your evaluation function. For this reason, it’s common not to use the error as a loss
function, even when it’s the thing we’re actually interested
in minimizing. Instead, we’ll replace it by a loss function
that has its minimum at (roughly) the same model, but that
85
provides a smooth, differentiable loss surface.
classi cation losses In this course, we will investigate three common loss
functions for classi ication. The irst, least-squares loss, is
just an application of MSE loss to classi ication, we will
Least squares loss (this video)
discuss that in the remainder of the lecture. It's not usually
Log loss / Cross entropy (Lecture 5, Probability) that good, but it gives you an idea of what a classi ication
SVM loss (Lecture 6, Linear Models 2) loss might look like.
86
f
f
f
f
f
fi
f
f
f
f
f
f
least-squares loss The least squares classi ier essentially turns the
classi ication problem into a regression problem: it assigns
loss for instances in pos class loss for instances in neg class points in one class the numeric value +1 and points in the
X X
other class the value -1, we then use a basic MSE loss that
<latexit sha1_base64="LD5OqszO0Xp5v/72GdACm/YXpsU=">AAAG+HicfVRbTxNBFN6qVEVR0EdfJjYalYq7VZAXEhRUYlSQUCBhSzM7Pd1OurfMzJaWyfwRn3wzvvpv/DfOXsDdbnVeevZ833dybj1O5FEuTPN37crVa3P16zduzt+6vXDn7uLSvUMexoxAm4ReyI4dzMGjAbQFFR4cRwyw73hw5Ay3EvxoBIzTMDgQkwg6PnYD2qcEC+3qLn6zBYyF9ELO1RM7JNJ2zlQT2Q6RjnqKNpDNY78rKbJpkHozfhRypRS6VJwe2OMuRcu5ED1H1tPTVvL9V26zC3UA7n/Vy6m6u9gwV8z0oaph5UbDyN9ed2nuvd0LSexDIIiHOT+xzEh0JGaCEg/UvB1ziDAZYhdOYtFf70gaRLGAgCj0SGP92EMiREmfUI8yIMKbaAMTRnUERAaYYSJ0N+fLoTgE2Afe7I1oxDOTj9zMEFiPoiPH6ajUQkkpXYajASXjUmoS+9zHYlBx8onvlJ0Qe8BGftmZpKmTnGKOgRHKkybs6c7sRsn4+UG4l+ODSTSAgCsZM08VhRoAxqCvhanJQcSRTKvROzfkG4LF0EzM1LexjdlwH3pNHafkKKfT90IslG5GAGck9H0c9KQdqXw/7OaKSltVRPeVlHbSF8dB+wlcQr8U0C9qOnL7Eu2jtkZL4GEBPKwEPiqgR9NSJy6gcQUdFdBRJbJe/L/wWQUeF9BxBZ0U0EkFPS+g59VWYj3ok1ZHZu1OxyR3PTqCDwwgULLRUtO1MD3BE6ssSaYqG5ZK292Dvj5BGeBPErrcOfj8Scmt9daquaamGY4XwwXFfLm2umVWKG6WTc4x19dbbyuckOHAvQy0/W7tjWVOT5+RSup5hqhhoUqp7ix6nstMgTNLkNU3kz+s8j8wPPkHO5wV/aLsC8VUxcMoWYChvq5Rcvywl81oG/RZZPBZL8au/itjEbJnehuY61Ndm/61m4k1ry+vNX1nq8Zha8VaW3n19VVj82N+g28YD4yHxhPDMl4bm8aOsWe0DVIzao9rL2pm/bz+vf6j/jOjXqnlmvtG6dV//QGd537h</latexit>
88
Note, however that the optimum under this loss function may
not always perfectly separate the classes, even if they are
linearly separable. It does in our case, but this result is not
guaranteed.
89
f
f
f
f
f
Here is the result in feature space, with the inal decision
boundary in orange.
90
playground.tensor ow.org (bit.ly/2Me1fxU) The tensor low playground also allows us to play around
with linear classi iers. Note that only for one of the two
datasets, the linear decision boundary is appropriate.
91
mlcourse@peterbloem.nl
92
f
f
f
f
fl