Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Bayesian Linear Models:

Model Selection

SPR Day 09, Spring 2020


Prof. Mike Hughes
https://www.cs.tufts.edu/comp/136/2020s/
1
Recap: Bayesian Linear Regression

Mike Hughes - Tufts COMP 136 - Spring 2020 2


Recap: Bayesian Linear Regression Simpler prior: assume zero mean,
General prior
Just need to define precision

Mike Hughes - Tufts COMP 136 - Spring 2020 3


Problem:
Hyperparameter Selection
How do we pick the prior hyperparameters?

alpha

How do we pick the likelihood hyperparameters?

beta

Mike Hughes - Tufts COMP 136 - Spring 2020 4


Hyperparameter Selection
Fixed valid. set K-fold Bayesian
(fraction f) cross-validation evidence
Fraction Higher is better
data used Better use
(1.0 – f) (K-1) / K 1.0 of training data
for training
run
Total runs/
examples 1 run K runs 1 run Lower is better
Faster training
seen for (1 – f) N (K-1) * N N
training
Total runs/
examples
1 run K runs 1 run Lower is better
seen for Faster evaluation
fN N N
evaluation
of fitness
Fitness
Heldout likelihood Heldout likelihood Evidence
function Mike Hughes - Tufts COMP 136 - Spring 2020 5
Why use Model Evidence?

Mike Hughes - Tufts COMP 136 - Spring 2020 6


Related Problem:
Model Selection
How do we pick the feature transform?

<latexit sha1_base64="CxQr/GUsCoVzg/46vJ75x7m2L4Y=">AAACLHicbVDLSgMxFM34rPVVdekmWIR2U2aqoBuh2I0bpYJ9QDsMmUymDc1khiQjlqH9Hzf+iiAuLOLW7zBtBx+tFwLncS8397gRo1KZ5thYWl5ZXVvPbGQ3t7Z3dnN7+w0ZxgKTOg5ZKFoukoRRTuqKKkZakSAocBlpuv3qxG/eEyFpyO/UICJ2gLqc+hQjpSUnV+1EPVp4cHgRXsA2nDDHmvHRaEbLP5R5oZLfxrU2borQdnJ5s2ROCy4CKwV5kFbNyb10vBDHAeEKMyRl2zIjZSdIKIoZGWY7sSQRwn3UJW0NOQqItJPpsUN4rBUP+qHQjys4VX9PJCiQchC4ujNAqifnvYn4n9eOlX9uJ5RHsSIczxb5MYMqhJPkoEcFwYoNNEBYUP1XiHtIIKx0vlkdgjV/8iJolEvWSal8e5qvXKZxZMAhOAIFYIEzUAFXoAbqAINH8AzewNh4Ml6Nd+Nj1rpkpDMH4E8Zn1+qJqVx</latexit>
(xn ) = [ 1 (xn ) 2 (xn ) ... M (xN )]

What size M?
Which functions to include?

Challenge: Could have unbounded number of choices!

Need to enumerate small set of possible models (size L) if want to


average over them properly

Mike Hughes - Tufts COMP 136 - Spring 2020 7


Model Selection for Regression
Model family (size M, feature functions, hyperparameters)

Specific parameters for chosen model

T 1
tn |xn ⇠ N (w
<latexit sha1_base64="XH8vxbgSvhWTzAp77uC6ve7XfK4=">AAACIHicbVDLSgMxFM34rPU16tJNsAgVtMyooMuiG1ei0Bd0xiGTpjY0kxmSO2oZ+ylu/BU3LhTRnX6Nae1CWw8kHM65l3vvCRPBNTjOpzU1PTM7N59byC8uLa+s2mvrNR2nirIqjUWsGiHRTHDJqsBBsEaiGIlCweph93Tg12+Y0jyWFeglzI/IteRtTgkYKbCPIJD4Ht+Z39M8wl5EoEOJyM77RXx7VcFe0uFFY+/sYi9kQK6yPbePdwK74JScIfAkcUekgEa4COwPrxXTNGISqCBaN10nAT8jCjgVrJ/3Us0SQrvkmjUNlSRi2s+GB/bxtlFauB0r8yTgofq7IyOR1r0oNJWD/fW4NxD/85optI/9jMskBSbpz6B2KjDEeJAWbnHFKIieIYQqbnbFtEMUoWAyzZsQ3PGTJ0ltv+QelPYvDwvlk1EcObSJtlARuegIldEZukBVRNEDekIv6NV6tJ6tN+v9p3TKGvVsoD+wvr4B1EehfQ==</latexit>
(xn ),
Observed outputs (and corresponding inputs)
)

T 1
tn |xn ⇠ N (w
<latexit sha1_base64="XH8vxbgSvhWTzAp77uC6ve7XfK4=">AAACIHicbVDLSgMxFM34rPU16tJNsAgVtMyooMuiG1ei0Bd0xiGTpjY0kxmSO2oZ+ylu/BU3LhTRnX6Nae1CWw8kHM65l3vvCRPBNTjOpzU1PTM7N59byC8uLa+s2mvrNR2nirIqjUWsGiHRTHDJqsBBsEaiGIlCweph93Tg12+Y0jyWFeglzI/IteRtTgkYKbCPIJD4Ht+Z39M8wl5EoEOJyM77RXx7VcFe0uFFY+/sYi9kQK6yPbePdwK74JScIfAkcUekgEa4COwPrxXTNGISqCBaN10nAT8jCjgVrJ/3Us0SQrvkmjUNlSRi2s+GB/bxtlFauB0r8yTgofq7IyOR1r0oNJWD/fW4NxD/85optI/9jMskBSbpz6B2KjDEeJAWbnHFKIieIYQqbnbFtEMUoWAyzZsQ3PGTJ0ltv+QelPYvDwvlk1EcObSJtlARuegIldEZukBVRNEDekIv6NV6tJ6tN+v9p3TKGvVsoD+wvr4B1EehfQ==</latexit>
(xn ), )
Mike Hughes - Tufts COMP 136 - Spring 2020 8
Model Selection via Evidence

possible datasets

Key idea: “Goldilocks” principle


If model too simple, it puts high mass only few datasets
If model too complex, it puts mass on too many datasets
Mike Hughes - Tufts COMP 136 - Spring 2020 9
Predictive distribution
Average over predictive distribution for each of
L possible models, weighted by posterior probability

Predictive distrib. Posterior of model i


for t given model i

Key idea: We can use all L models, don’t need to pick one

Mike Hughes - Tufts COMP 136 - Spring 2020 10


Ideal predictive posterior
If we want to predict new data given old data, ideally we would
average over all parameters w, alpha, beta, weighted by posterior prob.

But, this integral is hard!

Mike Hughes - Tufts COMP 136 - Spring 2020 11


Tractable predictive posterior
Assume:
• We have enough data that the posterior p(alpha, beta | data) is
peaked at MAP point estimate

ˆ , ˆ = arg max p(↵, |t)



<latexit sha1_base64="rBewqcKq+wXvwUyaSC3tXy+lff0=">AAACQXicbVC7SgNBFJ31GeMramkzGAQFCbtR0EYQbSwjGA1kQ7g7mU0GZx/M3BXDun6ajX9gZ29joYitjbPJFr4ODJw551zmzvFiKTTa9pM1MTk1PTNbmivPLywuLVdWVi90lCjGmyySkWp5oLkUIW+iQMlbseIQeJJfelcnuX95zZUWUXiOw5h3AuiHwhcM0EjdSssdAKYuyHgA2Q4d3zyOkNFD6oLqUzeAm26RMIGxR+/uaLxFf6j0Ns/iwPNTzLa7lapds0egf4lTkCop0OhWHt1exJKAh8gkaN127Bg7KSgUTPKs7Caax8CuoM/bhoYQcN1JRw1kdNMoPepHypwQ6Uj9PpFCoPUw8EwyX1H/9nLxP6+doH/QSUUYJ8hDNn7ITyTFiOZ10p5QnKEcGgJMCbMrZQNQwNCUXjYlOL+//Jdc1GvObq1+tlc9Oi7qKJF1skG2iEP2yRE5JQ3SJIzck2fySt6sB+vFerc+xtEJq5hZIz9gfX4Bc6Wvkg==</latexit>
↵,
• Prior on alpha, beta is relatively uniform (“flat”), so these
estimates might as well be ML estimates constant

ˆ , ˆ = arg max p(t|↵, )flat(↵, )



<latexit sha1_base64="+shWmtblyc3Gx6FsFpcVjPIbb2A=">AAACXXicbVFNa9tAEF2pTZq4+XDaQw+9DDUFB4KR0kJ7KYT00mMKcRKwjBmtR/aS1Qe7oxKjKD+yt+SSv9KVrUPidGDh7XtvmJ23caGV5SC49/xXrzc232xtd97u7O7tdw/eXdi8NJKGMte5uYrRklYZDVmxpqvCEKaxpsv4+mejX/4hY1WenfOioHGKs0wlSiI7atLlaI5cRaiLOdZHsLrFxFjDD4jQzCBK8WbSOpxhpcHdHRT9RuN5nFRcwy08sxxCxHTDVaKR6/6aNun2gkGwLHgJwhb0RFtnk+7faJrLMqWMpUZrR2FQ8LhCw0pqqjtRaalAeY0zGjmYYUp2XC3TqeGzY6aQ5MadjGHJPu2oMLV2kcbO2axj17WG/J82Kjn5Pq5UVpRMmVwNSkoNnEMTNUyVIcl64QBKo9xbQc7RoGT3IR0XQri+8ktwcTwIvwyOf3/tnZy2cWyJj+KT6ItQfBMn4pc4E0MhxYMnvG2v4z36G/6Ov7ey+l7b8148K//DP8z4s6g=</latexit>
↵,
Then the tractable estimate of predictive posterior becomes:

Mike Hughes - Tufts COMP 136 - Spring 2020 12


Now, we wish to solve this
hyperparameter estimation

ˆ , ˆ = arg max p(t|↵, )flat(↵, )



↵,
“Evidence”
<latexit sha1_base64="+shWmtblyc3Gx6FsFpcVjPIbb2A=">AAACXXicbVFNa9tAEF2pTZq4+XDaQw+9DDUFB4KR0kJ7KYT00mMKcRKwjBmtR/aS1Qe7oxKjKD+yt+SSv9KVrUPidGDh7XtvmJ23caGV5SC49/xXrzc232xtd97u7O7tdw/eXdi8NJKGMte5uYrRklYZDVmxpqvCEKaxpsv4+mejX/4hY1WenfOioHGKs0wlSiI7atLlaI5cRaiLOdZHsLrFxFjDD4jQzCBK8WbSOpxhpcHdHRT9RuN5nFRcwy08sxxCxHTDVaKR6/6aNun2gkGwLHgJwhb0RFtnk+7faJrLMqWMpUZrR2FQ8LhCw0pqqjtRaalAeY0zGjmYYUp2XC3TqeGzY6aQ5MadjGHJPu2oMLV2kcbO2axj17WG/J82Kjn5Pq5UVpRMmVwNSkoNnEMTNUyVIcl64QBKo9xbQc7RoGT3IR0XQri+8ktwcTwIvwyOf3/tnZy2cWyJj+KT6ItQfBMn4pc4E0MhxYMnvG2v4z36G/6Ov7ey+l7b8148K//DP8z4s6g=</latexit>

But first, what is this


“evidence” anyway?

Mike Hughes - Tufts COMP 136 - Spring 2020 13


Evidence for Linear Regression

• Probability of training data t given alpha, beta


• Marginalizes over the weights w

Mike Hughes - Tufts COMP 136 - Spring 2020 14


Simplifying the Evidence

Key ideas:
• Bring constants outside integral
• Recognize inside integral as Gaussian, “complete the square”

Mike Hughes - Tufts COMP 136 - Spring 2020 15


Closed-form Log Evidence

Precision matrix for posterior p(w | data)

Mean vector for posterior p(w | data)

Mike Hughes - Tufts COMP 136 - Spring 2020 16


How to estimate?
ˆ , ˆ = arg max p(t|↵, )flat(↵, )

<latexit sha1_base64="+shWmtblyc3Gx6FsFpcVjPIbb2A=">AAACXXicbVFNa9tAEF2pTZq4+XDaQw+9DDUFB4KR0kJ7KYT00mMKcRKwjBmtR/aS1Qe7oxKjKD+yt+SSv9KVrUPidGDh7XtvmJ23caGV5SC49/xXrzc232xtd97u7O7tdw/eXdi8NJKGMte5uYrRklYZDVmxpqvCEKaxpsv4+mejX/4hY1WenfOioHGKs0wlSiI7atLlaI5cRaiLOdZHsLrFxFjDD4jQzCBK8WbSOpxhpcHdHRT9RuN5nFRcwy08sxxCxHTDVaKR6/6aNun2gkGwLHgJwhb0RFtnk+7faJrLMqWMpUZrR2FQ8LhCw0pqqjtRaalAeY0zGjmYYUp2XC3TqeGzY6aQ5MadjGHJPu2oMLV2kcbO2axj17WG/J82Kjn5Pq5UVpRMmVwNSkoNnEMTNUyVIcl64QBKo9xbQc7RoGT3IR0XQri+8ktwcTwIvwyOf3/tnZy2cWyJj+KT6ItQfBMn4pc4E0MhxYMnvG2v4z36G/6Ov7ey+l7b8148K//DP8z4s6g=</latexit>
↵,
• Can do gradient descent
• Can do coordinate descent (EM, later in course)
• Can get estimates analytically
• See textbook!
Cycle these until
convergence!

Eigendecomposition!

Mike Hughes - Tufts COMP 136 - Spring 2020 17


Example: 1D sinusoid data

Mike Hughes - Tufts COMP 136 - Spring 2020 18


Model Selection for Linear Regr.
(using polynomial features of order M)

Why does M=2 have low evidence?


Can set quadratic term to 0, but then we have a model that is “too complex”
Can set quadratic term non zero, but sinusoid (odd function) not a good fit by quadratic

Mike Hughes - Tufts COMP 136 - Spring 2020 19

You might also like