Day09-Model Selection

Bayesian Linear Models:
Model Selection
SPR Day 09, Spring 2020

Prof. Mike Hughes
https://www.cs.tufts.edu/comp/136/2020s/
1
Recap: Bayesian Linear Regression
Mike Hughes - Tufts COMP 136 - Spring 2020 2

Recap: Bayesian Linear Regression Simpler prior: assume zero mean,
General prior
Just need to define precision

Problem:
Hyperparameter Selection
How do we pick the prior hyperparameters?
alpha
How do we pick the likelihood hyperparameters?
beta

Hyperparameter Selection
Fixed valid. set K-fold Bayesian
(fraction f) cross-validation evidence
Fraction Higher is better
data used Better use
(1.0 – f) (K-1) / K 1.0 of training data
for training
run
Total runs/
examples 1 run K runs 1 run Lower is better
Faster training
seen for (1 – f) N (K-1) * N N
training
Total runs/
examples
1 run K runs 1 run Lower is better
seen for Faster evaluation
fN N N
evaluation
of fitness
Fitness
Heldout likelihood Heldout likelihood Evidence
function Mike Hughes - Tufts COMP 136 - Spring 2020 5
Why use Model Evidence?

Related Problem:
Model Selection
How do we pick the feature transform?
<latexit sha1_base64="CxQr/GUsCoVzg/46vJ75x7m2L4Y=">AAACLHicbVDLSgMxFM34rPVVdekmWIR2U2aqoBuh2I0bpYJ9QDsMmUymDc1khiQjlqH9Hzf+iiAuLOLW7zBtBx+tFwLncS8397gRo1KZ5thYWl5ZXVvPbGQ3t7Z3dnN7+w0ZxgKTOg5ZKFoukoRRTuqKKkZakSAocBlpuv3qxG/eEyFpyO/UICJ2gLqc+hQjpSUnV+1EPVp4cHgRXsA2nDDHmvHRaEbLP5R5oZLfxrU2borQdnJ5s2ROCy4CKwV5kFbNyb10vBDHAeEKMyRl2zIjZSdIKIoZGWY7sSQRwn3UJW0NOQqItJPpsUN4rBUP+qHQjys4VX9PJCiQchC4ujNAqifnvYn4n9eOlX9uJ5RHsSIczxb5MYMqhJPkoEcFwYoNNEBYUP1XiHtIIKx0vlkdgjV/8iJolEvWSal8e5qvXKZxZMAhOAIFYIEzUAFXoAbqAINH8AzewNh4Ml6Nd+Nj1rpkpDMH4E8Zn1+qJqVx</latexit>
(xn ) = [ 1 (xn ) 2 (xn ) ... M (xN )]
What size M?
Which functions to include?
Challenge: Could have unbounded number of choices!
Need to enumerate small set of possible models (size L) if want to

average over them properly

Model Selection for Regression
Model family (size M, feature functions, hyperparameters)
Specific parameters for chosen model
T 1
tn |xn ⇠ N (w
<latexit sha1_base64="XH8vxbgSvhWTzAp77uC6ve7XfK4=">AAACIHicbVDLSgMxFM34rPU16tJNsAgVtMyooMuiG1ei0Bd0xiGTpjY0kxmSO2oZ+ylu/BU3LhTRnX6Nae1CWw8kHM65l3vvCRPBNTjOpzU1PTM7N59byC8uLa+s2mvrNR2nirIqjUWsGiHRTHDJqsBBsEaiGIlCweph93Tg12+Y0jyWFeglzI/IteRtTgkYKbCPIJD4Ht+Z39M8wl5EoEOJyM77RXx7VcFe0uFFY+/sYi9kQK6yPbePdwK74JScIfAkcUekgEa4COwPrxXTNGISqCBaN10nAT8jCjgVrJ/3Us0SQrvkmjUNlSRi2s+GB/bxtlFauB0r8yTgofq7IyOR1r0oNJWD/fW4NxD/85optI/9jMskBSbpz6B2KjDEeJAWbnHFKIieIYQqbnbFtEMUoWAyzZsQ3PGTJ0ltv+QelPYvDwvlk1EcObSJtlARuegIldEZukBVRNEDekIv6NV6tJ6tN+v9p3TKGvVsoD+wvr4B1EehfQ==</latexit>
(xn ),
Observed outputs (and corresponding inputs)
)
T 1
tn |xn ⇠ N (w
<latexit sha1_base64="XH8vxbgSvhWTzAp77uC6ve7XfK4=">AAACIHicbVDLSgMxFM34rPU16tJNsAgVtMyooMuiG1ei0Bd0xiGTpjY0kxmSO2oZ+ylu/BU3LhTRnX6Nae1CWw8kHM65l3vvCRPBNTjOpzU1PTM7N59byC8uLa+s2mvrNR2nirIqjUWsGiHRTHDJqsBBsEaiGIlCweph93Tg12+Y0jyWFeglzI/IteRtTgkYKbCPIJD4Ht+Z39M8wl5EoEOJyM77RXx7VcFe0uFFY+/sYi9kQK6yPbePdwK74JScIfAkcUekgEa4COwPrxXTNGISqCBaN10nAT8jCjgVrJ/3Us0SQrvkmjUNlSRi2s+GB/bxtlFauB0r8yTgofq7IyOR1r0oNJWD/fW4NxD/85optI/9jMskBSbpz6B2KjDEeJAWbnHFKIieIYQqbnbFtEMUoWAyzZsQ3PGTJ0ltv+QelPYvDwvlk1EcObSJtlARuegIldEZukBVRNEDekIv6NV6tJ6tN+v9p3TKGvVsoD+wvr4B1EehfQ==</latexit>
(xn ), )
Model Selection via Evidence
possible datasets
Key idea: “Goldilocks” principle

If model too simple, it puts high mass only few datasets
If model too complex, it puts mass on too many datasets
Predictive distribution
Average over predictive distribution for each of
L possible models, weighted by posterior probability
Predictive distrib. Posterior of model i

for t given model i
Key idea: We can use all L models, don’t need to pick one

Ideal predictive posterior
If we want to predict new data given old data, ideally we would
average over all parameters w, alpha, beta, weighted by posterior prob.
But, this integral is hard!

Tractable predictive posterior
Assume:
• We have enough data that the posterior p(alpha, beta | data) is
peaked at MAP point estimate
ˆ , ˆ = arg max p(↵, |t)

↵
<latexit sha1_base64="rBewqcKq+wXvwUyaSC3tXy+lff0=">AAACQXicbVC7SgNBFJ31GeMramkzGAQFCbtR0EYQbSwjGA1kQ7g7mU0GZx/M3BXDun6ajX9gZ29joYitjbPJFr4ODJw551zmzvFiKTTa9pM1MTk1PTNbmivPLywuLVdWVi90lCjGmyySkWp5oLkUIW+iQMlbseIQeJJfelcnuX95zZUWUXiOw5h3AuiHwhcM0EjdSssdAKYuyHgA2Q4d3zyOkNFD6oLqUzeAm26RMIGxR+/uaLxFf6j0Ns/iwPNTzLa7lapds0egf4lTkCop0OhWHt1exJKAh8gkaN127Bg7KSgUTPKs7Caax8CuoM/bhoYQcN1JRw1kdNMoPepHypwQ6Uj9PpFCoPUw8EwyX1H/9nLxP6+doH/QSUUYJ8hDNn7ITyTFiOZ10p5QnKEcGgJMCbMrZQNQwNCUXjYlOL+//Jdc1GvObq1+tlc9Oi7qKJF1skG2iEP2yRE5JQ3SJIzck2fySt6sB+vFerc+xtEJq5hZIz9gfX4Bc6Wvkg==</latexit>
↵,
• Prior on alpha, beta is relatively uniform (“flat”), so these
estimates might as well be ML estimates constant
ˆ , ˆ = arg max p(t|↵, )flat(↵, )

↵
<latexit sha1_base64="+shWmtblyc3Gx6FsFpcVjPIbb2A=">AAACXXicbVFNa9tAEF2pTZq4+XDaQw+9DDUFB4KR0kJ7KYT00mMKcRKwjBmtR/aS1Qe7oxKjKD+yt+SSv9KVrUPidGDh7XtvmJ23caGV5SC49/xXrzc232xtd97u7O7tdw/eXdi8NJKGMte5uYrRklYZDVmxpqvCEKaxpsv4+mejX/4hY1WenfOioHGKs0wlSiI7atLlaI5cRaiLOdZHsLrFxFjDD4jQzCBK8WbSOpxhpcHdHRT9RuN5nFRcwy08sxxCxHTDVaKR6/6aNun2gkGwLHgJwhb0RFtnk+7faJrLMqWMpUZrR2FQ8LhCw0pqqjtRaalAeY0zGjmYYUp2XC3TqeGzY6aQ5MadjGHJPu2oMLV2kcbO2axj17WG/J82Kjn5Pq5UVpRMmVwNSkoNnEMTNUyVIcl64QBKo9xbQc7RoGT3IR0XQri+8ktwcTwIvwyOf3/tnZy2cWyJj+KT6ItQfBMn4pc4E0MhxYMnvG2v4z36G/6Ov7ey+l7b8148K//DP8z4s6g=</latexit>
↵,
Then the tractable estimate of predictive posterior becomes:

Now, we wish to solve this
hyperparameter estimation

↵
↵,
“Evidence”
But first, what is this

“evidence” anyway?

Evidence for Linear Regression
• Probability of training data t given alpha, beta

• Marginalizes over the weights w

Simplifying the Evidence
Key ideas:
• Bring constants outside integral
• Recognize inside integral as Gaussian, “complete the square”

Closed-form Log Evidence
Precision matrix for posterior p(w | data)
Mean vector for posterior p(w | data)

How to estimate?
↵
↵,
• Can do gradient descent
• Can do coordinate descent (EM, later in course)
• Can get estimates analytically
• See textbook!
Cycle these until
convergence!
Eigendecomposition!

Example: 1D sinusoid data

Model Selection for Linear Regr.
(using polynomial features of order M)
Why does M=2 have low evidence?

Can set quadratic term to 0, but then we have a model that is “too complex”
Can set quadratic term non zero, but sinusoid (odd function) not a good fit by quadratic

Day09-Model Selection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day09-Model Selection

Uploaded by

Copyright:

Available Formats

Bayesian Linear Models:

SPR Day 09, Spring 2020

Mike Hughes - Tufts COMP 136 - Spring 2020 2

Mike Hughes - Tufts COMP 136 - Spring 2020 3

How do we pick the likelihood hyperparameters?

Mike Hughes - Tufts COMP 136 - Spring 2020 4

Mike Hughes - Tufts COMP 136 - Spring 2020 6

Challenge: Could have unbounded number of choices!

Need to enumerate small set of possible models (size L) if want to

Mike Hughes - Tufts COMP 136 - Spring 2020 7

Specific parameters for chosen model

Key idea: “Goldilocks” principle

Predictive distrib. Posterior of model i

Mike Hughes - Tufts COMP 136 - Spring 2020 10

But, this integral is hard!

Mike Hughes - Tufts COMP 136 - Spring 2020 11

ˆ , ˆ = arg max p(↵, |t)

ˆ , ˆ = arg max p(t|↵, )flat(↵, )

Mike Hughes - Tufts COMP 136 - Spring 2020 12

ˆ , ˆ = arg max p(t|↵, )flat(↵, )

But first, what is this

Mike Hughes - Tufts COMP 136 - Spring 2020 13

• Probability of training data t given alpha, beta

Mike Hughes - Tufts COMP 136 - Spring 2020 14

Mike Hughes - Tufts COMP 136 - Spring 2020 15

Precision matrix for posterior p(w | data)

Mean vector for posterior p(w | data)

Mike Hughes - Tufts COMP 136 - Spring 2020 16

Mike Hughes - Tufts COMP 136 - Spring 2020 17

Mike Hughes - Tufts COMP 136 - Spring 2020 18

Why does M=2 have low evidence?

Mike Hughes - Tufts COMP 136 - Spring 2020 19

You might also like