Bayesian Kernel Methods

Bayesian Kernel Methods
Dan Lo
Department of Computer Science
Kennesaw State University
Maximum Likelihood Estimate for Linear
Regression
• Given data 𝐷 = 𝑥𝑖 , 𝑦𝑖 𝑛𝑖=1 , 𝑥𝑖 ∈ 𝑅𝑑 , 𝑦𝑖 ∈ 𝑅, assume 𝑦𝑖 ~𝑁(𝑤 𝑇 𝑥𝑖 , 𝜎 2 ) are i.i.d.
with a known variance.
• We want to find 𝑤 that maximizes 𝑝(𝐷|𝑤).
• So 𝑤𝑀𝐿𝐸 ∈ 𝑎𝑟𝑔𝑚𝑎𝑥𝑤∈𝑅𝑑 𝑝(𝐷|𝑤).
𝑛
• 𝑝 D w = p y1 , y2 , … , yn x1 , x2 , … , 𝑥𝑛 , 𝑤 = Π𝑖=1 p yi x i , w
1 2 n − 1 σ 𝑦 −𝑤 𝑇 𝑥 2
𝑛 1 − 𝑦𝑖 −𝑤 𝑇 𝑥𝑖 1
• = Π𝑖=1 1 𝑒 2𝜎2 = 1 e 2 𝑖 𝑖
2𝜎 𝑖
2𝜋 2𝜎 2𝜋 2𝜎
𝑇 2 𝑇 2
• σ𝑖 𝑦𝑖 − 𝑤 𝑥𝑖 = 𝑦 − 𝐴𝑤 𝑦 − 𝐴𝑤 = 𝑦 − Aw
−𝑥1𝑇 −
where A = ⋮
−𝑥𝑛𝑇 −
Cont.
• So to maximize 𝑝 D w is to minimize 𝑦 − 𝐴𝑤 𝑇 𝑦 − 𝐴𝑤
• 𝑦 − 𝐴𝑤 𝑇 𝑦 − 𝐴𝑤 = 𝑦 𝑇 𝑦 − 2𝑦 𝑇 𝐴𝑤 + 𝑤 𝑇 𝐴𝑇 𝐴𝑤
• Take derivative on 𝑤, −2𝐴𝑇 𝑦 + 2𝐴𝑇 𝐴𝑤 = 0
• 𝐴𝑇 𝐴𝑤 = 𝐴𝑇 𝑦 ⇒ 𝑤 = 𝐴𝑇 𝐴 −1 𝐴𝑇 𝑦
• Invertible? 𝐴𝑇 𝐴 is invertible if the columns of A are linearly independent.
• To show minimum, we need the second directive to greater or equal to 0.
2 𝜕 𝜕
•ℋ=𝛻 𝑓= 𝑓
𝜕𝑥𝑖 𝜕𝑥𝑗
𝑖𝑗
• 𝛻𝑤2 𝑦 − 𝐴𝑤 𝑇 𝑦 − 𝐴𝑤 = AT A
• So we need AT A to be positive semi-definite.
Cont.
• The result is identical to least square estimate for linear regression!
1 2
• We may let ℒ𝑤 = 𝑦 − 𝐴𝑤 in the previous derivation.
2
1 2
• 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ℒ𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦 − 𝐴𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦 − 𝐴𝑤 =
2
𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑑(𝑦, 𝐴𝑤). So MLE is trying to get the estimate of y closer to
the labeled value.
𝑇
• 𝑓 𝑥 = 𝑤𝑀𝐿𝐸 𝑥
Bayesian Regression
• In linear regression, we got a predicted value but we don’t know how
confident the value is! There is an overfitting issue as well.
• Bayesian allows us to optimize loss function.
• Given data 𝐷 = {𝑋, 𝑦}, we want to learn a function to predict 𝑦.
• A Gaussian process defines a distribution over functions p(f) which
can be used for Bayesian regression:
𝑝 𝐷 𝑓 𝑝(𝑓)
•𝑝 𝑓𝐷 =
𝑝 𝐷
• This gives a variance of the prediction as well!
Example of Computing Posterior
• Given 𝐷 = 𝑥𝑖 , 𝑦𝑖 𝑛𝑖=1 , 𝑥𝑖 ∈ 𝑅𝑑 , 𝑦𝑖 ∈ 𝑅
• 𝑦𝑖′ 𝑠 are independent given 𝑤, 𝑦𝑖 ~𝑁 𝑤 𝑇 𝑥𝑖 , 𝜎𝑦2 and 𝑤~𝑁(0𝑑 , 𝜎𝑤2 𝐼), each 𝑤i is
i.i.d.
• Let 𝑎−1 = 𝜎𝑦2 and 𝑏 −1 = 𝜎𝑤2 . Assume 𝑎, 𝑏 > 0 are known, called precision.
𝑇
𝑎 𝑇
−𝑥 1 −
• Likelihood: 𝑝 𝐷 𝑤 ∝ 𝑒 −2 𝑦−𝐴𝑤 (𝑦−𝐴𝑤) where 𝐴 = ⋮
−𝑥𝑛𝑇 −
𝑎 𝑏
−2 𝑦−𝐴𝑤 𝑇 𝑦−𝐴𝑤 −2𝑤 𝑇 𝑤
• Posterior: 𝑝 𝑤 𝐷 ∝ 𝑝 𝐷 𝑤 𝑝 𝑤 ∝ 𝑒
• We want to know the posterior distribution! If so, we can use it to predict y, given
an x.
• ⇒ 𝑎 𝑦 − 𝐴𝑤 𝑇 𝑦 − 𝐴𝑤 + 𝑏𝑤 𝑇 𝑤 = 𝑎 𝑦 𝑇 𝑦 − 𝑤 𝑇 𝐴𝑇 𝑦 − 𝑦 𝑇 𝐴𝑤 + 𝑤 𝑇 𝐴𝑇 𝐴𝑤 +
𝑏𝑤 𝑇 𝑤 = 𝑎𝑦 𝑇 𝑦 − 2𝑎𝑤 𝑇 𝐴𝑇 𝑦 + 𝑎𝑤 𝑇 𝐴𝑇 𝐴𝑤 + 𝑏𝑤 𝑇 𝑤
• ⇒ 𝑎𝑦 𝑇 𝑦 − 2𝑎𝑤 𝑇 𝐴𝑇 𝑦 + 𝑤 𝑇 (𝑎𝐴𝑇 𝐴 + 𝑏𝐼)𝑤
Posterior Distribution
• For posterior distribution to following Gaussian, it has to have the
exponential form 𝑤 − 𝜇 𝑇 Σ −1 𝑤 − 𝜇 = 𝑤 𝑇 Σ −1 𝑤 − 2𝑤 𝑇 Σ −1 𝜇 +
𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡.
• Compare the form to what we have: 𝑎𝑦 𝑇 𝑦 − 2𝑎𝑤 𝑇 𝐴𝑇 𝑦 +
𝑤 𝑇 (𝑎𝐴𝑇 𝐴 + 𝑏𝐼)𝑤
• We have Σ −1 = 𝑎𝐴𝑇 𝐴 + 𝑏𝐼 and 𝑤 𝑇 Σ −1 𝜇 = 𝑎𝑤 𝑇 𝐴𝑇 𝑦 ⇒ 𝜇 = 𝑎Σ𝐴𝑇 𝑦
• So 𝑝 𝑤 𝐷 = 𝑁 𝑤 𝜇, Σ That is the posterior follows Gaussian
distribution.
Max A Posteriori Estimate of w
𝑏 −1
• 𝑤𝑀𝐴𝑃 = 𝜇 = 𝑎Σ𝐴𝑇 𝑦 = 𝑎 𝑎𝐴𝑇 𝐴 + 𝑏𝐼 −1 𝑇
𝐴 𝑦 = 𝐴𝑇 𝐴 + 𝐼 𝐴𝑇 𝑦
𝑎
• C.f. 𝑤𝑀𝐿𝐸 = 𝐴𝑇 𝐴 −1 𝐴𝑇 𝑦
𝑏
• The term serves as the regularization parameter.
𝑎
Gaussian Process
• A Gaussian process defines a distribution over functions, p(f), where
𝑓 is a function mapping some input space 𝒳 to ℛ.
• 𝑓: 𝒳 → ℛ
• Notice that 𝑓 can be an infinite-dimensional quantity such as ℛ.
• Let 𝒇 = 𝑓 𝑥1 , … , 𝑓 𝑥𝑛 be an n-dimensional vector of function
values evaluated at 𝑛 points 𝑥𝑖 ∈ 𝒳. Note 𝒇 is a random variable.
• Definition: p(𝒇) is a Gaussian process if for any finite subset
𝑥1 , … , 𝑥𝑛 ⊂ 𝒳, the marginal distribution over that finite subset
𝑝 𝒇 has a multivariate Gaussian distribution.
Examples
• Let input space be a dataset 𝑥𝑖 , 𝑦𝑖 𝑛𝑖=1 and assume 𝑓 𝑥𝑖 = 𝑦𝑖 , ∀𝑖 .
Let the subset of the dataset be 𝑥𝑖 , 𝑦𝑖 𝑟𝑖=1 and 𝑧 = 𝑦1 , … , 𝑦𝑟 ∈ 𝑅𝑟 .
𝑝(𝑧) is a trivial Gaussian process and a multivariate Gaussian
distribution.
• If we let 𝑟 = 1 and z = y = 𝑓 𝑥 = 𝑚𝑥 where 𝑚~𝑁 0,1 in the
above example, we get random lines in this Gaussian process.
Random Lines
Brownian Motion (𝑘 𝑥, 𝑦 = min(𝑥, 𝑦))
𝑇
Gaussian (𝑘 𝑥, 𝑦 = exp(−100 𝑥 − 𝑦 (𝑥 − 𝑦))
Existence of Gaussian Process
• Theorem: For any set S, any mean function 𝜇: 𝑆 → 𝑅, and any
covariance function 𝑘: 𝑆 × 𝑆 → 𝑅, there exists a Gaussian process
𝑧𝑡 on 𝑆 such that 𝐸𝑧𝑡 = 𝜇 𝑡 , 𝐶𝑜𝑣 𝑧𝑠 , 𝑧𝑡 = 𝑘 𝑠, 𝑡 , ∀𝑠, 𝑡 ∈ 𝑆.
• Note that any covariance matrix is symmetric and positive semi-
definite!
• This theorem allows us to choose whatever mean function and
whatever covariance to build a Gaussian process!
Application of Gaussian Process
• Gaussian processes define distributions on functions which can be
used for non-linear/linear regressions, classification, ranking,
preference learning, ordinal regressions, etc.
• GPs are closely related to many other models such as
• Bayesian kernel machines
• Linear regression with basis functions
• Infinite multi-layer perceptron neural networks
• Spline models
• Compared to SVM, GP offers several advantages: learning the kernel
and regularization parameters, integrated feature selection, fully
probabilistic predictions, and interpretability.
Relations among univariate/multivariate/infinite
Gaussian distributions
• Univariate 𝜇, 𝜎 2
• Multivariate (𝝁, Σ)
• Gaussian Process (𝜇 . , 𝐾(. , . )) where 𝜇 is a mean function and K is
a covariance function (kernel).
• GP is an infinite dimensional generalization of multivariate Gaussian
distribution.
GP Kernels
• p(𝑓) is a Gaussian process if for any finite subset 𝑥1 , … , 𝑥𝑛 ⊂ 𝒳,
the marginal distribution over that finite subset 𝑝 𝑓 has a
multivariate Gaussian distribution.
• GPs are parameterized by a mean function 𝜇(𝑥) and a covariance
function or kernel, K 𝑥, 𝑥 ′ ,
• 𝑝 𝑓 𝑥 , 𝑓 𝑥 ′ = 𝑁(𝜇, Σ) where
𝜇(𝑥) 𝐾(𝑥, 𝑥) 𝐾(𝑥, 𝑥 ′ )
• 𝜇= ′ and Σ =
𝜇(𝑥 ) 𝐾(𝑥 ′ , 𝑥) 𝐾(𝑥 ′ , 𝑥 ′ )
• Similarly, for p(𝑓 𝑥1 , … , 𝑓 𝑥𝑛 ) where 𝜇 is a 𝑛 × 1 vector and Σ is an
𝑛 × 𝑛 matrix.
Example of Covariance Function
𝛼
𝑥𝑖 −𝑥𝑗
• 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑣0 𝑒 − 𝑟 + 𝑣1 + 𝑣2 𝛿𝑖𝑗 with parameters 𝑣0 , 𝑣1 , 𝑣2 , 𝑟, 𝛼 .
• The kernel parameters can be learned from data:
• 𝑣0 signal variance
• 𝑣1 variance of bias
• 𝑣2 noise variance
• 𝑟 length scale
• 𝛼 roughness
• Once the mean and covariance are defined, everything else about GPs
follows from the basic rules of probability applied to multivariate
Gaussians.
Gaussian Process Priors
• GP: consistent Gaussian prior on any set of function values 𝒇 =
𝑓𝑛 𝑁
𝑛=1 , given corresponding inputs 𝑋 = 𝑥 𝑁
𝑛 𝑛=1 .
• Points nearby are highly correlated; far apart points are toward
independent.
• So the correlation matrix (𝑁 × 𝑁) has high values along diagonal
band and close to zero far from diagonal.
𝑑 𝑑 2
𝑥𝑛 −𝑥 ′
1
−2 σ𝐷
𝑑=1
𝑛
• Covariance: 𝐾𝑛𝑛′ = 𝐾 𝑥𝑛 , 𝑥𝑛′ ; 𝜃 = 𝑣𝑒 𝑟𝑑
GP: Prior
• 𝑓 𝑥 ~𝐺𝑃 𝑚 𝑥 , 𝐾 𝑥, 𝑥 ′
•𝑚 𝑥 =𝐸 𝑓 𝑥
𝑇
•𝐾 𝑥, 𝑥 ′ = 𝐸( 𝑓 𝑥 − 𝑚 𝑥 𝑓 𝑥′ −𝑚 𝑥′ )
1 2
−2 𝑥−𝑥 ′
• 𝑘 𝑥, 𝑥 ′ = 𝑒
• Steps
1. Create N points 𝑥𝑖
2. Create u = 𝑁(0𝑁 , 1𝑁 ), K N×𝑁
3. Cholesky decomposition for square root of 𝐾 = 𝐿𝐿𝑇 , 𝐿𝑢~𝑁(0, 𝐾)
4. 𝑓 𝑖 ~𝑁 0𝑁 , 𝐾 ⇒ 𝑓 𝑖 ~𝐿𝑁 0𝑁 , 𝐼
10 samples from the GP Prior
GP Posterior
𝑝 𝐷𝑓 𝑝 𝑓
• Given data set 𝐷 = 𝑥𝑖 , 𝑓𝑖 , 𝑖 = 1: 𝑁 , 𝑝 𝑓 𝐷 = .
𝑝 𝐷
• Ten samples from the GP posterior is as follows:
Using GP for Nonlinear Regression
• Give a data set 𝒟 = 𝑥𝑖 , 𝑦𝑖 𝑛𝑖=1 = (𝑋, 𝑦), the model is 𝑦𝑖 = 𝑓 𝑥𝑖 +
𝜖𝑖 where 𝑓~𝐺𝑃(. |0, 𝐾) and 𝜖~𝑁(. |0, 𝜎 2 ).
• Prior on 𝑓 is a GP, likelihood is Gaussian, therefore posterior on 𝑓 is
also a GP (𝑝 𝑓 𝐷 = 𝑝 𝐷 𝑓 𝑝(𝑓)/𝑝 𝐷 ).
• To make a prediction: 𝑝(𝑦∗ |𝑥∗ , 𝐷) = ∫ 𝑝 𝑦∗ 𝑥∗ , 𝑓, 𝐷 𝑝 𝑓 𝐷 𝑑𝑓. That
is the mean of the posterior!
• We can compute the marginal likelihood (evidence) and use this to
compare or tune covariance functions. 𝑝 𝑦 𝑋 = ∫ 𝑝 𝑦 𝑓, 𝑋 𝑝 𝑓 𝑑𝑓.
Why Gaussian Process Works?
𝑝 𝑋𝑌 𝑝 𝑌
• From Bayesian, we know 𝑝 𝑌 𝑋 = .
𝑝 𝑋
𝑝(𝑋,𝑌)
• If we consider joint probability, we know 𝑝 𝑌 𝑋 = .
𝑝 𝑋
• Consider a simple example, given data 𝑥, 𝑦 , 𝑦 = 𝑓(𝑥) and 𝑥∗ , we want to
predict 𝑦∗ based on the assumptions:
• (𝑦∗ , 𝑦) follows a bivariate normal distribution with a known covariance matrix.
• Now, if we know the distribution of 𝑦∗ given 𝑦, then the mean of that
distribution is our best estimate of 𝑦∗ and its variance will give the
uncertainty.
𝑝(𝑦∗ ,𝑦)
• Luckily, that conditional distribution is normal and 𝑝 𝑦∗ 𝑦 = .
𝑝 𝑦
Multivariate Gaussian Distribution
• Definition (characterization): The random variables 𝑋1 , … , 𝑋𝑛 are said
to have an n-dimensional normal distribution if every linear
combination 𝑎1 𝑋1 , … , 𝑎𝑛 𝑋𝑛 has a normal distribution. Let 𝑋 =
𝑋1 , … , 𝑋𝑛 𝑇 , 𝑎 = 𝑎1 , … , 𝑎𝑛 𝑇 , 𝜇 = 𝑚1 , … , 𝑚𝑛 𝑇 , 𝑋 is n-dimensional
normal, if and only if 𝑎𝑇 𝑋~𝑁 𝑎𝑇 𝜇, 𝑎𝑇 Σ𝑎 ∀𝑎 = 𝑎1 , … , 𝑎𝑛 𝑇 .
• Each marginal distribution in an n-dimensional normal distribution is
one dimensional normal because 𝑋𝑘 = 0𝑋1 + ⋯ + 1𝑋𝑘 + ⋯ + 0𝑋𝑛.
• The other way around is not true. If X and Y are each normal, but X
and Y are not jointly normal.
• Example: 𝑋~𝑁 0,1 , 𝑌 = 𝑋 2𝐵 − 1 , 𝑌~𝑁 0,1 where 𝐵 is
Bernoulli(1/2).
2𝑋 𝑖𝑓 𝐵 = 1
• 𝑌+𝑋 =ቆ is not a normal distribution!
0 𝑖𝑓 𝐵 = 1
Multivariate Gaussian Probability Density
Function
• Let Σ = (𝛿𝑖𝑗 ) be the covariance matrix where 𝛿𝑖𝑗 = 𝐶 𝑋𝑖 , 𝑋𝑗 .
• Assume Σ is non-singular, we have the following multivariate Gaussian
probability density function.
1 1
•𝑝 𝑋=𝑥 = exp(− 𝑥 − 𝜇 𝑇 Σ −1 𝑥 − 𝜇 )
2𝜋 𝑛/2 det(Σ) 2
• The distribution is said to be non-singular.
• The density is constant on ellipsoid x − 𝜇 Σ −1 x − 𝜇 = 𝐶 in Rn .
• The density function of an n-dimensional normal distribution is
uniquely determined by the expectations and covariances.
Conditioning in the Bivariate Normal
Distribution
• Let’s consider the previous simple example. Let 𝑋 and 𝑌 have a
bivariate normal distribution with expectations 𝑚𝑥 and 𝑚𝑦 , variances
2 2 𝜎𝑋𝑌
𝜎𝑥 and 𝜎𝑦 , and covariance 𝐶 𝑋, 𝑌 = 𝜎𝑋𝑌 . Let 𝜌 = be the
𝜎𝑋 𝜎𝑌
correlation coefficient. Assume covariance is non-singular.
• The observed data is 𝑌 and we want to predict 𝑋.
• So the conditional density function for 𝑋 given 𝑌 = 𝑦 is
𝑓𝑋,𝑌 𝑥,𝑦
• 𝑓𝑋|𝑌=𝑦 𝑥 =
𝑓𝑌 𝑦
Cont.
• The marginal distribution of 𝑌 follows normal distribution (characteristic).
1 1
So 𝑓𝑌 𝑦 = exp(− 2 𝑦 − 𝑚𝑌 2 ).
2𝜋𝜎𝑌 2𝜎𝑌
𝜎𝑋2 𝜎𝑋𝑌
• The covariance matrix Σ = 2 .
𝜎𝑋𝑌 𝜎𝑌
1 𝜎𝑌2 −𝜎𝑋𝑌
• Because Σ is non-singular, Σ −1 = 2 .
det Σ −𝜎𝑋𝑌 𝜎𝑋
𝑥 𝑇 𝑥
1 1
• f𝑋,𝑌 𝑥, 𝑦 = exp(− 𝑦 −𝜇 Σ −1 𝑦 −𝜇 )
2𝜋 det Σ 2
𝑓𝑋,𝑌 𝑥,𝑦
• We want to compute .
𝑓𝑌 𝑦
Cont.
2
1 1 𝜎𝑋𝑌
• Non-exponential part: 2𝜋 = and 𝜌2 = 2 𝜎2
det Σ 2 𝜎 2 −𝜎 2 )/𝜎 2 𝜎𝑋 𝑌
2𝜋 𝜎𝑌
2𝜋 (𝜎𝑋 𝑌 𝑋𝑌 𝑌
1 1
•⇒ =
2 𝜎 2 −𝜌2 𝜎 2 𝜎 2 )/𝜎 2 2𝜋 𝜎𝑋 1−𝜌2
2𝜋 (𝜎𝑋 𝑌 𝑋 𝑌 𝑌
2
• So 𝜎𝑋|𝑌=𝑦 = 𝜎𝑋2 (1 − 𝜌2 ).
𝑥 𝑇 𝑥
1 −1 1 2}
• The exponential part:− { 𝑦 − 𝜇 Σ 𝑦 − 𝜇 − 2 𝑦 − 𝑚 𝑌
2 𝜎𝑌
1 𝑥 𝑚𝑋 𝑇 𝜎𝑌2 −𝜎𝑋𝑌 𝑥 𝑚𝑋 1
•⇒ − 𝑦 − 𝑚𝑌 2 𝑦 − 𝑚 + 2 (𝑦 −
2 det Σ −𝜎𝑋𝑌 𝜎𝑋 𝑌 2𝜎𝑌
Cont.
1
• ⇒− {𝜎𝑌2 𝑥 − 𝑚𝑋 2 − 𝜎𝑋𝑌 𝑥 − 𝑚𝑋 𝑦 − 𝑚𝑌 − 𝜎𝑋𝑌 𝑥 − 𝑚𝑋 𝑦 − 𝑚𝑌 + 𝜎𝑋2 (𝑦 −
2 det Σ
Cont.
2 2
2 𝜌𝜎𝑋 𝜎𝑋 𝜌 2
• 𝑥− 𝑚𝑋 − 2 𝑥 − 𝑚𝑋 𝑦 − 𝑚𝑌 + ( ) 𝑦 − 𝑚𝑌
𝜎𝑌 𝜎𝑌2
𝜌𝜎𝑋 𝜌𝜎𝑋
•= 𝑥2 − 2 𝑚𝑋 + 𝑦 − 𝑚𝑌 𝑥+ 𝑚𝑋2 +2 𝑚𝑋 𝑦 − 𝑚𝑌 +
𝜎𝑌 𝜎𝑌
2 2
𝜎𝑋 𝜌 2
( ) 𝑦 − 𝑚𝑌
𝜎𝑌2
2
𝜌𝜎𝑋
• = 𝑥 − 𝑚𝑋 + 𝑦 − 𝑚𝑌
𝜎𝑌
Cont.
2
𝜌𝜎
𝑥− 𝑚𝑋 + 𝜎 𝑋 𝑦−𝑚𝑌
1 𝑌
• 𝑓𝑋|𝑌=𝑦 𝑥 = exp − 2 1−𝜌2
2𝜋 𝜎𝑋 1−𝜌2 2𝜎𝑋
𝜌𝜎𝑋
• So the conditional distribution is N(𝑚𝑋 + 𝑦− 𝑚𝑌 , 𝜎𝑋2 1 − 𝜌2 ).
𝜎𝑌
• This mean depends on the observed value 𝑦.
• This variance is independent to the observed values and is a constant.
Multivariate Conditional Normal Distribution
• Let 𝑧 ∈ 𝑅𝑛 ~𝑁 𝜇, Σ , 𝜖 ∈ 𝑅𝑛 ~𝑁(0, 𝜎 2 𝐼), 𝑧, 𝜖 are independent.
• Let Y = 𝑧 + 𝜖 ⇒ 𝑌~𝑁(𝜇, Σ + 𝜎 2 𝐼)
• Split y into two parts using indices 𝑎 = 1, … , 𝑙 , 𝑏 = (𝑙 + 1, … , 𝑛)
𝑦1 𝑦𝑙+1
𝑦𝑎
• So Y = 𝑦 , 𝑦𝑎 = ⋮ , 𝑦𝑏 = ⋮
𝑏
𝑦𝑙 𝑦𝑛
𝜇𝑎
• Split 𝜇 = 𝜇
𝑏
2 𝐶𝑎𝑎 𝐶𝑎𝑏
• Let Σ + 𝜎 𝐼 = 𝐶 =
𝐶𝑏𝑎 𝐶𝑏𝑏
𝐾𝑎𝑎 𝐾𝑎𝑏
• Let Σ =
𝐾𝑏𝑎 𝐾𝑏𝑏
Cont.
• We want to find 𝑝 𝑌𝑎 𝑌𝑏 = 𝑦𝑏 ?
−1
• 𝑌𝑎 𝑌𝑏 = 𝑦𝑏 ~𝑁(𝑚, 𝐷) where 𝑚 = 𝜇𝑎 + 𝐶𝑎𝑏 𝐶𝑏𝑏 𝑦𝑏 − 𝜇𝑏
• 𝑚 = 𝜇𝑎 + 𝐾𝑎𝑏 𝐾𝑏𝑏 + 𝜎 2 𝐼 −1 (𝑦𝑏 − 𝜇𝑏 ) and
−1
• 𝐷 = 𝐶𝑎𝑎 − 𝐶𝑎𝑏 𝐶𝑏𝑏 𝐶𝑏𝑎 = 𝐾𝑎𝑎 + 𝜎 2 𝐼 − 𝐾𝑎𝑏 𝐾𝑏𝑏 + 𝜎 2 𝐼 −1 𝐾𝑏𝑎
GP Regression Prediction
• Given 𝑦𝑏 , we want to find the posterior of 𝑦𝑎 (prediction).
• (𝑧𝑥 )~𝐺𝑃 𝜇, 𝑘 𝑜𝑛 𝑅𝑑
• 𝑧𝑥𝑖 is a random variable corresponds to data 𝑥𝑖 .
• Let 𝑌𝑖 = 𝑧𝑥𝑖 + 𝜖𝑖 where 𝜖~𝑁(0, 𝜎 2 𝐼).
• To compute 𝑝 𝑦𝑎 𝑦𝑏 !
• Let 𝑧ǁ = (𝑧𝑥1 , … , 𝑧𝑥𝑛 ), we have 𝑌 = 𝑧ǁ + 𝜖
𝑇
• We know 𝑧~𝑁ǁ 𝜇,
෤ 𝐾 , 𝜇෤ = 𝜇 𝑥1 , … , 𝜇 𝑥𝑛 , 𝐾𝑖𝑗 = 𝑘(𝑥𝑖 , 𝑥𝑗 ) from
GP definition.
Cont.
𝜇𝑎 𝐾𝑎𝑎 𝐾𝑎𝑏
• 𝜇෤ = 𝜇 and 𝐾 =
𝑏 𝐾𝑏𝑎 𝐾𝑏𝑏
• So we have
• 𝑚 = 𝜇𝑎 + 𝐾𝑎𝑏 𝐾𝑏𝑏 + 𝜎 2 𝐼 −1 (𝑦𝑏 − 𝜇𝑏 ) and
• 𝐷 = 𝐾𝑎𝑎 + 𝜎 2 𝐼 − 𝐾𝑎𝑏 𝐾𝑏𝑏 + 𝜎 2 𝐼 −1 𝐾𝑏𝑎
A Regression Example
• Given 3 points 𝑥1 , 𝑥2 , 𝑥3 and their corresponding targets 𝑓1 , 𝑓2 , 𝑓3
where 𝑓(𝑥𝑖 ) = 𝑓𝑖 , we are modeling 𝑓(𝑥).
𝑓(𝑥)
𝑓3
𝑓2
𝑓1
𝑥1 𝑥2 𝑥3 𝑥
• Assume 𝑓𝑖′ 𝑠 are
drawing from Gaussian distribution. So we have the
𝑓1 0 𝐾11 𝐾12 𝐾13
following 𝑓2 ~𝑁 0 , 𝐾21 𝐾22 𝐾23
𝑓3 0 𝐾31 𝐾32 𝐾33
Cont.
• Assume nearby points are highly correlated and far apart points are
𝑓1 0 1 0.9 0.01
independent, we have 𝑓2 ~𝑁 0 , 0.9 1 0.02 .
𝑓3 0 0.01 0.02 1
1 2
−2 𝑥 −𝑥
• To measure proximity, let 𝐾𝑖𝑗 = 𝑒 𝑖 𝑗
(choose parameters
appropriately)
• Now, given a test point 𝑥∗ , we want to find 𝑓∗ and assume
𝑓∗ ~𝑁 0, 𝐾∗∗ where 𝐾∗∗ is its variance.
• So adding 𝑓∗ to 3 dimensional Gaussian distribution. We now have 4
dimensional Gaussian.
Cont.
𝑓1 0 𝐾11 𝐾12 𝐾13 𝐾1∗
𝑓2 0 𝐾21 𝐾22 𝐾23 𝐾2∗
• With the test point, we have ~𝑁 ,
𝑓3 0 𝐾31 𝐾32 𝐾33 𝐾3∗
𝑓∗ 0 𝐾∗1 𝐾∗2 𝐾∗3 𝐾∗∗
• We put the predicted value 𝑓∗ at the end. It’s ok. 𝐶𝑎𝑎 , 𝐶𝑏𝑏 are the
same. 𝐶𝑎𝑏 is the rightmost column in the organization.
𝐾1∗
• Let 𝐾∗ = 𝐾2∗ . The prediction is 𝑓 𝑥∗ = 𝑓∗ = 𝜇∗ = 𝐾∗𝑇 𝐾 −1 𝑓.
𝐾3∗
• Also the variance 𝜎∗ = 𝐾∗∗ − 𝐾∗𝑇 𝐾 −1 𝐾∗
Noiseless GP Regression
• Given a training set 𝐷 = 𝑥𝑖 , 𝑓𝑖 , 𝑖 = 1: 𝑁 where 𝑓𝑖 = 𝑓(𝑥𝑖 ) and a
test set 𝑋∗ of 𝑁∗ points, we want to predict the function outputs 𝒇∗ .
𝒇 𝝁 𝑲 𝑲∗
• ~𝑁( 𝝁 , 𝑻 ) where 𝑲𝑁×𝑁 = 𝑘 𝑋, 𝑋 , 𝑲∗ = 𝑘(𝑋, 𝑋∗ ),
𝒇∗ ∗ 𝑲∗ 𝑲∗∗
and 𝑲∗∗ = 𝑘(𝑋∗ , 𝑋∗ ).
1 ′ 2
− 2 𝑥−𝑥
• 𝑘 𝑥, 𝑥 ′ = 𝜎𝑓2 𝑒 2𝑙
• 𝑝 𝒇∗ 𝑋, 𝑋∗ , 𝒇 = 𝑁(𝒇∗ |𝝁∗ , Σ∗ )
• 𝝁∗ = 𝝁 𝑿∗ + 𝑲𝑻∗ 𝑲−𝟏 𝒇 − 𝝁 𝑿
• Σ∗ = 𝑲∗∗ − 𝑲𝑻∗ 𝑲−𝟏 𝑲∗

Bayesian Kernel Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Kernel Methods

Uploaded by

Copyright:

Available Formats

Bayesian Kernel Methods

You might also like