GP Gpss15 Session1

Introduction to Gaussian Processes
Neil D. Lawrence
GPSS
14th September 2015
Book
?
Outline
The Gaussian Density
Covariance from Basis Functions

Outline

I Perhaps the most common probability density.
(y − µ)2
!
1
p(y|µ, σ ) = √
2
exp −
2πσ2 2σ2
4

= N y|µ, σ2
I The Gaussian density.

Gaussian Density
2
p(h|µ, σ2 )
0
0 1 2
h, height/m
The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Mean

shown as red line. It could represent the heights of a
population of students.
Gaussian Density
(y − µ)2
!
1
N y|µ, σ2 = √ exp −
2πσ 2 2σ2
σ2 is the variance of the density and µ is
the mean.
Two Important Gaussian Properties
Sum of Gaussians
I Sum of Gaussian variables is also Gaussian.

yi ∼ N µi , σ2i
Sum of Gaussians

yi ∼ N µi , σ2i
And the sum is distributed as

n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1
Sum of Gaussians

yi ∼ N µi , σ2i

n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1
(Aside: As sum increases, sum of non-Gaussian, finite

variance variables is also Gaussian [central limit theorem].)
Sum of Gaussians

yi ∼ N µi , σ2i

n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1
(Aside: As sum increases, sum of non-Gaussian, finite

variance variables is also Gaussian [central limit theorem].)
Scaling a Gaussian
I Scaling a Gaussian leads to a Gaussian.

Scaling a Gaussian

y ∼ N µ, σ2
Scaling a Gaussian

y ∼ N µ, σ2
And the scaled density is distributed as

wy ∼ N wµ, w2 σ2
Linear Function
2 data points
best fit line
y
50 60 70 80 90 100
x
A linear regression between x and y.
Regression Examples
I Predict a real value, yi given some inputs xi .

I Predict quality of meat given spectral measurements
(Tecator data).
I Radiocarbon dating, the C14 calibration curve: predict age
given quantity of C14 isotope.
I Predict quality of different Go or Backgammon moves
given expert rated training data.
y = mx + c
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
5
4
c y = mx + c
3
y
m
1
0
0 1 2 3 4 5
x
5
4
c y = mx + c
3
y
m
1
0
0 1 2 3 4 5
x
5
4
c y = mx + c
3
y
m
1
0
0 1 2 3 4 5
x
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
5
4 y = mx + c
3
y
0
0 1 2 3 4 5
x
y = mx + c
point 1: x = 1, y = 3
3=m+c
point 2: x = 3, y = 1
1 = 3m + c
point 3: x = 2, y = 2.5
2.5 = 2m + c
6 A PHILOSOPHICAL ESSAY ON PROBABILITIES.
height: "The day will come when, by study pursued

through several ages, the things now concealed will
appear with evidence; and posterity will be astonished

' '
that truths so clear had escaped us. Clairaut then

undertook to submit to analysis the perturbations which
the comet had experienced by the action of the two
great planets, Jupiter and Saturn; after immense cal-
culations he fixed next passage at the perihelion

its
toward the beginning of April, 1759, which was actually

verified by observation. The regularity which astronomy
shows us in the movements of the comets doubtless
exists also in all phenomena. -
The curve described by a simple molecule of air or

vapor is regulated in a manner just as certain as the
planetary orbits ;
the only difference between them is
that which comes from our ignorance.

Probability is relative, in part to this ignorance, in
part to our knowledge. We know that of three or a

greater number of events a
single one ought to occur ;
but nothing induces us to believe that one of them will

occur rather than the others. In this state of indecision
it is
impossible for us to announce their occurrence with
certainty. It is, however, probable that one of these
events, chosen at will, will not occur because we see
several cases equally possible which exclude its occur-
rence, while only a single one favors it.
The theory of chance consists in reducing all the
events of the same kind to a certain number of cases
equally possible, that is to say, to such as we may be
equally undecided about in regard to their existence,
and in determining the number of cases favorable to
the event whose The ratio of
probability is sought.
y = mx + c +
point 1: x = 1, y = 3
3 = m + c + 1
point 2: x = 3, y = 1
1 = 3m + c + 2
point 3: x = 2, y = 2.5
2.5 = 2m + c + 3
Underdetermined System
What about two unknowns and 5

one observation? 4
3
y
y1 = mx1 + c 2
1
0
0 1 2 3
x
Can compute m given c. 5

4
y1 − c 3
m=
y
x 2
1
0
0 1 2 3
x

4
c = 1.75 =⇒ m = 1.25 3
y
2
1
0
0 1 2 3
x

4
c = −0.777 =⇒ m = 3.78 3
y
2
1
0
0 1 2 3
x

4
c = −4.01 =⇒ m = 7.01 3
y
2
1
0
0 1 2 3
x

4
c = −0.718 =⇒ m = 3.72 3
y
2
1
0
0 1 2 3
x

4
c = 2.45 =⇒ m = 0.545 3
y
2
1
0
0 1 2 3
x

4
c = −0.657 =⇒ m = 3.66 3
y
2
1
0
0 1 2 3
x

4
c = −3.13 =⇒ m = 6.13 3
y
2
1
0
0 1 2 3
x

4
c = −1.47 =⇒ m = 4.47 3
y
2
1
0
0 1 2 3
x

Assume 4
3
y
c ∼ N (0, 4) , 2
1
we find a distribution of solu- 0
tions. 0 1 2 3
x
Probability for Under- and Overdetermined
I To deal with overdetermined introduced probability

distribution for ‘variable’, i .
I For underdetermined system introduced probability
distribution for ‘parameter’, c.
I This is known as a Bayesian treatment.
Multivariate Prior Distributions
I For general Bayesian inference need multivariate priors.

I E.g. for multivariate linear regression:
X
yi = w j xi, j + i
i
(where we’ve dropped c for convenience), we need a prior
over w.
I This motivates a multivariate Gaussian density.
I We will use the multivariate Gaussian to put a prior directly
on the function (a Gaussian process).
Multivariate Prior Distributions
I For general Bayesian inference need multivariate priors.

I E.g. for multivariate linear regression:
yi = w> xi,: + i
(where we’ve dropped c for convenience), we need a prior

over w.
I This motivates a multivariate Gaussian density.
I We will use the multivariate Gaussian to put a prior directly
on the function (a Gaussian process).
Prior Distribution
I Bayesian inference requires a prior on the parameters.

I The prior represents your belief before you see the data of
the likely value of the parameters.
I For linear regression, consider a Gaussian prior on the
intercept:
c ∼ N (0, α1 )
Posterior Distribution
I Posterior distribution is found by combining the prior with

the likelihood.
I Posterior distribution is your belief after you see the data of
the likely value of the parameters.
I The posterior is found through Bayes’ Rule
p(y|c)p(c)
p(c|y) =
p(y)
Bayes Update
2 p(c) = N (c|0, α1 )
0
-3 -2 -1 0 1 2 3 4
Figure: A Gaussian prior combinesc with a Gaussian likelihood for a
Gaussian posterior.
Bayes Update
2 p(c) = N (c|0, α1 )

p(y|m, c, x, σ2 ) = N y|mx + c, σ2
0
-3 -2 -1 0 1 2 3 4
Gaussian posterior.
Bayes Update
2 p(c) = N (c|0, α1 )

p(y|m, c, x, σ2 ) = N y|mx + c, σ2
1 p(c|y, m, x, σ2 ) =
y−mx

N c| 1+σ2 /α1 , (σ−2 + α−1
1
)−1
0
-3 -2 -1 0 1 2 3 4
Gaussian posterior.
Stages to Derivation of the Posterior
I Multiply likelihood by prior

I they are “exponentiated quadratics”, the answer is always
also an exponentiated quadratic because
exp(a2 ) exp(b2 ) = exp(a2 + b2 ).
I Complete the square to get the resulting density in the
form of a Gaussian.
I Recognise the mean and (co)variance of the Gaussian. This
is the estimate of the posterior.
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w> xi,: + i
yi = w> xi,: + i
I Multivariate regression likelihood:

 n

1  1 X 2 
p(y|X, w) = n/2
exp − 2
 >
yi − w xi,: 
2
(2πσ ) 2σ i=1
yi = w> xi,: + i
I Multivariate regression likelihood:

 n

1  1 X 2 
p(y|X, w) = n/2
exp − 2
 >
yi − w xi,: 
2
(2πσ ) 2σ i=1
I Now use a multivariate Gaussian prior:
1 1 >

p(w) = p exp − w w
(2πα) 2 2α
Two Dimensional Gaussian
I Consider height, h/m and weight, w/kg.

I Could sample height from a distribution:
p(h) ∼ N (1.7, 0.0225)
I And similarly weight:
p(w) ∼ N (75, 36)

Height and Weight Models
p(w)
p(h)
h/m w/kg
Gaussian distributions for height and weight.

Sampling Two Dimensional Variables
Marginal Distributions
Joint Distribution
p(h)
w/kg
p(w)
h/m
Samples of height and weight

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Joint Distribution
p(h)
w/kg
p(w)
h/m

Independence Assumption
I This assumes height and weight are independent.
p(h, w) = p(h)p(w)
I In reality they are dependent (body mass index) = w

h2
.
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Joint Distribution
p(h)
w/kg
p(w)
h/m
Independent Gaussians
p(w, h) = p(w)p(h)
 1  (w − µ1 )2 (h − µ2 )2 
  
1
p(w, h) = q exp −  + 
2 σ 2 σ 2
q 
2 2
2πσ1 2πσ2 1 2
 " # " #!> " 2 #−1 " # " #!

1  1 w µ σ1 0 w µ 
p(w, h) = q exp − − 1 − 1 
2 h µ2 0 σ22 h µ2
2πσ21 2πσ22
1 1

p(y) = 1
exp − (y − µ) > −1
D (y − µ)
|2πD| 2 2
Correlated Gaussian
Form correlated from original by rotating the data space using

matrix R.
1 1

p(y) = 1
exp − (y − µ) > −1
D (y − µ)
|2πD| 2 2
Correlated Gaussian

matrix R.
1 1 >

p(y) = 1
exp − (R y − R µ)
> > −1
D (R>
y − R>
µ)
|2πD| 2 2
Correlated Gaussian

matrix R.
1 1

p(y) = 1
exp − (y − µ) >
RD−1 >
R (y − µ)
|2πD| 2 2
this gives a covariance matrix:
C−1 = RD−1 R>

Correlated Gaussian

matrix R.
1 1

p(y) = 1
exp − (y − µ) > −1
C (y − µ)
|2πC| 2 2
this gives a covariance matrix:
C = RDR>
Recall Univariate Gaussian Properties
1. Sum of Gaussian variables is also Gaussian.

yi ∼ N µi , σ2i

yi ∼ N µi , σ2i
n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

yi ∼ N µi , σ2i
n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1
2. Scaling a Gaussian leads to a Gaussian.


yi ∼ N µi , σ2i
n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

y ∼ N µ, σ2

yi ∼ N µi , σ2i
n
 n n

X X X 
yi ∼ N  µi , σ2i 
i=1 i=1 i=1

y ∼ N µ, σ2

wy ∼ N wµ, w2 σ2
Multivariate Consequence
I If
x ∼ N µ, Σ
I If
x ∼ N µ, Σ
I And
y = Wx
I If
x ∼ N µ, Σ
I And
y = Wx
I Then
y ∼ N Wµ, WΣW>
Sampling a Function
Multi-variate Gaussians
I We will consider a Gaussian with a particular structure of

covariance matrix.
I Generate a single sample from this 25 dimensional
Gaussian distribution, f = f1 , f2 . . . f25 .

I We will plot these points against their index.

Gaussian Distribution Sample
1
2
0.8
1
0.6
0
fi
j
0.4
-1
0.2
-2
0 5 10 15 20 25 0
(a) A 25 dimensional
i correlated ran- (b) colormap ishowing correlations
dom variable (values ploted against between dimensions.
index)
Figure: A sample from a 25 dimensional Gaussian distribution.

1
2
0.8
1
0.6
0
fi
j
0.4
-1
0.2
-2
0 5 10 15 20 25 0
i correlated ran- (b) colormap ishowing correlations
index)

1
2
0.8
1
0.6
0
fi
0.4
-1
0.2
-2
0 5 10 15 20 25 0
i correlated ran- (b) colormap showing correlations
index)

1
2
0.8
1
0.6
0
fi
0.4
-1
0.2
-2
0 5 10 15 20 25 0
index)

1
2
0.8
1
0.6
0
fi
0.4
-1
0.2
-2
0 5 10 15 20 25 0
index)

1
2
0.8
1
0.6
0
fi
0.4
-1
0.2
-2
0 5 10 15 20 25 0
index)

1
2
0.8
1
0.6
0
fi
0.4
-1
0.2
-2
0 5 10 15 20 25 0
index)

1 1 0.96587
0
fi
-1 0.96587 1
-2
0 5 10 15 20 25
i correlated ran- (b) correlation between f1 and f2 .
dom variable (values ploted against
index)

Prediction of f2 from f1
1
1 0.96587
0
f1
0.96587 1
-1
-1 0 1
f2
I The single contour of the Gaussian density represents the

joint distribution, p( f1 , f2 ).
1
1 0.96587
0
f1
0.96587 1
-1
-1 0 1
f2

I We observe that f1 = −0.313.
1
1 0.96587
0
f1
0.96587 1
-1
-1 0 1
f2

I Conditional density: p( f2 | f1 = −0.313).
1
1 0.96587
0
f1
0.96587 1
-1
-1 0 1
f2

Prediction with Correlated Gaussians
I Prediction of f2 from f1 requires conditional density.

I Conditional density is also Gaussian.
2
 
 k1,2 k1,2 
p( f2 | f1 ) = N  f2 | f1 , k2,2 − 
k1,1 k1,1 

where covariance of joint density is given by

" #
k1,1 k1,2
K=
k2,1 k2,2
1
1 0.57375
0
f1
0.57375 1
-1
-1 0 1
f5

1
1 0.57375
0
f1
0.57375 1
-1
-1 0 1
f5

1
1 0.57375
0
f1
0.57375 1
-1
-1 0 1
f5

1
1 0.57375
0
f1
0.57375 1
-1
-1 0 1
f5

I Prediction of f∗ from f requires multivariate conditional

density.
I Multivariate conditional density is also Gaussian.

p(f∗ |f) = N f∗ |K∗,f K−1
f,f f, K∗,∗ − K K−1
K
∗,f f,f f,∗
I Here covariance of joint density is given by

" #
Kf,f K∗,f
K=
Kf,∗ K∗,∗
I Prediction of f∗ from f requires multivariate conditional

density.
I Multivariate conditional density is also Gaussian.

p(f∗ |f) = N f∗ |µ, Σ
µ = K∗,f K−1
f,f f
Σ = K∗,∗ − K∗,f K−1

f,f Kf,∗
I Here covariance of joint density is given by
" #
Kf,f K∗,f
K=
Kf,∗ K∗,∗
Covariance Functions
Where did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, Squared

Exponential, Gaussian)
 kx − x0 k22 
 
k (x, x ) = α exp −
0 
2`2 
I Covariance matrix is
built using the inputs to
the function x.
I For the example above it
was based on Euclidean
distance.
I The covariance function
is also know as a kernel.

 kx − x0 k22 
 
k (x, x ) = α exp −
0 
2`2 
3
2
the function x.
1
I For the example above it 0
was based on Euclidean -1
distance. -2
I The covariance function -3
-1 -0.5 0 0.5 1
2
||xi −x j ||
k xi , x j = α exp − 2`2
x1 = −3.0, x1 = −3.0
(−3.0−−3.0)2
k1,1 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00
x1 = −3.0, x1 = −3.0
(−3.0−−3.0)2
k1,1 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00
x2 = 1.20, x1 = −3.0
(1.20−−3.0)2
k2,1 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00
x2 = 1.20, x1 = −3.0
0.110
(1.20−−3.0)2
k2,1 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110
x2 = 1.20, x1 = −3.0
0.110
(1.20−−3.0)2
k2,1 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110
x2 = 1.20, x2 = 1.20
0.110
(1.20−1.20)2
k2,2 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110
x2 = 1.20, x2 = 1.20
0.110 1.00
(1.20−1.20)2
k2,2 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110
x3 = 1.40, x1 = −3.0
0.110 1.00
(1.40−−3.0)2
k3,1 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110
x3 = 1.40, x1 = −3.0
0.110 1.00
(1.40−−3.0)2
k3,1 = 1.00 × exp − 2×2.002 0.0889
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110 0.0889
x3 = 1.40, x1 = −3.0
0.110 1.00
(1.40−−3.0)2
k3,1 = 1.00 × exp − 2×2.002 0.0889
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110 0.0889
x3 = 1.40, x2 = 1.20
0.110 1.00
(1.40−1.20)2
k3,2 = 1.00 × exp − 2×2.002 0.0889
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110 0.0889
x3 = 1.40, x2 = 1.20
0.110 1.00
(1.40−1.20)2
k3,2 = 1.00 × exp − 2×2.002 0.0889 0.995
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110 0.0889
x3 = 1.40, x2 = 1.20
0.110 1.00 0.995
(1.40−1.20)2
k3,2 = 1.00 × exp − 2×2.002 0.0889 0.995
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110 0.0889
x3 = 1.40, x3 = 1.40
0.110 1.00 0.995
(1.40−1.40)2
k3,3 = 1.00 × exp − 2×2.002 0.0889 0.995
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.00 0.110 0.0889
x3 = 1.40, x3 = 1.40
0.110 1.00 0.995
(1.40−1.40)2
k3,3 = 1.00 × exp − 2×2.002 0.0889 0.995 1.00
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
x3 = 1.40, x3 = 1.40
(1.40−1.40)2
k3,3 = 1.00 × exp − 2×2.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 2.00 and α = 1.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
x1 = −3, x1 = −3
(−3−−3)2
k1,1 = 1.0 × exp − 2×2.02
x1 = −3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with ` = 2.0 and α = 1.0.

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0
x1 = −3, x1 = −3
(−3−−3)2
k1,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0
x2 = 1.2, x1 = −3
(1.2−−3)2
k2,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0
x2 = 1.2, x1 = −3 0.11
(1.2−−3)2
k2,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11
x2 = 1.2, x1 = −3 0.11
(1.2−−3)2
k2,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11
x2 = 1.2, x2 = 1.2 0.11
(1.2−1.2)2
k2,2 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11
x2 = 1.2, x2 = 1.2 0.11 1.0
(1.2−1.2)2
k2,2 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11
x3 = 1.4, x1 = −3 0.11 1.0
(1.4−−3)2
k3,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11
x3 = 1.4, x1 = −3 0.11 1.0
0.089
(1.4−−3)2

k3,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x3 = 1.4, x1 = −3 0.11 1.0
0.089
(1.4−−3)2

k3,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x3 = 1.4, x2 = 1.2 0.11 1.0
0.089
(1.4−1.2)2

k3,2 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x3 = 1.4, x2 = 1.2 0.11 1.0
0.089 1.0
(1.4−1.2)2

k3,2 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x3 = 1.4, x2 = 1.2 0.11 1.0 1.0
0.089 1.0
(1.4−1.2)2

k3,2 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x3 = 1.4, x3 = 1.4 0.11 1.0 1.0
0.089 1.0
(1.4−1.4)2

k3,3 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x3 = 1.4, x3 = 1.4 0.11 1.0 1.0
0.089 1.0 1.0

(1.4−1.4)2

k3,3 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x4 = 2.0, x1 = −3 0.11 1.0 1.0
0.089 1.0 1.0

(2.0−−3)2

k4,1 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089
x4 = 2.0, x1 = −3 0.11 1.0 1.0
0.089 1.0 1.0

(2.0−−3)2

k4,1 = 1.0 × exp − 2×2.02 0.044

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x1 = −3 0.11 1.0 1.0
0.089 1.0 1.0

(2.0−−3)2

k4,1 = 1.0 × exp − 2×2.02 0.044

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x2 = 1.2 0.11 1.0 1.0
0.089 1.0 1.0

(2.0−1.2)2

k4,2 = 1.0 × exp − 2×2.02 0.044

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x2 = 1.2 0.11 1.0 1.0
0.089 1.0 1.0

(2.0−1.2)2

k4,2 = 1.0 × exp − 2×2.02 0.044 0.92

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x2 = 1.2 0.11 1.0 1.0 0.92
0.089 1.0 1.0

(2.0−1.2)2

k4,2 = 1.0 × exp − 2×2.02 0.044 0.92

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x3 = 1.4 0.11 1.0 1.0 0.92
0.089 1.0 1.0

(2.0−1.4)2

k4,3 = 1.0 × exp − 2×2.02 0.044 0.92

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x3 = 1.4 0.11 1.0 1.0 0.92
0.089 1.0 1.0

(2.0−1.4)2

k4,3 = 1.0 × exp − 2×2.02 0.044 0.92 0.96

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x3 = 1.4 0.11 1.0 1.0 0.92
0.089 1.0 1.0 0.96

(2.0−1.4)2

k4,3 = 1.0 × exp − 2×2.02 0.044 0.92 0.96

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x4 = 2.0 0.11 1.0 1.0 0.92
0.089 1.0 1.0 0.96

(2.0−2.0)2

k4,4 = 1.0 × exp − 2×2.02 0.044 0.92 0.96

2
||xi −x j ||
k xi , x j = α exp − 2`2
1.0 0.11 0.089 0.044
x4 = 2.0, x4 = 2.0 0.11 1.0 1.0 0.92
0.089 1.0 1.0 0.96

(2.0−2.0)2

k4,4 = 1.0 × exp − 2×2.02 0.044 0.92 0.96 1.0

2
||xi −x j ||
k xi , x j = α exp − 2`2
x4 = 2.0, x4 = 2.0
(2.0−2.0)2
k4,4 = 1.0 × exp − 2×2.02

2
||xi −x j ||
k xi , x j = α exp − 2`2
x1 = −3.0, x1 = −3.0
(−3.0−−3.0)2
k1,1 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00
x1 = −3.0, x1 = −3.0
(−3.0−−3.0)2
k1,1 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00
x2 = 1.20, x1 = −3.0
(1.20−−3.0)2
k2,1 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00
x2 = 1.20, x1 = −3.0
2.81
(1.20−−3.0)2
k2,1 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81
x2 = 1.20, x1 = −3.0
2.81
(1.20−−3.0)2
k2,1 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81
x2 = 1.20, x2 = 1.20
2.81
(1.20−1.20)2
k2,2 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81
x2 = 1.20, x2 = 1.20
2.81 4.00
(1.20−1.20)2
k2,2 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81
x3 = 1.40, x1 = −3.0
2.81 4.00
(1.40−−3.0)2
k3,1 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81
x3 = 1.40, x1 = −3.0
2.81 4.00
(1.40−−3.0)2
k3,1 = 4.00 × exp − 2×5.002 2.72
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81 2.72
x3 = 1.40, x1 = −3.0
2.81 4.00
(1.40−−3.0)2
k3,1 = 4.00 × exp − 2×5.002 2.72
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81 2.72
x3 = 1.40, x2 = 1.20
2.81 4.00
(1.40−1.20)2
k3,2 = 4.00 × exp − 2×5.002 2.72
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81 2.72
x3 = 1.40, x2 = 1.20
2.81 4.00
(1.40−1.20)2
k3,2 = 4.00 × exp − 2×5.002 2.72 4.00
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81 2.72
x3 = 1.40, x2 = 1.20
2.81 4.00 4.00
(1.40−1.20)2
k3,2 = 4.00 × exp − 2×5.002 2.72 4.00
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81 2.72
x3 = 1.40, x3 = 1.40
2.81 4.00 4.00
(1.40−1.40)2
k3,3 = 4.00 × exp − 2×5.002 2.72 4.00
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
4.00 2.81 2.72
x3 = 1.40, x3 = 1.40
2.81 4.00 4.00
(1.40−1.40)2
k3,3 = 4.00 × exp − 2×5.002 2.72 4.00 4.00
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

2
||xi −x j ||
k xi , x j = α exp − 2`2
x3 = 1.40, x3 = 1.40
(1.40−1.40)2
k3,3 = 4.00 × exp − 2×5.002
x1 = −3.0, x2 = 1.20, and x3 = 1.40 with ` = 5.00 and α = 4.00.

Outline

Basis Function Form
Radial basis functions commonly have the form

 xi − µk 2 
 
φk (xi ) = exp − .
 
 2`2 
I Basis function
φ(x)
0.5
maps data into a
“feature space” in
which a linear sum 0
is a non linear -8 -6 -4 -2 0 2 4 6 8
function. x
Figure: A set of radial basis functions with width
` = 2 and location parameters µ = [−4 0 4]> .
Basis Function Representations
I Represent a function by a linear sum over a basis,

m
X
f (xi,: ; w) = wk φk (xi,: ), (1)
k=1
I Here: m basis functions and φk (·) is kth basis function and
w = [w1 , . . . , wm ]> .
I For standard linear model: φk (xi,: ) = xi,k .

Random Functions
Functions derived 2
using: 1
0
f (x)
m
-1
X
f (x) = wk φk (x),
k=1
-2
where elements of w -8 -6 -4 -2 0 2 4 6 8
are independently x
sampled from a Figure: Functions sampled using the basis set from
Gaussian density, figure 3. Each line is a separate sample, generated
by a weighted sum of the basis set. The weights, w
wk ∼ N (0, α) . are sampled from a Gaussian density with variance
α = 1.
Direct Construction of Covariance Matrix
Use matrix notation to write function,

m
X
f (xi ; w) = wk φk (xi )
k=1

m
X
k=1
computed at training data gives a vector
f = Φw.

m
X
k=1
f = Φw.
w ∼ N (0, αI)

m
X
k=1
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.


m
X
k=1
f = Φw.
w ∼ N (0, αI)
Φ ∈ <n×p is a design matrix


m
X
k=1

f = Φw.
w ∼ N (0, αI)
Φ is fixed and non-stochastic for a given training set.


m
X
k=1
f = Φw.
w ∼ N (0, αI)
Φ is fixed and non-stochastic for a given training set.
f is Gaussian distributed.
Expectations
I We have
hfi = Φ hwi .
We use h·i to denote expectations under prior distributions.

Expectations
I We have
hfi = Φ hwi .
I Prior mean of w was zero giving
hfi = 0.

Expectations
I We have
hfi = Φ hwi .
hfi = 0.
I Prior covariance of f is
D E
K = ff> − hfi hfi>

Expectations
I We have
hfi = Φ hwi .
hfi = 0.
I Prior covariance of f is
D E
K = ff> − hfi hfi>
D E D E
ff> = Φ ww> Φ> ,
giving
K = αΦΦ> .

Covariance between Two Points
I The prior covariance between two points xi and x j is

k xi , x j = αφ: (xi )> φ: x j ,

k xi , x j = αφ: (xi )> φ: x j ,
or in sum notation
m
X
k xi , x j = α φk (xi ) φk x j
k=1

k xi , x j = αφ: (xi )> φ: x j ,
or in sum notation
m
X
k=1
I For the radial basis used this gives


k xi , x j = αφ: (xi )> φ: x j ,
or in sum notation
m
X
k=1
I For the radial basis used this gives
 xi − µk 2 + x j − µk 2 
 
m
X
k xi , x j = α exp −  .
 
 2`2 
k=1
RBF Basis Functions
k (x, x0 ) = αφ(x)> φ(x0 )
 2
 x − µk 
2
φk (x) = exp −
 
`2

 
 
−1
µ =  0 
 
1
 
RBF Basis Functions
k (x, x0 ) = αφ(x)> φ(x0 )
3
 2 2
 x − µk 
2
φk (x) = exp − 1
 
`2

 
0
  -1
−1
-2
µ =  0 
 
-3
1
 
-3 -2 -1 0 1 2 3
Selecting Number and Location of Basis
I Need to choose
1. location of centers
2. number of basis functions
Restrict analysis to 1-D input, x.
I Consider uniform spacing over a region:

k xi , x j = αφk (xi )> φk (x j )
I Need to choose
m
X
k xi , x j = α φk (xi )φk (x j )
k=1
I Need to choose
m
(xi − µk )2  (x j − µk )2 
!  
X
k xi , x j = α exp − exp −
 
2`2 2`2 
k=1
I Need to choose
m
 (xi − µk )2 (x j − µk )2 
 
X
k xi , x j = α exp −
 − 
2`2 2`2 
k=1
I Need to choose

 xi + x2j − 2µk xi + x j + 2µ2k 
 2 
m
X
k xi , x j = α exp −  ,
 
 2` 2 
k=1
Uniform Basis Functions
I Set each center location to
µk = a + ∆µ · (k − 1).
µk = a + ∆µ · (k − 1).
I Specify the basis functions in terms of their indices,
m
X x2i + x2j
k xi , x j =α ∆µ
0
exp −
2`2
k=1

2 a + ∆µ · (k − 1) xi + x j + 2 a + ∆µ · (k − 1) 2
!
− .
2`2
µk = a + ∆µ · (k − 1).
I Specify the basis functions in terms of their indices,
m
X x2i + x2j
k xi , x j =α ∆µ
0
exp −
2`2
k=1

2 a + ∆µ · (k − 1) xi + x j + 2 a + ∆µ · (k − 1) 2
!
− .
2`2
I Here we’ve scaled variance of process by ∆µ.

Infinite Basis Functions
I Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
I Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
I This implies
b − a = ∆µ(m − 1)
I Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
I This implies
b − a = ∆µ(m − 1)
and therefore
b−a
m= +1
∆µ
I Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
I This implies
b − a = ∆µ(m − 1)
and therefore
b−a
m= +1
∆µ
I Take limit as ∆µ → 0 so m → ∞
I Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
I This implies
b − a = ∆µ(m − 1)
and therefore
b−a
m= +1
∆µ
I Take limit as ∆µ → 0 so m → ∞
2 2
Z b x2i + x2j 2 µ − 12 xi + x j − 12 xi + x j !
k(xi , x j ) = α 0
exp − + dµ,
a 2`2 2`2
where we have used a + k · ∆µ → µ.

Result
I Performing the integration leads to

 2 
√  xi − x j 
k(xi ,x j ) = α0 π`2 exp −
 

 4` 2 
    
1   b − 2 xi + x j   a − 21 xi + x j 
1
× erf   − erf   ,
 
2  `   ` 
Result

 2 
√  xi − x j 
k(xi ,x j ) = α0 π`2 exp −
 

 4` 2 
    
1   b − 2 xi + x j   a − 21 xi + x j 
1
× erf   − erf   ,
 
2  `   ` 
I Now take limit as a → −∞ and b → ∞

Result

 2 
√  xi − x j 
k(xi ,x j ) = α0 π`2 exp −
 

 4` 2 
    
1   b − 2 xi + x j   a − 21 xi + x j 
1
× erf   − erf   ,
 
2  `   ` 
I Now take limit as a → −∞ and b → ∞

 2 
 xi − x j 
k xi , x j = α exp −  .
 
 4`2 
√
where α = α0 π`2 .
Infinite Feature Space
I An RBF model with infinite basis functions is a Gaussian

process.

process.
I The covariance function is given by the exponentiated
quadratic covariance function.
 2 
 xi − x j 
k xi , x j = α exp −  .
 
 4` 2 

process.
I The covariance function is the exponentiated quadratic.
I Note: The functional form for the covariance function and
basis functions are similar.
I this is a special case,
I in general they are very different
Similar results can obtained for multi-dimensional input

models ??.

 kx − x0 k22 
 
k (x, x ) = α exp −
0 
2`2 
the function x.
I For the example above it
was based on Euclidean
distance.
I The covariance function

 kx − x0 k22 
 
k (x, x ) = α exp −
0 
2`2 
3
2
the function x.
1
I For the example above it 0
was based on Euclidean -1
distance. -2
I The covariance function -3
-1 -0.5 0 0.5 1
RBF Basis Functions
k (x, x0 ) = αφ(x)> φ(x0 )
 2
 x − µk 
2
φk (x) = exp −
 
`2

 
 
−1
µ =  0 
 
1
 
RBF Basis Functions
k (x, x0 ) = αφ(x)> φ(x0 )
3
 2 2
 x − µk 
2
φk (x) = exp − 1
 
`2

 
0
  -1
−1
-2
µ =  0 
 
-3
1
 
-3 -2 -1 0 1 2 3
References I
P. S. Laplace. Essai philosophique sur les probabilités. Courcier, Paris, 2nd edition, 1814. Sixth edition of 1840 translated
and repreinted (1951) as A Philosophical Essay on Probabilities, New York: Dover; fifth edition of 1825 reprinted
1986 with notes by Bernard Bru, Paris: Christian Bourgois Éditeur, translated by Andrew Dale (1995) as
Philosophical Essay on Probabilities, New York:Springer-Verlag.
R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. Lecture Notes in Statistics 118.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.
[Google Books] .
C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998.

GP Gpss15 Session1

Uploaded by

Copyright:

Available Formats

You might also like

GP Gpss15 Session1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GP Gpss15 Session1

Uploaded by

Copyright:

Available Formats

Introduction to Gaussian Processes

The Gaussian Density

Covariance from Basis Functions

The Gaussian Density

Covariance from Basis Functions

I Perhaps the most common probability density.

I The Gaussian density.

The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Mean

I Sum of Gaussian variables is also Gaussian.

I Sum of Gaussian variables is also Gaussian.

And the sum is distributed as

I Sum of Gaussian variables is also Gaussian.

And the sum is distributed as

(Aside: As sum increases, sum of non-Gaussian, finite

I Sum of Gaussian variables is also Gaussian.

And the sum is distributed as

(Aside: As sum increases, sum of non-Gaussian, finite

I Scaling a Gaussian leads to a Gaussian.

I Scaling a Gaussian leads to a Gaussian.

I Scaling a Gaussian leads to a Gaussian.

And the scaled density is distributed as

I Predict a real value, yi given some inputs xi .

height: "The day will come when, by study pursued

appear with evidence; and posterity will be astonished

that truths so clear had escaped us. Clairaut then

great planets, Jupiter and Saturn; after immense cal-

culations he fixed next passage at the perihelion

toward the beginning of April, 1759, which was actually

The curve described by a simple molecule of air or

that which comes from our ignorance.

part to our knowledge. We know that of three or a

but nothing induces us to believe that one of them will

What about two unknowns and 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

Can compute m given c. 5

I To deal with overdetermined introduced probability

I For general Bayesian inference need multivariate priors.

I For general Bayesian inference need multivariate priors.

(where we’ve dropped c for convenience), we need a prior

I Bayesian inference requires a prior on the parameters.

I Posterior distribution is found by combining the prior with

I Multiply likelihood by prior

I Noise corrupted data point

I Noise corrupted data point

I Multivariate regression likelihood:

I Noise corrupted data point

I Multivariate regression likelihood:

I Now use a multivariate Gaussian prior:

I Consider height, h/m and weight, w/kg.