Professional Documents
Culture Documents
Wilson2020 Part1
Wilson2020 Part1
https://cims.nyu.edu/~andrewgw
Courant Institute of Mathematical Sciences
Center for Data Science
New York University
1 / 52
Gaussian processes
Definition
A Gaussian process (GP) is a collection of random variables, any finite number of
which have a joint Gaussian distribution.
Nonparametric Regression Model
I Prior: f (x) ⇠ GP(m(x), k(x, x0 )), meaning (f (x1 ), . . . , f (xN )) ⇠ N (µ, K),
with µi = m(xi ) and Kij = cov(f (xi ), f (xj )) = k(xi , xj ).
GP posterior Likelihood GP prior
z }| { z }| { z }| {
p(f (x)|D) / p(D|f (x)) p(f (x))
2 2
Outputs, f(x)
Outputs, f(x)
1 1
0 0
−1 −1
−2 −2
−3 −3
−10 −5 0 5 10 −10 −5 0 5 10
Inputs, x Inputs, x
2 / 52
Example: RBF Kernel
3 / 52
The Power of Non-Parametric Representations
f (x)
ci 1 ci ci+1 x
J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (1)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (2)
J i=1
4 / 52
Deriving the RBF Kernel
f (x)
ci 1 ci ci+1 x
J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (3)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (4)
J i=1
5 / 52
Deriving the RBF Kernel
J
X ✓ 2
◆ ✓ ◆
(x ci )2
f (x) = wi i (x) , wi ⇠ N 0, , i (x) = exp (6)
i=1
J 2`2
2 J
X
) k(x, x0 ) = i (x) i (x
0
) (7)
J i=1
3 3
2 2
Output, f(x)
Output, f(x)
1 1
0 0
−1 −1
−2 −2
−3 −3
−4 −4
−10 −5 0 5 10 −10 −5 0 5 10
Input, x Input, x
2
Output, f(x)
−1
−2
−3
−4
−10 −5 0 5 10
Input, x
8 / 52
Learning and Model Selection
p(y|Mi )p(Mi )
p(Mi |y) = (11)
p(y)
We can write the evidence of the model as
Z
p(y|Mi ) = p(y|f, Mi )p(f)df (12)
Output, f(x)
p(y|M)
−1
−2
−3
−4
y −10 −8 −6 −4 −2 0 2 4 6 8 10
All Possible Datasets Input, x
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006.
Bayesian Methods for Adaptive Models. MacKay, D.J. PhD Thesis, 1992.
Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson, A.G. PhD Thesis, 2014.
9 / 52
Machine Learning for Econometrics
(The Start of My Journey...)
10 / 52
Machine Learning for Econometrics
11 / 52
Machine Learning for Econometrics
12 / 52
Autoregressive Conditional Heteroscedasticity (ARCH)
2003 Nobel Prize in Economics
13 / 52
Autoregressive Conditional Heteroscedasticity (ARCH)
2003 Nobel Prize in Economics
14 / 52
But what about the kernel?
15 / 52
Gaussian Processes and Neural Networks
Introduction to Gaussian processes. MacKay, D. J. In Bishop, C. M. (ed.), Neural Networks and Machine
Learning, Chapter 11, pp. 133-165. Springer-Verlag, 1998.
16 / 52
Gaussian Process Regression Networks
x) y1 (x)
1(
W1 6 0.90
fˆ1 (x)
W1q (x
0.75
.. 5
.. .
. 0.60
Latitude
)
4
0.45
.. 2
W p1 (
0.15
. ..
.
1
x)
0.00
fˆq (x)
W 0.15
pq (
x) yp (x) 1 2 3 4 5
Longitude
Gaussian process regression networks. Wilson, A.G., Knowles, D.A., Ghahramani, Z. ICML 2012.
17 / 52
The Inverse Wishart Process
c ⇠ IWP(⌫, k✓ ) (13)
y|c ⇠ GP( , (⌫ 2)c) (14)
Student-t processes as alternatives to GPs. Shah, A., Wilson, A.G., Ghahramani, Z. AISTATS 2014.
18 / 52
The Inverse Wishart Process
c ⇠ IWP(⌫, k✓ ) (15)
y|c ⇠ GP( , (⌫ 2)c) (16)
19 / 52
The Inverse Wishart Process
c ⇠ IWP(⌫, k✓ ) (18)
y|c ⇠ GP( , (⌫ 2)c) (19)
20 / 52
The Inverse Wishart Process
c ⇠ IWP(⌫, k✓ ) (21)
y|c ⇠ GP( , (⌫ 2)c) (22)
21 / 52
The Inverse Wishart Process
c ⇠ IWP(⌫, k✓ ) (24)
y|c ⇠ GP( , (⌫ 2)c) (25)
22 / 52
The Inverse Wishart Process
Introduce a completely flexible Bayesian non-parametric distribution over kernels:
c ⇠ IWP(⌫, k✓ ) (27)
y|c ⇠ GP( , (⌫ 2)c) (28)
23 / 52
Heteroscedasticity revisited...
Choice 1
Choice 2
24 / 52
Inductive Biases
I While the inverse Wishart process is flexible, its inductive biases are wrong.
I It is challenging to say anything about the covariance function of a stochastic
process from a single draw if no assumptions are made.
I If we allow the covariance between any two points in the input space to arise
from any positive definite function, with equal probability, then we gain
essentially no information from a single realization.
I Most commonly one assumes a restriction to stationary kernels, meaning that
covariances are invariant to translations in the input space.
25 / 52
Spectral Mixture Kernels for Pattern Discovery
Let ⌧ = x x0 2 RP . From Bochner’s Theorem,
Z
T
k(⌧ ) = S(s)e2⇡is ⌧ ds (32)
RP
Then
k(⌧ ) = exp{ 2⇡ 2 ⌧ 2 2
} cos(2⇡⌧ µ) . (34)
Q
X P
Y
k(⌧ ) = wq cos(2⇡⌧ T µq ) exp{ 2⇡ 2 ⌧p2 v(p)
q }. (35)
q=1 p=1
Gaussian Process Kernels for Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson et. al, 2012
26 / 52
GP Model for Pattern Extrapolation
I 2
Observations y(x) ⇠ N (y(x); f (x), ) (can easily be relaxed).
I 0
f (x) ⇠ GP(0, kSM (x, x |✓)) (f (x) is a GP with SM kernel).
I kSM (x, x0 |✓) can approximate many different kernels with different settings of
its hyperparameters ✓.
I Learning involves training these hyperparameters through maximum marginal
likelihood optimization (using BFGS)
model fit complexity penalty
z }| { z }| {
1 T 2 1 1 2 N
log p(y|✓, X) = y (K✓ + I) y log |K✓ + I| log(2⇡) . (36)
2 2 2
27 / 52
Function Learning Example
700
Train
Airline Passengers (Thousands)
600
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
28 / 52
Function Learning Example
600 Train
Human?
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
29 / 52
Function Learning Example
700
Train
Airline Passengers (Thousands)
600 Human?
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
30 / 52
Results of Representation Learning
700
Train
Airline Passengers (Thousands)
600 Alien?
Test
500 Human?
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
31 / 52
Results, Airline Passengers
20
SM
10
0
0 0.1 0.2 0.3 0.4 0.5
Frequency (1/month)
32 / 52
Results, CO2
20
95% CR
CO2 Concentration (ppm)
400 SM
Train
15 SE
−5
320
−10
1968 1977 1986 1995 2004 0 0.2 0.4 0.6 0.8 1 1.2
Year Frequency (1/month)
33 / 52
What do we need for large scale pattern extrapolation?
(a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC
(k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA
Fast kernel learning for multidimensional pattern extrapolation. Wilson et. al, NeurIPS 2014.
34 / 52
More Patterns
35 / 52
Scalable Exact Gaussian Processes
36 / 52
Exact Gaussian Processes on a Million Data Points
37 / 52
Deep Kernel Learning
Deep kernel learning combines the inductive biases of deep learning architectures
with the non-parametric flexibility of Gaussian processes.
W (1)
W (2) h1 (✓)
W (L)
(1)
Input layer h1
Output layer
...
...
(2)
h1
x1 (L)
h1
y1
...
...
...
...
...
... yP
xD (L)
hC
...
hB
(2)
...
(1)
hA
h1 (✓)
Hidden layers
1 layer
Deep Kernel Learning. Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P. AISTATS, 2016
38 / 52
Face Orientation Extraction
Training data
Test data
Figure: Top: Randomly sampled examples of the training and test data. Bottom: The
two dimensional outputs of the convolutional network on a set of test cases. Each
point is shown using a line segment that has the same orientation as the input face.
39 / 52
Learning Flexible Non-Euclidean Similarity Metrics
0.2 300
2
100
100 100
0.1 200
200 200
200
0 100
Figure: Left: The induced covariance matrix using DKL-SM (spectral mixture)
kernel on a set of test cases, where the test samples are ordered according to the
orientations of the input faces. Middle: The respective covariance matrix using
DKL-RBF kernel. Right: The respective covariance matrix using regular RBF
kernel. The models are trained with n = 12, 000.
40 / 52
Step Function
18
GP(RBF)
GP(SM)
16 DKL−SM
Training data
14
Output Y
12
10
4
−1 −0.5 0 0.5 1
Input X
Figure: Recovering a step function. We show the predictive mean and 95% of the
predictive probability mass for regular GPs with RBF and SM kernels, and DKL with
SM base kernel.
41 / 52
Deep Kernel Learning for Autonomous Driving
1.0
0 0
28
0.8
24
10 10 20
Speed, mi/s
0.6
North, mi
16
0.4
20 20 12
8
0.2
30 30 4
0.0 0
0.05 0 5 0.2 10 15 0.4
20 5 0.6 0 5 10
0.8 15 20 1.0
East, mi
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
50
Front distance, m
40
30
20
10
0
5 0 5 5 0 5 5 0 5 5 0 5 5 0 5
Side distance, m
I b is a bias, vi are the hidden to output weights, h is any bounded hidden unit
transfer function, ui are the input to hidden weights, and J is the number of
hidden units. Let b and vi be independent with zero mean and variances b2 and
2
v /J, respectively, and let the ui have independent identical distributions.
Collecting all free parameters into the weight vector w,
Ew [f (x)] = 0 , (38)
J
X
2 1 2
cov[f (x), f (x0 )] = Ew [f (x)f (x0 )] = b + 0
v Eu [hi (x; ui )hi (x ; ui )] , (39)
J i=1
2 2 0
= b + v Eu [h(x; u)h(x ; u)] . (40)
We can show any collection of values f (x1 ), . . . , f (xN ) must have a joint Gaussian
distribution using the central limit theorem.
Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.
43 / 52
Neural Network Kernel
J
X
f (x) = b + vi h(x; ui ) . (41)
i=1
PP Rz t2
I Let h(x; u) = erf(u0 + uj xj ), where erf(z) = p2 e dt
j=1 ⇡ 0
I Choose u ⇠ N (0, ⌃)
Then we obtain
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ), (42)
⇡ (1 + 2x̃T ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
44 / 52
Neural Network Kernel
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ) (43)
⇡ (1 + 2x̃ ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
T
Set ⌃ = diag( 0 , ). Draws from a GP with a neural network kernel with varying :
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006
45 / 52
Neural Network Kernel
2 2x̃T ⌃x̃0
kNN (x, x0 ) = sin( p ) (44)
⇡ (1 + 2x̃T ⌃x̃)(1 + 2x̃0T ⌃x̃0 )
Set ⌃ = diag( 0 , ). Draws from a GP with a neural network kernel with varying :
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006
46 / 52
Neural Tangent Kernels
I Recent work [e.g., 1, 2, 3] deriving neural tangent kernels from infinite neural
network limits, with promising results.
I Closely related to Radford Neal’s [4] result showing a Bayesian neural network
with infinitely many hidden units converges to a Gaussian process.
I Note that most kernels from infinite neural network limits have a fixed
structure. On the other hand, standard neural networks essentially learn a
similarity metric (kernel) for the data. Learning a kernel amounts to
representation learning. Bridging this gap is interesting future work.
[1] Neural tangent kernel: convergence and generalization in neural networks. Jacot et. al, NeurIPS 2018.
[2] On exact computation with an infinitely wide neural net. Arora et. al, NeurIPS 2019.
[3] Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks. Arora et. al, arXiv 2019.
[4] Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.
47 / 52
Functional Kernel Learning
See, e.g.,
[1] Function-Space Distributions over Kernels.
Benton, G., Maddox, W. J., Salkey, J., Wilson, A. G. NeurIPS, 2019.
[2] Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes.
Wilson, 2014.
[3] Bayesian Nonparametric Spectral Estimation. Tobar, F, NeurIPS 2018.
48 / 52
Functional Kernel Learning (Model Specification)
49 / 52
Results (Sinc)
50 / 52
Texture Extrapolation
51 / 52
Multi-Task Learning
52 / 52