GP Intro COE PDF

An Introduction to Gaussian
Processes for Spatial Data

(Predictions!)
Matthew Kupilik
College of Engineering Seminar Series
Nov 2016
Matthew Kupilik UAA GP for Spatial Data Nov 2016 1 / 35

Why?
When evidence is easier to come by than understanding
1
http://xkcd.com/1425/,CC BY-NC 2.5, Flickr Park or Bird: http://parkorbird.flickr.com/
Data Missing A Model
◮ In many engineering applications we are confronted by the
need for a model. We apply our knowledge of the physics
involved to deduce a specific model form
◮ Often our knowledge of the underlying physical processes
is lousy or involves too many assumptions or variables we
cannot measure
◮ In these cases machine learning attempts to solve the
problem by learning relationships from existing data or
measurements

State of the Art
◮ Significant publicity on models applied to social systems,
commercial applications (recommendation engines, facial
recognition)
◮ Very flashy results from Deep Convolutional Neural
Networks [1]
◮ Such solutions are still far from commonplace in applied
engineering for a variety of reasons (data availability,
processing, knowledge)
Engineers also somewhat loathe handing off their

analysis to a black box without having a good idea
of how that box operates.

Gaussian Processes
GPs provide a machine learning approach where
uncertainty in the model is concretely available and
the model can be used to gain physical insight into
the process.
◮ Gaussian Processes are a neural network with infinitely many

weights and a distribution assigned to each weight
◮ The two approaches are actually quite intertwined

What is a Gaussian Process?
◮ Formally a Gaussian Process is a collection of random
variables, any finite number of which have a joint
Gaussian distribution [2]
◮ With Gaussian Processes we seek to find the most likely
function that could produce our data using a Bayesian
approach

Distribution over Functions
A Gaussian process is completely specified by a mean and
covariance function:
m(x) = E[f (x)]

k(x, x ′ ) = E[(f (x) − m(x))(f (x ′ ) − m(x ′ ))]
Or more concisely:
f (x) ∼ GP(m(x), k(x, x ′ ))

Distribution over Functions
The values of the unknown function themselves are then a
Gaussian joint probability distribution where the random
variables are the function values at a specific input variable:

f (x1 ) . . . f (xn ) ∼ N (µ, k)
Where the mean is µi = m(xi ) and the covariance is

ki,j = cov (f (xi ), f (xj ))
We write all the inputs and outputs as scalars here, but
everything expands in dimension,
for example
if we wanted the
input space to be x = lat long time .

Applying Bayes
Given a dataset D = (y, x) predict the output at a new input
x∗ . If we assign a prior and a likelihood, then we can condition
the prior on our observations and calculate a posterior:
p(f |x, y) ∝ p(y|x, f )p(f )
Mathematically equivalent to drawing random functions from

the prior, and rejecting those that don’t agree with the data.
If we assume the prior and likelihood are Gaussian there are
closed form solutions to this conditioning.

Prediction
We can then predict the output (y∗ ) at an input location (x∗ )
that is not in our training data set D.
Z
p(y∗ |D, x∗ ) = p(y∗ |x, x∗ , f )p(f |D)df

Kernel Trick
◮ All the previous integrals only have closed form solutions
when Gaussian.
◮ They only deal with inner products instead of the high
dimensional feature space
For Gaussian Processes we need a prior and a likelihood, we

will express these inner products using a Kernel or Covariance
function!

Covariance Function
The covariance function or kernel is extremely important. It
controls our assumptions about how the output is expected to
vary within the input space. One of the key components to
Gaussian Processes is that we need to choose an expressive
valid kernel function.
Many common examples, such as the squared exponential
covariance function (very popular with machine learning):
(x−x ′ )2
Kse (x − x ′ ) = e − 2ℓ2
The chosen covariance function often has many parameters

which we do not know, θ (hyperparameters)!

Learning the Covariance Function
Since we do not know the values of any parameters that make
up the covariance function we can fit these parameters to our
available data. Common to maximize the marginal likelihood
using non linear conjugate gradient descent.
Marginal likelihood: the probability of observing the given data

under our prior:
log p(y|x, θ)

Complexity
Naive implementation of this has complexity O(n3 ) for
learning and inference and O(n2 ) for prediction. This
complexity comes from inverses on the covariance matrix.
This has traditionally limited GP approaches to a few

thousand data points!

Spatial Gaussian Processes - Kriging
Moisture Content in soil, 265 data points
0 2 4 6 8 10 12 14 16
0
0.2
0.4
0.6
0.8

Function estimated on a much finer grid of 128841 data points
using a squared exponential covariance function, result
compared to linear interpolation
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
0 0
0.2 0.2
0.4 0.4
0.6 0.6
0.8 0.8
1 1
1.2 1.2
1.4 1.4
1.6 1.6
1.8 1.8
2 2
Gaussian Process Linear Interpolation

Same data, isometric viewpoint, with kriging we have no
problem extrapolating outside the range.
22 22
20
20
18
18
16
16
14
14 12
12 10
10 8
20 0 20
0 15
0.5 15 0.5
1 10 1 10
1.5 5 1.5 5
2 0 2 0
Gaussian Process Linear Interpolation

Learning and Prediction Example
Take a messy nonlinear example function. Draw 10 points
randomly from this function. Choosing a squared exponential
as the covariance function and optimize the hyper parameters
using the ten known points. Perform prediction on the same
range.
3
1
Output
-1
-2
-3
0 1 2 3 4 5 6 7 8 9 10
Input

Learning and Prediction Example
We can find a very nice fit when we have observations around
the predictions. However our covariance function dies out out
as the distance in time gets further from the measurements
and reverts to the mean. For example, here is the same
nonlinear function predicting into the future.
15
10
5
Output
-5
-10
-15
0 5 10 15 20 25 30 35 40 45 50
Input

Engineering Required
For this low dimensional example it is easy to see I need to
build some periodicity into the covariance function. In higher
dimensional spaces these are the sort of patterns we wish to
discover.
Engineering input is required in the building of a covariance

function!

Spectral Mixture Kernels
There are many existing kernel functions, but few that are
highly expressive (able to recreate many types of structures or
patterns). Some recent results have looked at describing the
spectral structure of the covariance function. [3]
Q
2 (x−x ′ )2 v
X
′
k(x − x ) = wq e −2π q
cos(2π(x − x ′ )µq )
q=1

Previous Example, same training range with 5 component
spectral mixture kernel. If the dynamics are not in the training
set they won’t be in the prediction!
15
10
5
Output
-5
-10
-15
0 5 10 15 20 25 30 35 40 45 50
Input

Previous Example, doubled training range with 5 component
spectral mixture kernel.
15
10
5
Output
-5
-10
-15
0 5 10 15 20 25 30 35 40 45 50
Input

Some Recent Improvements
The computational complexity has been reduced significantly
through the use of grids, Kronecker products, and circulant
Toeplitz embeddings.
P+1
◮ O(PN P ) computations and O(PN 2P ) storage, for N
datapoints and P input dimensions for hyperparameter
fitting. [4]
◮ Together with Toeplitz block structure and kernel
interpolation, O(n) for learning and O(n) prediction![5]

Spatial and Temporal Data
GPs with spectral mixture kernels can be applied almost out of
the box to many standard spatiotemporal datasets.
◮ Build a covariance function that can express the dynamics
(hard), choose a covariance function that can learn the
dynamics (easier)
◮ Initialize the hyperparameters for that covariance
function. If you choose the easy method previously, this
step is now hard.
◮ Optimize the hyperparameters to best match your dataset
(bottleneck)
◮ Inference with your covariance function and training data
(build a covariance matrix)
◮ Predict Away!

Example
Publicly available African violence dataset [6]
◮ Output is number of violent events, per grid cell per
month for 1990-2009
◮ Approximately 42 possible input regressor variables (lat,
long, time, temperature, population, etc..)
◮ 814, 490 measurements in a potentially 42 dimensional
space
◮ Decently large data set

Initial Hack Method
Initially choose only space and time as an identification space,
perform identification using 40 months and predict on the next
12. Negative binomial likelihood and 15 component spectral
mixture kernel.
◮ Grid size: 63 × 70 × 40 = 176400
◮ We are using almost none of the information available in

the data set!  
Lat
Events = f  Long 
Time

Initial Hack Results
Prediction for month 41:
60
Violent Events
40
40
20 20
0 0
50 60
30 20 40 20
10 0 20 0
-10 -20 0 -20
-40 -20 -40

Dimension Reduction
For inputs, choose latitude, longitude, population, a count of
nearby recent violent events, and time.
 
Lat
 Long 
 
Events = f   Population 
 
Past Events
Time
Principal component analysis applied for dimension reduction.

Grid size: 300 × 100 × 39 = 1, 170, 00!

PCA Results
Three dimensions, negative binomial likelihood and 15 component
spectral mixture kernel. Prediction for month 41:
Violent Events
60 60
40 40
20 20
0 0
60 60
40 40 40 40
20 20 20 20
0 0 0 0
-20 -20 -20 -20
-40 -40
Average error per month is approximately 27 events over the years

worth of prediction.
Societal Impact
Machine learning is being applied to models for which we have
massive amounts of data and almost no knowledge of the
underlying processes (violence, voting, music popularity, etc).
These social systems are seeing the impacts of machine
learning first due to the massive amounts of data available
which give us an ability to find patterns.

Engineering Impact
The modeling impacts on the physical sciences is still very
nascent but will only increase. Already happening in many
applications (civil, electrical and mechanical) including traffic
prediction, erosion, weather, climate modeling, electric power
demand and more.
It will be important to be able to incorporate our abilities to
spot patterns and combine with our knowledge of the physics
to create truly useful models for society.

Additional Resources
◮ Open Source Toolboxes in R, Matlab, Python, C++,
Octave
◮ http://www.gaussianprocess.org/ for lots of tools
◮ All of the examples in this presentation made use of the
GPML library for Matlab[7].

Questions

References
[1] Geoffrey E Hinton, Simon Osindero, and Yee-Whye The 32nd International Conference on Machine
Teh. A fast learning algorithm for deep belief nets. Learning, volume 37, pages 607–616, 2015.
Neural computation, 18(7):1527–1554, 2006. [5] Andrew Gordon Wilson, Christoph Dann, and
[2] Rasmussen Carl Edward and Christopher KI Hannes Nickisch. Thoughts on Massively Scalable
Williams. Gaussian Processes for Machine Gaussian Processes. Emnlp, pages 1–25, 2015.
Learning. MIT Press, 2006. [6] Clionadh Raleigh, Andrew Linke, Håvard Hegre,
[3] A Wilson and R Adams. Gaussian process kernels and Joakim Karlsen. Introducing acled: An armed
for pattern discovery and extrapolation. conflict location and event dataset special data
Proceedings of the 30th International Conference feature. Journal of peace research, 47(5):651–660,
on Machine Learning, 28(3):1067–1075, 2013. 2010.
[4] Seth R Flaxman, Andrew Gordon Wilson, Daniel B [7] Carl Edward Rasmussen and Hannes Nickisch.
Neill, Hannes Nickisch, and Alexander J Smola. Gaussian processes for machine learning (gpml)
Fast Kronecker Inference in Gaussian Processes toolbox. J. Mach. Learn. Res., 11:3011–3015,
with non-Gaussian Likelihoods. In Proceedings of December 2010.

GP Intro COE PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GP Intro COE PDF

Uploaded by

Copyright:

Available Formats

An Introduction to Gaussian

Processes for Spatial Data

Matthew Kupilik UAA GP for Spatial Data Nov 2016 1 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 3 / 35

Engineers also somewhat loathe handing off their

Matthew Kupilik UAA GP for Spatial Data Nov 2016 4 / 35

◮ Gaussian Processes are a neural network with infinitely many

Matthew Kupilik UAA GP for Spatial Data Nov 2016 5 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 6 / 35

m(x) = E[f (x)]

f (x) ∼ GP(m(x), k(x, x ′ ))

Matthew Kupilik UAA GP for Spatial Data Nov 2016 7 / 35

Where the mean is µi = m(xi ) and the covariance is

Matthew Kupilik UAA GP for Spatial Data Nov 2016 8 / 35

p(f |x, y) ∝ p(y|x, f )p(f )

Mathematically equivalent to drawing random functions from

Matthew Kupilik UAA GP for Spatial Data Nov 2016 9 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 10 / 35

For Gaussian Processes we need a prior and a likelihood, we

Matthew Kupilik UAA GP for Spatial Data Nov 2016 11 / 35

The chosen covariance function often has many parameters

Matthew Kupilik UAA GP for Spatial Data Nov 2016 12 / 35

Marginal likelihood: the probability of observing the given data

Matthew Kupilik UAA GP for Spatial Data Nov 2016 13 / 35

This has traditionally limited GP approaches to a few

Matthew Kupilik UAA GP for Spatial Data Nov 2016 14 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 15 / 35

Gaussian Process Linear Interpolation

Gaussian Process Linear Interpolation

Matthew Kupilik UAA GP for Spatial Data Nov 2016 17 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 18 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 19 / 35

Engineering input is required in the building of a covariance

Matthew Kupilik UAA GP for Spatial Data Nov 2016 20 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 21 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 22 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 23 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 24 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 25 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 26 / 35

◮ We are using almost none of the information available in

Matthew Kupilik UAA GP for Spatial Data Nov 2016 27 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 28 / 35

Principal component analysis applied for dimension reduction.

Matthew Kupilik UAA GP for Spatial Data Nov 2016 29 / 35

Average error per month is approximately 27 events over the years

Matthew Kupilik UAA GP for Spatial Data Nov 2016 31 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 32 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 33 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 34 / 35

Matthew Kupilik UAA GP for Spatial Data Nov 2016 35 / 35

You might also like