Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

An Introduction to Gaussian

Processes for Spatial Data


(Predictions!)

Matthew Kupilik
College of Engineering Seminar Series

Nov 2016

Matthew Kupilik UAA GP for Spatial Data Nov 2016 1 / 35


Why?
When evidence is easier to come by than understanding

1
http://xkcd.com/1425/,CC BY-NC 2.5, Flickr Park or Bird: http://parkorbird.flickr.com/
Matthew Kupilik UAA GP for Spatial Data Nov 2016 2 / 35
Data Missing A Model
◮ In many engineering applications we are confronted by the
need for a model. We apply our knowledge of the physics
involved to deduce a specific model form
◮ Often our knowledge of the underlying physical processes
is lousy or involves too many assumptions or variables we
cannot measure
◮ In these cases machine learning attempts to solve the
problem by learning relationships from existing data or
measurements

Matthew Kupilik UAA GP for Spatial Data Nov 2016 3 / 35


State of the Art
◮ Significant publicity on models applied to social systems,
commercial applications (recommendation engines, facial
recognition)
◮ Very flashy results from Deep Convolutional Neural
Networks [1]
◮ Such solutions are still far from commonplace in applied
engineering for a variety of reasons (data availability,
processing, knowledge)

Engineers also somewhat loathe handing off their


analysis to a black box without having a good idea
of how that box operates.

Matthew Kupilik UAA GP for Spatial Data Nov 2016 4 / 35


Gaussian Processes
GPs provide a machine learning approach where
uncertainty in the model is concretely available and
the model can be used to gain physical insight into
the process.

◮ Gaussian Processes are a neural network with infinitely many


weights and a distribution assigned to each weight
◮ The two approaches are actually quite intertwined

Matthew Kupilik UAA GP for Spatial Data Nov 2016 5 / 35


What is a Gaussian Process?
◮ Formally a Gaussian Process is a collection of random
variables, any finite number of which have a joint
Gaussian distribution [2]
◮ With Gaussian Processes we seek to find the most likely
function that could produce our data using a Bayesian
approach

Matthew Kupilik UAA GP for Spatial Data Nov 2016 6 / 35


Distribution over Functions
A Gaussian process is completely specified by a mean and
covariance function:

m(x) = E[f (x)]


k(x, x ′ ) = E[(f (x) − m(x))(f (x ′ ) − m(x ′ ))]

Or more concisely:

f (x) ∼ GP(m(x), k(x, x ′ ))

Matthew Kupilik UAA GP for Spatial Data Nov 2016 7 / 35


Distribution over Functions
The values of the unknown function themselves are then a
Gaussian joint probability distribution where the random
variables are the function values at a specific input variable:
 
f (x1 ) . . . f (xn ) ∼ N (µ, k)

Where the mean is µi = m(xi ) and the covariance is


ki,j = cov (f (xi ), f (xj ))
We write all the inputs and outputs as scalars here, but
everything expands in dimension,
 for example
 if we wanted the
input space to be x = lat long time .

Matthew Kupilik UAA GP for Spatial Data Nov 2016 8 / 35


Applying Bayes
Given a dataset D = (y, x) predict the output at a new input
x∗ . If we assign a prior and a likelihood, then we can condition
the prior on our observations and calculate a posterior:

p(f |x, y) ∝ p(y|x, f )p(f )

Mathematically equivalent to drawing random functions from


the prior, and rejecting those that don’t agree with the data.
If we assume the prior and likelihood are Gaussian there are
closed form solutions to this conditioning.

Matthew Kupilik UAA GP for Spatial Data Nov 2016 9 / 35


Prediction
We can then predict the output (y∗ ) at an input location (x∗ )
that is not in our training data set D.
Z
p(y∗ |D, x∗ ) = p(y∗ |x, x∗ , f )p(f |D)df

Matthew Kupilik UAA GP for Spatial Data Nov 2016 10 / 35


Kernel Trick
◮ All the previous integrals only have closed form solutions
when Gaussian.
◮ They only deal with inner products instead of the high
dimensional feature space

For Gaussian Processes we need a prior and a likelihood, we


will express these inner products using a Kernel or Covariance
function!

Matthew Kupilik UAA GP for Spatial Data Nov 2016 11 / 35


Covariance Function
The covariance function or kernel is extremely important. It
controls our assumptions about how the output is expected to
vary within the input space. One of the key components to
Gaussian Processes is that we need to choose an expressive
valid kernel function.
Many common examples, such as the squared exponential
covariance function (very popular with machine learning):
(x−x ′ )2
Kse (x − x ′ ) = e − 2ℓ2

The chosen covariance function often has many parameters


which we do not know, θ (hyperparameters)!

Matthew Kupilik UAA GP for Spatial Data Nov 2016 12 / 35


Learning the Covariance Function
Since we do not know the values of any parameters that make
up the covariance function we can fit these parameters to our
available data. Common to maximize the marginal likelihood
using non linear conjugate gradient descent.

Marginal likelihood: the probability of observing the given data


under our prior:

log p(y|x, θ)

Matthew Kupilik UAA GP for Spatial Data Nov 2016 13 / 35


Complexity
Naive implementation of this has complexity O(n3 ) for
learning and inference and O(n2 ) for prediction. This
complexity comes from inverses on the covariance matrix.

This has traditionally limited GP approaches to a few


thousand data points!

Matthew Kupilik UAA GP for Spatial Data Nov 2016 14 / 35


Spatial Gaussian Processes - Kriging
Moisture Content in soil, 265 data points

0 2 4 6 8 10 12 14 16
0

0.2

0.4

0.6

0.8

Matthew Kupilik UAA GP for Spatial Data Nov 2016 15 / 35


Spatial Gaussian Processes - Kriging
Function estimated on a much finer grid of 128841 data points
using a squared exponential covariance function, result
compared to linear interpolation

0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
0 0

0.2 0.2

0.4 0.4

0.6 0.6

0.8 0.8

1 1

1.2 1.2

1.4 1.4

1.6 1.6

1.8 1.8

2 2

Gaussian Process Linear Interpolation


Matthew Kupilik UAA GP for Spatial Data Nov 2016 16 / 35
Spatial Gaussian Processes - Kriging
Same data, isometric viewpoint, with kriging we have no
problem extrapolating outside the range.

22 22

20
20
18
18
16
16
14
14 12

12 10

10 8
20 0 20
0 15
0.5 15 0.5
1 10 1 10
1.5 5 1.5 5
2 0 2 0

Gaussian Process Linear Interpolation

Matthew Kupilik UAA GP for Spatial Data Nov 2016 17 / 35


Learning and Prediction Example
Take a messy nonlinear example function. Draw 10 points
randomly from this function. Choosing a squared exponential
as the covariance function and optimize the hyper parameters
using the ten known points. Perform prediction on the same
range.
3

1
Output

-1

-2

-3
0 1 2 3 4 5 6 7 8 9 10
Input

Matthew Kupilik UAA GP for Spatial Data Nov 2016 18 / 35


Learning and Prediction Example
We can find a very nice fit when we have observations around
the predictions. However our covariance function dies out out
as the distance in time gets further from the measurements
and reverts to the mean. For example, here is the same
nonlinear function predicting into the future.
15

10

5
Output

-5

-10

-15
0 5 10 15 20 25 30 35 40 45 50
Input

Matthew Kupilik UAA GP for Spatial Data Nov 2016 19 / 35


Engineering Required
For this low dimensional example it is easy to see I need to
build some periodicity into the covariance function. In higher
dimensional spaces these are the sort of patterns we wish to
discover.

Engineering input is required in the building of a covariance


function!

Matthew Kupilik UAA GP for Spatial Data Nov 2016 20 / 35


Spectral Mixture Kernels
There are many existing kernel functions, but few that are
highly expressive (able to recreate many types of structures or
patterns). Some recent results have looked at describing the
spectral structure of the covariance function. [3]
Q
2 (x−x ′ )2 v
X

k(x − x ) = wq e −2π q
cos(2π(x − x ′ )µq )
q=1

Matthew Kupilik UAA GP for Spatial Data Nov 2016 21 / 35


Spectral Mixture Kernels
Previous Example, same training range with 5 component
spectral mixture kernel. If the dynamics are not in the training
set they won’t be in the prediction!
15

10

5
Output

-5

-10

-15
0 5 10 15 20 25 30 35 40 45 50
Input

Matthew Kupilik UAA GP for Spatial Data Nov 2016 22 / 35


Spectral Mixture Kernels
Previous Example, doubled training range with 5 component
spectral mixture kernel.
15

10

5
Output

-5

-10

-15
0 5 10 15 20 25 30 35 40 45 50
Input

Matthew Kupilik UAA GP for Spatial Data Nov 2016 23 / 35


Some Recent Improvements
The computational complexity has been reduced significantly
through the use of grids, Kronecker products, and circulant
Toeplitz embeddings.
P+1
◮ O(PN P ) computations and O(PN 2P ) storage, for N
datapoints and P input dimensions for hyperparameter
fitting. [4]
◮ Together with Toeplitz block structure and kernel
interpolation, O(n) for learning and O(n) prediction![5]

Matthew Kupilik UAA GP for Spatial Data Nov 2016 24 / 35


Spatial and Temporal Data
GPs with spectral mixture kernels can be applied almost out of
the box to many standard spatiotemporal datasets.
◮ Build a covariance function that can express the dynamics
(hard), choose a covariance function that can learn the
dynamics (easier)
◮ Initialize the hyperparameters for that covariance
function. If you choose the easy method previously, this
step is now hard.
◮ Optimize the hyperparameters to best match your dataset
(bottleneck)
◮ Inference with your covariance function and training data
(build a covariance matrix)
◮ Predict Away!

Matthew Kupilik UAA GP for Spatial Data Nov 2016 25 / 35


Example
Publicly available African violence dataset [6]
◮ Output is number of violent events, per grid cell per
month for 1990-2009
◮ Approximately 42 possible input regressor variables (lat,
long, time, temperature, population, etc..)
◮ 814, 490 measurements in a potentially 42 dimensional
space
◮ Decently large data set

Matthew Kupilik UAA GP for Spatial Data Nov 2016 26 / 35


Initial Hack Method
Initially choose only space and time as an identification space,
perform identification using 40 months and predict on the next
12. Negative binomial likelihood and 15 component spectral
mixture kernel.
◮ Grid size: 63 × 70 × 40 = 176400

◮ We are using almost none of the information available in


the data set!  
Lat
Events = f  Long 
Time

Matthew Kupilik UAA GP for Spatial Data Nov 2016 27 / 35


Initial Hack Results
Prediction for month 41:

60
Violent Events

40
40

20 20

0 0
50 60
30 20 40 20
10 0 20 0
-10 -20 0 -20
-40 -20 -40

Matthew Kupilik UAA GP for Spatial Data Nov 2016 28 / 35


Dimension Reduction
For inputs, choose latitude, longitude, population, a count of
nearby recent violent events, and time.
 
Lat
 Long 
 
Events = f   Population 
 
Past Events
Time

Principal component analysis applied for dimension reduction.


Grid size: 300 × 100 × 39 = 1, 170, 00!

Matthew Kupilik UAA GP for Spatial Data Nov 2016 29 / 35


PCA Results
Three dimensions, negative binomial likelihood and 15 component
spectral mixture kernel. Prediction for month 41:
Violent Events

60 60
40 40
20 20
0 0
60 60
40 40 40 40
20 20 20 20
0 0 0 0
-20 -20 -20 -20
-40 -40

Average error per month is approximately 27 events over the years


worth of prediction.
Matthew Kupilik UAA GP for Spatial Data Nov 2016 30 / 35
Societal Impact
Machine learning is being applied to models for which we have
massive amounts of data and almost no knowledge of the
underlying processes (violence, voting, music popularity, etc).
These social systems are seeing the impacts of machine
learning first due to the massive amounts of data available
which give us an ability to find patterns.

Matthew Kupilik UAA GP for Spatial Data Nov 2016 31 / 35


Engineering Impact
The modeling impacts on the physical sciences is still very
nascent but will only increase. Already happening in many
applications (civil, electrical and mechanical) including traffic
prediction, erosion, weather, climate modeling, electric power
demand and more.
It will be important to be able to incorporate our abilities to
spot patterns and combine with our knowledge of the physics
to create truly useful models for society.

Matthew Kupilik UAA GP for Spatial Data Nov 2016 32 / 35


Additional Resources
◮ Open Source Toolboxes in R, Matlab, Python, C++,
Octave
◮ http://www.gaussianprocess.org/ for lots of tools
◮ All of the examples in this presentation made use of the
GPML library for Matlab[7].

Matthew Kupilik UAA GP for Spatial Data Nov 2016 33 / 35


Questions

Matthew Kupilik UAA GP for Spatial Data Nov 2016 34 / 35


References
[1] Geoffrey E Hinton, Simon Osindero, and Yee-Whye The 32nd International Conference on Machine
Teh. A fast learning algorithm for deep belief nets. Learning, volume 37, pages 607–616, 2015.
Neural computation, 18(7):1527–1554, 2006. [5] Andrew Gordon Wilson, Christoph Dann, and
[2] Rasmussen Carl Edward and Christopher KI Hannes Nickisch. Thoughts on Massively Scalable
Williams. Gaussian Processes for Machine Gaussian Processes. Emnlp, pages 1–25, 2015.
Learning. MIT Press, 2006. [6] Clionadh Raleigh, Andrew Linke, Håvard Hegre,
[3] A Wilson and R Adams. Gaussian process kernels and Joakim Karlsen. Introducing acled: An armed
for pattern discovery and extrapolation. conflict location and event dataset special data
Proceedings of the 30th International Conference feature. Journal of peace research, 47(5):651–660,
on Machine Learning, 28(3):1067–1075, 2013. 2010.
[4] Seth R Flaxman, Andrew Gordon Wilson, Daniel B [7] Carl Edward Rasmussen and Hannes Nickisch.
Neill, Hannes Nickisch, and Alexander J Smola. Gaussian processes for machine learning (gpml)
Fast Kronecker Inference in Gaussian Processes toolbox. J. Mach. Learn. Res., 11:3011–3015,
with non-Gaussian Likelihoods. In Proceedings of December 2010.

Matthew Kupilik UAA GP for Spatial Data Nov 2016 35 / 35

You might also like