Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Technical Report 2007-003

Kriging
Author: Alberto Lovison
Date: December 03, 2007

Abstract

Kriging is a very popular regression methodology based on Gaussian Processes,


originally developed in geostatistics for predicting gold concentration at extraction
sites. Kriging produces also an estimate of the error, i.e., a prediction value and an
expected deviation from the prediction. A noise parameter can be controlled turning
the predictor from interpolant to approximant. The smoothness of the model is
controlled by the covariance function (in geostatistics: the variogram function) which
is a function of the distance between different points which rules their reciprocal
correlation.

Key Words: Kriging, Metamodeling, RSM, Gaussian Processes, Bayesian paradigm


Tec. Rep. 2007-003 December 03, 2007

1 Origin and pronounciation


Kriging is a regression methodology originated from the extensive work of Profes-
sor Daniel Krige, from the Witwatersrand University of South Africa, especially from
problems of gold extraction1 . The formalization and dissemination of this methodology,
now universally employed in all branches of geostatistics, as oil extraction and idrology
among others, is due to Professor Georges Matheron, who indicated the Krige’s regres-
sion technique as krigeage [?]. This is the reason why the pronunciation of kriging
with a soft “g” seems the more correct, despite of the hard “g” pronunciation mainly
diffused in the U.S.

2 Definitions
To fix the notations, we consider a dataset of training points:

D := (xi , yi ) : xi ∈ Ω ⊂ RM , yi ∈ R, i = 1, . . . , n ,


our parameters x, or inputs, belong to a given subset of RM , which is our parameter


space. The outputs y are real values.
The regression problem is to obtain a likely value for outcomes y∗ at given places
x∗ different from the x1 , . . . , xn of the dataset. In other words, we want to find a
function
f : Ω −→ R, f (xi ) = yi , ∀i = 1, . . . , n.
Kriging offers an answer to this issue by means of the Gaussian Process framework.
The essential feature resides in the definition of the covariance function, which for
kriging only depends on the distance between points, the lag.

2.1 Gaussian Vectors


A random vector X = (X1 , . . . , Xn ) is said gaussian random vector, if ∀u ∈ Rn ,
hX, ui := u1 X1 + · · · + un Xn is a Gaussian random variable, i.e., hX, ui ∈ N (m, σ 2 ),
where m, σ are given by

m = hm,
~ ui , where m
~ = (E[X1 ], . . . , E[Xn ]) ,
2
σ = hΓu, ui ,

and Γ is the covariance matrix

Γi,j := Cov(Xi , Xj ) := E [(Xi − E[Xi ]) (Xj − E[Xj ])] = E[Xi Xj ] − E[Xi ]E[Xj ].

Let us note that, as a result, X1 , . . . , Xn are all Gaussian,

Xi ∈ N (mi , σi2 ) = N (E[Xi ], V ar(Xi )).


1
Recall that the Witwatersrand gold site is the world largest one, and from there comes the 40%
of the total world production.

1
Tec. Rep. 2007-003 December 03, 2007

Equivalently, one can define a gaussian random vector of expected value m


~ and
covariance matrix Γ setting the following gaussian probability density
 
1 −1
p(x) = exp − Γ (x − m), ~ (x − m)
~ ,
2

or, in other words, we can give the probability law


Z  
1 1 −1
P {(X1 , . . . , Xn ) ∈ A} = d 1 exp − Γ (x − m),
~ (x − m)
~ dµ(x),
A (2π) 2 (det Γ) 2 2

and say that X ∈ N (m,


~ Γ). More elegantly, we can say

X = AY + m,
~ AAT = Γ, Y ∈ N (~0, I).

Note that the matrix A is the LU (Cholesky) factorization of the symmetric positive
definite matrix Γ, which is very useful also for numerical purposes.

2.2 Gaussian Processes


A stochastic process X̄ is an indicized family of random variables

X̄ := (Xt )t∈I , Xt is a random variable. I e.g. = {1, . . . , n} , Z, R+ , Rn , C, etc.

If the set of indexes I is finite, i.e., X̄ = (X1 , . . . , Xn ), X is a random vector, if I = R+ ,


we are considering an historical series, while if the indexes are points in the space, i.e.,
I = Ω ⊆ R2 , R3 , Rn , we are talking about a random field.
A Gaussian Process is a stochastic process (f (x))x∈Ω , i.e., a (possibly infinite)
collection of random variables where any finite subset is a Gaussian vector.
A Gaussian Process is completely determined by its mean function m(x) and its
covariance function k(x, x0 ):

m(x) := E[f (x)], (1)


0 0 0
k(x, x ) := E[(f (x) − m(x))(f (x ) − m(x ))], (2)

and we will write


f (x) ∼ N (m(x), k(x, x0 )).

2.3 Bayesian Prediction with Gaussian Processes


What kind of probability distribution can we write to represent the shape of a prob-
able function passing exactly through the points assigned in a training dataset D =
((x1 , f1 ), . . . , (xn , fn )) = (X, f )?
First we have to assign a prior, i.e., an a priori probability distribution for the
(large) family of functions we want to employ to represent our unknown function.

2
Tec. Rep. 2007-003 December 03, 2007

1 1

0.5 0.5

-2 -1 1 2 -2 -1 1 2

-0.5 -0.5

-1 -1

Figure 1: A flat Gaussian process considered as a prior, and the posterior corresponding to those
function passing through a given dataset. The shaded area correspond to the standard deviations,
without and with conditions. The solid line is the average m(x), while the dashed lines are some
sample functions from the probability distributions.

This could be, e.g., the following flat Gaussian prior: a Gaussian process with zero
mean and gaussian covariance:
  
1 2
f (x) ∼ N 0, exp − x − x0 .
2

Next, we will pick from this family only the functions which satisfy the condition
required, i.e., to pass through D.
The resulting posterior distributions f? at a family of points X? , could be written
as

f? X? , X, f ∼ N K(X? , X)K(X, X)−1 f ,


K(X? , X? ) − K(X? , X)K(X, X)−1 K(X, X? ) . (3)


For a single prediction f? at a point x? , and writing K = K(X, X) for the matrix of
covariances between the r.v. at the training points, and k? for the covariances vector
between the unknown point x? and the training sites X = (x1 , . . . , xn ), we write:

f? = kT? K −1 f, (4)
2 2
σ (f? ) = σ (x? ) − kT? K −1 k? . (5)

3
Tec. Rep. 2007-003 December 03, 2007

2.4 Kriging
In other words, we are considering a random variable Y ∈ PD for each point x∗ in
the domain of the function f . The expected value m(x) and standard deviations s(x)
are adopted a priori or argued a posteriori. Furthermore, the two random variables
Y (x0 ) and Y (x00 ) are correlated accordingly to a covariance function, Cov(x0 , x00 ) which
can be very complicated. The Kriging method adopts the following hypothesis for the
prior:

1. m(x) ≡ m, σ(x) ≡ Sill, i.e., the average and the standard deviations are the
same for all points.

2. Cov(x0 , x00 ) := E [(Y (x0 ) − m) · (Y (x00 ) − m)] =: C(kx0 − x00 k) = s(1−γ(kx0 − x00 k),
i.e., the covariance depends only on (the absolute value of the) spatial separation
between points. The model function of the correlation γ(h) is called variogram.

These assumptions have the following consequences on the regression problem. Let us
consider a point x? different from one of the training points. The residual standard
deviation at the point x? varies accordingly to the value y ? we could assign to the
unknown function. The Kriging estimator is a linear estimator, i.e., the estimated
value is expressed as a linear combination of the training values, in other words:
n
X
?
y = λi (x? )yi ,
i=1

where the weights λ1 , . . . , λn , are obviously point–dependent. From the assumptions we


made on f (x), we can compute the residual standard deviation even without knowing
the effective values of the unknown function at the given point x? . The residual variance
is
 !2 
h i n
X
σ 2 (x? ) = E ky ? − f (x? )k2 = E  λi yi − f (x? )  =
i=1
n n
" ! !#
X X
? ?
=E λi yi − f (x ) · λi yi − f (x ) =
i=1 i=1
n n n
" # " #
X X X
λi yi f (x? ) + E f (x? )2 =
 
=E λi yi λi yi − 2E
i=1 i=1 i=1
n X
X n n
X
λi E [yi f (x? )] + E f (x? )2
 
= λi λj E [yi yi ] − 2
i=1 j=1 i=1

From the definition of covariance:

E [yi yj ] = C(kxi − xj k) + m2 , and E f (x? )2 = C(0) = σ 2 (f ).


 

4
Tec. Rep. 2007-003 December 03, 2007

We do not lose generality assuming the constant average equal to zero,

m(x) ≡ m = 0.

The residual variance results in


h i
σ 2 (x? ) = E ky ? − f (x? )k2 =
n X
X n n
X
= λi λj Cov(xi , xj ) − 2 λi Cov(xi , x? ) + Cov(x? , x? ) =
i=1 j=1 i=1
n X
X n n
X
= λi λj C(kxi − xj k) − 2 λi C(kxi − x? k) + C(0),
i=1 j=1 i=1

which is minimized setting to zero of the first derivative, i.e., looking for the critical
point for λi :
n
∂ h i X
E ky ? − f (x? )k2 = λj C(kxi − xj k) − C(kxi − x? k) = 0
∂λi
j=1

At the end of all, the best values for the coefficients λ1 , . . . , λn are found solving a
symmetric, positive–definite linear system:

Cλ = b,

where Ci,j = C(kxi − xj k is the covariance matrix, completely determined by the


dataset, while the rhs bi = C(kxi − x? k) is the vector of the covariances associated to
the distances between the unknown site x? and the training dataset.

2.5 Numerical aspects


The main numerical difficulty is the solution of a linear system required at every
estimation at a new site. However, this difficulty is engentled by the fact that the
covariance matrix is symmetric and positive definite. Indeed, the covariance matrix
can be factorized by Choleski,
C = L · LT ,
with L lower triangular, so the original system is split into two trivial systems straight-
forwardly solved by substitution:

Cλ = b ⇐= LLT λ = b ⇐= Ly = b ⇐= LT λ = y.

The advantage consists in the fact that the factorization is performed only once and
the two factorized linear systems are very easy to solve.
More precisely, the whole Gauss–Jordan solution of a general linear system would
cost O(n3 ), while the Choleski factorization order is O(n2 ) and the factorized systems
are of order O(n).

5
Tec. Rep. 2007-003 December 03, 2007

On the other hand, when the matrix starts to grow in dimension, usually grows
also its condition number, making the round off errors appreciably large. In this case
it is convenient to pass to moving neighborhoods strategies, i.e., to involve in the
computations only the points which are within a given radius w.r.t. the target site x? .
For scattered datasets, this strategy implies the solution of a smaller but possibly
different system every new site, requiring also the factorization at every step.
If otherwise the dataset is regular, maybe a grid, the reduced covariance matrix
will be always the same, letting us to compute only once the Choleski factorization,
which is the most important numerical bottleneck.

3 Anatomy of a Variogram
γ(h) σ − γ(h)

Sill σ

Nugget δ

h h
Range ρ
(a) Anatomy of a Gaussian variogram (b) Covariance function

Figure 2: Correspondence between (a) a Gaussian variogram γ(h) and (b) the spatially dependent
covariance function C(h) = σ − γ(h).

As described before, the covariance function for the Kriging methodology only
depends on the mutual disposition of the points, furthermore only on the distance (the
lag) between points, in the simple Kriging model, is relevant.
As can be easily argued, a very natural generalization of the notion of continuity
of functions. We expect from a regular (continuous) function that the values corre-
sponding to close points being near. Kriging improves this natural concept describing
more quantitatively this spatial correlation between points introducing and managing
the problem characteristic scales.
The variations of the function as the parameters vary can be plotted in a vario–
gram. More precisely, an experimental variogram is built in the following way: all
pairs of training points are binned according to their mutual distance, and the average
squared difference between the values are computed for the pairs in the bin.

kyp − yq k2
D E P
2
0 < h1 < · · · < h` γ(hi ) = kyp − yq k :=
# {hi < kxp − xq k < hi+1 }

6
Tec. Rep. 2007-003 December 03, 2007

As one can imagine, γ(h) will be very small for h small, and progressively γ will grow,
on average, until it reach an ”indifference” distance. This asymptotical value is called
sill σ, while the distance at which the sill is considered reached is called range ρ.
Furthermore, the continuity hypothesis asks for γ(h) → 0, for h → 0. However, if
some phenomena occur at some very low scale, γ jumps very suddenly at a non zero
value called the nugget δ.
The growing rate typically follows one of the standard models, Gaussian, Exponen-
tial, Spherical and Polynomial. The smoothest one, the Gaussian usually gives best
results, but, as a drawback, is usually the most expensive in numerical terms, because
easily generates large condition numbers for the covariance matrix.
Once the best fitting variogram is determined, the covariance function is obtained
as follows:
Cov(x0 , x00 ) = C( x0 − x00 ) := σ − γ( x0 − x00 ).
See Figure 3 to see the correspondence between variograms and covariance functions.

4 Simple Kriging in modeFRONTIER


Our main source of inspiration is surely the GSLib 1 , which collects the state–of–
the–art implementation of the kriging methodology and its geostatistical applications.
GSLibs are extensively documented [2].

4.1 Autofitting
In modeFRONTIERthere are two possibilities for the automatic determination of range,
sill and noise: maximizing the likelihood and maximizing the leave–one–out predictive
probability. Maximizing the likelihood means finding the less complex model which fits
acceptably the training set. Likelihood brings the advantage of employing at the same
time the whole dataset, i.e., the whole information available. Maximizing leave–one–
out predictive probability, also called pseudo–likelihood, is an autofitting procedure of
the cross–validation type. A performance function is built up training a new model
on all the points available excluding one site at a time and evaluating the prediction
error at the removed site. The functional is gathered also involving the prediction error
offered by kriging, i.e., large errors at remoted sites are weighted less than moderate
errors at very dense sites. Usually cross validation procedures are reliable and robust,
nevertheless they can be misleading for very small datasets, where it is not likely to
remove also one site from training, or for very dense datasets, where removing one site
produces little effect. For more information refer to [1]

5 Acknowledgements
Special thanks goes to the Department of Mathematical Methods and Models for Sci-
entific Applications of the University of Padova, in particular to Professor Mario Putti
1
www.gslib.com

7
Tec. Rep. 2007-003 December 03, 2007

and to Ing. Andrea Comerlati for sharing with us their total mastery with Kriging
and its applications in hydrology.[3]

References
[1] Rasmussen C. E. and Williams C. K. I. 2006 Gaussian Processes for Ma-
chine Learning MIT Press

[2] Deutsch C. and Journel A. 1992 GSLIB: Geostatistical Software Library


and User’s Guide Oxford University Press

[3] Gambolati G., Putti M., Comerlati A. and Ferronato M. 2004 Saving venice
by sea water Journal Geophysical Research 109(F03006)

You might also like