Professional Documents
Culture Documents
Kriging
Kriging
Kriging
Author: Alberto Lovison
Date: December 03, 2007
Abstract
2 Definitions
To fix the notations, we consider a dataset of training points:
D := (xi , yi ) : xi ∈ Ω ⊂ RM , yi ∈ R, i = 1, . . . , n ,
m = hm,
~ ui , where m
~ = (E[X1 ], . . . , E[Xn ]) ,
2
σ = hΓu, ui ,
Γi,j := Cov(Xi , Xj ) := E [(Xi − E[Xi ]) (Xj − E[Xj ])] = E[Xi Xj ] − E[Xi ]E[Xj ].
1
Tec. Rep. 2007-003 December 03, 2007
X = AY + m,
~ AAT = Γ, Y ∈ N (~0, I).
Note that the matrix A is the LU (Cholesky) factorization of the symmetric positive
definite matrix Γ, which is very useful also for numerical purposes.
2
Tec. Rep. 2007-003 December 03, 2007
1 1
0.5 0.5
-2 -1 1 2 -2 -1 1 2
-0.5 -0.5
-1 -1
Figure 1: A flat Gaussian process considered as a prior, and the posterior corresponding to those
function passing through a given dataset. The shaded area correspond to the standard deviations,
without and with conditions. The solid line is the average m(x), while the dashed lines are some
sample functions from the probability distributions.
This could be, e.g., the following flat Gaussian prior: a Gaussian process with zero
mean and gaussian covariance:
1 2
f (x) ∼ N 0, exp − x − x0 .
2
Next, we will pick from this family only the functions which satisfy the condition
required, i.e., to pass through D.
The resulting posterior distributions f? at a family of points X? , could be written
as
For a single prediction f? at a point x? , and writing K = K(X, X) for the matrix of
covariances between the r.v. at the training points, and k? for the covariances vector
between the unknown point x? and the training sites X = (x1 , . . . , xn ), we write:
f? = kT? K −1 f, (4)
2 2
σ (f? ) = σ (x? ) − kT? K −1 k? . (5)
3
Tec. Rep. 2007-003 December 03, 2007
2.4 Kriging
In other words, we are considering a random variable Y ∈ PD for each point x∗ in
the domain of the function f . The expected value m(x) and standard deviations s(x)
are adopted a priori or argued a posteriori. Furthermore, the two random variables
Y (x0 ) and Y (x00 ) are correlated accordingly to a covariance function, Cov(x0 , x00 ) which
can be very complicated. The Kriging method adopts the following hypothesis for the
prior:
1. m(x) ≡ m, σ(x) ≡ Sill, i.e., the average and the standard deviations are the
same for all points.
2. Cov(x0 , x00 ) := E [(Y (x0 ) − m) · (Y (x00 ) − m)] =: C(kx0 − x00 k) = s(1−γ(kx0 − x00 k),
i.e., the covariance depends only on (the absolute value of the) spatial separation
between points. The model function of the correlation γ(h) is called variogram.
These assumptions have the following consequences on the regression problem. Let us
consider a point x? different from one of the training points. The residual standard
deviation at the point x? varies accordingly to the value y ? we could assign to the
unknown function. The Kriging estimator is a linear estimator, i.e., the estimated
value is expressed as a linear combination of the training values, in other words:
n
X
?
y = λi (x? )yi ,
i=1
4
Tec. Rep. 2007-003 December 03, 2007
m(x) ≡ m = 0.
which is minimized setting to zero of the first derivative, i.e., looking for the critical
point for λi :
n
∂ h i X
E ky ? − f (x? )k2 = λj C(kxi − xj k) − C(kxi − x? k) = 0
∂λi
j=1
At the end of all, the best values for the coefficients λ1 , . . . , λn are found solving a
symmetric, positive–definite linear system:
Cλ = b,
Cλ = b ⇐= LLT λ = b ⇐= Ly = b ⇐= LT λ = y.
The advantage consists in the fact that the factorization is performed only once and
the two factorized linear systems are very easy to solve.
More precisely, the whole Gauss–Jordan solution of a general linear system would
cost O(n3 ), while the Choleski factorization order is O(n2 ) and the factorized systems
are of order O(n).
5
Tec. Rep. 2007-003 December 03, 2007
On the other hand, when the matrix starts to grow in dimension, usually grows
also its condition number, making the round off errors appreciably large. In this case
it is convenient to pass to moving neighborhoods strategies, i.e., to involve in the
computations only the points which are within a given radius w.r.t. the target site x? .
For scattered datasets, this strategy implies the solution of a smaller but possibly
different system every new site, requiring also the factorization at every step.
If otherwise the dataset is regular, maybe a grid, the reduced covariance matrix
will be always the same, letting us to compute only once the Choleski factorization,
which is the most important numerical bottleneck.
3 Anatomy of a Variogram
γ(h) σ − γ(h)
Sill σ
Nugget δ
h h
Range ρ
(a) Anatomy of a Gaussian variogram (b) Covariance function
Figure 2: Correspondence between (a) a Gaussian variogram γ(h) and (b) the spatially dependent
covariance function C(h) = σ − γ(h).
As described before, the covariance function for the Kriging methodology only
depends on the mutual disposition of the points, furthermore only on the distance (the
lag) between points, in the simple Kriging model, is relevant.
As can be easily argued, a very natural generalization of the notion of continuity
of functions. We expect from a regular (continuous) function that the values corre-
sponding to close points being near. Kriging improves this natural concept describing
more quantitatively this spatial correlation between points introducing and managing
the problem characteristic scales.
The variations of the function as the parameters vary can be plotted in a vario–
gram. More precisely, an experimental variogram is built in the following way: all
pairs of training points are binned according to their mutual distance, and the average
squared difference between the values are computed for the pairs in the bin.
kyp − yq k2
D E P
2
0 < h1 < · · · < h` γ(hi ) = kyp − yq k :=
# {hi < kxp − xq k < hi+1 }
6
Tec. Rep. 2007-003 December 03, 2007
As one can imagine, γ(h) will be very small for h small, and progressively γ will grow,
on average, until it reach an ”indifference” distance. This asymptotical value is called
sill σ, while the distance at which the sill is considered reached is called range ρ.
Furthermore, the continuity hypothesis asks for γ(h) → 0, for h → 0. However, if
some phenomena occur at some very low scale, γ jumps very suddenly at a non zero
value called the nugget δ.
The growing rate typically follows one of the standard models, Gaussian, Exponen-
tial, Spherical and Polynomial. The smoothest one, the Gaussian usually gives best
results, but, as a drawback, is usually the most expensive in numerical terms, because
easily generates large condition numbers for the covariance matrix.
Once the best fitting variogram is determined, the covariance function is obtained
as follows:
Cov(x0 , x00 ) = C( x0 − x00 ) := σ − γ( x0 − x00 ).
See Figure 3 to see the correspondence between variograms and covariance functions.
4.1 Autofitting
In modeFRONTIERthere are two possibilities for the automatic determination of range,
sill and noise: maximizing the likelihood and maximizing the leave–one–out predictive
probability. Maximizing the likelihood means finding the less complex model which fits
acceptably the training set. Likelihood brings the advantage of employing at the same
time the whole dataset, i.e., the whole information available. Maximizing leave–one–
out predictive probability, also called pseudo–likelihood, is an autofitting procedure of
the cross–validation type. A performance function is built up training a new model
on all the points available excluding one site at a time and evaluating the prediction
error at the removed site. The functional is gathered also involving the prediction error
offered by kriging, i.e., large errors at remoted sites are weighted less than moderate
errors at very dense sites. Usually cross validation procedures are reliable and robust,
nevertheless they can be misleading for very small datasets, where it is not likely to
remove also one site from training, or for very dense datasets, where removing one site
produces little effect. For more information refer to [1]
5 Acknowledgements
Special thanks goes to the Department of Mathematical Methods and Models for Sci-
entific Applications of the University of Padova, in particular to Professor Mario Putti
1
www.gslib.com
7
Tec. Rep. 2007-003 December 03, 2007
and to Ing. Andrea Comerlati for sharing with us their total mastery with Kriging
and its applications in hydrology.[3]
References
[1] Rasmussen C. E. and Williams C. K. I. 2006 Gaussian Processes for Ma-
chine Learning MIT Press
[3] Gambolati G., Putti M., Comerlati A. and Ferronato M. 2004 Saving venice
by sea water Journal Geophysical Research 109(F03006)