Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the Annual Conference of the International Association for Mathematical Geology, Berlin-

Germany, September 15-20, 2002, p. 479-484.

Optimizing Spatial Declustering Weights – Comparison of Methods


G. Dubois (1), M. Saisana (2)

(1) Inst. of Mineralogy and Petrography, University of Lausanne, Switzerland, (2) Joint
Research Centre of the European Commission, Ispra, Italy

Michaela Saisana, European Commission, Joint Research Centre,


Via Enrico Fermi 1, 21020 Ispra, Italy, michaela.saisana@jrc.it

Abstract
Analysis of a spatial phenomenon is to a great extent affected by the frequent irregular
structures and/or the preferential clustering of the sampling schemes. To obtain
representative statistics for an area of interest, the influence of clustered measurements needs
to be reduced by attributing them lower weights. In this case study, two standard methods, the
polygonal and the cell-declustering methods, are confronted with the recently proposed
Coefficient of Representativity. The last combines Thiessen polygons with nearest neighbours
distances and is further coupled with information provided by the borders of the region of
interest. The relative performance of these methods is assessed through an estimation
problem of the average elevation of several subsets of a Digital Elevation Model of
Switzerland. The results are showing for this case study that the Coefficient of
Representativity generated overall better results when confronted to the other methods.

1. Introduction
Most studies on spatial phenomena are based on the information provided by sampling
campaigns and/or monitoring networks. Because the collected information is usually not
distributed regularly in space, since many factors can restrict the access to particular
geographical locations, such information may not be representative of the analysed
phenomenon. Moreover, sampling campaigns often target the high or low end of the statistical
distribution of the investigated variable, introducing so a systematic bias in the statistics. This
bias can be taken into account by weighting each sample value in order to reduce the
influence of clustered measurements and to increase it for isolated measurements.

Two declustering methods are traditionally used in geostatistics: the polygonal method and
the cell-declustering method. Both use a weighted linear combination of the available sample
values to estimate the exhaustive mean.

2. Overview on existing declustering methods


In the polygonal method, each sample is associated to a polygon of influence (also called
Thiessen or Voronoi polygon, see e.g. Thiessen, 1911) which is constructed in such a way that
its geometry will include all the data points that are closer to the sample than to any other
measurement. As a result, the estimated global mean, F(x), of a data set is defined by
n

∑ w .x
i =1
i i
F ( x) = n (1)
∑ wi
i =1
where the weights wi are defined by the surfaces of the polygons (Isaaks and Srivastava, p.
243-247, 1989). Clearly, isolated points will have larger polygons than clustered points.

Because the external borders of the polygons are frequently delimited by a convex hull, which
is constructed on the basis of the outer points, a bias may be introduced when the shape of the
study area does not correspond to the convex hull. Consequently, one is often constrained to
work only within the geographical limits defined by the sampling network, even if the
network has been designed to investigate a larger area. Nevertheless, Geographic Information
Systems (GIS) facilitate the use of additional information that can better define the boundaries
Proceedings of the Annual Conference of the International Association for Mathematical Geology, Berlin-
Germany, September 15-20, 2002, p. 479-484.

of the region of interest. This is illustrated in Figure 1, which shows Thiessen polygons that
correspond to 300 points located in Switzerland. The use of the country border has clearly
reduced the impact of the very large polygons found in the South and East of the country.

Figure 1. Thiessen polygons delimited by country borders and by a convex hull.

Another limitation of the method comes from the fact that two points can be very close in
space and still have large Thiessen polygons, as it is often the case at the borders of the study
area.

The cell-declustering method was proposed first by Journel (1983) and a modified version
was later presented by Deutsch (1989). The method uses a moving window system, which
will split the study area into rectangular cells, and each measurement receives a weight that is
inversely proportional to the number of points that fall within the same window. Accordingly,
the number of cells covering the investigated region defines the sum of the weights in
Equation (1). A drawback of the method is the arbitrary definition of (a) the cell size (i.e.
rectangular size in each dimension) and (b) the origin of the cell network, which will
influence the results. The few rules of thumb that are usually applied to select an appropriate
cell size depend on the type of clustering. When data are clustered at random the cell size is
selected in such a way that there is approximately one datum per cell in the sparse areas. If it
is known that the high- or low-level areas have been over sampled, then the cell size can be
selected such that weights provide the minimum or maximum weighted mean of the data,
respectively. If there are a few extreme high or low values in the data, one should
systematically shift the origin of the cell for a fixed cell size. The weights obtained for each
change of the origin can be averaged to obtain a unique set of weights for that cell size. The
results are more stable than if just one origin is used and the estimated mean obtained this way
is not sensitive to small changes in the data locations.

On the basis of the preceding observations, a new measure combining Thiessen polygons and
the distance of each point to its nearest neighbour was proposed. Called “Coefficient of
Representativity” (CR), (Dubois, 2000a,b), this coefficient was mainly developed to facilitate
the comparison of different sampling strategies and to define the impact of each measurement
or monitoring station on a survey.

The CR is a product of two terms:

o A first term takes the surface of the Thiessen polygon into account. It is equal to the ratio
of the surface of the Thiessen polygon (STh) to the ideal surface it should have to obtain a
homogeneous monitoring network. This surface is simply defined as the mean surface
(Sm), that is the total area of the investigated area STotal, divided by the number of sampling
points N.
Proceedings of the Annual Conference of the International Association for Mathematical Geology, Berlin-
Germany, September 15-20, 2002, p. 479-484.

o The second term is equal to the ratio of the squared distance between a point and its
nearest neighbour (NNdist) to the mean surface of the Thiessen polygons. The choice of the
squared distance is based on the desire to compare the structure of a monitoring network
to a reference, which is here a regular grid where points are distributed in the middle of
each cell of the grid. Here, the regular square grid has been kept as reference and,
consequently, NNdist2 = STh.

Thus, the CR is defined by


2
S Th NN dist
CR = ⋅ (2)
Sm Sm

If a point is isolated then CR > 1. In the case the point is closer to a neighbouring point than
the theoretical distance, then CR < 1. The declustering step is simply performed by setting
the sum of all the CR’s to 1 and by using the relative value of the CR associated to each point.
As a result, the CR does not need any subjective decisions and benefits from the information
provided by borders delineating the study area. Moreover, the weights associated to clustered
points, which may nevertheless have large polygons of influence, are reduced by taking the
nearest neighbours distances into account.

3. Case study
In order to assess the relative performances of the above-described declustering methods, a
Digital Elevation Model (DEM) of Switzerland was used as an exhaustive data set from
which subsets were extracted. The mean values of the declustered subsets were then
compared to the mean value of the elevation given by the exhaustive data set. The DEM
contains 402 cells of size 10 km × 10 km. A shaded relief map of the DEM data and the
corresponding statistics are presented in Figure 2 and Table 1 respectively.

Elevation (m)
305
915
2133
3353
>3353

Figure 2. Digital elevation model of Switzerland, resolution of 10 km x 10 km.

The high standard deviation of the elevation data in Switzerland, 60% of the mean, underlines
the high variability in the local terrain surface.

N Minimum Maximum Mean Median Std. Dev. Skewness


402 152 3505 1370 1524 826 0.717

Table 1: Descriptive statistics of the Swiss Digital Elevation Model (402 cells of 10 km x 10 km, the
elevation is given in meters).

The subsets of the DEM, simulating sampling campaigns or monitoring networks, were
expected to provide information for the whole surface of Switzerland. In view of a better
simulation of real case studies, the subsets were generated in order to obtain preferential
sampling strategies: five independent random sets (noted 1 to 5) of 300 points preferentially
Proceedings of the Annual Conference of the International Association for Mathematical Geology, Berlin-
Germany, September 15-20, 2002, p. 479-484.

sampled in areas of high elevation were prepared, each set being further decomposed into
smaller dependent subsets of 250, 200, 150, 100 and 50 points. The same operation was
repeated for 5 further sets (sets 6 to 10) for which the clustered data were located in regions
with lower values. The mean values of the elevation as well as the standard deviation for all
the 60 subsets are given in Table 2. One can see from this table that there is a clear need for
efficient declustering methods since the mean values of the elevation for all these subsets are
very poor estimates of the true mean given by the exhaustive data set (1370 m).

300 points 250 points 200 points 150 points 100 points 50 points
Mean Std. Mean Std. Mean Std. Mean Std. Mean Std. Mean Std.
Set 1 1542 834 1645 854 1757 895 1993 880 2359 788 2356 774
Set 2 1628 807 1720 830 1838 864 2052 865 2368 773 2387 803
Set 3 1631 808 1735 818 1870 835 2071 844 2432 652 2444 649
Set 4 1627 807 1722 831 1839 863 2019 863 2369 773 2387 803
Set 5 1628 807 1735 818 1870 835 2071 844 2432 652 2444 649
Set 6 1170 794 1013 732 970 701 988 720 985 721 1042 798
Set 7 1147 762 995 700 946 659 963 679 972 701 994 726
Set 8 1146 764 986 683 946 659 952 668 956 684 960 693
Set 9 1160 786 1011 733 964 690 960 681 980 724 1009 770
Set 10 1160 786 997 710 956 682 976 709 968 704 985 732

Table 2. Mean values and standard deviations (Std.) of the random subsets extracted from the Swiss
DEM. Sets 1-5 and sets 1-6 have preferential samplings in regions of high and low values,
respectively. Units are in meters.

The declustering methods which were applied to the 60 data sets were the cell-declustering
method, Thiessen polygons closed by the country border, Thiessen polygons closed by the
convex hull and the CR. The cell size of the cell-declustering method was defined on a case
by case basis. The weighted mean for each data set has been calculated for several
combinations of the cell size in two directions (West-East and North-South) and considering 5
(Deutsch showed that 5 offsets for the origin are usually enough) different offsets for the
origin for a given cell size. For the data sets belonging to sets 1 to 5, the cell size that gave the
minimum declustered mean was selected over all other possibilities since the clustered
samples of those data sets were all located in areas of high elevation. On the other hand, given
that the data sets of sets 6 to 10 contain points clustered in areas of low elevation, the
rectangular cell size selected was the one that gave the maximum weighted mean.

4. Results
The Mean Absolute Percentage Error (MAPE) and the Root Mean Square Error (RMSE) were
used to quantify the differences between the estimated mean values and the observed mean
value of the elevation as given by the exhaustive data set. These error measurements are given
in Table 3. The Thiessen polygons gave the worst results, with nevertheless a clear
improvement of the estimations when the country borders were taken into account. The CR
gave results that are similar to those obtained by the cell-declustering approach, with the best
results for the worst-case scenario, that is when only 50 samples are available. On average,
that is for all the data sets, the CR gave better results, even if the cell-declustering method
generated better results when samples were clustered in regions of low values.

In view to better highlight the differences of the results obtained by the cell-declustering
method and the CR, the estimated mean values derived from each subset are shown in Figure
4. The X axis shows the code which was associated to each data set. This coding was defined
in the following way: sets 1-30 were the data sets with a preferential sampling in the regions
of high elevation, sets 31-60 were the sets where the sampling was made preferentially in
regions of low values. Each group of 6 (e.g. 1-6, 7-12,...) corresponded to the 300, 250, 200,
150, 100 and 50 points scenario. Three ellipses (sets 41, 47, 60, with 100, 100 and 50 points,
respectively) are highlighting the three outliers of the CR which, as a result, generated a
higher mean RMSE and MAPE for the situation corresponding to the preferential sampling in
Proceedings of the Annual Conference of the International Association for Mathematical Geology, Berlin-
Germany, September 15-20, 2002, p. 479-484.

regions with lower elevation values. It appeared on the maps that a few isolated points had a
very high weight compared to the average. This underlines the need for further investigations
for what concerns the limitations of the CR which is more sensitive than the cell-declustering
method to the relative locations of the samples. On the other hand, the results of the cell-
declustering method are influenced by the deviation of the sample mean from the mean of the
exhaustive data set. For the data sets with a preferential sampling in regions with low values,
the non-declustered mean presented an average deviation of around 26.4% from the true mean
while for the other sets, for which the cell-declustering method gave significantly worse
results compared to CR, the average deviation was of 46%.

Polygonal Polygonal
Type of Cell- Coefficient of
Sets declustering with declustering with
Error declustering Representativity
convex hull country borders
1 to 10 MAPE 8.9 21.1 15.5 7.5
1 to 10 RMSE 175 377 261 139
1 to 5 MAPE 11.7 30.1 19.6 7.7
1 to 5 RMSE 224 501 327 143
6 to 10 MAPE 6.0 12.1 11.4 7.2
6 to 10 RMSE 103 181 167 132

Table 3. Mean Absolute Percentage Error (%) and Root Mean Square Error (m) in estimating the true
elevation mean based on the declustered points of the subsets of the sets 1 to 10. Sets 1 to 5 are
preferentially clustered in areas of high elevation, while sets 6 to 10 are preferentially clustered in
areas of low elevation.

700 Cell declustering


Thiessen polygons with convex hull
Thiessen polygons with country borders
declustered and true elevation mean
Root Mean Square Error between

600 CR

500

400

300

200

100

0
0 50 100 150 200 250 300 350
Number of sampling points

Figure 3. Comparisons of the elevation mean RMSE of the declustering methods

Although the CR seems to be a very good candidate as a future declustering method, which
also presents the strong advantage of being independent from any subjective judgement and
can therefore be easily automated, the method is computationally more heavy than cell-
declustering and still needs to have its limitations clearly defined, in terms of certain sampling
schemes.
Proceedings of the Annual Conference of the International Association for Mathematical Geology, Berlin-
Germany, September 15-20, 2002, p. 479-484.

2000
Cell declustering
1900 CR
True mean
1800

1700
Elevation mean (m)

1600

1500

1400

1300

1200

1100

1000
1

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58
Sample set number

Figure 4. Estimates of the true mean obtained by the cell-declustering method and the CR for each
subset: sets 1-30 are data sets with a preferential sampling in regions of high elevation, sets 31-60 are
the sets where the sampling has been made preferentially in regions of low values. Each group of 6 (1-
6, 7-12, …) correspond to the 300, 250, 200, 150, 100 and 50 points scenario.

Acknowledgements

M. Saisana gratefully acknowledges support provided by a grant from the Bodossakis


foundation.

5. References
Dubois, G. (2000a). How representative are samples in a sampling network? Journal of Geographic
Information and Decision Analysis, 4(1): 1-10.

Dubois, G. (2000b). Intégration de Systèmes d’Informations Géographiques (SIG) et de méthodes


géostatistiques. Inst. of Mineralogy & Petrography, Dept. of Earth Sciences, University of Lausanne.
Ph.D. Thesis, (in French) 260 p.

Thiessen, A. H. (1911). Precipitation average for large areas. Monthly Weather Review, 39: 1082-
1084.

Journel A. G. (1983). Nonparametric estimation of spatial distributions. Mathematical Geology,


15(3): 445-468.

Deutsch C. V. (1989). DECLUS: a FORTRAN 77 program for determining optimum spatial


declustering weights. Computers & Geosciences, 15(3): 325-332.

Isaaks, E. H. & Srivastava, R. M. (1989). An introduction to applied geostatistics. Oxford University


Press, 561 p.

You might also like