Professional Documents
Culture Documents
BAYREUTH
BAYREUTH
net/publication/268308168
CITATIONS READS
7 1,287
1 author:
Peter Lischer
ConStat Consulting
30 PUBLICATIONS 732 CITATIONS
SEE PROFILE
All content following this page was uploaded by Peter Lischer on 04 October 2016.
Abstract
Key Words and phrases: Interlaboratory studies, robust distance, robust es-
timation of components of variance, multivariate outlier.
AMS 1991 subject classifications: Primary 62F35; secondary 62J10.
1 Introduction
In order for the results of analytical chemical measurements to be meaningful, pro-
cedures must be well developed enough that a reanalysis does not drastically change
the conclusions, and well enough specified that different laboratories will achieve
similar conclusions for the same sample. This means that there has to be a stan-
dard, i. e. a written document that lays down in full details how the test should
be carried out. A standardised method has to be robust, that is small variations in
the procedure should not produce unexpectedly large changes in the results. (ISO,
∗
This paper won the 1995 W.J. Youden Award in Interlaboratory Testing from the American
Statistical Association.
251
252 P. LISCHER
2 Method-performance studies
yij = m + bi + eij ,
where m is the true (or consensus) value, bi is the laboratory bias with variance σL2
and eij is the replication error with variance σr2 . The bi and eij are assumed to be
uncorrelated and centred. The parameter σr is called repeatability standard devi-
ation and σR = (σL2 + σr2 )1/2 reproducibility standard deviation. Repeatability and
reproducibility are the traditional precision parameters in chemistry (ISO, 1987).
The repeatability (r) of the method is the value below which the absolute difference
between two single analytical results obtained with the same method on identical
sample material and under constant conditions as regards laboratory, analyst, appa-
ratus, chemicals and interval of time, is expected (with 95% confidence) to lie. The
reproducibility (R) is the value below which the absolute difference between two
single analytical results obtained with the same method on identical sample mate-
rial and under different conditions of laboratory, analyst, apparatus, chemicals and
interval of time, is expected (with 95% confidence) to lie. For normally distributed
errors we have
√ −1
r = 2Φ (0.975) σr
√ −1 q
R = 2Φ (0.975) σL2 + σr2 ,
where Φ(z) is the cumulative standard normal distribution. The measurements yij
are not all uncorrelated. We have
2
σL + σr2 , if (i, k) = (k, l)
Cov(yij , ykl ) = σL2 , if i = k and j 6= l
0, if i 6= k
A drawback of the uniform-level design is that the operator, when testing succes-
sively identical samples, may be influenced by the result of his first term. To prevent
INTERLABORATORY STUDIES 253
this an alternative split-level design may be used. In this procedure, instead of using
two samples that the operator has been told to be identical, or performing two tests
on the same specimen of material, two series of n samples are prepared at slightly
different levels m + ∆ and m − ∆ (where ∆ is small) and each of the n laboratories
receives one sample of series 1 and one sample of series 2 for testing. The values of
σr and σR derived from a split-level experiment are valid for the mean level m.
The aim of a method-performance study consists of finding estimates for the pre-
cision parameters σr and σR which are characteristic of the particular method and
not only of the specific study. To achieve this aim, the following conditions must be
fulfilled:
The conditions under a) are relatively easy to meet, although interlaboratory studies
often have to be conducted with volunteers instead of randomly selected participants.
On the other hand, the conditions in b) raise problems which often have not been
satisfactorily solved up to now. The classical analysis of variance supposes normally
distributed errors. However, every analytical chemist knows that deviant or suspect
results occur much more frequently than the normal distribution would predict.
There are many reasons for this; it is enough if just one parameter of the analytical
process is not completely under control. Since very few suspect values deviate by
an order of magnitude, it is often difficult to decide whether the suspect value
should be regarded as valid. In evaluating method-performance studies, extreme
results or all results obtained from a suspect laboratory are often eliminated before
the analysis of variance is conducted, in order to avoid excessively high values for
repeatability and reproducibility and, hence, a bad evaluation of the precision of
the method. But any such elimination inevitably entails the risk of underestimating
laboratory bias and replication error. An international convention about outlier
tests to be used was adopted (Horwitz, 1988). It does not, however, change the
unsatisfactory ’either-or’ situation, which is typical for all outlier tests: as soon as
the conditions for elimination are fulfilled, the value of the desired quantity changes
abruptly. Moreover, the proposed Cochran and Grubbs tests are far from the best
possible ones; e.g. the Grubbs test cannot even safely reject two distant outliers
out of 20 (Hampel, 1985). On the other hand, 30% outliers in method-performance
studies are rather the rule than the exception (Horwitz & Albert, 1986). These and
other unsatisfactory features of outlier tests led the Swiss Federal Committee for
Official Methods in Food Analysis (Lischer, 1987; SLB, 1989) and the Analytical
Methods Committee of the Royal Society of Chemistry (AMC, 1989) to propose
solutions which are similar. Instead of the hitherto usual outlier tests they use robust
statistical methods to calculate σr and σR . The underlying principle is Huber’s
proposal 2 (Huber, 1964). In the following we present two methods to estimate the
scale parameters σr and σR , the official SLB-method (SLB, 1989) and an alternative
method inspired by Rousseeuw’s scale estimator Qn (Rousseeuw & Croux, 1993).
254 P. LISCHER
where ψc (t) = max(−c, min(t, c)) and S ∗ = 1.4826medij {|xij − medj xij |}.
The two quantities hi = (x̄i· − T )/S and ki = si /Sr , where x̄i· and si are mean and
within-laboratory standard deviation of laboratory i, will later be used to detect
laboratories which have produced unreliable results.
The third step is inspired by Huber’s proposal 2, but location T and scale S are
calculated separately whereas the proposed algorithm of the Analytical Methods
Committee of the Royal Society of Chemistry (AMC, 1989) calculates T and S
simultaneously. Reichenbach (1989) compared the two algorithms in a simulation
study. He showed that the AMC-algorithm converges slowly and that the breakdown
point is lower than 25% for moderate sample sizes. This is too small as 30% outliers
are not uncommon in interlaboratory trials. A procedure for evaluating such trials
should allow two bad laboratories out of eight. The SLB-procedure with separate
determination of location and scale has better convergence properties, a comparable
relative efficiency and a breakdown point of ≈ 30%.
INTERLABORATORY STUDIES 255
The constant fn is a small-sample correction factor. The scale estimator Qn does not
need any location estimate. Instead of measuring how far away the observations are
from the central value, Qn looks at a typical distance between observations, which
is still valid at asymmetric distributions. The Gaussian efficiency of Qn is 82%.
In the case of a split-level design, Qn allows immediately to get estimates for σr and
σR . Let {(yi1 , yi2 ), i = 1, 2, . . . , n} be the results of the experiment, v= {(yi1 +yi2 )/2}
and w= {(yi1 − yi2 )/2}, so that Var[vi ] = σL2 + σr2 /2 and Var[wi ] = σr2 /2. Then
√
σ̂r = 2Qn (w) (1)
q p
σ̂R = σ̂L2 + σ̂r2 = Q2n (v) + Q2n (w) (2)
{|yij − ykl |; i = 1, 2, . . . , n − 1, j = 1, 2, . . . , p, k = i + 1, i + 2, . . . , n, l = 1, 2, . . . , p}
Then √
σ̂r = 2.2191dr / 2
and
σ̂R = 2.2191dR .
256 P. LISCHER
Laboratories
1 2 3 4 5 6 7 8
(1) 4.2 3.1 3.2 3.2 3.2 3.2 3.2 3.2
(1) 4.4 3.1 3.2 3.1 3.1 3.3 3.2 2.7
(2) 26.2 26.5 27.0 26.8 26.4 28.8 28.2 26.0
(2) 26.0 26.6 27.2 26.5 26.2 28.0 28.2 25.9
(3) 48.5 44.4 46.4 45.7 44.1 48.8 45.1 45.5
(3) 48.3 44.5 46.6 46.0 45.0 48.5 45.6 49.3
Laboratories
9 10 11 12 13 14 15
(1) 3.0 2.9 3.1 2.6 3.6 3.0 3.1
(1) 3.0 3.1 3.1 2.7 3.5 3.1 3.1
(2) 25.9 26.2 29.6 24.7 29.2 25.1 25.9
(2) 26.0 26.4 30.0 24.1 29.6 25.2 26.0
(3) 43.8 45.0 50.7 45.8 49.0 42.9 44.9
(3) 44.2 45.2 50.6 46.1 50.0 42.9 44.7
Table 1: A trial of determination of nitrate in drinking water [mg/l] for three samples at
different concentration levels performed by fifteen laboratories.
Mandel (1989) presented a procedure for flagging outliers, based on two statistics,
called h and k. The h-values are calculated independently for each concentration
level. The overall average at that level is subtracted from each cell-average and
divided by the standard deviation. It is a measure of where a particular lab’s av-
erage lies with respect to the consensus value. The k-values are also calculated
independently at each level. It is simply the ratio of the within-cell standard devi-
ation to the pooled value over all laboratories at that level. It is evident that this
non-robust procedure suffers from the masking effect. But there is a simple remedy,
however. Instead of Mandel’s h and k we use the two quantities hi = (x̄i· − T )/S
and ki = si /Sr introduced earlier.
We will illustrate this procedure in terms of an interlaboratory study published
partially in the SLB (1989). It deals with the photometric determination of nitrate
in drinking water at three concentration levels. Every laboratory determined two
replicates for each sample (Table 1). The statistical analysis of the data was done
with the SLB- and the Qn -method (Table 2). The estimates do not much differ.
The h-values are displayed in Figure 1 and the k-values in Figure 2. h- and k-values
with absolute values ≤ 2 are traditionally considered as ”satisfactory”, between 2
and 3 as ”questionable” and with ≥ 3 as ”unsatisfactory”.
At a glance, we see what is going on: laboratories 1, 11, 12 and 13 got at least
one deviant mean value with |h| > 2 and laboratory 8 has a high within-laboratory
variation for sample 1 and 3 (|k| > 3). The organiser of the study has now to find
INTERLABORATORY STUDIES 257
SLB-method
µ̂ σ̂L σ̂r σ̂R
Sample 1 3.12 0.13 0.08 0.16
Sample 2 26.49 1.45 0.21 1.47
Sample 3 46.18 2.34 0.30 2.36
Qn -method
µ̂ σ̂L σ̂r σ̂R
Sample 1 3.12 0.18 0.13 0.22
Sample 2 26.49 1.31 0.26 1.33
Sample 3 46.18 1.98 0.26 2.00
3 Laboratory-performance studies
These two aims are somewhat divergent, but the motivation is the same: the iden-
tification of laboratories that produce data of unacceptable quality.
Most proficiency testing schemes proceed by comparing the bias estimate (x − xtrue )
with a target value for the standard deviation that forms the criterion of perfor-
mance. An obvious approach is to form z-scores given by
x − xtrue
z= ,
σ
where σ is the target value for the standard deviation. If x̂ and σ̂ are good estimates
of xtrue and the standard deviation σ, respectively, and if the underlying distribution
258 P. LISCHER
where x̄ is the mean of the group and V is the sample covariance matrix. Asymp-
totically the d2i follow a chi-squared distribution on p degrees of freedom. If x̄ and
V were not estimated but were known population parameters, outlying values of xi
would yield large values of the squared distance d2i . However, the effect of such val-
ues on the estimation of x̄ and V leads to the rapid breakdown of the Mahalanobis
distance for the detection of outliers, particularly if several outliers are present. In
Rousseeuw & Leroy (1987) it was suggested to calculate robust distances by using
the Minimum Volume Ellipsoid estimator. This estimator has a maximal breakdown
point but is not very efficient. Further, a rule of thumb of Rousseeuw states that
there should be at least five observations per dimension (n/p ≥ 5). This requirement
INTERLABORATORY STUDIES 259
We will illustrate the procedure in terms of a proficiency test carried out at the Swiss
Federal Research Station for Agricultural Chemistry and Hygiene of the Environ-
ment (FAC) with 23 participating laboratories. The primary goal was a laboratory
comparison. A number of dried sewage sludges was analysed for different heavy
metals and P and K. In this numerical example the results of the Zn-analyses are
presented (Table 3).
Each laboratory Li , (i = 1, . . . , 23) had to analyse 10 specimens S1 , S2 , . . . , S10 .
S1 , S2 , . . . , S5 were taken from 5 different homogenised samples of dried sludges,
S6 , S2 , . . . , S9 were mixtures: (S6 : 50%S1 +50%S2 ; S7 : 50%S3 +50%S2 ; S8 : 50%S4 +
50%S2 ; S9 : 50%S5 + 50%S2 ) and S10 was an extracted solution of S1 , produced in
a laboratory of the FAC. This specimen was analysed to obtain an estimate of the
effect of the sample preparation for different laboratories. The mixtures S6 , S7 , S8
and S9 together with S1 , S3 , S4 and S5 , allowed four alternative determinations of
the true content of S2 (Table 4). Inconsistencies between the direct and the indirect
determinations of the true concentration of sample S2 indicated a bad performance
of a laboratory. With the four supplementary results for the true concentration of
specimen S2 it was possible to get a more reliable value of the true content of S2
and to estimate the scale parameters σr and σR for this concentration level.
260 P. LISCHER
Specimens
Labs s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
1 1531 2945 1264 2068 1150 2252 2109 2526 2096 1499
2 1527 2440 1250 2110 1160 2080 2140 2430 2010 1500
3 1500 2900 1200 1900 1100 2200 1900 2400 2000 1746
4 1490 2830 1210 2060 1090 2220 2060 2440 2030 1350
5 1547 2900 1238 2060 1120 2211 2094 2463 1961 1500
6 1432 —- —- —- 1076 2029 —- 1813 1368
7 1393 2466 1131 1886 1011 2040 1985 2321 1133 1430
8 1579 2954 1188 2240 1229 2313 2172 2642 2160 1524
9 1520 2950 1240 2080 1080 2250 2070 2520 2010 1514
10 1653 3199 1369 2489 1182 2537 2565 2729 2371 1648
11 1577 3030 1264 2113 1129 2296 2191 2676 2129 1600
12 1536 3014 1260 2089 1143 2304 2113 2513 2089 1616
13 1500 3150 1220 2030 1070 1940 1780 2370 1940 1520
14 1512 2779 1252 2089 1120 2245 2042 2517 2000 1485
15 1458 2688 1214 1896 1122 1618 1836 2160 1849 1655
16 1500 3160 1160 2150 1065 2180 2058 2482 2003 1428
17 1525 2865 1245 2125 1120 2215 2180 2600 2065 1495
18 1488 2720 1713 1968 1044 2143 2016 2344 1992 1448
19 1464 2857 1187 2024 1073 2153 2024 2413 1932 1576
20 1600 2400 1300 1900 440 2400 1900 2300 1900 1500
21 1558 3077 1279 2082 1116 2271 2152 2558 2133 1481
22 1595 3271 1344 2241 1202 2420 2280 2670 2134 —-
23 1803 2408 1084 2309 1473 2860 2063 2429 1369 1250
m̂j 1527 2882 1238 2075 1114 2222 2071 2479 2003 1507
σ̂j 60.0 271.3 62.8 132.5 60.6 153.2 126.3 139.2 128.5 102.5
where the mj are the true (or consensus) values, the bij are the random laboratory
2
effects assumed to have mean 0 and variance σLj and the eij are the replication errors
2 2 2
assumed to have mean 0 and variance σrj . In general σLj and σrj are concentration-
dependent but if the same analytical method is used and the concentration levels
are not too different the quotient
2
σLj
q= 2 2
σLj + σrj
Labs s2 w1 w2 w3 w4 rd2i
1 2945 2973 2954 2984 3042 2.1
2 2440 2633 3030 2750 2860 14.9
3 2900 2900 2600 2900 2900 31.3
4 2830 2950 2910 2820 2970 7.0
5 2900 2875 2950 2866 2802 1.0
6 —- 2626 —– —– 2550 15.5
7 2466 2687 2839 2756 1255 98.0
8 2954 3047 3156 3044 3091 17.3
9 2950 2980 2900 2960 2940 1.7
10 3199 3421 3761 2969 3560 30.7
11 3030 3015 3118 3239 3129 4.8
12 3014 3072 2966 2937 3035 2.6
13 3150 2380 2340 2710 2810 26.0
14 2779 2978 2832 2945 2880 1.5
15 2688 1778 2458 2424 2576 63.9
16 3160 2860 2956 2814 2941 13.4
17 2865 2905 3115 3075 3010 4.3
18 2720 2798 2319 2720 2940 203.0
19 2857 2842 2861 2802 2791 6.6
20 2400 3200 2500 2700 3360 385.4
21 3077 2984 3025 3034 3150 4.6
22 3271 3245 3216 3099 3066 8.9
23 2408 3917 3042 2549 1265 379.2
The data structure of the zij is the same as the structure of results of an uniform-
level experiment. Therefore, the covariance matrix S has a very simple form: 1 in
the diagonal and q otherwise. With the SLB- or the Qn -method we get immediately
a robust estimate q̂ of q resp. Ŝ of S and
−1
rd2i = zi Ŝ zTi
is the squared robust distance of laboratory i.
The critical value is χ210,0.975 = 20.48. Laboratories with values of rd2i > 20.48 must
be considered as unreliable.
262 P. LISCHER
In Fig. 3 the z-scores are presented. In the first line below the graphic there are the
laboratory codes, in the second the number of analysed specimens, in the third the
squared robust distance and in the forth the corresponding p-values. We used the
SLB-method to get q̂.
in an indirect way from samples S6 and S1 , from S7 and S3 , from S8 and S4 and
from S9 and S5 . We could organise a future proficiency test in the following way:
at each round four specimens P1 , P2 , P3 and P4 are distributed. P1 and P2 are
new samples, P3 and P4 are mixtures of P1 and P2 with P0 , where P0 is a specimen
from an earlier round with a (only for the organiser) known true value. Then the
organiser can evaluate an eventual improvement in performance and even estimate
the precision parameters σr and σR at the concentration level of P0 .
At a glance, we see the laboratories which have analytical problems for the determi-
nation of Zn. Similar graphics for the other elements can be done. As the confiden-
tiality of the results is extremely important in this type of laboratory-performance
studies, the organiser distributes to the laboratories only graphics which contain
exclusively their own results. Instead of the results of one element of all laboratories
(Fig. 3), he distributes graphics which show the results of all elements of a particular
laboratory.
4 Conclusions
Monitoring the amount of pollutants in soil, water, air, plants, food, etc. is impor-
tant nowadays. Analytical tests are required to judge contamination. The fascinat-
ing thing about analycal chemical measurements is that they can quantify chemical
contents objectively. A drawback is that they suffer from a lack of comparability.
It is common knowledge amongst those who practise analysis for trade and com-
merce that analysts can obtain different results on the same material. Obviously it
may not be in their interest to expose this fact. It is in the field of public health
and environmental monitoring, where determinand concentrations are often small
and where slight differences may be significant, that interlaboratory variation has
received most attention. The disturbing thing is the suggestion of unreliability and
its possible diffusion to the general public as well as to the governments responsible
in cases where important decisions must be made on the basis of chemical measure-
ments. However, the situation is not as bad as it seems. If chemists and statisticians
collaborate, try to understand each other’s problems and use realistic models, repre-
sentative samples, standardised and robust analytical methods, reference materials,
interlaboratory tests and good robust statistics, errors can be controlled.
References
[1] AMC (1989): Analytical Methods Committee. Robust Statistics Part 2: Inter-
laboratory Trials. Analyst 114 1699-1702.
[3] Hampel, F.R. (1985): The Breakdown Points of the Mean Combined With
Some Rejection Rules, Technometrics 27 95-107.
264 P. LISCHER
[5] Horwitz, W. (1988): Protocol for the Design, Conduct and Interpretation of
Collaborative Studies. Pure & Appl. Chem. 60(6) 855-864.
[10] Reichenbach, A. (1989): Robuste Methoden für die Auswertung von Ringver-
suchen. Diplomarbeit ETH-Zürich.
[11] Rousseeuw, P.J. & Leroy, A.M. (1987): Robust Regression and Outlier Detec-
tion. Wiley, New York.
[12] Rousseeuw, P.J. & van Zomeren, B.C. (1990): Unmasking multivariate outliers
and leverage points. J. Amer. Statist. Assoc. 85 633-639.
[13] Rousseeuw, P.J. & Croux, C. (1991): Alternatives to the Median Absolute
Deviation, J. Amer. Statist. Assoc. 88 1273-1283.