Professional Documents
Culture Documents
Horwitz 1982
Horwitz 1982
Evaluation of
Analytical Methods Used for
Regulation of Foods and Drugs
Although the Association of Official in a single laboratory and presumably
Analytical Chemists (AOAC) has been by the same analyst. Under our legal
evaluating and approving methods of system this solution is impossible,
analysis for almost 100 years, there is since defendants accused by laborato-
practically no discussion in the Jour- ry evidence have a constitutional right
nal of the Association of Official Ana- to produce rebuttal evidence from any
lytical Chemists of the criteria for de- laboratory of their choice. Therefore,
termining which methods should be the important question to be answered
approved for regulatory use. These de- in the evaluation of methods of analy-
cisions are usually made on the basis sis is how much allowance must be
of a method's performance in interlab- made for between-laboratory variabil-
oratory collaborative studies. ity in interpreting the values produced
John Mandel of the National Bu- by different laboratories. If the vari-
reau of Standards, in his 1981 Shew- ability or error produced by the meth-
hart medal address (1), pointed out od is excessive—that is, it does not
that the basic objective of conducting permit effective regulation as required
interlaboratory tests is not to detect by the statute—the method must be
the known statistically significant dif- judged unacceptable for the intended
ferences among laboratories: "The real purpose.
FDA
aim is to achieve the practical inter- The purpose of this paper is to
changeability of test results." Inter- suggest some practical limits of ac-
laboratory tests are conducted to de- ceptable variability in methods of
termine how much allowance must be analysis required by AOAC's custom-
made for variability among laborato- ers—the regulatory agencies and the
ries in order to make the values inter- regulated industries. The collabora-
changeable. tive study procedure has provided the
An irreducible difference exists be- essential data for developing this in-
tween supposedly identical measure- formation.
ments made in different laboratories.
This point was recently demonstrated Method Characteristics
by a group of New Zealand govern- Methods are usually evaluated on
ment laboratories in attempting to the basis of three characteristics: reli-
minimize the discrepancies in values ability, applicability, and practicabili-
for blood alcohol between laboratories. ty. For our present purpose, reliability
The laboratories went to great pains is the overriding consideration. In gen-
to discover every source of error, even eral, when a need exists for a method,
to the extreme of moving analysts we have to accept any reasonable de-
from one laboratory to another. They gree of reliability. Applicability to a
found that an analyst increases his or wide range of sample types and practi-
her intra-analyst variability when cability with respect to cost, time, and
moved to a different laboratory envi- training constraints both assume
ronment. They concluded that the greater importance when there are
only way to eliminate interlaboratory several competing methods.
variability was to conduct all analyses The important aspects of reliability,
listed in their approximate order of
importance for most purposes, are:
• Reproducibility, or total be-
Presented at the 95th Annual Meeting of the As- tween-laboratory precision. This is the
sociation of Official Analytical Chemists, Wash-
ington, D.C., Oct. 19,1981. measure of the ability of different lab-
This article not subject to U.S. Copyright ANALYTICAL CHEMISTRY, VOL. 54, NO. 1, JANUARY 1982 · 67 A
Published 1981 American Chemical Society
since the definition of "best" will vary
with the purpose for which the meth
od will be used. Since this is not usual
ly known beforehand, we must usually
assume that our primary interest will
be in the achievement of a suitable de
gree of precision and bias; require
ments for specificity and limit of reli
able measurement will usually be self-
evident.
I Drugs
in Feeds
In regulatory work, or even in ana
lyzing for adherence to commercial
specifications, between-laboratory
Pharma
ceuticals Pesticide variability is the most important fac
Residues Aflatoxlns tor. Bias can be tolerated. If it is con
•ε stant, a correction factor can be used.
If it is variable, it becomes a compo
Trace nent of reproducibility (between-labo
Major Minor Elements ratory variability). In fact, Youden (2)
Nutrients Nutrients equates systematic error to the "true
! between-laboratory" variability,
which in our terminology is the repro
ducibility adjusted for the within-lab
oratory variability (repeatability).
Bias, as a recovery factor, particularly
in modern trace analysis, is generally
permitted to seek its own level, pro
vided it is above 60-80% (3).
Interlaboratory Precision
Concentration
It would appear that any systematic
approach to estimating what consti
Figure 1. The general curve relating interlaboratory coefficients of variation (ex tutes a reasonable precision would be
pressed as powers of two on the right) with concentration (expressed as powers of an almost impossible task. Methods
10) along the horizontal center axis are composed of almost infinite com
binations of dissolution, cleanup, and
oratories to check each other. It is the that a high degree of accuracy and measurement procedures. These innu
overall measure of variability, includ precision is not an important require merable combinations are applied to
ing the within-laboratory component. ment. The averaging of numerous im pure substances and complex mixtures
• Repeatability, or within-labora precise determinations often provides as solids, liquids, and gases by ana
tory precision. This is the measure of a surprisingly good mean. Sometimes lysts with various degrees of compe
the ability of a laboratory (or analyst) all that is needed is to differentiate tency. Yet, despite this complexity, we
to check itself. samples with "none" of the analyte have found that analytical variability
• Systematic error or bias (some from those that contain "significant" can be summarized (in an oversimpli
times also called "accuracy or inaccu amounts. In monitoring trends, the fied fashion to be sure) by plotting the
racy"). This is the difference of the systematic error, as long as it is con determined mean coefficient of varia
value(s) obtained from the true, as stant, is not important. The precision tion (CV), expressed as powers of two,
signed, or consensus value(s). must be good enough to detect when a against the analyte level measured, ex
• Specificity (when required). This "significant" difference occurs. In pressed as powers of 10, as shown in
is the ability of the method to measure compliance activities a high degree of Figure 1, taken from our recent paper
what it is intended to measure. accuracy (as lack of bias) and preci on quality control (4).
• Limit of reliable measurement sion are required at the specification The sources of these data are an ex
(when required). This is the smallest level, unless the specification is based amination of over 150 independent
amount (or concentration) of a mate upon the method itself, in which case AOAC interlaboratory collaborative
rial that can be measured with a stat only precision is pertinent. The preci studies covering numerous AOAC top
ed degree of confidence. sion requirement may decrease as the ics, from drug preparations and pesti
Which of these factors is most im distance from the specification value cide formulations on the high end of
portant depends upon the purpose for increases. When the "no residue" re the concentration scale to aflatoxin
which the data will be used. In regula quirements of the Federal Food, Drug, contaminants at the low end, with im
tory analysis, analytical values are and Cosmetic Act are involved, speci portant stops in between at pesticide
used for three major purposes: to sur ficity and limit of reliable measure residue and trace element concentra
vey a field to determine the extent of ment are the most important consid tions. At least five analytical meth
a problem; to monitor trends to deter erations. In practical work, other re ods—chromatography, atomic absorp
mine if any corrective action has to be quirements come into play. In surveys, tion spectrometry, spectrophotometry,
taken; and to determine compliance the need to analyze many samples polarography, and bioassay—are in
with an economic or legal specifica makes a rapid method a necessity; in volved. A convenient, easily remem
tion. A different emphasis on the vari monitoring, repeated sampling of the bered reference point is that at 1 ppm
ous method characteristics is required same population is important; in com (HT 6 ), the CV is 24 = 16%. Other
to accomplish each purpose. In pliance, practicality, although impor points are given in Figure 1.
surveying a field, the normal variabili tant, is secondary to reliability.
The most important and startling
ty of the measurement of a commodity Therefore, it is apparent that there point is that this idealized smoothed
or the environment is usually so large is no such thing as a "best" method curve is independent of the nature of
"S ο
+i
c
_o
s^t°r::-
ϋ a
1—
>π
ι
σ>
Ο
I ?
0)
'δ
C ^ W c i e n t of Variation
ε<ο
ο
υ
Year Sample Fat Content (%)
Figure 2. The performance of laboratories analyzing EPA's Figure 3. The interlaboratory coefficient of variation and
quality-control samples for pesticide residues in fat and blood standard deviation (absolute) of the gravimetric ether extrac
over a 13-year period (4). Fat · , blood Ο tion method for the determination of fat in meat as a function
of the concentration (%) of fat (6)
80
120
I-
Ô
£
c
ο
.S
« 80
«*-ο
40
ο c
φ
δ
«
ο
υ
40
20
Figure 5. The interlaboratory coefficients of variation for the Figure 6. The interlaboratory coefficients of variation of trace
determination of pesticide residues in butterfat (8) and in elements in blood by various methods as calculated from the
wildlife (9) by gas chromatographic methods. Wildlife O, but- data reviewed by Versieck and Cornells ( 10)
terfat ·
implicit faith that if a method is fol- compare the result with, or they are green cacao beans (13), illustrate this
lowed exactly, the correct result will eliminated by repetition. In an inter- effect. The repeatability (within-labo-
automatically be produced. However, laboratory situation with blind sam- ratory variability) changes from an
our review of several hundred collabo- ples, where there is no opportunity to unacceptable CV of 50% for the 20 val-
rative studies in which the samples censor the data, outliers are more ob- ues from 10 laboratories to a marginal-
were examined as true unknowns re- vious. In current AOAC collaborative ly acceptable 36% by the omission of
veals that often 5 to 15% of the re- studies, outliers are usually eliminated one value classified as an outlier by
ported values are statistical outliers— by the techniques suggested by You- the Dixon test. In this case, as in many
values that are far outside the region den (2): a ranking test to remove con- others, there is no question about the
where most of the other values reside. sistently high or low laboratories, fol- classification as an outlier since 18
Outliers are produced by experienced lowed by the elimination of outlying ppb of aflatoxin had been added to
chemists as well as by novices, at individual values by a Dixon test in- each sample. In this study, the mean
macro as well as at trace levels. Schul- volving the deviations of extreme changed little by the elimination of
ler et al. (11), in their review of afla- values. the outlier: from 16.0 to 14.6 ppb, or
toxin methods, noted that they had to We have only recently realized the expressed in terms of recovery, from
tolerate a 10% outlier rate in recom- importance of outliers in the evalua- 89 to 81%.
mending methods for international tion of methods of analysis for approv- This particular example shows only
referee status. In the analysis of moon al by the AOAC. Outliers are a fact of a 5% outlier rate. We have seen meth-
rocks from the Lunar Analysis Pro- laboratory life and allowance must be ods approved by the AOAC with an
gram of the U.S. National Aeronautics made for them. By definition, they lie outlier rate as large as 50%. There is
and Space Administration, Morrison at the extreme points of the statistical only one legitimate excuse for elimina-
reported (12) that almost 7% of the frequency distributions of a series of tion of laboratories without a statisti-
values had to be discarded as outliers. analytical values. Therefore, they have cal test: intentional or unintentional
Outliers produced by a single ana- a large influence on the magnitude of failure to follow the method. An inten-
lyst or within a laboratory are usually the indices used to measure the per- tional failure occurs when the speci-
inconspicuous, since they are either formance of methods. fied equipment or reagent is unavail-
unrecognized from analysis of single Figures 7 and 8, using the data from able and failure to substitute will
samples where there is nothing to the collaborative study of aflatoxin in mean dropping out of the study. But
11
0.08
10
0.07 CV = 36%
9
CV = 50%
„ 0.06
8 a
Ο
Ζ 7 % 0.05
-Q to
9
_i 5 SE 0.04
ο
4
3 Ι °·03
ο
2
ι? 0.02
1 0.01
0 10 20 30 40 0 10 20 30 40
Total Aflatoxin (ppb) Concentration (ppb)
Figure 7. Original data from the interlaboratory study of the Figure 8. The data from Figure 7 plotted as a normal frequen
determination of aflatoxin in cacao beans ( 13). The two val cy distribution with the outlier included (20 points, broken
ues from each laboratory are plotted horizontally. The circled line) and the outlier excluded (18 points, solid line)
value is an outlier by the Dixon test
substitution on the basis of "it cannot False Positives and False limited. Furthermore, many methods
possibly affect the results" is inexcus Negatives in food chemistry are empirical—they
able. Collaborative studies are very are based upon the faith that other
expensive in terms of time and man Another potentially useful suitabili participants will adhere to the speci
power. Jeopardizing their success with ty index for evaluation of methods fied directions to produce equivalent
untested changes undermines the en may be the percentage of false posi results. Empirical methods by defini
tire collaborative study. tives and false negatives. False posi tion have no bias. In trace analysis,
tives (excessively high blanks) may the precision characteristic usually
Although many AOAC studies show appear when working at any concen
no outliers, a 5 to 15% outlier rate, takes care of recovery, because the
tration level, but the appearance of random error is often as large or larger
particularly at the ppm and ppb lev both false positives and false negatives
els, is not at all unusual. If but one of than the systematic error. If the recov
is characteristic of trace analysis. ery is low but repeatable, as in isotope
five or six laboratories required in a These values are not necessarily out
minimum statistical pattern turns out dilution methods, any recovery is ac
liers, but they can be. We have noted ceptable since correction can be made
to be an outlier by the Youden rank in our examination of the available
ing test, we have reached a 20% rejec back to the 100% level. If it is variable,
aflatoxin studies that the percentage even though within acceptable limits,
tion point. It appears that we may of false negatives increases much like
have to tolerate a 20% outlier rate, be the correction factor procedure will
our CV/concentration curve as the not do. The method must then be ac
cause that is the penalty to be paid if concentration approaches zero. This
we use a minimum number of partici companied by sufficient recovery data
phenomenon may be more useful for to indicate the boundaries of perfor
pating laboratories. This is one of the delineating a limit of reliable measure
reasons why having at least 10 labora mance. In the proposed "SOM" docu
ment, i.e., the concentration at which ment of the FDA, recovery limits were
tories in a study will improve the the proportion of false negatives is
chances of acquiring adequate data for given as more than 80% for concentra
more than 20% (or some other num tions of 0.1 ppm and above, and more
statistical evaluation. ber), than for evaluating the perfor than 60% for lower concentrations (3).
There is one important statistical mance of methods in general. Naturally, we would prefer higher re
problem with outliers: What outlier coveries. But these figures do appear
test should be used? This is a complex to be reasonable in light of actual re
statistical problem whose solution de Bias or Systematic Error
coveries under ordinary, and not col
pends on the true distribution of val Up to now I have paid little atten laborative, conditions.
ues. Chemists seem to have little diffi tion to the matter of bias or systemat
culty in applying intuition and experi ic error because in most cases it takes
ence to this problem, but many statis care of itself. Very few methods are re Summary
ticians are appalled at this approach. jected because of low or high recov The primary objective of interlabo
We hope to apply a number of outlier eries. In the case of macro methods, ratory studies is to determine if we
tests to a number of AOAC collabora recovery is frequently very close to have achieved interchangeability of
tive studies to determine if any of the theoretical, because these methods are test results among laboratories. But
several dozen procedures described in usually based upon stoichiometry or interchangeability is a function of the
the statistical literature is best suited basic physical principles of extraction purpose for which the results will be
for application to interlaboratory or separations, and the amount of ana- used: to survey a field; to monitor
work. lyte available for measurement is not trends; or to determine compliance
BAIRD
The Spectroscopy People
1315-43.
(12) Morrison, G. H. Anal. Chem. 1971, 43
(7), 22-31A.
(13) Scott, P. M.; Pryzybylski, W. J. Assoc.
Off. Anal. Chem. 1971,54, 540-44.
CIRCLE 26 ON READER SERVICE CARD
76 A · ANALYTICAL CHEMISTRY, VOL. 54, NO. 1, JANUARY 1982