Criteria For Judging Precision and Accuracy in Method Development and Evaluation

CLIN. CHEM.
20/7, 825-833 (1974)
Criteria for Judging Precision and Accuracy in
Method Development and Evaluation
James 0. Westgard, R. Neill Carey, and Svante Wold1
We describe an approach for formulating criteria requirements if the test results are to be medically
that can be used to judge whether an analytical useful. These specifications include the type of speci-
method has acceptable precision and accuracy. We men, amount of specimen, expected range of concen-
derive criteria for several experiments that are com- tration, elapsed time between specimen submission
monly used in method-evaluation studies: precision and report of results,and precisionand accuracy.
or replicates, recovery, interference, and compari-
The analyst, or clinical chemist, must confer with
son of patient values between the new method and a
the physician to obtain an adequate definition of
proven method. These criteria are based on the
medical usefulness of the test results, thus the ac- these medical requirements. In addition, he must de-
ceptability of the method is judged with respect to fine other requirements based on the technical and
the clinical requirements. economic resources of the laboratory.
In setting up the analytical method, the analyst
Additional Keyphrases: statistics quality
#{149} control must optimize the method to best satisfy all of the
requirements. He may develop a new method, or he
may evaluate existing methods that appear to satisfy
In evaluating the performance of a new laboratory all of the requirements.. In developing a method, he
method, the analyst can be guided by several will strive to satisfy the physical requirements such
schemes that outline experimental procedures and as the amount of specimen, range of linearity, etc.,
statistical techniques (1-4). However, none of these and then evaluate the performance of the method.
schemes provides criteria by which the analyst can At this early stage of testing or developmental test-
judge whether the clinical performance of the meth- ing, he relies primarily on simple experiments for de-
od is acceptable. If one is to make reliable decisions, it termination of within-run precision, recoyery, and
is necessary to clearly define (a) experimental proto- interference. When they indicate that the method
cols that provide reliable estimates of performance, appears to be acceptable, a final stage of testing
(b standards that represent acceptable performance, (method evaluation) is conducted that includes ex-
and (c) criteria that permit the observed perfor- periments for determination of run-to-run precision
mance to be compared with that performance which and comparison of values obtained for patients with
is defined as acceptable. Our purpose here is to out- the test method and a reference method. Regardless of
line an approach for defining b and c. In order to which experiments are used, the analyst’s objectives
limit the scope of this paper, we assume that a can are to quantify precision and accuracy; hence to
be adequately defined, i.e., that the presently recom- judge whether the method performs acceptably, the
mended experiments are satisfactory if used careful- analyst must have objective criteria for judging pre-
ly, and that simple statistics such as the standard cision and accuracy.
deviation, least-squares analysis, and paired t-test
Formulation of Criteria for
can be used to describe the performance of the new
method. Judging Precision and Accuracy
A new analytical method must satisfy the require- To the analyst, precision means random analytic
ments of both the user and the analyst. The user of error. This is illustrated in Figure 1 by the distribu-
clinical laboratory service, the physician, has certain tion of the individual measurements around a mean
value. Accuracy, on the other hand, is commonly
Clinical Laboratories and the Departments of Medicine, Pa- thought to mean systematic analytic error, which is
thology, and Statistics, University of Wisconsin, Madison, Wis. shown in Figure 1 by the difference between the
53706. mean of the measured values and the “true” value.
1 Statistician in residence; on leave from the Institute of Chem-
istry, University of Umea, Sweden. Analysts sometimes find it useful to divide this sys-
Received April 22, 1974; accepted May 10, 19’74. tematic error into constant and proportional compo-
CLINICAL CHEMISTRY, Vol. 20, No. 7, 1974 825

A
tration at which the performance of the method is
critical. With these two pieces of information, we
formulate a “performance standard” (PS), which
OBSERVED
summarizes the medical specification for total ana-
TRUE VALUE
VALUES lytic error. PS is defined as the allowable error (EA)
at the concentration (Xc) where critical medical de-
PRECISION RANDOM
ANALYTIC cisions must be made. If there are several different
ERROR
decision levels, several performance standards can be
defined. For example, for glucose we could specify
SYSTEMATIC PS1 = 10 mg/dl (EA) at 50 mg/dl (Xc), PS2 = 10
ACCURACY (I) ‘ANALYTIC ERROR’
mg/dl at 120 mg/dl, PS3 = 10 mg/dl at 160 mg/dl,
and PS4 = 25 mg/dl at 300 mg/dl.
(2) TOTAL ANALYTIC ERROR, Once PS has been defined, “decision” criteria are
formulated by comparing the estimates of analytic
Fig. 1. Definitions of precision and accuracy in terms of
random, systematic, and total analytic errors error in the test method with the defined allowable
error (EA). If the errors observed in the test method
are smaller than the medically allowable errors, then
the method performs acceptably. If larger, then the
errors need to be decreased by appropriate modifica-
nents (5),
hence they may speak of constant error tions, or else the method is unacceptable. We formu-
and proportional error. None of this terminology is late separate criteria for acceptance and rejection.
familiar to the physician who uses the test values, These criteria have a known “level of significance”
thereforehe isseldom ableto communicate with the (a), i.e., the amount of risk or chance of being wrong
analyst in these terms. The physician thinks rather is known. To formulate criteria at a = 0.05 or the 5%
in terms of the total analytical error, which includes level of significance, we compare the 95% limit of the
both random and systematic components. From his estimate of analytic error with EA. For example, to
point of view, all types of analytic error are accept- decide if the analytic error is small enough for the
able as long as the total analytic error is less than a method to be acceptable, we compare the larger esti-
specified amount. This total analytic error is illus- mate of analytic error (usually the upper 95% limit
trated as the second definition of accuracy in Figure of the confidence interval) with EA. When this esti-
1. This definition is medically more useful; after all, mate is less than EA, there is only a 5% chance (or
it makes little difference to the patient whether a less) that the error could be larger than EA. Conse-
laboratory value is in error because of random or sys- quently, performance is acceptable. This is illus-
tematic analytic error, and ultimately he is the one trated in Figure 2 by example a. In example b where
who must live with the error. the smaller estimate (usually lower 95% limit) is
Because the medical requirements for performance greater than EA, the performance of the method is
can best and most easily be described in terms of the not acceptable. There is only a 5% chance (or less)
total analytic error, we formulate standards for ac- that the analytic error is small enough for the meth-
ceptable performance by using this concept of total od to perform acceptably. The errors must be re-
error. To do this, a specified value for the total al- duced by appropriate modification of the method, or
lowable error is interpreted as a 95% limit2 of allow- the method judged as unacceptable. Example c
able error (EA).3 In addition to EA, it is necessary to shows the situation where the larger 95% limit is
specify the medical decision level (Xc), the concen- greater than EA and the smaller 95% limit is less
than EA. In this case, the data are not sufficient for a
decision to be made at a = 0.05.
21n some situations, one may wish to use different confidence By formulating criteria in this manner, we divide
limits, perhaps 99% or 99.9%. The choice should reflect how criti- the performance of methods into three classes: (a)
cal the medical requirements are for the particular substance
being measured. acceptable at a = 0.05, (b) unacceptable at a =
Nonstandard abbreviations used: EA, medically allowable an- 0.05, and (c) data not sufficient for a judgment at a
alytic error; Xc, concentration at which critical medical decisions = 0.05. The analyst has clear decisions for classes a
are made; PS, performance standard, defined as the medically
allowable analytic error at the critical medical decision concen- and b, but not for class c. Class c methods can be
tration; a, level of significance; RE, random analytic error; PE, better assessed after additional experimental data
proportional component of systematic analytic error; CE, con- are obtained. This will narrow the confidence band
stant component of systematic analytic error; SE, systematic an-
alytic error; TE, total analytic error; SDr, standard deviation of for the estimate of analytic error, and this may re-
the test method; %R, percent recovery in the standard addition solve the overlap with EA. Alternatively, the perfor-
or recovery experiment; ,mean percent recovery; SEM, stan-
dard error of the mean; SDd, standard deviation of the differences
mance of a class c method may be better described
in paired t-test; bias, difference between average concentrations by reference to the “operating characteristic of the
in paired t-test; t, t-value in t-test; a, intercept from least-squares statistical test” (6), which gives the exact probabili-
analysis; b, slope from least-squares analysis; standard error ty that the criterion fails to detect that the observed
in the y direction about the least-squares line; X,, individual pa-
tient values by reference method; and X, average of patient analytic error is greater than the allowable analytic
values by reference method. error. However, it is sufficient at this time to simply
826 CLINICAL CHEMISTRY. Vol. 20. No.7. 1974

Observed
Analytic
Error
I - __ -
sion and accuracy (12) recommends
documented by use of two terms-one
that accuracy be
that estimates
random analytic error and the other systematic ana-
lytic error (or precision and bias, in their terminolo-
gy). The manner in which these terms are combined
I here is different, but it is consistent with the medical

use of the laboratory data and therefore appropriate
for the problem with which we are concerned.
a b c Of the five error criteria that we have derived, the
Fig. 2. Decisions on performance: The uncertainty in the TE criterion is the most demanding because it in-
estimate of the analytic error is shown by the vertical bar
and the magnitude of a particular value of EA by the dot- cludes both RE and SE. The other criteria consider
ted horizontal line, a, acceptable; b, not acceptable; c individual analytic errors and each could be judged
data not sufficient for a judgment on performance as acceptable, even when the total analytic error is
not acceptable. This is readily apparent when RE
and SE individually approach EA, since TE will be
nearly twice EA when RE and SE are summed. Thus
the criterion for TE is often sufficient by itself for
recognize that class c methods are likely to be bor- judging the acceptability of performance. The other
derline in terms of performance. Acceptance of such criteria are most useful in the early stages of method
a method will undoubtedly require more stringent development and testing. During this time, TE could
quality control and attention from the analyst in be approximated by summation of the individual
order to maintain an adequate performance when components (RE + PE + CE).
the method is in routine service. In applying the proposed criteria, the conclusions
By using this approach, we have formulated crite- on performance depend on the definition of perfor-
ria that are applicable to several different evaluation mance standards (PS). When initially attempting to
experiments. Random analytic error (RE) is estimat- define PS, the analyst can be guided by the recom-
ed from the replicate or precision experiment, pro- mendations of Barnett (13), Campbell and Owen
portional systematic analytic error (PE) from the re- (14), Tonks (15), Cotlove et. al. (16), Vanko (17),
covery experiment, constant systematic analytic and Duncan and Geary (18). These authors present
error (CE) from the interference experiment, mean recommendations for maximum allowable SD’s that
systematic analytic error (SE, includes both PE and should be multiplied by 2 to provide a 95% limit for
CE) from the patient comparison experiment, and fi- allowable analytic error. Barnett’s recommendations
nally total analytic error (TE) from the experiments most closely represent the medically allowable error,
on replicates and patient comparisons. Table 1 sum- since they are based on medical judgment and in-
marizes the analytic errors, the experiments from clude the medical decision levels. Campbell and
which they are estimated, and the performance crite- Owen surveyed physicians for recommendations for
ria. Appendix 1 gives a detailed discussion of the for- “acceptable reproducibility,” but their data are
mulation of the criteria and Appendix 2 gives exam- more limited than Barnett’s. Tonk’s and Cotlove’s
ples of the application of the criteria. limits are dependent on the range of variation in
normal subjects and are more theoretically defined.
Discussion Vanko summarizes “state of the art” performance
The concept of total analytic error that is present- limits, that is, the performance that is available from
ed here is not new. Eisenhart (11) discussed this methods that are in use today. “State of the art” in-
many years ago, and the ASTM standard on preci- formation is also available through state and nation-
Table 1. Performance Criteria for Analytic Errors

Performance criteria
Analytic error Experiment Acceptable Not acceptable
Random (RE) Replicates 2SDTU < EA 2SDT! > EA
Proportional (PE) Recovery I%’ or I lOOl0Xc < EA
- or i - 10011X> EA
Constant (CE) Interference Biasj + t(SDd/V’N) < E Bias! t(SDa/V’N) > EA
-
Systematic (SE) Patient (a + bX ± W) - X < EA (a + bX ± lI) Xc.11> E4 -
comparison .
Total (TE) Replicate and (a + bX) - X + (a + bX) - XcI +

comparison /(2SD)2 + III <EA v’(2SDTI)2 + 1V2 > EA
CLINICAL CHEMISTRY. Vol. 20, No. 7, 1974 827

al quality-control survey programs, and Duncan and Appendix 1. Formulation of Criteria
Geary have recommended limits based on survey re- for Specific Types of Analytic Error
sults in Australia. “State of the art” recommenda- Random analytic error (RE). The standard devia-
tions are most useful in situations where the medical tion of the test method (SD1) should be estimated
requirements may be more demanding than the per- from at least 20 experimental results (1) on a sample
formance available from current methods. In such whose concentration equals or nearly equals the
cases, it may be necessary to judge a method in com- medical-decision concentration Xc. SD is calculat- ‘
parison to currently available performance, rather ed by the usual procedures, and the upper and lower
than to judge all methods as unacceptable. 95% confidence limits (SD1 and SD11, resp.) are
Even though conclusions on acceptability depend calculated as described by Natrella (7) and shown in
on the particular values chosen for EA, the conclu- equations 1 and 2,
sion will be meaningful as long as EA is stated. A
proper statement of the conclusion should specify
SDTU = SDT (upper factor) (1)
EA; for example, “the method is judged acceptable
because the total analytic error is less than 10 mg/dl SDTI = SD (lower factor) (2)
at concentrations of 50 and 120 mg/dl.” Or, “the
method does not perform acceptably because the
total analytic error exceeds 10 mg/dl at 120 mg/dl.”
where the upper and lower factors are listed in Table
2 as a function of degrees of freedom, which is N-i,
The reliability of such conclusions depends on the
reliability of the estimates of analytic errors, which
where N is the number of replicates. The 95% upper
limit for random analytic error (REd) is given by
requires proper application of statistics. In particu-
equation 3 and the 95% lower limit (RE1) by equa-
lar, statistical models should allow for error in the
reference method, or else the experiment should be tion4:
designed to minimize the statistical bias introduced
RE,, = 2SDTU (3)
by such errors. We have discussed some of the fac-
tors that can limit the reliability of the statistical RE1 = 2SDTI
estimates in an earlier report (5) and are continuing
to investigate this problem. Our main concern here
is to outline an approach for formulating objective RE by itself is acceptable when
criteria, assuming that reliable estimates of analytic
errors can be obtained by proper application of least- 2SDTU < E (5)
squares techniques.
We personally prefer least-squares techniques for
analyzing the patient comparison data, because the RE by itself is not acceptable when
slope and intercept terms provide information about
the proportional or constant nature of systematic
2SDT1 > E (6)
error, and this is useful to the analyst if he needs to
modify the method to diminish the analytic errors.
Because of this, we chose to derive SE and TE crite- Example Calculation: We illustrate the calcula-
ria for such a case. However, similar criteria can be tions for a glucose method and refer back to our earlier
derived for t-test statistics. (Confidence limits for definition of performance standards, PS1 = 10 mg/dl
the estimate of SE are calculated in the same man- at 50 mg/dl and PS2 = 10 mg/dl at 120 mg/dl. For
ner as discussed for constant analytic error. See Ap- an observed SD1 of 2.0 mg/dl, which was obtained
pendix 1. Then bias would replace the first term in from 21 measurements on a pool which averaged 55
the TE criteria, and t(SDd/\N) would replace W in mg/dl, SD lu is 2.7 mg/dl (2.0 X 1.358) and RE is
the second term.) Although the criteria are formulat- 5.4 mg/dl (2 X 2.7). For an observed SD1 of 3.0
ed very simply, reliable estimates of SE are more dif- mg/dl at 110 mg/dl (N = 21), SD1 is 4.1 mg/dl and
ficult to obtain by t-test statistics. Only when X RE is 8.2 mg/dl. For both decision levels, the upper
equals X can one be sure that the estimate of sys- limit of random analytic error is less than the medi-
tematic analytic error is reliable. Because this re- cally allowable error, therefore RE by itself is accept-
striction is often difficult to satisfy and because the able.
analyst loses the information about the nature of the Proportional component of systematic analytic
systematic error, we prefer least-squares techniques. error (PE). In principle, the experiment should be
We have considered only the evaluation model performed by adding a small amount of pure dry
where a reference method is available and where sys- standard material to the sample matrix. In practice,
tematic differences between
the test and reference a small volume of a high-concentration aqueous
methods are not acceptable.
Evaluation studies that standard is added to a large volume of the sample
do not fit this simple model will have to be ap- matrix (say a maximum of 0.1 ml standard to 0.9 ml
proached differently. An understanding of the simple sample). The original sample and the sample with
case will aid in the development of approaches that standard added are analyzed and the percent recov-
are appropriate for the more difficult cases. ery (%R) is calculated by equation 7.
828 CLINICAL CHEMISTRY, Vol. 20, No.7, 1974

where t is obtained from a statistics table for P =
Table 2. Factors for Calculating One-Sided 0.05 and N-i degrees of freedom, and SEM is the
95% Confidence Limits for a Standard Deviationa standard error of the mean recovery.4 As shown in
Degrees of freedom Factor for lower limit Factor for upper limit
equation 9, the upper limit of PE is calculated from
5 0.672 2.089
either or Th, whichever gives the larger esti-
7 0.706 1.797
10 0.739 1.593
mate. The lower limit of PE is calculated from
12 0.756 1.515 whichever value gives the smaller estimate (equation
15 0.775 1.437 10).
20 0.798 1.358 To formulate the decision criteria, PE and PE1
25 0.815 1.308 are multiplied by X(. to estimate the analytic error
30 0.828 1.274 in concentration units that can be compared with
40 0.847 1.228 EA. PE by itself is acceptable when
50 0.861 1.199
60 0.871 1.179
70 0.879 1.163 - 100%!,,x < E,, (13)
80 0.886 1.151
90 0.892 1.141
100 0.897 1.133 but is not acceptable when
150 0.913 1.107
200 0.925 1.091
250 0.932 1.080 %Ruori - l00!1X > E (14)
300 0.938 1.073
400 0.946 1.062
500 0.951 1.055 Example Calculation: is 98.0% for nine ex-
periments, the standard deviation of the recoveries is
From Natrella (1).
1.5%. SEM is L5%/v’9, or 0.5%. The upper 95% limit
for recovery is 99.2%, i.e., 98.0% + (2.31 X 0.5%),
where 2.31 is the t-value for eight degrees of freedom.
The lower 95% limit for recovery is 96.8% [(98.0 -
(2.31 X 0.5%)]. PE is 3.2% (196.8 1001). The analy-

-
tic error is 1.6 mg/dl at 50 mg/dl (3.2% X 50) and 3.8

concentration recovered (100) (7)
mg/dl at 120 mg/dl (3.2 X 120). These analytic er-
- concentration added
rors are smaller than EA (10 mg/dl in our previous
definitions of PS1 and PS2), therefore proportional
The “concentration recovered” is equal to the con- analytic error by itself does not limit the medical
centration after addition minus the original sample usefulness of the test results.
concentration times the sample dilution factor (vol- Constant component of systematic analytic error
ume sample/total volume of sample + standard). (CE). Interference studies are performed by adding
The “concentration added” is equal to the standard the suspected interfering material to the original
concentration times the dilution factor (volume stan- sample matrix, then analyzing both the original
dard/total volume of standard + sample). sample and the interference sample. The concentra-
The magnitude of the proportionalerroris given tion added should be appropriate for the medical use
by equation 8, of the method. if the method is to be
For example,
used for screening purposes, a concentration of inter-
PE = %R - 100 (8) fering material that corresponds to the upper limit of
variation in normal subjects may be acceptable. If
the method is to be used for hospitalized patients,
where is the mean recovery for the series of ex- the concentration added should represent the upper
periments. The upper and lower confidence limits for limit of concentration expected in these patients.
PE (PEu and PE1, resp.) are given by equations 9 The experiment is commonly performed in two
and 10.
different ways. The original sample and the interfer-
ence sample can be analyzed in replicate, or a series
PE,, = uorl - 100I4 (9) of samples can be analyzed singly. The experimental
data can be analyzed with t-test statistics, using
PE1 %Ruori - 100!, (10)
Here the upper and lower limits of the mean recov- 4The standard error of the mean (SEM) is equal to SD/v”N
ery are given by equations ii and 12, where SD is the standard deviation of the individual observations
and N is the number of observations. The imprecision of the
method is judged separately in the RE criteria (equations 5 and
6); however, there is still an uncertainty in the estimate of a sys-
= %R + t(sEM) (11) tematic analytic error caused by the imprecision. This uncertain-
ty can be made small by having a sufficiently large number of ob-
%R, = %R - t(SEM) (12) servations.
CLINICAL CHEMISTRY. Vol. 20, No. 7, 1974 829

“unpaired” and “paired” forms for the multiple and slope and intercept are obtained. For confluence
singlesample experiments,respectively. The mean analysis, the equations are similar, but more compli-
CE is estimated by bias3 and upper and lower limits cated.
can be calculated from SDd,3 N, and the appropriate For simple linear regression, the 95% limits of sys-
t-value forP = 0.05 and N-i degrees of freedom. tematic analytic error are estimated as described by
Natrella (10). The average testvalue (Yc) that cor-
CE,, = Bias,, = Bias + t(SDd/VW) (15) responds to the critical medical concentration (Xc)
is calculated from the regression equation.
CE1 = Bias1 = Bias - t(SDd/V) (16)
Y=a+bX (19)
Performance is acceptable when
IBias + t(SDd/V’W) < E (17)

The 95% limits
21.
for Yc are given by equations 20 and
Performance is not acceptable when

Yc,,=Yc+W (20)
Bias! - t(SDd/’1) > E (18) (21)
Example Calculation: For addition of 1.5 mg of W is the width of the confidence band and is calcu-
creatinine per deciliter, bias = 2.0 mg/dl, SDd = 1.2 lated from equation 22,
mg/dl, N = 9. CE,,, is 2.9 mg/dl [2.0 + 2.31(1.2/
/)], thus constant errordue to normal concentra-
tionsof creatinineislessthan our specifiedallowable W = f112S,., - +(x_)22)U2 (22)
errorof 10 mg/dl. For additionsof 15 mg of creati-
nine per deciliter, bias = 17.0 mg/dl, SDd = 2.1
mg/dl, and N = 9. CE0 is 18.6 mg/dl [17 + where a is 0.05, to.975 is obtained from Table 3 for
2.31(2.1/’V)] and CE1 is 15.4 mg/dl [17 2.31(2.1/ -
N-i degrees of freedom, is the standard error
/g)]. For samplesfrom uremic patients, the analytic about the regression line, X is the mean of the refer-
errors could exceed the allowable error of 10 mg/dl ence values, and X, is an individual reference value.
(both PS1 and PS2). The upper limit of systematic error (SEu) is the ab-
Systematic analytic error (SE). The mean sys- solute difference between Xc and the upper or lower
tematic error can be estimated by analyzing at least limit of Yc, whichever gives the larger estimate.
40 patients’ specimens (1) by both the test and refer-
ence methods. We assume the case where an accu- SE,, = Ic,,or z - Xc!,, (23)
rate reference method is available, therefore differ-
ences between the test and reference methods are
due to errors that originate in the test method.5 The Substituting with equations 20 and 21 and
for Ycu or i
statistical approach must be appropriate for the ex- Yc with equation 19 gives
perimental situation; however, some form of least-
squares analysis will generally be applicable, some-
times simple linear regression, or more generally SE,, = (a + bXc ± W) - Xc!,, (24)
“confluence” analysis6 (8, 9). We illustrate the for-
mulation of criteria for the simple case when linear The lower limitof systematic error (SE,) is which-
regression can be used, i.e., the case when random
ever difference gives the smaller estimate.
analytic error in the reference method (X value) is
sufficiently small that unbiased estimates of the
SE1= (a+bxc±W)-XcL (25)
51n practice, the selection of a reference method is indeed a
problem, but for many commonly measured substances, such as Systematic analytic error by itself is acceptable
glucose, calcium, etc., there are methods that are generally ac- when SE0 is less than EA, or
cepted (by consensus) as being accurate enough for this purpose.
Differences between test and reference values are still assumed to
be errors in the test method, unless proven otherwise. Enzyme
methods, hormone assays, and many other newly developed tests (a + bXc ± W) - Xc!,, < EA (26)
do not fit this experimental model and will have to be ap-
proached in a somewhat different manner.
6 Confluence analysis is used here to designate a least-squares
technique that allows for random error in both x and y. Simple SE is not acceptable when SE1 is greater than EA, or
linear regression allows for error only in y and may give an esti-
mate for the slope that is too low and an estimate for the inter-
cept that is too high. (a + bXc ± W) Xcl, > E (27)
830 CLINICAL CHEMISTRY, Vol. 20. No. 7, 1974

The 95% limits for TE are given by equations 28
Table 3. t-Values for Calculating the and 29 where the absolute value term on the right-
Confidence Limits of Systematic Analytic Error hand side is the mean systematic error and the
Degrees of freedom
square-root term is the uncertainty:
10 2.228
15 2.131
20 2.086 TE,, = I(a + bX) - Xc I+ V’RE,,2 + W2 (28)
30
40
2.042
2.021
TE, = I(a + bXc) - X + IRE,2 + W2 (29)
60 2.000
120 1.980 RE0, RE,, and W are defined as 95% limits by equa-
1.960 tions 3, 4, and 22, respectively. Performance is ac-
From Natrella (10). ceptable when TE0 is less than EA. Substituting for
RE0 with equation 3, gives the final form of the cri-
terion for acceptability, as shown in equation 30.
(a + bXc) - Xc! + ISDTU) + W2 < E (30)

Example Calculation: For a glucose comparison
Substituting for RE, with equation 4 gives the crite-
experiment, a =1.0, b = 0.95, S,,,,, = 3.0, N = 41, X
= 110,and (X1 X)2 = 140 000. For Xc = 50 (based
-
rion for rejection
on our earlier definition of PS1):
(a + bXc) - Xc! + ‘/(2SDT1)2 + W2 > EA (31)
= a + bXc 1.0 + 0.95(50) = 48.5 Example Calculation: We use here the data in
W = 2.021(3.0)(
1
+
(50_110)2
140,000 ) 1/2
-
-
14 the previous example calculations. For PS1 where Xc
is 50 mg/dl, the mean SE is 1.5 mg/dl (IYc Xc! -
or 148.5 50.0!). RE0 or 2SD

- is 5.4 mg/dl and W
Yc,, = 48.5 + 1.4 = 49.9 is 1.4 mg/dl, therefore the uncertainty term is 5.6
Yct = 48.5 - 1.4 47.1 mg/dl [v’(5.4)2 + (1.4)2]. TE0 is 7.1 mg/dl, which is
less than EA, therefore performance is acceptable at
SE,, = 50.0 -47.1 = 2.9
this medical decision point. For PS2 where Xc is 120
SE, = 50.0 -49.9 = 0.1 mg/dl, the mean SE is 5.0 mg/dl (1115.0 -
120.01). RE0 or 2SDu is 8.2 mg/dl, W is 1.0 mg/dl,

For Xc = 120 (for our definition of PS2): and TE0 is 13.3 mg/dl, which is greater than our
specified EA of 10 mg/dl. Therefore, we cannot be
95% certain that the total analytic error is smaller
Yc = 1.0 + 0.95(120) = 115.0
than EA and cannot judge the performance of the
W =
1
2.021(3.0)(-j +
(120_110)2
140,000 ) 1/2
= t.o method as acceptable. RE, or 2SDri is 4.8 mg/dl (2
x 3.0 x 0.798) and TE, is 9.9 mg/dl, which is less
than EA. Therefore, we cannot be 95% certain that
= 115.0 + 1.0 = 116.0
the total analytic error is greater than EA. This is an
= 115.0 - 1.0 = 114.0 example of a classc method where more experimen-
SE,, = 120.0 - 114.0 = 6.0 tal data are needed before a judgment can be made
with only a 5% risk of being incorrect.
SE1 120.0 - 116.0 = 4.0
Appendix 2. Example Applications
At both decision levels, the upper limits of system- In this section we illustrate the application of the
atic error are less than EA, thus systematic error by TE criteria to some glucose studies that are typical
itself is small enough for the method to be accept- of those appearing in the recent literature. In judging
able. the performance of these methods, we consider two
Total analytic error (TE). The random and sys- performance standards (PS1 = 10 mg/dl at 50 mg/
tematic components must be considered together, to dl, PS2 = 10 mg/dl at 120 mg/dl). The statistical
estimate their total effect. On the average, the ana- summary of the experimental data is given in Part A
lytic error is simply the systematic error. But for a of Table 4. Part B summarizes the values that are
single measurement on a patient sample, the actual calculated in order to judge acceptability.
error may range above or below the mean value by Example 1. An AutoAnalyzer (Technicon) hexoki-
an amount that depends on the total uncertainty. nase method was evaluated. The precision experi-
This uncertainty includes both the uncertainty in ment included eight measurements on one pool of
the estimateof systematicerrorand the uncertainty serum. Although thisislessdata than generallyde-
of a single measurement (i.e., the random analytic siredin an evaluationstudy,itwillserve to illustrate
error of the method). the affect of N in the calculations. The concentration
CLINICALCHEMISTRY, Vol. 20. No. 7, 1974 831

Example 2. An automated glucose oxidase method
Table 4. Example Applications for glucose was evaluated. The precision data again
Exam p1. 1 Exampl. 2 Exampl. 3
are limited and we make the same assumption as in
A. Experimental data Example 1. RE0 again is large (9.0 mg/dl) because of
Precision experiment the low number of measurements. The method was
compared to an automated o-toluidine method and
N 8 8 48
SDr (mg/dl) 1.9 2.5 5.3 SE is 8.8 mg/dl at 120 mg/dl and 8.1 mg/dl at 50
av. concn (mg/dl) 158 135 121 mg/dl. W cannot be calculated because S,,,,, is not
given. Let us make an optimistic or low estimate of
Comparison experiment
N 20 337 25 TE0 by assuming Sr,, to be 0, i.e., let us select a
a (mg/dl) 0.900 -7.6 00 small Syix to give the test method the benefit of the
b 0.997 0.99 1058 doubt. Then TE0 at 120 mg/dl is 17.8 mg/dl [8.8 +
(mg/dl) 2.16 - 80 9.0]. We cannot be 95% certain that TE is less
(mg/dl) 143 - - than EA, and cannot judge the method as accept-
(100,678) -
- able. While we lack some of the data which are neces-
sary to determine whether TE is larger than EA at a
B. Calculated value8 for determining acceptability = 0.05, it is likely that TE actually exceeds EA. Be-
SDTU 3.9 4.5 6.6 causeN = 337, the 1/N term in equation 22 is small.
RE,, 7.8 9.0 13.2 if X approaches Xc, or if (Xc X)2 is large, both
-
SDTI
_6 1.8 of the method is not. SDTu is 6.6 mg/dl (5.3 X 1.23]
RE, -a 3.6 8.8 and RE0 is 13.2 mg/dl, which is greater than EA.
SE at I,, = 120 0.5 8.8
W at X = 120 1.1 mg/dl[8.8 + 3.6]. TE is large primarily because of
TE, at X = 120 8.4 (17.8)’ (20 6)’ the large SE. This method most likely is unaccept-
TE,atX = 120 _a (12.4)d (16.4)’ able, unless SE is proven to originate with the refer-
SE at X, =50 0.8 -
ence method rather than with the test method.
W at X = 50 1.4 - - Example 3. A manual kit o-toluidine method was
TE, at X =50 8.6 - - evaluated. The concentration of the pool in the pre-
a cision experiment agrees with that specified by PS2
TE,atX0=50 - -
#{149}
Not calculated because not needed.
and the number of measurements is large (N = 48).
. Not calculated because data are lacking. Although the amount of data is good, the precision
AssumeS ,.=0, then W=O. of the method is not. SD’ru is 6.6 mg/dl [5.3 X 1.23]
d Assume Xc, z(X )‘ is large, 1/N term
-+ - - 0, then
W .- 0.
and RE0 is 13.2 mg/dl, which is greater than EA.
‘Assume .-‘. Xc, Z(Xi - )‘ is large, then W-. 3.3. Therefore, we cannot be 95% certain that RE by it-
self is less than EA. RE, is 8.8, so we cannot be 95%
certain that RE by itself exceeds EA. Since RE0 ex-
ceeds EA, it is obvious that TE0 must also be greater
than EA. SE at 120 is 7.0 mg/dl, so SE by itself is
of the pool is not close to either of our Xc values (50 appreciable. As in example 2, we need to make some
and 120), but let us assume that SD, is constant at assumptions in order to estimate an approximate
1.9 mg/dl over the range 50-160 mg/dl. SDTu is 3.9 value of TE0. if we assume X approaches Xc so that
mg/dl(1.9 x 2.035) and RE0 is 7.8(2.0 x 3.9), both the last part of equation 22 approaches zero, then W
being large because SD,- cannot be estimated very would be 3.3 and TE0 would be 20.6 [7.0 + 13.6]. If
well with only eight measurements. In the compari- our assumptions were correct, then TE, would be
son experiment, only 20 samples were compared to a 16.4 [7.0 + 9.4], in which case we could be 95% cer-
glucose oxidase method. The summation term [(X1 tain that TE exceeded EA.
- was not actuallygiven, but all the data
X)2] These three examples point out the difficulty in
pointswere tabulated so itcould be calculated.For applying the criteriato published evaluationdata.
= 120 mg/dl, Yc is 120.5 mg/dl [0.997 x 120 + Approximations are usually necessary because the
0.90] and SE is 0.5 mg/dl [120.5 120]. W is 1.1 - statistical data are not complete. However, even
mg/dl, as calculated below: with approximation, the judgments on performance
are likely to be much more objective than the judg-
W = 2.093(2.16) ( 1
+
(143 120)2
100,678 )1/2
ments made without the aid of these criteria.
References
1. Barnett, R. N., and Youden, W. M., A revised scheme for the
TE0 is 8.4 mg/dl [0.5 + ‘v’(7.8)2 + (1.1)21. For Xc = comparison of quantitative methods. Amer. J. Clin. Pathol. 54,
50 mg/dl, Yc is 50.8, SE is 0.8 mg/dl, W is 1.4 454 (1970).
mg/dl and TE0 is 82 mg/dl. Thus, the total analytic 2. Henry, J. B., Beeler, M. F., Copeland B. E., and Wert, E. B.,
A format for description of methods in clinical pathology. Amer.
error is less than 10 mg/dl at concentrations of 50 J. Clin. Pathol. 52,296(1969).
and 120 mg/dl. Since performance is acceptable, we 3 Broughton, P. M. G., Buttolph, M. A., Gownlock, A. H., Neill,
need not calculate SD,-,, RE,, and TE,. D. W., and Sketelbery, R. G., Recommended scheme for the eval-
832 CLINICAL CHEMISTRY, Vol. 20, No.7,1974

uation of instruments for automatic analysis in the clinical bio- Sect. C 67C, 161 (1963).
chemistry laboratory. J. Cliii. Pat hol. 22, 278(1969). 12. ASTM Standard E 177.71, Standard recommended practice
4. Logan, J. E., Evaluation of commercial kits. CRC Crit. Rev. for use of the terms precision and accuracy as applied to mea-
Cliii. Lab. Sci. 3, 271 (1972). surement of a property of a material. American Society of Testing
5. Westgard, J. 0., and Hunt, M. R., Use and interpretation of Materials, 1916 Race St., Philadelphia,Pa. 1971.
common statistical tests in method-comparison studies. Cliii. 13. Barnett, 11. N., Medical significance of laboratory results.
Chem. 19,49(1973). Amer. J. Cliii. Pat hol. 50,671 (1968).
6. Natrella, M. G., Experimental statistics. Nat. Bar. Stand. 14. Campbell, D. G., and Owen J. A., Clinical laboratory error in
Handb. 91, U. S. Government Printing Office, Washington, D. C., ‘ perspective. Cliii. Biochem. 1, 3 (1967).
1963, pp 1-17, 4-3. 15. Tonks, D. B., A study of the accuracy and precision of clini-
7. Natrella, M. G., Experimental statistics. Nat. Bur. Stand. cal chemistry determinations in 170 Canadian laboratories. Cliii.
Handb. 91, U. S. Government Printing Office, Washington, D. C., Chem. 9,217 (1963).
1963, pp 4-1 to 4-7. 16. Cotlove, E., Harris, E. K., and Williams, G. Q., Biological and
8. Creasy, M. A., Confidence limits for the gradient in the linear analytic components of variation in long-term studies of serum
functional relationship. J. Roy. Statist. Soc. B. 18,65(1956). constituents in normal subjects; ifi. Physiological and medical
9. Halperin, M., Fitting of straight lines and prediction when implications. Cliii. Chem. 16, 1028(1970).
both variables are subject to error. J. Amer. Statist. Ass. 56, 657 17. Vanko, M., Selected factors which influence the design of a
(1961). quality control program. Advances in Automated Analysis, Tech-
10. Natrella, M. G., Experimental statistics. Nat. Bur. Stand. nicon International Congress 1970, 1. E. C. Barton et. al., Eds.
Handb. 91, U. S. Government Printing Office, Washington, D. C., Thurman Associates, Miami, FIn. 33132, p 159.
1963, pp 5-14 to 5-19. 18. Duncan, B. M., and Geary, T. D., A method for analyzing re-
11. Eisenhart, E., Realistic evaluation of the precision and accu- #{149}-
suits of medical laboratory proficiency surveys. J. Roy. CoIl. Pa-
racy of instrument calibration systems. J. Res. Nat. Bar. Stand., thol. Aust. 5,91 (1973).
CLINICAL CHEMISTRY, Vol. 20, No. 7. 1974 833

Criteria For Judging Precision and Accuracy in Method Development and Evaluation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Criteria For Judging Precision and Accuracy in Method Development and Evaluation

Uploaded by

Copyright:

Available Formats

CLIN. CHEM.

20/7, 825-833 (1974)

Criteria for Judging Precision and Accuracy in

Method Development and Evaluation

James 0. Westgard, R. Neill Carey, and Svante Wold1

CLINICAL CHEMISTRY, Vol. 20, No. 7, 1974 825

826 CLINICAL CHEMISTRY. Vol. 20. No.7. 1974

I here is different, but it is consistent with the medical

Table 1. Performance Criteria for Analytic Errors

Systematic (SE) Patient (a + bX ± W) - X < EA (a + bX ± lI) Xc.11> E4 -

Total (TE) Replicate and (a + bX) - X + (a + bX) - XcI +

CLINICAL CHEMISTRY. Vol. 20, No. 7, 1974 827

828 CLINICAL CHEMISTRY, Vol. 20, No.7, 1974

(2.31 X 0.5%)]. PE is 3.2% (196.8 1001). The analy-

tic error is 1.6 mg/dl at 50 mg/dl (3.2% X 50) and 3.8

CLINICAL CHEMISTRY. Vol. 20, No. 7, 1974 829

IBias + t(SDd/V’W) < E (17)

Performance is not acceptable when

Bias! - t(SDd/’1) > E (18) (21)

830 CLINICAL CHEMISTRY, Vol. 20. No. 7, 1974

(a + bXc) - Xc! + ISDTU) + W2 < E (30)

or 148.5 50.0!). RE0 or 2SD

120.01). RE0 or 2SDu is 8.2 mg/dl, W is 1.0 mg/dl,

CLINICALCHEMISTRY, Vol. 20. No. 7, 1974 831

832 CLINICAL CHEMISTRY, Vol. 20, No.7,1974

CLINICAL CHEMISTRY, Vol. 20, No. 7. 1974 833

You might also like