Professional Documents
Culture Documents
Criteria For Judging Precision and Accuracy in Method Development and Evaluation
Criteria For Judging Precision and Accuracy in Method Development and Evaluation
We describe an approach for formulating criteria requirements if the test results are to be medically
that can be used to judge whether an analytical useful. These specifications include the type of speci-
method has acceptable precision and accuracy. We men, amount of specimen, expected range of concen-
derive criteria for several experiments that are com- tration, elapsed time between specimen submission
monly used in method-evaluation studies: precision and report of results,and precisionand accuracy.
or replicates, recovery, interference, and compari-
The analyst, or clinical chemist, must confer with
son of patient values between the new method and a
the physician to obtain an adequate definition of
proven method. These criteria are based on the
medical usefulness of the test results, thus the ac- these medical requirements. In addition, he must de-
ceptability of the method is judged with respect to fine other requirements based on the technical and
the clinical requirements. economic resources of the laboratory.
In setting up the analytical method, the analyst
Additional Keyphrases: statistics quality
#{149} control must optimize the method to best satisfy all of the
requirements. He may develop a new method, or he
may evaluate existing methods that appear to satisfy
In evaluating the performance of a new laboratory all of the requirements.. In developing a method, he
method, the analyst can be guided by several will strive to satisfy the physical requirements such
schemes that outline experimental procedures and as the amount of specimen, range of linearity, etc.,
statistical techniques (1-4). However, none of these and then evaluate the performance of the method.
schemes provides criteria by which the analyst can At this early stage of testing or developmental test-
judge whether the clinical performance of the meth- ing, he relies primarily on simple experiments for de-
od is acceptable. If one is to make reliable decisions, it termination of within-run precision, recoyery, and
is necessary to clearly define (a) experimental proto- interference. When they indicate that the method
cols that provide reliable estimates of performance, appears to be acceptable, a final stage of testing
(b standards that represent acceptable performance, (method evaluation) is conducted that includes ex-
and (c) criteria that permit the observed perfor- periments for determination of run-to-run precision
mance to be compared with that performance which and comparison of values obtained for patients with
is defined as acceptable. Our purpose here is to out- the test method and a reference method. Regardless of
line an approach for defining b and c. In order to which experiments are used, the analyst’s objectives
limit the scope of this paper, we assume that a can are to quantify precision and accuracy; hence to
be adequately defined, i.e., that the presently recom- judge whether the method performs acceptably, the
mended experiments are satisfactory if used careful- analyst must have objective criteria for judging pre-
ly, and that simple statistics such as the standard cision and accuracy.
deviation, least-squares analysis, and paired t-test
Formulation of Criteria for
can be used to describe the performance of the new
method. Judging Precision and Accuracy
A new analytical method must satisfy the require- To the analyst, precision means random analytic
ments of both the user and the analyst. The user of error. This is illustrated in Figure 1 by the distribu-
clinical laboratory service, the physician, has certain tion of the individual measurements around a mean
value. Accuracy, on the other hand, is commonly
Clinical Laboratories and the Departments of Medicine, Pa- thought to mean systematic analytic error, which is
thology, and Statistics, University of Wisconsin, Madison, Wis. shown in Figure 1 by the difference between the
53706. mean of the measured values and the “true” value.
1 Statistician in residence; on leave from the Institute of Chem-
istry, University of Umea, Sweden. Analysts sometimes find it useful to divide this sys-
Received April 22, 1974; accepted May 10, 19’74. tematic error into constant and proportional compo-
Analytic
Error
I - __ -
sion and accuracy (12) recommends
documented by use of two terms-one
that accuracy be
that estimates
random analytic error and the other systematic ana-
lytic error (or precision and bias, in their terminolo-
gy). The manner in which these terms are combined
comparison .
parison to currently available performance, rather ed by the usual procedures, and the upper and lower
than to judge all methods as unacceptable. 95% confidence limits (SD1 and SD11, resp.) are
Even though conclusions on acceptability depend calculated as described by Natrella (7) and shown in
on the particular values chosen for EA, the conclu- equations 1 and 2,
sion will be meaningful as long as EA is stated. A
proper statement of the conclusion should specify
SDTU = SDT (upper factor) (1)
EA; for example, “the method is judged acceptable
because the total analytic error is less than 10 mg/dl SDTI = SD (lower factor) (2)
at concentrations of 50 and 120 mg/dl.” Or, “the
method does not perform acceptably because the
total analytic error exceeds 10 mg/dl at 120 mg/dl.”
where the upper and lower factors are listed in Table
2 as a function of degrees of freedom, which is N-i,
The reliability of such conclusions depends on the
reliability of the estimates of analytic errors, which
where N is the number of replicates. The 95% upper
limit for random analytic error (REd) is given by
requires proper application of statistics. In particu-
equation 3 and the 95% lower limit (RE1) by equa-
lar, statistical models should allow for error in the
reference method, or else the experiment should be tion4:
designed to minimize the statistical bias introduced
RE,, = 2SDTU (3)
by such errors. We have discussed some of the fac-
tors that can limit the reliability of the statistical RE1 = 2SDTI
estimates in an earlier report (5) and are continuing
to investigate this problem. Our main concern here
is to outline an approach for formulating objective RE by itself is acceptable when
criteria, assuming that reliable estimates of analytic
errors can be obtained by proper application of least- 2SDTU < E (5)
squares techniques.
We personally prefer least-squares techniques for
analyzing the patient comparison data, because the RE by itself is not acceptable when
slope and intercept terms provide information about
the proportional or constant nature of systematic
2SDT1 > E (6)
error, and this is useful to the analyst if he needs to
modify the method to diminish the analytic errors.
Because of this, we chose to derive SE and TE crite- Example Calculation: We illustrate the calcula-
ria for such a case. However, similar criteria can be tions for a glucose method and refer back to our earlier
derived for t-test statistics. (Confidence limits for definition of performance standards, PS1 = 10 mg/dl
the estimate of SE are calculated in the same man- at 50 mg/dl and PS2 = 10 mg/dl at 120 mg/dl. For
ner as discussed for constant analytic error. See Ap- an observed SD1 of 2.0 mg/dl, which was obtained
pendix 1. Then bias would replace the first term in from 21 measurements on a pool which averaged 55
the TE criteria, and t(SDd/\N) would replace W in mg/dl, SD lu is 2.7 mg/dl (2.0 X 1.358) and RE is
the second term.) Although the criteria are formulat- 5.4 mg/dl (2 X 2.7). For an observed SD1 of 3.0
ed very simply, reliable estimates of SE are more dif- mg/dl at 110 mg/dl (N = 21), SD1 is 4.1 mg/dl and
ficult to obtain by t-test statistics. Only when X RE is 8.2 mg/dl. For both decision levels, the upper
equals X can one be sure that the estimate of sys- limit of random analytic error is less than the medi-
tematic analytic error is reliable. Because this re- cally allowable error, therefore RE by itself is accept-
striction is often difficult to satisfy and because the able.
analyst loses the information about the nature of the Proportional component of systematic analytic
systematic error, we prefer least-squares techniques. error (PE). In principle, the experiment should be
We have considered only the evaluation model performed by adding a small amount of pure dry
where a reference method is available and where sys- standard material to the sample matrix. In practice,
tematic differences between
the test and reference a small volume of a high-concentration aqueous
methods are not acceptable.
Evaluation studies that standard is added to a large volume of the sample
do not fit this simple model will have to be ap- matrix (say a maximum of 0.1 ml standard to 0.9 ml
proached differently. An understanding of the simple sample). The original sample and the sample with
case will aid in the development of approaches that standard added are analyzed and the percent recov-
are appropriate for the more difficult cases. ery (%R) is calculated by equation 7.
Here the upper and lower limits of the mean recov- 4The standard error of the mean (SEM) is equal to SD/v”N
ery are given by equations ii and 12, where SD is the standard deviation of the individual observations
and N is the number of observations. The imprecision of the
method is judged separately in the RE criteria (equations 5 and
6); however, there is still an uncertainty in the estimate of a sys-
= %R + t(sEM) (11) tematic analytic error caused by the imprecision. This uncertain-
ty can be made small by having a sufficiently large number of ob-
%R, = %R - t(SEM) (12) servations.
Y=a+bX (19)
Performance is acceptable when
Example Calculation: For addition of 1.5 mg of W is the width of the confidence band and is calcu-
creatinine per deciliter, bias = 2.0 mg/dl, SDd = 1.2 lated from equation 22,
mg/dl, N = 9. CE,,, is 2.9 mg/dl [2.0 + 2.31(1.2/
/)], thus constant errordue to normal concentra-
tionsof creatinineislessthan our specifiedallowable W = f112S,., - +(x_)22)U2 (22)
errorof 10 mg/dl. For additionsof 15 mg of creati-
nine per deciliter, bias = 17.0 mg/dl, SDd = 2.1
mg/dl, and N = 9. CE0 is 18.6 mg/dl [17 + where a is 0.05, to.975 is obtained from Table 3 for
2.31(2.1/’V)] and CE1 is 15.4 mg/dl [17 2.31(2.1/ -
N-i degrees of freedom, is the standard error
/g)]. For samplesfrom uremic patients, the analytic about the regression line, X is the mean of the refer-
errors could exceed the allowable error of 10 mg/dl ence values, and X, is an individual reference value.
(both PS1 and PS2). The upper limit of systematic error (SEu) is the ab-
Systematic analytic error (SE). The mean sys- solute difference between Xc and the upper or lower
tematic error can be estimated by analyzing at least limit of Yc, whichever gives the larger estimate.
40 patients’ specimens (1) by both the test and refer-
ence methods. We assume the case where an accu- SE,, = Ic,,or z - Xc!,, (23)
rate reference method is available, therefore differ-
ences between the test and reference methods are
due to errors that originate in the test method.5 The Substituting with equations 20 and 21 and
for Ycu or i
statistical approach must be appropriate for the ex- Yc with equation 19 gives
perimental situation; however, some form of least-
squares analysis will generally be applicable, some-
times simple linear regression, or more generally SE,, = (a + bXc ± W) - Xc!,, (24)
“confluence” analysis6 (8, 9). We illustrate the for-
mulation of criteria for the simple case when linear The lower limitof systematic error (SE,) is which-
regression can be used, i.e., the case when random
ever difference gives the smaller estimate.
analytic error in the reference method (X value) is
sufficiently small that unbiased estimates of the
SE1= (a+bxc±W)-XcL (25)
51n practice, the selection of a reference method is indeed a
problem, but for many commonly measured substances, such as Systematic analytic error by itself is acceptable
glucose, calcium, etc., there are methods that are generally ac- when SE0 is less than EA, or
cepted (by consensus) as being accurate enough for this purpose.
Differences between test and reference values are still assumed to
be errors in the test method, unless proven otherwise. Enzyme
methods, hormone assays, and many other newly developed tests (a + bXc ± W) - Xc!,, < EA (26)
do not fit this experimental model and will have to be ap-
proached in a somewhat different manner.
6 Confluence analysis is used here to designate a least-squares
technique that allows for random error in both x and y. Simple SE is not acceptable when SE1 is greater than EA, or
linear regression allows for error only in y and may give an esti-
mate for the slope that is too low and an estimate for the inter-
cept that is too high. (a + bXc ± W) Xcl, > E (27)
SDTI
_6 1.8 of the method is not. SDTu is 6.6 mg/dl (5.3 X 1.23]
RE, -a 3.6 8.8 and RE0 is 13.2 mg/dl, which is greater than EA.
SE at I,, = 120 0.5 8.8
W at X = 120 1.1 mg/dl[8.8 + 3.6]. TE is large primarily because of
TE, at X = 120 8.4 (17.8)’ (20 6)’ the large SE. This method most likely is unaccept-
TE,atX = 120 _a (12.4)d (16.4)’ able, unless SE is proven to originate with the refer-
SE at X, =50 0.8 -
ence method rather than with the test method.
W at X = 50 1.4 - - Example 3. A manual kit o-toluidine method was
TE, at X =50 8.6 - - evaluated. The concentration of the pool in the pre-
a cision experiment agrees with that specified by PS2
TE,atX0=50 - -
#{149}
Not calculated because not needed.
and the number of measurements is large (N = 48).
. Not calculated because data are lacking. Although the amount of data is good, the precision
AssumeS ,.=0, then W=O. of the method is not. SD’ru is 6.6 mg/dl [5.3 X 1.23]
d Assume Xc, z(X )‘ is large, 1/N term
-+ - - 0, then
W .- 0.
and RE0 is 13.2 mg/dl, which is greater than EA.
‘Assume .-‘. Xc, Z(Xi - )‘ is large, then W-. 3.3. Therefore, we cannot be 95% certain that RE by it-
self is less than EA. RE, is 8.8, so we cannot be 95%
certain that RE by itself exceeds EA. Since RE0 ex-
ceeds EA, it is obvious that TE0 must also be greater
than EA. SE at 120 is 7.0 mg/dl, so SE by itself is
of the pool is not close to either of our Xc values (50 appreciable. As in example 2, we need to make some
and 120), but let us assume that SD, is constant at assumptions in order to estimate an approximate
1.9 mg/dl over the range 50-160 mg/dl. SDTu is 3.9 value of TE0. if we assume X approaches Xc so that
mg/dl(1.9 x 2.035) and RE0 is 7.8(2.0 x 3.9), both the last part of equation 22 approaches zero, then W
being large because SD,- cannot be estimated very would be 3.3 and TE0 would be 20.6 [7.0 + 13.6]. If
well with only eight measurements. In the compari- our assumptions were correct, then TE, would be
son experiment, only 20 samples were compared to a 16.4 [7.0 + 9.4], in which case we could be 95% cer-
glucose oxidase method. The summation term [(X1 tain that TE exceeded EA.
- was not actuallygiven, but all the data
X)2] These three examples point out the difficulty in
pointswere tabulated so itcould be calculated.For applying the criteriato published evaluationdata.
= 120 mg/dl, Yc is 120.5 mg/dl [0.997 x 120 + Approximations are usually necessary because the
0.90] and SE is 0.5 mg/dl [120.5 120]. W is 1.1 - statistical data are not complete. However, even
mg/dl, as calculated below: with approximation, the judgments on performance
are likely to be much more objective than the judg-
W = 2.093(2.16) ( 1
+
(143 120)2
100,678 )1/2
ments made without the aid of these criteria.
References
1. Barnett, R. N., and Youden, W. M., A revised scheme for the
TE0 is 8.4 mg/dl [0.5 + ‘v’(7.8)2 + (1.1)21. For Xc = comparison of quantitative methods. Amer. J. Clin. Pathol. 54,
50 mg/dl, Yc is 50.8, SE is 0.8 mg/dl, W is 1.4 454 (1970).
mg/dl and TE0 is 82 mg/dl. Thus, the total analytic 2. Henry, J. B., Beeler, M. F., Copeland B. E., and Wert, E. B.,
A format for description of methods in clinical pathology. Amer.
error is less than 10 mg/dl at concentrations of 50 J. Clin. Pathol. 52,296(1969).
and 120 mg/dl. Since performance is acceptable, we 3 Broughton, P. M. G., Buttolph, M. A., Gownlock, A. H., Neill,
need not calculate SD,-,, RE,, and TE,. D. W., and Sketelbery, R. G., Recommended scheme for the eval-