Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Invited Review

Method comparison in the


clinical laboratory
Asger Lundorff Jensen, Mads Kjelgaard-Hansen

Abstract: Studies comparing a new method with an established method, to assess whether the new measurements are comparable
with existing ones, are frequently conducted in clinical pathology laboratories. Assessment usually involves statistical analysis of
paired results from the 2 methods to objectively investigate sources of analytical error (total, random, and systematic). In this
review article, the types of errors that can be assessed in performing this task are described, and a general protocol for comparison
of quantitative methods is recommended. The typical protocol has 9 steps: 1) state the purpose of the experiment, 2) establish
a theoretical basis for the method comparison experiment, 3) become familiar with the new method, 4) obtain estimates of random
error for both methods, 5) estimate the number of samples to be included in the method comparison experiment, 6) define
acceptable difference between the 2 methods, 7) measure the patient samples, 8) analyze the data and 9) judge acceptability. The
protocol includes the essential investigations and decisions needed to objectively assess the overall analytical performance of
a new method compared to a reference or established method. The choice of statistical methods and recommendations of decision
criteria within the stages are discussed. Use of the protocol for decision-making is exemplified by the comparison of 2 methods for
measuring alanine aminotransferase activity in serum from dogs. Finally, a protocol for comparing simpler semiquantitative
methods with established methods that measure on a continuous scale is suggested. (Vet Clin Pathol. 2006;35:276–286)
2006 American Society for Veterinary Clinical Pathology

Key Words: Acceptability, difference plot, imprecision, inaccuracy, reference method

I.Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Introduction


II.Basic Concepts of a Method Comparison Study . . . 277
III.Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Studies comparing a new method with an established
IV. Method Comparison . . . . . . . . . . . . . . . . . . . . . . . . 278 method, to assess whether the new measurements are
A. Suggested protocol . . . . . . . . . . . . . . . . . . . . . . 278 comparable with existing ones, frequently are conducted in
1. State the purpose of the experiment . . . . . . . 278 clinical pathology laboratories. Method comparison experi-
2. Establish a theoretical basis for the method ments have been the topic of many publications in scien-
comparison experiment . . . . . . . . . . . . . . . . . 278 tific journals,1–14 textbooks,15,16 and scientific societies,17 as
3. Become familiar with the new method . . . . . 278 well as web sites.18
4. Obtain estimates of random error for both Assessment usually involves statistical analysis of paired
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 results from the 2 methods, but it also is valuable to consider
5. Estimate the number of samples to be the applicability of the new method by looking at the cost of
included in the method comparison a new analyzer; the costs, safety, and availability of reagents
experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 278 and calibration material; the space occupied by a new ana-
6. Define acceptable difference between the 2 lyzer; the requirements for sampling the material; the time to
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 obtain a result when the analyzer is ready and when the
7. Measure the patient samples . . . . . . . . . . . . . 280 analyzer is not ready; operator education; waste handling; etc.
8. Analyze the data . . . . . . . . . . . . . . . . . . . . . . 280 Decisions concerning applicability are based almost exclu-
9. Judge acceptability . . . . . . . . . . . . . . . . . . . . 281 sively on local and subjective assessments, whereas decisions
B. Special case: comparing semiquantitative and concerning analytical performance usually depend on statis-
quantitative tests . . . . . . . . . . . . . . . . . . . . . . . . 282 tical analyses and objective criteria of acceptability.
C. What if the 2 methods do not produce identical In this review article, we address the basic concepts of
results? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 a method comparison study, suggest a protocol for comparison
V. Example: Alanine Aminotransferase . . . . . . . . . . . . 283 of quantitative methods, including remarks on statistical
VI. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 methods and error assessment, and present an example of

From the Department of Small Animal Clinical Science, The Royal Veterinary and Agricultural University, Groennegaardsvej 3, DK-1870, Frederiksberg C, Denmark. Corresponding
author: Asger Lundorff Jensen (alj@kvl.dk). This article has been peer-reviewed. ª2006 American Society for Veterinary Clinical Pathology

Page 276 Veterinary Clinical Pathology Vol. 35 / No. 3 / 2006


Jensen, Kjelgaard-Hansen

Types of Errors

Reasons for erroneous test results are traditionally divided


into 3 categories: 1) preanalytical error, eg, wrong patient,
wrong sampling technique, or wrong sample handling; 2)
total analytical error; and 3) postanalytical error, eg, mis-
spelling, transcription error, or wrong unit (Figure 1). Method
comparison studies are used to investigate total analytical
error. The magnitude of total analytical error is the summa-
tion of random error and systematic error2,16 (Figure 2).
Total analytical error20,21 is a quantitative measure; its
qualitative counterpart is accuracy or trueness, which is the
closeness of the agreement between a test result and the true
value.
Random error3,4 is a matter of precision. Precision, which
is a qualitative concept, depends only on the distribution of
random error and does not relate to the true value. The
quantitative counterpart of precision is imprecision, which is
computed as a standard deviation (SD) or a coefficient of
variation (CV) of the measurement results. Imprecision de-
pends critically on the specified conditions and may be
expressed as the CV of repeatability (closeness of agreement
between independent results obtained with the same method
on identical test material under the same conditions—same
operator, same apparatus, same laboratory, and after short
intervals of time) or as the CV of reproducibility (closeness of
agreement between independent results obtained with the
same method on identical test material but under different
conditions—different operators, different apparatus, different
Figure 1. Types of errors in the clinical laboratory. laboratories, and/or after different intervals of time).
Systematic error,3,4 also referred to as bias or inaccuracy,
is the mean that would result from an infinite number of
a method comparison experiment. The main focus will be on
measurements of the same analyte carried out under re-
methods with measurements on a continuous scale (eg, serum
peatability conditions minus a true value of the analyte. In
enzymes, WBC counts), where errors can be assessed more
practice, infinite numbers of measurements are impossible
objectively. Methods with results on an ordinal scale (eg,
to achieve, and thus considerably smaller numbers of mea-
cytologic findings) also can be subjected to method comparison
surements are used. A true value is a value that would be
studies, usually in the form of intra- and interobserver varia-
tion studies applying statistical test such as kappa statistics,
chi-square tests, Fisher’s exact tests, and percent agreement.
Objective assessment of intra- and interobserver variation
studies is, in our opinion, not always straight forward.

Basic Concepts of a Method Comparison Study

Erroneous test results inevitably occur when laboratory


methods are applied. A primary function of the laboratory is
to minimize the amount of error so test interpretation, patient
care, and consumer safety are not compromised. Important
measures used to minimize erroneous test results include the
use of validated methods, sufficiently trained personnel,
written standard operating procedures, internal and external
quality assurance and control programs, and knowledge
about reasons for and types and magnitudes of errors. Such
information also is of importance when reporting on di-
agnostic accuracy expressed in terms of sensitivity and
specificity, likelihood ratios, diagnostic odds ratio, or area
under a receiver operator characteristic (ROC) curve, as has Figure 2. Graphical illustration of random error, systematic error, and
been promoted recently by the STARD initiative.19 total analytical error.

Vol. 35 / No. 3 / 2006 Veterinary Clinical Pathology Page 277


Method Comparison

obtained by a perfect measurement. This is also impossible to obtained in other species) is essential if unexpected or aberrant
achieve, and the best estimate of a true value is a value results occur. If antigen-antibody reactions are involved in the
produced by a reference method which can be described as new method, knowledge or hypotheses concerning the
a thoroughly investigated test method, clearly and exactly specificity, affinity, and avidity of the applied antibodies are
describing the necessary conditions and procedures, for the also valuable.
evaluation of a specific biological endpoint, which has been
shown to have accuracy and precision commensurate with its 3. Become familiar with the new method
intended use and which can, therefore, be used to assess the
accuracy of other methods for the same measurement. When In this phase, a working procedure is established. In practical
a reference method is not available, certified reference terms this means that one establishes sufficient working
material (which is not identical to calibration material) with competence with the method so one can correctly prepare
values measured by a reference method may be used to assess reagents, set up the analyzer, calibrate the method and obtain
systematic error. In veterinary clinical pathology, certified test results. If not done earlier, one also assesses whether the
species-specific reference material or reference methods are new method can actually measure the analyte in question, eg,
seldom available, and existing, routinely applied methods are by measuring samples with presumed different levels of
frequently used as the method to which a new method is analyte and mixtures thereof.
compared.
In other words, systematic error is the new method’s
difference from what is held to be a true value as determined
4. Obtain estimates of random error for both methods
by a reference method or an existing method in the laboratory. Estimates of random error (ie, data on imprecision) serve at
Systematic error can be subdivided into constant and pro- least 2 purposes. First, estimates of random error are to be
portional systematic error (Figure 1). Constant systematic
used in the method comparison experiment to judge accept-
errors are systematic deviations estimated as the average ability of the new method (see point 6). Second, if duplicate or
differences between the 2 methods. The presence of a constant replicate measurements are used in the method comparison
systematic error indicates that one method measures consis- experiment, estimates of random error may help in assessing
tently higher or lower in comparison with the other method. validity of the measurements by the individual methods and
Proportional systematic error means that the differences help identify unexpected test results arising from sample mix-
between the 2 methods are proportionally related to the level ups, transposition errors, and other mistakes.
of measurements.
If estimates of random error for both methods are not
already present, imprecision studies should be conducted. For
Method Comparison quantitative assays, it is useful to report imprecision as the CV
either at 2 or more specified mean values near clinical decision
Suggested protocol
points or at values in low, middle, and high parts of the
A method comparison study is a research experiment, and analytical range as obtained by repeating the test over
as with all other research experiments, a research protocol a specified number of days. Within-run CVs are appropriate
outlining the scope and procedures is essential. Local if all patient samples are analyzed in a single run.
traditions may influence the structure and content of the
protocol. In the following, we suggest a protocol based 5. Estimate the number of samples to be included in the
primarily on previous publications16,18 that includes items method comparison experiment
which in our experience are useful.
Most authors recommend including at least 40 patient
1. State the purpose of the experiment samples in the method comparison experiment.16,22 The
samples should cover the working range of the methods and
The reason for performing a method comparison experiment should represent the spectrum of diseases expected in routine
is to estimate the type and magnitude of systematic error application of the methods. Another significant factor that
between 2 methods and to judge if the 2 methods are identical determines the statistical power of a method comparison
within the inherent imprecision of both methods or within experiment is the number of samples. Based on simulations, it
preset analytical quality specifications. has been shown that an important factor in deciding the
number of samples to include is the range ratio, which is the
2. Establish a theoretical basis for the method maximum value divided by the minimum value.8 When the
comparison experiment range ratio is low, eg, 2, the number of samples should be high,
eg, 500, while when the range ratio is high, eg, 10, the number
It usually is very helpful to collect and write down infor- of samples to include may be lower, eg, 100.
mation relating to both the new method and the comparative
method. Information on sample requirements, analytical pro- 6. Define acceptable difference between the 2 methods
cess, reaction principles, calibration procedure, calculations,
known interferences, and anticipated analytical performance Before the measurements are conducted, the amount of
(eg, anticipated imprecision, inaccuracy, reportable range analytical error that is allowable without compromising test

Page 278 Veterinary Clinical Pathology Vol. 35 / No. 3 / 2006


Jensen, Kjelgaard-Hansen

Table 1. Proposed hierarchy of models to be applied to set analytical Table 2. Data on biological variation for some canine blood and serum
quality specifications. components.*

No. Model Sources of Information Analyte CVG CVI CVA CVmax Bmax TEmax Reference
(%) (%) (%) (%) (%) (%) No.
1 Evaluation of the effect of
analytical performance on RBC 4.4 5.4 2.8 2.7 1.8 6.3 39
clinical outcomes in specific HCT 5.2 6.4 1.1 3.2 2.1 7.4 39
clinical settings Hgb 4.7 5.9 2.9 3.0 1.9 6.9 39
2 Evaluation of the effect of 2.a. Data on biological variation WBC 12.3 12.1 3.7 6.1 4.3 14.4 39
analytical performance 2.b. Analysis of clinician’s opinions ALT 23.7 9.7 3.2 4.8 6.4 14.3 40
on clinical decisions
AST 10.9 11.4 3.3 5.7 4.0 13.4 40
in general
ALP 34.2 8.6 1.7 4.3 8.8 15.9 40
3 Published professional 3.a. National and international
recommendations expert bodies Albumin 3.0 2.4 1.6 1.2. 1.0 3.0 40
3.b. Expert local groups or individuals Total protein 3.1 2.6 1.1 1.3 1.0 3.2 40
4 Performance goals 4.a. Regulatory bodies Urea 35.1 16.1 3.8 8.0 9.7 22.9 40
4.b. Organizers of External Quality Creatinine 12.9 14.6 2.9 7.3 4.9 17.0 40
Assessment (EQA) schemes Cholesterol 15.1 7.3 3.0 3.7 4.2 10.3 40
5 Goals based on current 5.a. Data from EQA or proficiency Glucose 3.8 9.5 3.7 4.8 2.6 10.5 41
state of art testing schemes Fructosamine 4.2 11.1 2.8 5.6 3.0 12.2 41
5.b. Current publications on methodology Potassium 3.6 3.3 0.1 1.7 1.2 4.0 42
Total thyroxine (TT4) 17.2 17.0 4.0 8.4 6.0 19.9 43
Thyrotropin (TSH) 43.6 13.6 8.8 6.8 11.4 22.6 44
interpretation, patient care, or consumer care is defined.
The basis for a method comparison study is the hypothesis Iron 17.2 17.8 0.7 8.9 6.2 20.9 45
that the 2 methods are identical either within inherent Fibrinogen 19.0 17.1 2.8 8.5 6.4 20.4 45
imprecision of both methods or within preset analytical C-reactive protein 29.3 24.3 7.2 12.2 9.5 29.6 45
quality specifications.6 a-1-acid glycoprotein 67.0 9.6 8.1 4.8 16.9 24.8 45
Haptoglobin 20.2 17.0 4.9 8.5 6.6 20.6 45
Acceptance limits based on inherent imprecision of both *CVG indicates between-dog coefficient of variation; CVI, within-dog coefficient of
methods. The inherent imprecision of both methods is cal- variation; CVA, analytical coefficient of variation; CVmax, maximum allowable
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
imprecision; Bmax, maximum allowable inaccuracy; and TEmax, maximum allowable
culated as CV2Method1 þ CV2Method2 . When means of duplicates
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi total error.
CV2Method1 CV2
are used, the formula is 2 þ Method2
2 . If single measure-
ments are used and the imprecision (CV) is 5% and 3%, then the components in dogs, cows, and rabbits have been available for
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
inherent imprecision of both methods is 52 þ 32 55.8%. If many years. A list of data on biological variation of some
qffiffiffiffiffiffiffiffiffiffiffiffiffi common analytes in dogs is presented in Table 2. Data on
2 2
means of duplicates are used, the CV is 52 þ 32 54.1%. This biological variation also are available for numerous blood
means that if the mean value of the 2 methods is 100 and they components in humans,27 and these values can be used as
are expected to measure identically, then the difference starting points until veterinary data are available. Data on
between the 2 methods is expected to be within the interval biological variation make it possible to calculate objectively
0 6 1.96 CV mean in 95% of the measurements, ie, 0 6
 
maximum allowable values for imprecision28–31 (Imax), in-
1.96 (0.041 100)50 6 8.04.
 
accuracy31 (Bmax), and total error32 (TEmax) from the within-
animal (CVwithin) and between-animal (CVbetween) variations
Acceptance limits based on analytical quality specifications. using the following formulas:
Analytical quality specifications can be established in a num- Imax ¼ 0:5 CVwithin ;


ber of ways. In evidence-based health care, types of evidence qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi


and grading of recommendations used in clinical practice Bmax ¼ 0:25 CV2within þ CV2between ;

and
guidelines are placed in a hierarchy of objectivity, the best
TEmax ¼ Bmax þ ð1:65 Imax Þ: 

being first and the worst being last.23 Recently, a similar


hierarchy for analytical quality specifications in human Other criteria that have been suggested for TEmax are:
clinical pathology has been proposed24 (Table 1). In veterinary TEmax5Bmax þ (2 Imax), TEmax5Bmax þ (3 Imax), and TEmax5
 

clinical pathology, analytical quality specifications are very Bmax þ (4 Imax).3




rarely derived objectively from an analysis of medical needs in Analytical performance data from the Clinical Laboratory
specific clinical situations, one exception being the use of error Improvement Amendments (CLIA) proficiency testing criteria
grid analysis in the evaluation of portable blood glucose for medical laboratories also can be a starting point for setting
meters in dogs and cats.25,26 analytical quality requirements (for more details on CLIA, see
Meanwhile, data on biological variation for many blood www.fda.gov/cdrh/CLIA/index.html). Some CLIA total al-

Vol. 35 / No. 3 / 2006 Veterinary Clinical Pathology Page 279


Method Comparison

comparative method on the x-axis, setting the line of identity


(y5x) in the plot, and making an initial visual assessment of
the relationship between the results and the adequacy of data
distribution and data range (Figure 3). Plotting of data
should begin when the method comparison experiment
begins so that large differences can be investigated and the
sample in question may be reanalyzed while the sample is still
available.

Commonly used statistical analyses. Data are inspected for


linearity and distribution, and if y and x are linearly related
and data are not clumped in one end of the data range,
correlation analysis is performed. Then, the correlation coef-
ficient (r) is calculated, not as a measure of acceptability of the
new method but as a means to assess whether subsequent
statistical analysis using ordinary linear regression is useful.
Figure 3. Data from a method comparison experiment with the new The reason that the correlation coefficient is not used as
method (Method 2) plotted against the comparative method (Method 1). a measure of acceptability is that it does not assess agreement
but association; a high correlation is no guarantee of good
lowable errors are already given in percentages. If given in agreement.11,12
concentration units, total allowable error is calculated as When the correlation coefficient (r) is .975 (for data
a percent of the medical decision concentration of interest, ie, encompassing a small range) or .99 (for data encompassing
divide the total allowable error by the medical decision a wide range), simple linear regression provides useful
concentration and multiply by 100 to express as a percentage. information about constant error and proportional error via
Maximum values for imprecision, inaccuracy, and total error intercept and slope, respectively.1,10 Constant error is present if
are used later in the experiment to judge the acceptability of the intercept differs significantly from 0, while proportional
the new method (see point 9). error exists if the slope differs from 1. To further ascertain
that the data follow a straight line, a simple runs test or the
7. Measure the patient samples more complex lack-of-fit test can be applied.
If r is ,.975 (or ,.99), the data may need improvement, ie,
Patient samples are measured by both methods, preferably no more data may be needed in certain parts of the plot,
more than 2 to 4 hours apart to avoid changes due to analytical variance may need to be decreased by replicate
instability of the samples (eg, evaporation). Specific consid- measurements, or alternative regression analysis, such as
erations are needed if the analyte is unstable (eg, ammonia). Deming or Passing-Bablok, may be needed. Ordinary least
To minimize systematic errors that might occur in only squares regression has the assumption that the comparative
a single run, samples should be assayed on several different method is free of error and that the error of the new method is
analytical runs on different days. A minimum of 5 days is normally distributed and constant over the range studied.7
recommended for method comparison experiments using, for These assumptions are most often not met in method
example, 8–10 patient samples per day. comparison experiments, and therefore some authors have
It is advisable to perform duplicate analysis of the proposed other alternatives, such as the Passing-Bablok or
patient samples so that obvious outliers (eg, arising from Deming regression techniques.7,13 In Passing-Bablok regres-
sample mix-ups) can be detected. It also is advisable to base sion, extreme values can be included, imprecision in both
the duplicate measurements on 2 single measurements on the methods is allowed, and the imprecision does not have to be
same specimen divided into 2 separate randomly–placed normally distributed or constant over the data range. In
sample cups. A further advantage of duplicate measurements Deming regression, both the new and the comparative method
is that the relationship between the 2 methods can become may be measured with error.15 Stöckl et al10 found that
somewhat clearer if the means of the duplicate measurements Passing-Bablok regression should be used with care and that it
rather than single values are compared so that analytical may treat too many data points as outliers. Meanwhile,
variation is reduced (see point 6). Payne33 suggested that the Passing-Bablok regression pro-
cedure is likely to be more accurate than Deming’s procedure
8. Analyze the data when analytical imprecision increases with measured
concentration.
The optimal method for the analysis of data from method If the data are nonlinear, some sort of systematic
comparison experiments is a matter of active discus- proportional error is present. In these cases, data analysis
sion.1,6,7,9,10,12 can be performed on subsets of the data range or nonlinear
regression procedures may be used. However, objective
Plotting the data. Data analysis usually begins with plotting assessment of acceptability is often very difficult in these
the data from the new method on the y-axis against the cases.

Page 280 Veterinary Clinical Pathology Vol. 35 / No. 3 / 2006


Jensen, Kjelgaard-Hansen

alized (Figure 5A–B). Additional analysis of the differences,


eg, following logarithmic transformation of the data or by
plotting the percentage differences, can be used to study the
relationship between the 2 methods further, but frequently
a simple difference plot suffices.
Next, the lines representing 0 6 the combined inherent
CV (see point 6) are inserted in the plot (Figure 6). If the 2
methods are identical, the differences should be symmetrically
distributed around 0, and 95% of the differences should be
within the lines.6 If this is not the case, then the 2 methods are
not identical within inherent imprecision.

Acceptability based on preset analytical quality specifica-


tions. In our experience, the easiest way to judge acceptability
based on preset analytical quality specifications is to use the
Figure 4. A difference plot with the difference between the 2 methods maximum allowable total error (TEmax) (see point 6.b.) and
plotted against the mean value of the 2 methods. a medical decision chart (MEDx chart), which is a graphical
tool for comparing inaccuracy and imprecision and which
Other statistical analyses. The ordinary paired t-test and the has an analytical quality requirement stated in the form of
nonparametric Wilcoxon signed rank test are not applicable if allowable total error.3,37
proportional error is present.9 A correlation coefficient of In the MEDx chart, total allowable inaccuracy is on the
concordance has been proposed as an improved version of the y-axis and total allowable imprecision on the x-axis. Four
correlation coefficient that indicates the strength of the lines are drawn, each corresponding to the suggested criteria
relationship between 2 methods that fall on the line of for TEMax (Figure 6):
identity.14 A web-based calculator of the concordance correla-
tion coefficient is available,34 together with a table that can be
used to categorize test performance as poor to good. However,
it has been suggested that this approach be used only if data
on total allowable error are not predetermined.5

9. Judge acceptability

The basis (hypothesis) of a method comparison experiment is


that the 2 methods are identical either within inherent
imprecision of both methods or within preset analytical
performance limits.

Acceptability based on inherent imprecision. To judge


acceptability based on inherent imprecision of both methods,
a difference plot (also known as a Bland-Altman plot or bias
plot) can be used.11,12,35 Basically, the difference plot is
constructed by plotting the difference between the methods
(A–B) on the y-axis against the mean of the methods ([AþB]/2)
on the x-axis (Figure 4). If one method is considered a reference
method, the differences are sometimes plotted against test
results yielded by this method, yet this approach may produce
misleading results, since a plot of the differences against
the standard measurement always appears to show a rela-
tion between difference and magnitude, even when there is
none.36
Ideally, the 2 methods should yield identical results, that
is, the difference between the methods should on average be 0,
a hypothesis that can easily be tested by means of a paired
t-test or Wilcoxon signed rank test. To test whether the dif- Figure 5. Difference plots exhibiting constant error (A) and proportional
ferences change with analyte concentration, ie, proportional error (B). (A) Constant error is indicated by the differences all being above
error, linear regression analysis of the differences can be 0 and more or less constant irrespective of the mean value. (B)
applied. Using these tests and by inspecting the plot, constant Proportional error is indicated by the differences increasing in value
and/or proportional errors can be further detected and visu- relative to the increase in mean value.

Vol. 35 / No. 3 / 2006 Veterinary Clinical Pathology Page 281


Method Comparison

A. Data are collected as described above


B. A scattergram is prepared by plotting the semiquanti-
tative test results on the x-axis and the corresponding
quantitative test results on the y-axis.
C. For each category value on the x-axis (eg, 0, þ1, þ2, þ3)
the mean value, median value, SD, and/or range are
reported for the corresponding values on the y-axis.
D. Statistical tests are done to determine if the values in
each category are significantly different from those
in the other categories, eg, by means of a 1-way ANOVA
(an unpaired t-test in the case of a dichotomous semi-
quantitative test). If the values are not significantly
different, then the semiquantitative test is not useful.
E. If a medically applied cutoff value exists for the
quantitative test, then this is inserted in the scattergram.
F. For each category value on the x-axis, the numbers of
Figure 6. A difference plot with dotted lines representing 0 6 1.96  dots above (a) and under (b) the cutoff value are
the combined inherent coefficient of variation (CV) of the 2 methods. In counted.
this example, CV of method 1 is 6% and CV of method 2 is 7%. The G. The probability that a patient really has a true value
combined
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi inherent CV of both methods given single measurements is above the cutoff is calculated for each category value on
62 þ 72 59.2%. At a mean value of 25, the difference between the 2 the x-axis (ie, a/(aþb)). In tables on binominal distribu-
methods should be within the interval 4.5 to 4.5, while at the mean tions, the 95% confidence interval for this probability
value of 175, the difference between the 2 methods should be within can be obtained.
31.6 to 31.6. The dotted lines represent the interval within which the H. The results are assessed to determine whether the new
differences between the 2 methods should fall if the 2 methods are or simpler semiquantitative test is medically useful
identical within the inherent imprecision of both methods. From the based on the derived probabilities. In most cases, this
figure it can be appreciated that the 2 methods are identical, since the assessment is subjective. In some cases, one chooses to
differences are symmetrically distributed around 0 and all of the use the semiquantitative test as a screening test, in
differences are inside the dotted lines, ie, more than 95% of the which case the risk that a disease may not be detected
differences are within the lines. must be weighed against the time, costs, and risk of the
quantitative test.
Line A (for TEmax 5Bmax þ ð1:65 Imax Þ): from TEmax on the


y-axis to ðTE 1:65 Þ on the x-axis


max

Line B (for TEmax 5Bmax þ ð2 Imax Þ): from TEmax on the y-axis
 What if the 2 methods do not produce identical results?
to ðTEmax>
2 Þ on the x-axis
Line C (for TEmax 5Bmax þ ð3 Imax Þ): from TEmax on the y-axis
 If the method comparison experiment has revealed that the 2
to ðTE3Max Þ on the x-axis methods are not identical either within inherent combined
Line D (for TEmax 5Bmax þ ð4 Imax Þ): from TEmax on the y-axis
 imprecision or within predefined limits, the methods cannot
to ðTE4max Þ on the x-axis. be used interchangeably. In some cases, this can be very
frustrating, for example, when a manufacturer has stopped
Imprecision and inaccuracy from the replication study
and the method comparison experiment, respectively, are then
plotted into the MEDx chart, and it is now easy to judge
whether the new method is just acceptable (ie, within control)
or of poor, marginal, good, or excellent performance (the
designations ‘‘poor,’’ ‘‘marginal,’’ ‘‘good,’’ and ‘‘excellent’’
correspond directly to sigma performance criteria 1, 2, 3, and 4
in ‘‘Six Sigma Quality Management’’) (Figure 7).38

Special case: comparing semiquantitative and quantitative tests

In some cases, the aim is to compare a new or simpler semi-


quantitative test to a quantitative laboratory test. One exam-
ple would be to compare a reagent dipstick to a quantitative
laboratory test (eg, Coomassie blue) for measuring urine
protein concentration. Since one method reports discrete data
(0, þ1, þ2, þ3) while the other reports data on a continuous
scale (g/L or mg/dL), the procedure described above is not
applicable. Instead, the following procedure can be applied: Figure 7. The Medical Decision (MEDx) chart.

Page 282 Veterinary Clinical Pathology Vol. 35 / No. 3 / 2006


Jensen, Kjelgaard-Hansen

producing a certain reagent and the method comparison Table 3. Example data from an experiment comparing 2 methods (new
experiment has revealed the new method does not produce method and routine method) for measurement of alanine aminotransfer-
identical results. In this case, it may be worthwhile to perform ase activity (U/L) in fresh unhemolyzed canine serum.
a new method comparison experiment using another new
method. If this is not an option and for some reason one is Routine Method New Method
forced to use the new method, new reference intervals for each
104 116
animal species must be prepared. Obtaining new reference
102 115
intervals is often a very cumbersome and expensive process
but it is, in our opinion, much preferable to simply including 113 125
the regression equation in the new method, since this may be 101 111
a significant source of undetectable and unexplainable error at 106 115
a later stage when everyone has forgotten that a regression 96 106
equation was included. 102 112
108 117
79 86
Example: Alanine Aminotransferase 85 90
Purpose of the experiment 116 125
94 104
A new method for measuring alanine aminotransferase (E.C. 101 111
2.6.1.2) (ALAT) activity in serum from dogs is being
110 121
considered in the laboratory. The laboratory already has
115 126
a method for measuring ALAT activity in serum from dogs.
99 105
The purpose of the method comparison experiment is to judge
if the 2 methods are identical either within inherent impre- 95 108
cision of both methods or within preset analytical quality 110 122
specifications. 97 106
93 107
100 108
Theoretical basis for the method comparison experiment 101 110
94 106
The sample material that is analyzed is fresh unhemolyzed
92 106
serum according to the laboratory’s standard operating
89 97
procedure for sample material for clinical chemical analysis.
Both methods use the modified International Federation for 115 127
Clinical Chemistry (IFCC) method where the reaction is 102 116
initiated by the addition of a-ketoglutarate as a second 120 133
reagent. The concentration of NADH is measured by its 111 123
absorbance at 340 nm, and the rate of absorbance decrease is 85 90
proportional to the ALAT activity. The routine method has an 75 86
imprecision of 2%. The new method has an imprecision of 5% 70 79
when applied to feline serum samples. 72 79
76 78
79 86
Familiarization with the new method
81 90
The new method has been applied to fresh unhemolyzed 82 89
canine serum samples for 1 week to obtain a working 88 86
competence with the method. Samples with different ALAT 89 91
activities have been mixed, and it has been observed that 65 75
ALAT activity in the mixtures is comparable to what would be
expected from the combined ALAT activities in the original
samples. Thus, it is assumed that the new method actually can
Number of samples to be included in the method
measure ALAT activity in canine serum samples.
comparison experiment

Estimates of random error for both methods The laboratory reference interval for ALAT activity in canine
serum is 0–80 U/L. Forty patient samples are assumed to be
The routine method has an imprecision of 2% (single samples). required. Since increased values are of clinical interest,
An experiment on 5 canine serum samples revealed that the samples with ALAT activities around and above the upper
inaccuracy of the new method was 4%. limit of the reference interval are preferred.

Vol. 35 / No. 3 / 2006 Veterinary Clinical Pathology Page 283


Method Comparison

Acceptable difference between the 2 methods


Acceptance limits based on inherent imprecision. Single
measurements
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiare used, and the combined inherent impreci-
sion is 22 þ 42 54.5%. Thus, at least 95% of the differences
should be within the interval 0 6 1.96 4.550 6 8.2%.


Acceptance limits based on analytical quality specifications.


From Table 2, the following analytical quality specifications
can be obtained for ALAT activity in canine serum: maximum
allowable imprecision (Imax): 4.8%; maximum allowable inac-
curacy (Bmax): 6.4%; maximum allowable total error (TEmax):
14.3%.

Measuring the patient samples

The 40 samples were measured by both methods no more then Figure 9. An example of an experiment comparing 2 methods for the
1 hour apart. Eight different patient samples were analyzed measurement of ALAT activity in fresh unhemolyzed canine serum
each day for 5 days. The results are presented in Table 3. samples by means of a difference plot. The dotted lines represent 0 6
1.96 inherent imprecision of both methods (4.5%). Only 13 values of


40 measurements (33%) are inside the interval outlines by the dotted


Analysis of the data lines, and thus the 2 methods are not identical within the inherent
imprecision of both methods.
Plotting the data. In Figure 8, the data are plotted with the
new method on the y-axis and the routine method on the
x-axis. A line of identity (y5x) is inserted. Initial visual Statistical analyses. Correlation and regression analyses are
assessment indicates that the data are linear and symmetri- conducted using the software MedCalc version 7.4.4.1
cally distributed around a line situated above the line of (www.medcalc.be). The correlation coefficient (r) is .98, and
identity. The data are not clumped in one end of the data thus simple linear regression analysis should provide useful
range, although most of the data are above the upper limit information about constant error and proportional error via,
of the reference interval (80 U/L). respectively, intercept and slope.
Ordinary linear regression analysis reveals intercept51.3
(95% confidence interval58.3 to 5.8) and slope51.11 (95%
confidence interval51.036 to 1.184). Thus, the intercept is
not statistically significantly different from 0 and hence no
constant error is present. However, proportional error exists,
since the slope is different from 1. These findings are further
supported by Deming and Passing-Bablok regression analyses.
In the Deming regression analysis, intercept52.16 (95% con-
fidence interval59.55 to 5.22) and slope51.12 (95% confi-
dence interval51.049 to 1.19). In the Passing-Bablok regression
analysis, intercept51.7 (95% confidence interval57.8 to 5.4)
and slope51.115 (95% confidence interval51.046 to 1.177). The
concordance correlation coefficient is .8117, which is consid-
ered as indicating almost perfect performance.

Judging acceptability

The analysis so far has revealed a proportional error. The next


step is to judge whether the 2 methods are identical either
within inherent imprecision of both methods or within preset
analytical quality specifications.
Figure 8. An example of an experiment comparing 2 methods for the
measurement of ALAT activity (U/L) in fresh unhemolyzed canine serum Acceptability based on inherent imprecision. A difference
samples. The dotted line represents the line of identity ( y5x ). The solid plot with the mean value of the methods on the x-axis and the
line represents the regression line ( y5a þ bx ; new method5a þ b difference between the methods on the y-axis is constructed
routine method) with intercept51.3 (95% confidence interval58.3 to (Figure 9). The combined inherent imprecision (CV) is 4.5%.
5.8) and slope51.11 (95% confidence interval51.038 to 1.184). Two lines representing, respectively, 0 þ 1.96 4.5% and 0 


Page 284 Veterinary Clinical Pathology Vol. 35 / No. 3 / 2006


Jensen, Kjelgaard-Hansen

8. Linnet K. Necessary Sample size for method comparison studies based


on regression analysis. Clin Chem. 1999;45:882–894.
9. Linnet K. Limitations of the paired t-test for evaluation of method
comparison data. Clin Chem. 1999;45:314–315.
10. Stockl D, Dewitte K, Thienpont LM. Validity of linear regression in
method comparison studies: is it limited by the statistical model or the
quality of the analytical input data? Clin Chem. 1998;44:2340–2346.
11. Altman DG, Bland JM. Measurement in medicine: the analysis of
method comparison studies. Statistician. 1983;32:307–317.
12. Bland JM, Altman DG. Statistical methods for assessing agreement
between two methods of clinical measurement. Lancet. 1986;1:307–310.
13. Passing H, Bablok. A new biometrical procedure for testing the
equality of measurements from two different analytical methods.
Application of linear regression procedures for method comparison
studies in clinical chemistry, Part I. J Clin Chem Clin Biochem. 1983;21:
709–720.
14. Lin LK. A concordance correlation coefficient to evaluate reproduc-
Figure 10. Medical decision chart (MEDx chart) depicting the new ibility. Biometrics. 1989;45:255–268.
method reported in the example relative to the 4 criteria for allowable
15. Jones RG, Payne RB. Clinical Investigation and Statistics in Laboratory
total error (TEmax). TEmax is 14.3%. The 4 lines represent, from left to Medicine. London: ACB Venture Publications; 1997:27–65.
right, different criteria for TEmax (See Figure 7 and the text for further
16. Koch DD, Peters T. Selection and evaluation of methods. In: Burtis CA,
explanation). The dot represents estimated impression (4%) and Ashwood ER, eds. Tietz Fundamentals of Clinical Chemistry. 4th ed.
inaccuracy (9.4%) of the new method. This operating point is located Philadelphia, PA: WB Saunders Co; 1996:170–181.
to the right of the last line, indicating that the 2 methods are not identical 17. Clinical and Laboratory Standards Institute (formerly NCCLS). Method
within preset analytical quality specifications. Comparison and Bias Estimation Using Patient Samples; Approved
Guideline. 2nd ed. NCCLS document EP9-A2. Wayne, PA: NCCLS;
2002.
18. Westgard JO. Method validation—The comparison of methods
1.96 4.5% are also inserted in the plot. Only 13 of the 40 values

experiment. Available at: http://www.westgard.com/lesson23.htm.
(33%) are inside the interval outlines by the dotted lines, and Accessed August 29, 2005.
thus the methods are not identical within inherent imprecision 19. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and
of both methods. accurate reporting of studies of diagnostic accuracy: the STARD
initiative. Clin Chem. 2003;49:1–6.
Acceptability based on preset analytical quality specifica- 20. Inczedy J, Lengyel T, Ure AM. Compendium of Analytical Nomenclature—
The Orange Book. Available at: http://www.iupac.org/publications/
tions. Maximum allowable total error (TEmax) is 14.3%. Using analytical_compendium/. Accessed August 16, 2005.
this value, a MEDx chart is constructed (Figure 10). 21. Fuentes-Arderiu X. Glossary of ISO metrological and related terms and
The imprecision of the new method is 4%. Inaccuracy definitions relevant to clinical laboratory sciences. Available at: http://
is estimated from the simple linear regression equation y51.3 www.westgard.com/isoglossary.htm. Accessed August 16, 2005.
þ 1.11 x. At the upper limit of the reference interval (80 U/L)

22. Bellamy JEC, Olexson DW. Evaluating laboratory procedures. In:
the new method would be expected to give a value of 87.5 Quality Assurance Handbook for Veterinary Laboratories. Ames, IA: Iowa
U/L. The new method thus measures 7.5 U/L or 9.4% higher State University Press; 2000:61–77.
than expected. The point (4; 9.4) is inserted into the MEDx 23. Montori VM, Guyatt GH. What is evidence-based medicine and why
should it be practiced? Respir Care. 2001;46:1201–1214.
chart. From the figure, it is evident that the 2 methods are not
identical within preset analytical quality specifications. 24. Kenny D, Fraser CG, Petersen PH, Kallner A. Consensus agreement.
Scand J Clin Lab Invest. 1999;59:585.
25. Wess G, Reusch C. Assessment of five portable blood glucose meters
References for use in cats. Am J Vet Res. 2000;61:1587–1592.
1. Westgard JO, Hunt MR. Use and interpretation of common statistical 26. Wess G, Reusch C. Evaluation of five portable blood glucose meters for
tests in method-comparison studies. Clin Chem. 1973;19:49–57. use in dogs. J Am Vet Med Assoc. 2000;216:203–209.
2. Westgard JO, Carey RN, Wold S. Criteria for judging precision and 27. Ricos C, Alvarez V, Cava F, et al. Biological variation database &
accuracy in method development and evaluation. Clin Chem. 1974;20: desirable quality specifications: the 2001 update. Available at: http://
825–833. www.westgard.com/guest21.htm. Accessed August 29, 2005.
3. Westgard JO. A method evaluation decision chart (MEDx chart) for 28. Fraser CG. Desirable performance standards for clinical chemistry
judging method performance. Clin Lab Sci. 1995;8:277–283. tests. Adv Clin Chem. 1983;23:299–339.
4. Westgard JO. Points of care in using statistics in method comparison 29. Fraser CG. The application of theoretical goals based on biological
studies. Clin Chem. 1998;44:2240–2242. variation data in proficiency testing. Arch Pathol Lab Med. 1988;112:
5. Lumsden JH. Laboratory test method validation. Revue Méd Vét. 404–415.
2000;151:623–630. 30. Fraser CG, Harris EK. Generation and application of data on biological
6. Petersen PH, Stockl D, Blaabjerg O, Pedersen B, et al. Graphical variation in clinical chemistry. CRC Crit Rev Clin Lab Sci. 1989;29:
interpretation of analytical data from comparison of a field method 409–430.
with a Reference Method by use of difference plots. Clin Chem. 1997;43: 31. Fraser CG, Petersen PH, Ricos C, Haeckel R. Proposed quality
2039–2046. specifications for the imprecision and inaccuracy of analytical
7. Linnet K. Evaluation of regression procedures for methods compar- systems for clinical chemistry. Eur J Clin Chem Clin Biochem. 1992;30:
ison studies. Clin Chem. 1993;39:424–432. 311–317.

Vol. 35 / No. 3 / 2006 Veterinary Clinical Pathology Page 285


Method Comparison

32. Petersen PH, Ricos C, Stockl D, et al. Proposed guidelines for the 39. Jensen AL, Iversen L, Petersen TK. Study on biological variability of
internal quality control of analytical results in the medical laboratory. haematological components in dogs. Comp Haemat Internat. 1998;8:
Eur J Clin Chem Clin Biochem. 1996;34:983–999. 202–204.
33. Payne RB. Method comparison: evaluation of least squares, Deming 40. Jensen AL, Aaes H. Critical differences of clinical chemical parameters
and Passing/Bablok regression procedures using computer simula- in blood from dogs. Res Vet Sci. 1993;54:10–14.
tion. Ann Clin Biochem. 1997;34:319–320.
41. Jensen AL, Aaes H, Iversen L, Petersen TK. The long-term biological
34. Lin’s concordance. Available at: http://www.niwa.co.nz/services/ variability of fasting plasma glucose and serum fructosamine in
statistical/concordance. Accessed August 16, 2005. healthy Beagle dogs. Vet Res Commun. 1999;23:73–80.
35. Jensen AL, Bantz M. Comparing laboratory tests using the difference 42. Jensen AL, Pedersen HD, Koch J, Aaes H, Flagstad A. Applicability of
plot method. Vet Clin Pathol. 1993;22:46–48. the critical difference. Zentralbl Veterinarmed A. 1993;40:624–630.
36. Bland JM, Altman DG. Comparing methods of measurements: why 43. Jensen AL, Hoier R. Evaluation of thyroid function in dogs by hormone
plotting difference against standard method is misleading. Lancet. analysis: effects of data on biological variation. Vet Clin Pathol. 1996;25:
1995;346:1085–1087. 130–134.
37. Jensen AL, Iversen L, Hoier R. Evaluation of analytical perfor- 44. Iversen L, Jensen AL, Hoier R, Aaes H. Biological variation of ca-
mance assisted by total error criteria of a commercial enzyme nine serum thyrotropin (TSH) concentration. Vet Clin Pathol. 1999;28:
immunometric assay for canine serum thyrotropin. Vet Clin Pathol. 16–19.
1999;28:53–56.
45. Kjelgaard-Hansen M, Mikkelsen LM, Kristensen AT, Jensen AL. Study
38. Westgard JO. Six Sigma Quality Design and Control. Madison, WI: on biological variability of five acute-phase reactants in dogs. Comp
Westgard QC Inc; 2001. Clin Path. 2003;12:69–74.

Page 286 Veterinary Clinical Pathology Vol. 35 / No. 3 / 2006

You might also like