Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

This article was downloaded by: [University of Kent]

On: 07 December 2014, At: 06:57


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

Journal of Biopharmaceutical Statistics


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/lbps20

Comparison of Measurements by Multiple Methods or


Instruments
a b
Douglas M. Hawkins & Abha Sharma
a
School of Statistics , University of Minnesota , Minneapolis, Minnesota, USA
b
Roche Molecular Diagnostics , Pleasanton, California, USA
Published online: 17 Aug 2010.

To cite this article: Douglas M. Hawkins & Abha Sharma (2010) Comparison of Measurements by Multiple Methods or
Instruments, Journal of Biopharmaceutical Statistics, 20:5, 965-984, DOI: 10.1080/10543401003618991

To link to this article: http://dx.doi.org/10.1080/10543401003618991

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Journal of Biopharmaceutical Statistics, 20: 965–984, 2010
Copyright © Taylor & Francis Group, LLC
ISSN: 1054-3406 print/1520-5711 online
DOI: 10.1080/10543401003618991

COMPARISON OF MEASUREMENTS BY MULTIPLE


METHODS OR INSTRUMENTS

Douglas M. Hawkins1 and Abha Sharma2


1
School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA
2
Roche Molecular Diagnostics, Pleasanton, California, USA

Much work has been done on comparison of one device with another, but the problem
Downloaded by [University of Kent] at 06:57 07 December 2014

of comparing three or more devices is less well known. Most existing work has
concentrated on the possibilities of a constant relative bias between the devices, or
of different linear relationships with the underlying true value. These two possibilities
are placed within a hierarchy of models extending them to settings with multiplicative
interaction terms. These additional terms can capture departures such as outliers,
variance changing with the analyte concentration, and different measurement variances
between the devices.

Key Words: Analysis of variance; Bland–Altman plot; Methods comparison; Multiplicative


interactions.

1. INTRODUCTION
Comparison of instruments, laboratories, or different assay methods is of
frequent interest (Bland and Altman, 1999; Martinez et al., 2001). It arises, for
example, in “assay round robins” in which one wishes to assess whether different
centers’ measurements are interchangeable, and in regulatory settings of comparing
different measuring instruments. It is also a central element of comparing different
methods of assay. We discuss the problem using a clinical chemistry setting and
terms, but the problem and its solution are much broader and apply in principle to
any setting where several ways of providing putatively equivalent measurements are
compared.
In a typical methods comparison, a collection of specimens, or pools, of widely
differing analyte concentrations will be assayed by each method (or laboratory or
instrument). Three different settings are important:

• In the first, each specimen has a known true value of the analyte.
• In the second, each has a true analyte concentration that is a known multiple of
some unknown value.
• In the third, the true analyte concentrations of the specimens are unknown.

Received February 5, 2008; Accepted June 13, 2009


Address correspondence to Douglas M. Hawkins, School of Statistics, University of Minnesota,
Minneapolis, MN 55455, USA; E-mail: dhawkins@umn.edu

965
966 HAWKINS AND SHARMA

The first setting rarely applies in practice. It would hold if each specimen
was obtained by exact spiking, or exact dilution, using a pure master source. It
is more commonly invoked in settings where the specimens have been assayed by
some standard method and assigned a value. As standard method are themselves
imperfect, however, so are the values they assign to unknown specimens, and
treating the resulting values as if they were exactly correct can lead to erroneous
conclusions about the methods being compared. For this reason, we do not consider
this first setting further.
The second setting arises when each of the specimens is made by dilution
of a common master solution. While the exact true concentration of the master
solution is seldom known, to the extent that the dilutions are highly precise, each
of the specimens will have an analyte concentration that is a known multiple of this
unknown true concentration of the master solution.
The third setting is commonly seen when patient specimens are used. To
Downloaded by [University of Kent] at 06:57 07 December 2014

achieve the desirable wide range of analyte values, patients may be selected on
the basis of severity of symptoms, or there may be some prescreening to identify
specimens that span a wide range of concentrations. Other than this attempt to
challenge the methods with wide-ranging specimen values, however, the actual
true values are unknown. Our discussion concentrates on this third setting, with
indications of how to specialize this setting to the first or second.
There is an important distinction to be drawn between settings where only
two methods are compared, versus settings where there are more than two. If
there are only two methods, then we can measure the difference between them
and see whether this difference seems to depend (in mean, variance, or both) on
their average. This is typically done graphically using Bland–Altman plots (Bland
and Altman, 1999; Hawkins, 2002). However unless the two methods agree exactly,
there is generally no logical basis for deciding what the “true” concentration of
each specimen is; all that can be assessed is the level of agreement between the
two methods. When there are three or more methods, however, it becomes possible
and sensible to use some consensus figure as an estimate of the true concentration
of each specimen. With this, we can investigate absolute biases and absolute
measurement variability in individual methods using their disagreement with this
consensus. The focus of this paper is this richer setting of testing three or more
methods.

2. POSSIBLE METHOD BEHAVIORS


An important aspect of methods comparison studies is that all the methods
attempt to give the same absolute value for the analyte, and in practice do so to
a large degree. This distinguishes the problem from the more general “errors in
variables” settings of comparing methods that do not aim at absolute calibrations.
It is also generally the case that all the methods have a measurement random
variability much smaller than the variability of the specimens used to assess the
methods.
Consider the mean function of each method—the long-term average assay that
it would give, regarded as a function of the (unknown) true analyte concentration.
Suppose there are n specimens and p methods, respectively making up the rows
and the columns of a data matrix. Writing yim for the measurement produced by
MEASUREMENTS BY MULTIPLE DEVICES 967

method m on specimen i whose true (but unknown) analyte concentration is i , we


can distinguish the following situations:

yim = i + eim (1)


yim = i + m + eim (2)
yim = i 1 + m  + m + eim (3)
yim = gm i  + eim (4)

where in all four models eim is a zero-mean random error whose variance may
depend on the method and/or on i , and in model (4), gm  is some method-specific
curve.
The baseline model. The first model is the ideal situation, in which each of
Downloaded by [University of Kent] at 06:57 07 December 2014

the methods is unbiased for the true quantity of interest i .


The additive two-way layout. The second model is perhaps the most widely
used approach to multiple methods comparison. It defines an additive two-way
layout with specimen and method as the factors, allowing for a set of method-
specific constant relative biases m . We can test whether the m are all zero using
a two-way analysis of variance (ANOVA), provided the error terms eim satisfy
distributional requirements. In the two-method setting, this analysis is equivalent to
the paired t-test often used to test for mean-equivalence of the methods.

2.1. The Mandel Column Regression Model


The additive two-way layout is widely used as a model for methods
comparisons, but as these comments show, it is not capable of handling even
modestly difficult problems such as one method having a different linear calibration
than the others. A more general model, defined by equation (3), which also underlies
Deming regression and the approach of Carstensen (2004), is Mandel’s (1971)
column regression model:

yim = i + m + i m + eim = i 1 + m  + m + eim

Method m gives assays that are linearly related to the true analyte
concentration with intercept m and slope 1 + m . The m measure the non-
parallelism in the calibration lines of the different methods. If all m are zero, then
the regression lines are parallel, and the situation drops back to the additive setting.

2.2. The Mandel Multiplicative Interaction (MI) Models


It takes a further model generalization to accommodate curved calibrations.
A powerful, flexible step in this direction (Mandel, 1976, 1995) is given by

yim = i + m + i m + i m + eim

This model specializes to the column regression model if it turns out that
i m = 0 for all i The i terms are row specific, so this model allows for
968 HAWKINS AND SHARMA

the possibility of curvature in some of the methods. A nonlinear monotonic


calibration curve, for example, would give positive  values at high and low analyte
concentrations and negative values in the middle; then a method that was curved
relative to the consensus would then have a positive  if its calibration was convex,
and negative if it was concave. Methods with linear calibrations would have zero .
The model can also identify an outlier; if only one i and only one m are
appreciably nonzero, then cell i m is an outlier. Multiple outliers in the same row,
or in the same column, can be captured by this multiplicative interaction term.
A final model allows for an additional layer of structure by adding another
multiplicative interaction term:


2
yim = i + m + i m + ik mk + eim
k=1
Downloaded by [University of Kent] at 06:57 07 December 2014

This is a generalization of the “additive main effect and multiplicative


interaction” (AMMI) used in agronomy (Cornelius and Crossa, 1999; Ebdon and
Gauch, 2002). As there are no outside constraints on the vectors , , these Mandel
models can be written equivalently without the i m by absorbing the column
regression term into the MI terms, as indeed Mandel’s discussion has it. However,
even though it is logically redundant, we will keep writing the column regression
term separately in the models, as doing so keeps a useful model hierarchy.
The MI models are indeterminate in that multiplying all  by an arbitrary
nonzero constant and dividing the  by the same constant leaves the product
unchanged. We remove this indeterminacy by adding another pair of constants and
constraints, rewriting the MI model(s) as


2
yim = i + m + i m +
k ik mk + eim
k=1

 
with the constraints i 2ik = m 2mk = 1 for k = 1 2.
Apart from curvature and outliers, these MI models can point to other issues,
such as measurement variance that differs between the methods.

3. FORMAL STATISTICAL TESTING


The sequence of models of interest is:
1. The equivalence model yim = i + eim .
2. The constant relative bias model yim = i + m + eim .
3. The column regression model yim = i + m + i m + eim .
4. The MI model yim = i + m + i m +
1 i1 m1 +
eim .
5. The two MI term model yim = i + m + i m + 2k=1
k ik mk + eim .
These models are nested—each is an elaboration of the preceding model that
allows for a further departure from the ideal unbiasedness, and each is got by setting
to zero some of the parameters in the model below it. Specifically, if:
1. All  are zero, (2) reduces to (1).
2. All  are zero, (3) reduces to (2).
MEASUREMENTS BY MULTIPLE DEVICES 969

3.
1 = 0, (4) reduces to (3).
4.
2 = 0 (5) reduces to (4).
As all the models are fitted by least squares, this suggests that we can find
which of the models best describes a data set by ANOVA methods—fitting the
sequence of models, finding the residual sum of squares of each, and testing each
reduction in sum of squares to see whether it is significantly large.
Writing N for the total number of observations (N = np if all combinations
are measured) leads to the following skeleton ANOVA model. Placeholders f1 and
f2 are written in for two of the degrees of freedom entries; these are discussed later.

Source Sum of squares Degrees of freedom Mean square

Samples SSsamp n−1 MSsamp


Downloaded by [University of Kent] at 06:57 07 December 2014

Methods SSmeth p−1 MSmeth


Additive residual SSE0 N −n−p+1 MSE0
Column regression SScr p−1 MScr
CR Residual SSE0 − SScr N − n − 2p + 2 MSE1
MI model SSb1 f1 MSb1
MI residual SSE0 − SScr − SSb1 N − n − 2p + 2 − f1 MSE2
Second MI term SSb2 f2 MSb2
Final residual SSE0 − SScr − SSb1 − SSb2 N − n − 2p + 2 − f1 − f2 MSE3

This skeleton has four “error” terms, corresponding to the successive models
2–5. We test the successive models by finding the ratio of each “hypothesis” mean
square to an appropriate error mean square. One of the lines—that for samples—is
not commonly tested; in methods comparisons, the specimens are carefully chosen
to span a wide range of values and it is a given that the F ratio for samples should
be huge.
The first five sums of squares in this skeleton ANOVA follow central
or noncentral chi-squared distributions. The remaining terms, which involve
eigenvalues and nonregular settings, do not have chi-squared distributions, despite
their roots in generalized likelihood ratio testing. Mandel (1971) suggested that one
nevertheless treat them as if they were chi-squared quantities with some degrees of
freedom to be worked out—these are reflected in the f1 and f2 placeholders, and we
follow this idea. A more detailed discussion of the appropriate choice of degrees of
freedom is given in the technical Appendix.
Table 1 lists suggested values for the notional degrees of freedom f1 (for the
first MI term) and f2 (for the second) for various numbers of specimens n up to 100
and methods p up to 15. Each entry was based on simulating 1,000,000 tables of
normal data. Some entries are blank; this indicates that the number of observations
does not permit a test for that bilinear term. The Appendix describes the simulation
and subsequent modeling steps leading to Table 1 in more detail.
Returning to the question of whether the i are known either exactly, or up to
known multiples of some unknown constant, the skeleton ANOVA shows that this
does not have much fundamental effect. If all i are known exactly, then we replace
the estimated row main effects by the true concentrations and gain one degree of
freedom for the method and column regression sums of squares. The lower levels of
the table are unaffected.
970 HAWKINS AND SHARMA

Table 1 Degrees of freedom for bilinear terms

n Term 3 4 5 6 7 8 9 10 11 12 13 14 15

5 1 4 6 7 9 10 11 12 14 15 16 17 19 20
2 2 3 4 5 6 7 8 9 10 11 12 13
6 1 6 7 9 10 12 13 15 16 17 19 20 21 22
2 3 4 6 6 8 9 10 11 12 13 15 16
7 1 7 9 10 12 13 15 16 18 19 21 22 23 25
2 4 6 7 8 9 11 12 13 14 16 17 18
8 1 8 10 12 13 15 17 18 20 21 23 24 26 27
2 5 6 8 10 11 12 13 15 16 18 19 20
9 1 9 11 13 15 17 18 20 22 23 25 26 28 29
2 6 8 9 11 13 14 15 17 18 19 21 22
Downloaded by [University of Kent] at 06:57 07 December 2014

10 1 10 12 15 16 18 20 22 23 25 27 28 30 31
2 7 9 11 12 14 15 17 18 20 21 23 24
12 1 12 15 17 19 21 23 25 27 29 30 32 34 35
2 9 11 13 15 17 18 20 21 23 25 26 28
14 1 15 17 20 22 24 26 28 30 32 34 35 37 39
2 11 13 16 18 20 21 23 25 26 28 30 31
16 1 17 20 22 25 27 29 31 33 35 37 39 41 42
2 13 16 18 20 22 24 26 28 30 31 33 35
18 1 19 22 25 27 30 32 34 36 38 40 42 44 46
2 15 18 20 22 25 27 29 31 33 35 36 38
20 1 21 25 27 30 33 35 37 39 41 43 45 47 49
2 17 20 23 25 27 30 32 34 36 38 40 42
25 1 27 30 34 37 39 42 44 47 49 51 53 55 57
2 22 25 28 31 34 36 38 41 43 45 47 49
30 1 32 36 40 43 46 48 51 54 56 58 61 63 65
2 27 31 34 37 40 43 45 48 50 52 55 57
35 1 38 42 46 49 52 55 58 61 63 66 68 70 73
2 32 36 40 43 46 49 52 54 57 59 62 64
40 1 43 47 51 55 58 61 64 67 70 73 75 78 80
2 37 42 45 49 52 55 58 61 63 66 69 71
45 1 48 53 57 61 65 68 71 74 77 80 82 85 87
2 42 47 51 54 58 61 64 67 70 73 75 78
50 1 54 59 63 67 71 74 77 80 83 86 89 92 94
2 47 52 56 60 64 67 71 74 77 79 82 85
60 1 64 70 74 79 83 86 90 93 97 100 103 106 108
2 57 63 67 72 76 79 83 86 89 92 95 98
70 1 75 81 86 90 95 99 102 106 109 113 116 119 122
2 67 73 78 83 87 91 95 98 102 105 108 111
80 1 85 91 97 102 106 111 115 118 122 125 129 132 135
2 78 83 89 94 98 103 107 111 114 118 121 124
90 1 95 102 108 113 118 122 127 131 134 138 142 145 148
2 87 94 100 105 110 114 118 123 126 130 134 137
100 1 106 113 119 125 129 134 138 143 147 150 154 158 161
2 97 104 110 116 121 126 130 134 139 142 146 150
MEASUREMENTS BY MULTIPLE DEVICES 971

4. COMPUTATION
A fuller discussion of the computations involved in the approach is given in
the Appendix. The broad route map is:
• The first part of the analysis is a conventional unreplicated two-way ANOVA
layout.
• This is followed by forming the matrix of residuals and regressing each column
on the estimated row main effects to get the column regression model.
• Then form the matrix of residuals from the column regression model. A singular
value decomposition (SVD) of this matrix gives the explained sums of squares
and estimated eigenvalues of the two bilinear terms.
These computations are routine if the data set is complete. Methods
comparison data sets however often have missing cells, due, for example, to lost
Downloaded by [University of Kent] at 06:57 07 December 2014

sample and limitations on dynamic range. This is not problematic provided there
is enough “linking” of methods in different samples to keep the overall matrix well
connected, as is generally the case.
There are standard algorithms for ANOVA and the column regressions for
incomplete arrays, but the SVD for incomplete data is less familiar. A method that is
well suited is the Gabriel and Zamir (1979) alternating least-squares method, whose
details are set out in the technical Appendix.

5. EXAMPLES
Example 1. The first data set (kindly supplied by Bendix Carstensen) is from the
comparison of six measures for HbA1c . Blood samples may be venous or capillary,

Figure 1 Venous samples for HbA1c .


972 HAWKINS AND SHARMA

and each source’s samples may be assayed using three instruments: BR.VC, BR.V2,
and Tosoh. Carstensen’s actual data set involves repeated measures leading to the
explicit calculation of measurement variances, but we ignore this information and
use just the mean for each of the n = 38 subjects by each of the six methods.

A useful initial graphical exploration is given by averaging the six measures to


get a consensus ˆ i and plotting the differences between each yim and this consensus,
getting a plot with many of the properties of a Bland–Altman plot. To reduce chart
clutter, the three methods using venous samples are shown in Fig. 1, and the three
using capillary samples in Fig. 2. In addition to the individual points, the plots
also show the regression line of the deviation on the consensus. These lines give a
visual impression of the potential fits of models (1), (2), and (3). As some of the
lines appear to lie consistently above or below the horizontal axis, there seems to
be relative bias between the instruments. None of the lines appears to have much
Downloaded by [University of Kent] at 06:57 07 December 2014

slope, however, suggesting any relative bias may be constant. The ratio of vertical
to horizontal spread shows that the different methods all have high correlation with
the consensus.
Going to the formal ANOVA gives

Analysis of variance for Mandel models

Source SS df MS F P

Method 3.701 5 0.7402 31.18 0.0000


Error 0 4.392 185 0.0237
Col reg 0.216 5 0.0431 1.86 0.1036
Error 1 4.176 180 0.0232
MI 1 2.914 53 0.0550 5.53 0.0000
Error 2 1.262 127 0.0099
MI 2 0.616 43 0.0143 1.86 0.0076
Error 3 0.646 84 0.0077
Overall mean is 8.1795

The “method” line shows that there is indeed a highly significant relative bias
between the methods. The “col reg” line shows that the column regression fits no
better than the simple constant bias model, which means that the regression lines of
Figs. 1 and 2 are plausibly horizontal but not coincident.
Follow-up comparisons of the method means by the Tukey method produce
the groupings:

showing that there are significant differences involving both the source of blood
(capillary or venous) and the instrument on which the assay is carried out. At this
point, a conventional analysis using two-way ANOVA would be complete, but,
moving on, the MI model adds further insights.
MEASUREMENTS BY MULTIPLE DEVICES 973
Downloaded by [University of Kent] at 06:57 07 December 2014

Figure 2 Capillary samples for HbA1c .

The MIs involve a separate “marker” for each method and for each subject.
The following table shows the main effect and the MIs of the methods:

Method m ˆ m ˆ m1 ˆ m2

BR.V2_Cap 0.049 0.383 −0.703


BR.V2.Ven 0.102 0.332 −0.009
BR.VC_Cap 0.165 −0.552 −0.299
BR.VC_Ven −0.083 −0.590 0.135
Tosoh_Cap −0.224 0.210 0.521
Tosoh_Ven −0.008 0.217 0.355

The first MI m1 is associated with the three instruments. BR.VC is clearly off
to one side, with BR.V2 and Tosoh on the other, with draw method (venous or
capillary) irrelevant.
These MI terms describe patient–assay interactions. The key to understanding
them is to note that the product
k ik mk identifies particular cells i m that deviate
from the overall linear pattern. For the product to be nonzero requires that the row
have an appreciably nonzero “row marker” ik , and the column have an appreciably
nonzero “column marker” mk . A good first step in the diagnosis therefore is plot
the estimated row markers ik against the estimated specimen consensus i .
This specimen plot for the first MI is shown in Fig. 3. Patients whose MI
is close to zero are well fitted by the additive two-way ANOVA layout; those for
whom this MI is far from zero are not. There are a number of patients with sizeable
values for the MI; the most extreme is Patient 28 (coordinates 7.4, −0.78), whose
974 HAWKINS AND SHARMA
Downloaded by [University of Kent] at 06:57 07 December 2014

Figure 3 First MI of HbA1c data set.

residuals from the additive two-way layout were

BR.V2_Cap BR.V2_Ven BR.VC_Cap BR.VC_Ven Tosoh_Cap Tosoh_Ven


−0.41 −0.09 0.44 0.49 −0.18 −0.25

Patient 15 (coordinates 9.9, 0.4) is one of several in the opposite direction:

BR.V2_Cap BR.V2_Ven BR.VC_Cap BR.VC_Ven Tosoh_Cap Tosoh_Ven


0.21 0.16 −0.30 −0.28 0.08 0.14

In both patients, the venous–capillary pairs broadly agree, but there are
marked instrument differences. These are not general interinstrument biases, as the
instrument main effect has already been removed. Closer study would be needed
to decide whether this interaction between patient and method reflects a genuine
dependence or is a result of some feature of the data gathering.
The second MI is much less important, as measured by its mean square. It
loads primarily on BR.V2_Cap, contrasting its values with the two Tosoh readings.
In the interests of brevity, we just comment that much of this significance of this
term comes from just two patients.

Example 2. A second example is a data set from the analysis of 51 HBV pools
by 6 different assay methods. This data set has an additional element, namely,
incompleteness. Of the 306 possible pool/method assays, only 248 were run, the
remainder being skipped. The missing values result mainly from dynamic range
limitations—that some methods cannot be used below some lower threshold, and
others cannot be used above some upper threshold.
MEASUREMENTS BY MULTIPLE DEVICES 975

The systematic lack of some portions of the data array is not a cause of
concern; it is understood that only those methods that are able to read in a
particular range of values can be compared in that range. All six methods were used
in the middle range of specimens, so the data array is sufficiently connected for the
fitting algorithms to work.
Instead of the plot of differences versus consensus used in the first example, we
use a more traditional plot of the individual assays against the consensus. Putting
all methods on a single plot would give a cluttered, hard-to-read graph, so they have
been broken into two groups and plotted separately as Figs. 4 and 5. Along with the
individual readings by each method, we show the linear regression of that method’s
values on the consensus figure.
Visually, the individual lines of Fig. 4 look to be identical or near-identical,
suggesting that these assays agree well with the consensus. This is partly a reflection
Downloaded by [University of Kent] at 06:57 07 December 2014

of the fact that the three methods shown in this figure are variants of the same
methodology. The lines of Fig. 5 seem more divergent, suggesting that one assay
may have a different slope than the others.
Putting the data through the formal ANOVA gives:

Analysis of variance for Mandel models

Source SS df MS F P

Method 4.336 5 0.8673 14.61 0.0000


Error 0 11.397 192 0.0594
Col reg 1.061 5 0.2122 3.84 0.0025
Error 1 10.336 187 0.0553
MI 1 4.842 68 0.0712 1.54 0.0196
Error 2 5.494 119 0.0462
MI 2 3.531 58 0.0609 1.89 0.0074
Error 3 1.963 61 0.0322
Overall mean is 5.4154

The formal tests show that there are significant departures from identity in all
aspects of the model sequence. This leads us to estimate and try to interpret the
departures from the coincident model. A first characterization of the six assays is by
their respective column regressions. The six individual equations for the deviation
from consensus are:

Assay m Intercept offset ˆ m Slope offset ˆ m

CAPSLD −0.063 −0.010


CAPSL1 −0.116 −0.001
CAPSL2 0.236 −0.040
COBAS −0.474 0.067
Versant 0.218 −0.002
HPS 0.199 −0.013
976 HAWKINS AND SHARMA
Downloaded by [University of Kent] at 06:57 07 December 2014

Figure 4 Plot of individual HCV assays versus consensus.

Looking at the slope offsets  of these separate regressions shows that the
main departure from parallelism comes from the COBAS assays having a steeper,
and the CAPSL2 a shallower, regression than the other assays.
The intercepts divide into three groups. COBAS has an intercept well below
the other methods, but as a consequence of its steeper slope, it agrees quite well with
the consensus in the middle part of the range. CAPSL2, Versant, and HPS have

Figure 5 Plot of further individual HCV assays versus consensus.


MEASUREMENTS BY MULTIPLE DEVICES 977

similar intercepts somewhat above the other methods, and CAPSLD and CAPSL1
are in the middle. Overall, with near-zero intercept and slope offsets, the straight
lines for CAPSLD and CAPSL1 are very close to the consensus.
Turning to the multiplicative interaction terms, Figs. 6 and 7 are plots of the
first and the second set of ik against the consensus. The corresponding column
markers mk are:

Assay m ˆ m1 ˆ m2

CAPSLD 0.471 −0.236


CAPSL1 0.245 −0.023
CAPSL2 0.245 −0.152
COBAS −0.772 −0.442
Versant −0.243 0.880
Downloaded by [University of Kent] at 06:57 07 December 2014

HPS 0.204 0.018

Figure 6, showing the row markers of the first MI term, is horseshoe shaped.
The corresponding column marker is dominated by the numerically largest value,
−0.772, for COBAS. This means that a large part of the function of the first
MI term is to capture a horseshoe-shaped departure in the COBAS assays from
consensus. Looking more closely at Fig. 5, we can see that this is the case; the
COBAS assays are above the consensus at both the high and the low ends of the
range, and below the consensus in the middle of the range—COBAS appears to have
a curved rather than a linear calibration relative to the other assays.
The other method with a numerically relatively large value on this first
vector of column marker is CAPSLD, with the value 0.471, a sign opposite to

Figure 6 First MI of HCV data.


978 HAWKINS AND SHARMA
Downloaded by [University of Kent] at 06:57 07 December 2014

Figure 7 Second MI of HCV data.

that of COBAS. This points to a smaller convex departure from consensus. And
indeed, studying Fig. 4 more closely, we see that CAPSLD read several specimens
appreciably above the consensus near the bottom end of the range, but was below
at the extreme left and right.
Neither pattern of deviations from consensus—of COBAS or CAPSLD—was
visually striking in the original plots, but the first MI term captures and models
them effectively.
The row markers of the second MI term are shown in Fig. 7. Their most
striking feature is the three high points making a “caret” shape at the area x = −2
of the plot. These three values are considerably above any of the other markers in
this vector. The corresponding column markers show a value of 0.880 for Versant,
with much smaller values for the other methods. Multiplying these row and column
markers then tells us that this MI term serves largely to identify three unusually
high Versant assays in the region where the consensus assay is about −2. With this
clue, these three points are also visible in Fig. 5.
A smaller but still appreciable row marker from the second MI is located
around the x = 2 area of Fig. 7. This corresponds to another high Versant assay
visible in Fig. 5.
In summary, the model automatically captured several statistically significant
features of the methods. Fitting linear calibrations gives some differences in slope
between the different methods. At a more detailed level, the COBAS (and less
so CAPSLD) assays seem to involve some curvature, and Versant gave a trio of
unusually high assays.
MEASUREMENTS BY MULTIPLE DEVICES 979

6. CONCLUSION
There is a large body of writing on the comparison of two methods. Going
to three or more however changes the whole landscape, making it logically possible
to go beyond simply assessing the level of disagreement between the methods to
allowing coalitions and consensus, with at least the potential of isolating absolute
biases and differences in measurement variability. The most familiar tool for such
comparisons is the two-way ANOVA, but broadening the model allows a more
replicable and reliable interpretation of the methods comparison. The family of
Mandel models corresponds to realistic assay performance differences and is a
powerful tool for analysis of multiple methods comparison.

APPENDIX
Downloaded by [University of Kent] at 06:57 07 December 2014

Computation
The modified Mandel model used for the assays is


2
yim = i + m + i m +
k ik mk + eim (5)
k=1

This notation differs from that usually seen in linear modeling exercises in that
there is no overall mean. Rather, we have absorbed this into the row effects, making
i the true mean concentration of the ith pool. The error terms eim are assumed
independent N0 2 .
The maximum likelihood estimators of the parameters of the full model and
submodels are given by ordinary least squares (OLS). The first “interesting” model
is the additive two-way layout

yim = i + m + eim (6)

If the data matrix is complete, then OLS involves no more than averaging over
rows and columns. If the matrix is incomplete, then the calculations are heavier, but
still fully standard.
Fitting the additive two-way layout gives estimates ˆ i and ˆ m , and a sum of
squares for the overall method bias, corrected for pools. This has p − 1 degrees of
freedom.
The next model in the sequence is the column regression model

yim = i + m + i m + eim (7)

This is a non-linear model. It is most conveniently fitted by the score


algorithm. This uses the estimates of the pool means from the initial additive fit to
recast (7) as

yim − ˆ i = m + ˆ i m + eim (8)

For each column j, this is a linear regression of the dependent yim − ˆ i on the
covariate ˆ i . The intercept gives a refined estimate of the overall bias of method
980 HAWKINS AND SHARMA

m, and the slope estimates that method’s linear slope differential. The reduction in
the residual sum of squares is the sum of squares explained by the separate column
regressions. As this involved the fitting of p − 1 new linear parameters, it has p − 1
degrees of freedom.
Because of the nonlinearity, the “score” algorithm does not give the exact
global optimum for the column regression model, and this suboptimality could
give a serious loss of accuracy in situations where the terms beyond the two-way
ANOVA contribute substantially in comparison with the main effects. However, this
situation does not apply in methods comparison problems where the main effect
of sample is overwhelmingly bigger than the MI terms so that any additional fine-
tuning in reestimating the i would not lead to a materially closer fit.
Having fitted the column regression model (4), we continue with a “score”
approach for the remaining terms, forming the matrix of residuals from the column
regression model
Downloaded by [University of Kent] at 06:57 07 December 2014

rim = yim − ˆ i + ˆ m + ˆ i ˆ m

and using them to fit the two multiplicative interaction terms through the model


2
rim =
k ik mk + eim
k=1

These multiplicative terms are the first two terms of the singular value
decomposition (SVD) of the matrix R = rij . If the matrix is complete, then
conventional matrix packages can be used to calculate the SVD. The sums of
squares explained by the successive terms are the squares of the estimated singular
values.
If the matrix is incomplete, then the alternating least squares approach
becomes potentially interesting. The single MI model is

rim =
1 i1 m1 + eim

If the i1 are known, then for each column m this model is a no-intercept
linear regression of the rim on the covariates i1 giving a slope bm =
1 m1 . Fit these
p coefficients bm by least squares using all those residuals that are nonmissing, then
calculate the corresponding
= b, and j1 = bj /b, where b is the length of
the vector b. Similarly, estimates of the m lead to corresponding estimates of the i
by a regression within each row on the column markers.
The alternating least-squares approach starts with an initial estimate of the i1
and then successively uses these to estimate the m1 , and these estimates in turn are
fed back to refine the estimates of the i , with this alternation between regressions
continuing until convergence.
Since the residual sum of squares decreases at each step of this alternation,
convergence is guaranteed (Kiers, 2002).
While convergence is generally quick, good starting values help. One easy
method that usually gives excellent results is to replace all missing residuals rim by
zero and submit the now-complete matrix to a conventional SVD, using the leading
row eigenvector for the initial i1 .
MEASUREMENTS BY MULTIPLE DEVICES 981

There are some differences from the complete-data case. If there are no missing
data, then the estimate of
1 given by the length of the vector of row slopes and that
given by the length of the vector of column slopes are identical. This is no longer
the case with incomplete data. Further, neither of the two estimates of
1 necessarily
equals the variance explained by adding the MI term to the model, as is the case
with the full-matrix analysis. It is recommended that
1 be estimated in a follow-up
no-intercept regression of all rim on the product ˆ i1 ˆ m1 .
After the first MI term has been fitted, we deflate the residual matrix, forming
the matrix with i m element equal to

rim −
ˆ 1 ˆ i1 ˆ m1

which is modeled to be
1 i2 m2 + eim , and apply the alternating least-squares
algorithm in exactly the same way to find the second MI term.
Downloaded by [University of Kent] at 06:57 07 December 2014

Distribution
The skeleton ANOVA gives a sequence of sums of squares, and associated
degrees of freedom, and dividing the sum of squares by its degrees of freedom gives
a mean square. This leads to the familiar computation of the ratio of an “explained”
to an “error” mean square to get an F ratio testing whether that term has significant
additional explanatory power.
For the first two tests—of overall bias between the methods, and for slope
differences in the linear calibration—these calculations are valid under the model
assumption of independent homoskedastic normal errors.
The variances explained by the two MI terms, however, are eigenvalue
problems and are well known to not follow this familiar rubric. Neither follows a 2
chi-squared distribution, and parameter counting does not provide a suitable divisor
to make them unbiased estimates of 2 .
For example, consider a test using n = 50 pools and p = 7 methods. Parameter
counting might suggest that the first MI term involves 53 free parameters:
; 48
“free” , after we account for the implied orthogonality to the estimated row means
and the length constraint; and 4 “free”  after accounting for orthogonality to the
column means and slopes, and the length constraint.
We simulated 1 million random N0 1 tables of size 50 × 7, calculating the
variance explained by the leading MI term. This variance explained had a mean of
76, and a variance of 84. The mean is much higher than the 53 given by parameter
counting. Furthermore, if the distribution were chi-squared with some other degrees
of freedom, its variance would be double the mean. The actual variance is far
below this, demonstrating that even finding a more appropriate degrees of freedom
than that given by parameter counting will not be enough to give chi-squared
distributions.
Despite this, it is desirable to use the ANOVA and F -ratio approach,
even if this is at best approximate. Mandel (1971) used simulation to estimate
the expectation of the variance explained by MI, and recommended using this
expectation as a notional degrees of freedom.
Applying this approach to our simulated 50 × 7 tables gives Fig. A1. This
figure shows the actual empirical fractiles of these 1 million tables, plotted against
982 HAWKINS AND SHARMA
Downloaded by [University of Kent] at 06:57 07 December 2014

Figure A1 Checking the distribution of explained variance.

the nominal fractiles given by a chi-squared distribution with 76 degrees of freedom.


The identity line is also given as a reference. The empirical and nominal fractiles
agree well in the middle of the range, but diverge increasingly toward the edge. For
example, the empiric 1% point of the distribution has a nominal significance under
the 76 degrees of freedom chi-squared distribution of 3.3%, a threefold error.
As the primary use of the notional degrees of freedom and mean squares is
for testing the significance of the MI term, we recommend a testing method that is
accurate, not in the middle of the distribution, but at some relevant tail point—for
example the 95% fractile—and our calculated f1 and f2 are defined in that way.
Another subtle difficulty arises with the distribution of the second MI term;
this difficulty relates to the appropriate null hypothesis for testing the second MI
term. It is sensible to test the hypothesis
2 = 0 only in settings where it has
already been decided that
1 > 0, and the fractiles should reflect this. This affects
the simulations used to study the sampling distribution of the second MI term. It
is not appropriate to simulate “null” data satisfying a model with no multiplicative
interaction terms; instead, for the null we need to simulate data matrices in which
there is a single multiplicative interaction.
In our simulations, therefore, we simulated no-structure N0 1 data matrices
to investigate the variance explained by the first MI term. For the second MI term
we simulated data matrices with a single MI term with
1 = 100, and N0 1 noise.
As a final departure from Mandel’s approach, we wish to use these degrees
of freedom to split the residual variance into “explained” and “error” components,
from which we can calculate pseudo-F ratios. In our modeling, therefore, we
started out with the n − 1p − 2 degrees of freedom of the error term from the
multiplicative interaction model, and then sought degrees of freedom f1 and f2 such
that the empiric 95% point of the F ratios for the first and second MI terms would
match those of the reference F distribution as closely as possible.
MEASUREMENTS BY MULTIPLE DEVICES 983

Simulations
Each entry in the degrees of freedom table was estimated using a separate
simulation of 1,000,000 tables, implemented in FORTRAN 95 code using Ripley’s
(1983) method for generating random normal variables. With this many simulations,
there is no perceptible random error in the empiric fractiles. Then f1 and f2 values
were found that best matched the empiric 95% fractiles to a tail area.
Returning to the issue of the dependence of the test of the second MI on the
outcome of the test for the first, there are grounds for concern that a type I error,
of deciding there was a nonzero
1 when there was not, might affect the size of the
follow-up test for a nonzero
2 . To assess this, a simulation was run for three table
configurations: n = 20, p = 6; n = 30, p = 4; and n = 50, p = 8. The first two of
these, with total sample size of 120, perhaps typify common methods comparison
settings, with the third being a much more ambitious investigation. For each sample
Downloaded by [University of Kent] at 06:57 07 December 2014

configuration, three
1 values were simulated: the null value
1 = 0; a value giving
50% power for the test of the first MI; and a large value like that used to generate
the f2 table. 10,000,000 tables were simulated for each of these 9 combinations, and
among those samples in which the first MI test was significant at the 5% level, the
proportion yielding a 5% significance for the second MI test was calculated. This
proportion is the size of the test for the second MI, conditional on a statistically
significant first MI. These sizes were:

Null
1 50% power
1 “Infinite”
1

n = 20, p = 6 0.036 0.039 0.041


n = 30, p = 4 0.052 0.055 0.056
n = 50, p = 8 0.036 0.045 0.049

The last column of the table should equal 0.05, the nominal size of the test.
That the numbers do not quite match is an indication of the fact that the choice
of f1 and f2 is restricted to integers, making it impossible to hit the target of 0.05
exactly. Looking across the rows though shows that the test sizes are encouragingly
similar, and uniformly somewhat smaller than the nominal values in the final
column.
This means that when type I errors occur in the test of the first MI, they are
not compounded by inflated type I error rates for the test of the second MI, but
that the test of the second MI is slightly conservative.

ACKNOWLEDGMENTS
The authors are grateful to the referees for helpful comments, and to
Bendix Carstensen of Steno Diabetes Center and to Roche Molecular Systems for
permission to use the data sets of the examples.

REFERENCES
Bland, J. M., Altman, D. G. (1999). Measuring agreement in method comparison studies.
Stat. Methods Med. Res. 8:135–160.
984 HAWKINS AND SHARMA

Carstensen, B. (2004). Comparing and predicting between several methods of measurement.


Biostatistics 5:399–413.
Cornelius, P. L., Crossa, J. (1999). Prediction assessment of shrinkage estimators of
multiplicative models for multi-environment cultivar trials. Crop Sci. 39:998–1009.
Ebdon, J. S., Gauch, H. G. (2002). Additive main effect and multiplicative interaction
analysis of national turfgrass performance trials: II. Cultivar recommendations.
Crop Sci. 42:497–506.
Gabriel, K. R., Zamir, S. (1979). Lower rank approximation of matrices by least squares
with any choice of weights. Technometrics 21:489–498.
Hawkins, D. M. (2002). Diagnostics for conformity of paired quantitative measurements.
Stat. Med. 21:1913–1935.
Hawkins, D. M. (2005). Outliers. In: Armitage, P., Colton, T., eds. Encyclopedia of
Biostatistics. New York: Wiley.
Kiers, H. A. L. (2002). Setting up alternating least squares and iterative majorization
algorithms for solving various matrix optimization problems. Comput. Stat. Data Anal.
Downloaded by [University of Kent] at 06:57 07 December 2014

41:157–170.
Mandel, J. (1971). A new analysis of variance model for non-additive data. Technometrics
13:1–18.
Mandel, J. (1976). Models, transformations of scale, and weighting. J. Quality Technol.
8:86–97.
Mandel, J. (1995). Analysis of Two-Way Layouts. London: Chapman and Hall.
Martinez, A., Riu, J., Rius, F. X. (2001). Multiple analytical method comparison using
maximum livelihood principal component analysis and linear regression with errors in
both axes. Anal. Chim. Acta 446:147–158.
Ripley, B. D. (1983). Computer generation of random variables: a tutorial. Int. Stat. Rev.
51:301–319.

You might also like