Professional Documents
Culture Documents
Journal of Biopharmaceutical Statistics
Journal of Biopharmaceutical Statistics
To cite this article: Douglas M. Hawkins & Abha Sharma (2010) Comparison of Measurements by Multiple Methods or
Instruments, Journal of Biopharmaceutical Statistics, 20:5, 965-984, DOI: 10.1080/10543401003618991
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Journal of Biopharmaceutical Statistics, 20: 965–984, 2010
Copyright © Taylor & Francis Group, LLC
ISSN: 1054-3406 print/1520-5711 online
DOI: 10.1080/10543401003618991
Much work has been done on comparison of one device with another, but the problem
Downloaded by [University of Kent] at 06:57 07 December 2014
of comparing three or more devices is less well known. Most existing work has
concentrated on the possibilities of a constant relative bias between the devices, or
of different linear relationships with the underlying true value. These two possibilities
are placed within a hierarchy of models extending them to settings with multiplicative
interaction terms. These additional terms can capture departures such as outliers,
variance changing with the analyte concentration, and different measurement variances
between the devices.
1. INTRODUCTION
Comparison of instruments, laboratories, or different assay methods is of
frequent interest (Bland and Altman, 1999; Martinez et al., 2001). It arises, for
example, in “assay round robins” in which one wishes to assess whether different
centers’ measurements are interchangeable, and in regulatory settings of comparing
different measuring instruments. It is also a central element of comparing different
methods of assay. We discuss the problem using a clinical chemistry setting and
terms, but the problem and its solution are much broader and apply in principle to
any setting where several ways of providing putatively equivalent measurements are
compared.
In a typical methods comparison, a collection of specimens, or pools, of widely
differing analyte concentrations will be assayed by each method (or laboratory or
instrument). Three different settings are important:
• In the first, each specimen has a known true value of the analyte.
• In the second, each has a true analyte concentration that is a known multiple of
some unknown value.
• In the third, the true analyte concentrations of the specimens are unknown.
965
966 HAWKINS AND SHARMA
The first setting rarely applies in practice. It would hold if each specimen
was obtained by exact spiking, or exact dilution, using a pure master source. It
is more commonly invoked in settings where the specimens have been assayed by
some standard method and assigned a value. As standard method are themselves
imperfect, however, so are the values they assign to unknown specimens, and
treating the resulting values as if they were exactly correct can lead to erroneous
conclusions about the methods being compared. For this reason, we do not consider
this first setting further.
The second setting arises when each of the specimens is made by dilution
of a common master solution. While the exact true concentration of the master
solution is seldom known, to the extent that the dilutions are highly precise, each
of the specimens will have an analyte concentration that is a known multiple of this
unknown true concentration of the master solution.
The third setting is commonly seen when patient specimens are used. To
Downloaded by [University of Kent] at 06:57 07 December 2014
achieve the desirable wide range of analyte values, patients may be selected on
the basis of severity of symptoms, or there may be some prescreening to identify
specimens that span a wide range of concentrations. Other than this attempt to
challenge the methods with wide-ranging specimen values, however, the actual
true values are unknown. Our discussion concentrates on this third setting, with
indications of how to specialize this setting to the first or second.
There is an important distinction to be drawn between settings where only
two methods are compared, versus settings where there are more than two. If
there are only two methods, then we can measure the difference between them
and see whether this difference seems to depend (in mean, variance, or both) on
their average. This is typically done graphically using Bland–Altman plots (Bland
and Altman, 1999; Hawkins, 2002). However unless the two methods agree exactly,
there is generally no logical basis for deciding what the “true” concentration of
each specimen is; all that can be assessed is the level of agreement between the
two methods. When there are three or more methods, however, it becomes possible
and sensible to use some consensus figure as an estimate of the true concentration
of each specimen. With this, we can investigate absolute biases and absolute
measurement variability in individual methods using their disagreement with this
consensus. The focus of this paper is this richer setting of testing three or more
methods.
where in all four models eim is a zero-mean random error whose variance may
depend on the method and/or on i , and in model (4), gm is some method-specific
curve.
The baseline model. The first model is the ideal situation, in which each of
Downloaded by [University of Kent] at 06:57 07 December 2014
Method m gives assays that are linearly related to the true analyte
concentration with intercept m and slope 1 + m . The m measure the non-
parallelism in the calibration lines of the different methods. If all m are zero, then
the regression lines are parallel, and the situation drops back to the additive setting.
yim = i + m + i m + i m + eim
This model specializes to the column regression model if it turns out that
i m = 0 for all i The i terms are row specific, so this model allows for
968 HAWKINS AND SHARMA
2
yim = i + m + i m + ik mk + eim
k=1
Downloaded by [University of Kent] at 06:57 07 December 2014
2
yim = i + m + i m +
k ik mk + eim
k=1
with the constraints i 2ik = m 2mk = 1 for k = 1 2.
Apart from curvature and outliers, these MI models can point to other issues,
such as measurement variance that differs between the methods.
3.
1 = 0, (4) reduces to (3).
4.
2 = 0 (5) reduces to (4).
As all the models are fitted by least squares, this suggests that we can find
which of the models best describes a data set by ANOVA methods—fitting the
sequence of models, finding the residual sum of squares of each, and testing each
reduction in sum of squares to see whether it is significantly large.
Writing N for the total number of observations (N = np if all combinations
are measured) leads to the following skeleton ANOVA model. Placeholders f1 and
f2 are written in for two of the degrees of freedom entries; these are discussed later.
This skeleton has four “error” terms, corresponding to the successive models
2–5. We test the successive models by finding the ratio of each “hypothesis” mean
square to an appropriate error mean square. One of the lines—that for samples—is
not commonly tested; in methods comparisons, the specimens are carefully chosen
to span a wide range of values and it is a given that the F ratio for samples should
be huge.
The first five sums of squares in this skeleton ANOVA follow central
or noncentral chi-squared distributions. The remaining terms, which involve
eigenvalues and nonregular settings, do not have chi-squared distributions, despite
their roots in generalized likelihood ratio testing. Mandel (1971) suggested that one
nevertheless treat them as if they were chi-squared quantities with some degrees of
freedom to be worked out—these are reflected in the f1 and f2 placeholders, and we
follow this idea. A more detailed discussion of the appropriate choice of degrees of
freedom is given in the technical Appendix.
Table 1 lists suggested values for the notional degrees of freedom f1 (for the
first MI term) and f2 (for the second) for various numbers of specimens n up to 100
and methods p up to 15. Each entry was based on simulating 1,000,000 tables of
normal data. Some entries are blank; this indicates that the number of observations
does not permit a test for that bilinear term. The Appendix describes the simulation
and subsequent modeling steps leading to Table 1 in more detail.
Returning to the question of whether the i are known either exactly, or up to
known multiples of some unknown constant, the skeleton ANOVA shows that this
does not have much fundamental effect. If all i are known exactly, then we replace
the estimated row main effects by the true concentrations and gain one degree of
freedom for the method and column regression sums of squares. The lower levels of
the table are unaffected.
970 HAWKINS AND SHARMA
n Term 3 4 5 6 7 8 9 10 11 12 13 14 15
5 1 4 6 7 9 10 11 12 14 15 16 17 19 20
2 2 3 4 5 6 7 8 9 10 11 12 13
6 1 6 7 9 10 12 13 15 16 17 19 20 21 22
2 3 4 6 6 8 9 10 11 12 13 15 16
7 1 7 9 10 12 13 15 16 18 19 21 22 23 25
2 4 6 7 8 9 11 12 13 14 16 17 18
8 1 8 10 12 13 15 17 18 20 21 23 24 26 27
2 5 6 8 10 11 12 13 15 16 18 19 20
9 1 9 11 13 15 17 18 20 22 23 25 26 28 29
2 6 8 9 11 13 14 15 17 18 19 21 22
Downloaded by [University of Kent] at 06:57 07 December 2014
10 1 10 12 15 16 18 20 22 23 25 27 28 30 31
2 7 9 11 12 14 15 17 18 20 21 23 24
12 1 12 15 17 19 21 23 25 27 29 30 32 34 35
2 9 11 13 15 17 18 20 21 23 25 26 28
14 1 15 17 20 22 24 26 28 30 32 34 35 37 39
2 11 13 16 18 20 21 23 25 26 28 30 31
16 1 17 20 22 25 27 29 31 33 35 37 39 41 42
2 13 16 18 20 22 24 26 28 30 31 33 35
18 1 19 22 25 27 30 32 34 36 38 40 42 44 46
2 15 18 20 22 25 27 29 31 33 35 36 38
20 1 21 25 27 30 33 35 37 39 41 43 45 47 49
2 17 20 23 25 27 30 32 34 36 38 40 42
25 1 27 30 34 37 39 42 44 47 49 51 53 55 57
2 22 25 28 31 34 36 38 41 43 45 47 49
30 1 32 36 40 43 46 48 51 54 56 58 61 63 65
2 27 31 34 37 40 43 45 48 50 52 55 57
35 1 38 42 46 49 52 55 58 61 63 66 68 70 73
2 32 36 40 43 46 49 52 54 57 59 62 64
40 1 43 47 51 55 58 61 64 67 70 73 75 78 80
2 37 42 45 49 52 55 58 61 63 66 69 71
45 1 48 53 57 61 65 68 71 74 77 80 82 85 87
2 42 47 51 54 58 61 64 67 70 73 75 78
50 1 54 59 63 67 71 74 77 80 83 86 89 92 94
2 47 52 56 60 64 67 71 74 77 79 82 85
60 1 64 70 74 79 83 86 90 93 97 100 103 106 108
2 57 63 67 72 76 79 83 86 89 92 95 98
70 1 75 81 86 90 95 99 102 106 109 113 116 119 122
2 67 73 78 83 87 91 95 98 102 105 108 111
80 1 85 91 97 102 106 111 115 118 122 125 129 132 135
2 78 83 89 94 98 103 107 111 114 118 121 124
90 1 95 102 108 113 118 122 127 131 134 138 142 145 148
2 87 94 100 105 110 114 118 123 126 130 134 137
100 1 106 113 119 125 129 134 138 143 147 150 154 158 161
2 97 104 110 116 121 126 130 134 139 142 146 150
MEASUREMENTS BY MULTIPLE DEVICES 971
4. COMPUTATION
A fuller discussion of the computations involved in the approach is given in
the Appendix. The broad route map is:
• The first part of the analysis is a conventional unreplicated two-way ANOVA
layout.
• This is followed by forming the matrix of residuals and regressing each column
on the estimated row main effects to get the column regression model.
• Then form the matrix of residuals from the column regression model. A singular
value decomposition (SVD) of this matrix gives the explained sums of squares
and estimated eigenvalues of the two bilinear terms.
These computations are routine if the data set is complete. Methods
comparison data sets however often have missing cells, due, for example, to lost
Downloaded by [University of Kent] at 06:57 07 December 2014
sample and limitations on dynamic range. This is not problematic provided there
is enough “linking” of methods in different samples to keep the overall matrix well
connected, as is generally the case.
There are standard algorithms for ANOVA and the column regressions for
incomplete arrays, but the SVD for incomplete data is less familiar. A method that is
well suited is the Gabriel and Zamir (1979) alternating least-squares method, whose
details are set out in the technical Appendix.
5. EXAMPLES
Example 1. The first data set (kindly supplied by Bendix Carstensen) is from the
comparison of six measures for HbA1c . Blood samples may be venous or capillary,
and each source’s samples may be assayed using three instruments: BR.VC, BR.V2,
and Tosoh. Carstensen’s actual data set involves repeated measures leading to the
explicit calculation of measurement variances, but we ignore this information and
use just the mean for each of the n = 38 subjects by each of the six methods.
slope, however, suggesting any relative bias may be constant. The ratio of vertical
to horizontal spread shows that the different methods all have high correlation with
the consensus.
Going to the formal ANOVA gives
Source SS df MS F P
The “method” line shows that there is indeed a highly significant relative bias
between the methods. The “col reg” line shows that the column regression fits no
better than the simple constant bias model, which means that the regression lines of
Figs. 1 and 2 are plausibly horizontal but not coincident.
Follow-up comparisons of the method means by the Tukey method produce
the groupings:
showing that there are significant differences involving both the source of blood
(capillary or venous) and the instrument on which the assay is carried out. At this
point, a conventional analysis using two-way ANOVA would be complete, but,
moving on, the MI model adds further insights.
MEASUREMENTS BY MULTIPLE DEVICES 973
Downloaded by [University of Kent] at 06:57 07 December 2014
The MIs involve a separate “marker” for each method and for each subject.
The following table shows the main effect and the MIs of the methods:
Method m ˆ m ˆ m1 ˆ m2
The first MI m1 is associated with the three instruments. BR.VC is clearly off
to one side, with BR.V2 and Tosoh on the other, with draw method (venous or
capillary) irrelevant.
These MI terms describe patient–assay interactions. The key to understanding
them is to note that the product
k ik mk identifies particular cells i m that deviate
from the overall linear pattern. For the product to be nonzero requires that the row
have an appreciably nonzero “row marker” ik , and the column have an appreciably
nonzero “column marker” mk . A good first step in the diagnosis therefore is plot
the estimated row markers ik against the estimated specimen consensus i .
This specimen plot for the first MI is shown in Fig. 3. Patients whose MI
is close to zero are well fitted by the additive two-way ANOVA layout; those for
whom this MI is far from zero are not. There are a number of patients with sizeable
values for the MI; the most extreme is Patient 28 (coordinates 7.4, −0.78), whose
974 HAWKINS AND SHARMA
Downloaded by [University of Kent] at 06:57 07 December 2014
In both patients, the venous–capillary pairs broadly agree, but there are
marked instrument differences. These are not general interinstrument biases, as the
instrument main effect has already been removed. Closer study would be needed
to decide whether this interaction between patient and method reflects a genuine
dependence or is a result of some feature of the data gathering.
The second MI is much less important, as measured by its mean square. It
loads primarily on BR.V2_Cap, contrasting its values with the two Tosoh readings.
In the interests of brevity, we just comment that much of this significance of this
term comes from just two patients.
Example 2. A second example is a data set from the analysis of 51 HBV pools
by 6 different assay methods. This data set has an additional element, namely,
incompleteness. Of the 306 possible pool/method assays, only 248 were run, the
remainder being skipped. The missing values result mainly from dynamic range
limitations—that some methods cannot be used below some lower threshold, and
others cannot be used above some upper threshold.
MEASUREMENTS BY MULTIPLE DEVICES 975
The systematic lack of some portions of the data array is not a cause of
concern; it is understood that only those methods that are able to read in a
particular range of values can be compared in that range. All six methods were used
in the middle range of specimens, so the data array is sufficiently connected for the
fitting algorithms to work.
Instead of the plot of differences versus consensus used in the first example, we
use a more traditional plot of the individual assays against the consensus. Putting
all methods on a single plot would give a cluttered, hard-to-read graph, so they have
been broken into two groups and plotted separately as Figs. 4 and 5. Along with the
individual readings by each method, we show the linear regression of that method’s
values on the consensus figure.
Visually, the individual lines of Fig. 4 look to be identical or near-identical,
suggesting that these assays agree well with the consensus. This is partly a reflection
Downloaded by [University of Kent] at 06:57 07 December 2014
of the fact that the three methods shown in this figure are variants of the same
methodology. The lines of Fig. 5 seem more divergent, suggesting that one assay
may have a different slope than the others.
Putting the data through the formal ANOVA gives:
Source SS df MS F P
The formal tests show that there are significant departures from identity in all
aspects of the model sequence. This leads us to estimate and try to interpret the
departures from the coincident model. A first characterization of the six assays is by
their respective column regressions. The six individual equations for the deviation
from consensus are:
Looking at the slope offsets of these separate regressions shows that the
main departure from parallelism comes from the COBAS assays having a steeper,
and the CAPSL2 a shallower, regression than the other assays.
The intercepts divide into three groups. COBAS has an intercept well below
the other methods, but as a consequence of its steeper slope, it agrees quite well with
the consensus in the middle part of the range. CAPSL2, Versant, and HPS have
similar intercepts somewhat above the other methods, and CAPSLD and CAPSL1
are in the middle. Overall, with near-zero intercept and slope offsets, the straight
lines for CAPSLD and CAPSL1 are very close to the consensus.
Turning to the multiplicative interaction terms, Figs. 6 and 7 are plots of the
first and the second set of ik against the consensus. The corresponding column
markers mk are:
Assay m ˆ m1 ˆ m2
Figure 6, showing the row markers of the first MI term, is horseshoe shaped.
The corresponding column marker is dominated by the numerically largest value,
−0.772, for COBAS. This means that a large part of the function of the first
MI term is to capture a horseshoe-shaped departure in the COBAS assays from
consensus. Looking more closely at Fig. 5, we can see that this is the case; the
COBAS assays are above the consensus at both the high and the low ends of the
range, and below the consensus in the middle of the range—COBAS appears to have
a curved rather than a linear calibration relative to the other assays.
The other method with a numerically relatively large value on this first
vector of column marker is CAPSLD, with the value 0.471, a sign opposite to
that of COBAS. This points to a smaller convex departure from consensus. And
indeed, studying Fig. 4 more closely, we see that CAPSLD read several specimens
appreciably above the consensus near the bottom end of the range, but was below
at the extreme left and right.
Neither pattern of deviations from consensus—of COBAS or CAPSLD—was
visually striking in the original plots, but the first MI term captures and models
them effectively.
The row markers of the second MI term are shown in Fig. 7. Their most
striking feature is the three high points making a “caret” shape at the area x = −2
of the plot. These three values are considerably above any of the other markers in
this vector. The corresponding column markers show a value of 0.880 for Versant,
with much smaller values for the other methods. Multiplying these row and column
markers then tells us that this MI term serves largely to identify three unusually
high Versant assays in the region where the consensus assay is about −2. With this
clue, these three points are also visible in Fig. 5.
A smaller but still appreciable row marker from the second MI is located
around the x = 2 area of Fig. 7. This corresponds to another high Versant assay
visible in Fig. 5.
In summary, the model automatically captured several statistically significant
features of the methods. Fitting linear calibrations gives some differences in slope
between the different methods. At a more detailed level, the COBAS (and less
so CAPSLD) assays seem to involve some curvature, and Versant gave a trio of
unusually high assays.
MEASUREMENTS BY MULTIPLE DEVICES 979
6. CONCLUSION
There is a large body of writing on the comparison of two methods. Going
to three or more however changes the whole landscape, making it logically possible
to go beyond simply assessing the level of disagreement between the methods to
allowing coalitions and consensus, with at least the potential of isolating absolute
biases and differences in measurement variability. The most familiar tool for such
comparisons is the two-way ANOVA, but broadening the model allows a more
replicable and reliable interpretation of the methods comparison. The family of
Mandel models corresponds to realistic assay performance differences and is a
powerful tool for analysis of multiple methods comparison.
APPENDIX
Downloaded by [University of Kent] at 06:57 07 December 2014
Computation
The modified Mandel model used for the assays is
2
yim = i + m + i m +
k ik mk + eim (5)
k=1
This notation differs from that usually seen in linear modeling exercises in that
there is no overall mean. Rather, we have absorbed this into the row effects, making
i the true mean concentration of the ith pool. The error terms eim are assumed
independent N0 2 .
The maximum likelihood estimators of the parameters of the full model and
submodels are given by ordinary least squares (OLS). The first “interesting” model
is the additive two-way layout
If the data matrix is complete, then OLS involves no more than averaging over
rows and columns. If the matrix is incomplete, then the calculations are heavier, but
still fully standard.
Fitting the additive two-way layout gives estimates ˆ i and ˆ m , and a sum of
squares for the overall method bias, corrected for pools. This has p − 1 degrees of
freedom.
The next model in the sequence is the column regression model
For each column j, this is a linear regression of the dependent yim − ˆ i on the
covariate ˆ i . The intercept gives a refined estimate of the overall bias of method
980 HAWKINS AND SHARMA
m, and the slope estimates that method’s linear slope differential. The reduction in
the residual sum of squares is the sum of squares explained by the separate column
regressions. As this involved the fitting of p − 1 new linear parameters, it has p − 1
degrees of freedom.
Because of the nonlinearity, the “score” algorithm does not give the exact
global optimum for the column regression model, and this suboptimality could
give a serious loss of accuracy in situations where the terms beyond the two-way
ANOVA contribute substantially in comparison with the main effects. However, this
situation does not apply in methods comparison problems where the main effect
of sample is overwhelmingly bigger than the MI terms so that any additional fine-
tuning in reestimating the i would not lead to a materially closer fit.
Having fitted the column regression model (4), we continue with a “score”
approach for the remaining terms, forming the matrix of residuals from the column
regression model
Downloaded by [University of Kent] at 06:57 07 December 2014
rim = yim − ˆ i + ˆ m + ˆ i ˆ m
and using them to fit the two multiplicative interaction terms through the model
2
rim =
k ik mk + eim
k=1
These multiplicative terms are the first two terms of the singular value
decomposition (SVD) of the matrix R = rij . If the matrix is complete, then
conventional matrix packages can be used to calculate the SVD. The sums of
squares explained by the successive terms are the squares of the estimated singular
values.
If the matrix is incomplete, then the alternating least squares approach
becomes potentially interesting. The single MI model is
rim =
1 i1 m1 + eim
If the i1 are known, then for each column m this model is a no-intercept
linear regression of the rim on the covariates i1 giving a slope bm =
1 m1 . Fit these
p coefficients bm by least squares using all those residuals that are nonmissing, then
calculate the corresponding
= b, and j1 = bj /b, where b is the length of
the vector b. Similarly, estimates of the m lead to corresponding estimates of the i
by a regression within each row on the column markers.
The alternating least-squares approach starts with an initial estimate of the i1
and then successively uses these to estimate the m1 , and these estimates in turn are
fed back to refine the estimates of the i , with this alternation between regressions
continuing until convergence.
Since the residual sum of squares decreases at each step of this alternation,
convergence is guaranteed (Kiers, 2002).
While convergence is generally quick, good starting values help. One easy
method that usually gives excellent results is to replace all missing residuals rim by
zero and submit the now-complete matrix to a conventional SVD, using the leading
row eigenvector for the initial i1 .
MEASUREMENTS BY MULTIPLE DEVICES 981
There are some differences from the complete-data case. If there are no missing
data, then the estimate of
1 given by the length of the vector of row slopes and that
given by the length of the vector of column slopes are identical. This is no longer
the case with incomplete data. Further, neither of the two estimates of
1 necessarily
equals the variance explained by adding the MI term to the model, as is the case
with the full-matrix analysis. It is recommended that
1 be estimated in a follow-up
no-intercept regression of all rim on the product ˆ i1 ˆ m1 .
After the first MI term has been fitted, we deflate the residual matrix, forming
the matrix with i m element equal to
rim −
ˆ 1 ˆ i1 ˆ m1
which is modeled to be
1 i2 m2 + eim , and apply the alternating least-squares
algorithm in exactly the same way to find the second MI term.
Downloaded by [University of Kent] at 06:57 07 December 2014
Distribution
The skeleton ANOVA gives a sequence of sums of squares, and associated
degrees of freedom, and dividing the sum of squares by its degrees of freedom gives
a mean square. This leads to the familiar computation of the ratio of an “explained”
to an “error” mean square to get an F ratio testing whether that term has significant
additional explanatory power.
For the first two tests—of overall bias between the methods, and for slope
differences in the linear calibration—these calculations are valid under the model
assumption of independent homoskedastic normal errors.
The variances explained by the two MI terms, however, are eigenvalue
problems and are well known to not follow this familiar rubric. Neither follows a 2
chi-squared distribution, and parameter counting does not provide a suitable divisor
to make them unbiased estimates of 2 .
For example, consider a test using n = 50 pools and p = 7 methods. Parameter
counting might suggest that the first MI term involves 53 free parameters:
; 48
“free” , after we account for the implied orthogonality to the estimated row means
and the length constraint; and 4 “free” after accounting for orthogonality to the
column means and slopes, and the length constraint.
We simulated 1 million random N0 1 tables of size 50 × 7, calculating the
variance explained by the leading MI term. This variance explained had a mean of
76, and a variance of 84. The mean is much higher than the 53 given by parameter
counting. Furthermore, if the distribution were chi-squared with some other degrees
of freedom, its variance would be double the mean. The actual variance is far
below this, demonstrating that even finding a more appropriate degrees of freedom
than that given by parameter counting will not be enough to give chi-squared
distributions.
Despite this, it is desirable to use the ANOVA and F -ratio approach,
even if this is at best approximate. Mandel (1971) used simulation to estimate
the expectation of the variance explained by MI, and recommended using this
expectation as a notional degrees of freedom.
Applying this approach to our simulated 50 × 7 tables gives Fig. A1. This
figure shows the actual empirical fractiles of these 1 million tables, plotted against
982 HAWKINS AND SHARMA
Downloaded by [University of Kent] at 06:57 07 December 2014
Simulations
Each entry in the degrees of freedom table was estimated using a separate
simulation of 1,000,000 tables, implemented in FORTRAN 95 code using Ripley’s
(1983) method for generating random normal variables. With this many simulations,
there is no perceptible random error in the empiric fractiles. Then f1 and f2 values
were found that best matched the empiric 95% fractiles to a tail area.
Returning to the issue of the dependence of the test of the second MI on the
outcome of the test for the first, there are grounds for concern that a type I error,
of deciding there was a nonzero
1 when there was not, might affect the size of the
follow-up test for a nonzero
2 . To assess this, a simulation was run for three table
configurations: n = 20, p = 6; n = 30, p = 4; and n = 50, p = 8. The first two of
these, with total sample size of 120, perhaps typify common methods comparison
settings, with the third being a much more ambitious investigation. For each sample
Downloaded by [University of Kent] at 06:57 07 December 2014
configuration, three
1 values were simulated: the null value
1 = 0; a value giving
50% power for the test of the first MI; and a large value like that used to generate
the f2 table. 10,000,000 tables were simulated for each of these 9 combinations, and
among those samples in which the first MI test was significant at the 5% level, the
proportion yielding a 5% significance for the second MI test was calculated. This
proportion is the size of the test for the second MI, conditional on a statistically
significant first MI. These sizes were:
Null
1 50% power
1 “Infinite”
1
The last column of the table should equal 0.05, the nominal size of the test.
That the numbers do not quite match is an indication of the fact that the choice
of f1 and f2 is restricted to integers, making it impossible to hit the target of 0.05
exactly. Looking across the rows though shows that the test sizes are encouragingly
similar, and uniformly somewhat smaller than the nominal values in the final
column.
This means that when type I errors occur in the test of the first MI, they are
not compounded by inflated type I error rates for the test of the second MI, but
that the test of the second MI is slightly conservative.
ACKNOWLEDGMENTS
The authors are grateful to the referees for helpful comments, and to
Bendix Carstensen of Steno Diabetes Center and to Roche Molecular Systems for
permission to use the data sets of the examples.
REFERENCES
Bland, J. M., Altman, D. G. (1999). Measuring agreement in method comparison studies.
Stat. Methods Med. Res. 8:135–160.
984 HAWKINS AND SHARMA
41:157–170.
Mandel, J. (1971). A new analysis of variance model for non-additive data. Technometrics
13:1–18.
Mandel, J. (1976). Models, transformations of scale, and weighting. J. Quality Technol.
8:86–97.
Mandel, J. (1995). Analysis of Two-Way Layouts. London: Chapman and Hall.
Martinez, A., Riu, J., Rius, F. X. (2001). Multiple analytical method comparison using
maximum livelihood principal component analysis and linear regression with errors in
both axes. Anal. Chim. Acta 446:147–158.
Ripley, B. D. (1983). Computer generation of random variables: a tutorial. Int. Stat. Rev.
51:301–319.