Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Journal Pre-proof

Comparison of statistical models for characterizing continuous differences


between two biomechanical measurement systems

Daniel Koska, Doris Oriwol, Christian Maiwald

PII: S0021-9290(23)00075-1
DOI: https://doi.org/10.1016/j.jbiomech.2023.111506
Reference: BM 111506

To appear in: Journal of Biomechanics

Accepted date : 13 February 2023

Please cite this article as: D. Koska, D. Oriwol and C. Maiwald, Comparison of statistical models
for characterizing continuous differences between two biomechanical measurement systems.
Journal of Biomechanics (2023), doi: https://doi.org/10.1016/j.jbiomech.2023.111506.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2023 Elsevier Ltd. All rights reserved.


Journal Pre-proof

of
Comparison of statistical models for characterizing
continuous differences between two biomechanical

pro
measurement systems
Daniel Koskaa,∗, Doris Oriwola , Christian Maiwalda
a
Chemnitz University of Technology, Thüringer Weg 11, 09126 Chemnitz, Germany

Abstract
re-
Most biomechanical processes are continuous in nature. Measurement
systems record this continuous behavior as curve data, which is often treated
inappropriately in validation studies. The current paper compares different
statistical models for analyzing the agreement of curves from two measure-
ment systems. All models were evaluated in various error scenarios (simu-
lP
lated and real-world data). Excellent results were obtained using a functional
method, with coverage probabilities close to the desired level in all data sets.
Pointwise constructed bands had a lower coverage probability, but still con-
tained most of the curve points and may thus be an option in scenarios where
assumptions of functional models are violated (e.g., when curves are much
noisier than those presented here, or in the presence of drift). Models that
account for within-subject variation showed a higher coverage probability
rna

and less uncertainty about the variation of band limits. We hope this study,
along with the provided research code, will inspire researchers to use meth-
ods for curve data more frequently and appropriately.

Keywords: method comparison, curve data, time series, validation,


prediction bands
Jou

Article Type: Original Article. Word count: 3295


Corresponding author: Daniel Koska, Email: daniel.koska@hsw.tu-chemnitz.de
Tel: +49 371 531-32024, Fax: +49 371 531-832024

1
Journal Pre-proof

of
1 1. Introduction

pro
2 The validity of biomechanical measurement systems is typically evaluated
3 based on simultaneous recordings of two measurement systems (new vs. gold
4 standard). In most studies, the resulting differences between these systems
5 are examined using statistical methods that were originally developed for
6 discrete data. For instance, the Limits of Agreement approach (LoA) by
7

9
re-
Altman and Bland (1983) was developed in a medical context for scalar
variables such as blood pressure. The LoA model describes a symmetric
uncertainty interval around the mean difference in which 95% of the normally
10 distributed differences are expected to lie. LoAs are simple and intuitive by
design, and thus are presumably the most widely used statistical method in
lP
11

12 validation studies (Ludbrook, 2010).


13 Most biomechanical processes, however, are continuous in nature, such as
14 joint angles, ground reaction forces, or center of pressure trajectories. Mea-
surement systems record this continuous behavior as curve data by observing
rna

15

16 the process value along a discrete set of time points. The term ”discrete” is
17 used throughout the manuscript to refer to non-functional univariate quan-
18 tities rather than to discrete values, such as counts, or categories.
19 Reducing curves to single points or discrete variables, such as local ex-
20 trema (Blair et al., 2018), rates of change (Kluitenberg et al., 2012), or ranges
Jou

21 of motion (Koska et al., 2018) results in a number of problems:

22 1. Much of the original information is ignored.

2
Journal Pre-proof

of
23 2. Time-dependent variations of the measurement error are not captured.

pro
24 3. The validity of statements derived from discrete variables runs the risk
25 of examining points that have little relevance for the system under
26 investigation (Donoghue et al., 2008; Pataky et al., 2008; Richter et al.,
27 2014; Park et al., 2017).

28 One way to incorporate the continuous structure of an observed curve


29

30

31
re-
is to apply discrete statistics to several or all curve points (Schwartz et al.,
2004; Pini et al., 2019). This discrete or ”pointwise” approach is character-
ized by the fact that each point is treated independently of the others. For
32 curve data, the coverage probability is typically defined as the probability
lP
33 that all points of the curve are contained in the band simultaneously. Point-
34 wise bands, however, will almost surely be too narrow (Duhamel et al., 2004;
35 Cutti et al., 2014; Degras, 2017), resulting in lower than desired coverage
36 probabilities. This is because they ignore the local correlation structure of
rna

37 continuous curves, which implies that the actual number of independent pro-
38 cesses is less than the number of sampling points (Pataky, 2010). Therefore,
39 pointwise bands often lead to a multiple comparisons problem.
40 A variety of statistical methods are available to address the problem of
41 multiple comparisons in curve data, with a majority rooted in either random
Jou

42 field theory (Adler and Taylor, 2007; Friston et al., 2007) or functional data
43 analysis frameworks (Ramsay and Silverman, 2005). Lenhoff et al. (1999)
44 investigated a method of approximating continuous curves from a discretely

3
Journal Pre-proof

of
45 observed set of points using Fourier series and bootstrapping to create si-
multaneous prediction bands (Sutherland et al., 1988; Olshen et al., 1989).

pro
46

47 Using a set of joint angle curves, they showed that their functional prediction
48 bands achieved better coverage probability (86% at 90% nominal coverage
49 level) than pointwise constructed Gaussian bands (54% coverage probability).
50 Prediction bands are defined to estimate a range in which a future observa-
51 tion (i.e., a curve) will fall with a certain probability. They are therefore well
52

53
re-
suited for method comparisons.
Røislien et al. (2012) developed an approach to extend the LoA for curve
54 data. Similar to Lenhoff et al. (1999), they use Fourier series to create
55 functional bands. Their method, however, must be regarded as pointwise,
lP
56 since the actual calculation of the band limits is performed separately and
57 independently for each point of the curve (which somewhat contradicts the
58 original intention).
59 Røislien et al. (2012) also highlight another problem: Many methods,
rna

60 including the bootstrap in Lenhoff et al. (1999), assume independent and


61 identically distributed (iid) curves. This assumption is violated when more
62 than one curve per subject is included in the model. Including only one curve
63 per subject, on the other hand, does not allow for conclusions to be drawn
64 about the within-subject error. This introduces additional uncertainty about
Jou

65 the random error of a measurement system, which is closely related to the


66 ability to differentiate between noise and actual biomechanical effects. The
67 uncertainty about the random error is further compounded when only a small

4
Journal Pre-proof

of
68 number of subjects is tested, which is a common phenomenon in validation
studies, e.g., N =6 in Morrow et al. (2017), N =7 in Røislien et al. (2012),

pro
69

70 N =10 in Fusca et al. (2018), or N =12 in Koldenhoven and Hertel (2018).


71 The aim of this paper is to illustrate how different statistical models af-
72 fect the coverage and width of prediction bands and to derive recommenda-
73 tions for constructing continuous agreement intervals. Therefore, we compare
74 different models to address three central questions: (i) What is the cover-
75

76
re-
age probability of pointwise vs. functional prediction bands?, (ii) What is
the amount of uncertainty regarding the variation of band limits?, and (iii)
77 How does including information about within-subject variation (via multiple
78 curves per subject) affect the parameters in (i) and (ii)? We compare the
lP
79 following models:

80 • Pointwise LoA including multiple curves per subject (Bland and Alt-
81 man, 1999, 2007)
rna

82 • LoA according to Røislien et al. (2012)

83 • Bootstrapped functional prediction bands (Sutherland et al., 1988; Ol-


84 shen et al., 1989; Lenhoff et al., 1999)

85 The analysis in this paper centers on smooth biomechanical signals, there-


Jou

86 fore all models are analyzed using simulated and real joint angle curves in
87 different measurement error scenarios.

5
Journal Pre-proof

of
88 2. Methods

pro
89 2.1. Data sets

90 Four data sets containing joint angle curves from two measurement sys-
91 tems, a gold standard (GOLD) and a new system (NEW), were used to
92 evaluate the models in section 2.2: Three simulated (GAUSS, NONGAUSS,
93 XSHIFT) and one with real movement data (REAL). The data sets represent
94

95

96
re-
a broad range of error characteristics typically encountered when comparing
biomechanical measurement systems. Fig. 1 shows the original curves of
two measurement systems in each of the four data sets. Fig. 3 contains
97 the associated difference curves. The following list presents an overview of
the data sets. A more detailed description of the simulated models and an
lP
98

99 explanation of the selected model parameters can be found in Appendix A.

100 1. GAUSS (Fig. 1A): The shape of the curves in both measurement
101 systems is similar. The observed differences are the result of small,
rna

102 normally distributed random fluctuations (i.e., Gaussian errors).

103 • This represents the ideal case in which the output of both mea-
104 surement systems hardly differs.

105 2. NONGAUSS (Fig. 1B): In this scenario, the curves in NEW have a
Jou

106 significantly lower amplitude. The distribution of differences is there-


107 fore no longer Gaussian, violating an elementary assumption of models,
108 such as the LoA.

6
Journal Pre-proof

of
109 • This scenario may occur as a result of sensor-to-segment align-
ment artifacts, e.g., when determining joint angles using inertial

pro
110

111 measurement units (Raimondo et al., 2022).

112 3. XSHIFT (Fig. 1C): The amplitude of the curves in NEW differs
113 less from the reference system than in NONGAUSS, but a shift in x-
114 axis direction (i.e., a temporal shift) is introduced. The differences
115

116

117
NONGAUSS.
re-
are non-Gaussian as well, but display a different distribution than in

• This scenario may occur as a consequence of relative movement


118 between two measurement systems, e.g., when wearable sensors
lP
119 are not properly attached.

120 4. REAL (Fig. 1D): This data set contains real-world sagittal plane
121 hip joint angle curves from healthy subjects walking on a treadmill
without gradient. Data were simultaneously recorded using an optical
rna

122

123 motion capture system (GOLD) and an inertial measurement unit. The
124 observed differences display more complex behavior than the simulated
125 curves (Fig. 3), which manifests in, e.g., larger differences in the within-
126 subject variation of different subjects.
Jou

127 [Figure 1 about here.]

128 All data sets contain N = 220 curves from two measurement systems.
129 This corresponds to Dij (t) = 110 pointwise difference curves (NEW - GOLD),

7
Journal Pre-proof

of
130 where i = 1, ..., 11 is the number of subjects, j = 1, ..., 10 is the number of
curves per subject, and t = 1, ..., 101 is the number of curve points. All

pro
131

132 curves are of equal length. The number of curves was deliberately kept small
133 to reflect the typical sample size of many validation studies. Systematic
134 offsets between the two measurement systems were not modeled, since this
135 paper focuses on random error components.

136

137

138
2.2. Prediction band models
re-
Prediction bands were constructed from difference curves between two
measurement systems for each of the four data sets. The bands were created
139 using three different methods - two pointwise methods and one functional
lP
140 method. The significance level α was set to 0.05 in the following description.

141 2.2.1. Pointwise bands

142 1. RØISLIEN: LoA according to Røislien et al. (2012)

This model was presented as an extension of the basic LoA model, using
rna

143

144 Fourier series to approximate a continuous function from a discretely


145 observed set of points. The actual calculation of LoA parameters, how-
146 ever, is performed independently for each point of the curve (R package
147 ’fda’ (Ramsay et al., 2021)). This practically eliminates the effect of us-
148 ing Fourier series to create simultaneous bands. We therefore omitted
Jou

149 this intermediate step.

150 The upper (u(t)) and lower (l(t)) limits for 1 − α = 0.95 bands are

8
Journal Pre-proof

of
151 calculated as:

pro
[l(t), u(t)] = d(t) ± 1.96 ∗ SD(D(t)), (1)

152 where d(t) and SD(D(t)) denote the time-dependent mean and stan-
153 dard deviation of the difference curves. The model assumes iid curves,
154 therefore, only one (random) curve per subject is drawn each time the
155

156
model is fit. re-
2. POINT: Pointwise LoA including multiple curves per subject (Bland
157 and Altman, 1999)

This model accounts for the variation of curves within subjects. It is


lP
158

159 assumed that differences can be represented as the sum of the mean
160 difference and two variance components:

q
2 2
[l(t), u(t)] = d(t) ± σdb (t) + σdw (t), (2)
rna

2 2
161 where σdb (t) is the between-subject and σdw (t) the within-subject vari-
162 ance at each independent point t of the difference curves. Both vari-
163 ance components can be estimated from a one-way analysis of variance
164 (Bland and Altman, 2007).
Jou

9
Journal Pre-proof

of
165 2.2.2. Functional (simultaneous) bands

3. BOOT: Bootstrapped functional prediction bands (Sutherland et al.,

pro
166

167 1988; Olshen et al., 1989)

168 In this model, the distribution of differences is estimated using boot-


169 strapping. For this purpose, continuous curves from a discretely ob-
170 served set of points are approximated using Fourier series and resam-
pled with replacement (B = 1000 times). The number of basis func-
171

172

173
re-
tions K was set to 50 to avoid unintended smoothing effects. The
method originally proposes the use of Fourier series, but is not re-
174 stricted to them. Other smoothing methods, such as B-splines, can
175 also be used to extend the method’s applicability to a wider range
lP
176 of signals, including non-periodic curves and curves with underlying
177 trends. The method is fully functional, i.e., band limits are calculated
178 simultaneously using all curve points, rather than independently for
179 each point:
rna

[l(t), u(t)] = fˆ(t) ± C ∗ σ̂fˆ(t), (3)

180 where fˆ(t) represents the mean Fourier curve and σ̂fˆ(t) the standard
181 deviation of the Fourier curves. The constant C is determined by re-
182 peatedly bootstrapping the original sample and calculating the maxi-
Jou

183 mum normalized deviation of the original curves (fˆi (t)) from the bth
184 bootstrap mean (fˆb (t)). The deviation is normalized to the standard
185 deviation of the bth sample σ̂fbˆb (t) . C is chosen to make Eq. 4 approxi-

10
Journal Pre-proof

of
186 mately equal to the desired coverage probability 1-α:

pro
B j
1 X 1X |fˆi (t) − fˆb (t)|
[ I(max{ } ≤ C)] ≈ 1 − α, (4)
B b=1 j i=1 t σ̂fbˆb (t)

187 where I is an indicator function which returns 1 if the deviation is ≤


188 C. For more details, see the description in Olshen et al. (1989) and
189 Lenhoff et al. (1999).

190

191
re-
The applied (naı̈ve) bootstrap method assumes iid curves, therefore
- similar to RØISLIEN - only a single random curve per subject is
192 drawn each time the model is fit (BOOTiid). When addressing re-
193 search question ’iii’, we implemented a second, modified version of the
lP
194 BOOT method, in which multiple curves per subject are accounted for
195 (BOOTrep). Therefore, BOOTrep includes the two-stage bootstrap
196 process described in Davison and Hinkley (1997), in which subjects
197 (including all of their curves) are sampled with replacement in the first
rna

198 stage, and one curve per subject is drawn without replacement in the
199 second stage. This way, both within and between subjects variation
200 are accounted for.

201 2.3. Coverage probability


Jou

202 The coverage probability in all methods was evaluated using leave-one-
203 out cross validation (LOOCV). For this, the data set is split into training
204 and test data, where the test data set consists of exactly one difference curve

11
Journal Pre-proof

of
205 d(t). The remaining D − 1 training curves are used to calculate upper (u(t))
and lower (l(t)) prediction band limits and determine how many points of

pro
206

207 d(t) are contained in the resulting band. This process is repeated D = 110
208 times, so that every curve is left out once. For each iteration, the percentage
209 of covered points (PCP) is calculated as:

T
1X
P CP (c) = I{l(t) ≤ dc (t) ≤ u(t)} ∗ 100, (5)
T t=1

210

211
re-
where t = 1, ..., T is the number of curve points, c = 1, ..., D is the curve
index, and I is an indicator function which returns 1 if the tth curve point
212 is within the limits.
lP
213 The actual coverage probability is determined as the mean percentage of
214 D LOOCV bands that contain at least x = 100(1 − α)% of the curve points:

D
1 X
Px% = I{P CP (c) ≥ x} (6)
D c=1
rna

215 with x = {100%, 95%, 90%, 50%} representing different percentages of curve
216 points contained in the band. For instance, P50% = 0.9 means that 90% of
217 the LOOCV prediction bands cover at least 50% of the test curve. P100%
218 refers to the coverage for entire curves and corresponds to the conventional
219 definition of coverage probability.
Jou

12
Journal Pre-proof

of
220 2.4. Uncertainty estimation

The presented models differ with regard to their approach for estimating

pro
221

222 the amount of curve variance and the underlying sampling strategies. They
223 are therefore likely to result in different degrees of uncertainty about the
224 width of the band limits, i.e., the degree to which the upper and lower band
225 limits are unknown. We analyzed two primary sources of uncertainty:

226

227

228
re-
1. The uncertainty resulting from repeated, random sampling (’Monte
Carlo variability’). This applies to all methods except POINT, where
there is no random sampling from the original sample.

229 2. The uncertainty about an inference (i.e., the uncertainty about the
lP
230 population of band limits based on a random sample from that popu-
231 lation).

To estimate both sources of uncertainty, an empirical distribution for both


the upper (U (t)) and lower (L(t)) band limits was generated via repeated k-
rna

fold cross validation. k equals the number of subjects (k = i = 11). The


number of repetitions was set to 300. The uncertainty in both band limits
is represented as the sum of the difference between the 97.5th and 2.5th
percentile across all curve points (Cumulated Area of Uncertainty, CAU):

T
X
Jou

CAUU = |U (t)0.975 − U (t)0.025 | (7a)


t=1
XT
CAUL = |L(t)0.975 − L(t)0.025 |, (7b)
t=1

13
Journal Pre-proof

of
232 where U and L are matrices containing k ∗300 repetitions = 3300 band limits
each. The overall uncertainty, denoted as Area of Uncertainty (AU, see Fig.

pro
233

234 2), is calculated as the sum of the CAU of both limits:

AU = CAUU + CAUL (8)

235 [Figure 2 about here.]

236

237
re-
The lower the AU, the lower the variation of the band limits and therefore
the uncertainty regarding the band width. We expect models that account
238 for within-subject variation (POINT, BOOTrep) to yield lower AU values.
239 POINT is likely to have the lowest AU value, since the band limits remain
lP
240 unchanged when the method is calculated several times using the same sam-
241 ple. Thus, only the leave-one-subject-out variation is reflected in the results
242 of POINT, which limits the comparability with the AU values of the other
243 methods.
rna

244 All methods and data sets were implemented in R (v4.0.5) (R Core Team,
245 2021) using RStudio (RStudio Team, 2022). The code to reproduce the
246 analysis is provided at https://zenodo.org/badge/latestdoi/334994253.

247 3. Results
Jou

248 Fig. 3 displays difference curves and prediction bands in the four data
249 sets. The respective coverage probabilities are summarized in Table 1. Table
250 2 contains the uncertainty areas for the band limits.

14
Journal Pre-proof

of
251 [Figure 3 about here.]

pro
Table 1: Leave-one-out cross validated (LOOCV) prediction band coverage probabilities
of four models (POINT, RØISLIEN, BOOTrep, BOOTiid) across data sets. Coverage
probabilities (P ) were calculated for different percentages of curve points contained in the
band: 100%, 95%, 90%, 50%. For instance, P95% = 0.9 indicates that 90% of the LOOCV
prediction bands cover at least 95% of the points of the test curve. P100 % refers to the
coverage for entire curves.

Data Px% POINT RØISLIEN BOOTrep BOOTiid


P100% 0.67 0.52 0.99 0.85
GAUSS P95%
P90%
P50%
P100%
re- 0.78
0.81
0.99
0.84
0.6
0.65
0.98
0.60
0.99
1
1
0.96
0.96
0.97
1
0.85
NONGAUSS P95% 0.85 0.68 0.96 0.91
P90% 0.86 0.71 0.98 0.95
P50% 0.98 0.97 1 1
lP
P100% 0.84 0.77 0.95 0.89
XSHIFT P95% 0.85 0.82 0.98 0.95
P90% 0.88 0.83 1 0.96
P50% 0.99 0.99 1 1
P100% 0.74 0.72 0.99 0.96
rna

REAL P95% 0.79 0.79 1 0.96


P90% 0.85 0.85 1 0.96
P50% 0.96 0.95 1 1
Jou

15
Journal Pre-proof

of
Table 2: Area of Uncertainty (AU) for the distribution of band limits across methods and
data sets. AUs were calculated using repeated k-fold cross validation. Higher values repre-
sent more scattered band limits. The results in POINT differ from the remaining methods

pro
in that no variation of the band limits occurs when calculating the bands repeatedly with
the same sample. Therefore, only the variation across k cross validation folds is reflected
in the results of POINT.

Data POINT RØISLIEN BOOTrep BOOTiid


GAUSS 33 60 53 192
NONGAUSS 40 116 111 315
XSHIFT 81 138 342 475
REAL 586 694 1104 2313

252

253
re-
Regarding the coverage probability for entire curves (Table 1: P100% ), the
BOOTrep bands achieved nominal coverage across all data sets, while the
254 BOOTiid bands showed a slightly lower coverage probability in three out
lP
255 of four data sets. The prediction bands of the pointwise methods (POINT,
256 RØISLIEN) were noticeably narrower (Fig. 3) and achieved a lower coverage
257 probability. Both pointwise prediction bands were similar in quality, but the
258 coverage probability in RØISLIEN was lower for all data sets.
rna

259 The high coverage probabilities in the remaining coverage levels (Table 1:
260 P50%−95% show that all methods covered the vast majority of curve points.
261 For both BOOTiid and BOOTrep, 95% of the curve points were covered with
262 a probability close to 1. For POINT, the same probability was still ≈ 0.8.
263 Here, again, the RØISLIEN method had the lowest coverage probability in
Jou

264 all data sets.


265 With regard to the uncertainty about the band limits, the two meth-
266 ods that do not account for within-subject variation (RØISLIEN, BOOTiid)

16
Journal Pre-proof

of
267 yielded considerably higher AU values (Table 2). BOOTiid exhibited the
greatest uncertainty, with the bands tending to be very wide in some cases

pro
268

269 (see Fig. 2A and Fig. 3D).

270 4. Discussion

271 This paper investigated central aspects of the construction of prediction


272 bands from difference curves between two biomechanical measurement sys-
273

274

275
re-
tems. We compared different models in various error scenarios with regard to
their coverage probability and the amount of uncertainty about the variation
of band limits.

276 4.1. Coverage probability


lP
277 Both versions of the functional BOOT method achieved or were close to the
278 desired coverage probability in all data sets. This is in contrast to the two
279 pointwise methods, whose coverage probabilities were below nominal.
A look at the percentage of covered points per curve, however, reveals that
rna

280

281 the pointwise bands still contain the majority of curve points and represent
282 the course of the difference curves rather well. This is further confirmed by
283 a visual inspection of the band limits (Fig. 3), which are plausible even in
284 the presence of severe violations of parametric model assumptions. Although
285 not investigated in this paper due to our focus on random error components,
Jou

286 this robustness to the violation of model assumptions may be advantageous


287 in scenarios where functional models are more likely to fail (see Discussion
288 section 4.3).

17
Journal Pre-proof

of
289 The coverage probabilities in our study were higher for models that in-
clude multiple curves per subject (POINT, BOOTrep) in all data sets. This

pro
290

291 corresponds to our expectations, since the additional source of variation leads
292 to wider bands and thus increases the likelihood of covering a future obser-
293 vation. However, unlike the pointwise methods, BOOTiid can be expected
294 to achieve nominal coverage when a larger number of curves is included.

295

296

297
re-
4.2. Uncertainty about band limits

The use of multiple curves per subject also has a positive effect on the
uncertainty about the variation of band limits. This is best demonstrated
298 using the two BOOT models. The interpretation of the uncertainty values
lP
299 in POINT is somewhat limited, since the prediction bands in POINT are
300 calculated from the entire original sample, while the other models each draw a
301 random subsample. Therefore, an important source of uncertainty is missing
302 in POINT.
rna

303 In BOOT, prediction bands are calculated by bootstrapping a random


304 subsample from the original sample to get an estimate of the true standard
305 deviation (Eq. 3). In BOOTiid, the subsample is limited to one random
306 curve per subject, whereas in BOOTrep, any curve of the original sample
307 can be drawn. Therefore, the sampling distribution in BOOTrep is more
likely to converge to the true distribution, as it draws from a larger set of
Jou

308

309 curves with more variation between curves. This, in turn, results in less
310 variation of the band width across different subsamples.

18
Journal Pre-proof

of
311 The difference between BOOTiid and BOOTrep largely depends on the
size of the within-subject variation relative to the overall variance of the

pro
312

313 curves in the data set. In GAUSS, this relative within-subject variation is
314 smaller than in XSHIFT and REAL. Accordingly, the ratio of the uncertainty
315 areas of BOOTiid and BOOTrep (AUiid/rep ) is larger in GAUSS (AUiid/rep =
316 3.6) than in XSHIFT (AUiid/rep = 1.4) and REAL (AUrelative = 2.1). There-
317 fore, and in general, we recommend including within-subject variation when
318

319
re-
investigating measurement errors. In addition to improving the coverage
performance and uncertainty regarding the variation of the band limits, it
320 is informative in its own right (see Discussion section 4.4). Of course, the
321 dependency of multiple curves of the same subject should be accounted for
lP
322 in the model to avoid falsely narrow bands (Montenij et al., 2016).
323 Generally, it seems advisable to err on the side of caution and draw larger
324 samples than typically encountered in validation studies to limit the uncer-
325 tainty about the random measurement error. Since the width of prediction
rna

326 bands, unlike that of confidence bands, does not converge to zero as n → ∞,
327 drawing unnecessarily large samples is less of a concern when calculating
328 prediction bands. This, of cause, does not include ethical aspects (Altman,
329 1980) and cost considerations. To determine whether the chosen sample is
330 large enough, one may study the convergence of band limits, either a priori
Jou

331 via simulation, or using predefined stopping criteria.

19
Journal Pre-proof

of
332 4.3. Pointwise vs. functional models

It is hard to determine with certainty the extent to which certain model

pro
333

334 aspects affect the results. Pataky et al. (2015) suggest that the model of
335 randomness (i.e., pointwise vs. functional analysis) is more important than
336 the distinction between parametric and non-parametric methods. This is re-
337 flected in our results as well: Parametric model assumptions were fulfilled in
338 GAUSS, but the difference between the coverage probabilities of the paramet-
339

340
re-
ric (POINT, RØISLIEN) and non-parametric (BOOT) models was slightly
larger in GAUSS than in the other data sets. We interpret this as further
341 evidence for the superiority of functional methods.
342 It is possible, however, to imagine scenarios in which the use of pointwise
lP
343 bands may be justified. This may be the case when the mean error function
344 varies over time, e.g., in the presence of drift. Drift occurs in measurement
345 systems such as force plates or gyroscopes and causes a violation of the
346 assumption of stationarity, a major assumption in many time series models
rna

347 including BOOT. Pointwise methods may further be useful when curves are
348 less smooth than in our data sets (e.g., EMG data). In such cases, it is
349 hardly possible to fit a mathematical function that adequately represents the
350 signal. A detailed analysis of non-smooth curves, however, was not part of
351 this paper and smoothness-related issues may further be addressed by prior
Jou

352 low-pass filtering of the signals.


353 In our opinion, methods that combine the properties of different ap-
354 proaches should be further investigated in future research. For bootstrap-

20
Journal Pre-proof

of
355 ping, methods such as the block bootstrap have been established, in which
curves are reduced to a few presumably independent regions (blocks) (Kun-

pro
356

357 sch, 1989). It could be assumed, e.g., that 10 nearby points form a block
358 and that the correlation structure of the entire curve can be adequately rep-
359 resented with these blocks. Another aspect that deserves attention is the
360 design of asymmetric bands to describe the contribution of the respective
361 measurement systems to the random error, e.g., using percentiles instead
362

363
re-
symmetric bands around a measure of central tendency.

4.4. Alternative error bands

364 This paper focuses on bootstrap methods for constructing functional pre-
lP
365 diction bands, but there are several other methods in the literature on func-
366 tional data analysis (Goldsmith et al., 2012; Degras, 2017), some of which
367 have desirable small sample properties (Telschow and Schwartzman, 2022),
368 or offer other advantages (Liebl and Reimherr, 2019). However, those meth-
rna

369 ods may be difficult to apply in scenarios such as the ones presented in this
370 paper, where bands are computed for differences and multiple curves per
371 subject are present.
372 All models in this paper include the between-subject variation. There are
373 situations, however, in which these bands are too wide, since only within-
subject effects, e.g. in intra-individual pre-post interventions, are of interest.
Jou

374

375 Bland and Altman (1999) therefore suggest to calculate prediction intervals
376 based solely on the within-subject standard deviation in such cases. These

21
Journal Pre-proof

of
377 are narrower and therefore reduce the risk of false-negative results. In cases
where measurement systems are validated without a clearly defined applica-

pro
378

379 tion area, we recommend using different statistical intervals for within and
380 between-subject designs.

381 5. Conclusion

382 The findings of this paper suggest that there are methods that allow for
383

384
re-
an adequate characterization of difference curves in various error scenarios.
The relevance of this work, however, goes far beyond the mere comparison of
385 measurement systems, but concerns any biomechanical study in which two
386 groups of curves are compared. If possible, a functional approach should
lP
387 be chosen to account for the problem of multiple comparisons in pointwise
388 models. Pointwise bands have a lower coverage probability, but still contain
389 most of the curve points. They may thus be an option in scenarios where
390 functional models are bound to fail, e.g., when curves are much more noisy
rna

391 than in our examples. Any model should account for within-subject variation,
392 i.e., multiple curves per subject should be included to increase the coverage
393 probability and reduce the uncertainty about the width of the band limits.
394 The construction of prediction bands is accompanied by an increased de-
395 gree of complexity in comparison with discrete statistical methods. This, in
Jou

396 turn, requires at least some programming experience and statistical back-
397 ground. We suspect that the low prevalence of adequate models for curve
398 data in validation studies is directly related to a lack of such experience and

22
Journal Pre-proof

of
399 background. To lower these obstacles, we provide the associated R code in
addition to the paper (https://zenodo.org/badge/latestdoi/334994253).

pro
400

401 Acknowledgements

402 The authors would like to thank Lisa Peterson for linguistic correction of
403 this paper as a native speaker. We would also like to thank the reviewers for
404 their helpful comments.

405

406
References
re-
Adler, R., Taylor, J., 2007. Random Fields and Geometry. Springer New
407 York. doi:10.1007/978-0-387-48116-6.
lP
408 Altman, D.G., 1980. Statistics and ethics in medical research: III how large
409 a sample? BMJ 281, 1336–1338. doi:10.1136/bmj.281.6251.1336.

410 Altman, D.G., Bland, J.M., 1983. Measurement in medicine: The analysis of
rna

411 method comparison studies. The Statistician 32, 307. doi:10.2307/2987937.

412 Blair, S., Duthie, G., Robertson, S., Hopkins, W., Ball, K., 2018. Con-
413 current validation of an inertial measurement system to quantify kicking
414 biomechanics in four football codes. Journal of Biomechanics 73, 24–32.
415 doi:10.1016/j.jbiomech.2018.03.031.
Jou

416 Bland, J.M., Altman, D.G., 1999. Measuring agreement in method com-
417 parison studies. Statistical Methods in Medical Research 8, 135–160.
418 doi:10.1177/096228029900800204.

23
Journal Pre-proof

of
419 Bland, J.M., Altman, D.G., 2007. Agreement between methods of measure-
ment with multiple observations per individual. Journal of Biopharmaceu-

pro
420

421 tical Statistics 17, 571–582. doi:10.1080/10543400701329422.

422 Cutti, A., Parel, I., Raggi, M., Petracci, E., Pellegrini, A., Accardo,
423 A., Sacchetti, R., Porcellini, G., 2014. Prediction bands and in-
424 tervals for the scapulo-humeral coordination based on the bootstrap
425

426

427
re-
and two gaussian methods. Journal of Biomechanics 47, 1035–1044.
doi:10.1016/j.jbiomech.2013.12.028.

Davison, A.C., Hinkley, D.V., 1997. Bootstrap Methods and their Applica-
428 tion. Cambridge University Press. doi:10.1017/cbo9780511802843.
lP
429 Degras, D., 2017. Simultaneous confidence bands for the mean of functional
430 data. Wiley Interdisciplinary Reviews: Computational Statistics 9, e1397.
431 doi:10.1002/wics.1397.
rna

432 Donoghue, O.A., Harrison, A.J., Coffey, N., Hayes, K., 2008. Func-
433 tional data analysis of running kinematics in chronic achilles tendon
434 injury. Medicine & Science in Sports & Exercise 40, 1323–1335.
435 doi:10.1249/mss.0b013e31816c4807.

436 Duhamel, A., Bourriez, J., Devos, P., Krystkowiak, P., Destée, A., Deram-
Jou

437 bure, P., Defebvre, L., 2004. Statistical tools for clinical gait analysis. Gait
438 and Posture 20, 204–212. doi:10.1016/j.gaitpost.2003.09.010.

24
Journal Pre-proof

of
439 Friston, K., Ashburner, J., Nichols, T., Penny, W., 2007. Statistical Para-
metric Mapping: The analysis of funtional brain images. Academic Press.

pro
440

441 Fusca, M., Negrini, F., Perego, P., Magoni, L., Molteni, F., Andreoni,
442 G., 2018. Validation of a wearable IMU system for gait analysis: Pro-
443 tocol and application to a new system. Applied Sciences 8, 1167.
444 doi:10.3390/app8071167.

445

446

447
re-
Goldsmith, J., Greven, S., Crainiceanu, C., 2012. Corrected confidence bands
for functional data using principal components. Biometrics 69, 41–51.
doi:10.1111/j.1541-0420.2012.01808.x.

448 Kluitenberg, B., Bredeweg, S.W., Zijlstra, S., Zijlstra, W., Buist, I., 2012.
lP
449 Comparison of vertical ground reaction forces during overground and
450 treadmill running. a validation study. BMC Musculoskeletal Disorders
451 13. doi:10.1186/1471-2474-13-235.
rna

452 Koldenhoven, R.M., Hertel, J., 2018. Validation of a wearable sen-


453 sor for measuring running biomechanics. Digital Biomarkers 2, 74–78.
454 doi:10.1159/000491645.

455 Koska, D., Gaudel, J., Hein, T., Maiwald, C., 2018. Validation of an inertial
456 measurement unit for the quantification of rearfoot kinematics during run-
Jou

457 ning. Gait and Posture 64, 135–140. doi:10.1016/j.gaitpost.2018.06.007.

458 Kunsch, H.R., 1989. The jackknife and the bootstrap for general stationary
459 observations. The Annals of Statistics 17. doi:10.1214/aos/1176347265.

25
Journal Pre-proof

of
460 Lenhoff, M.W., Santner, T.J., Otis, J.C., Peterson, M.G., Williams, B.J.,
Backus, S.I., 1999. Bootstrap prediction and confidence bands: a superior

pro
461

462 statistical method for analysis of gait data. Gait & Posture 9, 10–17.
463 doi:10.1016/s0966-6362(98)00043-5.

464 Liebl, D., Reimherr, M., 2019. Fast and fair simultaneous confidence bands
465 for functional parameters arXiv:1910.00131.

466

467

468
re-
Ludbrook, J., 2010. Confidence in altman-bland plots: A critical review of
the method of differences. Clinical and Experimental Pharmacology and
Physiology 37, 143–149. doi:10.1111/j.1440-1681.2009.05288.x.

469 Montenij, L., Buhre, W., Jansen, J., Kruitwagen, C., de Waal, E., 2016.
lP
470 Methodology of method comparison studies evaluating the validity of car-
471 diac output monitors: a stepwise approach and checklist. British Journal
472 of Anaesthesia 116, 750–758. doi:10.1093/bja/aew094.
rna

473 Morrow, M.M., Lowndes, B., Fortune, E., Kaufman, K.R., Hallbeck, M.S.,
474 2017. Validation of inertial measurement units for upper body kinematics.
475 Journal of Applied Biomechanics 33, 227–232. doi:10.1123/jab.2016-0120.

476 Olshen, R.A., Biden, E.N., Wyatt, M.P., Sutherland, D.H., 1989.
477 Gait analysis and the bootstrap. The Annals of Statistics 17.
Jou

478 doi:10.1214/aos/1176347372.

479 Park, J., Seeley, M.K., Francom, D., Reese, C.S., Hopkins, J.T., 2017.
480 Functional vs. traditional analysis in biomechanical gait data: An al-

26
Journal Pre-proof

of
481 ternative statistical approach. Journal of Human Kinetics 60, 39–49.
doi:10.1515/hukin-2017-0114.

pro
482

483 Pataky, T.C., Caravaggi, P., Savage, R., Parker, D., Goulermas, J.Y.,
484 Sellers, W.I., Crompton, R.H., 2008. New insights into the plantar
485 pressure correlates of walking speed using pedobarographic statistical
486 parametric mapping (pSPM). Journal of Biomechanics 41, 1987–1994.
487

488

489
re-
doi:10.1016/j.jbiomech.2008.03.034.

Pataky, T.C., Vanrenterghem, J., Robinson, M.A., 2015. Zero- vs.


one-dimensional, parametric vs. non-parametric, and confidence inter-
490 val vs. hypothesis testing procedures in one-dimensional biomechan-
lP
491 ical trajectory analysis. Journal of Biomechanics 48, 1277–1285.
492 doi:10.1016/j.jbiomech.2015.02.051.

493 Pini, A., Markström, J.L., Schelin, L., 2019. Test–retest reliability measures
494 for curve data: an overview with recommendations and supplementary
rna

495 code. Sports Biomechanics , 1–22doi:10.1080/14763141.2019.1655089.

496 R Core Team, 2021. R: A Language and Environment for Statistical Com-
497 puting. R Foundation for Statistical Computing. Vienna, Austria. URL:
498 https://www.R-project.org/.
Jou

499 Raimondo, G.D., Vanwanseele, B., van der Have, A., Emmerzaal, J.,
500 Willems, M., Killen, B.A., Jonkers, I., 2022. Inertial sensor-to-segment

27
Journal Pre-proof

of
501 calibration for accurate 3d joint angle calculation for use in OpenSim.
Sensors 22, 3259. doi:10.3390/s22093259.

pro
502

503 Ramsay, J.O., Graves, S., Hooker, G., 2021. fda: Functional Data Analy-
504 sis. URL: https://CRAN.R-project.org/package=fda. r package version
505 5.5.1.

506 Ramsay, J.O., Silverman, B.W., 2005. Functional Data Analysis. Springer
507

508
re-
New York. doi:10.1007/b98888.

Richter, C., O’Connor, N.E., Marshall, B., Moran, K., 2014. Compari-
509 son of discrete-point vs. dimensionality-reduction techniques for describ-
510 ing performance-related aspects of maximal vertical jumping. Journal of
lP
511 Biomechanics 47, 3012–3017. doi:10.1016/j.jbiomech.2014.07.001.

512 Røislien, J., Rennie, L., Skaaret, I., 2012. Functional limits of agreement:
513 A method for assessing agreement between measurements of gait curves.
Gait & Posture 36, 495–499. doi:10.1016/j.gaitpost.2012.05.001.
rna

514

515 RStudio Team, 2022. RStudio: Integrated Development Environment for R.


516 RStudio, PBC. Boston, MA. URL: http://www.rstudio.com/.

517 Schwartz, M.H., Trost, J.P., Wervey, R.A., 2004. Measurement and man-
518 agement of errors in quantitative gait data. Gait & Posture 20, 196–203.
Jou

519 doi:10.1016/j.gaitpost.2003.09.011.

520 Sutherland, D., Olshen, R., Biden, E., Wyatt, M., 1988. Development of
521 mature walking. Mac Keith Press.

28
Journal Pre-proof

of
522 Telschow, F.J., Schwartzman, A., 2022. Simultaneous confidence bands for
functional data using the gaussian kinematic formula. Journal of Statistical

pro
523

524 Planning and Inference 216, 70–94. doi:10.1016/j.jspi.2021.05.008.

re-
lP
rna
Jou

29
Journal Pre-proof

of
525 Figure Captions

pro
526 Figure 1: Curves from two measurement systems (gold standard (black)
527 and new measurement system) in four data sets (A - GAUSS, B - NON-
528 GAUSS, C - XSHIFT, D - REAL).
529

530 Figure 2: Distribution of band limits across 300 calculations in four differ-
531

532

533
re-
ent models (A - BOOTiid (yellow), B - BOOTrep (blue), C - POINT (grey),
D - RØISLIEN (pink)) in the GAUSS data set. Areas of uncertainty are
displayed as colored ribbons next to the actual differences.
534

Figure 3: Prediction bands (α = 0.05) of POINT (dotted), RØISLIEN


lP
535

536 (pink), BOOTrep (blue) and BOOTiid (yellow) in four data sets (A - GAUSS,
537 B - NONGAUSS, C - XSHIFT, D - REAL). Difference curves are displayed
538 as black curves in the background. Note that the bands in RØISLIEN,
BOOTrep, and BOOTiid represent a single random subsample and therefore
rna

539

540 do not necessarily reflect the coverage probabilities in Table 1. E.g., the
541 orange band (BOOTiid) in D is wider than the blue one (BOOTrep), even
542 though the coverage probability of BOOTiid is lower.
Jou

30
Journal Pre-proof

of
pro
re-
Figure 1
lP
rna
Jou

31
Journal Pre-proof

of
pro
re-Figure 2
lP
rna
Jou

32
Journal Pre-proof

of
pro
re-Figure 3
lP
rna
Jou

33
Journal Pre-proof

CrediT author statement

Daniel Koska: Conceptualization, Methodology, Investigation, Data Curation, Software, Formal


analysis, Investigation, Resources, Writing – Original Draft, Visualization

Doris Oriwol: Conceptualization, Methodology, Formal analysis, Writing – Review & Editing

of
Christian Maiwald: Conceptualization, Methodology, Formal analysis, Resources, Writing –
Review & Editing, Supervision

pro
re-
lP
rna
Jou
Journal Pre-proof
Conflict of Interest Statement

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

of
pro
re-
lP
rna
Jou

You might also like