Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Quality of Kinanthropometric Data

A primer for scientific and professional personnel for assessment


of status and change in high performance athletes.

William D “Bill” Ross, Robin V Carr, JE Lindsay Carter, Curtis


Brackenbury and Francis Holway

Are athletes born or made? The answer is yes, there is an interplay of


genetic and environmental influences unique to every individual.

Quality
Introduction
Quantification
Definition
Measurement
Reliability
Measurement Error
Undependability
Objectivity
Validity
Types of Validity
Content-Related Validity
Logical (Face) Validity
Criterion-Related Validity
Concurrent Validity
Predictive Validity
Predictive Index
Construct-Related Validity
Summary of Validity
Relationships: Quantification, Reliability, Objectivity, Validity
The Meaning of Error
Measurement Error (Error of Observation)
Discrete vs. Continuous Variables
Discrete
Continuous
Random vs. Systematic Error
Random (Experimental) Error
Systematic Error
Observed Values
Imprecision and Bias
Precision
Bias
Accuracy
Target Example
Sources of Error
Conceptual Errors
Illegitimate Errors
Blunders
Unfavorable Ambient Conditions
Errors of Coding
Errors of Computation
Computer Program Errors
Chaotic Errors
Errors in Assembling and Reporting Data
Rounding Procedures
Median Values
Correlations Ignore Systematic Errors
Test-Re-test Reliability Coefficients
T-Tests and F-Tests May Ignore Random Errors
The Technical Error of Measurement
Example Calculation (TEM)
The Percent Technical Error of Measurement
Example Calculation (%TEM)
Spreadsheet Calculation of TEM for Individuals
Standards of Excellence
Overview
Examples in Comprehensive Protocols
A Mathematical Simulation of Scale Increments & TEM
Combining Errors
Theory of Combining Errors
Replication and Percent Reduction in Error
Median as Best Estimate of True Value
Practical Decisions Using TEM
Data Management
AMPlan
Troika Training
Introduction
Assessing individual growth status or monitoring change depends upon
"theory and process," as advocated by W Edwards Deming, the
American who taught Japanese business about quality and who helped
bring about a revolution in manufacturing and business. Deming
believes, after all is said and done, that quality is a function of human
commitment (Peters & Austin (1983)).

The concept of quality in measurement is inextricably linked with the


topics of quantification, reliability, objectivity and validity. This section
of Anthropometry Illustrated (Ross, Carr, Carter (1999) deals with
these topics first in general terms, and then as they relate to
anthropometry. In practical terms, quality in measurement is related
to the reduction, and ideally the eradication, of measurement error.
This process presupposes a practical appreciation of theory and
process.

Quantification
Quantification is the process of ascribing numbers to variables. This
involves two distinct processes:

1. Definition;

2. Measurement.

Quantification depends first upon clearly defined criteria for evaluation


(definition). In body composition, for example, if two observers from
the general public are asked to rate the extent to which a body builder
is over-weight, they may provide very different assessments, since the
term ‘over-weight’ does not contain any clearly defined criteria. One
observer may not like the appearance of large musculature, and may
therefore rate the body builder as being considerably over-weight.
The other may see the relatively low amount of subcutaneous adipose
tissue and rate the body builder as being not at all over-weight. Who
is right? Neither, if there is no agreement to define the term ‘over-
weight.’

A second requirement for quantification is the adequacy of the


measurement technique. Assume the term ‘over-weight’ has been
defined as being ‘proportionately over-fat’ (we won’t give a more
specific definition here). While two experienced evaluators may reach
substantial agreement from a subjective assessment of this criterion,
these are always vulnerable to the intrusion of bias. Sub-conscious
opinions, vested interests, loss of perspective, emotional state and
many other factors may intervene to affect an individual’s subjective
assessment and thereby diminish the agreement between the two
observers. This, therefore, implies the necessity of quantification
through an accurate measurement process.

Kinanthropometry may be thought of as the quantitative interface


between human structure and function. It is utterly dependent upon
highly refined measurement techniques geared to quantify rigorously
defined structural parameters.

Reliability
Reliability is defined as the degree to which repeated measurements of
a variable provide a consistent score. Reliability is enhanced by
reducing two factors:

measurement error;

undependability.

Measurement error is the deviation of an observed measurement value


from the true value. This deviation may be random or systematic,
depending upon the source of the error. For instance, someone
inexperienced in making skinfold measurements is likely to obtain
quite variable values for repeated measurements on the same skinfold
site. Much of the variability in these measurements may be due to
random error, and this measurer would be termed unreliable.
Systematic error may also be present. Most often, this is a matter of
using different landmarks or techniques. Again, these must be defined
explicitly in the measurement protocol. This can be done in two ways:
(1) by fully describing technique, or (2) by referring to a source where
this is done. Systematic error, when uniform over measurement
occasions, would not reduce the reliability of the measurements. Both
random and systematic errors are described in more detail below.
Undependability is a change in the true value of the variable during the
time of repeated measurements. For example, the repeated
measurement of stature throughout the day is likely to show a small
amount of variability. While some of this may be due to error, it may
also be due to the well-known fact that people tend to be slightly
shorter at the end of the day, after gravity has applied long periods of
compression on the spinal column and some fluid has been evacuated
from the intervertebral disks. In this case, any gradual reduction in
the observed values for stature throughout the day may be mostly due
to the change in its true value. Undependability is not error, but it
reduces consistency and therefore it diminishes assessed reliability.

Anthropometry is a very reliable discipline when practiced by highly


trained individuals. Schutz (1998), in an informal and unpublished
survey of reliabilities among tests from a number of disciplines under
the kinesiology / human kinetics umbrella, rates it highest (see table 1
below).

Table 1: Some approximate reliabilities for tests of structure and


function (Schutz (1998)).

*Note: Reliability of 'Other Anthropometry' is increased with


replication of measurements, as are tests in other areas.

Tests Reliability

Height, Weight .99 - 1.00

Other Anthropometry .90 - .99*

Strength .90 - .95


Physiological Tests .

anabolic steroids .95 - .98

other substances .50 - .99

Physical Endurance .
muscular endurance .85 - .90

cardiovascular direct .80 - .85


cardiovascular indirect .75 - .85
Reaction Time .
one trial .30 - .50

mean of 20 trials .80 - .90

Accuracy Tests .

one trial .00 - .50

mean of many trials .50 - .90

Cognitive Tests .70 - .80

Attitude Inventories .60 - .70

Objectivity (Inter-Observer Reliability)


Objectivity can be defined as the degree to which different observers
assign the same score to a variable. As examples, when judges
observe a figure skating or diving performance, the extent to which
they ascribe different scores to the same performance shows fallibility
in objectivity.

Objectivity requires commitment, agreement, standards, training and


evidence of consistency among investigators. This then opens
investigation opportunity to the whole of the scientific community, to
confirm or deny findings. The driving engine of science is strong
inference and testable hypotheses. Objective measurement is also the
basis for cooperative research endeavor.

Validity
The validity of a measurement is the extent to which it measures what
it is supposed to measure. This requires external criteria that are also
objective. The American Psychological Association (1985) outlines
different types of validity, as is shown in figure 1 below:

Figure 1: Types of validity applied to anthropometry.


While this organizational scheme was devised for categorizing the
validity of psychological tests, it can easily be applied to
anthropometry by providing only slight changes to the definitions. The
following are modified definitions and examples of each type.

Content-related validity can be defined as the degree to which a


measurement is representative of some defined universe or ‘domain’
of content. As an example in anthropometry, in a study that looks at
the changes that occur during a six-month weight lifting program for
adults, it would make sense to select a number of muscle girths to be
measured as dependent variables. On the other hand, while
standardized facial dimensions can also readily be measured, they are
unlikely to be of benefit in such a study since they are irrelevant in this
context. Measurements taken in any study should generally be limited
to those that reflect the content and purpose of the study, and which
are under investigation for possible change or potentially important
relationships with other variables. In other words, a measurement
may be valid in some situations and for some purposes, but not in and
for others.

Similarly, acromiale-radiale distance is a perfect measure of arm


length, subject to the limitations of the measuring technique. It is not
a valid measurement for reach, though, since reach would include
other factors. Also, a skinfold measurement may be a good estimate
of the subcutaneous adipose tissue and the two layers of skin at that
site, but it generally is not a good estimate of total body fat (ether-
extractable lipid).

A classic abuse of content-related validity is the use of the Body Mass


Index (BMI) as a tool for individual counseling regarding optimal body
weight and health status appraisal. Since the BMI is an index formed
by weight in kilograms over the square of height in decimeters, it is
simply an index of ponderosity. It does not take much thought to
realize that high ponderosity could just as easily result from
proportionately large amounts of muscle and/or bone as it could from
fat. While it may have certain limited uses with group data in
epidemiology, the BMI says nothing about an individual’s body
composition, and should never be used by itself for individual
prescription in that regard. In short, a measurement or derived
variable may be perfectly valid in one situation (e.g. the BMI as a
ponderosity index) and irresponsibly invalid in another (e.g. as an
indicator of body composition or health status for individuals).

Logical validity (sometimes called face validity) can be defined for


anthropometry as the untested assumption of the validity of a
measurement for a particular purpose, based upon logical conclusions
about what is being measured. For example, take the relationship
between tibiale mediale-sphyrion tibiale length and actual tibia length
as determined from cadaver or radiographic studies. In the absence of
reported information about this relationship, and given knowledge of
the human skeleton, the logic of assuming the latter from the former
appears strong enough to warrant the former’s use as a substitute for
most purposes.

Criterion-related validity proclaims the validity of one or more


measurements or derived variables based upon their concurrent or
predictive relationships to a known criterion variable. There are two
types of criterion related validity:
Concurrent validity is the degree to which one or more
measurements or derived variables correlate with a criterion
variable which has already been established as being a valid
measure of the attribute of interest. The new variable thereby
assumes the potential role of being a substitute for the criterion. As
an example, body densitometry was for many years considered to
be the criterion ‘gold standard’ for the estimation of body
composition. Because it is a relatively elaborate and time
consuming process, many attempts have been made to develop
simpler substitute tests that correlate well with this criterion. Over
a hundred combinations of skinfold measurements were used as
predictors of this criterion because they showed reasonably high
correlations with the densitometric estimation of fat in different
samples. Many other types of predictors (e.g. electrical impedance
and infra-red technologies) were also validated using this same
concurrent approach. They are no more valid than the criterion,
however, and this criterion has now been seriously challenged
(Martin (1984), Martin et al (1986), Ross et al (1987)).

An empirical finding of criterion-related validity assumes that the


criterion itself is valid. However, this is an unwarranted assumption
unless it is supported by a clear rationale and empirical evidence.
For example, skinfold predictions of percent fat validated from
densitometric criteria are valid only to the extent that the body has
two compartments of known density. The evidence here is not at all
convincing; in fact, it can be shown not to be true.

Predictive validity is the degree to which a criterion variable can be


predicted from one or more measurements or derived variables. For
instance, certain patterns in anthropometric variables have been
found to correlate with participation in elite sports (Carter (1981),
Carter (1985), Ross and Ward (1984a)). Carr (1994) used multiple
regressions to predict standing vertical jumps from various
anthropometric variables.

It must be remembered, however, that error is always involved in


prediction if the correlations are not perfect, and that the errors of
estimate will increase as the correlations decrease. In other words,
a prediction will have less certainty and an increased amount of
expected error as the correlation between the predicted and
predictor variables diminishes.

This can be demonstrated by the Predictive Index (PI) shown in


equation 1, figure 2 and table 2 below. It estimates by how much
better than pure chance one can predict Y from one predictor X,
using the best fitting linear regression equation. Expressed as a
percent, it is 100 times (one minus the square root of (1 - r
squared)). The equation can be written as follows:

Equation 1: The predictive index.

PI = 100 X (1 - (1 - r2)0.5)

Figure 2: The predictive index, expressed as a percent, is 100


times (one minus the square root of 1 - r squared). As shown
above and below (table 2), the prediction from a regression
equation with a correlation of 0.80 is only 40% better than pure
chance; a correlation of 0.50 is less than 14% better than pure
chance.

Table 2: The predictive index for the regression of Y on X,


expressed as a percent better than pure chance, and based on
the linear correlations between the variables.

Correlation (r) %
0.00 0
0.10 1
0.20 2
0.30 5
0.40 8
0.50 13
0.60 20
0.70 29
0.80 40
0.90 56
0.95 69
0.98 80
0.99 86
0.999 96

It is perhaps worth noting that judgment about significance is number


dependent. Jokingly, we point out that some physiologists in North
America seem to think a large sample is when the subjects outnumber
the co-authors.

Construct-related validity can be defined as the degree to which a


measurement reflects an attribute that cannot be directly measured.
An example may be the use of the Heath-Carter anthropometric
somatotype (Carter (1975)). The endomorphy component is based on
the shape construct of roundness, which is difficult to define and
measure directly on the human form. However, since large amounts
of subcutaneous adipose tissue tend to increase roundness of shape,
skinfold measurements can be used to estimate this construct.

It should be noted that the above definitions may not provide exact
boundaries among the different classifications of validity, and that
some examples may fit into more than one category. The definitions
are intended merely as an aid to the assessment of validity in
interpreting anthropometric data.

In summary, Howard Meredith taught his students that validity had


two requirements:

1. Reliability (both intra- and inter-observer);

2. Relevance (evidence of a demonstrated link between a


measurement and the quantity being assessed).

A whimsical example is perhaps appropriate for students. While it may


be shown that crime rate in any given city over a period of time is
correlated with the salaries of Baptist ministers, there is obviously no
causative link. And while the measurements of both crime and
salaries may be very reliable, reducing Baptist ministers' salaries is not
a relevant approach to reducing crime.

Those having impeccable measurements and claiming validity for a


particular proposition must not forget relevance. The ability to
distinguish relevance is the basis of scientific sophistication.
Relationships Among
Quantification, Reliability, Objectivity & Validity
Theoretically, of all measurements that can be made, only a sub-set of
these measurements will be reliable. And of all reliable
measurements, some will be objective, and only a sub-set of those will
be valid for a specific purpose. The following Venn diagram helps one
visualize the relationships among these concepts:

Figure 3: Venn diagram showing relationships among


quantification, reliability, objectivity and validity.
Paraphrasing an often-noted statistical principle, it can be said that "a
measurement can be reliable without being valid, but to be valid, it
must be reliable."

The Meaning of Error


In science, the word "error" has two meanings:

the actual difference between a measured value and a "true" value;

the approximate difference, estimated from the uncertainty of a


measurement.

The former is a hypothetical situation with continuous variables (see


below), since the true value is never known. The latter is expressed
as a standard deviation, average deviation, probable error, precision
index, technical error of measurement, or some measure of departure
from an expected or theoretical value. It is essentially a measure of
reliability, and may therefore include any undependability that may be
present.

Measurement Error (i.e. Error of Observation)


Reliability and validity are both diminished by error – different types of
error. The following discussion focuses on error theory and practical
problems and solutions surrounding it.

In theory, any observed value in the measurement of a variable is


made up of the true value of the variable, plus any observation error
that may or may not be present:

Equation 2: Components of an observed value.

OBSERVED VALUE = TRUE VALUE + ERROR OF OBSERVATION


It is possible to measure some variables with no error at all, while
other variables will always have some degree of error included in the
measurement, no matter how carefully they are made. This difference
has to do with the type of variable being measured.

Discrete vs. Continuous Variables


Discrete variables can only be measured in whole numbers. For
instance, the number of people in a room is a discrete variable.
Counting the people carefully should give you an accurate
measurement of the number of people present, with no error
resulting. The score in a soccer game, the number of females and
males in a study, and the number of measurements made on a subject
are other examples of discrete variables. If one measures or counts
these variables carefully, one can obtain the true value of the variable
with no error being made at all.

Continuous variables may take any value within a defined range, and
between any two values an indefinitely large number of in-between
values can occur. Continuous variables can thus be measured with
increasingly smaller fractions or greater number of decimal places,
depending upon the amount of accuracy required and the precision of
the measurement process. For instance, the height of the back of a
chair is a continuous variable. The measurement may be 'rounded off'
to the nearest whole meter, which is likely not to be the exact height
of the chair. In other words, some degree of error is introduced by the
rounding off process. While taking the measurement to a greater
number of decimal places (e.g. to the nearest centimeter, or to the
nearest millimeter) can reduce this error, in theory there is always
some error present. In fact, no matter how small the scale of
measurement is, the 'true' height of the chair may be somewhere
between two units on that scale. The error that results from this
rounding off process is called random error, because the rounded
value is just as likely to be above as below the true value.

Random vs. Systematic Error


There are two general types of error that can be made in the
measurement of a continuous variable: random error and systematic
error. Random error is also referred to as experimental error.
As is discussed above, the measurement of continuous variables
always includes some random error, although the amount can usually
be reduced by the following:
better definition of the measurement criteria;

improvements in instrumentation;

improved measurement conditions, and;

practice, ideally under the guidance of a criterion anthropometrist.

Many factors may increase the amount of random error in a


measurement. For example, the cyclical postural movements that are
concomitant with breathing may introduce random error into the
measurement of anterior-posterior (A-P) chest depth by an
inexperienced measurer who has not been taught to measure at the
‘end tidal’ moment of ventilation.

Some examples of random error in anthropometry are:


inconsistent errors in the judgment of close measures (e.g. 170.1 or
170.2 cm?);

chance intermittent conditions (availability of subjects, variability in


clothing);

small disturbances (breathing of subject, talking, distractions such


as phone calls, extraneous conversation, discourtesy, joking,
fleeting inattention, a changed environment caused by celebrity
subjects or visitors);

unintentional random departures from explicit descriptions of


landmarks/techniques;

biological variation of individuals (i.e. individual differences in


structure affecting landmark location or the behavior of calipers or
tapes).
It should be noted that random error has no bias, and has an equal
likelihood of being either greater than or less than the true value.
Over large numbers of measurements, the accumulation of random
error should eventually approach zero, with the positive errors
canceling out the negative errors. This has two important implications.
Group data tends to have random error averaged out to zero,
especially as the number of subjects in the group increases. Since
most research studies involve group data, they are therefore more
resistant to random error than is data focused on an individual.

Both individual and group data benefit by having repeated


measurements of the variable of interest. The greater the number
of replications, the more likely the random error will approach zero.

In anthropometry, random error can be reduced by taking repeated


measurements of the variable and by obtaining the mean or median of
these values. The median of the repeated measurements is more
resistant to extreme values and blatant error than is the mean (see
'Median Values of Measures' below). Hence, the general rule for
replication in anthropometry is to use the median as the best estimate
of the true value.

The other type of error is called systematic error, and it may or may
not be present in a given instance, irrespective of the type of variable
being considered. It is the effect of regular definable departures from
theoretical expectancies. An example of systematic error with a
discrete variable involves two individuals taking someone’s exercise
pulse rate over the same fifteen-second period. The first individual
may use the proper technique, which involves clicking the stopwatch
simultaneously with the counting of the pulse beats from zero. The
other may begin at the same time by starting the pulse count from
one. Thus on every occasion the second individual would have a
systematic error which would overestimate the true value of the pulse
rate by one.

An example of systematic error with a continuous variable in


anthropometry involves weighing people on two separate weigh
scales. One scale may have the dial mis-calibrated to read, say, 0.13
kg when no weight is on it. In this case, everyone measured will
weigh 0.13 kg more on this improperly calibrated scale than on the
properly calibrated one. This constitutes a systematic error with a
continuous variable, that may be easily corrected by an arithmetic
constant.

Some examples of systematic error in anthropometry are:

error due to the improper calibration of any instrument;

some anthropometrist errors (e.g. individual differences in the


locating of a landmark or in the touch or pressure on a tape or
caliper);

certain experimental conditions (e.g. the thickness and angle of a


wall mark for stature or a landmark on a subject. These could be
random effects as well, depending on the schedule for repeated
measurements);

imperfect technique (e.g. the mishandling of instruments, or the


failure to raise a skinfold at the marked site and encompass
entrapped tissue), inexperience and ineptitude or ignorance (e.g.
the common missed location of the iliac crest skinfold);

untrained and inept teachers of anthropometry (the ISAK


certification scheme addresses this problem);

sample bias (an atypical element in a defined group);

ambient effects on measurement. Diurnal variance in stature and


weight is a fact that must be taken into account. People are taller
and lighter in the morning. In some studies the protocol for
determining weight may call for subjects to be postprandial for
twelve hours, and weighed nude after voiding. Student studies
have also shown a change in stature and sitting height
accompanying running, jumping, hanging upside down and partner
stretching. Thus the ambient conditions at the time of the
measurement occasion may be systematic or random, depending on
the schedule and the assignment of subjects.
In summary, many errors may be either random or systematic,
depending upon how consistently they occur under the given
circumstances. We can therefore refine our definition of an observed
value as follows:

Equation 3: Refined components of an observed value.

OBSERVED VALUE = TRUE VALUE + RANDOM ERROR + SYSTEMATIC ERROR

Where:

random error is always present to some extent with continuous


variables, and;

systematic error may or may not be present.

Imprecision and Bias


Imprecision and bias are the results of random and systematic error,
respectively. Precision refers to the closeness with which the
measurements agree with one another. Bias occurs when the general
tendency, or average, of the results is wrong. Accuracy refers to the
closeness of the measurements to the true value, so accuracy requires
results that are both precise and unbiased. These terms and concepts
can be easily demonstrated by the example of shooting at a target
(modified from Weldon (1986), p 78), with the error of each shot
being its distance and direction from the center (see figure 4 below).

Figure 4: Visual depiction of imprecision and bias.

Target Example
ACCURATE

PRECISE (no random error)

UNBIASED (no systematic error)

INACCURATE

PRECISE (no random error)

BIASED (systematic error present)

INACCURATE

IMPRECISE (random error present)

UNBIASED (no systematic error)

INACCURATE

IMPRECISE (random error present)

BIASED (systematic error present)

Sources of Error
Conceptual Errors (These reduce validity!)

The concept and definition of what is being measured and the


assumptions used in the analysis may give rise to major errors and
misinterpretations. The use of the BMI (the ratio of body mass to the
square of stature) as an index of obesity to ascribe a "healthy" weight
for individuals is a blatant conceptual error. Conceptual errors are
rampart in the life sciences, and even more so in social science. They
arise usually from unwarranted assumptions made in the effort to
reduce the complexity of a problem. The errors are then embedded,
and even when computers can address the complexity of the problem,
the same assumptions are entered into the computational algorithms
and thus pervade contemporary thinking. This destroys the validity of
the concepts as they are used.

Illegitimate Errors (These reduce reliability, which also reduces


validity!)

Illegitimate errors are those generated in the data assembly. They


should not happen, but they do. They result from inadequate training,
attention, protocol or data management. Deceptively simple
anthropometric data assembly, resolution and report require
knowledge, understanding, practice, and dedication. It also require
explicit descriptions and discipline for compliance. The designation of
a criterion anthropometrist for each data assembly affixes
responsibility for the monitoring of technique and data management as
a coordinated team effort.

Typical illegitimate errors are:

Blunders:
Examples include incorrect reading of values, poor verbalization or
other faulty transmission of these values, recording error, and
computer entry error. Automated instrumentation may help but
cannot substitute for vigilance.

Errors caused by unfavorable ambient conditions for measurers and


subjects:
Examples include haste, non-detached measurers who are in some
way influenced by the outcome (e.g. proving a training gain),
threatening or autocratic leadership, inadequate equipment, poor
lighting and ventilation, an unreasonable schedule, and failure to
recognize fatigue or eye strain.

Errors of coding:
These most commonly involve ascribing data to the wrong subject,
or mis-classifying a subject.
Errors of computation:
These are common when protocol calls for two or three trials which
are averaged, rather than taking three trials and using the median.
They may also include decimal point errors and reversals of digits.

Computer program errors:


Especially with rapidly advancing technology, new programs often
have errors. A new approach to test data analyses is to introduce
known systematic error in a system to assess the resiliency of the
system to detect and flag or otherwise inform the investigator of an
untoward effect.

Chaotic errors:
These are large errors of the kind that force the criterion
anthropometrist to call a "halt and reassessment." For example,
assigning data to the wrong subject, or the failure of an electronic
device; e.g. a scale or instrument dependent on batteries.

Errors in Assembling & Reporting Data


In anthropometry, quality is reflected in the "precision" and "accuracy"
of the measurement. As noted above, these terms are not
synonymous.

Rounding Procedures

In the data assembly, complex rules tend to introduce error. For


simplicity, we thus recommend rounding up when the decimal fraction
is 0.5 or greater, and rounding down when it is less than 0.5. In
statistical analyses, other rules involving significant numbers and
arbitrary rounding rules may apply (see Brinkworth (1968) p 101). One
practice is 1) round up in girths since the light touch is an element of style and
2) round down in lengths, breadths and skinfolds since these measures require
the application of force. The rule should be consistent for each study or data
assembly.

Median Values of Measures

As was discussed above, random error in a measurement can be


reduced by taking repeated measurements of the variable and by then
obtaining a measure of central tendency (i.e. the mean or median).
The median, or mid-score of rank-ordered measurements, is more
resistant to extreme values and blatant error than the mean. Assume,
for example, that the measured values in columns A, B and C in table
3, below, contain triplicate measurements of an object that has a true
value of 5. Assume that the first two measurements of this object
have no error, but that the third measurement, in column C, contains
an extreme error. The mean value of the triplicate measures is
affected by the extreme error and is different from the true mean,
while the median value is the same as the true median. Hence, in
anthropometry the general rule for replication is to use the median of
three measures as the best estimate of the true value.

Table 3: Hypothetical example showing the median more


resistant to extreme error than the mean.

. A B C Mean Median

True Values 5 5 5 5 5

Measured 5 5 11 9 5
Values

On occasion, anthropometrists take more than three measures. The


general rule for obtaining the median from rank-ordered
measurements is:

for odd numbers of replications, use the mid-score;

for even numbers of replications, use the arithmetic mean of the


middle two scores.

Correlations Ignore Systematic Errors


Pearson correlation coefficients and T-tests and F-tests are often used
to assess the reliability of repeated measurements. Each of these has
limitations, however, and the simple examples in Tables 4 and 5 below
serve to illustrate this point.

Table 4: 1st hypothetical example of repeated measurements on


five subjects.

A B

5 6

4 5

3 4

2 3

1 2

The test-retest reliability correlation coefficient between the repeated


measurements in A and B (Table 4) is a perfect +1.00, yet the results
are obviously not perfectly reliable (i.e. they are not consistent).
Correlation coefficients cannot describe systematic error effects, such
as the one that occurs here. One can predict (regress) A from B by
simply subtracting the constant 1, but B cannot act as an unqualified
substitute for A because of the systematic difference between the
two. Thus, in comparing B with A, we have a precise but inaccurate
system. There is no random error. The inaccuracy is entirely due to
the systematic error.

T-Tests and F-Tests May Ignore Random Error

T-tests and F-tests are often inadequate for describing random error.
Suppose the repeated measures were as shown below in Table 5.

Table 5: 2nd hypothetical example of repeated measurements


on five subjects.
A B

5 1

4 2

3 3

2 4

1 5

Here, the mean of measurement occasion A is the same as the mean


of occasion B, and the variance is the same for both. T-tests and F-
tests would show no significant difference between occasions A and B,
supporting the notion that the results are reliable, yet obviously
substantial error has occurred (i.e. the TEM = 2 and %TEM = 67%;
see below) and we have an inaccurate system. We cannot determine
here if this error is systematic or random.

The Technical Error of Measurement


The technical error of measurement is a more appropriate statistic to
assess reliability and quality, since it takes into account both random
and systematic differences (Dahlberg (1940); Johnston et al (1972);
Mueller and Martorell (1988); Ross et al (1988b); Carr et al (1989);
Schmidt and Carter (1990); Ross and Eiben (1991a); Ross and
Marfell-Jones (1991b); Knapp (1992); Carr et al (1993); Ross et al
(1994); Ross (1996a); Pederson and Gore (1996); Norton and Olds
(1996); Ross et al (1997)). Furthermore, its units are the same as
those of the variable, and this allows the setting of standards for
precision, and/or confidence limits.

The technical error of measurement is the square root of the quotient


of the sum of the squared differences between replicated measures
and twice the number of subjects.

The equation is as follows:

Equation 4: The technical error of measurement.


TEM = ( D2/ 2n) 0.5

Where D = d2 - d 1 (the difference between second and first


measurements).

It is similar in concept to the standard deviation of the differences,


except that by dividing by 2n instead of n, half of the differences
(errors) are attributed to each of the measurements upon which these
differences are based.

Following this reasoning, if there are more than two measurements


taken on each subject for a given variable, n indicates the number of
comparisons. This is shown below in Table 6, where three sets of
measurements (A,B,C) are taken on five subjects. There are thus 3
comparisons made for each subject (A-B, B-C, A-C), 15 comparisons
made in all, and 15 difference scores that result. Then 2n becomes (2
X 15) = 30

Example Calculation (TEM)

A B C A-B B-C A-C (A-B)2 (B-C)2 (A-C)2

10.0 10.0 10.5 0.0 -0.5 -0.5 0.00 0.25 0.25

9.0 9.5 9.5 -0.5 0.0 -0.5 0.25 0.00 0.25

12.0 12.0 12.0 0.0 0.0 0.0 0.00 0.00 0.00

14.0 12.5 12.5 1.5 0.0 1.5 2.25 0.00 2.25

7.0 7.5 8.0 -0.5 -0.5 -1.0 0.25 0.25 1.00

. . . . .
 2.75 0.50 3.75

. . . . . . .
S D2 7.00
. . . . . . .
TEM 0.48

Table 6: Arithmetic example for calculation of technical error of


measurement.
Using equation 4 above, the technical error of measurement for this
data assembly is:

TEM = ( D2/ 2n)0.5

= (7.00/(2 X 15))0.5

= 0.230.5

= 0.48

The Percent Technical Error

The technical error of measurement may also be expressed as a


percentage of some measure of central tendency for the raw data; e.g.
the mean or more often the median of the first series.

Equation 5: The percent technical error of measurement.

%TEM = 100 (TEM/ Median)

Example Calculation (%TEM)

With the median of the first series above being 10, the %TEM is:

%TEM = 100 (TEM/ Median)

= 100 (0.48/10.0)

= 4.8%
Spreadsheet Calculation of TEM's for Individuals
Table 7 is a copy of a Microsoft Excel spreadsheet technical error and
percent technical error calculator for individual subjects.

Table 7: Microsoft Excel Spreadsheet example for calculation of


technical error of measurement for individual subjects, numbers
1 to 5.

. A B C D E F

1 Subject A B C TEM %TEM

2 1 10.0 10.0 10.5 0.29 2.9

3 2 9.0 9.5 9.5 0.29 3.2

4 3 12.0 12.0 12.0 0.00 0.0

5 4 14.0 12.5 12.5 0.87 6.2

6 5 7.0 7.5 8.0 0.50 7.1

where cell E2 contains:


=(((B2-C2)^2+(C2-D2)^2+(B2-D2)^2)/(2*COUNT(B2:D2)))^0.5

and cell F2 contains:


=100*(E2/B2)

It is important to realize that the technical error of measurement only


recognizes systematic errors that occur between or among repeated
measurements of a variable. If all replicated values have the same
systematic error, the technical error of measurement will not suggest
that error is present.
For example, in measuring biacromiale breadths according to the
definition in this treatise, the subject stands erect. In the IBP protocol
the shoulders are rounded slightly to achieve maximum breadth by
bringing the acromial processes forward. With impeccable
anthropometry the IBP measurements are precise but inaccurate by
our standards. Conversely, our measures are similarly precise and
inaccurate by IBP standards.

Adherence to defined specification is essential and use of a criterion


anthropometrist propitious. It is incumbent upon all investigators to
cite a reference for each of their measurements, or to define their
measurements in their reports. This definition should include:

evidence of the obtained precision in data assembly and;

indication of how the new techniques differ from generally accepted


techniques.

Standards of Excellence
New instruments for lengths and breadths are now available, and they
appear to reduce the technical error. However, controlled experiments
are difficult to conduct since "experience" with one instrument may
bias the data obtained by another. The arrow in figure 5, below, calls
attention to the relatively high %TEM for hand breadth. A grip breadth
technique is proposed in this treatise since hands are malleable,
especially in children. Skinfold technical errors are relatively small in
absolute terms but are magnified when expressed as percentage
values.

Figure 5: Percent technical error of measurement obtained from


replicated values by different measurers.
Overview

It is immodest and non-verifiable to claim to be the best


anthropometrist in the world. However, this should perhaps be a
personal goal for each measurer on every measurement occasion.
With technique standardized and instruments defined and calibrated,
the TEM and %TEM become our guides for assessing the extent to
which we achieve this goal.

As Norton and Olds (1996) point out, there is often a relationship


between the size of the TEM and the mean value of the variable being
measured. Larger skinfolds, for example, are likely to produce larger
TEM values. This is the theoretical justification for using the %TEM
instead, as a measure of the precision of the data.

However, the %TEM can sometimes produce misleading descriptions of


error. When the variable being measured is very small (e.g. skinfolds
on triathletes, or bone widths on small children) relatively precise
measurements and small TEM values may produce unfairly large
%TEM scores, as is shown in the data in figure 6 below, from Carr
(1994).

Figure 6: TEM and %TEM from anthropometry on 100 athletes (Carr


(1994)).
While skinfold measurements seem to have a reputation for being
imprecise, it must be remembered that the measurements are being
made in millimeters, not the centimeters found with most other
anthropometric variables. The TEM shown for the skinfolds in figure 5
was converted from millimeters to centimeters, to facilitate
comparison with the other variables. It can be seen that the skinfold
measurements here were actually more precise, in an absolute sense,
than all other variables. When taken as a percentage of the mean
value of the skinfolds, however, the %TEM appears considerably less
precise. This is even more pronounced than usual in figure 5, since
the skinfolds belonged to fairly elite athletes and were smaller than
normal to begin with. Which should be reported, TEM or %TEM? To
some extent this becomes a philosophical question, and the answer is
related to the specifics of their use.
What should the standards be for TEM and %TEM? Again, this may
vary according to the situation. In a longitudinal study looking at
changes in a variable over time, the TEM values should be small
enough that they will not obscure real change. This is even more true
for clinical or professional uses, where the data on individuals is more
important than group averages. If the TEM is large, any reported
change may be due to error rather than real biological change. Norton
and Olds (1996), p 404, show the %TEM values required by 68% and
95% confidence intervals for showing real change in variables taken
from 55 different studies. They also show (Table 7, page 409)
proposed target inter- and intra-tester %TEMs for the I.S.A.K.-based
Australian accreditation scheme for anthropometrists. For example, an
experienced level 3 anthropometrist should be able to attain intra-
tester %TEMs of 5% for skinfolds (using Harpenden skinfold calipers,
possibly 7% using Slim Guide calipers) and 1% for other variables.

Examples in Comprehensive Protocols

We have reported technical error of measurement standards on


previous occasions (Carr et al (1989); Carr (1994); Ross et al (1994);
Ross (1996a); Ross et al (1996/97). Here we show comparisons
(figure 7, below) between two studies:
919 athletes at the World Championships of Swimming, Diving,
Synchronized Swimming and Water Polo (i.e. the KASP Project,
reported by Carter and Ackland (1994));

100 national, provincial, or professional athletes in a variety of


sports in Vancouver, Seattle and Milwaukee (Carr (1994)).

These comparisons show that well-trained anthropometrists can attain


similar levels of measurement quality, even when the source of their
training is different. (Note: KASP values are based on replications
from different measurers (objectivity, or inter-observer reliability);
Carr values are intra-observer reliability estimates.

Figure 7: Comparison of %TEM for lengths (KASP - Carter and


Ackland (1994); and Carr (1994).
A Mathematical Simulation: Scale Increments, Rounding
Error, and the TEM
Ostensibly, as noted above in figure 7, the KASP data reported lower
TEMs than the Carr data for skinfold measurements. An explanation of
this difference may come partly from the finer gradations of scale
provided by the Harpenden skinfold caliper. This is demonstrated in
hypothetical data on 200 subjects (paper in progress - the authors),
where numbers were randomly generated to eight decimal places and
then rounded to the nearest 0.1 or 0.5 decimal place, then assembled
for comparison of their squared deviations between measurement
occasions (figure 8). The results in the three dimensional graph show
that the sums of squared deviations (among repeat measures) are
increased by the coarseness of the rounding.

Figure 8: Rounding Effect: Squared Differences. The coarser the


gradations, the larger the effect on reported error.
From a purely mathematical point of view, with all other factors
remaining constant, the rounding effect will always yield lower
technical error of measurement values for 0.1 mm than for 0.5 mm.
This is in keeping with the direction of the empirical differences shown
in the TEM for Carr (Slim Guide) and KASP (Harpenden) in figure 7,
above.

Theoretically, this relates to the previous discussion regarding error in


measurement, and the fact that with continuous variables, some
degree of error is always present, and is dependent upon the scale
characteristics of the measurement instrument. In practical terms, the
dynamic effects of the Slim Guide and Harpenden calipers are similar
(Schmidt and Carter (1990)). Based on this simulation, in stating
standards for anthropometrists' attainment, Slim Guide TEM standards
may be approximately 20% larger than the Harpenden caliper for the
same level of technical ability in measurement.
Figure 9: Rounding Effect on Technical Error of Measurement
(TEM). The coarser the gradations, the larger the effect on
error.

Combining Errors

Theory of Combining Errors

The theory of error consists of a body of knowledge which includes an


appreciation of how error arises, what types of error contaminate "true
measures", and the elaboration of statistical theory, as discussed
above (see also Beers (1957); Sokal and Rohlf (1981); Pederson and
Gore (1996)). The theory of reduction of error by combining
replications is fundamental in understanding the basis for high quality
anthropometry.
Reduced to its simplest terms, if we have an original and a replicated
measurement, the total error will be the sum of the two constituent
errors (e1 and e2) plus the covariant factor between e1 and e2. This
may be represented diagrammatically as shown in the iconologic
diagram in figure 12 below. If the errors are dependent, the
covariance (r)=1, and the total error is the sum of the independent
errors as is shown on the right. If the errors are independent, the
total error is the square root of the sum of the squared errors. (You
may recognize this as Pythagorean' theorem for calculating the
hypotenuse of a right-angled triangle.)

Figure 10: Combining uncorrelated & correlated errors.

In anthropometry, dependent errors are systematic and additive. For


the most part, replicated measurements by experienced
anthropometrists are independent. Indeed, elimination of systematic
error is a major objective for assessing quality of measurement. The
assumption of independence of error can be tested indirectly as
reported by Ross and Eiben (1991). They showed that the technical
error of measurement of the sum of skinfolds approximated theoretical
expectancy for combining errors when the covariance factor was zero.

Equation 6: Total error is the square root of the sum of the error
variances.

Total error = (e12 + e22 + ... + en2 ) 0.5

With only random error to deal with, precision is predictably enhanced


by using any measure of central tendency for replicated measures. As
shown in figure 11, a reduction in total error is a function of the
number of replications when a measure of central tendency is used as
the best estimate.

Replication and Percent Reduction in Error

Figure 11: The reduction in error using a measure of central


tendency is a function of the number of replications.
In practical terms, the use of the mean of three independent measures
(original and two replications) reduces the total error by 42 percent.
For this reason, careful investigators concerned with individual
assessment routinely use a triple series of measurements, selecting
the median as the best estimate of the true value. The median has
four advantages over the mean:
it resists gross error;

it is a real value;

it is easy to calculate, and;

there is no rounding off error.

The robustness of the median to resist imposed error was


demonstrated in a thesis by Ward (1988), reported by Ward and
Anderson (1998).
Practical Decisions Using TEM

In clinical use, or when individual physique status or monitoring


change with respect to growth, aging, treatment, diet or exercise is a
focus of interest, whenever possible take three complete series of
measurements and use medians as the best estimates of true value.
Even so, when time is limited, two trials may be taken with a third
being taken only when the first two disagree by some set tolerance.
The mean of the closest two is then used as the truest value. In
practice, however, it is almost as easy to take three trials than to
calculate tolerances of two to determine if a third measurement is
needed. We have never met an investigator who regretted taking
three measurements -- quite the opposite.

In building norms, or making group comparisons, single


measurements are often adequate since we are concerned with the
sample rather than with individual characteristics. Even so, a sub-
sample with replicated measures is necessary to calculate technical
errors of measurement for use in quality control in the data assembly.

Data Management

An outline for a comprehensive system for data management is shown


below. This is a seven tiered system, A to G, with A and B levels
linked by the anagram "idea". The overall plan indicates the steps for
efficient data assembly, resolution and report. A current TeP project is
to develop comprehensive software to integrate these steps.

A B C 's - AMPlan Levels for Data


Assembly, Resolution & Report

Assembly - data acquisition in clinical and field tests - proforma

( A to B I-D-E-A)

Identity - Sample & subject classification - IBP recommendations


Definition - Instruments, landmarks, technique - basal references

Error - Precision, accuracy, cleaning - methods

Acceptance - Standards and quality control - protocol

Basic - acceptable data base from assembly archive - standards


Category - call up of individuals, groups, analytic tools to show status and monitor
change of raw score - spreadsheets

Derived - standard scores, indices, proportionality values, somatotype component


ratings - computational software

Expanded - body composition, fractional mass estimates based on defined


assumptions and premises - algorithms

Functional - anthropometric concomitants of biomechanical and physiological events


based on theoretical and empirical expectancy - analytic systems

General display and report for scientific and professional interpretation for groups
and individuals - display & report

Troika Training

One cannot develop expertise in measurement by training independent


of others. We recommend training units in groups of three. This
approach is by happenstance termed ‘troika’ training. A troika, from
the Russian, means a three-horse rig for a cart or a sleigh. It was
used to give the idea of pulling together, emphasizing teamwork. It
includes the idea of three interchangeable roles in a measurement
session; subject, measurer, recorder. Thus the training unit is entirely
self-contained, may replicate measurements on one another, and
compare random and systematic error effects as explained above.
This is best done under the tutelage of a criterion anthropometrist,
ideally one certified by ISAK or with demonstrated competence in the
specified measurement techniques.

The advantage of "troika training" is that it is relatively easy to


organize and it is self-contained with the partners taking turns serving
as subject, measurer, recorder or computer operator.

Troika training has three requisites:


strict adherence to an explicit definition of landmarks and
techniques;

adequate practice in measurement evaluated by the calculation of


technical errors of measurement from replicated measures to assess
intra- and inter-observer reliability;

over-all scrutiny by a criterion anthropometrist, with comparison


and correction of techniques where systematic error is suspect.

A standard proforma that facilitates troika training and data assembly


is shown in the following chapter, Proforma. An extended, bi-lateral
proforma is given in the chapter Proforma Expanded.

Cited References

American Psychological Association (1985) Joint Standards for Educational and


Psychological Tests. Washington, DC: American Psychological Association.

Beers Y (1957) Introduction to the Theory of Error. Reading: Addison-Wesley.

Brinkworth BJ (1968) An Introduction to Experimentation. London: The English


Universities Press Ltd.

Carr RV (1994) Anthropometric Modelling of the Human Vertical Jump. Doctoral


dissertation, Simon Fraser University, Burnaby, BC, Canada.

Carr RV, Rempel RD and Ross WD (1989) Sitting height: an analysis of five
measurement techniques. Am J Phys Anthropol 79: 339-344.

Carr RV, Blade L, Rempel R and Ross WD (1993) Technical note: on the
measurement of direct vs. projected anthropometric lengths. Am J Phys
Anthropol 90: 515-517.
Carter JEL (1975) The Heath-Carter Somatotype Method (2nd ed) San Diego:
San Diego State University.

Carter JEL (1981) Somatotypes of female athletes. In: J Borms, M Hebbelinck


and A Venerando (eds): The Female Athlete: A Socio-Psychological and
Kinanthropometric Approach. Medicine and Sport. Basel: Karger, 15: 85-116.

Carter JEL (1985) Morphological factors limiting human performance. In: DH


Clarke and HM Eckert (eds): Limits of Human Performance. Champaign, IL:
Human Kinetics, American Academy of Physical Education Papers,18:106-

Dahlberg G (1940) Statistical Methods for Medical and Biological Students.


London: George, Allen & Unwin

Johnston FE, Hamill PVV and Lemeshow S (1972) Skinfold thickness of children
6-11 years. Unit, Unied States DHEW Pub no (HSM) 73, 1602. Washington, DC:
US Government Printing Office.

Knapp TR (1992) Notes and comments. Technical error of measurement: A


methodological critique. Am J Phys Anthropol 87: 235-236.

Martin AD (1984) An anatomical basis for assessing human body composition.


Doctoral dissertation, Simon Fraser University, Burnaby, BC, Canada.

Martin AD, Drinkwater DT, Clarys JP and Ross WD (1986) The inconsistency of
the fat free mass: a reappraisal with implications for densitometry. In: T Reilly, J
Watson and J Borms (eds): Kinanthropometry III, London: E & FN Spon, 92-97.

Mueller WH and Martorell R (1988) Reliability and accuracy of measurement. In:


TG Lohman, AF Roche and R Martorell (eds): Anthropometric Standardization
Reference Manual. Champaign, IL: Human Kinetics Books, 83-86
Norton K and Olds T (1996) Anthropometrica: A textbook of body measurement
for sports and health courses. Sydney, Australia: University of New South Wales
Press.

Pederson D and Gore C (1996) Anthropometry measurement error. In: K Norton


and T Olds (eds) Anthropometrica. Sydney: University of New South Wales
Press, Chapter 3, 77-96.

Peters TJ and Austin NK (1983) A Passion for Excellence: The Leadership


Difference. New York: Random House.

Ross WD and Ward R (1984a) Proportionality of Olympic athletes. In: JEL Carter
(ed) Physical Structure of Olympic Athletes. Med Sport Sci, Basel: Karger (18):
110-143.

Ross WD, DeRose EH and Ward R (1988b) Anthropometry applied to sport


medicine. In: A Dirix, HG Knuttgen and K Tittel (eds) The Olympic Book of Sports
Medicine. London: Blackwell Scientific Publications, 233-265

Ross WD and Eiben OG (1991) The sum of skinfolds and the O-Scale System
for physique assessment rating of adiposity. Anthrop Közl, 33: 299-303.

Ross WD and Marfell-Jones MJ (1991b) Kinanthropometry. In: JD MacDougall,


HA Wenger and HJ Green (eds) Physiological Testing of the High-Performance
Athlete (2nd ed). Champaign, IL: Human Kinetics, 223-308.

Ross WD, Kerr DA, Carter JEL, Ackland TR and Bach TM (1994) Appendix B -
Anthropometric Techniques: Precision and Accuracy. In: JEL Carter & TR
Ackland (eds) Kinanthropometry in Aquatic Sports (5). Champaign, IL: Human
Kinetics, 158-169.

Ross WD (1996a) Anthropometry in assessing physique status and monitoring


change. In: Oded Bar-Or (ed) The Child and Adolescent Athlete, Vol. VI, London:
Blackwell Scientific, 538-572.
Ross WD (1997) On combining samples that differ allometrically with size. Rev
Bras Med Esporte 3 (4): 95-100.

Ross WD, Carr RV and Carter JEL (1999) Anthropometry Illustrated. (An
interactive CD-ROM text and learning system) Surrey BC: Turnpike Electronic
Publications Inc.

Schmidt PK and Carter JEL (1990) Static and dynamic differences among five
types of skinfold calipers. Hum Biol 62 (3) (June): 369-388.

Schutz RW (1998) An informal survey of reliabilities for various tests taken from
the disciplines involved in kinesiology / human kinetics. Personal correspondence
RVC.

Sokal RR and Rohlf FJ (1981) Biometry: The Principles and Practice of Statistics
in Biological Research (2nd ed). New York: WH Freeman, 571-572.
Ward Richard (1989) The O-Scale System for Human Physique Assessment.
Doctoral dissertation, Simon Fraser University, Burnaby, BC, Canada.

Ward, R. and Anderson GS (1998) Resilience of anthropometric data assembly


strategies to imposed error. Sports Science. 16(8): 755 – 759.

Weldon KL (1986) Statistics: A Conceptual Approach. Englewood Cliffs, NJ:


Prentice-Hall Inc.

You might also like