Professional Documents
Culture Documents
Cpt2. Some Basic Stat Concept
Cpt2. Some Basic Stat Concept
POPULATIONS
The true population of a particular ecosystem can be determined only by carrying out a
census of all living organisms within that ecosystem. This applies equally whether one is
concerned with numbers of people in a town, state or country or with numbers of microbes
in a batch of a food commodity or product. Whilst, in the former case, it is possible at least
theoretically to determine the human population in a non-destructive manner, the same
does not apply to estimates of microbial populations.
When a survey is carried out on people living for instance, in a single town or village,
it would not be unexpected that the number of residents differs between different houses;
nor that there are differences in ethnicity, age, sex, health and well-being, personal likes and
dislikes, etc. Similarly, there will be both quantitative and qualitative differences in popula-
tion statistics between different towns and villages, different parts of a country and different
countries.
A similar situation pertains when one looks at the microbial populations of a food. The
microbial association of foodstuffs differs according to diverse intrinsic and extrinsic fac-
tors, especially the acidity and water activity, and the extent of any processing effects. Thus
the primary microbial population of acid foods will generally consist of yeasts and moulds,
whereas the primary population of raw meat and other protein-rich foodstuffs will consist
largely of Gram negative non-fermentative bacteria, with smaller populations of other organ-
isms (Mossel, 1982). In enumerating microbes, it is essential first to define the population to
be counted. For instance, does one need to assess the total population, that is living and
dead organisms, or only the viable population; if the latter, is one concerned only with spe-
cific groups of organisms, for example aerobes, anaerobes, psychrotrophs and psychrophiles,
mesophiles or thermophiles? Even when such questions have been answered, it would still be
impossible to determine the true ecological population of a particular lot of food, since to
do so would require testing of all the food. Such a task would be both technically and eco-
nomically impossible.
An individual lot or batch consists of a bulk quantity of food that has been processed
under essentially identical conditions on a single occasion. The food may be stored and
distributed in bulk or as pre-packaged units each containing one or more individual units
of product (e.g. a single meat pie or a pack of frozen peas). Assuming that the processing
has been carried out under uniform conditions, then, theoretically, the microbial population
of each unit should be typical of the population of the whole lot. In practice, this will not
always be the case. For instance, high levels of microbial contamination may be associated
only with specific parts of a lot due to some processing defect. In addition, estimates of
microbial populations will be affected by the choice of test regime that is used.
It is not feasible to determine the levels and types of aerobic and anaerobic organisms, or of
acidophilic and non-acidophilic organisms, or other distinct classes of microorganism using a
single test. Thus when a microbiological examination is carried out, the types of microorgan-
isms that are detected will be defined in part by the test protocol. All such constraints therefore
provide a biased estimate of the microbial population of the lot. Hence, sampling of either
bulk or pre-packaged units of product merely provides a sample of the types and numbers of
microorganisms that make up the population of the lot and those population samples will
themselves be further sampled by our choice of examination protocol. In order to ensure that
a series of samples drawn from a lot properly reflect the diversity of types and numbers of
organisms associated with the product it is essential that the primary samples should be drawn
in a random manner, either from a bulk or as individual packaged units of the foodstuff.
Analytical chemists frequently draw large primary samples that are blended and resam-
pled before taking one or more analytical samples the purpose is to minimize the between-
sample variation in order to determine an average analytical estimate for a particular
analyte. It is not uncommon for several kilograms of material to be taken as a number of
discrete samples that are then combined. Indeed, for some purposes, such multiple sampling
procedures are commonplace. The sampling of foods for microbiological examination can-
not generally be done in this way because of the risks of cross contamination during the
mixing of primary samples.
A population sample (i.e. a unit of product) may itself be subdivided for analytical pur-
poses and it is necessary, therefore, to consider the implications of determining microbial
populations in terms of the number, size and nature of the samples taken. In a few instances
it is possible for the analytical sample to be truly representative of the lot sampled. Liquids,
such as milk, can be sufficiently well mixed that the number of organisms in the analytical
sample is representative of the milk in a bulk storage tank. However, because of problems of
mixing, samples withdrawn from a grain silo, or even from individual sacks of grain, may not
necessarily be truly representative. In such circumstances, deliberate stratification (qv) may
be the only practical way of taking samples. Similar situations obtain when one considers
complex raw material (e.g. animal carcases), or composite food products (e.g. ready-to-cook
frozen meals containing slices of cooked meat, Yorkshire pudding, peas, potato and gravy). It
is necessary to consider also the actual sampling protocol to be used: for instance, in sampling
from a meat or poultry carcase, is the sample to be taken by swabbing, rinsing or excision of
skin? Where on the carcase should the sample be taken? For instance, one area may be more
likely to carry high numbers and types of organism than other areas. Hence, standardisation
of sampling protocols is essential. In situations where a composite food consists of discrete
components, a sampling protocol needs to be used that reflects the purpose of the test is a
composite analytical sample required (i.e. one made up from the various ingredients in appro-
priate proportions) or should each ingredient be tested separately. These matters are consid-
ered in more detail in Chapter 5.
If a single sample is analysed, the result provides a method-dependent single point estimate
of the population numbers in that sample. Replicate tests on a single sample provide an
improved estimate of population numbers, based on the average of the results, together
with a measure of variability of the estimate for that sample. Similarly, if replicate samples
are tested, the average result provides a better estimate of the number of organisms in the
population based on the inter-sample average and an estimate of the variability between
samples. Thus, we can have greater confidence that the average sample population will
reflect more closely the population in the lot. The standard error of the mean (SEM) pro-
vides an estimate of the extent to which that mean (average) value is reliable. If a sufficient
number of replicate samples is tested then we can derive a frequency distribution for the
counts, such as that shown in Fig. 2.1 (data from Blood, 1974). Note that the distribution
curve has a long left hand tail and that the curve is not symmetrical, probably because the
data were compiled from results obtained in two different production plants. The statistical
aspects of frequency distributions are discussed in Chapter 3.
Adding the individual values and dividing by the number of replicate tests provides a sim-
ple arithmetic mean of the values (x ( x1 x2 x3 .... xn ) / n in1 xi / n where xi
is the value of ith test and n is the number of tests done). However, it is possible to derive
35
30.5
30
25 23
% frequency
21
20
16
15
10
5 4.5
2 3
0 0
0
5
5
3.
4.
4.
5.
5.
6.
6.
7.
7.
FIGURE 2.1 Frequency distribution of colony count data determined at 30C on beef sausages manufactured in
two factories (modified from Blood, 1974) (reproduced by permission of Leatherhead Food International).
other forms of average value. For instance, multiplying the individual counts on n samples
and then taking the nth root of the product provides the geometric mean value (x) :
x n (x1 x2 x3 xn )
It is simpler to determine the approximate geometric mean by taking logarithms of the orig-
inal values (y log10 x), adding the log-transformed values and dividing the sum by n to
obtain the mean log value ( y ), which equals log x. This value is then back-transformed by
taking the antilog to obtain an estimate of the geometric mean value:
n n
yi log xi
i 1 i 1
y log x
n n
The geometric mean is appropriate for data that conform to a log-normal distribution and
for titres obtained from n-fold dilution series. It is important to understand the difference
between the geometric and the arithmetic mean values since both are used in handling
microbiological data. In terms of microbial colony counts, the log mean count is the log10
of the simple arithmetic mean; by contrast, the mean log-count is the arithmetic average of
the log10-transformed counts that, on back-transformation gives the geometric mean count.
The methods are illustrated in Example 2.1.
A population is described by its parameters: the mean () and the variance (2). But we
cannot know the values of these parameters except for a finite population (e.g. a set of
pipettes). However, we can obtain estimates of these parameters from the statistics that
describe the sample population in terms of its analytical mean value ( x ) and its variance
(s2). We can also provide a measure of the likelihood that the same mean result would be
attained if analyses were repeated on a further set of samples from the same lot. Such esti-
mated values are statistics that can be used as estimates of the true population parameters.
Results from replicate analyses of a single sample, and analyses of replicate samples, will
always show some variation that reflects the distribution of microbes in the samples tested,
inadequacies of the sampling technique and technical inaccuracies of the method and the
analyst. The variation can be expressed in several ways.
The statistical range is the simplest way to describe the dispersion of values by deriving the
differences between the lowest and the highest estimates, for example, in Example 2.1, the
colony count range is 610 (i.e. 19701360). The statistical range is often used in Statistical
Process Control (Chapter 12) but since it depends solely on the values for the extreme
counts, its usefulness is severely limited since it takes no account of the distribution of values
between the two extremes.
The population variance is derived from the mean of the squares of the deviations, viz.
2 (x )2 / n , where x is an individual result, the population mean value, n the number
in the population and indicates sum of. Each individual result (x) differs from the pop-
ulation mean by a value (x ), which is referred to statistically as the deviation. But as
the value of is unknown, the sample mean ( x ) is used as an estimate of the population
mean. The sample variance (s2) provides an estimate of the population variance (2) and is
determined as a weighted mean of the squares of the deviations, weighting being introduced
through the application of the concept of degrees of freedom, which assumes that of n observa-
tions, only (n 1) are available since one observation has been used already in determining the
mean value. The unbiased estimate (s2) of the population variance (2) is thus derived from:
n n 2
n xi2 xi
i 1 i1
s2
n(n 1)
n
The alternative form of this equation s2 (x x)2 /(n 1) should not normally be
i 1
used in practical calculation of the sample variance since it is based on the square of the
deviations from the mean value. Such deviations are usually only an approximation for
the absolute infinite decimal value; and since the sum of the deviations from the mean value
are squared, any discrepancies are additive and the derived variance may be inaccurate.
The standard deviation(s) of the sample mean is the square root of the variance
(s s2 ) . The coefficient of variation (CV), often referred to as the relative stand-
ard deviation (RSD), is the standard deviation expressed as a percentage of the mean:
%CV %RSD (s/x) 100 .
The term standard error is often used conventionally to mean the standard devia-
tion (described above) and is a statistical measure of the deviation that estimates would
be expected to show in testing repeat samples from the same population. In other words, it
shows how much variation might be expected to occur merely by chance in the character-
istics of samples drawn equally randomly from a single population. However, the SEM is a
measure of the deviation in the mean value which would be expected if repeated analyses
were undertaken on the same lot of product. The SEM is estimated from the square root
of the variance divided by the number of observations used, that is, SEM s2 / n s/ n .
We should pause at this point to consider an important statistical theorem, which underlies
many statistical procedures. The central limit theorem is a statement about the sampling
distribution of the mean values from a defined population. It describes the characteristics of
the distribution of mean values that would be obtained from tests on an infinite number of
independent random samples drawn from that population. The theorem states, for a distri-
bution with a population mean and a variance 2, the distribution of the average tends to
be Normal, even when the distribution from which the average is computed is non-Normal.
The limiting normal distribution has the same mean as the parent distribution and its vari-
ance is equal to the variance of the parent divided by the sample size (2/N).
Individual results from a finite number of independent, randomly drawn samples from the
same population are distributed around the average (mean) value so that the sum of the val-
ues greater than the average will equal the sum of the values lower than the average value. If
sufficient independent random samples are tested then we can derive a statistical distribution
that describes the occurrence of the population (Chapter 3). Now, no matter what form the
actual distribution takes, the distribution of the average (mean) result in repeated tests always
approaches a Normal distribution when sufficient trials are undertaken. In this situation, the
number of trials relates not to the number of samples per se but to the number of replicate trials.
Assume that we wish to determine the statistics that describes a series of replicate col-
ony counts on n samples, represented by x1, x2, x3,, xn, for which the actual values are
1540, 1360, 1620, 1970, 1420 as colony forming units (cfu)/g
The range of colony counts provides a measure of the extent of overall deviation
between the largest and the smallest data values and is determined by subtracting the
lowest count from the highest count; for the example data the range is 19701360 610.
The median colony count is the middle value (in an odd-numbered set of values) or the
average of the two middle values in an even-numbered set of values; for this sequence of
counts the median value of 1360, 1420, 1540, 1620, 1970 1540.
The arithmetic average (mean) colony count is the sum of the individual values divided
by the number of values, that is
n
x (x1 x2 x3 xn )/ n xi/n ,
i1
where x mean value and means sum of ; for our data the mean count
x/n (1540 1360 1620 1970 1420)/5 1582.
The geometric mean colony count is the nth root of the product obtained by multiplying
together each value of x. Hence, the geometric mean count n ( x1 x2 x3 xn ) .
Alternately, we can transform the x values by deriving their logarithms so that
y log10 x: then geometric mean is the antilog of the sum of y divided by n
n n
= antilog log xi / n = antilog yi /n .
i1
i1
For our data the geometric mean colony count antilog(log10 x/n)
antilog[log 1540 log 1360 log 1620 log 1970 log 1420)/ 5]
antilog(15 . 9 7 73 / 5)
The sample variance (s2) is the sum of the squares of the differences between the values
for x and the mean value (x ) , divided by the degree of freedom of the data set (i.e. n 1).
(One value of n was used in determining the mean value, hence there are only n 1 degrees
of freedom (df)). Thus
s2
( n x 2 ( x)
2
) ( x 2 2
( x ) /n )
n(n 1) (n 1)
[(15402 13602 16202 19702 14202 )] [(1540 1360 1620 1970 1420) 2 /5 ]
s2
(5 1)
12, 742, 900 12, 513,620 229,280
57,320
4 4
An alternative form of the equation is:
n
( xi x)
2
s2 = /(n 1)
i=1
(1540 1582)2 (1360 1582)2 (1620 1582)2 (1970 1582)2 (14 2 0 1582)2
s2
(5 1)
(42)2 (222)2 382 3882 (162)2
4
1 7 64 49, 284 1444 150, 544 26, 244
4
229, 280
57, 320
4
Note that in this example, where the mean value was finite, both methods gave the same
result for the variance. However, where the mean value is not finite, rounding errors can
cause serious inaccuracies in the variance calculation.
The standard deviation (s) around the mean is the square root of the variance and is
given by:
Thence the Relative Standard Deviation (RSD), which is the ratio between the standard
deviation and the mean value, is given by 100 239.4/1582 15.1%
The variance of the log10-transformed values is derived similarly using the trans-
formed values, that is, y log10 x, then:
Using the alternative method with a mean log-count of 3.1946, gives the variance of y as:
Note the small difference in the variance estimates determined by the two alternative
methods.
The SD of the mean log-count is 0.00399 0.0631665 0.0632 and the RSD of the
mean log-count is (0.0632 100)/3.19456 1.97%.
The reverse
transformation of the mean log-count is done by taking the antilog of y :
x 10 y 103.1946 1565. But this is not an accurate estimate of the geometric mean x .
The relationship between the log mean count (log x ) and the mean log-count ( y ) is given
by the formula:
y ln(10) s2 y 2 . 3025 s2
log x
10 10
Note that standard deviations of the mean log-count should not be directly back-
transformed since the value obtained (100.0635 1.1574) would be misleading. Rather,
the approximate upper and lower 95% confidence intervals around the geometric mean
would be determined as 10(3.194620.0635) and 10(3.194620.0635), that is, 103.33216 2097
and 103.0676 1168. Hence for these data the geometric mean is 1569 and the 95% upper
and lower confidence limits are 2097 and 1168, respectively. A comparison with the arith-
metic mean and its 95% confidence limits is shown below:
For these data the difference between the arithmetic and geometric mean values
is small since the individual counts are reasonably evenly distributed about the mean
value and are not heavily skewed. Note that the median value is smaller than both mean
values because of the small population of results that were examined. The standard
deviation of the arithmetic mean value reflects the level of dispersion of values around
the mean value. Note also that the upper and lower 95% confidence limits are distrib-
uted evenly about the arithmetic mean value (1582 478) but are distributed unevenly
around the geometric mean value (1565 397 and 1565 532).
References
Blood RM (1974) The Clearing House Scheme. Tech Circular No. 558. Leatherhead Food Research
Association.
Mossel, DAA (1982) Microbiology of Foods: The Ecological Essentials of Assurance and Assessment
of Safety and Quality, 3rd edition. University of Utrecht, NL.
Further Reading
Glantz, SA (1981) Primer of Biostatistics, 4th edition. McGraw-Hill, New York, USA.
Hawkins, DM (2005) Biomeasurement Understanding, Analysing and Communicating Data in the
Biosciences. Oxford University Press, Oxford, UK.
Hoffman HS (2003) Statistics Explained: Internet Glossary of Statistical Terms. http://www.animat-
edsoftware.com/statglos/statglos.htm