Professional Documents
Culture Documents
Biostat Notes
Biostat Notes
CBCS_Syllabus_M.Sc._Botany/Zoology/Microbiology_Sem – I
1.1 Measure of Central tendency: Mean, Mode and Median, Frequency distribution
Statistics –word come from Latin word “state” which indicates historical importance of
governmental data gathering.
Gottfried Achenwall first used the German word “statistics”.
The term statistics has been used to indicate facts and figures of any kind: health
statistics, vital statistics, or business statistics.
It is also used to refer to a body of knowledge known as statistical methods developed for
handling data in general, particularly in the fields of experimentation and research.
The various definitions of statistics are – (1) Principles and methods for the collection,
organization, analysis and interpretation of numerical data of different kind.
(2) The science and art of dealing with variation in such a way as to obtain reliable
results. (3) The science of experimentation which may be regarded as mathematical
applied to observational data.
1. Collection of data: The data is collected with a specific and well defined purpose. If the
data collected is faulty, the conclusions drawn will also be false. Therefore, maximum care
should be taken while collecting the data.
2. Organization of the data: From just direct observation, it is not possible to derive any
conclusions. The data should be edited, classified and tabulated.
3. Presentation of the data: The data should be presented either in tabular or graphic form.
It can be represented by diagrams also.
4. Analysis of the data: Mean, Standard deviation, Range, etc.
5. Interpretation of the data: It is the last stage in statistical analysis but the most difficult
one. It requires a high degree of skill and interpretation.
Page 1 of 79
Biostatistics
Biostatistics: The science in which the mathematical principles of handling and analyzing
data is applied to biological fields like medicine, public health, agricultural genetics, etc.
When an investigator gets data by experiment, by interview or from existing records, the
result is a series of numbers, the observations.
Such observations are sample from a larger population and we use these samples in order
to draw conclusions about the population.
Generalization made from sample to population entails some risk and therefore reasoning
from sample to population requires systematic thinking.
Statistical theory is largely denoted to this reasoning. So when biologists sets the goal of
an investigation and collects information (data), he requires help from statistical methods
to analyze the data and to draw conclusions, which help to decide future course of action.
For this reason, Biostatistics is applied in majority of branches of biological sciences such
as Agriculture, Genetics, Physiology, Biochemistry, Molecular Biology, Taxonomy,
Medicine and Health Sciences.
Classification
Objectives of classification:
Page 2 of 79
Biostatistics
Types of Classification
III. Qualitative Classification: The collected data is classified on the basis of some attribute
or quality like gender, religion, literacy, employment, etc. Qualitative classification is simple,
when the attribute is one and manifold, when the attribute is more than one.
IV. Quantitative Classification: The collected data are grouped with reference to
characteristics which can be measured and numerically described such as height, weight, age,
income, sales, etc.
Page 3 of 79
Biostatistics
• Continuous variable: It can assume all values within an interval and can be divided into
smaller and smaller units. Theoretically the data point can lie anywhere on the numerical
scale.
• Discrete (Discontinuous) variable: It is one where the values of the variable differ from
one another by definite amounts.
• Ranked variable: They are the variables which cannot be measured but can be ranked by
their magnitude/size.
• Derived variable: The derived variables are calculated based on two or more
independently variables. They show relationship between variables.
• Variates: Measurements on quantitative variables are called variates.
• Attributes: Measurements on qualitative variables are called attributes.
• Frequency distribution: It is grouping of the data into ordered classes and determining
the number of observations in each of the classes, frequency distribution.
• Ordered array: It is listing of data in order of magnitude from the smallest value to the
largest value.
• Frequency table: The tabular form of a frequency of distribution is called frequency
table.
• Cumulative frequency: It is obtained by adding all successive class intervals.
• Relative frequency: It is obtained by dividing the class frequency by the total frequency.
• Tabulation: It is a process of summarizing classified data in the form of a table.
• Table: It is a systematic arrangement of classified data in columns and raws.
• Simple/One way table: A table which contains data on 1 characteristic is called one-way
table.
• Two way table: A table which contains data on 2 characteristics is called two way table.
• Manifold table: A table which contains data on more than 2 characteristics is called
multifold table.
• Data: Set of values collected for the variable from each of the elements belonging to the
sample. OR It is a collective term referring to a group of observations.
• Primary data: The data collected by actual observations, measurements, counting, direct
recording, etc, during the course of investigation is called primary data.
• Secondary data: Any data, detached from the original source, reprocessed for one’s own
purpose ay any other person or organization is called secondary data.
Page 4 of 79
Biostatistics
The measure of central tendency is defined as: “It is a sort of average or typical value
of the items in the series and its function is to summarize the series in terms of this average
value”.
Thus a single expression representing the whole group is selected which may convey a fairly
adequate idea about the whole group. This single expression in statistics is known as the
average. Averages, are generally, the central part of the distribution and therefore they are
also called the measure of central tendency.
Each of them, in its own way, can be called a representative of the characteristics of
the whole group and thus the performance of the group as a whole can be described by the
single value which each of these measures gives. The values of mean, median and mode also
help us in comparing two or more groups or frequency distributions in terms of typical or
characteristics performances.
The main aim of an average is to present huge mass of statistical data in a simple and concise
manner. This makes the central theme of the data readily understandable. The averages are
extremely useful for purposes of comparison.
Page 5 of 79
Biostatistics
Arithmetic mean, commonly called ‘mean’, is defined as the sum of all the variates of
a variable divided by the total number of items in the sample.
Σx x1 + x2 + x3 + . . . . . . + xn
= =
N n
Where, X = Arithmetic mean of variable x
Σx = Sum of all the items of the variable x
n = Total number of items in the sample
The mean should be expressed in the same unit in which data is given.
Example (Mean of raw or ungrouped data): Find the mean of the triglycerides present 10
patients in their blood sample in hospital.
25, 30, 21, 55, 47, 10, 15, 17, 45, 35
Solution:
X =
Σx
N
25 + 30 + 21 + 55 + 47 + 10 + 15 + 17 + 45 + 35 300
= =
10 10
X = 30
Let x1, x2, x3 …. xn be the variates and let f1, f2, f3 …. fn be their corresponding frequencies,
then their mean X is given by,
fi xi fi xi
X =
f1x1 + f2x2 + f3x3 + . . . . . . + fnxn
= =
f1 + f2 + f3 + . . . + fn Σfi N
Where N = f1 + f2 + f3 + . . . + fn
Page 6 of 79
Biostatistics
Example (Mean grouped data): Find the mean of a sample of reports cases of mumps in
school children of the following data.
Blood LDL 52 58 60 65 68 70 75
No. of patients 7 5 4 6 3 3 2
Solution:
X F f×x
52 7 364
58 5 290
60 4 240
65 6 390
68 3 204
70 3 210
75 2 150
Total 30 1848
Page 7 of 79
Biostatistics
Uses: 1. A common man uses mean for calculating average marks obtained by a students.
2. It is extremely used in practical statistics.
3. Estimates are always obtained by mean.
4. Businessman uses it to find out the operation cost, profit per unit of article, output
per man and per machine, average monthly income and expenditure etc.
Mode:–
Mode is defined as that value in a series which occurs most frequently. In a frequency
distribution mode is that variate which has the maximum frequency. In other words, mode
represents that value which most frequent or typical predominant.
Example: Weight of catfish in g: 8, 9, 10, 9, 17, 10, 19, 15, 10, 12, 19
Array of the data: 8, 9, 10, 10, 10, 12, 15, 17, 19, 19
In the above data, 10 occurs 3 times,
19 occurs 2 times, and
other occurs one time,
Therefore, the mode of the above data is 10 g.
Page 8 of 79
Biostatistics
Demerits: 1. There are different formulas for its calculations which ordinarily give different
answers.
2. Mode is determinate. Some series have two or more than two modes.
3. It cannot be subjected to algebraic treatments. For example, the combined mode cannot
be calculated for the modes of two series.
4. It is an unstable measure as it is affected more by sampling fluctuations.
5. Mode for the series with unequal class-intervals cannot be calculated.
Median:–
Median is defined as the middle most or the central value of the variable in a set of
observations, when the observations are arranged either in ascending or in descending order
of their magnitudes. It divides the arranged series in two equal parts. Median is a position
average, whereas the mean is a calculated average.
When a series consists of an even numbers of terms, median is the mean of the two central
items. It is generally denoted by M.
Example: The following are the lengths in cm of a species of fish. Let us identify the median
weight. Length in cm: 17, 16, 15, 18, 16
First let us array the data in ascending order of magnitude. 15, 16, 16, 17, 18
Page 9 of 79
Biostatistics
Then, we identify the median as the value of the item (n + 1) / 2, where n is the number of
items in the sample. It is 5 in this example.
Median = value of item (n + 1) / 2 = (5+1)/2 = 6/2 = 3.
That is, the median is the value of the item 3 in the array. In above example the value of the
item 3 is 16 cm.
The median length of the sample is 16 cm.
Example: The following are the weights in g of a species frog, with a sample size n = 8
Weight in g: 75, 60, 55, 80, 45, 70, 40, 85
Array the data as follows: 40, 45, 55, 60, 70, 75, 80, 85
The median is the value of the item (n + 1) / 2 = (8 + 1) / 2 =9/2 = 4.5
That is the median is the value of the item 4.5 in the array. The value of the item 4.5 is
calculated as a value mid-way between item 4 and 5, as follows,
Value of item 4.5 = (value of item 4 + value of item 5) / 2
= (60 + 70) / 2 = 130/2 = 65
The median weight is 65 g.
Page 10 of 79
Biostatistics
Uses: 1. It is useful in those cases where numerical measurements are not possible.
2. It is also useful in those cases where mathematical calculations cannot be made in order
to obtain the mean.
3. It is generally used in studying phenomena like skill, honesty, intelligence etc.
Frequency Distribution:–
Example: Present the following data of the sample of frequency of 40 households taken from
a factory in the form of a frequency table with 9 classes with class interval data
100.
200, 120, 350, 550, 400, 140, 350, 85, 180, 110,
110, 600, 350, 500, 450, 200, 170, 90, 170, 800,
190, 700, 630, 170, 210, 185, 250, 120, 180, 350,
110, 250, 430, 140, 300, 400, 200, 400, 210, 305
Page 11 of 79
Biostatistics
Solution: Arranging the given data in the ascending does order, we get:
85, 90, 110, 110, 110, 120, 120, 140, 140, 170, 170, 170, 180, 180, 185, 190, 200,
200, 200, 210, 210, 250, 250, 300, 305, 350, 350, 350, 350, 400, 400, 400, 430,
450, 500, 550, 600, 700, 800
Taking a class interval of 100 starting 0 – 99
Frequency Distribution Table
Class Intervals Tally bars Frequency
0 – 99 II 2
100 – 199 IIII IIII IIII 14
200 – 299 IIII II 7
300 – 399 IIII I 6
400 – 499 IIII 5
500 – 599 II 2
600 – 699 II 2
700 – 799 I 1
800 – 899 I 1
Total 40
Example: The following are the lengths of 25 gold fish, measured to the nearest tenth of a
cm.
3.9, 3.8, 3.6, 3.8, 4.0, 4.2, 3.6, 4.7, 4.3, 3.9, 3.6, 4.5, 3.8,
3.9, 4.3, 4.1, 3.9, 4.4, 4.1, 4.1, 4.4, 4.1, 3.9, 3.3, 4.0
Page 12 of 79
Biostatistics
Percent relative
Sr. No. Class intervals Frequency Relative frequency
frequency
1 3.25 – 3.55 2 2 / 25 = 0.08 0.08 × 100 = 8
2 3.55 – 3.85 5 5 / 25 = 0.20 0.20 × 100 = 20
3 3.85 – 4.15 11 11 / 25 = 0.44 0.44 × 100 = 44
4 4.15 – 4.45 5 5 / 25 = 0.20 0.20 × 100 = 20
5 4.45 – 4.75 2 2 / 25 = 0.08 0.08 × 100 = 8
Total 25 1.00 100
Cumulative frequency series are of two types: (1) Less than series and (2) More than series.
Page 13 of 79
Biostatistics
Less than cumulative frequency distribution of the length (cm) of gold fish
Sr. No. Less than Cumulative frequency
1 3.25 0
2 3.55 0+2=2
3 3.85 2+5=7
4 4.15 7 + 11 = 18
5 4.45 18 + 5 = 23
6 4.75 23 + 2 = 25
More than cumulative frequency distribution of the length (cm) of gold fish
Sr. No. More than Cumulative frequency
1 3.25 25
2 3.55 25 – 2 = 23
3 3.85 23 – 5 = 18
4 4.15 18 – 11 = 7
5 4.45 7–5=2
6 4.75 2–2=0
Example: The heights (in cm) of 40 persons are: 110, 112, 125, 135, 150, 152, 150, 155, 159,
130, 128, 138, 133, 143, 147, 151, 154, 156, 112, 116, 117, 111, 113, 115, 118,
121, 123, 120, 125, 121, 110, 113, 114, 149, 153, 155, 150, 156, 152, 111
Array the data and form a cumulative frequency table with class interval of 10.
Solution: Arranging the given data in ascending order of magnitude, we get: 110, 110, 111,
111, 112, 112, 113, 113, 114, 115, 116, 118, 119, 120, 121, 121, 123, 125, 128, 130,
133, 135, 138, 143, 147, 149, 150, 150, 151, 152, 152, 153, 154, 155, 155, 155, 156, 159
Sr. No. Class interval Tally Frequency Cumulative frequency
1 110 – 119 IIII IIII III 13 13
2 120 – 129 IIII II 7 20
3 130 – 139 IIII 4 24
4 140 – 149 III 3 27
5 150 – 159 IIII IIII III 13 40
Page 14 of 79
Biostatistics
Frequency Graphs:–
All types of frequency distributions can be represented by means of graphs. The most
common types of frequency graphs are: (1) the bar diagrams for qualitative and discrete
frequency distributions, (2) histogram for continuous frequency distribution, (3) frequency
polygon and frequency curve, and (4) the ogives, for cumulative frequency distributions.
Bar Diagram: The simple bar diagram can be used to represent qualitative as well as discrete
frequency distributions. The height of each bar is proportional to the frequency of the
respective class.
Page 15 of 79
Biostatistics
Frequency polygon and frequency curve: Frequency polygon and frequency curve are
alternative form to histogram. Whereas the information given in a histogram is precise, those
given in frequency polygon or frequency curve are general and tend to reflect the nature of
the frequency distribution of the population from which the sample was obtained. Frequency
polygon is obtained by joining the middle points of the tops of the rectangles in a histogram
by straight lines.
Frequency curve is drawn in the same manner except that the adjacent points of class interval
plotted against the respective frequencies are joined by a smooth line, instead of straight
lines.
Page 16 of 79
Biostatistics
To draw an ogive, the true class intervals are marked on the X-axis and the cumulative
frequencies are marked on the Y-axis. Against the true upper limits of the class-intervals,
points are plotted corresponding to the respective cumulative frequency. The lines joining the
points thus plotted give the less than ogive. Likewise, more than cumulative frequencies are
plotted against the respective true lower limit of the class-interval and the plotted points are
joined to get the more than ogive.
Page 17 of 79
Biostatistics
The mean alone gives no information about the range of values that comprise a data
set. Two set of data need not be identical but may have the same measure of central
tendencies.
For e.g., The following three sets of data are not identical but their means are same.
A : 60, 60, 60, 60, 60 ⇒ 60
B : 30, 50, 85, 75, 60 ⇒ 60
C : 10, 30, 90, 90, 50 ⇒ 60
The mean for all the three sets is 60 but the observation in each set are different. This
variation in data is described as Measure of Dispersion.
The first set there is no dispersion because all the observations are same. In second and third
set of observations dispersion is evident because all the observations are different. The
dispersion is small when the values of the observations are close together. i.e., show little
variation. The dispersion is higher when the values are widely spread out. A measure of
dispersion conveys information regarding the amount of variability present in a set of data.
The mean of egg production of A, B and C is same. The mean does not show the fluctuation
or variation in the number of egg produced daily by B and C poultry. The daily egg
production of poultry A is same. It does not show variability. The variability in the egg
production by poultry B is less then Poultry C. In poultry B, the values spread out between
3835 and 4140 while in poultry C, the values spread out between 800 and 12000. It means
dispersion is more in poultry C’s egg production than in case of poultry B.
Page 18 of 79
Biostatistics
Importance of Dispersion:–
Range (R):–
Page 19 of 79
Biostatistics
Disadvantages: 1. The range does not take into account the number of observation the
sample, only taken into consideration the largest observation and the smallest
observation, whatever they may be. Because we expect large sample to include
occasional extreme values, we expect a large range. A measure of variability should
depend on the number of observations.
2. It makes no direct use of many of the observations in the sample. Observations
between the smallest and largest in a set are used only to determine which observations
are smallest and largest. Some use of the actual values of intervening observations
seems desirable.
3. The range also suffers from dependence upon extreme observations.
4. Range cannot be completed in case of open-end distribution.
Standard deviation is defined as the square root of the arithmetic mean (X) of the
squared deviations of the various items from arithmetic mean. In short, it is called the root
mean square deviation. The mean of squared deviations is called the variance. Therefore, the
square root of variance (V) is the standard deviation (SD).
∑ ( – )
Standard deviation (SD) = = √
Where, X = Variable x
X = mean of the variable x = Σx/n
(X – X) = deviation
(X – X)2 = Squared deviation
(X – X)2/n = mean of squared deviation = Variance
Page 20 of 79
Biostatistics
Example: Compute the standard deviation for the following weights in g of frogs:
30 90 20 10 80 70
X weight in g (X – ) (X – )2
10 10 – 50 = –40 1600
20 20 – 50 = –30 900
30 30 – 50 = –20 400
70 70 – 50 = 20 400
80 80 – 50 = 30 900
90 90 – 50 = 40 1600
ΣX = 300 Σ(X– )2 = 5800
X = ΣX / n = 300/6 = 50 g
– 5800" = √966.67
SD = = 6
SD = 31.09 g
Example: Compute the standard deviation for the following data of the number of eggs in the
60 nests of a species of a bird.
2 2 3 6 2 4 1 0 1 2 3 4 4 5 6 4 4 2 2 0 1 3 6 5 2 5 3 5 4 4
2 4 3 0 4 3 1 5 2 2 3 6 4 3 2 3 6 1 2 3 2 5 4 1 1 4 3 3 2 5
X Frequency
f·x (X – ) (X – )2 f·(X – )2
No of eggs / nest (f)
0 5 0 0 – 3 = –3 9 5 × 9 = 45
1 8 8 1 – 3 = –2 4 8 × 4 = 32
2 12 24 2 – 3 = –1 1 12 × 1 = 12
3 12 36 3–3=0 0 12 × 0 = 0
4 12 48 4–3=1 1 12 × 1 = 12
5 7 35 5–3=2 4 7 × 4 = 28
6 4 24 6–3=3 9 4 × 9 = 36
Σf = 60 Σfx = 175 Σf(X – )2 = 165
Page 21 of 79
Biostatistics
' –
165"
SD =
Σf
= 60 = √2.75
Variance (V or σ):–
The variance is the arithmetic mean of the squares of sum of the deviations from the
mean value of the data. It is also described as the square of standard deviation. The methods
for calculating variance are the same as for the standard deviation. Sometimes it is denoted by
σ.
+(, - ,)
Variance = V =
Page 22 of 79
Biostatistics
The limitation of the mean deviation for negative values is overcome by squaring of the
deviations. To avoid any further errors due to biased estimate, degrees of freedom (n – 1) are
used for small number of values, instead of n values,
+(, - ,) +.,
– –
V = or V =
Where dx = (x − x)
Example: Compute mean and variance of the data set given below.
1'2
x = = 2500/50 = 50
1'
V = 59.18
Page 23 of 79
Biostatistics
Demerits: The unit of expression of variance is not the same as that of the observations,
because variance indicates squared deviations, e.g., if the observations are given in meters
then the variance will be in square meters.
Coefficient of variation:–
Standard deviation
Coefficient of variation = × 100
Mean
89
CV = × 100
When comparing the CV of two or more series of data, the series having lesser CV is less
variable, more stable, more uniform and more consistent, while the series of data having
higher CV is more variable, less stable, less uniform and less consistent.
Example: Lengths (X ± SD in cm) of two species of fish, A & B, are as follows. Comment
on the variability of the length in the two species.
Species A = 67 ± 2.5
Species B = 64 ± 2.4
Page 24 of 79
Biostatistics
:;
=
<
CV × 100
3.6
=
=>
CV of species A × 100 = 250/67 = 3.73 %
3.7
=
=7
CV of species B × 100 = 240/64 = 3.75 %
Example: A researcher collects data on the weight and length of fishes and is interested to
find out which of the character is more variable. The data are:
Fish Means Standard deviation
Weight 350 gms 12 gms
Length 16 inches 1.5 inch
3
Coefficient of variation for weight =
?65
× 100 = 1200/350 = 3.43 %
.6
=
=
Coefficient of variation for length × 100 = 150/16 = 9.375 %
Page 25 of 79
Biostatistics
When we survey the entire parent population without leaving a single individual and
then calculate the mean value of such survey, it is called population mean (µ). But such
population survey needs lot of labour, money and time.
In order to avoid this, in general we conduct a sample survey. In the sample survey, the
sample plots have to be selected at random without any personal bias.
The principle behind taking a sample at random is that every plot has got equal probability to
be selected in the sample survey.
In this trend, when the sample survey is conducted, values are recorded from each sample
plot (Xi).
The population consists of data with some clearly defined characteristic. For e.g., A
population may consist of all patients with a particular disease, Tablet from a production
batch…
Sample – selection of patients to participate in a clinical study.
Sample – tablets chosen for a weight determination.
Sample chosen should be representative of the population. Under these conditions, it is
assumed that x value is more or less equal to µ, which is the value of population mean.
The question is how to calculate population mean from the knowledge of sample mean. The
answer is No, because under no circumstances the true value of population mean can be
calculated from the value of sample mean. However, confidence limits at 95% and 99% can
be established for the population mean.
The confidence limits has got one lower and one higher value and it is supposed that the
exact value of the population mean can be anything ranging from the lower value to the
higher value, that means the population mean can be any value ranging from the lower one to
Page 26 of 79
Biostatistics
the higher one. The range of values between the lower and the higher limits is known as the
confidence range or the confidence limit.
When the investigator tells that he has established the confidence limit for the population
mean at 95%, he is confident that the calculated range will certainly include the value of
population mean.
Similarly, when the investigator tells that he has established the confidence limit for the
population mean at 99%, he is confident that the calculated range will certainly include the
value of population mean.
Confidence limits:–
For 95% . . . . C = 1.96 For 99% . . . . C = 2.54
µ= ± (C × Se)
:@
Se =
√
Sd = √V
1@2
–
V=
When once we establish the confidence limits for the population mean what all can be said in
that the value of the population mean can fall anywhere between the lower limit and the
higher limit of the confidence limit, but where exactly it could be is not known at all.
Confidence Interval:–
In order to get the information about the parameter, we draw a random sample from
the population and calculate summary measures for the sample. They are known as statistics.
Such statistics gives information about population parameters.
Page 27 of 79
Biostatistics
However, since sample is only a part of the population, the numerical value of a statistic
cannot be expected to give exact value of the parameter. As statistic is a random variable,
therefore it will have a probability distribution.
The probability distribution of the statistic is known as sampling distribution of the statistic.
The mean is a statistic. The probability distribution of mean is known as sampling
distribution of mean. The standard deviation of sampling distribution is referred as “standard
error” (SE).
Suppose random sample is drawn from the population with mean (µ) and variance (σ), we
want to relate the sampling distribution of x to the population from which it is drawn. The
mean and standard deviation (standard error) of the sampling distribution is determined in
terms of µ and σ,
x = mean of the sampling distribution
µ = population mean
SE = Population Sd ÷ C8DEFGH I JH
8. K
SE = or
√ √
This shows that, the variability of the sample mean is governed by the two factors,
(a) Population variability & (b) Sample size
Large variability in the population includes large variability in the sample mean, thus making
sample information about µ less reliable. But, this can be balanced by taking n appropriately
large. Thus, with increasing sample size, the SE of X decreases and the distribution of X tend
to become concentrated around the population mean (µ).
Confidence limit: It is the probability associated with a confidence that any value in the set
of data will fall within a given range of mean. It is represented by two end
points of confidence interval.
Confidence interval: It is the interval between two values based on sample observations.
Page 28 of 79
Biostatistics
Example: Following is the data of sample mean. From these data, calculate 95% and 99%
confidence limit for the population mean.
No. 1 2 3 4 5 6 7 8 9 10 11 12
X 25 34 41 39 42 27 26 39 34 46 36 38
Solution:
X X– dx2
(dx)
25 –10.58 111.9
34 –1.58 2.49
41 5.42 29.37
39 3.42 11.69
42 6.42 41.21
27 –8.58 73.61
26 –9.58 91.77
39 3.42 11.69
34 –1.58 2.49
46 10.42 108.57
36 0.42 0.17
38 2.42 5.85
+, = 427 +., = 490.81
12
= = 427 / 12 = 35.58
V = 44.61
Sd = √V = √44.61 = 6.67
Sd =.=>
Se = = = 1.927
√ √ 3
Page 29 of 79
Biostatistics
C – Value
95% C = 1.96
99% C = 2.54
Result: The 95% and 99% confidence limit (CL) for the population mean is given below.
95% 99%
μ 39.36 40.48
μ 31.8 30.68
Conclusion: From the above result, it is concluded that, 95% confidence limit for population
mean is from 31.8 39.36 and 99% confidence limit for population mean is
from 30.68 40.48.
Page 30 of 79
Biostatistics
Chi-square test of Karl Pearson is a statistical device to test the significance of the difference
between observed distribution and the expected distribution. It is an index to measure the
extent and significance of the difference between the observed and expected frequencies.
Chi-square (χ2) is the summation of the squared deviation of each observed frequency (O)
from the respective expected frequency (E) divided by the expected frequency.
(R – S)
χ2 = ∑
S
If the differences (deviations) between O & E are greater, the chi-square will be greater, and
vice versa. If there is no difference between O & E, the χ2 will be zero.
Page 31 of 79
Biostatistics
1. Every observation of the sample for this test should be independent of all other
observations.
2. The total number of observations used for the test should be large.
3. The expected frequency of any item should not be less than 5.
4. The frequencies used in χ2 should be absolute and not relative in terms.
5. The observation collected for χ2 test should be on random sampling.
6. Chi-square test is used only for drawing inferences. It cannot be used for estimation of
parameter or any other value.
7. Chi-square test is totally dependent on degree of freedom.
Page 32 of 79
Biostatistics
Chi-square test is used to compare the observed frequencies with the respective
expected frequencies obtained on apriori hypothesis.
e.g., Comparison of observed and expected frequency distributions of various types
such as Binomial, Poisson and Normal.
Chi-square test is used to compare the observed and expected frequencies of two or
more attributes and to decide whether these attributes are independent of or dependent on
each other. The expected frequencies are obtained on the basis of the null hypothesis that
there is no association between the attributes.
e.g., To test whether eye-color and hair-color of persons are independent or
associated.
Example: A disease was detected in 382/600 animals in species A and 218/300 animals in
species B. Test by means of suitable test whether only difference in detection of species in
both the species.
Species Disease detected Disease not detected Total
A 382 218 600
B 218 82 300
Calculation:
Species Disease detected Disease not detected Total
A 382 218 600 r1
B 218 82 300 r2
Total 600 c1 300 c2 900 N
Null hypothesis: There is no difference between the detection of disease between the two
species.
Page 33 of 79
Biostatistics
XY × [Y =55 × =55
\ 455
E1 = = = 400
XY × [ =55 × ?55
\ 455
E2 = = = 200
X × [Y ?55 × =55
\ 455
E3 = = = 200
X ×[ ?55 × ?55
\ 455
E4 = = = 100
DF = (r – 1)(c – 1) = (2 – 1)(2 – 1) = 1
Result: Here calculated χ2 value = 7.29 and tabulated χ2 value = 3.84 at df 1 & LS 0.05.
Conclusion: Here χ3] > χ3^ , so the null hypothesis is rejected and therefore there is a
significant difference present in detection of disease between both species.
Page 34 of 79
Biostatistics
Hypothesis testing:–
Sampling theory deals with two types of problems, like, estimation and testing of
hypothesis. Modern theory of probability plays an important role in decision making and the
branch of statistics which helps us in arriving at the criterion for such decision is known as
testing of hypothesis. It employs statistical techniques to arrive at decisions in certain
situation where there is an element of uncertainty on the basis of sample whose size is fixed
in advance.
A hypothesis (H) is a statement about the population parameter. In other words, a hypothesis
is a conclusion which is tentatively drawn on logical basis. Statistical hypothesis is
tentatively conclusion that specifies the properties of a distribution of random variable. These
properties generally refer to parameters of the population and the hypothetical values with
which the values of statistic derived from a sample are compared in order to find the
difference between statistic and corresponding parameters.
In other words, statistical hypothesis is some assumption or statement, which may or may not
be true, about a population or about the probability distribution characterizing the given
population, which we want to test on the basis of the evidence from a random sample.
Hypothesis testing can be regarded as an example of a decision process, in which data are
assembled in a particular way to produce a quantity that leads to a choice between two
decisions. Each decision then leads to an action. Because data arise from sampling process,
these are some risk that an incorrect decision will be made with some loss attached to the
resulting incorrect action.
Page 35 of 79
Biostatistics
In testing of hypothesis a statistic is completed from a sample drawn from the parent
population and on the basis of this statistic it is observed whether the sample so drawn has
come from the population with certain specified characteristic.
The value of sample statistic may differ from the corresponding population parameter due to
sampling fluctuation.
The test of hypothesis discloses the fact whether a difference between sample statistic and the
corresponding hypothetical population parameters significant or not significant. Thus the test
of hypothesis is also known as the test of significance.
Hypothesis:–
Generally, an investigator has a hypothesis (H). The hypothesis may be that the
sample mean (X) is lesser than the population mean (µ), or the mean of “treated” group is
greater than the mean of the “control” group, or the means of more than two groups are not
same.
Page 36 of 79
Biostatistics
Verbally, the H○ states that there is no significance difference between sample mean and
population mean, or between means of two populations, or between means of more than two
populations.
Prof. R. A. Fisher remarked, “Null hypothesis is the hypothesis which is to be tested for
possible rejection under the assumption it is true”.
The negation of null hypothesis is called Alternative hypothesis. It means ‘any statistical
hypothesis which is not a null hypothesis is called an alternative hypothesis’. It is represented
by HA or Hα or H1. In other words if null hypothesis is rejected, the alternative hypothesis is
applicable.
According to alternative hypothesis, the difference between population mean (µ) and sample
mean (X) is not due to sampling fluctuations, but is real and quite significant.
In case null hypothesis is not applicable, the verification of scientific hypothesis will depend
on alternative hypothesis. The null hypothesis is accepted as true until such time the
alternative hypothesis disprove it.
In case null hypothesis is rejected in favour of alternative hypothesis, there are two
possible outcomes. Either the null hypothesis has been rejected correctly or incorrectly.
Falsely rejecting the null hypothesis is called Type I Error.
In case null hypothesis is not rejected, again these are two possible outcomes. Either we have
failed to reject the null hypothesis, though it should have been rejected or we have correctly
failed to reject the null hypothesis because it was not to be rejected. Failing to reject the null
hypothesis when it should have been rejected is called Type II Error.
1. Type I Error: When null hypothesis is true, but the difference of means is significant and
the hypothesis is rejected, it is called Type I Error. The probability of making Type I error
is denoted by a or α. It means the probability of making Type I error by rejecting null
Page 37 of 79
Biostatistics
2. Type II Error: When null hypothesis is false, but difference of means is significant and
the hypothesis is accepted, it is called Type II Error. It means in Type II error, null
hypothesis is accepted when it is false. The probability of Type II error by accepting null
hypothesis when it is false is represented by b or β and the probability of making correct
decision of rejecting the false null hypothesis will be (1 – b) or (1 – β).
The decision rule specifies which values of the test statistic will determine the
rejection of the null hypothesis in favour of alternative hypothesis. The decision rule is based
on the probabilities (α and β) of the Type I and Type II errors. Possibilities associated with
making a correct decision can be represented as follows:
Decision about In reality, null hypothesis is
null hypothesis True False
Accept Correct Type II Error
Reject Type I Error Correct
i.e., Type I Error (α): Null hypothesis (H○) rejected though it is true.
Type II Error (β): Null hypothesis (H○) accepted though it is wrong.
Level of Significance (LS) is the quantity of risk of Type I error which can be tolerated in
making a decision about the null hypothesis (H○). Thus, level of significance is the maximum
probability of making a Type I error.
The commonly used levels of significance in practice are 5% (0.05) and 1% (0.01). It means
at 5% level of significance a = 0.05 or probability of making Type I error is 0.05. It can be
inferred that there is probability of making Type I error 5 out of 100 times or the chances of
making correct decision are 95 times out of 100 times. It means 95 times the decision made is
correct and is wrong only 5 times. Similarly, at 1% level of significance (i.e. a = 0.01) there is
probability of making 1 error out of 100 times.
Page 38 of 79
Biostatistics
The test statistics used to test the hypothesis H○ follows a known distribution. This is
represented by a standard normal curve or normal probability curve of sampling distribution.
The area under probability curve is divided into two regions:
(1) The region of rejection or critical region and (2) The region of acceptance
The area of critical region is equal to the level of significance α and lies on the tail of the
distribution curve. It may be located on both the sides or only one side i.e. on the one tail
(either right or left).
Page 39 of 79
Biostatistics
2. Acceptance region:–
The region of standard normal curve which is not covered by rejection is known as
acceptance region.
When we compare the calculated probability (area left in the tail) with the level of
significance (LS), if it is less than 0.05 (P < 0.05) or less than 0.01 (P < 0.01), we reject the
H○. If the calculated P is equal to or greater than 0.05 (P ≥ 0.05) or 0.01 (P ≥ 0.01) we fail to
reject H○.
Thus we have only two options regarding our “decision” about the null hypothesis, either
“reject H○” or “fail to reject H○”. We are rejecting a H○ because the probability for its
occurrence is low, lower than 0.05 or 0.01.
Usually in the hypothesis testing, and for that matter any test of significance (χ2-test, t-test,
etc.) the probability for the occurrence of H○ is not calculated, but the critical ratio value (t,
χ2, etc...) is calculated and compared with the table values at specific level of significance
(0.05 or 0.01) and degrees of freedom.
If the calculated critical value is equal to or greater than the table value, then the H○ is
rejected and the given hypothesis is discussed.
1. When rejection is on both the ends of normal curve, the test is known as two tailed test or
two sided test. It is applied in cases where difference between sample mean and
population mean tends to reject the null hypothesis.
Page 40 of 79
Biostatistics
2. When rejection region is only on one side of the normal curve, the test is known as one
sided test or one tailed tests.
(i) Right Tailed Test: In the right tailed test the rejection region or critical region lies
entirely on the right tail of the normal curve. It is applied while the population mean is as
larger as some specific value of mean.
Page 41 of 79
Biostatistics
(ii) Left Tailed Test: In the left tailed test the critical region or rejection region lies entirely
on the left tail of the normal curve. It is applied when the population mean is as small as
some specific value of mean.
Page 42 of 79
Biostatistics
Page 43 of 79
Biostatistics
Introduction:–
An Irish statistician, W.S. Gossett in 1908, applied this test for testing the
significance of difference between the means of two different samples of small size. It was
named ‘t-test’. The pen name of Gossett was student, hence this test is called student’s t-test.
It was further elaborated and explained by R.A. Fisher. Student’s t-test is applied to small
samples only.
Student’s t-distribution:–
If x1, x2, x3…xn is a random sample size of n drawn from normal population with mean µ and
variance σ (not known) then the students t statistic (mean) is defined as,
–µ –µ
t= = 8. 9."
√ −
(S.E. of mean)
Page 44 of 79
Biostatistics
Properties of t-distribution:–
Application of t-distribution:–
Page 45 of 79
Biostatistics
Population mean is estimated using the table t value at the specific levels of
significance and the n – 1 DF.
Example: A sample of 10 plant yielded a mean sugar level of 45 mg% with a SD of 8 mg%.
Estimate the population mean of the sugar level with 95% confidence.
Page 46 of 79
Biostatistics
Types of t-test:–
Following are two types of t-test can be performed:
1. t-test for Paired samples or t-test for Two sample means: When paired observations are
arranged case wise, and a test of treatment effect is performed, this is called two sample
means or paired sample means. It is also known as ‘correlated t-test’ or ‘paired t-test’.
2. t-test for Single mean or t-test for Independent sample: The observations are classified
into two groups. A test of mean difference is performed for a specified variable. This is
called ‘t-test for single mean’ or ‘unpaired t-test’.
Paired t-test:–
One of the experimental designs in biology and medicine is to assign the same
subjects both for ‘control’ and ‘experimental’ treatments.
For example, if an investigator wants to evaluate the efficacy of a new drug formulation in
reducing the blood glucose levels in men, he/she can have a sample of 10 persons who are
willing to undergo the experiment. The usual procedure is to divide the sample of 10 persons
into two groups and administer the placebo/vehicle (the medium in which the drug to be
tested is prepared) and the drug to each group. In the matched pair design, all the ten persons
are given, first, the placebo treatment and their blood samples are analyzed for blood glucose
level. Next, after the expiry of sufficient time, the same persons are given the drug, and their
blood samples are analyzed for blood sugar level. Thus, each subject yields a pair of data,
which can be analyzed using the t-test to assess significance of the mean difference.
Page 47 of 79
Biostatistics
(ii) Values obtained after control treatment and after experimental treatment.
(iii) Values obtained at two different periods- now and after a gap of a day, a month, a
year etc.
(2). Significance of the mean difference (D) is tested using t-test.
(3). H○: D – μ; = 0; n – 1 degree of freedom (n = pairs of the data); Sampling distribution of
mean differences (D’s) with a population mean of µ D = 0.
(4). Computation: (a) D = difference between n pairs of values
(b) ΣD
(c) Mean difference, D = ΣD"n
(d) (D – D), (D – D)2 and Σ(D – D)2
Σ(D – D)3"
(e) SD = n
(f) SE of mean difference, SE = SD"
√n − 1
D − μ;c D – 0c 9
(5). t = SE that is, t = SE ⇒ ∴ t = 8S
(6). Table t at specific LS and n–1 degrees of freedom.
(7). Decision: If calculated t > table t, reject H○.
(8). Inference/Conclusion: Based on the decision, the given H is discussed.
Σ(D – D)3"
Step: 2 Computation of D = ΣD"n and SD = n
Page 48 of 79
Biostatistics
D = ΣD"n =
35
5
= 2 g/100 mg/100 ml
Step: 6 The sample mean differences is significant. The drug is effective in increasing the
hemoglobin content in aged people. The claim of the company is valid.
Page 49 of 79
Biostatistics
Unpaired t-test:–
Using student’s t-test we can assume sampling distribution of difference between means of
two groups, compute the SE of difference between means, locate the observed difference
between means in the sampling distribution in terms of t value.
The calculated t-value is compared with the table t at specified LS and DF, a decision is made
about the H○ and inference is drawn about the population difference between means.
Page 50 of 79
Biostatistics
Example: Two horticultural plots were each divided into six equal sub-plots. Organic
fertilizer is added to plot 1 and chemical fertilizer is added to plot 2. The yield of
fruits from plot 1 and plot 2 in kg/sub-plot is given below. Can we say the yield
due to organic fertilizer is higher than due to chemical fertilizer?
Plot 1 6.2 5.7 6.5 6.0 6.3 5.8
Plot 2 5.6 5.9 5.6 5.7 5.8 5.7
Calculations:
dx12 dx22
X1 (X1 – 1 )2 X2 (X1 – 1 )2
(X1 – 2 2
1) (X2 – 2)
SD = √V = √0.0533 = 0.23
t = 2.707
Table t at 0.05 LS & 10 DF = 2.228
The calculated t (2.707) > table t (2.228)
∴ reject H○
The yield due to organic fertilizer (plot 1) is significantly higher than due to chemical
fertilizer (plot 2).
Page 51 of 79
Biostatistics
Suppose we wish to know if three drugs differ in their effectiveness in lowering serum
cholesterol in human subjects. Some subjects (persons) receive drug A, some drug B and
some drug C. After a specified period of time, measurements are taken to determine the
cholesterol level in each group. In each group, the effect of drug in reducing cholesterol is
different, as there are 3 different drugs. Even in a particular group of persons, for one drug,
the effect of lowering cholesterol is different. This variation is due to differences in genetic
makeup of the subjects and differences in their diets. By using Analysis of Variance, we will
be able to reach a conclusion regarding the equality of the effectiveness of the 3 drugs.
Principle of ANOVA:–
Page 52 of 79
Biostatistics
specific features. Whatever be their nature, they are always assumed to have come from
different populations about which inferences are to be drawn.
The observed differences between these groups will consist of two components, viz., (1) A
natural variation (“error”) and (2) Variation due to “treatment” or any other factor.
In ANOVA, the two components of the observed difference are separated, estimated and
compared. The variation due to “treatment” is expected to occur between groups and
therefore, it is referred to as “between variance”. The normal variation would occur within
each of the groups and, therefore, referred to as “within variance”. The variation from the two
sources, “between” and “within” together is called the “total variance”.
If “within” variability is greater than the “between” variability, it would mean that the
difference between group is not significant. On the other hand, if the “between” variability is
greater than the “within” variability, it would suggest a significant difference between the
group.
While designing experiments in Biology or in other fields, every care must be taken to have
the natural variability (error) to be distributed randomly. If samples are drawn from different
isolated populations for the purpose of comparison of these populations, it is assumed that the
“error” is randomly distributed. In laboratory or field experiments “randomization”
procedures must be followed in the allotment of experimental units to different groups. If no
much “randomization” procedures were followed, then the “error” would no more be
“random” but “systematic”. Systematic errors will definitely lead to bias in the inference.
1. Description of Data: The measurements resulting from one way ANOVA along with the
means all the measurements are displayed in a table form.
2. Hypothesis: We test the null hypothesis that all populations or treatment means are equal.
H○: µ 1 = µ 2 = µ 3 = …… = µ n
3. Test statistic:
(a) First find out means of each sample as, X , X3, X? ……Xi
Page 53 of 79
Biostatistics
(c) Take the deviation of the sample means from mean of each sample, i.e., X − X, X3 − X,
X? − X ….
(d) Square each deviations and multiply by the number of items in the corresponding
samples, i.e., n (X − X)2, n3 (X3 − X)2, n? (X? − X)2….
(e) Then total these values. This is known as Sum of Squares between the groups or SSbetween.
∴ SSbetween = n (X − X)2 + n3 (X3 − X)2 + n? (X? − X)2 +……+ ni (Xi − X)2
n = number of items in the corresponding samples
(f) Divide the result of step (e) i.e. SSbetween by the degree of freedom between (k – 1). This is
known as Mean Square between the groups or MSbetween.
::uvwxvvf
MSbetween =
@' yt^ztt ^{t |}n~sq (i- )
(g) Find out the deviation of the values of the sample items for all the samples from the
corresponding means of the samples. Square each deviation and sum up all the deviation.
This is known as Sum of Square within the group or SSwithin.
SSwithin = Σ(X1 – X )2 + Σ(X2 – X3 )2 + Σ(X3 – X? )2+……+ Σ(Xk – Xi )2
(12• )
or SSwithin = Σ Σxi2 – Σ
(h) Divide the result of step (g) i.e. MSbetween by degrees of freedom within (nk – k). This is
named as Mean Square within the samples or MSwithin.
::x•w€•f
MSwithin =
@' z ^{ ^{t |}n~sq ( i - i)
Page 54 of 79
Biostatistics
(k) If the calculated F-value is less than F-table value, there is no significant difference
among the sample mean.
Example: The following data represent the gain in weight (in kg) of a species of edible fish
cultured in 4 diet formulations (D1, D2, D3 & D4) for a period of 3 months. Analyze these data
for significant difference among the diet formulations in terms gain in weight.
D1 D2 D3 D4
4 8 5 1
5 7 7 4
1 9 8 1
3 6 6 3
2 10 9 1
Calculations:
(1) n = 5, n3 = 5, n? = 5, n7 = 5, k = 4
Σx1 = 15, Σx2 = 40, Σx3 = 35, Σx4 = 10
(2) X = 15/5 = 3, X3 = 8, X? = 7, X7 = 2
Page 55 of 79
Biostatistics
::uvwxvvf ?5 ?5
(5) MSbetween = = = = 43.33
;ˆ (i- ) 7- ?
(12• )
(7) For Σ ,
(12Y ) ( 6) 336
D1 = = = = 45
Y 6 6
(12‡ ) ( 5) 55
D4 = = = = 20
‡ 6 6
(12• )
∴Σ = 45 + 320 + 245 + 20 = 630
(12• )
(8) SSwithin = Σ Σxi2 – Σ = 668 – 630 = 38
Page 56 of 79
Biostatistics
::x•w€•f ?L ?L
(9) MSwithin = = = = 2.375
@' ( i-i) 35-7 =
MSbetween 7?.??
MSwithin 3.?>6
F-ratio = = = 18.24
The means of the four diet groups are not the same at P < 0.05.
There is significant difference among the diet formulations in terms of weight gain.
Page 57 of 79
Biostatistics
General Introduction:–
Various statistical methods studied so far, like measures of central tendency, average
and measure of dispersion are related to one variable only. These are many situations where
two variables are inter-related and a change in the value of one variable causes change in the
value of other variable. For example, we may like to study the relationship between height
and weight of persons, blood pressure and age, consumption of certain nutrient and weight
gain or intensity of stimulus and intensity of reaction. The study of nature and strength of
relationship between two variables is described in terms of correlation and regression.
In correlation analysis, we are concerned whether two variables are independent or they vary
together in positive or negative direction. In correlation the two variables are not related as
independent and dependent variables. It means in correlation both the variables are affected
by a common cause and the degree to which these variables vary together is estimated.
Correlation:–
The relationship between two or more variables is called “correlation”, and the
variables are said to be correlated. The relationship between two variables is also known as
“covariation”. The term “relationship” can be used in two different senses, viz., mutual
dependence, and cause and effect relationship.
Page 58 of 79
Biostatistics
well. Similarly when the organism increases its activity (metabolism) it consumes more
oxygen. On the other hand, when the oxygen consumption decreases, the activity, i.e., the
metabolism decreases, when the organism becomes less active, its oxygen consumption also
becomes lesser. A relationship between two variables in which a change in the value of one
of the two variables brings about a change in the value of the other variable is said to be
‘mutually dependent’.
A relationship between two variables in which changes in the values of one variable is
the cause of the changes in the values of the other variable is said to be ‘cause and effect
relationship’ between the two variables.
For example, consider the two variables, environmental temperature and the body
temperature of poikiloterms living in that environment. When there is increase in the
environmental temperature there is an increase in the body temperature is the “cause” and the
increase in the body temperature is the “effect”. Such a relationship between two variables is
known as “cause and effect” relationship.
The cause and effect relationship between two variables may be either direct or indirect. In
the example of environmental temperature and body temperature of poikilotherms one more
variables namely the oxygen consumption by the organism may also be considered.
When there is increase in the body temperature of the organisms, there is increase in their
oxygen consumption as well. Here the relationship between increase in the environmental
temperature and body temperature of the poikilotherms is direct. The cause and effect
relationship between environmental temperature and oxygen consumption is through the
other factor namely the body temperature, and therefore the relationship is indirect.
Sometimes you might come across two variables that may not have any type of direct
relationship between them. Yet the value of one variable changes when that of the other
variable changes. This may be due to third factor that causes both variables to increase their
values.
Page 59 of 79
Biostatistics
Let us consider an example to illustrate the above situation. The amount of paddy produced
and the amount of cotton produced in the same area, say, a district, obviously do not have any
direct relationship between them. Yet we may find whenever there is increase in the yield of
paddy there might be increase in the yield of the cotton as well. The reason for this
relationship might be the rainfall received in that district. Thus the relationship between the
variables paddy yield and cotton yield is due to the third factor namely the amount the
rainfall.
The nature of correlation between two variables need not be same at all times. For example,
the relationship between height and weight of humans or the length and weight of organisms
may not be same. Generally, with increase in the height or length, there is increase in the
weight. However, it is common to see people who are tall weighing less and those who are
short weighing heavier.
Significance of Correlation:–
The study of correlation is of great significance in practical life, because of the following
reasons:
1. The study of correlation enables us to know the nature, direction and degree of relationship
between two or more variables.
2. Correlation studies help us to estimate the changes in the value of one variable as a result
of change in the value of related variable. This is called regression analysis.
3. Correlation analysis helps us in understanding the behavior of certain events under specific
circumstances. For example, we can identify the factors for rainfall in a given area and
how these factors influence paddy production.
4. Correlation facilitates the decision making in the business world. It reduces element of
uncertainty in decision-making.
5. It helps in making predictions.
Types of Correlation:–
Correlation between variables may be simple or multiple. A simple correlation deals
with only two variables where as a multiple correlation deals with more than two variables
may be a positive correlation or a negative correlation. Whether it is positive or negative, it
may be linear or non-linear.
Page 60 of 79
Biostatistics
1. Positive Correlation:–
A correlation between two variables in which, with an increase in the values of one
variable the values of the other variable also increases, and with a decrease in the value of the
one variable the value of the other variable also decreases, is said to be a positive correlation.
In other words, in a positive correlation between two variables is the value of both the
variables more in the same direction. For example, the correlation between the environmental
temperature and the body temperature of poikilotherms is a positive correlation.
2. Negative Correlation:–
A correlation between two variables in which when there is an increase in the value of
one variable, the values of the other variable decreases, and when there is a decrease in the
values of one variable the other variable increases, is said to be a negative correlation.
In other words, in a negative correlation the values of the two variables move in opposite
direction. For example, the correlation between environmental temperature and bacterial
growth, having a cause and effect relationship, is negative one. With an increase in the
temperature the bacterial growth increases.
3. Linear Correlation:–
When the values of two variables vary in a constant ratio, the correlation between two
variables is said to be linear. The correlation between the optical density and the intensity of
the color of a solution is an example of linear correlation.
If the amount of change in the value of one variable and the corresponding amount of
change in the other variable are not in a constant ratio, the correlation between the two
variables is said to be non-linear or curvilinear. The correlation between the length and
weight of fish is generally a non-linear correlation.
Page 61 of 79
Biostatistics
We can study the presence or absence and extent of correlation between variables by
one of the following methods: (1) Scatter diagram; (2) Correlation graph; and (3) Karl
Pearson’s Coefficient of Correlation (r).
Scatter diagram and correlation graphs are graphical methods and they indicate only the
nature of the correlation whether positive and negative. No numerical measure of the extent
of correlation is given by these measures of the extent of correlation is given by these
methods. The Karl Pearson’s coefficient of correlation gives the magnitude of correlation
between two variables in numerical terms.
It is an easy and simple method of studying correlation between two variables. Scatter
diagram is constructed as follows. If X and Y are pairs of variables, the values of the variable
X are marked in the X-axis and the values of Y are marked on Y-axis. A point is plotted
against each value of X and the corresponding Y value. A swarm of dot is obtained, and this
is called scatter diagram. The nature of the scatter of dots in the diagram gives an idea of the
nature of correlation between the variables given, as shown in following figures.
If the plotted dots form a straight line running left to right in the upward direction, the
correlation is perfectly positive.
Page 62 of 79
Biostatistics
If the dots are scattered around a straight line running from left to right in an upward
direction, than the correlation between the two variables is positive.
Positive correlation
If the dots of the scatter diagram form a straight line running from left to right in the
downward direction, the correlation between the two variables is perfect negative.
Page 63 of 79
Biostatistics
A scatter diagram in which the plotted dots form a swarm around a straight line runs from left
to right in the downward direction indicates negative correlation between the variables.
Negative correlation
In cases where there is no correlation between the variables, the scatter of dots in the diagram
will not form either a straight line or even a flow of dots from left to right in the upward or
downward direction.
No correlation
Page 64 of 79
Biostatistics
When the given variables are with reference to a period of time, correlation graph is
the ideal method to understand the relationship between the two variables. To draw the
correlation graph, the period of time is marked on the X-axis and the value of the two
variables on the Y-axis or, if necessary on both the Y-axis. The points of a variable plotted
against time are joined by a line to form a curve. Different curves are constructed for each of
the two variables (figure).
In a correlation graph, if the curves of the two variables are close to each other and if they
move in the same direction, the variables have a positive correlation. On the other hand, if the
curves of the two variables move in opposite directions, the variables are negatively
correlated.
Page 65 of 79
Biostatistics
+ - (•-•)
r= where, r = Coefficient of correlation,
8, 8•
X = variable x,
Y = variable y,
X= Mean of variable x,
Y= Mean of variable y,
n = number of pairs of variables,
Sx = SD of variable x, and
Sy = SD of variable y
+,·+•
+,• - + .,·.•
r= or r = where dx = X – X; & dy = Y – Y
(+,) (+•) C(+., · +.• )
’+, - “’+• - “
An important thing to be understood in the coefficient of correlation (r) is that it does not
represent a percent agreement between the two given variables. However if r = –1 or +1 or 0,
there might be 100 percent agreement. But if r = 0.5 or any other value, it does not indicate
50% or any other representative percent agreement between the two variables. The percent
agreement between two variables increases exponentially with the increasing r values.
Page 66 of 79
Biostatistics
Example: Obtain the coefficient of correlation for the following data on the length (X in cm)
and weight (Y in g) of fish.
X 5 7 3 1 9 12 8 3
Y 8 9 5 4 9 13 7 9
Calculations:
dx dy
X Y dx2 dy2 dx·dy
(X – ) (Y – •)
5 8 –1 0 1 0 0
7 9 1 1 1 1 1
3 5 –3 –3 9 9 9
1 4 –5 –4 25 16 20
9 9 3 1 9 1 3
12 13 6 5 36 25 30
8 7 2 –1 4 1 –2
3 9 –3 1 9 1 –3
Σx = 48 Σy = 64 Σdx2 = 94 Σdy2 = 54 Σdx·dy = 58
+, ”• +• –”
= = •
=6 •= = •
=8
+ .,·.• —• —• —•
r= = = =š
C(+., · +.• ) √—” ט” √—™š– . —
r = 0.81
Page 67 of 79
Biostatistics
Regression
Introduction:–
Two regression lines, x on y and y on x, can be drawn for a series of bivariate data. The
regression line of x on y gives the best estimate the value of x when a value of y is given. The
regression line of the y on x gives the best estimate of the value of y when the value of x
given.
Regression lines are given algebraic expressions, the regression equations. Regression
equations are used to draw the regression lines. They are also used as numerical methods to
find out the best estimate of values of the variables from the values of the other variables.
Regression equations:–
For every series of bivariate data two regression equations are derived, viz.,
regression equation of x on y that is used to draw the regression line of x on y, and the
regression equation of y on x that is used to draw the regression line of y on x.
Page 68 of 79
Biostatistics
In the above equations, X is the mean of variable x, Y is mean is variable y, Sx and Sy are the
SD’s of the variable x and y respectively, and r is the correlation coefficient between the
variables x and y.
If a value of y is given, the corresponding value of x can be estimated using the equation (1).
If a value of x is given, the corresponding value of y can be estimated using the equation (2).
Equation (1) yields a final form as x = ay + b, and the equation (2), y = ax + b.
Suppose there is a unit change in the value of y, the corresponding change in the value of x is
given by the regression coefficient of x on y. That is, when there is unit change in the value
of y, the value of x will change by the amount rSx/Sy. In other words, regression coefficient
decides what should be the slope of the regression line of x on y.
Similarly, the regression coefficient of y on x, rSy/Sx gives the amount of change in the value
if y when there is a unit change in the value of x and thus describing the slope of the
regression line of y on x.
The formulae for obtaining the regression coefficient and constant of the two regression
equations are as follows.
Regression equation of x on y,
› 8,
(Y – •)
8•
(X – ) =
X = AY + B
Page 69 of 79
Biostatistics
+• +, - +,·+,• + .,·.•
B= or B =
+• - (+•) +.•
Regression equation of y on x,
› 8•
(Y – •) =
8,
(X – )
Y = AX + B
A = Coefficient of x in the regression equation of y on x and therefore the regression
coefficient of y.
› 8• +,• - +,·+•
A= or A =
8, +, - (+,)
+, +• - +,·+,• + .,·.•
B= or B =
+, - (+,) +.,
Regression lines:–
Using the two regression equations, two regression lines, one x on y and another y on
x, can be drawn. Suitable values of the variable x are chosen and their corresponding y values
are estimated using regression equation of y on x. The chosen x values and their estimated y
values are plotted to draw the regression line of y on x. Likewise, suitable y values are chosen
and their corresponding x values are estimated using the regression equation of x on y. The
chosen y values and their estimated x values are plotted to draw the regression line x on y.
(1) If the correlation between the two given variables is perfectly positive or perfectly
negative, the two regression lines will overlap with each other. That is, there will be only one
regression line representing both x on y and y on x.
Page 70 of 79
Biostatistics
(2) If the degree of correlation between the two given variables x and y is higher, i.e., when
the value of r is close to either –1 or +1, the two regression lines will be closer to each other.
On the other hand, if the correlation between the two variables is low, i.e., when r is close to
0, the regression lines will be further apart as in given in following figure. Thus the angle
between the two regression lines is an indication of the closeness or farness between the two.
Page 71 of 79
Biostatistics
(3) When there is no correlation between the two variables, i.e., when the r = 0, the two
regression lines intersect each other at right angles, as shown below.
(4) The two regression lines intersect each other at the point of X and Y. If the values of x and
y are found out with the help of the two regression equations, i.e., by substituting one into
another, these values will be X and Y.
Though correlation and regression are closely related to each other, there are certain
differences between them.
(1) Correlation measures the degree and nature of the relationship between the two given
variables. Regression, on the other hand, gives the average change in the value of one
variable when there is a change in the value of other variables.
(2) Correlation is a two way relationship between the two variables. That is, if x and y are the
two variables given, the correlation between x and y is the same as the correlation between y
and x. On the other hand, regression is a one-way relationship. The regression of x on y is not
the same as the regression of y on x.
Page 72 of 79
Biostatistics
Page 73 of 79
Biostatistics
Page 74 of 79
Biostatistics
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6 5.96 5.94 5.91 5.89 5.87 5.86 5.8
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.7 4.68 4.66 4.64 4.62 4.56
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.1 4.06 4.03 4 3.98 3.96 3.94 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.6 3.57 3.55 3.53 3.51 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.5 3.44 3.39 3.35 3.31 3.28 3.26 3.24 3.22 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.1 3.07 3.05 3.03 3.01 2.94
10 4.96 4.1 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.94 2.91 2.89 2.86 2.85 2.77
11 4.84 3.98 3.59 3.36 3.2 3.09 3.01 2.95 2.9 2.85 2.82 2.79 2.76 2.74 2.72 2.65
12 4.75 3.89 3.49 3.26 3.11 3 2.91 2.85 2.8 2.75 2.72 2.69 2.66 2.64 2.62 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.63 2.6 2.58 2.55 2.53 2.46
14 4.6 3.74 3.34 3.11 2.96 2.85 2.76 2.7 2.65 2.6 2.57 2.53 2.51 2.48 2.46 2.39
15 4.54 3.68 3.29 3.06 2.9 2.79 2.71 2.64 2.59 2.54 2.51 2.48 2.45 2.42 2.4 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.46 2.42 2.4 2.37 2.35 2.28
17 4.45 3.59 3.2 2.96 2.81 2.7 2.61 2.55 2.49 2.45 2.41 2.38 2.35 2.33 2.31 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.37 2.34 2.31 2.29 2.27 2.19
19 4.38 3.52 3.13 2.9 2.74 2.63 2.54 2.48 2.42 2.38 2.34 2.31 2.28 2.26 2.23 2.16
20 4.35 3.49 3.1 2.87 2.71 2.6 2.51 2.45 2.39 2.35 2.31 2.28 2.25 2.23 2.2 2.12
22 4.3 3.44 3.05 2.82 2.66 2.55 2.46 2.4 2.34 2.3 2.26 2.23 2.2 2.17 2.15 2.07
24 4.26 3.4 3.01 2.78 2.62 2.51 2.42 2.36 2.3 2.25 2.22 2.18 2.15 2.13 2.11 2.03
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.18 2.15 2.12 2.09 2.07 1.99
28 4.2 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.15 2.12 2.09 2.06 2.04 1.96
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.13 2.09 2.06 2.04 2.01 1.93
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11 2.08 2.04 2.01 1.99 1.96 1.88
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.04 2 1.97 1.95 1.92 1.84
45 4.06 3.2 2.81 2.58 2.42 2.31 2.22 2.15 2.1 2.05 2.01 1.97 1.94 1.92 1.89 1.81
50 4.03 3.18 2.79 2.56 2.4 2.29 2.2 2.13 2.07 2.03 1.99 1.95 1.92 1.89 1.87 1.78
60 4 3.15 2.76 2.53 2.37 2.25 2.17 2.1 2.04 1.99 1.95 1.92 1.89 1.86 1.84 1.75
70 3.98 3.13 2.74 2.5 2.35 2.23 2.14 2.07 2.02 1.97 1.93 1.89 1.86 1.84 1.81 1.72
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2 1.95 1.91 1.88 1.84 1.82 1.79 1.7
100 3.94 3.09 2.7 2.46 2.31 2.19 2.1 2.03 1.97 1.93 1.89 1.85 1.82 1.79 1.77 1.68
200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.84 1.8 1.77 1.74 1.72 1.62
500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.9 1.85 1.81 1.77 1.74 1.71 1.69 1.59
1000 3.85 3 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.8 1.76 1.73 1.7 1.68 1.58
Page 75 of 79
Biostatistics
4 21.2 18 16.7 15.9 15.5 15.2 14.9 14.8 14.7 14.6 14.5 14.4 14.3 14.3 14.2 14.0
5 16.3 13.3 12.1 11.4 10.9 10.7 10.5 10.3 10.2 10.1 9.96 9.89 9.82 9.77 9.72 9.55
6 13.8 10.9 9.78 9.15 8.75 8.47 8.26 8.1 7.98 7.87 7.79 7.72 7.66 7.61 7.56 7.4
7 12.3 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.54 6.47 6.41 6.36 6.31 6.16
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.73 5.67 5.61 5.56 5.52 5.36
9 10.6 8.02 6.99 6.42 6.06 5.8 5.61 5.47 5.35 5.26 5.18 5.11 5.05 5.01 4.96 4.81
10 10.0 7.56 6.55 5.99 5.64 5.39 5.2 5.06 4.94 4.85 4.77 4.71 4.65 4.6 4.56 4.41
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.46 4.4 4.34 4.29 4.25 4.1
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.5 4.39 4.3 4.22 4.16 4.1 4.05 4.01 3.86
13 9.07 6.7 5.74 5.21 4.86 4.62 4.44 4.3 4.19 4.1 4.02 3.96 3.91 3.86 3.82 3.66
14 8.86 6.51 5.56 5.04 4.7 4.46 4.28 4.14 4.03 3.94 3.86 3.8 3.75 3.7 3.66 3.51
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4 3.89 3.8 3.73 3.67 3.61 3.56 3.52 3.37
16 8.53 6.23 5.29 4.77 4.44 4.2 4.03 3.89 3.78 3.69 3.62 3.55 3.5 3.45 3.41 3.26
17 8.4 6.11 5.19 4.67 4.34 4.1 3.93 3.79 3.68 3.59 3.52 3.46 3.4 3.35 3.31 3.16
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.6 3.51 3.43 3.37 3.32 3.27 3.23 3.08
19 8.19 5.93 5.01 4.5 4.17 3.94 3.77 3.63 3.52 3.43 3.36 3.3 3.24 3.19 3.15 3
20 8.1 5.85 4.94 4.43 4.1 3.87 3.7 3.56 3.46 3.37 3.29 3.23 3.18 3.13 3.09 2.94
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.18 3.12 3.07 3.02 2.98 2.83
24 7.82 5.61 4.72 4.22 3.9 3.67 3.5 3.36 3.26 3.17 3.09 3.03 2.98 2.93 2.89 2.74
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 3.02 2.96 2.9 2.86 2.82 2.66
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.96 2.9 2.84 2.79 2.75 2.6
30 7.56 5.39 4.51 4.02 3.7 3.47 3.3 3.17 3.07 2.98 2.91 2.84 2.79 2.74 2.7 2.55
35 7.42 5.27 4.4 3.91 3.59 3.37 3.2 3.07 2.96 2.88 2.8 2.74 2.69 2.64 2.6 2.44
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.8 2.73 2.66 2.61 2.56 2.52 2.37
45 7.23 5.11 4.25 3.77 3.45 3.23 3.07 2.94 2.83 2.74 2.67 2.61 2.55 2.51 2.46 2.31
50 7.17 5.06 4.2 3.72 3.41 3.19 3.02 2.89 2.79 2.7 2.63 2.56 2.51 2.46 2.42 2.27
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.56 2.5 2.44 2.39 2.35 2.2
70 7.01 4.92 4.07 3.6 3.29 3.07 2.91 2.78 2.67 2.59 2.51 2.45 2.4 2.35 2.31 2.15
80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 2.48 2.42 2.36 2.31 2.27 2.12
100 6.9 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.5 2.43 2.37 2.31 2.27 2.22 2.07
200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.6 2.5 2.41 2.34 2.27 2.22 2.17 2.13 1.97
500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.28 2.22 2.17 2.12 2.07 1.92
1000 6.66 4.63 3.8 3.34 3.04 2.82 2.66 2.53 2.43 2.34 2.27 2.2 2.15 2.1 2.06 1.9
Page 76 of 79
Biostatistics
4 74.14 61.25 56.18 53.44 51.71 50.53 49.66 49 48.5 48.1 47.7 47.4 47.2 46.9 46.8 46.1
5 47.18 37.12 33.2 31.09 29.75 28.84 28.16 27.7 27.3 26.9 26.7 26.4 26.2 26.1 25.9 25.4
6 35.51 27 23.7 21.92 20.8 20.03 19.46 19.0 18.7 18.4 18.2 17.9 17.8 17.7 17.6 17.1
7 29.25 21.69 18.77 17.2 16.21 15.52 15.02 14.6 14.3 14.1 13.9 13.7 13.6 13.4 13.3 12.9
8 25.42 18.49 15.83 14.39 13.49 12.86 12.4 12.1 11.8 11.5 11.4 11.2 11.1 10.9 10.8 10.5
9 22.86 16.39 13.9 12.56 11.71 11.13 10.7 10.4 10.1 9.89 9.72 9.57 9.44 9.33 9.24 8.9
10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.2 8.96 8.75 8.59 8.45 8.33 8.22 8.13 7.8
11 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.36 8.12 7.92 7.76 7.63 7.51 7.41 7.32 7.01
12 18.64 12.97 10.8 9.63 8.89 8.38 8 7.71 7.48 7.29 7.14 7.01 6.89 6.79 6.71 6.41
13 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.8 6.65 6.52 6.41 6.31 6.23 5.93
14 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.8 6.58 6.4 6.26 6.13 6.02 5.93 5.85 5.56
15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 5.94 5.81 5.71 5.62 5.54 5.25
16 16.12 10.97 9.01 7.94 7.27 6.81 6.46 6.2 5.98 5.81 5.67 5.55 5.44 5.35 5.27 4.99
17 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 5.44 5.32 5.22 5.13 5.05 4.78
18 15.38 10.39 8.49 7.46 6.81 6.36 6.02 5.76 5.56 5.39 5.25 5.13 5.03 4.94 4.87 4.59
19 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 5.08 4.97 4.87 4.78 4.7 4.43
20 14.82 9.95 8.1 7.1 6.46 6.02 5.69 5.44 5.24 5.08 4.94 4.82 4.72 4.64 4.56 4.29
22 14.38 9.61 7.8 6.81 6.19 5.76 5.44 5.19 4.99 4.83 4.7 4.58 4.49 4.4 4.33 4.06
24 14.03 9.34 7.55 6.59 5.98 5.55 5.24 4.99 4.8 4.64 4.51 4.39 4.3 4.21 4.14 3.87
26 13.74 9.12 7.36 6.41 5.8 5.38 5.07 4.83 4.64 4.48 4.35 4.24 4.14 4.06 3.99 3.72
28 13.5 8.93 7.19 6.25 5.66 5.24 4.93 4.7 4.51 4.35 4.22 4.11 4.01 3.93 3.86 3.6
30 13.29 8.77 7.05 6.13 5.53 5.12 4.82 4.58 4.39 4.24 4.11 4 3.91 3.83 3.75 3.49
35 12.9 8.47 6.79 5.88 5.3 4.89 4.6 4.36 4.18 4.03 3.9 3.79 3.7 3.62 3.55 3.29
40 12.61 8.25 6.6 5.7 5.13 4.73 4.44 4.21 4.02 3.87 3.75 3.64 3.55 3.47 3.4 3.15
45 12.39 8.09 6.45 5.56 5 4.61 4.32 4.09 3.91 3.76 3.64 3.53 3.44 3.36 3.29 3.04
50 12.22 7.96 6.34 5.46 4.9 4.51 4.22 4 3.82 3.67 3.55 3.44 3.35 3.27 3.2 2.95
60 11.97 7.77 6.17 5.31 4.76 4.37 4.09 3.87 3.69 3.54 3.42 3.32 3.23 3.15 3.08 2.83
70 11.8 7.64 6.06 5.2 4.66 4.28 3.99 3.77 3.6 3.45 3.33 3.23 3.14 3.06 2.99 2.74
80 11.67 7.54 5.97 5.12 4.58 4.2 3.92 3.71 3.53 3.39 3.27 3.16 3.07 3 2.93 2.68
100 11.5 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.3 3.18 3.07 2.99 2.91 2.84 2.59
200 11.16 7.15 5.63 4.81 4.29 3.92 3.65 3.43 3.26 3.12 3.01 2.9 2.82 2.74 2.67 2.42
500 10.96 7 5.51 4.69 4.18 3.81 3.54 3.33 3.16 3.02 2.91 2.81 2.72 2.64 2.58 2.33
1000 10.89 6.96 5.46 4.66 4.14 3.78 3.51 3.3 3.13 2.99 2.87 2.77 2.69 2.61 2.54 2.3
Page 77 of 79
Biostatistics
Page 78 of 79
Biostatistics
Page 79 of 79