Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

BIOSTATISTICS

CBCS_Syllabus_M.Sc._Botany/Zoology/Microbiology_Sem – I

104 - Biostatistics and Bioinformatics

Unit - 1 Basics of Biostatistics

1.1 Measure of Central tendency: Mean, Mode and Median, Frequency distribution

1.2 Measures of dispersion: Range Variance, Standard deviation, Coefficient of variation

1.3 Confidence limits and confidence intervals

1.4 Chi-square Test

Unit - 2 Statistical Tests in Biology

2.1 Stude t s t test, ea i g of sig ifica ce a d sig ifica ce levels

2.2 Paired a d u paired t test

2.3 Analysis of Variance

2.4 Regression and Correlation analysis


Biostatistics

Biostatistics – General Introduction

Statistics –word come from Latin word “state” which indicates historical importance of
governmental data gathering.
Gottfried Achenwall first used the German word “statistics”.
The term statistics has been used to indicate facts and figures of any kind: health
statistics, vital statistics, or business statistics.
It is also used to refer to a body of knowledge known as statistical methods developed for
handling data in general, particularly in the fields of experimentation and research.
The various definitions of statistics are – (1) Principles and methods for the collection,
organization, analysis and interpretation of numerical data of different kind.
(2) The science and art of dealing with variation in such a way as to obtain reliable
results. (3) The science of experimentation which may be regarded as mathematical
applied to observational data.

Statistics is a science assisting you to make decision under uncertainties. Decision


making process must be based on data; neither personal opinion or on belief or rumors.
Statistics is a branch of science which deals with scientific methods of collection,
organization, presentation, analysis and interpretation of the data obtained by conducting a
survey or an experimental study.

There are five stages in a statistical investigation:

1. Collection of data: The data is collected with a specific and well defined purpose. If the
data collected is faulty, the conclusions drawn will also be false. Therefore, maximum care
should be taken while collecting the data.
2. Organization of the data: From just direct observation, it is not possible to derive any
conclusions. The data should be edited, classified and tabulated.
3. Presentation of the data: The data should be presented either in tabular or graphic form.
It can be represented by diagrams also.
4. Analysis of the data: Mean, Standard deviation, Range, etc.
5. Interpretation of the data: It is the last stage in statistical analysis but the most difficult
one. It requires a high degree of skill and interpretation.

Page 1 of 79
Biostatistics

Biostatistics: The science in which the mathematical principles of handling and analyzing
data is applied to biological fields like medicine, public health, agricultural genetics, etc.

Importance and Scope of Statistical Methods in Biology:–

When an investigator gets data by experiment, by interview or from existing records, the
result is a series of numbers, the observations.
Such observations are sample from a larger population and we use these samples in order
to draw conclusions about the population.
Generalization made from sample to population entails some risk and therefore reasoning
from sample to population requires systematic thinking.
Statistical theory is largely denoted to this reasoning. So when biologists sets the goal of
an investigation and collects information (data), he requires help from statistical methods
to analyze the data and to draw conclusions, which help to decide future course of action.
For this reason, Biostatistics is applied in majority of branches of biological sciences such
as Agriculture, Genetics, Physiology, Biochemistry, Molecular Biology, Taxonomy,
Medicine and Health Sciences.

Classification

Classification is grouping of data according to their identity, similarity or


resemblances.

Objectives of classification:

i. It condenses the mass of data in an easily assimilable form


ii. It eliminates unnecessary details
iii. It facilitates comparison and highlights the significant aspects of data
iv. It helps in drawing inferences
v. It helps in statistical treatment of the information collected

Page 2 of 79
Biostatistics

Types of Classification

I. Chronological or Temporal Classification: The collected data are arranged according to


the order of time expressed in years, months, weeks etc. It is generally classified in ascending
order of time. For eg. Data related with population, sales of a firm, import and export, etc.

II. Geographical Classification: The collected data is classified according to geographical


region or place. For eg. Production of wheat in different countries, production of tomatoes in
different states of India, etc.

III. Qualitative Classification: The collected data is classified on the basis of some attribute
or quality like gender, religion, literacy, employment, etc. Qualitative classification is simple,
when the attribute is one and manifold, when the attribute is more than one.

IV. Quantitative Classification: The collected data are grouped with reference to
characteristics which can be measured and numerically described such as height, weight, age,
income, sales, etc.

BASIC TERMINOLOGIES OF BIOSTATISTICS


• Population: The totality or aggregate of individuals under study is known as population.
OR A collection of the set of individuals or objects or events whose properties are to be
analyzed.
• Sample: It is a fraction of population that can be regarded as a representative of the entire
population.
• Sample size: The total number of observations in a sample is called sample size (n).
• Sampling: The procedure of drawing a sample from the population is called sampling.
• Parameter: A descriptive measure of some characteristics of the population.
• Variable: The characteristics on which individuals or objects differ among themselves is
called variable.
• Quantitative variable: Whenever the measurements of a variable are possible on a scale
in some appropriate units, it is called quantitative variable.
• Qualitative variable: It shows variation in object in terms of quality or kind and not in
terms of magnitude.

Page 3 of 79
Biostatistics

• Continuous variable: It can assume all values within an interval and can be divided into
smaller and smaller units. Theoretically the data point can lie anywhere on the numerical
scale.
• Discrete (Discontinuous) variable: It is one where the values of the variable differ from
one another by definite amounts.
• Ranked variable: They are the variables which cannot be measured but can be ranked by
their magnitude/size.
• Derived variable: The derived variables are calculated based on two or more
independently variables. They show relationship between variables.
• Variates: Measurements on quantitative variables are called variates.
• Attributes: Measurements on qualitative variables are called attributes.
• Frequency distribution: It is grouping of the data into ordered classes and determining
the number of observations in each of the classes, frequency distribution.
• Ordered array: It is listing of data in order of magnitude from the smallest value to the
largest value.
• Frequency table: The tabular form of a frequency of distribution is called frequency
table.
• Cumulative frequency: It is obtained by adding all successive class intervals.
• Relative frequency: It is obtained by dividing the class frequency by the total frequency.
• Tabulation: It is a process of summarizing classified data in the form of a table.
• Table: It is a systematic arrangement of classified data in columns and raws.
• Simple/One way table: A table which contains data on 1 characteristic is called one-way
table.
• Two way table: A table which contains data on 2 characteristics is called two way table.
• Manifold table: A table which contains data on more than 2 characteristics is called
multifold table.
• Data: Set of values collected for the variable from each of the elements belonging to the
sample. OR It is a collective term referring to a group of observations.
• Primary data: The data collected by actual observations, measurements, counting, direct
recording, etc, during the course of investigation is called primary data.
• Secondary data: Any data, detached from the original source, reprocessed for one’s own
purpose ay any other person or organization is called secondary data.

Page 4 of 79
Biostatistics

1.1 Measures of Central Tendency

The measure of central tendency is defined as: “It is a sort of average or typical value
of the items in the series and its function is to summarize the series in terms of this average
value”.

Thus a single expression representing the whole group is selected which may convey a fairly
adequate idea about the whole group. This single expression in statistics is known as the
average. Averages, are generally, the central part of the distribution and therefore they are
also called the measure of central tendency.

The most common measures of central tendency are:


1. Arithmetic Mean or Mean
2. Median
3. Mode.

Each of them, in its own way, can be called a representative of the characteristics of
the whole group and thus the performance of the group as a whole can be described by the
single value which each of these measures gives. The values of mean, median and mode also
help us in comparing two or more groups or frequency distributions in terms of typical or
characteristics performances.

The main aim of an average is to present huge mass of statistical data in a simple and concise
manner. This makes the central theme of the data readily understandable. The averages are
extremely useful for purposes of comparison.

A good average must have the following characteristics:


1. It should be rigidly defined so that different persons may not interpret it differently.
2. It should be easy to understand and easy to calculate.
3. It should be based on all the observations of the data.
4. It should be easily subjected to further mathematical calculations.
5. It should be least affected by the fluctuations of the sampling.
6. It should not be unduly affected by the extreme values.
7. It should be easy to interpret.

Page 5 of 79
Biostatistics

Mean (Arithmetic Mean):–

Arithmetic mean, commonly called ‘mean’, is defined as the sum of all the variates of
a variable divided by the total number of items in the sample.

Sum of all the items in a sample


Mean =
Total number of items in the sample

Σx x1 + x2 + x3 + . . . . . . + xn
= =
N n
Where, X = Arithmetic mean of variable x
Σx = Sum of all the items of the variable x
n = Total number of items in the sample
The mean should be expressed in the same unit in which data is given.

Example (Mean of raw or ungrouped data): Find the mean of the triglycerides present 10
patients in their blood sample in hospital.
25, 30, 21, 55, 47, 10, 15, 17, 45, 35
Solution:

X =
Σx
N
25 + 30 + 21 + 55 + 47 + 10 + 15 + 17 + 45 + 35 300
= =
10 10
X = 30

Mean of grouped data:

Let x1, x2, x3 …. xn be the variates and let f1, f2, f3 …. fn be their corresponding frequencies,
then their mean X is given by,

fi xi fi xi
X =
f1x1 + f2x2 + f3x3 + . . . . . . + fnxn
= =
f1 + f2 + f3 + . . . + fn Σfi N

Where N = f1 + f2 + f3 + . . . + fn

Page 6 of 79
Biostatistics

Example (Mean grouped data): Find the mean of a sample of reports cases of mumps in
school children of the following data.
Blood LDL 52 58 60 65 68 70 75
No. of patients 7 5 4 6 3 3 2

Solution:
X F f×x
52 7 364
58 5 290
60 4 240
65 6 390
68 3 204
70 3 210
75 2 150
Total 30 1848

Here, N = Σf = 30 and Σfx = 1848


Σfx 1848
= = = 61.6
N 30

Merits, Demerits and Uses of Mean:–


Merits: 1. It can be easily calculated.
2. Its calculation is based on all the observations.
3. It is easy to understand.
4. It is rightly defined by the mathematical formula.
5. It is least affected by sampling fluctuations.
6. It is the best measure to compare two or more series (data).
7. It is the average obtained by calculations and it does not depend upon any position.

Demerits: 1. It may not be represented in actual data so it is theoretical.


2. The extreme values have greater affected on mean.
3. It cannot be calculated if all the values are not known.
4. It cannot be determined for the quantitative data such as love, beauty, honesty, etc.
5. Mean may lead to fallacious conditions in the absence of original observations.

Page 7 of 79
Biostatistics

Uses: 1. A common man uses mean for calculating average marks obtained by a students.
2. It is extremely used in practical statistics.
3. Estimates are always obtained by mean.
4. Businessman uses it to find out the operation cost, profit per unit of article, output
per man and per machine, average monthly income and expenditure etc.

Mode:–
Mode is defined as that value in a series which occurs most frequently. In a frequency
distribution mode is that variate which has the maximum frequency. In other words, mode
represents that value which most frequent or typical predominant.

For example, in the series, 6, 5, 3, 4, 3, 7, 8, 5, 9, 5, 4; we notice that 5 occurs most


frequently, therefore, 5 is the mode.

Mode is also known as Norm.

Example: Weight of catfish in g: 8, 9, 10, 9, 17, 10, 19, 15, 10, 12, 19
Array of the data: 8, 9, 10, 10, 10, 12, 15, 17, 19, 19
In the above data, 10 occurs 3 times,
19 occurs 2 times, and
other occurs one time,
Therefore, the mode of the above data is 10 g.

A series of observation may have one or more modes.


Unimodal series: The series of observations which contains only one mode, is called
Unimodal series.
Bimodal series: The series of observations which contains two modes is called a bimodal
series. In this series, the two modes are the same value of greatest density.
Trimodal series: The series of observations which contains three modes is called a trimodal
series. In this series, the three modes are of same value of greatest density
and highest concentration of observations.
Multimodal or Polymodal samples also do occur.
A sample with no mode is called a no modal series.

Page 8 of 79
Biostatistics

Merits: 1. It can be easily understood.


2. It can be located in some cases by inspection.
3. It is capable of being ascertained graphically.
4. It is not affected by extreme value.
5. It represents the most frequent value and hence it is very useful in practice.
6. The arrangement of data is not necessary if the items are a few.

Demerits: 1. There are different formulas for its calculations which ordinarily give different
answers.
2. Mode is determinate. Some series have two or more than two modes.
3. It cannot be subjected to algebraic treatments. For example, the combined mode cannot
be calculated for the modes of two series.
4. It is an unstable measure as it is affected more by sampling fluctuations.
5. Mode for the series with unequal class-intervals cannot be calculated.

Uses: 1. It is used for the study of most popular fashion.


2. It is extensively used by businessmen and commercial management.

Median:–
Median is defined as the middle most or the central value of the variable in a set of
observations, when the observations are arranged either in ascending or in descending order
of their magnitudes. It divides the arranged series in two equal parts. Median is a position
average, whereas the mean is a calculated average.

When a series consists of an even numbers of terms, median is the mean of the two central
items. It is generally denoted by M.

Median = value of the item (n + 1) / 2 in an array.

Example: The following are the lengths in cm of a species of fish. Let us identify the median
weight. Length in cm: 17, 16, 15, 18, 16

First let us array the data in ascending order of magnitude. 15, 16, 16, 17, 18

Page 9 of 79
Biostatistics

Then, we identify the median as the value of the item (n + 1) / 2, where n is the number of
items in the sample. It is 5 in this example.
Median = value of item (n + 1) / 2 = (5+1)/2 = 6/2 = 3.
That is, the median is the value of the item 3 in the array. In above example the value of the
item 3 is 16 cm.
The median length of the sample is 16 cm.

Example: The following are the weights in g of a species frog, with a sample size n = 8
Weight in g: 75, 60, 55, 80, 45, 70, 40, 85
Array the data as follows: 40, 45, 55, 60, 70, 75, 80, 85
The median is the value of the item (n + 1) / 2 = (8 + 1) / 2 =9/2 = 4.5
That is the median is the value of the item 4.5 in the array. The value of the item 4.5 is
calculated as a value mid-way between item 4 and 5, as follows,
Value of item 4.5 = (value of item 4 + value of item 5) / 2
= (60 + 70) / 2 = 130/2 = 65
The median weight is 65 g.

Merits: 1. It is easily understood.


2. It is not affected by extreme values.
3. It can be located graphically.
4. It is the best measure for qualitative data such as beauty, intelligence, honesty etc.
5. It can be easily located even if the class intervals in the series are unequal.
6. It can be determined even by inspection in many cases.

Demerits: 1. It is not subject to algebraic treatments.


2. It cannot represent the irregular distribution series.
3. It is a positional average and is based on the middle item.
4. It does not have sampling stability.
5. It is an estimate in case of a series containing even number of items.
6. It does not take into account the value of all the items in the series.
7. It is not suitable in those cases where due importance and weight should be given to
extreme value.

Page 10 of 79
Biostatistics

Uses: 1. It is useful in those cases where numerical measurements are not possible.
2. It is also useful in those cases where mathematical calculations cannot be made in order
to obtain the mean.
3. It is generally used in studying phenomena like skill, honesty, intelligence etc.

Frequency Distribution:–

A common and very useful method of classification of qualitative and quantitative


data in biology is the construction of frequency distribution. A frequency distribution is a
classification of a random variable into a number of classes or representatives of the class-
intervals occur in the data. The frequency distribution is always presented in a table called
frequency table.

Preparation of a frequency table:–

1. Arrange the scores in ascending order to form an array.


2. Draw a table consisting of three columns: (a) class intervals, (b) tally, and (c) frequency.
3. Bearing in mind the lower and upper limits, write down the class intervals or the variables
in the first column.
4. Against each interval or the variable, write down as many vertical lines in the “Tally
column” as the number of scores it contains.
5. Count the number of vertical lines, crossing of 4 lines to be counted as 5 and put down the
number in the ‘frequency column’.

Example: Present the following data of the sample of frequency of 40 households taken from
a factory in the form of a frequency table with 9 classes with class interval data
100.
200, 120, 350, 550, 400, 140, 350, 85, 180, 110,
110, 600, 350, 500, 450, 200, 170, 90, 170, 800,
190, 700, 630, 170, 210, 185, 250, 120, 180, 350,
110, 250, 430, 140, 300, 400, 200, 400, 210, 305

Page 11 of 79
Biostatistics

Solution: Arranging the given data in the ascending does order, we get:
85, 90, 110, 110, 110, 120, 120, 140, 140, 170, 170, 170, 180, 180, 185, 190, 200,
200, 200, 210, 210, 250, 250, 300, 305, 350, 350, 350, 350, 400, 400, 400, 430,
450, 500, 550, 600, 700, 800
Taking a class interval of 100 starting 0 – 99
Frequency Distribution Table
Class Intervals Tally bars Frequency
0 – 99 II 2
100 – 199 IIII IIII IIII 14
200 – 299 IIII II 7
300 – 399 IIII I 6
400 – 499 IIII 5
500 – 599 II 2
600 – 699 II 2
700 – 799 I 1
800 – 899 I 1
Total 40

Example: The following are the lengths of 25 gold fish, measured to the nearest tenth of a
cm.
3.9, 3.8, 3.6, 3.8, 4.0, 4.2, 3.6, 4.7, 4.3, 3.9, 3.6, 4.5, 3.8,
3.9, 4.3, 4.1, 3.9, 4.4, 4.1, 4.1, 4.4, 4.1, 3.9, 3.3, 4.0

Frequency Distribution Table


Sr. No. Class intervals Tally Frequency
1 3.3 – 3.5 II 2
2 3.6 – 3.8 IIII 5
3 3.9 – 4.1 IIII IIII I 11
4 4.2 – 4.4 IIII 5
5 4.5 – 4.7 II 2
Total 25

Page 12 of 79
Biostatistics

Relative frequency distribution:–


We know that the frequency is defined as the total number of data point that fall
within that class. Frequency of each class can also be expressed in a fraction or percentage
terms. There are known as relative frequencies. In other words, a relative frequency is the
class frequency expressed as a rate of total frequency. i.e.,
Class frequency
Relative frequency =
Total frequency

Percent relative
Sr. No. Class intervals Frequency Relative frequency
frequency
1 3.25 – 3.55 2 2 / 25 = 0.08 0.08 × 100 = 8
2 3.55 – 3.85 5 5 / 25 = 0.20 0.20 × 100 = 20
3 3.85 – 4.15 11 11 / 25 = 0.44 0.44 × 100 = 44
4 4.15 – 4.45 5 5 / 25 = 0.20 0.20 × 100 = 20
5 4.45 – 4.75 2 2 / 25 = 0.08 0.08 × 100 = 8
Total 25 1.00 100

Cumulative Frequency Distribution:–


Cumulative frequency corresponding to a class is the sum of all the frequencies up to
end including that class. In cumulative frequency distribution the frequency of a particular
class is obtained by adding to the frequency of that class all the frequencies of its previous
classes. Thus the cumulative frequency table is obtained from the ordinary frequency table by
successively adding the several frequencies.

Cumulative frequency series are of two types: (1) Less than series and (2) More than series.

Simple frequency distribution of the length (cm) of gold fish


Sr. No. Class intervals Frequency
1 3.25 – 3.55 2
2 3.55 – 3.85 5
3 3.85 – 4.15 11
4 4.15 – 4.45 5
5 4.45 – 4.75 2

Page 13 of 79
Biostatistics

Less than cumulative frequency distribution of the length (cm) of gold fish
Sr. No. Less than Cumulative frequency
1 3.25 0
2 3.55 0+2=2
3 3.85 2+5=7
4 4.15 7 + 11 = 18
5 4.45 18 + 5 = 23
6 4.75 23 + 2 = 25

More than cumulative frequency distribution of the length (cm) of gold fish
Sr. No. More than Cumulative frequency
1 3.25 25
2 3.55 25 – 2 = 23
3 3.85 23 – 5 = 18
4 4.15 18 – 11 = 7
5 4.45 7–5=2
6 4.75 2–2=0

Example: The heights (in cm) of 40 persons are: 110, 112, 125, 135, 150, 152, 150, 155, 159,
130, 128, 138, 133, 143, 147, 151, 154, 156, 112, 116, 117, 111, 113, 115, 118,
121, 123, 120, 125, 121, 110, 113, 114, 149, 153, 155, 150, 156, 152, 111
Array the data and form a cumulative frequency table with class interval of 10.

Solution: Arranging the given data in ascending order of magnitude, we get: 110, 110, 111,
111, 112, 112, 113, 113, 114, 115, 116, 118, 119, 120, 121, 121, 123, 125, 128, 130,
133, 135, 138, 143, 147, 149, 150, 150, 151, 152, 152, 153, 154, 155, 155, 155, 156, 159
Sr. No. Class interval Tally Frequency Cumulative frequency
1 110 – 119 IIII IIII III 13 13
2 120 – 129 IIII II 7 20
3 130 – 139 IIII 4 24
4 140 – 149 III 3 27
5 150 – 159 IIII IIII III 13 40

Page 14 of 79
Biostatistics

Frequency Graphs:–
All types of frequency distributions can be represented by means of graphs. The most
common types of frequency graphs are: (1) the bar diagrams for qualitative and discrete
frequency distributions, (2) histogram for continuous frequency distribution, (3) frequency
polygon and frequency curve, and (4) the ogives, for cumulative frequency distributions.

Bar Diagram: The simple bar diagram can be used to represent qualitative as well as discrete
frequency distributions. The height of each bar is proportional to the frequency of the
respective class.

Frequency distribution of the number of eggs per nest of a species of bird

Histogram: Histogram is the graphical representation of continuous frequency distribution. It


resembles the bar diagram with specific differences. Unlike the bar diagram there is no space
between the vertical rectangular bars, indicating that the class intervals run continuously.

Histogram of length of gold fish

Page 15 of 79
Biostatistics

Frequency polygon and frequency curve: Frequency polygon and frequency curve are
alternative form to histogram. Whereas the information given in a histogram is precise, those
given in frequency polygon or frequency curve are general and tend to reflect the nature of
the frequency distribution of the population from which the sample was obtained. Frequency
polygon is obtained by joining the middle points of the tops of the rectangles in a histogram
by straight lines.

Frequency polygon of the length of gold fish

Frequency curve is drawn in the same manner except that the adjacent points of class interval
plotted against the respective frequencies are joined by a smooth line, instead of straight
lines.

Frequency curve of the length of gold fish

Page 16 of 79
Biostatistics

Cumulative frequency graphs or Ogives: Graphical representation of cumulative frequency


distribution is called cumulative frequency graph or an ogive. There are two types of ogives,
less than ogive and more than ogive. The less than ogive is the graphic representation of less
than cumulative frequency distribution, and the more than ogive is the graphic representation
of more than cumulative frequency distribution.

To draw an ogive, the true class intervals are marked on the X-axis and the cumulative
frequencies are marked on the Y-axis. Against the true upper limits of the class-intervals,
points are plotted corresponding to the respective cumulative frequency. The lines joining the
points thus plotted give the less than ogive. Likewise, more than cumulative frequencies are
plotted against the respective true lower limit of the class-interval and the plotted points are
joined to get the more than ogive.

More than Less than

Ogives of the length of gold fish

Page 17 of 79
Biostatistics

1.2 Measures of Dispersion

The mean alone gives no information about the range of values that comprise a data
set. Two set of data need not be identical but may have the same measure of central
tendencies.

For e.g., The following three sets of data are not identical but their means are same.
A : 60, 60, 60, 60, 60 ⇒ 60
B : 30, 50, 85, 75, 60 ⇒ 60
C : 10, 30, 90, 90, 50 ⇒ 60

The mean for all the three sets is 60 but the observation in each set are different. This
variation in data is described as Measure of Dispersion.

The first set there is no dispersion because all the observations are same. In second and third
set of observations dispersion is evident because all the observations are different. The
dispersion is small when the values of the observations are close together. i.e., show little
variation. The dispersion is higher when the values are widely spread out. A measure of
dispersion conveys information regarding the amount of variability present in a set of data.

No. Poultry A Poultry B Poultry C


Days Daily egg production Daily egg production Daily egg production
1 4000 4050 3900
2 4000 4025 2100
3 4000 3950 1200
4 4000 4140 800
5 4000 3835 12000
Total 20000 20000 20000
Mean 4000 4000 4000

The mean of egg production of A, B and C is same. The mean does not show the fluctuation
or variation in the number of egg produced daily by B and C poultry. The daily egg
production of poultry A is same. It does not show variability. The variability in the egg
production by poultry B is less then Poultry C. In poultry B, the values spread out between
3835 and 4140 while in poultry C, the values spread out between 800 and 12000. It means
dispersion is more in poultry C’s egg production than in case of poultry B.

Page 18 of 79
Biostatistics

Importance of Dispersion:–

The measure of dispersion is important tool in biostatistical studies because biological


phenomena are more variable than physical and chemical phenomena. Individual variations
are found in Hb%, No of RBC & WBC, cure of some drug varies in different patients of same
age and sex.

The major objectives of measure of objectives are:–


⇒ To judge the reliability of measure of central tendency.
⇒ To obtain correct picture of distribution or dispersion of values in a series.
⇒ To make a comparative study of variability of two or more series or sample.
⇒ To identify causes of variability in samples in order to exercise corrective measure.
⇒ To utilize dispersion values for further statistical analysis.

Types of Measures of Dispersion:


Range, Standard deviation, Coefficient of variations…

Range (R):–

Range is a very simple measure of dispersion. It is defined as the difference between


maximum value and the minimum value of the given series of data. More commonly the
range is indicated as “minimum-maximum”.
Range = Highest value – Lowest value
R=H–L

Example: Hb% of 100 cc of 15 persons was as follows. Calculate the range.


11.5 13.8 14.3 11.7 13.1 14.5 11.8 14.0 14.7 12.5 14.1 14.8 12.9 14.2 14.9
Solution: Array the data in ascending order.
11.5 11.7 11.8 12.5 12.9 13.1 13.8 14.0 14.1 14.2 14.3 14.5 14.7 14.8 14.9
Range = Highest value – Lowest value
= 14.9 – 11.5
∴ R = 3.4 per 100 cc
The range is 3.4 per 100 cc.

Page 19 of 79
Biostatistics

Advantages: 1. It is easy compute.


2. Its units are the same as the units of the variable being measured.

Disadvantages: 1. The range does not take into account the number of observation the
sample, only taken into consideration the largest observation and the smallest
observation, whatever they may be. Because we expect large sample to include
occasional extreme values, we expect a large range. A measure of variability should
depend on the number of observations.
2. It makes no direct use of many of the observations in the sample. Observations
between the smallest and largest in a set are used only to determine which observations
are smallest and largest. Some use of the actual values of intervening observations
seems desirable.
3. The range also suffers from dependence upon extreme observations.
4. Range cannot be completed in case of open-end distribution.

Standard Deviation (SD):–

Standard deviation is defined as the square root of the arithmetic mean (X) of the
squared deviations of the various items from arithmetic mean. In short, it is called the root
mean square deviation. The mean of squared deviations is called the variance. Therefore, the
square root of variance (V) is the standard deviation (SD).

∑ ( – )
Standard deviation (SD) = = √

Where, X = Variable x
X = mean of the variable x = Σx/n
(X – X) = deviation
(X – X)2 = Squared deviation
(X – X)2/n = mean of squared deviation = Variance

Page 20 of 79
Biostatistics

Example: Compute the standard deviation for the following weights in g of frogs:
30 90 20 10 80 70

X weight in g (X – ) (X – )2
10 10 – 50 = –40 1600
20 20 – 50 = –30 900
30 30 – 50 = –20 400
70 70 – 50 = 20 400
80 80 – 50 = 30 900
90 90 – 50 = 40 1600
ΣX = 300 Σ(X– )2 = 5800

X = ΣX / n = 300/6 = 50 g

– 5800" = √966.67
SD = = 6
SD = 31.09 g

Example: Compute the standard deviation for the following data of the number of eggs in the
60 nests of a species of a bird.
2 2 3 6 2 4 1 0 1 2 3 4 4 5 6 4 4 2 2 0 1 3 6 5 2 5 3 5 4 4
2 4 3 0 4 3 1 5 2 2 3 6 4 3 2 3 6 1 2 3 2 5 4 1 1 4 3 3 2 5
X Frequency
f·x (X – ) (X – )2 f·(X – )2
No of eggs / nest (f)
0 5 0 0 – 3 = –3 9 5 × 9 = 45
1 8 8 1 – 3 = –2 4 8 × 4 = 32
2 12 24 2 – 3 = –1 1 12 × 1 = 12
3 12 36 3–3=0 0 12 × 0 = 0
4 12 48 4–3=1 1 12 × 1 = 12
5 7 35 5–3=2 4 7 × 4 = 28
6 4 24 6–3=3 9 4 × 9 = 36
Σf = 60 Σfx = 175 Σf(X – )2 = 165

Page 21 of 79
Biostatistics

X = Σfx / n = 175/60 = 2.91 ≈ 3 eggs/nest

' –
165"
SD =
Σf
= 60 = √2.75

SD = 1.66 eggs per nest

Merits: 1. It is based on all the observations.


2. It is rigidly defined.
3. It has greater mathematical significance and is capable of further mathematical
treatment.
4. It represents the true measurement of dispersion of a series.
5. It is least affected by fluctuations of sampling.
6. It is reliable and dependable measure of dispersion.
7. It is extremely useful in correlation etc.

Demerits: 1. It is difficult to compute unlike other measure of dispersion.


2. It is not simple to understand.
3. It gives more weightage to extreme values.
4. It consumes much time and labour while computing it.

Uses: 1. It is widely used in biological studies.


2. It is used in fitting a normal curve to a frequency distribution.
3. It is most widely used measure of dispersion.

Variance (V or σ):–

The variance is the arithmetic mean of the squares of sum of the deviations from the
mean value of the data. It is also described as the square of standard deviation. The methods
for calculating variance are the same as for the standard deviation. Sometimes it is denoted by
σ.

+(, - ,)
Variance = V =

Page 22 of 79
Biostatistics

The limitation of the mean deviation for negative values is overcome by squaring of the
deviations. To avoid any further errors due to biased estimate, degrees of freedom (n – 1) are
used for small number of values, instead of n values,

+(, - ,) +.,
– –
V = or V =

Where dx = (x − x)

Variance of grouped data is given by the formula:


+0(, - ,)
+0 –
V =

Example: Compute mean and variance of the data set given below.

Class Mid-point No of fields


f·x (, − ,) (, − ,) 0(, − ,)
interval X F
31 – 35 33 2 66 33 – 50 = –17 289 578
36 – 40 38 3 114 –12 144 432
41 – 45 43 8 344 –7 49 392
46 – 50 48 12 576 –2 4 48
51 – 55 53 16 848 3 9 144
56 – 60 58 5 290 8 64 320
61 – 65 63 2 126 13 169 338
66 – 70 68 2 136 18 324 648
Σfx = Σ(, − ,) Σf(, − ,)
Σf = 50
2500 = 1052 = 2900

1'2
x = = 2500/50 = 50
1'

1'(2 - 2) 3455 3455


Σf – 65– 74
V = = =

V = 59.18

Page 23 of 79
Biostatistics

Merits: 1. It is easy to calculate.


2. It indicates the variability clearly.
3. The variance is the most informative among the measures of dispersion for populations
commonly met with.

Demerits: The unit of expression of variance is not the same as that of the observations,
because variance indicates squared deviations, e.g., if the observations are given in meters
then the variance will be in square meters.

Coefficient of variation:–

Standard deviation is an absolute measure of dispersion, and it is expressed in the


same unit in which the data is given. Therefore, it is not of much use for comparing the extent
of dispersion in two or more series of data, especially if the data are in different units of
measurement. In such instances, a relative measure of dispersion would be desirable. A
relative measure is generally called a coefficient. It is not expressed in any unit of
measurements. Its main objective is to help comparison. The relative measure of standard
deviation is called the Coefficient of Variation (CV).

Standard deviation
Coefficient of variation = × 100
Mean

89
CV = × 100

When comparing the CV of two or more series of data, the series having lesser CV is less
variable, more stable, more uniform and more consistent, while the series of data having
higher CV is more variable, less stable, less uniform and less consistent.

Example: Lengths (X ± SD in cm) of two species of fish, A & B, are as follows. Comment
on the variability of the length in the two species.
Species A = 67 ± 2.5
Species B = 64 ± 2.4

Page 24 of 79
Biostatistics

:;
=
<
CV × 100

3.6
=
=>
CV of species A × 100 = 250/67 = 3.73 %

3.7
=
=7
CV of species B × 100 = 240/64 = 3.75 %

The length of species B is more variable than the length of species A.

Example: A researcher collects data on the weight and length of fishes and is interested to
find out which of the character is more variable. The data are:
Fish Means Standard deviation
Weight 350 gms 12 gms
Length 16 inches 1.5 inch

3
Coefficient of variation for weight =
?65
× 100 = 1200/350 = 3.43 %

.6
=
=
Coefficient of variation for length × 100 = 150/16 = 9.375 %

Since CV (weight) is more than CV (length). So there is a greater variability in the


lengths of the fishes than their weights.

Page 25 of 79
Biostatistics

1.3 Confidence limits & Confidence intervals

Confidence limit for population mean:–

When we survey the entire parent population without leaving a single individual and
then calculate the mean value of such survey, it is called population mean (µ). But such
population survey needs lot of labour, money and time.

In order to avoid this, in general we conduct a sample survey. In the sample survey, the
sample plots have to be selected at random without any personal bias.

The principle behind taking a sample at random is that every plot has got equal probability to
be selected in the sample survey.

In this trend, when the sample survey is conducted, values are recorded from each sample
plot (Xi).

The population consists of data with some clearly defined characteristic. For e.g., A
population may consist of all patients with a particular disease, Tablet from a production
batch…
Sample – selection of patients to participate in a clinical study.
Sample – tablets chosen for a weight determination.
Sample chosen should be representative of the population. Under these conditions, it is
assumed that x value is more or less equal to µ, which is the value of population mean.

The question is how to calculate population mean from the knowledge of sample mean. The
answer is No, because under no circumstances the true value of population mean can be
calculated from the value of sample mean. However, confidence limits at 95% and 99% can
be established for the population mean.

The confidence limits has got one lower and one higher value and it is supposed that the
exact value of the population mean can be anything ranging from the lower value to the
higher value, that means the population mean can be any value ranging from the lower one to

Page 26 of 79
Biostatistics

the higher one. The range of values between the lower and the higher limits is known as the
confidence range or the confidence limit.

When the investigator tells that he has established the confidence limit for the population
mean at 95%, he is confident that the calculated range will certainly include the value of
population mean.

Similarly, when the investigator tells that he has established the confidence limit for the
population mean at 99%, he is confident that the calculated range will certainly include the
value of population mean.

Confidence limits:–
For 95% . . . . C = 1.96 For 99% . . . . C = 2.54
µ= ± (C × Se)
:@
Se =

Sd = √V
1@2

V=

Here, X = Sample mean


µ = Population mean
Se = Standard error of the distribution
C = It is a value which denotes how far a sample mean differs from a population mean
in terms of standard error.

When once we establish the confidence limits for the population mean what all can be said in
that the value of the population mean can fall anywhere between the lower limit and the
higher limit of the confidence limit, but where exactly it could be is not known at all.

Confidence Interval:–
In order to get the information about the parameter, we draw a random sample from
the population and calculate summary measures for the sample. They are known as statistics.
Such statistics gives information about population parameters.

Page 27 of 79
Biostatistics

However, since sample is only a part of the population, the numerical value of a statistic
cannot be expected to give exact value of the parameter. As statistic is a random variable,
therefore it will have a probability distribution.

The probability distribution of the statistic is known as sampling distribution of the statistic.
The mean is a statistic. The probability distribution of mean is known as sampling
distribution of mean. The standard deviation of sampling distribution is referred as “standard
error” (SE).

Suppose random sample is drawn from the population with mean (µ) and variance (σ), we
want to relate the sampling distribution of x to the population from which it is drawn. The
mean and standard deviation (standard error) of the sampling distribution is determined in
terms of µ and σ,
x = mean of the sampling distribution
µ = population mean

SE = Population Sd ÷ C8DEFGH I JH
8. K
SE = or
√ √
This shows that, the variability of the sample mean is governed by the two factors,
(a) Population variability & (b) Sample size

Large variability in the population includes large variability in the sample mean, thus making
sample information about µ less reliable. But, this can be balanced by taking n appropriately
large. Thus, with increasing sample size, the SE of X decreases and the distribution of X tend
to become concentrated around the population mean (µ).

Confidence limit: It is the probability associated with a confidence that any value in the set
of data will fall within a given range of mean. It is represented by two end
points of confidence interval.

Confidence interval: It is the interval between two values based on sample observations.

Page 28 of 79
Biostatistics

Example: Following is the data of sample mean. From these data, calculate 95% and 99%
confidence limit for the population mean.

No. 1 2 3 4 5 6 7 8 9 10 11 12
X 25 34 41 39 42 27 26 39 34 46 36 38

Solution:

X X– dx2
(dx)
25 –10.58 111.9
34 –1.58 2.49
41 5.42 29.37
39 3.42 11.69
42 6.42 41.21
27 –8.58 73.61
26 –9.58 91.77
39 3.42 11.69
34 –1.58 2.49
46 10.42 108.57
36 0.42 0.17
38 2.42 5.85
+, = 427 +., = 490.81

12
= = 427 / 12 = 35.58

1@2 745.L 745.L


– 3–
V= = =

V = 44.61

Sd = √V = √44.61 = 6.67

Sd =.=>
Se = = = 1.927
√ √ 3

Page 29 of 79
Biostatistics

C – Value
95% C = 1.96
99% C = 2.54

For 95%, μ = X + (C × Se)


= 35.58 + (1.96 × 1.93) = 35.58 + 3.78
= 39.36
μ = X – (C × Se)
= 35.58 – (1.96 × 1.93) = 35.58 – 3.78
= 31.8

For 99%, μ = X + (C × Se)


= 35.58 + (2.54 × 1.93) = 35.58 + 4.902
= 40.48
μ = X – (C × Se)
= 35.58 – (2.54 × 1.93) = 35.58 – 4.902
= 30.68

Result: The 95% and 99% confidence limit (CL) for the population mean is given below.

95% 99%
μ 39.36 40.48

μ 31.8 30.68

Conclusion: From the above result, it is concluded that, 95% confidence limit for population
mean is from 31.8 39.36 and 99% confidence limit for population mean is
from 30.68 40.48.

Page 30 of 79
Biostatistics

1.4 Chi-Square (Q ) Test

Chi-square test is the most commonly used nonparametric test in biological


experiments. Chi-square test was developed by Prof. R. A. Fisher in 1870. Its present form
was introduced by Karl Pearson in 1900. The term Chi-square was derived from Greek
letter Chi or c. It is pronounced as ‘kye’.

Chi-square test of Karl Pearson is a statistical device to test the significance of the difference
between observed distribution and the expected distribution. It is an index to measure the
extent and significance of the difference between the observed and expected frequencies.

Chi-square (χ2) is the summation of the squared deviation of each observed frequency (O)
from the respective expected frequency (E) divided by the expected frequency.

(R – S)
χ2 = ∑
S

If the differences (deviations) between O & E are greater, the chi-square will be greater, and
vice versa. If there is no difference between O & E, the χ2 will be zero.

Characteristics of Chi-Square Distribution:–

1. Chi-square curve is always positively skewed. i.e., χ2 value is always positive.


2. Chi-square values increase with the increase in degree of freedom.
3. The standard deviation of χ2 distribution is equal to √2v, where v is degree of freedom.
4. The mean of distribution is the number of degree of freedom.
5. The value of χ2 lies between zero and infinity, 0 ≤ χ2 < ∞.
6. The sum of two χ2 distribution is again a χ2 distribution, i.e. if χ3 and χ33 are two
independent χ2 distributions, they have a χ2 distribution with n1 and n2 degrees of
freedom respectively, then χ3 + χ33 is also a χ2 distribution with (n + n3 ) degree of
freedom.
7. For different degrees of freedom, the shape of curve will be different.
8. Chi-square (χ2) is a statistical hypothesis and not a parameter.

Page 31 of 79
Biostatistics

Conditions of Chi-Square test:–

1. Every observation of the sample for this test should be independent of all other
observations.
2. The total number of observations used for the test should be large.
3. The expected frequency of any item should not be less than 5.
4. The frequencies used in χ2 should be absolute and not relative in terms.
5. The observation collected for χ2 test should be on random sampling.
6. Chi-square test is used only for drawing inferences. It cannot be used for estimation of
parameter or any other value.
7. Chi-square test is totally dependent on degree of freedom.

Procedure of Chi-square test:–

1. A null hypothesis (H○: O – E = 0 or there is no association between the given attributes


or the attributes ate independent) is established.
2. A level of significance, 0.05 or 0.01, is present to test the H○. The level of significance
is the maximum probability for the rejecting the H○.
3. The expected frequency for each observed frequency is calculated based on the basis of
the null hypothesis (for the test of independence) or on apriori hypothesis (for
goodness-of-fit test), i.e. (Ri × Ci) / N.
4. The observed frequency data and the respective expected frequency data are organized
in Chi-Square table or contingency table.
5. The degree of freedom is obtained as (r – 1)(c – 1), where r is the number of raws, and
c is the number of columns in the table.
6. (O–E), (O–E)2, (O–E)2/E, Σ(O–E)2/E are computed and the calculated χ2 is obtained.
7. The chi-square table entered at the specified level of significance and the associated
degrees of freedom, and the table χ2 is noted.
8. Calculated χ2 is compared with the table χ2 value.
9. If the calculated χ2 is greater than the table χ2, the H○ is rejected and on the other hand,
if calculated χ2 value is less than the table χ2, the H○ is failed to be rejected.
10. Based on the decision to reject or fail to reject the null hypothesis, the goodness-of-fit
or the independence of the attributes is interpreted.

Page 32 of 79
Biostatistics

Applications of Chi-Square test:–

1. Goodness-of-fit test (Chi-Square with Apriori Hypothesis)

Chi-square test is used to compare the observed frequencies with the respective
expected frequencies obtained on apriori hypothesis.
e.g., Comparison of observed and expected frequency distributions of various types
such as Binomial, Poisson and Normal.

2. Test for independence of attributes (Chi-square without apriori hypothesis)

Chi-square test is used to compare the observed and expected frequencies of two or
more attributes and to decide whether these attributes are independent of or dependent on
each other. The expected frequencies are obtained on the basis of the null hypothesis that
there is no association between the attributes.
e.g., To test whether eye-color and hair-color of persons are independent or
associated.

Example: A disease was detected in 382/600 animals in species A and 218/300 animals in
species B. Test by means of suitable test whether only difference in detection of species in
both the species.
Species Disease detected Disease not detected Total
A 382 218 600
B 218 82 300

Calculation:
Species Disease detected Disease not detected Total
A 382 218 600 r1
B 218 82 300 r2
Total 600 c1 300 c2 900 N

Null hypothesis: There is no difference between the detection of disease between the two
species.

Page 33 of 79
Biostatistics

Observed Expected Observed – Expected (R − S)


(O) (E) (O – E) S
E1 382 400 -18 0.81
E2 218 200 18 1.62
E3 218 200 18 1.62
E4 82 100 -18 3.24
(R–S)
χ2 = ∑ = 7.29
S

XY × [Y =55 × =55
\ 455
E1 = = = 400

XY × [ =55 × ?55
\ 455
E2 = = = 200

X × [Y ?55 × =55
\ 455
E3 = = = 200

X ×[ ?55 × ?55
\ 455
E4 = = = 100

DF = (r – 1)(c – 1) = (2 – 1)(2 – 1) = 1

Tabulated value = 3.84 at DF 1 and LS 0.05


Calculated value = 7.29

Result: Here calculated χ2 value = 7.29 and tabulated χ2 value = 3.84 at df 1 & LS 0.05.

Conclusion: Here χ3] > χ3^ , so the null hypothesis is rejected and therefore there is a
significant difference present in detection of disease between both species.

Page 34 of 79
Biostatistics

2.1 Meaning of Significance & Significance Level

Hypothesis testing:–

Sampling theory deals with two types of problems, like, estimation and testing of
hypothesis. Modern theory of probability plays an important role in decision making and the
branch of statistics which helps us in arriving at the criterion for such decision is known as
testing of hypothesis. It employs statistical techniques to arrive at decisions in certain
situation where there is an element of uncertainty on the basis of sample whose size is fixed
in advance.

A hypothesis (H) is a statement about the population parameter. In other words, a hypothesis
is a conclusion which is tentatively drawn on logical basis. Statistical hypothesis is
tentatively conclusion that specifies the properties of a distribution of random variable. These
properties generally refer to parameters of the population and the hypothetical values with
which the values of statistic derived from a sample are compared in order to find the
difference between statistic and corresponding parameters.

In other words, statistical hypothesis is some assumption or statement, which may or may not
be true, about a population or about the probability distribution characterizing the given
population, which we want to test on the basis of the evidence from a random sample.

Hypothesis testing can be regarded as an example of a decision process, in which data are
assembled in a particular way to produce a quantity that leads to a choice between two
decisions. Each decision then leads to an action. Because data arise from sampling process,
these are some risk that an incorrect decision will be made with some loss attached to the
resulting incorrect action.

The testing of hypothesis is a procedure that helps us to ascertain the likelihood of


hypothesized population parameter being correct by making use of the sample statistic. In
other words, it is a process of testing of significance which concern with the testing of some
hypothesis regarding a parameter of the population on the basis of statistic from the sample.

Page 35 of 79
Biostatistics

In testing of hypothesis a statistic is completed from a sample drawn from the parent
population and on the basis of this statistic it is observed whether the sample so drawn has
come from the population with certain specified characteristic.

The value of sample statistic may differ from the corresponding population parameter due to
sampling fluctuation.

The test of hypothesis discloses the fact whether a difference between sample statistic and the
corresponding hypothetical population parameters significant or not significant. Thus the test
of hypothesis is also known as the test of significance.

Hypothesis:–

Generally, an investigator has a hypothesis (H). The hypothesis may be that the
sample mean (X) is lesser than the population mean (µ), or the mean of “treated” group is
greater than the mean of the “control” group, or the means of more than two groups are not
same.

These hypotheses may be shown symbolically as follows:


H: X < µ
H: µ 1 > µ 2
H: µ 1 ≠ µ 2 ≠ µ 3 ≠ µ 4

Whatever be the hypothesis an investigator puts forward, its statistical significance is


obtained by subjecting the null form of the hypothesis to an appropriate test of significance.
The null form of the hypothesis (H) is called null hypothesis (H○), which means that there is
no significant difference between the means of the two groups.
The null hypothesis of the above hypothesis may be represented as follows,
H○: X – µ = 0
In other words, H○: X = µ
H○: µ 1 – µ 2 = 0
or in other words, H○: µ 1 = µ 2
H○: µ 1 = µ 2 = µ 3 = µ 4

Page 36 of 79
Biostatistics

Verbally, the H○ states that there is no significance difference between sample mean and
population mean, or between means of two populations, or between means of more than two
populations.

Prof. R. A. Fisher remarked, “Null hypothesis is the hypothesis which is to be tested for
possible rejection under the assumption it is true”.

The negation of null hypothesis is called Alternative hypothesis. It means ‘any statistical
hypothesis which is not a null hypothesis is called an alternative hypothesis’. It is represented
by HA or Hα or H1. In other words if null hypothesis is rejected, the alternative hypothesis is
applicable.

According to alternative hypothesis, the difference between population mean (µ) and sample
mean (X) is not due to sampling fluctuations, but is real and quite significant.

In case null hypothesis is not applicable, the verification of scientific hypothesis will depend
on alternative hypothesis. The null hypothesis is accepted as true until such time the
alternative hypothesis disprove it.

Types of Errors in hypothesis testing:–

In case null hypothesis is rejected in favour of alternative hypothesis, there are two
possible outcomes. Either the null hypothesis has been rejected correctly or incorrectly.
Falsely rejecting the null hypothesis is called Type I Error.

In case null hypothesis is not rejected, again these are two possible outcomes. Either we have
failed to reject the null hypothesis, though it should have been rejected or we have correctly
failed to reject the null hypothesis because it was not to be rejected. Failing to reject the null
hypothesis when it should have been rejected is called Type II Error.

1. Type I Error: When null hypothesis is true, but the difference of means is significant and
the hypothesis is rejected, it is called Type I Error. The probability of making Type I error
is denoted by a or α. It means the probability of making Type I error by rejecting null

Page 37 of 79
Biostatistics

hypothesis when it is true is a or α and the probability of making correct decision of


accepting null hypothesis when it is true will be (1 – a) or (1 – α).

2. Type II Error: When null hypothesis is false, but difference of means is significant and
the hypothesis is accepted, it is called Type II Error. It means in Type II error, null
hypothesis is accepted when it is false. The probability of Type II error by accepting null
hypothesis when it is false is represented by b or β and the probability of making correct
decision of rejecting the false null hypothesis will be (1 – b) or (1 – β).

Determination of Decision Rule and Level of Significance:–

The decision rule specifies which values of the test statistic will determine the
rejection of the null hypothesis in favour of alternative hypothesis. The decision rule is based
on the probabilities (α and β) of the Type I and Type II errors. Possibilities associated with
making a correct decision can be represented as follows:
Decision about In reality, null hypothesis is
null hypothesis True False
Accept Correct Type II Error
Reject Type I Error Correct
i.e., Type I Error (α): Null hypothesis (H○) rejected though it is true.
Type II Error (β): Null hypothesis (H○) accepted though it is wrong.

Level of Significance (LS) is the quantity of risk of Type I error which can be tolerated in
making a decision about the null hypothesis (H○). Thus, level of significance is the maximum
probability of making a Type I error.

The commonly used levels of significance in practice are 5% (0.05) and 1% (0.01). It means
at 5% level of significance a = 0.05 or probability of making Type I error is 0.05. It can be
inferred that there is probability of making Type I error 5 out of 100 times or the chances of
making correct decision are 95 times out of 100 times. It means 95 times the decision made is
correct and is wrong only 5 times. Similarly, at 1% level of significance (i.e. a = 0.01) there is
probability of making 1 error out of 100 times.

Page 38 of 79
Biostatistics

Critical Region or Rejection Region:–

The test statistics used to test the hypothesis H○ follows a known distribution. This is
represented by a standard normal curve or normal probability curve of sampling distribution.
The area under probability curve is divided into two regions:
(1) The region of rejection or critical region and (2) The region of acceptance

Fig.: Standard normal curve or probability curve of sampling distribution showing


rejection and acceptance region

1. Rejection region or Critical region:–

It is the region of standard normal curve of distribution which corresponds to those


levels of significance that form the basis of rejection of null hypothesis (H○). Critical region
is responsible for making the Type I error. The rejection region indicates that if the value of
test statistics lies in this region, the null hypothesis H○ will be rejected. It is also called
critical region.

The area of critical region is equal to the level of significance α and lies on the tail of the
distribution curve. It may be located on both the sides or only one side i.e. on the one tail
(either right or left).

Page 39 of 79
Biostatistics

2. Acceptance region:–

The region of standard normal curve which is not covered by rejection is known as
acceptance region.

Decision about H○:–

When we compare the calculated probability (area left in the tail) with the level of
significance (LS), if it is less than 0.05 (P < 0.05) or less than 0.01 (P < 0.01), we reject the
H○. If the calculated P is equal to or greater than 0.05 (P ≥ 0.05) or 0.01 (P ≥ 0.01) we fail to
reject H○.

Thus we have only two options regarding our “decision” about the null hypothesis, either
“reject H○” or “fail to reject H○”. We are rejecting a H○ because the probability for its
occurrence is low, lower than 0.05 or 0.01.

Usually in the hypothesis testing, and for that matter any test of significance (χ2-test, t-test,
etc.) the probability for the occurrence of H○ is not calculated, but the critical ratio value (t,
χ2, etc...) is calculated and compared with the table values at specific level of significance
(0.05 or 0.01) and degrees of freedom.

If the calculated critical value is equal to or greater than the table value, then the H○ is
rejected and the given hypothesis is discussed.

Two Tailed and One Tailed Test:–

The critical region is represented in probability curve of sampling distribution. This


curve has two sides which are called two tails.

1. When rejection is on both the ends of normal curve, the test is known as two tailed test or
two sided test. It is applied in cases where difference between sample mean and
population mean tends to reject the null hypothesis.

Page 40 of 79
Biostatistics

Fig.: Two tailed test diagrams

2. When rejection region is only on one side of the normal curve, the test is known as one
sided test or one tailed tests.
(i) Right Tailed Test: In the right tailed test the rejection region or critical region lies
entirely on the right tail of the normal curve. It is applied while the population mean is as
larger as some specific value of mean.

Page 41 of 79
Biostatistics

(ii) Left Tailed Test: In the left tailed test the critical region or rejection region lies entirely
on the left tail of the normal curve. It is applied when the population mean is as small as
some specific value of mean.

Fig.: One tailed test diagrams at 5% significance level

Page 42 of 79
Biostatistics

Fig.: One tailed test diagrams at 1% significant level

Page 43 of 79
Biostatistics

2.2 Student’s t-test: Paired & Unpaired

Introduction:–

An Irish statistician, W.S. Gossett in 1908, applied this test for testing the
significance of difference between the means of two different samples of small size. It was
named ‘t-test’. The pen name of Gossett was student, hence this test is called student’s t-test.
It was further elaborated and explained by R.A. Fisher. Student’s t-test is applied to small
samples only.

Student’s t-distribution:–

The t-distribution is based on the degree of freedom of distribution, n – 1, if the


sample size is n. The quantity t is defined as,
Difference of population parameter and corresponding statistic
t =
Standard error of the statistic
with (n – 1) degree of freedom, if the sample size is n.

If x1, x2, x3…xn is a random sample size of n drawn from normal population with mean µ and
variance σ (not known) then the students t statistic (mean) is defined as,
–µ –µ
t= = 8. 9."
√ −
(S.E. of mean)

with (n – 1) degree of freedom


where S.D. = Standard deviation of the sample

Assumptions for t-test:–

The t-test is applied under following assumptions:


1. Samples are drawn from normal population and are random.
2. For testing the equality of two population means, the population variances are regarded as
equal.
3. The population standard deviation may not be known.
4. In case of two samples some adjustments in degrees of freedom for t are made.

Page 44 of 79
Biostatistics

Properties of t-distribution:–

1. t-distribution is asymptotic to x-axis, it extends to infinity on either side.


2. The slope of the curve or form of t-distribution varies with the degrees of freedom. The
degree of freedom is defined as ‘size of sample - one’.
3. t-distribution is a symmetrical distribution with mean zero.
4. Its graph is similar to that of normal distribution. There is more area in the tails of the t-
distribution, and the standard normal curve is higher in the middle, i.e., t-distribution has a
greater spread than normal distribution.
5. The larger the number of degrees of freedom the more closely t-distribution resembles
standard normal distribution, i.e., t-curve is higher in the middle, i.e., t-distribution
resembles normal curve as the number of degrees of freedom approaches without limit.
6. Sampling distribution of t does not depend in population parameter but it depends only the
degree of freedom = n – 1, i.e., on the sample size.

Application of t-distribution:–

The t-distribution has following three important applications in testing of hypothesis


for small sample (n < 30).
1. To test the significance of a single mean, when the population variance σ is unknown.
2. To test the significance of difference between two sample means the population variances
being equal and unknown.
3. To test the significance of an observed sample correlation coefficient or difference between
means of two sample (dependent samples or paired observations).

Page 45 of 79
Biostatistics

Estimation of population mean:–

Population mean is estimated using the table t value at the specific levels of
significance and the n – 1 DF.

95% confidence interval of µ is obtained as,


µ= ± t0.05 LS and n–1 DF × (S.E. of )

99% confidence interval of µ is obtained as,


µ= ± t0.01 LS and n–1 DF × (S.E. of )

Example: A sample of 10 plant yielded a mean sugar level of 45 mg% with a SD of 8 mg%.
Estimate the population mean of the sugar level with 95% confidence.

Step: 1 n = 10, X = 45 mg%, SD = 8 mg%


Step: 2 Assumption of sampling distribution of sample means with an S.E. computed as,
S.E. of Mean = S. D." = 8" = 8" = 8"3 = 2.67
√n − 1 √10 − 1 √9

Step: 3 Table value of t at 0.05 LS with 10 – 1 = 9 DF ⟹ 2.26


Step: 4 µ = X ± t0.05 LS and n–1 DF × (S.E. of X)
= 45 ± 2.26 × (2.67)
= 45 ± 6.03
µ = 38.97 to 51.03 mg%

Example: A sample of 14 measurements of oxygen content in the water of a lake yields a


mean of 50 ppm with an SD of 5 ppm. Estimate the mean oxygen content of the
lake with 99% confidence.

Step: 1 n = 14, X = 50 ppm, SD = 5 ppm


Step: 2 Assumption of sampling distribution of sample means with an SE computed as,
S.E. of Mean = S. D." = 5" = 5" = 5"3.61 = 1.39
√n − 1 √14 − 1 √13

Step: 3 Table t at 0.01 LS with 13 DF = 3.01

Page 46 of 79
Biostatistics

Step: 4 µ = X ± t0.01 LS and n–1 DF × (S.E. of X)


= 50 ± 3.01 × (1.39)
= 50 ± 4.18
µ = 45.82 ppm to 54.18ppm

Types of t-test:–
Following are two types of t-test can be performed:
1. t-test for Paired samples or t-test for Two sample means: When paired observations are
arranged case wise, and a test of treatment effect is performed, this is called two sample
means or paired sample means. It is also known as ‘correlated t-test’ or ‘paired t-test’.
2. t-test for Single mean or t-test for Independent sample: The observations are classified
into two groups. A test of mean difference is performed for a specified variable. This is
called ‘t-test for single mean’ or ‘unpaired t-test’.

Paired t-test:–

One of the experimental designs in biology and medicine is to assign the same
subjects both for ‘control’ and ‘experimental’ treatments.

For example, if an investigator wants to evaluate the efficacy of a new drug formulation in
reducing the blood glucose levels in men, he/she can have a sample of 10 persons who are
willing to undergo the experiment. The usual procedure is to divide the sample of 10 persons
into two groups and administer the placebo/vehicle (the medium in which the drug to be
tested is prepared) and the drug to each group. In the matched pair design, all the ten persons
are given, first, the placebo treatment and their blood samples are analyzed for blood glucose
level. Next, after the expiry of sufficient time, the same persons are given the drug, and their
blood samples are analyzed for blood sugar level. Thus, each subject yields a pair of data,
which can be analyzed using the t-test to assess significance of the mean difference.

Procedure for the Paired t-test or Paired data analyses:–

(1). Same subjects yields pairs of data.


e.g. (i) Values obtained before and after treatment.

Page 47 of 79
Biostatistics

(ii) Values obtained after control treatment and after experimental treatment.
(iii) Values obtained at two different periods- now and after a gap of a day, a month, a
year etc.
(2). Significance of the mean difference (D) is tested using t-test.
(3). H○: D – μ; = 0; n – 1 degree of freedom (n = pairs of the data); Sampling distribution of
mean differences (D’s) with a population mean of µ D = 0.
(4). Computation: (a) D = difference between n pairs of values
(b) ΣD
(c) Mean difference, D = ΣD"n
(d) (D – D), (D – D)2 and Σ(D – D)2

Σ(D – D)3"
(e) SD = n
(f) SE of mean difference, SE = SD"
√n − 1
D − μ;c D – 0c 9
(5). t = SE that is, t = SE ⇒ ∴ t = 8S
(6). Table t at specific LS and n–1 degrees of freedom.
(7). Decision: If calculated t > table t, reject H○.
(8). Inference/Conclusion: Based on the decision, the given H is discussed.

Example: A pharmaceutical company develops a drug, which it claims to increase


hemoglobin content in aged people. The hemoglobin content (g/100ml) of 10
subjects is measured before and after administration of the drug. On the basis of
the following data, determine whether the company’s claim is valid.
Subject 1 2 3 4 5 6 7 8 9 10
Before 10 9 11 12 8 7 12 10 10 9
After 12 11 13 14 9 10 12 14 11 12

Step: 1 H○: D – μ; = 0; 0.05 LS with 10 – 1 = 9 DF

Σ(D – D)3"
Step: 2 Computation of D = ΣD"n and SD = n

Page 48 of 79
Biostatistics

Subject Before (A) After (B) D (B – A) D–9 (D – 9)2


1 10 12 2 0 0
2 9 11 2 0 0
3 11 13 2 0 0
4 12 14 2 0 0
5 8 9 1 –1 1
6 7 10 3 1 1
7 12 12 0 –2 4
8 10 14 4 2 4
9 10 11 1 –1 1
10 9 12 3 1 1
ΣD = 20 (D – 9)2 = 12

D = ΣD"n =
35
5
= 2 g/100 mg/100 ml

Σ(D – D)3" 12" = √1.2 = 1.1 mg/100 ml


SD = n= 10

Step: 3 Assumption of sampling distribution of D with a mean of µ D = 0 and SE computed


as,
SE of mean difference = SD" = 1.1" = 1.1" = 0.37
√n − 1 √9 √3
Step: 4 Location of the observed mean difference in the sampling distribution in terms of t
as,
(D − μ; )c 2
t= SE = "0.37 = 5.41

Step: 5 Decision about the H○: D – µ D = 0


Table t at 0.05 LS & 9 DF = 2.262
The calculated t (5.42) > table t (2.26)
Therefore reject H○

Step: 6 The sample mean differences is significant. The drug is effective in increasing the
hemoglobin content in aged people. The claim of the company is valid.

Page 49 of 79
Biostatistics

Unpaired t-test:–

A more common design of experiment is to assign different subjects (experimental


units) to two groups. It is assumed that the two groups are samples from two different
populations such as “control population” and “experimental population”.

Using student’s t-test we can assume sampling distribution of difference between means of
two groups, compute the SE of difference between means, locate the observed difference
between means in the sampling distribution in terms of t value.

The calculated t-value is compared with the table t at specified LS and DF, a decision is made
about the H○ and inference is drawn about the population difference between means.

Procedure for the Unpaired t-test or Analysis of Uncorrelated groups:–

(1). Data: Group 1: n & X1; Group 2: n3 & X2


(2). H○: μ – μ3 = 0 (no significance difference between the means of the two populations
from which the samples n1 & n2 were obtained, respectively)
(3). Sampling distribution of differences between the means. Standard deviation between
means of two uncorrelated groups is obtained using pooled variance,
1@2Y d 1@2
Yd -3
V=

Standard deviation between means is calculated as,


SD = √V
(4). Calculation of t-value as,
-
t=
89· d

(5). Table t at specific LS and n + n3 – 2 = DF


(6). Decision: if calculated t > table t, ⟹ reject H○
(7). Inference/Conclusion: Based on the decision the given H is discussed.

Page 50 of 79
Biostatistics

Example: Two horticultural plots were each divided into six equal sub-plots. Organic
fertilizer is added to plot 1 and chemical fertilizer is added to plot 2. The yield of
fruits from plot 1 and plot 2 in kg/sub-plot is given below. Can we say the yield
due to organic fertilizer is higher than due to chemical fertilizer?
Plot 1 6.2 5.7 6.5 6.0 6.3 5.8
Plot 2 5.6 5.9 5.6 5.7 5.8 5.7

Calculations:
dx12 dx22
X1 (X1 – 1 )2 X2 (X1 – 1 )2
(X1 – 2 2
1) (X2 – 2)

6.2 0.12 0.014 5.6 –0.12 0.014


5.7 –0.38 0.144 5.9 0.18 0.032
6.5 0.42 0.176 5.6 –0.12 0.014
6.0 –0.08 0.006 5.7 –0.02 0.0004
6.3 0.22 0.048 5.8 0.08 0.006
5.8 –0.28 0.078 5.7 –0.02 0.0004
ΣX1 = 36.5 Σdx12 = 0.466 ΣX2 = 34.3 Σdx22 = 0.067

12Y ?=.6 12 ?7.?


X1 = = = 6.08 X2 = = = 5.72
Y = =

1@2Y d 1@2 5.7==d5.5=> 5.6??


= = = 0.0533
Yd -3 =d=-3 5
V=

SD = √V = √0.0533 = 0.23

<Y - < =.5L-6.>3 5.?= 5.?= 5.?= 5.?=


= = = = =
5.3? √5. >d5. > 5.3? √5.?7 5.3? ×5.6L 5. ??
t=
Y Y Y Y
:;· f df 5.3? gdg
Y

t = 2.707
Table t at 0.05 LS & 10 DF = 2.228
The calculated t (2.707) > table t (2.228)
∴ reject H○
The yield due to organic fertilizer (plot 1) is significantly higher than due to chemical
fertilizer (plot 2).

Page 51 of 79
Biostatistics

2.3 Analysis of Variance (ANOVA)

The “Analysis of variance” (ANOVA) is the appropriate statistical technique to be


used in situations where we have to compare more than two groups. The term ANOVA was
first proposed by R. A. Fisher.

ANOVA refers to be the examination of differences among the groups. It is an extremely


useful technique concerning in Biology. It is used for estimating and testing hypothesis about
population variances and population means. It is used to examine the significance of the
difference amongst more than two sample means at the same time.

Suppose we wish to know if three drugs differ in their effectiveness in lowering serum
cholesterol in human subjects. Some subjects (persons) receive drug A, some drug B and
some drug C. After a specified period of time, measurements are taken to determine the
cholesterol level in each group. In each group, the effect of drug in reducing cholesterol is
different, as there are 3 different drugs. Even in a particular group of persons, for one drug,
the effect of lowering cholesterol is different. This variation is due to differences in genetic
makeup of the subjects and differences in their diets. By using Analysis of Variance, we will
be able to reach a conclusion regarding the equality of the effectiveness of the 3 drugs.

It is used to analyze the results of two different experimental designs.


(1) Completely randomized design (CRD), and (2) Randomized block design (RBD).

Assumptions in Analysis of Variance:–

1. Each of the samples is drawn randomly from a normal population.


2. Each of the populations has the same variance. That is σ12 = σ22 = σ32 = σ42… & so on.

Principle of ANOVA:–

Suppose we want to compare more than two groups (k groups) with n , n3 , n? . . . ni


samples, and X , X3 , X? . . . Xi means. These k groups might experimental groups with
different “treatments” or might be samples from populations which are different in some

Page 52 of 79
Biostatistics

specific features. Whatever be their nature, they are always assumed to have come from
different populations about which inferences are to be drawn.

The observed differences between these groups will consist of two components, viz., (1) A
natural variation (“error”) and (2) Variation due to “treatment” or any other factor.

In ANOVA, the two components of the observed difference are separated, estimated and
compared. The variation due to “treatment” is expected to occur between groups and
therefore, it is referred to as “between variance”. The normal variation would occur within
each of the groups and, therefore, referred to as “within variance”. The variation from the two
sources, “between” and “within” together is called the “total variance”.

If “within” variability is greater than the “between” variability, it would mean that the
difference between group is not significant. On the other hand, if the “between” variability is
greater than the “within” variability, it would suggest a significant difference between the
group.

While designing experiments in Biology or in other fields, every care must be taken to have
the natural variability (error) to be distributed randomly. If samples are drawn from different
isolated populations for the purpose of comparison of these populations, it is assumed that the
“error” is randomly distributed. In laboratory or field experiments “randomization”
procedures must be followed in the allotment of experimental units to different groups. If no
much “randomization” procedures were followed, then the “error” would no more be
“random” but “systematic”. Systematic errors will definitely lead to bias in the inference.

Computation of ANOVA (Testing steps/Procedure):–

1. Description of Data: The measurements resulting from one way ANOVA along with the
means all the measurements are displayed in a table form.
2. Hypothesis: We test the null hypothesis that all populations or treatment means are equal.
H○: µ 1 = µ 2 = µ 3 = …… = µ n
3. Test statistic:
(a) First find out means of each sample as, X , X3, X? ……Xi

Page 53 of 79
Biostatistics

(b) Find out the combined mean of the sample


12Y d12 d12j d⋯d12l
X= = X (Double bar)
mn^op \n. n' qorsptq (\)

(c) Take the deviation of the sample means from mean of each sample, i.e., X − X, X3 − X,
X? − X ….

(d) Square each deviations and multiply by the number of items in the corresponding
samples, i.e., n (X − X)2, n3 (X3 − X)2, n? (X? − X)2….

(e) Then total these values. This is known as Sum of Squares between the groups or SSbetween.
∴ SSbetween = n (X − X)2 + n3 (X3 − X)2 + n? (X? − X)2 +……+ ni (Xi − X)2
n = number of items in the corresponding samples

(f) Divide the result of step (e) i.e. SSbetween by the degree of freedom between (k – 1). This is
known as Mean Square between the groups or MSbetween.
::uvwxvvf
MSbetween =
@' yt^ztt ^{t |}n~sq (i- )

(g) Find out the deviation of the values of the sample items for all the samples from the
corresponding means of the samples. Square each deviation and sum up all the deviation.
This is known as Sum of Square within the group or SSwithin.
SSwithin = Σ(X1 – X )2 + Σ(X2 – X3 )2 + Σ(X3 – X? )2+……+ Σ(Xk – Xi )2
(12• )
or SSwithin = Σ Σxi2 – Σ

(h) Divide the result of step (g) i.e. MSbetween by degrees of freedom within (nk – k). This is
named as Mean Square within the samples or MSwithin.
::x•w€•f
MSwithin =
@' z ^{ ^{t |}n~sq ( i - i)

Page 54 of 79
Biostatistics

(i) Make ANOVA table:

Sum of Squares Degree of freedom Mean Square


Source of Variation 88
SS DF MS =
.0
88•H‚ƒHH
Between Samples SSbetween k–1
„−
88ƒ ‚…
Within Samples SSwithin nk – k
„−„
Total SStotal nk – 1

(j) Find out the F value,


†8•H‚ƒHH
†8ƒ ‚…
F=

(k) If the calculated F-value is less than F-table value, there is no significant difference
among the sample mean.

Example: The following data represent the gain in weight (in kg) of a species of edible fish
cultured in 4 diet formulations (D1, D2, D3 & D4) for a period of 3 months. Analyze these data
for significant difference among the diet formulations in terms gain in weight.
D1 D2 D3 D4
4 8 5 1
5 7 7 4
1 9 8 1
3 6 6 3
2 10 9 1

Calculations:
(1) n = 5, n3 = 5, n? = 5, n7 = 5, k = 4
Σx1 = 15, Σx2 = 40, Σx3 = 35, Σx4 = 10

(2) X = 15/5 = 3, X3 = 8, X? = 7, X7 = 2

Page 55 of 79
Biostatistics

12Y d12 d12j d12‡ 6d75d?6d 5 55


(3) X = = = = 50
\ 35 35

(4) SSbetween = n (X − X)2 + n3 (X3 − X)2 + n? (X? − X)2+ n7 (X7 − X)2


= 5(3 – 5)2 + 5(8 – 5)2 + 5(7 – 5)2 + 5(2 – 5)2
= 5(–2)2 + 5(3)2 + 5(2)2 + 5(–3)2
= 5(4) + 5(9) +5(4) +5(9)
= 20 + 45 + 20 + 45
= 130

::uvwxvvf ?5 ?5
(5) MSbetween = = = = 43.33
;ˆ (i- ) 7- ?

(6) For Σxi2,


D1 = 16 + 25 + 1 + 9 + 4 = 55
D2 = 64 + 49 + 81 + 36 + 100 = 330
D3 = 25 + 49 + 64 + 36 + 81 = 255
D4 = 1 + 16 + 1 + 9 + 1 = 28
∴ Σ Σxi2 = 55 + 330 + 255 + 28 = 668

(12• )
(7) For Σ ,
(12Y ) ( 6) 336
D1 = = = = 45
Y 6 6

(12 ) (75) =55


D2 = = = = 320
6 6

(12j ) (?6) 336


D3 = = = = 245
j 6 6

(12‡ ) ( 5) 55
D4 = = = = 20
‡ 6 6

(12• )
∴Σ = 45 + 320 + 245 + 20 = 630

(12• )
(8) SSwithin = Σ Σxi2 – Σ = 668 – 630 = 38

Page 56 of 79
Biostatistics

::x•w€•f ?L ?L
(9) MSwithin = = = = 2.375
@' ( i-i) 35-7 =

(10) ANOVA table:


Source of Variation SS DF MS
Between 130 3 43.33
Within 38 16 2.375
Total 168 19

MSbetween 7?.??
MSwithin 3.?>6
F-ratio = = = 18.24

F-table value with 3 and 16 DF at 0.05 LS = 3.24.

Since the calculated F (18.24) > table F (3.24),

Therefore, we reject the null hypothesis (H○).

The means of the four diet groups are not the same at P < 0.05.
There is significant difference among the diet formulations in terms of weight gain.

Page 57 of 79
Biostatistics

2.4 Regression and Correlation Analysis

General Introduction:–

Various statistical methods studied so far, like measures of central tendency, average
and measure of dispersion are related to one variable only. These are many situations where
two variables are inter-related and a change in the value of one variable causes change in the
value of other variable. For example, we may like to study the relationship between height
and weight of persons, blood pressure and age, consumption of certain nutrient and weight
gain or intensity of stimulus and intensity of reaction. The study of nature and strength of
relationship between two variables is described in terms of correlation and regression.

In correlation analysis, we are concerned whether two variables are independent or they vary
together in positive or negative direction. In correlation the two variables are not related as
independent and dependent variables. It means in correlation both the variables are affected
by a common cause and the degree to which these variables vary together is estimated.

In regression analysis, the dependence of one variable on another variable is determined.


Therefore, the two variables are related as independent and dependent variables. Regression
analysis is employed to predict or estimate the value of one variable corresponding to a given
value of another variable. Regression equations are applied to determine changes in Y due to
changes in X variables.

Correlation:–

The relationship between two or more variables is called “correlation”, and the
variables are said to be correlated. The relationship between two variables is also known as
“covariation”. The term “relationship” can be used in two different senses, viz., mutual
dependence, and cause and effect relationship.

(1) Mutual Dependence:


Consider the two variables, rate of oxygen consumption and metabolism in
organisms, when the oxygen consumption increases, there is increase in the metabolism as

Page 58 of 79
Biostatistics

well. Similarly when the organism increases its activity (metabolism) it consumes more
oxygen. On the other hand, when the oxygen consumption decreases, the activity, i.e., the
metabolism decreases, when the organism becomes less active, its oxygen consumption also
becomes lesser. A relationship between two variables in which a change in the value of one
of the two variables brings about a change in the value of the other variable is said to be
‘mutually dependent’.

(2) Cause and Effect Relationship:

A relationship between two variables in which changes in the values of one variable is
the cause of the changes in the values of the other variable is said to be ‘cause and effect
relationship’ between the two variables.

For example, consider the two variables, environmental temperature and the body
temperature of poikiloterms living in that environment. When there is increase in the
environmental temperature there is an increase in the body temperature is the “cause” and the
increase in the body temperature is the “effect”. Such a relationship between two variables is
known as “cause and effect” relationship.

The cause and effect relationship between two variables may be either direct or indirect. In
the example of environmental temperature and body temperature of poikilotherms one more
variables namely the oxygen consumption by the organism may also be considered.

When there is increase in the body temperature of the organisms, there is increase in their
oxygen consumption as well. Here the relationship between increase in the environmental
temperature and body temperature of the poikilotherms is direct. The cause and effect
relationship between environmental temperature and oxygen consumption is through the
other factor namely the body temperature, and therefore the relationship is indirect.

Sometimes you might come across two variables that may not have any type of direct
relationship between them. Yet the value of one variable changes when that of the other
variable changes. This may be due to third factor that causes both variables to increase their
values.

Page 59 of 79
Biostatistics

Let us consider an example to illustrate the above situation. The amount of paddy produced
and the amount of cotton produced in the same area, say, a district, obviously do not have any
direct relationship between them. Yet we may find whenever there is increase in the yield of
paddy there might be increase in the yield of the cotton as well. The reason for this
relationship might be the rainfall received in that district. Thus the relationship between the
variables paddy yield and cotton yield is due to the third factor namely the amount the
rainfall.

The nature of correlation between two variables need not be same at all times. For example,
the relationship between height and weight of humans or the length and weight of organisms
may not be same. Generally, with increase in the height or length, there is increase in the
weight. However, it is common to see people who are tall weighing less and those who are
short weighing heavier.

Significance of Correlation:–

The study of correlation is of great significance in practical life, because of the following
reasons:
1. The study of correlation enables us to know the nature, direction and degree of relationship
between two or more variables.
2. Correlation studies help us to estimate the changes in the value of one variable as a result
of change in the value of related variable. This is called regression analysis.
3. Correlation analysis helps us in understanding the behavior of certain events under specific
circumstances. For example, we can identify the factors for rainfall in a given area and
how these factors influence paddy production.
4. Correlation facilitates the decision making in the business world. It reduces element of
uncertainty in decision-making.
5. It helps in making predictions.

Types of Correlation:–
Correlation between variables may be simple or multiple. A simple correlation deals
with only two variables where as a multiple correlation deals with more than two variables
may be a positive correlation or a negative correlation. Whether it is positive or negative, it
may be linear or non-linear.

Page 60 of 79
Biostatistics

1. Positive Correlation:–

A correlation between two variables in which, with an increase in the values of one
variable the values of the other variable also increases, and with a decrease in the value of the
one variable the value of the other variable also decreases, is said to be a positive correlation.

In other words, in a positive correlation between two variables is the value of both the
variables more in the same direction. For example, the correlation between the environmental
temperature and the body temperature of poikilotherms is a positive correlation.

2. Negative Correlation:–

A correlation between two variables in which when there is an increase in the value of
one variable, the values of the other variable decreases, and when there is a decrease in the
values of one variable the other variable increases, is said to be a negative correlation.

In other words, in a negative correlation the values of the two variables move in opposite
direction. For example, the correlation between environmental temperature and bacterial
growth, having a cause and effect relationship, is negative one. With an increase in the
temperature the bacterial growth increases.

3. Linear Correlation:–

When the values of two variables vary in a constant ratio, the correlation between two
variables is said to be linear. The correlation between the optical density and the intensity of
the color of a solution is an example of linear correlation.

4. Non-linear or Curvilinear Correlation:

If the amount of change in the value of one variable and the corresponding amount of
change in the other variable are not in a constant ratio, the correlation between the two
variables is said to be non-linear or curvilinear. The correlation between the length and
weight of fish is generally a non-linear correlation.

Page 61 of 79
Biostatistics

Methods of study of Correlation:–

We can study the presence or absence and extent of correlation between variables by
one of the following methods: (1) Scatter diagram; (2) Correlation graph; and (3) Karl
Pearson’s Coefficient of Correlation (r).

Scatter diagram and correlation graphs are graphical methods and they indicate only the
nature of the correlation whether positive and negative. No numerical measure of the extent
of correlation is given by these measures of the extent of correlation is given by these
methods. The Karl Pearson’s coefficient of correlation gives the magnitude of correlation
between two variables in numerical terms.

(1) Scatter Diagram:–

It is an easy and simple method of studying correlation between two variables. Scatter
diagram is constructed as follows. If X and Y are pairs of variables, the values of the variable
X are marked in the X-axis and the values of Y are marked on Y-axis. A point is plotted
against each value of X and the corresponding Y value. A swarm of dot is obtained, and this
is called scatter diagram. The nature of the scatter of dots in the diagram gives an idea of the
nature of correlation between the variables given, as shown in following figures.

If the plotted dots form a straight line running left to right in the upward direction, the
correlation is perfectly positive.

Perfect positive correlation

Page 62 of 79
Biostatistics

If the dots are scattered around a straight line running from left to right in an upward
direction, than the correlation between the two variables is positive.

Positive correlation

If the dots of the scatter diagram form a straight line running from left to right in the
downward direction, the correlation between the two variables is perfect negative.

Perfect negative correlation

Page 63 of 79
Biostatistics

A scatter diagram in which the plotted dots form a swarm around a straight line runs from left
to right in the downward direction indicates negative correlation between the variables.

Negative correlation

In cases where there is no correlation between the variables, the scatter of dots in the diagram
will not form either a straight line or even a flow of dots from left to right in the upward or
downward direction.

No correlation

Page 64 of 79
Biostatistics

(2) Correlation Graph:–

When the given variables are with reference to a period of time, correlation graph is
the ideal method to understand the relationship between the two variables. To draw the
correlation graph, the period of time is marked on the X-axis and the value of the two
variables on the Y-axis or, if necessary on both the Y-axis. The points of a variable plotted
against time are joined by a line to form a curve. Different curves are constructed for each of
the two variables (figure).

In a correlation graph, if the curves of the two variables are close to each other and if they
move in the same direction, the variables have a positive correlation. On the other hand, if the
curves of the two variables move in opposite directions, the variables are negatively
correlated.

Correlation graph showing positive correlation between the variable x and y

(3) Karl Pearson’s Coefficient of Correlation:–

Coefficient of correlation (r) is a quantitative measure of the correlation between two


variables. The correlation coefficient is calculated on the basis of the following assumptions
about the two variables. (1) The correlation between the two variables is linear. (2) The two
variables are related to each other in a cause and effect relationship. (3) The value of the two
variables is affected by factors that are common to both the variables. The equation for
getting Karl Pearson’s coefficient (r) is,

Page 65 of 79
Biostatistics

+ - (•-•)
r= where, r = Coefficient of correlation,
8, 8•

X = variable x,
Y = variable y,
X= Mean of variable x,
Y= Mean of variable y,
n = number of pairs of variables,
Sx = SD of variable x, and
Sy = SD of variable y

A modification of the above basic formula that is easier to use is as follows:

+,·+•
+,• - + .,·.•
r= or r = where dx = X – X; & dy = Y – Y
(+,) (+•) C(+., · +.• )
’+, - “’+• - “

Interpretation of Karl Pearson’s Coefficient of Correlation:–

The value of r always lies between +1 and –1.


When: r = +1, the correlation between the two variables is perfect positive;
r = –1, the correlation is perfect negative;
r = a positive value lying between 0 and +1, the correlation is positive;
r = a negative value lying between –1 and 0, the correlation is negative;
r = 0, there is no linear correlation between the two given variables.

An important thing to be understood in the coefficient of correlation (r) is that it does not
represent a percent agreement between the two given variables. However if r = –1 or +1 or 0,
there might be 100 percent agreement. But if r = 0.5 or any other value, it does not indicate
50% or any other representative percent agreement between the two variables. The percent
agreement between two variables increases exponentially with the increasing r values.

Page 66 of 79
Biostatistics

Example: Obtain the coefficient of correlation for the following data on the length (X in cm)
and weight (Y in g) of fish.

X 5 7 3 1 9 12 8 3
Y 8 9 5 4 9 13 7 9

Calculations:
dx dy
X Y dx2 dy2 dx·dy
(X – ) (Y – •)
5 8 –1 0 1 0 0
7 9 1 1 1 1 1
3 5 –3 –3 9 9 9
1 4 –5 –4 25 16 20
9 9 3 1 9 1 3
12 13 6 5 36 25 30
8 7 2 –1 4 1 –2
3 9 –3 1 9 1 –3
Σx = 48 Σy = 64 Σdx2 = 94 Σdy2 = 54 Σdx·dy = 58

+, ”• +• –”
= = •
=6 •= = •
=8
+ .,·.• —• —• —•
r= = = =š
C(+., · +.• ) √—” ט” √—™š– . —

r = 0.81

Page 67 of 79
Biostatistics

Regression

Introduction:–

If variables, x and y, are related to one another with a significant correlation, it is


possible to obtain an estimate of the value x if y is known, and the value of y, if x is known.
The statistical device with the help of which the unknown values of a variable from the
known values of another correlated variables is estimated is called regression. A line that
gives the best estimate of the value of one variable when the value of the other variable is
given is known as regression line.

Two regression lines, x on y and y on x, can be drawn for a series of bivariate data. The
regression line of x on y gives the best estimate the value of x when a value of y is given. The
regression line of the y on x gives the best estimate of the value of y when the value of x
given.

Regression lines are given algebraic expressions, the regression equations. Regression
equations are used to draw the regression lines. They are also used as numerical methods to
find out the best estimate of values of the variables from the values of the other variables.

Regression equations:–

For every series of bivariate data two regression equations are derived, viz.,
regression equation of x on y that is used to draw the regression line of x on y, and the
regression equation of y on x that is used to draw the regression line of y on x.

The regression equation of x on y is as follows,


› 8,
(Y – •) --------------------(1) ⇒ x = ay + b
8•
(X – ) =

The regression equation of y on x is as follows,


› 8•
(Y – •) = (X – ) --------------------(2) ⇒ y = ax + b
8,

Page 68 of 79
Biostatistics

In the above equations, X is the mean of variable x, Y is mean is variable y, Sx and Sy are the
SD’s of the variable x and y respectively, and r is the correlation coefficient between the
variables x and y.

If a value of y is given, the corresponding value of x can be estimated using the equation (1).
If a value of x is given, the corresponding value of y can be estimated using the equation (2).
Equation (1) yields a final form as x = ay + b, and the equation (2), y = ax + b.

In the expression x = ay + b, x is considered the dependent variables and y is an independent


variable. x is said to be a function of y. “A” in the above expression is called the regression
coefficient or slope of the regression equation of x on y. “B” in the expression is referred to as
the constant or intercept.

Similarly, in the expression y = ax + b, y is the dependent variable and x is the independent


variable, and y is said to be a function of x. The “A” is the regression coefficient and slope
and “B”, the constant or intercept of the regression equation of y on x.

Suppose there is a unit change in the value of y, the corresponding change in the value of x is
given by the regression coefficient of x on y. That is, when there is unit change in the value
of y, the value of x will change by the amount rSx/Sy. In other words, regression coefficient
decides what should be the slope of the regression line of x on y.

Similarly, the regression coefficient of y on x, rSy/Sx gives the amount of change in the value
if y when there is a unit change in the value of x and thus describing the slope of the
regression line of y on x.

The formulae for obtaining the regression coefficient and constant of the two regression
equations are as follows.

Regression equation of x on y,
› 8,
(Y – •)
8•
(X – ) =

X = AY + B

Page 69 of 79
Biostatistics

A = Coefficient of y in the regression equation of x on y and therefore the regression


coefficient of x.
› 8, +,• - +,·+•
A= or A =
8• +• - (+•)

+• +, - +,·+,• + .,·.•
B= or B =
+• - (+•) +.•

Regression equation of y on x,
› 8•
(Y – •) =
8,
(X – )

Y = AX + B
A = Coefficient of x in the regression equation of y on x and therefore the regression
coefficient of y.
› 8• +,• - +,·+•
A= or A =
8, +, - (+,)

+, +• - +,·+,• + .,·.•
B= or B =
+, - (+,) +.,

Regression lines:–

Using the two regression equations, two regression lines, one x on y and another y on
x, can be drawn. Suitable values of the variable x are chosen and their corresponding y values
are estimated using regression equation of y on x. The chosen x values and their estimated y
values are plotted to draw the regression line of y on x. Likewise, suitable y values are chosen
and their corresponding x values are estimated using the regression equation of x on y. The
chosen y values and their estimated x values are plotted to draw the regression line x on y.

Properties of regression lines:–

(1) If the correlation between the two given variables is perfectly positive or perfectly
negative, the two regression lines will overlap with each other. That is, there will be only one
regression line representing both x on y and y on x.

Page 70 of 79
Biostatistics

Regression lines when r = +1 Regression lines when r = –1

(2) If the degree of correlation between the two given variables x and y is higher, i.e., when
the value of r is close to either –1 or +1, the two regression lines will be closer to each other.

Regression lines close to each other when r is close to +1 or –1

On the other hand, if the correlation between the two variables is low, i.e., when r is close to
0, the regression lines will be further apart as in given in following figure. Thus the angle
between the two regression lines is an indication of the closeness or farness between the two.

Regression lines further apart when r is close to 0

Page 71 of 79
Biostatistics

(3) When there is no correlation between the two variables, i.e., when the r = 0, the two
regression lines intersect each other at right angles, as shown below.

Regression lines intersecting each other at right angle when r = 0

(4) The two regression lines intersect each other at the point of X and Y. If the values of x and
y are found out with the help of the two regression equations, i.e., by substituting one into
another, these values will be X and Y.

Difference between Correlation and Regression:–

Though correlation and regression are closely related to each other, there are certain
differences between them.

(1) Correlation measures the degree and nature of the relationship between the two given
variables. Regression, on the other hand, gives the average change in the value of one
variable when there is a change in the value of other variables.

(2) Correlation is a two way relationship between the two variables. That is, if x and y are the
two variables given, the correlation between x and y is the same as the correlation between y
and x. On the other hand, regression is a one-way relationship. The regression of x on y is not
the same as the regression of y on x.

Page 72 of 79
Biostatistics

Distribution of Probability – Chi-square [ χ2 - Table]


Probability (P)
Degree of freedom
(df) 0.1 0.05 0.025 0.01 0.001
(90%) (95%) (97.5%) (99%) (99.9%)
1 2.706 3.841 5.024 6.635 10.828
2 4.605 5.991 7.378 9.210 13.816
3 6.251 7.815 9.348 11.345 16.266
4 7.779 9.488 11.143 13.277 18.467
5 9.236 11.070 12.833 15.086 20.515
6 10.645 12.592 14.449 16.812 22.458
7 12.017 14.067 16.013 18.475 24.322
8 13.362 15.507 17.535 20.090 26.124
9 14.684 16.919 19.023 21.666 27.877
10 15.987 18.307 20.483 23.209 29.588
11 17.275 19.675 21.920 24.725 31.264
12 18.549 21.026 23.337 26.217 32.909
13 19.812 22.362 24.736 27.688 34.528
14 21.064 23.685 26.119 29.141 36.123
15 22.307 24.996 27.488 30.578 37.697
16 23.542 26.296 28.845 32.000 39.252
17 24.769 27.587 30.191 33.409 40.790
18 25.989 28.869 31.526 34.805 42.312
19 27.204 30.144 32.852 36.191 43.820
20 28.412 31.410 34.170 37.566 45.315
21 29.615 32.671 35.479 38.932 46.797
22 30.813 33.924 36.781 40.289 48.268
23 32.007 35.172 38.076 41.638 49.728
24 33.196 36.415 39.364 42.980 51.179
25 34.382 37.652 40.646 44.314 52.620
26 35.563 38.885 41.923 45.642 54.052
27 36.741 40.113 43.195 46.963 55.476
28 37.916 41.337 44.461 48.278 56.892
29 39.087 42.557 45.722 49.588 58.301
30 40.256 43.773 46.979 50.892 59.703
40 51.805 55.758 59.342 63.691 73.402
50 63.167 67.505 71.420 76.154 86.661
100 118.498 124.342 129.561 135.807 149.449

Page 73 of 79
Biostatistics

‘t’- Table [Distribution of ‘t’ – Probability]

Degree of freedom Probability (P)


(df) 0.1 0.05 0.02 0.01 0.002
1 6.314 12.706 31.821 63.657 318.310
2 2.920 4.303 6.965 9.925 22.326
3 2.353 3.182 4.541 5.841 10.213
4 2.132 2.776 3.747 4.604 7.173
5 2.015 2.571 3.365 4.032 5.893
6 1.943 2.447 3.143 3.707 5.208
7 1.895 2.365 2.998 3.499 4.785
8 1.86 2.306 2.896 3.355 4.501
9 1.833 2.262 2.821 3.250 4.297
10 1.812 2.228 2.764 3.169 4.144
11 1.796 2.201 2.718 3.106 4.025
12 1.782 2.179 2.681 3.055 3.930
13 1.771 2.160 2.650 3.012 3.852
14 1.761 2.145 2.624 2.977 3.787
15 1.753 2.131 2.602 2.947 3.733
16 1.746 2.120 2.583 2.921 3.686
17 1.74 2.110 2.567 2.898 3.646
18 1.734 2.101 2.552 2.878 3.610
19 1.729 2.093 2.539 2.861 3.579
20 1.725 2.086 2.528 2.845 3.552
21 1.721 2.080 2.518 2.831 3.527
22 1.717 2.074 2.508 2.819 3.505
23 1.714 2.069 2.500 2.807 3.485
24 1.711 2.064 2.492 2.797 3.467
25 1.708 2.060 2.485 2.787 3.450
26 1.706 2.056 2.479 2.779 3.435
27 1.703 2.052 2.473 2.771 3.421
28 1.701 2.048 2.467 2.763 3.408
29 1.699 2.045 2.462 2.756 3.396
30 1.697 2.042 2.457 2.750 3.385
40 1.684 2.021 2.423 2.704 3.307
60 1.671 2.000 2.390 2.660 3.232
120 1.658 1.980 2.358 2.617 3.160
Infinitive (∞) 1.645 1.960 2.326 2.576 3.090

Page 74 of 79
Biostatistics

‘F’ - Table (P = 0.05)


DF1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
DF2
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.76 8.74 8.73 8.71 8.7 8.66

4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6 5.96 5.94 5.91 5.89 5.87 5.86 5.8

5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.7 4.68 4.66 4.64 4.62 4.56

6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.1 4.06 4.03 4 3.98 3.96 3.94 3.87

7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.6 3.57 3.55 3.53 3.51 3.44

8 5.32 4.46 4.07 3.84 3.69 3.58 3.5 3.44 3.39 3.35 3.31 3.28 3.26 3.24 3.22 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.1 3.07 3.05 3.03 3.01 2.94

10 4.96 4.1 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.94 2.91 2.89 2.86 2.85 2.77

11 4.84 3.98 3.59 3.36 3.2 3.09 3.01 2.95 2.9 2.85 2.82 2.79 2.76 2.74 2.72 2.65

12 4.75 3.89 3.49 3.26 3.11 3 2.91 2.85 2.8 2.75 2.72 2.69 2.66 2.64 2.62 2.54

13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.63 2.6 2.58 2.55 2.53 2.46

14 4.6 3.74 3.34 3.11 2.96 2.85 2.76 2.7 2.65 2.6 2.57 2.53 2.51 2.48 2.46 2.39

15 4.54 3.68 3.29 3.06 2.9 2.79 2.71 2.64 2.59 2.54 2.51 2.48 2.45 2.42 2.4 2.33

16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.46 2.42 2.4 2.37 2.35 2.28

17 4.45 3.59 3.2 2.96 2.81 2.7 2.61 2.55 2.49 2.45 2.41 2.38 2.35 2.33 2.31 2.23

18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.37 2.34 2.31 2.29 2.27 2.19

19 4.38 3.52 3.13 2.9 2.74 2.63 2.54 2.48 2.42 2.38 2.34 2.31 2.28 2.26 2.23 2.16

20 4.35 3.49 3.1 2.87 2.71 2.6 2.51 2.45 2.39 2.35 2.31 2.28 2.25 2.23 2.2 2.12

22 4.3 3.44 3.05 2.82 2.66 2.55 2.46 2.4 2.34 2.3 2.26 2.23 2.2 2.17 2.15 2.07

24 4.26 3.4 3.01 2.78 2.62 2.51 2.42 2.36 2.3 2.25 2.22 2.18 2.15 2.13 2.11 2.03

26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.18 2.15 2.12 2.09 2.07 1.99

28 4.2 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.15 2.12 2.09 2.06 2.04 1.96

30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.13 2.09 2.06 2.04 2.01 1.93

35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11 2.08 2.04 2.01 1.99 1.96 1.88

40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.04 2 1.97 1.95 1.92 1.84

45 4.06 3.2 2.81 2.58 2.42 2.31 2.22 2.15 2.1 2.05 2.01 1.97 1.94 1.92 1.89 1.81

50 4.03 3.18 2.79 2.56 2.4 2.29 2.2 2.13 2.07 2.03 1.99 1.95 1.92 1.89 1.87 1.78

60 4 3.15 2.76 2.53 2.37 2.25 2.17 2.1 2.04 1.99 1.95 1.92 1.89 1.86 1.84 1.75

70 3.98 3.13 2.74 2.5 2.35 2.23 2.14 2.07 2.02 1.97 1.93 1.89 1.86 1.84 1.81 1.72

80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2 1.95 1.91 1.88 1.84 1.82 1.79 1.7

100 3.94 3.09 2.7 2.46 2.31 2.19 2.1 2.03 1.97 1.93 1.89 1.85 1.82 1.79 1.77 1.68

200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.84 1.8 1.77 1.74 1.72 1.62

500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.9 1.85 1.81 1.77 1.74 1.71 1.69 1.59

1000 3.85 3 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.8 1.76 1.73 1.7 1.68 1.58

Page 75 of 79
Biostatistics

‘F’ - Table (P = 0.01)


DF1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
DF2
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.4 27.2 27.1 27.1 26.9 26.9 26.9 26.7

4 21.2 18 16.7 15.9 15.5 15.2 14.9 14.8 14.7 14.6 14.5 14.4 14.3 14.3 14.2 14.0

5 16.3 13.3 12.1 11.4 10.9 10.7 10.5 10.3 10.2 10.1 9.96 9.89 9.82 9.77 9.72 9.55

6 13.8 10.9 9.78 9.15 8.75 8.47 8.26 8.1 7.98 7.87 7.79 7.72 7.66 7.61 7.56 7.4

7 12.3 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.54 6.47 6.41 6.36 6.31 6.16

8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.73 5.67 5.61 5.56 5.52 5.36

9 10.6 8.02 6.99 6.42 6.06 5.8 5.61 5.47 5.35 5.26 5.18 5.11 5.05 5.01 4.96 4.81

10 10.0 7.56 6.55 5.99 5.64 5.39 5.2 5.06 4.94 4.85 4.77 4.71 4.65 4.6 4.56 4.41

11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.46 4.4 4.34 4.29 4.25 4.1

12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.5 4.39 4.3 4.22 4.16 4.1 4.05 4.01 3.86

13 9.07 6.7 5.74 5.21 4.86 4.62 4.44 4.3 4.19 4.1 4.02 3.96 3.91 3.86 3.82 3.66

14 8.86 6.51 5.56 5.04 4.7 4.46 4.28 4.14 4.03 3.94 3.86 3.8 3.75 3.7 3.66 3.51

15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4 3.89 3.8 3.73 3.67 3.61 3.56 3.52 3.37

16 8.53 6.23 5.29 4.77 4.44 4.2 4.03 3.89 3.78 3.69 3.62 3.55 3.5 3.45 3.41 3.26

17 8.4 6.11 5.19 4.67 4.34 4.1 3.93 3.79 3.68 3.59 3.52 3.46 3.4 3.35 3.31 3.16

18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.6 3.51 3.43 3.37 3.32 3.27 3.23 3.08

19 8.19 5.93 5.01 4.5 4.17 3.94 3.77 3.63 3.52 3.43 3.36 3.3 3.24 3.19 3.15 3

20 8.1 5.85 4.94 4.43 4.1 3.87 3.7 3.56 3.46 3.37 3.29 3.23 3.18 3.13 3.09 2.94

22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.18 3.12 3.07 3.02 2.98 2.83

24 7.82 5.61 4.72 4.22 3.9 3.67 3.5 3.36 3.26 3.17 3.09 3.03 2.98 2.93 2.89 2.74

26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 3.02 2.96 2.9 2.86 2.82 2.66

28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.96 2.9 2.84 2.79 2.75 2.6

30 7.56 5.39 4.51 4.02 3.7 3.47 3.3 3.17 3.07 2.98 2.91 2.84 2.79 2.74 2.7 2.55

35 7.42 5.27 4.4 3.91 3.59 3.37 3.2 3.07 2.96 2.88 2.8 2.74 2.69 2.64 2.6 2.44

40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.8 2.73 2.66 2.61 2.56 2.52 2.37

45 7.23 5.11 4.25 3.77 3.45 3.23 3.07 2.94 2.83 2.74 2.67 2.61 2.55 2.51 2.46 2.31

50 7.17 5.06 4.2 3.72 3.41 3.19 3.02 2.89 2.79 2.7 2.63 2.56 2.51 2.46 2.42 2.27

60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.56 2.5 2.44 2.39 2.35 2.2

70 7.01 4.92 4.07 3.6 3.29 3.07 2.91 2.78 2.67 2.59 2.51 2.45 2.4 2.35 2.31 2.15

80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 2.48 2.42 2.36 2.31 2.27 2.12

100 6.9 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.5 2.43 2.37 2.31 2.27 2.22 2.07

200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.6 2.5 2.41 2.34 2.27 2.22 2.17 2.13 1.97

500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.28 2.22 2.17 2.12 2.07 1.92

1000 6.66 4.63 3.8 3.34 3.04 2.82 2.66 2.53 2.43 2.34 2.27 2.2 2.15 2.1 2.06 1.9

Page 76 of 79
Biostatistics

‘F’ - Table (P = 0.001)


DF1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
DF2
130. 129. 129. 128. 128. 127. 127. 127. 126.
3 167.03 148.5 141.11 137.1 134.6 132.9 131.6
6 9 3 7 3 9 7 4 4

4 74.14 61.25 56.18 53.44 51.71 50.53 49.66 49 48.5 48.1 47.7 47.4 47.2 46.9 46.8 46.1

5 47.18 37.12 33.2 31.09 29.75 28.84 28.16 27.7 27.3 26.9 26.7 26.4 26.2 26.1 25.9 25.4

6 35.51 27 23.7 21.92 20.8 20.03 19.46 19.0 18.7 18.4 18.2 17.9 17.8 17.7 17.6 17.1

7 29.25 21.69 18.77 17.2 16.21 15.52 15.02 14.6 14.3 14.1 13.9 13.7 13.6 13.4 13.3 12.9

8 25.42 18.49 15.83 14.39 13.49 12.86 12.4 12.1 11.8 11.5 11.4 11.2 11.1 10.9 10.8 10.5

9 22.86 16.39 13.9 12.56 11.71 11.13 10.7 10.4 10.1 9.89 9.72 9.57 9.44 9.33 9.24 8.9

10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.2 8.96 8.75 8.59 8.45 8.33 8.22 8.13 7.8

11 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.36 8.12 7.92 7.76 7.63 7.51 7.41 7.32 7.01

12 18.64 12.97 10.8 9.63 8.89 8.38 8 7.71 7.48 7.29 7.14 7.01 6.89 6.79 6.71 6.41

13 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.8 6.65 6.52 6.41 6.31 6.23 5.93

14 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.8 6.58 6.4 6.26 6.13 6.02 5.93 5.85 5.56

15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 5.94 5.81 5.71 5.62 5.54 5.25

16 16.12 10.97 9.01 7.94 7.27 6.81 6.46 6.2 5.98 5.81 5.67 5.55 5.44 5.35 5.27 4.99

17 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 5.44 5.32 5.22 5.13 5.05 4.78

18 15.38 10.39 8.49 7.46 6.81 6.36 6.02 5.76 5.56 5.39 5.25 5.13 5.03 4.94 4.87 4.59

19 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 5.08 4.97 4.87 4.78 4.7 4.43

20 14.82 9.95 8.1 7.1 6.46 6.02 5.69 5.44 5.24 5.08 4.94 4.82 4.72 4.64 4.56 4.29

22 14.38 9.61 7.8 6.81 6.19 5.76 5.44 5.19 4.99 4.83 4.7 4.58 4.49 4.4 4.33 4.06

24 14.03 9.34 7.55 6.59 5.98 5.55 5.24 4.99 4.8 4.64 4.51 4.39 4.3 4.21 4.14 3.87

26 13.74 9.12 7.36 6.41 5.8 5.38 5.07 4.83 4.64 4.48 4.35 4.24 4.14 4.06 3.99 3.72

28 13.5 8.93 7.19 6.25 5.66 5.24 4.93 4.7 4.51 4.35 4.22 4.11 4.01 3.93 3.86 3.6

30 13.29 8.77 7.05 6.13 5.53 5.12 4.82 4.58 4.39 4.24 4.11 4 3.91 3.83 3.75 3.49

35 12.9 8.47 6.79 5.88 5.3 4.89 4.6 4.36 4.18 4.03 3.9 3.79 3.7 3.62 3.55 3.29

40 12.61 8.25 6.6 5.7 5.13 4.73 4.44 4.21 4.02 3.87 3.75 3.64 3.55 3.47 3.4 3.15

45 12.39 8.09 6.45 5.56 5 4.61 4.32 4.09 3.91 3.76 3.64 3.53 3.44 3.36 3.29 3.04

50 12.22 7.96 6.34 5.46 4.9 4.51 4.22 4 3.82 3.67 3.55 3.44 3.35 3.27 3.2 2.95

60 11.97 7.77 6.17 5.31 4.76 4.37 4.09 3.87 3.69 3.54 3.42 3.32 3.23 3.15 3.08 2.83

70 11.8 7.64 6.06 5.2 4.66 4.28 3.99 3.77 3.6 3.45 3.33 3.23 3.14 3.06 2.99 2.74

80 11.67 7.54 5.97 5.12 4.58 4.2 3.92 3.71 3.53 3.39 3.27 3.16 3.07 3 2.93 2.68

100 11.5 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.3 3.18 3.07 2.99 2.91 2.84 2.59

200 11.16 7.15 5.63 4.81 4.29 3.92 3.65 3.43 3.26 3.12 3.01 2.9 2.82 2.74 2.67 2.42

500 10.96 7 5.51 4.69 4.18 3.81 3.54 3.33 3.16 3.02 2.91 2.81 2.72 2.64 2.58 2.33

1000 10.89 6.96 5.46 4.66 4.14 3.78 3.51 3.3 3.13 2.99 2.87 2.77 2.69 2.61 2.54 2.3

Page 77 of 79
Biostatistics

‘r’ - Table [The Correlation Coefficient]

Degree of Probability (P)


freedom
(df) 0.05 0.02 0.01 0.001
1 0.997 0.999 1 1
2 0.95 0.980 0.99 0.999
3 0.878 0.934 0.959 0.991
4 0.811 0.882 0.917 0.974
5 0.755 0.832 0.875 0.951
6 0.707 0.789 0.834 0.925
7 0.666 0.749 0.798 0.898
8 0.632 0.716 0.765 0.872
9 0.602 0.685 0.735 0.847
10 0.576 0.658 0.708 0.823
11 0.553 0.634 0.684 0.801
12 0.532 0.612 0.661 0.78
13 0.514 0.592 0.641 0.76
14 0.497 0.574 0.623 0.742
15 0.482 0.558 0.606 0.725
16 0.468 0.543 0.59 0.708
17 0.456 0.529 0.575 0.693
18 0.444 0.516 0.561 0.679
19 0.433 0.503 0.549 0.665
20 0.423 0.492 0.457 0.652
25 0.381 0.445 0.487 0.597
30 0.349 0.409 0.449 0.554
35 0.325 0.381 0.418 0.519
40 0.304 0.359 0.393 0.49
45 0.288 0.338 0.372 0.465
50 0.273 0.322 0.354 0.443
60 0.25 0.295 0.325 0.408
70 0.232 0.274 0.302 0.38
80 0.217 0.257 0.283 0.357
90 0.205 0.242 0.267 0.338
100 0.195 0.230 0.254 0.321

Page 78 of 79
Biostatistics

Suggested Further Reading:


[1] Fundamental of Biostatistics – Veer Bala Rastogi
[2] An Introduction to Biostatistics – N. Gurumani
[3] Biostatistics – P. N. Arora & P. K. Malhan
[4] Basic Biostatistics for Geneticists and Epidemiologists – Robert C. Elston &
William D. Johnson
[5] Applied Statistics and Probability for Engineers – Douglas C. Montgomery &
George C. Runger
[6] Basic Statistics – A Primer for the Biomedical Sciences - Olive Jean Dunn &
Virginia A. Clark
[7] Biostatistics and Microbiology: A Survival Manual – Daryl S. Paulson
[8] Biostatistics - A Methodology for the Health Sciences – Gerald van Belle, Lloyd D.
Fisher, Patrick J. Heagerty & Thomas lumley
[9] Elements of Statistics with Application to Economic Data – Harold T. Davis & W.
F. C. Nelson
[10] Fundamentals of Biostatistics – Bernard Rosner
[11] Introduction to Biostatistics – Robert R. Sakal & F. James Rohlf
[12] Introduction to Statistics and Data Analysis for Physicists – Gerhard Bohm &
Gunter Zech
[13] Introduction to Statistics – David M. Lane
[14] Introductory Biostatistics – Chap T. Le
[15] Medical Biostatistics – Abhaya Indrayan & Sanjeev B. Sarmukaddam
[16] Sample Size Calculations in Clinical Research – Shein-Chung Chow, Jun Shao &
Hansheng Wang
[17] Statistical Methods for Clinical Trials – Mark X. Norleans
[18] Statistical Methods for Biostatistics and Related Fields – Wolfgang Hardle, Yuichi
Mori & Philippe Vieu
[19] Statistical Methods for Food Science - John A. Bower
[20] Statistical Methods in Water Resources – D. R. Helsel & R. M. Hirsch
[21] Statistics for Biology and Health – K. Dietz, M. Gail, K. Krickeberg, J. Samet & A.
Tsiatis
[22] Statistics for the Life Sciences – Myra L. Samuels & Jeffery A. Witmer
[23] Essentials of Statistics – David Brink
[24] Essentials of Statistics: Exercises – David Brink
[25] Statistics for Business and Economics – Marcelo Fernandes
[26] Statistics for Health, Life and Social Sciences – Denis Anthony
[27] Stats Practically Short and Simple – Sidney Tyrrell
[28] Topics in Biostatistics – Walter T. Ambrosius
[29] Tutorials in Biostatistics – R. B. D’Agostino
[30] The Concise Encyclopedia of Statistics – Yadolah Dodge
[31] The Elements of Statistical Learning - Data Mining, Inferences, and Prediction -
Trevor Hastie, Robert Tibshirani & Jerome Friedman
[32] Encyclopedia of Biostatistics Vol. 1-8

Page 79 of 79

You might also like