Professional Documents
Culture Documents
Community Medicine
Community Medicine
BIOSTATISTICS
For
The mere mention of the word statistics can ring alarm bells in many minds and more so
in the minds of medics. Medical professionals believe that they are duty bound not to
touch anything mathematical. They had opted for Biology instead of Mathematics at FSc
level, so it would be a sin to think of it. They erroneously think of Statistics as a branch
of Mathematics. No doubt that mathematics are extensively used in statistics but so is the
case with other sciences like: Physics, Chemistry, Agriculture, Engineering, and many
more. The subject is further defamed, though, for political reasons: for example, when
Mark Twain quotes the Victorian age English Prime Minister Disraeli as saying “there
are lies, damned lies and statistics”. We the readers immediately implicate the subject
statistics as the worst types of lies. Facts are to the contrary. What Disraeli means by
statistics here is the facts and figures presented by the government whereas the subject
Statistics is an entirely different phenomenon though it also plays with facts and figures
but in a scientific way.
The kind of thinking involved in statistics will not be entirely new to you. Indeed, you
will find that many of our day-to-day assumptions and decisions already depend on it.
Suppose you are told that two adults are sitting in the next room. One is 5 feet tall and the
other is six feet tall. What would be your best guess as to each one’s sex, based on that
information alone? You may be fairly confident in assuming that the six feet tall person
may be a man and the five-footer may be a woman. You could be wrong, off course, but
experience tells you that five-foot men and six-foot women are somewhat rare. You have
noticed that, by and large, males tend to be taller than female human beings. Off course
you have not seen all men or all women and you recognize that many women are taller
than many men; nevertheless you feel reasonably confident about generalizing from the
particular men and women you have known to men and women as a whole.
The above is a simple, everyday example of statistical thinking. There are many other
examples. Anytime you use phrases like: ‘on average, I sleep 52 hours a week’ or ‘we
can expect a lot of rain at this time of year’ or ‘the earlier you start revising the better you
are likely to do in the annual exam’; you are making a statistical statement, even though
you may have performed no calculations.
There are many more things, which are not known to us. We may need information on
some of these things. Without conducting a proper investigation, we may be oblivious to
many important things. To conduct such an investigation, we need to have knowledge of
the subject STATISTICS. Medical science cannot progress without making use of the
subject STATISTICS; hence all medical graduates must possess a first-hand knowledge
of STATISTICS. Just think for a moment from where we know that Hemoglobin level is
12-15g/dl in adult males. Have we measured the hemoglobin level of all men and
women? Certainly not but still we confidently diagnose a man as anemic with a
hemoglobin level of 10g/dl. How do we feel confident about our diagnosis of anemia
Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 2
with 10g/dl hemoglobin? Take care you need to learn it as we discuss it in the following
pages.
Statistics:
The word statistics is derived from ‘status’ meaning situation. The subject studies
situations. A practical definition would be:
Statistics are:
* Note that the word statistics used in everyday language means facts and
figures and not the subject Statistics. The word statistic means the
computed figures from actual observations (sample).
SOME DEFINITIONS:
Data:
Record of observations – facts and figures – any piece of information. It could be:
Primary or Secondary
Information:
Variable:
An attribute or characteristic that is variable from one individual to another, e.g., age,
gender, height etc.
Defined as “the whole set of things or objects about which we want to know” In statistics
population can be human beings, potatoes, tomatoes, rice, chairs, tables, ECG machines,
paracetamol tablets etc.
Sample:
Census:
Sampling:
The procedures of drawing a sample from population i.e., when some members of the
population are drawn for examination
You will appreciate the fact that populations are usually very big and most of the times of
infinite size. Time and other resources are always scarce; therefore, researchers almost
always opt for sampling rather than conducting a census. To be able to draw meaningful
information from our samples we would like that our samples are representatives of the
population they are drawn from. We don’t know about the population so how do we
know that our sample is representative of the population. We don’t have any foolproof
method but to be reasonably sure that our samples are representatives of the population
they are drawn from, we must ensure that:
We leave it to the Nature to make a selection for us. It should not be subservient to our
choice.
We can ensure samples are drawn randomly but the size of a sample is usually dictated
by the availability of resources i.e., time, men, money and material and the statistical
requirements.
Sampling Unit:
Sampling Frame:
In an ideal world we need to have a list of all the members of the population and then
draw a sample by method of say lottery. But just imagine that if we want to know about
the heights of adult males in district Abbottabad and we want to draw a sample of 1000
adult males. There may be more than 300,000 adult males in district Abbottabad. Do we
have any method where we can obtain a list of all adult males of district Abbottabad? We
surely don’t have a complete list of adult males of district Abbottabad. Similarly, most of
the populations we encounter don’t have complete lists. Therefore, we need to look for
other sampling techniques.
Random Samples:
Defined as where each and every member of the population has an equal chance of
selection.
A. Probability Sampling
B. Non-probability Sampling
Probability samples are those when members of the population have known
and not necessarily equal chance of getting selected as sample members. In
this technique of sampling inferential statements can be made based on
samples. In the case of probability sampling the sampling frame is available in
some shape. Some of such techniques are as under:
Note: If you can select a sample by the above two methods then
you should not use any other method.
Variable:
An attribute or characteristic that varies from one individual to another.
Types of Data:
Data is collected against variables. Age is a variable, but ages of students of a sample or
population are data. We need to know different types of variables because different
statistical techniques are employed to analyze different variables.
Discrete:
The variable takes a PARTICULAR number in a given range. If parity ranges from zero
to ten then it could be a particular number out of 11 numbers. It cannot be 2.6. It is also
called count variable as we count. Examples are parity, number of injections per day etc.
Continuous:
The variable takes ANY number in a given range. If heights range from 100 cm to 200
cm, there could be infinite values in the given range owing to decimals. These variables
are measured. Examples are hemoglobin value, age, weight etc.
Category or Categorical Variables:
Categorical variables are first categorized and then counted. They are nominal and
ordinal.
Nominal:
Observations have names only for example male/female, black/white/yellow/brown.
There are no orders or ratios. If nominal data has only two groups e.g., male/female it is
called dichotomous or binary data. More than two groups e.g., religion, are called
multichotomous variables.
Ordinal:
Also called RANK variable. When data is placed into meaningful order. Students may be
ranked as 1st, 2nd, 3rd etc. however the interval between orders is not certain. Likert Scale
and Visual Analogue Scale (VAS) are ordinal variables.
Another classification of variables is independent and dependent variable. They are used
when we compare variables. Independent variables are presumed causes and dependent
variables are presumed effects. Incidence of Lung Cancer is an example to explain this.
Smoking is an independent variable and Lung Cancer is dependent variable.
After data collection, the researcher has to organize the scores into some comprehensible
form for further statistical processes. The most commonly used procedure for organizing
a set of data is to place the scores in a frequency distribution.
Frequency Distribution:
The collected data can be plotted in tabular form or graphic form after organizing it
showing frequencies of different observations. It can also be organized in group form
which is called grouped frequency distributions. In frequency distribution, the
disorganized sets of scores are arranged in an order (ascending or descending) by
grouping together all individuals who have the same scores. A frequency distribution can
be in the form of a table or a graph. Following is the data on the pulse rates/min of 15
students:
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75
Note: For tallying observations the FIVE BAR GATES or tallying methods can be used
shown in the third column in the above table. is called five bar gate.
Now a days computer software like MS Excel, SPSS etc. are used and should be used
with the aforementioned method becoming obsolete.
Once the data is collected, we need to describe it. Before describing the data, it is
important to know the type of variable we are going to describe. The quantitative variable
is described in terms of measures of central tendencies and measures of dispersion. For
categorical variables, frequency and percentages are used.
Mean X
N
Mean of a population
X Sum of values in a population
N Number of values in a population
To calculate Mean from the data of Pulse rates of students
Sum of all observations = 1140 (Σx)
Number of observations = 15 (n)
x
( Mean ) X
n
1140 minute
X 76 beats /
15
Advantages of Mean:
i. It represents all the values in a distribution
ii. Can be used in further statistical computations
Disadvantages:
Median: The centre value of series of observations when the observations are ranked in
order from the lowest value to the highest (ascending or descending order). Median
divides the distribution into two equal halves.
n 1
Position of Median = th
2
Using the same data
We first arrange the observations into an order from lowest to highest
62, 66, 67, 69, 72, 73, 73, 73 75, 76, 78, 80, 82, 86, 108
n 1 15 1 16
Position of Median = 8 8th value is the median which is 73.
2 2 2
You can see that there are seven values below 73 and an equal number i.e. seven above
the median. In the data shown n=15 which is an odd number. If n=16, an even number
then the Median would be
n 1 16
Position of Median = 8 .5
2 2
8.5 means the average of 8th and 9th observations. If e.g. 8th value was 73 and 9th 75, then
Median would be 73 75 148
74 beats / min ute
2 2
In this case median may not be an actually observed value.
Advantages of Median:
It is not affected by extreme values therefore, it is used for that data, which is, skewed i.e.
having extreme observations.
Disadvantages:
1. It does not take into account all the values of a distribution
2. It is of limited value in further statistical computation
Median can be used for Quantitative (discrete and continuous) and Ordinal data.
Some distribution may have two modes – they are called bimodal distributions. If there
are more than two modes, such distributions are known as multimodal distributions.
Mode can be used for all types of data i.e., Nominal, Ordinal, and Quantitative (discrete
and continuous).
Note: Mean – Median – Mode have the same units as of observations and must be noted
with the resultant value e.g. Mean is 76 beats per minute.
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75
Range is 62-108 beats per minute or 108-62 beats per minute or 46 beats per minute.
Range is a good measure of dispersion when we want to know immediately how the data
is spread out but it takes into account only the lowest and highest values of a distribution.
Therefore, it is not a good measure of dispersion of data.
62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X X ) =0 Σ ( X X ) 2 =1650
1650
Variance = 110
15
We square the deviations to get rid of the negative signs but by squaring the values we
loose the units. Therefore, Variance is of limited value in measuring dispersion of the
data.
3. Standard deviation
4. The most useful measure of dispersion that can be used in further statistical
computations. It is the square root of the sum of squared deviations of
observations from mean of the distribution divided by the number of observations.
(Samples less than 30 use n-1, which may be used in big samples too without
making much difference)
Standard Deviation ( SD) Variance
Variance = (Standard Deviation)2 = SD2
x x 2
SD
n
62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X X ) =0 Σ ( X X ) 2 =1650
We know that ( x x) 2 1650
n 15
x x
2
SD
n 1
1650
SD 110 10.5 beats / min
15
By squaring the deviation we get rid of the negative signs, but we loose the original unit,
which is taken care of by applying the square root again, which means that original units
are restored.
The use of standard deviation in statistical data is explained with the Normal Distribution.
5. Co-efficient of Variation:
Measures variability in relation to the mean and offers a method by which one can
compare the relative dispersions of one type of data with the relative dispersion of
another type of data.
Our data of heart beats per minute will have Co-efficient of variation as under:
SD 10.5
Co-efficient of variation of heart beats data = x100 = x100 13.8%
Mean 76
If we also had recorded the systolic blood pressures of the same individuals with a mean
systolic BP of 130mmHg and Standard Deviation of 13mmHg – the co-efficient of
variation would have been
Co-efficient of Variation of Systolic Blood Pressure:
SD 13
x100 x100 10%
Mean 130
Now we can compare and conclude that among persons whose pulse rates and systolic
blood pressure were recoded, pulse rates are more variable than systolic blood pressure
since co-efficient of variation of pulse rates is 13.8% and systolic BP is 10%.
Inferential Statistics means when we go beyond the actual observations and state
something (based on the collected data), which have not been actually observed. Here the
theory of probability comes in.
Probability:
The number of events occurring out of a total possible number of events is called
probability. If we flip a fair coin, the probability of having head is ½ or 50% or 0.5. The
probability of either having head or tail is 1/1 or 100% or 1. Two simple rules of
probability need to be remembered.
Addition Rules:
For two or more possible mutually exclusive events the collective probability equals
ONE or 100%. For example, there are two possibilities when we flip a fair coin i.e.,
either head or tail. We cannot have both head and tail at one flip. The probability of
having head is 0.5 or 50%. Therefore, according to addition rule the probability of having
head or tail is 0.5 + 0.5 = 1 or 50% + 50%=100%.
Example: If infant mortality rate is 60 per 1000 in Pakistan, then, the probability of an
infant dying is 60 per 1000 or 6 per 100 or 6% or 0.06. The probability of an infant
surviving is 940 per 1000 i.e., 1000-60=940. It can also be said that the probability of an
infant surviving is 94% or 0.94. As a child can either survive or die and they are mutually
exclusive phenomena, therefore, according to addition rule the probability of either dying
or surviving is 0.06+0.94=1. (If the statement has “or” in it, addition rule is applied)
Multiplication Rule: For two or more independent and randomly occurring phenomena
the probabilities multiply. When we flip a fair coin, it is an independent event. When we
flip a fair coin twice or thrice or more; all are independent events. The probability of
getting head with flip is 0.5 then having heads on two flips is 0.5x0.5 = 0.25.
Example: If we know that 10% of patients visiting a medical OPD suffer from
hypertension it means that probability of a patient having hypertension is 0.1. So, when
the events are independent, the probability of having first two patients entering the OPD
of being suffering from hypertension is 0.1x0.1 = 0.01 or 1%. This is called the
multiplication rule.
Note: If the probability is stated to be 1 it is called unity. To say one has to die eventually
the probability will be 1. For one to stay alive for ever the probability will be 0. In
between 0 and 1 there are fractions of 1, which may have many decimals for different
events.
NORMAL CURVE
On the X-axis are the values and the Y-axis shows the frequency of those values like a
frequency distribution. It is important to remember that normal distribution is a
probability distribution and is an ideal world. If our collected data has tendency to
conform to normal distribution, then we make use of it in statistical inferences. The total
probability of frequency of values under the curve is equal to 1 or 100%. All the
individual values under the curve have probability of occurrence (frequency) ranging
between 0 and 1 (or 0% to 100%) and total to 1.
1. It is bell shaped.
2. It is perfectly symmetrical.
3. Mean, Median and Mode are in the centre of the curve i.e. the dome of the curve.
4. Half the values (50%) lie on each side when it is cut into half at the highest point.
5. It has got two determinants Mean (µ) and Standard Deviation (δ).
6. 68.26% of the values lie between the range of Mean ± 1xSD (1δ - µ + 1δ). In
other words the probability of occurrence of values between the range 1xSD –
Mean + 1xSD (1x δ-µ+1x δ) is 68.26% or .6826. This also implies that 31.74%
of values are either below Mean – 1xSD (µ-1x δ ) or above Mean + 1x SD(µ+1x
δ ). In other words the probability of occurrence of values below Mean – 1xSD
(µ-1x δ) or above Mean + 1xSD (µ+1x δ ) is 31.74% or .3174.
7. 95.45% of the values lie between the range of Mean ± 2xSD (2δ - µ + 2δ). In
other words the probability of occurrence of values between the range 2xSD –
Mean + 2xSD is 95.45% or .9545. This also implies that 4.55% of values are
To elaborate it further and make it useful, remember the following landmarks also:
a. 95% of the values lie between the range of Mean ± 1.96xSD (1.96δ - µ + 1.96δ).
In other words the probability of occurrence of values between the range 1.96xSD
– Mean + 1.96xSD is 95% or .95. This also implies that 5% of values are either
below Mean – 1.96xSD or above Mean + 1.96x SD. In other words the
probability of occurrence of values below Mean – 1.96xSD or above Mean +
1.96xSD is 5% (2.5% 0n each side) or .05 (0.025 on each side).
b. 99% of the values lie between the range of Mean ± 2.58xSD (2.58δ - µ + 2.58δ).
In other words the probability of occurrence of values between the range 2.58xSD
– Mean + 2.58xSD δ is 99% or .99. This also implies that 1% of values are either
below Mean – 2.58xSD or above Mean + 2.58x SD. In other words the
probability of occurrence of values below Mean – 2.58 or above Mean +
2.58xSD is 1% or .01.
The multiple of standard deviation is called Z or t (in the case of “t” distribution) ,
which ranges from 0 to infinity. The area under normal curve is also referred to as area
under Z.
Note: When we say that a certain percentage of observations lie between Mean ±
Z x SD, the Z in the case of 68.26% is 1, in the case of 95%, Z is 1.96, in the case
of 95.45% Z is 2, in the case of 99% Z is 2.58 and in the case of 99.73% Z is 3.
NOTE: The difference between Normal Curve and STANDARD NORMAL CURVE
(SNC) is that the Mean of SNC is zero with a Standard Deviation of one. Normal Curve
is the frequency distribution of a variable in the entire population, whereas SNC is the
frequency distribution of means of all its samples with respect to a variable.
KURTOSIS: It means the peakedness of a normal curve. The diagram below shows
three types according to the size of population standard deviation. The three curves
shown have the same mean:
1. Mesokurtosis: It is in the middle showing a moderate standard deviation.
2. Platykurtosis: It is flat and has a wide standard deviation.
3. Leptokurtosis: It is ellipticall in shape and has a narrow standard deviation.
Diagram B
Note: Skew means tail. Skew is said to be to the side where the tail of the
distribution is.
From our data of pulse rates for which we have calculated Mean and Standard Deviation
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75
Mean = 76
SD = 10.5
n = 15
By using normal distribution we can say that:
1. 68.26% of the values are within Mean ± ISD i.e. between 65.5 and 86.5
2. 95% of values are within Mean ± I.96SD i.e. between 55.4 to 96.6
3. 95.45% of values lie within Mean ± 2SD i.e. between 55 to 97
4. 99% of values lie within Mean ± 2.58SD i.e. between 48.9 to 103.1
5. 99.73% of values lie within Mean ± 3SD i.e. between 44.5 to 107.5
These are confidence limits for the Sample (read SAMPLE). Number 1 is 68.26%
confidence limits; number 2 is 95% confidence limits; number 3 is 95.45% confidence
limits; number 4 is 99% confidence limits; and number 5 is 99.73% confidence limits.
This means that you can state with a certain percent of confidence in what range the
values within your sample fall. But do not forget that these confidence limits are for your
sample and not for the population from which the sample is drawn.
The upper limit of the range is upper confidence limit; the lower limit of the range is
lower confidence limit. In between the upper and lower confidence limits is the
CONFIDENCE INTERVAL. 95% confidence limits for a sample imply that 95% of the
observations in the sample will lie within this range, which in the case of our data are
55.4 to 96.6. It also means that 5% of observations may lie outside these limits either
below lower confidence limit or above the upper confidence limit.
Such calculations are of no use as long as we do not know the population mean (µ) and
population Standard Deviation (δ). And if after all we know the population mean and
Standard Deviation then what is the need of all this exercise i.e. studying a sample.
To understand the concept of Standard Error let’s take the example of our data of pulse
rates. We have a mean of the data, which is 76 per minute. If we draw repeated samples
from the same population and compute means of all the samples then we’ll have a
distribution of means of the samples like individual values of pulse rates in one sample.
Central limit theorem states that “means” of many samples from the same population are
normally distributed. The Standard Deviation of the distribution of means of many
samples of one population is known as Standard Error (SE). We use Standard Error to
estimate a population parameter. But do you really think that somebody can actually
carryout the exercise of drawing repeated samples from a population and of such a
number to construct a meaningful distribution? Only an eccentric will be prepared to do
that.
Example: If the number of people with Iodine deficiency are 55 out of a randomly
selected sample of 440 persons in district ‘x’, the 95% and 99% confidence limits will be
as under:
pxq
SE of Pr oportion
n
The number of persons with Iodine deficiency = 55 out of 440
Then p = 55/440 x 100 = 12.5%
So q = 100 – p = 100 – 12.5 = 87.5%
Sample size = n = 440
Therefore, using the formula
pxq
SE of Pr oportion
n
12.5 x87.5
SE of Pr oportion
440
SE (of proportion) = 1.57
95% Confidence limits are p ± 1.96x SE = 12.5 ± 3.07
95% Confidence limits are 3.07 - 12.5 + 3.07 = [9.43% to 15.57%]
This means that if we draw repeated samples from the population of ‘x’, 95% samples
will have Iodine deficient people between 9.43% and 15.57%.
99% Confidence Limits are p ± 2.58 x SE
99% Confidence Limits are p ± 2.58 x 1.57
99% Confidence Limits are p ± 4.05 = 12.5 ± 4.05 = 4.05 – 12.5 + 4.05
99% Confidence Limits are [ 8.45% to 16.55 % ]
99% Confidence Interval (CI) is defined as: The range of mean values or
proportions within which there are 99 chances out of 100 that the true population
mean or proportion will fall
95% confidence limits will be Mean ± t x SE. To know about the value of t we have to
refer to t table. First we have to calculate degrees of freedom (DF), which are n-1. Our n
is 15 hence DF = 15-1 = 14. Therefore, 14 are our degrees of freedom (DF). Referring to
t table at 14 DF we find the value of t at 0.05, which are for 95% confidence limits, is
2.14.
95% confidence limits are: Mean ± t x SE = 76 ± 2.14 x SE = 76± 2.14 x 2.8 = 76±
5.99
In the same way we can calculate 99% confidence limits. By referring to t table at 14 DF
and 0.01(which means 99% confidence limits) we find the value of t as 2.98. We
substitute 2.98 for t or 2.14 in the previous example and calculate 99% confidence limits.
(Do it yourself)
Note: t is higher than Z ( 2.14 > 1.96 in the case of 95% CL and 2.98 > 2.58 for 99% CL)
but after 120 DF t tends to is equal Z.
We may be interested to compare two or more populations and determine that with regard
to some observations they differ significantly; or the differences are just by chance - more
precisely the act of sampling error. We know that means of samples even from the same
population may differ but to what extent remains the question that has to be answered
through significant testing or hypothesis testing.
While comparing two or more samples we may have a hypothesis, which is called
research hypothesis. Such a hypothesis may state that there is a difference or otherwise.
We have to prove it based on the collected data.
NULL HYPOTHESIS (H1): A statistical hypothesis that is tested for rejection with the
assumption that it could be true. It states that the different sets of data belong to one
population and the observed differences are by chance. In other words:
A=B
ALTERNATIVE HYPOTHESIS (H0): It states that the different sets of data belong to
different populations and the differences are statistically significant and are not due to
chance.
A ≠ B or A > B or A < B
SIGNIFICANCE TESTING:
To test hypothesis or know about significance we perform different statistical tests in
different situations. The following diagram is an algorithm how to select a statistical test
in hypothesis testing.
For data, which are not normally distributed that means it is not parametric; we use Non-
parametric tests. Non-parametric data is nominal or ordinal.
NOTE: Parametric tests are more sensitive compared to Non-Parametric tests. It is also
important to note that the data has to be collected randomly to enable the tests to be
meaningful.
5% LEVEL OF SIGNIFICANCE (p = 0.05): A level of probability at which the Null
hypothesis is rejected if an obtained sample difference occurs by chance only 5 times or
less out of 100.
We will discuss Normal distribution test (Z-test) and Chi-square tests only.
X 2 = Mean of sample 2
Whereas
S.E is the Standard Error difference between two means
SD12 SD22
SE (diff between two means)
n1 n2
Example:
If we want to compare the weights of girls’ students of 1st year and Final year, we collect
data randomly. After collection and computation we have the following figures:
SD12 SD22
SE
n1 n2
(4) 2 (5) 2
SE
32 27
16 25
SE
32 27
SE = 1.2
X1 X 2
Z
SE
54 62
Z 6 .7
1 .2
Remember that the difference will be statistically significant if Z is more than 1.96 for a
level of 5% and for a level of 1% more than 2.58 (please refer to the properties of normal
distribution)
Our data shows significant difference both at 5% and 1% level. Hence, we can state that
there is statistically significant difference between the girls’ students of 1st and final year
with regard to their weights both at 5% and 1% significance level.
(We will reject null hypothesis)
Note: One has to determine the significance level during the planning stage of the study.
p1 xq1 p2 xq2
SE (Standard Error difference between two proportions) =
n1 n2
Example:
If we collect some data randomly with the following observations:
13 out of a sample of 63, fourth year students are obese; and 17 out of 61 third year
students are obese. Is there any statistically significant difference between 4th and 3rd year
students with regard to frequency of obesity or the observed differences are by chance?
p1 xq1 p2 xq2
SE
n1 n2
We can use t test instead of Z test in the case of both small and large samples but to
decide about the critical level, we have to use t table as stated earlier.
Note: The numbers 1-4 given in brackets are the cell numbers of the above table.
2
Chi-Square (X2) = (O E )
E
O = Observed frequencies
E = Expected frequencies
To calculate Expected frequencies for each cell (1-4)
RowTotal x ColumnTotal
Expected Frequencies ( E )
Grand Total
Cell No Observed E
Row Total x Column Total Expected O–E (O–E)2 (O E ) 2
Grand Total E
(O) (E)
1 18 136x40 14.9 +3.1 9.61 0.64
366
2 118 136x326 121.1 -3.1 9.61 0.08
366
3 22 230x40 25.1 -3.1 9.61 0.38
366
4 208 230x326 204.9 +3.1 9.61 0.05
366
2
(O E )
1 . 15
E
We either accept Null Hypothesis or reject it. When we accept Null Hypothesis, we reject
the Alternative Hypothesis. When we reject Null Hypothesis we accept the Alternative
Hypothesis.
STEPS:
1. State the null and alternative hypotheses, Ho and H1.
2. Select the decision criterion α (or “level of significance”).
3. Establish the critical values
4. Draw a random sample from the population, and calculate the mean of that
sample
5. Select appropriate statistical test and compute the value of the test statistic Z or t
or X2 (as the case may be).
6. Compare the calculated value of test statistic with the critical values of Z/t/X2, and
then accept or reject the null hypothesis.
4. Draw a random sample from the population, and calculate the mean of that
sample: Sample randomly drawn from a district.
5. Select appropriate statistical test and compute the value of the test statistic Z
or t or X2 (as the case may be).
We select X2 test as the data is not continuous and do computations as under (One-way
Chi-square test):
Expected Frequencies (E) are calculated by dividing the total frequencies by number of
categories. The number of categories are 4 and total is 508. All expected frequencies
equal 127.
6. Compare the calculated value of test statistic with the critical values of Z/t/X2,
and then accept or reject the null hypothesis.
Our calculated X2 is equal to 108.5. The degrees of freedom in this case are equal to the
number of categories minus one. There are four categories of weapons, therefore, DF =
4-1 = 3. At 3 DF X2 is equal to 7.81 at 0.05. As our calculated value is more than the
table value, therefore, the difference among the use of weapons in causing injuries is
statistically significant and cannot be due to chance alone. Therefore, we reject the Null
Hypothesis and accept the alternative hypothesis.
Ho True Ho False
T
E Type II error
S Ho Accepted Correct (β)
T False positive
R
E Type I error
S (α)
Ho Rejected Correct
U False
L negative
T
To avoid Type I error we may decrease our significance level but that will increase the
chance of committing Type II error. It is easy to avoid type I error but avoiding type II
error is not so simple. One way is to increase the sample size and reduce sampling
variations.
STATISTICAL POWER:
The Power of a statistical test is defined as: the ability of a statistical test to reject the null
hypothesis when it is actually false and should be rejected.
p-value if calculated by referring to relevant statistical tables will mean the exact
probability of stating chance of sampling error. If p = 0.0001, it will mean that the
obtained sample difference occurs by chance one out of 10000. Statistical packages
calculate p value up to some decimals. While stating p value it will mean the probability
of getting wrong. p = 0.05 will mean an otherwise result of 5 out of 100.
The value of Pearson’s “r” ranges from 0 to 1, when 0, there is no correlation and 1 is
perfect correlation. From 0 to 1 the strength varies from weak to strong.
Negative correlation means the two variables move in opposite direction. For example if
dose of insulin is increased, the level of blood glucose goes down.
2 2
X Y x x y y xx yy xx yy
1 5 1-3=-2 5-7=-2 -2x-2=+4 (-2)2=+4 (-2)2=+4
2 6 2-3=-1 6-7=-1 -1x-1=+1 (-1)=+1 (-1)2=+1
3 7 3-3=0 7-7=0 0 0 0
4 8 4-3=+1 8-7=+1 +1x+1=+1 (+1)2=+1 (+1)2=+1
5 9 5-3=+2 9-7=+2 +2x+2=+4 (+2)=+4 (+2)2=+4
x =3 y =7 x x =0 y y =0 x x y y +10 x x 2 =10 y y =10
2
x x y y
Pearson' s " "
x x y y
2 2
10 10
1
10 x10 10
If “r” is equal to +1
Negative correlation
No correlation
EXERCISE:
The following data is collected on the dose of Methyl dopa per 24 hours and systolic BP.
Calculate Pearson’s “r” and interpret through diagram.
CoD = r2
If “r” =0.06
This means that 36% of change in ‘Y’ variable may be attributed to ‘X’ variable.