Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Lecture Notes

BIOSTATISTICS
For

4th year MBBS

Department of Community Medicine

Ayub Medical College


Abbottabad

Professor Muhammad Salim Wazir


MBBS (Pesh), MSc(UK), M. Phil

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 1


OVERTURE

The mere mention of the word statistics can ring alarm bells in many minds and more so
in the minds of medics. Medical professionals believe that they are duty bound not to
touch anything mathematical. They had opted for Biology instead of Mathematics at FSc
level, so it would be a sin to think of it. They erroneously think of Statistics as a branch
of Mathematics. No doubt that mathematics are extensively used in statistics but so is the
case with other sciences like: Physics, Chemistry, Agriculture, Engineering, and many
more. The subject is further defamed, though, for political reasons: for example, when
Mark Twain quotes the Victorian age English Prime Minister Disraeli as saying “there
are lies, damned lies and statistics”. We the readers immediately implicate the subject
statistics as the worst types of lies. Facts are to the contrary. What Disraeli means by
statistics here is the facts and figures presented by the government whereas the subject
Statistics is an entirely different phenomenon though it also plays with facts and figures
but in a scientific way.

The kind of thinking involved in statistics will not be entirely new to you. Indeed, you
will find that many of our day-to-day assumptions and decisions already depend on it.
Suppose you are told that two adults are sitting in the next room. One is 5 feet tall and the
other is six feet tall. What would be your best guess as to each one’s sex, based on that
information alone? You may be fairly confident in assuming that the six feet tall person
may be a man and the five-footer may be a woman. You could be wrong, off course, but
experience tells you that five-foot men and six-foot women are somewhat rare. You have
noticed that, by and large, males tend to be taller than female human beings. Off course
you have not seen all men or all women and you recognize that many women are taller
than many men; nevertheless you feel reasonably confident about generalizing from the
particular men and women you have known to men and women as a whole.

The above is a simple, everyday example of statistical thinking. There are many other
examples. Anytime you use phrases like: ‘on average, I sleep 52 hours a week’ or ‘we
can expect a lot of rain at this time of year’ or ‘the earlier you start revising the better you
are likely to do in the annual exam’; you are making a statistical statement, even though
you may have performed no calculations.

There are many more things, which are not known to us. We may need information on
some of these things. Without conducting a proper investigation, we may be oblivious to
many important things. To conduct such an investigation, we need to have knowledge of
the subject STATISTICS. Medical science cannot progress without making use of the
subject STATISTICS; hence all medical graduates must possess a first-hand knowledge
of STATISTICS. Just think for a moment from where we know that Hemoglobin level is
12-15g/dl in adult males. Have we measured the hemoglobin level of all men and
women? Certainly not but still we confidently diagnose a man as anemic with a
hemoglobin level of 10g/dl. How do we feel confident about our diagnosis of anemia
Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 2
with 10g/dl hemoglobin? Take care you need to learn it as we discuss it in the following
pages.

Statistics:

The word statistics is derived from ‘status’ meaning situation. The subject studies
situations. A practical definition would be:

A subject that deals with data: collection, compilation, presentation, analysis,


interpretation, and making inferences about that data.

Statistics are:

a). Descriptive Statistics: Methods used to summarize or describe our


observations. In other words, we are describing our sample.

b). Inferential Statistics: Using those observations as a basis for making


estimates or predictions i.e., inferences about a situation that has not yet
been observed. Appropriate tests of significance are applied in inferential
statistics. Simply: generalizing towards population based on sample.

* Note that the word statistics used in everyday language means facts and
figures and not the subject Statistics. The word statistic means the
computed figures from actual observations (sample).

SOME DEFINITIONS:

Data:

Record of observations – facts and figures – any piece of information. It could be:
Primary or Secondary

Primary Data: The data collected first-hand by researchers. It is difficult but


useful as the researchers collect data of use to them.

Secondary Data: Data readily available in some format in papers or computers


databases. It is easy but limited by non-availability of all desired variables.

Information:

When data is processed and made meaningful

Variable:

An attribute or characteristic that is variable from one individual to another, e.g., age,
gender, height etc.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 3


Population:

Defined as “the whole set of things or objects about which we want to know” In statistics
population can be human beings, potatoes, tomatoes, rice, chairs, tables, ECG machines,
paracetamol tablets etc.

Sample:

A part taken out of the population for actual study

Census:

When all members of a population are studied

Sampling:

The procedures of drawing a sample from population i.e., when some members of the
population are drawn for examination

You will appreciate the fact that populations are usually very big and most of the times of
infinite size. Time and other resources are always scarce; therefore, researchers almost
always opt for sampling rather than conducting a census. To be able to draw meaningful
information from our samples we would like that our samples are representatives of the
population they are drawn from. We don’t know about the population so how do we
know that our sample is representative of the population. We don’t have any foolproof
method but to be reasonably sure that our samples are representatives of the population
they are drawn from, we must ensure that:

i. they are drawn randomly

ii. they are of adequate size

We leave it to the Nature to make a selection for us. It should not be subservient to our
choice.

We can ensure samples are drawn randomly but the size of a sample is usually dictated
by the availability of resources i.e., time, men, money and material and the statistical
requirements.

Note: For a valid research sample size is an important prerequisite.

Sampling Unit:

It could be individuals or households drawn through sampling. They may not be


necessarily observational units. As an example, your sampling units may be houses but
you observe children of 1 to 2 years of age for vaccination status.

Sampling Frame:

The list of sampling units is called sampling frame.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 4


SAMPLING TECHNIQUES

In an ideal world we need to have a list of all the members of the population and then
draw a sample by method of say lottery. But just imagine that if we want to know about
the heights of adult males in district Abbottabad and we want to draw a sample of 1000
adult males. There may be more than 300,000 adult males in district Abbottabad. Do we
have any method where we can obtain a list of all adult males of district Abbottabad? We
surely don’t have a complete list of adult males of district Abbottabad. Similarly, most of
the populations we encounter don’t have complete lists. Therefore, we need to look for
other sampling techniques.

Random Samples:

Defined as where each and every member of the population has an equal chance of
selection.

There are two types of sampling:

A. Probability Sampling

B. Non-probability Sampling

A. Probability Sampling Techniques:

Probability samples are those when members of the population have known
and not necessarily equal chance of getting selected as sample members. In
this technique of sampling inferential statements can be made based on
samples. In the case of probability sampling the sampling frame is available in
some shape. Some of such techniques are as under:

a. Simple Random Sample: Drawing a sample from a population by a


random method e.g., lottery which gives every individual in the population
an equal and independent chance of appearing in the sample. For example,
selecting 20 students at random from the population of students of Ayub
Medical College.

b. Stratified Random Sampling: Drawing a sample from a population


which has first been divided into sub-groups or strata. From each sub-
group a sample is drawn by a random method which gives every
individual in the sub-groups equal and independent chance of appearing in
the sample e.g., if students of Ayub Medical College are first divided into
five classes i.e., 1st, 2nd, 3rd, 4th, and 5th years and then say four students (a
specific percentage) from each class are randomly selected.

Note: If you can select a sample by the above two methods then
you should not use any other method.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 5


c. Systematic Sampling: Drawing a sample from a population by a
systematic procedure e.g., if our sample is 10 from a population of 40 the
40/10=4. So we select every 4th student entering the classroom. But which
4th ? The number has to be determined randomly through lottery from: 1,
2, 3, 4. If it is 3 then numbers 3, 7, 11, 15, 19, 23, 27, 31, 35, and 39 will
be included in the study. In systematic sampling we don’t have the list of
the population, but we know the size of the population.

d. Multi-stage Sampling: A process of sampling a population in a series of


consecutive steps e.g., a town may be divided into a number of areas and a
number of those areas drawn by a random process; within these drawn
areas the schools may be listed, and a number of these schools drawn by a
random process. The pupils within these schools are then randomly
selected and studied. This technique is used in big surveys where spread of
the population is over a large geographic area.

e. Cluster Sampling: Cluster sampling is adopted when there is no sampling


frame from which the final sample can be selected. In this type the
researcher combs the area meticulously to find the items needed to from
that area’s sample e.g., to know vaccination status of children under two
years a researcher first selects a district, then Union Council, then villages
and then Mohallah – all randomly. Then he looks for seven or more such
houses located together that have children less than two years of age. Such
seven or more houses located together are known as clusters. It could be
multistage i.e., one stage, two stage and so on.

B. Non-probability Sampling: Such samples are putatively non-representative


and no probability statements can be made based on these samples because
unlike probability sampling the population members have no known chance of
getting selected as sample members.

a. Purposive Sampling: Sampling based on a pre-determined idea for


example selection of all diabetic patients. It is used in qualitative
research wherein the according to the judgement of the researcher the
selected individuals will give good data. Also called judgmental
sampling technique.

b. Convenience Sampling: Also known as accidental sampling, or grab


technique. Selected according to the convenience of the researcher for
example it is convenient to ask the frontbenchers and hence they are
the sample members. In other words, whosoever is available at the
time of data collection is included.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 6


c. Quota Sampling: Similar to cluster sampling or stratified random
sampling except that the researcher doesn’t have to comb all the area
when compared to cluster sampling; and examining only a
predetermined number, this is termed Quota in the case of Cluster
sampling. When compared to stratified random sampling, the strata in
this case are defined and instead of selecting members through simple
random sampling, they are selected through convenience. This
technique is used in quick surveys or polls.

d. Snow-Ball Sampling: If we wish to have a sample of drug addicts we


wont be able to find many. Hence, we investigate the first addict and
through him/her reach other addicts. Thus, the sample is accumulated
like a snowball. Useful in populations that are hidden and used by
qualitative researchers.

e. Consecutive Sampling: Used in hospital setups where no lists are


available as patients come in unpredictably. Consecutive patients as
per criteria are included according to sample size.

 Note if the sample is not randomly selected it cannot be representative of the


population under study. This is also called SYSTEMATIC ERROR or BIAS.
Therefore, to decrease BIAS the sampling technique is important
consideration. To decrease the play of CHANCE (random error) sample size
is a consideration.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 7


VARIABLES

Variable:
An attribute or characteristic that varies from one individual to another.
Types of Data:
Data is collected against variables. Age is a variable, but ages of students of a sample or
population are data. We need to know different types of variables because different
statistical techniques are employed to analyze different variables.

Quantity or Quantitative Variables:


They are discrete and continuous.

Discrete:
The variable takes a PARTICULAR number in a given range. If parity ranges from zero
to ten then it could be a particular number out of 11 numbers. It cannot be 2.6. It is also
called count variable as we count. Examples are parity, number of injections per day etc.

Continuous:
The variable takes ANY number in a given range. If heights range from 100 cm to 200
cm, there could be infinite values in the given range owing to decimals. These variables
are measured. Examples are hemoglobin value, age, weight etc.
Category or Categorical Variables:
Categorical variables are first categorized and then counted. They are nominal and
ordinal.
Nominal:
Observations have names only for example male/female, black/white/yellow/brown.
There are no orders or ratios. If nominal data has only two groups e.g., male/female it is
called dichotomous or binary data. More than two groups e.g., religion, are called
multichotomous variables.

Ordinal:
Also called RANK variable. When data is placed into meaningful order. Students may be
ranked as 1st, 2nd, 3rd etc. however the interval between orders is not certain. Likert Scale
and Visual Analogue Scale (VAS) are ordinal variables.

Another classification of variables is independent and dependent variable. They are used
when we compare variables. Independent variables are presumed causes and dependent
variables are presumed effects. Incidence of Lung Cancer is an example to explain this.
Smoking is an independent variable and Lung Cancer is dependent variable.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 8


COMPILATION OF DATA

After data collection, the researcher has to organize the scores into some comprehensible
form for further statistical processes. The most commonly used procedure for organizing
a set of data is to place the scores in a frequency distribution.

Frequency Distribution:
The collected data can be plotted in tabular form or graphic form after organizing it
showing frequencies of different observations. It can also be organized in group form
which is called grouped frequency distributions. In frequency distribution, the
disorganized sets of scores are arranged in an order (ascending or descending) by
grouping together all individuals who have the same scores. A frequency distribution can
be in the form of a table or a graph. Following is the data on the pulse rates/min of 15
students:
72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75

The frequency distribution is as under:


Pulse Rate No. (Frequency)
62 1
66 1
67 1
69 1
72 1
73 3
75 1
76 1
78 1
80 1
82 1
86 1
108 1
Total 15
When the data is large, its presentation in simple frequency distribution in table (as
shown above) is not possible. Such large data is then condensed to form “groups” and the
data is labeled as grouped data. Such data is organized in grouped frequency distribution
table as shown below:
Frequency Distribution of the grouped data could be as under:
Pulse Rate No. Frequency 5 Bar Gate
(Class interval) (tallying)
61-70 4 ////
71-80 8 //// ///
81-90 2 //
91-100 0
101-110 1 /
Total 15 15

Note: For tallying observations the FIVE BAR GATES or tallying methods can be used
shown in the third column in the above table. is called five bar gate.
Now a days computer software like MS Excel, SPSS etc. are used and should be used
with the aforementioned method becoming obsolete.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 9


DESCRIPTION OF DATA

Once the data is collected, we need to describe it. Before describing the data, it is
important to know the type of variable we are going to describe. The quantitative variable
is described in terms of measures of central tendencies and measures of dispersion. For
categorical variables, frequency and percentages are used.

Description of quantitative variable:


A. Central Tendencies of Data:

Central tendency is defined as a simple value/score that is representative of the entire


distribution.
Measured by Mean – Median – Mode
Mean (Arithmetic Mean): It is defined as the sum of observations divided by the
number of observations
 x
Mean  X 
n

 x  Sum of observations in a sample


n  Number of observations in a sample

Remember that Mean of a sample is denoted by X (pronounced as X Bar) and mean of


population is denoted by µ (pronounced as mew)

Mean     X
N

  Mean of a population
 X  Sum of values in a population
N  Number of values in a population
To calculate Mean from the data of Pulse rates of students
Sum of all observations = 1140 (Σx)
Number of observations = 15 (n)
 x
( Mean ) X 
n
1140 minute
X   76 beats /
15

To calculate mean of a grouped data, the following method is adopted:


Pulse Rate No. Frequency Mid point FxM
(F) (M)
61-70 4 65.5 262
71-80 8 75.5 604
81-90 2 85.5 171
91-100 0 95.5 0
101-110 1 105.5 105.5
Total n =15 Σx = 1142.5
 x 1142 . 5
X    76 . 16 beats / min
n 15

In the above table mid point calculated as like 61  70  65 . 5 , 71  80  75 . 5 so on .


2 2

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 10


Mean calculated in this way is not exactly the same as calculated by adding individual
observations but is near to that. In the case of very large data, it goes even nearer to the
actual value.

Advantages of Mean:
i. It represents all the values in a distribution
ii. Can be used in further statistical computations
Disadvantages:

i. It is affected by extreme values


ii. Sometimes it can give a ridiculous figure e.g., 2.35 children, 1.13 eggs etc
Mean is used for continuous data

Median: The centre value of series of observations when the observations are ranked in
order from the lowest value to the highest (ascending or descending order). Median
divides the distribution into two equal halves.

n 1
Position of Median = th
2
Using the same data
We first arrange the observations into an order from lowest to highest

62, 66, 67, 69, 72, 73, 73, 73 75, 76, 78, 80, 82, 86, 108

n  1 15  1 16
Position of Median =    8 8th value is the median which is 73.
2 2 2
You can see that there are seven values below 73 and an equal number i.e. seven above
the median. In the data shown n=15 which is an odd number. If n=16, an even number
then the Median would be
n  1 16
Position of Median =   8 .5
2 2
8.5 means the average of 8th and 9th observations. If e.g. 8th value was 73 and 9th 75, then
Median would be 73  75 148
  74 beats / min ute
2 2
In this case median may not be an actually observed value.

Advantages of Median:

It is not affected by extreme values therefore, it is used for that data, which is, skewed i.e.
having extreme observations.

Disadvantages:
1. It does not take into account all the values of a distribution
2. It is of limited value in further statistical computation
Median can be used for Quantitative (discrete and continuous) and Ordinal data.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 11


Mode: The most frequently observed value in a distribution is known as mode. In the
aforementioned data Mode is 73 beats per minute, which appear three times.
1. Mode can be used for all types of data.
2. Mode is not affected at all by extreme values.
3. Mode is of no value in further statistical computations.
4. Mode does not take into account all the values in a distribution.

Some distribution may have two modes – they are called bimodal distributions. If there
are more than two modes, such distributions are known as multimodal distributions.
Mode can be used for all types of data i.e., Nominal, Ordinal, and Quantitative (discrete
and continuous).

Note: Mean – Median – Mode have the same units as of observations and must be noted
with the resultant value e.g. Mean is 76 beats per minute.

B. Measures of Dispersions/ Variations/Variability/Deviation


It is very important to know whether the scores in a distribution are
spread out or clustered together. The following statistical measures
are used to assess the degree of variability in the data:
1. Range
2. Mean Deviation
3. Variance
4. Standard Deviation
5. Coefficient of Variation

1. Range: It is the difference between the highest and lowest observations.


In the data

72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75

Range is 62-108 beats per minute or 108-62 beats per minute or 46 beats per minute.
Range is a good measure of dispersion when we want to know immediately how the data
is spread out but it takes into account only the lowest and highest values of a distribution.
Therefore, it is not a good measure of dispersion of data.

2. Variance: is equal to the sum of squared deviation of observations from mean of


the distribution divided by the number of observations. (Samples less than 30 use
n-1, which may be used in big samples too without making much difference)
Variance =  ( X  X )2
n

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 12


We arrange data in the following table:
Heart Mean Deviation Squared
Rate (X ) from Mean deviation
(X ) XX X  X 
2

62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X  X ) =0 Σ ( X  X ) 2 =1650

Deviation of each score for mean of the distribution is calculated by X  X . By this


process, we get positive (+) and negative (-) scores which cancel each other and added we
get zero. That is why it is said that sum of all deviation scores is always zero. With zero
answer, further statistical procedures are stopped. Hence, we take square of each
deviation score.
( X  X ) 2 1650

1650
Variance =  110
15

We square the deviations to get rid of the negative signs but by squaring the values we
loose the units. Therefore, Variance is of limited value in measuring dispersion of the
data.
3. Standard deviation
4. The most useful measure of dispersion that can be used in further statistical
computations. It is the square root of the sum of squared deviations of
observations from mean of the distribution divided by the number of observations.
(Samples less than 30 use n-1, which may be used in big samples too without
making much difference)
Standard Deviation ( SD)  Variance
Variance = (Standard Deviation)2 = SD2
 x  x 2
SD 
n

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 13


Using the table for calculating variance:
Heart Rate Mean Deviation from Mean Squared deviation
(X ) (X ) X  X X  X 
2

62 76 -14 196
66 76 -10 100
67 76 -9 81
69 76 -7 49
72 76 -4 16
73 76 -3 9
73 76 -3 9
73 76 -3 9
75 76 -1 1
76 76 0 0
78 76 +2 4
80 76 +4 16
82 76 +6 36
86 76 +10 100
108 76 +32 1024
n =15 Σ ( X  X ) =0 Σ ( X  X ) 2 =1650
We know that  ( x  x) 2 1650
n 15

 x  x 
2
SD 
n 1

1650
SD   110 10.5 beats / min
15
By squaring the deviation we get rid of the negative signs, but we loose the original unit,
which is taken care of by applying the square root again, which means that original units
are restored.
The use of standard deviation in statistical data is explained with the Normal Distribution.

Note: Standard Deviation of a sample is denoted by the symbols SD and Standard


Deviation of population is denoted by Greek letter small sigma δ

5. Co-efficient of Variation:
Measures variability in relation to the mean and offers a method by which one can
compare the relative dispersions of one type of data with the relative dispersion of
another type of data.
Our data of heart beats per minute will have Co-efficient of variation as under:
SD 10.5
Co-efficient of variation of heart beats data = x100 = x100 13.8%
Mean 76
If we also had recorded the systolic blood pressures of the same individuals with a mean
systolic BP of 130mmHg and Standard Deviation of 13mmHg – the co-efficient of
variation would have been
Co-efficient of Variation of Systolic Blood Pressure:
SD 13
x100  x100  10%
Mean 130
Now we can compare and conclude that among persons whose pulse rates and systolic
blood pressure were recoded, pulse rates are more variable than systolic blood pressure
since co-efficient of variation of pulse rates is 13.8% and systolic BP is 10%.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 14


INFERENTIAL STATISTICS

Inferential Statistics means when we go beyond the actual observations and state
something (based on the collected data), which have not been actually observed. Here the
theory of probability comes in.

Probability:
The number of events occurring out of a total possible number of events is called
probability. If we flip a fair coin, the probability of having head is ½ or 50% or 0.5. The
probability of either having head or tail is 1/1 or 100% or 1. Two simple rules of
probability need to be remembered.

Addition Rules:
For two or more possible mutually exclusive events the collective probability equals
ONE or 100%. For example, there are two possibilities when we flip a fair coin i.e.,
either head or tail. We cannot have both head and tail at one flip. The probability of
having head is 0.5 or 50%. Therefore, according to addition rule the probability of having
head or tail is 0.5 + 0.5 = 1 or 50% + 50%=100%.

Example: If infant mortality rate is 60 per 1000 in Pakistan, then, the probability of an
infant dying is 60 per 1000 or 6 per 100 or 6% or 0.06. The probability of an infant
surviving is 940 per 1000 i.e., 1000-60=940. It can also be said that the probability of an
infant surviving is 94% or 0.94. As a child can either survive or die and they are mutually
exclusive phenomena, therefore, according to addition rule the probability of either dying
or surviving is 0.06+0.94=1. (If the statement has “or” in it, addition rule is applied)

Multiplication Rule: For two or more independent and randomly occurring phenomena
the probabilities multiply. When we flip a fair coin, it is an independent event. When we
flip a fair coin twice or thrice or more; all are independent events. The probability of
getting head with flip is 0.5 then having heads on two flips is 0.5x0.5 = 0.25.

Example: If we know that 10% of patients visiting a medical OPD suffer from
hypertension it means that probability of a patient having hypertension is 0.1. So, when
the events are independent, the probability of having first two patients entering the OPD
of being suffering from hypertension is 0.1x0.1 = 0.01 or 1%. This is called the
multiplication rule.

Note: If the probability is stated to be 1 it is called unity. To say one has to die eventually
the probability will be 1. For one to stay alive for ever the probability will be 0. In
between 0 and 1 there are fractions of 1, which may have many decimals for different
events.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 15


NORMAL DISTRIBUTION

Also known as Gaussian Distribution.

A British mathematician De Moivre conceived it first in the seventeenth century. It is an


important statistical distribution and is a mathematical model of frequency distribution of
most of the biological values in nature. Shown diagrammatically the Normal Distribution
is denoted as a curve known as Normal Curve or Gaussian Curve.

NORMAL CURVE

On the X-axis are the values and the Y-axis shows the frequency of those values like a
frequency distribution. It is important to remember that normal distribution is a
probability distribution and is an ideal world. If our collected data has tendency to
conform to normal distribution, then we make use of it in statistical inferences. The total
probability of frequency of values under the curve is equal to 1 or 100%. All the
individual values under the curve have probability of occurrence (frequency) ranging
between 0 and 1 (or 0% to 100%) and total to 1.

PROPERTIES OF NORMAL CURVE

1. It is bell shaped.
2. It is perfectly symmetrical.
3. Mean, Median and Mode are in the centre of the curve i.e. the dome of the curve.
4. Half the values (50%) lie on each side when it is cut into half at the highest point.
5. It has got two determinants Mean (µ) and Standard Deviation (δ).
6. 68.26% of the values lie between the range of Mean ± 1xSD (1δ - µ + 1δ). In
other words the probability of occurrence of values between the range 1xSD –
Mean + 1xSD (1x δ-µ+1x δ) is 68.26% or .6826. This also implies that 31.74%
of values are either below Mean – 1xSD (µ-1x δ ) or above Mean + 1x SD(µ+1x
δ ). In other words the probability of occurrence of values below Mean – 1xSD
(µ-1x δ) or above Mean + 1xSD (µ+1x δ ) is 31.74% or .3174.
7. 95.45% of the values lie between the range of Mean ± 2xSD (2δ - µ + 2δ). In
other words the probability of occurrence of values between the range 2xSD –
Mean + 2xSD is 95.45% or .9545. This also implies that 4.55% of values are

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 16


either below Mean – 2xSD or above Mean + 2x SD. In other words the
probability of occurrence of values below Mean – 2xSD or above Mean + 2xSD
is 4.55% or .0455.
8. 99.73% of the values lie between the range of Mean ± 3xSD (3δ - µ + 3δ). In
other words the probability of occurrence of values between the range 3xSD –
Mean + 3xSD is 99.73% or .9973. This also implies that 0.27% of values are
either below Mean – 3xSD or above Mean + 3x SD. In other words the
probability of occurrence of values below Mean – 3xSD or above Mean + 3xSD
is 0.27% or .0027.

To elaborate it further and make it useful, remember the following landmarks also:

a. 95% of the values lie between the range of Mean ± 1.96xSD (1.96δ - µ + 1.96δ).
In other words the probability of occurrence of values between the range 1.96xSD
– Mean + 1.96xSD is 95% or .95. This also implies that 5% of values are either
below Mean – 1.96xSD or above Mean + 1.96x SD. In other words the
probability of occurrence of values below Mean – 1.96xSD or above Mean +
1.96xSD is 5% (2.5% 0n each side) or .05 (0.025 on each side).
b. 99% of the values lie between the range of Mean ± 2.58xSD (2.58δ - µ + 2.58δ).
In other words the probability of occurrence of values between the range 2.58xSD
– Mean + 2.58xSD δ is 99% or .99. This also implies that 1% of values are either
below Mean – 2.58xSD or above Mean + 2.58x SD. In other words the
probability of occurrence of values below Mean – 2.58 or above Mean +
2.58xSD is 1% or .01.
The multiple of standard deviation is called Z or t (in the case of “t” distribution) ,
which ranges from 0 to infinity. The area under normal curve is also referred to as area
under Z.
Note: When we say that a certain percentage of observations lie between Mean ±
Z x SD, the Z in the case of 68.26% is 1, in the case of 95%, Z is 1.96, in the case
of 95.45% Z is 2, in the case of 99% Z is 2.58 and in the case of 99.73% Z is 3.

NOTE: The difference between Normal Curve and STANDARD NORMAL CURVE
(SNC) is that the Mean of SNC is zero with a Standard Deviation of one. Normal Curve
is the frequency distribution of a variable in the entire population, whereas SNC is the
frequency distribution of means of all its samples with respect to a variable.

KURTOSIS: It means the peakedness of a normal curve. The diagram below shows
three types according to the size of population standard deviation. The three curves
shown have the same mean:
1. Mesokurtosis: It is in the middle showing a moderate standard deviation.
2. Platykurtosis: It is flat and has a wide standard deviation.
3. Leptokurtosis: It is ellipticall in shape and has a narrow standard deviation.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 17


OTHER SHAPES OF FREQUENCY DISTRIBUTION
A curve may not be symmetrical in most of the cases of sample data. It may have many
shapes, but two shapes are important to remember. (These shapes are NOT Normal
Curves but data curves)
Diagram A:

Mode Median Mean

Positively (right) skewed distribution

Diagram B

Mean Median Mode

Negatively (left) skewed distribution

Diagram A is a distribution skewed to right (Positively Skewed); and diagram B shows a


distribution that is skewed to the left (Negatively Skewed).
Now we have three types of curves.
1. Symmetrical Curve
2. Curve Skewed on the right
3. Curve Skewed on the left

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 18


1. Symmetrical Curve:
Suppose students appear in a test on subject K. Data shows that there are very few
students who have scores less than 10%. As the scores increase the number of
students increase until a stage is reached where the scores are around 50% and
most students score around that. That is the mode of the data. According to the
properties of Normal Curve it is also the Mean and Median. Again, when scores
increase further the numbers of students keep on decreasing until we reach a stage
where students score around 90%. You can appreciate that they will be very few.
This type of distribution of scores is a normal distribution. Most biological values
are distributed like this e.g. pulse rates, blood pressure, hemoglobin value etc.

2. Curve Skewed to the Right (Positively Skewed):


Suppose students appear in a test on subject L. We observe that there are many
students who have scores on the lower side that will be the mode of the
distribution. (We know that mode is not affected by extreme values at all). Next to
mode to the right will be the median as it is less affected by extreme values. Mean
will be on the extreme right where the few extreme values lie. On the right of the
curve will be the students with higher scores but less in number. This distribution
is skewed to the right, which means that maximum students scored less and few
scored high marks. Wealth is generally distributed like this.

3. Curve Skewed to the Left (Negatively Skewed):


Suppose students appear in a test on subject M. We can appreciate the fact that
there are very few students who scored less. Maximum of the students scored
high; therefore, the mode of the distribution will fall on the extreme right. To the
left will be the median, and to further left will be mean. This implies that most of
the students did well in test on subject M and only a few lagged behind.
Distribution of Hemoglobin value in children is skewed to the left.

Note: Skew means tail. Skew is said to be to the side where the tail of the
distribution is.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 19


ESTIMATION
One main purpose of the statistics should be to estimate population value/parameters.
Estimation means generalizing to a bigger phenomenon by actually looking at a part of
that bigger phenomenon. That is by making statements about a population (which has not
been fully examined) on the basis of a part of it that is actually examined. In other words,
we extrapolate our sample data to the population from which the sample is drawn. Do not
forget that to be able to make such statements the sample has to be representative of the
population it is drawn from. For a sample to be representative the data has to be collected
in a random manner.

From our data of pulse rates for which we have calculated Mean and Standard Deviation

72, 73, 80, 62, 66, 108, 82, 73, 69, 78, 86, 67, 76, 73, 75

Mean = 76
SD = 10.5
n = 15
By using normal distribution we can say that:

1. 68.26% of the values are within Mean ± ISD i.e. between 65.5 and 86.5
2. 95% of values are within Mean ± I.96SD i.e. between 55.4 to 96.6
3. 95.45% of values lie within Mean ± 2SD i.e. between 55 to 97
4. 99% of values lie within Mean ± 2.58SD i.e. between 48.9 to 103.1
5. 99.73% of values lie within Mean ± 3SD i.e. between 44.5 to 107.5

These are confidence limits for the Sample (read SAMPLE). Number 1 is 68.26%
confidence limits; number 2 is 95% confidence limits; number 3 is 95.45% confidence
limits; number 4 is 99% confidence limits; and number 5 is 99.73% confidence limits.
This means that you can state with a certain percent of confidence in what range the
values within your sample fall. But do not forget that these confidence limits are for your
sample and not for the population from which the sample is drawn.

The upper limit of the range is upper confidence limit; the lower limit of the range is
lower confidence limit. In between the upper and lower confidence limits is the
CONFIDENCE INTERVAL. 95% confidence limits for a sample imply that 95% of the
observations in the sample will lie within this range, which in the case of our data are
55.4 to 96.6. It also means that 5% of observations may lie outside these limits either
below lower confidence limit or above the upper confidence limit.

Such calculations are of no use as long as we do not know the population mean (µ) and
population Standard Deviation (δ). And if after all we know the population mean and
Standard Deviation then what is the need of all this exercise i.e. studying a sample.

Therefore, we have to estimate the population parameters especially the Standard


Deviation (δ) of the population. Here comes the concept of Standard Error.
Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 20
STANDARD ERROR:

To understand the concept of Standard Error let’s take the example of our data of pulse
rates. We have a mean of the data, which is 76 per minute. If we draw repeated samples
from the same population and compute means of all the samples then we’ll have a
distribution of means of the samples like individual values of pulse rates in one sample.

Central limit theorem states that “means” of many samples from the same population are
normally distributed. The Standard Deviation of the distribution of means of many
samples of one population is known as Standard Error (SE). We use Standard Error to
estimate a population parameter. But do you really think that somebody can actually
carryout the exercise of drawing repeated samples from a population and of such a
number to construct a meaningful distribution? Only an eccentric will be prepared to do
that.

Statistics provide us with a formula to calculate SE without going through the


cumbersome exercise.
SD
Standard Error (of Mean) =
n
SD = Standard Deviation of the sample
n = Number of observations of sample.

If we apply this formula to our data then we may have a SE of


10.5
SE 
15
10.5
SE 
3.87
SE = 2.7
Now we have got a standard error (SE) which is 2.7 that can help us calculate population
parameter. Based on this we calculate confidence limits for the population exactly the
same way as for sample but substituting standard deviation of the sample with standard
error of the mean.
1. 68.26% confidence limits are: Mean ± I x SE i.e. from 73.3 to 78.7
2. 95% confidence limits are: Mean ± I.96 x SE i.e. from 70.3 to 81.3
3. 95.45% confidence limits are: Mean ± 2 x SE i.e. from 70.6 to 81.4
4. 99% confidence limits are: Mean ± 2.58 x SE i.e. from 69 to 83
5. 99.73% confidence limits are: Mean ± 3 x SE i.e. from 67.9 to 84.1
Confidence Limits based on standard error of a mean are confidence limits for
population and hence an estimation of population parameter based on sample. But
remember that the actual explanation of confidence limits calculated on the basis of
standard error of mean is a little bit different from the explanation of confidence limits
calculated on the basis of actual standard deviation and mean of population if known.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 21


CONFIDENCE LIMITS BASED ON ACTUAL STANDARD DEVIATION AND
MEAN: 95% confidence limits mean that 95% of the values of that particular observation
in the population lie within Mean ± I.96x δ.

CONFIDENCE LIMITS BASED ON STANDARD ERROR OF MEAN: 95% confidence


limits mean that if we draw many samples from the same population 95% of the time the
sample means will fall in these limits. But practically we mean the same as if we know
the actual mean and standard deviation of the population.

CONFIDENCE LIMITS FOR A PROPORTION: We can also calculate confidence limits


for a proportion using standard error of a proportion formula.
pxq
SE of Pr oportion  
n
p = proportion (percent or fraction of 1) of an event occurring
q = proportion of an event NOT occurring i.e. q = 100-p (percentage) OR 1-p
(fraction of 1)
n = number of observations (sample size)

Example: If the number of people with Iodine deficiency are 55 out of a randomly
selected sample of 440 persons in district ‘x’, the 95% and 99% confidence limits will be
as under:

pxq
SE of Pr oportion  
n
The number of persons with Iodine deficiency = 55 out of 440
Then p = 55/440 x 100 = 12.5%
So q = 100 – p = 100 – 12.5 = 87.5%
Sample size = n = 440
Therefore, using the formula

pxq
SE of Pr oportion  
n

12.5 x87.5
SE of Pr oportion  
440
SE (of proportion) = 1.57
95% Confidence limits are p ± 1.96x SE = 12.5 ± 3.07
95% Confidence limits are 3.07 - 12.5 + 3.07 = [9.43% to 15.57%]
This means that if we draw repeated samples from the population of ‘x’, 95% samples
will have Iodine deficient people between 9.43% and 15.57%.
99% Confidence Limits are p ± 2.58 x SE
99% Confidence Limits are p ± 2.58 x 1.57
99% Confidence Limits are p ± 4.05 = 12.5 ± 4.05 = 4.05 – 12.5 + 4.05
99% Confidence Limits are [ 8.45% to 16.55 % ]

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 22


This means that if we draw repeated samples from the population of district ‘x’, 99%
samples will have Iodine deficient people between 8.45% and 16.55%.
Confidence limits for proportion imply the same as is the case with confidence limits for
mean. If we increase the sample size the standard error decreases and consequently the
confidence interval will contract.
95% Confidence Interval (CI) is defined as: The range of mean values or
proportions within which there are 95 chances out of 100 that the true population
mean or proportion will fall

99% Confidence Interval (CI) is defined as: The range of mean values or
proportions within which there are 99 chances out of 100 that the true population
mean or proportion will fall

t- Distribution: When the sample size is small we use t-distribution instead of Z


distribution (normal distribution). While calculating the confidence limits we substitute t
with Z. Another alteration to the method is the computation of standard error, given as
under:

Standard Error of mean (t-distribution) = SD n 1


If we calculate standard error by this method for our data of pulse rates, it will be
10.5/ 3.74 = 2.8

95% confidence limits will be Mean ± t x SE. To know about the value of t we have to
refer to t table. First we have to calculate degrees of freedom (DF), which are n-1. Our n
is 15 hence DF = 15-1 = 14. Therefore, 14 are our degrees of freedom (DF). Referring to
t table at 14 DF we find the value of t at 0.05, which are for 95% confidence limits, is
2.14.

95% confidence limits are: Mean ± t x SE = 76 ± 2.14 x SE = 76± 2.14 x 2.8 = 76±
5.99

Therefore, 95% confidence limits are: 70.01 to 81.99

In the same way we can calculate 99% confidence limits. By referring to t table at 14 DF
and 0.01(which means 99% confidence limits) we find the value of t as 2.98. We
substitute 2.98 for t or 2.14 in the previous example and calculate 99% confidence limits.
(Do it yourself)

Note: t is higher than Z ( 2.14 > 1.96 in the case of 95% CL and 2.98 > 2.58 for 99% CL)
but after 120 DF t tends to is equal Z.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 23


The table for t distribution is given as under:

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 24


SIGNIFICANCE TESTING
HYPOTHESIS TESTING

We may be interested to compare two or more populations and determine that with regard
to some observations they differ significantly; or the differences are just by chance - more
precisely the act of sampling error. We know that means of samples even from the same
population may differ but to what extent remains the question that has to be answered
through significant testing or hypothesis testing.

While comparing two or more samples we may have a hypothesis, which is called
research hypothesis. Such a hypothesis may state that there is a difference or otherwise.
We have to prove it based on the collected data.

NULL HYPOTHESIS (H1): A statistical hypothesis that is tested for rejection with the
assumption that it could be true. It states that the different sets of data belong to one
population and the observed differences are by chance. In other words:

A=B

ALTERNATIVE HYPOTHESIS (H0): It states that the different sets of data belong to
different populations and the differences are statistically significant and are not due to
chance.
A ≠ B or A > B or A < B

SIGNIFICANCE TESTING:
To test hypothesis or know about significance we perform different statistical tests in
different situations. The following diagram is an algorithm how to select a statistical test
in hypothesis testing.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 25


Tests applied to data, which is normally distributed, are called parametric tests because
they are applied to data, which have parameters like Mean and Standard Deviation.
Parametric data is quantitative.

For data, which are not normally distributed that means it is not parametric; we use Non-
parametric tests. Non-parametric data is nominal or ordinal.

NOTE: Parametric tests are more sensitive compared to Non-Parametric tests. It is also
important to note that the data has to be collected randomly to enable the tests to be
meaningful.
5% LEVEL OF SIGNIFICANCE (p = 0.05): A level of probability at which the Null
hypothesis is rejected if an obtained sample difference occurs by chance only 5 times or
less out of 100.

1% LEVEL OF SIGNIFICANCE (p = 0.01): A level of probability at which the Null


hypothesis is rejected if an obtained sample difference occurs by chance only 1 time or
less out of 100.

We will discuss Normal distribution test (Z-test) and Chi-square tests only.

1. Z-test: difference between two means


Pre-requisites:
i. Data is normally distributed
ii. Data is randomly collected
X1  X 2
Z
SE
X 1 = Mean of sample 1

X 2 = Mean of sample 2
Whereas
S.E is the Standard Error difference between two means

SD12 SD22
SE (diff between two means)  
n1 n2

SD1 = Standard Deviation of sample 1


SD2 = Standard Deviation of sample 2
n1 = Number of observations in sample 1
n2 =Number of observations in sample 2

Example:
If we want to compare the weights of girls’ students of 1st year and Final year, we collect
data randomly. After collection and computation we have the following figures:

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 26


1st year (Sample 1)
Number of girls = n1 = 32
Mean weight in Kg = X1 = 54
Standard Deviation =SD1 = 04

Final year (Sample 2)


Number of girls = n2 = 27
Mean weight in Kg = X 2 = 62
Standard Deviation =SD2 = 05
By using Z
X1  X 2
Z
SE

SD12 SD22
SE  
n1 n2

(4) 2 (5) 2
SE  
32 27

16 25
SE  
32 27
SE = 1.2
X1  X 2
Z
SE
54  62
Z  6 .7
1 .2

Remember that the difference will be statistically significant if Z is more than 1.96 for a
level of 5% and for a level of 1% more than 2.58 (please refer to the properties of normal
distribution)

Our data shows significant difference both at 5% and 1% level. Hence, we can state that
there is statistically significant difference between the girls’ students of 1st and final year
with regard to their weights both at 5% and 1% significance level.
(We will reject null hypothesis)

Note: One has to determine the significance level during the planning stage of the study.

ii. Z-test: for difference between two proportions


p1  p2
Z
SE
p1 = percentage (proportions) of occurrence in sample 1
p2 = percentage (proportions) of occurrence in sample 2

p1 xq1 p2 xq2
SE (Standard Error difference between two proportions) =  
n1 n2

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 27


p1 = percentage (proportions) of occurrence in sample 1
p2 = percentage (proportions) of occurrence in sample 2
q1 = percentage (proportions) of non-occurrence in sample 1 (100 – p1)
q2 = percentage (proportions) of non-occurrence in sample 2 (100-p2)
n1 = Number of observations in sample 1
n2 =Number of observations in sample 2

Example:
If we collect some data randomly with the following observations:
13 out of a sample of 63, fourth year students are obese; and 17 out of 61 third year
students are obese. Is there any statistically significant difference between 4th and 3rd year
students with regard to frequency of obesity or the observed differences are by chance?

To know the answer we apply Z test for two proportions.

4th year students (sample 1)


13
Percentage of obese = p1 = x100  20.6%
63
Percentage of non-obese = q1 = 100-p1 = 100 – 20.6 = 79.4%

Number of observations = n1=63


3rd year students (sample 2)
17
Percentage of obese = p2 = x100  27.9%
61
Percentage of non-obese = q2 = 100-p2 = 100 – 27.9 = 72.1%
Number of observations = n2=61

p1 xq1 p2 xq2
SE  
n1 n2

20.6 x79.4 27.9 x72.1


SE  
63 61
SE = 7.7
p1  p2 20.6  27.9 7.3
Z    0.94
SE 7 .7 7 .7
As Z is less than 1.96, therefore, we can say at a 5% significance level that there is no
statistically significant difference between 4th and 3rd year students with regard to obesity
and the differences observed are due to chance.
(In this case we accept Null Hypothesis)

We can use t test instead of Z test in the case of both small and large samples but to
decide about the critical level, we have to use t table as stated earlier.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 28


2. TESTS FOR CATEGORICAL DATA
CHI-SQUARE TEST:
Chi-square Test (X2) is applied to categorical data but the data has to be collected
randomly. It also shows association between two or more variables. There are many ways
to compute X2 but we will discuss only 2x2 contingency table (two-way Chi-square test).
Suppose a researcher has invented a new vaccine for measles and claims that it prevents
measles. He randomly selects two groups of children. One group is inoculated with the
vaccine and the other group is left as such and the outcome is observed. His observations
are recorded in the table below.
Developed Measles No Measles Total
Inoculated 18 118 136
(1) (2)
Not inoculated 22 208 230
(3) (4)
Total 40 326 366

Note: The numbers 1-4 given in brackets are the cell numbers of the above table.
2
Chi-Square (X2) =  (O  E )
E
O = Observed frequencies
E = Expected frequencies
To calculate Expected frequencies for each cell (1-4)
RowTotal x ColumnTotal
Expected Frequencies ( E ) 
Grand Total
Cell No Observed E 
Row Total x Column Total Expected O–E (O–E)2 (O  E ) 2
Grand Total E
(O) (E)
1 18 136x40 14.9 +3.1 9.61 0.64
366
2 118 136x326 121.1 -3.1 9.61 0.08
366
3 22 230x40 25.1 -3.1 9.61 0.38
366
4 208 230x326 204.9 +3.1 9.61 0.05
366
2
(O  E )
  1 . 15
E

Our completed X2 = 1.15


Now we have to refer to X2 – table
Like t table, X2 table also has got degrees of freedom (DF). To calculate degrees of
freedom (DF) we have to multiply number of rows minus 1 with number of columns
minus 1.
DF = (c–1) (r–1)
In our case there are two columns and two rows (excluding captions and totals)
DF = (2–1) (2–1)
DF = (1) (1)
DF = 1

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 29


At 1 DF at 0.05 (5% significance level) X2 = 3.84. Our computed X2 = 1.15, which is less
than 3.84. Hence, we can say at 5% significance level that there is no statistically
significant difference between group 1 and group 2. It concludes that the vaccine has got
no different effect on children compared to those who were not vaccinated against
measles.
(We accept Null Hypothesis)

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 30


HYPOTHESIS TESTING

We deal with two hypotheses which are:


1. Null Hypothesis
2. Alternative Hypothesis

We either accept Null Hypothesis or reject it. When we accept Null Hypothesis, we reject
the Alternative Hypothesis. When we reject Null Hypothesis we accept the Alternative
Hypothesis.

STEPS:
1. State the null and alternative hypotheses, Ho and H1.
2. Select the decision criterion α (or “level of significance”).
3. Establish the critical values
4. Draw a random sample from the population, and calculate the mean of that
sample
5. Select appropriate statistical test and compute the value of the test statistic Z or t
or X2 (as the case may be).
6. Compare the calculated value of test statistic with the critical values of Z/t/X2, and
then accept or reject the null hypothesis.

Example: A Forensic specialist collects data at random of medicolegal cases of injuries


due to the kind of weapon used, in a district over a period of one year.

The data is given as under:


Type of Weapon Used Number of injured
persons
Sharp 184
Blunt 168
Firearm 123
Corrosives 34
Total 509

We follow the steps as given above:


1. State the null and alternative hypotheses, Ho and H1.

Null Hypothesis: There is no difference between the types of weapons used in


causing injuries.

Alternative Hypothesis: There seems to be preference for weapons in inflicting


injuries.

2. Select the decision criterion α (or “level of significance”). We select a 5%


significance level (p=0.05). Conventionally a 5% level of significance
(p=0.05%) is selected. It can be more stringent and less than 5% (p<0.05) but
it is never more than 5%.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 31


3. Establish the critical values: X2 table at p = 0.05 with degrees of freedom as
will be calculated.

4. Draw a random sample from the population, and calculate the mean of that
sample: Sample randomly drawn from a district.

5. Select appropriate statistical test and compute the value of the test statistic Z
or t or X2 (as the case may be).

We select X2 test as the data is not continuous and do computations as under (One-way
Chi-square test):

Type of Weapon Number of injured Expected O-E (O–E)2 (O  E ) 2


Used persons Frequencies
(Categories) Observed (E) E
Frequencies
(O)
Sharp 184 127 +57 3249 3249/127 = 25.58
Blunt 168 127 +41 1681 1681/127 = 13.23
Firearm 123 127 -4 16 16/127 = 0.125
Corrosives 33 127 -94 8836 8836/127 = 69.57
Total 508 508 (O  E ) 2
  108.5
E

Expected Frequencies (E) are calculated by dividing the total frequencies by number of
categories. The number of categories are 4 and total is 508. All expected frequencies
equal 127.

6. Compare the calculated value of test statistic with the critical values of Z/t/X2,
and then accept or reject the null hypothesis.

Our calculated X2 is equal to 108.5. The degrees of freedom in this case are equal to the
number of categories minus one. There are four categories of weapons, therefore, DF =
4-1 = 3. At 3 DF X2 is equal to 7.81 at 0.05. As our calculated value is more than the
table value, therefore, the difference among the use of weapons in causing injuries is
statistically significant and cannot be due to chance alone. Therefore, we reject the Null
Hypothesis and accept the alternative hypothesis.

While testing a hypothesis we may be liable to commit errors, which are:

Type I Error: Rejecting a true hypothesis (α)

Type II Error: Accepting a false hypothesis. (β)

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 32


ACTUAL SITUATION

Ho True Ho False
T
E Type II error
S Ho Accepted Correct (β)
T False positive

R
E Type I error
S (α)
Ho Rejected Correct
U False
L negative
T

To avoid Type I error we may decrease our significance level but that will increase the
chance of committing Type II error. It is easy to avoid type I error but avoiding type II
error is not so simple. One way is to increase the sample size and reduce sampling
variations.
STATISTICAL POWER:
The Power of a statistical test is defined as: the ability of a statistical test to reject the null
hypothesis when it is actually false and should be rejected.

p-value: We calculate p value instead of doing significance testing or hypothesis testing.


p-value:
p-values measures the role of chance (sampling error). If it is more than 5% or 0.05, we
attribute it to sampling error (chance). In such a scenario we accept Null Hypothesis
(statistically not significant). If it is equal to or less than 5% or 0.05, we may not attribute
to the role of chance, therefore we reject Null Hypothesis. (statistically significant)

p-value if calculated by referring to relevant statistical tables will mean the exact
probability of stating chance of sampling error. If p = 0.0001, it will mean that the
obtained sample difference occurs by chance one out of 10000. Statistical packages
calculate p value up to some decimals. While stating p value it will mean the probability
of getting wrong. p = 0.05 will mean an otherwise result of 5 out of 100.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 33


CORRELATION
Relation of two variables can be determined through correlation. Coefficient of
correlation called Pearson “r” shows the strength and direction of correlation.

The value of Pearson’s “r” ranges from 0 to 1, when 0, there is no correlation and 1 is
perfect correlation. From 0 to 1 the strength varies from weak to strong.

The direction is determined by the sign + or -. If +, it means positive correlation. Positive


correlation means that the two variables (independent and dependent) are positively
related, i.e. both variables move in the same direction. For example, when age of the baby
increases so does the weight.

Negative correlation means the two variables move in opposite direction. For example if
dose of insulin is increased, the level of blood glucose goes down.

The following example is given how to calculate Coefficient of Correlation (Pearson’s


“r”)

X variable is independent variable (age in moths of babies)

Y variable is dependent variable (weight in kgs of the same babies)

Correlation of age in month with weight

2 2
X Y x x  y y xx  yy xx  yy
1 5 1-3=-2 5-7=-2 -2x-2=+4 (-2)2=+4 (-2)2=+4
2 6 2-3=-1 6-7=-1 -1x-1=+1 (-1)=+1 (-1)2=+1
3 7 3-3=0 7-7=0 0 0 0
4 8 4-3=+1 8-7=+1 +1x+1=+1 (+1)2=+1 (+1)2=+1
5 9 5-3=+2 9-7=+2 +2x+2=+4 (+2)=+4 (+2)2=+4
x =3 y =7  x  x  =0  y  y  =0  x  x  y  y  +10  x  x 2 =10  y  y  =10
2

 x  x  y  y 
Pearson' s " " 
x  x    y  y  
2 2

 10  10
  1
10 x10 10
If “r” is equal to +1

It is perfect and positive correlation

Diagrammatically it can be presented in a scatter diagram form

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 34


The line is called best of fit line. If it moves towards the right side and upward it shows
positive correlation.

Negative correlation

No correlation

EXERCISE:

The following data is collected on the dose of Methyl dopa per 24 hours and systolic BP.
Calculate Pearson’s “r” and interpret through diagram.

Dose of MD Change in systolic BP


50 +10
100 0
150 -10
200 -20
250 -20

Coefficient of determination (COD)

CoD = r2

It is the square of Pearson’s “r”

If “r” =0.06

CoD = r2 = (0.06)2 = 0.36

This means that 36% of change in ‘Y’ variable may be attributed to ‘X’ variable.

Professor Mohammad Salim Wazir, (Biostatistics Lecture Notes, 2021) 35

You might also like