2 Biostatistics Lecture Notes Part Two

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 201

Introduction to Biostatistics

Wondimu Ayele(Msc, PhD fellow )


SP, AAU
January 2019
Introduction to Probability

Biostatistics -Notes WA , SPH AAU ,2019


Objective

• To provide understanding of probability and


their applications

• Calculation of probabilities using frequency


distribution

• Explain probability distribution and set the


ground for development of statistical inference

Biostatistics -Notes WA , SPH AAU ,2019


Introduction to sets
• A set is a collection of objects, sets are usually designated
by capital letters A, B,. . . etc

Example A= {a, b, c d} in the set “a” is a member of set


“A” and is denoted as a  A.

• Universal set (U); is a set of all objects under consideration (U),

• Empty/null set (); is a set that contains no members.

• Given two sets A and B; If being a member of A implies being a


member B, then A is a subset of B, denoted as A  B.

Biostatistics -Notes WA , SPH AAU ,2019


Introduction to sets
• Two sets A and B are equal: if A & B have the same members.

• If A  B= C  set C is A union B and contains elements


which are in A or in B or in both.

• If D = A  B  set D is A intersection B and consists of


elements which are in A and in B.

• Example A = {1, 2, 3, 4, 5} B= {a, b, 1, 2, 5, c, 6}

• A  B = {1, 2, 3, 4, 5, 6, a, b, c}

• A  B= {1, 2, 5}
Biostatistics -Notes WA , SPH AAU ,2019
Basic characteristics of Set
1. A = A, A = A, AU = U, AU= A

2. AA = A , A A = A;

3. AB = BA; A B=BA

4. (AB)C = A(BC); (AB) C=A(BC),

5. A(BC)=(AB)(AC); A
(BC)=(AB)U(AC)

6. (Ac)c = A

7. (AB) c = AcBc; (AB) c = AcBc


Biostatistics -Notes WA , SPH AAU ,2019
Probability
• Probability is the language of chance. The deliberate use of
chance is the central idea of statistical designs for producing data.

• Probability provide necessary tools to capture the


uncertain state of our knowledge.

• Probabilistic experiment to be any process that produces


outcomes which are not predictable in advance.

Biostatistics -Notes WA , SPH AAU ,2019


Probability
• Probabilities are used in everyday communication.
– A patient has a 50 – 50 chance of surviving a certain
operation
– The chance of a 30 year old woman to celebrate her 70th
birthday is 30%
• Because medicine is an inexact science, physicians seldom can
predict an outcome with absolute certainty.
• Example1
• To formulate a diagnosis, a physician must rely on available
diagnostic information about a patient;
– History and physical examination
– Laboratory studies,Biostatistics
X‐ray-Notes findings, ECG, etc
WA , SPH AAU ,2019
Probability
• Because no test result is absolutely accurate, it does affect
the probability of the presence (or absence) of a disease.

Example2
– We may hear a physician say that a patient has a 50—50 chance
of surviving a certain operation .

– Another physician may say that she is 95 percent certain that a


patient has a particular disease.

Biostatistics -Notes WA , SPH AAU ,2019


Probability
• understanding of probability is fundamental for
quantifying the uncertainty that is inherent in the decision
making process.

• Probability theory also allows us to draw conclusions


about a population of patients based on known information
about a sample of patients drawn from that population.

Biostatistics -Notes WA , SPH AAU ,2019


Basic terms
• A random experiment is an experiment for which the
outcome cannot be predicted with certainty, but all
possible outcomes can be identified prior to its
performance, and it may be repeated under the same
conditions.

• We call a phenomenon random if:-


– The exact outcome is not predictable in advance.

– however, there is a predictable long term pattern that can be


described by the distribution of outcomes of very many trials
Biostatistics -Notes WA , SPH AAU ,2019
Basic terms
• Sample space is the set of all possible outcomes of a
random experiment. It is denoted by S P(S) = 1
• In tossing a single six-sided die once the sample space is
S = {1, 2, 3, 4, 5, 6} .
• Equally likely: A set of events is equally likely if one of
them cannot be expected to happen in preference to
another.
– E.g. If A coin toss the outcome will be either heads
or tails.

Biostatistics -Notes WA , SPH AAU ,2019


Basic terms
• Mutually exclusive events: if the occurrence of one of
them preclude the occurrence of all others.
Two events A and B are mutually exclusive if they cannot
occur at the same time
P (A ∩ B) = 0
Example:
o A coin toss cannot produce heads and tails
simultaneously.
o Weight of an individual can’t be classified
simultaneously as “underweight”, “normal”,
“overweight”
Biostatistics -Notes WA , SPH AAU ,2019
Basic terms
• Independent Events: Two events A and B are
independent
 if the probability of the first one happening is the same no
matter how the second one turns out.
 The outcome of one event has no effect on the occurrence
or non-occurrence of the other.
Example:
 The outcomes on the first and second coin tosses are
independent

Biostatistics -Notes WA , SPH AAU ,2019


Basic terms
• Experiment = any process with an uncertain outcome

– When an experiment is performed, one and only one


outcome is obtained.

• Event = something that may happen or not when the


experiment is performed

– An event either occurs or it does not occur.


– Events are represented by uppercase letters such as A, B, & C

Biostatistics -Notes WA , SPH AAU ,2019


Examples
1. Experiment is blood test to determine HIV status. Possible
outcomes are {HIV +} and {HIV -}.
– A1 could be the event that a test comes out positive.

– A2 could be the event that a test comes out negative.

2. Experiment is blood test and further screening to determine


HIV status (HIV+ or HIV-) and AIDS status (D+ or D-).
Events are:
– {(HIV +;D+)}; {(HIV +;D-)}; {(HIV -;D+)}; {(HIV -;D-)}

Biostatistics -Notes WA , SPH AAU ,2019


3. Experiment is to record the number of people that get tested for
HIV in one week at a given clinic. Suppose 500 is the maximum
possible number of tests given in a week. Then any non-negative
integer less than or equal to 500 is a conceivable outcome.
Events are:{0}; {1}; {2}; … ; {500}
• Note that unions and intersections of events are events.
A1 is the event that greater than 100 people get tested.
A2 is the event that fewer than 220 people get tested.
A3 is the event that greater than 100 people but fewer than 220
get tested.

• The probability of an event A, denoted by P(A), in general, is


the chance A will happen. But how to measure the chance of
occurrence , i.e., how determine the probability an event?
Biostatistics -Notes WA , SPH AAU ,2019
4. Let a box containing 100 marbles, 90 of them red and
the other 10 blue.
 If the question is: ‘‘Are there red marbles in the box?’’,
someone who saw the box’s contents would answer
‘‘90%.’’
 But if the question is: ‘‘If I take one marble at random,
do you think I would have a red one?’’, the answer
would be ‘‘90% chance.’’
 The first 90% represents a proportion; the second 90%
indicates the probability.

Biostatistics -Notes WA , SPH AAU ,2019


Approaches to probability
1. Subjective Probability: Definitions of probability as a
quantitative measure of the “degree of certainty” of the
observer of experiment.

2. Classical definition: Definitions that reduce the concept


of probability to the more primitive notion of “equal
likelihood”

3. Statistical definition: Definitions that take as their point


of departure the “relative frequency” of occurrence of the
event in a large number of trials.
Biostatistics -Notes WA , SPH AAU ,2019
Approaches to probability
1. Subjective probability: measures the confidence or a wish
that a particular individual has in the truth of a particular
proposition.
– E.g. If some one says that he is 95% certain that a cure for
AIDS will be discovered within 5 years, then he means that
Pr(discovery of cure of AIDS within 5 years) = 95%.
• Although the subjective view of probability has enjoyed
increased attention over the years, it has not been fully
accepted by scientists.

Biostatistics -Notes WA , SPH AAU ,2019


Approaches to probability
2. The classical definition of probability:
 The probability P(A) of an event A is equal to the number
of possible simple events (outcomes) favorable to A
divided by the total number of possible simple events of
the experiment, i.e., where m= number of the simple
events into which the event A can be decomposed.
The probability of an event A can be: P(A)  m
N

Example 1. Consider the experiment of tossing a


balanced coin. P(H)=P(T)=1/2.

Biostatistics -Notes WA , SPH AAU ,2019


Example 2. Consider the experiment of tossing a
balanced . k=1, 2, 3, 4, 5, 6) are observed on the upper
face of the die. Therefore, P(Dk) =1/6 (k=1, 2, 3, 4, 5, 6).

 Let Dodd is the event that an odd number of dots are


observed,

 Deven an even number of dots are observed,

– we have P(Dodd)=3/6=1/2, P(Deven) = 3/6 = ½.

– Let A the event that a number less than 6 of dots is


observed then P(A) = 5/6
Biostatistics -Notes WA , SPH AAU ,2019
Approaches to probability
3. The statistical/Relative frequency probability:
The absolute frequency (A) of an event A in n trails is the
number of times A occurs, and the relative frequency of A in
these trials is: f ( A)
P(A) 
n

Example 1. Suppose that of 158 people who attended a


dinner party, 99 were ill due to food poisoning. The
probability of illness for a person selected at random is
Pr (illness) = 99/158 = 0.63 or 63%

Biostatistics -Notes WA , SPH AAU ,2019


Example 2. The record of a certain health center showed
that out of 10000 smokers, 2940 developed lung cancer.
If one smoker is randomly selected from these group,
what is the probability that he will develop lung cancer.
Let L:=the smoker develops lung cancer
P(L)=2940/10000=0.294

 Note : We will adopt the relative frequency interpretation


of probability, which says that the probability that an
event A occurs is equal to the proportion of the time that
A occurs if we repeat the random experiment again and
again to infinity:

Biostatistics -Notes WA , SPH AAU ,2019


Properties of probability
• The mathematical development of probability starts with
three basic rules or axioms:
1. The numerical value of a probability always lies between
0 and 1, inclusive. 0  P(E)  1
– A value 0 means the event can not occur
– A value 1 means the event definitely will occur
– A value of 0.5 means that the probability that the event
will occur is the same as the probability that it will not
occur.

Biostatistics -Notes WA , SPH AAU ,2019


Properties of probability
2. The sum of the probabilities of all mutually
exclusive outcomes is equal to 1.
– P(E1) + P(E2 ) + .... + P(En ) = 1.

3. For any two events A and B P(A or B) is:


– P(A or B) = P(A) + P(B) - P(A and B) (Addition rule)

– For two mutually exclusive events A and B,

P(A or B ) = P(A) + P(B).

Biostatistics -Notes WA , SPH AAU ,2019


Properties of probability

4. For any two independent events A and B:

P(A and B) =P(A) P(B) (Multiplication rule)

5. The complement of an event A, denoted by Ā or Ac, is


the event that A does not occur then P(Ac) = 1 ‐P(A)
(complementary events)

Biostatistics -Notes WA , SPH AAU ,2019


Basic Probability Rules
1. Addition rule
A. If events A and B are mutually exclusive:

 P(A or B) = P(A) + P(B)

 P(A and B) = 0

 If not mutually exclusive:

 P(A or B) = P(A) + P(B) - P(A and B)

 P(event A or event B occurs or they both occur)

Biostatistics -Notes WA , SPH AAU ,2019


Example: The probabilities below represent years of
schooling completed by mothers of newborn infants

1. What is the probability that a


mother has completed < 12
years of schooling?
2. What is the probability that a
mother has completed 12 or
more years of schooling?

Biostatistics -Notes WA , SPH AAU ,2019


Class work
The probability that at least three individuals
among the five develop hepatitis B is

Biostatistics -Notes WA , SPH AAU ,2019


Basic Probability Rules
 What is the probability that a mother has
completed < 12 years of schooling?
P( 8 years) = 0.056 and
P(9-11 years) = 0.159
 Since these two events are mutually exclusive,
P( 8 or 9-11) = P( 8 U 9-11)
= P( 8) + P(9-11) = 0.056+0.159
= 0.215
 What is the probability that a mother has completed 12 or
more years of schooling?
P(12) = P(12 or 13-15 or 16) = P(12 U 13-15 U 16)
= P(12)+P(13-15)+P(16)
= 0.321+0.218+0.230
= 0.769 Biostatistics -Notes WA , SPH AAU ,2019
Basic Probability Rules
B. If A and B are not mutually exclusive events,
then subtract the overlapping:
P(AU B) = P(A)+P(B) − P(A ∩ B)

Biostatistics -Notes WA , SPH AAU ,2019


Basic Probability Rules
2. Multiplication rule
 If A and B are independent events, then
P(A ∩ B) = P(A) × P(B)

More generally, if dependent


P(A ∩ B) = P(A) P(B|A) = P(B) P(A|B)
P(A and B) denotes the probability that A and B both
occur at the same time.

Biostatistics -Notes WA , SPH AAU ,2019


Conditional Probability
 Refers to the probability of an event, given that another
event is known to have occurred.
 “What happened first is assumed”

 Hint - When thinking about conditional probabilities,


think in stages. Think of the two events A and B occurring
chronologically, one after the other, either in time or
space.
• Conditional probabilities, probabilities based on the
knowledge that some event has occurred.

Biostatistics -Notes WA , SPH AAU ,2019


Conditional Probability
• Conditional probabilities are denoted by P(B/A) or
P(Event/conditioning event).
• The formula for calculating a sample conditional
probability is :

Biostatistics -Notes WA , SPH AAU ,2019


Conditional Probability
The conditional probability that event B has occurred
given that event A has already occurred is denoted
P(B|A) and is defined

Provided that P(A) ≠ 0.

Biostatistics -Notes WA , SPH AAU ,2019


Conditional Probability
Example1.
Table1. A study investigating the effect of prolonged exposure to
bright light on retina damage in premature infants.

Retinopathy Retinopathy TOTAL


YES NO

Bright light 18 3 21
Reduced light 21 18 39

TOTAL 39 21 60

Biostatistics -Notes WA , SPH AAU ,2019


Conditional Probability
• We want to know whether the probability of retinopathy
for the bright‐light infants differs form the probability of
retinopathy for the reduced‐light infants.
These probabilities are

• We want to compare the probability of retinopathy, given


that the infant was exposed to bright light, with that the
infant was exposed to reduced light.

• Exposure to bright light and exposure to reduced light are


conditioning events, events we want to take into account
when calculating conditional probabilities.
Biostatistics -Notes WA , SPH AAU ,2019
Conditional Probability
• For the retinopathy data, the conditional probability of
retinopathy, given exposure to bright light, is
• P(Retinopathy/exposure to bright light) is
= No. of infants with retinopathy exposed to bright light
No. of infants exposed to bright light
= 18/21 = 0.86
• P(Retinopathy/exposure to reduced light)
= No. of infants with retinopathy exposed to reduced light
No. of infants exposed to reduced light
= 21/39 = 0.54
• The conditional probabilities suggest that premature infants
exposed to bright light have a higher risk of retinopathy than
premature infants exposed to reduced light.
Biostatistics -Notes WA , SPH AAU ,2019
Class work
Table 2, shows the frequency of cocaine use by gender
among adult cocaine users.
_________________________________________________________________________________________________________________

Life time frequency Male Female Total


of cocaine use
__________________________________________________________________________________________________________________

1-19 times 32 7 39
20-99 times 18 20 38
more than 100 times 25 9 34
----------------------------------------------------------------------------------------------------
Total 75 36 111
----------------------------------------------------------------------------------------------------------------------
1. What is the probability of a person randomly picked is a male?
2. What is the probability of a person randomly picked uses cocaine more than 100
times?
3. Given that the selected person is male, what is the probability of a person
randomly picked uses cocaine more than 100 times?
4. Given that the person has used cocaine less than 100 times, what is the
probability of being female?
5. What is the probability of a person randomly picked is a male and uses cocaine
more than 100 times?
Biostatistics -Notes WA , SPH AAU ,2019
Conditional Probability
1. For independent events A and B,
P(A/B) = P(A).
2. For non independent events A and B
P(A and B) = P(A/B) P(B), (General Multiplication Rule)

3. Bays theorem:
P(A/B) = P(B/A) P(A)
P(B)

Biostatistics -Notes WA , SPH AAU ,2019


Conditional Probability
Home work
From a city population, the probability of selecting a male or a smoker
is 7/10, a male smoker is 2/5, and a male, if a smoker is already
selected is 2/3 . Find the probability of selecting (a) a non-smoker, (b)
a male, and (c) a smoker, if a male is first selected.
Let A: a male is selected
B: a smoker is selected. We are given
P(AB) =7/10 , P(AB) =2/5 , P(A|B) = 2/3
The probability of selecting a non-smoker is
P(Bc) = 1–P(B) = 1 - P(AB)/ P(A|B)
[P(A/B) = 1- P(AB)/ P(B) =
1 –(2/5)/(2/3)  P(B’) = 1 -3/5=2/5
The probability of selecting a male (by addition theorem) is:
P(A) = P(AB) + P(AB) – P(B)
= (7/10 )+(2/5)-(3/5)=1/2
Class work
Find the probability of selecting a smoker if a male is first selected is
Biostatistics -Notes WA , SPH AAU ,2019
P(B|A) ????
Home work
1. Consider the experiment of tossing a fair die and
define the following events:
A = {Observe an even number of dots}
B = { Observe a number of dots less or equal to 4}.
Are events A and B independent?
2. Suppose that three programmers are designing computer code for a
project: Mr. A has designed 60% of the code, Mr. B 30% and Mr. C
10%. Suppose further that Mr. A has a bug in 3% of her work, Mr. B
in 7% of her work, and Mr. C in 5% of his.
A. What percentage of the code written has a bug?
B. Given that you find a bug in a line of code, who is most likely to
have written it? Who is least likely?
C. How does the ordering compare to the unconditional probabilities
and why does this relationship make
Biostatistics -Notes WA , SPHsense?
AAU ,2019
Baye’s Theorem
• In the health sciences field a widely used application of probability
laws and concepts is found in the evaluation of screening tests and
diagnostic criteria.
• Of interest to clinicians is an enhanced ability to correctly predict
the presence or absence of a particular disease from a knowledge of
test results (positive or negative) and/ or the status of presenting
symptoms (present or absent).

Biostatistics -Notes WA , SPH AAU ,2019


Baye’s Theorem
• Also of interest is information regarding the likelihood of
positive and negative test results and the likelihood of the
presence or absence of a particular symptom in patients
with and without a particular disease.

• In our consideration of screening tests, we must be aware


of the fact that they are not always perfect. That is, a
testing procedure may yield a false positive or a false
negative.

Biostatistics -Notes WA , SPH AAU ,2019


Bayes Theorem
Total probability
If the event B may occur together with one and only one
of n mutually exclusive events A1, A2, ..., An then
P(B)= P(A1)P(B|A1)+P(A2)P(B|A2)+ ...+P(An)P(B|An).

Bayes’s Formula
If the event B may occur together with one and only one
of n mutually exclusive events A1, A2, ..., An then

P(Ak )P(B|Ak ) P(Ak )P(B|Ak )


P(Ak|B)   n
 P(A j )P(B|A j )
P(B)
j 1

Biostatistics -Notes WA , SPH AAU ,2019


Sensitivity and Specificity
• Data for assessing the sensitivity and specificity of a test are usually
of the form
Disease Category
Test result Diseased(+) Nondiseased (-) total

+ A B A+B

- C D C+D

total A+C B+D 1.00

Sensitivity: is the proportion of diseased people who would


be correctly classified
estimated by Sens = A/(A + C).
Specificity: is the proportion of non diseased people who
would be correctly classified
estimated by Spec = D/(B
Biostatistics + D).
-Notes WA , SPH AAU ,2019
Sensitivity and Specificity
• The prevalence of a disease is the percent of the population
with the disease estimated by R = (A + C)/(A + B + C + D).
Note that a random sample is required to estimate prevalence.
• Positive Predictive Value: is the proportion of people who
tested positive that truly are positive.
estimated by PPV =A/(A + B).
• Negative Predictive Value: is the proportion of people who
tested negative that truly are negative.
estimated by NPV =D/(C + D).
• False Negative: The probability of a false negative is the
probability of testing negative given a truly positive condition.
• False Positive: The probability of a false positive is the
probability of testing positive given a truly negative condition.
Biostatistics -Notes WA , SPH AAU ,2019
Example1
Data for assessing the sensitivity and specificity of a test are usually of
the form
Disease Category
Test result Diseased(+) Nondiseased (-) total

+ 10000 5000 15000

- 1000 84000 85000

total 11000 89000 100000

 The estimated Sensitivity is Sens = A/(A + C)=90.9%


 The estimated Specificity is Spec = D/(B + D)=94.4%
 The estimated prevalence is R = (A + C)/(A + B + C + D)=11.00%.
 The estimated PPV is PPV =A/(A + B)=66.7%
 The estimated NPV is NPV =D/(C + D)=98.8%
Biostatistics -Notes WA , SPH AAU ,2019
PROBABILITY DISTRIBUTION

Biostatistics -Notes WA , SPH AAU ,2019


Probability distribution
• Every random variable has a corresponding probability
distribution.

• A probability distribution applies the theory of probability


to describe the behavior of the random variable.

• The term Probability distribution or just distribution refers


to the way data are distributed, in order to draw
conclusions about a set of data.

Biostatistics -Notes WA , SPH AAU ,2019


Probability distribution
• Probability distribution is listing of all the possible values
that a random variable can take along with their
probabilities.

• A probability distribution of a random variable can be


displayed by a table or a graph or a mathematical formula.

• Random Variable is any quantity or characteristic that is


able to assume a number of different values such that any
particular outcome is determined by chance

• Random variables can be either discrete or continuous


Biostatistics -Notes WA , SPH AAU ,2019
Probability distribution
• The random variable domain is the sample space and its
range is the set of real numbers.
Example1 Number of HIV+ patients up on taking a single
blood test to determine the status.

Example2 Observe 100 babies to be born in a clinic. The


number of boys, which have been born, is a random
variable. It may take values from 0 to 100.

Example3 Select one student from an university and


measure his/her height and record this height by x. Then x
is a random variable, assuming values from, say from 100
cm to 250 cm independence upon each specific student.
Biostatistics -Notes WA , SPH AAU ,2019
Basic definition
 A discrete random variable is able to assume only a finite or
countable number of outcomes
 A continuous random variable can take on any value in a specified
interval.
Example 1 Experiment is surgery on two people. Outcomes are {ss,sf,fs,ff}.
Example2 Experiment is to observe the number of people that get tested for
HIV in one week at a given clinic. Suppose 500 is the maximum
possible number of tests given in a week. Then any non-negative
integer less than or equal to 500 is a conceivable outcome.
X = number of tests in a given week.
Example3 Experiment is to record the number of places that a person has
lived in his or her lifetime. Possible outcomes are {1; 2; 3; …,}
X = number of places a person has lived.
Example4 . Experiment is to record the sex of a person. Outcomes {m, f}

Biostatistics -Notes WA , SPH AAU ,2019


Discrete Probability distributions
• For a discrete random variable X, a probability
distribution is a function that assigns to any possible value
x of X the probability P(X = x).
Two Requirements for a Probability Distribution:
1. The sum of the probabilities of all the events in the
sample space must equal 1; that is
ΣP(X)=1.
2. The probabilities of each event in the sample space must
be between or equal to 0 and 1. That is, 0≤P(X)≤1.

Biostatistics -Notes WA , SPH AAU ,2019


Example1:
• Consider again the experiment of taking a single blood
test to determine HIV status. Let the random variable X
denote the number of positive tests.
• Then X(HIV+)=1, X(HIV-)=0
If we knew that the prevalence of HIV was 0.11, then
P(X = 1) = 0.11 and P(X = 0) = 0.89
• These two equations completely describe the probability
distribution of the discrete (dichotomous) random
variable X.

Biostatistics -Notes WA , SPH AAU ,2019


Example 2 Consider the value on the face showing
up from tossing a die.
• The probability distribution of this variable is
Value on Face 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
• Notice that the total probability is 1.

Biostatistics -Notes WA , SPH AAU ,2019


• Example -3
The data shows the number of diagnostic services
a patient receives

Biostatistics -Notes WA , SPH AAU ,2019


• What is the probability that a patient receives exactly 3
diagnostic services?
P(X=3) = 0.031
• What is the probability that a patient receives at most one
diagnostic service?
P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900
• What is the probability that a patient receives at least four
diagnostic services?
P (X≥4) = P(X = 4) + P(X = 5)
= 0.010 + 0.006
= 0.016
Biostatistics -Notes WA , SPH AAU ,2019
Expected Value of a Discrete Random variable
• The average value assumed by a random variable is called
its expected value, or the population mean
• It is represented by E(X) or µ=ΣX.P(X) the symbol E(X) is
used for the expected value.
Example expected value For the diagnostic service data:
Mean (X) = 0(0.671) +1(0.229) +2(0.053) +3(0.031) +4(0.010)
+5(0.006)
= 0.498 ≈ 0.5
• We would expect an average of 0.5 services for each visit

Biostatistics -Notes WA , SPH AAU ,2019


Variance of a Discrete Random Variable
• The variance of a random variable X is called the
population variance and is represented by Var (X) or σ2
σ2 = ∑(xi-µ)2P(X=xi)
Variance for above diagnostic service is
σ2 = ∑(xi-µ)2P(X=xi) = (0 − 0.5)2(0.671) +(1 − 0.5)2(0.229)
+(2 − 0.5)2(0.053) +(3 − 0.5)2(0.031)+(4 − 0.5)2(0.010)
+(5 − 0.5)2(0.006) = 0.782
Standard deviation = σ = √0.782 = 0.884

Biostatistics -Notes WA , SPH AAU ,2019


Factorials

• Given the positive integer n, the product of all the whole


numbers from n down through 1 is called n factorial and is
written n!.

• n! = nx(n‐1)x(n‐2)x…x2x1 = nx(n‐1)!

• By definition; 0!=1.

Biostatistics -Notes WA , SPH AAU ,2019


Factorials
• Permutation: An ordered arrangement of objects.

• Combinations: An arrangement of objects without


regard to order.

Biostatistics -Notes WA , SPH AAU ,2019


Binomial distribution
• It is one of the most widely encountered discrete
distributions.
• The origin of binomial distribution lies in Bernoulli’s trials.
• When a single trial of some experiment can result in only
one of two mutually exclusive outcomes (success or
failure; dead or alive; sick or well, male or female) the trail
is called Bernoulli trial.
Example1.
– Let X represents smoking status; X=1 smoker and X=0
non-smoker. The two outcomes are mutually exclusive.
– Take the case of USA; in 1987, 29% of the adults in USA
were smokers, therefore Pr (X=1) = 0.29 and Pr (X=0) =
1-0.29 = 0.71.

Biostatistics -Notes WA , SPH AAU ,2019


Binomial distribution
• Suppose an event can have only binary outcomes A and B.
Pr (X=success) = Pr (X=1) = p
• Pr (X=failure) = Pr (X=0) = 1-p

• If an experiment is repeated n times and the outcome is


independent from one trial to another, the probability
P(X=x) that outcome X occurs exactly x times is
Pr (X= x) = n! p x (1- p) n- x
x ! (n- x )!
where , n (trials) & p (each probability outcome of event X)
are parameters of the binomial distribution , x is number of
successes. and n! read as ”n factorial” or factorial n” is the
product of all integers 1 to n inclusive. By definition
1!=0!=1.

Biostatistics -Notes WA , SPH AAU ,2019


Binomial distribution
 Example 2
 Suppose now we randomly select two individuals in USA, see the
smoking status of the two persons,
 What is the probability
– That both are non smokers?
– one is a smoker?
– both are smokers?
 If Pr (X=1) = p and pr (X=0) = 1- p, then the above can be calculated
using the multiplicative rule.
_________________________________________________________________________________________________________________

Outcome of X
Person1 Person2 Prob No of smokers
_____________________________________________________________________________________________________________________

0 0 (1- p)(1- p)=0.71×0.71=0.50 0


0 1 (1- p) p=0.71×0.29=0.21 1
1 0 p (1- p)=0.29×0.79=0.21 1
1 1 p p=0.29 ×0.29=0.08 2
_______________________________________________________________

Biostatistics -Notes WA , SPH AAU ,2019


Characteristics of a Binomial Distribution
1. The experiment consist of n identical trials. There are
only two possible mutually exclusive outcomes, on each
trial.
2. The probability of A remains the same from trial to trial.
This probability is denoted by p, and the probability of B
is denoted by q. Note that q=1‐ p.
3. The trials are independent.
4. The binomial random variable X is the number of A’s in n
trials. n and p are the parameters of the binomial
distribution.
5. The mean is np and the variance is np(1‐ p)

Biostatistics -Notes WA , SPH AAU ,2019


 The general form of the Binomial pmf is given by:
• b(x; n, p) = nCx px qnx , (where q = 1  p), and its
cumulative density function
( cdf )is given by:
x x

F(x) = B(x; n, p) =  b(i; n, p) = 


i 0
n Ci  p i  q ni
i 0

It is paramount to observe that the binomial random variable ,


X, is the sum of n independent Bernoulli random variable, Xi,
i.e., X = X1 + X2 + ... + Xn
Where Xi represents the Bernoulli rv at the ith trial whose value is
equal to 0 or 1 (0 for failure and 1 for success) so that the Rx =
0, 1, 2, ..., n.

Biostatistics -Notes WA , SPH AAU ,2019


 Class work 1
1. Each child born to a particular set of parents has a probability
of 0.25 of having blood type O. If these parents have 5
children. What is the probability that
a. Exactly two of them have blood type O
b. At most 2 have blood type O
c. At least 4 have blood type O
d. 2 do not have blood type O.

Biostatistics -Notes WA , SPH AAU ,2019


Class work 2
2. Suppose you take a sample of N independent biologists
to determine how many of them use valid statistical
methods.
• In particular, you have a sample of N independent,
identically distributed RVs. With Yi with p=P(Y=1)
• What is the distribution of the number of successes
Y=∑NI=1 Yi in N trials? Y~Bin(y;N,p)
• Calculate the probability that 0 out of 10 biologists use valid
statistical methods when the probability of using valid statistical
methods is 0.8

Biostatistics -Notes WA , SPH AAU ,2019


The Poisson distribution
• Discrete probability distribution is used to model the
number of occurrences of an event that takes place
infrequently in time or space
• Applicable for counts of events over a given interval of
time, for example:
– number of patients arriving at an emergency
department in a day
– number of new cases of HIV diagnosed at a clinic in a
month
– Daily number of new cases of breast cancer notified
to a cancer registry
– Number of abnormal cells in a fixed area of
histological slides from a series of liver biopsies
Biostatistics -Notes WA , SPH AAU ,2019
The Poisson distribution
• The theoretical situation giving rise to data of this type is
easier to describe in relation to events occurring over
time (or space) at a fixed rate on average, but where each
event occurs independently and at random.
• Such data will have a Poisson distribution
• Suppose events happen randomly and independently in
time at a constant rate. If events happen with rate l
events per unit time, the probability of x events
happening in unit time is:

Biostatistics -Notes WA , SPH AAU ,2019


• where x = 0, 1, 2, . . .x is a potential outcome of X
• t time of segment of interest
• The constant (lambda) represents the rate at which
the event occurs, or the expected number of events
per unit time
• e = 2.71828
• It depends up on just one parameter, which is the )

Biostatistics -Notes WA , SPH AAU ,2019


Three assumptions of Poisson distribution
1. The probability that a single event occurs within a
given small subinterval is proportional to the
length of the subinterval
2. The rate at which the event occurs is constant over
the entire interval t
3. Events occurring in consecutive subintervals are
independent of each other

Biostatistics -Notes WA , SPH AAU ,2019


Example
Example1
The daily number of new registrations of cancer is 2.2 on average.
• What is the probability of
a) Getting no new cases
b) Getting 1 case
c) Getting 2 cases
d) Getting 3 cases
e) Getting 4 cases
solution
• a) P(X=0)= 0 .111
• b) P(X=1) = 0.244
• c) P(X=2) = 0.268
• d) P(X=3) = 0.197
• e) P(X=4) = 0.108
Biostatistics -Notes WA , SPH AAU ,2019
The Poisson distribution
• Characteristics;
• The Poisson distribution is very asymmetric when its mean
is small
• With large means it becomes nearly symmetric
• It has no theoretical maximum value, but the probabilities
tail off towards zero very quickly
• λ is the parameter of the Poisson distribution
• The mean is λ and the variance is also λ.

Biostatistics -Notes WA , SPH AAU ,2019


Probability distribution of continuous variables
• Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
Example 1
– Suppose, X represents the continuous variable
‘Height’; rarely is an individual exactly equal to 170cm
tall
– X can assume an infinite number of intermediate
values 170.1, 170.2, 170.3 etc.

• Because a continuous random variable X can take on an


uncountable infinite number of values, the probability
associated with any particular one value is almost equal to
zero.

Biostatistics -Notes WA , SPH AAU ,2019


Probability distribution of continuous variables
• However the probability that X will assume
some value in the interval enclosed by two
ranges say x1 and x2 is a value greater than
given by

• As a continuous variable can take an infinite


number of values, it helps to visualize the
probability distribution as a curve and
probabilities as ‘area under the curve’.
• It is also called normal distribution.

Biostatistics -Notes WA , SPH AAU ,2019


Normal Distribution
• The Normal Distribution is by far the most important
probability distribution in statistics.
• It is also sometimes known as the Gaussian distribution,
after the mathematician Gauss.
• The distributions of many medical measurements in
populations follow a normal distribution (eg. Serum uric
acid levels, cholesterol levels, blood pressure, height and
weight)
• The normal distribution is a theoretical, continuous
probability distribution whose equation is:

for -∝ < x < +∝


Biostatistics -Notes WA , SPH AAU ,2019
Normal Distribution
• The normal distribution for any given interval
between a and b is:

Biostatistics -Notes WA , SPH AAU ,2019


Characteristics of the Normal Distribution
1. It is a probability distribution of a continuous variable. It
extends from minus infinity( -∞) to plus infinity (+∞).

2. It is unimodal, bell-shaped and symmetrical about x = u.

3. It is determined by two parameters: referred as the mean μ


(read as ‘mu’) and standard deviation σ (read ‘sigma’).
– Changing μ alone shifts the entire normal curve to the left or
right.
– Changing σ alone changes the degree to which the distribution
is spread out.
– The mean μ can be any number (negative, positive or zero).
– The standard deviation σ must be a positive number.
Biostatistics -Notes WA , SPH AAU ,2019
Characteristics of the Normal Distribution
4. The height of the frequency curve, which is called the
probability density, cannot be taken as the probability of a
particular value.
– This is because for a continuous variable there are infinitely
many possible values so that the probability of any specific
value is zero.
5. An observation from a normal distribution can be related to a
standard normal distribution: (SND) which has a published
table.
– Thus an observation x from a normal distribution with
mean μ and standard deviation σ can be related to a
Standard normal distribution by calculating :
SND = Z = (x - μ ) / σ
Biostatistics -Notes WA , SPH AAU ,2019
6. Perpendiculars of the area under the curve.

– ± SD contain about 68%;


– ±2 SD contain about 95%;
– ±3 SD contain about 99.7%

7. The distribution is completely determined by


the parameters m and s.
Biostatistics -Notes WA , SPH AAU ,2019
Normal curve

Biostatistics -Notes WA , SPH AAU ,2019


Normal probability
• Normal curve area for Z value of 1.95 in the table

Biostatistics -Notes WA , SPH AAU ,2019


Biostatistics -Notes WA , SPH AAU ,2019
Normal Probability distribution
• Area under any Normal curve
• To find the area under a normal curve ( with mean μ and
standard deviation σ) between x=a and x=b, find the Z scores
corresponding to a and b (call them Z1 and Z2) and then find the
area under the standard normal curve between Z1 and Z2 from
the published table. Z- Scores
• E.g. Assume a distribution has a mean of 70 and a standard
deviation of 10.
• How many standard deviation units above the mean is a score of
80? ( 80-70) / 10 = 1
• How many standard deviation units above the mean is a score of
83? Z = (83 - 70) / 10 = 1.3
 The number of standard deviation units is called a Z-score or
Z value.
Biostatistics -Notes WA , SPH AAU ,2019
Standard Normal Distribution
• It is a normal distribution that has a mean equal to 0 and a
SD equal to 1, and is denoted by N(0, 1).
• The main idea is to standardize all the data that is given by
using Z-scores.
• These Z-scores can then be used to find the area (and thus
the probability) under the normal curve.

Biostatistics -Notes WA , SPH AAU ,2019


Z - Transformation
• If a random variable X~N(m, s) then we can transform it to
a SND with the help of Z transformation.

• Z represents the Z-score for a given x value


• This process is known as standardization and gives the
position on a normal curve with =0 and =1, i.e., the SND, Z.
• A Z-score is the number of standard deviations that a
given x value is above or below the mean.

Biostatistics -Notes WA , SPH AAU ,2019


Probability distribution
• In general, Z = (raw score - population mean) / population SD = (x-
μ) /σ In the above population, what Z-score corresponds to a raw
score 68? Z = (68-70)/10 = - 0.2
• Z-scores are important because given a Z – value we can find out
the probability of obtaining a score this large or larger (or this low
or lower).
• Hence, P(-1 < Z < +1) = 0.6827 ; P(-1.96 < Z < +1.96) = 0.95 and P(-
2.576 < Z < + 2.576) = 0.99.
• From the symmetry properties of the stated normal distribution,
P(Z ≤ -x) = P(Z ≥ x) = 1– P(z ≤ x)

Biostatistics -Notes WA , SPH AAU ,2019


Example
• Example1: Suppose a borderline hypertensive is defined as a person
whose DBP is between 90 and 95 mm Hg inclusive, and the subjects
are 35-44-year-old males whose BP is normally distributed with
mean 80 and variance 144. What is the probability that a randomly
selected person from this population will be a borderline
hypertensive?
Solution: Let X be DBP, X ~ N(80, 144)
• P (90 < X < 95)= P(0.83 < z < 1.25)
= P (Z < 1.25) − P(Z < 0.83) = 0.8944 − 0.7967 = 0.098
• Thus, approximately 9.8% of this population will be borderline
hypertensive.

Biostatistics -Notes WA , SPH AAU ,2019


Finding normal curve areas
1. The table gives areas between -∞ and the value of Zo.
2. Find the z value in tenths in the column at left margin
and locate its row. Find the hundredths place in the
appropriate column.
3. Read the value of the area (P) from the body of the table
where the row and column intersect. Values of P are in the
form of a decimal point and four places

Biostatistics -Notes WA , SPH AAU ,2019


• The total area under the curve is 1.0, and the
curve is symmetric, so half is above the mean,
half is below

Biostatistics -Notes WA , SPH AAU ,2019


a. What is the probability that z < -1.96?

1. Sketch a normal curve


2. Draw a perpendicular line for z = -1.9
3. Find the area in the table
4. The answer is the area to the left of the line P(z< -1.96) =
0.0250

Biostatistics -Notes WA , SPH AAU ,2019


b. What is the probability that -1.96 < z <1.96?

• The area between the values P(-1.96 < z <1.96) = .9750 -


.0250 = .9500

Biostatistics -Notes WA , SPH AAU ,2019


c. What is the probability that z > 1.96?

• The answer is the area to the right of the line; found by


subtracting table value from 1.0000; P(z > 1.96) =1.0000
-.9750 = .0250

Biostatistics -Notes WA , SPH AAU ,2019


Normal Probability distribution
• Example2: Suppose that total carbohydrate intake in 12-14 year old
males is normally distributed with mean 124 g/1000 cal and SD 20
g/1000 cal.
a) What percent of boys in this age range have carbohydrate intake
above 140g/1000 cal?
b) What percent of boys in this age range have carbohydrate intake
below 90g/1000 cal?
Solution: Let X be carbohydrate intake in 12-14-year-old males and
X ∼ N (124, 400)
a) P(X > 140) = P(Z > (140-124)/20) = P(Z > 0.8) = 1− P(Z < 0.8) = 1−0.7881
= 0.2119
b) P(X < 90) = P(Z < (90-124)/20) = P(Z < -1.7)= P(Z > 1.7) = 1− P(Z < 1.7) =
1− 0.9554 = 0.0446

Biostatistics -Notes WA , SPH AAU ,2019


Class work
1. Assume that among diabetics the fasting blood level of glucose is
approximately normally distribute with a mean of 105 mg per 100
ml and SD of 9 mg per 100 ml.
a) What proportions of diabetics have levels between 90 and 125
mg per 100 ml?
b) What proportions of diabetics have levels below 87.4 mg per 100
ml?
c) What level cuts of the lower 10% of diabetics?
d) What are the two levels which encompass 95% of diabetics?
2. Among a large group of coronary patients it is found that their
serum cholesterol levels approximate a normal distribution. It was
found that 10% of the group had cholesterol levels below 182.3
mg per 100 ml where as 5% had values above 359.0 mg per 100
ml. What is the mean and SD of the distribution?

Biostatistics -Notes WA , SPH AAU ,2019


Other types probability distribution
• The multinomial distribution: - is similar to the
binomial distribution, but has the advantage of
allowing one to compute probabilities when
there are more than two outcomes

• The lognormal distribution is defined with


reference to the normal distribution such that a
random variable x is lognormal if its natural
logarithm, y = log x, is normal.

Biostatistics -Notes WA , SPH AAU ,2019


Other probability distribution
• The Gamma distribution is skewed (non
symmetric) distribution, most of the area under
the density function is located near the origin,
and the density function drops gradually.
• Hyper geometric Distribution:- is a distribution of
a variable that has two outcomes when sampling
is done without replacement. The distribution is
existing only for non negative integers less than
the number of samples or the number of possible
success.

Biostatistics -Notes WA , SPH AAU ,2019


Summary
 Probability distribution is a set of data either using a table or
graph from which a random variable selected in order to draw
conclusion about a particular set of characteristics
 It is classified in to two
 Discrete random variable is one of probability distribution
that had no any other value in b/n two variable .
 it can classified as:
– Binomial
– Poisson distribution
 Continuous probability distribution ;-is also another
p\d in which there is always an infinite number of value
in b/n two variable.
 That is also classified in to:
 Normal probability distribution
 Exponential distribution
Biostatistics -Notes WA , SPH AAU ,2019
Assignment
A1. Let A and B denote two independent genetic traits. Suppose the
probability that an individual will exhibit trait A is ½ and the
probability that an individual will exhibit trait B is ¾.
a) What is the probability that an individual will exhibit Both traits?
b) Neither trait?
c) trait A but not trait B?
d) trait B but not trait A?
e) exactly one trait?
A2. A physician develops a diagnostic test that is positive for 95% of the
patients who have disease and is positive for 10% of the patients
who do not have disease. Of patients tested, 20% actually have
disease. Suppose you evaluate a patient by administering this
diagnostic test and obtain a positive result. Using the information
given, calculate the probability that this patient has disease.
Biostatistics -Notes WA , SPH AAU ,2019
A3. The height, X, of young American women is distributed
normal with mean μ=65.5 and standard deviation σ=2.5
inches. Find the probability of each of the following events.
– a. X < 67
– b. 64 < X < 67
A4. Four buses carrying 148 students arrive at a football stadium.
The buses carry, respectively, 25, 33, 40 and 50 students. After
everyone gets off the buses, a random student is picked at
random. Let X denote the number of students that were on
his/her bus. Also, one of the drivers is picked at random.
Let Y denote the number of students that were on his/her bus.
(a) Compute E[X] and E[Y ]. How do you explain the difference?
(b) Compute V ar[X] and V ar[Y ].

Biostatistics -Notes WA , SPH AAU ,2019


Sampling and Sampling Distribution

Biostatistics -Notes WA , SPH AAU ,2019


Course objectives:
1. Define population and sample and understand the
different sampling terminologies
2. identify and describe common methods of sampling
3. Differentiate between probability and Non-Probability
sampling methods and apply different techniques of
sampling
4. Understand the importance of a representative sample
5. Differentiate between random error and bias
6. Enumerate advantages and limitations of the different
sampling methods
7. Understand importance of sampling distribution
8. Understand the sampling distributions of a statistic
Biostatistics -Notes WA , SPH AAU ,2019
SAMPLING METHODS
• Sampling involves the selection of a number of a study units

from a defined population.

I. Explain difference b/n sample survey and censuses ?

II. Discuss advantage of sample survey over census?

III. Discuss different b/n study design and sample design?

Biostatistics -Notes WA , SPH AAU ,2019


Sampling
• If we have to draw a sample, we will be confronted with

the following questions:

A. What is the group of people (population) from which we

want to draw a sample?

B. How many people do we need in our sample?

C. How will these people be selected?

Biostatistics -Notes WA , SPH AAU ,2019


Non Probability Sampling methods
• Used when a sampling frame does not exist

• No random selection (unrepresentative of the given


population)

• Inappropriate if the aim is to measure variables and


generalize findings obtained from a sample to the
population.

Biostatistics -Notes WA , SPH AAU ,2019


Probability Sampling methods
• A sampling frame exists or can be compiled.
• Involve random selection procedures.
• Every elements in the population has known non zero
probability of being included in the sample
• All units of the population should have an equal or at least
a known chance of being included in the sample.
• Generalization is possible (from sample to population),
• Since unbiased estimate of the population
• Reliability and validity of estimate can be evaluated

Biostatistics -Notes WA , SPH AAU ,2019


• Researchers often use sample survey
methodology to obtain information about a
larger population by selecting and
measuring a sample from that population.

• Since population is too large, we rely on


the information collected from the sample.
– Cost minimization

Biostatistics -Notes WA , SPH AAU ,2019


• Inferences about the population are based on
the information from the sample drawn from
that population.
• However, due to the variability in the
characteristics of the population, scientific
sample designs should be applied to select a
representative sample.
• If not, there is a high risk of distorting the view
of the population.

Biostatistics -Notes WA , SPH AAU ,2019


• A sample is a collection of individuals selected
from a larger population.
• Sampling enables us to estimate the
characteristic of a population by directly
observing a portion of the population.
• Researchers are not interested in the sample
itself, but in what can be learned from the
sample—and how this information can be
applied to the entire population.

Biostatistics -Notes WA , SPH AAU ,2019


Sample Information

Population

Biostatistics -Notes WA , SPH AAU ,2019


Sampling
• The process of selecting a portion of the
population to represent the entire population.
• A main concern in sampling:
1. Ensure that the sample represents the population,
and
2. The findings can be generalized.

Biostatistics -Notes WA , SPH AAU ,2019


Why Sample?
• Feasibility: Sampling may be the only feasible
method of collecting information.
• Reduced cost: Sampling reduces demands on
resource such as finance, personnel, and material.
• Greater speed: Data can be collected and
summarized more quickly
• Efficiency in terms of resource use

Biostatistics -Notes WA , SPH AAU ,2019


Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of
discrimination within the population.

Biostatistics -Notes WA , SPH AAU ,2019


Errors in sampling
1) Sampling error: Errors introduced due to
errors in the selection of a sample.
– They cannot be avoided or totally eliminated.
2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
Biostatistics -Notes WA , SPH AAU ,2019
Sampling error (random error)
• A sample is a subset of a population.
• Because of this property of samples, results
obtained from them cannot reflect the full range
of variation found in the larger group
(population).
• This type of error, arising from the sampling
process itself, is called sampling error, which is a
form of random error.
• Sampling error can be minimized by increasing
the size of the sample.
• When n = N ⇒ sampling error = 0
Biostatistics -Notes WA , SPH AAU ,2019
Non‐sampling error (bias)
• Systematic error in the design or conduct of a
sampling procedure which results in distortion
of the sample,
• So that it is no longer representative of the
reference population.
• We can eliminate or reduce the non‐sampling
error (bias) by careful design of the sampling
procedure and not by increasing the sample
size.
Biostatistics -Notes WA , SPH AAU ,2019
Non‐sampling error cont...
• There are several possible sources of bias in sampling.
• The best known source of bias is non- response. It is the
failure to obtain information on some of the subjects
included in the sample to be studied.
• Non- response results in significant bias when the
following two conditions are both fulfilled.
When non‐respondents constitute a significant
proportion of the sample (about 15% or more)
When non‐respondents differ significantly from
respondents.

Biostatistics -Notes WA , SPH AAU ,2019


• Sampling frame - the list of all the units in the
reference population, from which a sample is to be
picked.
• Sampling fraction (Sampling interval) - the ratio
of the number of units in the sample to the
number of units in the reference population
(n/N)

Biostatistics -Notes WA , SPH AAU ,2019


• Reference population (or target population): the
population of interest to whom the researchers would like
to make generalizations.

• Study population(sample unity): the actual group in which


the study is conducted .
• Study unit: the units on which information will be
collected: persons, housing units, etc.
– The sampling unit is not necessarily the same as the study unit.
- if the objective is to determine the availability of latrine, then the
study unit would be the household;
- if the objective is to determine the prevalence of trachoma, then
the study unit would be the individual.

Biostatistics -Notes WA , SPH AAU ,2019


The hierarchy of sampling
Study subjects
The actual
participants in
the study

Sample
Subjects who are
selected

Sampling Frame
The list of potential subjects
from which the sample is
drawn

Study population
The Population from whom the study
subjects would be obtained

Target population
The population to whom the results would be
applied
Biostatistics -Notes WA , SPH AAU ,2019
Sampling Methods

probability sampling methods Non-probability sampling methods

1. SRS 1. Convenience or
2. Systematic haphazard sampling
3. Stratified 2. Volunteer sampling
4. Cluster 3. Judgment sampling
5. Multi stage 4. Quota sampling
6. Sampling with probability 5. Snowball sampling
proportional to size technique
6. Purposive sampling

Biostatistics -Notes WA , SPH AAU ,2019


A. Probability sampling
• Involves random selection of a sample
• Every sampling unit has a known and non-zero
probability of selection into the sample.

• Involves the selection of a sample from a


population, based on chance.

Biostatistics -Notes WA , SPH AAU ,2019


• Probability sampling is:
– more complex,
– more time-consuming and
– usually more costly than non-probability
sampling.
• However, because study samples are
randomly selected and their probability of
inclusion can be calculated,
– reliable estimates can be produced and
– inferences can be made about the population.

Biostatistics -Notes WA , SPH AAU ,2019


• There are several different ways in which a
probability sample can be selected.
• The method chosen depends on a number of
factors, such as
– The available sampling frame,
– How spread out the population is,
– How costly it is to survey members of the
population

Biostatistics -Notes WA , SPH AAU ,2019


Most common probability
sampling methods
1. Simple random sampling
2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling

Biostatistics -Notes WA , SPH AAU ,2019


1. Simple random sampling
• The required number of individuals are
selected at random from the sampling
frame, a list or a database of all individuals in
the population

• Each member of a population has an equal


chance of being included in the sample.

Biostatistics -Notes WA , SPH AAU ,2019


• To use a SRS method:
– Make a numbered list of all the units in the
population
– Each unit should be numbered from 1 to N (where
N is the size of the population)
– Select the required number.
• The randomness of the sample is ensured by:
• Use of “lottery’ methods
• Table of random numbers
• Computer programs

Biostatistics -Notes WA , SPH AAU ,2019


Random numbers
…. 8094 2525 8247 1347 7433 3620 1897 ….
…. 3563 2198 8211 9045 2618 2751 2627 ….
…. 1330 6331 3753 9693 8738 6815 1538 ….
…. 3565 0016 2243 6432 4796 6095 5283 ….
…. 7850 5925 5588 7311 2192 4545 3530 ….
…. 4490 5417 9727 6153 5901 4878 9980 ….
…. 6545 9104 9318 8819 7537 2785 9373 ….

Biostatistics -Notes WA , SPH AAU ,2019


Example

• Suppose your school has 500 students and


you need to conduct a short survey on the
quality of the food served in the cafeteria.
• You decide that a sample of 10 students
should be sufficient for your purposes.
• In order to get your sample, you assign a
number from 1 to 500 to each student in
your school.

Biostatistics -Notes WA , SPH AAU ,2019


• To select the sample, you use a table of
randomly generated numbers.

• Pick a starting point in the table (a row and


column number) and look at the random
numbers that appear there.

• In this case, since the data run into three digits,


the random numbers would need to contain
three digits as well.

Biostatistics -Notes WA , SPH AAU ,2019


• Ignore all random numbers after 500 because
they do not correspond to any of the students
in the school.
• Remember that the sample is without
replacement, so if a number recurs, skip over
it and use the next random number.
• The first 10 different numbers between 001
and 500 make up your sample.

Biostatistics -Notes WA , SPH AAU ,2019


• Advantages of simple random sampling:-
– No bias i.e. no tendency to have too high or too low
statistic when you take many samples.
– Bias is consistent, repeated deviation of the sample
statistic from the population parameter in the same
direction.
– Small variability (of the values of the statistics from
sample to sample).

• SRS has certain limitations:


– Requires a sampling frame.
– Difficult if the reference population is dispersed.
– Minority subgroups of interest may not be selected.
Biostatistics -Notes WA , SPH AAU ,2019
2. Systematic random sampling

• Sometimes called interval sampling


• Selection of individuals from the sampling
frame systematically rather than randomly
• Individuals are taken at regular intervals down
the list
• The starting point is chosen at random

Biostatistics -Notes WA , SPH AAU ,2019


• Important if the reference population is
arranged in some order:
– Order of registration of patients
– Numerical number of house numbers
– Student’s registration books

• Taking individuals at fixed intervals (every


kth) based on the sampling fraction, eg. if the
sample includes 20%, then every fifth.

Biostatistics -Notes WA , SPH AAU ,2019


Steps in systematic random sampling

1. Number the units on your frame from 1 to N


(where N is the total population size).

2. Determine the sampling interval (K) by dividing the


number of units in the population by the desired
sample size.

Biostatistics -Notes WA , SPH AAU ,2019


3. Select a number between one and K at random.
This number is called the random start and would
be the first number included in your sample.

4. Select every Kth unit after that first number

Biostatistics -Notes WA , SPH AAU ,2019


Example
• To select a sample of 100 from a population of
400, you would need a sampling interval of
400 ÷ 100 = 4.
• Therefore, K = 4.
• You will need to select one unit out of every
four units to end up with a total of 100 units in
your sample.
• Select a number between 1 and 4 from a table
of random numbers.

Biostatistics -Notes WA , SPH AAU ,2019


• If you choose 3, the third unit on your frame
would be the first unit included in your
sample;

• The sample might consist of the following


units to make up a sample of 100: 3 (the
random start), 7, 11, 15, 19...395, 399 (up to
N, which is 400 in this case).

Biostatistics -Notes WA , SPH AAU ,2019


• Using the above example, you can see that
with a systematic sample approach there are
only four possible samples that can be
selected, corresponding to the four possible
random starts:
A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400

Biostatistics -Notes WA , SPH AAU ,2019


• Each member of the population belongs to only one
of the four samples and each sample has the same
chance of being selected.

• The main difference with SRS, any combination of


100 units would have a chance of making up the
sample, while with systematic sampling, there are
only four possible samples.

Biostatistics -Notes WA , SPH AAU ,2019


Systematic sampling
Merits
• Less time consuming easier to perform as
compared to SRS
Demerits
• Systematic sampling should not be used when
a cyclic repetition is inherent in the sampling
frame.

Biostatistics -Notes WA , SPH AAU ,2019


3. Stratified random sampling

• It is done when the population is known to be have


heterogeneity with regard to some factors and those
factors are used for stratification.

• Using stratified sampling, the population is divided


into homogeneous, mutually exclusive groups called
strata, and

• A population can be stratified by any variable that is


available for all units prior to sampling (e.g., age, sex,
province of residence, income, etc.).
Biostatistics -Notes WA , SPH AAU ,2019
• Among strata there is heterogeneity and within
each strata units there is homogeneity.

• A separate sample is taken independently from


each stratum.

• Any of the sampling methods mentioned in this


section (and others that exist) can be used to
sample within each stratum.

Biostatistics -Notes WA , SPH AAU ,2019


Why do we need to create strata?
• It can make the sampling strategy more
efficient.
• A larger sample is required to get a more
accurate estimation if a characteristic varies
greatly from one unit to the other.
• For example, if every person in a population
had the same salary, then a sample of one
individual would be enough to get a precise
estimate of the average salary.
Biostatistics -Notes WA , SPH AAU ,2019
• This is the idea behind the efficiency gain
obtained with stratification.
– If you create strata within which units share
similar characteristics (e.g., income) and are
considerably different from units in other strata
(e.g., occupation, type of dwelling) then you
would only need a small sample from each
stratum to get a precise estimate of total income
for that stratum.
– Then you could combine these estimates to get
a precise estimate of total income for the whole
population.

Biostatistics -Notes WA , SPH AAU ,2019


• If you use a SRS approach in the whole
population without stratification, the sample
would need to be larger than the total of all
stratum samples to get an estimate of total
income with the same level of precision.

Biostatistics -Notes WA , SPH AAU ,2019


• Stratified sampling ensures an adequate
sample size for sub-groups in the population of
interest.

• When a population is stratified, each stratum


becomes an independent population and you
will need to decide the sample size for each
stratum.

Biostatistics -Notes WA , SPH AAU ,2019


• Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation:
n
nj  Nj
N
– nj is sample size of the jth stratum
– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population
size

Biostatistics -Notes WA , SPH AAU ,2019


Example: Proportionate Allocation

• Village A B C D Total
• HHs 100 150 120 130 500
• S. size ? ? ? ? 60

Biostatistics -Notes WA , SPH AAU ,2019


Stratified random sampling
Merit
• The representativeness of the sample is
improved.
• adequate representation of minority subgroups
of interest can be ensured by stratification and
by varying the sampling fraction between strata
as required
Demerit
• Sampling frame for the entire population has to
be prepared separately for each stratum.

Biostatistics -Notes WA , SPH AAU ,2019


4. Cluster sampling
• The selection of groups of study units (clusters) instead
of the selection of study units individually
• The sampling unit is a cluster, and the sampling frame is
a list of these clusters.
• Sometimes it is too expensive to carry out SRS because
– Population may be large and scattered.
– Complete list of the study population unavailable
– Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
• Cluster sampling is the most widely used to reduce the
cost.
• The clusters should be homogeneous, unlike stratified
sampling where the strata are heterogeneous
Biostatistics -Notes WA , SPH AAU ,2019
Steps in cluster sampling
• Cluster sampling divides the population into groups
or clusters.
• A number of clusters are selected randomly to
represent the total population, and then all units
within selected clusters are included in the sample.
• No units from non-selected clusters are included in
the sample—they are represented by those from
selected clusters.
• This differs from stratified sampling, where some
units are selected from each group.

Biostatistics -Notes WA , SPH AAU ,2019


Biostatistics -Notes WA , SPH AAU ,2019
Example
• In a school based study, we assume students of
the same school are homogeneous.

• We can select randomly sections and include all


students of the selected sections only

Biostatistics -Notes WA , SPH AAU ,2019


Advantages
• Cost reduction
• It creates 'pockets' of sampled units instead of
spreading the sample over the whole territory.
• Sometimes a list of all units in the population
is not available, while a list of all clusters is
either available or easy to create.

Biostatistics -Notes WA , SPH AAU ,2019


Disadvantages
• Creates a loss of efficiency when compared with SRS.
• It is usually better to survey a large number of small
clusters instead of a small number of large clusters.
– This is because neighboring units tend to be more
alike, resulting in a sample that does not represent
the whole spectrum of opinions or situations
present in the overall population.
– Hence, sampling error is usually higher than for a
simple random sample of the same size.

Biostatistics -Notes WA , SPH AAU ,2019


5. Multi-stage sampling
• Similar to the cluster sampling, except that it
involves picking a sample from within each
chosen cluster, rather than including all units
in the cluster.
• This type of sampling requires at least two
stages.

Biostatistics -Notes WA , SPH AAU ,2019


• The primary sampling unit (PSU) is the
sampling unit in the first sampling stage.

• The secondary sampling unit (SSU) is the


sampling unit in the second sampling stage,
etc.

Biostatistics -Notes WA , SPH AAU ,2019


.

Woreda PSU

Kebele SSU

Sub-Kebele TSU

HH

Biostatistics -Notes WA , SPH AAU ,2019


• In the first stage, large groups or clusters are
identified and selected. These clusters contain
more population units than are needed for the
final sample.

• In the second stage, population units are


picked from within the selected clusters (using
any of the possible probability sampling
methods) for a final sample.

Biostatistics -Notes WA , SPH AAU ,2019


• If more than two stages are used, the process of
choosing population units within clusters continues
until there is a final sample.

• With multi-stage sampling, you still have the benefit


of a more concentrated sample for cost reduction.

• However, the sample is not as concentrated as other


clusters and the sample size is still bigger than for a
simple random sample size.

Biostatistics -Notes WA , SPH AAU ,2019


• Also, you do not need to have a list of all of the units in
the population. All you need is a list of clusters and list
of the units in the selected clusters.

• Admittedly, more information is needed in this type


of sample than what is required in cluster sampling.

• However, multi-stage sampling still saves a great


amount of time and effort by not having to create a list
of all the units in a population.

Biostatistics -Notes WA , SPH AAU ,2019


Merit and demerit of Multi-stage sampling
Merit
– Cuts the cost of preparing sampling frame in total
population
•Demerit
– Sampling error is increased compared with a simple
random sample
• Multistage sampling gives less precise estimates than
simple random sampling for the same sample size, but
the reduction in cost usually far outweighs this, and
allows for a larger sample size.
• One should consider the design effect if one uses this
(multistage) sampling method and multiply the sample
size calculated by 1.5 or 2.
Biostatistics -Notes WA , SPH AAU ,2019
B. Non-probability sampling
• In non-probability sampling, every item has an
unknown chance of being selected.

• In non-probability sampling, there is an assumption


that there is an even distribution of a characteristic of
interest within the population.

• For probability sampling, random is a feature of the


selection process, rather than an assumption about the
structure of the population.

• This is what makes the researcher believe that any


sample would be representative and because of that,
results will be accurate.

Biostatistics -Notes WA , SPH AAU ,2019


• In non-probability sampling, since elements
are chosen arbitrarily, there is no way to
estimate the probability of any one element
being included in the sample.

• Also, no assurance is given that each item


has a chance of being included, making it
impossible either to estimate sampling
variability or to identify possible bias

Biostatistics -Notes WA , SPH AAU ,2019


• Reliability cannot be measured in non-probability
sampling; the only way to address data quality is to
compare some of the survey results with available
information about the population.

• Still, there is no assurance that the estimates will


meet an acceptable level of error.

• Researchers are reluctant to use these methods


because there is no way to measure the precision of
the resulting sample.

Biostatistics -Notes WA , SPH AAU ,2019


• Despite these drawbacks, non-probability
sampling methods can be useful when
descriptive comments about the sample itself
are desired.
• Secondly, they are quick, inexpensive and
convenient.
• There are also other circumstances, such as
researches, when it is unfeasible or impractical
to conduct probability sampling.

Biostatistics -Notes WA , SPH AAU ,2019


The most common types of non-
probability sampling

1. Convenience or haphazard sampling


2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique

Biostatistics -Notes WA , SPH AAU ,2019


1. Convenience or haphazard sampling

• Convenience sampling is sometimes referred


to as haphazard or accidental sampling.

• It is not normally representative of the target


population because sample units are only
selected if they can be accessed easily and
conveniently.

Biostatistics -Notes WA , SPH AAU ,2019


• The obvious advantage is that the method is
easy to use, but that advantage is greatly
offset by the presence of bias.

• Although useful applications of the technique


are limited, it can deliver accurate results
when the population is homogeneous.

Biostatistics -Notes WA , SPH AAU ,2019


• For example, a scientist could use this method to
determine whether a lake is polluted or not.

• Assuming that the lake water is well-mixed, any


sample would yield similar information.

• A scientist could safely draw water anywhere on the


lake without bothering about whether or not the
sample is representative

Biostatistics -Notes WA , SPH AAU ,2019


2. Volunteer sampling
• As the term implies, this type of sampling occurs
when people volunteer to be involved in the
study.
• In psychological experiments or pharmaceutical
trials (drug testing), for example, it would be
difficult and unethical to enlist random
participants from the general public.
• In these instances, the sample is taken from a
group of volunteers.

Biostatistics -Notes WA , SPH AAU ,2019


• Sometimes, the researcher offers payment to
attract respondents.

• In exchange, the volunteers accept the


possibility of a lengthy, demanding or
sometimes unpleasant process.

Biostatistics -Notes WA , SPH AAU ,2019


• Sampling voluntary participants as opposed
to the general population may introduce
strong biases.

• Often in opinion polling, only the people


who care strongly enough about the subject
tend to respond.

• The silent majority does not typically


respond, resulting in large selection bias.

Biostatistics -Notes WA , SPH AAU ,2019


3. Judgment sampling
• This approach is used when a sample is taken based
on certain judgments about the overall population.

• The underlying assumption is that the investigator


will select units that are characteristic of the
population.

• The critical issue here is objectivity: how much can


judgment be relied upon to arrive at a typical
sample?

Biostatistics -Notes WA , SPH AAU ,2019


• Researchers often use this method in exploratory
studies like pre-testing of questionnaires and focus
groups.

• They also prefer to use this method in laboratory


settings where the choice of experimental subjects
(i.e., animal, human) reflects the investigator's pre-
existing beliefs about the population.

• advantage of judgment sampling is the reduced cost


and time involved in acquiring the sample.

Biostatistics -Notes WA , SPH AAU ,2019


• The limitation of Judgment sampling is
subject to the researcher's biases and is
perhaps even more biased than haphazard
sampling.

• Since any preconceptions the researcher may


have reflected in the sample, large biases can
be introduced if these preconceptions are
inaccurate.

Biostatistics -Notes WA , SPH AAU ,2019


4. Quota sampling
• This is one of the most common forms of non-
probability sampling.
• Sampling is done until a specific number of
units (quotas) for various sub-populations have
been selected.
• there are no rules as to how these quotas are
to be filled
• It is really a means for satisfying sample size
objectives for certain sub-populations.
Biostatistics -Notes WA , SPH AAU ,2019
• As with all other non-probability sampling
methods, in order to make inferences about
the population, it is necessary to assume that
persons selected are similar to those not
selected.
• Such strong assumptions are rarely valid.
• The main argument against quota sampling is
that it does not meet the basic requirement of
randomness.
• Some units may have no chance of selection
or the chance of selection may be unknown.
• Therefore, the sample may be biased.
Biostatistics -Notes WA , SPH AAU ,2019
• Quota sampling is generally less expensive
than random sampling. It is also easy to
administer.
• It is an effective sampling method when
information is urgently required and can be
carried out sampling frames.

• In many cases where the population has no


suitable frame, quota sampling may be the
only appropriate sampling method.
Biostatistics -Notes WA , SPH AAU ,2019
5. Snowball sampling
• A technique for selecting a research sample
where existing study subjects recruit future
subjects from among their acquaintances.
• Thus the sample group appears to grow like a
rolling snowball.

Biostatistics -Notes WA , SPH AAU ,2019


• This sampling technique is often used in
hidden populations which are difficult for
researchers to access; example populations
would be drug users or commercial sex
workers.

• Because sample members are not selected


from a sampling frame, snowball samples are
subject to numerous biases. For example,
people who have many friends are more likely
to be recruited into the sample.

Biostatistics -Notes WA , SPH AAU ,2019


Sampling Distributions

Biostatistics -Notes WA , SPH AAU ,2019


Sampling Distributions

The sampling distribution of a statistic is the


probability distribution of all possible values the
statistic may assume, when computed from
repeated random samples of the same size,
drawn from a specified population.
We consider sample statistics as random
variables.
Biostatistics -Notes WA , SPH AAU ,2019
Biostatistics -Notes WA , SPH AAU ,2019
Parameter & Sample Statistic

• Population Parameter:
• Sample Statistic: A descriptive measure
A descriptive measure
computed from the data of a computed from the data of a
sample. population.

• X(sample mean) • μ (population mean)


• p(sample proportion) • ∏(population proportion)
• S(sample standard deviation) • δ (population standard
• r (sample correlation deviation)
coefficient) • ρ (population correlation
coefficient)
Biostatistics -Notes WA , SPH AAU ,2019
Sampling Distribution of Mean
 Mean calculated from a sample is usually the best
guess for population mean. But different samples
give different sample means.

 The sampling distribution of X is the probability


distribution of all possible values the random
variable X may assume, when computed from
repeated random samples of the same size, drawn
from a specified population.

Biostatistics -Notes WA , SPH AAU ,2019


A sampling distribution of a sample statistic (based on n
observations) is the relative frequency distribution of the
values of the statistic theoretically generated by taking
repeated random samples of size n and computing the value
of the statistic for each sample. (See Figure 5.)

Figure 5 Generating the theoretical sampling distribution of the sample


Biostatistics -Notes WA , SPH AAU ,2019
mean
• Example 1 Use computer simulation to find the approximate sampling
distribution of , the mean of a random sample of n = 5 observations
from the population of 4,171 number of children ever born in
Appendix A.
• Solution We used a statistical program, for example SPSS, to obtain
100 random samples of size n = 5 from target population. The first ten
of these samples are presented in Table 6.3.

Biostatistics -Notes WA , SPH AAU ,2019


Sampling distribution
• For each sample of five observations, the sample mean was
computed. The relative frequency distribution of the number
of children ever born for the entire population of 4,171
women was plotted in Figure 6.4 and the 100 values of are
summarized in the relative frequency distribution shown in
Figure 6.

• Figure. 6 relative frequency distribution for 4171 children


ever born Biostatistics -Notes WA , SPH AAU ,2019
Sampling distribution
• We can see that the value of in Figure 6.5 tend to cluster around the
population mean,  = 3.15 children. Also, the values of the sample
mean are less spread out (that is, they have less variation) than the
population values shown in Figure 6.4. These two observations are
borne out by comparing the means and standard deviations of the two
sets of observations, as shown

• Figure7 sampling distribution of X relative frequency X based on


100 samples of size n=5
Biostatistics -Notes WA , SPH AAU ,2019
Sampling Distribution of Mean

When sampling from a normal population with mean


 and standard deviation , sample means from

N ( , )
samples of size n are normally distributed: n


Term n is called standard error (standard
deviation of sample means).

Biostatistics -Notes WA , SPH AAU ,2019


Properties of the Sampling
Distribution of the Sample Mean
 Comparing the population distribution and the
sampling distribution of the mean:
The sampling distribution is more bell-shaped
and symmetric.
Both have the same center.
The sampling distribution of the mean is more
compact, with a smaller variance.

Biostatistics -Notes WA , SPH AAU ,2019


Relationships between Population Parameters and
the Sampling Distribution of the Sample Mean
The expected value of the sample mean is equal to the population mean:

E( X )   X   X
The variance of the sample mean is equal to the population variance divided by
the sample size:
 2
V (X )    2
X
X

n
The standard deviation of the sample mean, known as the standard error of
the mean, is equal to the population standard deviation divided by the square
root of the sample size:
X
SD( X )   X 
Biostatistics -Notes WA , SPH AAU ,2019
n
Sampling from a Normal Population
When sampling from a normal population with mean  and standard
deviation , the sample mean,X , has a normal sampling distribution:

2
X ~ N ( , )
n

This means that, as the sample Sampling Distribution of the Sample Mean

size increases, the sampling 0.4

Sampling Distribution: n =1
distribution of the sample mean 0.3
Sampling Distribution: n =
remains centered on the
f(X )
0.2

Sampling Distribution: n =
population mean, but becomes 0.1

Normal population
more compactly distributed 0.0


around that population mean

Biostatistics -Notes WA , SPH AAU ,2019


The Central Limit Theorem
When sampling from a population with mean  and finite
standard deviation , the sampling distribution of the
sample mean will tend to a normal distribution with mean
 and standard deviation  as the sample size becomes
large n
(n > 30).

For “large enough” n: X ~ N (  , / n)


2

Biostatistics -Notes WA , SPH AAU ,2019


Sampling Distribution of Proportion

 Proportion calculated from a sample is usually the best guess


for population proportion. But different samples give different
sample proportions.

 The sampling distribution of p is the probability distribution


of all possible values the random variable p may assume,
when computed from repeated random samples of the same
size, drawn from a specified population.
Biostatistics -Notes WA , SPH AAU ,2019
Sampling Distribution of
Proportion
 As the sample size, n, increases, the sampling distribution of
proportions from samples of size n are normally distributed
 (1   )
N ( , )
n
 Standard error (standard deviation of sample proportions) is

 (1   )
n
 As an estimate for standard error we use
p(1  p)
n
Biostatistics -Notes WA , SPH AAU ,2019

You might also like