Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

Probability Distributions

• A probability distribution is a device for indicating the


values that a random variable may have by applying the
theory of probability

• Random Variable = Any quantity or characteristic that is


able to assume a number of different values such that
any particular outcome is determined by chance
• e.g. No. of patients in pediatric OPD.

1
PD cont.…

• Random variables can be either discrete or continuous

• A discrete random variable is able to assume only a


finite or countable number of outcomes

• A continuous random variable can take on any value in a


specified interval

2
PD cont.…

• Therefore, the probability distribution of a random


variable is a table, graph, or formula that gives the
probabilities with which the random variable takes
different values or ranges of values

3
A. Discrete Probability Distributions

• For a discrete random variable, the probability


distribution specifies each of the possible outcomes of
the random variable along with the probability that each
will occur

• Examples can be:


• Frequency distribution
• Relative frequency distribution
• Cumulative frequency

4
Properties of Probability Distribution:
1. P ( x)  0, if X is discrete.
f ( x)  0, if X is continuous.

2.  P( X = x ) = 1 ,
x
if X is discrete.

 f ( x)dx
x
= 1 , if is continuous.

Note:
•If X is a continuous random variable then
b
P ( a  X  b) =  f ( x)dx
a
•Probability of a fixed value of a continuous random
variable is zero.
 P ( a  X  b) = P ( a  X  b) = P ( a  X  b) = P ( a  X  b)

•If X is discrete random variable the


b −1
P ( a  X  b) =  P ( x )
x = a +1
b −1
P ( a  X  b) =  p ( x )
x=a
b
P ( a  X  b) =  P ( x )
x = a +1
b
P ( a  X  b) =  P ( x )
x=a

•Probability means area for continuous random variable.


cont.…

The following data shows the number of diagnostic


services a patient receives

7
cont.…

• What is the probability that a patient receives exactly 3


diagnostic services?
P(X=3) = 0.031
• What is the probability that a patient receives at most
one diagnostic service?
P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900

8
cont.…

• What is the probability that a patient receives at least


four diagnostic services?
P (X≥4) = P(X = 4) + P(X = 5)
= 0.010 + 0.006
= 0.016

9
cont.…

• Probability distributions can also be displayed using a


graph

What it looks like???


10
cont.…

• If a random variable is able to take on a large number of


values, then a probability mass function might not be the
most useful way to summarize its behavior

• Instead, measures of location and dispersion can be


calculated (as long as the data are not categorical)

11
Definition:
•Let a discrete random variable X assume the values X1,
X2, …, Xn with the probabilities P(X1), P(X2), ….,P(Xn)
respectively. Then the expected value of X ,denoted as
E(X) is defined as:

E ( X ) = X 1 P( X 1 ) + X 2 P( X 2 ) + .... + X n P( X n )
n
=  X i P( X i )
i =1
•Let X be a continuous random variable assuming the
b
values in the interval (a, b) such that  f ( x)dx = 1
b a
,then E ( X ) =  x f ( x)dx
a

•The variance of X is given by:


Variance of X = var( X ) = E ( X ) − [ E ( X )]
2 2

Where:
n
E ( X ) =  xi P( X = xi ) , if X is discrete
2 2

i =1

=  x 2 f ( x)dx , if X is continuous.
x
There are some general rules for mathematical expectation.
Let X and Y are random variables and k is a constant.

RULE 1 E (k ) = k
RULE 2 Var (k ) = 0
RULE 3 E (kX ) = kE ( X )
RULE 4 Var (kX ) = k 2Var ( X )
RULE 5 E ( X + Y ) = E ( X ) + E (Y )
cont.…

• For the diagnostic service data,

Mean (X) = 0(0.671) +1(0.229) +2(0.053)


+3(0.031) +4(0.010) +5(0.006)
= 0.498 ≈ 0.5

• We would expect an average of 0.5 services for each visit

15
Discrete cont.…

σ2 = (0− 0.5)2(0.671) +(1 − 0.5)2(0.229)


+(2 − 0.5)2(0.053) +(3 − 0.5)2(0.031)
+(4 − 0.5)2(0.010) +(5 − 0.5)2(0.006)
= 0.782

Standard deviation = σ = √0.782 = 0.884

Hint: remember calculating variance from grouped data

16
Cont…
• To obtain the expected value of a discrete random variable X,
we multiply each possible outcome by its associated
probability and sum all values with a probability greater than
0.
• Or P X = μ = σni=1 xiP(X = xi) Where the xi’s are the
values the random variable assumes with positive probability
Example: Consider the random variable representing the
number of episodes of diarrhea in the first 2 years of life.
Suppose this random variable has a probability mass function
as below
R 0 1 2 3 4 5 6
P(X=r) 0.129 0.264 0.271 0.185 0.095 0.039 0.017
Biostatistics 17
Cont…
• What is the expected number of episodes of diarrhoea in the

first 2 years of life?

• E(X)=0(.129)+1(.264)+2(.271)+3(.185)+4(.095)+5(.039)+6(.01

7)=2.038

• Thus, on the average a child would be expected to have 2

episodes of diarrhoea in the first 2 years of life.

• The variance of a discrete random variable denoted by X is

defined by 𝑉 𝑋 = 𝜎 2 = σ𝑘𝑖=1 𝑥𝑖 − 𝜇 2
∗ 𝑃 𝑋 = 𝑥𝑖 = σ𝑘𝑖=1 𝑥𝑖 2 𝑃 𝑋 = 𝑥𝑖 − 𝜇 2

Biostatistics 18
cont.…

• Examples of discrete probability distributions are the


binomial distribution and the Poisson distribution

19
1. Binomial Distribution

• It is one of the most widely encountered discrete


distributions
• Consider dichotomous random variable
• Based on Bernoulli trial = When a single trial of some
experiment can result in only one of two mutually
exclusive outcomes (success or failure; dead or alive; sick
or well, male or female)

20
Binomial cont.…

Example:
• We are interested in determining whether a newborn
infant will survive until his/her 70th birthday
• Let Y represent the survival status of the child at age 70
years
• Y = 1 if the child survives and Y = 0 if he/she does not

21
Binomial cont.…

• The outcomes are mutually exclusive and exhaustive

• Suppose that 72% of infants born survive to age 70 years


P(Y = 1) = p = 0.72
P(Y = 0) = 1 − p = 0.28

22
Binomial cont.…

23
Binomial cont.…

• A binomial probability distribution occurs when the


following requirements are met
1. The procedure has a fixed number of trials
2. The trials must be independent
3. Each trial must have all outcomes that fall into two
categories
4. The probabilities must remain constant for each trial

24
Binomial cont.…

Characteristics of a Binomial Distribution


• The experiment consist of n identical trials
• Only two possible outcomes which are mutually exclusive
on each trial
• The probability of A (success) remains the same from trial
to trial. This probability is denoted by p, and the
probability of B (failure) is denoted by q
q = 1- p
• The trials are independent
• n and  are the parameters of the binomial distribution
• The mean is n and the variance is n(1- )
25
Binomial cont.…

• Suppose an event can have only binary outcomes A and B

• Let the probability of A is  and that of B is 1 - 

• The probability  stays the same each time the event


occurs

26
Binomial cont.…

• If an experiment repeated n times and the outcome is


independent from one trial to another, the probability
that outcome A occurs exactly x times is:

• P (X=x) = , x = 0, 1, 2, ..., n

27
Binomial cont.…

•n denotes the number of fixed trials


•x denotes the number of successes in the n trials
•p denotes the probability of success
•q denotes the probability of failure (1- p)

• Which means there are x objects in a group among n


objects
• where n!=n(n-1)(n-2)…(1) , and 0!=1
28
Binomial cont.…

Example:
• Suppose we know that 40% of a certain population are
cigarette smokers. If we take a random sample of 10
people from this population, what is the probability that
we will have exactly 4 smokers in our sample?

29
Binomial cont.…

• If we assume that the probability that any individual in


the population is a smoker to be P=.40, then the
probability that x=4 smokers out of n=10 subjects
selected is:

• P(X=4) =10C4(.4)4(1-.4)10-4
= 10C4(.4)4(.6)6 = 210(.0256)(.04666)
= 0.25
• Or the probability of obtaining exactly 4 smokers in the
sample is about 25%
30
Binomial cont.…

• We can compute the probability of observing zero


smokers out of 10 subjects selected at random, exactly 1
smoker, and so on, and display the results in a table, as
given, below.

• The third column, P(X ≤ x), gives the cumulative


probability. E.g. the probability of selecting 3 or fewer
smokers into the sample of 10 subjects is
P(X ≤ 3) =.3823, or about 38%.
31
Binomial cont.…

32
Binomial cont.…

Exercise
• Each child born to a particular set of parents has a
probability of 0.25 of having blood type O. If these
parents have 5 children.
• What is the probability that?
a. Exactly two of them have blood type O
b. At most 2 have blood type O
c. At least 4 have blood type O
d. 2 do not have blood type O.

33
Binomial cont.…

a) Solution for ‘a’

 5
P(x = 2) =  (0.25) (0.75)
2 5-2

 2
= 0.2637

34
Binomial cont.…

The Mean and Variance of a Binomial Distribution


• Once n and P are specified, we can compute the
proportion of success,

• The mean and variance of the distribution are given by:


μ = np, σ2 = npq, σ = √npq

35
Binomial cont.…

Example:
• 70% of a certain population has been immunized for
polio. If a sample of size 50 is taken, what is the
“expected total number”, in the sample who have been
immunized?
µ = np = 50(.70) = 35

• This tells us that “on the average” we expect to see 35


immunized subjects in a sample of 50 from this
population.

36
Binomial cont.…

• If repeated samples of size 10 are selected from the


population of infants born, the mean number of children
per sample who survive to age 70 would be
np = (10)(0.72) = 7.2

• The variance would be npq = (10)(0.72)(0.28) = 2.02 and


the SD would be
√2.02 = 1.42

37
Exercise
• Suppose that in a certain malarious area past experience indicates
that the probability of a person with a high fever will be positive for
malaria is 0.6.
• Consider 4 randomly selected patients (with high fever) in that
same area.
• 1) What is the probability that no patient will be positive for malaria?
2) What is the probability that exactly one patient will be positive for
malaria?
3) What is the probability that exactly two of the patients will be
positive for malaria?
4) What is the probability that all patients will be positive for
malaria?
5) Find the mean and the SD of the probability distribution given
above.

Biostatistics 38
b) The Poisson distribution

• Suppose events happen randomly and independently in


space or time at a constant rate in an interval.

• If events happen with rate λ (Greek lambda) events per


unit time, the probability of x events happening in unit
time is given as:

• λ is the parameter of the Poisson distribution and it is


the average number of occurrences of the random event
in the interval (or volume), and e is the constant and its
value is 2.7183.

39
The Poisson distribution cont’d…

• Theoretically, an infinite number of occurrences of the


event must be possible in the interval, and the probability
of the single occurrence of the vent in a given interval is
proportional to the length of the interval.

• Mean (µ) and variance are equal in Poisson


distribution and are the same as λ.Generally, for Poisson
distribution:

40
The Poisson distribution cont’d…

• Example: In a study of suicides, the monthly distribution


of adolescent suicides in an area for ten years interval
closely followed a Poisson distribution with parameter λ =
2.75. Find the probability that a randomly selected month
will be one in which three adolescent suicides occurred.

41
The Poisson distribution cont’d…

Example: λ =2 which is the average number of items per


sample and assuming that the number of items follows
Poisson distribution, find the probability that the next
sample taken will contain:
a. One or fewer items? Soln p(X<1) = 0.406
b. Exactly three items? Soln P(X=3) = 0.180
c. More than five items? Soln P(X>5) = 0.017

42
B. Continuous Probability Distributions

• A continuous random variable can take on any value in a


specified interval or range

• With a large number of class intervals, the frequency


polygon begins to resemble a smooth curve.

• The probability distribution of X is represented by a


smooth curve called a probability density function

43
Continuous cont.…

• The area under the smooth curve is equal to 1


• The area under the curve between any two points x1 and
x2 is the probability that X takes a value between x1 and
x2 44
Continuous cont.…

• Instead of assigning probabilities to specific outcomes of


the random variable X, probabilities are assigned to
ranges of values
• The probability associated with any one particular value
is equal to 0

45
Continuous cont.…

• We calculate:
✓Pr [ a < X < b], the probability of an interval of values
of X.

• For the above reason,

is also without meaning

46
The Normal distribution

• The Normal distribution is the most important probability


distribution in statistics
• It is frequently called the “Gaussian distribution” or bell-
shape curve.
• Variables such as blood pressure, weight, height, serum
cholesterol level, — are approximately normally
distributed

47
Normal cont.…
• Distribution of weights of 57 children; the frequency
distribution consists of intervals with a width of 10 lb.

48
Normal cont.…

• Now imagine that we increase the number of children


to 50,000 and decrease the width of the intervals to
0.01 lb. The histogram would now look more like;

49
Normal cont.…

• A random variable is said to have a normal distribution if


it has a probability distribution that is symmetric and bell-
shaped

50
Normal cont.…

• If we continue to increase the size of the data set and


decrease the interval width, we eventually arrive at a
smooth curve superimposed on the histogram of
called a density curve.

central limit theorem


✓ If sample sizes are fairly large, values of x(or p) in
repeated sampling have a very nearly normal
distribution.

51
Normal cont.…

• The concept “probability of X=x” is replaced by the


“probability density function fx ( ) evaluated at X=x”
• The probability that the variable assumes any value in an
interval between two specific points a and b is given by

52
Normal cont.…

• A random variable X is said to follow normal distribution,


if and only if, its probability density function ( a formula
used to represent the distribution of a random variable)
is
2
1  x - 
1 −  
2  
f(x) = e
 2
, - < x < .

53
Normal cont.…

• π (pi) = 3.14159
• e = 2.71828, x = Value of X
• Range of possible values of X: -∞ to +∞
• µ = Expected value of X (“the long run average”)
• σ2 = Variance of X
• µ and σ are the parameters of the normal distribution —
they completely define its shape

54
Normal cont.…
• The normal distribution plays an important role in
statistical inference because:
1. Many real-life distributions are approximately normal.
2. Many other distributions can be almost normalized by
appropriate data transformations (e.g., taking the log).
When log X has a normal distribution, X is said to have
a lognormal distribution.
3. As a sample size increases, the means of samples
drawn from a population of any distribution will
approach the normal distribution. This theorem, when
stated rigorously, is known as the central limit
theorem.
55
Normal cont.…

56
Normal cont.…

1. The mean µ tells you about location -


• Increase µ - Location shifts right
• Decrease µ – Location shifts left
• Shape is unchanged

2. The variance σ2 tells you about narrowness or flatness


of the bell -
• Increase σ2 - Bell flattens. Extreme values are more likely
• Decrease σ2 - Bell narrows. Extreme values are less likely
• Location is unchanged

57
Normal cont.…

58
Normal cont.…

Properties of the Normal Distribution


1. It is symmetrical about its mean, .
2. The mean, the median and mode are all equal
3. The total area under the curve about the x-axis is 1
square unit.
4. The curve never touches the x-axis (asymptote)
5. As the value of  increases, the curve becomes more
and more flat and vice versa.

59
Normal cont.…

6. Perpendiculars of:
± SD contain about 68%;
±2 SD contain about 95%;
±3 SD contain about 99.7%
of the area under the curve.

7. The distribution is completely determined by the


parameters  and .

60
Normal cont.…

• As a result, we have different normal distributions


depending on the values of μ and σ2
• We cannot tabulate every possible distribution
• Tabulated normal probability calculations are available
only for the Normal Distribution with µ = 0 and σ2=1.

61
Standard Normal Distribution

• It is a normal distribution that has a mean equal to 0 and


a SD equal to 1, and is denoted by N(0, 1)
• The main idea is to standardize all the data that is given
by using Z-scores
• These Z-scores can then be used to find the area (and
thus the probability) under the normal curve

62
SND cont.…

• Z-transformation: If a random variable X~N(,) then we


can transform it to a SND with the help of Z-
transformation

Z= x-

• Z represents the Z-score for a given x value

63
SND cont.…

• Consider redefining the scale to be in terms of how many


SDs away from mean for normal distribution, μ=110 and
σ=15.

Value x
50 65 80 95 110 125 140 155 170
-4 -3 -2 -1 0 1 2 3 4
SDs from mean using
(x-110)/σ = (x-μ)/σ

64
SND cont.…

• This process is known as standardization and gives the


position on a normal curve with μ=0 and σ=1, i.e., the
SND, Z.

• A Z-score is the number of standard deviations that a


given x value is above or below the mean.

65
SND cont.…

Finding normal curve areas


1. The table gives areas between -∞ and the value of zo.

2. Find the z value in tenths in the column at left margin


and locate its row. Find the hundredths place in the
appropriate column.

3. Read the value of the area (P) from the body of the
table where the row and column intersect. Values of P
are in the form of a decimal point and four places.

66
SND cont.…

Some Useful Tips

67
SND cont.…

• Standard normal curve and some important divisions.


68
SND cont.…

a) What is the probability that z < -1.96?

(1) Sketch a normal curve


(2) Draw a line for z = -1.9
(3) Find the area in the table
(4) The answer is the area to the left of the line P(z < -1.96) =
.0250
69
SND cont.…

b) What is the probability that -1.96 < z < 1.96?

The area between the values P(-1.96 < z <


1.96) = .9750 - .0250 = .9500

70
SND cont.…

c) What is the probability that z > 1.96?

• The answer is the area to the right of the line; found by subtracting
table value from 1.0000; P(z > 1.96) =1.0000 - .9750 = .0250

71
72
Applications of the Normal Distribution

• The normal distribution is used as a model to study many


different variables.
• We can use the normal distribution to answer probability
questions about continuous random variables.

• Following the model of the normal distribution, a given


value of x must be converted to a z score before it can be
looked up in the z table

73
Applications cont.…

Example:
• The diastolic blood pressures of males 35–44 years of age
are normally distributed with µ = 80 mm Hg and σ2 = 144
mm Hg2
σ = 12 mm Hg

• Therefore, a DBP of 80+12 = 92 mm Hg lies 1 SD above


the mean
• Let individuals with BP above 95 mm Hg are considered
to be hypertensive

74
Applications cont.…

a. What is the probability that a randomly selected male


has a blood pressure above 95 mm Hg?

• Approximately 10.6% of this population would be


classified as hypertensive

75
Applications cont.…

b. What is the probability that a randomly selected male


has a DBP above 110 mm Hg?

Z = 110 – 80 = 2.50
12

P (Z > 2.50) = 0.0062


• Approximately 0.6% of the population has a DBP above
110 mm Hg

76
Applications cont.…

c. What is the probability that a randomly selected male


has a DBP below 60 mm Hg?
Z = 60 – 80 = -1.67
12

P (Z < -1.67) = 0.0475


• Approximately 4.8% of the population has a
DBP below 60 mm Hg

77
Applications cont.…

d. What value of DBP cuts off the upper 5% of this


population?
• Looking at the table, the value Z = 1.645 cuts off an area
of 0.05 in the upper tail
• We want the value of X that corresponds to Z = 1.645
Z=X–μ
σ
1.645 = X – μ, X = 99.7
σ
• Approximately 5% of the men in this population have a
DBP greater than 99.7 mm Hg
78
Exercises

1. If the total cholesterol values for a certain target


population are approximately normally distributed
with a mean of 200 (mg/100 mL) and a standard
deviation of 20 (mg/100 mL), the probability that a
person picked at random from this population will
have a cholesterol value greater than 240 (mg/100
mL) is

79
Exercise cont.…

2. Assume that the test scores for a large class are


normally distributed with a mean of 74 and a standard
deviation of 10.
(a) Suppose that you receive a score of 88. What percent
of the class received scores higher than yours?
(b) Suppose that the teacher wants to limit the number of
A grades in the class to no more than 20%. What would
be the lowest score for an A?

80
Exercise cont.…

3. Refer to the standard normal distribution. What is the


probability of obtaining a z value of:
a) At least 1.25?
b) At least 0.84?
c) Between 1.96 and 1.96?
d) Between 1.22 and 1.85?
e) Between 0.84 and 1.28?
f) Less than 1.72?
g) Less than 1.25?

81
Exercise cont.…

4. Refer to the standard normal distribution. Find a z


value such that the probability of obtaining a larger z
value is:
a) 0.05
b) 0.025
c) 0.20

82
Example2: Suppose that total carbohydrate intake
in 12–14-year-old males is normally distributed with
mean 124 g/1000 cal and SD 20g/1000 cal.

A. What percent of boys in this age range have


carbohydrate intake above 140g/1000 cal?

B. What percent of boys in this age range have


carbohydrate intake below 90g/1000 cal?
Biostatistics 83
• Solution: Let X be carbohydrate intake in 12-14-
year-old males and X ∼ N (124, 400)

140−124
• A) P(X>140)=𝑃(𝑍 > =P(Z>0.8)= 1-
20

• P(Z<0.8)=1- 0.7881= 0.2119

• Interpretation: about 21.2% of boys in the age range


of 12-14 yrs have carbohydrate intake of above
140g/1000cal.

Biostatistics 84
90−124
• P(X<90)= P(Z< )= P(Z< -1.7)=1-P(Z>1.7)
20

• N.B: P(X<-x)= P(X>x)


Exercises
1) Assume that among diabetics the fasting blood
level of glucose is approximately normally
distribute with a mean of 90 mg per 100
ml and SD of 4 mg per 100 ml.

Biostatistics 85
a) What proportions of diabetics have levels
between 90 and 125 mg per 100 ml?

b) What proportions of diabetics have levels below


87.4 mg per 100 ml?

c) What level cuts of the lower 10% of diabetics?

d) What are the two levels which encompass 95%


of diabetics?

Biostatistics 86
Exercise: Diskin et al. studied common breath metabolites such

as ammonia, acetone, isoprene, ethanol and acetaldehyde in


five subjects over a period of 30 days. Each day, breath samples
were taken and analyzed in the early morning on arrival at the
laboratory. For subject A, a 27-year-old female, the ammonia
concentration in parts per billion (ppb) followed a normal
distribution over 30 days with mean 491 and standard
deviation 119. What is the probability that on a random day, the
subject‘s ammonia concentration is between 292 and 649 ppb?
88

You might also like