Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

Binomial and normal distributions

Business Statistics 41000

Fall 2015

1
Topics

1. Sums of random variables

2. Binomial distribution

3. Normal distribution

4. Vignettes

2
Topic: sums of random variables

Sums of random variables are important for two reasons:

1. Because we often care about aggregates and totals (sales, revenue,


employees, etc).

2. Because averages are basically sums, and probabilities are basically


averages (of dummy variables), when we go to estimate
probabilities, we will end up using sums of random variables a lot.

This second point is the topic of the next lecture. For now, we focus on
the direct case.

3
A sum of two random variables

Suppose X is a random variable denoting the profit from one wager and
Y is a random variable denoting the profit from another wager.

If we want to consider our total profit, we may consider the random


variable that is the sum of the two wagers, S = X + Y .

To determine the distribution of S, we must first know the joint


distribution of (X , Y ).

4
A sum of two random variables

Suppose that (X , Y ) has the following joint distribution:

-$200 $100 $200

1 3
$0 0 9 9

1 2 2
$100 9 9 9

So S can take the values {−200, −100, 100, 200, 300}.

Notice that there are two ways that S can be $200.

5
A sum of two random variables

We can directly determine the distribution of S as:


S
s P(S = s)

-$200 +$0 0
1
-$200 + $100 9
1
$100 + $0 9
2 3 5
$100 + $100 or $200 + $0 9 + 9 = 9
2
$200 + $100 9

When determining the distribution of sums of random variables, we lose


information about individual values and aggregate the probability of
events giving the same sum.

6
Topic: binomial distribution

A binomial random variable can be constructed as the sum of


independent Bernoulli random variables.

Familiarity with the binomial distribution eases many practical probability


calculations.

See OpenIntro sections 3.4 and 3.6.4.

7
Sums of Bernoulli RVs

When rolling two dice, what is the probability of rolling two ones?

By independence we can calculate this probability as


 
1 1 1
P(1, 1) = = .
6 6 36

Now with three dice, what is the probability of rolling exactly two 1’s?

8
Sums of Bernoulli RVs (cont’d)

The event A =“rolling a one”, can be described as a Bernoulli random


variable with p = 61 .

We can denote the three independent rolls by writing


iid
Xi ∼ Bernoulli(p), i = 1, 2, 3.

The notation iid is shorthand for “independent and identically


distributed”.

Determining the probability of rolling exactly two 1’s can be done by


considering the random variable Y = X1 + X2 + X3 and asking for
P(Y = 2).

9
Sums of Bernoulli random variables (cont’d)

Consider the distribution of Y = X1 + X2 + X3 .

Event y P(Y = y )

000 0 (1 − p)3

001 or 100 or 010 1 (1 − p)(1 − p)p + p(1 − p)(1 − p) + (1 − p)p(1 − p)

011 or 110 or 101 2 (1 − p)p 2 + p 2 (1 − p) + p(1 − p)p

111 3 p3

Remember that for this example p = 61 .

10
Sums of Bernoulli random variables (cont’d)
Determining the probability of a certain number of successes requires
knowing 1) the probability of each individual success and 2) the number
of ways that number of successes can arise.

Event y P(Y = y )

000 0 (1 − p)3

001 or 100 or 010 1 3(1 − p)2 p

011 or 110 or 101 2 3(1 − p)p 2

111 3 p3

5 5
We find that P(Y = 2) = 3p 2 (1 − p) = 3(1/36)(5/6) = 6(12) = 72 .

11
Sums of Bernoulli random variables (cont’d)
What if we had four rolls, and the probability of success was 13 ?

0000
1000
0100
1100
0010
1010
0110
1110
0001
1001
0101
1101
0011
1011
0111
1111

12
Sums of Bernoulli random variables (cont’d)

Summing up the probabilities for each of the values of Y , we find:

y P(Y = y )
0 (1 − p)4
1 4(1 − p)3 p
2 6(1 − p)2 p 2
3 4(1 − p)p 3
4 p4

1
Substituting p = 3 we can now find P(Y = y ) for any y = 0, 1, 2, 3, 4.

13
Defintion: N choose y

The number of ways we can arrange y successes among N trials can be


calculated efficiently by a computer. We denote this number with a
special expression.

N choose y

The notation  
N N!
=
y (N − y )!y !
designates the number of ways that y items can be assigned to N
possible positions.

This notation can be used to summarize the entries in the previous tables
for various values of N and y .

14
Definition: Binomial distribution

Binomial distribution

A random variable Y has a binomial distribution with parameters N and


p if its probability distribution function is of the form:
 
N y
p(y ) = p (1 − p)N−y
y

for integer values of y between 0 and N.

15
Example: drunk batter
What is the probability that our alcoholic major-leaguer gets more than 2
hits in a game in which he has 5 at bats?

Let X =“number of hits”. We model X as a binomial random variable


with parameters N = 5 and p = 0.316.

X
x P(X = x)

0 (1 − p)5
1 5(1 − p)4 p
2 10(1 − p)3 p 2
3 10(1 − p)2 p 3
4 5(1 − p)p 4
5 p5

Substituting p = 0.316 we calculate P(X > 2) = 0.185.


16
Example: winning a best-of-seven play-off

Assume that the Chicago Bulls have probability 0.4 of beating the Miami
Heat in any given game and that the outcomes of individual games are
independent.

What is the probability that the Bulls win a seven game series against the
Heat?

17
Example: winning a best-of-seven play-off (cont’d)

Consider the number of games won by the Bulls over a full seven games
against the Heat. We model this as a binomial random variable Y with
parameters N = 7 and p = 0.4, which we express with the notation

Y ∼ Bin(7, 0.4).

The symbol “∼” is read “distributed as”. “Bin” is short for “binomial”.
The numbers which follow are the values of the two binomial parameters,
the number of independent Bernoulli trials (N) and the probability of
success at each trial (p).

18
Example: winning a best-of-seven play-off (cont’d)

Although we never see all seven games played (because the series stops
as soon as one team wins four games) we note that in this expanded
event space

I any event with at least four Bulls wins corresponds to an observable


Bulls series win,

I any event corresponding to an observed Bulls series win has at least


four total Bulls wins.

19
Example: winning a best-of-seven play-off (cont’d)

For example, the observable sequence 011011 (where a 1 stands for a


Bulls win) has two possible completions, 0110110 or 0110111. Any
hypothetical games played beyond the series-ending fourth win can only
increase the total number of wins tallied by Y .

Conversely, the sequence 1010111 is an event corresponding to Y = 5


and we can associate it with the observable subsequence 101011, a Bulls
series win in six games.

20
Example: winning a best-of-seven play-off (cont’d)

Therefore, the events corresponding to “Bulls win the series” are


precisely those corresponding to Y ≥ 4.

We may conclude that the probability of a series win for the Bulls is

P(Y ≥ 4) = P(Y = 4) + P(Y = 5) + P(Y = 6) + P(Y = 7)


= 0.29.

21
Example: winning a best-of-seven play-off (cont’d)

We can arrive at this answer without reference to the binomial random


variable Y if we are willing to do our own counting.

! ! !
4 4 4 5 4 6 4
P(Bulls series win) = p + p (1 − p) + p (1 − p)2 + p (1 − p)3
3 3 3
! ! !
4 4 5 4 6 4
= p4 + p (1 − p) + p (1 − p)2 + p (1 − p)3
1 2 3
= 0.29.

This calculation explicitly accounts for the fact that Bulls series wins
necessarily conclude with a Bulls game win.

22
Example: double lottery winners

In 1971, Jane Adams won the lottery twice in one year! If you read of a
double winner in your daily newspaper, how surprised should you be?

To answer this question we need to make some assumptions. Consider 40


state lotteries. Assume that each one has a 1 in 18 million chance of
winning. Assume that each one has 1 million people that play it daily
(say, 250 times a year), and that each one buys 5 tickets.

Given these conditions, what is the probability that in one calendar year
there is at least one double winner?

23
Example: double lottery winners (cont’d)

Let Xi be the random variable denoting how many winning tickets person
i has:

Xi ∼ Binomial(5(250), p = (1/18) × 10−6 ).

Now let Yi be the dummy variable for the event Xi > 1, which is the
event that person i is a double (or more) winner:

Yi ∼ Bernoulli(q).

We can compute q = 1 − Pr (Xi = 0) − Pr (Xi = 1) = 2.4 × 10−9 .

24
Example: double lottery winners (cont’d)

To account for the


Pmillion people playing the lottery in each of 40 states,
N
we consider Z = i=1 Yi , which is another binomial random variable:

Z ∼ Binomial(N = 4 × 107 , q).

Finally, the probability that Z > 0 can be found as

1 − P(Z = 0) = 1 − (1 − q)N = 1/11.

Not so rare!

25
Example: rural vs. urban hospitals

About as many boys as girls are born in hospitals. In a small Country Hospital
only a few babies are born every week. In the urban center, many babies are
born every week at City General. Say that a normal week is one where between
45% and 55% of the babies are female. An unusual week is one where more
than 55% are girls or more than 55% are boys.

Which of the following is true?

I Unusual weeks occur equally often at Country Hospital and at City


General.

I Unusual weeks are more common at Country Hospital than at City


General.

I Unusual weeks are less common at Country Hospital than at City General.

26
Example: rural vs. urban hospital (cont’d)

We can model the births in the two hospitals as two independent random
variables. Let X = “number of baby girls born at Country Hospital” and
Y =“number of baby girls born at City General”.

X ∼ Binomial(N1 , p)

Y ∼ Binomial(N2 , p)

Assume that p = 0.5. The key difference is that N1 is much smaller than
N2 . To illustrate, assume that N1 = 20 and N2 = 500.

27
Example: rural vs. urban hospital (cont’d)

During a usual week at the rural hospital between 0.45N1 = 0.45(20) = 9


and 0.55N1 = 0.55(20) = 11 baby girls are born.

The probability of usual week is P(9 ≤ X ≤ 11) ≈ 0.50, so the


probability of an unusual week is

1 − P(9 ≤ X ≤ 11) = P(X < 9) + P(X > 11) ≈ 0.5.

Note: satisfying the condition X < 9 is the same as not satisfying the
condition X ≥ 9; strict versus non-strict inequalities make a difference.

28
Example: rural vs. urban hospital (cont’d)

Country Hospital
0.20
0.15
Probability

0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Births

29
Example: rural vs. urban hospital (cont’d)

In a usual week at the city hospital between 0.45N2 = 0.45(500) = 225


and 0.55N2 = 0.55(500) = 275 baby girls are born.

Then the probability of a usual week is P(225 ≤ X ≤ 275) = 0.978, so


the probability of an unusual week is

1 − P(225 ≤ X ≤ 275) = P(X < 225) + P(X > 275) = 0.022.

30
Example: rural vs. urban hospital (cont’d)

City General
0.030
Probability

0.020
0.010
0.000

200 206 212 218 224 230 236 242 248 254 260 266 272 278 284 290

Births

31
Variance of a sum of independent random variables

A useful fact:

Variance of linear combinations of independent random variables

Pm
A weighted sum/difference of random variables Y = i ai Xi can be
expressed as
Xm
V(Y ) = ai2 V(Xi ).
i

How can this be used to derive the expression for the variance of a
binomial random variable?

32
Variance of binomial random variable

Variance of a binomial random variable

A binomial random variable X with parameters N and p has variance

V(X ) = Np(1 − p).

33
Variance of a proportion
By dividing through by the total number of babies born each week we
can consider the proportion of girl babies. Define the random variables

X Y
P1 = and P2 = .
N1 N2
Then it follows that

V(X ) N1 p(1 − p)
V (P1 ) = 2 = = p(1 − p)/N1
N1 N12

and

V(Y ) N2 p(1 − p)
V (P2 ) = 2 = = p(1 − p)/N2 .
N2 N22

34
Law of Large Numbers

An arithmetical average of random variables is itself a random variable.

As more and more individual random variables are averaged up, the
variance decreases but the mean stays the same.

As a result, the distribution of the averaged random variable becomes


more and more concentrated around its expected value.

35
Law of Large Numbers

Distribution of sample proportion (N = 10, p = 0.7)


0.25
0.20
0.15
0.10
0.05
0.00

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

36
Law of Large Numbers

Distribution of sample proportion (N = 20, p = 0.7)


0.15
0.10
0.05
0.00

0 0.7 1

37
Law of Large Numbers

Distribution of sample proportion (N = 50, p = 0.7)


0.12
0.10
0.08
0.06
0.04
0.02
0.00

0 0.7 1

38
Law of Large Numbers

Distribution of sample proportion (N = 150, p = 0.7)


0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00

0 0.7 1

39
Law of Large Numbers

Distribution of sample proportion (N = 300, p = 0.7)


0.05
0.04
0.03
0.02
0.01
0.00

0 0.7 1

40
Example: Schlitz Super Bowl taste test

41
Bell curve approximation to binomial
The binomial distributions can be approximated by a smooth density
function for large N.

Normal approximation for binomial distribution with N = 20, p = 0.5


0.20
Probability mass / Density

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

42
Bell curve approximation to binomial

Normal approximation for binomial distribution with N = 60, p = 0.1


0.15
Probability mass / Density

0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

43
Bell curve approximation to binomial

Normal approximation for binomial distribution with N = 500, p = 0.8


0.04
Probability mass / Density

0.03
0.02
0.01
0.00

340 346 352 358 364 370 376 382 388 394 400 406 412 418 424 430 436 442 448 454 460

What are some reasons that very small p or small N lead to bad
approximations?
44
Central limit theorem

The normal distribution can be “justified” via its relationship to the


binomial distribution. Roughly: if a random outcome is the combined
result of many individual random events, its distribution will follow a
normal curve.

The quincunx or Galton box is a device which physically simulates such


a scenario using ball bearings and pins stuck in a board.

PLAY VIDEO

The CLT can be stated more precisely, but the practical impact is just
this: random variables which arise as sums of many other random
variables (not necessarily normally distributed) tend to be normally
distributed.

45
Normal distributions
The normal family of densities has two parameters, typically denoted µ
and σ 2 , which govern the location and scale, respectively.

Gaussian densities for various location parameters


0.4
0.3
f(x)

0.2
0.1
0.0

-4 -2 0 2 4

46
Normal distributions (cont’d)
I will use the terms normal distribution, normal density and normal
random variable more or less interchangeably.

Mean-zero Gaussian densities with differing scale parameters


0.8
0.6
f(x)

0.4
0.2
0.0

-4 -2 0 2 4

The normal distribution is also called the Gaussian distribution or the


bell curve.
47
Normal means and variances

Mean and variance of a normal random variable

A normal random variable X , with parameters µ and σ 2 , is denoted

X ∼ N(µ, σ 2 ).

The mean and variance of X are


E (X ) = µ,
V (X ) = σ 2 .

The density function is symmetric and unimodal, so the median and


mode of X are also given by the location parameter µ. The standard
deviation of X is given by σ.

48
Normal approximation to binomial

The binomial distributions can be approximated by a normal distribution.

Normal approximation to the binomial

A Bin(N, p) distribution can be approximated by a N(Np, Np(1 − p))


distribution for N “large enough”.

Notice that this just “matches” the mean and variance of the two
distributions.

49
Linear transformation of normal RVs

We can add a fixed number to a normal random variable and/or multiply


it by a fixed number and get a new normal random variable. This sort of
operation is called a linear transformation.

Linear transformation of normal random variables

If X ∼ N(µ, σ 2 ) and Y = a + bX for fixed numbers a and b, then


Y ∼ N(a + bµ, b 2 σ 2 ).

For example, if X ∼ N(1, 2) and Y = 3 − 5X , then Y ∼ N(−2, 50).

50
Standard normal RV

Standard normal

A standard normal random variable is one with mean 0 and variance 1.


It is often denoted by the letter Z :

Z ∼ N(0, 1).

We can write any normal random variable as a linear transformation of a


standard normal RV. For normal random variable X ∼ N(µ, σ 2 ), we can
write

X = µ + σZ .

51
The “empirical rule”
It is convenient to characterize where the “bulk” of the probability mass
of a normal distribution resides by providing an interval, in terms of
standard deviations, about the mean.

N(µ,σ)
0.4
0.3
Density

0.2

68 %
0.1
0.0

µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ

52
The “empirical rule” (cont’d)
The widespread application of the normal distribution has lead this to be
dubbed the empirical rule.

N(µ,σ)
0.4
0.3
Density

0.2

95 %
0.1
0.0

µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ

53
The “empirical rule” (cont’d)

It is, for obvious reasons, sometimes called the 68-95-99.7 rule.

N(µ,σ)
0.4
0.3
Density

0.2

99.7 %
0.1
0.0

µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ

54
The “empirical rule” (cont’d)

To revisit some earlier examples:

I 68% of Chicago daily highs in the winter season are between 19 and
48 degrees.

I 95% of NBA players are between 6ft and 7ft 2in.

I In 99.7% of weeks, the proportion of baby girls born at City General


is between 0.4985 and 0.5015.

55
Sums of normal random variables

Weighted sums of normal random variables are also normally distributed.

For example if

X1 ∼ N(5, 20) and X2 ∼ N(1, 0.5)

then for Y = 0.1X1 + 0.9X2

Y ∼ N(m, v ).

where m = 0.1(5) + 0.9(1) = 1.4 and v = 0.12 (20) + 0.92 (0.5) = 0.605.

56
Linear combinations of normal RVs
Linear combinations of independent normal random variables

For i = 1, . . . , n, let

iid
Xi ∼ N(µi , σi2 ).
Pn
Define Y = i=1 ai Xi for weights a1 , a2 , . . . , an . Then

Y ∼ N(m, v )
where
n
X n
X
m= ai µi and v= ai2 σi2 .
i=1 i=1

57
Example: two-stock portfolio

Consider two stocks, A and B, with annual returns (in percent of


investment) distributed according to normal distributions

XA ∼ N(5, 20) and XB ∼ N(1, 0.5).

What fraction of our investment should we put into stock A, with the
remainder put in stock B?

58
Example: two-stock portfolio (cont’d)

For a given fraction α, the total return on our portfolio is

Y = αXA + (1 − α)XB

with distribution

Y ∼ N(m, v ).

where m = 5α + (1 − α) and v = 20α2 + 0.5(1 − α)2 .

59
Example: two-stock portfolio (cont’d)
Suppose we want to find α so that P(Y ≤ 0) is as small as possible.

Two-stock portfolio
0.6
0.5

Stock A
Stock B
0.4
Density

0.3
0.2
0.1
0.0

-5 0 5 10 15 20

Percent return

The blue distributions correspond to varying values of α.


60
Example: two-stock portfolio (cont’d)
We can plot the probability of a loss as a function of α.

Probability of a loss
0.12
0.10
Probability

0.08
0.06
0.04

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

We see that this probability is minimized when α = 11% approximately.


This is the LLN at work! 61
Variance of a sum of correlated random variables

For correlated (dependent) random variables, we have a modified formula:

Variance of linear combinations of two correlated random variables

A weighted sum/difference of random variables Y = a1 X1 + a2 X2 can be


expressed as

V(Y ) = a12 V(X1 ) + a22 V(X2 ) + 2a1 a2 Cov(X1 , X2 ).

There is a homework problem that asks you to find the variance of


portfolios of stocks, as in the example above, for stocks which are related
to one another (in a common industry, for example).

62
Vignettes

1. Differential dispersion

2. Average number of sex partners

3. mean reversion

63
Vignette: a difference in dispersion

In this vignette we observe how selection (in the sense of evolution, or


hiring, or admissions) can turn higher variability into over-representation.
The analysis uses the ideas of random variables, distribution functions,
and conditional probability.

For more background, read the article “Sex Ed” from the February 2005
issue of the New Republic (available at the course home page).

64
A difference in dispersion
Consider two groups of college graduates with “employee fitness scores”
following the distributions shown below.

Distribution of Capabilities, Group A

0.6
0.5
0.4
Probability

0.256
0.3
0.2

0.128 0.128
0.085 0.085
0.064 0.064
0.1

0.043 0.051 0.051 0.043


0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Score

Distribution of Capabilities, Group B


0.6

0.464
0.5
0.4
Probability

0.3

0.171 0.171
0.2

0.063 0.063
0.1

0.003 0.008 0.023 0.023 0.008 0.003


0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Score

These distributions have the same mean, the same median, and the same
mode. But they differ in their dispersion, or variability.
65
A difference in dispersion (cont’d)
Let X denote the random variables recording the scores and let A and B
denote membership in the respective groups.
Distribution of Capabilities, Group A

0.6
0.5
0.4
Probability
0.256
0.3
0.2
0.128 0.128
0.085 0.085
0.064 0.064
0.1

0.043 0.051 0.051 0.043


0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Score

Distribution of Capabilities, Group B


0.6

0.464
0.5
0.4
Probability

0.3

0.171 0.171
0.2

0.063 0.063
0.1

0.003 0.008 0.023 0.023 0.008 0.003


0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Score

V (X | A) = 5.87 and V (X | B) = 1.666.


The corresponding standard deviations are σ(X | A) = 2.42 and
σ(X | B) = 1.29. 66
A difference in dispersion (cont’d)
But now consider only elite jobs, for which it is necessary that fitness
score X ≥ 4.

Distribution of Capabilities, Group A

0.6
0.5
0.4
Probability

0.256
0.3
0.2

0.128 0.128
0.085 0.085
0.064 0.064
0.1

0.043 0.051 0.051 0.043


0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Score

Distribution of Capabilities, Group B


0.6

0.464
0.5
0.4
Probability

0.3

0.171 0.171
0.2

0.063 0.063
0.1

0.003 0.008 0.023 0.023 0.008 0.003


0.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Score

We can use Bayes’ rule to calculate P(A | X ≥ 4) and P(B | X ≥ 4). 67


A difference in dispersion (cont’d)

If we assume a priori that P(A) = P(B) = 1/2, we find

P(X ≥ 4 | A)P(A)
P(A | X ≥ 4) =
P(X ≥ 4 | A)P(A) + P(X ≥ 4 | B)P(B)
0.094(0.5)
=
0.094(0.5) + 0.012(0.5)
= 0.89.

Why don’t we need to calculate P(B | X ≥ 4) separately?

68
Larry Summers and women-in-science
“Summers’s critics have repeatedly mangled his suggestion that
innate differences might be one cause of gender disparities ... into
the claim that they must be the only cause. And they have
converted his suggestion that the statistical distributions of men’s
and women’s abilities are not identical to the claim that all men are
talented and all women are not–as if someone heard that women
typically live longer than men and concluded that every woman lives
longer than every man. . . .

In many traits, men show greater variance than women, and are
disproportionately found at both the low and high ends of the
distribution. Boys are more likely to be learning disabled or retarded
but also more likely to reach the top percentiles in assessments of
mathematical ability, even though boys and girls are similar in the
bulk of the bell curve. . . .”

Stephen Pinker in The New Republic


69
Example: gender and aptitudes revisited
Assume that job“aptitude” can be represented as a continuous random
variable and that the distribution of scores differs by gender.

Aptitude distribution
0.4

women
0.3

men
Density

0.2
0.1
0.0

-6 -4 -2 0 2 4 6

Score

For women, 93.7% of the scores are between the vertical dashed lines,
whereas only 68.6% of the men’s scores fall in this range. 70
Example: gender and aptitudes revisited (cont’d)
The corresponding CDFs reveals the same difference.

Cumulative distribution function


1.0
0.8
0.6
F(x)

0.4
0.2
0.0

-6 -4 -2 0 2 4 6

Score

These distributions are meant to be illustrative rather than factual.


71
Sex partners vignette: which average?

Here is a torn-from-the-headlines example of why it pays to know a little


probability.

“Everyone knows men are promiscuous by nature...Surveys bear


this out. In study after study and in country after country, men
report more, often many more, sexual partners than women...

But there is just one problem, mathematicians say. It is


logically impossible for heterosexual men to have more partners
on average than heterosexual women. Those survey results
cannot be true.”

72
A sex-partners statistical model

Question: is it possible for men to have more sex partners, on average,


than women?

To answer this question, we will consider a “toy” probability model for


homo sapiens mating behavior.

John Lenny Romeo


Sally 0.07 0.06 0.05
Chastity 0.5 0.5 0.5
Maude 0.05 0.04 0.09

Let’s call it the “summer camp” model.

73
A sex-partners random variable

The quantity of interest is the number of sex partners. In our model, this
will be a number between 0 and 3.

For each individual we can compute the distribution of this random


variable. We will denote individuals by their first initial. A red initial
means they partnered, a black initial means they did not.

We will assume independence. This means, for example, that Sally


hooking up with Romeo makes it neither more nor less likely that she will
hook up with Lenny.

74
Sally’s sex-partner distribution

Xs

Event x P(Xs = x)

JLR 0 (1-0.07)(1-0.06)(1-0.05)

JLR or JLR or JLR 1 (0.07)(1-0.06)(1-0.05) +


(1-0.07)(0.06)(1-0.05) +
(1-0.07)(1-0.06)(0.05)

JLR or JLR or JLR 2 (0.07)(0.06)(1-0.05) +


(1-0.07)(0.06)(0.05) +
(0.07)(1-0.06)(0.05)

JLR 3 (0.07)(0.06)(0.05)

Can you see the probability laws in action here?


75
Sally’s sex-partner distribution

Xs

Event x ps (x) = P(Xs = x)

JLR 0 0.83

JLR or JLR or JLR 1 0.16

JLR or JLR or JLR 2 0.01

JLR 3 0.0002

Here is what it looks like after the calculation (rounded a bit). We can
do similarly for each individual.

76
Sally’s sex-partners distribution
Here is a picture of Sally’s sex partner distribution.

1.0 Distribution of sex partners for Sally

0.8305
0.8
Probability

0.6
0.4
0.2

0.1592

0.0101 2e-04
0.0

0 1 2 3

Number of partners

The mean is 0(0.83) + 1(0.16) + 2(0.01) + 3(0.0002) = 0.18. What is the


mode? What is the median?
77
Female sex-partner distribution

To get the distribution for all females, we sum over the individual women.
We apply the law of total probability using all three conditional
distributions:

pfemale (x) = ps (x)P(Sally) + pc (x)P(Chastity) + pm (x)P(Maude).

We assume that the women are selected at random with equal probability
P(Maude) = P(Chastity) = P(Sally) = 1/3.

78
Female sex-partner distribution
At the end we get a distribution like this.

Distribution of sex partners for females


1.0
0.8

0.5951
Probability

0.6
0.4

0.2315
0.2

0.1315
0.0418
0.0

0 1 2 3

Number of partners

The mean is 0.62, the mode is 0, and the median is 0.


79
Male sex-partner distribution
We can do the same thing for the males, and we get this.

Distribution of sex partners for males


1.0
0.8
Probability

0.6

0.4983
0.4417
0.4
0.2

0.0583
0.0017
0.0

0 1 2 3

Number of partners

The mean is 0.62, the mode is 1, and the median is 1.


80
Sex-partners vignette recap

The narrow lesson is that it pays to be specific about which measure of


central tendency you’re talking about!

The more general lesson is that using probability models and a little bit
of algebra can help us see a situation more clearly.

This example uses the concepts of random variable, independence,


conditional distribution, mean, median...and others.

81
Idea: statistical “null” hypotheses

The hypothesis that events are independent often makes a nice contrast
to other explanations, namely that random events are somehow related.

This vantage point allows us to judge if those other explanations fit the
facts any better than the uninteresting “null” explanation that events are
independent.

82
Vignette: making better pilots

Flight instructors have a policy of berating pilots who make bad landings.
They notice that good landings met with praise mostly result in
subsequently less-good landings, while bad landings met with harsh
criticism mostly result in subsequently improved landings.

Is their causal reasoning necessarily valid?

To stress-test their judgment that “criticism works” we consider the


evidence in light of the null hypothesis that subsequent landings are in
fact independent of one another, regardless of criticism or praise.

83
Example: making better pilots (cont’d)

Contrary to the assumptions of the instructors, consider each landing as


independent of subsequent landings (irrespective of feedback).

Assume that landings can be classified into three types: poor, adequate,
or excellent. Further assume the following probabilities:

Event Probability

bad pb
adequate pa
good pg

Remember that pb + pa + pg = 1.

84
Example: making better pilots (cont’d)
Assume that the policy of criticism is judged to work when a poor
landing is followed by a not-poor landing. Then

P(criticism seems to work) = P(not bad2 | bad1 ) = P(not bad2 ) = pa +pg

by independence.

Conversely, the policy of praise appears to work when an good landing is


followed by another good landing. So

P(good2 | good1 ) = P(good2 ) = pg .

Praise always appears to work less often than criticism!


85
Remark: null and alternative hypotheses

The previous example shows that the evidence can appear to favor
criticism over praise even if criticism and praise are totally irrelevant.

Does this mean that criticism does not work?

No, it just means that the observed facts are not compelling evidence
that criticism works, because they are entirely consistent with the null
hypothesis that landing quality is independent of previous landings and
feedback.

In cases like this we say we “fail to reject the null hypothesis”. We’ll
revisit this terminology a couple weeks from now.

86
Example: making better pilots (continuous version)

What if we want to take pilot skill into account?

We will model this situation using normal random variables and see if the
same conclusions (that praise appears to hurt performance and criticism
seems to boost it) could arise by chance.

87
Example: making better pilots (continuous version, cont’d)

Assume that each pilot has a certain ability level, call it A. Each
individual landing score arises as a combination of this ability and certain
random fluctuations, call them . The landing score at time t can be
expressed as

St = A + t .

iid
Assuming that t ∼ N(0, σ 2 ), then

St ∼ N(A, σ 2 ).

88
Example: making better pilots (continuous version, cont’d)
Denote an average landing score as M. Consider a pilot with A > M.
When he makes an exceptional landing, because 1 > 2σ, he is unlikely to
best it on his next landing.

Distribution of landing scores


0.8
0.6
Density

0.4
0.2
0.0

M A A+ε1

S2

For this reason, praise is unlikely to work even though landings are
independent of one another. 89
Example: making better pilots (continuous version, cont’d)
For a poor pilot with A < M a similar argument holds. When he makes a
very poor landing, because 1 < −2σ, he is unlikely to do worse on his
next landing.

Distribution of landing scores


0.8
0.6
Density

0.4
0.2
0.0

A+ε1 A M

S2

For this reason, criticism is likely to “work” even though landings are
independent. 90
Idea: mean reversion

The previous example illustrates an idea known as mean reversion.

This name refers to the fact that subsequent observations tend to be


“pulled back” towards the overall mean even if the events are
independent of one another.

Mean reversion describes a probabilistic fact, not a physical process.

What might the flight instructors have done (as an experiment) to really
get to the bottom of their question?

91

You might also like