Professional Documents
Culture Documents
Statistical Snacks
Statistical Snacks
Introduction
Excerpts
References
Statistical Snacks
(Copyright 2013 Metin Bektas)
Introduction:
In case you already skimmed through the book or read a sample, you might
be wondering: Is this a textbook? Or is it recreational? Well, it's both and
neither.
This book was certainly written with the intent to entertain with interesting
statistical problems and ideas. But statistics can not be fully enjoyed
without understanding its inner workings. Unexpected results are great, but
it's even better if you can understand how to arrive there.
So the intent of the book is not only to entertain, but also to teach important
core concepts of statistics and how to apply them. For this reason the
snacks, as well as the whole book, are written to become more demanding
as the book nears the end. Thus I recommend to read it in the order in which
it stands, unless you already have sound knowledge in statistics.
Try not to rush hastily through the book. Take your time to absorb the
concepts and the applications. Don't worry if there's something that you
don't understand on first try, some snacks are written to be especially
challenging. And I can assure you that "getting it" on second or third try is
even more rewarding.
You will notice that the title of the snacks includes a certain number of dots.
These show the difficulty of the snack with respect to the other snacks in
the chapter. The indicator ranges from one dot for relatively easy snacks to
three dots for tough ones.
It's a good idea to keep a pen and piece of paper ready when reading.
Sometimes you'll find it helpful to do your own visualization of the
problem, other times you'll think of a great variation of the problem or even
a completely new problem. Don't let these ideas go to waste, rather pursue
and share them.
A side note for Amazon Kindle users: don't set the font size too high,
otherwise the formulas will be displayed over more than one line. This can
be somewhat confusing and cause your train of thought to be unnecessarily
interrupted. The Table Of Contents has been optimized for Kindle so that
you may jump to the desired chapter without any hassle.
Hopefully this will be one of these books to which you will always return
and that will stay on your mind long after reading. Enjoy your trip through
the wondrous world of statistics.
Metin Bektas
Chapter One: The Power of Multiplication
Before we enjoy our first snacks, it's wise to look at some basics.
Multiplication is a surprisingly powerful tool in statistics. It enables us to
solve a vast amount of problems with relative ease. One thing to remember
though is that the multiplication rule, to which I'll get in a bit, only works
for independent events. So let's talk about those first.
When we roll a dice, there's a certain probability that the number six will
show. This probability does not depend on what number we rolled before.
The events "rolling a three" and "rolling a six" are independent in the sense,
that the occurrence of the one event does not affect the probability for the
other.
Let's look at a card deck. We draw a card and note it. Afterward, we put it
back in the deck and mix the cards. Then we draw another one. Does the
event "draw an ace" in the first try affect the event "draw a king" in the
second try? It does not, because we put the ace back in the deck and mixed
the cards. We basically reset our experiment. In such a case, the events
"draw an ace" and "draw a king" are independent.
But what if we don't put the first card back in the deck? Well, when we take
the ace out of the deck, the chance of drawing a king will increase from 4 /
52 (4 kings out of 52 cards) to 4 / 51 (4 kings out of 51 cards). If we don't
do the reset, the events "draw an ace" and "draw a king" are in fact
dependent. The occurrence of one changes the probability for the other.
With this in mind, we can turn to our powerful tool called multiplication
rule. We start with two independent events, A and B. The probabilities for
their occurrence are respectively p(A) and p(B). The multiplication rule
states that the probability of both events occurring is simply the product of
the probabilities p(A) and p(B). In mathematical terms:
p(A and B) = p(A) · p(B).
A quick look at the dice will make this clear. Let's take both A and B to be
the event "rolling a six". Obviously they are independent, rolling a six on
one try will not change the probability of rolling a six in the following try.
So we are allowed to use the multiplication rule here. The probability of
rolling a six is 1/6, so p(A) = p(B) = 1/6. Using the multiplication rule, we
can calculate the chance of rolling two six in a row: p(A and B) = 1/6 · 1/6
= 1/36. Note that if we took A to be "rolling a six" and B to be "rolling a
three", we would arrive at the same result. The chance of rolling two six in
a row is the same as rolling a six and then a three.
Can we also use this on the deck of cards, even if we don't reset the
experiment? Indeed we can. But we have to take into account that the
probabilities change as we go along. In more abstract terms, instead of
looking at the general events "draw an ace" and "draw a king", we need to
look at the events A = "draw an ace in the first try" and B = "draw a king
with one ace missing". With the order of the events clearly set, there's no
chance of them interfering. The occurrence of both events, first drawing an
ace and then drawing a king with the ace missing, has the probability: p(A
and B) = p(A) · p(B) = 4/52 · 4/51 = 16/2652 or 1 in about 165 or 0.6 %.
If you're not familiar with these calculations, give them a try. What's the
probability of rolling three six in a row? To arrive at the right answer take A
to be "rolling two six in a row" as calculated above and B to be "rolling a
six". What's the probability of drawing two aces in a row? Remember that
during the first draw, there are 4 aces in 52 cards and during the second
draw, you are left with 3 aces in 51 cards. What about the probability of
drawing all the aces in a row?
Of course when we start doing statistics, we need to talk about coins. Let's
first look at how likely it is to get three heads in a row. Any guesses? The
chance for a coin showing heads is p(heads) = 1/2. Since this event is so
wonderfully independent from basically anything else in the universe, we
can simply start multiplying:
p(three heads in a row) = p(heads) · p(heads) · p(heads)
One way is that the coin shows the sequence heads-heads-tails in that order.
The probability for that is 1/2 · 1/2 · 1/2 = (1/2)3 = 1/8. But the sequences
heads-tails-heads and tails-heads-heads are just as likely. And they too
satisfy our condition of showing heads twice within three throws. So we
get:
p(twice heads in three throws) = 1/8 · 3 = 3/8
Here are the first two sentences of the prologue to Shakespeare's Romeo
and Juliet:
There are 26 letters in the English alphabet and since he'll be needing the
comma and space, we include those as well. So there's a 1/28 chance of
getting the first character right. Same goes for the second character, third
character, etc ... Because he's typing randomly, the chance of getting a
character right is independent of what preceded it. So we can just start
multiplying:
p(reproduce) = 1/28 · 1/28 · ... · 1/28 = (1/28)77
The result is about 4 times ten to the power of -112. This is a ridiculously
small chance! Even if he was able to complete one quadrillion tries per
millisecond, it would most likely take him considerably longer than the
estimated age of the universe to reproduce these two sentences.
Now what about the first word? It has only three letters, so he should be
able to get at least this part in a short time. The chance of randomly
reproducing the word "two" is:
p(reproduce) = 1/26 · 1/26 · 1/26 = (1/26)3
Note that I dropped the comma and space as a choice, so now there's a 1 in
26 chance to get a character right. The result is 5.7 times ten to the power of
-5, which is about a 1 in 17500 chance. Even a slower monkey could easily
get that done within a year, but I guess it's still best to stick to human
writers.
Imagine taking a multiple choice test that has three possible answers to each
question. This means that even if you don't know any answer, your chance
of getting a question right is still 1/3. How likely is it to get all questions
right by guessing if the test contains ten questions?
Here we are looking at the event “correct answer” which occurs with a
probability of p(correct answer) = 1/3. We want to know the odds of this
event happening ten times in a row. For that we simply apply the
multiplication rule:
p(all correct) = (1/3)10 = 0.000017
Doing the inverse, we can see that this corresponds to about 1 in 60000. So
if we gave this test to 60000 students who only guessed the answers, we
could expect only one to be that lucky. What about the other extreme? How
likely is it to get none of the ten questions right when guessing?
Now we must focus on the event “incorrect answer” which has the
probability p(incorrect answer) = 2/3. The odds for this to occur ten times in
a row is:
Just out of statistical curiosity, how likely is it that exactly one tomato piece
ends up in the left half and the remaining eleven in the right half? Is that
more or less likely than having all pieces on the right side?
Let's name the pieces for the moment. The piece that falls first is piece 1,
the piece that falls second is piece 2, and so on. One way for only one piece
being on the left is that piece 1 falls to the left and all following to the right.
Since they fall to any side with the probability 0.5, the chance for this
outcome is also 0.00024.
But this time our event can occur in more than one way. If piece 1 falls to
the right, piece 2 to the left and all following again to the right, we end up
once more with only one piece in the left half. How many possibilities are
there in total? Obviously each of the twelve pieces can take the role of
being the outcast. So we have twelve possibilities, all occurring with the
probability 0.00024. The chance of having only one piece on the left is thus:
p(one left) = 12 · 0.00024 = 0.0029
This is about 1 in 350. With 500 pizzas a day, we can expect to get one or
even two pizzas of this kind every day. So this distribution is significantly
more likely than having all the pieces on the right half. And the
probabilities will increase further as we get closer to the uniform
distribution with six pieces on each side.
I'm thinking of a number between one and ten. Can you guess it? The
chances of guessing it on first try is 1/10. What's the chance of guessing it
on second and third try? Again this is a case of applying the multiplication
rule while taking into account that the probability will change as we go
along.
So what's the probability of guessing the number on second try? For that to
happen, we first need to guess the wrong number. This occurs with a
probability of 9/10. Now you are left with nine numbers to choose from.
The chance of picking the right one is 1/9. So the probability is:
p(correct on second try) = 9/10 · 1/9 = 1/10
Again 1/10. What about the probability of making the right guess on third
try? You miss the first guess with a chance of 9/10, then you miss the
second guess with a chance of 8/9 and finally you get the right number with
a chance of 1/8. So:
p(correct on third try) = 9/10 · 8/9 · 1/8 = 1/10
We want to pick a lock that is secured by a three digit code. What are the
odds of us doing so on first try? And within the first one hundred tries? To
answer these questions, we need to look at yet another powerful
multiplication rule.
The answer again is to start multiplying. For each slot there are 26
possibilities, so the total number of words we can construct using five
characters is: 26 · 26 · 26 · 26 · 26 = 11881376.
What does this have to do with our lock? To calculate the probability of
picking it on first try, we need to know how many possible combinations
there are. The lock has three slots and for each slot we have ten options
(from 0 to 9). So the number of possible combinations is: 10 · 10 · 10 =
1000 and the probability of getting the correct one on first try is 1/1000.
So we probably won't be lucky on first try. But what about within the first
one hundred tries? As in the guessing of the random numbers, each try is
equal. You can easily verify that by using the multiplication rule to calculate
the chance of picking the lock on second try, third try, etc ... So let's sum all
the probabilities for the first one hundred tries:
p(within one hundred) = 1/1000 · 100 = 1/10
While reading about the lock, you might be reminded of slot-machines used
in casinos. Just like our lock, they offer three slots, but usually have more
options for each slot. Today's standard is about 20 options per slot, so the
number of possible outcomes is: 20 · 20 · 20 = 8000, reducing our chance
of hitting the jackpot to 1/8000.
Chen and Thomas often challenge each other in a strategy computer game.
An analysis of the duels so far shows that Chen won 55 % of the matches,
while Thomas won the other 45 %. So Chen has a 10 % edge.
To make things more interesting, they put some money in a pot and decide
to play three rounds. Whoever wins the most out of the three rounds gets
the pot. What is Chen's chance of taking the pot home?
One way for Chen to win the game is to win all three rounds. We symbolize
this by the sequence: C – C – C. Using the multiplication rule we calculate
the odds of this outcome:
p(3/3 rounds) = 0.55 · 0.55 · 0.55 = 0.166
He also wins the game if he is victorious in two out of three rounds. One
such sequence is C – C – T and the respective probability is: 0.55 · 0.55 ·
0.45. But let's not forget that there are two other possible sequences: C – T
– C and T – C – C. All have the same probability, so the odds for winning
two out of three rounds are:
p(2/3 rounds) = 3 · 0.55 · 0.55 · 0.45 = 0.408
Summing these two probabilities gives us the chance of Chen taking the pot
home:
p(Chen) = 0.574 = 57.4 %
Of course this also means that Thomas' chance of winning the pot is
p(Thomas) = 0.426 = 42.6 %. Did you notice what happened to the initial
edge? The 10 % edge for winning a single round turned into a 15 % edge
for winning the entire game. For Thomas this is unfavorable, but it shows
that when you want to sift out the better player, it makes more sense to have
a game of three rounds than just a single match.
What happens if they play five instead of three rounds? Does this increase
the edge of winning the game even further? My gut tells me yes, but as you
will see throughout this book, the gut is not always to be trusted in
statistics. So we better crunch the numbers. After a lot of counting this leads
to p(Chen) = 59.3 % and p(Thomas) = 40.7 %. So indeed the edge increases
to about 19 % for a game of five rounds.
Each year in the US there are about 5 homicides per 100000 people, so the
probability of falling victim to a homicide in a given year is 0.00005 or 1 in
20000. What are the chances of falling victim to a homicide over a lifespan
of 70 years?
Let's approach this the other way around. The chance of not becoming a
homicide victim during one year is p = 0.99995. Using the multiplication
rule we can calculate the probability of this event occurring 70 times in a
row:
p = 0.99995 · ... · 0.99995 = 0.9999570
Thus the odds of not becoming a homicide victim over the course of 70
years are 0.9965. This of course also means that there's a 1 - 0.9965 =
0.0035, or 1 in 285, chance of falling victim to a homicide during a life
span. How does this compare to other countries?
In Germany, the homicide rate is about 0.8 per 100000 people. Doing the
same calculation gives us a 1 in 1800 chance of becoming a murder victim.
At the other end of the scale is Honduras with 92 homicides per 100000
people, which translates into a saddening 1 in 16 chance of becoming a
homicide victim over the course of a life.
It can get even worse if you live in a particularly crime ridden part of a
country. The homicide rate for the city San Pedro Sula in Honduras is about
160 per 100000 people. If this remained constant over time and you never
left the city, you'd have a 1 in 9 chance of having your life cut short in a
homicide.
There's a one important type of problem we didn't look at yet. I'm talking
about the unnecessarily dreaded "at least"-questions. Paul enjoys playing
tennis, though he's not particularly good at it. Of all his matches, he won
only 10 %. Now he wants to enter a tournament which requires him to
participate in eight matches. What is the chance of him winning at least one
match?
The chance of him not winning one of the eight matches is easily calculated
using the multiplication rule. There's a 90 % chance of him losing a match,
so we just multiply to find out how likely it is for this event to occur eight
times in a row:
p(no win) = 0.98 = 0.43 = 43 %
Now we remember that this probability and the chance of him winning at
least one match add up to one and we get:
So despite him having such a hard time to win a match, there's a more than
50/50 chance for him to win at least one tournament match. Of course, Paul
is not so interested in this number, for him it's about the enjoyment, not
winning.
The phrase "at least" gave away how we should approach this. Let's first
calculate how likely it is to not find any mistake. The chance of getting a
word correct is 3999/4000. The chance for this event to happen 30000 times
in a row is:
p(no mistake) = (3999/4000)30000 = 0.0006
As we saw, there's a high probability that once a book is finished, there will
be some mistakes left in the book. Knowing that, the author goes over the
text once more and finds m(1) mistakes. After that he proofreads it yet
another time and finds m(2) mistakes. How many mistakes can he expect to
still be in the book after that?
m(1) = p · M
mistakes during the first try. Now there are M – m(1) or (1-p) · M mistakes
left. Assuming the same chance of spotting a mistake, we can expect him to
discover:
m(2) = p · (1-p) · M
mistakes in the second try. Let's use the first equation to eliminate the
unknown p. We can write p = m(1) / M and plug that into the second
equation. As promised, we really won't need to look at his chance of
spotting a mistake. We can get rid of it by straightforward algebra. Doing
this results in the equation:
m(2) = m(1) / M · (1 – m(1) / M) · M
which simplifies to:
All that's left is more algebra. We solve this equation for M to get the
expected number of mistakes in the text before the author started
proofreading:
M = m(1)2 / (m(1) – m(2))
We already concluded that the number of mistakes left after the second
attempt is simply L = M – m(1) – m(2) or:
Voila, all the inputs we need to calculate the expected number of mistakes
still in the text are the number of mistakes found on first and second try.
Surprising, isn't it? Of course now that we have a formula for M, we can
also calculate the proofreader's chance of spotting a mistake using p = m(1)
/ M:
p = 1 – m(2) / m(1)
On top of that, we're able to compute how many mistakes are expected to be
found during a third proofread: m(3) = p · M.
Let's use some numbers. First the author found m(1) = 30 mistakes and then
m(2) = 12 mistakes. How many mistakes do we expect to be left in the text?
p = 1 – 12 / 30 = 0.6 = 60 %
This means that during an additional attempt to eliminate the mistakes we
can expect him to find m(3) = 0.6 · 8 = ca. 5 of the remaining 8 mistakes.
One thing to remember though that we need to have m(1) > m(2) in order
for the formulas to work. Otherwise our assumption of a constant chance of
spotting mistakes can not be fulfilled and the formula just produces
nonsense like negative values or, even worse, an infinite number of
mistakes. No writer is that bad.
(●●) Super-PIN
When you want to access your cell phone, you have to enter a PIN code.
This protects your information from unauthorized access in case the cell
phone gets lost. You are usually given three shots at entering the correct
PIN code. If you fail to do so, you need the super-PIN to gain access.
Assume your chance of entering the known PIN incorrectly is 0.1. How
likely is it that you need to resort to the super-PIN? That's simple. For that
to happen, you need to enter an incorrect code three times in a row.
According to the multiplication rule, this happens with the probability:
p(super-PIN) = 0.13 = 0.001 = 1 in 1000
However, this calculation was rather simplistic. When you enter the PIN
incorrectly on first try, you would usually pay more attention during the
second attempt. And if this one also fails, you would be even more alert
since you know that this is your last shot before having to locate the super-
PIN in the disorganized pile of documents.
Let's take this into account. During the first try, your chance of entering the
PIN incorrectly is 0.1. However, during the second and third try the
corresponding chances are only 0.05 and 0.01. How does that change the
odds of needing to resort to the super-PIN? Again we use the multiplication
rule:
p(super-PIN) = 0.1 · 0.05 · 0.01 = 0.00005
Or 1 in 20000. So paying attention reduced the chances of failing to access
the cell phone significantly. We'll use this result to answer another question:
if you access your cell phone fifteen times a day, what are your chances of
needing the super-PIN at least once over the course of a year? Have a guess
before crunching the numbers.
From that we can conclude that the probability of needing the super-PIN at
least once during a year when logging in fifteen times a day is:
We want to find out for what number of mails n his chance at a response is
95 %, or in other words, his chance of getting no response is only 5 %. Thus
we can just turn the above formula into an equation by inserting p(no
response) = 0.05.
0.05 = (1-p)n
n = ln(0.05) / ln(1-p)
For example, if the response rate is only 1 in 1000, he needs to send about
3000 mails for a 95 % chance at a response. If he doesn't have a program
that automatically sends the mails and it takes him half a minute to send one
manually, he'd be sending mails for 25 hours straight to get to this number.
Assuming the impacts are normally distributed, one can derive a formula
for the probability of striking a circular target of Radius R using a missile
with a given CEP:
p = 1 – exp( -0.41 · R² / CEP² )
This quantity is also called the “single shot kill probability” (SSKP). Let's
include some numerical values. Assume a small complex with the
dimensions 100 m by 100 m is targeted with a missile having a CEP of 150
m. Converting the rectangular area into a circle of equal area gives us a
radius of about 56 m. Thus the SSKP is:
p = 1 – exp( -0.41 · 56² / 150² ) = 0.056 = 5.6 %
So the chances of hitting the target are relatively low. But the lack in
accuracy can be compensated by firing several missiles in succession. What
is the chance of at least one missile hitting the target if ten missiles are
fired? First we look at the odds of all missiles missing the target and answer
the question from that. One missile misses with 0.944 probability, the
chance of having this event occur ten times in a row is:
p(all miss) = 0.94410 = 0.562
Still not great considering that a single missile easily costs 10000 $
upwards. How many missiles of this kind must be fired at the complex to
have a 90 % chance at a hit? A 90 % chance at a hit means that the chance
of all missiles missing is 10 %. So we can turn the above formula for p(all
miss) into an equation by inserting p(all miss) = 0.1 and leaving the number
of missiles n undetermined:
0.1 = 0.944n
All that's left is doing the algebra. Applying the natural logarithm to both
sides and solving for n results in:
n = ln(0.1) / ln(0.944) = 40
Assume the complete collection consists of N stickers. You are only one
sticker short, so you visit a friend who has n of the stickers in total. What
are your chances of him having this one missing sticker?
There's a chance of 1/N for a randomly selected sticker to be the one you're
looking for. Of course this also means that the odds are (N-1)/N = 1 – 1/N
of a sticker not being the desired one. If your friend has n stickers, the
chance of him not having this particular sticker is thus:
p(none) = (1 – 1/N)n
Since the probability of him not having this sticker and him having it at
least once must add up to one, we get:
p(at least one) = 1 – (1 – 1/N)n
Let's put some numbers to that. We assume that the complete collection has
N = 30 different stickers. You know that your friend has n = 50 stickers at
home. The chance that he has at least one copy of the sticker you are
missing is thus: p(at least one) = 0.82 = 82 %. The odds are pretty good.
But it won't be much of a help to you if he only has one copy of the sticker
you need. What's more interesting to us is how likely it is that he has at least
two copies of the desired sticker. How can we approach this?
We note that the chance of not having the sticker, the chance of having it
exactly once and the chance of having it at least twice must add up to one,
since one of these excluding outcomes is sure to come up. Thus we can
write:
p(at least two) = 1 – p(one) – p(none)
We already derived a formula for the chance of him not having it, so we go
right to computing the probability of him having exactly one copy. The
chance that the first sticker he shows to us is the desired sticker and the rest
are not is:
1/N · (1 – 1/N)n-1
That was all we needed to calculate the odds of him having at least two
copies of the missing sticker:
Another interesting question is: given that the full collection has N different
stickers, how many should one be expected to buy to get the complete set?
Since we didn't look at expected values yet, we need to postpone this
question until chapter three and revisit the stickers then.
This is a problem that has been covered in many books and articles already.
But it is too delicious to just leave it out. Given a group of n people, what is
the chance that at least one pair shares a birthday?
Like in all "at least"-problems, we take the detour. Let us rather calculate
how likely it is that none of them share a birthday. Now for one pair, the
chance of them not having their birthday on the same date is just 364/365.
We need this event to occur for all possible pairings in the group. Once we
know the number of pairings m, we can simply write:
p(no shared birthdays) = (364/365)m
We can write 1 as n-(n-1) and 0 as n-n. This will make it clear that we might
also formulate this sum as such:
m = n · n – (1 + 2 + 3 + ... + n-1 + n)
Meet Jack and Mike. They are passionate chess players and take great joy
in challenging each other. Jack has been playing ever since he was a child,
while Mike joined the fun while in his teens. The statistics show that
experience pays in chess, of all the games they played so far, 70 % were
won by Jack.
One Sunday they meet to play three rounds in row. How likely is it that
Jack wins all the rounds? That's easy to calculate using the multiplication
rule.
What about the chance of him winning exactly two rounds? One sequence
that fulfills this condition is: Jack – Jack – Mike. This occurs with a
probability of 0.7 · 0.7 · 0.3. But this is not the only possible way in which
Jack can win two out of three rounds. There are two other possible
sequences: Jack – Mike – Jack and Mike – Jack – Jack. As you can easily
verify, all of them occur with the same probability. We need to take all these
possibilities into account, so the odds of Jack winning exactly two rounds
are:
p(2/3 rounds) = 3 · 0.7 · 0.7 · 0.3 = 44.1 %
If you think "this is starting to sound a lot like the good ol' coins", you are
absolutely right. I can't stress enough that we carefully need to count all the
ways in which our desired outcome can come true. Or better yet, let a
formula do the counting.
This is where the binomial coefficient comes in. It tells us how many
possibilities there are to distribute k elements on n slots. We symbolize this
number by C(n,k). If your calculator does not include this function, you can
easily find a binomial coefficient calculator online. In the above problem,
all we need to know is C(20,12) to solve it. Using a calculator we get:
C(20,12) = 125970
So there are 125970 sequences which have Jack winning twelve out of
twenty rounds. With each sequence having the same probability 0.712 · 0.38,
the chance of Jack winning exactly twelve out of twenty rounds is:
p(12/20 rounds) = 125970 · 0.712 · 0.38
Which is about 11.4 %. It's time to formulate the binomial distribution more
generally. We are given an experiment with two possible outcomes A and B
for each trial (Jack or Mike, heads or tails, ace or not ace). Event A occurs
with the probability p, event B with the probability 1-p. The probability of
A occurring exactly k times within n trials is:
B(n,k) = C(n,k) · pk · (1-p)(n-k)
Agreed, this formula looks impressive. But the last two terms we can
understand more or less intuitively with the multiplication rule. First A,
having the probability p, occurs k times, then B, having the probability 1-p,
occurs the remaining n-k times. That's where the expression pk · (1-p)(n-k)
comes from. And the first term is just the binomial coefficient C(n,k),
telling us in how many ways our outcome can happen.
As we saw, there are C(20,12) = 125970 ways for Jack to win twelve out of
twenty rounds. But doesn't that mean that there are just as many ways for
Mike to lose eight of twenty rounds? There should be. Let's ask the
binomial coefficient. If our logic is right, then C(20,8) should be the same
as C(20,12).
Indeed the calculator spits out C(20,8) = 125970. The binomial coefficient
is symmetric around the center for good reasons. Mathematically, this
means that in all cases C(n,k) = C(n,n-k) holds true.
We first need to extract all the parameters that are required as inputs in the
binomial distribution. The probability of finding a defective part is p = 0.02.
We take a sample of six parts, so n = 6. And we are interested in knowing
how likely it is to have exactly one defective part in our sample, which
means that k = 1. This is all we need to solve the problem:
p(1/6 defective) = C(6,1) · 0.021 · 0.985 = 10.8 %
We could have deduced this with the multiplication rule as well. The chance
of the first part being defective and the other five parts being functional is:
0.02 · 0.985. In total, there are six sequences to have one defective part
among six parts, all having the same probability. So we get the result:
p(1/6 defective) = 6 · 0.02 · 0.985 = 10.8 %
Or about 1 in 60. Note that we already derived this result in chapter one
using the multiplication rule with about the same amount of work. So at
first glance, it seems like we didn't gain that much by using the binomial
formula.
That being said, the chances for getting k = 1 and k = 2 correct answers on
the test when choosing the answers at random is:
p(1 correct) = C(10,1) · (1/3)1 · (2/3)9 = 0.087
As you can see, these outcomes are already much more likely than choosing
only incorrect answers. We could continue and do this for all possible
results up to k = 10 correct answers. Below you can see the plot of this
calculation:
The most likely outcome is to have three correct answers, but two and four
correct answers occur with a probability that is comparable to that. The
probabilities drop sharply towards the end. The odds of nine or ten correct
answers are so small that they don't even show on the graph.
What if you know some of the answers for sure and only have to guess the
remaining ones? We can also apply the binomial formula there. Assume we
know six answers for sure and guess the remaining four. Then we have n =
4 trials with the unchanged chance of p = 1/3 for a correct answer. The odds
for no correct answer (besides the six we know for sure, of course) are thus:
p(0 correct) = C(4,0) · (1/3)0 · (2/3)4 = 0.198
The definite English article (and most common English word) “the” has a
frequency of about 0.07, meaning that within one hundred words we expect
it to come up seven times. A sample of texts taken from the newspaper
“USA Today” (sample size 15810 words) shows that the average length of
an English sentence is 19 words. What is the probability of the word “the”
not being included in a sentence of average length?
The Frumkina Law states that we are allowed to apply the binomial
distribution in good approximation to the distribution of words in a text. So
let's extract all the parameters we need as inputs for the formula. In abstract
terms, the 19 words correspond to n = 19 trials with each trial having the
chance p = 0.07 of resulting in the event “the”. So the chance of not having
the definite article come up is:
B(19,0) = C(19,0) · 0.070 · 0.9319 = 0.252 = 25.2 %
Just out of curiosity, we'll continue this train of thought and calculate the
probability of the definite article appearing twice in an average English
sentence:
B(19,2) = C(19,2) · 0.072 · 0.9317 = 0.244 = 24.4 %
Just as a side note, the analysis of the sample confirmed the rule of thumb
that the average word length in English is five letters (the actual result was
4.91 letters per word, close enough). It also revealed that reading the “USA
Today” requires a 12th grade reading level. If you are interested in doing
such text analysis, I recommend the “Advanced Text Analyser” featured on
the website usingenglish.com.
Let's get started and find the required inputs. The text tells us that p = 0.8
and n = 5. For k we will have either 3, 4 or 5.
Finally we sum them up to get the odds that the majority of the penalty
shots result in a goal:
When teaching small groups, there's always a chance of too few people
showing up for the course to take place. One participant is ill, another in an
important meeting and yet another on training. The agreement is: if less
than two people show up, the appointment is canceled.
In deriving these last results, that is the chance of cancellation for groups of
five and six people, we assumed the chance of the participants showing up
to be the same as in a group of four people. An analysis of the groups I
taught over the years shows that this assumption should only be used as a
first approximation.
For groups of four, five and six participants this leads to:
During the process of transmission over the endless distances, errors can
occur. There's always a chance that zeros randomly turn into ones and vice
versa. What can we do to make communication more reliable? One way is
to send duplicates.
Instead of simply sending a 0, we send the string 00000. If not too many
errors occur during the transmission, we can still decode it on arrival. For
example, if it arrives as 00010, we can deduce that the originating string
was with a high probability a 0 rather than a 1. The single transmission
error that occurred did not cause us to incorrectly decode the string.
Adding the probabilities for all these desired events tells us how likely it is
that we can correctly decode the string.
In the graph below you can see the plot of this function. The x-axis
represents the transmission error probability p and the y-axis the chance of
successfully decoding the string. For p = 10 % (1 in 10 bits arrive
incorrectly) the odds of identifying the originating string are still a little
more than 99 %. For p = 20 % (1 in 5 bits arrive incorrectly) this drops to
about 94 %.
It's election time and we want to know what percentage of the people
support our candidate. So we do a survey, we ask 1000 people if they intend
to vote for our candidate. 55 % answer yes, which makes us optimistic
about the outcome of the election. But how reliable is this answer?
Let's switch to an all-knowing perspective and look down upon the survey
takers. We know that in the entire population p percent support a certain
candidate. If the survey takers ask n people, how likely is it that their
estimate for p, let's call this value q, will fall within 5 % of the actual value?
That's a tough one. Let's do it step by step, starting with the binomial
distribution. The chance that k out of n people support our candidate in the
survey is:
B(n,k) = C(n,k) · pk · (1-p)(n-k)
The survey taker's estimate for the level of support will then be: q = k/n.
Now ideally q = p, but this is in no way guaranteed or even realistic. There
will be some deviation from the actual value. To be within 5 % of the actual
value, the survey must result in a value for q that lies within 0.95·p and
1.05·p. Or in other words: the number of supporters k must be within
0.95·p·n and 1.05·p·n. So to find out how likely it is to get a result within 5
% of the actual value, we need to sum the probabilities for all these
outcomes:
- k equals 0.95·p·n
- k equals 0.95·p·n + 1
- k equals 0.95·p·n + 2
- and so on until we reach k equals 1.05·p·n
That's a lot of terms. Suppose we ask 1000 people, with 45 % of the entire
population supporting our candidate. Ideally 450 people will respond
favorably to our candidate, but hitting the nail on the head so precisely is
rather unrealistic. To be within 5 % of the actual value, the number of
supporters in the survey must fall between k = 428 and k = 473. That's 45
binomial terms we need to sum to get the probability for this outcome.
If you haven't read the penalty kick snack, do it before enjoying this treat.
Otherwise you're good to go. When talking about the penalty kicks, we
focused on one team and ignored the other. Let's include both in our
discussion. One question certainly pops up quickly: What is the chance of
the teams being tied after each team took five shots at the goal? This is the
case if the result is either 0:0, 1:1, 2:2, 3:3, 4:4 or 5:5. Any guesses?
In a similar way we arrive at the formulas for p(1:1), p(2:2), and so on.
Summing them all up gives us the probability of having a tie at the end of
the penalty shoot-out:
p(tie) = p(0 of 5 successful)2 + p(1of 5 successful)2 + p(2 of 5 successful)2
+ ... + p(5 of 5 successful)2
Taking the values p = 0.8 and n = 5, we get the individual probabilities via
the binomial distribution.
p(tie) = 0.32 = 32 %
So about 1 out of 3 shoot-outs will not be decided after the regular five
penalty kicks. If you did all the calculations, you probably noticed how
unlikely the result 0:0 is. There's only a 1 in 10 million chance for that to
happen. Of all the ties, the result 4:4 is the most likely one.
After a pleasurable night of drinking, John decides to walk home. But the
high concentration of alcohol in his veins makes it hard for him to stay the
course. He randomly takes either a step forwards (with the probability p) or
a step backwards (with the probability 1-p). Now he has taken n steps in
total. What is the probability that he is x or more steps away from his
origin? I will not give you any numerical values just yet, let's rather draw up
a battle plan which will work for any particular data.
Assume he has taken k of the n steps forwards. This of course also means
that he has taken n-k of the n steps backwards, so his distance from the
origin is:
d = k – (n-k) = 2·k – n
This formula makes sense. If he took all the steps forwards (k = n), then his
distance from the origin is d = n. And if he took all the steps backwards (k =
0), the formula gives us a distance of d = -n. The same magnitude in
distance, but in the opposite direction.
The binomial distribution enables us to calculate the chance that k out of n
steps are taken forwards:
For this value of k he again is x steps away from the origin, but this time in
the opposite direction. Any k smaller than k2 will result in a distance greater
than x. We now know that if he is x or more steps away from the origin, he
either took anywhere between k1 and n steps forwards (distance greater or
equal to x in forward direction) or only 0 and k2 steps forwards (distance
greater or equal to x in backward direction).
To get the overall probability for him to be x or more steps from the origin,
we sum the probabilities p(k forward steps) for all the values of k that we
determined to be of relevance. Puh, that was some tough work. Now let's
make it easier by inserting numerical values.
6 = 2·k1 – 12
And so on for all relevant values of k as determined above. Making the sum
provides us with the odds that he is 6 or more steps from the origin after
taking 12 random steps:
p(distance 6 steps or more) = 24 %
For doing such calculations it is of great help to have a calculator that can
compute cumulative probabilities, that is, automatically sums the
distribution up to or beyond a specific value. For example, using the "Stat
Trek Binomial Calculator" I only needed to compute two values to get the
above result, the cumulative probability for all k smaller than or equal to 3
and the cumulative probability for all k greater than or equal to 9. This is
certainly much quicker than summing all the terms individually.
This statistical problem might seem quite artificial to you, but it actually
comes from a real-world application. The diffusion of a substance in a host
substance happens by means of a random walk. The diffusing particles
collide frequently with the molecules of the host substance and thus
randomly take steps forwards and backwards. Of course here the situation is
even more complex because we are faced with a three-dimensional random
walk, which provides much more possible directions than just forwards and
backwards.
Chapter Three: To Be Expected
We already talked briefly about the expected value while discussing penalty
kicks. Statistical analysis of soccer games reveal that the chance of a
successful penalty kick is 0.8 or 80%. In a penalty shoot-out, each team gets
five penalty kicks. How many goals should we expect from a team? Our
analysis showed that the event "four out of five" has the highest probability.
We could have gotten this result much quicker by multiplying our chance
(0.8) with the number of trials (5).
e=p·n
This is the expected value. It is quite straightforward and we can use it for
many interesting applications. You can use it to decide if a game is fair or
favors a certain party (the player, the casino) or to set interest rates for
credits.
From these numbers we can calculate the odds of him selling a certain
number of cars in a week:
What number of sales can we expect from Martin during a week? To get the
expected value of a discrete probability distribution, we do this weighted
sum:
e = 0 · 0.05 + 1 · 0.12 + ... + 4 · 0.09 = 2.26
Before going to the snacks, let's look at the special case of having a uniform
probability distribution. Here all the outcomes are equally probable. This
means that if we have n possible outcomes, each must have the probability
1/n (otherwise they won't add up to one as required). The expected value
then is:
e = n(1) · 1/n + n(2) · 1/n + n(3) · 1/n + ...
So for a uniform distribution the expected value is just the arithmetic mean.
In other words, we can interpret the expected value (as defined by the
weighted sum for a probability distribution) to be a generalization of the
common arithmetic mean, where each number carries the same weight.
(●) My Fair Game
You meet a nice man on the street offering you a game of dice. For a wager
of just 2 $, you can win 8 $ when the dice shows a six. Sounds good? Let's
say you join in and play 30 rounds. What will be your expected balance
after that?
You roll a six with the probability p = 1/6. So of the 30 rounds, you can
expect to win 1/6 · 30 = 5, resulting in a pay-out of 40 $. But winning 5
rounds of course also means that you lost the remaining 25 rounds, resulting
in a loss of 50 $. Your expected balance after 30 rounds is thus -10 $. Or in
other words: for the player this game results in a loss of 1/3 $ per round.
Let's make a general formula for just this case. We are offered a game
which we win with a probability of p. The pay-out in case of victory is P,
the wager is W. We play this game for a number of n rounds.
The expected number of wins is p·n, so the total pay-out will be: p·n·P. The
expected number of losses is (1-p)·n, so we will most likely lose this
amount of money: (1-p)·n·W.
Now we can set up the formula for the balance. We simply subtract the
losses from the pay-out. But while we're at it, let's divide both sides by n to
get the balance per round. It already includes all the information we need
and requires one less variable.
B = p · P – (1-p) · W
This is what we can expect to win (or lose) per round. Let's check it by
using the above example. We had the winning chance p = 1/6, the pay-out P
= 8 $ and the wager W = 2 $. So from the formula we get this balance per
round:
B = 1/6 · 8 $ – 5/6 · 2 $ = – 1/3 $ per round
Just as we expected. Let's try another example. I'll offer you a dice game. If
you roll two six in a row, you get P = 175 $. The wager is W = 5 $. Quite
the deal, isn't it? Let's see. Rolling two six in a row occurs with a
probability of p = 1/36. So the expected balance per round is:
B = 1/36 · 175 $ – 35/36 · 5 $ = 0 $ per round
I offered you a truly fair game. No one can be expected to lose in the long
run. Of course if we only play a few rounds, somebody will win and
somebody will lose.
It's helpful to understand this balance as being sound for a large number of
rounds but rather fragile in case of playing only a few rounds. Casinos are
host to thousands of rounds per day and thus can predict their gains quite
accurately from the balance per round. After a lot of rounds, all the random
streaks and significant one-time events hardly impact the total balance
anymore. The real balance will converge to the theoretical balance more
and more as the number of rounds grows. This is mathematically proven by
the Law of Large Numbers. Assuming finite variance, the proof can be done
elegantly using Chebyshev's Inequality.
You are done with work and want to get home as quickly as possible. Two
options are available to you: take the country road or the highway. Since
you've been driving these routes for quite some time now, you know what
to expect. The route over the country road takes 60 minutes on average,
traffic jams do not arise since most people prefer the highways. The route
over the highway takes 45 minutes if there's no traffic jam and 90 minutes if
there's one. Your experience suggests that there's a 1 in 5 chance that you'll
end up in a jam. Which route should you prefer?
What we seek is the expected travel time when taking the highway. We
could approach this in a similar way as in the case of the dice game. But
let's rather interpret the given data as a discrete probability distribution
having these possible outcomes:
- 45 minutes with probability 0.8
- 90 minutes with probability 0.2
So on average you'll save six minutes per trip (or one hour per week
assuming five workdays) by taking the highway.
As you can see our theoretical result agrees with the numerical simulation
within acceptable boundaries (the above results deviate no more than 4 %
from the expected 54 minutes). If you have basic knowledge of a
programming language, it's always fun to check the result with simulations.
(●●) Credits
A bank gives a 15000 $ credit to a business owner, who agrees to pay back
the credit after one year with p percent interest. After careful analysis, the
bank estimates that there's a 5 % chance that the money will be lost. What is
the expected win or loss for the bank with a p = 4 % interest? How should
they set the interest rate to have an expected win of 500 $?
Let's look at the first question. Again we can consider this to be a discrete
probability distribution with two outcomes. One outcome is that the
business owner pays back the credit after one year with the agreed interest.
This outcome occurs with 95 % chance and results in a win of 600 $ (4 %
of 15000 $) for the bank. The other possible outcome is that the business
owner will not be able to pay back the money. This occurs with a 5 %
chance and results in a loss of 15000 $ for the bank. So in summary, this is
the probability distribution for this case:
- 600 $ win with chance 0.95
- 15000 $ loss with chance 0.05
So the bank should expect to lose 180 $ per credit of this type. Of course no
bank will grant you such a credit. Rather, they calculate how the interest
rate should be set for them to expect a profit. In our case we want the profit
(the expected value) to be 500 $. We can set up an equation for that. Let's
leave the the interest p undetermined and insert 500 $ for the expected
value.
p = 0.088 = 8.8 %
For the bank to make 500 $ profit per credit of this kind, it needs to set the
interest rate at 8.8 %. This is a rather high interest rate. Luckily if you
spread the credit over more than one year (which is usually the case), this
would decrease significantly.
Assume we want to create a fair game using a wheel that is divided into n
fields. The player spins this wheel after paying the wager. One of these
fields results in a win for the player, the other n-1 fields lead him to lose the
game. Given the payout P and the wager W, how many fields should the
wheel have in order for the game to be fair?
To answer that, we go back to the formula for the balance per round from
the snack “My Fair Game”. Since the winning chance is p = 1/n and the
corresponding chance of loosing 1-p = (n-1)/n, we get this balance per
round for the game:
B = 1/n · P – (n-1)/n · W
In a fair game, the balance per round is zero. Neither the player nor the
“casino” is expected to gain anything in the long run. Thus all we need to
do to make things fair is to set B = 0 :
0 = 1/n · P – (n-1)/n · W
Now we can solve algebraically for the number of fields n that satisfy this
condition. Start by multiplying both sides with n and go from there. The
result is:
n = 1 + P/W
If for example the payout is five times the wager, so P/W = 5, our wheel
should consist of n = 6 fields. Including more than six fields will decrease
the chance for a win and thus favor the casino. Similarly, less fields will
give the player the edge. Of course in real-world casinos the latter never
happens. The games are always set up in such a way that the casino is
expected to gain in the long run.
Two opponents of equal strength, let's call them Bill and Marcus, are
competing against each other over several rounds of chess. Whoever wins
two rounds in a row, wins the game. With this rule, the game could end
after a minimum of two rounds or, if they win the rounds in an alternating
fashion, could go on forever. The question that comes to mind is: what is
the average length of such a game?
Imagine the game ends after two rounds. That means that either the
sequence B – B or M – M occurred. Both sequences have the probability
0.5 · 0.5, so in total the chance of the game ending so soon is: 2 · 0.5 · 0.5
or:
- Length 2 with p(2) = 2 · 0.52 = 0.5
What about if the game ends after three rounds? Again this can only happen
via two sequences: M – B – B or B – M – M. So the odds of the game
ending there is:
- Length 3 with p(3) = 2 · 0.53 = 0.52
It seems like there is a simple pattern. Let's confirm that by looking at the
chance of having a winner after four rounds. The only possible sequences
for that are: B – M – B – B or M – B – M – M, so we get the probability:
- Length 4 with p(4) = 2 · 0.54 = 0.53
Indeed we cracked the probability distribution for this game to end after a
certain number of rounds. All that is left is to compute the expected value
for the length. Applying the formula from the introduction of this chapter,
we get:
e = 2 · 0.5 + 3 · 0.52 + 4 · 0.53 + …
Thus, the average length of this game is three rounds, not as long as you
might have guessed. Let's take it step further and calculate how likely it is
that the game goes on for six rounds or more. The chance for the game
ending within five rounds is:
p(within 5 rounds) = 0.5 + 0.52 + 0.53 + 0.54
or about 0.94 = 94 %. This means that we can expect only about 6 % of the
games to go on for six rounds or more.
The expected number of purchases to get a new sticker is the inverse of this
probability: e(new) = 1 / p(new). For example if our chance at a new sticker
is 1/8, we will most likely need to buy 8 stickers. So given that we already
have n stickers, this is how many we need to buy to get the next one:
e(new) = N / (N – n)
We begin with no stickers and need to buy e = N/N = 1 sticker for the next
new sticker. Now we have one sticker and need to buy e = N/(N-1) for the
next new one. Having two stickers now, we need to buy yet another e =
N/(N-2) stickers for the next addition. And so on until the collection is
complete. For the total number of purchases E, we need to add all these
expected values:
E = N/N + N/(N-1) + N/(N-2) + … + N/1
E = N · (1/N + 1/(N-1) + 1/(N-2) + … + 1/1)
That's a lot of terms. Luckily there is a handy approximation formula for the
sum in the bracket (which mathematicians call a harmonic series). The
larger N is, the more accurate this formula is:
E = N · (ln(N) + 0.58)
When you toss a coin a large number of times, sooner or later some unlikely
events (such as a lot of heads in a row) will occur. Let's focus on such
repeats. Given that you tossed the coin n times, what is the longest string of
heads you can expect?
If e is bigger than one, we expect this string to come up during the n coin
tosses. For example, the expected value for a string of k = 5 heads in a row
within n = 500 throws is: e = 16. Of course the longer the string, the smaller
the expected value. For a string of k = 6 heads within n = 500 throws we
get: e = 8.
How long can we keep increasing the string size k before hitting the critical
value of e = 1? To answer this question, we simply set e = 1 and solve the
equation for k:
1 = 0.5k · n
(●) Skewness
Let's take a look at how the skewness of a distribution can influence the
expected value. Two car salesmen, Stephen and Ben, follow Martin's
example and make a table of their weekly sales over the last 100 weeks.
Stephen got the following distribution:
- 0 sales with probability 0.10
- 1 sales with probability 0.20
- 2 sales with probability 0.40
- 3 sales with probability 0.20
- 4 sales with probability 0.10
As you can see it is perfectly symmetrical with two sales per week being
the outcome with the highest probability. The expected value is:
e = 0 · 0.10 + 1 · 0.20 + … + 4 · 0.10 = 2
Not surprisingly, it coincides with the most likely outcome. Over 50 weeks,
we would expect Stephen to sell 50·2 = 100 cars with this sales distribution.
Let's turn to Ben's sales:
- 0 sales with probability 0.15
- 1 sales with probability 0.30
- 2 sales with probability 0.40
- 3 sales with probability 0.10
- 4 sales with probability 0.05
Here the probabilities are visibly skewed towards zero. The expected value
should reflect that. Applying the formula we get:
e = 0 · 0.15 + 1 · 0.30 + … + 4 · 0.05 = 1.6
The expected value is now below the outcome having the highest
probability. Ben will only sell 50·1.6 = 80 cars over the course of 50 weeks
given this distribution.
skew = | (a – b ) / (n – 2) |
Don't worry too much about the outer bracket, it just tell us take the positive
value of whatever we get. Mathematicians call that the absolute value. For
example: |2| = 2 and |-3| = 3. Let's calculate the skewness of the given sales
distributions. Stephen's distribution has two classes above the mean and two
below the mean. This leads to:
skew = | (2 – 2 ) / 3 | = 0
Ben's distribution has three classes above the mean and two classes below
it. In this case we get:
skew = | (3 – 2 ) / 3 | = 0.33
You probably have been wondering why we divide by n minus two. Why
not simply divide by the number of classes or not divide it at all? Consider
this: The worst case scenario for a skewed distribution is to have only one
class on one side of the mean and the remaining classes on the other side of
the mean. In this case a = n - 1 and b = 1 (or the other way around). When
we calculate the skewness of such a distribution, we get:
skew = | (n – 1 – 1) / (n – 2) | = 1
Our analysis of train delays at a certain station has shown that the
probability p of a t minute delay is p(t) = 0.25·0.75t. We can turn that into a
table by inserting different values of t:
- 0 minute delay with probability p(0) = 0.25
- 1 minute delay with probability p(1) = 0.19
- 2 minute delay with probability p(2) = 0.14
- 3 minute delay with probability p(3) = 0.11
And so on until infinity. First let's prove that this really is a discrete
probability distribution by showing that all the probabilities add up to one.
Consider the sum:
s = p(0) + p(1) + p(2) + ...
So we get:
Indeed the probabilities all add up to one and p(t) is a discrete probability
distribution. For the expected delay, we need to do this weighted sum:
e = 0·p(0) + 1·p(1) + 2·p(2) + ...
As one last snack, let's now prove that any function of the form p(t) = (1-
q)·qt is a discrete probability distribution, though we restrict the value of q
to be between zero and one. For example, if you set q = 0.75, you'll arrive at
the function we had in the previous snack. And we were able to show that
this particular function indeed is a discrete probability distribution.
We do the proof by showing that the sum s of the probabilities p(0), p(1),
p(2), and so on, all add up to one. So let's write down this sum in general:
s = p(0) + p(1) + p(2) + ...
Using the summation formula for the sum in the bracket that we looked up
during the "Delay revisited" snack, we can write:
s = (1-q) · 1 / (1-q) = 1
Thus showing that the probabilities sum up to one and concluding the proof.
We can also deduce a general formula for the expected value. The proper
weighted sum is:
e = 0·p(0) + 1·p(1) + 2·p(2) + ...
When we insert q = 0.75, the expected value turns out to be e = 3, just what
we got earlier. So it all checks out.
Over the last chapters and snacks, we noticed that a lot of different
mathematical fields play into statistics. On this and many other occasions, a
sound knowledge of algebra proved to be important in solving the problem.
In other cases, we required or will require the help of geometry and
calculus. To succeed at statistics it is vital not to neglect other fields of
mathematics.
Chapter Four: Poisson Distribution
Let's take this to be the introductory example. We know from looking at the
soccer team's history, that it produces goals with a mean rate of 2.4 goals
per game. Now we want to know how likely it is that during a particular
game it will not shoot any goal. Using the Poisson distribution we can
answer this question (and many more questions of this kind)
straightforward:
p(no goal) = 9 %
Here's the general formula to solve such problems. We are given an average
rate λ at which an event is occurring over a certain time span (goals per
game, accidents per year, mails per day). If the occurrence of the event is
random and independent of any previous occurrences, we can use this
formula to calculate the chance that it will occur k times during said time
span:
Going back to the above example, we wanted to know how likely it is for k
= 0 goals to occur during a game when the average rate is λ = 2.4 goals per
game:
You are probably wondering about the exclamation mark. What does it
mean to have a number followed by an exclamation mark? We call k! a
factorial and read "k factorial". Whenever we see this, we just multiply all
numbers down to one. For example: 3! = 3·2·1 = 6 or 5! = 5·4·3·2·1 = 120.
So nothing to worry about. Of course for 0! this doesn't work, it is defined
as 0! = 1. Keep that in mind.
Again you can easily find online calculators that do all the computing for
you. I recommend the "Stat Trek Poisson Distribution Calculator", which is
easy to use and displays cumulative probabilities. This can be very helpful
when answering questions featuring the phrases "at least" or "more than".
(●) Tornadoes
Statistics show that in the US state of New York there are on average five
tornadoes per year. How likely is it that during one year we find only two
tornadoes? What's the probability of more than five tornadoes occurring?
Let's turn to the first question. All we need as inputs for the Poisson
distribution is the average rate, in this case λ = 5, and the number of
occurrences, in this case k = 2. Plugging that into the formula gives us:
p(2 tornadoes) = e-5 · 52 / 2! = 8.4 %
So the chance of only two tornadoes forming over a year is about 1 in 12.
This was the simpler of the two questions. What about the chance of having
more than five tornadoes? Since the Poisson distribution is infinite, we
shouldn't try to do this sum:
Continuing this path until we get to five and summing all the terms results
in:
p(5 or less tornadoes) = 0.616
Since the probability for five or less tornadoes and the probability for more
than five tornadoes must add up to one, we can quickly get the desired
result:
p(more than 5 tornadoes) = 0.384 = 38.4 %
Using the Stat Trek calculator, we could have gotten this probability much
quicker by simply typing in the random Poisson variable as 5 and looking
up the line for the cumulative probability X > 5.
The chance of having no heat wave during one year can be calculated easily
using the Poisson distribution with λ = 0.57 and k = 0:
We can use the multiplication rule to determine how likely it is that this
event occurs three times in a row.
So the chance for that happening is about 18 %. You might think to yourself
now, if there are on average 0.57 heat waves per year that must mean that
we should expect 1.71 heat waves per three years. So can't we just use the
Poisson distribution with λ = 1.71 and k = 0 to calculate in one step how
likely it is that no heat wave occurs within three years? We can certainly try.
Indeed we get the same result (neglecting the difference due to rounding
errors that occurred during the first approach). This is a beautiful property
of the Poisson distribution. When given a rate per year and asked to
calculate the likelihood of the event occurring a number of times over let's
say ten years, we can just compute the rate per ten years and do the exercise
in one take.
First we should extend the given rate to cover the desired area. If there are
1.2 foxes per km², that means we can expect 6 foxes per 5 km². To find the
chance of having no foxes within five square kilometers we use the Poisson
distribution with λ = 6 and k = 0:
Doing the sum gives us the probability of finding less than three foxes in a
5 square kilometer area:
p(less than 3) = 0.062 = 6.2 %
Why were we allowed to apply Poisson here? Remember, our condition for
using the distribution was that the events occur at random and
independently of each other. Foxes don't take preferential routes through the
woods and fields, thus their location at a certain time is rather random. On
top of that they are loners, roaming more or less independently of each
other. From that we can conclude that both our conditions are fulfilled
within acceptable boundaries. For animals forming herds (such as humans),
this would not have been the case. Our "roaming" is not Poisson
compatible.
(●●) Happy Customers
On average 4.6 customers arrive per hour in our office. We also know that
the average service time is 23 minutes. How many employees should we
hire so that the chance of serving the arriving customers without wait is 90
% or higher?
We can use the Poisson distribution and the information λ = 4.6 customers
per hour to make a table of how likely it is that a certain numbers of
customers arrive during a one hour period:
Including these values in the sum as well, we see that the chance of having
k = 0 to k = 7 customers per hour is 0.905 = 90.5 %, which means that
being able to serve seven customers per hour without wait would indeed
bring us up to the 90 % target.
For seven customers the expected total service time is 7 · 23 minutes = 161
minutes. So we need 161/60 = 2.7 employees or, since we can't have a
decimal number of employees, rather 3 employees to have a 90 % or more
chance of serving the arriving customers without wait.
I admit that this is a tricky problem, but it is also quite powerful. With slight
variations, we can use this approach to find a suitable value for the number
of telephone operators in a call-center, the number of docks to unload ships,
the number of salesmen in a store, and so on.
(●●●) Soccer
As the physicist Metin Tolan shows in his book "So werden wir
Weltmeister", soccer results follow the Poisson distribution within
acceptable boundaries. For a team that on average produces λ goals per
game, the probability of scoring k goals in a particular game is:
Let two teams with the respective rates λ(1) and λ(2) participate in a match
against each other. In a first approximation (which under more careful
analysis would certainly require revision to some extent) we can assume
that the teams will produce goals independently of each other. In this case,
we can use the multiplication rule to easily compute the probability for a
certain end-result k(1) : k(2). To achieve this result, team 1 must produce
k(1) goals, this happens with the probability:
p(1) = e-λ(1) · λ(1)k(1) / k(1)!
At the same time, team 2 needs to score k(2) goals. The Poisson distribution
tells us this happens with the probability:
p(2) = e-λ(2) · λ(2)k(2) / k(2)!
For ending up with the result k(1) : k(2), both these events must occur. So
the probability is:
p(result) = p(1) · p(2)
In the "1. Bundesliga" (which is the main German soccer league) the
average is about λ = 1.5 goals per game, with the top teams going as high as
λ = 2.5 and underdogs going as low as λ = 0.7. Let's make a test run using
these two extreme values and calculate the probability for three specific
outcomes.
We assign team 1 the number of λ(1) = 2.5 goals per game, team 2 will be
the underdog with only λ(2) = 0.7 goals per game. Let's first calculate the
chance of seeing a boring 0:0 at the end of the game. Team 1 will score no
goals with the chance:
Using the multiplication rule, we can now compute the chance of seeing a
match with no goals occurring:
Luckily for the viewer, this result is quite unlikely. Let's see how that
compares to a 1:0 victory for team 1. Using the same approach we find that:
p(1:0) = 0.205·0.497 = 0.102 = 10.2 %
This result is already more than twice as likely as the 0:0. What about the
chance of the underdog claiming a 0:1 victory?
As expected, it's much less likely than the 1:0 and, though not as easily
expected, slightly less likely than the 0:0. This way we could check any
result we desire. We can also ask grander questions like: what is the chance
of a tie? This unfortunately would require us to do an infinite number of
calculations since:
p(tie) = p(0:0) + p(1:1) + p(2:2) + ...
So how likely is a tie for our game? Let's evaluate some terms using the
above approach and a Poisson calculator.
- p(0:0) = 0.0407
- p(1:1) = 0.0713
- p(2:2) = 0.0314
- p(3:3) = 0.0060
- p(4:4) = 0.0007
Let's look at all x:0 victories for team 1. In each of these games team 2
scores zero goals, so p(2) remains at 0.497, while p(1) changes with x.
Doing the product p(1) · p(2) for all x > 0 and then summing these terms
results in:
p(x:0, x > 0) = p(1 cumulative x > 0) · 0.497 = 0.456
So the chance for team 1 to achieve a x:0 victory against the underdog is
about 46 %. In a similar way we can write down the odds for all x:1
victories. Here team 2 always scores one goal, so p(2) = 0.3478 for all
possible results, while p(1) varies with x.
The odds for a x:1 victory are thus about 25 %. We can just keep on
applying this logic until we get to infinity, or, for practical reasons, until the
probability drops to a ridiculously low value. Of course at the end we add
all the chances for these outcomes to arrive at the approximate probability
of team 1 winning the game. We get:
p(victory team 1) = ca. 77 %
Already knowing the chance of a tie, we can now easily calculate the
chance of an underdog victory:
p(victory team 2) = 8 %
As you can see, the Poisson distribution provides fantastic possibilities for
calculating probabilities in soccer and sports in general. Whenever we have
teams competing by producing goals or points at a certain rate, we can
apply the formula in first approximation, always keeping in mind though,
that it only holds exactly true if the rates at which the teams score said goals
or points are independent of each other. This is certainly not true under
more careful observation, but for soccer, analysis showed that the Poisson
distribution produces surprisingly reliable numbers. It is not far fetched that
this holds true for many other sports as well.
p(k) = e-λ · λk / k!
The time it takes to complete the route will depend on the density. If the
road is clear, the travel time will be a minimum of n minutes. For every
additional car per km, this increases by a factor m. If for example every car
increases the travel time by 5 %, the factor will be m = 1.05. So we have an
exponential relationship between the density and the travel time:
t(k) = n · mk
We'll not specify any numerical values for now. Given the Poisson
distributed variation in car density and this relation for the travel time, what
is the expected travel time? Note that we have a discrete probability
distribution for the travel time:
- t(0) minutes with the probability p(0) - t(1) minutes with the probability
p(1) - t(2) minutes with the probability p(2)
And so on. To get the expected value, we do the weighted sum as
introduced in the third chapter of the book.
The sign Σ indicates a sum, so the notation on the right side means that we
sum the expression over all k. So we get:
t = n · e-λ · Σk mk · λk / k!
Note that I wrote the constants not including k in front of the summation
sign, which is just factoring them out. The fact that we can do that will be
vital in solving this problem. Both m and λ have the exponent k, so we can
write them in one bracket:
t = n · e-λ · Σk (m·λ)k / k!
Note that except for the missing factor, the terms behind the summation
sign are the terms of a Poisson distribution with the rate m·λ. This enables
us to do a mathematical magic trick to make the sum disappear. We
multiply these terms by 1 (which certainly we are allowed to do), but we'll
write the 1 as em·λ · e-m·λ. So we get:
Now look at the terms following the summation sign again. We included the
missing factor, so it's a sum over all terms of a Poisson distribution. And we
know that since the Poisson distribution is a probability distribution, all the
terms add up to one. Thus:
t = n · e-λ · em·λ
After a lot of algebra and thought, we arrived at a neat formula for the
expected travel time depending on the average number of cars on the street
(λ), the travel time with no other cars on the road (n) and the factor that tells
us how the travel time increases with car density (m).
If for example there are usually λ = 15 cars per km, the free flow travel time
is n = 30 minutes and each car increases the travel time by three percent, so
m = 1.03, the expected travel time, given that the car density is Poisson
distributed, is:
Every year about one car sized asteroid will hit the Earth and leave a
formidable crater. Surely this would be a spectacular, but also very
frightening event to witness. Let's do the math: how likely it is that one of
these asteroids will impact within 0.5 kilometers of your house?
The Earth has a radius of about 6400 kilometers. With the respective
formula for a sphere, we can use this value to calculate the surface area of
Earth:
This is the total area available for the asteroids to hit. A region of radius 0.5
km around your house has the area:
So it's amazingly small. But since on average one asteroid of this size will
hit per year, the above number only covers your chance over one year. What
about the chance of such a hit over the course of a life? We approach this
the same way as in the snack “Homicide She Wrote”. We calculate the
chance of not being hit 70 times in a row and from that deduce the chance
of a hit over a life span:
p(hit 70 years) = 1 in 9 million
You think that's a low probability? It is, the chance of being struck by
lightning is much higher, but tell that to the about 800 people of the 7000
million alive today who, according to the calculated odds, are expected to
actually witness such a close asteroid strike at some point in their lives.
In this snack we focused on car sized asteroids because they have the
practical habit of appearing once a year on average. But once we look at
bigger or smaller asteroids, the impact frequency, and thus the above
calculated probabilities, will change. The picture below shows how the
impact frequency varies with the asteroid diameter.
Note that the scale of the y-axis is logarithmic, which will always produce a
much clearer picture if the quantity in question varies over a wide range of
magnitudes. A side note for the curious: the formula for the red line can be
used to estimate the average time T between two impact events (in years)
and the diameter D (in meters) of the asteroid.
T = 0.018 · D2,72
Let x be the square's side length. The total area of the square is then A(total)
= x². But the length of the circle's diameter is also x. Using a formulary we
find that the area of the circle is π/4 times the square of the diameter:
A(circle) = π/4 · x². To find the probability that the randomly dropped
needle falls within the circle, we simply compute the ratio of the desired
area to the total area:
p(within circle) = A(circle) / A(total) = π/4
Rounded to four digits, this probability is 0.7854, which is roughly three out
of four. What about if we halve the diameter of the circle while keeping the
square at the same size? Try this calculation on your own. You should arrive
at p(within circle) = π/16.
When you cross the ocean, the region visible to you will approximately
form a rectangle with the length 12500 km and the width 2 · 50 km = 100
km. Try to picture this rectangle extending from Sydney to Seattle in your
mind. So the part of the Pacific Ocean that you are able to observe on your
journey has the area: 12500 km · 100 km = 1.25 million km².
Doing the ratio of the observable area to the total area gives us the chance
of finding the island:
Not bad actually, considering how incredibly vast the Pacific Ocean is. But
having a visibility of 50 km all the way through the ocean is rather
unrealistic. It's probably going to be considerably lower for some portions
of the trip. But for the sake of simplicity, let's stick to the 50 km visibility.
What distance do I need to cover within the Pacific Ocean to get a 50/50
shot at spotting the island? How long will this take if my ship makes 20
knots? We assume the route will be taken in such a way that the forming
“band of visibility” does not overlap and except for small portions we'll be
traveling in straight lines.
With these assumptions, the sighted area is about d · 100 km, with d being
the distance covered. Now I want to choose d in such a way, that the
resulting probability is 0.5. The best way to do this is to set up and solve
this equation:
0.5 = d · 100 km / 156000000 km²
d = 780000 km
This is a little more than 60 times the distance from Sydney to Seattle.
Since 20 knots correspond to 37 km per hour, the journey would take us
21100 hours or 880 days. With ideal visibility throughout and a constant
speed of 20 knots, we can see half the Pacific Ocean in about two and a half
years. More realistic values of visibility would bring us up to five years or
so. Hopefully, we'll come across the island at some point during this time.
This is a problem that's best solved with the help of geometry. In the image
you can see a plot of the situation. The x-axis represents the waiting time,
the y-axis the service time. Both can go up to 30 minutes. All combinations
of waiting and service time must fall somewhere into the square set by this
limit.
Now we would like to know what the probability of x+y being smaller than
15 is. In the image you can see the line x+y = 15. This line separates the
acceptable and unacceptable combinations. Pick any point in the shaded
region (for example x = 5 and y = 5) to check this logic. You'll see that all
satisfy the condition x + y < 15.
Computing the probability now just means doing the ratio A(shaded) /
A(total). The marked square has an area of A(total) 30·30 = 900. For the
shaded triangle we remember that the area equals one-half times base times
height, so A(shaded) = 0.5·15·15 = 112.5. Our chance of getting in and out
within fifteen minutes is thus:
p( x + y < 15 ) = 0.125 = 12.5 %
So we shouldn't get our hopes up too much. Problems that can be solved in
such a way are often very similar. Usually we are given two random
variables that can vary within a certain range. Always make a plot to
visualize the problem and try to find the line that separates desired from
undesired outcomes. Once you managed to make this separation, you either
get the answer using simple geometric formulas or, if you're not lucky,
integral calculus.
(●●) Think Fast
After several beers, George decided to measure his reaction times using a
computer program. Several tries later, he concluded that it takes him
between 0.75 s and 1.50 s to react to a signal. He also managed to talk his
(sober) wife Tina into testing her reaction times as well. As expected, she
did better, her reaction time ranged between 0.50 s and 1.25 s.
In direct competition, how likely is it that George will react faster than
Tina? To answer that, we do a plot. We will take the x-axis to be the
reaction time of George and the y-axis that of Tina. You can see that the
possible combinations of reaction times have been marked by a square.
Make sure to study the graph carefully.
The line of equal reaction times has been included as well. It separates the
desired from undesired outcomes. We are interested in the combinations
that lie in the shaded region. Here George's reaction time is smaller than
Tina's.
To confirm this, pick any point in the shaded region and go straight to the x-
axis, this will give you the reaction time of George. Then go from the same
point straight to the y-axis to get the corresponding reaction time for Tina.
You will see that for all points within the shaded area, this leads to a smaller
reaction time for George.
With that in mind, we can now easily compute the chance of George
reacting faster by doing the ratio of shaded area to total area. The total area
is the area of the square in which all possible combinations lie. Since all
sides have the length 0.75, we get: A(total) = 0.75² = 0.5625.
The shaded region is a right triangle with a base and height of length 0.5.
Remembering that the area for such a triangle is one half times base times
height, we get A(shaded) = 0.5·0.5·0.5 = 0.125. All that's left is doing the
ratio:
Suppose you travel to work by train and at one point have to transfer to
another train. You are on train A, which is scheduled to arrive at the station
at exactly 9:00. Once arrived, you have to transfer to train B, which is
scheduled to leave the station at 9:10. So between the trains there's a 10
minute wait.
Let's generalize that. Let D(A) be the delay of train A, D(B) the delay of
train B and w the scheduled waiting time in-between. In order to reach train
B in time, the relation D(A) < D(B) + w must be satisfied. Now let the
delays vary randomly between zero and a maximum value m. What's is the
probability of missing train B?
It is helpful to look at this problem graphically. In the below plot the x-axis
represents the delay of train A and the y-axis the delay of train B. Both are
only allowed to vary between 0 and m, so all the possible combinations of
the delays must lie within the marked square.
The relation D(A) = D(B) + w marks the border between acceptable and
unacceptable delays. If D(A) is larger than D(B) + w, we miss train B,
otherwise we catch it. This relation is drawn in the graph as well, it is the
straight line that separates the unshaded from the shaded region.
The total area is easy to calculate, it is the area of the square, so A(total) =
m². What about A(shaded)? You might remember that we can calculate the
area of a right triangle by multiplying 0.5 with the base and height. Let's get
these dimensions for the shaded area.
The base extends from w to m on the x-axis, so the length is just m-w. We
can see from the mathematical relation that the line has the slope one, so
when going 1 unit to the right, the line will go 1 unit up. Similarly, going m-
w units to the right results in the line going m-w units up. From this we can
conclude that the height must have the length m-w as well, leading to
A(shaded) = 0.5 · (m-w)².
All that's left is to do the ratio A(shaded) / A(total). The probability for
missing train B, given that the delays vary randomly between zero and a
maximum value m and there's a scheduled waiting time w in-between, is
(after a little algebraic manipulation):
p(missing B) = 0.5 · (1 – w/m)2
Talk about a tough nut to crack. Don't worry if you don't get all of it on first
try, it is a very demanding problem. But it beautifully shows what the
geometric approach is capable of.
Let's look at another challenging problem we can solve using the geometric
approach. This time we'll keep it less general. In a flood two variables are
important in determining the severity: the area of the flooded region F and
the average density of houses D in the affected region. Ideally the flooded
area is strongly limited (small F) and the flood occurs in an uninhabited
region (small D). Worst case is a large flood hitting a densely populated
region.
For our problem, we will let the flood area F vary between 0 and 500 square
kilometers and the housing density D vary between 0 and 15 houses per
square kilometers, both randomly and independent of each other.
Let's make an F-D-coordinate system and include the limits set for each
quantity. All possible combinations of F and D lie in the rectangle spawned
by these limits.
This time the separating line between severe and not severe is a hyperbola:
FD = 2500. If the product FD is higher than 2500, we speak of a severe
flood, otherwise not. The separating line, as well as all combinations
considered severe (shaded), are included in the graph.
Data on flood sizes show that our assumption is too crude. In reality, there
is a power law at work, quite similar to the Gutenberg-Richter Law for
earthquakes: the bigger the event, the less frequent its occurrence.
Researchers found this relationship between the flood discharge q
(measured in cubic meters per second) and the frequency f (measured in
events per year):
f = a / qb
with the exponent b usually being around 2. So a flood twice the size of a
certain reference value, will occur only one-fourth as often. To make the
problem more realistic, we would need to take this dependence into account
and solve the arising multiple integrals mentioned above. This
unfortunately is beyond the scope of this book.
Chapter Six: Bayes and Bias
For example when given two random events A and B, rather than
calculating p(A) and p(B) straightforward, we calculate the probability of A
occurring, given that B has already occurred. This is called a conditional
probability, symbolized by p(A|B) and read in short as "probability of A
given B". As long as you read the symbol | as “given”, you will always be
able to figure out what is meant by the expression.
p(D) = 1/4
Assume we chose to search island A and thus failed to locate the treasure.
What's the chance of choosing the correct island given that we already ruled
out island A? Since there are three islands left to search we get:
p(D|A) = 1/3
As you can see, the occurrence of the event “choosing island A” changed
the odds of the event “choosing island D” to our favor. Of course the
probabilities p(D|B) and p(D|C) have the same numerical value.
With this in mind, we can take a look at a typical Bayes problem with a
deliciously surprising result. After that we'll turn to the general formula for
solving such problems. If you never dealt with Bayes before, it'll probably
take some getting used to, so choose a comfortable pace for reading.
Assume that 10000 people took the test. At the beginning of the previous
paragraph, we noted that that about 1 % of all people have this disease. So
the most likely scenario is that in our sample 0.01 · 10000 = 100 have the
disease and 0.99 · 10000 = 9900 don't.
Let's put the focus on the 100 people who have the disease. The test will
recognize the disease in 0.98 · 100 = 98 of them. What about the 9900 who
don't have it? With the given rate of false recognitions, the test will identify
the disease in 0.02 · 9900 = 198 of them.
Now we can draw our conclusions. In total 296 were identified as having
the disease, but only 98 actually have it. So the chance of someone having
the disease, given that the device recognized it, is just 98 / 296 = 33.1 %.
We might as well just flip a coin!
The test seemed quite accurate, what happened? One problem is that the
disease is so rare. If you test the general public, the number of people not
having the disease will greatly outnumber the small group that has it. So
even a very low rate of false positives will strongly impact the overall
results.
Many Bayes problems are quite similar to our introductory example. There
is a certain probability p(A) of one event A occurring. This event and its
complement B will take the center stage in the problem. Both A and B can
lead to a certain consequence C with the respective probabilities p(C|A) and
p(C|B). These probabilities don't need to add up to one as they did in the
introductory example. Make sure to study the below image to understand
this set-up before reading on.
There are two ways for C to occur, via A or via B. So it's a legitimate
question to ask for the probability of A, given that C occurred. In
mathematical terms, we are looking for p(A|C).
Let A be the event "disease", B the event "no disease" and C be the event
"disease recognized". The disease occurred with a probability p(A) = 0.01
in the population. The probability that the disease is correctly recognized
was p(C|A) = 0.98, the chance for false recognitions was p(C|B) = 0.02.
These are all the inputs we need:
Sure enough, we arrive at the same result. The snacks will provide more
opportunities to apply this grand formula.
Let's devote one quick snack to the new notation introduced in this chapter.
Picture a hard drive with three folders: A, B and C. Each folder has 10
subfolders. We agree that A1 stands for the first subfolder in folder A, A2
for the second subfolder in folder A, and so on. Now we want to locate a
file which, unbeknown to us, sits in folder B6. Since there are 3·10 = 30
subfolders in total, the chance of randomly choosing the desired subfolder
B6 is:
p(B6) = 1/30
What about the chance of choosing the correct subfolder, given that we
already expanded folder B? Since folder B contains 10 subfolders, one of
which holds the file we are seeking, we get:
p(B6|B) = 1/10
p(B6|A) = 0
Of course this only holds true if we are not allowed to revise our decision to
expand a certain folder. I hope this short bit helps you to get used to the
notion of conditional probability.
Assume that 10 % of all drivers are notorious speeders. Now the police sets
up a radar trap. Speeders will fall into this trap with a 70 % chance, while
this is true for only 10 % of law abiding drivers (they might have missed the
limit sign or temporarily paid too little attention). What is the chance that a
driver is a notorious speeder, given that the radar caught him?
First we need to identify the events correctly. The word “given” in the
statement of the problem serves as a pointer to event C, the consequence.
So we take C to be the event "fall into trap". With the first sentence telling
us that 10 % are speeders, we should set A to be the event "speeder" and B
the event "law abiding driver". That means that p(A) = 0.1, p(C|A) = 0.7
and p(C|B) = 0.1.
All that's left is to apply Bayes' theorem to get p(A|C), the probability of the
event "speeder" given the event "fall into trap". Let's plug in the numerical
values:
So only about half the drivers that got caught by the trap are actually
notorious speeders. The others were law abiding citizens who just happened
to have missed a sign or were not paying enough attention.
Let's identify the events. C is obviously the event "tired at work" and we
should set A to be "party night" and B to be "regular night". In this case,
p(A) = 3/7 = 0.43, p(C|A) = 0.9 and p(C|B) = 0.1. Make a diagram similar
to the ones above to feel comfortable with the problem. Now we can find
out how likely it is that he had a long night of partying, given that he's tired
at work:
So our remark that he should take it easy with partying when when we see
him sleeping at the desk is not unfounded. In about 9 of 10 cases we made
the correct conclusion.
But beware, what if he's not that much of a party animal as we suspected
and often has a hard time sleeping? We'll let him party only one night a
week, keep the 90 % chance of being tired after a night of partying and
increase his chance of being tired after a regular night to 30 %. With these
inputs, the chance that he had a long night of partying when being tired at
work is:
That goes to show that we should be careful to judge. Bayes really is about
finding bias and avoiding it. The next snack will hopefully serve as a
powerful reminder to that.
The location of the word “given” tells us that we should take event C to be
“suspect identified as black”. Our to center stage events will then be A =
black person committed crime and B = white person committed crime. As
for the numbers, we are given the percentage of black people in the
population. Assuming no inclination towards this crime in any form, we
have p(A) = 0.1. Since the witness is correct 90 % of the time, we get
p(C|A) = 0.9 and p(C|B) = 0.1.
Now we can apply Bayes' theorem to determine how likely it is that the
criminal really is black given that the witness identified him as being so. We
get:
So the statement is, at least with respect to race, no more helpful that the
flip of a coin. How can this be considering the high reliability of the
witness? What sorcery is this? Let's break it down using some numbers.
(●●) Updates
One thing that makes the Bayesian approach so powerful is that we can
update probabilities as the events come in. Consider the following example.
We are watching a Poker game and know that about 20 % of the players are
skilled and the remaining 80 % unskilled. The chance of a skilled player
winning is 60 %, whereas an unskilled player wins only 20 % of his games.
See the image below for a visualization of this situation.
What is the chance of a player being skilled, given that he won a round? We
can apply Bayes' theorem directly to get the answer:
This shows that we can not conclude from one win alone that the player is
skilled. But what if he wins two rounds in a row? That should increase the
chances of the victorious player being skilled. From the given information,
we can deduce that the chances of a skilled player winning twice in a row is
0.6² = 0.36, whereas an unskilled player only has the odds 0.2² = 0.04. This
is just applying the multiplication rule from chapter one. We visualize this
updated situation.
We use Bayes' theorem yet another time to determine the chance of a player
being skilled, given that he accomplished two wins in a row:
p(skilled|3 wins) = 87 %
Here we can feel relatively safe to conclude that the player indeed is skilled.
Out of 100 players who manage to get three in a row, we expect only 13 to
be lucky, unskilled players. We could just go on like this. Do the
appropriate calculations to confirm that:
p(skilled|4 wins) = 95 %
p(skilled|5 wins) = 98 %
Feel free to set up a similar scenario for a game you are interested in and
compute how many wins it would take to deduce beyond reasonable doubt
that the victorious player is skilled.
In the snack “Innocent Drivers” we saw that with the given estimates only
44 % of those falling into the radar trap are actually notorious speeders. So
saying “I wasn't going too fast on purpose” provides a believable excuse
when getting caught. But what about if a driver gets caught three times in a
row? Is the excuse still believable then? Or can we say beyond reasonable
doubt that this driver is a speeder?
For a speeder, the chance of getting caught thrice is 0.7³ = 0.343, for a law
abiding driver we get 0.1³ = 0.001. So the odds of a driver being a speeder,
given that he or she got caught three times is:
(●●●) Swerving
We can extend Bayes' rule to cover more than two initial events. Consider
the graph below, here we have three events A, B and C, all being able to
result in event D with the respective probabilities p(D|A), p(D|B) and
p(D|C).
For this set-up, Bayes' theorem takes a slightly different form. The chance
for event A, given that D occurred, is:
p(A|D) = p(D|A)·p(A) /
(p(D|A)·p(A) + p(D|B)·p(B) + p(D|C)·p(C))
For reasons of limited space the denominator is written in the second line. If
you compare that to the formula we worked with so far, you'll see that,
except for the event names being changed, the only difference is that we
have an additional term in the denominator.
Let's make an example, again involving law enforcement. The police set up
a check point and pull out all drivers who have been spotted swerving. Just
before putting up the check point, one of the policemen looks up relevant
driver statistics.
They show that of all the drivers, 2 % are intoxicated above limit, 8 %
intoxicated below limit and 90 % sober. Studies show that drivers
intoxicated above limit swerve with a 40 % probability, drivers intoxicated
below limit with a 15 % probability and all others with a 2 % probability.
p(A|D) = 0.4·0.02 /
(0.4·0.02 + 0.15·0.08 + 0.02·0.9)
(●) In Chains
We are given two mutually exclusive states, A and B (see sketch below).
During each step one of the two states is taken. If state A is taken, there's a
probability p that state B will follow. This of course also means that the
chance for remaining in state A is 1-p. On the other hand, if state B is taken,
the chance of state A following is q and the chance of remaining in state B
is thus 1-q.
This was the set-up of the problem. The question that is to be answered is:
What is the overall probability of having state A or state B occur? Or in
other words: Of the total time, what percentage will be spend in state A or
B? Symbolizing these percentages by p(A) and p(B), the general formula to
solve this problem is:
p(A) = q / (p+q)
p(B) = p / (p+q)
If that was a little too abstract up to this point, don't worry, the example will
make it clear. Consider this: I know that if I exercise one day, the chance of
me not exercising the next day is p = 0.3 = 30 %. On the other hand, if I
don't exercise one day, the odds of me going back to training the next day
are q = = 0.2 = 20 %. Try to visualize this situation similar to the picture
given above. The event A is “exercise” and event B is “not exercise”.
So that's about three days a week I will exercise, not so bad actually. Note
that since the probabilities of switching states is relatively low, the result
will be some days of training followed by some days of laziness rather than
jumping from exercise to not exercise with each new day. This is something
the formula does not capture.
Another Markov quickie at the end: when I write, I finish about five pages
per day. A day of writing is followed with a 30 % chance by a day of not
writing. On the other hand, if I don't get anything done one day, the chance
of getting back to writing the next day is 40 %. How long will it take to
finish a book of 160 pages this way?
Using the general formulas, we can say that with this behavior I'll be
writing 4 / 7 days and not writing the remaining 3 / 7 days. So I'm getting
done 4·5 = 20 pages per week, which means that finishing the book will
take me 160 / 20 = 8 weeks.
You might be wondering how to get from the Markov chain to the general
formula. Without resorting to matrix algebra, this question is a little
complicated to answer, but it can be done. So we know that in the long run
I'll be in state A with a probability p(A). We can arrive there in two ways:
from A with a probability of p(A)·(1-p) or from B with a probability
p(B)·q. Now the probability of arriving in A must be p(A), so:
We also know that I'm either in state A or state B, there are no other
possibilities. So the respective probabilities p(A) and p(B) must add up to
one:
The rest is algebra. Solve equation (1) for p(B) and insert the resulting
expression for p(B) in equation (2). Then solve for p(A) and you'll end up
with the general formula.
Of course Markov chains don't end here. This was just a quick peek into a
very rich field. The number of states can be anything you like, given that
you have the skills and computational power to back it up. A sound
knowledge of matrix algebra is a must for tackling Markov chains with
more than two states as general formulas get exponentially longer when you
include more states.
When you buy a food product, you can find the weight of the contents on
the box. One manufacturer for example claims that his box of cookies
weighs 100 grams. To test the reliability of this claim, we take a sample of
10 boxes and weigh them. Here are the results: 102 gr, 105 gr, 98 gr, 101 gr,
94 gr, 98 gr, 103 gr, 105 gr, 101 gr and 97 gr.
Our first job will be to calculate the arithmetic mean, usually symbolized by
μ. To do that, we simply add all the values and divide by the number of
measurements:
μ = 1004 / 10 = 100.4 grams
So the mean indeed was very close to the 100 grams claimed by the
manufacturer. But additionally to that, we are interested in knowing how
strongly the weights are expected to spread around this mean. How likely is
getting a 90 gram box for example? Is that something we can expect to
happen often?
x = (m – μ)2
It is simply the square of the difference between the observed and mean
value. Once we did that for all the measurements, we can get the standard
deviation from this formula:
s = square root ( (x1 + x2 + …) / (n – 1) )
with n being the total number of measured values (in our case 10). Let's
calculate the standard deviation now for the weights we observed. For the
first and second measurement we get:
x1 = (102 – 100.4)2 = 2.6
For the “2 sigma interval” we take two standard deviations around the
mean. Using the calculated values we get: μ + 2· s = 107.6 grams and μ - 2·
s = 93.2 grams. Our estimate thus is that 95.5 % of all boxes fall within this
range. As you can see, the standard deviation indeed does provide a lot of
information about how the weights spread.
I also mentioned that the above table only holds exactly true when the
quantity in question is normally distributed. If the quantity is distributed in
another way, we can only use the table as a first approximation. As a rule of
thumb one can say that when the observed values are strongly skewed
(significantly more than half of the observed values left or right of the
mean), one should not assume a normal distribution.
The standard deviation should not be confused with the standard error SE,
which rather measures how reliable our result for the mean is. Remember
that we got μ = 100.4 grams as the mean for our sample. Since our sample
was small, there's a good chance that the true mean μ(true) will somewhat
differ from that. Can we say to what extend?
Despite only having a sample size of 10 boxes, we were able to narrow the
mean down to a relatively small interval. Note how the standard error varies
with the sample size n. Assuming the standard deviation stays relatively
constant when expanding the sample, the standard error will halve when the
sample size increases fourfold. With 40 boxes we could bring the interval
down to about ± 1.14 grams.
So remember that while the standard deviation gives information about the
spread around the mean, the standard error helps us to deduce how close
our sample mean is to the true mean. Both is helpful information to have for
any sample.
In this snack we'll take a look at how to test a hypothesis statistically. This
is usually done with the Chi-Square test. For it to work, we need to be given
an expected distribution and an observed distribution. From these two we
can calculate the so called Chi-Square χ². It is a measure of how the
expected distribution differs from the observed one.
Imagine we are observing a four-lane road. For 1000 passing cars we'll note
which lane (lane 1, lane 2, lane 3 and lane 4) each car was driving on. Our
assumption is that there is no preferred lane, so we expect the number of
cars being uniformly distributed over the four lanes:
- lane 1: 250 cars
- lane 2: 250 cars
- lane 3: 250 cars
- lane 4: 250 cars
During the Chi-Square test, we check if the null hypothesis holds true. We
can formulate the null hypothesis as such: The deviations between expected
and observed values are not statistically significant.
The first step is to calculate the Chi-Square χ². We symbolize the expected
frequencies by e1, e2, … (in our case all have the value 250) and the
observed frequencies by h1, h2, … For each class, which in our case are the
lanes, we calculate this quantity:
x = (h – e)2 / e
χ2 = x1 + x2 + …
Let's do this for the above example. For the first lane we had the expected
value e1 = 250 and the observed value h1 = 238. This is all we need to
compute the quantity x1:
x1 = (238 – 250)2 / 250 = 0.58
In the same way we calculate the other inputs needed for the Chi-Square.
So we get:
Now we know how to calculate the Chi-Square, but what to do with it? To
continue, we first need to know the degrees of freedom for our set-up.
When the expected values are uniformly distributed, as it is in our example,
the degrees of freedom f is just the number of classes n minus one. So we
get: f = 4 – 1 = 3.
How to find the relevant critical value in the table? First we go to the row
determined by the degrees of freedom, for us, that's row 3. Then we go to
the column determined by the probability 0.05 (for starters, always use this
column). The intersection of this row and column is the critical value we
were looking for: c² = 7.82.
If the Chi-Square is below this critical value, the deviations are not
statistically significant and we accept the hypothesis. In the example we got
χ² = 2.14 and c² = 7.82, so indeed the Chi-Square is smaller than the critical
value. We can accept our hypothesis and conclude that the observation
supports our assumption of the cars being uniformly distributed over the
four lanes.
Let's sum up the process of statistical hypothesis testing in several steps.
Our starting point is always an expected and an observed distribution as
well as the null hypothesis, which states that the deviations between the two
are not statistically significant. Then we proceed as such:
1. Compute Chi-Square
2. Determine degrees of freedom
3. Look up critical value in table
4. Compare Chi-Square and critical value
One thing to keep in mind though is that the Chi-Square test should not be
directly used if one of the expected or observed frequencies is equal to or
smaller than 5. In this case you either have to merge classes to produce
frequencies larger than 5 or use an alternative hypothesis test, like Fisher's
exact test.
Let's look at a certain type of crime, say robberies. Now a statistic shows
that of all the robberies in the country, 80 % have been committed by
natives and 20 % by immigrants. Can we conclude from these numbers that
the immigrants are more inclined to steal than the natives? Many people
would do so.
The police keeps basic records of all crimes that have been reported. This
enables us to get a closer look at the situation. Consider the graph below, it
shows the age distribution of people accused of robbery in Canada in 2008.
It immediately becomes clear that it is for the most part a "young person's
crime". The rates are significantly elevated for ages 14 – 20 and then
decrease with age. Even without crunching the numbers it is clear that the
younger a population is, the more robberies will occur.
Let's go back to our fictional country of 90 % natives and 10 % immigrants,
with the immigrant population being younger. Assuming the same
inclination to committing robberies for both groups, the immigrant
population would contribute more than 10 % to the total amount of
robberies for the simple reason that robbery is a crime mainly committed by
young people.
Using a simplistic example, we can put this logic to the test. Let's stick to
our numbers of 90 % natives and 10 % immigrants. This time however,
we'll crudely specify an age distribution for both. For the native population
the breakdown is:
- 15 % below age 15
- 15 % between age 15 and 25
- 70 % above age 25
- 20 % below age 15
- 20 % between age 15 and 25
- 60 % above age 25
We'll set the total population count to 100 million. Now assume that there's
a crime that is committed solely by people in the age group 15 to 25. Within
this age group, 1 in 100000 will commit this crime over the course of one
year, independently of what population group he or she belongs to. Note
that this means that there's no inclination towards this crime in any of the
two groups.
It's time to crunch the numbers. There are 0.9 · 100 million = 90 million
natives. Of these, 0.15 · 90 million = 13.5 million are in the age group 15 to
25. This means we can expect 135 natives to commit this crime during a
year.
As for the immigrants, there are 0.1 · 100 million = 10 million in the
country, with 0.2 · 10 million = 2 million being in the age group of interest.
They will give rise to an expected number of 20 crimes of this kind per
year.
In total, we can expect this crime to be committed 155 times, with the
immigrants having a share of 20 / 155 = 12.9 %. This is higher than their
proportional share of 10 % despite there being no inclination for
committing said crime. All that led to this result was the population being
younger on average.
This is not Menzerath's Law just yet. Consider this linguistic order:
syllables, words, subordinate clauses, clauses, paragraphs and texts. These
are linguistic elements ordered by their size. We looked at words and found
out that the longer they are, the smaller their syllables. Interestingly enough,
statistical analysis showed that the longer the subordinate clauses, the
shorter its words. And the longer the clauses, the shorter the subordinate
clauses. There seems to be a deeper relationship at work.
Menzerath's Law states that, on average, the longer the linguistic element,
the shorter the linguistic elements that compose it. If x is the length of the
linguistic element and y the length of its components, then:
y = a · xb
with a and b being constants that vary with the linguistic elements involved
and language. In order to have a decrease, b must be smaller than zero.
Gabriel Altmann, a Czech scientist who specializes in quantitative
linguistics, analyzed 10000 words from American English and found that
the best fit is produced by the curve:
y = 4.09 · x-0.36
with x being the length of a word (measured in syllables) and y the average
length of the syllables that compose it (measured in letters). The data points
from Altmann's analysis as well as the corresponding fit can be seen in the
image below.
The formula also agrees quite well with the randomly picked words from
the introduction of the snack. For words of x = 2 syllable length, we would
expect the syllables to be on average y = 3.2 letters long, while for words
with x = 7 syllables the formula predicts the syllables to be y = 2 letters
long.
The above image nicely shows what regression analysis does. It can find the
best curve (we'll specify a little later what is meant by best) through a
number of points in a coordinate system. Consider this crude home
experiment I did a while ago. Equipped with a sound level meter and a
ruler, I dropped a small wooden sphere onto the ground from different
heights to see how the maximum sound level during impact varies with the
impact speed (which you can calculate easily from the height neglecting air
resistance).
Here are the first five of eleven data points I collected. Each point is the
average of twenty measurements. The twenty repetitions were done to make
sure that random fluctuations (for example in the drop height) do not distort
the results.
The goal now is to deduce from the data points a formula that connects the
impact speed v with the sound level s. Most of the work will be done by the
computer, but before it can do anything, one must choose a general form for
the relationship. This formula, with its several still undetermined
parameters, is called the ansatz.
To make a good ansatz, you should be familiar with how typical graphs of
common functions (linear, quadratic, power, exponential, Boltzmann,
trigonometric) look like. Then you can make an educated guess on which
class of functions will most likely produce a good fit. For my data, I chose a
power function in its most basic form:
s = a · vb
with the yet to be determined parameters a and b. Now it's the computer's
turn. It will determine the values of the parameters so that the best fit is
produced. But what exactly is meant by best fit?
Imagine I did an approximate fit by hand and for 0.99 m/s the formula spits
out the value 29.12 dB instead of the measured 28.56 dB. So the fit differs
from the observed value at this point by 0.56 dB. This difference is called a
residual r. For each data point, the computer calculates this residual and
squares it. The sum of all the squared residuals is a measure of how strongly
the fit differs from the real data.
S = sum( r2 )
The computer now chooses the values for the parameters a and b that
minimize this sum. This process is called the “method of least squares” and
leads to what is considered the best fit. For my data points and ansatz, the
choices a = 29 and b = 0.43 minimized the sum of the squared residual to S
= 0.40. All other combinations of a and b produced a sum bigger than that.
As you can see, the fit turned out to be quite well. Some deviations are
always acceptable when working with real-world data, no fit can ever be
perfect down to the last detail. A number that is often used to characterize
the goodness of the fit is the adjusted r-square. The closer it is to one, the
better the fit. For doing fits, I recommend the program OriginPro or, as a
free but less sophisticated alternative, the website xuru.org.
When we want a text to be finished quickly, we tend to type faster. But the
faster you type, the more likely you are to make a typing error. So the time
gained by entering more words per minute (WPM) could be eaten up by the
time lost to correcting typing errors. How much do we gain (or maybe even
lose) when we increase typing speed?
To answer this question, we first need data on how the typing error
probability varies with speed. Scientists from the Vanderbilt University and
Brooklyn College of CUNY collected data on just this relationship. You can
see the data in the graph below.
As expected, the error probability quickly increases with speed. If you
double the speed from the average of 40 WPM to 80 WPM, the error
probability goes up threefold from a mere 8 % (one typing error per twelve
words) to about 25 % (one typing error per four words).
Let's focus on the quantity “correct words per minute” (CWPM). It depends
both on the WPM and the error probability, which I will symbolize by p. A
little thought leads to this mathematical relationship:
CWPM = (1 – p) · WPM
Using the above plot, we calculate this quantity for several speeds. For
example, when writing at average speed WPM = 40, our error probability is
going to be about p = 0.08. According to the formula, this translates into 37
correct words per minute. At WPM = 50, we have p = 0.1 and thus 45
correct words per minute. We can do this for several other speeds and
compile a table:
- WPM = 40 → CWPM = 37
- WPM = 50 → CWPM = 45
- WPM = 60 → CWPM = 50
- WPM = 70 → CWPM = 55
- WPM = 80 → CWPM = 59
As you can see, the increase in error probability does not eat up the gain,
but it noticeably dampens the progress. We can use the data for the error
probability to do a Boltzmann fit (with the common sense constraints that p
goes to zero as we decrease typing speed and p goes to one as we increase
typing speed) and to expand the table, leading to these surprising results:
- WPM = 90 → CWPM = 57
- WPM = 100 → CWPM = 54
- WPM = 110 → CWPM = 49
For such high speeds (which skilled typewriters are indeed able to reach),
the rate of correct words actually decreases for the average writer. In this
region the time lost due to errors overtakes the gain by increased typing
speed. This shows that faster typing does not necessarily mean faster
progress. Rather there's an optimum speed, which the data suggests to be at
about twice the regular speed.
This problem is inspired by the great book "Duelling Idiots and Other
Probability Puzzlers" by Paul J. Nahin. In it, he looked at a duel during
which the contestants take turns firing at each other and found out the the
dueller getting the first shot is always slightly more likely to be victorious.
In this variation, we'll have the contestants fire at each other simultaneously.
Dueller 1 has the probability of p1 of hitting the opponent with a shot, for
dueller 2 this number is p2. What is the chance of dueller 1 winning?
He wins after one round if he hits his opponent but his opponent doesn't hit
him (if both hit each other, there is no winner). This happens with the
chance:
p(1) = p1 · (1-p2)
For him to win after two rounds, he needs to miss the first shot and get the
second shot, while his opponent misses both. This occurs with the chance:
A win occurs after the third round if he misses the first two times and
managed to get a hit with the third shot. Again, his opponent needs to miss
all the shots. The probability for this particular outcome is:
p(3) = (1-p1)2 · p1 · (1-p2)3
To the keen eye, the progression now becomes visible. The probability for a
win after the n-th round is:
To get the odds for moron 1 to be victorious, we need to sum all these
terms. Unfortunately, this gives us an infinite sum which can not be
evaluated (to my knowledge) analytically.
But given numerical values, we can always get as close to the result as we
like by evaluating a certain number of terms. If the probabilities p1 and p2
are not too small, the chance for the game to end after a large number of
rounds is so small, that they can be ignored with good conscience. Our
numerical example and the following analytical treatment will show that.
Let's set p1 = 0.5 and p2 = 0.4, so dueller 1 has a slight edge. We compute
his chance of being victorious. The odds for him winning after a certain
number of rounds are:
- p(1) = 0.300
- p(2) = 0.090
- p(3) = 0.027
- p(4) = 0.008
- p(5) = 0.002
- p(6) = 0.001
As you can see, the chance of dueller 1 winning after the sixth round is
already quite slim. So we feel comfortable ending here and giving our
approximate result by summing these terms:
p(1 victorious) = ca. 43 %
Now you might wonder why his chance is smaller than 50 %, but remember
that with this set-up, ties are possible. So to really compare the dueller's
chances, we need to compute the probability of dueller 2 being victorious as
well. Since it's rather random who of the two we call moron 1 and moron 2,
we can just swap the probabilities and use the above formulas. So only for
the moment, we set p1 = 0.4 and p2 = 0.5 to find the chance of dueller 2
winning the match:
- p(1) = 0.200
- p(2) = 0.060
- p(3) = 0.018
- p(4) = 0.005
- p(5) = 0.002
- p(6) = 0.001
p(tie) = ca. 28 %
In about 1 in 4 duels there would be no winner. The morons would just end
up shooting each other at the same time. Just a side note for the
mathematically curious: the problem is actually solvable in analytical form
if p1 is fixed at 0.5. This leads to:
p(1) = 0.5 · (1-p2)
p(1) = q
p(2) = q2
p(3) = q3
And so on. Again we do the sum to find out how likely it is for dueller 1 to
win. This is simply:
If you read the snack "Delay Revisited" in chapter three, you know that this
sum can be written in a compact form:
Note the minus one at the end. This is because the first term usually
included in the sum (q0 = 1) was missing. Inserting 0.5 · (1-p2) results in:
p(1 victorious) = 1 / (1 – 0.5 · (1-p2)) – 1
Agreed, not a particularly beautiful formula, but it works. We can use this
formula now to check if our approximate answer to dueller 1 winning the
match was accurate. We had p1 = 0.5 (as required by the formula) and p2 =
0.4. Inserting this value gives us:
p(1 victorious) = 1 / (1 – 0.5 · 0.6) – 1 = 0.4285
The idea for this problem came to me while doing the numerical simulation
for the snack "Country Roads". We calculated an expected travel time of 54
minutes for the highway route. With only 100 loops in the simulation per
trial, the simulated average travel time always deviated somewhat from the
theoretical value. We got 53.5 minutes, then 55.1 minutes and after that
54.3 minutes.
I thought to myself: "When the computer spits out 55.1 minutes and I,
knowing the theoretical average is rather 54 minutes, predict the next
simulated value to be below the displayed 55.1 minutes, how likely is it that
I'm correct given I keep on applying this strategy?" Good question. Let's
simplify this a little while turning it into a problem for us to solve.
We start with a computer program that will display one of these four
numbers with the respective probabilities:
So the program is most likely to display a 2, but given enough trials, will
have displayed each of the numbers. We can interpret this as a discrete
probability distribution and compute the expected value:
e = 1·0.2 + 2·0.4 + 3·0.3 + 4·0.1 = 2.3
How likely is it that my predictions will be correct using this strategy? Let's
look at each number individually. If it displays a 1, my prediction comes
true if the next number is a 2, 3 or 4. The chance for this to happen is (using
the table):
q(1) = 0.8
If the 2 comes up, my prediction for a larger number is correct if in the next
trial the numbers 3 or 4 come up. So:
q(2) = 0.4
Once the 3 comes up, I'll predict a lower number since 3 is bigger than the
expected value. So my guess is that either a 1 or a 2 will show:
q(3) = 0.6
q(4) = 0.9
Since I know that using this strategy I'm right 80 % of the time when the
number 1 shows, its 200 appearances should give me 160 correct
predictions. Similarly, the number 2 provides me with 160 successes, the
number 3 with 180 successes and the number 4 with 90 successes. So in
total, I can expect to get 590 correct predictions during 1000 trials:
p(correct) = 59 %
Thus our prediction strategy is certainly better than just random guessing,
knowing the expected value can certainly be helpful.
You might have one last question: why choose to perform 1000 trials? Isn't
it rather random to just choose any value? It is, but had I chosen to make
100 trials or only 10 trials it wouldn't have affected the result. When doing
the ratio at the end, the random choice of number of trials cancels out. This
is especially clear when solving the problem algebraically, without fixing
any specific numerical values.
Some time ago I came across a website that offered a certain number of
articles. Whenever you clicked "next", it would randomly display one of the
articles. This was somewhat annoying because after a while you had seen
some of the articles over and over and you had to keep on clicking several
times to get to a new one. But it does bring up a nice and easily stated
statistical problem.
Given that there are N articles in total, what is the probability of seeing n
different articles with n clicks? So we are asking for the chance of no article
appearing twice when clicking "next" n times. Of course n must be smaller
than N for this to be possible. We can't see more new articles than there
actually are.
Let's keep this general and approach it by looking at each click individually,
assuming we never get an article twice. For the first click, there's a 100 %
chance for being displayed an article that didn't come up before:
p(1) = 1
Now 1 out of N articles is already seen by us, which means N-1 are still
undisplayed. The chance of getting a new article on second click is thus:
p(2) = (N-1) / N
You can guess how this goes on. Now that only N-2 articles are not seen by
us, the third click provides us with this probability of seeing a new article:
p(3) = (N-2) / N
Note that the number in the bracket on the right side is always one smaller
than the one in the bracket on the left, which indicates the number of clicks.
So with the n-th click there is this chance of getting a new article:
To always see a new article up to the n-th click, we want all of the above
events to occur. Using the multiplication rule, the probability for that is:
This basically solves the problem. But let's apply some algebra to arrive at a
much nicer formula for p(only new). The factorial notation as introduced in
chapter three allows us to write the above formula in a very compact way
after inserting all the terms for p(1), p(2), and so on:
p(only new) = N! / (Nn · (N-n)!)
p(only new) = 30 %
So the chances are very slim for that to happen. As a side note for the
mathematically curious: the formula for the binomial coefficient C(n,k) also
involves factorials in a very similar manner. It is:
C(N,n) · n! = N! / (N-n)!
Multiplying both sides of the formula for p(only new) with Nn gives us the
same expression on the right side:
with n(overlap) being the number of possibilities for the marks to overlap
and n(total) the total number of possibilities for two people to place two
marks on a square with n² fields.
n(total) = n2 · n2 = n4
How many possibilities are there for the marks to overlap? Each field
provides exactly one possibility for an overlap, so:
n(overlap) = n2
Thus, the probability that two people choose the same field at random is:
p(overlap) = n2 / n4 = 1 / n2
In the 5 x 5 grid pictured above, the chances of accidental overlaps are 1/5²
= 1/25 or 4 %. We'll run a computer simulation of this set-up to see if it
checks out. For this purpose, we create six integer variables, x(1), y(1),
x(2), y(2), l and o. x(1) and y(1) represent the coordinates of the field
chosen by one person, x(2) and y(2) those of the field chosen by the other.
During a loop, each of these four variables is assigned a random value
between one and five. Then the coordinates are compared. The variable l
keeps track of the total number of loops, the variable o records the number
of overlaps. The probability of overlaps occurring by chance is then
computed by o/l.
We run three trials with ten million loops each. Here are the results rounded
to two digits after the comma:
- 4.08 %
- 3.89 %
- 4.06 %
As you can see, it agrees very well with our result above, the deviation is no
more then 3 % of the theoretical value. This means we can take the problem
to the next level by including a third person. Again we start with a grid
having n² fields and each person randomly choosing a field to mark. What
is the chance of an overlap occurring?
We will approach this the same way, by counting possibilities and doing the
ratio. The total number of possible outcomes is relatively easy to derive:
Each person has n² choices for placing the mark. That leads to:
n(total) = n2 · n2 · n2 = n6
An overlap can occur in many different ways: only A and B overlap, only A
and C overlap, only B and C overlap or all three overlap.
Assume A and B choose the same field, but C chooses another. How many
possibilities are there for that to happen? A and B have n² choices, C has
(since one field is now taken) n² – 1 choices. So the number of ways, in
which only A and B overlap, is:
Due to symmetry, the possibilities are the same for two other cases: only A
and C overlap and only B and C overlap. So the number of ways for two
people to choose the same field and one to choose another is:
All that's left is to find out how many possibilities we have for all three to
overlap. Each field in the grid provides exactly one possibility for that, so:
n(three overlap) = n2
n(overlap) = 3 · n2 · (n2-1) + n2
By doing the ratio n(overlap) / n(total), we can find out what the probability
is that an overlap will occur when three people randomly choose one of n²
fields. After some algebra, we get this formula:
p(overlap) = 3 / n2 – 2 / n4
In the 5 x 5 grid pictured above, the odds of overlapping are 0.118 or about
11.8 %. About, but not exactly, three times the result we got for two people.
We run a computer simulation to check our result. The procedure is similar
to the program above, we just need some additional variables to include the
third person and to keep proper track of overlaps. The results of the three
trials with ten million loops each are:
- 11.61 %
- 11.73 %
- 11.67 %
1) A white field will become black at the next step if at least one
neighboring field is black (has the flu).
These two rules of local interaction can already produce interesting and
beautiful behavior during the simulation: clusters forming, clusters joining,
fronts propagating, and so on. Of course you need to rely on a computer to
do the large number of computations necessary to get from one distribution
to the next. But if the rules are not too complex, there are some things we
can deduce by hand.
Let's stay with the above example. The absence of any resistance means that
the flu wave could just go on forever. We will denote the percentage of
black fields by p and the total number of fields by N.
Now we can easily compute the chance of having at least one infected
neighbor:
Note that since N appeared as a factor in each term, I just divided both sides
by N to get rid of it. The above formula shows the overall behavior that is to
be expected, but it is not exact as the distribution of the black and white
fields is usually not statistically homogeneous.
With this equation we can take a look at possible steady states, that is, when
pk+1 = pk (with some fluctuations). Since we don't have to worry about steps
here, we'll just write p for the percentage of infected fields. This leads to the
equation:
p = p + (1 – (1-p)8) · (1-p) – q · p
If you're interested in cellular automatons, make sure to check out the Game
of Life, published by the British mathematician John Conway in 1970. It
quickly became popular and remains so today, with many websites and
blogs dedicated to documenting and researching the game. Based on a few
simple rules, the simulation gives rise to many interesting and unexpected
objects, such as the Blinker, Beacon, Pulsar, Glider, and so on. Some initial
patterns become static after a number of iterations, while others grow
indefinitely. You can easily find applets to try out your own initial patterns
on the internet.
Unfortunately for the players, this logic failed. The chances for black
remained at 0.474, no matter what colors appeared so far. Each spin is a
complete reset of the game. The same goes for coins. No matter how many
times a coin shows heads, the chance for this event will always stay 0.5. An
unlikely string will not alter any probabilities if the events are truly
independent.
with T being the average global temperature and P the number of pirates.
Given enough pirates (about 3.3 million to be specific), we could even
freeze Earth.
But of course nobody in the right mind would see causality at work here,
rather we have two processes, the disappearance of piracy and global
warming, that happened to occur at the same time. So you shouldn't be too
surprised that the recent rise of piracy in Somalia didn't do anything to stop
global warming.
Though not a fallacy in the strict sense, combinations of low probability and
a high number of trials are also a common cause for incorrect conclusions.
We computed that in roulette the odds of showing black twenty-six times in
a row are only 1 in 270 million. We might conclude that it is basically
impossible for this to happen anywhere.
(●) Inflation
This is an excerpt from the book "Business Math Basics - Practical and
Simple" by Metin Bektas, available on Amazon for Kindle and other e-
reading devices.
There's no denying it: things get more expensive. This happens in all
economies and almost every year. At moderate rates, this increase in price
level is not alarming. The picture below shows the inflation rates for the US
from 1991 to 2012. Only in 2009, shortly after the financial crisis, did
prices actually fall.
What reasons are there for inflation to occur? One way of answering this
question is to take the monetarist approach and focus on the so called
Equation of Exchange. It will help us to easily identify the culprit.
Also important is the velocity of money V. It tells us, how often each dollar
(bill) is used over the course of a year. This quantity depends on the saving
habits of the people in the economy. If they are keen on saving, the bills
will only pass through a few hands each year, thus V is small. On the other
hand, if people love to spend the money they have, any bill will see a lot of
different owners, so V is large. For the introductory example, we'll set V =
5.
Note that the product of these two quantities is the total spending in the
economy. If there are M = 100 billion $ in the economy and each dollar is
spend V = 5 times per year, the total annual spending must be M · V = 500
billion $. This conclusion is vital for understanding the Equation of
Exchange.
There are two more quantities we need to look at, one of which is the price
level P. It tells us the average price of a good in the economy. If there's
inflation, this is the quantity that will increase. Let's assume that in our
fictitious economy the average price of a good is P = 25 $.
Last but not least, there's the number of transactions T, which is just the
total number of goods sold over the entire year. We'll fix this to T = 200
billion for now and make another very important conclusion.
The product of these last two quantities is the total sales revenue in the
economy. If the average price of a good is P = 25 $ and there are T = 200
billion goods sold in a year, the total sales revenue must be P · T = 500
billion $. It is no accident that the total sales revenue equals the total
spending. Rather, this equality is the (reasonable) foundation of the
Equation of Exchange.
For the total spending to equal the total sales revenue, this equation must
hold true:
M·V=P·T
which is just the Equation of Exchange. Now think about what will happen
if we increase the money supply M in the economy, for example by printing
money or government spending. We'll assume that the spending habits of
the people remain unchanged (constant V). Since we increased the left side
of the equation, the total spending, the right side of the equation, the total
sales revenue, must increase as well.
One way this can happen is via an increase in price level P (inflation).
Indeed empirical evidence shows that in the US every increase in money
supply was followed by a rise in inflation later on.
Luckily there's another quantity on the right side which can absorb some of
the growth in money supply. A rise in the number of transactions T
(increased economic activity) following the "money shower" will dampen
the resulting inflationary drive. On the other hand, a combination of more
money and less economic activity can lead to a dangerous, Weimar-style
hyperinflation.
Of course there can be much more trivial causes for inflation than a
growing money supply. Prices are determined by an equilibrium of supply
and demand. If demand drops, retailers have to lower their prices to sell off
their stocks. Similarly, if demand suddenly increases, the retailer will be
able to set higher prices, resulting in inflation. This happens for example
when a new technology comes along that quickly rises in popularity.
Appropriately, this kind of price level growth is called a demand-pull
inflation.
(●) Hurricanes
In this section we are going to do just what the title says, that is compute
hurricanes. The great formula that accomplishes this, called Rankine
formula, is very little known among physicists and mathematicians, most
are not aware of its existence. But that doesn't make it any less useful.
For starters, we will assume this pressure difference to be constant over the
life of a hurricane. At a later point we will relax this condition, allowing the
calculations to include strengthening and weakening hurricanes. But for
now, we only care about two quantities: the distance from an observer to the
center of the storm r (any unit of length will do as long as we are consistent)
and the wind speed v at this distance.
The Rankine formula states that this expression is conserved as the
hurricane changes position:
v · r0.6 = constant
Our strategy will be: first we use current data (a distance and a wind speed)
to compute the constant, then we are able to get an estimate for the wind
speed at any distance. Note that this equation tells us that when we triple the
distance to the center of the hurricane, the wind speed halves.
----------------------
20 · 6000.6 ≈ 930
Now we can set up an equation for the maximum wind speed. Since we
inputted the speed in mph, the result will be in the same unit.
v · 1000.6 ≈ 930
v · 16 ≈ 930
v ≈ 58 mph
----------------------
Again the approaching hurricane is 600 miles away with current wind
speeds of 20 mph. The pressure difference between the center and
surroundings of the hurricane at this point is about 60 mb. During its
approach, it will come as close as 100 miles and is expected to strengthen
to 80 mb. What is the maximum wind speed v we can expect?
v · 1.8 ≈ 120
v ≈ 67 mph
----------------------
It is important to note that all of the equations only hold true outside the eye
of the storm (which is usually about 20 to 40 miles in diameter). The
maximum wind speed in a hurricane is reached at the wall of the eye. Inside
the eye wind speeds drop sharply. It is so to speak "the calm within the
storm" and can make for a quite eerie experience.
We'll draw one last conclusion before moving on. The size of the eye is
more or less a constant. This implies that the maximum wind speed within a
hurricane grows with the square root of the pressure difference. So if the
pressure difference quadruples, the maximum wind speed will
approximately double. Real-world data confirms this conclusion within
acceptable boundaries. As an estimate for the maximum wind speed in a
hurricane you can use this formula:
v(max) ≈ 16 · sq root(p)
Title:
http://commons.wikimedia.org/wiki/File:Typing_monkey_768px.png
Chapter One:
http://pixabay.com/en/flat-icon-symbol-math-mathematics-27155/
http://www.algebralab.org/lessons/lesson.aspx?
file=Algebra_ProbabilityMultiplicationRule.xml
http://people.richland.edu/james/lecture/m170/ch05-rul.html
http://www.ohrt.com/odds/binomial.php
(Binomial Coefficient Calculator)
http://shakespeare.mit.edu/romeo_juliet/full.html
http://robison.casinocitytimes.com/article/how-many-stops-on-a-slot-reel-
33747
http://www.abc.net.au/science/articles/2003/04/01/817429.htm#.Ua5Qcpy
WeUk
http://en.wikipedia.org/wiki/DF-21
http://jonathanturley.org/2012/09/29/targeted-hype/
Chapter Two:
http://commons.wikimedia.org/wiki/File:Binomial_distribution_pmf_sl.svg
http://wbs.eu.com/index.php/folders/category/91-statistics-1-revision?
download=236:introducing-binomial
http://stattrek.com/online-calculator/binomial.aspx
(Binomial Distribution Calculator)
http://prx.aps.org/pdf/PRX/v3/i2/e021006
Köhler, R. (2012). Quantitative Syntax Analysis. Berlin / Boston: Walter
de Gruyter GmbH & Co. KG
http://www.usingenglish.com/resources/text-statistics.php
http://archiv.c6-magazin.de/06/monatsthema/2006/06-fussball-wm/fussball-
lexikon/elfmeter.php
Chapter Three:
http://www.flickr.com/photos/go_greener_oz/3047060508/
http://www.intmath.com/counting-probability/11-probability-distributions-
concepts.php
http://de.wikipedia.org/wiki/Geometrische_Reihe
Chapter Four:
http://commons.wikimedia.org/wiki/File:Poisson_distribution_PMF.png
http://www.umass.edu/wsp/statistics/lessons/poisson/
http://stattrek.com/online-calculator/poisson.aspx
(Poisson Distribution Calculator)
---- Tornadoes
http://www.erh.noaa.gov/cae/svrwx/tornadobystate.htm
http://www.realestateforums.com/greenref/docs/GRE13_ChristopherMorga
n.pdf
http://www.thefoxwebsite.org/After-the-Hunt.pdf
---- Soccer
Chapter Five:
---- The Basics
http://www.artofproblemsolving.com/Store/products/intro-
counting/exc2.pdf
---- Asteroids
http://en.wikipedia.org/wiki/Impact_event
http://geography.about.com/library/cia/blcpacific.htm
http://www.wolframalpha.com/input/?i=seattle+sidney
https://de.wikipedia.org/wiki/Sichtweite
http://www.kcl.ac.uk/sspp/departments/geography/people/academic/malam
ud/floods.pdf
Chapter Six:
http://commons.wikimedia.org/wiki/File:Thomas_Bayes.gif
http://www.cut-the-knot.org/Probability/BayesTheorem.shtml
http://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml
http://www.sussex.ac.uk/Users/christ/crs/kr-ist/lec08b.html
Chapter Seven:
---- Don't be Mean
http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Standard_deviatio
n.html
http://www.greenbook.org/marketing-research.cfm/how-to-interpret-
standard-deviation-and-standard-error-in-survey-research-03377
http://grundpraktikum.physik.uni-saarland.de/scripts/Fehlerrechnung_Uni-
Ulm.pdf
http://faculty.southwest.tn.edu/jiwilliams/probab2.gif
http://www.math.cornell.edu/~lipa/mec/lesson6.html
http://psoup.math.wisc.edu/mcell/mjcell/mjcell.html
http://www.statcan.gc.ca/pub/85-002-x/2010001/article/11115-eng.htm
http://arxiv.org/pdf/cs/0512102v1.pdf
http://www.glottopedia.org/index.php/Menzerath-Altmann-Gesetz
https://my.vanderbilt.edu/motonoriyamaguchi/files/2011/08/Yamaguchi_Cr
ump_Logan_JEPHPP_inpress.pdf
---- Two Fallacies
http://www.fallacyfiles.org/gamblers.html
http://web.archive.org/web/20070407182624/http://www.venganza.org/abo
ut/open-letter/
http://www.statista.com/statistics/221031/total-worldwide-casinos-by-
region/
Computer Programs:
OpenOffice.org, Copyright © Sun Microsystems Inc.
(Text Editing)