Statistical Snacks

Table of Contents
Introduction
Chapter One: The Power of Multiplication
Chapter Two: Binomial Shortcut
Chapter Three: To Be Expected
Chapter Four: Poisson Distribution
Chapter Five: The Geometric Approach
Chapter Six: Bayes and Bias
Chapter Seven: Random Random Problems
Excerpts
References
Statistical Snacks
(Copyright 2013 Metin Bektas)
Introduction:
In case you already skimmed through the book or read a sample, you might
be wondering: Is this a textbook? Or is it recreational? Well, it's both and
neither.
This book was certainly written with the intent to entertain with interesting
statistical problems and ideas. But statistics can not be fully enjoyed
without understanding its inner workings. Unexpected results are great, but
it's even better if you can understand how to arrive there.
So the intent of the book is not only to entertain, but also to teach important
core concepts of statistics and how to apply them. For this reason the
snacks, as well as the whole book, are written to become more demanding
as the book nears the end. Thus I recommend to read it in the order in which
it stands, unless you already have sound knowledge in statistics.
It also intends to appeal to your imagination and creativity. Each statistical

concept enables you to solve an infinite amount of problems that have yet to
be thought up. Be bold. Make up your own problems and try solving them.
It can be anything from real-world applications to pure absurdness. Or
choose an existing problem and take it to the next level. If you need
numerical values, research or guess them. If the problem turns out to be too
complex, try to simplify it or limit yourself to special cases. Often times
when solving special cases, the idea to crack the whole thing will naturally
come to you.
Try not to rush hastily through the book. Take your time to absorb the
concepts and the applications. Don't worry if there's something that you
don't understand on first try, some snacks are written to be especially
challenging. And I can assure you that "getting it" on second or third try is
even more rewarding.
You will notice that the title of the snacks includes a certain number of dots.
These show the difficulty of the snack with respect to the other snacks in
the chapter. The indicator ranges from one dot for relatively easy snacks to
three dots for tough ones.
It's a good idea to keep a pen and piece of paper ready when reading.
Sometimes you'll find it helpful to do your own visualization of the
problem, other times you'll think of a great variation of the problem or even
a completely new problem. Don't let these ideas go to waste, rather pursue
and share them.
A side note for Amazon Kindle users: don't set the font size too high,
otherwise the formulas will be displayed over more than one line. This can
be somewhat confusing and cause your train of thought to be unnecessarily
interrupted. The Table Of Contents has been optimized for Kindle so that
you may jump to the desired chapter without any hassle.
Hopefully this will be one of these books to which you will always return
and that will stay on your mind long after reading. Enjoy your trip through
the wondrous world of statistics.
Metin Bektas
Chapter One: The Power of Multiplication
(●) The Basics
Before we enjoy our first snacks, it's wise to look at some basics.
Multiplication is a surprisingly powerful tool in statistics. It enables us to
solve a vast amount of problems with relative ease. One thing to remember
though is that the multiplication rule, to which I'll get in a bit, only works
for independent events. So let's talk about those first.
When we roll a dice, there's a certain probability that the number six will
show. This probability does not depend on what number we rolled before.
The events "rolling a three" and "rolling a six" are independent in the sense,
that the occurrence of the one event does not affect the probability for the
other.
Let's look at a card deck. We draw a card and note it. Afterward, we put it
back in the deck and mix the cards. Then we draw another one. Does the
event "draw an ace" in the first try affect the event "draw a king" in the
second try? It does not, because we put the ace back in the deck and mixed
the cards. We basically reset our experiment. In such a case, the events
"draw an ace" and "draw a king" are independent.
But what if we don't put the first card back in the deck? Well, when we take
the ace out of the deck, the chance of drawing a king will increase from 4 /
52 (4 kings out of 52 cards) to 4 / 51 (4 kings out of 51 cards). If we don't
do the reset, the events "draw an ace" and "draw a king" are in fact
dependent. The occurrence of one changes the probability for the other.
With this in mind, we can turn to our powerful tool called multiplication
rule. We start with two independent events, A and B. The probabilities for
their occurrence are respectively p(A) and p(B). The multiplication rule
states that the probability of both events occurring is simply the product of
the probabilities p(A) and p(B). In mathematical terms:
p(A and B) = p(A) · p(B).
A quick look at the dice will make this clear. Let's take both A and B to be
the event "rolling a six". Obviously they are independent, rolling a six on
one try will not change the probability of rolling a six in the following try.
So we are allowed to use the multiplication rule here. The probability of
rolling a six is 1/6, so p(A) = p(B) = 1/6. Using the multiplication rule, we
can calculate the chance of rolling two six in a row: p(A and B) = 1/6 · 1/6
= 1/36. Note that if we took A to be "rolling a six" and B to be "rolling a
three", we would arrive at the same result. The chance of rolling two six in
a row is the same as rolling a six and then a three.
Can we also use this on the deck of cards, even if we don't reset the
experiment? Indeed we can. But we have to take into account that the
probabilities change as we go along. In more abstract terms, instead of
looking at the general events "draw an ace" and "draw a king", we need to
look at the events A = "draw an ace in the first try" and B = "draw a king
with one ace missing". With the order of the events clearly set, there's no
chance of them interfering. The occurrence of both events, first drawing an
ace and then drawing a king with the ace missing, has the probability: p(A
and B) = p(A) · p(B) = 4/52 · 4/51 = 16/2652 or 1 in about 165 or 0.6 %.
If you're not familiar with these calculations, give them a try. What's the
probability of rolling three six in a row? To arrive at the right answer take A
to be "rolling two six in a row" as calculated above and B to be "rolling a
six". What's the probability of drawing two aces in a row? Remember that
during the first draw, there are 4 aces in 52 cards and during the second
draw, you are left with 3 aces in 51 cards. What about the probability of
drawing all the aces in a row?
Now onto the snacks.
(●) Good ol' Coins
Of course when we start doing statistics, we need to talk about coins. Let's
first look at how likely it is to get three heads in a row. Any guesses? The
chance for a coin showing heads is p(heads) = 1/2. Since this event is so
wonderfully independent from basically anything else in the universe, we
can simply start multiplying:
p(three heads in a row) = p(heads) · p(heads) · p(heads)
Plugging in the numbers gives us a probability of 1/8. Not that unlikely.

Now let's make it a little harder. Again we throw the coin three times, but
now we are interested in the probability of it showing heads exactly twice.
Shouldn't be too hard, right? True, but we need to be careful because there
are several ways in which we can achieve this result.
One way is that the coin shows the sequence heads-heads-tails in that order.
The probability for that is 1/2 · 1/2 · 1/2 = (1/2)3 = 1/8. But the sequences
heads-tails-heads and tails-heads-heads are just as likely. And they too
satisfy our condition of showing heads twice within three throws. So we
get:
p(twice heads in three throws) = 1/8 · 3 = 3/8
Counting the number of ways in which a result can be achieved is

important, only taking into account one of the many ways in which a result
can come to be will lead to an incorrect result. It's a common beginner's
mistake to overlook that. Of course the need to count can make some
problems quite tough.
Just imagine trying to find out how likely it is to get exactly twenty heads in
fifty throws. You would need to count all the ways to distribute twenty
heads to fifty slots. That's a lot of possibilities (47 129 212 243 960, to be
specific). Luckily in a later chapter the binomial coefficient will come to
our aid for just that case. It's how I got the above number in the first place.
(●) Monkey on a Typewriter
Here are the first two sentences of the prologue to Shakespeare's Romeo
and Juliet:
Two households, both alike in dignity,

In fair Verona, where we lay our scene
It has 77 characters. Now we let a monkey start typing random letters on a

typewriter. Once he typed 77 characters, we change the sheet and let him
start over. How many tries does he need to randomly reproduce the above
paragraph?
There are 26 letters in the English alphabet and since he'll be needing the
comma and space, we include those as well. So there's a 1/28 chance of
getting the first character right. Same goes for the second character, third
character, etc ... Because he's typing randomly, the chance of getting a
character right is independent of what preceded it. So we can just start
multiplying:
p(reproduce) = 1/28 · 1/28 · ... · 1/28 = (1/28)77
The result is about 4 times ten to the power of -112. This is a ridiculously
small chance! Even if he was able to complete one quadrillion tries per
millisecond, it would most likely take him considerably longer than the
estimated age of the universe to reproduce these two sentences.
Now what about the first word? It has only three letters, so he should be
able to get at least this part in a short time. The chance of randomly
reproducing the word "two" is:
p(reproduce) = 1/26 · 1/26 · 1/26 = (1/26)3
Note that I dropped the comma and space as a choice, so now there's a 1 in
26 chance to get a character right. The result is 5.7 times ten to the power of
-5, which is about a 1 in 17500 chance. Even a slower monkey could easily
get that done within a year, but I guess it's still best to stick to human
writers.
(●) Multiple Choice
Imagine taking a multiple choice test that has three possible answers to each
question. This means that even if you don't know any answer, your chance
of getting a question right is still 1/3. How likely is it to get all questions
right by guessing if the test contains ten questions?
Here we are looking at the event “correct answer” which occurs with a
probability of p(correct answer) = 1/3. We want to know the odds of this
event happening ten times in a row. For that we simply apply the
multiplication rule:
p(all correct) = (1/3)10 = 0.000017
Doing the inverse, we can see that this corresponds to about 1 in 60000. So
if we gave this test to 60000 students who only guessed the answers, we
could expect only one to be that lucky. What about the other extreme? How
likely is it to get none of the ten questions right when guessing?
Now we must focus on the event “incorrect answer” which has the
probability p(incorrect answer) = 2/3. The odds for this to occur ten times in
a row is:
p(all incorrect) = (2/3)10 = 0.017

In other words: 1 in 60. Among the 60000 guessing students, this outcome
can be expected to appear 1000 times. How would these numbers change if
we only had eight instead of ten questions? Or if we had four options per
question instead of three? I leave this calculation up to you.
(●) Weird Pizza
An industrial pizza manufacturer adds twelve small tomato pieces to each

pizza. The pieces are randomly distributed on the pizza by a machine. This
means that if we divide the pizza into a left and right half, the chance of a
piece falling on a certain half is just 0.5. What are the odds that all tomato
pieces end up on the right half?
This is a straightforward application of the multiplication rule. We want the

event “tomato piece on right half” to occur twelve times in a row. Thus:
p(all right) = 0.512 = 0.00024
Or about 1 in 4100. As expected, it is quite unlikely. But if the company for

example makes 500 pizzas a day, we would expect such a pizza to come up
every 8 days. Despite being so unlikely, the event will still occur every now
and then simply because of the high number of repeats.
Just out of statistical curiosity, how likely is it that exactly one tomato piece
ends up in the left half and the remaining eleven in the right half? Is that
more or less likely than having all pieces on the right side?
Let's name the pieces for the moment. The piece that falls first is piece 1,
the piece that falls second is piece 2, and so on. One way for only one piece
being on the left is that piece 1 falls to the left and all following to the right.
Since they fall to any side with the probability 0.5, the chance for this
outcome is also 0.00024.
But this time our event can occur in more than one way. If piece 1 falls to
the right, piece 2 to the left and all following again to the right, we end up
once more with only one piece in the left half. How many possibilities are
there in total? Obviously each of the twelve pieces can take the role of
being the outcast. So we have twelve possibilities, all occurring with the
probability 0.00024. The chance of having only one piece on the left is thus:
p(one left) = 12 · 0.00024 = 0.0029
This is about 1 in 350. With 500 pizzas a day, we can expect to get one or
even two pizzas of this kind every day. So this distribution is significantly
more likely than having all the pieces on the right half. And the
probabilities will increase further as we get closer to the uniform
distribution with six pieces on each side.
(●) Guessing a Random Number
I'm thinking of a number between one and ten. Can you guess it? The
chances of guessing it on first try is 1/10. What's the chance of guessing it
on second and third try? Again this is a case of applying the multiplication
rule while taking into account that the probability will change as we go
along.
So what's the probability of guessing the number on second try? For that to
happen, we first need to guess the wrong number. This occurs with a
probability of 9/10. Now you are left with nine numbers to choose from.
The chance of picking the right one is 1/9. So the probability is:
p(correct on second try) = 9/10 · 1/9 = 1/10
Again 1/10. What about the probability of making the right guess on third
try? You miss the first guess with a chance of 9/10, then you miss the
second guess with a chance of 8/9 and finally you get the right number with
a chance of 1/8. So:
p(correct on third try) = 9/10 · 8/9 · 1/8 = 1/10
I guess you could have just guessed these results.

(●) Cracking a Three Digit Code
We want to pick a lock that is secured by a three digit code. What are the
odds of us doing so on first try? And within the first one hundred tries? To
answer these questions, we need to look at yet another powerful
multiplication rule.
Assume we have a number of slots. Each slot can be occupied by a number

of different objects. How many possible distributions are there? For
example, consider all possible words that are five characters long. Each of
the five slots can be occupied by 26 letters. How many such words can we
construct?
The answer again is to start multiplying. For each slot there are 26
possibilities, so the total number of words we can construct using five
characters is: 26 · 26 · 26 · 26 · 26 = 11881376.
What does this have to do with our lock? To calculate the probability of
picking it on first try, we need to know how many possible combinations
there are. The lock has three slots and for each slot we have ten options
(from 0 to 9). So the number of possible combinations is: 10 · 10 · 10 =
1000 and the probability of getting the correct one on first try is 1/1000.
So we probably won't be lucky on first try. But what about within the first
one hundred tries? As in the guessing of the random numbers, each try is
equal. You can easily verify that by using the multiplication rule to calculate
the chance of picking the lock on second try, third try, etc ... So let's sum all
the probabilities for the first one hundred tries:
p(within one hundred) = 1/1000 · 100 = 1/10
We end up with a 10 % chance, not very good considering the amount of

work we have put into picking the lock. With one try every five seconds, we
would have been setting combinations for about eight minutes now. Even
for a 50 / 50 chance we need forty minutes of continuous combination
setting.
While reading about the lock, you might be reminded of slot-machines used
in casinos. Just like our lock, they offer three slots, but usually have more
options for each slot. Today's standard is about 20 options per slot, so the
number of possible outcomes is: 20 · 20 · 20 = 8000, reducing our chance
of hitting the jackpot to 1/8000.
(●●) Winning by Winning
Chen and Thomas often challenge each other in a strategy computer game.
An analysis of the duels so far shows that Chen won 55 % of the matches,
while Thomas won the other 45 %. So Chen has a 10 % edge.
To make things more interesting, they put some money in a pot and decide
to play three rounds. Whoever wins the most out of the three rounds gets
the pot. What is Chen's chance of taking the pot home?
One way for Chen to win the game is to win all three rounds. We symbolize
this by the sequence: C – C – C. Using the multiplication rule we calculate
the odds of this outcome:
p(3/3 rounds) = 0.55 · 0.55 · 0.55 = 0.166
He also wins the game if he is victorious in two out of three rounds. One
such sequence is C – C – T and the respective probability is: 0.55 · 0.55 ·
0.45. But let's not forget that there are two other possible sequences: C – T
– C and T – C – C. All have the same probability, so the odds for winning
two out of three rounds are:
p(2/3 rounds) = 3 · 0.55 · 0.55 · 0.45 = 0.408
Summing these two probabilities gives us the chance of Chen taking the pot
home:
p(Chen) = 0.574 = 57.4 %
Of course this also means that Thomas' chance of winning the pot is
p(Thomas) = 0.426 = 42.6 %. Did you notice what happened to the initial
edge? The 10 % edge for winning a single round turned into a 15 % edge
for winning the entire game. For Thomas this is unfavorable, but it shows
that when you want to sift out the better player, it makes more sense to have
a game of three rounds than just a single match.
What happens if they play five instead of three rounds? Does this increase
the edge of winning the game even further? My gut tells me yes, but as you
will see throughout this book, the gut is not always to be trusted in
statistics. So we better crunch the numbers. After a lot of counting this leads
to p(Chen) = 59.3 % and p(Thomas) = 40.7 %. So indeed the edge increases
to about 19 % for a game of five rounds.
(●) Homicide She Wrote
Each year in the US there are about 5 homicides per 100000 people, so the
probability of falling victim to a homicide in a given year is 0.00005 or 1 in
20000. What are the chances of falling victim to a homicide over a lifespan
of 70 years?
Let's approach this the other way around. The chance of not becoming a
homicide victim during one year is p = 0.99995. Using the multiplication
rule we can calculate the probability of this event occurring 70 times in a
row:
p = 0.99995 · ... · 0.99995 = 0.9999570
Thus the odds of not becoming a homicide victim over the course of 70
years are 0.9965. This of course also means that there's a 1 - 0.9965 =
0.0035, or 1 in 285, chance of falling victim to a homicide during a life
span. How does this compare to other countries?
In Germany, the homicide rate is about 0.8 per 100000 people. Doing the
same calculation gives us a 1 in 1800 chance of becoming a murder victim.
At the other end of the scale is Honduras with 92 homicides per 100000
people, which translates into a saddening 1 in 16 chance of becoming a
homicide victim over the course of a life.
It can get even worse if you live in a particularly crime ridden part of a
country. The homicide rate for the city San Pedro Sula in Honduras is about
160 per 100000 people. If this remained constant over time and you never
left the city, you'd have a 1 in 9 chance of having your life cut short in a
homicide.
(●●) At Least He Tried …
There's a one important type of problem we didn't look at yet. I'm talking
about the unnecessarily dreaded "at least"-questions. Paul enjoys playing
tennis, though he's not particularly good at it. Of all his matches, he won
only 10 %. Now he wants to enter a tournament which requires him to
participate in eight matches. What is the chance of him winning at least one
match?
A straightforward approach would be to calculate his odds for winning one

match, for winning two matches, for winning three matches and so on. Then
we sum them all. That sounds like a lot of work and it is. Luckily we can
solve this kind of problem more elegantly. Instead of the above procedure,
we simply calculate how likely it is for him to win no match. One minus
whatever we get here is the probability of him winning at least one match.
This is because the sum of the probabilities for "winning no match" and
"winning at least one match" must add up to 1 or 100%.
The chance of him not winning one of the eight matches is easily calculated
using the multiplication rule. There's a 90 % chance of him losing a match,
so we just multiply to find out how likely it is for this event to occur eight
times in a row:
p(no win) = 0.98 = 0.43 = 43 %
Now we remember that this probability and the chance of him winning at
least one match add up to one and we get:
p(at least one win) = 1 – p(no win) = 0.57 = 57 %
So despite him having such a hard time to win a match, there's a more than
50/50 chance for him to win at least one tournament match. Of course, Paul
is not so interested in this number, for him it's about the enjoyment, not
winning.
(●) No Matter how Careful
An especially careful writer has only a 1 in 4000 chance to make a mistake

while writing a word. What are the chances of finding at least one mistake
in a book of 30000 words?
The phrase "at least" gave away how we should approach this. Let's first
calculate how likely it is to not find any mistake. The chance of getting a
word correct is 3999/4000. The chance for this event to happen 30000 times
in a row is:
p(no mistake) = (3999/4000)30000 = 0.0006
The probability of finding at least one mistake is thus:
p(at least one mistake) = 1- p(no mistake) = 0.9994 = 99,94 %
So no matter how careful he writes, 1 in 4000 is already a ridiculously low

chance of making a mistake, we could safely bet all our money on him
doing at least one mistake. In the next snack we'll take a closer look at
mistakes while writing or rather on how to eliminate them with
proofreading.
(●●●) The Mistake Formula
As we saw, there's a high probability that once a book is finished, there will
be some mistakes left in the book. Knowing that, the author goes over the
text once more and finds m(1) mistakes. After that he proofreads it yet
another time and finds m(2) mistakes. How many mistakes can he expect to
still be in the book after that?
If we knew what his probability of spotting a mistake is, it would be easy to

answer this question. Unfortunately we don't know and we assume there's
no chance of finding out. Despite that, we can still get the desired result
with some logic and some algebra.
Let's denote the number of mistakes before the first proofread by M. If we

can calculate this number in retrospect, we can easily find the number of
mistakes left L after the second proofread. We just subtract the one's already
found from M, so L = M – m(1) – m(2).
So we start with M mistakes. If his chance of spotting a mistake is p, then

he will find:
m(1) = p · M
mistakes during the first try. Now there are M – m(1) or (1-p) · M mistakes
left. Assuming the same chance of spotting a mistake, we can expect him to
discover:
m(2) = p · (1-p) · M
mistakes in the second try. Let's use the first equation to eliminate the
unknown p. We can write p = m(1) / M and plug that into the second
equation. As promised, we really won't need to look at his chance of
spotting a mistake. We can get rid of it by straightforward algebra. Doing
this results in the equation:
m(2) = m(1) / M · (1 – m(1) / M) · M
which simplifies to:
m(2) = m(1) · (1 – m(1) / M)
All that's left is more algebra. We solve this equation for M to get the
expected number of mistakes in the text before the author started
proofreading:
M = m(1)2 / (m(1) – m(2))
We already concluded that the number of mistakes left after the second
attempt is simply L = M – m(1) – m(2) or:
L = m(1)2 / (m(1) – m(2)) – m(1) – m(2)
Voila, all the inputs we need to calculate the expected number of mistakes
still in the text are the number of mistakes found on first and second try.
Surprising, isn't it? Of course now that we have a formula for M, we can
also calculate the proofreader's chance of spotting a mistake using p = m(1)
/ M:
p = 1 – m(2) / m(1)
On top of that, we're able to compute how many mistakes are expected to be
found during a third proofread: m(3) = p · M.
Let's use some numbers. First the author found m(1) = 30 mistakes and then
m(2) = 12 mistakes. How many mistakes do we expect to be left in the text?
L = 302 / (30 – 12) – 30 – 12 = 8
The author's chance of spotting a mistake was:
p = 1 – 12 / 30 = 0.6 = 60 %
This means that during an additional attempt to eliminate the mistakes we
can expect him to find m(3) = 0.6 · 8 = ca. 5 of the remaining 8 mistakes.
One thing to remember though that we need to have m(1) > m(2) in order
for the formulas to work. Otherwise our assumption of a constant chance of
spotting mistakes can not be fulfilled and the formula just produces
nonsense like negative values or, even worse, an infinite number of
mistakes. No writer is that bad.
(●●) Super-PIN
When you want to access your cell phone, you have to enter a PIN code.
This protects your information from unauthorized access in case the cell
phone gets lost. You are usually given three shots at entering the correct
PIN code. If you fail to do so, you need the super-PIN to gain access.
Assume your chance of entering the known PIN incorrectly is 0.1. How
likely is it that you need to resort to the super-PIN? That's simple. For that
to happen, you need to enter an incorrect code three times in a row.
According to the multiplication rule, this happens with the probability:
p(super-PIN) = 0.13 = 0.001 = 1 in 1000
However, this calculation was rather simplistic. When you enter the PIN
incorrectly on first try, you would usually pay more attention during the
second attempt. And if this one also fails, you would be even more alert
since you know that this is your last shot before having to locate the super-
PIN in the disorganized pile of documents.
Let's take this into account. During the first try, your chance of entering the
PIN incorrectly is 0.1. However, during the second and third try the
corresponding chances are only 0.05 and 0.01. How does that change the
odds of needing to resort to the super-PIN? Again we use the multiplication
rule:
p(super-PIN) = 0.1 · 0.05 · 0.01 = 0.00005
Or 1 in 20000. So paying attention reduced the chances of failing to access
the cell phone significantly. We'll use this result to answer another question:
if you access your cell phone fifteen times a day, what are your chances of
needing the super-PIN at least once over the course of a year? Have a guess
before crunching the numbers.
Again we approach the “at least”-question from another angle. First we

compute the chance of not needing the super-PIN during a year. Our chance
of a successful log-in is 0.99995. The probability of this happening 15 · 365
= 5475 times in a row is:
p(no super-PIN, one year) = 0.999955475 = 0.76
From that we can conclude that the probability of needing the super-PIN at
least once during a year when logging in fifteen times a day is:
p(super-PIN, one year) = 1 – 0.76 = 0.24 = 24 %
So despite the large number of repeats, we will probably not be forced to go

through the pile of unorganized documents. Lucky us. However, don't throw
more onto that pile yet, let's first check how likely it is that you need the
super-PIN if you use the cell phone for three instead of one year. Doing the
same calculation with the exponent 15 · 365 · 3 = 16425 leads to:
p(super-PIN, three years) = 1 – 0.44 = 0.56 = 56 %
(●●) The Nigerian Prince
A totally legitimate Nigerian prince is trying to access his grand

inheritance. Unfortunately for him, he needs assistance in paying the bank
fees. Thanks to email, he can reach a lot of people quickly and to make up
for the trouble, he lets whoever assists him keep a fantastic share. Who
wouldn't want to be part of that, right?
Let's assume his chance at a successful response is p (as in prince). How

many emails must he send to have a 95 % chance at getting at least one
successful response? As in all "at least"-problems, we look at the opposite
event, the chance of him not getting any response. A recipient will not
respond with a chance of 1-p. So the chance of having no person respond
after sending n mails is:
p(no response) = (1-p)n
We want to find out for what number of mails n his chance at a response is
95 %, or in other words, his chance of getting no response is only 5 %. Thus
we can just turn the above formula into an equation by inserting p(no
response) = 0.05.
0.05 = (1-p)n
Solving for n by applying the natural logarithm (symbolized by ln) to both

sides results in:
n = ln(0.05) / ln(1-p)
For example, if the response rate is only 1 in 1000, he needs to send about
3000 mails for a 95 % chance at a response. If he doesn't have a program
that automatically sends the mails and it takes him half a minute to send one
manually, he'd be sending mails for 25 hours straight to get to this number.
(●●) Missile Accuracy
An important quantity when comparing missiles is the CEP (Circular Error

Probable). It is defined as the radius of the circle in which 50 % of the fired
missiles land. The smaller it is, the better the accuracy of the missile. The
German V2 rockets for example had a CEP of about 17 km. So there was a
50/50 chance of a V2 landing within 17 km of its target. Targeting smaller
cities or even complexes was next to impossible with this accuracy, one
could only aim for a general area in which it would land rather randomly.
Today's missiles are significantly more accurate. The latest version of
China's DF-21 has a CEP about 40 m, allowing the accurate targeting of
small complexes or large buildings, while CEP of the American made
Hellfire is as low as 4 m, enabling precision strikes on small buildings or
even tanks.
Assuming the impacts are normally distributed, one can derive a formula
for the probability of striking a circular target of Radius R using a missile
with a given CEP:
p = 1 – exp( -0.41 · R² / CEP² )
This quantity is also called the “single shot kill probability” (SSKP). Let's
include some numerical values. Assume a small complex with the
dimensions 100 m by 100 m is targeted with a missile having a CEP of 150
m. Converting the rectangular area into a circle of equal area gives us a
radius of about 56 m. Thus the SSKP is:
p = 1 – exp( -0.41 · 56² / 150² ) = 0.056 = 5.6 %
So the chances of hitting the target are relatively low. But the lack in
accuracy can be compensated by firing several missiles in succession. What
is the chance of at least one missile hitting the target if ten missiles are
fired? First we look at the odds of all missiles missing the target and answer
the question from that. One missile misses with 0.944 probability, the
chance of having this event occur ten times in a row is:
p(all miss) = 0.94410 = 0.562
Thus the chance of at least one hit is:
p(at least one hit) = 1 – 0.562 = 0.438 = 43.8 %
Still not great considering that a single missile easily costs 10000 $
upwards. How many missiles of this kind must be fired at the complex to
have a 90 % chance at a hit? A 90 % chance at a hit means that the chance
of all missiles missing is 10 %. So we can turn the above formula for p(all
miss) into an equation by inserting p(all miss) = 0.1 and leaving the number
of missiles n undetermined:
0.1 = 0.944n
All that's left is doing the algebra. Applying the natural logarithm to both
sides and solving for n results in:
n = ln(0.1) / ln(0.944) = 40
So forty missiles with a CEP of 150 m are required to have a 90 % chance

at hitting the complex. As you can verify by doing the appropriate
calculations, three DF-21 missiles would have achieved the same result.
(●●●) Complete Collection
Collecting stickers is often quite popular among kids. For me soccer

stickers were the big thing. But getting the full collection can take a while.
The more you already have, the smaller your chances of getting the missing
ones are. Because of that, collecting the last sticker can take as long as
collecting the previous five or so. Let's take a look at stickers from a
probabilistic point of view.
Assume the complete collection consists of N stickers. You are only one
sticker short, so you visit a friend who has n of the stickers in total. What
are your chances of him having this one missing sticker?
There's a chance of 1/N for a randomly selected sticker to be the one you're
looking for. Of course this also means that the odds are (N-1)/N = 1 – 1/N
of a sticker not being the desired one. If your friend has n stickers, the
chance of him not having this particular sticker is thus:
p(none) = (1 – 1/N)n
Since the probability of him not having this sticker and him having it at
least once must add up to one, we get:
p(at least one) = 1 – (1 – 1/N)n
Let's put some numbers to that. We assume that the complete collection has
N = 30 different stickers. You know that your friend has n = 50 stickers at
home. The chance that he has at least one copy of the sticker you are
missing is thus: p(at least one) = 0.82 = 82 %. The odds are pretty good.
But it won't be much of a help to you if he only has one copy of the sticker
you need. What's more interesting to us is how likely it is that he has at least
two copies of the desired sticker. How can we approach this?
We note that the chance of not having the sticker, the chance of having it
exactly once and the chance of having it at least twice must add up to one,
since one of these excluding outcomes is sure to come up. Thus we can
write:
p(at least two) = 1 – p(one) – p(none)
We already derived a formula for the chance of him not having it, so we go
right to computing the probability of him having exactly one copy. The
chance that the first sticker he shows to us is the desired sticker and the rest
are not is:
1/N · (1 – 1/N)n-1
However, it is important to consider all possible ways in which you can

have one sticker among n. Since it can take all of the available n “slots”
(first, second, and so on up to n-th), there are n possibilities in total, each
having the above probability. So the chance of him having exactly one copy
is:
p(one) = n · 1/N · (1 – 1/N)n-1
That was all we needed to calculate the odds of him having at least two
copies of the missing sticker:
p(at least two) = 1 – (1 – 1/N)n – n/N · (1 – 1/N)n-1

Let's go back to the numerical values. We had N = 30 stickers in the
complete collection and your friend had in total n = 50 stickers. The odds of
him having two or more copies of the desired sticker are thus p(at least two)
= 50 %. Not as good as we had hoped, but still better than nothing.
Another interesting question is: given that the full collection has N different
stickers, how many should one be expected to buy to get the complete set?
Since we didn't look at expected values yet, we need to postpone this
question until chapter three and revisit the stickers then.
(●●●) In Gut We Trust
This is a problem that has been covered in many books and articles already.
But it is too delicious to just leave it out. Given a group of n people, what is
the chance that at least one pair shares a birthday?
Like in all "at least"-problems, we take the detour. Let us rather calculate
how likely it is that none of them share a birthday. Now for one pair, the
chance of them not having their birthday on the same date is just 364/365.
We need this event to occur for all possible pairings in the group. Once we
know the number of pairings m, we can simply write:
p(no shared birthdays) = (364/365)m
So we reduced the problem to analyzing possible pairings. Let's order all

the members of the group by number: person 1, person 2, and so on up to
person n. Person 1 can enter into n-1 pairings. Person 2 can also enter into
n-1 pairings, but since we already covered his pairing with person 1, he
only adds n-2 pairings to the total. Person 3 also has a number of n-1
pairings available, but since we covered his pairing with person 1 and
person 2, he only adds n-3 pairings to the total. In this way we can continue
to add up the total number of pairings, until we get to person n, who won't
contribute anything to the total.
m = n-1 + n-2 + n-3 + ... + 1 + 0
We can write 1 as n-(n-1) and 0 as n-n. This will make it clear that we might
also formulate this sum as such:
m = n · n – (1 + 2 + 3 + ... + n-1 + n)
Luckily there's a handy formula for the sum in the bracket.
1 + 2 + 3 + ... + n-1 + n = 0.5 · n · (n+1)
After inserting this formula and some further algebraic manipulation, we

arrive a this equation for the number of possible pairings in a group of n
people:
m = 0.5 · n · (n-1)
This enables us to calculate the probability, that there are no shared

birthdays within a certain group. Let's look at a group of n = 23 people.
There are m = 0.5 · 23 · 22 = 253 possible pairings. The chance of having
no shared birthdays is:
p(no shared birthdays) = (364/365)253 = 0.5 = 50 %
This of course also means that there's a 50 / 50 chance of finding at least

two people sharing a birthday within a group of 23 people. Isn't that
surprising? I would have guessed that the chances are much smaller than
this. My gut tells me there should be closer to 360 / 2 = 180 people in a
group to have a 50 / 50 shot at shared birthdays. But my gut is wrong since
it doesn't take into account the large number of possible pairings that can be
constructed even in smaller groups.
Chapter Two: Binomial Shortcut
(Picture by Nusha, see references for link)
(●) The Basics
Meet Jack and Mike. They are passionate chess players and take great joy
in challenging each other. Jack has been playing ever since he was a child,
while Mike joined the fun while in his teens. The statistics show that
experience pays in chess, of all the games they played so far, 70 % were
won by Jack.
One Sunday they meet to play three rounds in row. How likely is it that
Jack wins all the rounds? That's easy to calculate using the multiplication
rule.
p(3/3 rounds) = 0.7 · 0.7 · 0.7 = 34.3 %
What about the chance of him winning exactly two rounds? One sequence
that fulfills this condition is: Jack – Jack – Mike. This occurs with a
probability of 0.7 · 0.7 · 0.3. But this is not the only possible way in which
Jack can win two out of three rounds. There are two other possible
sequences: Jack – Mike – Jack and Mike – Jack – Jack. As you can easily
verify, all of them occur with the same probability. We need to take all these
possibilities into account, so the odds of Jack winning exactly two rounds
are:
p(2/3 rounds) = 3 · 0.7 · 0.7 · 0.3 = 44.1 %
If you think "this is starting to sound a lot like the good ol' coins", you are
absolutely right. I can't stress enough that we carefully need to count all the
ways in which our desired outcome can come true. Or better yet, let a
formula do the counting.
Let's do one example during which we are forced to surrender without

what's called the binomial coefficient. This time Jack and Mike are playing
twenty rounds, what's the probability of Jack winning exactly twelve
rounds? One way for this outcome to arise is that Jack wins the first twelve
rounds and loses the remaining eight. The multiplication rule gives us a
chance of 0.712 · 0.38 for this sequence. Unfortunately for us, there are a lot
of ways to distribute twelve J's on twenty slots.
This is where the binomial coefficient comes in. It tells us how many
possibilities there are to distribute k elements on n slots. We symbolize this
number by C(n,k). If your calculator does not include this function, you can
easily find a binomial coefficient calculator online. In the above problem,
all we need to know is C(20,12) to solve it. Using a calculator we get:
C(20,12) = 125970
So there are 125970 sequences which have Jack winning twelve out of
twenty rounds. With each sequence having the same probability 0.712 · 0.38,
the chance of Jack winning exactly twelve out of twenty rounds is:
p(12/20 rounds) = 125970 · 0.712 · 0.38
Which is about 11.4 %. It's time to formulate the binomial distribution more
generally. We are given an experiment with two possible outcomes A and B
for each trial (Jack or Mike, heads or tails, ace or not ace). Event A occurs
with the probability p, event B with the probability 1-p. The probability of
A occurring exactly k times within n trials is:
B(n,k) = C(n,k) · pk · (1-p)(n-k)
Agreed, this formula looks impressive. But the last two terms we can
understand more or less intuitively with the multiplication rule. First A,
having the probability p, occurs k times, then B, having the probability 1-p,
occurs the remaining n-k times. That's where the expression pk · (1-p)(n-k)
comes from. And the first term is just the binomial coefficient C(n,k),
telling us in how many ways our outcome can happen.
Now let's enjoy some more snacks to let it sink in.
(●) The Many Ways of Losing
As we saw, there are C(20,12) = 125970 ways for Jack to win twelve out of
twenty rounds. But doesn't that mean that there are just as many ways for
Mike to lose eight of twenty rounds? There should be. Let's ask the
binomial coefficient. If our logic is right, then C(20,8) should be the same
as C(20,12).
Indeed the calculator spits out C(20,8) = 125970. The binomial coefficient
is symmetric around the center for good reasons. Mathematically, this
means that in all cases C(n,k) = C(n,n-k) holds true.
(●) Defective Parts
A machine produces parts with an error rate of 2 %. We take a sample of six

parts. What's the probability of finding exactly one defective part?
We first need to extract all the parameters that are required as inputs in the
binomial distribution. The probability of finding a defective part is p = 0.02.
We take a sample of six parts, so n = 6. And we are interested in knowing
how likely it is to have exactly one defective part in our sample, which
means that k = 1. This is all we need to solve the problem:
p(1/6 defective) = C(6,1) · 0.021 · 0.985 = 10.8 %
We could have deduced this with the multiplication rule as well. The chance
of the first part being defective and the other five parts being functional is:
0.02 · 0.985. In total, there are six sequences to have one defective part
among six parts, all having the same probability. So we get the result:
p(1/6 defective) = 6 · 0.02 · 0.985 = 10.8 %
(●) Multiple Choice Revisited
Multiple choice tests provide a nice playground for the binomial

distribution. So let's revisit the problem from chapter one. Picture a multiple
choice test with ten questions in total and each question offering three
possible answers. With the binomial distribution it's relatively easy to
calculate the chances of having no, one, two, and so on correct answers if
you only choose the answers at random.
The ten questions correspond to n = 10 trials. During each trial, there's a p =

1/3 chance of getting a correct answer. The probability of k = 0 correct
answers is:
p(0 correct) = C(10,0) · (1/3)0 · (2/3)10 = 0.017
Or about 1 in 60. Note that we already derived this result in chapter one
using the multiplication rule with about the same amount of work. So at
first glance, it seems like we didn't gain that much by using the binomial
formula.
However, deriving the result for k = 1 and k = 2 correct answers would be

much harder to do with the multiplication rule and for k = 3 next to
impossible. This is where the true power of the binomial distribution lies.
We don't need to concern ourselves with the many ways an outcome can
come to be. We simply plug in another value for k.
That being said, the chances for getting k = 1 and k = 2 correct answers on
the test when choosing the answers at random is:
p(1 correct) = C(10,1) · (1/3)1 · (2/3)9 = 0.087
p(2 correct) = C(10,2) · (1/3)2 · (2/3)8 = 0.195
As you can see, these outcomes are already much more likely than choosing
only incorrect answers. We could continue and do this for all possible
results up to k = 10 correct answers. Below you can see the plot of this
calculation:
The most likely outcome is to have three correct answers, but two and four
correct answers occur with a probability that is comparable to that. The
probabilities drop sharply towards the end. The odds of nine or ten correct
answers are so small that they don't even show on the graph.
What if you know some of the answers for sure and only have to guess the
remaining ones? We can also apply the binomial formula there. Assume we
know six answers for sure and guess the remaining four. Then we have n =
4 trials with the unchanged chance of p = 1/3 for a correct answer. The odds
for no correct answer (besides the six we know for sure, of course) are thus:
p(0 correct) = C(4,0) · (1/3)0 · (2/3)4 = 0.198
Or in other words: there's a respectable 1 – 0.198 = 0.802 = 80.2 % chance

to get at least one additional correct answer by guessing the remaining four.
Not too bad. I leave it up to you to compute how all these results change
when we are given four instead of three possible answers for each question.
This certainly is not favorable to the guessing test takers.
(●) Frumkina Law
The definite English article (and most common English word) “the” has a
frequency of about 0.07, meaning that within one hundred words we expect
it to come up seven times. A sample of texts taken from the newspaper
“USA Today” (sample size 15810 words) shows that the average length of
an English sentence is 19 words. What is the probability of the word “the”
not being included in a sentence of average length?
The Frumkina Law states that we are allowed to apply the binomial
distribution in good approximation to the distribution of words in a text. So
let's extract all the parameters we need as inputs for the formula. In abstract
terms, the 19 words correspond to n = 19 trials with each trial having the
chance p = 0.07 of resulting in the event “the”. So the chance of not having
the definite article come up is:
B(19,0) = C(19,0) · 0.070 · 0.9319 = 0.252 = 25.2 %
What about the chance of it showing up exactly once in a sentence of this

length? We can directly apply the formula again with k = 1 to get the
answer:
B(19,1) = C(19,1) · 0.071 · 0.9318 = 0.360 = 36.0 %
Just out of curiosity, we'll continue this train of thought and calculate the
probability of the definite article appearing twice in an average English
sentence:
B(19,2) = C(19,2) · 0.072 · 0.9317 = 0.244 = 24.4 %
If we go even further, we will see that the probability keeps on decreasing

(and quickly) from this point on, meaning that the most likely outcome is to
have it appear once in a sentence. Note that the three outcomes we looked at
already cover about 85 %. In other words, the chance of “the” coming up
less than three times is 85 %.
Just as a side note, the analysis of the sample confirmed the rule of thumb
that the average word length in English is five letters (the actual result was
4.91 letters per word, close enough). It also revealed that reading the “USA
Today” requires a 12th grade reading level. If you are interested in doing
such text analysis, I recommend the “Advanced Text Analyser” featured on
the website usingenglish.com.
(●●) Penalty Kicks
About 80 % of all the penalty kicks in soccer result in a goal. In a penalty

shoot-out a team takes five shots. What's the probability that the majority of
the penalty kicks are successful? A majority means that either three of five,
four of five or even all the shots end up in the net. So our grand plan is to
calculate p(3 of 5 successful), p(4 of 5 successful) and p(5 of 5 successful)
and sum them up.
Let's get started and find the required inputs. The text tells us that p = 0.8
and n = 5. For k we will have either 3, 4 or 5.
p(3 of 5 successful) = C(5,3) · 0.83 · 0.22 = 0.205
p(4 of 5 successful) = C(5,4) · 0.84 · 0.21 = 0.410
p(5 of 5 successful) = C(5,5) · 0.85 · 0.20 = 0.328
Finally we sum them up to get the odds that the majority of the penalty
shots result in a goal:
p(majority successful) = 0.943 = 94.3 %

These are surprisingly good odds. Did you see it coming? Note that the
probability p(4 of 5 successful) is the highest among the three we
calculated, so having four goals in the shoot-out is the most likely outcome
for the team. Mathematicians call this the expected value. You can easily
get it for any binomial distribution by multiplying p with n. In our case: 0.8
· 5 = 4. But more on that in the next chapter.
(●●) I'm Off
When teaching small groups, there's always a chance of too few people
showing up for the course to take place. One participant is ill, another in an
important meeting and yet another on training. The agreement is: if less
than two people show up, the appointment is canceled.
Let's assume that on average each participant shows up with a chance of 75

%. How likely is it that the appointment gets canceled if the group consists
of four people? To answer this question, we calculate two probabilities: the
probability that no participant shows up and the probability that exactly one
person attends.
Using the inputs p = 0.75 and n = 4 we get:
p(0 of 4 show up) = C(4,0) · 0.750 · 0.254 = 0.0039
p(1 of 4 show up) = C(4,1) · 0.751 · 0.253 = 0.047
The chance of the appointment being canceled is:
p(canceled) = 0.047 + 0.0039 = 0.051 = 5.1 %
Or 1 in 20. Theoretically this number should decrease as we increase the

size of the group. The more people are signed up, the less likely it should be
that too few show up. And indeed there's only a 1.6 % (or 1 in 65) chance of
cancellation for a group of five people and only a 0.5 % (or 1 in 215)
chance of cancellation when having six participants.
In deriving these last results, that is the chance of cancellation for groups of
five and six people, we assumed the chance of the participants showing up
to be the same as in a group of four people. An analysis of the groups I
taught over the years shows that this assumption should only be used as a
first approximation.
In reality, there is a clear trend for the attendance probability to decrease

with the number of people in a group. For a sample of twenty courses I
found this relationship between the probability of attendance p the group
size n:
p = 0.50 + 0.76 · n-0.8
For groups of four, five and six participants this leads to:
- p = 0.75 for a group of n = 4 people

The reason for this decrease might be the diffusion of responsibility. In a

group of four people, each participant has a higher responsibility for the
appointment taking place than in a group of six people. As a result of this,
the odds of cancellation do not decrease as strongly as suggested by our
initial calculations.
For a group of four people, the result of having a 5.1 % of chance of

cancellation remains. But taking the above dependence into account the
corresponding probability for a group of five people is now 2.7 % (or 1 in
37) and for a group of six people 1.4 % (or 1 in 71). Still a significant
decrease, but noticeably dampened.
(●●) More Codes

Not long ago did mankind first send rovers to Mars to analyze the planet
and find out if it ever supported life. The nagging question "Are we alone?"
drives us to penetrate deeper into space. A special challenge associated with
such journeys is communication. There needs to be a constant flow of
digital data, strings of ones and zeros, back and forth to ensure the success
of the space mission.
During the process of transmission over the endless distances, errors can
occur. There's always a chance that zeros randomly turn into ones and vice
versa. What can we do to make communication more reliable? One way is
to send duplicates.
Instead of simply sending a 0, we send the string 00000. If not too many
errors occur during the transmission, we can still decode it on arrival. For
example, if it arrives as 00010, we can deduce that the originating string
was with a high probability a 0 rather than a 1. The single transmission
error that occurred did not cause us to incorrectly decode the string.
Assume that the probability of a transmission error is p and that we add to

each 0 (or 1) four copies, as in the above paragraph. What is the chance of
us being able to decode it correctly? To be able to decode 00000 on arrival
correctly, we can't have more than two transmission errors occurring. So
during the n = 5 transmissions, k = 0, k = 1 and k = 2 errors are allowed.
Using the binomial distribution we can compute the probability for each of
these events:
p(0 errors) = C(5,0) · p0 · (1-p)5
p(1 error) = C(5,1) · p1 · (1-p)4
p(2 errors) = C(5,2) · p2 · (1-p)3
We can simplify these expressions somewhat. A binomial calculator

provides us with these values: C(5,0) = 1, C(5,1) = 5 and C(5,2) = 10. This
leads to:
p(0 errors) = (1-p)5
p(1 error) = 5 · p · (1-p)4
p(2 errors) = 10 · p2 · (1-p)3
Adding the probabilities for all these desired events tells us how likely it is
that we can correctly decode the string.
p(success) = (1-p)3 · ((1-p)2 + 5·p·(1-p) + 10·p2)
In the graph below you can see the plot of this function. The x-axis
represents the transmission error probability p and the y-axis the chance of
successfully decoding the string. For p = 10 % (1 in 10 bits arrive
incorrectly) the odds of identifying the originating string are still a little
more than 99 %. For p = 20 % (1 in 5 bits arrive incorrectly) this drops to
about 94 %.
The downside to this gain in accuracy is that the amount of data to be

transmitted, and thus the time it takes for the transmission to complete,
increases fivefold.
(●●●) Survey Accuracy
It's election time and we want to know what percentage of the people
support our candidate. So we do a survey, we ask 1000 people if they intend
to vote for our candidate. 55 % answer yes, which makes us optimistic
about the outcome of the election. But how reliable is this answer?
Let's switch to an all-knowing perspective and look down upon the survey
takers. We know that in the entire population p percent support a certain
candidate. If the survey takers ask n people, how likely is it that their
estimate for p, let's call this value q, will fall within 5 % of the actual value?
That's a tough one. Let's do it step by step, starting with the binomial
distribution. The chance that k out of n people support our candidate in the
survey is:
B(n,k) = C(n,k) · pk · (1-p)(n-k)
The survey taker's estimate for the level of support will then be: q = k/n.
Now ideally q = p, but this is in no way guaranteed or even realistic. There
will be some deviation from the actual value. To be within 5 % of the actual
value, the survey must result in a value for q that lies within 0.95·p and
1.05·p. Or in other words: the number of supporters k must be within
0.95·p·n and 1.05·p·n. So to find out how likely it is to get a result within 5
% of the actual value, we need to sum the probabilities for all these
outcomes:
- k equals 0.95·p·n
- k equals 0.95·p·n + 1
- k equals 0.95·p·n + 2
- and so on until we reach k equals 1.05·p·n
That's a lot of terms. Suppose we ask 1000 people, with 45 % of the entire
population supporting our candidate. Ideally 450 people will respond
favorably to our candidate, but hitting the nail on the head so precisely is
rather unrealistic. To be within 5 % of the actual value, the number of
supporters in the survey must fall between k = 428 and k = 473. That's 45
binomial terms we need to sum to get the probability for this outcome.
p(within 5 %) = B(1000,428) + B(1000,429) + B(1000,430) + ... +

B(1000,472) + B(1000,473)
A computer programmed to do such calculations will tell us that in our case

there's a 89.3 % chance of being within the 5 % tolerance. Of course the
more people we ask, the more reliable the result will be. For example if we
limit ourselves to asking only 500 people, the probability of being within
the 5 % tolerance drops to 73.6 %.
(●●) Penalty Kicks Revisited
If you haven't read the penalty kick snack, do it before enjoying this treat.
Otherwise you're good to go. When talking about the penalty kicks, we
focused on one team and ignored the other. Let's include both in our
discussion. One question certainly pops up quickly: What is the chance of
the teams being tied after each team took five shots at the goal? This is the
case if the result is either 0:0, 1:1, 2:2, 3:3, 4:4 or 5:5. Any guesses?
Let's do it one step at a time. We assume both teams have a 80 % chance of

making a successful penalty kick. Thus both teams have the same
probability for k successful shots. This simplifies the problem. To find out
what the odds of a 0:0 are, we simply need to calculate p(0 of 5 successful)
and make use of the multiplication rule to answer how likely it is that this
event occurs twice in a row.
p(0:0) = p(0 of 5 successful)2
In a similar way we arrive at the formulas for p(1:1), p(2:2), and so on.
Summing them all up gives us the probability of having a tie at the end of
the penalty shoot-out:
p(tie) = p(0 of 5 successful)2 + p(1of 5 successful)2 + p(2 of 5 successful)2
+ ... + p(5 of 5 successful)2
Taking the values p = 0.8 and n = 5, we get the individual probabilities via
the binomial distribution.
p(tie) = 0.32 = 32 %
So about 1 out of 3 shoot-outs will not be decided after the regular five
penalty kicks. If you did all the calculations, you probably noticed how
unlikely the result 0:0 is. There's only a 1 in 10 million chance for that to
happen. Of all the ties, the result 4:4 is the most likely one.
(●●●) Random Walk
After a pleasurable night of drinking, John decides to walk home. But the
high concentration of alcohol in his veins makes it hard for him to stay the
course. He randomly takes either a step forwards (with the probability p) or
a step backwards (with the probability 1-p). Now he has taken n steps in
total. What is the probability that he is x or more steps away from his
origin? I will not give you any numerical values just yet, let's rather draw up
a battle plan which will work for any particular data.
Assume he has taken k of the n steps forwards. This of course also means
that he has taken n-k of the n steps backwards, so his distance from the
origin is:
d = k – (n-k) = 2·k – n
This formula makes sense. If he took all the steps forwards (k = n), then his
distance from the origin is d = n. And if he took all the steps backwards (k =
0), the formula gives us a distance of d = -n. The same magnitude in
distance, but in the opposite direction.
The binomial distribution enables us to calculate the chance that k out of n
steps are taken forwards:
p(k forward steps) = C(n,k) · pk · (1-p)(n-k)
Which values of k satisfy our condition that he is a distance of x or more

steps away from the origin? Solving this equation for k will help us:
x = 2·k1 – n
For this particular value of k, let's call it k1 as indicated in the equation, he

is exactly x steps from the origin. Any k greater than k1 will result in a
distance greater than x. So these values are also of interest to us. But as of
now, we only looked at the forward direction. Being x or more steps from
the origin can also mean that he is at d = -x or d = -x-1. So we need to
compute yet another special value of k from this equation:
– x = 2·k2 – n
For this value of k he again is x steps away from the origin, but this time in
the opposite direction. Any k smaller than k2 will result in a distance greater
than x. We now know that if he is x or more steps away from the origin, he
either took anywhere between k1 and n steps forwards (distance greater or
equal to x in forward direction) or only 0 and k2 steps forwards (distance
greater or equal to x in backward direction).
To get the overall probability for him to be x or more steps from the origin,
we sum the probabilities p(k forward steps) for all the values of k that we
determined to be of relevance. Puh, that was some tough work. Now let's
make it easier by inserting numerical values.
We'll set the probability of a step forward to p = 0.6. In total he takes n = 12

steps. What's the probability that he is x = 6 or more steps from the origin
after taking all the steps?
The distance to the origin after k forward steps is:

d = 2·k – 12
In forwards direction he is exactly 6 steps from the origin if k is given by

the solution of this equation:
6 = 2·k1 – 12
or k1 = 9. So for anywhere between k = 9 and k = 12 forward steps he is 6

or more steps from the origin. Now let's turn to the backwards direction.
Here he's exactly 6 steps from the origin if k equals the solution of this
equation:
– 6 = 2·k2 – 12
or k2 = 3. If he took only between k = 0 and k = 3 steps forwards, he is

again at a distance of 6 or more steps from the origin but in the opposite
direction (remember, not taking a step forwards means taking a step
backwards). Now we have all we need to apply the binomial distribution.
We separately calculate the probability of taking 9 forward steps, 10

forward steps and so on up to 12. After that we compute the probability of
taking only 0 forward steps, 1 forward step, and so on up to 3. Then we sum
them all.
p(0 forward steps) = C(12,0) · 0.60 · 0.412
p(1 forward step) = C(12,1) · 0.61 · 0.411
p(2 forward steps) = C(12,2) · 0.62 · 0.410
And so on for all relevant values of k as determined above. Making the sum
provides us with the odds that he is 6 or more steps from the origin after
taking 12 random steps:
p(distance 6 steps or more) = 24 %
For doing such calculations it is of great help to have a calculator that can
compute cumulative probabilities, that is, automatically sums the
distribution up to or beyond a specific value. For example, using the "Stat
Trek Binomial Calculator" I only needed to compute two values to get the
above result, the cumulative probability for all k smaller than or equal to 3
and the cumulative probability for all k greater than or equal to 9. This is
certainly much quicker than summing all the terms individually.
This statistical problem might seem quite artificial to you, but it actually
comes from a real-world application. The diffusion of a substance in a host
substance happens by means of a random walk. The diffusing particles
collide frequently with the molecules of the host substance and thus
randomly take steps forwards and backwards. Of course here the situation is
even more complex because we are faced with a three-dimensional random
walk, which provides much more possible directions than just forwards and
backwards.
Chapter Three: To Be Expected
(Picture by go_greener_oz, see references for link)
(●) The Basics
Knowing the probabilities unfortunately doesn't enable us to predict the

future, but it does tell us what to expect, that is, we can deduce what the
most likely outcome is.
We already talked briefly about the expected value while discussing penalty
kicks. Statistical analysis of soccer games reveal that the chance of a
successful penalty kick is 0.8 or 80%. In a penalty shoot-out, each team gets
five penalty kicks. How many goals should we expect from a team? Our
analysis showed that the event "four out of five" has the highest probability.
We could have gotten this result much quicker by multiplying our chance
(0.8) with the number of trials (5).
So in general, if we have an event that occurs with the probability p during

one trial and we conduct n trials, we expect the event to occur e times:
e=p·n
This is the expected value. It is quite straightforward and we can use it for
many interesting applications. You can use it to decide if a game is fair or
favors a certain party (the player, the casino) or to set interest rates for
credits.
Let's go a little beyond that to include what is called a discrete probability

distribution. Consider Martin, a car salesman. He kept records of his sales
for each week over the last 100 weeks. There were 5 weeks with no sales,
12 weeks with only one sale, 44 weeks with two sales, 30 weeks with three
sales and 9 weeks with four sales. He was not able to get more than four
sales in a week over this time.
From these numbers we can calculate the odds of him selling a certain
number of cars in a week:
- 0 sales with probability 0.05

This is a discrete probability distribution. The addition of the word

"discrete" tells us that the outcomes only cover whole numbers, we can't
have 3.5 or 2.7 sales. Note that the odds in a probability distribution must
add up to one, so one of these outcomes is certain to appear on the next
trial.
What number of sales can we expect from Martin during a week? To get the
expected value of a discrete probability distribution, we do this weighted
sum:
e = 0 · 0.05 + 1 · 0.12 + ... + 4 · 0.09 = 2.26
You might consider this value to be somewhat useless, as we got a decimal

value for the expected number of sales. Can't we just say the expected
number of sales is 2?
If we do this, we would conclude that the number of sales over the next 50
weeks is about 50 · 2 = 100, when the expected value for the sales over the
next 50 weeks is actually 50 · 2.26 = 113. Rounding prematurely would
have cost us a great amount of accuracy. So even for discrete distributions
we should accept decimal expected values and rather interpret them over a
larger number of trials.
Let's generalize that: we are given a number of outcomes associated with

the numerical values n(1), n(2), n(3), and so on. The probability of each of
these outcomes is respectively p(1), p(2), p(3), and so on. To calculate the
expected value, we do this weighted sum:
e = n(1) · p(1) + n(2) · p(2) + n(3) · p(3) + ...
Usually the expected value is going to be close to the numerical value

associated with the highest probability. In our example the chance of him
selling two cars in a week was the highest and indeed the expected value
was close to two. But this is not necessarily always the case. When the
distribution is strongly skewed, the expected value can differ significantly
from the most likely outcome.
Before going to the snacks, let's look at the special case of having a uniform
probability distribution. Here all the outcomes are equally probable. This
means that if we have n possible outcomes, each must have the probability
1/n (otherwise they won't add up to one as required). The expected value
then is:
e = n(1) · 1/n + n(2) · 1/n + n(3) · 1/n + ...
e = 1/n · (n(1) + n(2) + n(3) + …)
So for a uniform distribution the expected value is just the arithmetic mean.
In other words, we can interpret the expected value (as defined by the
weighted sum for a probability distribution) to be a generalization of the
common arithmetic mean, where each number carries the same weight.
(●) My Fair Game
You meet a nice man on the street offering you a game of dice. For a wager
of just 2 $, you can win 8 $ when the dice shows a six. Sounds good? Let's
say you join in and play 30 rounds. What will be your expected balance
after that?
You roll a six with the probability p = 1/6. So of the 30 rounds, you can
expect to win 1/6 · 30 = 5, resulting in a pay-out of 40 $. But winning 5
rounds of course also means that you lost the remaining 25 rounds, resulting
in a loss of 50 $. Your expected balance after 30 rounds is thus -10 $. Or in
other words: for the player this game results in a loss of 1/3 $ per round.
Let's make a general formula for just this case. We are offered a game
which we win with a probability of p. The pay-out in case of victory is P,
the wager is W. We play this game for a number of n rounds.
The expected number of wins is p·n, so the total pay-out will be: p·n·P. The
expected number of losses is (1-p)·n, so we will most likely lose this
amount of money: (1-p)·n·W.
Now we can set up the formula for the balance. We simply subtract the
losses from the pay-out. But while we're at it, let's divide both sides by n to
get the balance per round. It already includes all the information we need
and requires one less variable.
B = p · P – (1-p) · W
This is what we can expect to win (or lose) per round. Let's check it by
using the above example. We had the winning chance p = 1/6, the pay-out P
= 8 $ and the wager W = 2 $. So from the formula we get this balance per
round:
B = 1/6 · 8 $ – 5/6 · 2 $ = – 1/3 $ per round
Just as we expected. Let's try another example. I'll offer you a dice game. If
you roll two six in a row, you get P = 175 $. The wager is W = 5 $. Quite
the deal, isn't it? Let's see. Rolling two six in a row occurs with a
probability of p = 1/36. So the expected balance per round is:
B = 1/36 · 175 $ – 35/36 · 5 $ = 0 $ per round
I offered you a truly fair game. No one can be expected to lose in the long
run. Of course if we only play a few rounds, somebody will win and
somebody will lose.
It's helpful to understand this balance as being sound for a large number of
rounds but rather fragile in case of playing only a few rounds. Casinos are
host to thousands of rounds per day and thus can predict their gains quite
accurately from the balance per round. After a lot of rounds, all the random
streaks and significant one-time events hardly impact the total balance
anymore. The real balance will converge to the theoretical balance more
and more as the number of rounds grows. This is mathematically proven by
the Law of Large Numbers. Assuming finite variance, the proof can be done
elegantly using Chebyshev's Inequality.
The convergence can be easily demonstrated using a computer simulation.

We will let the computer, equipped with random numbers, run our dice
game for 2000 rounds. After each round the computer calculates the
balance per round so far. The below picture shows the difference between
the simulated balance per round and our theoretical result of – 1/3 $ per
round.
(●) Country Roads
You are done with work and want to get home as quickly as possible. Two
options are available to you: take the country road or the highway. Since
you've been driving these routes for quite some time now, you know what
to expect. The route over the country road takes 60 minutes on average,
traffic jams do not arise since most people prefer the highways. The route
over the highway takes 45 minutes if there's no traffic jam and 90 minutes if
there's one. Your experience suggests that there's a 1 in 5 chance that you'll
end up in a jam. Which route should you prefer?
What we seek is the expected travel time when taking the highway. We
could approach this in a similar way as in the case of the dice game. But
let's rather interpret the given data as a discrete probability distribution
having these possible outcomes:
- 45 minutes with probability 0.8
- 90 minutes with probability 0.2
To get the expected value we do the weighted sum:

e = 45 · 0.8 + 90 · 0.2 = 54 minutes
So on average you'll save six minutes per trip (or one hour per week
assuming five workdays) by taking the highway.
Let's check this using a simple computer simulation. We create a variable

called "sum". Then we loop the following 100 times: The computer creates
a random number between 0 and 1. If this number is smaller than 0.8, it will
add 45 to the sum, otherwise it adds 90. After 100 such loops the sum is
divided by 100, giving the average travel time. Here are the results of three
such trials:
- 54.9 minutes
- 52.2 minutes
- 53.1 minutes
As you can see our theoretical result agrees with the numerical simulation
within acceptable boundaries (the above results deviate no more than 4 %
from the expected 54 minutes). If you have basic knowledge of a
programming language, it's always fun to check the result with simulations.
(●●) Credits
A bank gives a 15000 $ credit to a business owner, who agrees to pay back
the credit after one year with p percent interest. After careful analysis, the
bank estimates that there's a 5 % chance that the money will be lost. What is
the expected win or loss for the bank with a p = 4 % interest? How should
they set the interest rate to have an expected win of 500 $?
Let's look at the first question. Again we can consider this to be a discrete
probability distribution with two outcomes. One outcome is that the
business owner pays back the credit after one year with the agreed interest.
This outcome occurs with 95 % chance and results in a win of 600 $ (4 %
of 15000 $) for the bank. The other possible outcome is that the business
owner will not be able to pay back the money. This occurs with a 5 %
chance and results in a loss of 15000 $ for the bank. So in summary, this is
the probability distribution for this case:
- 600 $ win with chance 0.95
- 15000 $ loss with chance 0.05
The expected value is:
e = 600 · 0.95 + (- 15000) · 0.05 = - 180 $
So the bank should expect to lose 180 $ per credit of this type. Of course no
bank will grant you such a credit. Rather, they calculate how the interest
rate should be set for them to expect a profit. In our case we want the profit
(the expected value) to be 500 $. We can set up an equation for that. Let's
leave the the interest p undetermined and insert 500 $ for the expected
value.
500 $ = p · 15000 · 0.95 + (- 15000) · 0.05
Compare this to the above calculation to understand the structure. Solving

the equation for p leads to:
p = 0.088 = 8.8 %
For the bank to make 500 $ profit per credit of this kind, it needs to set the
interest rate at 8.8 %. This is a rather high interest rate. Luckily if you
spread the credit over more than one year (which is usually the case), this
would decrease significantly.
(●) My Fair Game Revisited
Assume we want to create a fair game using a wheel that is divided into n
fields. The player spins this wheel after paying the wager. One of these
fields results in a win for the player, the other n-1 fields lead him to lose the
game. Given the payout P and the wager W, how many fields should the
wheel have in order for the game to be fair?
To answer that, we go back to the formula for the balance per round from
the snack “My Fair Game”. Since the winning chance is p = 1/n and the
corresponding chance of loosing 1-p = (n-1)/n, we get this balance per
round for the game:
B = 1/n · P – (n-1)/n · W
In a fair game, the balance per round is zero. Neither the player nor the
“casino” is expected to gain anything in the long run. Thus all we need to
do to make things fair is to set B = 0 :
0 = 1/n · P – (n-1)/n · W
Now we can solve algebraically for the number of fields n that satisfy this
condition. Start by multiplying both sides with n and go from there. The
result is:
n = 1 + P/W
If for example the payout is five times the wager, so P/W = 5, our wheel
should consist of n = 6 fields. Including more than six fields will decrease
the chance for a win and thus favor the casino. Similarly, less fields will
give the player the edge. Of course in real-world casinos the latter never
happens. The games are always set up in such a way that the casino is
expected to gain in the long run.
(●●) This could take a while …
Two opponents of equal strength, let's call them Bill and Marcus, are
competing against each other over several rounds of chess. Whoever wins
two rounds in a row, wins the game. With this rule, the game could end
after a minimum of two rounds or, if they win the rounds in an alternating
fashion, could go on forever. The question that comes to mind is: what is
the average length of such a game?
Imagine the game ends after two rounds. That means that either the
sequence B – B or M – M occurred. Both sequences have the probability
0.5 · 0.5, so in total the chance of the game ending so soon is: 2 · 0.5 · 0.5
or:
- Length 2 with p(2) = 2 · 0.52 = 0.5
What about if the game ends after three rounds? Again this can only happen
via two sequences: M – B – B or B – M – M. So the odds of the game
ending there is:
- Length 3 with p(3) = 2 · 0.53 = 0.52
It seems like there is a simple pattern. Let's confirm that by looking at the
chance of having a winner after four rounds. The only possible sequences
for that are: B – M – B – B or M – B – M – M, so we get the probability:
- Length 4 with p(4) = 2 · 0.54 = 0.53
Indeed we cracked the probability distribution for this game to end after a
certain number of rounds. All that is left is to compute the expected value
for the length. Applying the formula from the introduction of this chapter,
we get:
e = 2 · 0.5 + 3 · 0.52 + 4 · 0.53 + …
Note that we simply multiplied the lengths with their respective

probabilities and did the sum. This leads to an infinite sum, which can be
solved analytically. But for now, let's just note the size of the terms quickly
decreases as we go along, so we can get a good estimate by just evaluating
the first ten terms. Doing this and rounding results in:
e=3
Thus, the average length of this game is three rounds, not as long as you
might have guessed. Let's take it step further and calculate how likely it is
that the game goes on for six rounds or more. The chance for the game
ending within five rounds is:
p(within 5 rounds) = 0.5 + 0.52 + 0.53 + 0.54
or about 0.94 = 94 %. This means that we can expect only about 6 % of the
games to go on for six rounds or more.
(●●) Complete Collection Revisited
In chapter one we took a probabilistic look at collecting stickers. It's time to

revisit this problem. We assume that the complete sticker collection consists
of N different stickers. The question is: how many should one be expected
to buy to get the full set? This problem is also known as the coupon
collector's problem.
Assume we already have n different stickers in the collection. This means

we are missing N – n stickers for the full set. The chance of getting a new
sticker on the next purchase is:
p(new) = (N – n) / N
The expected number of purchases to get a new sticker is the inverse of this
probability: e(new) = 1 / p(new). For example if our chance at a new sticker
is 1/8, we will most likely need to buy 8 stickers. So given that we already
have n stickers, this is how many we need to buy to get the next one:
e(new) = N / (N – n)
We begin with no stickers and need to buy e = N/N = 1 sticker for the next
new sticker. Now we have one sticker and need to buy e = N/(N-1) for the
next new one. Having two stickers now, we need to buy yet another e =
N/(N-2) stickers for the next addition. And so on until the collection is
complete. For the total number of purchases E, we need to add all these
expected values:
E = N/N + N/(N-1) + N/(N-2) + … + N/1
E = N · (1/N + 1/(N-1) + 1/(N-2) + … + 1/1)
That's a lot of terms. Luckily there is a handy approximation formula for the
sum in the bracket (which mathematicians call a harmonic series). The
larger N is, the more accurate this formula is:
E = N · (ln(N) + 0.58)
Let's look at some numbers. How many stickers can we be expected to

purchase before completing a set of N = 30 stickers? Using the formula, we
get E = 120 stickers. It's interesting to note that 30 of the 120 purchases will
be for the last missing sticker alone and 15 of the 120 purchases for the one
before that.
(●) Not Another Repeat
When you toss a coin a large number of times, sooner or later some unlikely
events (such as a lot of heads in a row) will occur. Let's focus on such
repeats. Given that you tossed the coin n times, what is the longest string of
heads you can expect?
The probability of the coin showing head is 0.5, so a string of k heads in a

row occurs with the probability 0.5k. Over the course of n throws, the
expected number of such strings is:
e = 0.5k · n
If e is bigger than one, we expect this string to come up during the n coin
tosses. For example, the expected value for a string of k = 5 heads in a row
within n = 500 throws is: e = 16. Of course the longer the string, the smaller
the expected value. For a string of k = 6 heads within n = 500 throws we
get: e = 8.
How long can we keep increasing the string size k before hitting the critical
value of e = 1? To answer this question, we simply set e = 1 and solve the
equation for k:
1 = 0.5k · n
k(max) = ln(n) / ln(2)

Where ln is the natural logarithm. For n = 500 tosses, the longest string that
is to be expected is k(max) = 9 heads in a row. For n = 5000 tosses this
number increases to k(max) = 12 heads in a row. Note the painfully slow
growth. We increased the number of tosses tenfold, yet the expected
maximum string size didn't even double.
(●) Skewness
Let's take a look at how the skewness of a distribution can influence the
expected value. Two car salesmen, Stephen and Ben, follow Martin's
example and make a table of their weekly sales over the last 100 weeks.
Stephen got the following distribution:
As you can see it is perfectly symmetrical with two sales per week being
the outcome with the highest probability. The expected value is:
e = 0 · 0.10 + 1 · 0.20 + … + 4 · 0.10 = 2
Not surprisingly, it coincides with the most likely outcome. Over 50 weeks,
we would expect Stephen to sell 50·2 = 100 cars with this sales distribution.
Let's turn to Ben's sales:
Here the probabilities are visibly skewed towards zero. The expected value
should reflect that. Applying the formula we get:
e = 0 · 0.15 + 1 · 0.30 + … + 4 · 0.05 = 1.6
The expected value is now below the outcome having the highest
probability. Ben will only sell 50·1.6 = 80 cars over the course of 50 weeks
given this distribution.
How can we measure skewness in general? One common approach is to

count how many classes are above (a) and below (b) the mean and divide
this number by the total number of classes (n) minus two.
skew = | (a – b ) / (n – 2) |
Don't worry too much about the outer bracket, it just tell us take the positive
value of whatever we get. Mathematicians call that the absolute value. For
example: |2| = 2 and |-3| = 3. Let's calculate the skewness of the given sales
distributions. Stephen's distribution has two classes above the mean and two
below the mean. This leads to:
skew = | (2 – 2 ) / 3 | = 0
Ben's distribution has three classes above the mean and two classes below
it. In this case we get:
skew = | (3 – 2 ) / 3 | = 0.33
You probably have been wondering why we divide by n minus two. Why
not simply divide by the number of classes or not divide it at all? Consider
this: The worst case scenario for a skewed distribution is to have only one
class on one side of the mean and the remaining classes on the other side of
the mean. In this case a = n - 1 and b = 1 (or the other way around). When
we calculate the skewness of such a distribution, we get:
skew = | (n – 1 – 1) / (n – 2) | = 1
So having n minus two in the denominator guarantees us that this measure

of skewness remains between 0 (no skewness) and 1 (maximum skewness)
and thus allows us to interpret the result in a straightforward way.
(●●●) Delay Revisited
Our analysis of train delays at a certain station has shown that the
probability p of a t minute delay is p(t) = 0.25·0.75t. We can turn that into a
table by inserting different values of t:
- 0 minute delay with probability p(0) = 0.25
And so on until infinity. First let's prove that this really is a discrete
probability distribution by showing that all the probabilities add up to one.
Consider the sum:
s = p(0) + p(1) + p(2) + ...
Inserting the equation for p(t) results in:
s = 0.25 · (0.750 + 0.751 + 0.752 + ...)

To evaluate the sum in the bracket we either need a great idea or a
formulary. Even though this is an exciting challenge, we will go with the
formulary which tells us that:
q0 + q1 + q2 + ... = 1 / (1-q)
So we get:
s = 0.25 · 1 / (1-0.75) = 0.25 · 1/0.25 = 1
Indeed the probabilities all add up to one and p(t) is a discrete probability
distribution. For the expected delay, we need to do this weighted sum:
e = 0·p(0) + 1·p(1) + 2·p(2) + ...
With the equation for p(t) this becomes:
e = 0.25 · (0·0.750 + 1·0.751 + 2·0.752 + ...)

Again we use a good formulary to look up the sum in the bracket. Lucky for
us, there is a formula for just this case:
0·q0 + 1·q1 + 2·q2 + ... = q / (1-q)2
Now we can compute the expected delay:
e = 0.25 · 0.75 / (1-0.75)2 = 3 minutes
As you saw, as long as the probabilities are decreasing in such a manner,

that all the probabilities add up to one, a discrete probability distribution
does not need to be finite. And assuming we can find proper formulas to
evaluate the arising infinite sum, we can still calculate the expected value.
(●●●) Infinitely Discrete
As one last snack, let's now prove that any function of the form p(t) = (1-
q)·qt is a discrete probability distribution, though we restrict the value of q
to be between zero and one. For example, if you set q = 0.75, you'll arrive at
the function we had in the previous snack. And we were able to show that
this particular function indeed is a discrete probability distribution.
We do the proof by showing that the sum s of the probabilities p(0), p(1),
p(2), and so on, all add up to one. So let's write down this sum in general:
s = p(0) + p(1) + p(2) + ...
Inserting the equation for p(t) results in:
s = (1-q) · (q0 + q1 + q2 + ...)
Using the summation formula for the sum in the bracket that we looked up
during the "Delay revisited" snack, we can write:
s = (1-q) · 1 / (1-q) = 1
Thus showing that the probabilities sum up to one and concluding the proof.
We can also deduce a general formula for the expected value. The proper
weighted sum is:
e = 0·p(0) + 1·p(1) + 2·p(2) + ...
Or, using the equation for p(t):
e = (1-q) · (0·q0 + 1·q1 + 2·q2 + ...)

Again we use the corresponding summation formula for this infinite
geometric sum and get:
e = (1-q) · q / (1-q)2 = q / (1-q)
When we insert q = 0.75, the expected value turns out to be e = 3, just what
we got earlier. So it all checks out.
Over the last chapters and snacks, we noticed that a lot of different
mathematical fields play into statistics. On this and many other occasions, a
sound knowledge of algebra proved to be important in solving the problem.
In other cases, we required or will require the help of geometry and
calculus. To succeed at statistics it is vital not to neglect other fields of
mathematics.
Chapter Four: Poisson Distribution
(●) The Basics
The Poisson distribution is a discrete probability distribution, similar to the

binomial distribution. One big difference though is that instead of having
probabilities as inputs, we rather look at the average rate of a certain event
occurring. For example, instead of being given the chance of a goal
occurring during a game, we are given the average number of goals per
game and go from that.
Let's take this to be the introductory example. We know from looking at the
soccer team's history, that it produces goals with a mean rate of 2.4 goals
per game. Now we want to know how likely it is that during a particular
game it will not shoot any goal. Using the Poisson distribution we can
answer this question (and many more questions of this kind)
straightforward:
p(no goal) = 9 %
Here's the general formula to solve such problems. We are given an average
rate λ at which an event is occurring over a certain time span (goals per
game, accidents per year, mails per day). If the occurrence of the event is
random and independent of any previous occurrences, we can use this
formula to calculate the chance that it will occur k times during said time
span:
p(k occurrences) = e-λ · λk / k!
Going back to the above example, we wanted to know how likely it is for k
= 0 goals to occur during a game when the average rate is λ = 2.4 goals per
game:
p(no goal) = e-2.4 · 2.40 / 0! = 0.09 = 9 %
You are probably wondering about the exclamation mark. What does it
mean to have a number followed by an exclamation mark? We call k! a
factorial and read "k factorial". Whenever we see this, we just multiply all
numbers down to one. For example: 3! = 3·2·1 = 6 or 5! = 5·4·3·2·1 = 120.
So nothing to worry about. Of course for 0! this doesn't work, it is defined
as 0! = 1. Keep that in mind.
Again you can easily find online calculators that do all the computing for
you. I recommend the "Stat Trek Poisson Distribution Calculator", which is
easy to use and displays cumulative probabilities. This can be very helpful
when answering questions featuring the phrases "at least" or "more than".
Let's have some snacks.
(●) Tornadoes
Statistics show that in the US state of New York there are on average five
tornadoes per year. How likely is it that during one year we find only two
tornadoes? What's the probability of more than five tornadoes occurring?
Let's turn to the first question. All we need as inputs for the Poisson
distribution is the average rate, in this case λ = 5, and the number of
occurrences, in this case k = 2. Plugging that into the formula gives us:
p(2 tornadoes) = e-5 · 52 / 2! = 8.4 %
So the chance of only two tornadoes forming over a year is about 1 in 12.
This was the simpler of the two questions. What about the chance of having
more than five tornadoes? Since the Poisson distribution is infinite, we
shouldn't try to do this sum:
p(6 tornadoes) + p(7 tornadoes) + p(8 tornadoes) + ...
A better approach is to compute how likely it is to have five or less

tornadoes. One minus whatever we get there is the probability of having
more than five tornadoes. Let's calculate the odds of five or less tornadoes
occurring by simply adding the chances for no tornado, for one tornado, and
so on up to five:
p(no tornado) = e-5 · 50 / 0! = 0.007
p(1 tornado) = e-5 · 51 / 1! = 0.034
Continuing this path until we get to five and summing all the terms results
in:
p(5 or less tornadoes) = 0.616
Since the probability for five or less tornadoes and the probability for more
than five tornadoes must add up to one, we can quickly get the desired
result:
p(more than 5 tornadoes) = 0.384 = 38.4 %
Using the Stat Trek calculator, we could have gotten this probability much
quicker by simply typing in the random Poisson variable as 5 and looking
up the line for the cumulative probability X > 5.
(●) It's Getting Hot

The Meteorological Service of Canada defines a heat wave as three or more
consecutive days with a maximum temperature of 32 °C or greater.
According to that definition, the Toronto Pearson International Airport
experienced 17 heat waves over the 30 years from 1971 to 2000. This
translates into a rate of about 0.57 heat waves per year. What are the
chances of having no heat waves over a period of three years?
The chance of having no heat wave during one year can be calculated easily
using the Poisson distribution with λ = 0.57 and k = 0:
p(no wave) = e-0.57 · 0.570 / 0! = 0.5655
We can use the multiplication rule to determine how likely it is that this
event occurs three times in a row.
p(no wave for 3 years) = 0.56553 = 0.1808
So the chance for that happening is about 18 %. You might think to yourself
now, if there are on average 0.57 heat waves per year that must mean that
we should expect 1.71 heat waves per three years. So can't we just use the
Poisson distribution with λ = 1.71 and k = 0 to calculate in one step how
likely it is that no heat wave occurs within three years? We can certainly try.
p(no wave for 3 years) = e-1.71 · 1.710 / 0! = 0.1809
Indeed we get the same result (neglecting the difference due to rounding
errors that occurred during the first approach). This is a beautiful property
of the Poisson distribution. When given a rate per year and asked to
calculate the likelihood of the event occurring a number of times over let's
say ten years, we can just compute the rate per ten years and do the exercise
in one take.
(●●) Foxy Lady

As of now, our rates always involved a time span. But this does not need to
be the case. Consider this data: from 1999 to 2000 The Mammal Society
conducted a national fox survey in Britain. It concluded that the number of
foxes per km² varies from 0.2 to 2.2. For our snack we'll take the average of
these two values to be the rate for the Poisson distribution. So assuming
there are 1.2 foxes per km², what is the chance of finding less than 3 foxes
in an area of 5 square kilometers?
First we should extend the given rate to cover the desired area. If there are
1.2 foxes per km², that means we can expect 6 foxes per 5 km². To find the
chance of having no foxes within five square kilometers we use the Poisson
distribution with λ = 6 and k = 0:
p(no foxes) = e-6 · 60 / 0! = 0.0025
Similarly, we use λ = 6 together with k = 1 and k = 2 to get the odds of

having one or two foxes within the area of interest:
p(1 fox) = e-6 · 61 / 1! = 0.0149
p(2 foxes) = e-6 · 62 / 2! = 0.0446
Doing the sum gives us the probability of finding less than three foxes in a
5 square kilometer area:
p(less than 3) = 0.062 = 6.2 %
Why were we allowed to apply Poisson here? Remember, our condition for
using the distribution was that the events occur at random and
independently of each other. Foxes don't take preferential routes through the
woods and fields, thus their location at a certain time is rather random. On
top of that they are loners, roaming more or less independently of each
other. From that we can conclude that both our conditions are fulfilled
within acceptable boundaries. For animals forming herds (such as humans),
this would not have been the case. Our "roaming" is not Poisson
compatible.
(●●) Happy Customers
On average 4.6 customers arrive per hour in our office. We also know that
the average service time is 23 minutes. How many employees should we
hire so that the chance of serving the arriving customers without wait is 90
% or higher?
We can use the Poisson distribution and the information λ = 4.6 customers
per hour to make a table of how likely it is that a certain numbers of
customers arrive during a one hour period:
- 0 customers with p = e-4.6 · 4.60 / 0! = 0.010

- 1 customer with p = e-4.6 · 4.61 / 1! = 0.046
All of these cases, k = 0 to k = 3 customers per hour, add up to a probability

of 0.325 = 32.5 %. So being able to serve three customers without wait does
not bring us even close to the 90 % target yet. We need to continue the
table:
Including these values in the sum as well, we see that the chance of having
k = 0 to k = 7 customers per hour is 0.905 = 90.5 %, which means that
being able to serve seven customers per hour without wait would indeed
bring us up to the 90 % target.
For seven customers the expected total service time is 7 · 23 minutes = 161
minutes. So we need 161/60 = 2.7 employees or, since we can't have a
decimal number of employees, rather 3 employees to have a 90 % or more
chance of serving the arriving customers without wait.
I admit that this is a tricky problem, but it is also quite powerful. With slight
variations, we can use this approach to find a suitable value for the number
of telephone operators in a call-center, the number of docks to unload ships,
the number of salesmen in a store, and so on.
(●●●) Soccer
As the physicist Metin Tolan shows in his book "So werden wir
Weltmeister", soccer results follow the Poisson distribution within
acceptable boundaries. For a team that on average produces λ goals per
game, the probability of scoring k goals in a particular game is:
p(k goals) = e-λ · λk / k!
Let two teams with the respective rates λ(1) and λ(2) participate in a match
against each other. In a first approximation (which under more careful
analysis would certainly require revision to some extent) we can assume
that the teams will produce goals independently of each other. In this case,
we can use the multiplication rule to easily compute the probability for a
certain end-result k(1) : k(2). To achieve this result, team 1 must produce
k(1) goals, this happens with the probability:
p(1) = e-λ(1) · λ(1)k(1) / k(1)!
At the same time, team 2 needs to score k(2) goals. The Poisson distribution
tells us this happens with the probability:
p(2) = e-λ(2) · λ(2)k(2) / k(2)!
For ending up with the result k(1) : k(2), both these events must occur. So
the probability is:
p(result) = p(1) · p(2)
In the "1. Bundesliga" (which is the main German soccer league) the
average is about λ = 1.5 goals per game, with the top teams going as high as
λ = 2.5 and underdogs going as low as λ = 0.7. Let's make a test run using
these two extreme values and calculate the probability for three specific
outcomes.
We assign team 1 the number of λ(1) = 2.5 goals per game, team 2 will be
the underdog with only λ(2) = 0.7 goals per game. Let's first calculate the
chance of seeing a boring 0:0 at the end of the game. Team 1 will score no
goals with the chance:
p(1) = e-2.5 · 2.50 / 0! = 0.082 = 8.2 %
For the underdog this probability is significantly higher. Plugging in the

respective values gives:
p(2) = e-0.7 · 0.70 / 0! = 0.497 = 49.7 %
Using the multiplication rule, we can now compute the chance of seeing a
match with no goals occurring:
p(0:0) = 0.082·0.497 = 0.041 = 4.1 %
Luckily for the viewer, this result is quite unlikely. Let's see how that
compares to a 1:0 victory for team 1. Using the same approach we find that:
p(1:0) = 0.205·0.497 = 0.102 = 10.2 %
This result is already more than twice as likely as the 0:0. What about the
chance of the underdog claiming a 0:1 victory?
p(0:1) = 0.082·0.348 = 0.029 = 2.9 %
As expected, it's much less likely than the 1:0 and, though not as easily
expected, slightly less likely than the 0:0. This way we could check any
result we desire. We can also ask grander questions like: what is the chance
of a tie? This unfortunately would require us to do an infinite number of
calculations since:
p(tie) = p(0:0) + p(1:1) + p(2:2) + ...
But we can still approximate it to a reliable degree by evaluating the first

five or so terms (after all, results such as 6:6, 7:7, and so on are so unlikely
that they hardly impact the overall probability for a tie).
So how likely is a tie for our game? Let's evaluate some terms using the
above approach and a Poisson calculator.
- p(0:0) = 0.0407
- p(1:1) = 0.0713
- p(2:2) = 0.0314
- p(3:3) = 0.0060
- p(4:4) = 0.0007
Seeing as how the probabilities quickly drop, we feel comfortable with

stopping here and summing them to get a good approximation for the
probability of a tie occurring during our game of leader versus underdog:
p(tie) = ca. 15 %
Let's turn to an even tougher question: What is the chance of team 1

winning the game? Again we need to look at the probabilities of an infinite
number of possible outcomes: all the x:0 victories (x > 0), all the x:1
victories (x > 1), and so on. Here the cumulative probabilities really come
in handy. Don't even try solving it without them.
Let's look at all x:0 victories for team 1. In each of these games team 2
scores zero goals, so p(2) remains at 0.497, while p(1) changes with x.
Doing the product p(1) · p(2) for all x > 0 and then summing these terms
results in:
p(x:0, x > 0) = p(1 cumulative x > 0) · 0.497 = 0.456
So the chance for team 1 to achieve a x:0 victory against the underdog is
about 46 %. In a similar way we can write down the odds for all x:1
victories. Here team 2 always scores one goal, so p(2) = 0.3478 for all
possible results, while p(1) varies with x.
p(x:1, x > 1) = p(1 cumulative x > 1) · 0.348 = 0.248
The odds for a x:1 victory are thus about 25 %. We can just keep on
applying this logic until we get to infinity, or, for practical reasons, until the
probability drops to a ridiculously low value. Of course at the end we add
all the chances for these outcomes to arrive at the approximate probability
of team 1 winning the game. We get:
p(victory team 1) = ca. 77 %
Already knowing the chance of a tie, we can now easily calculate the
chance of an underdog victory:
p(victory team 2) = 8 %
or 1 in about 13 games. Not too bad, considering we took the extreme

values for λ. And assuming the league offers three such leaders and three
such underdogs, there will be about 18 games of leader versus underdog per
season, which means we should expect one unlikely victory of this kind to
occur each year.
As you can see, the Poisson distribution provides fantastic possibilities for
calculating probabilities in soccer and sports in general. Whenever we have
teams competing by producing goals or points at a certain rate, we can
apply the formula in first approximation, always keeping in mind though,
that it only holds exactly true if the rates at which the teams score said goals
or points are independent of each other. This is certainly not true under
more careful observation, but for soccer, analysis showed that the Poisson
distribution produces surprisingly reliable numbers. It is not far fetched that
this holds true for many other sports as well.
(●●●) Magic Math

This snack is made for the mathematically curious. If that doesn't apply to
you, feel free to skip it and go right to the next chapter. When we take a
certain route, usually there are λ cars per km on the street. This quantity is
called the car density and the higher it is, the slower the traffic moves. We
assume it to be Poisson distributed, so the chance that there are k cars per
km on a given day is:
p(k) = e-λ · λk / k!
The time it takes to complete the route will depend on the density. If the
road is clear, the travel time will be a minimum of n minutes. For every
additional car per km, this increases by a factor m. If for example every car
increases the travel time by 5 %, the factor will be m = 1.05. So we have an
exponential relationship between the density and the travel time:
t(k) = n · mk
We'll not specify any numerical values for now. Given the Poisson
distributed variation in car density and this relation for the travel time, what
is the expected travel time? Note that we have a discrete probability
distribution for the travel time:
- t(0) minutes with the probability p(0) - t(1) minutes with the probability
p(1) - t(2) minutes with the probability p(2)
And so on. To get the expected value, we do the weighted sum as
introduced in the third chapter of the book.
t = t(0) · p(0) + t(1) · p(1) + … = Σk t(k) · p(k)
The sign Σ indicates a sum, so the notation on the right side means that we
sum the expression over all k. So we get:
t = n · e-λ · Σk mk · λk / k!
Note that I wrote the constants not including k in front of the summation
sign, which is just factoring them out. The fact that we can do that will be
vital in solving this problem. Both m and λ have the exponent k, so we can
write them in one bracket:
t = n · e-λ · Σk (m·λ)k / k!
Note that except for the missing factor, the terms behind the summation
sign are the terms of a Poisson distribution with the rate m·λ. This enables
us to do a mathematical magic trick to make the sum disappear. We
multiply these terms by 1 (which certainly we are allowed to do), but we'll
write the 1 as em·λ · e-m·λ. So we get:
t = n · e-λ · Σk em·λ · e-m·λ · (m·λ)k / k!
Again we factor out a constant:

t = n · e-λ · em·λ · Σk e-m·λ · (m·λ)k / k!
Now look at the terms following the summation sign again. We included the
missing factor, so it's a sum over all terms of a Poisson distribution. And we
know that since the Poisson distribution is a probability distribution, all the
terms add up to one. Thus:
t = n · e-λ · em·λ
After a lot of algebra and thought, we arrived at a neat formula for the
expected travel time depending on the average number of cars on the street
(λ), the travel time with no other cars on the road (n) and the factor that tells
us how the travel time increases with car density (m).
If for example there are usually λ = 15 cars per km, the free flow travel time
is n = 30 minutes and each car increases the travel time by three percent, so
m = 1.03, the expected travel time, given that the car density is Poisson
distributed, is:
t = 30 · e-15 · e1.03·15 = ca. 47 minutes

So the expected travel time is about 50 % higher than the free flow travel
time. Another possible variation of this problem is to have the number of
tornadoes Poisson distributed and let the TV coverage of tornadoes (in
minutes) increase exponentially with that. The constant n would then be the
coverage with no tornado occurring (general information, warnings) and m
the increasing factor per additional tornado.
Chapter Five: The Geometric Approach
(●) The Basics
Usually we look at probability as being the ratio of the desired outcomes to

all possible outcomes. If we are interested in drawing an ace from a deck of
cards, we have four desired outcomes among fifty-two possible outcomes.
So the probability of drawing an ace is just 4/52.
But we can also interpret probability geometrically as the ratio of the

desired area to the total area. This opens up a surprising amount of
interesting applications, as we will see later. Unfortunately, this also means
that we need to resort to advanced mathematics in some of the problems,
since calculating the area enclosed by curves often requires calculus. But I
assure you that the snacks will be rewarding even without the knowledge of
calculus.
(●) Whoa, dude, what was that?
Every year about one car sized asteroid will hit the Earth and leave a
formidable crater. Surely this would be a spectacular, but also very
frightening event to witness. Let's do the math: how likely it is that one of
these asteroids will impact within 0.5 kilometers of your house?
The Earth has a radius of about 6400 kilometers. With the respective
formula for a sphere, we can use this value to calculate the surface area of
Earth:
S = 4 · π · (6400 km)² = 515 million km²
This is the total area available for the asteroids to hit. A region of radius 0.5
km around your house has the area:
A = π · (0.5 km)² = 0.79 km²

Assuming our car sized asteroid falls randomly onto the surface of Earth,
which certainly is a legitimate assumption to make, the chance of the
asteroid hitting within half a kilometer of your house (or your current
location for that matter) is:
p(hit) = A / S = 1 in 655 million
So it's amazingly small. But since on average one asteroid of this size will
hit per year, the above number only covers your chance over one year. What
about the chance of such a hit over the course of a life? We approach this
the same way as in the snack “Homicide She Wrote”. We calculate the
chance of not being hit 70 times in a row and from that deduce the chance
of a hit over a life span:
p(hit 70 years) = 1 in 9 million
You think that's a low probability? It is, the chance of being struck by
lightning is much higher, but tell that to the about 800 people of the 7000
million alive today who, according to the calculated odds, are expected to
actually witness such a close asteroid strike at some point in their lives.
In this snack we focused on car sized asteroids because they have the
practical habit of appearing once a year on average. But once we look at
bigger or smaller asteroids, the impact frequency, and thus the above
calculated probabilities, will change. The picture below shows how the
impact frequency varies with the asteroid diameter.
Note that the scale of the y-axis is logarithmic, which will always produce a
much clearer picture if the quantity in question varies over a wide range of
magnitudes. A side note for the curious: the formula for the red line can be
used to estimate the average time T between two impact events (in years)
and the diameter D (in meters) of the asteroid.
T = 0.018 · D2,72
So an asteroid the size of a small house (D = 10 m) is expected to occur

about every 9 years and an asteroid the size of an oil tanker (D = 50 m)
every 750 years. Luckily, the Earth is massive enough to survive such
bombardment with relative ease and it has been doing so for the past 4.5
billion years. For dinosaurs it's a different story though.
(●) Falling Needle

Picture a circle within a square in such a way that the boundary of the circle
touches the sides of the square. I let a needle fall on its tip anywhere within
the square. What is the probability that it will land within the circle?
Luckily, this can be solved without any higher mathematics.
Let x be the square's side length. The total area of the square is then A(total)
= x². But the length of the circle's diameter is also x. Using a formulary we
find that the area of the circle is π/4 times the square of the diameter:
A(circle) = π/4 · x². To find the probability that the randomly dropped
needle falls within the circle, we simply compute the ratio of the desired
area to the total area:
p(within circle) = A(circle) / A(total) = π/4
Rounded to four digits, this probability is 0.7854, which is roughly three out
of four. What about if we halve the diameter of the circle while keeping the
square at the same size? Try this calculation on your own. You should arrive
at p(within circle) = π/16.
(●) Deep Blue Sea
Somewhere in the Pacific Ocean there is a small island with a fantastic

treasure on it. Unfortunately, you have no clue where exactly it is and the
island is too small to be located by satellite. So what to do? In your
desperation, you simply set sail from Sydney (Australia) to Seattle (US) in
the hopes of spotting it somewhere. Assuming your visibility to be 50 km
throughout the journey, what are your chances of finding the island?
To solve this problem, we need two additional pieces of information. The

area of the Pacific Ocean is about 156 million km² and the distance between
Sydney and Seattle is 12500 km.
When you cross the ocean, the region visible to you will approximately
form a rectangle with the length 12500 km and the width 2 · 50 km = 100
km. Try to picture this rectangle extending from Sydney to Seattle in your
mind. So the part of the Pacific Ocean that you are able to observe on your
journey has the area: 12500 km · 100 km = 1.25 million km².
Doing the ratio of the observable area to the total area gives us the chance
of finding the island:
p(finding island) = 1.25 / 156 = ca. 1 %
Not bad actually, considering how incredibly vast the Pacific Ocean is. But
having a visibility of 50 km all the way through the ocean is rather
unrealistic. It's probably going to be considerably lower for some portions
of the trip. But for the sake of simplicity, let's stick to the 50 km visibility.
What distance do I need to cover within the Pacific Ocean to get a 50/50
shot at spotting the island? How long will this take if my ship makes 20
knots? We assume the route will be taken in such a way that the forming
“band of visibility” does not overlap and except for small portions we'll be
traveling in straight lines.
With these assumptions, the sighted area is about d · 100 km, with d being
the distance covered. Now I want to choose d in such a way, that the
resulting probability is 0.5. The best way to do this is to set up and solve
this equation:
0.5 = d · 100 km / 156000000 km²
d = 780000 km
This is a little more than 60 times the distance from Sydney to Seattle.
Since 20 knots correspond to 37 km per hour, the journey would take us
21100 hours or 880 days. With ideal visibility throughout and a constant
speed of 20 knots, we can see half the Pacific Ocean in about two and a half
years. More realistic values of visibility would bring us up to five years or
so. Hopefully, we'll come across the island at some point during this time.
(●●) Just You Wait
Going to an administrative office can be quite annoying. You have to wait

in line for what seems like an eternity just to fill out a ridiculous amount of
forms. But sometimes you have to do it and then you are probably
wondering what your chances of getting in and out quickly are.
The time spent in the administrative office is determined by two random

variables: the waiting time x and the service time y. Assume that both x and
y can vary randomly between 0 and 30 minutes. What's your chance of
being in and out within 15 minutes?
This is a problem that's best solved with the help of geometry. In the image
you can see a plot of the situation. The x-axis represents the waiting time,
the y-axis the service time. Both can go up to 30 minutes. All combinations
of waiting and service time must fall somewhere into the square set by this
limit.
Now we would like to know what the probability of x+y being smaller than
15 is. In the image you can see the line x+y = 15. This line separates the
acceptable and unacceptable combinations. Pick any point in the shaded
region (for example x = 5 and y = 5) to check this logic. You'll see that all
satisfy the condition x + y < 15.
Computing the probability now just means doing the ratio A(shaded) /
A(total). The marked square has an area of A(total) 30·30 = 900. For the
shaded triangle we remember that the area equals one-half times base times
height, so A(shaded) = 0.5·15·15 = 112.5. Our chance of getting in and out
within fifteen minutes is thus:
p( x + y < 15 ) = 0.125 = 12.5 %
So we shouldn't get our hopes up too much. Problems that can be solved in
such a way are often very similar. Usually we are given two random
variables that can vary within a certain range. Always make a plot to
visualize the problem and try to find the line that separates desired from
undesired outcomes. Once you managed to make this separation, you either
get the answer using simple geometric formulas or, if you're not lucky,
integral calculus.
(●●) Think Fast
After several beers, George decided to measure his reaction times using a
computer program. Several tries later, he concluded that it takes him
between 0.75 s and 1.50 s to react to a signal. He also managed to talk his
(sober) wife Tina into testing her reaction times as well. As expected, she
did better, her reaction time ranged between 0.50 s and 1.25 s.
In direct competition, how likely is it that George will react faster than
Tina? To answer that, we do a plot. We will take the x-axis to be the
reaction time of George and the y-axis that of Tina. You can see that the
possible combinations of reaction times have been marked by a square.
Make sure to study the graph carefully.
The line of equal reaction times has been included as well. It separates the
desired from undesired outcomes. We are interested in the combinations
that lie in the shaded region. Here George's reaction time is smaller than
Tina's.
To confirm this, pick any point in the shaded region and go straight to the x-
axis, this will give you the reaction time of George. Then go from the same
point straight to the y-axis to get the corresponding reaction time for Tina.
You will see that for all points within the shaded area, this leads to a smaller
reaction time for George.
With that in mind, we can now easily compute the chance of George
reacting faster by doing the ratio of shaded area to total area. The total area
is the area of the square in which all possible combinations lie. Since all
sides have the length 0.75, we get: A(total) = 0.75² = 0.5625.
The shaded region is a right triangle with a base and height of length 0.5.
Remembering that the area for such a triangle is one half times base times
height, we get A(shaded) = 0.5·0.5·0.5 = 0.125. All that's left is doing the
ratio:
p(George faster) = 0.125 / 0.5625 = 0.22 = 22 %
So only in about 1 in 5 tries we can expect George to be faster than Tina,

despite their ranges overlapping for the most part. So until he sleeps off the
beers, he won't stand a chance in direct competition. But I'm sure they'll
find other things to do.
(●●●) Two Delays are Better than One
Suppose you travel to work by train and at one point have to transfer to
another train. You are on train A, which is scheduled to arrive at the station
at exactly 9:00. Once arrived, you have to transfer to train B, which is
scheduled to leave the station at 9:10. So between the trains there's a 10
minute wait.
Of course both trains are subject to delays, which usually shouldn't be a

problem. Even if train A is delayed and train B is not, the 10 minute wait
gives you a buffer. However, if train A is delayed more than 10 minutes,
you need to hope that train B is also experiencing some delay. For you to
catch train B, the delay of train A cannot exceed the delay of train B plus
the ten minutes.
Let's generalize that. Let D(A) be the delay of train A, D(B) the delay of
train B and w the scheduled waiting time in-between. In order to reach train
B in time, the relation D(A) < D(B) + w must be satisfied. Now let the
delays vary randomly between zero and a maximum value m. What's is the
probability of missing train B?
It is helpful to look at this problem graphically. In the below plot the x-axis
represents the delay of train A and the y-axis the delay of train B. Both are
only allowed to vary between 0 and m, so all the possible combinations of
the delays must lie within the marked square.
The relation D(A) = D(B) + w marks the border between acceptable and
unacceptable delays. If D(A) is larger than D(B) + w, we miss train B,
otherwise we catch it. This relation is drawn in the graph as well, it is the
straight line that separates the unshaded from the shaded region.
So which region stands for acceptable and which for unacceptable

combinations of delays? If we start on the straight line D(A) = D(B) + w
and go to the right, we end up in the shaded region. But going to the right
also means increasing D(A), so it is the shaded region that encompasses all
combinations of delays that lead us to missing train B. With all this
admittedly tough work, we can solve the problem relatively easily. The
probability of missing train B is just the ratio of the shaded region to the
total area.
The total area is easy to calculate, it is the area of the square, so A(total) =
m². What about A(shaded)? You might remember that we can calculate the
area of a right triangle by multiplying 0.5 with the base and height. Let's get
these dimensions for the shaded area.
The base extends from w to m on the x-axis, so the length is just m-w. We
can see from the mathematical relation that the line has the slope one, so
when going 1 unit to the right, the line will go 1 unit up. Similarly, going m-
w units to the right results in the line going m-w units up. From this we can
conclude that the height must have the length m-w as well, leading to
A(shaded) = 0.5 · (m-w)².
All that's left is to do the ratio A(shaded) / A(total). The probability for
missing train B, given that the delays vary randomly between zero and a
maximum value m and there's a scheduled waiting time w in-between, is
(after a little algebraic manipulation):
p(missing B) = 0.5 · (1 – w/m)2
In the introductory example w = 10 minutes. If we allow the delays to vary

within 0 to 15 minutes, the probability of getting stuck at the station is:
p(missing B) = 0.5 · (1 – 10/15)2 = 5.6 %
Talk about a tough nut to crack. Don't worry if you don't get all of it on first
try, it is a very demanding problem. But it beautifully shows what the
geometric approach is capable of.
(●●●) Severe Floods
Let's look at another challenging problem we can solve using the geometric
approach. This time we'll keep it less general. In a flood two variables are
important in determining the severity: the area of the flooded region F and
the average density of houses D in the affected region. Ideally the flooded
area is strongly limited (small F) and the flood occurs in an uninhabited
region (small D). Worst case is a large flood hitting a densely populated
region.
For our problem, we will let the flood area F vary between 0 and 500 square
kilometers and the housing density D vary between 0 and 15 houses per
square kilometers, both randomly and independent of each other.
The product of these two quantities is the number of affected houses H, so

H = FD. We agree to classify the flood as severe if H exceeds 2500. Given
the ranges for the variables F and D, how likely is it that a flood is severe?
Let's make an F-D-coordinate system and include the limits set for each
quantity. All possible combinations of F and D lie in the rectangle spawned
by these limits.
This time the separating line between severe and not severe is a hyperbola:
FD = 2500. If the product FD is higher than 2500, we speak of a severe
flood, otherwise not. The separating line, as well as all combinations
considered severe (shaded), are included in the graph.
Unfortunately, this problem can't be solved without resorting to advanced

mathematics. We need to make use of integral calculus to calculate the area
of the shaded region. This will give us the rounded result A(shaded) = 2253.
The area of the rectangle is A(total) = 500 · 15 = 33795, so the probability
of a severe flood is:
p(severe flood) = A(shaded) / A(total) = 6.7 %
When we choose a number randomly in the interval 0 to 500, the chance of

it being between 0 and 100 is just as high as the chance of it falling within
400 to 500. So without explicitly stating it, we assumed that large floods
occur with the same probability as small floods. Is this a realistic
assumption? It certainly makes for a significantly simpler calculation, since
having a non-uniform random variable would require us to solve multiple
integrals.
Data on flood sizes show that our assumption is too crude. In reality, there
is a power law at work, quite similar to the Gutenberg-Richter Law for
earthquakes: the bigger the event, the less frequent its occurrence.
Researchers found this relationship between the flood discharge q
(measured in cubic meters per second) and the frequency f (measured in
events per year):
f = a / qb
with the exponent b usually being around 2. So a flood twice the size of a
certain reference value, will occur only one-fourth as often. To make the
problem more realistic, we would need to take this dependence into account
and solve the arising multiple integrals mentioned above. This
unfortunately is beyond the scope of this book.
Chapter Six: Bayes and Bias
(●) The Basics
Bayes' theorem is a very intriguing statistical tool. It approaches statistical

problems from a new angle. Instead of simply calculating probabilities, we
calculate probabilities given that a certain event has already occurred.
For example when given two random events A and B, rather than
calculating p(A) and p(B) straightforward, we calculate the probability of A
occurring, given that B has already occurred. This is called a conditional
probability, symbolized by p(A|B) and read in short as "probability of A
given B". As long as you read the symbol | as “given”, you will always be
able to figure out what is meant by the expression.
Let's take a look at the notion of conditional probability before going to

Bayes' theorem. Imagine four islands, we'll symbolize them by A, B, C and
D. Of these, island D hides a treasure which we are keen on finding. Our
chance of choosing the right island on first try is:
p(D) = 1/4
Assume we chose to search island A and thus failed to locate the treasure.
What's the chance of choosing the correct island given that we already ruled
out island A? Since there are three islands left to search we get:
p(D|A) = 1/3
As you can see, the occurrence of the event “choosing island A” changed
the odds of the event “choosing island D” to our favor. Of course the
probabilities p(D|B) and p(D|C) have the same numerical value.
With this in mind, we can take a look at a typical Bayes problem with a
deliciously surprising result. After that we'll turn to the general formula for
solving such problems. If you never dealt with Bayes before, it'll probably
take some getting used to, so choose a comfortable pace for reading.
A hospital receives a new device to test people on a certain disease. We

know that about 1 % of the population have this disease. The manufacturer
of the device has provided us with some very important data. If a person has
the disease, the device will recognize it with a 98 % chance. If a person
does not have the disease, the device will falsely recognize the disease with
2 % chance. So it is quite accurate. What is the probability of someone
having the disease, given that the device recognized it?
Assume that 10000 people took the test. At the beginning of the previous
paragraph, we noted that that about 1 % of all people have this disease. So
the most likely scenario is that in our sample 0.01 · 10000 = 100 have the
disease and 0.99 · 10000 = 9900 don't.
Let's put the focus on the 100 people who have the disease. The test will
recognize the disease in 0.98 · 100 = 98 of them. What about the 9900 who
don't have it? With the given rate of false recognitions, the test will identify
the disease in 0.02 · 9900 = 198 of them.
Now we can draw our conclusions. In total 296 were identified as having
the disease, but only 98 actually have it. So the chance of someone having
the disease, given that the device recognized it, is just 98 / 296 = 33.1 %.
We might as well just flip a coin!
The test seemed quite accurate, what happened? One problem is that the
disease is so rare. If you test the general public, the number of people not
having the disease will greatly outnumber the small group that has it. So
even a very low rate of false positives will strongly impact the overall
results.
Many Bayes problems are quite similar to our introductory example. There
is a certain probability p(A) of one event A occurring. This event and its
complement B will take the center stage in the problem. Both A and B can
lead to a certain consequence C with the respective probabilities p(C|A) and
p(C|B). These probabilities don't need to add up to one as they did in the
introductory example. Make sure to study the below image to understand
this set-up before reading on.
There are two ways for C to occur, via A or via B. So it's a legitimate
question to ask for the probability of A, given that C occurred. In
mathematical terms, we are looking for p(A|C).
To solve this problem we make use of the Bayes' theorem:
p(A|C) = p(C|A)·p(A) / (p(C|A)·p(A) + p(C|B)·p(B))
where the probability p(B) is equal to 1 – p(A), since B is the

complementary event to A. It is certainly an impressive formula, but after a
few calculations I'm sure you'll feel comfortable with it. For our first
application, let's check it for the above example.
Let A be the event "disease", B the event "no disease" and C be the event
"disease recognized". The disease occurred with a probability p(A) = 0.01
in the population. The probability that the disease is correctly recognized
was p(C|A) = 0.98, the chance for false recognitions was p(C|B) = 0.02.
These are all the inputs we need:
p(A|C) = 0.98·0.01 / ( 0.98·0.01 + 0.02·0.99) = 33.1 %
Sure enough, we arrive at the same result. The snacks will provide more
opportunities to apply this grand formula.
(●) Where's the file?
Let's devote one quick snack to the new notation introduced in this chapter.
Picture a hard drive with three folders: A, B and C. Each folder has 10
subfolders. We agree that A1 stands for the first subfolder in folder A, A2
for the second subfolder in folder A, and so on. Now we want to locate a
file which, unbeknown to us, sits in folder B6. Since there are 3·10 = 30
subfolders in total, the chance of randomly choosing the desired subfolder
B6 is:
p(B6) = 1/30
What about the chance of choosing the correct subfolder, given that we
already expanded folder B? Since folder B contains 10 subfolders, one of
which holds the file we are seeking, we get:
p(B6|B) = 1/10
What if we expand folder A instead? Folder A also contains 10 subfolders,

but since none of them hold our file, we must conclude that the chance of
choosing the desired subfolder, given that we collapsed folder A, is zero:
p(B6|A) = 0
Of course this only holds true if we are not allowed to revise our decision to
expand a certain folder. I hope this short bit helps you to get used to the
notion of conditional probability.
(●) Innocent Drivers
Assume that 10 % of all drivers are notorious speeders. Now the police sets
up a radar trap. Speeders will fall into this trap with a 70 % chance, while
this is true for only 10 % of law abiding drivers (they might have missed the
limit sign or temporarily paid too little attention). What is the chance that a
driver is a notorious speeder, given that the radar caught him?
First we need to identify the events correctly. The word “given” in the
statement of the problem serves as a pointer to event C, the consequence.
So we take C to be the event "fall into trap". With the first sentence telling
us that 10 % are speeders, we should set A to be the event "speeder" and B
the event "law abiding driver". That means that p(A) = 0.1, p(C|A) = 0.7
and p(C|B) = 0.1.
All that's left is to apply Bayes' theorem to get p(A|C), the probability of the
event "speeder" given the event "fall into trap". Let's plug in the numerical
values:
p(A|C) = 0.7·0.1 / ( 0.7·0.1 + 0.1·0.9) = 44 %
So only about half the drivers that got caught by the trap are actually
notorious speeders. The others were law abiding citizens who just happened
to have missed a sign or were not paying enough attention.
What if we changed the percentage of notorious speeders from 10 % to a

mere 5 % and left the rest unchanged? This means that in relation to the
number of speeders, even more law abiding citizens would be caught by the
trap. So the chance of a driver caught in the trap to be a speeder should be
lower. Let's check our logic:
p(A|C) = 0.7·0.05 / ( 0.7·0.05 + 0.1·0.95) = 27 %
Indeed now only about 1 in 4 drivers receiving a ticket are actually

notorious speeders. This is the same effect we observed when we were
testing for the disease, the law abiding drivers greatly outnumber the
speeders and so even a low rate of "false recognitions" can significantly
change the outcome.
(●●) Party like it's 1999
Your colleague at work is a true party animal. He devotes on average three

nights of the week (no matter if weekend or work day) to partying. What's
the chance that the was out partying the previous night if he is tired at
work? We assume that there's a 90 % chance of him being tired after a night
of partying and a 10 % chance of him being tired after a regular night
(maybe he just had a hard time sleeping, we don't know that).
Let's identify the events. C is obviously the event "tired at work" and we
should set A to be "party night" and B to be "regular night". In this case,
p(A) = 3/7 = 0.43, p(C|A) = 0.9 and p(C|B) = 0.1. Make a diagram similar
to the ones above to feel comfortable with the problem. Now we can find
out how likely it is that he had a long night of partying, given that he's tired
at work:
p(A|C) = 0.9·0.43 / (0.9· 0.43 + 0.1·0.57) = 87 %
So our remark that he should take it easy with partying when when we see
him sleeping at the desk is not unfounded. In about 9 of 10 cases we made
the correct conclusion.
But beware, what if he's not that much of a party animal as we suspected
and often has a hard time sleeping? We'll let him party only one night a
week, keep the 90 % chance of being tired after a night of partying and
increase his chance of being tired after a regular night to 30 %. With these
inputs, the chance that he had a long night of partying when being tired at
work is:
p(A|C) = 0.9·0.14 / (0.9· 0.14 + 0.3·0.86) = 33 %
That goes to show that we should be careful to judge. Bayes really is about
finding bias and avoiding it. The next snack will hopefully serve as a
powerful reminder to that.
(●●) A Touchy Subject

Assume we have a population of 90 % white and 10 % black. In the dark of
night a car has been stolen and only one witness saw it happen. To the
police he describes the suspect as being black. To test the reliability of his
statement, the police have actors reenact the crime. During that, the witness
was able to identify the race of the actor correctly in 90 % of all
reenactments. Now onto the question: given that the witness identified the
suspect as black, what is the chance of said crime being committed by a
black person?
The location of the word “given” tells us that we should take event C to be
“suspect identified as black”. Our to center stage events will then be A =
black person committed crime and B = white person committed crime. As
for the numbers, we are given the percentage of black people in the
population. Assuming no inclination towards this crime in any form, we
have p(A) = 0.1. Since the witness is correct 90 % of the time, we get
p(C|A) = 0.9 and p(C|B) = 0.1.
Now we can apply Bayes' theorem to determine how likely it is that the
criminal really is black given that the witness identified him as being so. We
get:
p(A|C) = 0.9·0.1 / (0.9· 0.1 + 0.1·0.9) = 50 %
So the statement is, at least with respect to race, no more helpful that the
flip of a coin. How can this be considering the high reliability of the
witness? What sorcery is this? Let's break it down using some numbers.
We'll let 100 actors reenact the crime. To be representative of the

population, 90 will be white and 10 will be black. The witness is asked to
note whenever he deems the actor to be a black person. With the given
reliability, he will identify 9 of the 10 black actors (90 %) as being black.
On the other hand, he will identify 9 of the 90 white actors (10 %) as being
black. So of the 18 people he judged to be black, only 9 actually were.
Again this is a classic case of how “false positives” can create an
astonishing result.
We'll stay with the population of 90 % white and 10 % black, but this time
the witness makes the statement that the criminal was white. Again we
measure his reliability to be 90 %. When you make the diagram, reassign
the events and apply the formula you get the chance of the criminal being
white, given that the witness said so:
p(A|C) = 0.9·0.9 / (0.9· 0.9 + 0.1·0.1) = 99 %
Amazing, isn't it? Let's do a breakdown to understand why this is so likely.

Again 100 actors reenact the crime, 90 of them white and 10 of them black.
The witness is asked to make a note whenever he judges an actor to be
white. With the given reliability, he will identify 1 of the 10 black actors (10
%) as being white and 81 of the 90 white actors (90 %) as being white. Of
the 82 people he deemed white, 81 actually were, this is 99 %. Here we
have the phenomenon of the “false positives” turned upside down. When he
identifies the subject to be a member of the majority group, he is
significantly more likely to be correct as the few false recognitions hardly
have any impact on the result.
(●●) Updates
One thing that makes the Bayesian approach so powerful is that we can
update probabilities as the events come in. Consider the following example.
We are watching a Poker game and know that about 20 % of the players are
skilled and the remaining 80 % unskilled. The chance of a skilled player
winning is 60 %, whereas an unskilled player wins only 20 % of his games.
See the image below for a visualization of this situation.
What is the chance of a player being skilled, given that he won a round? We
can apply Bayes' theorem directly to get the answer:
p(skilled|win) = 0.2·0.6 / (0.2·0.6 + 0.8·0.2) = 43 %
This shows that we can not conclude from one win alone that the player is
skilled. But what if he wins two rounds in a row? That should increase the
chances of the victorious player being skilled. From the given information,
we can deduce that the chances of a skilled player winning twice in a row is
0.6² = 0.36, whereas an unskilled player only has the odds 0.2² = 0.04. This
is just applying the multiplication rule from chapter one. We visualize this
updated situation.
We use Bayes' theorem yet another time to determine the chance of a player
being skilled, given that he accomplished two wins in a row:
p(skilled|2 wins) = 0.2·0.36 / (0.2·0.36 + 0.8·0.04) = 69 %

Already much more likely, but still not a safe bet. So we continue the train
of thought. Assume a player won three rounds in a row. The probability of a
skilled player doing that is 0.7³ = 0.216, an unskilled player has the odds
0.2³ = 0.008. Again we visualize this situation before applying Bayes'
theorem.
p(skilled|3 wins) = 87 %
Here we can feel relatively safe to conclude that the player indeed is skilled.
Out of 100 players who manage to get three in a row, we expect only 13 to
be lucky, unskilled players. We could just go on like this. Do the
appropriate calculations to confirm that:
So a player with five wins in a row can be considered skilled beyond

reasonable doubt or at least close to that. Of course if the ratio of skilled to
unskilled players changes or the winning probabilities are altered, the
numbers can turn out to be quite different. In general, the smaller the
percentage of skilled players, the more wins in a row we need to see to
conclude that this player is skilled.
Feel free to set up a similar scenario for a game you are interested in and
compute how many wins it would take to deduce beyond reasonable doubt
that the victorious player is skilled.
(●●) Innocent Drivers Revisited
In the snack “Innocent Drivers” we saw that with the given estimates only
44 % of those falling into the radar trap are actually notorious speeders. So
saying “I wasn't going too fast on purpose” provides a believable excuse
when getting caught. But what about if a driver gets caught three times in a
row? Is the excuse still believable then? Or can we say beyond reasonable
doubt that this driver is a speeder?
Remember that we assumed 10 % of drivers to be notorious speeders and

the remaining 90 % law abiding. A speeder gets caught with a 70 % chance
by the radar trap, a law abiding driver only with a 10 % chance. Let's look
at how the latter percentages change when we have the radar trap activate
three times.
For a speeder, the chance of getting caught thrice is 0.7³ = 0.343, for a law
abiding driver we get 0.1³ = 0.001. So the odds of a driver being a speeder,
given that he or she got caught three times is:
p(A|C) = 0.1·0.343 / (0.1·0.343 + 0.9·0.001) = 97 %
Our updated probability tells us that we shouldn't be inclined to believe the

excuse. However, it's in the eye of the beholder if that already adds up to
“beyond reasonable doubt”. After all, with these odds we can expect 3 of
100 people caught thrice to be very unlucky law abiding drivers who simply
should have paid more attention.
(●●●) Swerving
We can extend Bayes' rule to cover more than two initial events. Consider
the graph below, here we have three events A, B and C, all being able to
result in event D with the respective probabilities p(D|A), p(D|B) and
p(D|C).
For this set-up, Bayes' theorem takes a slightly different form. The chance
for event A, given that D occurred, is:
p(A|D) = p(D|A)·p(A) /
(p(D|A)·p(A) + p(D|B)·p(B) + p(D|C)·p(C))
For reasons of limited space the denominator is written in the second line. If
you compare that to the formula we worked with so far, you'll see that,
except for the event names being changed, the only difference is that we
have an additional term in the denominator.
Let's make an example, again involving law enforcement. The police set up
a check point and pull out all drivers who have been spotted swerving. Just
before putting up the check point, one of the policemen looks up relevant
driver statistics.
They show that of all the drivers, 2 % are intoxicated above limit, 8 %
intoxicated below limit and 90 % sober. Studies show that drivers
intoxicated above limit swerve with a 40 % probability, drivers intoxicated
below limit with a 15 % probability and all others with a 2 % probability.
To solve the problem we set event A = intoxicated above limit, B =

intoxicated below limit and C = sober. Study the below diagram carefully to
absorb all the information given in the problem.
Given the police pulls out a swerving driver, what are the chances that he is
intoxicated above limit? With all the events already neatly ordered, we can
easily find the inputs for Bayes' formula. The probabilities of the individual
events are: p(A) = 0.02, p(B) = 0.08 and p(C) = 0.9. The probabilities of the
events leading to the consequence are: p(D|A) = 0.4, p(D|B) = 0.15 and
p(D|C) = 0.02. Plugging these into the formula results in:
p(A|D) = 0.4·0.02 /
(0.4·0.02 + 0.15·0.08 + 0.02·0.9)
Using a common calculator, we get that the chance of a driver being

intoxicated above limit, given that he was seen swerving, is 0.21 = 21 %. So
only 1 in 5 drivers the police pull out will have to fear for his/her driver's
license, despite their chance of swerving being 20 times as high as for the
sober driver.
Chapter Seven: Random Random Problems
(●) The Basics
In this chapter you'll find an assortment of interesting statistical problems

and ideas, which did not quite fit in any of the other chapters. So onto the
last bowl of snacks.
(●) In Chains
Another interesting take on statistical problems is provided by Markov

chains. In this quick look at the topic, we'll restrict ourselves to Markov
chains with two states and solve them using a general formula. With these
restrictions, the formulation of the problems are always going to be quite
similar.
We are given two mutually exclusive states, A and B (see sketch below).
During each step one of the two states is taken. If state A is taken, there's a
probability p that state B will follow. This of course also means that the
chance for remaining in state A is 1-p. On the other hand, if state B is taken,
the chance of state A following is q and the chance of remaining in state B
is thus 1-q.
This was the set-up of the problem. The question that is to be answered is:
What is the overall probability of having state A or state B occur? Or in
other words: Of the total time, what percentage will be spend in state A or
B? Symbolizing these percentages by p(A) and p(B), the general formula to
solve this problem is:
p(A) = q / (p+q)
p(B) = p / (p+q)
If that was a little too abstract up to this point, don't worry, the example will
make it clear. Consider this: I know that if I exercise one day, the chance of
me not exercising the next day is p = 0.3 = 30 %. On the other hand, if I
don't exercise one day, the odds of me going back to training the next day
are q = = 0.2 = 20 %. Try to visualize this situation similar to the picture
given above. The event A is “exercise” and event B is “not exercise”.
Now what percentage of days will I be exercising with this behavior?

According to the formulas we get:
p(exercise) = 0.2 / 0.5 = 0.4 = 40 %
p(not exercise) = 0.3 / 0.5 = 0.6 = 60 %
So that's about three days a week I will exercise, not so bad actually. Note
that since the probabilities of switching states is relatively low, the result
will be some days of training followed by some days of laziness rather than
jumping from exercise to not exercise with each new day. This is something
the formula does not capture.
Another Markov quickie at the end: when I write, I finish about five pages
per day. A day of writing is followed with a 30 % chance by a day of not
writing. On the other hand, if I don't get anything done one day, the chance
of getting back to writing the next day is 40 %. How long will it take to
finish a book of 160 pages this way?
Using the general formulas, we can say that with this behavior I'll be
writing 4 / 7 days and not writing the remaining 3 / 7 days. So I'm getting
done 4·5 = 20 pages per week, which means that finishing the book will
take me 160 / 20 = 8 weeks.
You might be wondering how to get from the Markov chain to the general
formula. Without resorting to matrix algebra, this question is a little
complicated to answer, but it can be done. So we know that in the long run
I'll be in state A with a probability p(A). We can arrive there in two ways:
from A with a probability of p(A)·(1-p) or from B with a probability
p(B)·q. Now the probability of arriving in A must be p(A), so:
p(A) = p(A) · (1-p) + p(B) · q
With some algebraic manipulation, this leads to:
(1) p(A) · p = p(B) · q
We also know that I'm either in state A or state B, there are no other
possibilities. So the respective probabilities p(A) and p(B) must add up to
one:
(2) p(A) + p(B) = 1
The rest is algebra. Solve equation (1) for p(B) and insert the resulting
expression for p(B) in equation (2). Then solve for p(A) and you'll end up
with the general formula.
Of course Markov chains don't end here. This was just a quick peek into a
very rich field. The number of states can be anything you like, given that
you have the skills and computational power to back it up. A sound
knowledge of matrix algebra is a must for tackling Markov chains with
more than two states as general formulas get exponentially longer when you
include more states.
(●) Don't be Mean

There is an important statistical quantity that we haven't talked about yet.
The standard deviation tells us how strongly observed values deviate from
their averaged value. It also allows us to say how likely it is that further
measurements will fall into a certain interval around the mean. Let's go
through an example of calculating the standard deviation and drawing
conclusions from that.
When you buy a food product, you can find the weight of the contents on
the box. One manufacturer for example claims that his box of cookies
weighs 100 grams. To test the reliability of this claim, we take a sample of
10 boxes and weigh them. Here are the results: 102 gr, 105 gr, 98 gr, 101 gr,
94 gr, 98 gr, 103 gr, 105 gr, 101 gr and 97 gr.
Our first job will be to calculate the arithmetic mean, usually symbolized by
μ. To do that, we simply add all the values and divide by the number of
measurements:
μ = 1004 / 10 = 100.4 grams
So the mean indeed was very close to the 100 grams claimed by the
manufacturer. But additionally to that, we are interested in knowing how
strongly the weights are expected to spread around this mean. How likely is
getting a 90 gram box for example? Is that something we can expect to
happen often?
To get a sense of the spread we turn to the standard deviation. Let's

symbolize the measured values with m. For each of these values we
compute this quantity:
x = (m – μ)2
It is simply the square of the difference between the observed and mean
value. Once we did that for all the measurements, we can get the standard
deviation from this formula:
s = square root ( (x1 + x2 + …) / (n – 1) )
with n being the total number of measured values (in our case 10). Let's
calculate the standard deviation now for the weights we observed. For the
first and second measurement we get:
x1 = (102 – 100.4)2 = 2.6
x2 = (105 – 100.4)2 = 21.2
We proceed in a similar fashion to get the corresponding results for the

other 8 measurements. Once this is done, we sum them all up. The sum
turns out to be 116.6. Plugging that value into the above formula for the
standard deviation results in:
s = square root ( 116.6 / 9 ) = 3.6 grams
This is the standard deviation of our sample. If the quantity of interest,

which for us is the weight, is normally distributed (an assumption we can
make if we don't have any further information), one can use this table to
draw conclusions:
- 68.3 % chance of being within μ ± s
- 95.5 % chance of being within μ ± 2· s
- 99.7 % chance of being within μ ± 3· s
In the example we have μ + s = 104.0 grams and μ - s = 96.8 grams. If our

sample is representative, we can expect 68.3 % of all further measurements
to be within this interval. This is what's called the “1 sigma interval”. It
covers one standard deviation around the mean.
For the “2 sigma interval” we take two standard deviations around the
mean. Using the calculated values we get: μ + 2· s = 107.6 grams and μ - 2·
s = 93.2 grams. Our estimate thus is that 95.5 % of all boxes fall within this
range. As you can see, the standard deviation indeed does provide a lot of
information about how the weights spread.
As mentioned, these conclusions are only reliable if our sample is

representative, which means that it should be large and have no systematic
errors. In our example we shouldn't be too confident as 10 measurements is
not a significant sample. And if we used a cheap scale to weigh the boxes or
handled the scale incorrectly (for example not properly setting it to zero
before starting the experiment), a systematic error is a possibility. So make
sure to consider that when looking at the mean and standard deviation of a
sample.
I also mentioned that the above table only holds exactly true when the
quantity in question is normally distributed. If the quantity is distributed in
another way, we can only use the table as a first approximation. As a rule of
thumb one can say that when the observed values are strongly skewed
(significantly more than half of the observed values left or right of the
mean), one should not assume a normal distribution.
The standard deviation should not be confused with the standard error SE,
which rather measures how reliable our result for the mean is. Remember
that we got μ = 100.4 grams as the mean for our sample. Since our sample
was small, there's a good chance that the true mean μ(true) will somewhat
differ from that. Can we say to what extend?
To do that, we calculate standard error by dividing the standard deviation

with the square root of the number of measurements:
SE = s / square root (n)
Plugging in the corresponding values for our example results in SE = 1.14.

Now what does that tell us? We can conclude (and this is a general rule)
that the true mean is within twice the standard error with a chance of 95 %.
For us this means that there's 95 % certainty that the true mean of the box
weights lies within this interval:
μ(true) = 100.4 ± 2.28 gr
Despite only having a sample size of 10 boxes, we were able to narrow the
mean down to a relatively small interval. Note how the standard error varies
with the sample size n. Assuming the standard deviation stays relatively
constant when expanding the sample, the standard error will halve when the
sample size increases fourfold. With 40 boxes we could bring the interval
down to about ± 1.14 grams.
So remember that while the standard deviation gives information about the
spread around the mean, the standard error helps us to deduce how close
our sample mean is to the true mean. Both is helpful information to have for
any sample.
(●●) Tai Chi for Mathematicians
In this snack we'll take a look at how to test a hypothesis statistically. This
is usually done with the Chi-Square test. For it to work, we need to be given
an expected distribution and an observed distribution. From these two we
can calculate the so called Chi-Square χ². It is a measure of how the
expected distribution differs from the observed one.
Imagine we are observing a four-lane road. For 1000 passing cars we'll note
which lane (lane 1, lane 2, lane 3 and lane 4) each car was driving on. Our
assumption is that there is no preferred lane, so we expect the number of
cars being uniformly distributed over the four lanes:
- lane 1: 250 cars
- lane 2: 250 cars
- lane 3: 250 cars
- lane 4: 250 cars
These were the observed values:
- lane 1: 238 cars

- lane 2: 245 cars
- lane 3: 269 cars
- lane 4: 248 cars
Now the question is: do these measurements confirm our assumption of a

uniform distribution? It's clear that even if there's no preferred lane, there
are always going to be some fluctuations. But just how much of a deviation
can we accept before we need to reconsider our assumption? The Chi-
Square test is capable of answering these questions.
During the Chi-Square test, we check if the null hypothesis holds true. We
can formulate the null hypothesis as such: The deviations between expected
and observed values are not statistically significant.
The first step is to calculate the Chi-Square χ². We symbolize the expected
frequencies by e1, e2, … (in our case all have the value 250) and the
observed frequencies by h1, h2, … For each class, which in our case are the
lanes, we calculate this quantity:
x = (h – e)2 / e
The sum of these is the Chi-Square:
χ2 = x1 + x2 + …
Let's do this for the above example. For the first lane we had the expected
value e1 = 250 and the observed value h1 = 238. This is all we need to
compute the quantity x1:
x1 = (238 – 250)2 / 250 = 0.58
In the same way we calculate the other inputs needed for the Chi-Square.
So we get:
x2 = (245 – 250)2 / 250 = 0.10
x3 = (269 – 250)2 / 250 = 1.44
x4 = (248 – 250)2 / 250 = 0.02
Summing them up leads to the Chi-Square:

χ2 = 2.14
Now we know how to calculate the Chi-Square, but what to do with it? To
continue, we first need to know the degrees of freedom for our set-up.
When the expected values are uniformly distributed, as it is in our example,
the degrees of freedom f is just the number of classes n minus one. So we
get: f = 4 – 1 = 3.
Next we need a Chi-Square-table to look up the critical value. Remember

that the Chi-Square is a measure for the deviation between theory and
reality, so it makes sense to have a critical value it is not allowed to exceed
in order for us to accept the hypothesis. Once the calculated Chi-Square
exceeds the critical value, we are forced to reject our hypothesis.
How to find the relevant critical value in the table? First we go to the row
determined by the degrees of freedom, for us, that's row 3. Then we go to
the column determined by the probability 0.05 (for starters, always use this
column). The intersection of this row and column is the critical value we
were looking for: c² = 7.82.
If the Chi-Square is below this critical value, the deviations are not
statistically significant and we accept the hypothesis. In the example we got
χ² = 2.14 and c² = 7.82, so indeed the Chi-Square is smaller than the critical
value. We can accept our hypothesis and conclude that the observation
supports our assumption of the cars being uniformly distributed over the
four lanes.
Let's sum up the process of statistical hypothesis testing in several steps.
Our starting point is always an expected and an observed distribution as
well as the null hypothesis, which states that the deviations between the two
are not statistically significant. Then we proceed as such:
1. Compute Chi-Square
2. Determine degrees of freedom
3. Look up critical value in table
4. Compare Chi-Square and critical value
One thing to keep in mind though is that the Chi-Square test should not be
directly used if one of the expected or observed frequencies is equal to or
smaller than 5. In this case you either have to merge classes to produce
frequencies larger than 5 or use an alternative hypothesis test, like Fisher's
exact test.
(●) Immigrants and Crime
Assume we are given a country with a population that is 90 % native and 10

% immigrant. As it is often the case in the first world, the native population
is on average older than the immigrant population.
Let's look at a certain type of crime, say robberies. Now a statistic shows
that of all the robberies in the country, 80 % have been committed by
natives and 20 % by immigrants. Can we conclude from these numbers that
the immigrants are more inclined to steal than the natives? Many people
would do so.
The police keeps basic records of all crimes that have been reported. This
enables us to get a closer look at the situation. Consider the graph below, it
shows the age distribution of people accused of robbery in Canada in 2008.
It immediately becomes clear that it is for the most part a "young person's
crime". The rates are significantly elevated for ages 14 – 20 and then
decrease with age. Even without crunching the numbers it is clear that the
younger a population is, the more robberies will occur.
Let's go back to our fictional country of 90 % natives and 10 % immigrants,
with the immigrant population being younger. Assuming the same
inclination to committing robberies for both groups, the immigrant
population would contribute more than 10 % to the total amount of
robberies for the simple reason that robbery is a crime mainly committed by
young people.
Using a simplistic example, we can put this logic to the test. Let's stick to
our numbers of 90 % natives and 10 % immigrants. This time however,
we'll crudely specify an age distribution for both. For the native population
the breakdown is:
- 15 % below age 15
- 15 % between age 15 and 25
- 70 % above age 25
For the immigrants we take a slightly different distribution that results in a

lower average age:
- 20 % below age 15
- 20 % between age 15 and 25
- 60 % above age 25
We'll set the total population count to 100 million. Now assume that there's
a crime that is committed solely by people in the age group 15 to 25. Within
this age group, 1 in 100000 will commit this crime over the course of one
year, independently of what population group he or she belongs to. Note
that this means that there's no inclination towards this crime in any of the
two groups.
It's time to crunch the numbers. There are 0.9 · 100 million = 90 million
natives. Of these, 0.15 · 90 million = 13.5 million are in the age group 15 to
25. This means we can expect 135 natives to commit this crime during a
year.
As for the immigrants, there are 0.1 · 100 million = 10 million in the
country, with 0.2 · 10 million = 2 million being in the age group of interest.
They will give rise to an expected number of 20 crimes of this kind per
year.
In total, we can expect this crime to be committed 155 times, with the
immigrants having a share of 20 / 155 = 12.9 %. This is higher than their
proportional share of 10 % despite there being no inclination for
committing said crime. All that led to this result was the population being
younger on average.
So concluding from a larger than proportional share of crime that there's an

inclination towards crime in this part of the population is not
mathematically sound. To be able to draw any conclusions, we would need
to know the expected value, which can be calculated from the age
distribution of the crime and that of the population and can differ quite
strongly from the proportional value.
(●) Menzerath's Weird Law
When we decide to measure the length of words and compare it to the

length of its syllables, we can make an interesting discovery, which seems
to hold true in all languages: The longer a word is, the smaller its syllables
are.
Consider the word "people". It has a length of two syllables. The syllables
in this word are "peo" and "ple". So the average length of its syllables is
three letters. Let's look at a longer word, "Machiavellianism". It has a length
of seven syllables. The corresponding syllables are: "Ma", "Chi", "a", "ve",
"lli", "a", "ni", and "sm". In this case the syllables have an average length of
only 2 letters. While these were only two of many examples, further
statistical analysis confirms the trend that longer words have shorter
syllables.
This is not Menzerath's Law just yet. Consider this linguistic order:
syllables, words, subordinate clauses, clauses, paragraphs and texts. These
are linguistic elements ordered by their size. We looked at words and found
out that the longer they are, the smaller their syllables. Interestingly enough,
statistical analysis showed that the longer the subordinate clauses, the
shorter its words. And the longer the clauses, the shorter the subordinate
clauses. There seems to be a deeper relationship at work.
Menzerath's Law states that, on average, the longer the linguistic element,
the shorter the linguistic elements that compose it. If x is the length of the
linguistic element and y the length of its components, then:
y = a · xb
with a and b being constants that vary with the linguistic elements involved
and language. In order to have a decrease, b must be smaller than zero.
Gabriel Altmann, a Czech scientist who specializes in quantitative
linguistics, analyzed 10000 words from American English and found that
the best fit is produced by the curve:
y = 4.09 · x-0.36
with x being the length of a word (measured in syllables) and y the average
length of the syllables that compose it (measured in letters). The data points
from Altmann's analysis as well as the corresponding fit can be seen in the
image below.
The formula also agrees quite well with the randomly picked words from
the introduction of the snack. For words of x = 2 syllable length, we would
expect the syllables to be on average y = 3.2 letters long, while for words
with x = 7 syllables the formula predicts the syllables to be y = 2 letters
long.
There's another linguistic law, which initially seems to contradict

Menzerath's Law. Aren's Law states that the longer the clauses, the longer
the words. But looking at the linguistic hierarchy and considering
Menzerath's Law, this is to be expected. Longer clauses mean shorter
subordinate clauses and shorter subordinate clauses mean longer words.
The law probably stems from our limitations to process complex

information. As word length grows, information is added and the word
becomes harder to process. Shortening the syllables helps to keep the
complexity at an acceptable level. Under stress, when the ability to deal
with complex information is reduced further, this effect becomes even more
pronounced. An analysis of letters written by suicidal persons showed that
the average length of subclauses decreases much quicker with clause length
than in letters of people who were not under significant stress while writing.
Interestingly enough, the mathematical form of the law also holds true on
the semantic level. Fickermann analyzed how the number of meanings a
word has depends on its length and found out that the longer the word, the
fewer meanings it will take on. For the English, Fickermann found that the
best fit is produced with the constants a = 33.21 and b = -1.36, with x =
length of the words (measured in letters) and y = number of meanings. So
words of x = 6 letters can be expected to have y = 3 meanings on average.
(●) Regression Analysis
The above image nicely shows what regression analysis does. It can find the
best curve (we'll specify a little later what is meant by best) through a
number of points in a coordinate system. Consider this crude home
experiment I did a while ago. Equipped with a sound level meter and a
ruler, I dropped a small wooden sphere onto the ground from different
heights to see how the maximum sound level during impact varies with the
impact speed (which you can calculate easily from the height neglecting air
resistance).
Here are the first five of eleven data points I collected. Each point is the
average of twenty measurements. The twenty repetitions were done to make
sure that random fluctuations (for example in the drop height) do not distort
the results.
- 0.99 m/s → 28.56 dB

- 1.21 m/s → 30.89 dB
- 1.40 m/s → 33.22 dB
- 1.57 m/s → 34.72 dB
- 1.72 m/s → 36.00 dB
The goal now is to deduce from the data points a formula that connects the
impact speed v with the sound level s. Most of the work will be done by the
computer, but before it can do anything, one must choose a general form for
the relationship. This formula, with its several still undetermined
parameters, is called the ansatz.
To make a good ansatz, you should be familiar with how typical graphs of
common functions (linear, quadratic, power, exponential, Boltzmann,
trigonometric) look like. Then you can make an educated guess on which
class of functions will most likely produce a good fit. For my data, I chose a
power function in its most basic form:
s = a · vb
with the yet to be determined parameters a and b. Now it's the computer's
turn. It will determine the values of the parameters so that the best fit is
produced. But what exactly is meant by best fit?
Imagine I did an approximate fit by hand and for 0.99 m/s the formula spits
out the value 29.12 dB instead of the measured 28.56 dB. So the fit differs
from the observed value at this point by 0.56 dB. This difference is called a
residual r. For each data point, the computer calculates this residual and
squares it. The sum of all the squared residuals is a measure of how strongly
the fit differs from the real data.
S = sum( r2 )
The computer now chooses the values for the parameters a and b that
minimize this sum. This process is called the “method of least squares” and
leads to what is considered the best fit. For my data points and ansatz, the
choices a = 29 and b = 0.43 minimized the sum of the squared residual to S
= 0.40. All other combinations of a and b produced a sum bigger than that.
As you can see, the fit turned out to be quite well. Some deviations are
always acceptable when working with real-world data, no fit can ever be
perfect down to the last detail. A number that is often used to characterize
the goodness of the fit is the adjusted r-square. The closer it is to one, the
better the fit. For doing fits, I recommend the program OriginPro or, as a
free but less sophisticated alternative, the website xuru.org.
(●) Typing Speed
When we want a text to be finished quickly, we tend to type faster. But the
faster you type, the more likely you are to make a typing error. So the time
gained by entering more words per minute (WPM) could be eaten up by the
time lost to correcting typing errors. How much do we gain (or maybe even
lose) when we increase typing speed?
To answer this question, we first need data on how the typing error
probability varies with speed. Scientists from the Vanderbilt University and
Brooklyn College of CUNY collected data on just this relationship. You can
see the data in the graph below.
As expected, the error probability quickly increases with speed. If you
double the speed from the average of 40 WPM to 80 WPM, the error
probability goes up threefold from a mere 8 % (one typing error per twelve
words) to about 25 % (one typing error per four words).
Let's focus on the quantity “correct words per minute” (CWPM). It depends
both on the WPM and the error probability, which I will symbolize by p. A
little thought leads to this mathematical relationship:
CWPM = (1 – p) · WPM
Using the above plot, we calculate this quantity for several speeds. For
example, when writing at average speed WPM = 40, our error probability is
going to be about p = 0.08. According to the formula, this translates into 37
correct words per minute. At WPM = 50, we have p = 0.1 and thus 45
correct words per minute. We can do this for several other speeds and
compile a table:
- WPM = 40 → CWPM = 37
- WPM = 50 → CWPM = 45
- WPM = 60 → CWPM = 50
- WPM = 70 → CWPM = 55
- WPM = 80 → CWPM = 59
As you can see, the increase in error probability does not eat up the gain,
but it noticeably dampens the progress. We can use the data for the error
probability to do a Boltzmann fit (with the common sense constraints that p
goes to zero as we decrease typing speed and p goes to one as we increase
typing speed) and to expand the table, leading to these surprising results:
- WPM = 90 → CWPM = 57
- WPM = 100 → CWPM = 54
- WPM = 110 → CWPM = 49
For such high speeds (which skilled typewriters are indeed able to reach),
the rate of correct words actually decreases for the average writer. In this
region the time lost due to errors overtakes the gain by increased typing
speed. This shows that faster typing does not necessarily mean faster
progress. Rather there's an optimum speed, which the data suggests to be at
about twice the regular speed.
(●●) Duelling Morons
This problem is inspired by the great book "Duelling Idiots and Other
Probability Puzzlers" by Paul J. Nahin. In it, he looked at a duel during
which the contestants take turns firing at each other and found out the the
dueller getting the first shot is always slightly more likely to be victorious.
In this variation, we'll have the contestants fire at each other simultaneously.
Dueller 1 has the probability of p1 of hitting the opponent with a shot, for
dueller 2 this number is p2. What is the chance of dueller 1 winning?
He wins after one round if he hits his opponent but his opponent doesn't hit
him (if both hit each other, there is no winner). This happens with the
chance:
p(1) = p1 · (1-p2)
For him to win after two rounds, he needs to miss the first shot and get the
second shot, while his opponent misses both. This occurs with the chance:
p(2) = (1-p1) · p1 · (1-p2)2
A win occurs after the third round if he misses the first two times and
managed to get a hit with the third shot. Again, his opponent needs to miss
all the shots. The probability for this particular outcome is:
p(3) = (1-p1)2 · p1 · (1-p2)3
To the keen eye, the progression now becomes visible. The probability for a
win after the n-th round is:
p(n) = (1-p1)n-1 · p1 · (1-p2)n
To get the odds for moron 1 to be victorious, we need to sum all these
terms. Unfortunately, this gives us an infinite sum which can not be
evaluated (to my knowledge) analytically.
p(1 victorious) = p(1) + p(2) + p(3) + ...
But given numerical values, we can always get as close to the result as we
like by evaluating a certain number of terms. If the probabilities p1 and p2
are not too small, the chance for the game to end after a large number of
rounds is so small, that they can be ignored with good conscience. Our
numerical example and the following analytical treatment will show that.
Let's set p1 = 0.5 and p2 = 0.4, so dueller 1 has a slight edge. We compute
his chance of being victorious. The odds for him winning after a certain
number of rounds are:
- p(1) = 0.300
- p(2) = 0.090
- p(3) = 0.027
- p(4) = 0.008
- p(5) = 0.002
- p(6) = 0.001
As you can see, the chance of dueller 1 winning after the sixth round is
already quite slim. So we feel comfortable ending here and giving our
approximate result by summing these terms:
p(1 victorious) = ca. 43 %
Now you might wonder why his chance is smaller than 50 %, but remember
that with this set-up, ties are possible. So to really compare the dueller's
chances, we need to compute the probability of dueller 2 being victorious as
well. Since it's rather random who of the two we call moron 1 and moron 2,
we can just swap the probabilities and use the above formulas. So only for
the moment, we set p1 = 0.4 and p2 = 0.5 to find the chance of dueller 2
winning the match:
- p(1) = 0.200
- p(2) = 0.060
- p(3) = 0.018
- p(4) = 0.005
- p(5) = 0.002
- p(6) = 0.001
Again, we stop here and do the sum:
p(2 victorious) = ca. 29 %
So the 10 % edge in the probability of hitting the opponent leads to a 14 %

edge in the chance of winning the duel. Of course the probability for a tie
must be:
p(tie) = ca. 28 %
In about 1 in 4 duels there would be no winner. The morons would just end
up shooting each other at the same time. Just a side note for the
mathematically curious: the problem is actually solvable in analytical form
if p1 is fixed at 0.5. This leads to:
p(1) = 0.5 · (1-p2)
p(2) = 0.52 · (1-p2)2
p(3) = 0.53 · (1-p2)3
Abbreviating 0.5 · (1-p2) with q leads to:
p(1) = q
p(2) = q2
p(3) = q3
And so on. Again we do the sum to find out how likely it is for dueller 1 to
win. This is simply:
p(1 victorious) = q + q2 + q3 + ...
If you read the snack "Delay Revisited" in chapter three, you know that this
sum can be written in a compact form:
p(1 victorious) = 1 / (1-q) – 1
Note the minus one at the end. This is because the first term usually
included in the sum (q0 = 1) was missing. Inserting 0.5 · (1-p2) results in:
p(1 victorious) = 1 / (1 – 0.5 · (1-p2)) – 1
Agreed, not a particularly beautiful formula, but it works. We can use this
formula now to check if our approximate answer to dueller 1 winning the
match was accurate. We had p1 = 0.5 (as required by the formula) and p2 =
0.4. Inserting this value gives us:
p(1 victorious) = 1 / (1 – 0.5 · 0.6) – 1 = 0.4285
Which is 42.85 %, very close to the estimated 43 % (the estimate would

have been even closer had I not rounded the approximate result in the first
place). What about the chances for a tie? Is it possible to set up a formula
for that and give an analytical solution (even if just for a special case)? I
leave that up to you.
(●●) Trying to Predict
The idea for this problem came to me while doing the numerical simulation
for the snack "Country Roads". We calculated an expected travel time of 54
minutes for the highway route. With only 100 loops in the simulation per
trial, the simulated average travel time always deviated somewhat from the
theoretical value. We got 53.5 minutes, then 55.1 minutes and after that
54.3 minutes.
I thought to myself: "When the computer spits out 55.1 minutes and I,
knowing the theoretical average is rather 54 minutes, predict the next
simulated value to be below the displayed 55.1 minutes, how likely is it that
I'm correct given I keep on applying this strategy?" Good question. Let's
simplify this a little while turning it into a problem for us to solve.
We start with a computer program that will display one of these four
numbers with the respective probabilities:
- 1 with p(1) = 0.2

- 2 with p(2) = 0.4
- 3 with p(3) = 0.3
- 4 with p(4) = 0.1
So the program is most likely to display a 2, but given enough trials, will
have displayed each of the numbers. We can interpret this as a discrete
probability distribution and compute the expected value:
e = 1·0.2 + 2·0.4 + 3·0.3 + 4·0.1 = 2.3
Knowing that, I choose this prediction strategy: If the computer displays a

number smaller than 2.3, I predict the next number to be larger than the
previous one. If it displays a number greater than 2.3, my guess is that a
smaller number will follow the one just displayed.
How likely is it that my predictions will be correct using this strategy? Let's
look at each number individually. If it displays a 1, my prediction comes
true if the next number is a 2, 3 or 4. The chance for this to happen is (using
the table):
q(1) = 0.8
If the 2 comes up, my prediction for a larger number is correct if in the next
trial the numbers 3 or 4 come up. So:
q(2) = 0.4
Once the 3 comes up, I'll predict a lower number since 3 is bigger than the
expected value. So my guess is that either a 1 or a 2 will show:
q(3) = 0.6
And finally, if I see the number 4, I assume it will be followed by a smaller

number, which means either a 1, 2 or 3. Here my chances of being correct
are:
q(4) = 0.9
Assume we performed 1000 trials. According to the table, this is the

number of times we expect a certain number to be displayed during the
simulation:
- 1 with 200 occurrences

Since I know that using this strategy I'm right 80 % of the time when the
number 1 shows, its 200 appearances should give me 160 correct
predictions. Similarly, the number 2 provides me with 160 successes, the
number 3 with 180 successes and the number 4 with 90 successes. So in
total, I can expect to get 590 correct predictions during 1000 trials:
p(correct) = 59 %
Thus our prediction strategy is certainly better than just random guessing,
knowing the expected value can certainly be helpful.
You might have one last question: why choose to perform 1000 trials? Isn't
it rather random to just choose any value? It is, but had I chosen to make
100 trials or only 10 trials it wouldn't have affected the result. When doing
the ratio at the end, the random choice of number of trials cancels out. This
is especially clear when solving the problem algebraically, without fixing
any specific numerical values.
(●●) Annoying Websites
Some time ago I came across a website that offered a certain number of
articles. Whenever you clicked "next", it would randomly display one of the
articles. This was somewhat annoying because after a while you had seen
some of the articles over and over and you had to keep on clicking several
times to get to a new one. But it does bring up a nice and easily stated
statistical problem.
Given that there are N articles in total, what is the probability of seeing n
different articles with n clicks? So we are asking for the chance of no article
appearing twice when clicking "next" n times. Of course n must be smaller
than N for this to be possible. We can't see more new articles than there
actually are.
Let's keep this general and approach it by looking at each click individually,
assuming we never get an article twice. For the first click, there's a 100 %
chance for being displayed an article that didn't come up before:
p(1) = 1
Now 1 out of N articles is already seen by us, which means N-1 are still
undisplayed. The chance of getting a new article on second click is thus:
p(2) = (N-1) / N
You can guess how this goes on. Now that only N-2 articles are not seen by
us, the third click provides us with this probability of seeing a new article:
p(3) = (N-2) / N
Note that the number in the bracket on the right side is always one smaller
than the one in the bracket on the left, which indicates the number of clicks.
So with the n-th click there is this chance of getting a new article:
p(n) = (N-(n-1)) / N = (N-n+1) / N
To always see a new article up to the n-th click, we want all of the above
events to occur. Using the multiplication rule, the probability for that is:
p(only new) = p(1) · p(2) · ... · p(n)
This basically solves the problem. But let's apply some algebra to arrive at a
much nicer formula for p(only new). The factorial notation as introduced in
chapter three allows us to write the above formula in a very compact way
after inserting all the terms for p(1), p(2), and so on:
p(only new) = N! / (Nn · (N-n)!)
So if the pool consist of N = 10 articles and we click n = 5 times in total,

this is the probability that we will encounter only previously unseen articles
during the process:
p(only new) = 30 %
So the chances are very slim for that to happen. As a side note for the
mathematically curious: the formula for the binomial coefficient C(n,k) also
involves factorials in a very similar manner. It is:
C(n,k) = n! / (k! · (n-k)!)
When renaming the variables using n = N and k = n, the similarities become

even more clear:
C(N,n) = N! / (n! · (N-n)!)
Multiplying both sides with n! results in:
C(N,n) · n! = N! / (N-n)!
Multiplying both sides of the formula for p(only new) with Nn gives us the
same expression on the right side:
p(only new) · Nn = N! / (N-n)!
So we can easily express p(only new) using the binomial coefficient.

Combining the two equations brings us to this truly beautiful formula for
the probability of getting n different articles with n clicks when the articles
are chosen at random from a pool of N articles:
p(only new) = C(N,n) · n! / Nn

(●●) Overlapping Fields
Picture a square of size n by n. When we turn it into a grid it has n² fields.

Two people independently of each other choose and mark one of the fields.
Since there are a lot of fields, we shouldn't expect them to choose the same
field. But it can happen. What is the probability of their marks overlapping?
The easiest way to approach this is to count possibilities. We can calculate

the probability of overlapping using:
p(overlap) = n(overlap) / n(total)
with n(overlap) being the number of possibilities for the marks to overlap
and n(total) the total number of possibilities for two people to place two
marks on a square with n² fields.
In the snack "Cracking a three digit Code" we got to know a multiplication

rule for computing possibilities. We can apply it here. Since Person A has n²
choices and Person B has n² choices, the total number of possibilities is just:
n(total) = n2 · n2 = n4
How many possibilities are there for the marks to overlap? Each field
provides exactly one possibility for an overlap, so:
n(overlap) = n2
Thus, the probability that two people choose the same field at random is:
p(overlap) = n2 / n4 = 1 / n2
In the 5 x 5 grid pictured above, the chances of accidental overlaps are 1/5²
= 1/25 or 4 %. We'll run a computer simulation of this set-up to see if it
checks out. For this purpose, we create six integer variables, x(1), y(1),
x(2), y(2), l and o. x(1) and y(1) represent the coordinates of the field
chosen by one person, x(2) and y(2) those of the field chosen by the other.
During a loop, each of these four variables is assigned a random value
between one and five. Then the coordinates are compared. The variable l
keeps track of the total number of loops, the variable o records the number
of overlaps. The probability of overlaps occurring by chance is then
computed by o/l.
We run three trials with ten million loops each. Here are the results rounded
to two digits after the comma:
- 4.08 %
- 3.89 %
- 4.06 %
As you can see, it agrees very well with our result above, the deviation is no
more then 3 % of the theoretical value. This means we can take the problem
to the next level by including a third person. Again we start with a grid
having n² fields and each person randomly choosing a field to mark. What
is the chance of an overlap occurring?
We will approach this the same way, by counting possibilities and doing the
ratio. The total number of possible outcomes is relatively easy to derive:
Each person has n² choices for placing the mark. That leads to:
n(total) = n2 · n2 · n2 = n6
An overlap can occur in many different ways: only A and B overlap, only A
and C overlap, only B and C overlap or all three overlap.
Assume A and B choose the same field, but C chooses another. How many
possibilities are there for that to happen? A and B have n² choices, C has
(since one field is now taken) n² – 1 choices. So the number of ways, in
which only A and B overlap, is:
n(only A and B overlap) = n2 · (n2-1)
Due to symmetry, the possibilities are the same for two other cases: only A
and C overlap and only B and C overlap. So the number of ways for two
people to choose the same field and one to choose another is:
n(two of three overlap) = 3 · n2 · (n2-1)
All that's left is to find out how many possibilities we have for all three to
overlap. Each field in the grid provides exactly one possibility for that, so:
n(three overlap) = n2
The total number of possible overlaps is thus:
n(overlap) = 3 · n2 · (n2-1) + n2
By doing the ratio n(overlap) / n(total), we can find out what the probability
is that an overlap will occur when three people randomly choose one of n²
fields. After some algebra, we get this formula:
p(overlap) = 3 / n2 – 2 / n4
In the 5 x 5 grid pictured above, the odds of overlapping are 0.118 or about
11.8 %. About, but not exactly, three times the result we got for two people.
We run a computer simulation to check our result. The procedure is similar
to the program above, we just need some additional variables to include the
third person and to keep proper track of overlaps. The results of the three
trials with ten million loops each are:
- 11.61 %
- 11.73 %
- 11.67 %
Again, very well in agreement with the theoretical calculation. Interesting

side note: as the number of fields increases, the above function will
converge to 3 / n². So the ratio of p(overlap 3 people) / p(overlap 2 people)
will converge to 3 as n grows to infinity.
(●●) Cellular Automatons
Cellular automatons are very powerful in simulating how local interactions

play out in the grand scheme of things. You are given a grid and each field
within it can take one of two states (usually symbolized by black and
white). Additionally, there is a set of rules that defines how a field interacts
with its eight neighboring fields. The states of the neighbors during one step
determine the state of the field in the next step.
Let's make a simple example. Imagine each field to be a person. A white
field stands for a healthy person, while a black field represents a person that
has the flu. To simulate the rise, spread and fall of the flu wave, we define a
set of rules connecting each field to its eight neighbors. For now we will
just ignore any resistance to the flu that might occur.
1) A white field will become black at the next step if at least one
neighboring field is black (has the flu).
2) Each field that is black has a certain chance q of becoming white

(healthy) at the next step.
These two rules of local interaction can already produce interesting and
beautiful behavior during the simulation: clusters forming, clusters joining,
fronts propagating, and so on. Of course you need to rely on a computer to
do the large number of computations necessary to get from one distribution
to the next. But if the rules are not too complex, there are some things we
can deduce by hand.
Let's stay with the above example. The absence of any resistance means that
the flu wave could just go on forever. We will denote the percentage of
black fields by p and the total number of fields by N.
Given a certain percentage p of infected fields, what is the chance of a field

having at least one infected neighbor? As in all "at least"-problems, we first
calculate the probability of having no infected neighbor.
p(no infected) = (1-p)8
Now we can easily compute the chance of having at least one infected
neighbor:
p(at least one infected) = 1 – (1-p)8

So of the (1-p)·N white fields, we expect a number of p(at least one
infected)·(1-p)·N to become infected at the next step. This is just restating
rule 1 mathematically. We can also do the same for rule 2: of the p·N
infected fields, q·p·N will become white at the next step. Using these two
results, we set up a balance equation that relates the percentage of infected
fields during step k+1 to the percentage of infected fields during the
previous step k.
pk+1 = pk + (1 – (1-pk)8) · (1-pk) – q · pk
Note that since N appeared as a factor in each term, I just divided both sides
by N to get rid of it. The above formula shows the overall behavior that is to
be expected, but it is not exact as the distribution of the black and white
fields is usually not statistically homogeneous.
With this equation we can take a look at possible steady states, that is, when
pk+1 = pk (with some fluctuations). Since we don't have to worry about steps
here, we'll just write p for the percentage of infected fields. This leads to the
equation:
p = p + (1 – (1-p)8) · (1-p) – q · p
Unfortunately, this equation can not be solved analytically. But we can

solve it numerically when given a value for q. If we set q = 0.5 = 50 %, we
get two possible outcomes: the flu dies out (p = 0) or the flu goes on forever
at about 2 in 3 people infected (p = 0.66). Which outcome actually occurs
depends in a complex way on the initial distribution.
If you're interested in cellular automatons, make sure to check out the Game
of Life, published by the British mathematician John Conway in 1970. It
quickly became popular and remains so today, with many websites and
blogs dedicated to documenting and researching the game. Based on a few
simple rules, the simulation gives rise to many interesting and unexpected
objects, such as the Blinker, Beacon, Pulsar, Glider, and so on. Some initial
patterns become static after a number of iterations, while others grow
indefinitely. You can easily find applets to try out your own initial patterns
on the internet.
Of course there are many other interesting cellular automatons worth

observing and analyzing, the free to use applet "Mirek's Java Cellebration"
offers a great variety.
(●) Two and a Half Fallacies
The field of statistics gives rise to a great number of fallacies (and

intentional misuse for that matter). One of the most common is the
Gambler's Fallacy. It is the idea that an event can be "due" if it hasn't
appeared against all odds for quite some time.
In August 1913 an almost impossible string of events occurred in a casino

in Monte Carlo. The roulette table showed black a record number of
twenty-six times in a row. Since the chance for black on a single spin is
about 0.474, the odds for this string are: 0.47426 = 1 in about 270 million.
For the casino, this was a lucky day. It profited greatly from players
believing that once the table showed black several times in a row, the
probability for another black to show up was impossibly slim. Red was due.
Unfortunately for the players, this logic failed. The chances for black
remained at 0.474, no matter what colors appeared so far. Each spin is a
complete reset of the game. The same goes for coins. No matter how many
times a coin shows heads, the chance for this event will always stay 0.5. An
unlikely string will not alter any probabilities if the events are truly
independent.
Another common statistical fallacy is "correlation implies causation". In

countries with sound vaccination programmes, cancer rates are significantly
elevated, whereas in countries where vaccination hardly takes place, there
are only few people suffering from cancer. This seems to be a clear case
against vaccination: it correlates with (and thus surely somehow must
cause) cancer.
However, taking a third variable and additional knowledge about cancer

into account produces a very different picture. Cancer is a disease of old
age. Because it requires a string of undesired mutations to take place, it is
usually not found in young people. It is thus clear that in countries with a
higher life expectancy, you will find higher cancer rates. This increased life
expectancy is reached via the many different tools of health care,
vaccination being an important one of them. So vaccination leads to a
higher life expectancy, which in turn leads to elevated rates in diseases of
old age (among which is cancer). The real story behind the correlation
turned out to be quite different from what could be expected at first.
Another interesting correlation was found by the parody religion FSM

(Flying Spaghetti Monster). Deducting causation here would be madness.
Over the 18th and 19th century, piracy, the one with the boats, not the one
with the files and the sharing, slowly died out. At the same time, possibly
within a natural trend and / or for reasons of increased industrial activity,
the global temperature started increasing. If you plot the number of pirates
and the global temperature in a coordinate system, you find a relatively
strong correlation between the two. The more pirates there are, the colder
the planet is. Here's the corresponding formula:
T = 16 – 0.05 · P0.33
with T being the average global temperature and P the number of pirates.
Given enough pirates (about 3.3 million to be specific), we could even
freeze Earth.
But of course nobody in the right mind would see causality at work here,
rather we have two processes, the disappearance of piracy and global
warming, that happened to occur at the same time. So you shouldn't be too
surprised that the recent rise of piracy in Somalia didn't do anything to stop
global warming.
As we saw, a correlation between quantities can arise in many ways and

does not always imply causation. Sometimes there is a third, unseen
variable in the line of causation, other times it's two completely independent
processes happening at the same time. So be careful to draw your
conclusions.
Though not a fallacy in the strict sense, combinations of low probability and
a high number of trials are also a common cause for incorrect conclusions.
We computed that in roulette the odds of showing black twenty-six times in
a row are only 1 in 270 million. We might conclude that it is basically
impossible for this to happen anywhere.
But considering there are something in the order of 3500 casinos

worldwide, each playing roughly 100 rounds of roulette per day, we get
about 130 million rounds per year. With this large number of trials, it would
be foolish not to expect a 1 in 270 million event to occur every now and
then. So when faced with a low probability for an event, always take a look
at the number of trials. Maybe it's not as unlikely to happen as suggested by
the odds.
Excerpts
As a reward to all those readers who have come this far, I'll include two
excerpts from my other books. If you're interested in understanding
mathematics, which you probably are considering you bought and read this
book, I'm sure you will enjoy the following excerpts.
(●) Inflation
This is an excerpt from the book "Business Math Basics - Practical and
Simple" by Metin Bektas, available on Amazon for Kindle and other e-
reading devices.
There's no denying it: things get more expensive. This happens in all
economies and almost every year. At moderate rates, this increase in price
level is not alarming. The picture below shows the inflation rates for the US
from 1991 to 2012. Only in 2009, shortly after the financial crisis, did
prices actually fall.
What reasons are there for inflation to occur? One way of answering this
question is to take the monetarist approach and focus on the so called
Equation of Exchange. It will help us to easily identify the culprit.
Let's look at the quantities necessary to understand this equation step by

step and using an example. One quantity is the money supply M. It's simply
the total amount of money present in the economy. For introductory
purposes, I'll set this value to M = 100 billion $.
Also important is the velocity of money V. It tells us, how often each dollar
(bill) is used over the course of a year. This quantity depends on the saving
habits of the people in the economy. If they are keen on saving, the bills
will only pass through a few hands each year, thus V is small. On the other
hand, if people love to spend the money they have, any bill will see a lot of
different owners, so V is large. For the introductory example, we'll set V =
5.
Note that the product of these two quantities is the total spending in the
economy. If there are M = 100 billion $ in the economy and each dollar is
spend V = 5 times per year, the total annual spending must be M · V = 500
billion $. This conclusion is vital for understanding the Equation of
Exchange.
There are two more quantities we need to look at, one of which is the price
level P. It tells us the average price of a good in the economy. If there's
inflation, this is the quantity that will increase. Let's assume that in our
fictitious economy the average price of a good is P = 25 $.
Last but not least, there's the number of transactions T, which is just the
total number of goods sold over the entire year. We'll fix this to T = 200
billion for now and make another very important conclusion.
The product of these last two quantities is the total sales revenue in the
economy. If the average price of a good is P = 25 $ and there are T = 200
billion goods sold in a year, the total sales revenue must be P · T = 500
billion $. It is no accident that the total sales revenue equals the total
spending. Rather, this equality is the (reasonable) foundation of the
Equation of Exchange.
For the total spending to equal the total sales revenue, this equation must
hold true:
M·V=P·T
which is just the Equation of Exchange. Now think about what will happen
if we increase the money supply M in the economy, for example by printing
money or government spending. We'll assume that the spending habits of
the people remain unchanged (constant V). Since we increased the left side
of the equation, the total spending, the right side of the equation, the total
sales revenue, must increase as well.
One way this can happen is via an increase in price level P (inflation).
Indeed empirical evidence shows that in the US every increase in money
supply was followed by a rise in inflation later on.
Luckily there's another quantity on the right side which can absorb some of
the growth in money supply. A rise in the number of transactions T
(increased economic activity) following the "money shower" will dampen
the resulting inflationary drive. On the other hand, a combination of more
money and less economic activity can lead to a dangerous, Weimar-style
hyperinflation.
At some point in your life, you probably thought to yourself: If

governments can print money, why the hell don't they just make everyone a
millionaire? The answer to this question is now obvious: The Equation of
Exchange, that's why. If the government just started printing money like
crazy, the rise in price level would just eat the newly found wealth up. Each
dollar bill would gain three zeros, but you couldn't buy more with it than
before.
Of course there can be much more trivial causes for inflation than a
growing money supply. Prices are determined by an equilibrium of supply
and demand. If demand drops, retailers have to lower their prices to sell off
their stocks. Similarly, if demand suddenly increases, the retailer will be
able to set higher prices, resulting in inflation. This happens for example
when a new technology comes along that quickly rises in popularity.
Appropriately, this kind of price level growth is called a demand-pull
inflation.
(●) Hurricanes
This is an excerpt from the book "Great Formulas Explained - Physics,

Mathematics, Economics" by Metin Bektas, available on Amazon for
Kindle and other e-reading devices.
In this section we are going to do just what the title says, that is compute
hurricanes. The great formula that accomplishes this, called Rankine
formula, is very little known among physicists and mathematicians, most
are not aware of its existence. But that doesn't make it any less useful.
One of the most important quantities that is used to characterize a hurricane,

aside from the size, is the pressure difference p (usually in millibars, in
short: mb) between the center and the surrounding of the hurricane. Air
always flows from high to low pressure and thus, when an area of low
pressure forms, air starts flowing towards it. Because of earth's rotation, the
resulting flow is not direct. The air rather circulates around and into this
region of low pressure. The greater the pressure difference, the more violent
the movement of air will be.
For starters, we will assume this pressure difference to be constant over the
life of a hurricane. At a later point we will relax this condition, allowing the
calculations to include strengthening and weakening hurricanes. But for
now, we only care about two quantities: the distance from an observer to the
center of the storm r (any unit of length will do as long as we are consistent)
and the wind speed v at this distance.
The Rankine formula states that this expression is conserved as the
hurricane changes position:
v · r0.6 = constant
Our strategy will be: first we use current data (a distance and a wind speed)
to compute the constant, then we are able to get an estimate for the wind
speed at any distance. Note that this equation tells us that when we triple the
distance to the center of the hurricane, the wind speed halves.
----------------------
A hurricane is approaching and according to TV reports it is currently

about 600 miles away from our town. Current wind speeds are about 20
mph. From the projected path we can deduce that the hurricane will come
as close as 100 miles. What is the maximum wind speed v we can expect?
First we determine the constant using the current data:
20 · 6000.6 ≈ 930
Now we can set up an equation for the maximum wind speed. Since we
inputted the speed in mph, the result will be in the same unit.
v · 1000.6 ≈ 930
v · 16 ≈ 930
v ≈ 58 mph
Simple as that. But remember that we assumed the hurricane to be of

constant strength during its approach. If this is not the case, we need to
include the pressure difference in our calculations, which is what we will do
now.
----------------------
In case of hurricanes of changing strength, the pressure difference p appears

as a variable in the Rankine formula. This makes things a little harder, but
luckily not by much.
v · r0.6 / sq root (p) = constant
Let's turn to an example. We stick to the strategy: first determine the

constant using current data (a distance, a wind speed and a pressure
difference), then we can calculate the wind speed at any distance and
pressure difference.
----------------------
Again the approaching hurricane is 600 miles away with current wind
speeds of 20 mph. The pressure difference between the center and
surroundings of the hurricane at this point is about 60 mb. During its
approach, it will come as close as 100 miles and is expected to strengthen
to 80 mb. What is the maximum wind speed v we can expect?
First we determine the constant:
20 · (600)0.6 / sq root (60) ≈ 120
Now let's find the maximum wind speed:
v · (100)0.6 / sq root (80) ≈ 120
v · 1.8 ≈ 120
v ≈ 67 mph
----------------------
It is important to note that all of the equations only hold true outside the eye
of the storm (which is usually about 20 to 40 miles in diameter). The
maximum wind speed in a hurricane is reached at the wall of the eye. Inside
the eye wind speeds drop sharply. It is so to speak "the calm within the
storm" and can make for a quite eerie experience.
We'll draw one last conclusion before moving on. The size of the eye is
more or less a constant. This implies that the maximum wind speed within a
hurricane grows with the square root of the pressure difference. So if the
pressure difference quadruples, the maximum wind speed will
approximately double. Real-world data confirms this conclusion within
acceptable boundaries. As an estimate for the maximum wind speed in a
hurricane you can use this formula:
v(max) ≈ 16 · sq root(p)
The result is in mph. For an average category four hurricane (p = 80 mb) we

can expect maximum wind speeds of 140 mph.
References
Title:
http://commons.wikimedia.org/wiki/File:Typing_monkey_768px.png
Chapter One:
http://pixabay.com/en/flat-icon-symbol-math-mathematics-27155/
---- The Basics
http://www.algebralab.org/lessons/lesson.aspx?
file=Algebra_ProbabilityMultiplicationRule.xml
http://people.richland.edu/james/lecture/m170/ch05-rul.html
---- Good ol' Coins
http://www.ohrt.com/odds/binomial.php
(Binomial Coefficient Calculator)
---- Monkey on a Typewriter
http://shakespeare.mit.edu/romeo_juliet/full.html
---- Cracking a Three Digit Code
http://robison.casinocitytimes.com/article/how-many-stops-on-a-slot-reel-
33747
---- Homicide She Wrote

http://www.unodc.org/unodc/en/data-and-analysis/homicide.html
---- Missile Accuracy
http://www.abc.net.au/science/articles/2003/04/01/817429.htm#.Ua5Qcpy
WeUk
http://en.wikipedia.org/wiki/DF-21
http://jonathanturley.org/2012/09/29/targeted-hype/
---- In Gut We Trust
Grilly, T. (2009). 50 Mathematical Ideas. London: Quercus Publishing Plc.
Chapter Two:
http://commons.wikimedia.org/wiki/File:Binomial_distribution_pmf_sl.svg
---- The Basics
http://wbs.eu.com/index.php/folders/category/91-statistics-1-revision?
download=236:introducing-binomial
---- Defective Parts
http://stattrek.com/online-calculator/binomial.aspx
(Binomial Distribution Calculator)
---- Frumkina Law
http://prx.aps.org/pdf/PRX/v3/i2/e021006
Köhler, R. (2012). Quantitative Syntax Analysis. Berlin / Boston: Walter
de Gruyter GmbH & Co. KG
http://www.usingenglish.com/resources/text-statistics.php
---- Penalty Kicks
http://archiv.c6-magazin.de/06/monatsthema/2006/06-fussball-wm/fussball-
lexikon/elfmeter.php
Chapter Three:
http://www.flickr.com/photos/go_greener_oz/3047060508/
---- The Basics
http://www.intmath.com/counting-probability/11-probability-distributions-
concepts.php
---- Delay Revisited
http://de.wikipedia.org/wiki/Geometrische_Reihe
Chapter Four:
http://commons.wikimedia.org/wiki/File:Poisson_distribution_PMF.png
---- The Basics
http://www.umass.edu/wsp/statistics/lessons/poisson/
http://stattrek.com/online-calculator/poisson.aspx
(Poisson Distribution Calculator)
---- Tornadoes
http://www.erh.noaa.gov/cae/svrwx/tornadobystate.htm
---- It's Getting Hot
http://www.realestateforums.com/greenref/docs/GRE13_ChristopherMorga
n.pdf
---- Foxy Lady
http://www.thefoxwebsite.org/After-the-Hunt.pdf
---- Soccer
Tolan, M. (2010). So werden wir Weltmeister. Munich: Piper Verlag

GmbH.
Chapter Five:
---- The Basics
http://www.artofproblemsolving.com/Store/products/intro-
counting/exc2.pdf
Nahin, P.J. (2010). Duelling Idiots and Other Probability Puzzlers.

Oxfordshire: Princeton University Press.
---- Asteroids
http://en.wikipedia.org/wiki/Impact_event
---- Deep Blue Sea
http://geography.about.com/library/cia/blcpacific.htm
http://www.wolframalpha.com/input/?i=seattle+sidney
https://de.wikipedia.org/wiki/Sichtweite
---- Severe Floods
http://www.kcl.ac.uk/sspp/departments/geography/people/academic/malam
ud/floods.pdf
Chapter Six:
http://commons.wikimedia.org/wiki/File:Thomas_Bayes.gif
---- The Basics
http://www.cut-the-knot.org/Probability/BayesTheorem.shtml
http://www.math.hmc.edu/funfacts/ffiles/30002.6.shtml
---- Innocent Drivers
http://www.sussex.ac.uk/Users/christ/crs/kr-ist/lec08b.html
Chapter Seven:
---- Don't be Mean
http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Standard_deviatio
n.html
http://www.greenbook.org/marketing-research.cfm/how-to-interpret-
standard-deviation-and-standard-error-in-survey-research-03377
---- Tai Chi for Mathematicians
http://grundpraktikum.physik.uni-saarland.de/scripts/Fehlerrechnung_Uni-
Ulm.pdf
http://faculty.southwest.tn.edu/jiwilliams/probab2.gif
---- Cellular Automatons
http://www.math.cornell.edu/~lipa/mec/lesson6.html
http://psoup.math.wisc.edu/mcell/mjcell/mjcell.html
---- Immigrants and Crime
http://www.statcan.gc.ca/pub/85-002-x/2010001/article/11115-eng.htm
---- Menzerath Law
http://arxiv.org/pdf/cs/0512102v1.pdf
http://www.glottopedia.org/index.php/Menzerath-Altmann-Gesetz
---- Typing Speed
https://my.vanderbilt.edu/motonoriyamaguchi/files/2011/08/Yamaguchi_Cr
ump_Logan_JEPHPP_inpress.pdf
---- Two Fallacies
http://www.fallacyfiles.org/gamblers.html
http://web.archive.org/web/20070407182624/http://www.venganza.org/abo
ut/open-letter/
http://www.statista.com/statistics/221031/total-worldwide-casinos-by-
region/
Computer Programs:
OpenOffice.org, Copyright © Sun Microsystems Inc.
(Text Editing)
Visual Studio 2008, Copyright © Microsoft

(Numerical Simulations)
OriginPro, Copyright © OriginLab Corporation

(Visualization)
Mathscribe Lite, Copyright © Mathscribe Inc.

(Visualization)
SIGIL, GNU General Public License v3

(ebook Conversion)

Statistical Snacks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Snacks

Uploaded by

Copyright:

Available Formats

Table of Contents

Chapter One: The Power of Multiplication

Chapter Two: Binomial Shortcut

Chapter Three: To Be Expected

Chapter Four: Poisson Distribution

Chapter Five: The Geometric Approach

Chapter Six: Bayes and Bias

Chapter Seven: Random Random Problems

It also intends to appeal to your imagination and creativity. Each statistical

(●) The Basics

Now onto the snacks.

(●) Good ol' Coins

Plugging in the numbers gives us a probability of 1/8. Not that unlikely.

Counting the number of ways in which a result can be achieved is

(●) Monkey on a Typewriter

Two households, both alike in dignity,

It has 77 characters. Now we let a monkey start typing random letters on a

(●) Multiple Choice

p(all incorrect) = (2/3)10 = 0.017

(●) Weird Pizza

An industrial pizza manufacturer adds twelve small tomato pieces to each

This is a straightforward application of the multiplication rule. We want the

Or about 1 in 4100. As expected, it is quite unlikely. But if the company for

(●) Guessing a Random Number

I guess you could have just guessed these results.

Assume we have a number of slots. Each slot can be occupied by a number

We end up with a 10 % chance, not very good considering the amount of

(●●) Winning by Winning

(●) Homicide She Wrote

(●●) At Least He Tried …

A straightforward approach would be to calculate his odds for winning one

p(at least one win) = 1 – p(no win) = 0.57 = 57 %

(●) No Matter how Careful

An especially careful writer has only a 1 in 4000 chance to make a mistake

The probability of finding at least one mistake is thus:

p(at least one mistake) = 1- p(no mistake) = 0.9994 = 99,94 %

So no matter how careful he writes, 1 in 4000 is already a ridiculously low

If we knew what his probability of spotting a mistake is, it would be easy to

Let's denote the number of mistakes before the first proofread by M. If we

So we start with M mistakes. If his chance of spotting a mistake is p, then

m(2) = m(1) · (1 – m(1) / M)

L = m(1)2 / (m(1) – m(2)) – m(1) – m(2)

L = 302 / (30 – 12) – 30 – 12 = 8

The author's chance of spotting a mistake was:

Again we approach the “at least”-question from another angle. First we

p(super-PIN, one year) = 1 – 0.76 = 0.24 = 24 %

So despite the large number of repeats, we will probably not be forced to go

(●●) The Nigerian Prince

A totally legitimate Nigerian prince is trying to access his grand

Let's assume his chance at a successful response is p (as in prince). How

Solving for n by applying the natural logarithm (symbolized by ln) to both

(●●) Missile Accuracy

An important quantity when comparing missiles is the CEP (Circular Error

Thus the chance of at least one hit is:

p(at least one hit) = 1 – 0.562 = 0.438 = 43.8 %

So forty missiles with a CEP of 150 m are required to have a 90 % chance

(●●●) Complete Collection

Collecting stickers is often quite popular among kids. For me soccer

However, it is important to consider all possible ways in which you can

p(at least two) = 1 – (1 – 1/N)n – n/N · (1 – 1/N)n-1

(●●●) In Gut We Trust

So we reduced the problem to analyzing possible pairings. Let's order all

Luckily there's a handy formula for the sum in the bracket.