Notes On Scientific Methodology For Phil 1030

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

On Scientific Methodology

Wayne C. Myrvold
Revised Oct.. 2017

1. Arguments, deductive and inductive. Philosophers use the word “argument” in a sense
that is completely different from the ordinary use of the term. The philosophers’ use is summed
up nicely by Michael Palin in Monty Python’s “Argument Clinic” sketch.1

Palin: Well, an argument is not the same as contradiction.


Cleese: Can be.
Palin: No it can’t. An argument is a connected series of statements intended to establish a
proposition.
Cleese: No it isn’t.
Palin: Yes it is! It isn’t just contradiction.

As philosophers use the term, an argument is, as Palin says, a connected series of statements
intended to establish a proposition (we will use the words “statement” and “proposition” more or
less interchangeably; they are what are expressed by declarative sentences, the sorts of things that,
unlike, for example, questions, can be true or false).

For any argument, there must be a proposition, called the conclusion, and one or more statements,
called premises, that are offered as support for the conclusion.

There are two types of arguments: deductive and inductive. The difference lies in the relation that
is supposed to hold between the premises and the conclusion. Consider the following example.

Argument 1
Premise 1: An argument isn’t just contradiction.
Premise 2: What’s happening here in this room is just contradiction.
Conclusion: What’s happening here in this room isn’t an argument.

It is a feature of this argument that if the premises are true, then the conclusion must also be true.
This feature is called deductive validity; this is a deductively valid argument. Note that this is
different from saying that the premises are true, or that the conclusion is. You can recognize that
this is a deductively valid argument even if you have no clue whether or not the premises are true.

Also, someone could accept that it’s a deductively valid argument, and still reject the conclusion.
But that person would have to reject at least one of the premises. For example, Cleese could
maintain that what is happening in the room is an argument, by saying that an argument can be
just contradiction. Accepting that an argument as deductively valid simply means accepting that
one cannot deny the conclusion without also denying one of the premises.

1
If any of you are Monty Python fans, this will already be familiar to you. If you’re not already familiar with it:
https://www.youtube.com/watch?v=kQFKtI6gn9Y.
A deductive argument is one that is intended by the person presenting the argument to be
deductively valid. It might or might not succeed in being deductively valid. For example, later in
the Argument Clinic sketch, an argument occurs about whether Palin has paid for extra arguing
time. “Well, if I haven’t paid, then why are you arguing? Gotcha!” he exclaims. There’s an implicit
argument here.

Argument 2
Premise: You’re arguing.
Conclusion: You’ve been paid.

The tone, and the “Gotcha!” strongly suggest that this is supposed to be a deductive argument.
Cleese points out that this is not a valid argument, as it’s possible for the premise to be true without
the conclusion being true: “I could be arguing in my spare time.” This shows that the argument, as
written above, is not valid. All that it takes, to show invalidity, is that it is possible, that is, that it
could happen, that all the premises be true and the conclusion false at the same time. It’s an invalid
argument, even though, in the sketch, Palin has paid, and the conclusion is in fact true.

A deductive argument is one that is intended to be deductively valid. It might or might not succeed
in being deductively valid. Argument 1, above, is an example, of a deductively valid argument;
Argument 2, of a deductively invalid argument.

In addition to deductive arguments, there are inductive arguments. An inductive argument is not
intended to establish its conclusion with certainty; even if the premises are all true, at best it shows
that the conclusion is probably true.

Arguments of this sort are central to science. Any conclusion about the future based on evidence
about the past have to be inductive arguments. No matter how reliable a pattern has proved to be
in the past (for example, patterns observed in the motions of the planets), it’s not impossible for
that pattern to not persist in the future. Conclusions about things that are not directly observable
(such as subatomic particles) based on premises about what has been observed are also inductive
arguments. Much of the discussions about scientific methodology involves distinguishing good,
strong inductive arguments from weak ones.

2. Hypothetico-deductive method.
2.1. Basics of the hypothetico-deductive method. When people first began to try to explicitly
formulate scientific reasoning, in the seventeenth century, the basic picture that emerged was
something like the following.

You formulate a hypothesis. It could be something that you can’t test directly by observation, as it
might be a hypothesis about something that happened in the past, or it might be a hypothesis about
things out of your reach, such as the orbits of the planets, or about things that are too small to see,
such as atoms.

You don’t yet have much of an idea of whether the hypothesis is true; but you can ask yourself
what things would look like if the hypothesis is true. We call things that are true if the hypothesis

2
is true consequences of the hypothesis. These are things that are deduced from the supposition
that the hypothesis is true, hence the name hypothetico-deductive method.

What you want is to find some consequences of the hypothesis that can be checked by observation.
If the hypothesis is about the past, you should be asking yourself: what would things be like now
if the hypothesis is true? If the hypothesis is about the motions of the planets, and if you don’t
have access to spacecraft that can be used for observations (as was true for most of the history of
science), the question to ask is: what will we see here on earth, when we look up at the sky, if the
hypothesis is true? If the hypothesis is about things that are too small to see, ask yourself what
consequences there might be of your hypothesis for things that can be subjects of observation.
That is, you use the hypothesis to make predictions about what you will observe.

To get a feel for how this reasoning might work: suppose you have two roommates, Betty and
Barney. You come home late one evening from a hard day of studying in the library, and you find
that all the beer that you had bought a few days ago is gone, and all the empties are in the recycling
bin.

You might form two hypotheses:

H1: Betty drank all your beer.


H2: Barney drank all your beer.

These are both hypotheses about the past, so you ask yourself: what consequences are there, of the
two hypotheses, that you might be able to check in the present?

You reason as follows: if either one of your roommates drank all of your beer, then that person
will have a hangover in the morning (it was a lot of beer), and hence that person will display
symptoms of a hangover. This is something that you can check.

Betty gets up at 6 am and goes for her usual run. You conclude that Betty does not have a hangover,
and that H1 is false.

Barney sleeps until 11 am, gets up, drinks a litre of water and takes three Advil, and goes back to
bed. You take this behavior to be symptoms of a hangover.

What do you conclude from this? Notice that H1 and H2 are not an exhaustive set of alternatives;
it’s possible that they’re both false (exercise: think of a few ways that this could happen). So, it
would be rash to be certain that H2 is true. But you can say that the fact that Barney is showing
what you take to be hangover symptoms to be evidence in support of the hypothesis H2. We say,
in such a case, that the hypothesis is supported by the evidence, or that it is corroborated.
Sometimes it is said that the hypothesis is confirmed or verified, but don’t take this to meant that
it is conclusively confirmed.

Here’s a 17-century scientist, Christiaan Huygens (1629–1695), on the method of science, from
the introduction to his Treatise on Light, published in 1690.

3
One finds in this subject a sort of demonstration, which does not produce a certainty
as great as that of Geometry, and which differs much from it, in that, where the
Geometers prove their propositions by certain and incontestable principles, here the
principles are verified by the conclusions that one draws from them; the nature of
things does not permit it to be done otherwise.

It is nevertheless possible to arrive at a degree of probability [vraisemblance, could


also mean “verisimilitude”] which very often falls only a little short of complete
evidence. This is the case when the things, which one has demonstrated by the
principles assumed, agree perfectly with the phenomena that are observed in
experience; and all the more so when there is a great number of them, and above
all when one creates and foresees new phenomena, which must follow from the
hypotheses one is employing, and finds the effect corresponding to our expectation
(Huygens 1690, 2–3, my translation).

What Huygens is saying is: natural science is different from pure mathematics. In geometry, one
starts from principles (usually called axioms), that are regarded as “certain and incontestable.” In
natural science, one starts from hypotheses, draws conclusions from them, which can be tested by
observation. If the predictions turn out to be correct, this doesn’t guarantee that your hypotheses
are correct, but, as Huygens says, the nature of things doesn’t permit certainty. However, when
things go well—and, among other things, this means precise predictions that turn out to be correct,
a large number of different predictions, and prediction of new phenomena—then we can have a
degree of probability that is close to complete certainty.

Schematically, the reasoning goes like this. From a hypothesis H we deduce a consequence, O,
which must be true if the hypothesis is true, and which can be tested by observation. We then check
to see whether O is, in fact, true. There are two possibilities: either is true, or it’s not.

What happens next is very different depending on whether or not O is true. If it’s false, then, since
O must be true if H is, we conclude that H is false. This argument form is:

If H then O.
O is false.
Therefore, H is false.

This is a deductively valid argument.

If, however, O turns out true, we take H to be better supported by the evidence we have than it was
before we knew that O is true.

2.2. Auxiliary hypotheses. The sketch of the hypothetico-deductive method in the previous
section isn’t bad as a first pass. But there’s a wrinkle.

Most interesting hypotheses in science don’t, by themselves, make any predictions at all about what
will be observed. One way to see this is that, in order to check a consequence of the hypothesis,
we might require some sophisticated lab equipment, and your hypothesis might only have

4
consequences about what will happen if the equipment is working properly. Even if there’s no
equipment involved, you’ll be using your eyes or ears or other sensory apparatus to check the
consequence, and the hypothesis will have consequence about what you will see or hear if you’re
not hallucinating or subject to some sort of sensory illusion.

As an example: in September 2011 a group of physicists who were measuring the speed at which
neutrinos travelled (underground!) between CERN in Geneva, Switzerland, and an underground
lab at Gran Sasso, Italy, 730 kilometers away, announced that the result they obtained was slightly
larger than the speed of light, about 1.00002 times the speed of light.2

Despite what most of the news stories at the time suggested, the physicists involved in the
experiment did not conclude that these neutrinos were travelling faster than light. Why not? For
one thing, according modern physical theory, and in particular, Einstein’s theory of relativity,
nothing goes faster than light, and it would be difficult to reconcile faster-than-light neutrinos with
a lot of other evidence that supports the theory of relativity. For another, the experiment involved
a complicated, difficult measurement, requiring very precise estimates of the distance traversed
and the time elapsed, and, if either one of those numbers was in error by a small amount, then the
measured speed of the neutrinos could be less than the speed of light. To give you a sense of the
precision involved: they measured the distance travelled, 730 kilometres, within 20 centimetres,
and the difference between the estimated flight time of the neutrinos and the time it would take
light to travel that distance was only 0.0000000607 seconds.

What the scientists involved in the experiment said, in a paper that was posted online, is

Despite the large significance of the measurement reported here and the stability of
the analysis, the potentially great impact of the result motivates the continuation of
our studies in order to investigate possible still unknown systematic effects that
could explain the observed anomaly. We deliberately do not attempt any theoretical
or phenomenological interpretation of the results (Adam et al. 2011, p. 22).

That is: they refrained from drawing any conclusions, and published the result in order to facilitate
further investigation into possible unknown systematic errors. They eventually decided that the
measured timing was in error, because of a faulty optical fibre connection (Adam et al. 2012, p.
17). The corrected estimate of time of flight of the neutrinos yielded an estimated speed that is less
than the speed of light.

How does this fit into the schema of the Hypothetico-Deductive method? Let T be Einstein’s theory
of relativity. This theory entails that nothing travels faster than light. Does it follow from the
theory that, if you measure the speed of something, the result you get won’t be a speed greater than
the speed of light?

Clearly not; your measurement could be in error, and the theory T doesn’t say that there aren’t
ever any measurement errors. To interpret your result as indicating the actual speed of the thing

2
The original press release, and subsequent follow-ups, can be found at http://press.cern/press-
releases/2011/09/opera-experiment-reports-anomaly-flight-time-neutrinos-cern-gran-sasso.

5
requires certain extra assumptions, called auxiliary assumptions, about such matters as how the
measuring apparatus works.

Let M be the (false) statement that the speed of neutrinos estimated in the OPERA experiment is
less than the speed of light. As mentioned, M cannot be deduced from T by itself, but it can be
deduced from T plus some auxiliary assumptions A1, A2,…,, An. So, schematically, we have, as
premises, the premise that T and the auxiliary assumptions together entail M, and that M is false.
From this we conclude that either T is false, or one or more of the auxiliary assumptions is false.

If T and A1 and A2 and…, An, then M.


M is false. _
Therefore, either T is false, or at least one of A1, A2,…,, An is false.

Take this schema, rather than the earlier one, as the right one for how the H-D methods works in
cases of evidence that disconfirms a prediction.

2.3. Limitations of the HD method. The HD method gets part of scientific methodology
right; this accounts for its persistence in accounts of scientific method. But it can’t be the whole
story, and it requires supplementation.

One reason is that, in the case of disconfirming evidence, we want something stronger than the
conclusion that either the hypothesis is false, or one of the auxiliary hypotheses is false. In the
neutrinos case, most people at the time regarded it as far more likely that there was some unknown
systematic error, than that the neutrinos were actually travelling faster than light.

Another reason has to do with cases in which a theory turns out correct. Not all predictions should
count equally in favour of a theory. If I have a theory of planetary motion that predicts that there
will be a solar eclipse sometime within the next ten years, and you have a theory that predicts the
date and time within a few seconds, then your prediction coming true should count for more. The
idea is that good predictions are those that you wouldn’t expect to be true unless the theory is true.

There’s a deeper problem: some hypotheses lead to no predictions whatsoever. Suppose you want
to know whether a die is fair or loaded. Consider the two hypotheses:

H1: The die is fair


H2: The die is biased in favour of 6 coming up.

To test these hypotheses, the natural thing to do is to toss the die a large number of time, and to
see whether the die shows signs of a tendency to produce six more often than other results.
However, no matter what results you get, they will be consistent with both hypotheses—even if,
for example, you toss it 1,000 times and get a six in 900 of those tosses. It’s possible for a fair die
to do this; it’s just bloody unlikely.

And, if you think about it, in the case I used to introduce these topics, the Mystery of the Vanishing
Beer, neither the Betty nor the Barney hypothesis makes predictions about what you will, with
certainty, observe if the hypothesis is true. Rather, what you have seen supports one better than

6
the other, because what you observe is more along the lines of what you expect if the Barney
hypothesis is true.

All of these things suggest that we should introduce concepts such as probability and likelihood
into our considerations. That will be the subject of the next section.

3. Statistical hypothesis testing. In a lot of interesting cases, we have a hypothesis that doesn’t
make definite predictions. Nonetheless, we regard certain things to be more or less likely if the
hypothesis is true. For example, consider testing in a medical context. You want to know whether
a given treatment is effective. You can administer the treatment to a test group. It is too much to
expect for the treatment to be 100% effective, so you would expect some of the patients to whom
the treatment is administered to show signs of improvement.

Suppose, then you administer the treatment to 100 patients, and 75% of them show signs of
improvement. This, by itself, doesn’t tell you anything about whether the treatment is effective,
because some people improve without treatment. The key idea is contrast: you want to know
whether people who are treated tend to improve more than those who are not.

What researchers do is to divide the subjects of a trial into two groups: a treatment group, to whom
the treatment under study is administered, and a control group, who get no treatment. They collect
results regarding improvement in symptoms from the two groups. But not all differences between
the two groups are treated as significant. It could happen, purely, by chance, that there are more
people in the treatment group who would have gotten better anyway, even without the treatment.
What is wanted is a difference between the two groups that is not the sort of thing that you would
expect if the treatment is ineffective. Only if we see a difference between the two groups of a sort
that would be very unlikely if the treatment is not effective do we count the result as a statistically
significant result. I’ll explain what this means in connection with an example used by the
pioneering statistician Ronald A. Fisher in his classic book, The Design of Experiments, the first
edition of which was published in 1935.

3.1. The lady tasting tea. Suppose that a lady, who is very particular about the way her tea is
prepared, claims that she can tell, by taste, whether or not the milk was added to the cup first, or
the tea was.

The key idea is to ask yourself: what if she has no such ability? We design an experiment in such
a way that we know what to expect if she has no ability to distinguish between the two situations,
and we look for deviations from those expectations.

Call the hypothesis that the lady has no ability to tell the difference between the two ways of
preparing her tea the null hypothesis, H0. The null hypothesis is equivalent to the claim that her
guesses are no better than a random coin flip. We will design an experiment in such a way that,
for any outcome of the experiment, the null hypothesis gives that outcome a definite probability,
which we can calculate, and look for significant departures from what the null hypothesis would
lead one to expect.

7
In Fisher’s experiment, we prepare eight cups of tea, four milk-first, and four tea-first. To avoid
the possibility of her detecting some kind of pattern in the order of presentation, we randomize the
order in which the two preparations are presented to her.

If she has no ability to discriminate, then her guesses are no better than a random coin flip. She
might, by sheer coincidence, get them all correct. But the probability that she will is small. For
each cup, the probability of a correct guess is, on the supposition of the null hypothesis, ½. Since
the guesses are, on the null hypothesis, independent, to get the probability of getting them all
correct, we multiply ½ ×½×…½ eight times. The probability of getting all correct, if the null
hypothesis is true, is 1/256, which is about 0.004. This is small enouigh that would probably
acknowledge that getting all of them correct is an indication that she has the claimed ability to
distinguish between the two preparations. Certainly, it should count as evidence in favour of the
claim.

Whether or not it’s sufficient evidence to accept the claim depends on two sorts of consideration.
One is the prior plausibility of the hypothesis. If a hypothesis is regarded as highly implausible,
then it takes more evidence to make it believable. The second sort of consideration is what’s at
stake. Statisticians distinguish between two types of errors:

• Type I errors consist of rejecting the null hypothesis when it is actually true. In Fisher’s
example, rejecting the null hypothesis consists of accepting that the lady has the ability to
distinguish the between the two preparations, so a Type I error would consist of concluding
that she has the ability when in fact she hasn’t. A Type I error is a false positive.

• Type II errors consist of failing to reject the null hypothesis when it is actually false. In the
example, this would mean not concluding that the lady has the ability when in fact she
does. A Type II error is a false negative.

With these things in mind, let’s ask ourselves: what would we say if she gets seven out of eight
correct? Intuitively, this seems pretty good, and we would probably acknowledge her ability in
such a case. But what about six out of eight, or five out of eight? Should those count?

Fisher’s proposal (and this has become standard statistical procedure), is to ask, for the case in
which she gets seven correct, or six, or any other number: what is the probability that she would
do as well or better than that by pure chance? If she gets 6 correct, the question of the probability
of getting 6 or better correct by chance is the question of the probability, if the null hypothesis is
correct, of getting 6 or 7 or 8 correct.

These aren’t hard to calculate. There are 256 ways to make guesses about the eight preparations,
and, on the null hypothesis, each of them has the same probability, 1/256. Out of these 256, only
one gets all right. There are 8 ways to get only one wrong, 28 ways to get two wrong (I won’t go
through the details here of how to calculate these, but I’d be happy to show you if you like). There
are also online calculators for this sort of thing: see, for example, http://stattrek.com/online-
calculator/binomial.aspx. Results are in the following table.

8
Probability of getting exactly n correct Probability of getting at least n correct by
n by chance chance.
0 1/256 = 0.00390625 256/256 = 1
1 8/256 = 0.03125 255/256 = 0.99609375
2 28/256 = 0.109375 247/256 = 0.96484375
3 56/256 = 0.21875 219/256 = 0.85546875
4 70/256 = 0.2734375 163/256 = 0.63671875
5 56/256 = 0.21875 93/256 = 0.36328125
6 28/256 = 0.109375 37/256 = 0.14453125
7 8/256 = 0.03125 9/256 = 0.03515625
8 1/256 = 0.00390625 1/256 = 0.00390625

This number, the probability, if the null hypothesis is true, of getting results at least as good as the
results actually obtained, is called the significance level, or the p-value. Small p-values mean a
more significant result. Standard hypothesis testing procedure consists of specifying, in advance
of the test, a threshold p-value, and counting the results as significant if one obtains a p-value that
is less than or equal to the threshold. In many fields it is conventional to take p = 0.05 as the
threshold for taking a result to be significant; in others, p = 0.01 is the standard.

Suppose, now, that our lady gets seven out of eight correct. We can see from the above table that
probability getting at least seven correct, if she’s just randomly guessing, is only about 0.035. This
is smaller than 0.05, and so this result counts as significant at the 0.05 level. It’s not smaller than
0.01, and so it doesn’t count as significant at the 0.01 level.

3.2. Interpretation of results. Suppose that you perform an experiment of this kind, and decide
to count a result as significant if you get a p-value of less than 0.05. You do the experiment, and
the subject gets 7 out of 8 correct. What does this tell you?

There’s a strong temptation to say that the experiment shows that the lady does, indeed, have the
claimed ability to distinguish between the two preparations. But this is too quick. The result still
could be due to chance; that is, it could be a false positive. The probability of getting seven or
more, in the absence of any ability to distinguish, is about 0.035, or about 1 in 28. If you were to
repeatedly do this experiment, every day, on subjects with no ability whatsoever, then, on average
you’d get a “statistically significant” result about once a month.

A significantly significant result is some evidence in favour of there being a genuine effect, not
attributable to mere chance. It is not a proof that there is a genuine effect, nor does it, by itself,
show that it is even probable that there is a genuine effect. More on this in the next section.

9
3.3. The base-rate fallacy. If you have a significant result, it’s not enough to conclude that
there’s a real effect. To see why, let’s consider another example. You are a police officer, and you
have a sobriety test with the following properties.

• If the person to whom the test is administered is drunk, the test will, with certainty, tell you
that.

• If the person is not drunk, there’s a 1 in 100 chance that the test will be positive. That is,
the test has a 1% false positive rate.

You pull over a driver, completely at random, and administer the test. The test comes out positive.
What should you now think about whether the driver is drunk? You can’t be certain, of course, but
should you at least regard it as probable that the driver is drunk?

Take a moment to think about this, before reading on.

Answer: we can’t answer this question with only the information given; we need something else
before we can give an answer. We need to know: how likely do you regard it that a randomly
selected driver is drunk? That is, what do you take the rate of drunk drivers to be?

To see this, suppose that, on average, one out of every 1,000 drivers is drunk (I hope that the
percentage is not that high, but let’s take that number for illustrative purposes). Then, if you
randomly pull over drivers and test them, on average about one out of every thousand will actually
be drunk, and 999 will not be. But out of those 999 drivers who are not drunk, on average about
1%, or about 10, will falsely test positive. So, out of every 1,000 drivers, on average you’ll get
about 11 positive tests, 1 of which is actually drunk, the remaining 10 being false positives. That
is, most of the people who test positive will be false positives. If you pull someone over at random
and administer the test, and it comes out positive, you should regard it as more likely that the
person is not drunk than that the person is.

Here’s a similar situation, that allows use to visualize these numbers. Figure 2 shows 1,000
coloured squares, with 10 coloured green and 1 coloured red. Suppose I tell you that I selected a
square at random, and that it was either red or green. Based on that information, what should your
degree of belief be that the selected square is red?

10
Figure 2

Answer: telling you that the square I picked is red or green means that you can disregard all of the
blue squares and focus on the non-blue squares. Out of the eleven squares that are not blue, only
one is red. All the squares are equally likely. Therefore, conditional on the information that the
chosen square is red or green, your degree of belief that is it red should be 1/11.

3.4. Base rates and significance testing: the importance of replication. Consider medical
research. There are lots of research groups testing various treatments for effectiveness. Suppose
that they consider a result significant if it passes a significance test at the 5% significance level.
What does a significant result tell us?

The case is analogous to the drunk-driving case. Counting something as significant if a p-value of
0.05 or less is obtained is to adopt a procedure with a 5% false positive rate, that is, a one in twenty
false positive rate.

Here, again, we need a base rate. If the vast majority of treatments tested are ineffective (and it
isn’t easy to find effective drugs or other treatments; that’s one of the reasons medical research is
difficult), then most results that pass a significance test at the 5% level will be false positives.

When researchers publish a study showing an improvement between a control group and a
treatment group that is significant at the 5% level, this does not mean that we can be 95% sure that
the treatment is effective. If effective treatments are very rare, it is much more likely that this is a
false positive, that is, that it is due to chance, than that the treatment is actually effective. We
should not take this as telling us that the treatment is effective, but only that it’s worthy of further
study. If the experiment is repeated, and a significant result is again obtained, then the chance of
this happening, if the treatment is ineffective is only 0.0025; this will increase our confidence that
it is indeed effective. If the result turns out to be robust—that is, if we get positive results again
and again in repeated trials—then, with enough testing, the evidence will become good enough to
overcome initial skepticism that the treatment is effective. This is one reason why replication is

11
regarded as important in science; if some research group gets a positive result in a clinical trial, it
is important that the trial be repeated.

Here’s what Fisher has to say about this.

It is usual and convenient for experimenters to take 5 per cent. as a standard level
of significance, in the sense that they are prepared to ignore all results which fail to
reach this standard, and, by this means, to eliminate from further discussion the
greater part of the fluctuations which chance causes have introduced into their
experimental results. No such selection can eliminate the whole of the possible
effects of chance coincidence, and if we accept this convenient convention, and
agree that an event which would occur by chance only once in 70 trials is decidedly
“significant,” in the statistical sense, we thereby admit that no isolated experiment,
however significant in itself, can suffice for the experimental demonstration of any
natural phenomenon; for the “one chance in a million” will undoubtedly occur, with
no less and no more than its appropriate frequency, however surprised we may be
that it should occur to us. In order to assert that a natural phenomenon is
experimentally demonstrable we need, not an isolated record, but a reliable method
of procedure. In relation to the test of significance" we may say that a phenomenon
is experimentally demonstrable when we know how to conduct an experiment
which will rarely fail to give us a statistically significant result (Fisher 1960, pp.
13–14).

That is, adopting the convention that we will count something as significant if it passes a
significance test at the p = 0.05 level is, according to Fisher, a way of eliminating from further
discussion the ones that don’t, so that we can focus on the ones that do as worthy of further study.
If a treatment passes that test, what the experiment shows is that we probably shouldn’t disregard
it as unworthy of further study.

4. Probabilistic, or “Bayesian” approaches. Standard statistical testing tells you when a result
gives you some evidence in favour of a given hypothesis. This, by itself, doesn’t tell you whether
you should take the hypothesis as probably true, or likely to be true. To do this, we need something
more: we need to be able to talk about degrees of belief in various hypotheses. That’s the subject
of this section. This section has a bit more math in it than we’ve used before, and it has some
equations. But don’t worry. This isn’t hard to understand, though it might take you a while to get
familiar with it.

4.1. Degrees of belief as probabilities. Certainty and uncertainty comes in degrees: even if you
aren’t absolutely certain about who drank your beer, you might more strongly suspect one person
than another. In a case like the neutrino experiment, a scientist might regard it as much more
probable that there is some unknown systematic error than that the neutrinos were really travelling
faster than light.

12
One way to do this is to assign numbers to various propositions to indicate degree of belief in them.
Typically these are taken to run from 0, for complete disbelief, absolute certainty that the
proposition is false, to 1, indicating absolute certainty that the proposition is true. A value of ½
means that you regard it as equally likely be true or false.

How, you may ask, can something as nebulous as uncertainty be assigned a number? One way to
think about this is: someone who claims to be sure about something ought to be willing to put their
money where their mouth is. We ask: how much would you bet? Or, if you don’t like to risk your
own money, suppose I give you a choice between two free gifts: one given unconditionally, the
other given only if some proposition (say, the proposition that it will rain tomorrow) is true. For
example, which of the following would you prefer?

A. $80, unconditionally.
B. $100, if it rains tomorrow, and nothing otherwise.

Obviously, if you’re certain that it will rain tomorrow, then B is the better choice. (I’m assuming
that you have a use for the money, and that you would prefer getting $100 to getting $80). Also, if
you’re certain that it won’t rain tomorrow, then you’ll take A. You’ll only choose B if you’re pretty
sure that it will rain tomorrow.

Suppose, now, that we vary the amount in A, and offer you choices between

An. $n, unconditionally.


B. $100, if it rains tomorrow, and nothing otherwise.

For different values of n, we’ll ask you about your preferences.

If n is 100, A will, with certainty, give you the maximum amount that B could, but might not, give
you, and so, if you have the slightest uncertainty about whether it will rain tomorrow, then you’ll
prefer A100 to B, as B can’t give you a better reward than A100, and might give you worse. If n is
zero, and if you aren’t absolutely certain that it won’t rain, then you should prefer B to A0, since
A0 can’t give you a better reward than B, and might (if it rains) do worse.

Presumably, you’ll prefer B to An for small n, and prefer An to B for sufficiently large n. There will
be a shift in your preference for some number in between. Suppose there is a number x such that
you prefer B to An for all n less than x, and you prefer An to B for all n greater than x. We will then
say that your degree of belief in the proposition that it will rain tomorrow, or the probability you
assign to the proposition, is the ratio of $x to $100. For example: suppose your threshold is $75.
You prefer B to An for all n less than 75, and you prefer An to B for all n greater than 75. $75 is
three-quarters of $100, so this corresponds to assigning a probability 3/4 to the proposition that it
will rain tomorrow. This probability is sometimes called your betting ratio for bets on the
proposition that it will rain tomorrow, and the amount, $75, is what you regard as a fair price for
a ticket that entitles you to a payment of $100 if it rains, since it’s the amount you’re willing to
exchange for such a ticket.3

3
This is one way in which the word “probability” is used, to refer to a degree of belief, which can vary from person
to person. There is another sense of the word, which is meant to refer to something objective, such as the odds of

13
Consider, now, two propositions that are mutually exclusive. Let R be the proposition that it will
rain tomorrow, and let S be the proposition that there will be clear, sunny skies all day tomorrow.
(Notice: these are mutually exclusive, but not exhaustive. Though they can’t both be true, they
could be both false, as it could be a cloudy day with no rain). Suppose that you attach probability
x to R and y to S. What is the probability you should attach to the proposition R or S, which is true
if either R is true or S is true?

Let B, C and D be the offers:

B: $100, if R, and nothing otherwise.


C: $100, if S, and nothing otherwise.
D: $100, if either R or S, and nothing otherwise.

Note that if you accept both B and C, this amounts to the same payoff, no matter what happens, as
D. Therefore, you should set your fair prices for these tickets so that:

fair price for D = (fair price for B) + (fair price for C).

This means that

Pr(R or S) = Pr(R) + Pr(S).

The same argument can be applied to any two propositions that are mutually exclusive.

A useful way to think about probabilities, which I borrow from Bas van Fraassen (1989). We
represent various propositions, P, Q, R, via regions on a plane (see Figure 1).

P Q

R
Figure 1

winning a lottery; this is a fact about how the lottery is set up and how the winning ticket is drawn, and it doesn’t
depend on what anyone thinks about it. There’s no need to choose one of these as the “right” meaning of the word;
they’re both perfectly legitimate concepts, and both have a role to play in science.

14
Think of the points in the rectangle as representing the various ways things could be, and the region
in the circle P being the ways that P could be true. The points that are in both P and Q represent
ways for P and Q to both be true, and the points in all three circles, ways for all three to be true.

We can indicate probabilities by heaping mud on the diagram. Heap more mud on regions that you
regard as more probable. We will do so in such a way that the probability of a given proposition,
such as P, is equal to the fraction, out of all of the mud on the diagram, that is on the region that
represents P:

amount of mud on 𝑃
probability of 𝑃 =
total amount of mud on the diagram

4.2. Conditional probabilities. Suppose you have a jar that contains 100 plain M&Ms and 100
peanut M&Ms. 50 of the plain M&Ms are red, and the rest are green. 75 of the peanut M&Ms are
red, and the rest are green. You shake the jar, and pick one candy at random, with each one having
an equal chance of being drawn. What should your degree of belief (before the draw) be that the
M&M you will draw will be red?

(Take a moment to think about this, then read on.)

Red Green Total


Plain 50 50 100
Peanut 75 25 100
Total 125 75 200

Answer: There are 200 M&Ms in all, of which 125 are red: 50 plain, and 75 peanut. Since they’re
all equally likely, the probability of getting a red one is

125 5
Pr(𝑟𝑒𝑑) = = = 0.625.
200 8

This is the betting ratio at which you should bet on the proposition that the drawn M&M will be a
red one.

Now suppose that you learn that the drawn M&M is a peanut M&M, and nothing more (perhaps
you get to feel its shape, without looking at it to see its colour). Now, what should your degree of
belief be that it’s a red one? Again, take a moment to think about it.

Answer: Learning that the M&M is a peanut one means that you can disregard the “plain” row of
the above table, and focus on the “peanut” row. There are 100 peanut M&Ms, of which 75 are red.
So, having learned that it’s a peanut M&M, your degree of belief that it’s red should be

15
75 3
Pr(𝑟𝑒𝑑 |𝑝𝑒𝑎𝑛𝑢𝑡) = = = 0.75.
100 4

This probability is called a conditional probability. Read the above as

The probability of the drawn M&M being red, conditional on its being a peanut
M&M, is ¾.

When the probability of Q is greater than zero, we define the conditional probability of P,
conditional on Q, as

Pr(𝑃&𝑄)
Pr(𝑃|𝑄 ) = . (1)
Pr(𝑄)

4.4. Axioms of probability. Everything that we need to know about probability for this course can
be summed up in four axioms.4 The first three are:

I. For any proposition P, Pr(P) is greater than or equal to zero.

II. If you are certain that P is the sort of proposition that is true no matter what (also known
as a necessarily true proposition), then Pr(P) = 1.

III. If P and Q are mutually exclusive (that is, if they cannot possibly be both true), then
Pr(P or Q) = Pr(P) + Pr(Q).

If you think about it, these have to be true for betting ratios of a rational person. The amount you’d
bet for a ticket that pays $100 if P is true can’t be a negative number. If P is guaranteed to be true,
and you know it, then you should be willing to pay any amount up to $100 for a ticket that pays
you $100 if P is true. And we’ve seen an argument for the third in section 4.1.

The last axiom relates conditional to unconditional probability.

IV. For any propositions P and Q, if there is a conditional probability Pr(P|Q), then
Pr(P and Q) = Pr(P|Q)  Pr(P).

4.5. Updating beliefs on evidence. Conditional probabilities are used to represent the impact of
new evidence on degrees of belief. Suppose that you acquire a new piece of evidence, E, and

4
If you take a course on probability theory, you might see yet another axiom, called the axiom of
countable additivity. It involves situations in which an infinite number of propositions are
considered. We won’t need it for this course, so I’m not going to include it here.
16
nothing else, and suppose that you don’t change your mind about the conditional probability
Pr(H|E), for some hypothesis H. Then your degree of belief in the hypothesis undergoes the shift

Pr(𝐻&𝐸)
Pr(𝐻) ⇒ Pr(𝐻|𝐸) = . (2)
𝑃𝑟(𝐸)

Pr(H|E) is called the posterior probability of H, distinguished from Pr(H), its prior probability.
This shift is called updating by conditionalization, or conditionalizing on the new evidence.

We can think of this process in terms of the muddy Venn diagram. Heap mud on the diagram to
indicate your prior degrees of belief in H, E, H&E, etc. When you learn that E is true, this means
that you now have zero degree of belief in anything outside the E circle. Wipe the mud off that’s
not on E. This represents your new degrees of belief.

H E

Figure 2

To see what sorts of circumstances are conducive to a high (or low) posterior probability, it’s
useful to rewrite the posterior probability, using what is known as Bayes’ theorem, which follows
from the definition of conditional probability, equation (1) (see section 3.3 for how to get Bayes’
theorem).
Pr(𝐸|𝐻)
Pr(𝐻 |𝐸 ) = ×Pr(𝐻). (2)
Pr(𝐸)

The various bits of Bayes’ theorem have names. Pr(H|E), as we have already mentioned, is called
the posterior probability of H, and Pr(H), the prior probability of H. Pr(E|H) is called the
likelihood, and Pr(E), the prior probability of the evidence.

Looking at (2) gives us an idea of what counts for a high posterior probability. The prior probability
gets multiplied by the factor Pr(E|H)/Pr(E). The probability of H increases if this is larger than
one, decreases if it is smaller than one, and stays the same if it’s equal to one. So, evidence E
counts in favour of a hypothesis H if Pr(E|H) is bigger than Pr(E), that is, if E is something that
you’d regard as more likely to be true on the supposition of H than you would otherwise.

This means that contrast is key. Not just any evidence E counts in favour of a hypothesis, even if
the hypothesis is one that leads you to expect that E is true. When you’re trying to find evidence

17
that, if true, would count in favour of a hypothesis, ask yourself: What would things be like if H is
true and what would things be like if H isn’t true? What you want is something that you would
expect to be true if H is, but not otherwise.

Also, if the prior probability of H is low—meaning that you initially regard it as very implausible—
then it takes very good evidence to raise it to something appreciable. The slogan often used is
“extraordinary claims require extraordinary evidence.”

We can also use Bayes’ theorem to compare the impact of evidence on two hypotheses. If we have
two hypotheses, H1 and H2, and we’re only concerned about the relative probabilities of the two,
Bayes’ theorem gives, for the ratio of the posterior probabilities of these two hypotheses,

Pr(𝐻1 |𝐸) Pr(𝐸|𝐻1 ) Pr(𝐻1 )


= × (3)
Pr(𝐻2 |𝐸) Pr(𝐸|𝐻2 ) Pr(𝐻2 )

4.6. Hypothetico-deductive method revisited. Consider the special case in which E can be
deduced from H. That is, if H is true, then E must be true. Then

Pr(E|H) = 1, (4)

and so, in such a case,

1
Pr(𝐻 |𝐸 ) = ×Pr(𝐻) (5)
Pr(𝐸)

Since Pr(E) is less than or equal to 1, its inverse, 1/Pr(E), is greather than or equal to 1. It’s larger
than 1 if Pr(E) is smaller than 1, and it gets bigger, the smaller Pr(E) is. So, predictions from a
hypothesis count in favour of the hypothesis, but not all predictions count equally. What counts
are surprising predictions that turn out to be true, that is, predictions with low prior probability.

4.7. Statistical hypothesis testing revisited. As we saw in sections 3.3 and 3.4, above, telling you
that an experiment has been done that yielded a statistically significant result is not enough to tell
you whether you should take the hypothesis being tested to be probably true. We now have some
terminology for what is missing. If you want to how probable you should regarded a given
hypothesis, in light of the experimental evidence, this is asking for a posterior probability. You
can see from Bayes’ theorem that you can’t get this, without a prior probability.

Cases in which an interesting hypothesis has a well-defined prior probability that everyone can
agree on are rare. This is why standard statistical analysis leaves out all talk of prior probabilities.
The cost is that it also leaves out all talk of posterior probabilities, which is what we really want.

Let’s apply Bayes’ theorem to the case of the lady tasting tea. Let H0 be, as before, the null
hypothesis that the lady has no ability to distinguish between the two preparations. Compare this
to a hypothesis H1 that the lady has an ability to distinguish preparations that is so good that it is
virtually certain that, in a trial like the one considered, she will achieve a result that counts as

18
significant at the 0.05 level. These are mutually exclusive, but they aren’t exhaustive; she might
have some ability, but not as strong as imagined in H1.

Suppose that, before the experiment, you have prior degrees of belief Pr(H0) and Pr(H1). Let E be
the statement that the experiment yielded a result that is statistical significant at the 0.05 level.
This means that the lady got seven or more correct in the experiment. If someone tells you that E
is true, what does this do to your degrees of belief in these two hypotheses?

Since we’re interested in comparing only these two hypotheses, let’s compare the ratio of their
probabilities, and see how this changes when you learn that E is true. We have, from equation (3),

Pr(𝐻1 |𝐸) Pr(𝐸|𝐻1 ) Pr(𝐻1 )


= × (10)
Pr(𝐻0 |𝐸) Pr(𝐸|𝐻0 ) Pr(𝐻0 )

H1, remember, is the hypothesis that the lady’s ability is so good as to virtually guarantee a
significant result. That means that Pr(E|H1) is so close to 1 that it makes no difference, so we will
use 1 as a good approximation. Pr(E|H0) is about 0.035, or 1/28. Putting these into (10) gives us

Pr(𝐻1 |𝐸) 1 Pr(𝐻1 ) Pr(𝐻1 )


≈ × ≈ 28× . (11)
Pr(𝐻0 |𝐸) 0.035 Pr(𝐻0 ) Pr(𝐻0 )

So, the evidence E boosts your degree of belief in H1 relative to your degree of belief in H0. That
is, the evidence counts in favour of H1, compared to H0. If you initially regarded the two hypotheses
as equally plausible, then, after learning E, you should regard H1 as 28 times more probable than
H0. If you initially regarded H0 as a million times more probable than H1, then, after learning E,
you should regard H0 as only about 35,000 times more probable than H1. So, you can acknowledge
that the evidence counts in favour of H1, while remaining very skeptical of that hypothesis. What
the evidence does is make you a bit less skeptical.

4.8. How to get Bayes’ theorem. Putting H in for P and E in for Q in (1) gives us,

Pr(𝐻&𝐸)
Pr(𝐻 |𝐸 ) = , (12)
Pr(𝐻)

from which we can get

Pr(𝐻&𝐸) = Pr(𝐻|𝐸) × Pr(𝐸). (13)

Putting E in for P and H in for Q in (1), and going through the same steps, gives us

Pr(𝐸&𝐻) = Pr(𝐸|𝐻) × Pr(𝐻). (14)

But H&E and E&H are the same proposition, and so, equating (7) and (8), we get

19
Pr(𝐻|𝐸) × Pr(𝐸) = Pr(𝐸|𝐻) × Pr(𝐻). (15)

Dividing both sides by Pr(E) gives us

Pr(𝐸|𝐻)
Pr(𝐻|𝐸) = × Pr(𝐻), (16)
Pr(𝐸)
which is Bayes’ theorem.

20
References

Adam, T., N. Agafonova, A. Aleksandrov, O. Altino, P. Alvarez Sanchez, S. Aoki, A. Ariga, T.


Ariga, D. Autiero, A. Badertscher, A. Ben Dhahbi, A. Bertolin, C. Bozza, T. Brugière, F. Brunet,
G. Brunetti, S. Buontempo, F. Cavanna, A. Cazes, L. Chaussard, M. Chernyavskiy, V. Chiarella,
A. Chukanov, G. Colosimo, M. Crespi, N. D’Ambrosio, Y. Déclais, P. del Amo Sanchez, G. De
Lellist, M. De Serio, F. Di Capua, F. Cavanna, A. Di Crescenzot, D. Di Ferdinando, N. Di Marco,
S. Dmitrievsky, M. Dracos, D. Duchesneau, S. Dusini, J.Ebert, I. Eftimiopolous, O. Egorov, A.
Ereditato, L.S. Esposito, J. Favier, T. Ferber, R.A. Fini, T. Fukuda, A. Garfagnini, G. Giacomelli,
C. Girerd, M. Giorgini, M. Giovannozzi, J. Goldberg, C. Göllnitz, L. Goncharova, Y. Gornushkin,
G. Grella, F. Griantia, E. Gschewentner, C. Guerin, A.M. Guler, C. Gustavino, K. Hamada, T.
Hara, M. Hierholzer, A. Hollnagel, M. Ieva, H. Ishida, K. Ishiguro, K. Jakovcic, C. Jollet, M.
Jones, F. Jugetg M. Kamiscioglu, J. Kawada, S.H. Kim, M. Kimura, N. Kitagawa, B. Klicek, J.
Knuesel, K. Kodama, M. Komatsu, U. Kose, I. Kreslo, C. Lazzaro, J. Lenkeit, A. Ljubicic, A.
Longhin, A. Malgin, G. Mandrioli, J. Marteau, T. Matsuo, N. Mauri, A. Mazzoni, E. Medinaceli,
F. Meisel, A. Meregaglia, P. Migliozzi, S. Mikado, D. Missiaen, K. Morishima, U. Moser, M.T.
Muciaccia, N. Naganawa, T. Naka, M. Nakamura, T. Nakano, Y. Nakatsuka, D. Naumov, V.
Nikitina, S. Ogawa, N. Okateva, A. Olchevsky, O. Palamara, A. Paoloni, B.D. Park, I.G. Parka,
A. Pastore, L. Patrizii, E. Pennacchio, H. Pessard,C. Pistillo, N. Polukhina, M. Pozzatom, K. Pretzl,
F. Pupilli, R. Rescigno , T. Roganova, H. Rokujo, G. Rosa, I. Rostovtseva, A. Rubbia, A. Russo,
O. Sato, Y. Sato, A. Schembri, J. Schuler, L. Scotto Lavina J. Serrano, A. Sheshuko, H. Shibuya,
G. Shoziyoev, S. Simone, M. Sioli, C. Sirignano, G. Sirri, J.S. Song, M. Spinetti, N. Starkov, M.
Stellacci, M. Stipcevic, T. Strauss, P. Strolin, S. Takahashi, M. Tenti, F. Terranova, I. Tezuka, V.
Tioukov, P. Tolun, T. Tran, S. Tufanli, P. Vilain, M. Vladimirov, L. Votano, J.-L. Vuilleumier, G.
Wilquet, B. Wonsak, J. Wurtz, C.S. Yoon, J. Yoshida, Y. Zaitsev, S. Zemskova, and A. Zghiche
(2011), “Measurement of the neutrino velocity with the OPERA detector in the CNGS beam.”
http://arxiv.org/abs/1109.4897v1. Accessed September 12, 2016.

——— (2012). “Measurement of the neutrino velocity with the OPERA detector in the CNGS
beam.” http://arxiv.org/abs/1109.4897v4. Accessed September 12, 2016.

Fisher, Ronald A. (1960). The Design of Experiments, Seventh edition. Edinburgh and London:
Oliver and Boyd.

Huygens, Christiaan (1690). Traité de la Lumière. Leiden: Matchand Libraire.

van Fraassen, Bas (1989). Laws and Symmetry. Oxford: Oxford University Press.

21

You might also like