Professional Documents
Culture Documents
1. A practical guide to statistics - Book
1. A practical guide to statistics - Book
Evangelos Kitsos
A practical guide to applied statistics
Table of Contents
Preface........................................................................................................................... v
About the author ....................................................................................................... vi
Part A: The theoretical foundations ...................................................................... 7
1 Data collection considerations .................................................................... 8
i
How to use this guide
ii
A practical guide to applied statistics
iii
How to use this guide
iv
A practical guide to applied statistics
Preface
This book does not intend to teach statistical theory. It is not even a book.
Instead, it is a workbook (size 18.2 cm x 12.8 cm) that can be used at any
time as a practical guide on how to apply inferential statistics by using
Microsoft Excel (main focus) and Minitab. It consists of the following parts.
This book is ideal for students and practitioners who want to understand the
application of inferential statistics for solving real problems in a practical way.
v
How to use this guide
Copyright
Minitab® and all other trademarks and logos for the Company's products and
services are the exclusive property of Minitab, LLC. All other marks referenced
remain the property of their respective owners. See minitab.com for more
information.
vi
A practical guide to applied statistics
Unknown
population
In general, inferential statistics have two main functions. We can either use
sample data to estimate the parameters of an unknown population or test a
specific hypothesis that we may formulate. Such a distinction is useful for
7
The theoretical foundations
Within the quantitative category we can discern three different types of data
and a “special case”. The following analysis aims to clarify the topic.
8
A practical guide to applied statistics
Binary Nominal
Q: Overall, are you satisfied by the Q: What kind of car colour do you own?
services that we provide? 1. Blue 2. Black 3. Red
1. Yes 4. White 5. Other
2. No
Table 1. Examples of attributes data
Q: Based on your overall experience, Q: How often do you use our online
how likely are you to recommend us to banking services?
friends? 1. Every day or more
1. Very likely 2. 3-6 times a week
2. Likely 3. About once or twice a week
3. Neutral 4. About once or twice a month
4. Unlikely 5. Never
5. Very unlikely
Table 2. Examples of ordinal data
The limitation of this data is that the intervals between the various categories
are unknown. For example, Mike came 1st while Paul 2nd in a race, but what is
their true difference? Can we say that Mike is twice as good as Paul? Of course,
not! This information will remain unknown unless we measure their times.
9
The theoretical foundations
8.4 cm
Twice as long
4.2 cm
Variables data is very rich in information. For example, the result “Mike and
Paul completed the race (attributes)” offers less information compared to “Mike
finished 1st and Paul 2nd (ordinal)”, which gives less information compared to
“Mike finished the race in 10′ & 21′′ while Paul did it in 10′ & 25′′(variables)”.
Because of their nature, variables data allows the application of the most
powerful statistical procedures, and thus should be preferred if available.
This “illusion” usually leads to poor statistical conclusions because this data
should be mainly, yet not necessarily always, treated as attributes data. To
differentiate it to variables data, ask whether it makes sense to have a true
decimal point or negative values in your data. If it does not (i.e., -5 or 5.5 bumps
makes no sense), then your data is counts – discrete.
the population has been defined, we need to develop a sampling strategy that
will guide the data collection process.
A sampling strategy should be well planned and executed. Otherwise, one may
introduce bias in the sample which will be reflected in the results. There are
different sampling strategies, each of which has its merits and drawbacks.
11
The theoretical foundations
The main point is that no matter how advanced the applied statistics are, the
output of the investigation can be as good as the input is. The popular phrase
“Garbage in, Garbage out” is very reflective of this concept. With that in mind,
there are two basic principles that need to be considered.
Validity
The measurements are close to the “true
value” being measure. For example, a
scale should measure the actual weight
of the individual being measured.
Reliability
Repeated measurements are consistent
in the information that they convey. For
example, the measurements of
someone’s weight under the same
conditions should not vary significantly.
Figure 3. The two basic principles of measurements’ quality
12
A practical guide to applied statistics
If the factors that affect the validity and reliability of the measurement process
are not considered and controlled, excessive error will be introduced in the
measurement system and thus the reached conclusions will be of low quality.
Several terms are used to describe the types of error associated with a
measurement system. However, usually this error can be broken down into
three elements: accuracy, repeatability, reproducibility. The following table
defines it as well as presents some indicative sources that you may wish to
consider in your measurement systems.
There are advanced techniques to quantify these errors and judge whether a
particular measurement system is capable, which will not be discussed within
the scope of this book. However, being aware of them and tackling the sources
for each type of error is a very good start.
13
The theoretical foundations
Let us consider for example the following report from a quality inspection
process. The employee manually enters a product’s serial number so that the
system can automatically return its weight and length.
14
A practical guide to applied statistics
• Repeated Numbers
Please, note that outliers can be either errors that occurred during data
collection and need to be corrected, or simply observations of a very
skewed distribution that need to be investigated.
• Missing observations
Once you identify this kind of issues in your data, it useful to go back into the
process or the phenomenon measured and investigate their sources. Knowledge
of the process always gives great insight.
15
The theoretical foundations
The first step after the data collection has been completed, is to understand what
the sample data is saying. For that reason, we use what is known as descriptive,
or summary, statistics. These can be categorized in three main types.
Types of data
(→ page 59)
Category
Attributes
Definition Measures and tools
Variables
Ordinal
Histograms X
Visualise the dataset
Shape
Arithmetic mean X
The central number
tendency
Central
Please note that neither the list of measures and tools in the table is exhaustive
nor every tool listed is suitable for every situation. The aim of this section is to
theoretically introduce what one should do to prepare the ground for applying
inferential statistics. The choice and application of the right tool for a particular
situation will be later discussed in this book.
16
A practical guide to applied statistics
2.1.1 Histogram
There are different types of methods that one could use to visualize variables
data. The most common is the histogram.
possible values of the categorical variable on the x-axis and the frequencies on
the y-axis.
Bar charts are like histograms. However, unlike histograms the bars of a bar
chart are not connected, due to the categorical nature of the data.
Due to the fact that they gather all the attribute measurements in a single cycle,
pie charts provide a holistic perspective of the data that is analysed, and the data
analyst can easily see the importance of each category.
18
A practical guide to applied statistics
Where:
3, 5, 5, 6, 8, 9, 11
3 + 5 + 5 + 6 + 8 + 9 + 11 47
X̄ = = = 6.714
7 7
Note that while the arithmetic mean is more powerful compared to other
measurements of central tendency, it is highly affected by extreme values or in
other words skewed distributions. In such cases, it is always wise to investigate
the extreme values, as well as have an idea of other parameters, such as the
variance of the dataset, or other measurements of central tendency before a
decision is made.
19
The theoretical foundations
2.2.2 Median
The median is the middle value in a set of raw data that has been ordered from
the lowest to the highest, or the other way around. It is the middle value of a
distribution, in the sense that 50% of the values lie above and 50% below it. In
the example we used above:
𝑛+1 8
̄ Rank =
Median = = 4 (𝑃𝑙𝑎𝑐𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛) => 𝑀𝑒𝑑𝑖𝑎𝑛 = 6
2 2
Note that if there is an even number of values in the dataset the median is the
average of the two middle values.
The median is a value that can provide useful information when the dataset is
skewed due to extreme values. Therefore, the median should be calculated
along the arithmetic mean to check whether there are any extreme values in the
dataset. When the dataset includes such values, the median is probably a better
measurement of central tendency to consider when making decisions.
2.2.3 Mode
Mode refers to the value that occurs most frequently in a dataset. In the example
we used above:
𝑀𝑜𝑑𝑒 = 5
The mode is a simple parameter and can be used with any type of data; even
purely qualitative. The mode can provide very useful information because the
value that occurs with the highest frequency may require further attention.
20
A practical guide to applied statistics
approximately 25 minutes door to door. What time should you leave your
house? One may say that leaving the house at 7:35am will be ok, yet this is not
a good decision to make. The commuting process, similar to any other process
in the world, is subject to variation. Factors such as traffic, traffic lights and
weather conditions will have an impact on the time it takes you to get to work.
It may take for example from 10 to 40 minutes, and thus leaving the house
before 7:20am will be the right thing to do.
The previous example indicates that for any dataset the values will vary from
each other, and this variability is an essential parameter that also needs to be
calculated and considered before a decision is made. In some cases, it is even
possible to end up with datasets that have the same central tendency but have
totally different variability. Such differences are important, and thus you need
to apply statistics to measure the variability or dispersion of your data.
2.3.1 Range
Range is the most straightforward and simple parameter that you can use to
describe the dispersion of your data. It is the difference between your highest
and lowest values in a dataset. For our example above:
𝑅𝑎𝑛𝑔𝑒 = 11 − 3
An advantage of the range is that it can be used with ordinal data. On the other
hand, the range is highly affected by extreme values.
21
The theoretical foundations
Product length (ordered): 22, 26, 28, 33, 45, 45, 46, 47, 48, 48, 52, 52, 52, 53,
53, 53, 56, 57, 58, 58, 63, 63, 63, 64, 64, 64, 64, 65, 65, 66, 67, 68, 71, 74, 75,
75, 82, 87, 95, 98.
Product length (ordered): 22, 26, 28, 33, 45, 45, 46, 47, 48, 48, 52, 52, 52, 53,
53, 53, 56, 57, 58, 58, 63, 63, 63, 64, 64, 64, 64, 65, 65, 66, 67, 68, 71, 74, 75,
75, 82, 87, 95, 98.
In this case, the median is the average of 58 and 63; or else 60.5.
Step 3: Find and subtract the median of the 1 st and 2nd half of your dataset
Product length (ordered): 22, 26, 28, 33, 45, 45, 46, 47, 48, 48, 52, 52, 52, 53,
53, 53, 56, 57, 58, 58, 63, 63, 63, 64, 64, 64, 64, 65, 65, 66, 67, 68, 71, 74, 75,
75, 82, 87, 95, 98.
The median for the first half is (52+48) / 2 = 50. Respectively the median for
the second half is (66+67) / 2 = 66.5. Ultimately, the interquartile range is 66.5-
50 = 16.5.
The interquartile range can be used even when extreme values exist in the
dataset as it is not affected by the values that lie at the edge of a distribution.
22
A practical guide to applied statistics
Where:
s2 = the Variance of the sample
∑(𝑋𝑖 − X̄)2
𝑠2 = Σ (xi-X̄)2 = the sum of the squared deviation scores
𝑛−1
n = the size of the sample
X̄ = the sample mean
For our example above:
The variance is the squared value and thus it is difficult to relate it to the actual
measures in a dataset. For that reason, we need to calculate the standard
deviation of our data, which is simply the square root of the variance.
23
The theoretical foundations
The classical probability of an event can range from 0 (0%) to 1 (100%). For
example, the probability of rolling 2 by throwing a dice is P(number 2) = 1 / 6
= 16.67%. Such probabilities are called a priori probability because the
probability is known in advance.
The classical approach can only be applied when the outcomes and likelihoods
can be determined in advance, such as in games of chance. But what if these
cannot be known? This leads to the empirical definition, which is given by the
following function:
Finally, there are some cases which are not suitable for the classical approach
and for which there is no empirical data. For example, a manager is trying to
make predictions about the market’s needs or preferences to various products.
In these situations, one cannot rely on data to establish subjective probabilities
but can assign them by making a personal assessment of the situation based on
experience and intuition. It goes without saying that this method is subject to
the greatest error, as the human judgment is highly affected by unconscious
bias, however, in some cases it is the only way.
24
A practical guide to applied statistics
can be stated and proved mathematically, but it is easier to show the justification
of them graphically using Venn diagrams.
Let’s consider a company that is using agents in order to promote its products
in the European market. The company has 100 agents across Europe with 45 of
them specializing in products type A and 55 specializing in products type B.
From these agents, 15 specialize in both products A and B. The emails of these
agents are in a common database so when an email is received by a customer,
then anyone of these agents can respond. At the moment, there is no specific
algorithm to link the costumer request with a specialized agent; the allocation
is absolutely random. So, what is the probability of an agent with a specific
specialization to answer a customer request?
No specialization (15)
S: Total (100)
25
The theoretical foundations
Now we are ready to calculate the probabilities of various events about the
situation. For example:
Based on this probability and given that the sum of the probabilities of all
different alternatives in a given situation (the total area in the Veen diagram) is
always equal to 1 (100%), then we can say that the probability of having a
specialization is 85% (100% – 15%). The general formula is:
P(A) + P(A’) = 1.
The P(A’) is the complement of P(A) and represents the probability of P(A) not
happening. The sum of complement probabilities is always equal to 1.
The symbol U means union or else “or”. On the other hand, the events “having
specialization in product A” and “having specialization in product B” are non-
mutually exclusive, given that an agent can specialize in both products. In these
cases, the following generic formula of addition should be used:
The symbol ∩ means Intersection or else “and”. In our case, this is the
probability P(Specialization in product A) + P(Specialization in product B) –
P(Specialization in both) which is equal to:
26
A practical guide to applied statistics
Note that for disjoint probabilities the probability of their intersection is equal
to zero and therefore it is not considered in the equation.
What happens though if a customer request comes in before the agent completes
the previous request? Such a case reflects what is known as sampling without
replacement, as the agent is not available to deal with the request and therefore
the probabilities of the various events will change. For example, the probability
of having the new request addressed by an agent with any type of specialization
is now:
P(Specialization in A U B) = 45 / 99 + 55 / 99 – 15 / 99 = 85 / 99 = 0.859
This probability is slightly higher than before, which is because the same
number of agents with specialization are available, but the total amount of
agents has been reduced given that one agent without specialization is not
available.
27
The theoretical foundations
28
A practical guide to applied statistics
3.3 Expectation
'Expectation' is the theoretical or long-run average of a phenomenon that is
measured. This is given by the following formula:
For example, the expected value of all the dice’s rolls in a Monopoly game will
be 7; or at least something very close to it (the logic behind this deviation will
be explained later).
1 2 3 4 5 6
𝐸(𝑥) = 2 ∗ ( ) + 3 ∗ ( ) + 4 ∗ ( ) + 5 ∗ ( ) + 6 ∗ ( ) + 7 ∗ ( ) + 8
36 36 36 36 36 36
5 4 3 2 1
∗ ( ) + 9 ∗ ( ) + 10 ∗ ( ) + 11 ∗ ( ) + 12 ∗ ( )
36 36 36 36 36
=7
This tool is highly used in economics and decision theory. For example, a
businessman knows that a decision could lead to either a profit of €400 with a
probability of 0.7 or a loss of €200 with a probability of 0.3. As a result, we
would expect:
The calculated profit in the example represents the expected value and is usually
called the utility of the decision. Utilities are presented in a payoff table that
shows the results of each combination of actions and outcomes. For example, a
decision could lead to a loss or a profit, or different decisions can lead to
different results. Each outcome is also linked with a probability of occurring,
29
The theoretical foundations
For example, if we are rolling a dice for 3 times then we have k = 6, as there
are 6 possible outcomes in a dice, and n = 3. Rolling the same dice is considered
as single type of event as the potential outcomes of the experiment remain
constant. Therefore, there are 216 possible outcomes.
On the other hand, if we roll two different dices once, one with six sides and
another one with eight sides, then we have two different types of events. This
is because k changes from six to eight. In this case the following formula
applies:
In our example, we have 6 * 8 = 48 possible outcomes for one roll of each dice.
Counting rules are not always that straightforward though. For example, how
many even 3-digit numbers can be formed from the digits 1, 2, 5, 6, 9 if each
digit can only be used once?
30
A practical guide to applied statistics
If only even numbers can be formed, the digits in the unit’s position can only
be 2 or 6. Therefore, 4 other digits are available for the hundreds position, and
to prevent duplication, only 3 digits will be available for the tenths position.
n! = n * (n - 1) * (n – 2) * (n – 3)…………. * 1
1 2 3 4 5 6
Job 1 Bob Bob Hellen Hellen Matt Matt
Job 2 Hellen Matt Bob Matt Hellen Bob
Job 3 Matt Hellen Matt Bob Bob Hellen
Let’s assume now that the leader of the team wants to know in how many ways
out of these 6 in total, any two employees have been arranged. We can ask this
question in two ways; including arranged in a particular order or regardless a
particular order. The first one is called permutation and can be calculated by the
following equation.
n 𝑛!
𝑃𝑥 = (𝑛−𝑥)!
31
The theoretical foundations
For a permutation the order matters. For example, the team leader wants to
promote two employees, one to principal and one to senior level. The order here
matters because the two jobs are not of equal level, therefore we need to
consider all the potential ways of arranging any two of these employees. In this
case, 6P2 = 6.
1 2 3 4 5 6
Principal Bob Bob Hellen Hellen Matt Matt
Senior Hellen Matt Bob Matt Hellen Bob
Note the fact that the permutation is equal to n! is a coincidence. If we had four
employees and thus n! = 24, then the permutation would have been 4P2 = 12.
On the other hand, when the particular order of how the groups will be formed
does not matter, we use the combination rule.
n 𝑛!
𝐶𝑥 = 𝑥!∗(𝑛−𝑥)!
For example, if the leader would want to promote two of these employees into
a senior position, then a combination like Bob / Hellen would have been of
equal value with the Hellen / Bob. To avoid such double counting we add in the
denominator of the permutation the x!. In this case, 6C2 = 3.
1 2 3
Senior Bob Bob Hellen
Senior Hellen Matt Matt
32
A practical guide to applied statistics
Note that while these rules do not provide direct probabilities, they can help us
establish them. For example, in a race of 8 cars we would like to bet that
specifically three of them will have the first three places. What are the chances
of winning this bet?
If we bet that the three cars will finish in a particular order, then we will use the
permutation rule; 8P3 = 336. Thus, there are 336 different ways of arranging any
3 of them in the first 3 places. However, we are interested in 3 specific cars
finishing in a specific order, thus our chances of winning are 1 / 336 = 0.297%.
On the other hand, if we bet that 3 specific cars will finish in the first three
places, but we do not care about the order, then we have to use the combination
rule; 8C3 = 56. Therefore, our chances of winning are 1 / 56 = 1.78%. This is
still low, yet we have better chances now.
Although, there are many distributions here we will use the normal distribution
to illustrate their function. We will assume an example of a manufacturing
company that produces metallic cylinders to make the learning clearer. We are
the engineers of the company, and our customers request that the average
volume of our cylinders should be 1.5m3 ± 0.8m3. To meet these specifications,
we have set our machinery to produce cylinders with a specified mean volume
of 1.5m3 and a known standard deviation of 0.2m3.
33
The theoretical foundations
Random variation is due to the combination of various factors that take place
during a process. In our case for example, some of the factors push towards
making larger cylinders, while some others push towards making smaller
cylinders. In a subsequent production run the same factors can push towards the
opposite direction to the one they were pushing before. Nobody knows how
they will behave as these factors are susceptible to other factors that behave
randomly. However, for most of the cylinders the factors will balance out to
offer a product close, if not the same, to the specified mean value. Only in very
rare occasions all the factors will push towards one direction, which will result
in a cylinder with an extreme volume value.
34
A practical guide to applied statistics
For our example, we will assume that the production line of cylinders follows
the popular normal distribution. Note that we assume the normal distribution
for now. As we will see later in this book this should also be tested. Indeed, not
everything in life follows normality, and when this is the case, we need to apply
different types of tests. So, for a specified mean of 1.5m3 and a standard
deviation of 0.2m3 the following expectations arise.
34.13% 34.13%
2.14% 2.14%
0.13% 95.45% 0.13%
99.73%
Z -3 -2 -1 Mean 1 2 3
We always read distributions from the smaller (left) to the largest (right) value.
The x-axis indicates the values that the phenomenon under investigation can
take (cylinders’ volume) while the y-axis indicates the frequency of occurrence
of a specific value x. In hypothesis testing we use the area under the curve which
indicates the probability of occurrence of events that lie between two specified
values of interest. For example, we expect that 0.13% of the cylinders produced
will have a volume lower than 1.44m3, or 13.59% of the values will be
somewhere between 1.46m3 and 1.48m3.
35
The theoretical foundations
You can use Microsoft Excel to calculate the probability of occurrence for any
value that you are interested in. For example, you may be interested to know
what is the probability of getting a cylinder with a volume value between
1.46m3 and 1.48m3?
Table 10. Using Excel to calculate probabilities with the Normal distribution
36
A practical guide to applied statistics
As you can see, for any x value of m3 we can get a corresponding z value. Then,
we can calculate the probability of getting a value between any range of z
values. This follows the same logic and provides the same results with the
method that we applied above. However, by using the transformed distribution
we can bring any normal distribution down to a common scale and make
meaningful comparisons that are helpful in inferential statistics.
5 Discrete distributions
Examining the distribution of your dataset will help you understand your data
into context and make a prediction of what the future results may be. In this
section of the book, we will discuss some key discrete distributions.
37
The theoretical foundations
If these conditions are met then the binominal distribution can be used to
calculate the probability of 0, 1, 2, 3…. etc. successes in n trials. This is given
by the following equation:
𝑛!
𝑃(𝑟) = ∗ 𝑝𝑟 𝑞 (𝑛−𝑟)
𝑟! (𝑛 − 𝑟)!
Where:
r = number of successes Formula in Excel
q = probability of failure (1 – p)
n = the known number of trials (sample size)
The expected value of the binominal distribution, or the mean value, is given
by the equation E(x) = n * p, while the variance is σ2 = n * p * q. Naturally, the
standard deviation is the square root of this value.
38
A practical guide to applied statistics
As p and q start deviating from 0.5, we will observe skewness, which however
can be balanced to a certain extend by the size of the sample. As a rough
guideline, we say that the normal approximation of the binomial distribution is
reasonable when both np ≥ 10 and nq ≥ 10, or at least 5.
Where:
𝑁𝑝
.𝐶𝑟 ∗ 𝑁𝑞.𝐶𝑛−𝑟
𝑃(𝑟) = 𝑁𝐶
r = number of successes
. 𝑛
p = probability of success
q = probability of failure (1 – p)
Formula in Excel
n = the sample size
= HYPGEOM.DIST (r, n, N*
p, N, False) N = the population size (Batch or trials)
C= Combination
The expected value of the hypergeometric distribution, or else the mean value,
is given by the equation E(x) = n * p, while the variance is σ2 = n * p * q * (N
– n) / N - 1. Naturally, the standard deviation is the square root of this value.
39
The theoretical foundations
𝑥−1
𝑃(𝑥) = .𝐶𝑟−1 ∗ (1 − 𝑝) 𝑥−𝑟 ∗ 𝑝𝑟
Where: Formula in Excel
r = number of successes = NEGBINOM.DIST (Failures, r, p, False)
x = number of trials till r successes
p = probability of success of a single trial
C= Combination
The expected value of the negative binomial distribution, or else the mean
value, is given by the equation E(x) = r * (1 – p) / p, while the variance is σ2 =
r * (1 – p) / p ^ 2.
If these conditions are met, then the Poisson distribution can model the situation
and estimate the respective probabilities based on the following equation:
Where:
𝜆𝑥 ∗ 𝑒 −𝜆
𝑃(𝑥) = λ = the expected (average) number of
𝑥! occurrences
Note that the larger the λ the more the Poisson distribution can be approximated
by the normal distribution. Also, the Poisson distribution also approximates the
binomial especially when n >30 and p<0.1.
The problem with using a sample though, is that it is extremely unlikely that
the sample’s average will be equal to 1.5m3, simply due to the sampling error.
That is, one cannot know if the values of the sampled cylinders will absolutely
balance out to give the specified expected mean.
What we do know though is that the averages of the samples collected will form
their own normal distribution. This is also known as the sampling distribution
of the mean that models the phenomenon under investigation.
Where:
n: The sample size
X X̄
The average of the sampling distribution (𝑋̿ ) is the average of all the samples’
averages. Therefore, in a theoretical scenario of having all the averages it is
expected to be equal to the average of the entire population (μx).
42
A practical guide to applied statistics
One the other hand, the standard deviation of the sampling distribution (σX̄),
also usually referred to as the standard error, is affected by the sample size. The
larger the sample the smaller this error and thus, as we will see later, the better
the inferences that can be made.
To start with there are always two opposing hypotheses the key characteristics
of which are presented in the following table.
In real life, human beings usually form a hypothesis and then try to find
evidence to prove it. In the world of statistics things are slightly different. The
43
The theoretical foundations
aim is not to prove the H0 perse. Instead, we start with the presupposition that
the H0 is valid. Then, we collect and analyse sample data to decide whether H 0
should be retained or rejected in favour of H1.
A good way to illustrate how hypothesis testing works is to look at the example
of a trial. There are two hypotheses in a trial:
By law, the jurors should start with the assumption that the accused party is
innocent (H0 is taken for granted) until proven guilty. Then, the evidence is
presented, and the jurors evaluate it so that eventually they will come to a
verdict. Their thinking process follows the logic:
“…. provided that the accused party is indeed innocent (Ho is assumed to be
true) would one ever possibly observe the evidence (sample data) presented?
If the answer is yes, then the jurors fail to reject the H0 (we do not have
enough accusation evidence), otherwise they reject the H0 (we have
significant evidence to suggest that the party is not innocent) in favour of H1
(is guilty)”.
44
A practical guide to applied statistics
We also know that this was the case up to now, as we were selling these
cylinders for many years to the same customers with no complaints about their
volume. So, if “nothing has changed” we would still expect to have an average
of 1.5m3. On this basis, the null hypothesis (the expectation) is μx = 1.5m3 while
the alternative hypothesis (the testing hypothesis) is μx ≠ 1.5m3.
3
H : μ = 1.5m
0 x
Deviation 3
H : μ ≠ 1.5m
1 x
The answer is that it depends. We know from the basics of the normal
distribution that the more a value deviates from the mean the less chances it has
of occurring by pure chance. Thus, the question is how much deviation from
the expected mean (1.5m3) are we willing to accept as being random? Accepting
a large deviation means that we need to see a very extreme sample mean (strong
45
The theoretical foundations
This measure of strength that we would like to see in our sample data before
rejecting the null hypothesis is reflected in the significance level alpha (α). The
alpha is set by the analyst and thus there is no single right value that can be used
in all cases. It depends on the context and the importance placed by the
individual on the test that is to be conducted. Having said that, an alpha value
equal to 5% or 1% (stronger evidence needed) are more commonly used. In this
book, all the examples will use an alpha of 5%.
Now we are ready to finalize the test. On the one hand, we have a sample mean
that is expressed as a number that deviates from the population mean, while on
the other hand we have a significance level α which is expressed as a probability
that represents the amount of deviation, we are willing to accept. We need then
to transform both these values into standardized z values and compare them.
The underling distribution is the normal sampling distribution for the mean that
has been presented above. The following steps should be applied.
𝑥̅ − 𝜇𝜊 𝑥̅ − 𝜇𝜊 1.507 − 1.5
𝑧𝑐𝑎𝑙𝑐 = = 𝜎 = = 2.2683
𝜎𝜒ഥ 𝜒 0.02⁄
⁄ √42
√𝑛
Note that we have used the standard deviation for the sample mean
distribution.
46
A practical guide to applied statistics
Note that the critical values need to be placed in both sides, of the
distribution (±), as one can get values that are significantly lower or
higher than the specified mean.
• Step 3: Plot both values in the distribution of the sample mean and
compare them as it can be seen in the following figure.
If the test statistic falls within these limits, we fail to reject the
null hypothesis.
Zcalc = 2.26
Figure 12. 1-sample z-test results
In our example, the difference (deviation) observed in the sample data from the
mean is significant, as the calculated value is outside the limits set by the critical
values. This suggests that most likely that the mean has changed. Most likely it
has increased given that it lies at the right edge of the distribution. Therefore,
the null hypothesis should be rejected, and corrective action need to be taken to
bring the process back to the centre.
47
The theoretical foundations
-zCrit μx zCrit
Zcalc
Figure 13. Impact of significance level α to hypothesis testing
It is obvious that the higher the alpha value the more chances one has for
rejecting the null hypothesis. This would be fine, provided that the null
hypothesis is indeed false. However, what if the H0 is true? Note that we do not
know the true state of a population, which is the reason why we run the test.
In this sense, α also indicates a risk of potentially being wrong that we are
willing to take. In other words, rejecting a null hypothesis that should not have
been rejected is an error caused by the choice of having a very large α. At the
same time, a very low α can lead to a case where the null hypothesis is retained
when it should have been rejected. These are the two main types of errors in
inferential statistics, as it can be seen in the following table.
48
A practical guide to applied statistics
Expanding on the two types of errors goes beyond the scope of this book.
However, it is important to keep in mind that an outcome of a hypothesis test is
not always right. In fact, we can be 1-α% confident that we have reached the
right conclusion, and we have an α% chance of being wrong. In this respect, a
choice of an α at the level of 5% or 1%, accompanied by a large enough sample
to improve the Power (1-β) of the test and thus reduce the β error will most
likely give you a robust result. If you have serious concerns about the results of
a test, try running it again by increasing the sample size if possible.
However, what tails we use depends on the nature of the hypothesis we want to
test. For a two-tailed test, like the z-test we have seen already, we are interested
in both directions of the distribution. That is, if the test statistic indicates a value
significantly larger or smaller than the hypothesized μo, then we need to reject
49
The theoretical foundations
the null hypothesis and accept the alternative one. The word significance is
reflected in the alpha value and thus the α is divided into two equal groups that
are placed on both tails of the distribution. It may also be possible though that
we are interested in one direction of the distribution (one-tailed test).
α α/2 α
α/2
Left tail tests hypothesize that the value will not be smaller than a specified
value. In this case we do not care if the test statistic indicates a significantly
larger value. However, if the test statistic indicates a value that is significantly
smaller, then we need to reject the null hypothesis and accept the alternative
one. Again, significance is determined based on the chosen α value, which,
given the focus of the one-tailed tests on a single direction, must be placed in
full on the tail of the distribution that we are interested in; in this case is the left
side. Of course, the opposite applies for the right tail tests.
50
A practical guide to applied statistics
may also be possible that a situation will require to test parameters from two or
even more populations. The following table is using the mean as an example to
illustrate how this difference can affect the way your hypotheses are stated.
51
The theoretical foundations
Generic formula
Sample statistic Population Sample statistic
– Error ≤ parameter ≤
+ Error
1-α
Estimating the mean value
Confidence interval
𝜎𝑥
𝜇𝜊 = 𝑥̅ ± 𝛧 ∗ ⁄ =
√𝑛
-z z
= 1.507 ± 1.96 ∗ 0.02⁄
√42
1.501 x̄ = 1.507 1.513 Where α = 5%
Depending on what is inferred, the parameters of the generic formula will take
a specific form. For estimating the mean we follow the statistics from the
distribution of the sample mean On this basis, we can say that we are 95% (1-
α) confident that the new mean of the process is somewhere between 1.501m3
and 1.513m3. If we recall the specification limits that the customers requested
(1.5m3 ± 0.8m3), we can understand why they complain. They seem to receive
a significant number of products that are beyond the upper limits.
52
A practical guide to applied statistics
Although p-values and critical values will always lead to the same result, they
approach the test from a different perspective. In the example, with the metallic
cylinders the sample mean was 1.507m3, which was different to the specified
mean (1.5m3) by 0.007m3. From the perspective of p-values, we would ask:
-
Figure 16. P-values on a distribution (for a two-tail test)
In our example, we were interested in both directions and thus we had to equally
allocate the α to both tails. To balance that, we become interested in the
probability of getting either a difference of +0.007 or -0.007 and thus both sides
53
The theoretical foundations
To calculate the p-value, we need to find the area under the normal curve that
is at the outside of the two Zcalc values in the figure. We can do so by using the
function “=NORM.S.DIST(-Zcalc,True)” which will give as the area from the
minimum value (left side) to the -Zcalc value (= 1.17%). This would have been
the probability if it was a one tail test. However, we are running a two-tail test
and thus we need to multiply it by 2 (1.17% * 2 = 2.33%) to include the p-value
area from the right side of the distribution (it is symmetrical).
Then, your task is to compare the calculated p-value with the significance value
(α) which essentially is the cut-off probability. The two possible outcomes of
this comparison are:
a. If p-value < α then reject H0. It is very unlikely that the result is due to
pure chance. Probably, something has changed.
b. If p-value > α then fail to reject H0. The probability is not low enough to
suggest that something has changed. The difference is probably due to
pure chance.
54
A practical guide to applied statistics
Normal
Student (t)
7 A practical process
There are many distributions and statistical tests that one can run depending on
what is to be tested. Despite their differences though, all these tests follow more
or less the same logic with the z-test that we have already seen. Thus, the
following steps summarize the knowledge of this chapter as well as can be used
as a general guide for the tests that will be presented in this book.
Step Instructions
Clarify the issue and what needs to be tested.
1 Define the problem
Capture it in a question if possible
Use the maps in this book to guide you
2 Chose the right test
through the assumptions of the various tests
If the direction is specified (i.e., larger, or
Is it one tailed or two
3 smaller than….) then the test is one-tailed,
tailed?
otherwise it is a two-tailed test
Formulate the Capture the nature of the problem to be tested
4
hypotheses in a set of clear hypotheses
Collect and analyse Evaluate the quality and summarize your
5
your sample data sample data by using descriptive statistics
This reflects the strength you would like to
Decide on the level
6 see in the evidence as well as the risk you are
of significance (α)
willing to take. Usually, it is either 5% or 1%
Calculate the test Use the formulas in this book to calculate the
statistic test statistic
7 Find the critical Use either the critical value or the p-value
value or use the p- depending on what is more convenient. In this
value book we will work with p-values
Decide whether to Compare the critical to the calculated value,
8 reject or fail to reject or the p-value to the α, to decide the fate of
the Ηo the Ηo
Translate the result of the test to something
9 Apply the decision that has meaning for those involved in or
affected by the decision
Table 14. Basic steps to hypothesis testing
56
A practical guide to applied statistics
Microsoft Excel spreadsheet package can be used to carry out many statistical
tests, and with a little bit of manual contribution from our side it can become
comparable to some more advanced software for statistics in the market. Of
course, it has limitations, and this is the reason why we have also included
Minitab in the discussion.
Before we jump into the tests, we need first to set up Microsoft Excel for
statistics. This requires enabling the “Data Analysis ToolPak”. The process for
doing so is illustrated below.
• Step 1: Select “File” at the top left corner of Microsoft Excel and then
select “Options”.
• Step 2: In the options window select the “Add-ins” tab and then select
“Go…” as it can be seen in the following figure.
57
Setting up Microsoft Excel
1. Step 3: In the window that will pop up check the “Analysis ToolPak”
box and then press ok.
2. Step 4: If you have done everything correctly then in the “Data” tab you
should have the “Data Analysis” option.
Those of you who do not have access to something like Minitab, these notes
will offer a solid and straightforward approach to hypothesis testing using only
Microsoft Excel.
Finally, please note that this guide has been developed in a Windows computer.
Those of you who run other systems may have to think a little bit through on
how to apply the instructions and steps presented here.
58
A practical guide to applied statistics
Start
Variables
→ Page 61
Continuous data that can take on any number (i.e., length, weight…)
Note that sometimes the categorization of a dataset into a specific type can be
tricky. The following scale aims to help you with that issue. Start from the top
question and as you go down, check if your data posseses the indicated qualities.
59
Types of data
Variables
Data that can have a i.e.,
Is data Yes “logical” decimal point and weight,
continuous? take any value to a length,
theoretically infinite scale volume,
size
No Categorical
Counts (Discrete) i.e., the
It is numerical data with number of
True logical order, like the defects on a
differences Yes variables data, but can product or
exist? take only non-negative number of
integer values (categories) people
entering a
Quantitative café/hour
No Qualitative
P
Ordinal i.e., Likert O
It is data in order like the scales, W
counts-discrete but the preferences, E
Is data Yes R
in order? “true” value of the intervals order in the
(differences) between the finish line
categories cannot be known
No
Counts i.e., number of
(Nominal) cars of a
Like the binary specific colour
How many
but in this case in a parking lot
categories 3+
? the or the number
phenomenon of visits to
can have many various Attributes
2 features countries
60
A practical guide to applied statistics
The analysis of variables data should be conducted with caution. This is because
variables data, is the one with the most assumptions that need to be checked.
Yes Yes
No
Data
Happy to
Yes normally No
proceed?
distributed?
Yes
Figure 20. Map for choosing between parametric and non-parametric tests
This chapter will help you reach the point of choosing between parametric and
non-parametric tests. These will be then discussed in separate chapters.
61
Parametric vs non-parametric tests for variables data
Run descriptive
statistics
We will use this example to show how variables data can be analysed as well
as how we can check for normality in Microsoft Excel and Minitab.
62
A practical guide to applied statistics
ID Length (m)
1 17.3 Step 1: Arrange your data in a single column (as it is
2 14.5
shown in the left).
3 17.5
4 18.6
5 18.0 Step 2: Open the data analysis window and select
6 19.2 “Descriptive statistics”.
7 19.0
8 16.4 Step 3: In the Input range insert all the cells that you
9 21.2
10 20.3 are interested to analyse. Select the title too “Length
11 20.6 (m)” and tick the box “Labels in the first row”.
12 18.4
13 22.1
14 21.3
15 14.4
16 14.9
17 20.0
18 18.4
19 20.8
20 18.3
Step 5: Tick
the box
“Summary
Step 4: In the
statistics”
output range
insert an
empty cell
If you follow this process, Microsoft Excel will automatically produce some
descriptive statistics for your data. This can be seen below.
63
Parametric vs non-parametric tests for variables data
64
A practical guide to applied statistics
Minitab calculates the same values with Microsoft Excel. It can also calculate
immediately the quartiles and the Interquartile range of the dataset which will
be helpful in constructing the boxplot.
Boxplot value
Quartile 1 (Q1) = 17.35 = QUARTILE.EXC(Dataset,1)
Quartile 2 (Median - Q2) = 18.5 = QUARTILE.EXC(Dataset,2)
Quartile 3 (Q3) = 20.525 = QUARTILE.EXC(Dataset,3)
IQR = 3.175 = Q3 - Q1
Q1 = 17.35 Q3 = 20.525
IQR = 3.175
25% 25%
25% 25%
Note that we have to use some measures form the table with the descriptive
statistics that has been presented above (mean, min, max).
The first step is to calculate the limits of our dataset beyond which we will treat
values as outliers.
In our example, it seems that there are no outliers. Both the min and the max
values are within the boundaries set by the two values above.
66
A practical guide to applied statistics
Step 4: In the input range add the raw data and in the Bin range add the Length
of the bins that you have calculated (including the headings).
Step 5: Tick
the box
“Labels”
Step 7:
Tick the
box “Chart Step 6: In the
output” output range
insert an
empty cell
Microsoft Excel then will automatically calculate the number of values that
fall into the classes you have created, as well as will produce the histogram of
your data.
67
Parametric vs non-parametric tests for variables data
As we will see later, in Minitab we can generate the histogram along with the
test for normality which will be discussed next.
68
A practical guide to applied statistics
The p-value of the Anderson-Darling test is 0.463 which confirms that the
dataset derives from a normal distribution. Note that Minitab generates
additional measures, such as the histogram and the boxplot values,
automatically.
The following section will present the χ2 – test “Goodness of fit”, which is an
alternative that can be used in Microsoft Excel.
69
Parametric vs non-parametric tests for variables data
Distribution (Model) Χ2
P-value =CHISQ.DIST.RT(χ2,ν)
Notes
When testing for normality the result of the test will be affected by the way the
classes have been formed. It is suggested to apply the formula “1+3.222logn”
to calculate the No. of classes as it has been shown above.
70
A practical guide to applied statistics
71
Parametric vs non-parametric tests for variables data
The answer comes from the “golden rule” of the Central Limit Theorem:
“ . . . for any distribution with a well-defined mean and variance and given
random and independent samples of n observations each, the distribution of
sample means approaches normality as the size of n increases, regardless of
the shape of the population distribution.”
n=1
X̄ n > 30
Figure 23. Sample size impact on the sampling distribution of the means
72
A practical guide to applied statistics
When the sample size is less than 30, like in the example we used above where
we had only 20 observations, try to increase the sample. If this is not possible,
check whether the expected values meet the assumptions of the test. In our case,
all the values are larger than 1 (the last one is very close), with one value being
larger than 5 and one very close to it. Thus, although a borderline case, we could
proceed with caution.
2. Check transformed data for normality (if not normal revisit step 1).
73
Parametric vs non-parametric tests for variables data
74
A practical guide to applied statistics
Testing F-test
Variances How
means or 2
many? → Page 80
variances?
*
Bartlett’s test
>2
Means → Page 88
*
How Bartlett’s test
1 >2
many? → Page 88
2 One-way ANOVA
1-sample t-test → Page 90
Equal
→ Page 78 Yes
variances?
In F-test
Yes No
pairs? → Page 80
2-sample t-test
Paired t-test → Page 82
→ Page 86 Equal Yes
variances?
Welch’s t-test
* Difficult to run → Page 84
in Excel No
75
Parametric tests for variables data
10.1 χ2 – test
Aim: To test whether a sample derives from a population with a specified
variance.
H1: 𝜎 2 ≠ 𝜎𝑜2
Degrees of freedom
Where:
ν=n−1 n: The sample size
Distribution (Model) χ2
̂2
(𝑛−1)𝜎 ̂2
(𝑛−1)𝜎
Confidence intervals 2 to 2
𝜒1− 𝛼⁄ ,𝜈 𝜒𝛼 ⁄2,𝜈
2
Notes
This test is very sensitive to normality. Minitab is offering a p-value based on
Bonnet’s confidence interval that may be useful to check as well, especially if
the data is non-normal.
76
A practical guide to applied statistics
77
Parametric tests for variables data
Hypothesis: Ho : 𝜇𝜒 = 𝜇0
H1 : 𝜇𝜒 ≠ 𝜇0
𝑥̅ − 𝜇0 𝑥̅ : Sample mean
𝑡=
𝜎̂⁄ μ0 : Specified population mean value
√𝑛
̂ : Sample standard deviation
σ
n : sample size
𝜎̂
Confidence intervals 𝑥̅ ± 𝑡𝛼⁄
2 ,𝜈 √𝑛
Notes
In most cases it will be preferred to the z-test as the population variance is
difficult to be known with certainty.
78
A practical guide to applied statistics
79
Parametric tests for variables data
10.3 F-test
Aim: To test whether two independent samples derive from population
distributions with equal variances.
Distribution (Model) F
81
Parametric tests for variables data
Hypothesis: Ho:𝜇1 = 𝜇2
H1: 𝜇1 ≠ 𝜇2
1 1
Confidence intervals (𝑥̅1 − 𝑥̅2 ) ± 𝑡𝛼⁄2 ,𝜈 𝜎̂√ +
𝑛1 𝑛2
Minitab Path: Stat → Basic Statistics → 2-Sample t (Select options and tick
the box “Assume equal variances)
Notes*
Calculating the 𝜎̂ is challenging. Use the Data Analysis ToolPak.
82
A practical guide to applied statistics
83
Parametric tests for variables data
Hypothesis: Ho:𝜇1 = 𝜇2
H1: 𝜇1 ≠ 𝜇2
̂12
𝜎 ̂22
𝜎
Confidence intervals (𝑥̅1 − 𝑥̅2 ) ± 𝑡𝛼⁄2 ,𝜈 √ +
𝑛1 𝑛2
84
A practical guide to applied statistics
85
Parametric tests for variables data
Hypothesis: Ho: 𝜇1 = 𝜇2
H1: 𝜇1 ≠ 𝜇2
̂
σ
Confidence intervals 𝑑̅ ± 𝑡𝛼⁄2 ,𝜈 𝑛𝑑
√
Notes
Samples need to be of equal size and paired.
86
A practical guide to applied statistics
87
Parametric tests for variables data
Test statistic
P-value = CHISQ.DIST.RT(B,ν)
Minitab Path: Stat → ANOVA → Test for equal variances (Select options
and tick the box “Use test based on normal distribution”)
Notes
This test is difficult to run in Excel as we cannot automate the calculations.
88
A practical guide to applied statistics
89
Parametric tests for variables data
Hypothesis: Ho: 𝜇1 = 𝜇2 = 𝜇3 … … . … . . = 𝜇𝑘
Where:
Distribution (Model) F
P-value = F.DIST.RT(F,νΒ,νW)
Notes
You can use the Data Analysis ToolPak to run the test
90
A practical guide to applied statistics
91
Parametric tests for variables data
Hypothesis: Ho: 𝜇1 = 𝜇2 = 𝜇3 … … . … . . = 𝜇𝑘
Test statistic
1
∑𝑘 𝑤 (𝑥̅ − 𝑥̅ ′ )2
𝐹= 𝑘 − 1 𝑗=1 𝑗 𝑗
2(𝑘 − 2) 𝑘 1 𝑤𝑗
1+ 2 ∑ ( )(1 − )2
𝑘 − 1 𝑗=1 𝑛𝑗 − 1 𝑤
Where:
𝑛𝑗 ∑𝑘
𝑗=1 𝑤𝑗 𝑥̅𝑗
𝑤𝑗 = 𝑤 = ∑𝑘𝑗=1 𝑤𝑗 𝑥̅ ′ =
𝑠𝑗2 𝑤
Degrees of freedom
𝑘2 − 1
ν= 𝑤𝑗
1
3 ∑𝑘𝑗=1( )(1 − )2
𝑛𝑗 − 1 𝑤
Distribution (Model) F
P-value = F.DIST.RT(F,k-1,ν)
92
A practical guide to applied statistics
93
Parametric tests for variables data
94
A practical guide to applied statistics
Analyse
attributes data 2-proportion test
→ Page 96 → Page 98
1
What type
How
of Binary 2 Paired?
many?
attributes?
χ2 Goodness of fit
1 column
→ Page 102
Note that attributes data, especially nominal, is the least flexible type of data
when it comes to apply statistical analysis. For accuracy purposes, we prefer
measuring, if possible, something at variables or at least ordinal level.
95
Tests for attributes data
For example, the following figure shows the number of trips (counts - nominal
data) that a travel agent has organized towards six popular destinations
throughout a year. As a reference of comparison, the targets that the company
set at the beginning of the year have been plotted too.
60 No. of trips
48 Target
50
40
40 35 37 35
34
28 30 29
30 26 25 25
20
10
0
Czech United
Norway Italy Spain Greece
republic Kingdom
No. of trips 26 28 34 48 29 37
Target 25 35 40 30 25 35
We can see that for some destinations the company met the targets while for
others it did not. The question then is whether the differences observed between
targets and No. of trips is significant or simply an outcome of random variation
(pure chance). You can also ask whether there is any preference towards a
destination among your customers. This type of questions is the focus of the
inferential statistics that we will explore below.
96
A practical guide to applied statistics
Hypothesis: Ho : 𝑝 = 𝑝0
H1 : 𝑝 ≠ 𝑝0
𝑛 ∗ (1 − 𝑝
̂) ≥ 10 *
Where:
Test statistic 𝑝̂ : Sample proportion
𝑝̂ − 𝑝0 𝑝0 : Specified population proportion value
𝑧=
𝜎𝑝 𝑝(1 − 𝑝)
𝜎𝑝 = √
𝑛
n : Sample size
𝑝̂(1−𝑝̂)
Confidence intervals 𝑝̂ ± 𝑧𝛼⁄ √
2 𝑛
Notes
If you cannot meet the assumptions, you should try to increase the sample size.
If not possible, you can use 5, instead of 10, as a minimum required value, but
in this case, proceed with caution.
97
Tests for attributes data
Hypothesis: Ho : 𝑝1 = 𝑝2
H1 : 𝑝1 ≠ 𝑝2
(𝑝̂ 1 − 𝑝̂ 2 ) − 𝑘
𝑧= ̂ 1 (1 − 𝑝̂ 1 )
𝑝 ̂ 2 (1 − 𝑝̂ 2 )
𝑝
𝜎̂𝑝̅1−𝑝̅2 𝜎̂𝑝̂1−𝑝̂2 = √ +
𝑛1 𝑛2
𝑛𝑖 : Size of sample i
Notes
If you cannot meet the assumptions, you should try to increase the sample size.
If not possible, you can use 5, instead of 10, as a minimum required value, but
in this case, proceed with caution.
98
A practical guide to applied statistics
99
Tests for attributes data
Where:
Test statistic StF : No. of proportions that changed
(𝐴𝐵𝑆(𝑆𝑡𝐹. −𝐹𝑡𝑆. ) − 1)2 from Success to Failure
𝑥2 =
𝛮 FtS : No. of proportions that changed
from Failure to Success
Ν = StF + FtS
P-value* = CHISQ.DIST.RT(x2,ν)
1
Confidence intervals 𝛿̂ ± 𝑧𝛼⁄2 𝑆𝐸 +
𝑛
Where:
̂2
√StF+FtS−n𝛿 StF−FtS
n : No of paired comparisons made 𝑆𝐸 = 𝛿̂ =
𝑛 𝑛
Notes*
A p-value is also offered by the formula 2*BINOM.DIST((p,N,0.5,TRUE)
where p is the StF frequency. The minimum between the two generated p-values
can then be used as the p-value.
100
A practical guide to applied statistics
101
Tests for attributes data
Distribution (Model) Χ2
P-value =CHISQ.DIST.RT(χ2,ν)
Notes
There are different ways to calculate the expected values. You can use the
average of the counts to be tested or use target counts (i.e., targets or historical
data) that are converted to proportions as a comparison basis. The example that
is presented uses both approaches.
102
A practical guide to applied statistics
103
Tests for attributes data
Where:
Distribution (Model) Χ2
P-value =CHISQ.DIST.RT(χ2,ν)
Notes
In order to run this test in Microsoft Excel you need to set a two-way table with
the observed frequencies and then calculate the expected values. For
proportions this take the form of 2 (Success/failure) x k which is the number of
proportions to be tested.
104
A practical guide to applied statistics
105
Tests for attributes data
106
A practical guide to applied statistics
Ordinal data is like the attributes data. However, their additional characteristic
of being in order allows us to run more advanced statistics. These usually utilize
the ranks of the data and compare their medians.
No
How
2 In pairs?
many?
Yes
107
Tests for ordinal data
Test statistic
Notes*
The 2-sample sign test follows the same process but applies the test to the values
that derive from the paired differences of the two samples (paired test).
108
A practical guide to applied statistics
109
Tests for ordinal data
𝑛𝑖 (𝑛𝑖 + 1) 𝑛1 𝑛2
𝑈𝑖 = 𝑆𝑅𝑖 − 𝜇𝑢 =
2 2
SRi : Sum of relative rank for 𝑛1 𝑛2
𝜎𝑢 = √ ((𝑛1 + 𝑛2 + 1) − 𝑎𝑑𝑗)
sample i 12
U1 > U2 U1 < U2
110
A practical guide to applied statistics
111
Tests for ordinal data
Hypothesis: Ho : 𝜂1 = 𝜂2 = 𝜂3 =. … . = 𝜂𝑘
Where:
Test statistic Ο : Observed deviations from the Grand Median*
Distribution (Model) Χ2
P-value =CHISQ.DIST.RT(χ2,ν)
Notes *
Calculate the median of all the data (Grand Median). Then, create a two-way
table that counts how many values are larger than the Grand Median as well as
the values that are smaller than or equal to it. Treat this table as your observed
values in a χ2 – Test of Independence.
112
A practical guide to applied statistics
113
Tests for ordinal data
114
A practical guide to applied statistics
Medians 2-sample
In Wilcoxon test
Yes
pairs?
1
How 2 → Page 118
many?
No
>2 Levene’s test
→ Page 116 Mann-Whitney
test
→ Page 110
Check for Equal
symmetry Yes
variances?
→ Page 62
Mood’s Median Test
No → Page 112
Levene’s test No
→ Page 116
Similar shape?
Symmetric Equal variances?
Yes
? Lack of outliers?
No
Yes
1-sample
1-sample sign test Wilcoxon test Kruskal-Wallis Test
→ Page 108 → Page 118 → Page 120
115
Nonparametric tests for variables data
116
A practical guide to applied statistics
117
Nonparametric tests for variables data
𝑛𝑜 ∗ (𝑛𝑜 + 1) ∗ (2𝑛𝑜 + 1)
𝑆𝐸 = √
24
118
A practical guide to applied statistics
119
Nonparametric tests for variables data
Hypothesis: Ho : 𝜂1 = 𝜂2 = 𝜂3 =. … . = 𝜂𝑘
Where:
Test statistic
N : Total no. of values
𝑘
12 𝑆𝑅𝑖2
𝐾=( ∑ ) − 3(𝑁 + 1) SRi : Sum of relative ranks for
𝑁(𝑁 + 1) 𝑛𝑖
𝑖=1
sample i
ni : Size of sample i
Distribution (Model) χ2
P-value = CHISQ.DIST.RT(K,ν)
Notes
It is important that your data has no outliers as this can distort the test.
120
A practical guide to applied statistics
121
Nonparametric tests for variables data
122
A practical guide to applied statistics
In the previous sections we focused mainly, yet not entirely, on the analysis of
data that is related to a single variable. We were not particularly concerned
about potential relationships or causations between the different groups that was
compared. However, sometimes it may be necessary to determine whether two
or more variables are related to each other, and if so in what way and by how
much. Answers to such question are given by what is known as the regression
analysis.
There are many types of regression analysis, the choice of which depends on
what we are trying to model. Things that need to be considered include among
others, the type of data, the number of variables to be modelled as well as the
nature of the relationship. The following map aims to help you choose a suitable
approach.
123
Regression analysis
Simple linear
Fit
regression modelling
achieved?
→ Page 125
1 No
Yes
Nonlinear regression
How → Page 135
many xs?
No
2+
Fit Stop
Yes
Multiple linear achieved?
regression
→ Page 140
Multicollinearity
Partial least
Variables squares
(Minitab)
Start
What Ordinal logistic
type of Ordinal
(Minitab)
y?
Nominal logistic
Attributes Nominal
(Minitab)
Counts Poisson
(Numerical) (Minitab)
Within the scope of this book, we will focus on regression techniques for
variables data, which are most commonly used in practice.
When there are only two parameters to be modelled, one independent and one
dependent, the first step is to try to fit a simple linear model. The process is
illustrated below.
Plot the
relationship
Fit a regression
model
Evaluate the
model
Make
predictions
125
Regression analysis
straightforward. The following tables present the sample data collected by the
marketing department.
A/A Spending (x) Revenue (y) A/A Spending (x) Revenue (y)
1 1784.58 5995.48 13 2117.26 8805.13
2 2576.63 11087.70 14 1985.55 7526.36
3 3148.32 12850.94 15 3005.69 12546.28
4 2224.88 8072.03 16 3014.65 13589.67
5 2676.73 9871.44 17 2856.69 10989.62
6 1814.69 7137.16 18 1958.00 6139.10
7 2425.06 10457.57 19 1896.59 6759.36
8 2475.72 8330.34 20 2646.43 11809.99
9 2378.58 9243.69 21 2876.58 12586.34
10 2755.48 11542.70 22 2250.69 9986.58
11 2563.37 9725.25 23 2895.68 12036.67
12 2158.00 7256.00 24 2370.90 8482.15
Table 16. Marketing spending and revenue generated for period x
As you can see for each spending value, we have a related revenue value. Once
the table is created, the two variables can now be analysed independently. We
can apply descriptive statistics to summarize the data, check for normality, and
if necessary, run some additional hypothesis tests.
Continuing then with our regression analysis, we can plot the data on a scatter
diagram to visually represent the relationship between the two variables. On the
x axes of the diagram, we plot the independent variable (i.e., marketing
investment), while on the y axes we add the dependent one (i.e., generated
income).
126
A practical guide to applied statistics
Revenue (y)
Spending (x)
127
Regression analysis
In our case, there are no outliers, and it seems that there is a positive monotonic
association between the two variables; the higher the marketing spending the
higher the generated revenue. It also seems that there is a linear relationship
between the two. Such visual checks are important before you proceed to any
further analysis.
𝑦̂ = 𝑎 + 𝑏𝑥
is beyond the scope of this book, it is important to understand that the “line of
128
A practical guide to applied statistics
best fit” is the one that minimizes the sum of the squared deviations (least
squares) between the predicted values 𝑦̂𝑖 and the observed values 𝑦𝑖 , also
known as errors or residuals. This happens when:
∑𝑥∑𝑦
∑ 𝑥𝑦 − ∑ 𝑥 2 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦
𝑏= 𝑛
2 & 𝑎= 2
2 (∑ 𝑥) 𝑛 ∑ 𝑥 2 − (∑ 𝑥)
∑𝑥 −
𝑛
Luckily for us, we do not have to do the math as Microsoft Excel can
automatically produce the linear equation. The following process should be
followed:
1. Select the scatterplot and click on the plus symbol next to it. Then
tick the box “Trendline”.
2. Press right click on the trendline and then select the option “Format
trendline….”
129
Regression analysis
130
A practical guide to applied statistics
∑(𝑦𝑖 − 𝑦̂𝑖 )2 X
𝜎̂ = √ Frequency
𝑛−2
132
A practical guide to applied statistics
The table at the bottom provides the values a and b of the regression line.
However, the question is whether we can trust these values. To answer that we
need to conduct some checks.
1. Predictive power (R2). This is the same value as the one generated in
the scatterplot above. If it is low, we can conclude that the chosen model
is not suitable for the data.
2. Significance of the model. If the p-value for the F-test in the second
table is less than 5% then we can reject the Ho: β = 0 in favour of the
H1: β ≠ 0. This suggest that the model is significant in the population.
The residuals should be randomly distributed around and across the line
for the significance value to be unbiased. If special observations exist,
you may wish to transform the variables.
133
Regression analysis
Where:
𝑌̂ ± 𝑡1−𝑎,𝜈 𝑆𝐸𝑌̂ 𝑌̂ : The result from applying the
2
linear equation (point prediction)
1 (𝑥𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 − 𝑥̅ )2 ν=n–2
𝑆𝐸𝑌̂ = 𝜎̂√1 + +
𝑛 𝑠𝑥2 (𝑛 − 1)
𝑠𝑥2 : Variance of the independent
sample data x
134
A practical guide to applied statistics
Let us assume for example that the marketing department wants to estimate how
much revenue would be generated, if 2550€ and 3200€ are spent in advertising.
We can state with 95% certainty (predictive power) that if the marketing
department spends 2500 €, they will generate something between 8507€ to
11920€ revenue. If they spend 3200€ they will generate something between
11802€ to 15445€ revenue. They can now make a decision that is based on
mathematical expectations.
15 Non-linear relationships
the simplest model possible. This is because, it might be that advanced models
will be able to describe peculiar relationships, but they also create problems of
complexity that can be difficult to deal with.
Transformation
Order Model Equation
y bx + a
1 Quadratic y = a x2 + b x + c y a x2 + b x + c
y=ax3+bx2+cx ax3+bx2+cx
2 Cubic y
+d +d
We usually prefer models that require the minimum level of transformation and
if possible, these should be applied to the x and not the y variable. This is
because the errors in the regression equation are errors in the y-value. If the y-
value is transformed, so do the errors. As a result, the assumption that the
residuals are normally distributed can be invalidated in the background and thus
unexpected issues may arise.
For the reasons stated above, the polynomial models (i.e., quadratic, cubic) are
more frequently used compared to other types of non-linear models.
136
A practical guide to applied statistics
We can see that the distribution of the points is quite peculiar compared to the
line. Therefore, although the R2 is relatively high, we may wish to test a
quadratic relationship. This can be found if we press right click on the line of
the scatterplot and then select the option “Format trendline…”.
The quadratic line offers a higher R2 value compared to the linear line. As a
result, it seems a better model to use, and thus we can procced to the testing
phase. The process is exactly the same as with the simple linear regression. The
difference is that we need to create a column of x2 values and then use both
columns (x and x2) as the Input x Range in the Data Analysis ToolPak.
137
Regression analysis
138
A practical guide to applied statistics
If the model is not valid, we can try another one that may offer a better fit to the
data. The most common choice is to increase the order of the polynomial
regression up to the point where an increase in the order offers a significant
increase in the R2 value. In our example, a cubic relationship would offer an R2
value of 0.8803 which is higher compared to the 0.8744 of the quadratic line.
The question is, is it worth it to add an additional layer of complexity for such
an increase or we are simply overfitting the line to the data? In this case,
probably the quadratic line is the best choice.
𝑆𝐸𝑦̂ = 𝜎̂ ∗ 1.1
This will not give the exact estimations but can be a relatively good
approximation as it can be seen in the example above.
Minitab
139
Regression analysis
𝑦̂ = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ … … . +𝑏𝑘 𝑥𝑘
Although the mathematics for calculating the coefficients are more advanced
compared to the simple linear regression, the overall logic is the same. The main
difference is that we need to investigate the importance of each value
individually as well as the potential interactions between them.
Temperature 1
Humidity 0.735 1
140
A practical guide to applied statistics
All three predictors seem to be highly correlated to the response, but at the same
time there seems to be a relatively strong correlation between temperature and
humidity. In such cases, we should investigate the correlation between the two
elements and act accordingly:
SUMMARY OUTPUT
141
Regression analysis
ν=n–k–1
Where:
k: number of variables x.
ANOVA
Significance
df SS MS F F
Regression 3 106983213.0 35661071.0 62.4 0.000
Residual 20 11424749.9 571237.49
Total 23 118407962.9
H1: At least one βi ≠ 0 (not all the slopes are equal to zero).
The p-value of the ANOVA test will show the overall power of the model. If Ho
is failed to be rejected, then the analysis should stop here as the model is not
significant in the population. However, if Ho is rejected, we need to check the
individual tests for each independent variable. That is, a rejection of the Ho
suggests that the model has some overall predictive power, but does this mean
that all independent variables can predict the dependent one?
The individual p-values suggest that the predictors “Spending” and No. of calls”
are significant, while the predictor “Bonuses” is not significant. Therefore, we
need to re-run the regression without considering it.
142
A practical guide to applied statistics
143
Regression analysis
The predictors of the new model are all significant and thus we can use it for
making predictions. Note that if a multiple linear model is not suitable, we may
have to apply a non-linear system with advanced mathematical formulas.
𝑆𝐸𝑦̂ = 𝜎̂ ∗ 1.1
This will not give the exact estimations but can be a relatively good
approximation as it can be seen in the example above.
Minitab
144
A practical guide to applied statistics
conducting such analysis, the data analyst can understand how the phenomenon
behaves and thus make predictions about what is expected to happen in the
future.
2 £70,324 2 £104,969
3 £59,016 3 £92,851
4 £53,249 4 £80,099
2 £79,822 2 £115,678
3 £77,859 3 £102,534
4 £67,409 4 £95,432
The years have been broken down into quarters and for each quarter the
accumulated sales revenue has been provided. As data analysts, we are
interested in understanding the relationship between the quarters and how sales
changed over this period. In order to visually illustrate their relationship, you
can either use a scatter plot or a line graph. The latter is used here.
145
Regression analysis
The analysis of the graph clearly illustrates three (+1) key points:
• Seasonality component (St). Within every single year, the first quarter
delivers the highest amount of revenues and then the sales gradually drop
till the first quarter of the following year where they increase again. In
other words, there seems to be a kind of a cycle that is repeated annually.
• Trend component (Tt). The overall sales income increases over the
years. Essentially, if you were to calculate the average value of sales for
each seasonal cycle, then you would find out that these averages increase.
The first step in calculating the moving average is to define the period over
which seasonality is observed. In our case, this period is equal to 4 intervals or
else quarters. Whether the period is an even or an odd number plays an
important role on how to calculate the CMA.
2 b1 (0.5)𝑎1 + 𝑏1 + 𝑐1 + 𝑑1 + (0.5)𝑎2
1 𝐶𝑀𝐴 =
3 c1 4
4 d1
2 1 a2
147
Regression analysis
148
A practical guide to applied statistics
The outcome is a set of values that reflect the seasonal indices of the various
periods. Each index represents how much the actual original values (i.e., sales)
deviate from the corresponding baseline of the smoothed CMA values. For
example, in quarter 3 of 2015 the actual sales were 10.6% (1 – 0.894) below the
149
Regression analysis
smoothed average of this period. Similarly, in quarter 1 of 2016 the sales were
21.6% above the average of this period.
The values produced above reflect the seasonal component of the series data
along with the irregularity that comes from the various periods. The next step
therefore, is to extract the pure seasonal indices. In order to do that, we need to
calculate the average index for each similar interval in the series. The following
table presents these indices for our data.
Quarter St
For example, the seasonal index for every quarter
1 1.23
1 is equal to
2 1.06 1.216+1.187+1.285
3
= 1.23.
3 0.93
Keep in mind that the sum of the averages should
4 0.77
be close to unity.
Sum 0.997
Table 23. Calculating the average season index
Using then the seasonal index, we can now deseasonalise the original data. In
order to do that we simply need to divide it by the seasonal index. The results
can be found in the figure below.
150
A practical guide to applied statistics
St Deseassonalized sales
t Sales (Yt) CA(4) (with It ) St (Yt /St)
1 £74,841 1.23 £60,893
2 £70,324 1.06 £66,183
3 £59,016 £65,999 0.894 0.93 £63,648
4 £53,249 £68,828 0.774 0.77 £69,123
5 £87,976 £72,371 1.216 1.23 £71,580
6 £79,822 £76,496 1.043 1.06 £75,121
7 £77,859 £80,856 0.963 0.93 £83,971
8 £67,409 £86,590 0.778 0.77 £87,504
9 £108,694 £91,607 1.187 1.23 £88,438
10 £104,969 £95,067 1.104 1.06 £98,788
11 £92,851 £100,429 0.925 0.93 £100,138
12 £80,099 £105,543 0.759 0.77 £103,976
13 £138,898 £108,092 1.285 1.23 £113,012
14 £115,678 £111,219 1.040 1.06 £108,866
15 £102,534 0.93 £110,582
16 £95,432 0.77 £123,881
Table 24. Deseasonalising the original data
151
Regression analysis
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.987
R Square 0.974
y = 4134.7x + 53960.9
Adjusted R Square 0.972
r² = 0.974
Standard Error 3351.556
Observations 16
ANOVA
Significance
df SS MS F F
Regression 1 5812753274 5812753274 517.4 0.000
Residual 14 157260942 11232924
Total 15 5970014216
After we analyse the p-values on the table and the residuelas, we can use the
coefficients of the linear regression to forecast future sales. Let’s assume that
we are interested in one year ahead. The first step is to calculate the trend
component (Tt). This can be done by using the equation provided in the
regression analysis.
Tt = 4134.7x + 53960.9
Forecast = Tt * St
152
A practical guide to applied statistics
Eventually, this will give us the forecasted values, or else the point predictions.
Of course, you can also calculate the confidence intervals for these predictions
by using the Student’s (t) distribution, as this has been discussed above. In that
way you will be able to make more accurate estimations.
153
Regression analysis
Note that we generated forecasted values for the historical period as well.
Although we know the actual data for this period, we may wish to calculate
these values in order to compare them with the actual historical data.
Apparently, if we have conducted the analysis correctly, the two should be
close, which is a simple logical test to run.
Finally, keep in mind that any projections into the future should not go too far
ahead. This is because the further into the future one goes, the higher the
uncertainty and thus the less reliable the forecasts are. To make them more
accurate, you can calculate the errors of the estimation by following the
methods presented in the previous section.
154
A practical guide to applied statistics
Index list
155
Index
156
158