Download as pdf or txt
Download as pdf or txt
You are on page 1of 160

Applied statistics for business

using Microsoft Excel


(+ Minitab)

A practical workbook on how to gaze long


into the abyss of inferential statistics

Evangelos Kitsos
A practical guide to applied statistics

Table of Contents

Preface........................................................................................................................... v
About the author ....................................................................................................... vi
Part A: The theoretical foundations ...................................................................... 7
1 Data collection considerations .................................................................... 8

1.1 Types of data ....................................................................... 8


1.1.1 Attributes data .............................................................................. 8
1.1.2 Ordinal data ................................................................................. 9
1.1.3 Variables data .............................................................................. 9
1.1.4 Counts (discrete): A special case................................................ 10
1.2 Sampling strategies ........................................................... 10
1.3 Quality of measurements ................................................... 12
1.4 Visually inspect your data ................................................. 14
2 Analyse sample data .................................................................................... 16

2.1 Shape (Visual methods) .................................................... 17


2.1.1 Histogram .................................................................................. 17
2.1.2 Bar charts ................................................................................... 17
2.1.3 Pie chart ..................................................................................... 18
2.2 Central tendency................................................................ 19
2.2.1 Arithmetic mean ........................................................................ 19
2.2.2 Median ....................................................................................... 20
2.2.3 Mode .......................................................................................... 20

2.3 Measures of dispersion ...................................................... 20


2.3.1 Range ......................................................................................... 21
2.3.2 Interquartile range ...................................................................... 21
2.3.3 Variance & Standard deviation .................................................. 22
3 Introduction to probability theory.......................................................... 23

3.1 Basics of probability.......................................................... 23


3.1.1 The notion of probability ........................................................... 23
3.1.2 Addition rules and Venn diagrams ............................................. 24
3.1.3 Summing probabilities ............................................................... 26
3.1.4 Sampling with replacement vs no-replacement .......................... 27

i
How to use this guide

3.2 Joint probability .................................................................27


3.2.1 Independent events ..................................................................... 28
3.2.2 Conditional (or dependent) events .............................................. 28
3.3 Expectation ........................................................................29
3.4 Counting rules....................................................................30
3.4.1 Single and different type of events ............................................. 30
3.4.2 Permutation and combination rules ............................................ 31
4 The normal distribution ............................................................................ 33

4.1 Variation is part of our life ................................................34


4.2 Using the normal distribution ............................................34
4.3 The standardized normal distribution ................................36
5 Discrete distributions ................................................................................. 37

5.1 Binomial distribution .........................................................38


5.2 Hypergeometric distribution ..............................................39
5.3 Negative binomial distribution ..........................................40
5.4 Poisson distribution ...........................................................40
6 Applying inferential statistics .................................................................. 41

6.1 Sampling distributions .......................................................42


6.2 Formulate the hypothesis (The trial) ..................................43
6.3 Hypothesis testing (1-sample z-test) ..................................44
6.4 Errors in hypothesis testing................................................48
6.5 Two-tailed vs one-tailed tests ............................................49
6.6 Number of populations involved .......................................50
6.7 Making estimations............................................................52
6.8 Working with p-values ......................................................53
6.9 Degrees of freedom............................................................55
7 A practical process....................................................................................... 56

Part B: Using Microsoft Excel for hypothesis testing ..................................... 57


8 Types of data ................................................................................................. 59
9 Parametric vs non-parametric tests for variables data .................... 61

9.1 Analyse variables data .......................................................62


9.1.1 Descriptive statistics for variables data....................................... 63
9.1.2 Box and whisker plots ................................................................ 65
9.1.3 Visualize the distribution ............................................................ 66

ii
A practical guide to applied statistics

9.2 Tests for Normality ........................................................... 68


9.2.1 Anderson-Darling test ................................................................ 69
9.2.2 χ2 – test “Goodness of fit” .......................................................... 70
9.2.3 Central limit theorem ................................................................. 72

9.3 Transform your data .......................................................... 73


10 Parametric tests for variables data ......................................................... 75

10.1 χ2 – test .............................................................................. 76


10.2 1-sample t-test ................................................................... 78
10.3 F-test ................................................................................. 80
10.4 2-sample t-test ................................................................... 82
10.5 2-sample t-test (Aspin-Welch) .......................................... 84
10.6 Paired t-test ....................................................................... 86
10.7 Bartlett’s test ..................................................................... 88
10.8 One-way ANOVA ............................................................. 90
10.9 Welch One-way ANOVA ................................................. 92
11 Tests for attributes data ............................................................................. 95

11.1 Analyze attributes data ...................................................... 96


11.2 1-Proportion test ................................................................ 97
11.3 2-Proportion test ................................................................ 98
11.4 McNemar’s test ............................................................... 100
11.5 χ2 – Test “Goodness of fit” .............................................. 102
11.6 χ2 – Test of Independence ................................................ 104
12 Tests for ordinal data ................................................................................ 107

12.1 Analyze attributes data .................................................... 107


12.2 1-sample sign test (2-sample sign test*) .......................... 108
12.3 Mann-Whitney test .......................................................... 110
12.4 Mood’s – Median test ...................................................... 112
13 Nonparametric tests for variables data ............................................... 115

13.1 Levene’s test ................................................................... 116


13.2 1-sample Wilcoxon test (+2-sample Wilcoxon)* ............ 118
13.3 Kruskal-Wallis Test......................................................... 120

Part C: Regression analysis................................................................................. 123


14 Simple linear regression .......................................................................... 125

14.1 Plot the relationship......................................................... 125

iii
How to use this guide

14.2 Fit a model .......................................................................128


14.3 Test the model .................................................................131
14.4 Make predictions .............................................................134
15 Non-linear relationships .......................................................................... 135

15.1 Types of non-linear models .............................................135


15.2 Looking at the R2 .............................................................137
15.3 Predicting with non-linear models ...................................139
16 Multiple linear regression ....................................................................... 140

16.1 Correlated predictors .......................................................140


16.2 Applying a multiple linear regression ..............................141
16.3 Predicting with the multiple linear regression..................144
17 Forecasting with time-series .................................................................. 144

17.1 Components in time-series...............................................145


17.2 Central moving average ...................................................147
17.3 Working on irregularity ...................................................149
17.4 Running a linear regression .............................................151
Index ........................................................................................................................ 155

iv
A practical guide to applied statistics

Preface

Statistical analysis offers the means to capture and conceptualize knowledge


about a phenomenon, which can then be used in order to make important
decisions in business and life in general. At the same time though, statistics is
something that puzzles many students and practitioners.

This book does not intend to teach statistical theory. It is not even a book.
Instead, it is a workbook (size 18.2 cm x 12.8 cm) that can be used at any
time as a practical guide on how to apply inferential statistics by using
Microsoft Excel (main focus) and Minitab. It consists of the following parts.

• Part A: Without going into proofs of statistical theorems, this part


discusses the theoretical foundations that are needed in order to
understand the logic of applying inferential statistics.
• Part B: This part gradually builds a map that can lead the reader in
applying the right hypothesis test given the conditions faced in a
problem. There are more than 20 tests that are discussed and
illustrated in examples.
• Part C: This part discusses the simple, multiple, and non-linear types
of regression and presents relevant examples. Finally, the last section
discusses a simple method for working with time-series.

This book is ideal for students and practitioners who want to understand the
application of inferential statistics for solving real problems in a practical way.

v
How to use this guide

About the author

Evangelos Kitsos is the owner of a Greek consulting company, Epariston. He


helps companies improve their processes so that they can create more value for
their customers as well as reduce the cost of offering their products and services.
He is also an external partner of Warwick university, where he teaches as well
as supervises students in the subject areas of decision making, applied statistics,
business excellence and Lean 6σ.

Copyright
Minitab® and all other trademarks and logos for the Company's products and
services are the exclusive property of Minitab, LLC. All other marks referenced
remain the property of their respective owners. See minitab.com for more
information.

Portions of information contained in this publication/book are printed with


permission of Minitab, LLC. All such material remains the exclusive property
and copyright of Minitab, LLC. All rights reserved.

vi
A practical guide to applied statistics

Part A: The theoretical foundations

In applied statistics everything starts with a problem that is related to the


“unknown” characteristics of a population. For, example we may be interested
in how many people are in favour of the government’s policies, or to what
extend our customer’s value our services, or what is the expected weight of the
product that is produced in a factory’s machinery. Statistical problems come in
all sorts of types, but the underlying principle is the same. If a problem can be
measured, it can also be answered.

Obviously, statistical problems require the collection and analysis of data.


However, very rarely one has access to complete data, as in most cases
measuring everything is not practical, affordable, or even possible. In situations
like that we need to rely on inferential statistics. These use the theory of
probability to draw conclusions about a population based on the characteristics
and attributes that have been observed in a sample.

Apply a sampling strategy

Formulate the hypotheses


Collect sample data

Unknown
population

Analyse sample data


Make inferences

Figure 1. Steps of inferential statistics

In general, inferential statistics have two main functions. We can either use
sample data to estimate the parameters of an unknown population or test a
specific hypothesis that we may formulate. Such a distinction is useful for
7
The theoretical foundations

practical reasons, given that a particular approach can be useful depending on


what one is trying to achieve. However, the two are the opposite sides of the
same coin, as the statistics supporting them are not different. This book has been
structured based on the logic of hypothesis testing, while the process of making
estimations will be discussed alongside when this is suitable.

1 Data collection considerations Tip: During data


collection remember the
The quality of the data to be analysed is of critical phrase “Garbage in,
Garbage out”.
importance and requires careful consideration.
Indeed, no matter how advanced the applied statistics are, the output of the
analysis can be as good as the input is. In this respect, the following sections
will present some key concepts that need to be considered.

1.1 Types of data


Understand the type of data to be, or has already been, collected is of major
importance as it affects the analysis process that one can follow. In general, data
can be classified as:
Tip: Using statistics
• Quantitative: Numerical variables data. that are not suitable for
the data at hand, is one
• Qualitative: Attributes that are usually of the most common
mistakes made.
presented in a quantified form.

Within the quantitative category we can discern three different types of data
and a “special case”. The following analysis aims to clarify the topic.

1.1.1 Attributes data


Attributes data can be either binary or nominal data. Both are categorical data
in nature, with their only difference being the number of categories the
phenomenon under investigation can fall into.

8
A practical guide to applied statistics

Binary Nominal
Q: Overall, are you satisfied by the Q: What kind of car colour do you own?
services that we provide? 1. Blue 2. Black 3. Red
1. Yes 4. White 5. Other
2. No
Table 1. Examples of attributes data

This type of data is usually collected by observing a phenomenon and then


counting how many times its specific features / qualities of interest occur.

1.1.2 Ordinal data


Like attributes data, ordinal data falls into categories, but this type of categories
can be ordered in terms of their value.

Q: Based on your overall experience, Q: How often do you use our online
how likely are you to recommend us to banking services?
friends? 1. Every day or more
1. Very likely 2. 3-6 times a week
2. Likely 3. About once or twice a week
3. Neutral 4. About once or twice a month
4. Unlikely 5. Never
5. Very unlikely
Table 2. Examples of ordinal data

The limitation of this data is that the intervals between the various categories
are unknown. For example, Mike came 1st while Paul 2nd in a race, but what is
their true difference? Can we say that Mike is twice as good as Paul? Of course,
not! This information will remain unknown unless we measure their times.

1.1.3 Variables data


Variables data is continuous (non-categorical) data that informs about the order
of the values and allow for a meaningful calculation of their differences and, in
most cases, ratios. This type of data is measured through physical instruments.

9
The theoretical foundations

8.4 cm
Twice as long
4.2 cm

Figure 2. Examples of variables data

Variables data is very rich in information. For example, the result “Mike and
Paul completed the race (attributes)” offers less information compared to “Mike
finished 1st and Paul 2nd (ordinal)”, which gives less information compared to
“Mike finished the race in 10′ & 21′′ while Paul did it in 10′ & 25′′(variables)”.
Because of their nature, variables data allows the application of the most
powerful statistical procedures, and thus should be preferred if available.

1.1.4 Counts (discrete): A special case


Counts – discrete data is highly misused in statistics. This is because, while it
falls into ordered categories, these categories are numerical in nature and thus
an illusion that they can be analysed as ordinal or variables data is usually
created. For example, one may count how many bumps a car’s door has (i.e., 5,
6, 7). Obviously, this type of data is not only ordinal but also the intervals have
a true numerical meaning (i.e., 10 bumps are twice as much as 20).

This “illusion” usually leads to poor statistical conclusions because this data
should be mainly, yet not necessarily always, treated as attributes data. To
differentiate it to variables data, ask whether it makes sense to have a true
decimal point or negative values in your data. If it does not (i.e., -5 or 5.5 bumps
makes no sense), then your data is counts – discrete.

1.2 Sampling strategies


For any inference to be valid, it is important that the sample we use is truly
representative of the population that we are interested in. For that reason, the
first step is to clearly define the population that we should sample from. After
10
A practical guide to applied statistics

the population has been defined, we need to develop a sampling strategy that
will guide the data collection process.

A sampling strategy should be well planned and executed. Otherwise, one may
introduce bias in the sample which will be reflected in the results. There are
different sampling strategies, each of which has its merits and drawbacks.

Strategy Definition Merit Drawback


An arbitrary
Expensive and
A process which selection process is
difficult to be
secures that each often followed,
achieved but if
member of the which creates an
applied properly
Random population has an illusion of
it will give, in
equal and known randomness.
most cases, the
chance of being People may follow
best possible
selected patterns
sample
unconsciously
For very large
Break the There can be items
populations,
population down that fall into more
random sampling
into groups that than one groups, or
Stratified may introduce
share a common the classification
bias. Sampling
(Cluster) characteristic and may lead us to
from established
then apply neglect items that
sample groups
random sampling should have been
can offer a more
to the groups considered
balanced view
Easy and Bias is introduced
Every Nth item
convenient to when a pattern in
of a list or
Systematic apply, especially the population
production line is
for production coincides with the
selected
lines sampling interval
Break the Makes the
Filtering items in
population down process of
stages can narrow
into stages while collecting
Multistage down the focus in a
applying random samples from
way that leads to
sampling in each large populations
biased samples
stage manageable

Table 3. Common sampling strategies

11
The theoretical foundations

The list is by no means exhaustive. It presents though the most common


strategies for collecting unbiased representative samples. Depending on the
conditions prevailing, one should choose the most suitable strategy and
carefully apply it.

1.3 Quality of measurements


The quality of the sample data is affected not only by the sampling strategy but
also by the actual data collection process. A data collection, or else
measurement, process is influenced by various factors, such as methods,
facilities, environment, people, instruments and so on. A data analyst needs to
make sure that these factors are under control to the extent that this is possible.
This is especially true if the data analyst has not been involved in the data
collection process, in which case a retrospective examination of the
measurement process may be wise and useful to be conducted.

The main point is that no matter how advanced the applied statistics are, the
output of the investigation can be as good as the input is. The popular phrase
“Garbage in, Garbage out” is very reflective of this concept. With that in mind,
there are two basic principles that need to be considered.

Validity
The measurements are close to the “true
value” being measure. For example, a
scale should measure the actual weight
of the individual being measured.
Reliability
Repeated measurements are consistent
in the information that they convey. For
example, the measurements of
someone’s weight under the same
conditions should not vary significantly.
Figure 3. The two basic principles of measurements’ quality

12
A practical guide to applied statistics

If the factors that affect the validity and reliability of the measurement process
are not considered and controlled, excessive error will be introduced in the
measurement system and thus the reached conclusions will be of low quality.

Several terms are used to describe the types of error associated with a
measurement system. However, usually this error can be broken down into
three elements: accuracy, repeatability, reproducibility. The following table
defines it as well as presents some indicative sources that you may wish to
consider in your measurement systems.

Error What it is Possible source


Not suitable measuring
The difference between the
Accuracy processes, or improper
average of measurements
calibration or use in
(Validity) and the “true” value of the
general of the measuring
phenomenon measured.
instrument.
The variation observed over
Natural variation
Repeatability time in measurements of the
deriving from the
same item that were made by
(Reliability) various elements of the
a single operator using the
measurement system.
same measuring instrument.
The variation observed over
Lack of standardized
time in the average of the
Reproducibility measuring processes
measurements of the same
during which each
(Reliability) item that were made by
operator measures as
different operators using the
“seems” appropriate.
same measuring instrument.
Table 4. Measurement error & variation

There are advanced techniques to quantify these errors and judge whether a
particular measurement system is capable, which will not be discussed within
the scope of this book. However, being aware of them and tackling the sources
for each type of error is a very good start.

13
The theoretical foundations

1.4 Visually inspect your data


A good practice, after you collect the raw data, is to inspect it visually for
patterns and peculiarities. Doing so requires some experience and knowledge
of what has been measured, but there are some basic checks one can run.

Let us consider for example the following report from a quality inspection
process. The employee manually enters a product’s serial number so that the
system can automatically return its weight and length.

Product code Weight (g) Length (cm)


00756 681 14.60
00500 A 621 13.64
00500 A 621 13.64
0059 B 684 14.63
00549 692 14.21
00547 726 C 128.50 E
0061 B 719 15.77
00550 634 15.33
0073 B 759 14.65
00554 713 16.18
00700 676 15.57
00722 726 C 14.07
00747 715 14.29
00781 726 C 11.58
00790 713 14.58
00822 9 D 21.43 G
00826 4 D 15.88
00832 695 13.58
00840 703 H
Table 5. Product Test Results

14
A practical guide to applied statistics

There can be different types of potential errors in a dataset. In the example


above we could be suspicious regarding the following issues:

• Typing or Recording Errors

Successive numbers being Same product has been tested


A
identical twice?
These products have a four- Maybe this is a different type
B
digit code of product?

• Repeated Numbers

Multiple Appearance of 726 Difficult to explain - requires


C
is unexpected investigation

• Outliers (Extreme observations)

D Very low numbers Issue with the measurement


E Very high number - obvious tools?
Confirm outlier by further
G Very high number – not clear
analysis (i.e., Boxplot)

Please, note that outliers can be either errors that occurred during data
collection and need to be corrected, or simply observations of a very
skewed distribution that need to be investigated.

• Missing observations

Problem during data


H An empty space
collection?

Once you identify this kind of issues in your data, it useful to go back into the
process or the phenomenon measured and investigate their sources. Knowledge
of the process always gives great insight.

15
The theoretical foundations

2 Analyse sample data

The first step after the data collection has been completed, is to understand what
the sample data is saying. For that reason, we use what is known as descriptive,
or summary, statistics. These can be categorized in three main types.

Types of data
(→ page 59)
Category

Attributes
Definition Measures and tools

Variables

Ordinal
Histograms X
Visualise the dataset
Shape

to examine its Bar charts X X


distribution
Pie chart X X

Arithmetic mean X
The central number
tendency
Central

or the most common Median X X


value of a dataset
Mode X X

The level of Standard deviation X


Dispersion

variability in the Range X X


dataset Interquartile range X X

Table 6. The three basic types of descriptive statistics

Please note that neither the list of measures and tools in the table is exhaustive
nor every tool listed is suitable for every situation. The aim of this section is to
theoretically introduce what one should do to prepare the ground for applying
inferential statistics. The choice and application of the right tool for a particular
situation will be later discussed in this book.

16
A practical guide to applied statistics

2.1 Shape (Visual methods)


Quite often data analysts have to deal with a large amount of complex data
whose patterns might be difficult to be read in their raw form. In such cases,
visual displays can make life easier, as they can simplify the complexities and
present the data in a clear and straightforward way.

2.1.1 Histogram
There are different types of methods that one could use to visualize variables
data. The most common is the histogram.

Figure 4. Histogram for variables data

A histogram is a visual representation of a dataset’s frequency distribution. The


x-axis has all the values of the collected data grouped in classes, while the y-
axis represents the absolute frequencies for each class. In that way, we are
allowed to see the dataset’s shape.

2.1.2 Bar charts


If you are working with categorical variables, that is ordinal or attributes data,
then bar charts can be extremely useful. To create a bar chart, you place the
17
The theoretical foundations

possible values of the categorical variable on the x-axis and the frequencies on
the y-axis.

Figure 5. Bar chart showing the visual of categorical data

Bar charts are like histograms. However, unlike histograms the bars of a bar
chart are not connected, due to the categorical nature of the data.

2.1.3 Pie chart


A useful and popular alternative to a bar chart is the pie chart.

Figure 6. Pie Chart for customer feedback

Due to the fact that they gather all the attribute measurements in a single cycle,
pie charts provide a holistic perspective of the data that is analysed, and the data
analyst can easily see the importance of each category.

18
A practical guide to applied statistics

2.2 Central tendency


Graphical methods of presentation get the overall impression of a group of
figures across. For further analysis, however, it is necessary to condense the
data further into individual figures. The central tendency is a basic parameter
that is highly used for that reason. It describes the central number or else the
most common value of a dataset.

2.2.1 Arithmetic mean


The mean is the arithmetic average of a dataset that includes variables data. It
is referred to as X-bar (X̄) and it is calculated by summing all the values of the
collected data and dividing this sum by the total number of values. The equation
is:

Where:

∑𝑛𝑖=1 𝑋𝑖 X̄ = the average value


X̄ =
𝑛 Σxi = the sum of all the raw scores
n = the size of the sample

As an example, consider the figures:

3, 5, 5, 6, 8, 9, 11

3 + 5 + 5 + 6 + 8 + 9 + 11 47
X̄ = = = 6.714
7 7

Note that while the arithmetic mean is more powerful compared to other
measurements of central tendency, it is highly affected by extreme values or in
other words skewed distributions. In such cases, it is always wise to investigate
the extreme values, as well as have an idea of other parameters, such as the
variance of the dataset, or other measurements of central tendency before a
decision is made.

19
The theoretical foundations

2.2.2 Median
The median is the middle value in a set of raw data that has been ordered from
the lowest to the highest, or the other way around. It is the middle value of a
distribution, in the sense that 50% of the values lie above and 50% below it. In
the example we used above:

𝑛+1 8
̄ Rank =
Median = = 4 (𝑃𝑙𝑎𝑐𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛) => 𝑀𝑒𝑑𝑖𝑎𝑛 = 6
2 2

Note that if there is an even number of values in the dataset the median is the
average of the two middle values.

The median is a value that can provide useful information when the dataset is
skewed due to extreme values. Therefore, the median should be calculated
along the arithmetic mean to check whether there are any extreme values in the
dataset. When the dataset includes such values, the median is probably a better
measurement of central tendency to consider when making decisions.

2.2.3 Mode
Mode refers to the value that occurs most frequently in a dataset. In the example
we used above:

𝑀𝑜𝑑𝑒 = 5

The mode is a simple parameter and can be used with any type of data; even
purely qualitative. The mode can provide very useful information because the
value that occurs with the highest frequency may require further attention.

2.3 Measures of dispersion


Central values are an essential aspect of a dataset, yet a data analyst cannot
solely rely on them. Let us assume that you have a meeting with your CEO at
8:00 am. You know from experience that on average commuting takes

20
A practical guide to applied statistics

approximately 25 minutes door to door. What time should you leave your
house? One may say that leaving the house at 7:35am will be ok, yet this is not
a good decision to make. The commuting process, similar to any other process
in the world, is subject to variation. Factors such as traffic, traffic lights and
weather conditions will have an impact on the time it takes you to get to work.
It may take for example from 10 to 40 minutes, and thus leaving the house
before 7:20am will be the right thing to do.

The previous example indicates that for any dataset the values will vary from
each other, and this variability is an essential parameter that also needs to be
calculated and considered before a decision is made. In some cases, it is even
possible to end up with datasets that have the same central tendency but have
totally different variability. Such differences are important, and thus you need
to apply statistics to measure the variability or dispersion of your data.

2.3.1 Range
Range is the most straightforward and simple parameter that you can use to
describe the dispersion of your data. It is the difference between your highest
and lowest values in a dataset. For our example above:

𝑅𝑎𝑛𝑔𝑒 = 11 − 3

An advantage of the range is that it can be used with ordinal data. On the other
hand, the range is highly affected by extreme values.

2.3.2 Interquartile range


The interquartile range is the range of the middle 50% of the values in a dataset.
In our example above the IQR is 4. However, let us also consider a more
complicated example of a product’s type length to illustrate how to calculate
the IQR.

21
The theoretical foundations

Step 1: Order your data

Product length (ordered): 22, 26, 28, 33, 45, 45, 46, 47, 48, 48, 52, 52, 52, 53,
53, 53, 56, 57, 58, 58, 63, 63, 63, 64, 64, 64, 64, 65, 65, 66, 67, 68, 71, 74, 75,
75, 82, 87, 95, 98.

Step 2: Find the median

Product length (ordered): 22, 26, 28, 33, 45, 45, 46, 47, 48, 48, 52, 52, 52, 53,
53, 53, 56, 57, 58, 58, 63, 63, 63, 64, 64, 64, 64, 65, 65, 66, 67, 68, 71, 74, 75,
75, 82, 87, 95, 98.

In this case, the median is the average of 58 and 63; or else 60.5.

Step 3: Find and subtract the median of the 1 st and 2nd half of your dataset

Product length (ordered): 22, 26, 28, 33, 45, 45, 46, 47, 48, 48, 52, 52, 52, 53,
53, 53, 56, 57, 58, 58, 63, 63, 63, 64, 64, 64, 64, 65, 65, 66, 67, 68, 71, 74, 75,
75, 82, 87, 95, 98.

The median for the first half is (52+48) / 2 = 50. Respectively the median for
the second half is (66+67) / 2 = 66.5. Ultimately, the interquartile range is 66.5-
50 = 16.5.

The interquartile range can be used even when extreme values exist in the
dataset as it is not affected by the values that lie at the edge of a distribution.

2.3.3 Variance & Standard deviation


The variance and the standard deviation are the most commonly used statistics
to quantify the amount of dispersion of variables data. They indicate how much
on average each value in a dataset deviates from its average value. In inferential
statistics we are usually interested in the variance of the sample and thus we use
the following formula.

22
A practical guide to applied statistics

Where:
s2 = the Variance of the sample
∑(𝑋𝑖 − X̄)2
𝑠2 = Σ (xi-X̄)2 = the sum of the squared deviation scores
𝑛−1
n = the size of the sample
X̄ = the sample mean
For our example above:

(3 − 6.7)2 + (5 − 6.7)2 + (5 − 6.7)2 + (6 − 6.7)2 + (8 − 6.7)2 + (9 − 6.7)2 + (11 − 6.7)2


𝑠2 = <=>
7−1

13.8 + 2.9 + 2.9 + 0.5 + 1.7 + 5.2 + 18.4 45.42


<=> 𝑠 2 = = = 7.5
6 6

The variance is the squared value and thus it is difficult to relate it to the actual
measures in a dataset. For that reason, we need to calculate the standard
deviation of our data, which is simply the square root of the variance.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 2.75


This figure can be easily compared to the raw values and thus allows us to reach
conclusion about the variability of the dataset.

3 Introduction to probability theory

Descriptive statistics are excellent at describing how a situation or a process


works according to the sample data. However, making inferences about a
population, requires working with probabilities.

3.1 Basics of probability


In this section we will introduce the fundamentals of probability theory.

3.1.1 The notion of probability


In probability theory, the result of an activity or experiment is called an event.
The classical probability of such events is calculated by the following function:

23
The theoretical foundations

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠


𝑃(𝐸𝑣𝑒𝑛𝑡 𝑋) =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

The classical probability of an event can range from 0 (0%) to 1 (100%). For
example, the probability of rolling 2 by throwing a dice is P(number 2) = 1 / 6
= 16.67%. Such probabilities are called a priori probability because the
probability is known in advance.

The classical approach can only be applied when the outcomes and likelihoods
can be determined in advance, such as in games of chance. But what if these
cannot be known? This leads to the empirical definition, which is given by the
following function:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑐𝑎𝑛 𝑜𝑐𝑐𝑢𝑟


𝑃(𝐸𝑣𝑒𝑛𝑡 𝑋) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑖𝑠 𝑐𝑎𝑟𝑟𝑖𝑒𝑑 𝑜𝑢𝑡

Finally, there are some cases which are not suitable for the classical approach
and for which there is no empirical data. For example, a manager is trying to
make predictions about the market’s needs or preferences to various products.
In these situations, one cannot rely on data to establish subjective probabilities
but can assign them by making a personal assessment of the situation based on
experience and intuition. It goes without saying that this method is subject to
the greatest error, as the human judgment is highly affected by unconscious
bias, however, in some cases it is the only way.

3.1.2 Addition rules and Venn diagrams


Frequently we will encounter situations where the outcomes that interest us are
actually the result of two or more events taking place at the same time or in a
particular order. The way in which these elemental outcomes affect compound
outcomes are governed by what is known as the laws of probability. These laws

24
A practical guide to applied statistics

can be stated and proved mathematically, but it is easier to show the justification
of them graphically using Venn diagrams.

Let’s consider a company that is using agents in order to promote its products
in the European market. The company has 100 agents across Europe with 45 of
them specializing in products type A and 55 specializing in products type B.
From these agents, 15 specialize in both products A and B. The emails of these
agents are in a common database so when an email is received by a customer,
then anyone of these agents can respond. At the moment, there is no specific
algorithm to link the costumer request with a specialized agent; the allocation
is absolutely random. So, what is the probability of an agent with a specific
specialization to answer a customer request?

No specialization (15)

Product A (45) ss (15)


Both Product B (55)

S: Total (100)

Figure 7. Venn diagram of European agents

By looking at the Venn diagram is easy to calculate some basic probabilities.


To start with let us look at the number of agents who specialize in a product:

• Agents specializing in just Product A: 30 (45 – 15)

• Agents specializing in just Product B: 40 (55 – 15)

• Agents with no specialization: 15 (100 – 30 – 40 – 15)

• Sum of agents: 30 + 40 + 15 (Specialization in both) + 15 = 100

25
The theoretical foundations

Now we are ready to calculate the probabilities of various events about the
situation. For example:

P(No Specialization) = 15 / 100 = 0.15 or 15%

Based on this probability and given that the sum of the probabilities of all
different alternatives in a given situation (the total area in the Veen diagram) is
always equal to 1 (100%), then we can say that the probability of having a
specialization is 85% (100% – 15%). The general formula is:

P(A) + P(A’) = 1.

The P(A’) is the complement of P(A) and represents the probability of P(A) not
happening. The sum of complement probabilities is always equal to 1.

3.1.3 Summing probabilities


When two events are mutually exclusive or else disjoint, the following formula
should be used to find their sum:

P(A U B) = P (A) + P(B)

The symbol U means union or else “or”. On the other hand, the events “having
specialization in product A” and “having specialization in product B” are non-
mutually exclusive, given that an agent can specialize in both products. In these
cases, the following generic formula of addition should be used:

P(A U B) = P (A) + P(B) – P(A ∩ B)

The symbol ∩ means Intersection or else “and”. In our case, this is the
probability P(Specialization in product A) + P(Specialization in product B) –
P(Specialization in both) which is equal to:

P(Specialization in A U B) = 45 / 100 + 55 / 100 – 15 / 100 = 85 / 100 = 0.85


or 85%

26
A practical guide to applied statistics

Note that for disjoint probabilities the probability of their intersection is equal
to zero and therefore it is not considered in the equation.

3.1.4 Sampling with replacement vs no-replacement


Let’s assume that a customer request just came in. The request has been
answered by an agent with no specialization; a 15% probability. The agent
completes the request and then a second request comes in. In this case the
probability of an agent without specialization to deal with the new request is
still 15%. Actually, the probabilities of all the events will remain the same,
given the random allocation of customer requests. Such actions are called
sampling with replacement.

What happens though if a customer request comes in before the agent completes
the previous request? Such a case reflects what is known as sampling without
replacement, as the agent is not available to deal with the request and therefore
the probabilities of the various events will change. For example, the probability
of having the new request addressed by an agent with any type of specialization
is now:

P(Specialization in A U B) = 45 / 99 + 55 / 99 – 15 / 99 = 85 / 99 = 0.859

This probability is slightly higher than before, which is because the same
number of agents with specialization are available, but the total amount of
agents has been reduced given that one agent without specialization is not
available.

3.2 Joint probability


Joint probability is the probability of two or more independent events to occur
either together or in succession.

27
The theoretical foundations

3.2.1 Independent events


Independent events mean that knowing something about one event does not
change the probability of the other event occurring. In other words, the events
are not related. For this kind of events the following formula will provide the
joint probability:

P(A ∩ B) = P(A) * P(B)

For example, what is the probability of having two consecutive requests


answered by an agent without specialization, with replacement? This will be
equal to 2.25% (15 / 100 * 15 / 100). Note that the joint probability is always
lower than the marginal probabilities, as it is the outcome of multiplying two
fractions.

3.2.2 Conditional (or dependent) events


Let us now consider the two events “specialization in product A” and
“specialization in product B”. Are these events independent? Of course, they
are not. The likelihood of picking an agent with “specialization in product A”
is 45%. However, what if one tells you that the agent has a “specialization in
product B”? This condition changes the probability of picking an agent with
“specialization in product A”, because there are only 15 agents who specialize
in both products from the group of 45 who have a “specialization in product A”.
In such cases, we use the general rule of multiplication.

P(A ∩ B) = P(A) * P(B | A)

In our example this probability is equal to 45 / 100 * 15 / 45 = 15%, which is


correct as only 15 agents out of 100 have a specialization in both products. Note
that the general rule of multiplication works for every case, including events
that show statistical independence like the case above.

28
A practical guide to applied statistics

3.3 Expectation
'Expectation' is the theoretical or long-run average of a phenomenon that is
measured. This is given by the following formula:

xi : the value of the ith event


𝐸(𝑥) = ∑ 𝑝𝑖 𝑥𝑖
𝑖
pi : the probability of the ith event

For example, the expected value of all the dice’s rolls in a Monopoly game will
be 7; or at least something very close to it (the logic behind this deviation will
be explained later).

1 2 3 4 5 6
𝐸(𝑥) = 2 ∗ ( ) + 3 ∗ ( ) + 4 ∗ ( ) + 5 ∗ ( ) + 6 ∗ ( ) + 7 ∗ ( ) + 8
36 36 36 36 36 36

5 4 3 2 1
∗ ( ) + 9 ∗ ( ) + 10 ∗ ( ) + 11 ∗ ( ) + 12 ∗ ( )
36 36 36 36 36

=7

This tool is highly used in economics and decision theory. For example, a
businessman knows that a decision could lead to either a profit of €400 with a
probability of 0.7 or a loss of €200 with a probability of 0.3. As a result, we
would expect:

𝐸(𝑥) = 400 ∗ 0.7 ∗ (−200) ∗ 0.3 = 220

The calculated profit in the example represents the expected value and is usually
called the utility of the decision. Utilities are presented in a payoff table that
shows the results of each combination of actions and outcomes. For example, a
decision could lead to a loss or a profit, or different decisions can lead to
different results. Each outcome is also linked with a probability of occurring,

29
The theoretical foundations

which allows us to calculate the expected utility. As it was shown in previous


sections, all the probabilities need to sum up to 1.

3.4 Counting rules


In order to calculate probabilities in decision making it is important to know
how many possible outcomes or alternatives can potentially exist; that is the
denominator of a probability. There are different ways to do so given the
characteristics of the situation.

3.4.1 Single and different type of events


For single type of events, or events that have a constant number of potential
outcomes (k) and occur n times, the following formula applies:

Total possible outcomes = kn

For example, if we are rolling a dice for 3 times then we have k = 6, as there
are 6 possible outcomes in a dice, and n = 3. Rolling the same dice is considered
as single type of event as the potential outcomes of the experiment remain
constant. Therefore, there are 216 possible outcomes.

On the other hand, if we roll two different dices once, one with six sides and
another one with eight sides, then we have two different types of events. This
is because k changes from six to eight. In this case the following formula
applies:

Total possible outcomes = k1 * k2* k3 * .. . . . . . . . . . * kn

In our example, we have 6 * 8 = 48 possible outcomes for one roll of each dice.

Counting rules are not always that straightforward though. For example, how
many even 3-digit numbers can be formed from the digits 1, 2, 5, 6, 9 if each
digit can only be used once?

30
A practical guide to applied statistics

If only even numbers can be formed, the digits in the unit’s position can only
be 2 or 6. Therefore, 4 other digits are available for the hundreds position, and
to prevent duplication, only 3 digits will be available for the tenths position.

Total possible outcomes = (4) (3) (2) = 24

Obviously, this requires some logical thinking process to take place.

3.4.2 Permutation and combination rules


For any specific number of objects n, the number of ways that they can be
arranged is given by the factorial number of n.

n! = n * (n - 1) * (n – 2) * (n – 3)…………. * 1

For example, if one wants to allocate 3 job positions to 3 employees: Bob,


Hellen, and Matt. The possible ways of allocating these jobs are 3! = 6.

1 2 3 4 5 6
Job 1 Bob Bob Hellen Hellen Matt Matt
Job 2 Hellen Matt Bob Matt Hellen Bob
Job 3 Matt Hellen Matt Bob Bob Hellen

Table 7. Arranging 3 jobs to 3 employees

Let’s assume now that the leader of the team wants to know in how many ways
out of these 6 in total, any two employees have been arranged. We can ask this
question in two ways; including arranged in a particular order or regardless a
particular order. The first one is called permutation and can be calculated by the
following equation.

n 𝑛!
𝑃𝑥 = (𝑛−𝑥)!

31
The theoretical foundations

For a permutation the order matters. For example, the team leader wants to
promote two employees, one to principal and one to senior level. The order here
matters because the two jobs are not of equal level, therefore we need to
consider all the potential ways of arranging any two of these employees. In this
case, 6P2 = 6.

1 2 3 4 5 6
Principal Bob Bob Hellen Hellen Matt Matt
Senior Hellen Matt Bob Matt Hellen Bob

Table 8. Permutation of 2 employees from a group of 3

Note the fact that the permutation is equal to n! is a coincidence. If we had four
employees and thus n! = 24, then the permutation would have been 4P2 = 12.

On the other hand, when the particular order of how the groups will be formed
does not matter, we use the combination rule.

n 𝑛!
𝐶𝑥 = 𝑥!∗(𝑛−𝑥)!

For example, if the leader would want to promote two of these employees into
a senior position, then a combination like Bob / Hellen would have been of
equal value with the Hellen / Bob. To avoid such double counting we add in the
denominator of the permutation the x!. In this case, 6C2 = 3.

1 2 3
Senior Bob Bob Hellen
Senior Hellen Matt Matt

Table 9. Combination of 2 employees from a group of 3

32
A practical guide to applied statistics

Note that while these rules do not provide direct probabilities, they can help us
establish them. For example, in a race of 8 cars we would like to bet that
specifically three of them will have the first three places. What are the chances
of winning this bet?

If we bet that the three cars will finish in a particular order, then we will use the
permutation rule; 8P3 = 336. Thus, there are 336 different ways of arranging any
3 of them in the first 3 places. However, we are interested in 3 specific cars
finishing in a specific order, thus our chances of winning are 1 / 336 = 0.297%.
On the other hand, if we bet that 3 specific cars will finish in the first three
places, but we do not care about the order, then we have to use the combination
rule; 8C3 = 56. Therefore, our chances of winning are 1 / 56 = 1.78%. This is
still low, yet we have better chances now.

4 The normal distribution

Although in most cases we cannot know the actual outcome of an experiment


until it has taken place, we can usually (especially with past knowledge
available) describe the probability of the possible outcomes from that
experiment occurring. A listing of the possible outcomes and their probabilities
is called a probability distribution.

Although, there are many distributions here we will use the normal distribution
to illustrate their function. We will assume an example of a manufacturing
company that produces metallic cylinders to make the learning clearer. We are
the engineers of the company, and our customers request that the average
volume of our cylinders should be 1.5m3 ± 0.8m3. To meet these specifications,
we have set our machinery to produce cylinders with a specified mean volume
of 1.5m3 and a known standard deviation of 0.2m3.

33
The theoretical foundations

4.1 Variation is part of our life


Why though would not we set our machinery to produce all the cylinders with
a volume level of 1.5m3? Besides variation is cost! If a cylinder is way larger
or smaller than what the average suggests, then it
Tip: Everything in life
may not be suitable for use. Efforts should be
is susceptible to
made therefore to reduce it if it is economically random variation. Does
beneficial. However, no matter what we do, it take the same amount
of time to commute
variation cannot be eliminated. It will always be every day?
there, no matter what one is measuring. Thus, we
must learn how to manage and live with it. In fact, this is what applied statistics
is all about!

Random variation is due to the combination of various factors that take place
during a process. In our case for example, some of the factors push towards
making larger cylinders, while some others push towards making smaller
cylinders. In a subsequent production run the same factors can push towards the
opposite direction to the one they were pushing before. Nobody knows how
they will behave as these factors are susceptible to other factors that behave
randomly. However, for most of the cylinders the factors will balance out to
offer a product close, if not the same, to the specified mean value. Only in very
rare occasions all the factors will push towards one direction, which will result
in a cylinder with an extreme volume value.

4.2 Using the normal distribution


Inferential statistics use theoretical distributions that describe how phenomena
and industrial situations are expected to behave under the impact of random
variation. In that way, they allow us to consider the sample data in the context
of the natural expectations, which ultimately help us to make inferences. There
are many distributions suitable for different situations and types of data.

34
A practical guide to applied statistics

For our example, we will assume that the production line of cylinders follows
the popular normal distribution. Note that we assume the normal distribution
for now. As we will see later in this book this should also be tested. Indeed, not
everything in life follows normality, and when this is the case, we need to apply
different types of tests. So, for a specified mean of 1.5m3 and a standard
deviation of 0.2m3 the following expectations arise.

34.13% 34.13%

13.59% 68.27% 13.59%

2.14% 2.14%
0.13% 95.45% 0.13%

99.73%

Z -3 -2 -1 Mean 1 2 3

X 1.44 1.46 1.48 1.5 1.52 1.54 1.56


Figure 8. The normal distribution for the cylinder’s volume

We always read distributions from the smaller (left) to the largest (right) value.
The x-axis indicates the values that the phenomenon under investigation can
take (cylinders’ volume) while the y-axis indicates the frequency of occurrence
of a specific value x. In hypothesis testing we use the area under the curve which
indicates the probability of occurrence of events that lie between two specified
values of interest. For example, we expect that 0.13% of the cylinders produced
will have a volume lower than 1.44m3, or 13.59% of the values will be
somewhere between 1.46m3 and 1.48m3.

35
The theoretical foundations

You can use Microsoft Excel to calculate the probability of occurrence for any
value that you are interested in. For example, you may be interested to know
what is the probability of getting a cylinder with a volume value between
1.46m3 and 1.48m3?

Table 10. Using Excel to calculate probabilities with the Normal distribution

Remember, we always “read” a distribution from the smaller to the largest


value. Thus, if the cumulative probability of getting a value less than 1.46m 3 is
2.28% and a value less than 1.48m3 is 15.87%, then, the probability to get a
cylinder with a volume between 1.46m3 and 1.48m3 is 15.87% - 2.28% =
13.59%.

4.3 The standardized normal distribution


You probably have noticed that the calculations are affected by the mean value
and the standard deviation of the distribution. Indeed, every normal distribution
is characterized by these two values. So how can we compare normal
distributions with different mean and standard deviation values? To do so, we
use something called the standardized normal distribution which is a
transformed distribution with a mean value of 0 and a standard deviation of 1.

36
A practical guide to applied statistics

The standardized normal distribution is using the Z values. These values


indicate how many standard deviations a specific x value deviates from the
mean. We can calculate that in Microsoft Excel by using the following
formulas.

Figure 9. The standardized normal distribution

As you can see, for any x value of m3 we can get a corresponding z value. Then,
we can calculate the probability of getting a value between any range of z
values. This follows the same logic and provides the same results with the
method that we applied above. However, by using the transformed distribution
we can bring any normal distribution down to a common scale and make
meaningful comparisons that are helpful in inferential statistics.

5 Discrete distributions

Examining the distribution of your dataset will help you understand your data
into context and make a prediction of what the future results may be. In this
section of the book, we will discuss some key discrete distributions.

37
The theoretical foundations

5.1 Binomial distribution


The binomial distribution describes the probability of a binary experiment. In
particular, a situation is able to be modelled by the binomial distribution if:

• It has only two outcomes; success/failure, accept, reject, good/bad


etc. (Bernoulli process)

• The total number of trials (n) is known.

• Sampling follows a random collection strategy.

• The outcomes of the various trials are independent.

• The probability of success or failure remains constant.

If these conditions are met then the binominal distribution can be used to
calculate the probability of 0, 1, 2, 3…. etc. successes in n trials. This is given
by the following equation:

𝑛!
𝑃(𝑟) = ∗ 𝑝𝑟 𝑞 (𝑛−𝑟)
𝑟! (𝑛 − 𝑟)!
Where:
r = number of successes Formula in Excel

p = probability of success = BINOM.DIST(r, n, p, False)

q = probability of failure (1 – p)
n = the known number of trials (sample size)

The expected value of the binominal distribution, or the mean value, is given
by the equation E(x) = n * p, while the variance is σ2 = n * p * q. Naturally, the
standard deviation is the square root of this value.

A key point regarding the binomial distribution is that it can be approximated


by the normal distribution. In fact, it is perfectly symmetric when p = q = 0.5.

38
A practical guide to applied statistics

As p and q start deviating from 0.5, we will observe skewness, which however
can be balanced to a certain extend by the size of the sample. As a rough
guideline, we say that the normal approximation of the binomial distribution is
reasonable when both np ≥ 10 and nq ≥ 10, or at least 5.

5.2 Hypergeometric distribution


A basic assumption of the binomial distribution is that the probability of success
remains constant through an experiment. However, this assumption can be
violated. Think for example the case of drawing samples from a batch. If the
sample is defective, usually it will be thrown away or sent for repair. By doing
so though, the next sample picked from the batch will have a different
probability of success, especially if the population is small and its size is known
(also known as a finite population).

In such cases of sampling without replacement, it is more appropriate to use the


hypergeometric distribution instead of the binomial. The following equation
will give the probabilities:

Where:
𝑁𝑝
.𝐶𝑟 ∗ 𝑁𝑞.𝐶𝑛−𝑟
𝑃(𝑟) = 𝑁𝐶
r = number of successes
. 𝑛
p = probability of success
q = probability of failure (1 – p)
Formula in Excel
n = the sample size
= HYPGEOM.DIST (r, n, N*
p, N, False) N = the population size (Batch or trials)
C= Combination

The expected value of the hypergeometric distribution, or else the mean value,
is given by the equation E(x) = n * p, while the variance is σ2 = n * p * q * (N
– n) / N - 1. Naturally, the standard deviation is the square root of this value.

39
The theoretical foundations

5.3 Negative binomial distribution


The binomial distribution provides the probability of having r successes in
given number of trials n. However, sometimes we need to know the probability
of r successes occurring by the xth trial. In such cases, we can use the negative
binomial distribution.

𝑥−1
𝑃(𝑥) = .𝐶𝑟−1 ∗ (1 − 𝑝) 𝑥−𝑟 ∗ 𝑝𝑟
Where: Formula in Excel
r = number of successes = NEGBINOM.DIST (Failures, r, p, False)
x = number of trials till r successes
p = probability of success of a single trial
C= Combination

The expected value of the negative binomial distribution, or else the mean
value, is given by the equation E(x) = r * (1 – p) / p, while the variance is σ2 =
r * (1 – p) / p ^ 2.

5.4 Poisson distribution


The Poisson distribution describes the probability of events that occur in
specified intervals. In particular, a situation is able to be modelled by the
Poisson distribution if:

• The number of times an event (x) occurred can be known.

• Events cannot co-occur; with single events occurring with a constant


probability over the time period.

• The potential opportunities for a success to happen is unknown and


very large (theoretically infinite).

• A constant measuring interval exists.


40
A practical guide to applied statistics

• Events occur independently with the occurrence of one not affecting


the occurrence of the other.

If these conditions are met, then the Poisson distribution can model the situation
and estimate the respective probabilities based on the following equation:

Where:
𝜆𝑥 ∗ 𝑒 −𝜆
𝑃(𝑥) = λ = the expected (average) number of
𝑥! occurrences

Formula in Excel e = the base of the natural logarithm


(2.7183)
= POISSON.DIST(x, λ, False)
x = the number of occurrences

A basic characteristic of the Poisson distribution is that the expected value, or


else the mean value, and its variance are equal to λ.

Note that the larger the λ the more the Poisson distribution can be approximated
by the normal distribution. Also, the Poisson distribution also approximates the
binomial especially when n >30 and p<0.1.

6 Applying inferential statistics

Inferential statistics allow us to go a step further compared to what we can


immediately observe in a sample and draw conclusions about the entire
population. There are many tools that can be used to make such inferences and
their suitability depends on the type of data that has been collected as well as
the conditions of the case at hand. Quite a few of these tests will be discussed
in the context of this book, but for now we will use the z-test for a single mean
in the example of the cylinders to illustrate the theoretical foundations of
inferential statistics. Please note that this section builds on the knowledge
provided in the previous sections.
41
The theoretical foundations

6.1 Sampling distributions


Let us assume, that our customers have started complaining that some cylinders
are not of the right volume size anymore. This suggests that possibly something
has changed in our production process. It could be that the specified mean is
not 1.5m3 anymore, which is something that needs to be tested. To do so in an
ideal world, we would have measured every cylinder that will be ever produced
by this machine and then we would compare their average to the specified mean
of 1.5m3. However, this is not possible and thus we need to use a sample.

The problem with using a sample though, is that it is extremely unlikely that
the sample’s average will be equal to 1.5m3, simply due to the sampling error.
That is, one cannot know if the values of the sampled cylinders will absolutely
balance out to give the specified expected mean.

What we do know though is that the averages of the samples collected will form
their own normal distribution. This is also known as the sampling distribution
of the mean that models the phenomenon under investigation.

Population μx = 1.5m3 Sample 𝑋̿


σx = 0.02m3 𝜎𝑥
𝜎𝜒ഥ =
√𝑛

Where:
n: The sample size
X X̄

Figure 10. From a population to a sampling distribution

The average of the sampling distribution (𝑋̿ ) is the average of all the samples’
averages. Therefore, in a theoretical scenario of having all the averages it is
expected to be equal to the average of the entire population (μx).
42
A practical guide to applied statistics

One the other hand, the standard deviation of the sampling distribution (σX̄),
also usually referred to as the standard error, is affected by the sample size. The
larger the sample the smaller this error and thus, as we will see later, the better
the inferences that can be made.

6.2 Formulate the hypothesis (The trial)


Hypotheses are theories that we develop about a situation. These can be based
on past or anecdotal data, on observations of a phenomenon, or sometimes on
one’s gut feeling. Understanding the process of setting the hypotheses is of
crucial importance in inferential statistics as it reflects the underpinning logic
of the whole concept.

To start with there are always two opposing hypotheses the key characteristics
of which are presented in the following table.

Null hypothesis (Ho) Alternative hypothesis (H1)

Statement of inequality (the


Expressed as Statement of equality
exact opposite to Ho)

The expectations of The testing hypothesis that


Represents
what the case might be challenges the expectations

There is no significant There is a significant


difference between the difference between the
Statistically
expectations and the expectations and the sample
sample data data

Table 11. Null and alternative hypothesis

In real life, human beings usually form a hypothesis and then try to find
evidence to prove it. In the world of statistics things are slightly different. The

43
The theoretical foundations

aim is not to prove the H0 perse. Instead, we start with the presupposition that
the H0 is valid. Then, we collect and analyse sample data to decide whether H 0
should be retained or rejected in favour of H1.

A good way to illustrate how hypothesis testing works is to look at the example
of a trial. There are two hypotheses in a trial:

• H0: The accused party is innocent.

• H1: The accused party is not innocent (is guilty).

By law, the jurors should start with the assumption that the accused party is
innocent (H0 is taken for granted) until proven guilty. Then, the evidence is
presented, and the jurors evaluate it so that eventually they will come to a
verdict. Their thinking process follows the logic:

“…. provided that the accused party is indeed innocent (Ho is assumed to be
true) would one ever possibly observe the evidence (sample data) presented?
If the answer is yes, then the jurors fail to reject the H0 (we do not have
enough accusation evidence), otherwise they reject the H0 (we have
significant evidence to suggest that the party is not innocent) in favour of H1
(is guilty)”.

Note that this language is important in hypothesis testing as it underpins every


statistical test that we will see in this book.

6.3 Hypothesis testing (1-sample z-test)


We would like to run a 1-sample z-test to see whether our customers’
complaints that the volume of the cylinders has changed are valid or not. To do
so, the first step is to formulate our null and alternative hypothesis. We know
that the machine is set up to produce cylinders with a specified mean of 1.5 m3.

44
A practical guide to applied statistics

We also know that this was the case up to now, as we were selling these
cylinders for many years to the same customers with no complaints about their
volume. So, if “nothing has changed” we would still expect to have an average
of 1.5m3. On this basis, the null hypothesis (the expectation) is μx = 1.5m3 while
the alternative hypothesis (the testing hypothesis) is μx ≠ 1.5m3.

To test these hypotheses, we decided to select a random sample of 42 cylinders


which have an average volume value of 1.507m3. We can see immediately that
the average of the sample is bigger than the expected mean value. However, the
big question is whether the difference that we observe between the sample
average and the expected value is due to pure chance (sampling error) or
whether something has indeed changed in our process and thus the mean of the
population is not 1.5m3 anymore?

3
H : μ = 1.5m
0 x
Deviation 3
H : μ ≠ 1.5m
1 x

The average value


1.507m3

Figure 11. Deviation from the sample mean

The answer is that it depends. We know from the basics of the normal
distribution that the more a value deviates from the mean the less chances it has
of occurring by pure chance. Thus, the question is how much deviation from
the expected mean (1.5m3) are we willing to accept as being random? Accepting
a large deviation means that we need to see a very extreme sample mean (strong
45
The theoretical foundations

evidence) before we reject a null hypothesis. On the other hand, accepting a


small deviation means that a less extreme value (weak evidence) would be
enough to reject the null hypothesis.

This measure of strength that we would like to see in our sample data before
rejecting the null hypothesis is reflected in the significance level alpha (α). The
alpha is set by the analyst and thus there is no single right value that can be used
in all cases. It depends on the context and the importance placed by the
individual on the test that is to be conducted. Having said that, an alpha value
equal to 5% or 1% (stronger evidence needed) are more commonly used. In this
book, all the examples will use an alpha of 5%.

Now we are ready to finalize the test. On the one hand, we have a sample mean
that is expressed as a number that deviates from the population mean, while on
the other hand we have a significance level α which is expressed as a probability
that represents the amount of deviation, we are willing to accept. We need then
to transform both these values into standardized z values and compare them.
The underling distribution is the normal sampling distribution for the mean that
has been presented above. The following steps should be applied.

• Step 1: Use the transformation equation to transform the sample mean


to a z calculated value (test statistic).

𝑥̅ − 𝜇𝜊 𝑥̅ − 𝜇𝜊 1.507 − 1.5
𝑧𝑐𝑎𝑙𝑐 = = 𝜎 = = 2.2683
𝜎𝜒ഥ 𝜒 0.02⁄
⁄ √42
√𝑛

Note that we have used the standard deviation for the sample mean
distribution.

• Step 2: Transform the significance level α (5%) into a z critical value.


This can be done in Microsoft Excel with the following formula.

46
A practical guide to applied statistics

Zcrit = NORM.S.INV (α/2) = NORM.S.INV (0.5/2) = ± 1.96

Note that the critical values need to be placed in both sides, of the
distribution (±), as one can get values that are significantly lower or
higher than the specified mean.

• Step 3: Plot both values in the distribution of the sample mean and
compare them as it can be seen in the following figure.

If the test statistic falls within these limits, we fail to reject the
null hypothesis.

If the test statistic If the test statistic


falls outside the falls outside the
limits, then we reject limits, then we reject
the null hypothesis the null hypothesis
in favour of the in favour of the
alternative. alternative.

-zCrit = 1.96 μx zCrit = 1.96

Zcalc = 2.26
Figure 12. 1-sample z-test results

In our example, the difference (deviation) observed in the sample data from the
mean is significant, as the calculated value is outside the limits set by the critical
values. This suggests that most likely that the mean has changed. Most likely it
has increased given that it lies at the right edge of the distribution. Therefore,
the null hypothesis should be rejected, and corrective action need to be taken to
bring the process back to the centre.

47
The theoretical foundations

6.4 Errors in hypothesis testing


The test statistic plays a significant role in the decision to be made. However,
this is half of the whole picture. The outcome of the hypothesis testing is also
affected by the significance level (α) that the data analyst will choose to use.
Consider the case of tests that were conducted using the same sample data, and
thus the same zcalc, at different levels of α. The α value will define the ZCrit and
thereby the result of the test.

Tip: As the significance (α)


increases, the test becomes
less strict. That is, less
evidence is needed to
reject the null hypothesis.

-zCrit μx zCrit
Zcalc
Figure 13. Impact of significance level α to hypothesis testing

It is obvious that the higher the alpha value the more chances one has for
rejecting the null hypothesis. This would be fine, provided that the null
hypothesis is indeed false. However, what if the H0 is true? Note that we do not
know the true state of a population, which is the reason why we run the test.

In this sense, α also indicates a risk of potentially being wrong that we are
willing to take. In other words, rejecting a null hypothesis that should not have
been rejected is an error caused by the choice of having a very large α. At the
same time, a very low α can lead to a case where the null hypothesis is retained
when it should have been rejected. These are the two main types of errors in
inferential statistics, as it can be seen in the following table.

48
A practical guide to applied statistics

Table 12. Type I and Type II errors in hypothesis testing

Expanding on the two types of errors goes beyond the scope of this book.
However, it is important to keep in mind that an outcome of a hypothesis test is
not always right. In fact, we can be 1-α% confident that we have reached the
right conclusion, and we have an α% chance of being wrong. In this respect, a
choice of an α at the level of 5% or 1%, accompanied by a large enough sample
to improve the Power (1-β) of the test and thus reduce the β error will most
likely give you a robust result. If you have serious concerns about the results of
a test, try running it again by increasing the sample size if possible.

6.5 Two-tailed vs one-tailed tests


The hypothesis testing process is a comparison between a test statistic, which
is calculated based on the sample data and is placed on the standardized
distribution, and the critical values, which are always placed at the tails of the
distribution.

However, what tails we use depends on the nature of the hypothesis we want to
test. For a two-tailed test, like the z-test we have seen already, we are interested
in both directions of the distribution. That is, if the test statistic indicates a value
significantly larger or smaller than the hypothesized μo, then we need to reject

49
The theoretical foundations

the null hypothesis and accept the alternative one. The word significance is
reflected in the alpha value and thus the α is divided into two equal groups that
are placed on both tails of the distribution. It may also be possible though that
we are interested in one direction of the distribution (one-tailed test).

One-tailed test Two-tailed test One-tailed test


(Left tail) (Not specified direction) (Right tail)

Logic: Greater than Logic: Equal to Logic: Less than


H0: μx ≥ μo vs H1: μx < μo H0: μx = μo vs H1: μx ≠ μo H0: μx ≤ μo vs H1: μx > μo

Reject Accept Reject Accept Reject Accept Reject


H0 H0 H0 H0 H0 H0 H0

α α/2 α
α/2

Figure 14. Two-tailed vs one-tailed tests

Left tail tests hypothesize that the value will not be smaller than a specified
value. In this case we do not care if the test statistic indicates a significantly
larger value. However, if the test statistic indicates a value that is significantly
smaller, then we need to reject the null hypothesis and accept the alternative
one. Again, significance is determined based on the chosen α value, which,
given the focus of the one-tailed tests on a single direction, must be placed in
full on the tail of the distribution that we are interested in; in this case is the left
side. Of course, the opposite applies for the right tail tests.

6.6 Number of populations involved


The number of populations involved in hypothesis testing can vary. In some
cases, it will be necessary to test a parameter from a single population, yet it

50
A practical guide to applied statistics

may also be possible that a situation will require to test parameters from two or
even more populations. The following table is using the mean as an example to
illustrate how this difference can affect the way your hypotheses are stated.

Populations Two – More than


Single Two - paired
involved independent two
The The
parameters parameters of The
of two two related or parameter
The
independent dependent of many
population
populations populations are populations
parameter
are tested to tested to see is tested to
Logic is tested
see if there is whether there see if there
against a
a significant is a significant is anyone
specified
difference difference that is
value
between the between the significantly
two two different.
populations populations
H0: μ1 = μ2
H0: μx =
= μ3…………
μ0 H0: μ1 = μ2 H0: μpre = μpost
Hypotheses H1: Not all
H1: μx ≠ H1: μ1 ≠ μ2 Η1: μpre ≠ μpost
means are
μ0
equal
Is the Is there a
Does a Are the
mean difference
particular revenues of
weight of between the
treatment have the last five
Example the salaries of
an effect on years
product men and
heart rate at significantly
equal to women in a
rest? different?
2.5 kg? company?
Table 13. Number of populations involved in hypothesis testing

Note that it is important to consider the number of populations involved in your


analysis as this parameter will affect the test statistic that needs to be used in
your decision analysis.

51
The theoretical foundations

6.7 Making estimations


Apart from hypothesis testing, inferential statistics Tip: Hypothesis
testing and making
can be used to make estimations of a population
estimations are the
parameter. For example, now that we have shown two sides of the
same coin.
that the mean volume value of our cylinders is not
1.5 m3 anymore, we may wish to know what the new mean is. To do so we need
to place the confidence interval (1-α), which is formulated by the sampling
distribution around the value that we have measured in our sample. The
necessary calculations are presented in the following figure.

Generic formula
Sample statistic Population Sample statistic
– Error ≤ parameter ≤
+ Error

1-α
Estimating the mean value
Confidence interval
𝜎𝑥
𝜇𝜊 = 𝑥̅ ± 𝛧 ∗ ⁄ =
√𝑛
-z z
= 1.507 ± 1.96 ∗ 0.02⁄
√42
1.501 x̄ = 1.507 1.513 Where α = 5%

Figure 15. Estimating the mean using the z

Depending on what is inferred, the parameters of the generic formula will take
a specific form. For estimating the mean we follow the statistics from the
distribution of the sample mean On this basis, we can say that we are 95% (1-
α) confident that the new mean of the process is somewhere between 1.501m3
and 1.513m3. If we recall the specification limits that the customers requested
(1.5m3 ± 0.8m3), we can understand why they complain. They seem to receive
a significant number of products that are beyond the upper limits.

52
A practical guide to applied statistics

6.8 Working with p-values


In the test that we run above, we compared the test statistic (zcalc) with the
critical values (zcrit) to decide the fate of the null hypothesis. Sometimes though
it is preferable to work with what is known as the p-values. This is especially
true when we work with statistical software programmes, like Minitab or
Microsoft Excel, as many of them use p-values in hypothesis testing.

Although p-values and critical values will always lead to the same result, they
approach the test from a different perspective. In the example, with the metallic
cylinders the sample mean was 1.507m3, which was different to the specified
mean (1.5m3) by 0.007m3. From the perspective of p-values, we would ask:

“What is the probability of observing this difference, or more extreme


differences, by pure chance provided that the null hypothesis is right?”

This probability is illustrated below as an area covered with dashed lines.

-
Figure 16. P-values on a distribution (for a two-tail test)

In our example, we were interested in both directions and thus we had to equally
allocate the α to both tails. To balance that, we become interested in the
probability of getting either a difference of +0.007 or -0.007 and thus both sides

53
The theoretical foundations

of the distribution need to be considered. If it were a one-tail test, we would


have both the p-value and the α value allocated to one of the tails.

To calculate the p-value, we need to find the area under the normal curve that
is at the outside of the two Zcalc values in the figure. We can do so by using the
function “=NORM.S.DIST(-Zcalc,True)” which will give as the area from the
minimum value (left side) to the -Zcalc value (= 1.17%). This would have been
the probability if it was a one tail test. However, we are running a two-tail test
and thus we need to multiply it by 2 (1.17% * 2 = 2.33%) to include the p-value
area from the right side of the distribution (it is symmetrical).

Then, your task is to compare the calculated p-value with the significance value
(α) which essentially is the cut-off probability. The two possible outcomes of
this comparison are:

a. If p-value < α then reject H0. It is very unlikely that the result is due to
pure chance. Probably, something has changed.

b. If p-value > α then fail to reject H0. The probability is not low enough to
suggest that something has changed. The difference is probably due to
pure chance.

In our example, for an α equal to 5% we can conclude


Tip: If the p-
that the p-value is less than α, and thus we need to reject value is low the
the null hypothesis. As you can see whether you decide null has to go.
to run hypothesis testing with p-values or critical values
makes no difference. Both will give you the same result, since for every
calculated test statistic (zcalc) there is a corresponding p-value. Nevertheless,
understanding both approaches can be useful as it gives you a wider perspective
of the issue.

54
A practical guide to applied statistics

6.9 Degrees of freedom


Although the z-test is very powerful, it requires many assumptions to be met.
For example, we need to know the variance of the population in advance. In
that way we can know the exact shape of the underlying normal distribution and
thereby we make the inferences. However, this is not always possible and
therefore we are forced to use another test, known as the t-test.

The t-test is modelled by the Student’s t-distribution, which similar to the


normal is also symmetrical. However, its shape depends on the sample size. The
more the sample size increases the more the t-distribution resembles the normal.
However, for small sample sizes (< 120) we need to pay the price of having
more variability (standard error) in the sampling distribution, as we estimate the
standard deviation of the population based on the sample data.

Normal

Student (t)

Figure 17. Student’s (t) vs normal (z) distribution

To capture this variability (uncertainty) in the shape of the t-distribution, and in


many other distributions, we use something called the degrees of freedom (ν).
In general, these are equal to the number of sample observations minus the
parameters estimated from the sample data. For example, in the 1-sample t-test
we estimate the standard deviation of the population and thus the degrees of
freedom are ν = n – 1. The formula for calculating the degrees of freedom varies
for different distribution. These formulas will be given for each test in this book.
55
The theoretical foundations

7 A practical process

There are many distributions and statistical tests that one can run depending on
what is to be tested. Despite their differences though, all these tests follow more
or less the same logic with the z-test that we have already seen. Thus, the
following steps summarize the knowledge of this chapter as well as can be used
as a general guide for the tests that will be presented in this book.

Step Instructions
Clarify the issue and what needs to be tested.
1 Define the problem
Capture it in a question if possible
Use the maps in this book to guide you
2 Chose the right test
through the assumptions of the various tests
If the direction is specified (i.e., larger, or
Is it one tailed or two
3 smaller than….) then the test is one-tailed,
tailed?
otherwise it is a two-tailed test
Formulate the Capture the nature of the problem to be tested
4
hypotheses in a set of clear hypotheses
Collect and analyse Evaluate the quality and summarize your
5
your sample data sample data by using descriptive statistics
This reflects the strength you would like to
Decide on the level
6 see in the evidence as well as the risk you are
of significance (α)
willing to take. Usually, it is either 5% or 1%
Calculate the test Use the formulas in this book to calculate the
statistic test statistic
7 Find the critical Use either the critical value or the p-value
value or use the p- depending on what is more convenient. In this
value book we will work with p-values
Decide whether to Compare the critical to the calculated value,
8 reject or fail to reject or the p-value to the α, to decide the fate of
the Ηo the Ηo
Translate the result of the test to something
9 Apply the decision that has meaning for those involved in or
affected by the decision
Table 14. Basic steps to hypothesis testing
56
A practical guide to applied statistics

Part B: Using Microsoft Excel for hypothesis testing

Microsoft Excel spreadsheet package can be used to carry out many statistical
tests, and with a little bit of manual contribution from our side it can become
comparable to some more advanced software for statistics in the market. Of
course, it has limitations, and this is the reason why we have also included
Minitab in the discussion.

Before we jump into the tests, we need first to set up Microsoft Excel for
statistics. This requires enabling the “Data Analysis ToolPak”. The process for
doing so is illustrated below.

• Step 1: Select “File” at the top left corner of Microsoft Excel and then
select “Options”.

• Step 2: In the options window select the “Add-ins” tab and then select
“Go…” as it can be seen in the following figure.

57
Setting up Microsoft Excel

1. Step 3: In the window that will pop up check the “Analysis ToolPak”
box and then press ok.

2. Step 4: If you have done everything correctly then in the “Data” tab you
should have the “Data Analysis” option.

Those of you who do not have access to something like Minitab, these notes
will offer a solid and straightforward approach to hypothesis testing using only
Microsoft Excel.

Finally, please note that this guide has been developed in a Windows computer.
Those of you who run other systems may have to think a little bit through on
how to apply the instructions and steps presented here.

58
A practical guide to applied statistics

8 Types of data Tip: Using statistics


that are not suitable for
After you clearly define the problem that you are the data at hand, is one
of the most common
interested in investigating, it is important to mistakes made.
understand the type of data that needs to be
analyzed. This is because, different types of tests should be followed for
different types of data.

Start

Ordinal What is Attributes


→ Page 107 the type of → Page 95
the data?

Categorical data like the Proportions or discrete data. It


attributes data but also is usually manifested as a
indicates order (i.e. 1st, 2nd count of a characteristic (i.e.,
……. last) yes/no, times something
appears)

Variables
→ Page 61

Continuous data that can take on any number (i.e., length, weight…)

Figure 18. Map of choosing the Type of data

Note that sometimes the categorization of a dataset into a specific type can be
tricky. The following scale aims to help you with that issue. Start from the top
question and as you go down, check if your data posseses the indicated qualities.

59
Types of data

Variables
Data that can have a i.e.,
Is data Yes “logical” decimal point and weight,
continuous? take any value to a length,
theoretically infinite scale volume,
size
No Categorical
Counts (Discrete) i.e., the
It is numerical data with number of
True logical order, like the defects on a
differences Yes variables data, but can product or
exist? take only non-negative number of
integer values (categories) people
entering a
Quantitative café/hour
No Qualitative
P
Ordinal i.e., Likert O
It is data in order like the scales, W
counts-discrete but the preferences, E
Is data Yes R
in order? “true” value of the intervals order in the
(differences) between the finish line
categories cannot be known
No
Counts i.e., number of
(Nominal) cars of a
Like the binary specific colour
How many
but in this case in a parking lot
categories 3+
? the or the number
phenomenon of visits to
can have many various Attributes
2 features countries

Binary i.e., Yes/No, Good/Bad


The number of times two opposing Quality, Pass/fail
features of a phenomenon happen inspection of product,
are counted True/False

Figure 19. The scale of data

60
A practical guide to applied statistics

9 Parametric vs non-parametric tests for variables data

The analysis of variables data should be conducted with caution. This is because
variables data, is the one with the most assumptions that need to be checked.

Start (Page 59) * A sample size of at least 30 is suggested


when checking for normality in Excel
(Central Limit Theorem → page 71).
Analyse variables data
→ Page 62
Collect more data
→ Page 8
Access to Sample *
No No
Minitab? size > 30?

Yes Yes

Normality test Transform the data


→ Page 68 → Page 73

No

Data
Happy to
Yes normally No
proceed?
distributed?

Yes

Parametric tests Non-parametric tests


→ Page 75 → Page 115

Figure 20. Map for choosing between parametric and non-parametric tests

This chapter will help you reach the point of choosing between parametric and
non-parametric tests. These will be then discussed in separate chapters.

61
Parametric vs non-parametric tests for variables data

9.1 Analyse variables data


After we visually inspect the raw data, we should summarize and explore it.
The aim is to understand what the data is actually trying to say. The following
process applies.

Run descriptive
statistics

Check for outliers Outliers Investigate and


(Box plot) exist? resolve outliers

Visualise and examine


the distribution

Figure. The process of analysing variables data

Let us assume a manufacturing company that produces iron sticks for


construction purposes. A key quality characteristic of the products is their
length, with the customers requesting that the products are 19.5 ± 4.5 metres
long. Recently, there have been complaints regarding the quality of the products
and thus you have decided to collect a sample and investigate the process.

ID Length (m) ID Length (m) ID Length (m) ID Length (m)


1 17.3 6 19.2 11 20.6 16 14.9
2 14.5 7 19.0 12 18.4 17 20.0
3 17.5 8 16.4 13 22.1 18 18.4
4 18.6 9 21.2 14 21.3 19 20.8
5 18.0 10 20.3 15 14.4 20 18.3
Table 15. Raw variables data

We will use this example to show how variables data can be analysed as well
as how we can check for normality in Microsoft Excel and Minitab.

62
A practical guide to applied statistics

9.1.1 Descriptive statistics for variables data


Microsoft Excel is very good in summarizing variables data. The following will
present the steps you need to follow.

ID Length (m)
1 17.3 Step 1: Arrange your data in a single column (as it is
2 14.5
shown in the left).
3 17.5
4 18.6
5 18.0 Step 2: Open the data analysis window and select
6 19.2 “Descriptive statistics”.
7 19.0
8 16.4 Step 3: In the Input range insert all the cells that you
9 21.2
10 20.3 are interested to analyse. Select the title too “Length
11 20.6 (m)” and tick the box “Labels in the first row”.
12 18.4
13 22.1
14 21.3
15 14.4
16 14.9
17 20.0
18 18.4
19 20.8
20 18.3

Step 5: Tick
the box
“Summary
Step 4: In the
statistics”
output range
insert an
empty cell

Step 6: Then, click ok.

If you follow this process, Microsoft Excel will automatically produce some
descriptive statistics for your data. This can be seen below.

63
Parametric vs non-parametric tests for variables data

Length (m) We have seen most of these


Mean 18.56 statistics in Part A of this book.
Standard Error 0.504943978
Median 18.5 Please note that the standard error
Mode 18.4 is the standard deviation of the
Standard Deviation 2.25817812
Sample Variance 5.099368421 sample distribution for the mean.
Kurtosis -0.498805009 Also, the highlighted values will
Skewness -0.455789731
Range 7.7 be used later to construct the
Minimum 14.4 histogram.
Maximum 22.1
Sum 371.2
Count 20

In Minitab, arrange the dataset in a column and then go to:

Stat → Basic statistics → Display Descriptive Statistics…

This will bring up the following window.


Step 1: Add the variable(s) you
would like to summarize here

Step 2: Click Statistics to


choose the measures you
would like to calculate

Step 3: Click “OK” and then


again “OK”.

64
A practical guide to applied statistics

Minitab calculates the same values with Microsoft Excel. It can also calculate
immediately the quartiles and the Interquartile range of the dataset which will
be helpful in constructing the boxplot.

9.1.2 Box and whisker plots


Box and whisker plots, or boxplots, are very useful visual summaries of a
distribution that reveal information about the shape of the distribution and
possible outliers. Boxplots are created based on the logic of the interquartile
range (IQR) as it can be seen in the example with the length data below.

Step 1: Calculate the quartile values and interquartile range.

Boxplot value
Quartile 1 (Q1) = 17.35 = QUARTILE.EXC(Dataset,1)
Quartile 2 (Median - Q2) = 18.5 = QUARTILE.EXC(Dataset,2)
Quartile 3 (Q3) = 20.525 = QUARTILE.EXC(Dataset,3)
IQR = 3.175 = Q3 - Q1

Step 2: Create the boxplot

Q1 = 17.35 Q3 = 20.525
IQR = 3.175

25% 25%
25% 25%

Min = 14.4 Q2 = 18.5 X̄ = 18.56 Max = 22.1


65
Parametric vs non-parametric tests for variables data

Note that we have to use some measures form the table with the descriptive
statistics that has been presented above (mean, min, max).

Step 3: Check for outliers

The first step is to calculate the limits of our dataset beyond which we will treat
values as outliers.

Boundaries for outliers


Extreme min value = 12.59 = Q1 - 1.5*IQR
Extreme max value = 25.29 = Q3 + 1.5*IQR

In principle, outliers should be investigated. However, an outlier is not


necessarily an error or something peculiar that requires action (i.e. remove it)
to be taken. It can simply be an expected value of a very skewed distribution.

In our example, it seems that there are no outliers. Both the min and the max
values are within the boundaries set by the two values above.

9.1.3 Visualize the distribution


Understanding the shape of variables data requires to construct the histogram
of the dataset. In order to do that we need to group our data into meaningful
classes. In Microsoft Excel the following process can be applied.

Step 1: Develop the classes

Develop the classes


1+3.222logn
No. of classes = 5.191
Round No. of classes = 5
Range / Rounded No. of
Class range = 1.54 classes

Step 2: Create the bins for the classes

66
A practical guide to applied statistics

A/A Length (Bins)


1 15.94 Min + Class range
2 17.48
3 19.02
Previous bin + Class range
4 20.56
5 22.1

Step 3: Open the data analysis window and select “Histogram”.

Step 4: In the input range add the raw data and in the Bin range add the Length
of the bins that you have calculated (including the headings).

Step 5: Tick
the box
“Labels”

Step 7:
Tick the
box “Chart Step 6: In the
output” output range
insert an
empty cell

Step 8: Then, click ok.

Microsoft Excel then will automatically calculate the number of values that
fall into the classes you have created, as well as will produce the histogram of
your data.

67
Parametric vs non-parametric tests for variables data

Length (Bins) Frequency


15.94 3 Note: By default, Microsoft Excel
will have big gaps between the bars.
17.48 2
To minimize them press right click
19.02 7 on the bars, select “Format data
20.56 3 series” and then take the “Gap
22.1 4 Width” down to 0.
More 1

Figure 21. A histogram of variables data

As we will see later, in Minitab we can generate the histogram along with the
test for normality which will be discussed next.

9.2 Tests for Normality


Checking whether a dataset derives from a normal distribution or not is
extremely important before any choice of using more advanced statistical
methods is made. In fact, assuming normality enables us to use parametric tests.
These are more powerful in their inferences compared to the non-parametric
ones that do not assume any particular underlying distribution.

68
A practical guide to applied statistics

9.2.1 Anderson-Darling test


The Anderson-Darling test is a powerful tool that can be used to test a dataset
for normality. Unfortunately, running it in Microsoft Excel is too complicated
and thus it will be shown only in Minitab.

Hypothesis: Ho: The data derives from a normal distribution

H1: The data does not derive form a normal distribution

Path: Stat → Basic Statistics → Graphical Summary → Fill Variables Box


→ Ok

Figure 22. Summary report in Minitab

The p-value of the Anderson-Darling test is 0.463 which confirms that the
dataset derives from a normal distribution. Note that Minitab generates
additional measures, such as the histogram and the boxplot values,
automatically.

The following section will present the χ2 – test “Goodness of fit”, which is an
alternative that can be used in Microsoft Excel.

69
Parametric vs non-parametric tests for variables data

9.2.2 χ2 – test “Goodness of fit”


Aim: To test if a sample data comes from a population of a normal distribution.

Hypothesis: Ho: The data derives from a normal distribution

H1: The data does not derive form a normal distribution

Assumptions: Significant sample size (preferably > 30).

Expected values > 5 (Ideal / Strict assumption)

Expected values > 1 with more than 20% of them > 5

Test statistic Where:


Ο : Observed values in the sample data
(𝛰 − 𝛦)2
χ2 = ∑ E : Expected values are based on the assumed
𝛦
model

Degrees of freedom Where:


k : No. of categories or classes in sample data
ν=k−p−1
p : No. of parameters estimated from sample
data (For normality tests this is equal to 2).

Distribution (Model) Χ2

P-value =CHISQ.DIST.RT(χ2,ν)

Minitab Path: Not suitable.

Notes

When testing for normality the result of the test will be affected by the way the
classes have been formed. It is suggested to apply the formula “1+3.222logn”
to calculate the No. of classes as it has been shown above.

Use the class “more” that is generated in the histogram, if not 0.

70
A practical guide to applied statistics

71
Parametric vs non-parametric tests for variables data

9.2.3 Central limit theorem


The world of statistics is governed by a principle, the larger the sample size the
better. For example, in the context of the χ2 – test for normality, a larger sample
size will make easier to meet the assumption that wants the expected values
being more than 5. However, why the number 30 is suggested as an ideal
minimum?

The answer comes from the “golden rule” of the Central Limit Theorem:

“ . . . for any distribution with a well-defined mean and variance and given
random and independent samples of n observations each, the distribution of
sample means approaches normality as the size of n increases, regardless of
the shape of the population distribution.”

Graphically this is shown in the following figure.

n=1

X̄ n > 30

Figure 23. Sample size impact on the sampling distribution of the means

Τhe exact sample size needed to formulate a normal sampling distribution of


the means depends on the characteristics of the population. However, the
golden rule suggests that it is usually necessary to have a sample size larger
than at least 30 observations.

72
A practical guide to applied statistics

Be aware that when a sample size of 30 + observations is used, one cannot


automatically assume normality and apply parametric tests. This is a popular,
yet worrying, misconception. The test for normality should be run no matter
what. A sample of 30 + observations can only offer additional security.

When the sample size is less than 30, like in the example we used above where
we had only 20 observations, try to increase the sample. If this is not possible,
check whether the expected values meet the assumptions of the test. In our case,
all the values are larger than 1 (the last one is very close), with one value being
larger than 5 and one very close to it. Thus, although a borderline case, we could
proceed with caution.

9.3 Transform your data


What if your data is not normal? In such cases, you have two options. Either
proceed to non-parametric tests (please note that some of them are difficult to
run in Microsoft Excel) or try to transform your data. The latter requires
𝑛
applying a mathematical function (i.e., log(x) or √√𝑥) so that the transformed
data can meet the assumption of normality. This will then allow one to run
parametric tests.

The process one needs to follow includes 5 main steps.

1. Select a mathematical function to use.

2. Check transformed data for normality (if not normal revisit step 1).

3. Calculate the mean and standard deviation of the transformed data.

4. Use parametric tests with extreme caution.

5. If necessary, apply inverse function to bring data back to original


form.

73
Parametric vs non-parametric tests for variables data

Be very cautious with such transformations. Some statisticians suggest that it


might be better to simply use a non-parametric test, instead of “messing with
your data”. If you want to play safe, you may decide to run both, parametric
and non-parametric, tests and then compare the results.

74
A practical guide to applied statistics

10 Parametric tests for variables data

Start (Page 61)


χ2-test
1 → Page 76

Testing F-test
Variances How
means or 2
many? → Page 80
variances?
*
Bartlett’s test
>2
Means → Page 88

*
How Bartlett’s test
1 >2
many? → Page 88

2 One-way ANOVA
1-sample t-test → Page 90
Equal
→ Page 78 Yes
variances?

Welch’s One-way ANOVA *


No
→ Page 92

In F-test
Yes No
pairs? → Page 80

2-sample t-test
Paired t-test → Page 82
→ Page 86 Equal Yes
variances?
Welch’s t-test
* Difficult to run → Page 84
in Excel No

Figure 24. Map of parametric tests

75
Parametric tests for variables data

10.1 χ2 – test
Aim: To test whether a sample derives from a population with a specified
variance.

Hypothesis: Ho: 𝜎 2 = 𝜎𝑜2

H1: 𝜎 2 ≠ 𝜎𝑜2

Assumptions: Data should be normally distributed*


Measurements should be independent

Test statistic Where:

(𝑛 − 1)𝜎̂ 2 𝜎̂ 2 : Variance estimated from the sample


χ2 =
𝜎𝑜2 𝜎𝑜2 : Specified value of variance

Degrees of freedom
Where:
ν=n−1 n: The sample size

Distribution (Model) χ2

= CHISQ.DIST.RT(χ^2,ν) → Right tail


P-value =1 - CHISQ.DIST.RT(χ^2,ν) → Left tail
=2 * (Min One-tail p-value) → Two tail

̂2
(𝑛−1)𝜎 ̂2
(𝑛−1)𝜎
Confidence intervals 2 to 2
𝜒1− 𝛼⁄ ,𝜈 𝜒𝛼 ⁄2,𝜈
2

Minitab Path: Stat → Basic Statistics → One-Sample Variance

Notes
This test is very sensitive to normality. Minitab is offering a p-value based on
Bonnet’s confidence interval that may be useful to check as well, especially if
the data is non-normal.

76
A practical guide to applied statistics

77
Parametric tests for variables data

10.2 1-sample t-test


Aim: To test whether a sample derives from a population with a specified mean.

Hypothesis: Ho : 𝜇𝜒 = 𝜇0
H1 : 𝜇𝜒 ≠ 𝜇0

Assumptions: Data should be normally distributed


Measurements should be independent

Test statistic Where:

𝑥̅ − 𝜇0 𝑥̅ : Sample mean
𝑡=
𝜎̂⁄ μ0 : Specified population mean value
√𝑛
̂ : Sample standard deviation
σ
n : sample size

Degrees of freedom Where:

ν = n−1 n : sample size

Distribution (Model) Student’s t

= T.DIST.2T(ABS(t),ν) → Two tail


P-value = 1 - T.DIST.RT(t,ν) → Left tail
= T.DIST.RT(t,ν) → Right tail

𝜎̂
Confidence intervals 𝑥̅ ± 𝑡𝛼⁄
2 ,𝜈 √𝑛

Minitab Path: Stat → Basic Statistics → → 1-Sample t

Notes
In most cases it will be preferred to the z-test as the population variance is
difficult to be known with certainty.
78
A practical guide to applied statistics

79
Parametric tests for variables data

10.3 F-test
Aim: To test whether two independent samples derive from population
distributions with equal variances.

Hypothesis: Ho: 𝜎21 = 𝜎22


H1: 𝜎12 ≠ 𝜎22

Assumptions: Data of both samples should be normally distributed*


Samples and their measurements should be independent
Test statistic Where:

𝜎12 𝜎12 : Variance of sample 1


𝐹 =
𝜎22 𝜎22 : Variance of sample 2

Degrees of freedom Where:


𝜈1 = 𝑛1 − 1 Degrees of freedom for the numerator
𝜈2 = 𝑛2 − 1 Degrees of freedom for the denominator

Distribution (Model) F

= F.DIST.RT(F, ν1 , ν2) → Right tail


P-value =1 - F.DIST.RT(F, ν1 , ν2) → Left tail
=2 * (Min One-tail p-value) → Two tail

𝜎12 ∗𝐹1−𝛼⁄ 𝑣2,𝑣1 𝜎12 ∗𝐹𝛼⁄ 𝑣2,𝑣1


2. 2.
Confidence interval to
𝜎22 𝜎22

Minitab Path: Stat → Basic Statistics → Two-Sample Variance (Select


options and tick the box “Use test ..… based on normal
distribution”)
Notes
The test is very sensitive to this assumption.
80
A practical guide to applied statistics

81
Parametric tests for variables data

10.4 2-sample t-test


Aim: To test whether two independent samples derive from populations with
equal means, provided that they have equal variances in the population.

Hypothesis: Ho:𝜇1 = 𝜇2

H1: 𝜇1 ≠ 𝜇2

Assumptions: Data of both samples should be normally distributed


Samples and their measurements should be independent
𝜎12 = 𝜎22

Test statistic Where:


(𝑥̅1 − 𝑥̅2 ) − 𝑘 𝑥̅𝑖 : Mean of sample i
𝑡=
1 1 𝑘 : Hypothesized difference
𝜎̂√ +
𝑛1 𝑛2 𝑛𝑖 : Size of sample i
𝜎̂ : Pooled standard deviation*
Degrees of freedom Where:
ν = 𝑛1 + 𝑛2 − 2 𝑛𝑖 : Size of sample i
Distribution (Model) Student’s t

= T.DIST.2T(ABS(t),ν) → Two tail


P-value = 1 - T.DIST.RT(t,ν) → Left tail
= T.DIST.RT(t,ν) → Right tail

1 1
Confidence intervals (𝑥̅1 − 𝑥̅2 ) ± 𝑡𝛼⁄2 ,𝜈 𝜎̂√ +
𝑛1 𝑛2

Minitab Path: Stat → Basic Statistics → 2-Sample t (Select options and tick
the box “Assume equal variances)

Notes*
Calculating the 𝜎̂ is challenging. Use the Data Analysis ToolPak.

82
A practical guide to applied statistics

83
Parametric tests for variables data

10.5 2-sample t-test (Aspin-Welch)


Aim: To test whether two independent samples derive from populations with
equal means, provided that they have unequal variances in the population.

Hypothesis: Ho:𝜇1 = 𝜇2
H1: 𝜇1 ≠ 𝜇2

Assumptions: Data of both samples should be normally distributed


Samples and their measurements should be independent
𝜎12 ≠ 𝜎22
Test statistic Where:
(𝑥̅1 − 𝑥̅2 ) − 𝑘 𝑥̅𝑖 : Mean of sample i
𝑡=
𝜎̂ 2 𝜎̂22 𝑘 : Hypothesized difference
√ 1 +
𝑛1 𝑛2 𝑛𝑖 : Size of sample i
𝜎̂𝑖2 : The variance of sample i

Degrees of freedom Where:


2
𝜎̂ 2 𝜎̂ 2
( 𝑛1 + 𝑛2 ) 𝑛𝑖 : Size of sample i
1 2
ν=
𝜎̂𝑖2 : The variance of sample i
2 2
1 𝜎̂12 1 𝜎̂22
𝑛1 − 1 ( 𝑛1 ) + 𝑛2 − 1 ( 𝑛2 )

Distribution (Model) Student’s t

= T.DIST.2T(ABS(t),ν) → Two tail


P-value = 1 - T.DIST.RT(t,ν) → Left tail
= T.DIST.RT(t,ν) → Right tail

̂12
𝜎 ̂22
𝜎
Confidence intervals (𝑥̅1 − 𝑥̅2 ) ± 𝑡𝛼⁄2 ,𝜈 √ +
𝑛1 𝑛2

Minitab Path: Stat → Basic Statistics → 2-Sample t (Select options and


untick the box “Assume equal variances)
Notes
Calculating the ν can be challenging. Use the Data Analysis ToolPak.

84
A practical guide to applied statistics

85
Parametric tests for variables data

10.6 Paired t-test


Aim: To test whether two dependent samples derive from populations with
equal means.

Hypothesis: Ho: 𝜇1 = 𝜇2

H1: 𝜇1 ≠ 𝜇2

Assumptions: Data of both samples has no outliers (ideally, they should be


normally distributed)
The paired differences should be normally distributed
Measurements should be independent
Where:
Test statistic ∑ 𝑑𝑗
𝑑̅ =
𝑑̅ − 𝑘 𝑛
𝑡=
𝜎̂𝑑 𝑑𝑗 = 𝑥1,𝑗 − 𝑥2,𝑗 (differences)

√𝑛 𝑛 : No. of paired values
̂𝑑 : Standard deviation of differences
σ
Degrees of freedom Where:

ν=𝑛−1 𝑛 : No. of paired values

Distribution (Model) Student’s t

= T.DIST.2T(ABS(t),ν) → Two tail


P-value = 1 - T.DIST.RT(t,ν) → Left tail
= T.DIST.RT(t,ν) → Right tail

̂
σ
Confidence intervals 𝑑̅ ± 𝑡𝛼⁄2 ,𝜈 𝑛𝑑

Minitab Path: Stat → Basic Statistics → Paired t

Notes
Samples need to be of equal size and paired.

86
A practical guide to applied statistics

87
Parametric tests for variables data

10.7 Bartlett’s test


Aim: To test whether k independent samples derive from populations with
equal variances.

Hypothesis: Ho: 𝜎12 = 𝜎22 = 𝜎32 … … . . = 𝜎𝑘2

H1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑖𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙

Assumptions: Data of all samples should be normally distributed


Samples and their measurements should be independent
All sample sizes should be larger than 6

Test statistic

2.3026 ∑(𝑛𝑗 − 1)𝑆𝑗2


𝐵= [{(𝑁 − 𝑘)𝑙𝑜𝑔 } − ∑(𝑛𝑗 − 1)𝑙𝑜𝑔𝑆𝑗2 }]
𝐶 𝑁−𝑘
Where:
1 1 1
𝐶 =1+ {∑ − }
3(𝑘 − 1) (𝑛𝑗 − 1) (𝑁 − 𝑘)

𝑛𝑗 : Size of sample j N : Total number of observations


𝑆𝑗2 : Variance of sample j k : No. of samples compared

Degrees of freedom Where:


ν=k−1 k : No. of samples compared

Distribution (Model) χ2 (Approximated when 𝑛𝑗 > 6)

P-value = CHISQ.DIST.RT(B,ν)

Minitab Path: Stat → ANOVA → Test for equal variances (Select options
and tick the box “Use test based on normal distribution”)

Notes
This test is difficult to run in Excel as we cannot automate the calculations.

88
A practical guide to applied statistics

89
Parametric tests for variables data

10.8 One-way ANOVA


Aim: To test whether k independent samples derive from populations with
equal means, provided that they have equal variances.

Hypothesis: Ho: 𝜇1 = 𝜇2 = 𝜇3 … … . … . . = 𝜇𝑘

H1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙

Assumptions: Data of all samples should be normally distributed*


Samples and their measurements should be independent
𝜎12 = 𝜎22 = 𝜎32 … … . . = 𝜎𝑘2
Groups should have similar sample sizes (ideally equal)

Where:

Test statistic 𝑀𝑆𝐺 = Mean sum of squares between groups


𝑆𝑆𝐺 ∑𝑘
𝑖=1 𝑛𝑖 (𝑥̅ 𝑖 −𝑥̿ )
2
𝑀𝑆𝐺 𝑀𝑆𝐺 = 𝑘−1
= 𝑘−1
𝐹 =
𝑀𝑆𝑊
𝑀𝑆𝑊 = Mean sum of squares within groups
𝑛
𝑆𝑆𝑊 ∑𝑘 𝑖
𝑖=1 ∑𝑗=1(𝑥𝑖𝑗 −𝑥̅ )
2
𝑀𝑆𝑊 = 𝑁−𝑘
= 𝑁−𝑘
Degrees of freedom Where:
𝜈𝑀𝑆𝐺 = k − 1 k : Number of populations compared
𝜈𝑀𝑆𝑊 = 𝑁 − 𝑘 N : Total number of observations

Distribution (Model) F

P-value = F.DIST.RT(F,νΒ,νW)

Minitab Path: Stat → ANOVA → One-way …. (Select options and tick


the box “Assume equal variances”)

Notes
You can use the Data Analysis ToolPak to run the test

* The test is very sensitive to this assumption.

90
A practical guide to applied statistics

91
Parametric tests for variables data

10.9 Welch One-way ANOVA


Aim: To test whether k independent samples derive from populations with
equal means, provided that they have unequal variances.

Hypothesis: Ho: 𝜇1 = 𝜇2 = 𝜇3 … … . … . . = 𝜇𝑘

H1: 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙

Assumptions: Data of all samples should be normally distributed


Samples and their measurements should be independent
The variances of the populations do not need to be equal

Test statistic
1
∑𝑘 𝑤 (𝑥̅ − 𝑥̅ ′ )2
𝐹= 𝑘 − 1 𝑗=1 𝑗 𝑗
2(𝑘 − 2) 𝑘 1 𝑤𝑗
1+ 2 ∑ ( )(1 − )2
𝑘 − 1 𝑗=1 𝑛𝑗 − 1 𝑤
Where:
𝑛𝑗 ∑𝑘
𝑗=1 𝑤𝑗 𝑥̅𝑗
𝑤𝑗 = 𝑤 = ∑𝑘𝑗=1 𝑤𝑗 𝑥̅ ′ =
𝑠𝑗2 𝑤

𝑛𝑗 : Size of sample j k : No. of samples compared


𝑆𝑗2 : Variance of sample j

Degrees of freedom
𝑘2 − 1
ν= 𝑤𝑗
1
3 ∑𝑘𝑗=1( )(1 − )2
𝑛𝑗 − 1 𝑤

Distribution (Model) F

P-value = F.DIST.RT(F,k-1,ν)

Minitab Path: Stat → ANOVA → One-way …. (Select options and untick


the box “Assume equal variances”)
Notes

92
A practical guide to applied statistics

93
Parametric tests for variables data

94
A practical guide to applied statistics

11 Tests for attributes data

Hypothesis testing for attributes data is utilizing the frequencies observed


during the examination of a phenomenon.

Start (Page 59) 1-proportion test


→ Page 97

Analyse
attributes data 2-proportion test
→ Page 96 → Page 98
1

What type
How
of Binary 2 Paired?
many?
attributes?

>2 McNemar’s test


Counts → Page 100
(Discrete
or
χ2 test for
nominal)
independence
Nominal How >1 column → Page 104
many?

χ2 Goodness of fit
1 column
→ Page 102

Figure 25. Map of tests for variables data

Note that attributes data, especially nominal, is the least flexible type of data
when it comes to apply statistical analysis. For accuracy purposes, we prefer
measuring, if possible, something at variables or at least ordinal level.

95
Tests for attributes data

11.1 Analyze attributes data


To summarize attributes data, we utilize the observed frequencies that we then
illustrate in a bar chart. Visuals can be very helpful as they allow to spot patterns
and thereby analyze the case at hand.

For example, the following figure shows the number of trips (counts - nominal
data) that a travel agent has organized towards six popular destinations
throughout a year. As a reference of comparison, the targets that the company
set at the beginning of the year have been plotted too.

60 No. of trips
48 Target
50
40
40 35 37 35
34
28 30 29
30 26 25 25

20

10

0
Czech United
Norway Italy Spain Greece
republic Kingdom
No. of trips 26 28 34 48 29 37
Target 25 35 40 30 25 35

Figure 26. No. of trips vs targets for last year

We can see that for some destinations the company met the targets while for
others it did not. The question then is whether the differences observed between
targets and No. of trips is significant or simply an outcome of random variation
(pure chance). You can also ask whether there is any preference towards a
destination among your customers. This type of questions is the focus of the
inferential statistics that we will explore below.

96
A practical guide to applied statistics

11.2 1-Proportion test


Aim: To test whether a sample proportion is equal to a specified value in the
population.

Hypothesis: Ho : 𝑝 = 𝑝0

H1 : 𝑝 ≠ 𝑝0

Assumptions: Observations should be independent


𝑛∗𝑝
̂ ≥ 10 *

𝑛 ∗ (1 − 𝑝
̂) ≥ 10 *

Where:
Test statistic 𝑝̂ : Sample proportion
𝑝̂ − 𝑝0 𝑝0 : Specified population proportion value
𝑧=
𝜎𝑝 𝑝(1 − 𝑝)
𝜎𝑝 = √
𝑛
n : Sample size

Distribution (Model) Approximated by the z

= NORM.S.DIST(z, True) → Left tail


P-value = 1 - NORM.S.DIST(z, True) → Right tail
= 2 * (Min One-tail p-value) → Two tail

𝑝̂(1−𝑝̂)
Confidence intervals 𝑝̂ ± 𝑧𝛼⁄ √
2 𝑛

Minitab Path: Stat → Basic Statistics → 1 Proportion

Notes
If you cannot meet the assumptions, you should try to increase the sample size.
If not possible, you can use 5, instead of 10, as a minimum required value, but
in this case, proceed with caution.

97
Tests for attributes data

11.3 2-Proportion test


Aim: To test whether two proportions are equal in the population.

Hypothesis: Ho : 𝑝1 = 𝑝2
H1 : 𝑝1 ≠ 𝑝2

Assumptions: Samples and their observations should be independent


𝑛𝑖 ∗ 𝑝̂𝑖 ≥ 10 *
𝑛𝑖 ∗ (1 − 𝑝̂𝑖 ) ≥ 10 *

Test statistic Where:


𝑝̂𝑖 : Proportion of sample i
k : Specified difference

(𝑝̂ 1 − 𝑝̂ 2 ) − 𝑘
𝑧= ̂ 1 (1 − 𝑝̂ 1 )
𝑝 ̂ 2 (1 − 𝑝̂ 2 )
𝑝
𝜎̂𝑝̅1−𝑝̅2 𝜎̂𝑝̂1−𝑝̂2 = √ +
𝑛1 𝑛2

𝑛𝑖 : Size of sample i

Distribution (Model) Approximated by the z

= NORM.S.DIST(z, True) → Left tail


P-value = 1 - NORM.S.DIST(z, True) → Right tail
= 2 * (Min One-tail p-value) → Two tail

Confidence intervals (𝑝̂ 1 − 𝑝̂ 2 ) ± 𝑧𝛼⁄ 𝜎̂𝑝̂1 −𝑝̂2


2

Minitab Path: Stat → Basic Statistics → 2 Proportions

Notes
If you cannot meet the assumptions, you should try to increase the sample size.
If not possible, you can use 5, instead of 10, as a minimum required value, but
in this case, proceed with caution.
98
A practical guide to applied statistics

99
Tests for attributes data

11.4 McNemar’s test


Aim: To test whether two paired proportions are equal in the population.

Hypothesis: Ho : 𝑝𝑝𝑟𝑒 = 𝑝𝑝𝑜𝑠𝑡


H1 : 𝑝𝑝𝑟𝑒 ≠ 𝑝𝑝𝑜𝑠𝑡

Assumptions: Samples should be paired


Observations should be independent
Categories should be mutually exclusive

Where:
Test statistic StF : No. of proportions that changed
(𝐴𝐵𝑆(𝑆𝑡𝐹. −𝐹𝑡𝑆. ) − 1)2 from Success to Failure
𝑥2 =
𝛮 FtS : No. of proportions that changed
from Failure to Success
Ν = StF + FtS

Degrees of freedom ν=1

Distribution (Model) x2 (& the binomial distribution)

P-value* = CHISQ.DIST.RT(x2,ν)

1
Confidence intervals 𝛿̂ ± 𝑧𝛼⁄2 𝑆𝐸 +
𝑛

Where:
̂2
√StF+FtS−n𝛿 StF−FtS
n : No of paired comparisons made 𝑆𝐸 = 𝛿̂ =
𝑛 𝑛

Minitab Path: Stat → Tables→ Cross tabulation and Chi-Square

Notes*
A p-value is also offered by the formula 2*BINOM.DIST((p,N,0.5,TRUE)
where p is the StF frequency. The minimum between the two generated p-values
can then be used as the p-value.
100
A practical guide to applied statistics

101
Tests for attributes data

11.5 χ2 – Test “Goodness of fit”


Aim: To test whether there is a significant difference between observed values
and a distribution of expected values.

Hypothesis: Ho: There is no significant difference between observed and


expected values

H1: There is a significant difference between observed and


expected values

Assumptions: Expected values should be > 5


Observations should be independent
Categories should be mutually exclusive

Test statistic Where:

Ο : Observed values in the sample data


(𝛰 − 𝛦)2
χ2 = ∑
𝛦 E : Expected values (refer to the notes)

Degrees of freedom Where:

ν=k−1 k : No. of responses in sample data

Distribution (Model) Χ2

P-value =CHISQ.DIST.RT(χ2,ν)

Minitab Path: Stat →Tables → Chi-Square Goodness-of-fit Test

Notes

There are different ways to calculate the expected values. You can use the
average of the counts to be tested or use target counts (i.e., targets or historical
data) that are converted to proportions as a comparison basis. The example that
is presented uses both approaches.

102
A practical guide to applied statistics

103
Tests for attributes data

11.6 χ2 – Test of Independence


Aim: To test whether two categorical variables are associated in the population.
Also, it is used to test whether k unknown proportions for a single category are
equal in the population.

Hypothesis: Ho: There is no relationship between the two categories


H1: There is a relationship between the two categories

Assumptions: The categories within each variable should be independent


Observations should be independent
Expected values should be > 5

Where:

Test statistic Ο : Observed values in the sample data


E : Expected values
2
(𝛰 − 𝛦)2
χ = ∑
𝛦 ∑ 𝑅𝑜𝑤 ∗ ∑ 𝐶𝑜𝑙𝑢𝑚𝑛
𝐸=
∑ 𝑇𝑜𝑡𝑎𝑙

Degrees of freedom Where:


r : No. of rows in the two-way table
ν = (r − 1) ∗ (c − 1)
c : No. of columns in the two-way table

Distribution (Model) Χ2

P-value =CHISQ.DIST.RT(χ2,ν)

Minitab Path: Stat →Tables → Chi-Square Test for association

Notes
In order to run this test in Microsoft Excel you need to set a two-way table with
the observed frequencies and then calculate the expected values. For
proportions this take the form of 2 (Success/failure) x k which is the number of
proportions to be tested.

104
A practical guide to applied statistics

105
Tests for attributes data

106
A practical guide to applied statistics

12 Tests for ordinal data

Ordinal data is like the attributes data. However, their additional characteristic
of being in order allows us to run more advanced statistics. These usually utilize
the ranks of the data and compare their medians.

Start (Page 59) 1-sample sign test


→ Page 108

Analyse ordinal Mann-Whitney test


data → Page 110
→ Page 59 1

No

How
2 In pairs?
many?

Yes

>2 2-sample sign test


→ Page 108

Mood’s Median Test


→ Page 112

Figure 27. Map of tests for variables data

12.1 Analyze attributes data


The analysis process of ordinal data is similar to the one we follow for attributes
data. We use again the frequencies observed in samples, which we then
visualize in a bar chart or another meaningful mean of illustration. The
difference is that, because this type of data is in order, we can also calculate
medians and ranges for summary and comparison reasons. These have been
discussed in Part A of this book.

107
Tests for ordinal data

12.2 1-sample sign test (2-sample sign test*)


Aim: To test whether a sample derives from a population with a specified
median.
Hypothesis: Ho : 𝜂1 = 𝜂0
H1 : 𝜂1 ≠ 𝜂0

Assumptions: The test is distribution free


Measurements should be independent
Data should be measured at least at ordinal level

Test statistic

D+ : The number of values in the sample data larger than the


specified median

D- : The number of values in the sample data smaller than the


specified median.

N : The total amount of deviations (Sample size – ties)

Distribution (Model) Binomial distribution

= BINOM.DIST(D-,N,0.5,TRUE) → Right tail

P-value = BINOM.DIST(D+,N,0.5,TRUE) → Left tail

= 2 * (Min One-tail p-value) → Two tail

Minitab Path: Stat → Nonparametrics → One-Sample Sign

Notes*
The 2-sample sign test follows the same process but applies the test to the values
that derive from the paired differences of the two samples (paired test).

108
A practical guide to applied statistics

109
Tests for ordinal data

12.3 Mann-Whitney test


Aim: To test whether two independent samples derive from populations with
equal medians.
Hypothesis: Ho : 𝜂1 = 𝜂2
H1 : 𝜂1 ≠ 𝜂2
Assumptions: Samples should have equal variances (n/a for ordinal data)
Samples and their measurements should be independent
Data should be measured at least at ordinal level
Test statistic
Preparation: Rank all the values in reference to the entire dataset. Then:

𝑛𝑖 (𝑛𝑖 + 1) 𝑛1 𝑛2
𝑈𝑖 = 𝑆𝑅𝑖 − 𝜇𝑢 =
2 2
SRi : Sum of relative rank for 𝑛1 𝑛2
𝜎𝑢 = √ ((𝑛1 + 𝑛2 + 1) − 𝑎𝑑𝑗)
sample i 12

𝑼𝒎𝒂𝒙 − 𝜇𝑢 − 0.5 ∑𝑘𝑖=1(𝑡𝑖3 − 𝑡𝑖 )


𝑧= ∗ 𝑎𝑑𝑗 =
𝜎𝑢 (𝑛1 + 𝑛2 )(𝑛1 + 𝑛2 − 1)
Distribution (Model) Approximated by the Z distribution

U1 > U2 U1 < U2

= NORM.S.DIST(Z,TRUE) Left tail Right tail


P-value
= 1-NORM.S.DIST(Z,TRUE) Right tail Left tail

= 2 * (Min One-tail p-value) Two tail Two tail


Minitab Path: Stat → Nonparametrics → Mann-Whitney…
Notes
The adjusted σu is more accurate, but the non-adjusted is a more conservative
response and thus in close calls it is preferred.
When distributions do not have similar shape, the test can only evaluate whether
there are differences in the distributions of the two samples.

110
A practical guide to applied statistics

111
Tests for ordinal data

12.4 Mood’s – Median test


Aim: To test whether k independent samples derive from populations with
equal medians.

Hypothesis: Ho : 𝜂1 = 𝜂2 = 𝜂3 =. … . = 𝜂𝑘

H1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙

Assumptions: Samples and their measurements should be independent


Data should be measured at least at ordinal level
χ2 assumptions need to be met

Where:
Test statistic Ο : Observed deviations from the Grand Median*

(𝛰 − 𝛦)2 E : Expected values


χ2 = ∑
𝛦 ∑ 𝑅𝑜𝑤 ∗ ∑ 𝐶𝑜𝑙𝑢𝑚𝑛
𝐸=
∑ 𝑇𝑜𝑡𝑎𝑙

Degrees of freedom Where:


r : No. of rows in the two-way table (always 1)
ν = (r − 1) ∗ (c − 1)
c : No. of columns in the two-way table

Distribution (Model) Χ2

P-value =CHISQ.DIST.RT(χ2,ν)

Minitab Path: Stat → Nonparametrics → Mood’s-Median Test…

Notes *

Calculate the median of all the data (Grand Median). Then, create a two-way
table that counts how many values are larger than the Grand Median as well as
the values that are smaller than or equal to it. Treat this table as your observed
values in a χ2 – Test of Independence.

112
A practical guide to applied statistics

113
Tests for ordinal data

114
A practical guide to applied statistics

13 Nonparametric tests for variables data

Start (Page 59) Bonnet’s test


1 → Page 76
Only in Minitab
Testing
Variances How
medians or 2+
many? Levene’s test
variances?
→ Page 116

Medians 2-sample
In Wilcoxon test
Yes
pairs?
1
How 2 → Page 118
many?
No
>2 Levene’s test
→ Page 116 Mann-Whitney
test
→ Page 110
Check for Equal
symmetry Yes
variances?
→ Page 62
Mood’s Median Test
No → Page 112

Levene’s test No
→ Page 116
Similar shape?
Symmetric Equal variances?
Yes
? Lack of outliers?

No
Yes
1-sample
1-sample sign test Wilcoxon test Kruskal-Wallis Test
→ Page 108 → Page 118 → Page 120

Figure 28. Map of parametric tests

115
Nonparametric tests for variables data

13.1 Levene’s test


Aim: To test whether two or more independent samples derive from population
distributions with equal variances.
Hypothesis: Ho: 𝜎12 = 𝜎12
H1: 𝜎12 ≠ 𝜎12
Assumptions: Samples and their measurements should be independent
ANOVA assumptions applies to the formulated deviations
Test statistic Create a table of deviations for each value in each sample
from the sample’s median:
Deviation ij = ABS(xij – median i)
Then, apply a One-Way ANOVA test to this table.
Where:
𝑀𝑆𝐺 : Mean sum of squares between groups
𝑆𝑆𝐺 ∑𝑘
𝑖=1 𝑛𝑖 (𝑥̅ 𝑖 −𝑥̿ )
2
𝑀𝑆𝐺 𝑀𝑆𝐺 = 𝑘−1
= 𝑘−1
𝐹 =
𝑀𝑆𝑊
𝑀𝑆𝑊 : Mean sum of squares within groups
𝑛
𝑆𝑆𝑊 ∑𝑘 𝑖
𝑖=1 ∑𝑗=1(𝑥𝑖𝑗 −𝑥̅ )
2
𝑀𝑆𝑊 = =
𝑁−𝑘 𝑁−𝑘
Degrees of freedom Where:
𝜈𝑀𝑆𝐺 = k − 1 k : Number of populations compared
𝜈𝑀𝑆𝑊 = 𝑁 − 𝑘 N : Total number of observations
Distribution (Model) F
P-value = F.DIST.RT(F,νΒ,νW)
Minitab Path: Stat → ANOVA → Test for equal variances (Uncheck
“Use test based on normal distribution” in options)
Notes
Use the Data Analysis ToolPak to run the ANOVA test.

116
A practical guide to applied statistics

117
Nonparametric tests for variables data

13.2 1-sample Wilcoxon test (+2-sample Wilcoxon)*


Aim: To test whether a sample derives from a population with a specified
median.
Hypothesis: Ho : 𝜂1 = 𝜂0
H1 : 𝜂1 ≠ 𝜂0
Assumptions: Measurements should be independent
The distribution of the sample needs to be relatively
symmetric (check histogram).
Where:
W : Sum of ranks of the positive absolute
Test statistic
deviations from the hypothesized median, after
𝑛 (𝑛 + 1)
𝑊− 𝑜 𝑜 excluding the ties (refer to the example).
𝑍= 4
𝑆𝐸 𝑛𝑜 : No. of values that are not equal to ηο
𝜂0 : Hypothesized median

𝑛𝑜 ∗ (𝑛𝑜 + 1) ∗ (2𝑛𝑜 + 1)
𝑆𝐸 = √
24

Distribution (Model) Approximated by Z using a correction


Cor = 0.5/SE
= NORM.S.DIST(Z + Cor, TRUE) → Left tail
P-value =1- NORM.S.DIST(Z - Cor, TRUE) → Right tail
= 2 * (Min One-tail p-value) → Two tail
Confidence intervals This is given by the Quartile values of the Pairwise
averages of the sample data.
Minitab Path: Stat → Nonparametrics→ 1-Sample Wilcoxon
Notes*
A 2-sample Wilcoxon applies the same test on the values that derive from the
paired differences of the two samples (paired test).

118
A practical guide to applied statistics

119
Nonparametric tests for variables data

13.3 Kruskal-Wallis Test

Aim: To test whether k independent samples derive from populations with


equal medians.

Hypothesis: Ho : 𝜂1 = 𝜂2 = 𝜂3 =. … . = 𝜂𝑘

H1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛𝑠 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙

Assumptions: Samples and their measurements should be independent


Distributions should be of similar shape
Samples should have equal variances
Samples should have no outliers (check boxplot)

Where:
Test statistic
N : Total no. of values
𝑘
12 𝑆𝑅𝑖2
𝐾=( ∑ ) − 3(𝑁 + 1) SRi : Sum of relative ranks for
𝑁(𝑁 + 1) 𝑛𝑖
𝑖=1
sample i

ni : Size of sample i

Degrees of freedom Where:

ν=k−1 k : No. of samples

Distribution (Model) χ2

P-value = CHISQ.DIST.RT(K,ν)

Minitab Path: Stat → Nonparametrics→ Kruskal-Wallis…

Notes
It is important that your data has no outliers as this can distort the test.

120
A practical guide to applied statistics

121
Nonparametric tests for variables data

122
A practical guide to applied statistics

Part C: Regression analysis

In the previous sections we focused mainly, yet not entirely, on the analysis of
data that is related to a single variable. We were not particularly concerned
about potential relationships or causations between the different groups that was
compared. However, sometimes it may be necessary to determine whether two
or more variables are related to each other, and if so in what way and by how
much. Answers to such question are given by what is known as the regression
analysis.

In regression analysis there is a dependent variable (y), or a set of variables in


more advances cases, that depend on a single or a set of other independent
variables (x). The aim of the analysis is to see how the independent variables,
these are things that can be controlled, will affect the dependent variable, which
is something that cannot be controlled directly and thus is treated as the
response. By mathematically modelling the relationship between the variables,
we can then estimate one variable based on a chosen value of the other.

Essentially, the question is “what is the expected value of the dependent


variable Y, for a given value of the independent variable(s) X?”. It is obvious
that there is a notion of prediction, which is expressed as an expectation.
Therefore, uncertainty is part of regression analysis which indicates a direct link
to the theories discussed in the previous parts. Such links will be indicated
throughout the section when necessary.

There are many types of regression analysis, the choice of which depends on
what we are trying to model. Things that need to be considered include among
others, the type of data, the number of variables to be modelled as well as the
nature of the relationship. The following map aims to help you choose a suitable
approach.
123
Regression analysis

Simple linear
Fit
regression modelling
achieved?
→ Page 125

1 No
Yes
Nonlinear regression
How → Page 135
many xs?

No
2+
Fit Stop
Yes
Multiple linear achieved?
regression
→ Page 140
Multicollinearity
Partial least
Variables squares
(Minitab)
Start
What Ordinal logistic
type of Ordinal
(Minitab)
y?

Nominal logistic
Attributes Nominal
(Minitab)

What type Binary logistic


of Binary
(Minitab)
attributes?

Counts Poisson
(Numerical) (Minitab)

Figure 29. Map of regression methods


124
A practical guide to applied statistics

Within the scope of this book, we will focus on regression techniques for
variables data, which are most commonly used in practice.

14 Simple linear regression

When there are only two parameters to be modelled, one independent and one
dependent, the first step is to try to fit a simple linear model. The process is
illustrated below.

Plot the
relationship

Fit a regression
model

Evaluate the
model

Make
predictions

Figure 30. Basic steps of fitting a regression model

The following sections will discuss these steps by using an example of a


marketing department that is interested in determining how marketing spending
affects the revenue.

14.1 Plot the relationship


The first step of the analysis is to understand and appropriately collate the data.
By clarifying the role of the variables and putting them into the right table
format, the problem becomes clearer and thus the analysis procedure more

125
Regression analysis

straightforward. The following tables present the sample data collected by the
marketing department.

A/A Spending (x) Revenue (y) A/A Spending (x) Revenue (y)
1 1784.58 5995.48 13 2117.26 8805.13
2 2576.63 11087.70 14 1985.55 7526.36
3 3148.32 12850.94 15 3005.69 12546.28
4 2224.88 8072.03 16 3014.65 13589.67
5 2676.73 9871.44 17 2856.69 10989.62
6 1814.69 7137.16 18 1958.00 6139.10
7 2425.06 10457.57 19 1896.59 6759.36
8 2475.72 8330.34 20 2646.43 11809.99
9 2378.58 9243.69 21 2876.58 12586.34
10 2755.48 11542.70 22 2250.69 9986.58
11 2563.37 9725.25 23 2895.68 12036.67
12 2158.00 7256.00 24 2370.90 8482.15
Table 16. Marketing spending and revenue generated for period x

As you can see for each spending value, we have a related revenue value. Once
the table is created, the two variables can now be analysed independently. We
can apply descriptive statistics to summarize the data, check for normality, and
if necessary, run some additional hypothesis tests.

Continuing then with our regression analysis, we can plot the data on a scatter
diagram to visually represent the relationship between the two variables. On the
x axes of the diagram, we plot the independent variable (i.e., marketing
investment), while on the y axes we add the dependent one (i.e., generated
income).

126
A practical guide to applied statistics

Revenue (y)

Spending (x)

Figure 31. Relationship between Spending and Revenue

Each plotted point represents the relative position of an individual measurement


in reference to the two variables. In this way, the scatter diagram helps the data
analyst to understand the characteristics and evaluate some basic assumptions
of the relationship between the variables, including:

• Spot outliers. This is important because outliers can have a large


effect on the strength and maybe the direction of the relationship. If
outliers are spotted, they need to be investigated before proceeding.

• Monotonicity. A monotonic relationship is one where the dependent


variable either only increases or only decreases as its independent
variable increases. Non-monotonic relationships are very
complicated in their analysis.

• Check for linearity. This allows us to check if a linear model can be


applied or whether higher order models should be used. Not applying
the right model can bring back bizarre results.

127
Regression analysis

• Check the direction. The y values of the points can increase as we


move from the left to the right side of the x axis (positive relationship)
or decrease (negative relationship).

In our case, there are no outliers, and it seems that there is a positive monotonic
association between the two variables; the higher the marketing spending the
higher the generated revenue. It also seems that there is a linear relationship
between the two. Such visual checks are important before you proceed to any
further analysis.

14.2 Fit a model


Provided that we meet the assumptions listed above, we can run a linear
regression. The simple linear regression is probably the most popular
mathematical model in the world of science. It models the relationship of two
variables, one predictor and one response, based on the following equation:

𝑦̂ = 𝑎 + 𝑏𝑥

The parameters a and b describe the relationship between the independent


(control) variable x and the dependent (response) variable y, with “b” indicating
the slope of the regression line Residual or else
and “a” the point where the error (e)
line’s extension will meet the y
𝑦𝑖
axis. In order to estimate them, 𝑒𝑖
we need to collect sample data 𝑦̂𝑖

for x and y, and then find the Slope or regression


“line of best fit” for the coefficient (b)
Intercept
observed points. (a) 𝑥𝑖
While the mathematical proof Figure 32. Simple linear regression

is beyond the scope of this book, it is important to understand that the “line of

128
A practical guide to applied statistics

best fit” is the one that minimizes the sum of the squared deviations (least
squares) between the predicted values 𝑦̂𝑖 and the observed values 𝑦𝑖 , also
known as errors or residuals. This happens when:

∑𝑥∑𝑦
∑ 𝑥𝑦 − ∑ 𝑥 2 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦
𝑏= 𝑛
2 & 𝑎= 2
2 (∑ 𝑥) 𝑛 ∑ 𝑥 2 − (∑ 𝑥)
∑𝑥 −
𝑛

Luckily for us, we do not have to do the math as Microsoft Excel can
automatically produce the linear equation. The following process should be
followed:

1. Select the scatterplot and click on the plus symbol next to it. Then
tick the box “Trendline”.

2. Press right click on the trendline and then select the option “Format
trendline….”

3. On the right side of the window tick the boxes:

a. Display equation on chart

b. Display R-squared value on chart.

4. Evaluate the results.

The R2 is called the coefficient of determination. It shows how much of the


variance in the one variable can be predicted or explained by the variance in the
other. In other words, it indicates how good the used model is in predicting the
response variable. As a squared number, it will always be positive and will lie
somewhere between 0 (or 0%) and 1 (or 100%). The closer this value gets to 1,
the higher the predictability will be and thus the better the model. Note that this
derives from Pearson’s coefficient of correlation (R), which ranges from -1
(negative correlation) to +1 (positive correlation) and shows the same thing.

129
Regression analysis

130
A practical guide to applied statistics

14.3 Test the model


The parameters of the regression have been calculated based on random sample
data. These can only be estimates to some values α and β, which are “true” in
the population:
𝑌 = 𝛼 + 𝛽𝑥

Obviously, there is a direct analogy, although the math is more complicated, to


the concept of inferential statistics that we have seen in the previous part of the
book. That is, as the 𝑥̅ is expected to be normally distributed around a true value
μ, the Y is normally distributed around
the 𝑦̂, with a measurable standard Y
deviation 𝜎̂. The latter is known as the 𝑦̂

standard error of regression, and it is


given by the following equation:

∑(𝑦𝑖 − 𝑦̂𝑖 )2 X
𝜎̂ = √ Frequency
𝑛−2

If the Y can be anywhere between the areas formulated by the distributions


show in the figure, the line’s slope can also change and thus there is a chance
that the line becomes flat (β = 0). This means no true regression power in the
population, as a flat line would always generate the same Y = α, regardless of
what the x is. To determine this, we look at
𝑦2
𝑒2 the line formulated by the sample data and
𝑦̂2 𝑇2
𝑟2 we test whether the differences observed in
𝑦ത 𝑟1 the 𝑦̂i for the various xi, are due to the
𝑦̂1 𝑇1 regression (ri) or due to the random errors
𝑒1
𝑦1 (ei). This is in essence, an ANOVA test that
can be done in Excel by using the Data
𝑥1 𝑥2
Analysis ToolPak.
131
Regression analysis

132
A practical guide to applied statistics

The table at the bottom provides the values a and b of the regression line.
However, the question is whether we can trust these values. To answer that we
need to conduct some checks.

1. Predictive power (R2). This is the same value as the one generated in
the scatterplot above. If it is low, we can conclude that the chosen model
is not suitable for the data.

2. Significance of the model. If the p-value for the F-test in the second
table is less than 5% then we can reject the Ho: β = 0 in favour of the
H1: β ≠ 0. This suggest that the model is significant in the population.

3. Normally distributed residuals. This indicates an unbiased model and


thus we can trust the results on the model’s significance. You can test
the residuals for normality by using the techniques we have seen in the
part B of this book. If they are not normally distributed, you may wish
to transform the variables.

4. Residuals have constant variances. When using the Data Analysis


ToolPak in Excel, tick the box “Residual Plots”. Microsoft Excel will
bring back the following diagram. You simply need to use the
“Predicted Y” values on the Y axis.

The residuals should be randomly distributed around and across the line
for the significance value to be unbiased. If special observations exist,
you may wish to transform the variables.
133
Regression analysis

5. Residuals are independent. In order to check this, you have to create


a similar plot with the one above (Copy-Paste) but in the x axis you
should put the observation column from the residuals table.

Similar to the plot above, the residuals should be randomly distributed


around and across the line for the significance value to be unbiased. If
not, investigate why the observations are dependent.

If a model passes these checks, we can trust it for making predictions.


Otherwise, action needs to be taken. This may include, among others,
investigating the statistical problem further, using a different model to describe
the relationship or transforming the original data.

14.4 Make predictions


A valid model is a good predictor that can be used to estimate a y value for a
specific x value. Similar to the estimations made in the previous part of the
book, we use the following formula:

Where:
𝑌̂ ± 𝑡1−𝑎,𝜈 𝑆𝐸𝑌̂ 𝑌̂ : The result from applying the
2
linear equation (point prediction)
1 (𝑥𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 − 𝑥̅ )2 ν=n–2
𝑆𝐸𝑌̂ = 𝜎̂√1 + +
𝑛 𝑠𝑥2 (𝑛 − 1)
𝑠𝑥2 : Variance of the independent
sample data x

134
A practical guide to applied statistics

Let us assume for example that the marketing department wants to estimate how
much revenue would be generated, if 2550€ and 3200€ are spent in advertising.

Table 17. Estimating with the simple linear regression

We can state with 95% certainty (predictive power) that if the marketing
department spends 2500 €, they will generate something between 8507€ to
11920€ revenue. If they spend 3200€ they will generate something between
11802€ to 15445€ revenue. They can now make a decision that is based on
mathematical expectations.

15 Non-linear relationships

It is possible that fitting a simple linear regression to two variables may be


inadequate. For example, the R2 may be low or the p-value may not indicate a
true relationship in the population. In such cases, we need to test other non-
linear models. Some of these tests can be done in Excel, while others will need
more powerful software packages like Minitab.

15.1 Types of non-linear models


The aim of regression analysis is to find a model that best describes the
relationship between the variables. However, at the same time we need to apply
135
Regression analysis

the simplest model possible. This is because, it might be that advanced models
will be able to describe peculiar relationships, but they also create problems of
complexity that can be difficult to deal with.

The task is to convert non-linear models into a linear regression by


manipulating or transforming the x and y values. This is because it is way more
convenient to deal with linear models, and any sacrifice in accuracy due to the
transformation that is applied, is usually acceptable. The following table
presents some transformations you can apply to your data.

Transformation
Order Model Equation
y bx + a

1 Quadratic y = a x2 + b x + c y a x2 + b x + c

y=ax3+bx2+cx ax3+bx2+cx
2 Cubic y
+d +d

3 Logarithmic y = bln(x) + a y bln(x) + a

4 Exponential y = αebx ln(y) bx + ln(a)

5 Power y = αxb ln(y) bln(x) + ln(a)

Table 18. Models available in Excel and suitable transformations

We usually prefer models that require the minimum level of transformation and
if possible, these should be applied to the x and not the y variable. This is
because the errors in the regression equation are errors in the y-value. If the y-
value is transformed, so do the errors. As a result, the assumption that the
residuals are normally distributed can be invalidated in the background and thus
unexpected issues may arise.

For the reasons stated above, the polynomial models (i.e., quadratic, cubic) are
more frequently used compared to other types of non-linear models.
136
A practical guide to applied statistics

15.2 Looking at the R2


The scatterplot and the R2 value can offer significant information regarding the
model to apply. Let us consider the following linear relationship.

We can see that the distribution of the points is quite peculiar compared to the
line. Therefore, although the R2 is relatively high, we may wish to test a
quadratic relationship. This can be found if we press right click on the line of
the scatterplot and then select the option “Format trendline…”.

The quadratic line offers a higher R2 value compared to the linear line. As a
result, it seems a better model to use, and thus we can procced to the testing
phase. The process is exactly the same as with the simple linear regression. The
difference is that we need to create a column of x2 values and then use both
columns (x and x2) as the Input x Range in the Data Analysis ToolPak.

137
Regression analysis

138
A practical guide to applied statistics

If the model is not valid, we can try another one that may offer a better fit to the
data. The most common choice is to increase the order of the polynomial
regression up to the point where an increase in the order offers a significant
increase in the R2 value. In our example, a cubic relationship would offer an R2
value of 0.8803 which is higher compared to the 0.8744 of the quadratic line.
The question is, is it worth it to add an additional layer of complexity for such
an increase or we are simply overfitting the line to the data? In this case,
probably the quadratic line is the best choice.

15.3 Predicting with non-linear models


If the model offers a satisfactory result, we can use it to make predictions. Doing
so in Excel though can be hard. The problem is that we cannot easily calculate
the standard error of prediction. Therefore, if we have no access to a more
advanced software like Minitab, we will have to inflate the standard error of
regression:

𝑆𝐸𝑦̂ = 𝜎̂ ∗ 1.1

This will not give the exact estimations but can be a relatively good
approximation as it can be seen in the example above.

Minitab

139
Regression analysis

16 Multiple linear regression

In the previous sections we discussed cases where a single independent variable


(x) has an impact on a dependent variable (y). However, it is also possible to
have relationships where the dependent variable (y) is affected by more than
one independent variables. In such cases, you will need to run a multiple linear
regression.

𝑦̂ = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + ⋯ … … . +𝑏𝑘 𝑥𝑘

Although the mathematics for calculating the coefficients are more advanced
compared to the simple linear regression, the overall logic is the same. The main
difference is that we need to investigate the importance of each value
individually as well as the potential interactions between them.

16.1 Correlated predictors


We need to consider the cases where predictors of a model are correlated to
each other, a phenomenon known as multicollinearity. High correlations can
result in peculiar estimations of the regression coefficients. Consider for
example the following case with three predictors (Temperature, Humidity,
Time) and one response (Strength).

Temperature Humidity Time (s) Strength

Temperature 1

Humidity 0.735 1

Time (s) 0.292 0.294 1

Strength 0.829 0.894 0.726 1

Table 19. Correlation of 3 predictors and 1 response

140
A practical guide to applied statistics

All three predictors seem to be highly correlated to the response, but at the same
time there seems to be a relatively strong correlation between temperature and
humidity. In such cases, we should investigate the correlation between the two
elements and act accordingly:

• If a correlated predictor is not significantly correlated to the response,


we may wish to remove it from the model. This will affect the
coefficients of the other predictors.

• If all the predictors are significantly correlated to the response, we


should use the “Partial least squares regression” method (very
difficult to run in Excel) or transform the variables.

If we do not act in the presence of multicollinearity, we will generate a model


that will have poor predictive power.

16.2 Applying a multiple linear regression


Let us consider the example of the marketing department we used above with
three predictors to affect revenue this time; including spending, bonuses to
employees, and number of calls made to customers. By using the Data Analysis
ToolPak we get the following results.

SUMMARY OUTPUT

In the case of multiple


Regression Statistics regression, we use the
adjusted r square as it
Multiple R 0.951 considers the number of
R Square 0.904 variables involved and the
sample size.
Adjusted R Square 0.889
Standard Error 755.803
Observations 24.000

141
Regression analysis

ν=n–k–1
Where:
k: number of variables x.
ANOVA
Significance
df SS MS F F
Regression 3 106983213.0 35661071.0 62.4 0.000
Residual 20 11424749.9 571237.49
Total 23 118407962.9

In multiple regression the hypotheses are:

Ho: β1 = β2 = β3 = 0 (all the slopes are equal to 0).

H1: At least one βi ≠ 0 (not all the slopes are equal to zero).

The p-value of the ANOVA test will show the overall power of the model. If Ho
is failed to be rejected, then the analysis should stop here as the model is not
significant in the population. However, if Ho is rejected, we need to check the
individual tests for each independent variable. That is, a rejection of the Ho
suggests that the model has some overall predictive power, but does this mean
that all independent variables can predict the dependent one?

Coefficients Standard Error t Stat P-value


Intercept -5295.116 1524.178 -3.474 0.002
Spending 4.715 0.463 10.192 0.000
No. of calls 40.195 18.131 2.217 0.038
Bonuses 0.264 0.413 0.639 0.530

The individual p-values suggest that the predictors “Spending” and No. of calls”
are significant, while the predictor “Bonuses” is not significant. Therefore, we
need to re-run the regression without considering it.

142
A practical guide to applied statistics

143
Regression analysis

The predictors of the new model are all significant and thus we can use it for
making predictions. Note that if a multiple linear model is not suitable, we may
have to apply a non-linear system with advanced mathematical formulas.

16.3 Predicting with the multiple linear regression


If the model offers a satisfactory result, we can use it to make predictions.
However, similar to the case of non-linear models, doing so in Excel can be
hard. The problem is again that we cannot easily calculate the standard error of
prediction. Therefore, if we have no access to a more advanced software like
Minitab, we will need to inflate the standard error of regression:

𝑆𝐸𝑦̂ = 𝜎̂ ∗ 1.1

This will not give the exact estimations but can be a relatively good
approximation as it can be seen in the example above.

Minitab

17 Forecasting with time-series

Time series is a special type of bivariate data, where a variable is presented in


time order against regular intervals. The aim is to capture movement and
analyse how the dependent variable has changed over a period of time. By

144
A practical guide to applied statistics

conducting such analysis, the data analyst can understand how the phenomenon
behaves and thus make predictions about what is expected to happen in the
future.

17.1 Components in time-series


The analysis of time series requires the application of the tools that have been
discussed in this section. However, with time series, there are some additional
elements that need to be considered. Let’s work on the following table that
illustrates the historic sales data of a company.

Year Quarter Sales Year Quarter Sales

2015 1 £74,841 2017 1 £108,694

2 £70,324 2 £104,969

3 £59,016 3 £92,851

4 £53,249 4 £80,099

2016 1 £87,976 2018 1 £138,898

2 £79,822 2 £115,678

3 £77,859 3 £102,534

4 £67,409 4 £95,432

Table 20. Historic sales data

The years have been broken down into quarters and for each quarter the
accumulated sales revenue has been provided. As data analysts, we are
interested in understanding the relationship between the quarters and how sales
changed over this period. In order to visually illustrate their relationship, you
can either use a scatter plot or a line graph. The latter is used here.

145
Regression analysis

Figure 33. Time series plot of sales

The analysis of the graph clearly illustrates three (+1) key points:

• Seasonality component (St). Within every single year, the first quarter
delivers the highest amount of revenues and then the sales gradually drop
till the first quarter of the following year where they increase again. In
other words, there seems to be a kind of a cycle that is repeated annually.

• Trend component (Tt). The overall sales income increases over the
years. Essentially, if you were to calculate the average value of sales for
each seasonal cycle, then you would find out that these averages increase.

• Random or irregularity component (It). Neither the trend nor the


seasonality happens at a stable pace. Yes, there are the ups and downs
within the cycles of seasonality, but these ups and downs are not the same
from year to year. Such irregularity is expected provided that every
system and any kind of data is a “victim” of variation.

• Cyclical component (Ct). This is a long-term fluctuation that “looks”


like a wave and does not happen in a fixed period of time. It usually
coincides with the cycles of the economy and is difficult to analyse. The
data in the example is probably a part of a cycle.
146
A practical guide to applied statistics

17.2 Central moving average


Since seasonality affects the time series, we need to take it out before making
any estimations about the future. There are different ways to do so. Here, we
will use the central moving average (CMA). The CMA is a value around a
designated centre point.

The first step in calculating the moving average is to define the period over
which seasonality is observed. In our case, this period is equal to 4 intervals or
else quarters. Whether the period is an even or an odd number plays an
important role on how to calculate the CMA.

Cycle Interval Value For an odd number of intervals, the CMA is

1 a1 placed next to the middle value. The following

2 b1 formula should be used.


𝑎1 + 𝑏1 + 𝑐1 + 𝑑1 + 𝑒1
1 3 c1 𝐶𝑀𝐴 =
5
4 d1
5 e1
For an even number of intervals, the CMA is
placed between the two middle intervals. In
Cycle Interval Value this case, the first value of the following cycle

1 a1 needs to be considered as well.

2 b1 (0.5)𝑎1 + 𝑏1 + 𝑐1 + 𝑑1 + (0.5)𝑎2
1 𝐶𝑀𝐴 =
3 c1 4

4 d1

2 1 a2

Figure 34. Calculating the CMA

In our example, we have an even number of intervals (4).

147
Regression analysis

We should start from this


Year Quarter Sales (Yt) CMA(4)
point because any CMA
2015 1 £74,841
before that will include a
2 £70,324
blank cell or else unknown
3 £59,016 £65,999
data.
4 £53,249 £68,828
2016 1 £87,976 £72,371
2 £79,822 £76,496
3 £77,859 £80,856
4 £67,409 £86,590
2017 1 £108,694 £91,607
2 £104,969 £95,067 We should stop at this point
3 £92,851 £100,429 because any CMA after that
4 £80,099 £105,543 will include a blank cell or
2018 1 £138,898 £108,092 else unknown data.
2 £115,678 £111,219
3 £102,534
4 £95,432
Table 21. Calculation of the central moving average
Once the CMA is calculated, plot it on the initial graph.

148
A practical guide to applied statistics

17.3 Working on irregularity


The next step is to compare the two lines. To do so we divide the actual sales
by the moving average. In that way, we can learn more about the seasonality
of the data.

Year Quarter Sales (Yt) CMA(4) St (with It ) = Yt / CMA


2015 1 £74,841
2 £70,324
3 £59,016 £65,999 0.894
4 £53,249 £68,828 0.774
2016 1 £87,976 £72,371 1.216
2 £79,822 £76,496 1.043
3 £77,859 £80,856 0.963
4 £67,409 £86,590 0.778
2017 1 £108,694 £91,607 1.187
2 £104,969 £95,067 1.104
3 £92,851 £100,429 0.925
4 £80,099 £105,543 0.759
2018 1 £138,898 £108,092 1.285
2 £115,678 £111,219 1.040
3 £102,534
4 £95,432

Table 22. Calculating the seasonal index for each period

The outcome is a set of values that reflect the seasonal indices of the various
periods. Each index represents how much the actual original values (i.e., sales)
deviate from the corresponding baseline of the smoothed CMA values. For
example, in quarter 3 of 2015 the actual sales were 10.6% (1 – 0.894) below the

149
Regression analysis

smoothed average of this period. Similarly, in quarter 1 of 2016 the sales were
21.6% above the average of this period.

The values produced above reflect the seasonal component of the series data
along with the irregularity that comes from the various periods. The next step
therefore, is to extract the pure seasonal indices. In order to do that, we need to
calculate the average index for each similar interval in the series. The following
table presents these indices for our data.

Quarter St
For example, the seasonal index for every quarter
1 1.23
1 is equal to
2 1.06 1.216+1.187+1.285
3
= 1.23.
3 0.93
Keep in mind that the sum of the averages should
4 0.77
be close to unity.
Sum 0.997
Table 23. Calculating the average season index

Using then the seasonal index, we can now deseasonalise the original data. In
order to do that we simply need to divide it by the seasonal index. The results
can be found in the figure below.

150
A practical guide to applied statistics

St Deseassonalized sales
t Sales (Yt) CA(4) (with It ) St (Yt /St)
1 £74,841 1.23 £60,893
2 £70,324 1.06 £66,183
3 £59,016 £65,999 0.894 0.93 £63,648
4 £53,249 £68,828 0.774 0.77 £69,123
5 £87,976 £72,371 1.216 1.23 £71,580
6 £79,822 £76,496 1.043 1.06 £75,121
7 £77,859 £80,856 0.963 0.93 £83,971
8 £67,409 £86,590 0.778 0.77 £87,504
9 £108,694 £91,607 1.187 1.23 £88,438
10 £104,969 £95,067 1.104 1.06 £98,788
11 £92,851 £100,429 0.925 0.93 £100,138
12 £80,099 £105,543 0.759 0.77 £103,976
13 £138,898 £108,092 1.285 1.23 £113,012
14 £115,678 £111,219 1.040 1.06 £108,866
15 £102,534 0.93 £110,582
16 £95,432 0.77 £123,881
Table 24. Deseasonalising the original data

17.4 Running a linear regression


At this stage we are ready to run a simple linear regression to model the relation
between time and the deseassonalised data that we generated. Note that this is
the reason why we have coded the intervals with ordered numbers (t), as it can
be seen in the table above. The rest of the analysis can be conducted in
Microsoft Excel as we have already discussed in the previous sections of this
book.

151
Regression analysis

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.987
R Square 0.974
y = 4134.7x + 53960.9
Adjusted R Square 0.972
r² = 0.974
Standard Error 3351.556
Observations 16

ANOVA
Significance
df SS MS F F
Regression 1 5812753274 5812753274 517.4 0.000
Residual 14 157260942 11232924
Total 15 5970014216

Coefficients Standard Error t Stat P-value


Intercept 53960.983 1757.571 30.702 0.000
t 4134.771 181.764 22.748 0.000
Table 25. Regression analysis for the time series

After we analyse the p-values on the table and the residuelas, we can use the
coefficients of the linear regression to forecast future sales. Let’s assume that
we are interested in one year ahead. The first step is to calculate the trend
component (Tt). This can be done by using the equation provided in the
regression analysis.

Tt = 4134.7x + 53960.9

Once, the trend component is calculated, we need to seasonalize the data. We


do so by multiplying the trend component with the seasonal component.

Forecast = Tt * St

152
A practical guide to applied statistics

Eventually, this will give us the forecasted values, or else the point predictions.
Of course, you can also calculate the confidence intervals for these predictions
by using the Student’s (t) distribution, as this has been discussed above. In that
way you will be able to make more accurate estimations.

t Year Quarter Sales (Yt) St Tt Forecast


1 2015 1 £74,841 1.23 £58,096 £71,403
2 2 £70,324 1.06 £62,231 £66,125
3 3 £59,016 0.93 £66,365 £61,535
4 4 £53,249 0.77 £70,500 £54,310
5 2016 1 £87,976 1.23 £74,635 £91,730
6 2 £79,822 1.06 £78,770 £83,699
7 3 £77,859 0.93 £82,904 £76,871
8 4 £67,409 0.77 £87,039 £67,051
9 2017 1 £108,694 1.23 £91,174 £112,057
10 2 £104,969 1.06 £95,309 £101,273
11 3 £92,851 0.93 £99,443 £92,206
12 4 £80,099 0.77 £103,578 £79,792
13 2018 1 £138,898 1.23 £107,713 £132,385
14 2 £115,678 1.06 £111,848 £118,847
15 3 £102,534 0.93 £115,983 £107,542
16 4 £95,432 0.77 £120,117 £92,533
17 2019 1 1.23 £124,252 £152,712
18 2 1.06 £128,387 £136,421
19 3 0.93 £132,522 £122,877
20 4 0.77 £136,656 £105,274

Table 26. Forecasting sales revenues

153
Regression analysis

Note that we generated forecasted values for the historical period as well.
Although we know the actual data for this period, we may wish to calculate
these values in order to compare them with the actual historical data.
Apparently, if we have conducted the analysis correctly, the two should be
close, which is a simple logical test to run.

Finally, keep in mind that any projections into the future should not go too far
ahead. This is because the further into the future one goes, the higher the
uncertainty and thus the less reliable the forecasts are. To make them more
accurate, you can calculate the errors of the estimation by following the
methods presented in the previous section.

154
A practical guide to applied statistics

Index list

a priori probability 24 histograms 16 – 18


alpha value 45 – 49 historical data 102, 154
alternative hypothesis 43 – 47 hypothesis testing 44 – 122, 131
attributes 8, 59
average 19 independence 28, 104
interquartile range 21, 65
Bernoulli process 38 interval estimations 52
Bias 11
joint probability 27
central moving average 147 – 148
central limit theorem 61, 72 least square 127
central tendency 19 linear regression 125 – 130
cluster sampling 11
coefficient of correlation 129 mean value: see average
coefficient of determination 129 median 20
confidence intervals 52 multiple regression 140 – 144
mutually exclusive events 26
degrees of freedom 55
descriptive statistics 16, 63, 126 normal distribution 33 – 37; test
discrete data 10, 59 for normality 68 – 71
dispersion 16, 20 null hypothesis 43 – 47

expected value 29, 38, 39, 40, 41, one-tail tests 49


45, 70, 102, 104, 112
p-value 53
finite population 39 paired 51
frequency distribution 17, 33 – 41 parameters 19 – 23

155
Index

payoff table 29 scatter diagram 126


Poisson distribution 40 significance levels 46
point prediction (estimate) 135, skewed distribution 15, 19, 66
139, 144 standard deviation 22, 35;
sampling distribution 42
qualitative data 8 stratified sampling: look at cluster
quantitative data 8 Student’s t distribution 55
quartiles 65 subjective distribution 24
systematic sampling 11
random sampling 11
range 21 two-tailed tests 49
regression analysis 123 – 154 type I/type II errors 48
risk 48, 56
Venn diagram 24
sample size 55, 61, 72
sampling error 45
sampling distribution of the mean
42

156
158

You might also like